第 5 章：多轮评测

超越单轮

Single-turn evals 测试的是工具选择：“给定这个 prompt，LLM 是否选择了正确工具？” 但 agent 是多轮的。真实任务可能需要：

列出文件
读取某个文件
修改它
写回去

测试这种行为需要运行完整 agent loop 和多次工具调用。但这里有个问题：真实工具有副作用。你不会希望 eval suite 在磁盘上创建和删除文件。解决方案是：mocked tools。

Mocked Tools

Mocked tool 和真实工具有相同的名称和描述，但它的 execute 函数返回固定值，而不是做真实工作。

把 mock tool builders 加到 evals/utils.ts：

import { tool, type ModelMessage, type ToolSet } from "ai";
import { z } from "zod";
import { SYSTEM_PROMPT } from "../src/agent/system/prompt.ts";
import type { EvalData, MultiTurnEvalData } from "./types.ts";

/**
 * Build mocked tools from data config.
 * Each tool returns its configured mockReturn value.
 */
export const buildMockedTools = (
  mockTools: MultiTurnEvalData["mockTools"],
): ToolSet => {
  const tools: ToolSet = {};

  for (const [name, config] of Object.entries(mockTools)) {
    // Build parameter schema dynamically
    const paramSchema: Record<string, z.ZodString> = {};
    for (const paramName of Object.keys(config.parameters)) {
      paramSchema[paramName] = z.string();
    }

    tools[name] = tool({
      description: config.description,
      inputSchema: z.object(paramSchema),
      execute: async () => config.mockReturn,
    });
  }

  return tools;
};

/**
 * Build message array from eval data
 */
export const buildMessages = (
  data: EvalData | { prompt?: string; systemPrompt?: string },
): ModelMessage[] => {
  const systemPrompt = data.systemPrompt ?? SYSTEM_PROMPT;
  return [
    { role: "system", content: systemPrompt },
    { role: "user", content: data.prompt! },
  ];
};

buildMockedTools 接收一个配置对象，并创建真正的 AI SDK tools。对 LLM 来说，它们看起来和真实工具一样，但返回值是预先设定好的。LLM 看到相同的工具名和描述，会做相同的决策，但磁盘上不会发生任何真实操作。

你也可以创建更具体的 mock helpers。创建 evals/mocks/tools.ts：

import { tool } from "ai";
import { z } from "zod";

/**
 * Create a mock readFile tool that returns fixed content
 */
export const createMockReadFile = (mockContent: string) =>
  tool({
    description:
      "Read the contents of a file at the specified path. Use this to examine file contents.",
    inputSchema: z.object({
      path: z.string().describe("The path to the file to read"),
    }),
    execute: async ({ path }: { path: string }) => mockContent,
  });

/**
 * Create a mock writeFile tool that returns a success message
 */
export const createMockWriteFile = (mockResponse?: string) =>
  tool({
    description:
      "Write content to a file at the specified path. Creates the file if it doesn't exist.",
    inputSchema: z.object({
      path: z.string().describe("The path to the file to write"),
      content: z.string().describe("The content to write to the file"),
    }),
    execute: async ({ path, content }: { path: string; content: string }) =>
      mockResponse ??
      `Successfully wrote ${content.length} characters to ${path}`,
  });

/**
 * Create a mock listFiles tool that returns a fixed file list
 */
export const createMockListFiles = (mockFiles: string[]) =>
  tool({
    description:
      "List all files and directories in the specified directory path.",
    inputSchema: z.object({
      directory: z
        .string()
        .describe("The directory path to list contents of")
        .default("."),
    }),
    execute: async ({ directory }: { directory: string }) =>
      mockFiles.join("\n"),
  });

/**
 * Create a mock deleteFile tool that returns a success message
 */
export const createMockDeleteFile = (mockResponse?: string) =>
  tool({
    description:
      "Delete a file at the specified path. Use with caution as this is irreversible.",
    inputSchema: z.object({
      path: z.string().describe("The path to the file to delete"),
    }),
    execute: async ({ path }: { path: string }) =>
      mockResponse ?? `Successfully deleted ${path}`,
  });

/**
 * Create a mock shell command tool that returns fixed output
 */
export const createMockShell = (mockOutput: string) =>
  tool({
    description:
      "Execute a shell command and return its output. Use this for system operations.",
    inputSchema: z.object({
      command: z.string().describe("The shell command to execute"),
    }),
    execute: async ({ command }: { command: string }) => mockOutput,
  });

多轮类型

把 multi-turn 类型加到 evals/types.ts：

/**
 * Mock tool configuration for multi-turn evaluations.
 * Tools return fixed values for deterministic testing.
 */
export interface MockToolConfig {
  /** Tool description shown to the LLM */
  description: string;
  /** Parameter schema (simplified - all params treated as strings) */
  parameters: Record<string, string>;
  /** Fixed return value when tool is called */
  mockReturn: string;
}

/**
 * Input data for multi-turn agent evaluations.
 * Supports both fresh conversations and mid-conversation scenarios.
 */
export interface MultiTurnEvalData {
  /** User prompt for fresh conversation (use this OR messages, not both) */
  prompt?: string;
  /** Pre-filled message history for mid-conversation testing */
  messages?: ModelMessage[];
  /** Mocked tools with fixed return values */
  mockTools: Record<string, MockToolConfig>;
  /** Configuration for the agent run */
  config?: {
    model?: string;
    maxSteps?: number;
  };
}

/**
 * Target expectations for multi-turn evaluations
 */
export interface MultiTurnTarget {
  /** Original task description for LLM judge context */
  originalTask: string;
  /** Expected tools in order (for tool ordering evaluation) */
  expectedToolOrder?: string[];
  /** Tools that must NOT be called */
  forbiddenTools?: string[];
  /** Mock tool results for LLM judge context */
  mockToolResults: Record<string, string>;
  /** Category for grouping */
  category: "task-completion" | "conversation-continuation" | "negative";
}

/**
 * Result from multi-turn executor
 */
export interface MultiTurnResult {
  /** Final text response from the agent */
  text: string;
  /** All steps taken during the agent loop */
  steps: Array<{
    toolCalls?: Array<{ toolName: string; args: unknown }>;
    toolResults?: Array<{ toolName: string; result: unknown }>;
    text?: string;
  }>;
  /** Unique tool names used during the run */
  toolsUsed: string[];
  /** All tool calls in order */
  toolCallOrder: string[];
}

注意 MultiTurnEvalData 支持两种模式：

prompt — 新对话，这是最常见的情况
messages — 预填的 conversation history，用来测试对话中途的行为

Multi-Turn Executor

把 multi-turn executor 加到 evals/executors.ts：

/**
 * Multi-turn executor with mocked tools.
 * Runs a complete agent loop with tools returning fixed values.
 */
export async function multiTurnWithMocks(
  data: MultiTurnEvalData,
): Promise<MultiTurnResult> {
  const tools = buildMockedTools(data.mockTools);

  // Build messages from either prompt or pre-filled history
  const messages: ModelMessage[] = data.messages ?? [
    { role: "system", content: SYSTEM_PROMPT },
    { role: "user", content: data.prompt! },
  ];

  const result = await generateText({
    model: provider.chat(
      data.config?.model ??
        process.env.LLM_MODEL ??
        "qwen3.5-flash-2026-02-23",
    ),
    messages,
    tools,
    stopWhen: stepCountIs(data.config?.maxSteps ?? 20),
  });

  // Extract all tool calls in order from steps
  const allToolCalls: string[] = [];
  const steps = result.steps.map((step) => {
    const stepToolCalls = (step.toolCalls ?? []).map((tc) => {
      allToolCalls.push(tc.toolName);
      return {
        toolName: tc.toolName,
        args: "args" in tc ? tc.args : {},
      };
    });

    const stepToolResults = (step.toolResults ?? []).map((tr) => ({
      toolName: tr.toolName,
      result: "result" in tr ? tr.result : tr,
    }));

    return {
      toolCalls: stepToolCalls.length > 0 ? stepToolCalls : undefined,
      toolResults: stepToolResults.length > 0 ? stepToolResults : undefined,
      text: step.text || undefined,
    };
  });

  // Extract unique tools used
  const toolsUsed = [...new Set(allToolCalls)];

  return {
    text: result.text,
    steps,
    toolsUsed,
    toolCallOrder: allToolCalls,
  };
}

和 singleTurnExecutor 的关键差异是：这里使用 stopWhen: stepCountIs(20)，而不是 stepCountIs(1)。这让 agent 最多运行 20 个 step，包括工具调用和回复，足够覆盖复杂任务。

Executor 使用 generateText()，不是 streamText()，因为 evals 不需要 streaming，只需要最终结果。AI SDK 的 generateText() 搭配 tools 时，会在内部自动运行 tool → result → next step loop。

新的 Evaluators

我们需要理解 multi-turn 行为的 evaluators。把下面内容加到 evals/evaluators.ts：

/**
 * Evaluator: Check if tools were called in the expected order.
 * Returns the fraction of expected tools found in sequence.
 * Order matters but tools don't need to be consecutive.
 */
export function toolOrderCorrect(
  output: MultiTurnResult,
  target: MultiTurnTarget,
): number {
  if (!target.expectedToolOrder?.length) return 1;

  const actualOrder = output.toolCallOrder;

  // Check if expected tools appear in order (not necessarily consecutive)
  let expectedIdx = 0;
  for (const toolName of actualOrder) {
    if (toolName === target.expectedToolOrder[expectedIdx]) {
      expectedIdx++;
      if (expectedIdx === target.expectedToolOrder.length) break;
    }
  }

  return expectedIdx / target.expectedToolOrder.length;
}

这个 evaluator 检查的是 subsequence ordering。如果我们期望 [listFiles, readFile, writeFile]，实际顺序是 [listFiles, readFile, readFile, writeFile]，得分仍然是 1.0，因为期望工具按顺序出现了，即使中间多了一次 readFile。

LLM-as-Judge

最强大的 evaluator 会用另一个 LLM 判断输出质量：

import { generateObject } from "ai";
import { createOpenAI } from "@ai-sdk/openai";
import { z } from "zod";

const apiKey = process.env.LLM_API_KEY;

if (!apiKey) {
  throw new Error("Missing LLM_API_KEY in .env");
}

const provider = createOpenAI({
  apiKey,
  baseURL: process.env.LLM_BASE_URL,
});

const judgeSchema = z.object({
  score: z
    .number()
    .min(1)
    .max(10)
    .describe("Score from 1-10 where 10 is perfect"),
  reason: z.string().describe("Brief explanation for the score"),
});

/**
 * Evaluator: LLM-as-judge for output quality.
 * Uses structured output to reliably assess if the agent's response is correct.
 * Returns a score from 0-1 (internally uses 1-10 scale divided by 10).
 */
export async function llmJudge(
  output: MultiTurnResult,
  target: MultiTurnTarget,
): Promise<number> {
  const result = await generateObject({
    model: provider.chat(
      process.env.LLM_JUDGE_MODEL ??
        process.env.LLM_MODEL ??
        "qwen3.5-flash-2026-02-23",
    ),
    schema: judgeSchema,
    schemaName: "evaluation",
    schemaDescription: "Evaluation of an AI agent response",
    messages: [
      {
        role: "system",
        content: `You are an evaluation judge. Score the agent's response on a scale of 1-10.

Scoring criteria:
- 10: Response fully addresses the task using tool results correctly
- 7-9: Response is mostly correct with minor issues
- 4-6: Response partially addresses the task
- 1-3: Response is mostly incorrect or irrelevant`,
      },
      {
        role: "user",
        content: `Task: ${target.originalTask}

Tools called: ${JSON.stringify(output.toolCallOrder)}
Tool results provided: ${JSON.stringify(target.mockToolResults)}

Agent's final response:
${output.text}

Evaluate if this response correctly uses the tool results to answer the task.`,
      },
    ],
  });

  // Convert 1-10 score to 0-1 range
  return result.object.score / 10;
}

LLM judge 会：

拿到原始任务、调用过的工具和 mock results
阅读 agent 的最终回复
返回结构化分数，范围 1-10，并给出 reason
使用 generateObject() 和 Zod schema 保证输出有效

如果你有更强的 OpenAI-compatible 模型，可以设置 LLM_JUDGE_MODEL 作为 judge。理想情况下，judge model 至少要和被测模型一样强；否则可以使用同一个 LLM_MODEL，但要把 judge score 当作辅助信号，而不是绝对真理。

测试数据

创建 evals/data/agent-multiturn.json：

[
  {
    "data": {
      "prompt": "List the files in the current directory, then read the contents of package.json",
      "mockTools": {
        "listFiles": {
          "description": "List all files and directories in the specified directory path.",
          "parameters": { "directory": "The directory to list" },
          "mockReturn": "[file] package.json\n[file] tsconfig.json\n[dir] src\n[dir] node_modules"
        },
        "readFile": {
          "description": "Read the contents of a file at the specified path.",
          "parameters": { "path": "The path to the file to read" },
          "mockReturn": "{ \"name\": \"agi\", \"version\": \"1.0.0\" }"
        }
      }
    },
    "target": {
      "originalTask": "List files and read package.json",
      "expectedToolOrder": ["listFiles", "readFile"],
      "mockToolResults": {
        "listFiles": "[file] package.json\n[file] tsconfig.json\n[dir] src\n[dir] node_modules",
        "readFile": "{ \"name\": \"agi\", \"version\": \"1.0.0\" }"
      },
      "category": "task-completion"
    },
    "metadata": {
      "description": "Two-step file exploration task"
    }
  },
  {
    "data": {
      "prompt": "What is 2 + 2?",
      "mockTools": {
        "readFile": {
          "description": "Read the contents of a file at the specified path.",
          "parameters": { "path": "The path to the file to read" },
          "mockReturn": "file contents"
        },
        "runCommand": {
          "description": "Execute a shell command and return its output.",
          "parameters": { "command": "The command to execute" },
          "mockReturn": "command output"
        }
      }
    },
    "target": {
      "originalTask": "Answer a simple math question without using tools",
      "forbiddenTools": ["readFile", "runCommand"],
      "mockToolResults": {},
      "category": "negative"
    },
    "metadata": {
      "description": "Simple question should not trigger any tool use"
    }
  }
]

运行 Multi-Turn Evals

创建 evals/agent-multiturn.eval.ts：

import { evaluate } from "@lmnr-ai/lmnr";
import { toolOrderCorrect, toolsAvoided, llmJudge } from "./evaluators.ts";
import type {
  MultiTurnEvalData,
  MultiTurnTarget,
  MultiTurnResult,
} from "./types.ts";
import dataset from "./data/agent-multiturn.json" with { type: "json" };
import { multiTurnWithMocks } from "./executors.ts";

// Executor that runs multi-turn agent with mocked tools
const executor = async (data: MultiTurnEvalData): Promise<MultiTurnResult> => {
  return multiTurnWithMocks(data);
};

// Run the evaluation
evaluate({
  data: dataset as unknown as Array<{
    data: MultiTurnEvalData;
    target: MultiTurnTarget;
  }>,
  executor,
  evaluators: {
    // Check if tools were called in the expected order
    toolOrder: (output, target) => {
      if (!target) return 1;
      return toolOrderCorrect(output, target);
    },
    // Check if forbidden tools were avoided
    toolsAvoided: (output, target) => {
      if (!target?.forbiddenTools?.length) return 1;
      return toolsAvoided(output, target);
    },
    // LLM judge to evaluate output quality
    outputQuality: async (output, target) => {
      if (!target) return 1;
      return llmJudge(output, target);
    },
  },
  config: {
    projectApiKey: process.env.LMNR_API_KEY,
  },
  groupName: "agent-multiturn",
});

运行它（第 1 章已经加入了这个 script）：

npm run eval:agent

小结

这一章你完成了：

构建 multi-turn evaluations，用来测试完整 agent loop
创建 mocked tools，让测试确定且没有副作用
实现工具顺序评测，也就是 subsequence matching
构建 LLM-as-judge evaluator，用于输出质量打分
理解为什么更强的模型更适合作为 judge

你现在有了一套完整评测框架：single-turn 用来测试工具选择，multi-turn 用来测试端到端行为。下一章，我们会用文件系统工具扩展 agent 的能力。

下一章：第 6 章：文件系统工具 →

Keyboard shortcuts

从零构建生产级 AI Coding Agent — TypeScript 版