Documentation Index Fetch the complete documentation index at: https://nvd-54.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
评估(“evals”)通过评估智能体的执行轨迹——它产生的消息和工具调用序列——来衡量其性能。与验证基本正确性的集成测试 不同,评估根据参考或评分标准对智能体行为进行评分,使其在更改提示、工具或模型时对捕获回归很有用。
评估器是一个接受智能体输出(以及可选的参考输出)并返回分数的函数:
function evaluator ({ outputs , referenceOutputs } : {
outputs : Record < string , any > ;
referenceOutputs : Record < string , any > ;
}) {
const outputMessages = outputs . messages ;
const referenceMessages = referenceOutputs . messages ;
const score = compareMessages (outputMessages , referenceMessages) ;
return { key : "evaluator_score" , score : score };
}
agentevals 包提供了用于智能体轨迹的预构建评估器。你可以通过执行轨迹匹配 (确定性比较)或使用 LLM 评委 (定性评估)来进行评估:
方法 何时使用 轨迹匹配 你知道预期的工具调用,并希望进行快速、确定性、无成本的检查 LLM 评委 你想评估整体质量和推理,而无需严格的预期
安装 AgentEvals
npm install agentevals @langchain/core
或者,直接克隆 AgentEvals 仓库 。
轨迹匹配评估器
AgentEvals 提供 createTrajectoryMatchEvaluator 函数来将你的智能体轨迹与参考进行匹配。有四种模式:
模式 描述 用例 strict以相同顺序精确匹配消息结构和工具调用(消息内容可以不同) 测试特定序列(例如,授权前先查询策略) unordered与参考相同的消息结构和工具调用,但工具调用可以任意顺序 当顺序不重要时验证信息检索 subset智能体仅调用参考中的工具(不允许额外工具) 确保智能体不超出预期范围 superset智能体至少调用参考中的工具(允许额外工具) 验证是否采取了最低限度的必要操作
以下示例共享一个通用设置——一个带有 get_weather 工具的智能体:
import { createAgent } from "langchain" ;
import { tool } from "@langchain/core/tools" ;
import { HumanMessage , AIMessage , ToolMessage } from "@langchain/core/messages" ;
import { createTrajectoryMatchEvaluator } from "agentevals" ;
import * as z from "zod" ;
const getWeather = tool (
async ({ city }) => {
return `It's 75 degrees and sunny in ${ city } .` ;
},
{
name : "get_weather" ,
description : "Get weather information for a city." ,
schema : z . object ( { city : z . string () } ) ,
}
) ;
const agent = createAgent ( {
model : "claude-sonnet-4-6" ,
tools : [getWeather] ,
} ) ;
strict 模式确保轨迹包含相同顺序的相同消息和相同工具调用,但允许消息内容不同。当你需要强制执行特定操作序列时(例如要求在授权操作之前先查询策略),这非常有用。const evaluator = createTrajectoryMatchEvaluator ( {
trajectoryMatchMode : "strict" ,
} ) ;
async function testWeatherToolCalledStrict () {
const result = await agent . invoke ( {
messages : [ new HumanMessage ( "What's the weather in San Francisco?" )]
} ) ;
const referenceTrajectory = [
new HumanMessage ( "What's the weather in San Francisco?" ) ,
new AIMessage ( {
content : "" ,
tool_calls : [
{ id : "call_1" , name : "get_weather" , args : { city : "San Francisco" } }
]
} ) ,
new ToolMessage ( {
content : "It's 75 degrees and sunny in San Francisco." ,
tool_call_id : "call_1"
} ) ,
new AIMessage ( "The weather in San Francisco is 75 degrees and sunny." ) ,
] ;
const evaluation = await evaluator ( {
outputs : result . messages ,
referenceOutputs : referenceTrajectory
} ) ;
expect (evaluation . score) . toBe ( true ) ;
}
unordered 模式允许相同的工具调用以任意顺序出现。当你想验证检索到了特定信息但不关心顺序时,这很有帮助。例如,一个用不同工具调用检查城市天气和活动的智能体。const getEvents = tool (
async ({ city } : { city : string }) => {
return `Concert at the park in ${ city } tonight.` ;
},
{
name : "get_events" ,
description : "Get events happening in a city." ,
schema : z . object ( { city : z . string () } ) ,
}
) ;
const agent = createAgent ( {
model : "claude-sonnet-4-6" ,
tools : [getWeather , getEvents] ,
} ) ;
const evaluator = createTrajectoryMatchEvaluator ( {
trajectoryMatchMode : "unordered" ,
} ) ;
async function testMultipleToolsAnyOrder () {
const result = await agent . invoke ( {
messages : [ new HumanMessage ( "What's happening in SF today?" )]
} ) ;
const referenceTrajectory = [
new HumanMessage ( "What's happening in SF today?" ) ,
new AIMessage ( {
content : "" ,
tool_calls : [
{ id : "call_1" , name : "get_events" , args : { city : "SF" } },
{ id : "call_2" , name : "get_weather" , args : { city : "SF" } },
]
} ) ,
new ToolMessage ( {
content : "Concert at the park in SF tonight." ,
tool_call_id : "call_1"
} ) ,
new ToolMessage ( {
content : "It's 75 degrees and sunny in SF." ,
tool_call_id : "call_2"
} ) ,
new AIMessage ( "Today in SF: 75 degrees and sunny with a concert at the park tonight." ) ,
] ;
const evaluation = await evaluator ( {
outputs : result . messages ,
referenceOutputs : referenceTrajectory ,
} ) ;
expect (evaluation . score) . toBe ( true ) ;
}
superset 和 subset 模式匹配部分轨迹。superset 模式验证智能体至少调用了参考轨迹中的工具,允许额外的工具调用。subset 模式确保智能体没有调用参考之外的任何工具。const getDetailedForecast = tool (
async ({ city } : { city : string }) => {
return `Detailed forecast for ${ city } : sunny all week.` ;
},
{
name : "get_detailed_forecast" ,
description : "Get detailed weather forecast for a city." ,
schema : z . object ( { city : z . string () } ) ,
}
) ;
const agent = createAgent ( {
model : "claude-sonnet-4-6" ,
tools : [getWeather , getDetailedForecast] ,
} ) ;
const evaluator = createTrajectoryMatchEvaluator ( {
trajectoryMatchMode : "superset" ,
} ) ;
async function testAgentCallsRequiredToolsPlusExtra () {
const result = await agent . invoke ( {
messages : [ new HumanMessage ( "What's the weather in Boston?" )]
} ) ;
const referenceTrajectory = [
new HumanMessage ( "What's the weather in Boston?" ) ,
new AIMessage ( {
content : "" ,
tool_calls : [
{ id : "call_1" , name : "get_weather" , args : { city : "Boston" } },
]
} ) ,
new ToolMessage ( {
content : "It's 75 degrees and sunny in Boston." ,
tool_call_id : "call_1"
} ) ,
new AIMessage ( "The weather in Boston is 75 degrees and sunny." ) ,
] ;
const evaluation = await evaluator ( {
outputs : result . messages ,
referenceOutputs : referenceTrajectory ,
} ) ;
expect (evaluation . score) . toBe ( true ) ;
}
你还可以设置 toolArgsMatchMode 属性和/或 toolArgsMatchOverrides 来自定义评估器如何考虑实际轨迹与参考之间工具调用的相等性。默认情况下,只有具有相同参数和相同工具的工具调用被认为相等。访问仓库 了解更多详情。
LLM 评委评估器
你可以使用 LLM 通过 createTrajectoryLLMAsJudge 函数来评估智能体的执行路径。与轨迹匹配评估器不同,它不需要参考轨迹,但如果可用的话可以提供一个。
import { createTrajectoryLLMAsJudge , TRAJECTORY_ACCURACY_PROMPT } from "agentevals" ;
const evaluator = createTrajectoryLLMAsJudge ( {
model : "openai:o3-mini" ,
prompt : TRAJECTORY_ACCURACY_PROMPT ,
} ) ;
async function testTrajectoryQuality () {
const result = await agent . invoke ( {
messages : [ new HumanMessage ( "What's the weather in Seattle?" )]
} ) ;
const evaluation = await evaluator ( {
outputs : result . messages ,
} ) ;
expect (evaluation . score) . toBe ( true ) ;
}
如果你有参考轨迹,使用预构建的 TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE 提示: import { createTrajectoryLLMAsJudge , TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE } from "agentevals" ;
const evaluator = createTrajectoryLLMAsJudge ( {
model : "openai:o3-mini" ,
prompt : TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE ,
} ) ;
const evaluation = await evaluator ( {
outputs : result . messages ,
referenceOutputs : referenceTrajectory ,
} ) ;
有关 LLM 如何评估轨迹的更多可配置性,请访问仓库 。
在 LangSmith 中运行评估
要随时间跟踪实验,将评估器结果记录到 LangSmith 。首先,设置所需的环境变量:
export LANGSMITH_API_KEY = "your_langsmith_api_key"
export LANGSMITH_TRACING = "true"
LangSmith 提供两种主要方法来运行评估:Vitest/Jest 集成和 evaluate 函数。
import * as ls from "langsmith/vitest" ;
// import * as ls from "langsmith/jest";
import { createTrajectoryLLMAsJudge , TRAJECTORY_ACCURACY_PROMPT } from "agentevals" ;
const trajectoryEvaluator = createTrajectoryLLMAsJudge ( {
model : "openai:o3-mini" ,
prompt : TRAJECTORY_ACCURACY_PROMPT ,
} ) ;
ls . describe ( "trajectory accuracy" , () => {
ls . test ( "accurate trajectory" , {
inputs : {
messages : [
{ role : "user" , content : "What is the weather in SF?" }
]
},
referenceOutputs : {
messages : [
new HumanMessage ( "What is the weather in SF?" ) ,
new AIMessage ( {
content : "" ,
tool_calls : [
{ id : "call_1" , name : "get_weather" , args : { city : "SF" } }
]
} ) ,
new ToolMessage ( {
content : "It's 75 degrees and sunny in SF." ,
tool_call_id : "call_1"
} ) ,
new AIMessage ( "The weather in SF is 75 degrees and sunny." ) ,
] ,
},
}, async ({ inputs , referenceOutputs }) => {
const result = await agent . invoke ( {
messages : [ new HumanMessage ( "What is the weather in SF?" )]
} ) ;
ls . logOutputs ( { messages : result . messages } ) ;
await trajectoryEvaluator ( {
inputs ,
outputs : result . messages ,
referenceOutputs ,
} ) ;
} ) ;
} ) ;
使用你的测试运行器运行评估: vitest run test_trajectory.eval.ts
# 或
jest test_trajectory.eval.ts
创建一个 LangSmith 数据集 并使用 evaluate 函数。数据集必须具有以下 schema:
input :{"messages": [...]} 用于调用智能体的输入消息。
output :{"messages": [...]} 智能体输出中的预期消息历史。对于轨迹评估,你可以选择只保留 assistant 消息。
import { evaluate } from "langsmith/evaluation" ;
import { createTrajectoryLLMAsJudge , TRAJECTORY_ACCURACY_PROMPT } from "agentevals" ;
const trajectoryEvaluator = createTrajectoryLLMAsJudge ( {
model : "openai:o3-mini" ,
prompt : TRAJECTORY_ACCURACY_PROMPT ,
} ) ;
async function runAgent ( inputs : any ) {
const result = await agent . invoke (inputs) ;
return result . messages ;
}
await evaluate (
runAgent ,
{
data : "your_dataset_name" ,
evaluators : [trajectoryEvaluator] ,
}
) ;
将这些文档连接 到 Claude、VSCode 等,通过 MCP 获取实时答案。