Documentation Index Fetch the complete documentation index at: https://nvd-54.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
评估(“evals”)通过评估智能体的执行轨迹(它产生的消息和工具调用序列)来衡量其表现。与验证基本正确性的集成测试 不同,评估根据参考或评分标准对智能体行为进行评分,使其在更改提示词、工具或模型时能够有效捕获回归。
评估器是一个函数,它接受智能体输出(以及可选的参考输出)并返回一个分数:
def evaluator ( * , outputs : dict , reference_outputs : dict ):
output_messages = outputs [ " messages " ]
reference_messages = reference_outputs [ " messages " ]
score = compare_messages ( output_messages , reference_messages )
return { "key" : "evaluator_score" , "score" : score }
agentevals 包提供了用于智能体轨迹的预构建评估器。你可以通过执行轨迹匹配 (确定性比较)或使用 LLM 评判者 (定性评估)来进行评估:
方法 何时使用 轨迹匹配 你知道预期的工具调用,并希望快速、确定性、零成本的检查 LLM 作为评判者 你想要评估整体质量和推理,无需严格的预期
安装 AgentEvals
或者,直接克隆 AgentEvals 仓库 。
轨迹匹配评估器
AgentEvals 提供 create_trajectory_match_evaluator 函数来将你的智能体轨迹与参考进行匹配。有四种模式:
模式 描述 用例 strict精确匹配消息结构和工具调用(相同顺序,消息内容可不同) 测试特定序列(例如,授权前先查询策略) unordered与参考相同的消息结构和工具调用,但工具调用可以是任意顺序 验证信息检索(顺序无关紧要时) subset智能体只调用参考中的工具(无额外调用) 确保智能体不超出预期范围 superset智能体至少调用参考中的工具(允许额外调用) 验证最低要求的操作已被执行
以下示例共用一个通用设置——一个带有 get_weather 工具的智能体:
from langchain . agents import create_agent
from langchain . tools import tool
from langchain . messages import HumanMessage , AIMessage , ToolMessage
from agentevals . trajectory . match import create_trajectory_match_evaluator
@tool
def get_weather ( city : str ):
"""Get weather information for a city."""
return f "It's 75 degrees and sunny in { city } ."
agent = create_agent ( "claude-sonnet-4-6" , tools = [ get_weather ])
strict 模式确保轨迹包含相同顺序的相同消息和相同工具调用,但允许消息内容不同。当你需要强制执行特定操作序列时(例如在授权操作前要求先进行策略查询),这很有用。evaluator = create_trajectory_match_evaluator (
trajectory_match_mode = "strict" ,
)
def test_weather_tool_called_strict ():
result = agent . invoke ({
"messages" : [ HumanMessage ( content = "What's the weather in San Francisco?" )]
})
reference_trajectory = [
HumanMessage ( content = "What's the weather in San Francisco?" ),
AIMessage ( content = "" , tool_calls = [
{ "id" : "call_1" , "name" : "get_weather" , "args" : { "city" : "San Francisco" }}
]),
ToolMessage ( content = "It's 75 degrees and sunny in San Francisco." , tool_call_id = "call_1" ),
AIMessage ( content = "The weather in San Francisco is 75 degrees and sunny." ),
]
evaluation = evaluator (
outputs = result [ " messages " ],
reference_outputs = reference_trajectory
)
# {
# 'key': 'trajectory_strict_match',
# 'score': True,
# 'comment': None,
# }
assert evaluation [ " score " ] is True
unordered 模式允许相同的工具调用以任意顺序出现。当你想验证特定信息已被检索但不关心顺序时,这很有帮助。例如,一个使用不同工具调用检查城市天气和活动的智能体。@tool
def get_events ( city : str ):
"""Get events happening in a city."""
return f "Concert at the park in { city } tonight."
agent = create_agent ( "claude-sonnet-4-6" , tools = [ get_weather , get_events ])
evaluator = create_trajectory_match_evaluator (
trajectory_match_mode = "unordered" ,
)
def test_multiple_tools_any_order ():
result = agent . invoke ({
"messages" : [ HumanMessage ( content = "What's happening in SF today?" )]
})
reference_trajectory = [
HumanMessage ( content = "What's happening in SF today?" ),
AIMessage ( content = "" , tool_calls = [
{ "id" : "call_1" , "name" : "get_events" , "args" : { "city" : "SF" }},
{ "id" : "call_2" , "name" : "get_weather" , "args" : { "city" : "SF" }},
]),
ToolMessage ( content = "Concert at the park in SF tonight." , tool_call_id = "call_1" ),
ToolMessage ( content = "It's 75 degrees and sunny in SF." , tool_call_id = "call_2" ),
AIMessage ( content = "Today in SF: 75 degrees and sunny with a concert at the park tonight." ),
]
evaluation = evaluator (
outputs = result [ " messages " ],
reference_outputs = reference_trajectory ,
)
assert evaluation [ " score " ] is True
superset 和 subset 模式匹配部分轨迹。superset 模式验证智能体至少调用了参考轨迹中的工具,允许额外的工具调用。subset 模式确保智能体没有调用参考之外的任何工具。@tool
def get_detailed_forecast ( city : str ):
"""Get detailed weather forecast for a city."""
return f "Detailed forecast for { city } : sunny all week."
agent = create_agent ( "claude-sonnet-4-6" , tools = [ get_weather , get_detailed_forecast ])
evaluator = create_trajectory_match_evaluator (
trajectory_match_mode = "superset" ,
)
def test_agent_calls_required_tools_plus_extra ():
result = agent . invoke ({
"messages" : [ HumanMessage ( content = "What's the weather in Boston?" )]
})
# 参考只要求 get_weather,但智能体可能调用额外的工具
reference_trajectory = [
HumanMessage ( content = "What's the weather in Boston?" ),
AIMessage ( content = "" , tool_calls = [
{ "id" : "call_1" , "name" : "get_weather" , "args" : { "city" : "Boston" }},
]),
ToolMessage ( content = "It's 75 degrees and sunny in Boston." , tool_call_id = "call_1" ),
AIMessage ( content = "The weather in Boston is 75 degrees and sunny." ),
]
evaluation = evaluator (
outputs = result [ " messages " ],
reference_outputs = reference_trajectory ,
)
assert evaluation [ " score " ] is True
你还可以设置 tool_args_match_mode 属性和/或 tool_args_match_overrides 来自定义评估器如何考虑实际轨迹与参考中工具调用之间的等价性。默认情况下,只有具有相同参数的相同工具的调用才被视为相等。访问仓库 了解更多详情。
LLM 作为评判者评估器
你可以使用 LLM 通过 create_trajectory_llm_as_judge 函数来评估智能体的执行路径。与轨迹匹配评估器不同,它不需要参考轨迹,但如果可用的话可以提供。
from agentevals . trajectory . llm import create_trajectory_llm_as_judge , TRAJECTORY_ACCURACY_PROMPT
evaluator = create_trajectory_llm_as_judge (
model = "openai:o3-mini" ,
prompt = TRAJECTORY_ACCURACY_PROMPT ,
)
def test_trajectory_quality ():
result = agent . invoke ({
"messages" : [ HumanMessage ( content = "What's the weather in Seattle?" )]
})
evaluation = evaluator (
outputs = result [ " messages " ],
)
assert evaluation [ " score " ] is True
如果你有参考轨迹,使用预构建的 TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE 提示词: from agentevals . trajectory . llm import create_trajectory_llm_as_judge , TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE
evaluator = create_trajectory_llm_as_judge (
model = "openai:o3-mini" ,
prompt = TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE ,
)
evaluation = evaluator (
outputs = result [ " messages " ],
reference_outputs = reference_trajectory ,
)
要了解 LLM 如何评估轨迹的更多可配置选项,请访问仓库 。
异步支持
所有 agentevals 评估器都支持 Python asyncio。异步版本通过在函数名中 create_ 后添加 async 来获取。
from agentevals . trajectory . llm import create_async_trajectory_llm_as_judge , TRAJECTORY_ACCURACY_PROMPT
from agentevals . trajectory . match import create_async_trajectory_match_evaluator
async_judge = create_async_trajectory_llm_as_judge (
model = "openai:o3-mini" ,
prompt = TRAJECTORY_ACCURACY_PROMPT ,
)
async_evaluator = create_async_trajectory_match_evaluator (
trajectory_match_mode = "strict" ,
)
async def test_async_evaluation ():
result = await agent . ainvoke ({
"messages" : [ HumanMessage ( content = "What's the weather?" )]
})
evaluation = await async_judge ( outputs = result [ " messages " ])
assert evaluation [ " score " ] is True
在 LangSmith 中运行评估
要跟踪随时间变化的实验,将评估器结果记录到 LangSmith 。首先,设置所需的环境变量:
export LANGSMITH_API_KEY = "your_langsmith_api_key"
export LANGSMITH_TRACING = "true"
LangSmith 提供两种主要的评估运行方式:pytest 集成和 evaluate 函数。
import pytest
from langsmith import testing as t
from agentevals . trajectory . llm import create_trajectory_llm_as_judge , TRAJECTORY_ACCURACY_PROMPT
trajectory_evaluator = create_trajectory_llm_as_judge (
model = "openai:o3-mini" ,
prompt = TRAJECTORY_ACCURACY_PROMPT ,
)
@pytest . mark . langsmith
def test_trajectory_accuracy ():
result = agent . invoke ({
"messages" : [ HumanMessage ( content = "What's the weather in SF?" )]
})
reference_trajectory = [
HumanMessage ( content = "What's the weather in SF?" ),
AIMessage ( content = "" , tool_calls = [
{ "id" : "call_1" , "name" : "get_weather" , "args" : { "city" : "SF" }},
]),
ToolMessage ( content = "It's 75 degrees and sunny in SF." , tool_call_id = "call_1" ),
AIMessage ( content = "The weather in SF is 75 degrees and sunny." ),
]
t . log_inputs ({})
t . log_outputs ({ "messages" : result [ " messages " ]})
t . log_reference_outputs ({ "messages" : reference_trajectory })
trajectory_evaluator (
outputs = result [ " messages " ],
reference_outputs = reference_trajectory
)
使用 pytest 运行评估: pytest test_trajectory.py --langsmith-output
创建一个 LangSmith 数据集 并使用 evaluate 函数。数据集必须具有以下模式:
input :{"messages": [...]} 用于调用智能体的输入消息。
output :{"messages": [...]} 智能体输出中的预期消息历史。对于轨迹评估,你可以选择只保留 assistant 消息。
from langsmith import Client
from agentevals . trajectory . llm import create_trajectory_llm_as_judge , TRAJECTORY_ACCURACY_PROMPT
client = Client ()
trajectory_evaluator = create_trajectory_llm_as_judge (
model = "openai:o3-mini" ,
prompt = TRAJECTORY_ACCURACY_PROMPT ,
)
def run_agent ( inputs ):
return agent . invoke ( inputs )[ "messages" ]
experiment_results = client . evaluate (
run_agent ,
data = "your_dataset_name" ,
evaluators = [ trajectory_evaluator ]
)
连接这些文档 到 Claude、VSCode 等,通过 MCP 获取实时答案。