Code Graders
Code graders are scripts that evaluate agent responses deterministically. Write them in any language — Python, TypeScript, Node, or any executable.
Contract
Section titled “Contract”Code graders receive eval context via stdin JSON and return a result via stdout.
Input (stdin, raw wire format):
{ "input": [{ "role": "user", "content": "What is 15 + 27?" }], "input_files": [], "criteria": "Correctly calculates 15 + 27 = 42", "output": "The answer is 42.", "answer": "The answer is 42.", "expected_output": [{ "role": "assistant", "content": "42" }], "messages": [{ "role": "assistant", "content": "The answer is 42." }], "trace_summary": { "event_count": 1, "tool_calls": {}, "error_count": 0, "llm_call_count": 1 }}Raw grader stdin is a process-boundary wire format, so keys are snake_case. TypeScript and JavaScript graders that use @agentv/eval receive the same payload converted to camelCase.
| Raw stdin key | SDK field | Meaning |
|---|---|---|
output | output | Final answer / scored result as a string |
answer | answer | Deprecated alias for the same final answer string |
messages | messages | Transcript messages for transcript-aware graders |
expected_output | expectedOutput | Reference answer messages |
output_path | outputPath | Temp file containing large final answer JSON, when used |
trace_summary | traceSummary | Lightweight metrics summary |
token_usage | tokenUsage | Token usage metrics |
cost_usd | costUsd | Estimated cost in USD |
duration_ms | durationMs | Total execution duration |
workspace_path | workspacePath | Temp workspace path, when configured |
Do not treat output as a message array. Use output / answer for answer-text checks, and use messages, trace.messages, or trace.events only when the grader intentionally evaluates transcript or tool behavior.
JSON output (full protocol)
Section titled “JSON output (full protocol)”Emit a JSON object for numeric scores or multi-aspect results:
{ "score": 1.0, "assertions": [ { "text": "Answer contains correct value (42)", "passed": true } ]}| Output Field | Type | Description |
|---|---|---|
score | number | 0.0 to 1.0 |
assertions | Array<{ text, passed, evidence? }> | Per-aspect results with verdict and optional evidence |
Plain-text output (exit-code convention)
Section titled “Plain-text output (exit-code convention)”For simple pass/fail checks, skip the JSON protocol entirely. The exit code determines the score and stdout becomes the assertion text:
| Exit code | Score | Verdict |
|---|---|---|
| 0 | 1.0 | pass |
| non-zero (no stderr) | 0.0 | fail |
#!/bin/bash# check-pages.sh — passes when PDF has at least 5 pagespages=$(pdfinfo report.pdf | grep Pages | awk '{print $2}')if [ "$pages" -ge 5 ]; then echo "PDF has $pages pages (≥5 required)"else echo "PDF has only $pages pages (<5 required)" exit 1fiassertions: - type: code-grader command: [bash, scripts/check-pages.sh]Silent one-liners work too — stdout is optional:
assertions: - type: code-grader command: ["bash", "-c", "[ $(wc -l < output.txt) -ge 10 ]"]Scripts that write to stderr and exit non-zero surface as execution errors rather than quality failures.
Python Example
Section titled “Python Example”import json, sysdata = json.load(sys.stdin)output_text = data.get("output") or data.get("answer") or ""
assertions = []
if "42" in output_text: assertions.append({"text": "Output contains correct value (42)", "passed": True})else: assertions.append({"text": "Output does not contain expected value (42)", "passed": False})
passed = sum(1 for a in assertions if a["passed"])score = passed / len(assertions) if assertions else 0.0
print(json.dumps({ "score": score, "assertions": assertions,}))TypeScript Example
Section titled “TypeScript Example”import { readFileSync } from "fs";
const data = JSON.parse(readFileSync("/dev/stdin", "utf-8"));const outputText: string = data.output ?? data.answer ?? "";
const assertions: Array<{ text: string; passed: boolean }> = [];
if (outputText.includes("42")) { assertions.push({ text: "Output contains correct value (42)", passed: true });} else { assertions.push({ text: "Output does not contain expected value (42)", passed: false });}
const passed = assertions.filter(a => a.passed).length;
console.log(JSON.stringify({ score: passed > 0 ? 1.0 : 0.0, assertions, reasoning: `Passed ${passed} check(s)`,}));Referencing in Eval Files
Section titled “Referencing in Eval Files”assertions: - name: my_validator type: code-grader command: [./validators/check_answer.py]@agentv/eval SDK
Section titled “@agentv/eval SDK”The @agentv/eval package provides a declarative API with automatic stdin/stdout handling. Use defineCodeGrader to skip boilerplate:
#!/usr/bin/env bunimport { defineCodeGrader } from '@agentv/eval';
export default defineCodeGrader(({ output, criteria }) => { const outputText = output ?? ''; const assertions: Array<{ text: string; passed: boolean }> = [];
if (outputText.includes(criteria)) { assertions.push({ text: 'Output matches expected outcome', passed: true }); } else { assertions.push({ text: 'Output does not match expected outcome', passed: false }); }
const passed = assertions.filter(a => a.passed).length; return { score: assertions.length === 0 ? 0 : passed / assertions.length, assertions, };});SDK exports: defineCodeGrader, Message, ToolCall, Trace, TraceSummary, CodeGraderInput, CodeGraderResult
Target Access
Section titled “Target Access”Code graders can call an LLM through a target proxy for metrics that require multiple LLM calls (contextual precision, semantic similarity, etc.).
Configuration
Section titled “Configuration”Add a target block to the grader config:
assertions: - name: contextual-precision type: code-grader command: [bun, scripts/contextual-precision.ts] target: max_calls: 10 # Default: 50Use createTargetClient from the SDK:
#!/usr/bin/env bunimport { createTargetClient, defineCodeGrader } from '@agentv/eval';
export default defineCodeGrader(async ({ input, output }) => { const inputText = input .filter((message) => message.role === 'user') .map((message) => typeof message.content === 'string' ? message.content : '') .join('\n'); const outputText = output ?? ''; const target = createTargetClient(); if (!target) return { score: 0, assertions: [{ text: 'Target not configured', passed: false }] };
const response = await target.invoke({ question: `Is this relevant to: ${inputText}? Response: ${outputText}`, systemPrompt: 'Respond with JSON: { "relevant": true/false }' });
const result = JSON.parse(response.rawText ?? '{}'); return { score: result.relevant ? 1.0 : 0.0 };});Use target.invokeBatch(requests) for multiple calls in parallel.
Environment variables (set automatically when target is configured):
| Variable | Description |
|---|---|
AGENTV_TARGET_PROXY_URL | Local proxy URL |
AGENTV_TARGET_PROXY_TOKEN | Bearer token for authentication |
Advanced Input Fields
Section titled “Advanced Input Fields”Beyond the basic fields (input, output, expected_output, criteria), code graders receive additional structured context:
| Field | Type | Description |
|---|---|---|
input | Message[] | Full resolved input message array |
output | string | null | Final answer / scored result only |
answer | string | Deprecated alias for output |
messages | Message[] | Transcript messages from the target execution |
expected_output | Message[] | Expected/reference output messages |
output_path | string | Temp file containing large final answer JSON, when output is omitted |
input_files | string[] | Paths to input files referenced in the eval |
trace | Trace | Full execution trace with messages, events, metrics, and provenance |
trace_summary | TraceSummary | Lightweight execution metrics summary |
token_usage | {input, output} | Token consumption |
cost_usd | number | Estimated cost in USD |
duration_ms | number | Total execution duration |
start_time | string | ISO timestamp of first event |
end_time | string | ISO timestamp of last event |
file_changes | string | null | Unified diff of workspace file changes (populated when workspace is configured; includes files at workspace root, changes inside nested repos, and Copilot session-state artifacts) |
workspace_path | string | null | Absolute path to the temp workspace directory (populated when workspace is configured) |
trace_summary structure
Section titled “trace_summary structure”{ "event_count": 5, "tool_calls": { "search": 2, "fetch": 1 }, "error_count": 0, "llm_call_count": 2}| Field | Type | Description |
|---|---|---|
event_count | number | Total tool invocations |
tool_calls | Record<string, number> | Count per tool |
error_count | number | Failed tool calls |
llm_call_count | number | Number of LLM calls (assistant messages) |
Use expected_output for reference answers and output for the actual final answer from live runs. Use messages or trace when you need tool calls, intermediate messages, or replay/provenance data.
Workspace Access
Section titled “Workspace Access”When workspace is configured in the eval YAML (via workspace.template, workspace.path, or workspace.repos), code graders receive the workspace path in two ways:
- JSON payload:
workspace_pathfield in the stdin input - Environment variable:
AGENTV_WORKSPACE_PATH
This enables functional grading — running commands like npm test, pytest, or cargo test directly in the agent’s workspace.
What file_changes covers
Section titled “What file_changes covers”file_changes is a unified diff built from two sources, merged in order:
- Git baseline:
git diffagainst a baseline commit taken before the agent ran. Captures edits, new files at workspace root, and changes inside any nested git repos materialized viaworkspace.reposor set up via abefore_allhook. - Provider-reported artifacts: Copilot providers scan their session-state
files/directory after each run and append those as synthetic diffs. This surfaces files the agent wrote outsideworkspace_pathentirely (e.g.~/.copilot/session-state/<uuid>/files/).
Example: Deploy-and-Test Pattern
Section titled “Example: Deploy-and-Test Pattern”#!/usr/bin/env bunimport { readFileSync } from "fs";import { execFileSync } from "child_process";
const input = JSON.parse(readFileSync("/dev/stdin", "utf-8"));const cwd = input.workspace_path;
const assertions: Array<{ text: string; passed: boolean }> = [];
// Stage 1: Install dependenciestry { execFileSync("npm", ["install"], { cwd, stdio: "pipe" }); assertions.push({ text: "npm install passed", passed: true });} catch { assertions.push({ text: "npm install failed", passed: false }); }
// Stage 2: Typechecktry { execFileSync("npx", ["tsc", "--noEmit"], { cwd, stdio: "pipe" }); assertions.push({ text: "typecheck passed", passed: true });} catch { assertions.push({ text: "typecheck failed", passed: false }); }
// Stage 3: Run teststry { execFileSync("npm", ["test"], { cwd, stdio: "pipe" }); assertions.push({ text: "tests passed", passed: true });} catch { assertions.push({ text: "tests failed", passed: false }); }
const passed = assertions.filter(a => a.passed).length;console.log(JSON.stringify({ score: assertions.length > 0 ? passed / assertions.length : 0, assertions,}));workspace: template: ./workspace-template # copied into a temp dir before each run
execution: target: my_agent
tests: - id: implement-feature criteria: Agent implements the feature correctly input: "Implement the TODO functions in src/index.ts" assertions: - name: functional-check type: code-grader command: [bun, scripts/functional-check.ts]See examples/features/functional-grading/ for a complete working example.
Examples
Section titled “Examples”| Example | What it demonstrates |
|---|---|
examples/features/functional-grading/ | workspace_path — deploy-and-test with npm install + tsc + npm test |
examples/features/file-changes/ | file_changes — edits, creates, and deletes captured via git baseline |
examples/features/workspace-artifact/ | file_changes — new file generated by agent (CSV) captured via git baseline |
examples/features/file-changes-with-repos/ | file_changes — workspace-root files AND changes inside nested repos both captured |
Testing Locally
Section titled “Testing Locally”With agentv eval assert
Section titled “With agentv eval assert”Run a grader from .agentv/graders/ by name — no manual JSON piping required:
# Pass agent output and input directlyagentv eval assert rouge-score --agent-output "The fox jumps over the dog" --agent-input "Summarise this"
# Or pass a JSON file with { output, input } fieldsagentv eval assert rouge-score --file result.jsonThe command:
- Discovers the grader script by walking up directories looking for
.agentv/graders/<name>.{ts,js,mts,mjs} - Passes
{ output, input, criteria }to the script via stdin - Prints the grader’s JSON result to stdout
- Exits 0 if score >= 0.5, exit 1 otherwise
This is the same interface that agent-orchestrated evals use — the EVAL.yaml transpiler emits agentv eval assert instructions for code graders so external grading agents can run them directly.
With stdin pipe
Section titled “With stdin pipe”Pipe JSON directly to the grader script for full control:
echo '{"input":[{"role":"user","content":"What is 2+2?"}],"input_files":[],"criteria":"4","output":"4","expected_output":[{"role":"assistant","content":"4"}]}' | python validators/check_answer.py