Agent Evaluation for Legal: Profile Any Agent in 72 Hours

Connect vendor or custom agents — get capability profiles across your actual work

The Question Has Changed

For the past two years, the AI evaluation question for law firms was: which model should handle which task? That question mattered — and still does — because model selection determines cost, quality, and governance across every AI-enabled workflow.

But the landscape is shifting. Law firms aren't just choosing models anymore. They're choosing agents.

Harvey for legal research and drafting. Lexis+ AI for case law analysis. CoCounsel for document review. Spellbook for contract redlining. Microsoft Copilot for operational tasks. Custom agents built in-house on LangChain or CrewAI for firm-specific workflows. And every quarter, new vendors arrive with agents designed for legal discovery, compliance monitoring, due diligence, and client intake.

Each vendor demonstrates their agent in a controlled setting. Each provides benchmarks run against their own test sets. None of them can tell a firm how their agent actually performs on that firm's matters, in that firm's practice areas, against that firm's quality standards — or how it compares to the other agents the firm is already evaluating.

The model selection problem has become an agent selection problem. And it's harder.

Pnyx evaluates agents the same way it evaluates models. A firm connects its agents — vendor-provided, custom-built, or both — to Pnyx's evaluation infrastructure. Within 24 to 72 hours, each agent has a capability profile: what task types it handles well, where it fails, what it costs per completed task, and how it compares to every other agent the firm has submitted.

Why Agents Are Harder to Evaluate Than Models

Models respond to prompts. You send a prompt, you get a response, you can evaluate the output. The evaluation surface is well-defined.

Agents are different in ways that make evaluation fundamentally more complex:

Agents complete tasks, not prompts. An agent given a contract review task doesn't just respond — it reads the document, identifies risk clauses, cross-references governing law, flags deviations from the firm's standard positions, and produces a structured memo. Evaluation has to measure whether the task was completed correctly, not just whether the output reads well.

Agent costs are variable. A model costs a predictable amount per token. An agent might consume 500 tokens on a simple clause extraction and 50,000 on a complex multi-jurisdictional review — calling multiple models along the way. Cost per completed task, not cost per token, is what matters for the firm's economics.

Agents use tools — and tool use can fail. An agent that searches a case law database, retrieves the wrong precedent, and builds its analysis on that retrieval has a tool-use failure that cascades through every subsequent step. Evaluation needs to catch where in the chain things go wrong, not just whether the final output looks acceptable.

Agent behavior is opaque by design. Vendors don't expose their agent's internal reasoning chain, tool selection logic, or error recovery behavior. The firm sees the final output. Pnyx sees the full execution profile.

Why Legal

Legal is where agent proliferation, high stakes, and governance requirements converge most acutely.

Agent proliferation is already here. Corporate legal AI adoption more than doubled in a single year, jumping from 23% to 52%. LexisNexis's CEO described 2026 as the year agentic AI becomes embedded in the core operating model of leading law firms — shifting from single actions to coordinated, multi-agent systems handling complex workflows. Law firms aren't evaluating one agent. They're evaluating five, six, seven — from different vendors, built on different architectures, with overlapping claims.

The stakes eliminate guesswork. A model that produces a mediocre summary is a quality problem. An agent that misidentifies a risk clause in a $200 million acquisition agreement is a liability problem. An agent that retrieves outdated case law and builds a brief around it is a malpractice problem. The difference between a model error and an agent error is that the agent's error is embedded in a completed work product that looks authoritative.

Governance is becoming a compliance requirement. The EU AI Act reaches full application for high-risk systems in August 2026 — and AI systems used in legal services fall squarely in that category, with penalties reaching €35 million or 7% of global revenue. The Colorado AI Act takes effect in June 2026 requiring risk management and transparency for high-risk AI. Firms need to document not just which AI they use, but how they evaluated it, why they selected it, and how they monitor its performance. Vendor demos and informal testing don't satisfy that standard.

How It Works

Consider an Am Law 200 firm evaluating AI agents across its practice. The firm currently uses three vendor agents and two custom-built agents across five workflows:

Workflow	Agent	Source
Contract review & redlining	Spellbook	Vendor
Legal research & memo drafting	Harvey	Vendor
Document review for litigation	CoCounsel	Vendor
Client intake & matter classification	Custom (LangChain)	In-house
Regulatory compliance monitoring	Custom (CrewAI)	In-house

Each agent was adopted through separate evaluation processes — different practice groups, different timelines, different standards. The firm's Chief Innovation Officer wants a unified picture: how does each agent actually perform on the firm's work, and are there better options?

Connect agents to Pnyx

The firm connects each agent to Pnyx's evaluation infrastructure. Pnyx doesn't need access to the agent's internals — it interacts with each agent the same way the firm's attorneys do: by assigning tasks and evaluating what comes back.

Pnyx submits the firm's actual task patterns — anonymized, with confidential client information removed — to each agent. Contract review agents receive the firm's typical contract types. Research agents receive the firm's typical research questions. Document review agents receive representative document sets.

24–72 hours later: capability profiles

Each agent receives a structured profile across dimensions that matter for legal work:

Task completion rate. What percentage of assigned tasks does the agent complete correctly, by task type and complexity tier? An agent that handles routine NDA reviews at 95% accuracy but drops to 60% on multi-jurisdictional licensing agreements has a profile the firm needs to see before assigning it to M&A work.

Quality by practice area. How does the agent perform on corporate vs. litigation vs. regulatory work? Most agents are stronger in some domains than others — but vendor benchmarks don't reveal domain-specific weaknesses because they aren't run against the firm's practice mix.

Cost per completed task. Not cost per token — cost per completed task, including all model calls, tool invocations, retries, and error recovery the agent performs internally. A firm discovers that Agent A costs $0.12 per contract clause review while Agent B costs $0.85 for the same task at similar quality. Across thousands of clauses per month, that's the difference between sustainable and unsustainable AI economics.

Error patterns. Where does the agent fail, and how? Does it fail silently — producing confident but incorrect output? Does it fail gracefully — flagging uncertainty for human review? Does it fail in tool use — retrieving wrong precedents, misreading document structures, losing context across long documents? The error profile tells the firm not just how often an agent fails, but how dangerous its failures are.

Comparative ranking. For each task type, Pnyx ranks every evaluated agent — vendor and custom — against each other. The firm sees which agent is the best fit for contract review, which is strongest on research, which handles document review most cost-effectively. Not based on vendor claims. Based on the firm's own work.

What the Firm Discovers

The profiles typically reveal patterns that no single vendor evaluation would surface:

Overlap and redundancy. Two vendor agents that seemed to serve different purposes turn out to perform the same tasks at similar quality levels. The firm is paying for both when one would suffice.

Hidden strengths in custom agents. An in-house agent built for client intake turns out to outperform vendor options on matter classification because it was trained on the firm's own taxonomy. The firm nearly replaced it with a vendor product that would have been worse.

Domain-specific weaknesses. A vendor agent that excels at corporate contract work performs poorly on regulatory compliance documents — a weakness invisible in the vendor's demo, which focused on corporate transactions.

Cost variance across agents. Agent costs per task vary by 5x to 10x across vendors for similar-quality output. Without comparative profiling, the firm has no basis for knowing which agent is cost-effective and which is burning budget.

Governance gaps. Some agents provide audit trails and explainability artifacts. Others produce only final output with no documentation of reasoning. The firm can now make agent selection decisions that account for compliance requirements — not just quality and cost.

The Shift From Model Evaluation to Agent Evaluation

Pnyx's evaluation infrastructure handles both models and agents — but the evaluation dimensions are different:

Dimension	Model evaluation	Agent evaluation
What's measured	Output quality per prompt	Task completion per workflow
Cost unit	Cost per token	Cost per completed task
Error surface	Response quality	Multi-step chain, tool use, error recovery
Comparison scope	Models across providers	Agents across vendors and frameworks
Profiling time	Minutes	24–72 hours
Governance artifact	Skill profile per prompt	Execution profile per task type

The 24–72 hour profiling window exists because agents are task-completers, not prompt-responders. Evaluating an agent requires running it through representative tasks across the firm's practice areas, at sufficient volume to establish statistical confidence in the capability profile. This isn't a benchmark — it's a production-representative evaluation of what the agent actually does on the firm's work.

Why This Requires a Neutral Layer

Agent vendors are the least objective source of information about agent performance. Every vendor shows demos optimized for their strengths. Every benchmark is run on task types where the agent excels. No vendor will tell a firm that a competitor's agent handles their contract review better, or that their own agent is overpriced for the firm's document review volume.

The emerging agent evaluation tools — LangSmith, Langfuse, Arize, Maxim — solve a different problem. They help development teams debug and monitor their own agents. They don't compare agents from different vendors against each other on a firm's actual task patterns. They're observability tools, not evaluation tools.

Pnyx fills the gap between vendor claims and production reality. It evaluates every agent — vendor-built, custom-built, from any framework — against the same task patterns, with the same scoring methodology, producing comparable profiles. The firm gets a neutral, data-driven basis for agent selection decisions.

The Adoption Path

Connect agents. The firm connects its current agents — and any new agents under evaluation — to Pnyx's infrastructure. Pnyx interacts with each agent through its standard interface. No access to agent internals required.

Submit task patterns. The firm provides representative task patterns from its practice areas, anonymized. Pnyx needs the structure and complexity of the work, not confidential client content.

Receive profiles in 24–72 hours. Each agent gets a capability profile: task completion, quality by domain, cost per task, error patterns, comparative ranking. The profiles are the firm's property.

Make selection decisions with data. Assign agents to the workflows they handle best. Consolidate redundant agents. Upgrade where quality gaps exist. Document the evaluation for governance and compliance.

Monitor continuously. Agent performance isn't static. Vendor updates change behavior. Workload patterns shift. New agents launch. Pnyx monitors agent performance over time and alerts the firm when profiles change — the same continuous optimization that applies to models applies to agents.