Evaluating AI agents in production: eval sets, drift, and monitoring

The agent handles the demo perfectly. Every test case the team built for the presentation passes. The model retrieves the right information, the tool fires at the right moment, and the final response is coherent. Leadership signs off. Production deployment is scheduled.

Three months later, the first complaints surface. Some queries return off-target answers. Others trigger tool loops that never terminate cleanly. A use case that seemed secondary at launch turns out to be the one users hit most often, and it happens to be exactly the case no one bothered testing. Overall quality is hard to characterize: not a clear failure, not a clear success. An uncomfortable grey zone.

This is the rule, not the exception. The LangChain State of Agent Engineering report, based on 1,340 teams surveyed in December 2025, found that 89% of organizations have observability in place for their production agents, but only 52% run structured evaluations against documented test sets. The gap between "monitoring what happens" and "measuring whether outputs are good" is 37 percentage points. Behind that gap are teams who know something is off but cannot quantify it or catch regressions. This guide covers what they do, or should be doing.

Silent drift: why agents degrade without warning

A production AI agent is not a static system. It evolves under the pressure of several simultaneous forces, and each one can degrade quality progressively and invisibly.

The first force is model change. Inference API providers update their models silently or with minimal notice. A model from November 2024 is not identical to its March 2025 version. It may behave differently on certain request categories without any infrastructure metric catching it: latency stays flat, HTTP error rates stay flat, but response quality drifts.

The second force is distribution shift in user requests. In production, users ask questions the team never anticipated. The cases covered by test suites may have represented 80% of expected queries at launch. Six months in, unexpected request types account for 40% of real traffic. The agent was never evaluated on those. Nobody knows how it performs.

The third force is changes in external context: documents updated in a RAG knowledge base, third-party APIs whose schemas evolve, tools whose behavior shifts with their own releases. Every external dependency is a potential source of drift, even when the agent itself has not changed a single line of code.

A survey published on arXiv in March 2025 on the evaluation of LLM-based agents identifies this structural problem: static evaluations, built once on a fixed dataset, cannot capture the dynamic drift of agents in production. Evaluation must be a continuous process, not a one-time audit conducted before deployment.

Building a proprietary eval set: the non-negotiable starting point

The most common mistake is evaluating an agent against generic benchmarks: MMLU, HumanEval, SWE-bench. These benchmarks measure the general capabilities of the underlying model. They do not measure whether your agent correctly handles the requests of your actual users, with your tools, in your business context.

A proprietary eval set is a dataset built from your own production traffic. Building it follows four steps.

Sample real requests. From the moment production traffic starts, collect a representative sample of handled queries: 200 to 500 examples is a solid starting point to cover the distribution of use cases. Include common cases, edge cases flagged by users, and legitimate refusal cases, meaning out-of-scope requests the agent should gracefully decline.

Annotate reference outputs. For each query in the sample, document the expected response or, for multi-step agentic tasks, the expected trajectory: which tools should be called, in what order, with what parameters. This annotation is done by domain experts, not just engineers. The business side knows what "correct" looks like in the organization's context.

Set acceptability thresholds. Formally define the minimum score on each evaluation dimension for an agent version to be considered deployable. These thresholds are validated by the business before the first deployment, not revised downward when scores disappoint.

Version the dataset. The eval set evolves over time: add new cases that emerged in production, remove obsolete ones, update reference responses when expected behavior changes. Tag each dataset version so score comparisons across releases remain valid.

This golden dataset is the most valuable asset in the project, more so than the agent code itself. It encodes the shared understanding between the engineering team and the business about what "working" concretely means. Without it, every quality discussion stays subjective, and deployment decisions stay political.

LLM-as-a-judge: a reliable evaluator when properly calibrated

Manually annotating 300 queries before deployment is doable. Manually annotating the thousands of queries that flow through production every week is not. This is where automated evaluation via LLM-as-a-judge comes in: a separate LLM evaluates the agent's outputs against defined criteria.

The method involves submitting to a judge LLM (typically a distinct, more capable model than the one used in production) the original query, the agent's response, and a precise evaluation rubric. The judge returns a score and a justification. LangChain has published data on this approach at scale: the "answer correctness" criterion was applied more than 8 million times in a single production week in March 2025.

The approach has a real limitation. Research published in 2025 documents several systematic biases in LLM judges: recency bias (favoring responses presented second), position bias, length bias (favoring longer responses regardless of relevance). These biases are documented; they do not make the method useless, but they make calibration mandatory.

Calibration happens through double-labeling: assemble a subset of 100 to 200 examples annotated manually by domain experts, then measure the agreement rate between the LLM judge and the human annotations. A 75 to 90% agreement rate is the standard target. Below 70%, the rubric is too ambiguous or the judge model is poorly suited to the domain. Calibration is not an optional end-of-project step: it is what distinguishes a reliable LLM judge from an oracle that fails systematically and silently.

Open-source frameworks like DeepEval offer pre-built LLM-as-a-judge metrics for common cases: faithfulness (is the response grounded in the provided context?), answer relevancy (does the response address the question?), task completion (did the agent accomplish the requested task?). These are starting points, not oracles. Criteria specific to your domain require custom metrics built with your business team.

Metrics to instrument in production

A production agent generates two types of signals: infrastructure signals (latency, token cost, HTTP error rates, response lengths) and quality signals (relevance, faithfulness, task completion). Teams that only do observability capture the first kind. Teams that run structured evaluation capture both.

For RAG-style or retrieval agents, the priority quality metrics are as follows.

Faithfulness. Is the response entirely grounded in the retrieved documents, with no information invented outside the provided context? Drifting faithfulness typically signals degradation in retrieval quality or a shift in model behavior.

Answer relevancy. Does the response precisely address the question asked, without off-topic or redundant content? A drop in relevancy signals that the request distribution has shifted outside the profile the agent was optimized for.

Task completion rate. For multi-step agents: what percentage of tasks complete without unexpected interruption? A declining completion rate often points to a regression in edge-case handling or a behavioral change in an external tool.

For action agents (those executing tasks in third-party systems), two additional metrics apply.

Tool call precision. Is the agent calling the right tools, with the right parameters, in the right order? Declining tool call precision is often the earliest signal of drift, preceding user-reported complaints by days or weeks.

Trajectory accuracy. For golden dataset cases where the expected execution path is known: what percentage of runs follow the correct action sequence, not just produce an acceptable final answer? An agent can generate a plausible response via an incorrect execution path, which weakens robustness as case variety increases.

The combination of these metrics produces a qualitative health dashboard for the agent. This dashboard is not the same as the infrastructure dashboard, and it cannot be derived from it. Both are needed, in parallel.

Detecting drift: canary sets and alert thresholds

Drift monitoring relies on a specific technique: canary sets. A canary set is a stable subset of the golden dataset, typically 50 to 200 examples, replayed automatically in production at regular intervals (daily or weekly depending on criticality). Scores from each run are compared against a baseline established at initial deployment.

The alerting logic follows progressive thresholds. A variation of less than 2% on global scores between two consecutive runs falls within normal noise. A drop of 2 to 5% sustained over 24 to 48 hours warrants investigation. A drop above 5% triggers a priority alert and a human review of degraded examples.

What field practice confirms in 2026: teams running daily canary replays on their production model IDs detect silent drift on one to two model IDs per quarter on average. This drift would not have been caught by infrastructure metrics, which remain stable during a pure quality regression. The canary set is the only mechanism that makes this category of regression visible.

Drift response follows three levels. If drift is minor and localized to a specific request category: enrich the golden dataset with these cases and adjust the system prompt. If drift is moderate and diffuse: investigate whether an external tool or the underlying model has changed, and assess the impact of a rollback or model version change. If drift is severe (more than 10% drop on a critical metric): suspend the feature or revert to a previous version while diagnosing the root cause.

A practice that has become standard: quarterly drift drills. Deliberately inject a known regression into a staging environment and verify that alerts fire correctly, that rollback is operational, and that the post-mortem captures the degradation timeline accurately. This is the fire drill equivalent for evaluation infrastructure.

Embedding evaluation into the development cycle

The final piece is organizational: evaluation must enter the CI/CD pipeline just like unit tests in a conventional software project. Every change to the agent pipeline (new prompt, new model, new tool, new data source) must be re-run against the golden dataset before deployment.

The concrete pattern, adopted by the most mature teams, works as follows. A CI action automatically triggers a series of evaluations against the golden dataset on every pull request. Scores are posted as a PR comment with a comparison to the reference baseline. If any score falls below the predefined quality threshold, deployment is automatically blocked. An engineer must then either fix the regression or explicitly justify lowering the threshold and obtain formal business sign-off.

This pattern fundamentally changes team dynamics. Discussions about "did quality go down?" become discussions about "what's our current score, and why did it change?" The team argues from data, not from impressions. Rollback decisions become objectively defensible, even for non-technical stakeholders.

Commercial platforms like LangSmith (LangChain) and Braintrust both integrated this CI evaluation pattern in 2025 and 2026. Both offer native GitHub Actions integrations that block merges when scores fall below defined thresholds. The choice between them depends primarily on the existing stack: LangSmith integrates natively with LangChain and LangGraph workflows; Braintrust is model-agnostic and takes a dataset-first approach that works across any infrastructure.

What we set up in practice

When a client brings us an AI agent that works in the demo and needs to be industrialized, we always start with the same sequence.

First, an audit of what exists: is there a golden dataset? In what form, at what size, annotated by whom? In most projects we take over, the answer is either "no" or "we have a few demo examples" that do not cover the real distribution of production requests.

Then, building the golden dataset with the business teams. We sample existing production traffic, or construct queries with the client if the agent is not yet deployed. We set acceptability thresholds together, with the people who know what "correct" looks like in the business context. This step takes two to five days depending on domain complexity. It saves weeks of debugging once in production.

Finally, tooling: we wire evaluations into CI/CD, configure automated canary runs, define alerts and response procedures. Depending on the stack and the client's data sovereignty constraints, we use LangSmith, Braintrust, or an open-source framework like DeepEval hosted within the client's own perimeter.

The result is not spectacular to demo. We do not change the agent's initial quality. What we do is make quality visible, measurable, and defensible over time. That is the condition for an agent to stay useful after six months in production, and for business teams to maintain trust in it long-term. An agent without continuous evaluation is an agent you cannot maintain. An agent you cannot maintain will always end up abandoned.

Want to take a look at your setup together? Book a slot, and we'll spend 30 minutes reviewing your current evaluation setup and identifying the levers that will let you measure and control your agent's quality in production.

Evaluating AI agents in production: eval sets, drift, and monitoring

Silent drift: why agents degrade without warning

Building a proprietary eval set: the non-negotiable starting point

LLM-as-a-judge: a reliable evaluator when properly calibrated

Metrics to instrument in production

Detecting drift: canary sets and alert thresholds

Embedding evaluation into the development cycle

What we set up in practice

n8n workflows + AI agents: the combo replacing 80% of RPA projects

B2B Voice Agents: Why 2026 Is the Tipping Point

Ready to
automate everything

Evaluating AI agents in production: eval sets, drift, and monitoring

Silent drift: why agents degrade without warning

Building a proprietary eval set: the non-negotiable starting point

LLM-as-a-judge: a reliable evaluator when properly calibrated

Metrics to instrument in production

Detecting drift: canary sets and alert thresholds

Embedding evaluation into the development cycle

What we set up in practice

n8n workflows + AI agents: the combo replacing 80% of RPA projects

B2B Voice Agents: Why 2026 Is the Tipping Point

Ready toautomate everything

Ready to
automate everything