Executive Summary

Agentic AI systems do more than generate answers. They retrieve evidence, call tools, use identity, maintain state, coordinate across components, and sometimes trigger side effects. That changes the evaluation problem.

A chatbot can often be evaluated by its response. An agent must be evaluated by its response, trajectory, tools, state changes, and recovery path when something goes wrong.

This blog introduces the OCI Agent Evaluation Framework as a lifecycle approach to evaluating agents: qualifying capabilities, testing prompts, retrieval-augmented generation (RAG), tools, and trajectories, assessing release readiness, monitoring production behavior, and turning incidents into recovery and recertification coverage.

The focus is not only whether the answer is correct, but whether the path, authorization, state change, reliability, production behavior, and recovery are safe, traceable, and production-ready.

Agent evaluation is a lifecycle discipline: evaluate the answer, path, tools, state change, reliability, production behavior, and recovery together.

Why final-answer scoring breaks for agents

Many agent evaluations still resemble chatbot evaluations with tool-call logs attached. That is not enough. A final answer can look useful while the path to produce it violated authorization, approval, privacy, tool-boundary, state-change, or recovery rules. Table 1 contrasts traditional output scoring with lifecycle evaluation for agents.

Table 1. Output scoring versus agent lifecycle evaluation

DimensionOutput scoringAgent lifecycle evaluation
Unit of evaluationFinal responseOutcome, path, state, and recovery
Main evidencePrompt, output, scoreTrace, retrieval, tool call, policy event, state verifier
Common failureWrong answerCorrect answer through unsafe or unreliable path
CadenceBenchmark or CI checkOnboard, Develop, Release, Assure, Recover
Production linkWeak or absentReplay, canary, drift, incident replay, runtime action

A correct answer can still take an unsafe path

Imagine an enterprise support agent that correctly tells a customer, “Your account status has been updated.” The final response looks good, and a final-answer evaluator might pass it.

The trace shows a different story: the agent used a privileged update tool, skipped the required approval, passed the wrong authorization context, wrote to the customer record, and then reported success. A lifecycle-aware evaluator should fail this interaction even though the customer-facing answer is correct.

Agent evaluation follows the lifecycle

Enterprise agent evaluation is not a one-time release activity. The same evaluation substrate – contracts, datasets, traces, metrics, and decisions – should follow the system from onboarding through recovery. Evidence depth increases as autonomy, data sensitivity, production exposure, and consequence increase. Figure 1 summarizes the lifecycle loop: evaluation starts with onboarding, continues through development and release, monitors production behavior, and feeds incidents and drift back into future coverage. Table 2 maps each lifecycle stage to the evaluation question and typical run types.

Agent evaluation follows the lifecycle; incidents and drift feed new coverage back into development and release evaluation.

Figure 1. Agent evaluation follows the lifecycle; incidents and drift feed new coverage back into development and release evaluation.

Table 2. Agent evaluation lifecycle stages and typical runs

StageEvaluation questionTypical runs
OnboardIs the capability safe and useful enough for scoped use?Capability, skill/plugin, evaluator certification
DevelopIs the agent improving without regressions or unsafe paths?Prompt, RAG, trajectory, safety regression
ReleaseIs the candidate ready under declared scope and risk?Release gate, statistical gate, reliability.
AssureIs production behavior inside the approved envelope?Replay, canary, drift, A/B testing
RecoverDid the fix resolve the incident and improve coverage?Incident replay, recertification, runtime drill

Keep development loops lightweight; increase evidence depth as autonomy, data sensitivity, production exposure, and consequence increase.

What changes as the system under test becomes more agentic

The system under test (SUT) may be a prompt, model endpoint, RAG workflow, tool workflow, Model Context Protocol (MCP) application, single agent, multi-agent system, long-running agent, or deployed service. Evaluation should match the surface being tested. Table 3 shows how evaluation depth changes as the system under test becomes more agentic.

Table 3. Evaluation focus by system under test

System under testEvaluateWhy final-answer scoring is insufficient
Prompt / LLM endpointInstruction following, format, factuality, refusal, latency, costOnly checks one response
RAG workflowRetrieval, grounding, citations, freshness, access controlEvidence may be stale or unauthorized
Tool workflowTool choice, arguments, schema, policy, state verificationPolicy can fail despite a good answer
MCP applicationDiscovery, descriptors, auth, consent, side effectsMCP is also a security boundary
Single agentOutcome, path, tools, memory/state, recovery, terminationUnsafe autonomy can still succeed
Multi-agent systemDelegation, role attribution, shared state, escalationTeam success can hide the violator
Long-running agentSession lifecycle, checkpoint/resume, callbacks, stale intent, budgetFailure can occur hours later
Production serviceReplay, canary, drift, runtime action, incident replayOffline scores do not prove production behavior

Final-answer scoring fits simple response checks; tool-using, stateful, MCP-enabled, and production agents need trajectory evidence. Use the lightest evaluation that matches the risk and action surface.

Evaluate outcome, path, state, and recovery

For agents, task completion and path quality must be evaluated separately. Outcome checks whether the task completed. Path checks whether the agent used allowed evidence, tools, identities, approvals, and side-effect controls. Tool use checks selection, arguments, authorization context, and policy decisions. State verification checks the target system, not the agent’s self-report. Recovery and termination check whether the agent handled errors, missing evidence, timeouts, loops, refusal, escalation, and stop conditions safely.

Quality and assurance should also remain separate. Quality evidence measures task value: success, correctness, retrieval quality, groundedness, citations, reliability, latency, and cost. Assurance evidence measures whether the system stayed inside safety, security, privacy, access-control, approval, side-effect, and autonomy boundaries. A high average quality score should not hide a privacy leak, approval bypass, unsafe tool path, or uncontrolled write.

RAG, tool, and MCP evidence are different

RAG evaluation: Evaluate retrieval sufficiency, grounding, citation fidelity, freshness, and access-control filtering separately. A citation is useful only if it supports the specific claim it is attached to; high citation count is not citation correctness.

Tool evaluation: Inspect the request, arguments, schema validation, authorization context, policy decision, approval state, and side-effect result. For write or privileged actions, verify target state through a fixture, sandbox, UI state, API response, or post-action invariant.

MCP evaluation: Treat discovery, descriptors, tools, resources, authorization, consent, token handling, egress, sessions, and side effects as protocol and security-boundary evidence. Boundary compliance should not be inferred from the final answer.

Reliability evaluation goes beyond one successful run

Teams should ask not only, “Can the agent succeed?” but also, “How often does it succeed, through which safe paths, under which perturbations, at what cost and latency, and with what failure modes?”

For stochastic agents, one successful run is not release evidence. Report outcome consistency, trajectory consistency, tool-choice variance, recovery variance, latency/cost variance, and policy-decision variance. Repeated attempts on the same sample should not be counted as independent samples. If a required high-risk slice is underpowered, route the result to REVIEW rather than PASS.

The score is only as good as the metric and evaluator. For release-grade use, gateable metrics should have MetricCards, gateable judges or review protocols should have EvaluatorCards, and G2/G3 decisions should use a StatisticalDecisionPlan.

Long-running agents add session risk: checkpoint/resume, callbacks, memory, approval TTL, stale-intent checks, budget exhaustion, dependency drift, safe termination, and runtime disablement must be evaluated across the session, not only at the first response.

Trace the behavior you evaluate

This depends on observable behavior, not private chain-of-thought. A useful trace captures model calls, retrieval spans, tool/MCP calls, policy decisions, approvals, state events, recovery events, termination events, errors, timings, costs, and verifier results. Figure 2 shows the observable path an evaluator should inspect when an agent plans, retrieves evidence, checks authorization, acts, verifies state, and recovers or replans.

Agent evaluation inspects both the outcome and the observable path: planning, retrieval, authorization, action, verification, recovery, and replanning. Scores should link back to trace evidence, not just final responses.

Figure 2. Agent evaluation inspects both the outcome and the observable path: planning, retrieval, authorization, action, verification, recovery, and replanning. Scores should link back to trace evidence, not just final responses.

For RAG, the trace should preserve chunk IDs, citation mapping, corpus or index version, freshness metadata, and access-control decisions. For write actions, it should link to a state verifier.

Minimal trace for a tool-enabled agent:

model_call -> retrieval -> policy_decision -> tool_call_request -> tool_call_result -> state_event -> recovery_or_termination_event

Without trace linkage, a score is only a dashboard. With trace linkage, teams can debug failures, replay cases, compare releases, and connect incidents back to coverage.

Debugging map: Wrong answer -> inspect prompt, model, retrieval, grounding. Correct answer but unsafe action -> inspect tools, policy, approvals, state verification. Flaky agent -> inspect repeated trajectories, variance, loops, cost, latency. Production drift -> inspect replay, canary, dependency changes, and incident traces.

Production assurance turns evidence into action

Evaluation does not stop at release. Production behavior can drift as traffic, corpora, tools, prompts, policies, models, and user behavior change. Checks should continue through replay, canary or online assurance, drift detection, long-running session monitoring, and incident replay.

When severe issues appear, the system should route to an accountable decision, not just an alert. Release and production-assurance decisions may emit PASS, FAIL, REVIEW, QUARANTINE, ROLLBACK, DOWNGRADE, or DISABLE depending on the evidence and lifecycle stage. Recertification is the follow-up workflow that proves the fix, updates coverage, and returns the capability to an approved state.

A production-assurance system should not silently drop high-risk slices, repeated stochastic trials, trace capture, state verification, or human-review paths because of queue pressure, provider throttling, budget exhaustion, stale cache reuse, or reviewer backlog. When required coverage is lost, the decision should route to REVIEW or FAIL with reason codes and owner accountability.

A release or production-assurance run should emit an accountable decision, not just a score. In the support-agent case above, the result would be FAIL, with reason codes such as approval_boundary_violation, wrong_authorization_context, and state_modified_before_approval, linked to the trace, policy event, and state verifier. The next action might be human review and a temporary downgrade to read-only mode.

Release and production-assurance decisions should remain trace-linked, reviewable, owned, and actionable. High-risk or audit-grade decisions may require an EvidenceBundle; lightweight development and CI runs usually do not.

How teams can start on OCI

Teams do not need the full release-grade path on day one.

  • SDK/CLI is best for local authoring, prompt and RAG regression, trajectory debugging, CI checks, and fast developer feedback.
  • Managed evaluation service is best when evaluation supports production promotion (includes A/B testing), replay, canary assurance, runtime action, shared evidence, sensitive data, tenant isolation, retention, cost/capacity controls, or audit-grade evidence.

A practical first week: pick one RAG path, one tool path, and one refusal or escalation path; run SDK/CLI diagnostics with versioned samples and traces; add trajectory checks for tool use, authorization, state verification, and recovery; then promote consequential slices toward managed evaluation.

The delivery model should not create two evaluation standards. SDK/CLI and managed-service paths should share declared SUTs, versioned datasets, traces, metrics, evaluator definitions, gate policies, decisions, and evidence artifacts where applicable.

In an OCI implementation, developers can use lightweight SDK/CLI runs for local diagnostics and CI regression. Evaluation engineers can add trace coverage, RAG evidence, tool/MCP checks, and reliability runs. Release owners can promote selected runs into managed gates. Security, privacy, and SRE teams can use production replay, canary assurance, runtime actions, and evidence packages to keep deployed agents inside the approved envelope.

Six evaluation questions before an agent acts

Before consequential action, ask:

  1. Which SUT, model, prompt, tool, policy, and dataset versions were tested?
  2. Which scenarios, risk slices, negative paths, and production journeys were covered?
  3. Which retrieval evidence, citations, and access-control decisions supported the answer?
  4. Which tools were called, with what arguments, identity, authorization, and approval context?
  5. What state changed, and how was it verified?
  6. What happens if the system drifts, fails, loops, violates a boundary, or loses evidence coverage?

Maturity boundary

This framework improves evidence quality and decision discipline; it does not prove the absence of all AI risk. Production-service, audit-ready, or Reference Architecture claims should be made only for the specific SUT class, risk tier, decision grade, and delivery model where implementation evidence has been produced, reviewed, and approved. Until those artifacts exist, the correct claim is narrower: framework design is defined, and implementation evidence is pending or scoped to the artifacts that actually exist.

From answer checks to lifecycle evaluation

The next frontier is lifecycle evaluation: testing prompts, retrieval, tools, trajectories, state changes, reliability, production behavior, and recovery together.

An agent capability should be promoted only when the outcome is useful, the path is acceptable, the tools and data are authorized, the state change is verified, and the system can recover when something goes wrong.

A good first evaluation set includes one RAG journey, one tool-enabled journey, one negative path, one recovery path, and one production replay slice. Make those paths traceable before expanding coverage.

The model proposes. The system acts. The platform must evaluate the full lifecycle.