Skip to main content

Agentic Systems are Breaking Reliability Frameworks 

Security teams have spent years building detection and response capabilities around a failure mode they understood well enough to instrument for. Typically, a service misbehaves, an alert fires and an engineer investigates. This kind of model worked because the systems producing the failures were deterministic enough that misbehavior was visible, measurable and attributable to a cause that a runbook could address.

However, what agentic systems have introduced into that environment is a category of failure that looks nothing like the one the detection infrastructure was built to catch — a failure that completes successfully, logs nothing unusual, returns a clean status code and disappears into the transaction history while the damage it caused propagates quietly through every system the agent touched.

“The governance gap this creates is not a configuration problem that a new tool can close,” says Shahid Ali Khan, principal engineer – DevOps at TestMu AI, an AI-native software testing platform. It is structural, rooted in the assumption that reliability failures and security events are categorically distinct, happen through different mechanisms and require different response processes. 

Agentic systems break that assumption, Khan explains, because the same root cause, a manipulated input, a drifted model, a misconfigured capability boundary, can produce either outcome depending on context. Organizations that route reliability and security to different teams with different runbooks will keep discovering that gap through incidents rather than through the architectural decisions that could have prevented them. 

Testing Infrastructure That was Built for the Wrong Assumption 

The testing problem that agentic systems create is not a harder version of the testing problem that deterministic systems create. It is a structurally different problem that requires a different kind of answer. Ihor Zakutynskyi, chief technology officer at FORMA by Universe Group, describes the shift his team made when they encountered the limits of deterministic test assertions against probabilistic systems. 

“Rather than expecting exact outputs, we moved to constraint-based and statistical validation, asserting invariants and measuring distributions instead of matching outputs,” Zakutynskyi explains. “Hard guarantees, safety and schema contracts, monotonic side-effect rules, idempotency of repeated calls and bounded response times remain pass-fail invariants. Everything above that baseline moves to statistical validation, running Monte Carlo-style test suites over representative inputs and computing stability metrics from the semantic embeddings of responses rather than comparing strings.” 

The shift from exact match to distribution-based validation is not a concession to imprecision. It is a more accurate representation of what reliability actually means for a probabilistic system. Moreover, teams that resist it in favor of deterministic assertions will find themselves maintaining tests that pass consistently while missing the regressions that matter most. 

Ronak Desai, CEO and founder of Ciroos and formerly SVP and GM at Cisco, frames the same shift in terms that engineering leaders will find immediately actionable. “The question isn’t whether your test passed,” he notes. “It’s whether your system is reliably capable.” That reframe demands moving from assertion-based testing to distribution-based testing, asking across many independent runs on the same task how many produce a correct outcome and treating that ratio as the reliability signal rather than the result of any individual run. Variance stability, the consistency of an agent’s output distribution across runs, tells you whether an agent is reliable. A single passing run tells you almost nothing about the system’s actual capability envelope. 

Before any new agentic component reaches production, it should be tested against real production sessions in parallel, not to match exact outputs but to measure how consistent the new agent’s output distribution is compared to the established baseline. That comparison is the test. Everything else is preparation for it. 

Arun Anbumani, principal cloud infrastructure engineer at Oracle, adds the infrastructure dimension to the testing picture that pure model-focused approaches miss. “Replay-style testing against captured production traffic patterns and fault injection introducing controlled disruptions, resource contention, device resets and driver mismatches give teams visibility into how systems respond when the hardware paths underneath the models start behaving differently,” Anbumani explains. “The broader challenge is that most SRE tooling was built for predictable services, and as infrastructure becomes more heterogeneous and development workflows incorporate AI-assisted tooling, the testing and observability platforms are still evolving to keep up.” 

Testing infrastructure for agentic systems therefore, has to be built with the assumption that variability is not the exception but the operating condition. Fault injection and replay testing are not edge case preparation but the core of a testing regime designed for an environment where the normal operating envelope is wider and less well-defined than any previous generation of infrastructure. 

The Governance Layer Nobody Built 

The hardest part of running agentic systems in production is not building them. Khan, speaking from his experience in running agentic infrastructure at TestMu AI (formerly LambdaTest), identifies the real difficulty with a precision that practitioners who have not yet operated agentic systems at scale may not have encountered. “Traditional runbooks assume failures are obvious,” he observes. “A service crashes, latency spikes, errors propagate. Agents fail subtly. They might complete successfully while doing something completely unintended.” 

Detecting that failure mode without triggering false positives on every creative decision the agent makes is where existing governance frameworks fall short and building the controls that close that gap is the work most organizations have not done yet. 

Khan’s approach is to build governance controls at the platform level that operates on behavioral boundaries rather than system metrics. Every agent has an explicitly defined capability envelope covering what tools it can invoke, what data it can access, what output formats are valid and what actions require human approval. These are not permissions in the traditional sense. These are runtime assertions, checked at the moment of execution rather than granted at deployment time and assumed to hold thereafter. When an agent invokes a tool outside its envelope or generates output that does not match expected schemas, the event is captured with full context — the input that triggered it, the reasoning chain the agent followed and the attempted action — and routed to a dedicated anomaly pipeline separate from standard incident management. 

Building behavioral boundaries as runtime assertions rather than deployment-time permissions is the architectural decision that makes the governance layer enforceable rather than advisory. Permissions that are granted at deployment and never checked again are assumptions, and in an agentic system operating at machine speed, unverified assumptions are where the most consequential failures begin. 

“It is also worth implementing circuit breakers specifically for agent autonomy,” Khan notes. If an agent exceeds a threshold of envelope violations within a time window, it is automatically downgraded to a supervised mode where all actions require human approval. This limits the blast radius of a compromised or misbehaving agent while the team investigates, and it does so without requiring a human to notice the problem first. The circuit breaker fires on the pattern, not on a human’s recognition of it, which is the only response mechanism fast enough to be meaningful when the agent is operating at the speed agentic systems operate. 

When a Reliability Failure is Also a Security Event 

The boundary between a reliability failure and a security event in an agentic system is not always a boundary at all. Khan explains how his team encountered this directly when building the classification system for envelope violations. An agent accessing an unauthorized API could be a misconfiguration, which is a reliability problem, or a prompt injection attack, which is a security problem. From the outside, the two events look identical. “We have adopted a classify later, capture now approach,” he explains. “Every envelope violation is logged with enough context for both SRE and security review.” A secondary classification system then tags events based on input source. If the anomaly correlates with user-provided content, it is flagged for security review. If it correlates with a model update or configuration change, it routes to SRE. 

The classify-later-capture now approach resolves a real organizational tension that most incident response processes are not designed to handle. Forcing immediate classification of an event whose root cause is ambiguous leads either to misrouting, where a security event gets treated as a reliability problem until the damage is done, or to alert fatigue, where every ambiguous event gets escalated to both teams until neither team takes the escalation seriously.



from DevOps.com https://ift.tt/M1Xb9wA

Comments

Popular posts from this blog

Cursor’s New SDK Turns AI Coding Agents Into Deployable Infrastructure

For most of its life, Cursor has been an IDE. A very good one. But with the public beta of the Cursor SDK, the company is making a different kind of move — one that should get the attention of DevOps teams. The Cursor SDK is a TypeScript library that gives engineers programmatic access to the same runtime, models, and agent harness that power Cursor’s desktop app, CLI, and web interface. In short, the agents that used to live inside an editor can now be invoked from anywhere in your stack. That’s a meaningful shift in how AI coding tools fit into software delivery pipelines. From the Editor to the Pipeline If you’ve used Cursor before, the workflow is familiar — you interact with an agent in real time, asking it to write functions, fix bugs, or review code. The SDK breaks that dependency on interactive use. Now you can call those same agents programmatically, from a CI/CD trigger, a backend service, or embedded inside another tool. Getting started is a single inst...

Mistral Moves Coding Agents to the Cloud — and Gets Out of Your Way

For the past year or so, AI coding agents have been tethered to your local machine. You kick off a task, watch the terminal, and babysit every step. It works — but it’s not exactly hands-free. Mistral just changed that. On April 29, the Paris-based AI company announced remote coding agents for its Vibe platform, powered by a new model called Mistral Medium 3.5. The idea is simple: Instead of running coding sessions on your laptop, they now run in the cloud — asynchronously, in parallel, and without you watching over them. What’s Actually New Coding sessions can now work through long tasks while you’re away. Many can run in parallel, and you no longer become the bottleneck at every step the agent takes. That’s the core pitch. You start a task from the Mistral Vibe CLI or directly from Le Chat — Mistral’s AI assistant — and the agent handles the rest. When it’s done, it opens a pull request on GitHub and notifies you, so you review the result inste...

OpenAI Debuts Symphony to Orchestrate Coding Agents at Scale

OpenAI has unveiled Symphony, an open-source specification that shifts how software development teams deploy AI in workflows, moving from interactive coding assistance toward continuous orchestration of autonomous agents. Symphony reframes project management tools as operational hubs for AI-driven coding. Rather than prompting an assistant for individual tasks, developers assign work through issue trackers, allowing agents to execute tasks in parallel and deliver outputs for human review. The change reflects a trend in enterprise AI in which systems are increasingly embedded into production pipelines rather than used as standalone tools. Symphony emerged from internal experimentation at   OpenAI , where engineers attempted to scale the use of   Codex   across multiple concurrent sessions. While the agents proved capable, human operators became the limiting factor. Engineers found they could only manage a handful of sessions before coordination overhead offset pro...