Skip to main content

Why Your AI Agent is a Black Box and How to fix it With OpenTelemetry 

You built the agent. It works in testing. Then it hits production and starts giving wrong answers, timing out or burning through your token budget, and you have no idea why. This is when developers discover that print statements and log files weren’t designed for this.  

LLM applications fail in ways that traditional tooling can’t see. A hallucination doesn’t throw an exception. A slow retrieval step doesn’t show up in CPU metrics. A prompt that worked yesterday silently degrades today. 

The fix is observability, and the standard for doing it right is OpenTelemetry (OTel). 

What OpenTelemetry Actually Is 

OTel isn’t a monitoring product; it’s a vendor-neutral specification under the CNCF that defines a standard way to collect observability data: What gets collected, what it’s called and how it’s shipped. You instrument your application once and can send that data to Grafana, Datadog, Jaeger or a purpose-built LLM platform without rewriting your instrumentation. 

That portability matters more than people realize early on. Your observability investment is in your instrumentation code, not in the back end you happen to be using today. 

The Semantic Conventions Problem Nobody Talks About 

Every LLM observability platform claims OTel compatibility. Technically, most are — they’ll accept an OTLP payload without crashing. However, protocol-level compatibility says nothing about whether your spans will actually mean anything on the other side. 

The problem is semantic conventions. OTel defines how to send data but doesn’t fully define what to name LLM-specific attributes. Three competing standards have emerged: OTel’s own GenAI conventions (still evolving, not fully ratified), Arize’s OpenInference conventions (used by LlamaIndex, structurally different) and whatever each vendor decided to call things before any standard existed. 

In practice, this means your LlamaIndex pipeline emits OpenInference, your custom LLM wrapper emits GenAI conventions and your framework’s built-in tracing emits something proprietary. All three land as valid OTLP. None use the same attribute names. Your token usage dashboard is reading three different fields depending on which span it hits. 

This is the real state of LLM observability tooling in 2026. The protocol works. The conventions are a mess. It mirrors what happened with APM a decade ago: Fragmentation precedes consolidation. The difference is that the LLM space is moving faster, and teams that pick a coherent instrumentation strategy now will avoid painful migrations later. 

Why Traces Matter More Than Logs for LLM Work 

OTel has three pillars: Logs, metrics and traces. For LLM applications, traces do the heavy lifting. 

A log is fire-and-forget. Something happens, you write a line. No duration, no parent-child relationship, no shared context. Metrics tell you something is wrong but not what exactly went wrong with request #a7b3c9 at 2:32 p.m. 

A trace represents the complete life cycle of a single request. Inside that trace are spans, each wrapping one unit of work with a start time, end time, attributes and a known relationship to other spans: 

Trace: user_query [total: 1.2s] 

├── Span: routing [12ms] 

│   └── router.model=gpt-4o-mini, routing.result=rag 

├── Span: retrieval [180ms] 

│   └── retrieval.doc_count=5, retrieval.top_score=0.87 

├── Span: llm.completion [890ms] 

│   └── llm.model=gpt-4o, llm.tokens.prompt=1240, llm.cost_usd=0.0094 

└── Span: post_processing [8ms] 

 

You see exactly where time went, what each step cost and what decisions were made. When a user reports a wrong answer, you pull up the trace — already assembled, already timed, already attributed. No grepping through log files trying to reconstruct a sequence. 

One trend worth calling out: As model providers add more built-in capabilities (function calling, structured outputs, vision), the surface area of what needs to be traced per request keeps expanding. Teams that treat observability as an afterthought are accumulating blind spots faster than they realize. 

What Makes Agents Especially Hard to Observe 

Simple chat completions are easy — one request, one LLM call, one response. Agents are different. An agent might route to one of several tools, loop multiple times before settling on an answer, run subtasks in parallel or hand off to another agent. Each step is a span, and several things break naive approaches. 

Context propagation across tool calls requires trace IDs to travel with outgoing requests. Multi-agent traces need explicit context passing across agent boundaries, which most frameworks don’t do automatically. Loops need clearly labeled iterations, not identical span names you untangle manually. Routing decisions should be recorded as attributes so you can see why the agent chose RAG over a direct LLM call. 

One useful mental model is representing agents as state machines, where each state transition is a span. This makes control flow visible in the trace rather than implied by span ordering. If you can see the states an agent moved through, debugging a wrong decision gets tractable. 

This also reflects something happening across the industry: As agent architectures grow more complex (multi-agent orchestration, dynamic tool selection, self-correction loops), the gap between ‘it ran’ and ‘I understand what it did’ keeps widening. Observability isn’t just debugging infrastructure anymore; it’s becoming the feedback loop that teams use to actually improve agent behavior over time. 

What it Looks Like When it Works 

A user reports that your research agent gave outdated information. With traces, you open the specific request and see the retrieval step returned documents from 2023, the LLM generated a confident response from stale content and no error was thrown because technically nothing failed. A warning event flagged that the content age exceeded a threshold. 

This is the class of failure that defines LLM observability: Silent quality problems rather than system errors. The trace caught it because the instrumentation measured the content age and flagged it. That’s the difference between finding the problem in a trace and having a user report it three days later. 

Where This is Heading 

The LLM observability space is still early, but a few directions are becoming clear. 

First, evaluation and observability are converging. Today, most teams treat evals as a CI/CD concern and observability as a production concern. However, the same trace data that helps you debug a bad response can also feed automated quality scoring in production. Teams that connect these two loops will iterate faster than those running them separately. 

Second, cost observability is becoming a first-class requirement, not an afterthought. As teams scale from prototypes to production workloads with thousands of daily requests across multiple models, understanding cost per request, per feature and per user segment is table stakes. Token-level attribution across a multi-step pipeline is something most teams currently do in spreadsheets. It belongs in the trace. 

Third, the tooling will consolidate. Right now, the space has dozens of startups, several open-source projects and the major APM vendors all building LLM-specific features. History says this shakes out to a few winners within 2–3 years. The teams that instrument on OTel now, regardless of which back end they pick, will be the ones who navigate that consolidation without pain. 

The Bottom Line 

OTel is the right foundation because it is vendor-neutral, and you instrument once. But OTel is infrastructure. What matters is what you build on top of it: The semantic understanding of token costs, the alerting that knows when retrieval quality degrades and the trace view that makes an agent’s decision-making visible. If you’re moving past simple chat completions into agents and RAG, the observability requirements go up fast. Traces are how you keep up. 



from DevOps.com https://ift.tt/QSye32i

Comments

Popular posts from this blog

Cursor’s New SDK Turns AI Coding Agents Into Deployable Infrastructure

For most of its life, Cursor has been an IDE. A very good one. But with the public beta of the Cursor SDK, the company is making a different kind of move — one that should get the attention of DevOps teams. The Cursor SDK is a TypeScript library that gives engineers programmatic access to the same runtime, models, and agent harness that power Cursor’s desktop app, CLI, and web interface. In short, the agents that used to live inside an editor can now be invoked from anywhere in your stack. That’s a meaningful shift in how AI coding tools fit into software delivery pipelines. From the Editor to the Pipeline If you’ve used Cursor before, the workflow is familiar — you interact with an agent in real time, asking it to write functions, fix bugs, or review code. The SDK breaks that dependency on interactive use. Now you can call those same agents programmatically, from a CI/CD trigger, a backend service, or embedded inside another tool. Getting started is a single inst...

Mistral Moves Coding Agents to the Cloud — and Gets Out of Your Way

For the past year or so, AI coding agents have been tethered to your local machine. You kick off a task, watch the terminal, and babysit every step. It works — but it’s not exactly hands-free. Mistral just changed that. On April 29, the Paris-based AI company announced remote coding agents for its Vibe platform, powered by a new model called Mistral Medium 3.5. The idea is simple: Instead of running coding sessions on your laptop, they now run in the cloud — asynchronously, in parallel, and without you watching over them. What’s Actually New Coding sessions can now work through long tasks while you’re away. Many can run in parallel, and you no longer become the bottleneck at every step the agent takes. That’s the core pitch. You start a task from the Mistral Vibe CLI or directly from Le Chat — Mistral’s AI assistant — and the agent handles the rest. When it’s done, it opens a pull request on GitHub and notifies you, so you review the result inste...

GitHub Resets Copilot Pricing as AI Compute Costs Surge

The development community saw this one coming: GitHub will transition its Copilot service to a usage-based billing model on June 1, replacing its existing system of fixed subscriptions supplemented by premium request limits. As reported last week, GitHub suspended new sign-ups for several of its Copilot subscription tiers as it faced a surge in demand from agentic coding workflows. To address that, under GitHub’s new pricing model, customers across individual, business, and enterprise tiers will receive a monthly allocation of AI credits, which are consumed based on token usage. This includes input, output, and cached data processed by underlying models. Once those credits are exhausted, users can purchase additional capacity at published rates. The change leaves base subscription prices intact. Individual plans remain priced at $10 per month for Pro and $39 for Pro+, while business and enterprise tiers continue at $19 and $39 per user per month, respectively. Each plan’s monthly ...