News and Tech Update

Posts

Showing posts from May, 2026

The Validation Gap Is Costing You More Than You Think

Our latest State of Software Delivery report analyzed more than 28 million CI workflows and found a pattern that should give engineering leaders pause. Average throughput grew 59% year over year. Main branch activity for the median team declined 7%. Teams are generating more code than ever before. Less of it is reaching production. The cost of poor validation used to show up mostly in developer hours: debugging, blocked deployments, context switching. That cost hasn’t gone away. But there is a second bill now. Every failed build means agent retries. Every slow pipeline is compute burning while an agent waits. Main branch success rates have fallen to a five-year low of 70.8% against a 90% benchmark, and the AI spend attached to every failed cycle is climbing alongside it. The teams doing well are catching failures earlier and keeping their pipelines healthier. They are running the same tools as everyone else. What they have structured differently is where and when validation ha...

Why Enterprise AI Infrastructure Is Becoming a DevOps Problem

Most enterprise AI projects start with retrieval. You connect Jira, Confluence, SharePoint, and Slack. Maybe a few internal databases nobody has touched in five years. You tune embeddings, optimize chunking, wire up a vector database, and convince yourself you’ve built an AI-powered knowledge system. Then the model server crashes. And suddenly, you discover the uncomfortable truth about enterprise AI: The hard part was never retrieval. It was infrastructure. For the past two years, the industry has treated LLM deployment like a feature integration problem. In reality, it is rapidly becoming a platform engineering problem, one involving GPU orchestration, scaling economics, governance boundaries, workload scheduling, observability, and operational resilience. The moment organizations move beyond prototypes, the conversation changes fast. Search Was Never the Product Enterprise search already exists. Most organizations have had it for years. But what teams actually want is synthe...

AI Is Changing How We Write Infrastructure, But It’s Not Solving How We Control It

Over the past year, AI has fundamentally changed how software is written. Infrastructure code is no exception. Tasks that once required deep familiarity with tools, syntax, and workflows can now be handled through natural language. Engineers are no longer starting from a blank file. In many cases, reviewing and modifying code generated for them has become the norm. At a high level, this looks like progress, and in many ways, it is. Teams can move faster, the barrier to entry is lower, and experimentation is easier. But there is a growing gap that many organizations are only beginning to recognize: AI is accelerating how infrastructure is created, but it is not solving how infrastructure is understood, controlled, or governed. The Shift from Writing to Generating Traditionally, infrastructure as code assumed that the person writing the configuration understood what it would do. That assumption no longer holds. Today, it is increasingly common for infrastructure definitions to be gen...

Why Your AI Agent is a Black Box and How to fix it With OpenTelemetry

You built the agent. It works in testing. Then it hits production and starts giving wrong answers, timing out or burning through your token budget, and you have no idea why. This is when developers discover that print statements and log files weren’t designed for this. LLM applications fail in ways that traditional tooling can’t see. A hallucination doesn’t throw an exception. A slow retrieval step doesn’t show up in CPU metrics. A prompt that worked yesterday silently degrades today. The fix is observability, and the standard for doing it right is OpenTelemetry (OTel). What OpenTelemetry Actually Is OTel isn’t a monitoring product; it’s a vendor-neutral specification under the CNCF that defines a standard way to collect observability data: What gets collected, what it’s called and how it’s shipped. You instrument your application once and can send that data to Grafana, Datadog, Jaeger or a purpose-built LLM platform without rewriting your instrumentation. That portabil...

Agentic SRE: The Next Frontier of Reliability

Agentic SRE is the evolution of site reliability engineering where AI agents help observe systems, reason over telemetry and take bounded operational actions under human-defined guardrails. The goal is not to replace SREs, but to reduce toil, speed up diagnosis and make incident response more consistent and scalable. Why This Matters Modern systems are too distributed, noisy and fast-moving for purely manual operations to keep up. Engineers spend significant time correlating dashboards, reading logs, checking recent deploys and hunting for context before they can even start fixing the problem. Agentic SRE addresses this by turning telemetry into actionable context and automating safe parts of the response loo p . This shift is especially important because reliability work is full of repetitive, high-pressure tasks that are easy to standardize but hard to execute perfectly at 2 a.m. That makes it a perfect fit for agents that can summarize, correlate, recommend and execute wit...

5 Ways Agentic AI is Redefining DevOps Architecture for Self-Healing CI/CD Systems

In the past, the flaky test was a problem: A race condition, a timeout, an annoyance that needed to be rerun and forgotten. That’s no longer the case. As enterprises transition from deterministic applications to agentic AI, the flakiness problem has become a structural issue. Old CI/CD systems rely on binary assertions: Assert X == Y. But with AI agents, the output isn’t Y; it’s Y-like answers. Run the same agent again, and it will likely produce two defensible but varying results. So, the test suite built on a scenario that no longer exists, calls this a failure. DevOps teams and engineers don’t just face the challenge of building agents but also recreating the entire pipeline. In this post, we will share how agentic AI is transforming the DevOps architecture for self-healing CI/CD. What Does the Term “Agentic” Mean Here? Agentic AI is an automated system capable of receiving a target state, sensing its surroundings using telemetry and APIs, reasoning about the act...

JFrog Report Surfaces Need for Rapid DevSecOps Change in AI Era

A report published by JFrog finds that cybercriminals are now increasingly targeting the artificial intelligence (AI) tools and platforms used by application development teams. Based on an analysis of 18.2 billion artifacts managed via the JFrog Platform, security researchers discovered 969 AI agent skills carrying high-impact payloads in addition to 495 malicious AI models on the Hugging Face platform for hosting open source AI models. Additionally, 56 malicious extensions were also discovered on the OpenVSX registry. The survey also finds 41% of respondents work for organizations that are actively using AI libraries, with organizations on average employing 9.3 AI libraries each. At the same time, a separate global survey of 1,508 security and DevOps professionals conducted by JFrog finds more organizations are struggling to secure code generated by AI coding tools. Nearly half of respondents (45%) said reviewing and hardening AI-generated code is now a major time drain, with an eq...

On-Call: The Silent Force Shaping Engineering Culture

There is a silent force shaping engineering culture inside every technology organization. It affects productivity, team morale, psychological safety, and long-term retention. And yet, it is rarely discussed in executive meetings or reflected in meaningful KPIs. That force is on-call. On-call is one of the most direct touchpoints engineers have with the reality of the systems they own. When it’s healthy, it builds confidence, resilience, and pride. When it’s unhealthy, it quietly corrodes everything that makes engineering teams effective. And while most companies drastically underestimate this effect, a recent survey found that on-call is the least-liked aspect of software engineering, often leading to burnout and attrition. Poorly managed on-call isn’t only a mental health issue; it can also impact a company’s brand and finances, as recent significant outages from AWS, Azure, and Cloudflare have shown. In this article, I will go over why on-call matters, the current challenges...

Why DORA Metrics Look Different When AI Is Part of Your Development Workflow

DORA metrics have been a reliable compass for engineering teams for over a decade. Deployment frequency, lead time for changes, change failure rate, mean time to recovery, and reliability give teams a shared language for talking about delivery performance. The research behind them is solid, the benchmarks are well-established, and most engineering leaders know what good looks like for each metric. What is less discussed is how AI-assisted development changes the baseline assumptions those metrics were built on. Not whether DORA metrics are still relevant — they are — but how the same numbers can mean something different when a significant portion of your codebase is being generated by AI coding tools. Deployment Frequency Goes Up. Sometimes for the Wrong Reasons. AI coding assistants accelerate code production. Developers who use them ship features faster, close tickets quicker, and generate pull requests at a higher rate than before. For teams tracking deployment frequen...