Skip to main content

Posts

AI Is Changing How We Write Infrastructure, But It’s Not Solving How We Control It

Over the past year, AI has fundamentally changed how software is written. Infrastructure code is no exception. Tasks that once required deep familiarity with tools, syntax, and workflows can now be handled through natural language. Engineers are no longer starting from a blank file. In many cases, reviewing and modifying code generated for them has become the norm. At a high level, this looks like progress, and in many ways, it is. Teams can move faster, the barrier to entry is lower, and experimentation is easier. But there is a growing gap that many organizations are only beginning to recognize: AI is accelerating how infrastructure is created, but it is not solving how infrastructure is understood, controlled, or governed. The Shift from Writing to Generating Traditionally, infrastructure as code assumed that the person writing the configuration understood what it would do. That assumption no longer holds. Today, it is increasingly common for infrastructure definitions to be gen...
Recent posts

Why Your AI Agent is a Black Box and How to fix it With OpenTelemetry 

You built the agent. It works in testing. Then it hits production and starts giving wrong answers, timing out or burning through your token budget, and you have no idea why. This is when developers discover that print statements and log files weren’t designed for this.    LLM applications fail in ways that traditional tooling can’t see. A hallucination doesn’t throw an exception. A slow retrieval step doesn’t show up in CPU metrics. A prompt that worked yesterday silently degrades today.   The fix is observability, and the standard for doing it right is OpenTelemetry (OTel).   What OpenTelemetry Actually Is   OTel isn’t a monitoring product; it’s a vendor-neutral specification under the CNCF that defines a standard way to collect observability data: What gets collected, what it’s called and how it’s shipped. You instrument your application once and can send that data to Grafana, Datadog, Jaeger or a purpose-built LLM platform without rewriting your instrumentation.   That portabil...

Agentic SRE: The Next Frontier of Reliability 

Agentic SRE is the evolution of site reliability engineering where AI agents help observe systems, reason over telemetry and take bounded operational actions under human-defined guardrails. The goal is not to replace SREs, but to reduce toil, speed up diagnosis and make incident response more consistent and scalable.   Why This Matters   Modern systems are too distributed, noisy and fast-moving for purely manual operations to keep up. Engineers spend significant time correlating dashboards, reading logs, checking recent deploys and hunting for context before they can even start fixing the problem. Agentic SRE addresses this by turning telemetry into actionable context and automating safe parts of the response loo p .   This shift is especially important because reliability work is full of repetitive, high-pressure tasks that are easy to standardize but hard to execute perfectly at 2 a.m. That makes it a perfect fit for agents that can summarize, correlate, recommend and execute wit...

5 Ways Agentic AI is Redefining DevOps Architecture for Self-Healing CI/CD Systems 

In the past, the flaky test was a problem: A race condition, a timeout, an annoyance that needed to be rerun and forgotten. That’s no longer the case. As enterprises transition from deterministic applications to agentic AI, the flakiness problem has become a structural issue.   Old CI/CD systems rely on binary assertions: Assert X == Y. But with AI agents, the output isn’t Y; it’s Y-like answers. Run the same agent again, and it will likely produce two defensible but varying results. So, the test suite built on a scenario that no longer exists, calls this a failure.   DevOps teams and engineers don’t just face the challenge of building agents but also recreating the entire pipeline.    In this post, we will share  how agentic AI is transforming  the DevOps architecture for self-healing CI/CD.    What Does the Term “Agentic” Mean Here?    Agentic AI is an automated system capable of receiving a target state, sensing its surroundings using telemetry and APIs, reasoning about the act...

JFrog Report Surfaces Need for Rapid DevSecOps Change in AI Era

A report published by JFrog finds that cybercriminals are now increasingly targeting the artificial intelligence (AI) tools and platforms used by application development teams. Based on an analysis of 18.2 billion artifacts managed via the JFrog Platform, security researchers discovered 969 AI agent skills carrying high-impact payloads in addition to 495 malicious AI models on the Hugging Face platform for hosting open source AI models. Additionally, 56 malicious extensions were also discovered on the OpenVSX registry. The survey also finds 41% of respondents work for organizations that are actively using AI libraries, with organizations on average employing 9.3 AI libraries each. At the same time, a separate global survey of 1,508 security and DevOps professionals conducted by JFrog finds more organizations are struggling to secure code generated by AI coding tools. Nearly half of respondents (45%) said reviewing and hardening AI-generated code is now a major time drain, with an eq...

On-Call: The Silent Force Shaping Engineering Culture

There is a silent force shaping engineering culture inside every technology organization. It affects productivity, team morale, psychological safety, and long-term retention. And yet, it is rarely discussed in executive meetings or reflected in meaningful KPIs. That force is on-call. On-call is one of the most direct touchpoints engineers have with the reality of the systems they own. When it’s healthy, it builds confidence, resilience, and pride. When it’s unhealthy, it quietly corrodes everything that makes engineering teams effective. And while most companies drastically underestimate this effect, a recent survey found that on-call is the least-liked aspect of software engineering, often leading to burnout and attrition. Poorly managed on-call isn’t only a mental health issue; it can also impact a company’s brand and finances, as recent significant outages from AWS, Azure, and Cloudflare have shown. In this article, I will go over why on-call matters, the current challenges...

Why DORA Metrics Look Different When AI Is Part of Your Development Workflow

DORA metrics have been a reliable compass for engineering teams for over a decade. Deployment frequency, lead time for changes, change failure rate, mean time to recovery, and reliability give teams a shared language for talking about delivery performance. The research behind them is solid, the benchmarks are well-established, and most engineering leaders know what good looks like for each metric. What is less discussed is how AI-assisted development changes the baseline assumptions those metrics were built on. Not whether DORA metrics are still relevant — they are — but how the same numbers can mean something different when a significant portion of your codebase is being generated by AI coding tools. Deployment Frequency Goes Up. Sometimes for the Wrong Reasons. AI coding assistants accelerate code production. Developers who use them ship features faster, close tickets quicker, and generate pull requests at a higher rate than before. For teams tracking deployment frequen...