

Site reliability engineering (SRE) promised a better way. Born at Google and evangelized by a generation of platform engineers, SRE offered organizations a disciplined, engineering-first path from firefighting chaos to measured, sustainable operations. However, years into the mainstream adoption of SRE, various organizations find themselves spending more on SRE tooling than ever, while their on-call engineers are still drowning at 2 a.m.
The pattern is consistent. Titles change. Dashboards multiply. AI-powered AIOps platforms get procured. Error budgets get defined in a spreadsheet and promptly forgotten. Six months later, the postmortems look identical to those from two years ago.
What’s going wrong? After surveying dozens of engineering organizations, five mistakes surface repeatedly, and they compound each other in ways that are hard to untangle once they’re entrenched.
Renaming your ops team ‘SRE’ without changing how work gets done is the organizational equivalent of putting a racing stripe on a station wagon.
1. Cultural Failure — Treating SRE as a Team Rename, not a Cultural Transformation
The org chart changes. The incentives don’t. Everything else follows from there.
The most common SRE implementation failure is also the most invisible: Declaring victory at the org chart level. A company announces an SRE function, reassigns existing ops engineers to it, and proceeds to operate identically to how it always did, except now the ticket queue says ‘SRE’ at the top.
True SRE requires a genuinely different relationship between development and operations. It demands that developers own reliability outcomes and that SREs are empowered to say no to feature velocity when error budgets are exhausted. It requires psychological safety for blameless postmortems, where engineers disclose what actually happened without fear of career consequences. It needs executives who understand that a burn rate of 90% on an error budget is information, not failure.
None of that comes from an org chart change. It requires sustained, visible leadership commitment to changing how reliability is discussed, measured and rewarded across the engineering organization, including the product management and the C-suite. When product managers face zero pressure to absorb error budget burn, they have every incentive to keep pushing features. Meanwhile, the SRE team becomes a pressure valve with no authority: Accountable for reliability they can’t actually control.
How to Overcome This
Cultural transformation requires executive sponsorship with teeth. Start by instrumenting reliability at the leadership level, make error budget health a standing agenda item in engineering all-hands and in product reviews. Train product managers on SLOs and error budget policy, not just engineers. Define the escalation path when a budget is burned: Who decides to freeze feature work? Make it unambiguous. Run blameless postmortems publicly and celebrate engineers who surface systemic failures early. Behavior follows incentives; change what gets rewarded.
2. Talent Strategy — Hiring for Credentials Instead of Judgment, and Ignoring Toil Literacy
The wrong SRE hire optimizes the wrong things and leaves the biggest leverage on the table.
SRE hiring has developed a predictable pathology. Recruiters copy job descriptions from Google’s public SRE book, list every infrastructure technology known to humanity and then optimize for candidates who have ‘done SRE at a FAANG’. The result is a team that may be technically brilliant but is philosophically misaligned with the organization’s actual reliability problems.
Effective SRE work is, at its core, about engineering judgment under uncertainty. It requires engineers who can reason rigorously about risk, communicate SLO trade-offs to non-technical stakeholders, identify toil and build durable automation to eliminate it and resist the organizational gravity that turns every good SRE function back into an ops team over time. These capacities don’t appear on resumes, and they don’t surface in algorithm interview rounds.
A related failure is building a team with no appetite for eliminating toil. SRE teams that hire purely for incident response skill tend to become incident response teams, and they become very good at managing outages they could have eliminated. When no one on the team is energized by the project work side of SRE — the platform building, the automation pipelines, the reliability reviews — that work simply doesn’t happen.
How to Overcome This
Redesign your SRE interview process around judgment, not trivia. Include scenario-based panels that ask how a candidate would handle error budget trade-offs, what they’d do with 20% project time, and how they’d push back on a product team burning reliability. Explicitly assess toil literacy: Ask candidates to walk through a past example of toil they identified and eliminated. Build a team with complementary strengths, some engineers with deep incident triage experience, others who are motivated by building platforms that make incidents less likely. Pay SREs on par with your most senior software engineers; the role demands it.
3. Measurement — Defining SLOs in a Spreadsheet and Then Ignoring Them
An error budget no one enforces is just a number. SLO theater is worse than no SLOs at all.
Organizations that have read the SRE book understand, intellectually, that SLOs and error budgets are the foundation of everything. So, they set them. They bring engineers and product managers into a room, debate whether availability should be 99.9% or 99.95%, reach a number that feels politically safe and then document it in a wiki page that no one ever reads again.
This is SLO theater, and it is remarkably common. The tell is always the same: When you ask an engineering leader what their current error budget burn rate is, they don’t know. When you ask whether any team has ever had feature work paused because of a budget burn, the answer is usually no. The SLOs exist as documentation, not as operational reality.
Poorly defined SLOs have their own failure modes. Measuring availability as ‘uptime of our primary load balancer’ rather than as ‘the fraction of requests that completed successfully from the user’s perspective’ means you’re optimizing for infrastructure health, not user experience. When your load balancer is up but your database is saturated and 40% of requests are timing out, your SLO dashboard shows green. This is a measurement failure masquerading as a reliability success.
How to Overcome This
SLOs only work when they are operationally binding. This means three things: First, define SLOs from the user’s perspective, measure what users experience, not what infrastructure components report. Second, build real-time burn rate dashboards that are visible to both SRE and product teams, and integrate them into your deployment pipeline. Third, write and enforce an error budget policy before you deploy your SLOs. The policy must answer: What happens when the budget is 50% burned? 90%? 100%? Who has the authority to freeze releases? Without policy, SLOs are aspirational art.
4. AI/AIOps — Rushing AI-powered Reliability Features Before the Observability Foundation Exists
Applying machine learning to noisy, poorly structured telemetry is one of the most expensive ways to stay exactly where you are.
AIOps and AI-assisted reliability tooling have been among the fastest-growing categories in the DevOps vendor landscape. The pitch is seductive: AI that correlates alerts, predicts incidents before they happen and automatically generates runbooks from historical postmortems. For an SRE team drowning in alert fatigue, it sounds like salvation.
The reality is that most organizations deploying these tools are not ready for them, and the tools, without a clean observability foundation, actively make things worse. An AI-driven alert correlation applied to a system generating 4,000 noisy, overlapping alerts per hour will produce sophisticated-looking correlations of noisy, overlapping alerts. The AI doesn’t fix the underlying problem; it adds a layer of probabilistic interpretation on top of it, which engineers then must learn to distrust selectively. Alert fatigue becomes AI-assisted alert confusion.
The prerequisite for effective AI-assisted reliability is not a vendor contract; it is clean, well-structured, semantically rich telemetry. You need logs that are structured and queryable, metrics that are named consistently and tied to SLOs, distributed traces that accurately represent request flows and on-call workflows that are documented and standardized enough for a model to learn from. Without those foundations, the AI has nothing meaningful to work with.
There’s also a skills and trust problem. Teams that introduce AI-assisted incident response before engineers deeply understand their own systems create a dangerous dependency: Engineers learn to defer to the AI’s diagnosis rather than developing the pattern recognition and system intuition that makes SREs genuinely effective. When the AI is wrong, and at sufficient scale, it will be wrong and those engineers lack the foundational knowledge to catch the error.
How to Overcome This
- Audit your observability posture honestly before any AIOps procurement. Can your team answer: What is the p99 latency of our checkout service right now? Why did the error rate spike last Thursday at 2:43 p.m.? If the answer requires a tribal-knowledge expert rather than a dashboard, your telemetry is not AI-ready.
- Reduce alert volume first. A target of fewer than five actionable pages per on-call engineer per week is a reasonable benchmark. AI cannot fix a fundamentally noisy alert landscape; it can only repackage it.
- Pilot AI reliability features on a single, well-understood service with clean telemetry. Measure whether engineers find the AI’s suggestions useful or whether they learn to ignore them. Let that signal drive broader rollout decisions.
- Preserve on-call engineers’ system intuition. AI tools should augment human judgment, not replace the process of building it. Keep postmortems human-led, even when AI assistance is available for correlation.
5. Strategy — Scaling SRE Coverage Faster Than the Model can Absorb, and Burning out the Team
Embedding SREs into every product team before establishing shared standards turns one good SRE practice into thirty inconsistent ones.
SRE success in one team creates organizational pressure to replicate it everywhere, immediately. The pattern that follows is predictable and damaging: A handful of experienced SREs are spread thin across a dozen product teams. Each embedded SRE adapts practices to the local team culture. SLO definitions diverge. Runbook formats differ. Alert routing policies contradict each other. The shared infrastructure tooling that should underpin the entire function gets no investment because everyone is too busy fighting local fires.
The engineers caught in this expansion are, typically, the best ones, the people who proved the model in the first team, who can code, who communicate well, who understand the business. They get pulled into back-to-back on-call rotations across systems they barely know, with no time for the project work that made SRE valuable in the first place. Burnout follows. Attrition follows burnout. Then the organization, having lost its best reliability engineers, concludes that SRE ‘doesn’t scale’ rather than recognizing that they scaled it incorrectly.
This mistake is compounded by a failure to invest in the SRE team itself. SRE practitioners who spend 100% of their time on operational work have no time to learn new technologies, update their mental models or build the shared platforms that would multiply their leverage across teams. A team that can’t invest in itself compounds technical debt while standing still.
How to Overcome This
Grow deliberately. Before embedding SREs into a new team, define the readiness criteria: Does the team have instrumented services? A defined SLO? A documented on-call runbook? Use an ‘SRE engagement model’ that distinguishes between full embedding, consulting relationships and self-service; not every team needs an embedded SRE. Protect project time ruthlessly: The 50% project/50% operations split from the SRE book isn’t a suggestion. Build shared platforms (PagerDuty configurations, SLO tooling, structured logging libraries) that scale reliability standards without scaling headcount linearly. Measure on-call load per engineer; if it exceeds sustainable thresholds, treat it as an architectural problem requiring investment, not an HR problem requiring more hires.
The Common Thread
Look across all five mistakes and a pattern emerges: Organizations treat SRE as a technical solution to what is fundamentally an organizational and cultural challenge. The tooling, headcount and AI-powered platforms are the multipliers. What they multiply is either the clarity and discipline of a well-designed SRE function, or the confusion and toil of a poorly designed one. Getting the foundations right — incentives, measurement, talent and pacing — is not the boring work before the real work begins. It is the work. The teams that understand this are the ones running SRE five years from now.
from DevOps.com https://ift.tt/5GFnIXH
Comments
Post a Comment