Configuration Drift in a Multi-Cloud World

Configuration drift is the gap between the infrastructure state declared in code and the state actually running in your environment. It occurs when resources are changed outside of your infrastructure as code (IaC) workflow, so the live system no longer matches its definition.

In a single cloud, drift is usually straightforward to find and correct. Across multiple providers, it is harder to detect and more costly to leave unaddressed.

Why Does Multicloud Make Drift Worse?

Each provider has its own API, resource model, console, and defaults. A change made directly in one cloud does not resemble the equivalent change in another, so the signals used to detect drift differ in each environment. There is often no single source of truth that covers all providers, and tagging conventions and naming standards vary between accounts.

As a result, the number of places where an undeclared change can go unnoticed increases with each cloud you add. The practical effect is that the documented state of your infrastructure and its running state diverge, and the difference is often discovered only when a system fails.

How Drift Occurs

Most drift is unintentional and results from routine operations rather than misuse. Common causes include:

Manual fixes applied through a console during an incident and not reconciled back into code
Proof-of-concept resources created by hand and later forgotten
Failed or partial applies that leave resources in an inconsistent state
Third-party tools, operators, and autoscalers modifying resources they manage
IaC that is written but not applied, or applied from an outdated branch

These are normal byproducts of operating production systems.

The Cost of Drift

Drift has consequences in four areas.

First, security and compliance: an unreviewed change bypasses policy controls, and the running configuration can no longer be verified against its definition.

Second, reproducibility: environments that began identical no longer match, so a change validated in staging may fail in production.

Third, destructive applies: when IaC is eventually run against a drifted state, it may propose to delete or overwrite resources that are in use.

Fourth, cost: orphaned resources continue to incur charges across accounts.

Detecting and Remediating Drift

Start by codifying everything. Strictly speaking, drift is divergence from a declared state, so a resource that was never defined in IaC is an unmanaged resource rather than drift in the narrow sense. The practical effect is the same, though: anything outside IaC is invisible to plan- and refresh-based detection, so it can change without being noticed. Bringing it under code is what makes it observable in the first place.

From there, find divergence early with regular plan or refresh cycles that compare declared state against live state across all providers. This turns drift into a small, frequent signal instead of a quarterly surprise.

Detection only helps if it leads to a considered response. Reverting automatically is not always right, because sometimes the manual change is correct and the code is out of date. Review each case before deciding whether to update the code or revert the resource. Policy as code that runs before apply prevents many out-of-band changes in the first place.

Because a separate check per provider can leave gaps, coverage matters as much as method, whether you run scheduled plans per workspace through CI or consolidate detection in a single control plane. The goal is consistent, scheduled coverage rather than ad hoc checks.

Two tools illustrate different approaches to where detection lives.

Atlantis is a free, open-source tool that automates Terraform plan and apply through pull requests, with state locking and a full Git audit trail, and it is straightforward to self-host. Because it acts only on pull requests, it has no native drift detection, which means it has no mechanism to notice when live infrastructure diverges from state between deployments. Teams that use it add detection separately, commonly a scheduled CI job that runs terraform plan against each workspace.

Spacelift is an orchestration platform with built-in drift detection for Terraform, OpenTofu, Pulumi, and CloudFormation. It runs proposed runs against a stack’s stored state on a schedule you define, flags drifted resources, and can optionally reconcile them through a tracked run that follows your normal policies and approvals. One operational caveat to note is that scheduled drift detection runs on private workers rather than the public worker pool.

Drift as an Ongoing Practice

Drift cannot be eliminated. Manual changes to production are sometimes necessary, and automation occasionally fails. Teams that manage drift effectively treat the declared state as authoritative, detect divergence regularly, and reconcile it promptly. Handled as a continuous operational practice, a manual change becomes a recorded event to be reconciled rather than the cause of a later outage.

from DevOps.com https://ift.tt/gmhMdwS

News and Tech Update

Search This Blog