Skip to main content

Iceberg Won the Format War — Now Comes the Hard Part

Apache Iceberg has effectively won the open table format conversation. AWS, Google Cloud, Microsoft, Snowflake, Databricks — every major platform has thrown its weight behind it. If you work in data engineering or platform operations, the question is no longer whether Iceberg is the right foundation. It’s what it actually takes to run it day to day.

That second question doesn’t get nearly enough airtime. And it’s the one that determines whether your Iceberg adoption goes well or becomes a slow-motion infrastructure project that nobody budgeted for.

The Gap Nobody Talks About

Here’s what Iceberg gives you: a table format with schema evolution, time travel, partition evolution, and engine independence. Here’s what Iceberg does not give you: a way to get data into those tables, a way to model and transform it once it’s there, a way to coordinate when things run, or a way to keep table health in check as data piles up.

Put differently, Iceberg defines how tables behave, not how to operate the pipelines around them.

Most teams discover this the hard way. They pick Iceberg for its openness and flexibility, then spend months wiring together ingestion tools, dbt jobs, schedulers, and homegrown maintenance scripts. The individual pieces work. The thing they form is fragile because reliability lives in the gaps between tools rather than in any single place you can point to and say, “This is responsible.”

Why This Is a DevOps Problem

If you’ve been in DevOps for any length of time, this should ring a bell. It’s the same mess software delivery was in before CI/CD grew up: too many disconnected steps, no single system of record, and failures that only show up at the seams between tools. Data pipelines have just been slower to hit this wall.

Here’s a scenario I’ve watched play out multiple times. A schema change is applied to a production database. The ingestion tool picks it up and starts writing new fields into Iceberg. A dbt Core job runs later on its hourly schedule. It either blows up because its assumptions about the schema are now wrong, or — worse — it succeeds while quietly producing partial results that nobody catches until a dashboard goes sideways downstream. Meanwhile, table maintenance (compaction, snapshot expiry, orphan file cleanup) is running on its own cadence, completely unaware of what ingestion or transformation just did.

When you debug this, you’re context-switching across four systems, none of which are broken on their own. The problem is coordination. And the coordination layer? That’s you, at 3 a.m., reading runbooks in a Slack thread.

DevOps figured out years ago that humans make terrible glue between systems. That’s what drove the shift from manual deployments to automated pipelines with feedback loops, observability, and rollback. Data engineering is at an inflection point, and Iceberg adoption is accelerating it.

Schedules Are Holding Your Iceberg Stack Together with Duct Tape

The deeper architectural issue is how most Iceberg stacks coordinate work. Almost everyone uses time-based schedules: ingest every five minutes, run dbt hourly, and compact nightly. Fine when everything is batch and nothing changes. Not fine when ingestion is continuous, schema changes are routine, and table operations have a real impact on query performance.

Iceberg’s own snapshot mechanism gives us a better primitive. Snapshots capture the exact state of a table at a point in time. A system built around table state can answer questions that schedules can’t: this snapshot invalidated these downstream models; these tables now need compaction; downstream consumers can read a consistent version once these steps finish.

If you’ve worked with event-driven architectures in application development, this should feel natural — the state change itself triggers the next action instead of a cron job polling for it. For data platforms, this means the scheduler becomes an implementation detail rather than the load-bearing wall. It also means lineage gets real: not just “this source feeds this table” but “this version of the source produced this version of the model,” traceable to the snapshot.

You Traded Vendor Lock-In For Internal Platform Lock-In

There’s an irony that keeps coming up in my conversations with data platform teams. They chose Iceberg to avoid vendor lock-in. Then they spent six months building a bespoke pipeline platform — custom orchestration, monitoring scripts for table health, runbooks that span four tools, tribal knowledge about which manual steps to run when things break — and now they’re locked into that instead. Different lock-in, same result: you can’t easily change course because too much depends on the specific way you wired it all together.

This compounds fast. A handful of Iceberg tables can be managed with scripts and good intentions. A hundred tables with interdependencies need a system. Most teams hit this realization somewhere around table thirty, when operating pipelines starts eating more time than building them.

Anyone who’s been around long enough remembers when companies managed deployments with shell scripts and SSH. It worked until it didn’t. The shift to declarative infrastructure and managed delivery pipelines didn’t happen because the scripts stopped functioning. It happened because the cost of scaling that approach grew faster than the team. We’re at the same crossroads with data pipelines on Iceberg.

What to Look for in an Iceberg Pipeline Layer

If you’re evaluating your Iceberg strategy or about to start one, here’s how I’d think about the pipeline layer you’ll need.

Ingestion and transformation should live in the same system. When they’re separate, schema evolution and data quality become coordination headaches instead of pipeline features. Data quality tests shouldn’t be downstream assertions that catch problems after you’ve already burned the compute. They should be contracts at the boundary where data enters.

Table operations—compaction, snapshot expiry, metadata cleanup—need to be treated as first-class concerns, not cron jobs you set up once and forget about. They directly affect query performance and storage costs, and they need awareness of what the pipeline is doing. Running compaction during a large ingestion batch is a great way to create problems that are hard to diagnose.

The system should run inside your VPC. If you picked Iceberg for data sovereignty and security, sending your data through someone else’s infrastructure undermines the whole point. This isn’t hypothetical — in financial services and healthcare, regulations and company policies often mandate that data never leaves the VPC.

And the pipeline layer should let you build once and serve many consumers from the same Iceberg tables: analytics, data science, AI workloads, and data sharing. Iceberg makes this architecturally possible. The pipeline layer is what makes it operationally real.

Where These Conversations Are Happening

What I like about the Iceberg ecosystem is that these operational problems are being discussed in the open. The format is open source, governance is through the Apache Software Foundation, and the community skews heavily toward practitioners who are running this stuff for real.

If any of this resonates, Iceberg Summit 2026 is worth your time. It’s April 8-9 in San Francisco (Marriott Marquis), run under the Apache Software Foundation, and it’s the one event where you’ll find core maintainers, production users, and platform architects all in the same room. Last year’s edition had serious depth — not vendor keynotes, but real case studies and technical deep dives. I expect this year to go even deeper on the operational side, which is where the hard problems are right now.

The Hard Part Starts Now

Iceberg has crossed the adoption threshold. The format debate is over. What hasn’t been settled is how teams will actually run it without drowning in operational overhead.

If you’ve spent time in DevOps, you know the tools are only as good as the system they form. A great CI server doesn’t help much if your deployment process is held together with hope and shell scripts. The same is true for data pipelines on Iceberg. The question for 2026 isn’t whether to build on Iceberg. It’s whether your pipeline architecture can keep up.



from DevOps.com https://ift.tt/z6G9OvJ

Comments

Popular posts from this blog

Cursor’s New SDK Turns AI Coding Agents Into Deployable Infrastructure

For most of its life, Cursor has been an IDE. A very good one. But with the public beta of the Cursor SDK, the company is making a different kind of move — one that should get the attention of DevOps teams. The Cursor SDK is a TypeScript library that gives engineers programmatic access to the same runtime, models, and agent harness that power Cursor’s desktop app, CLI, and web interface. In short, the agents that used to live inside an editor can now be invoked from anywhere in your stack. That’s a meaningful shift in how AI coding tools fit into software delivery pipelines. From the Editor to the Pipeline If you’ve used Cursor before, the workflow is familiar — you interact with an agent in real time, asking it to write functions, fix bugs, or review code. The SDK breaks that dependency on interactive use. Now you can call those same agents programmatically, from a CI/CD trigger, a backend service, or embedded inside another tool. Getting started is a single inst...

Mistral Moves Coding Agents to the Cloud — and Gets Out of Your Way

For the past year or so, AI coding agents have been tethered to your local machine. You kick off a task, watch the terminal, and babysit every step. It works — but it’s not exactly hands-free. Mistral just changed that. On April 29, the Paris-based AI company announced remote coding agents for its Vibe platform, powered by a new model called Mistral Medium 3.5. The idea is simple: Instead of running coding sessions on your laptop, they now run in the cloud — asynchronously, in parallel, and without you watching over them. What’s Actually New Coding sessions can now work through long tasks while you’re away. Many can run in parallel, and you no longer become the bottleneck at every step the agent takes. That’s the core pitch. You start a task from the Mistral Vibe CLI or directly from Le Chat — Mistral’s AI assistant — and the agent handles the rest. When it’s done, it opens a pull request on GitHub and notifies you, so you review the result inste...

OpenAI Debuts Symphony to Orchestrate Coding Agents at Scale

OpenAI has unveiled Symphony, an open-source specification that shifts how software development teams deploy AI in workflows, moving from interactive coding assistance toward continuous orchestration of autonomous agents. Symphony reframes project management tools as operational hubs for AI-driven coding. Rather than prompting an assistant for individual tasks, developers assign work through issue trackers, allowing agents to execute tasks in parallel and deliver outputs for human review. The change reflects a trend in enterprise AI in which systems are increasingly embedded into production pipelines rather than used as standalone tools. Symphony emerged from internal experimentation at   OpenAI , where engineers attempted to scale the use of   Codex   across multiple concurrent sessions. While the agents proved capable, human operators became the limiting factor. Engineers found they could only manage a handful of sessions before coordination overhead offset pro...