Skip to main content

Iceberg Won the Format War — Now Comes the Hard Part

Apache Iceberg has effectively won the open table format conversation. AWS, Google Cloud, Microsoft, Snowflake, Databricks — every major platform has thrown its weight behind it. If you work in data engineering or platform operations, the question is no longer whether Iceberg is the right foundation. It’s what it actually takes to run it day to day.

That second question doesn’t get nearly enough airtime. And it’s the one that determines whether your Iceberg adoption goes well or becomes a slow-motion infrastructure project that nobody budgeted for.

The Gap Nobody Talks About

Here’s what Iceberg gives you: a table format with schema evolution, time travel, partition evolution, and engine independence. Here’s what Iceberg does not give you: a way to get data into those tables, a way to model and transform it once it’s there, a way to coordinate when things run, or a way to keep table health in check as data piles up.

Put differently, Iceberg defines how tables behave, not how to operate the pipelines around them.

Most teams discover this the hard way. They pick Iceberg for its openness and flexibility, then spend months wiring together ingestion tools, dbt jobs, schedulers, and homegrown maintenance scripts. The individual pieces work. The thing they form is fragile because reliability lives in the gaps between tools rather than in any single place you can point to and say, “This is responsible.”

Why This Is a DevOps Problem

If you’ve been in DevOps for any length of time, this should ring a bell. It’s the same mess software delivery was in before CI/CD grew up: too many disconnected steps, no single system of record, and failures that only show up at the seams between tools. Data pipelines have just been slower to hit this wall.

Here’s a scenario I’ve watched play out multiple times. A schema change is applied to a production database. The ingestion tool picks it up and starts writing new fields into Iceberg. A dbt Core job runs later on its hourly schedule. It either blows up because its assumptions about the schema are now wrong, or — worse — it succeeds while quietly producing partial results that nobody catches until a dashboard goes sideways downstream. Meanwhile, table maintenance (compaction, snapshot expiry, orphan file cleanup) is running on its own cadence, completely unaware of what ingestion or transformation just did.

When you debug this, you’re context-switching across four systems, none of which are broken on their own. The problem is coordination. And the coordination layer? That’s you, at 3 a.m., reading runbooks in a Slack thread.

DevOps figured out years ago that humans make terrible glue between systems. That’s what drove the shift from manual deployments to automated pipelines with feedback loops, observability, and rollback. Data engineering is at an inflection point, and Iceberg adoption is accelerating it.

Schedules Are Holding Your Iceberg Stack Together with Duct Tape

The deeper architectural issue is how most Iceberg stacks coordinate work. Almost everyone uses time-based schedules: ingest every five minutes, run dbt hourly, and compact nightly. Fine when everything is batch and nothing changes. Not fine when ingestion is continuous, schema changes are routine, and table operations have a real impact on query performance.

Iceberg’s own snapshot mechanism gives us a better primitive. Snapshots capture the exact state of a table at a point in time. A system built around table state can answer questions that schedules can’t: this snapshot invalidated these downstream models; these tables now need compaction; downstream consumers can read a consistent version once these steps finish.

If you’ve worked with event-driven architectures in application development, this should feel natural — the state change itself triggers the next action instead of a cron job polling for it. For data platforms, this means the scheduler becomes an implementation detail rather than the load-bearing wall. It also means lineage gets real: not just “this source feeds this table” but “this version of the source produced this version of the model,” traceable to the snapshot.

You Traded Vendor Lock-In For Internal Platform Lock-In

There’s an irony that keeps coming up in my conversations with data platform teams. They chose Iceberg to avoid vendor lock-in. Then they spent six months building a bespoke pipeline platform — custom orchestration, monitoring scripts for table health, runbooks that span four tools, tribal knowledge about which manual steps to run when things break — and now they’re locked into that instead. Different lock-in, same result: you can’t easily change course because too much depends on the specific way you wired it all together.

This compounds fast. A handful of Iceberg tables can be managed with scripts and good intentions. A hundred tables with interdependencies need a system. Most teams hit this realization somewhere around table thirty, when operating pipelines starts eating more time than building them.

Anyone who’s been around long enough remembers when companies managed deployments with shell scripts and SSH. It worked until it didn’t. The shift to declarative infrastructure and managed delivery pipelines didn’t happen because the scripts stopped functioning. It happened because the cost of scaling that approach grew faster than the team. We’re at the same crossroads with data pipelines on Iceberg.

What to Look for in an Iceberg Pipeline Layer

If you’re evaluating your Iceberg strategy or about to start one, here’s how I’d think about the pipeline layer you’ll need.

Ingestion and transformation should live in the same system. When they’re separate, schema evolution and data quality become coordination headaches instead of pipeline features. Data quality tests shouldn’t be downstream assertions that catch problems after you’ve already burned the compute. They should be contracts at the boundary where data enters.

Table operations—compaction, snapshot expiry, metadata cleanup—need to be treated as first-class concerns, not cron jobs you set up once and forget about. They directly affect query performance and storage costs, and they need awareness of what the pipeline is doing. Running compaction during a large ingestion batch is a great way to create problems that are hard to diagnose.

The system should run inside your VPC. If you picked Iceberg for data sovereignty and security, sending your data through someone else’s infrastructure undermines the whole point. This isn’t hypothetical — in financial services and healthcare, regulations and company policies often mandate that data never leaves the VPC.

And the pipeline layer should let you build once and serve many consumers from the same Iceberg tables: analytics, data science, AI workloads, and data sharing. Iceberg makes this architecturally possible. The pipeline layer is what makes it operationally real.

Where These Conversations Are Happening

What I like about the Iceberg ecosystem is that these operational problems are being discussed in the open. The format is open source, governance is through the Apache Software Foundation, and the community skews heavily toward practitioners who are running this stuff for real.

If any of this resonates, Iceberg Summit 2026 is worth your time. It’s April 8-9 in San Francisco (Marriott Marquis), run under the Apache Software Foundation, and it’s the one event where you’ll find core maintainers, production users, and platform architects all in the same room. Last year’s edition had serious depth — not vendor keynotes, but real case studies and technical deep dives. I expect this year to go even deeper on the operational side, which is where the hard problems are right now.

The Hard Part Starts Now

Iceberg has crossed the adoption threshold. The format debate is over. What hasn’t been settled is how teams will actually run it without drowning in operational overhead.

If you’ve spent time in DevOps, you know the tools are only as good as the system they form. A great CI server doesn’t help much if your deployment process is held together with hope and shell scripts. The same is true for data pipelines on Iceberg. The question for 2026 isn’t whether to build on Iceberg. It’s whether your pipeline architecture can keep up.



from DevOps.com https://ift.tt/z6G9OvJ

Comments

Popular posts from this blog

Why the Software Development Tools you Choose Directly Affect Your CI/CD Reliability 

Most conversations about CI/CD reliability start in the wrong place. Teams debug flaky pipelines, investigate intermittent failures, tune alerting thresholds and optimize build times. All of that work is legitimate. However, the decisions that most directly determine whether a CI/CD pipeline is reliable or not were made months or years earlier, during tool selection. By the time teams are debugging pipeline reliability, they are usually dealing with the downstream consequences of upstream decisions that seemed reasonable at the time.   The software development tools a team chooses shape their CI/CD pipeline in ways that are not always visible during evaluation. Understanding those connections is the most practical starting point for teams that want reliable pipelines rather than better pipeline firefighting.   The Integration Surface Problem   Every tool in a software development stack creates an integration surface. Integration surface is the set of connections a tool has with oth...

Co-Developing an AI Native Observability Platform  

As AI capabilities continue to evolve, AI is becoming central to managing the growing complexity of distributed, hybrid enterprise environments, enabling more effective analysis, correlation, and automation across interconnected systems.   Traditional infrastructure and specifically network monitoring approaches, often built around siloed tools and static thresholds, struggle to keep pace with the scale, velocity, and interdependencies of modern systems. Further blurring the boundaries between network, application, and infrastructure domains makes it harder to isolate root causes and maintain operational resilience. In this context, AIOps platforms have emerged as one response to the growing need for integrated observability, automation, and data-driven decision-making.   At AI Field Day, Selector AI presented an AIOps platform, which can be considered a foundation for co-creating more adaptive and data-driven network operations. Rather than positioning it purely as a product choice,...

Postman Adds AI Agent to Automate API Development and Governance

Postman added an artificial intelligence (AI) agent to its portfolio of tools and platforms for building and governing application programming interfaces (APIs) that can autonomously perform tasks ranging from development and documentation to exploration and setting up integrations with continuous integration/continuous deployment (CI/CD) environments. Company CEO Abhinav Asthana said the Autonomous API Engineer significantly reduces the total cost of building and maintaining APIs by automating time-consuming tasks that have historically created bottlenecks in software engineering workflows. In fact, the AI agent developed by Postman will make it significantly simpler to integrate API development and testing within those workflows, said Asthana. Designed to be triggered from a pull request, Slack, Postman command line interface (CLI) or the Postman app, the Autonomous API Engineer spins up a secure, sandboxed environment. It then executes tasks and returns verified artifacts, includ...