Zero-copy is the backbone of scalable Agentic AI
Summary
Introduction
In the 2010s, we laughed off data virtualization and canonical forms: attempts at abstracting data access and standardizing meaning across systems, long before Agentic AI made those things non-negotiable. ETL jobs and lakehouses gave us just enough leverage to patch over bad architecture. That worked for a while. It doesn't anymore.
Agentic AI needed more. It doesn’t just need fast models, it needs structured, consistent access to data across its planning and reasoning loops. You don’t get that from 14 conflicting pipelines feeding five stale warehouses. You get it from coherent, contract-bound data sources that don’t splinter on every handoff.
You get it from systems that don’t copy data unless they have to.
One Domain, Many Views
No one runs analytics directly on OLTP tables or runs joins inside a vector DB. Every system that consumes data has its own query pattern: batch files, streaming, SQL query, vector search, its own latency, its own format. Shaping data is not exactly copying data.
For instance, Kafka, Redpanda Data , Spark, Flink, ksqlDB, etc. let you project event streams into shape-specific views. Apache Iceberg and Delta Lake give you storage patterns where any type of data, raw logs, compacted state, and time-based partitions can all coexist.
These are projections tailored for specific consumption patterns: stream joins, RAG hydration, real-time dashboards; not raw copies. Each has lineage and context. None should become a new, conflicting source of truth.
Zero Copy with Third-Party Data
Modern external data sharing isn’t about FTP drops and batch syncs (well, sometimes, it still is...). It’s about clean APIs, Snowflake sharing, Delta Sharing, Conduktor Exchange, and even federated access across organizations. For instance, with Conduktor Exchange, partners can subscribe to real-time Kafka topics directly: no pipelines, no duplication, no extra processing for the data provider. This shifts the cost structure and makes third-party consumption scalable by design.
Agentic architectures need this by default:
Agentic systems won't wait for nightly copies. They need real-time remote access (with embedded security and constraints). Zero-copy isn’t about owning every byte. It’s about knowing where truth lives and pulling the right data/view, at the right time, for the right use.
Memory Lives in Logs
Apache Iceberg, Delta Lake, and Hudi have changed how we store time. Even Kafka has started streaming directly into S3-compatible data lakes (see KIP-405 "Tiered Storage" and very recently KIP-1150 "Diskless Topics" from Aiven ), enabling tools like Snowflake , Databricks , or Starburst to query event data without intermediary pipelines.
This doesn’t turn Kafka into a long-term storage system: it reinforces its role as a high-throughput, policy-aware, schema-governed data router that links producers, consumers, and lakes, making data more accessible.
Need audit logs for a compliance boundary? Store changes as first-class records with encryption, metadata, and retention policies. Need to retrain your LLM on customer behavior from the past year? Replay from Kafka, or query from Iceberg, both are zero-copy patterns.
History isn't a reason to copy. It's a reason to log immutably and query flexibly.
Break Apps Isolation
Most applications are still walled gardens: Oracle, SAP, Salesforce, your 2005 ERP, etc. They rarely emit events or want to keep interesting things locked in and hidden.
You don’t need to replatform an ERP to participate in modern data architecture. You need to isolate what it outputs and plug it into the real-time backbone. It's often possible to not build some brittle ETLs to pull data from those apps, but instead to use the Strangler pattern: gradually wrap legacy systems with change data capture (CDC), define data contracts and enforce them at the edge.
The goal isn’t to refactor your old system, it’s to externalize its outputs into a reusable and owned data layer that respects policy, freshness, and schema evolution, and to stop letting every team write their own copy logic.
Solve for Latency
Physics is real. If you're joining two massive tables across regions, yes, naive virtualization will fall over. That’s not a zero-copy failure. That’s a lack of locality planning.
Do what we already do in infra:
You don't need to build teleportation to make zero-copy work. You just need to treat data as a product, with SLAs, latency budgets, and cost-awareness.
Build Trust, Not Copies
Most teams duplicate data because they don’t trust upstream availability or correctness. So they copy it, rename columns, drop nulls, etc. The fix is based on ownership concepts like Data Mesh:
Zero-copy means fewer "WTFs per query"
You can't scale agentic AI with batch ETLs. You can't fine-tune a small model on a windowed dataset when every team slices time differently. You can't secure your pipeline if every copy bypasses your policy engine. Zero-copy is not "copying nothing", it's "exposing with control":
It’s clearly not for everything, there are times to break the rule. But if you don't zero-copy anything, you’re just building debt, not just in storage, but in complexity (meaning cost and TTM).
Spot on!