Zero-copy is the backbone of scalable Agentic AI

Zero-copy is the backbone of scalable Agentic AI

Summary

  • One Domain, Many Views
  • Zero Copy with Third-Party Data
  • Memory Lives in Logs
  • Break Apps Isolation
  • Solve for Latency
  • Build Trust, Not Copies
  • Zero-copy means fewer "WTFs per query"


Introduction

In the 2010s, we laughed off data virtualization and canonical forms: attempts at abstracting data access and standardizing meaning across systems, long before Agentic AI made those things non-negotiable. ETL jobs and lakehouses gave us just enough leverage to patch over bad architecture. That worked for a while. It doesn't anymore.

Agentic AI needed more. It doesn’t just need fast models, it needs structured, consistent access to data across its planning and reasoning loops. You don’t get that from 14 conflicting pipelines feeding five stale warehouses. You get it from coherent, contract-bound data sources that don’t splinter on every handoff.

You get it from systems that don’t copy data unless they have to.


One Domain, Many Views

No one runs analytics directly on OLTP tables or runs joins inside a vector DB. Every system that consumes data has its own query pattern: batch files, streaming, SQL query, vector search, its own latency, its own format. Shaping data is not exactly copying data.

For instance, Kafka, Redpanda Data , Spark, Flink, ksqlDB, etc. let you project event streams into shape-specific views. Apache Iceberg and Delta Lake give you storage patterns where any type of data, raw logs, compacted state, and time-based partitions can all coexist.

  • Want real-time dashboards? Stream events into Apache Druid or StarTree (Pinot) with inline pre-aggregations to serve low-latency metrics
  • Need memory for your RAG? Sync embeddings into a vector store like Weaviate , Pinecone , or Milvus, created by Zilliz , while streaming metadata via Kafka or Redpanda to maintain live references back to the original document (traceability).
  • Building a vector index for semantic search? Use OpenAI or Cohere to derive embeddings from canonical customer records in your Lakehouse (e.g. Iceberg, Delta Lake).
  • Powering an agent planning loop? Expose task states via real-time views built in Materialize or ksqlDB instead of materializing full snapshots.
  • Monitoring application state in real-time? Aggregate and window data streams with Apache Flink, RisingWave, or Confluent ksqlDB, avoiding the need to persist into yet another warehouse.

These are projections tailored for specific consumption patterns: stream joins, RAG hydration, real-time dashboards; not raw copies. Each has lineage and context. None should become a new, conflicting source of truth.


Zero Copy with Third-Party Data

Modern external data sharing isn’t about FTP drops and batch syncs (well, sometimes, it still is...). It’s about clean APIs, Snowflake sharing, Delta Sharing, Conduktor Exchange, and even federated access across organizations. For instance, with Conduktor Exchange, partners can subscribe to real-time Kafka topics directly: no pipelines, no duplication, no extra processing for the data provider. This shifts the cost structure and makes third-party consumption scalable by design.

Agentic architectures need this by default:

  • agents pulling from partner services
  • reacting to external signals
  • calling external tools

Agentic systems won't wait for nightly copies. They need real-time remote access (with embedded security and constraints). Zero-copy isn’t about owning every byte. It’s about knowing where truth lives and pulling the right data/view, at the right time, for the right use.


Memory Lives in Logs

Apache Iceberg, Delta Lake, and Hudi have changed how we store time. Even Kafka has started streaming directly into S3-compatible data lakes (see KIP-405 "Tiered Storage" and very recently KIP-1150 "Diskless Topics" from Aiven ), enabling tools like Snowflake , Databricks , or Starburst to query event data without intermediary pipelines.

This doesn’t turn Kafka into a long-term storage system: it reinforces its role as a high-throughput, policy-aware, schema-governed data router that links producers, consumers, and lakes, making data more accessible.

Need audit logs for a compliance boundary? Store changes as first-class records with encryption, metadata, and retention policies. Need to retrain your LLM on customer behavior from the past year? Replay from Kafka, or query from Iceberg, both are zero-copy patterns.

History isn't a reason to copy. It's a reason to log immutably and query flexibly.


Break Apps Isolation

Most applications are still walled gardens: Oracle, SAP, Salesforce, your 2005 ERP, etc. They rarely emit events or want to keep interesting things locked in and hidden.

You don’t need to replatform an ERP to participate in modern data architecture. You need to isolate what it outputs and plug it into the real-time backbone. It's often possible to not build some brittle ETLs to pull data from those apps, but instead to use the Strangler pattern: gradually wrap legacy systems with change data capture (CDC), define data contracts and enforce them at the edge.

The goal isn’t to refactor your old system, it’s to externalize its outputs into a reusable and owned data layer that respects policy, freshness, and schema evolution, and to stop letting every team write their own copy logic.


Solve for Latency

Physics is real. If you're joining two massive tables across regions, yes, naive virtualization will fall over. That’s not a zero-copy failure. That’s a lack of locality planning.

Do what we already do in infra:

  • Cache with purpose (Redis, DuckDB, Arrow Flight)
  • Materialize when queries repeat, Incremental View Maintenance is the future
  • Push joins to edge systems (stream processing)
  • Avoid 3NF joins in read paths: normalize in write, denormalize in view

You don't need to build teleportation to make zero-copy work. You just need to treat data as a product, with SLAs, latency budgets, and cost-awareness.


Build Trust, Not Copies

Most teams duplicate data because they don’t trust upstream availability or correctness. So they copy it, rename columns, drop nulls, etc. The fix is based on ownership concepts like Data Mesh:

  • Lets producers define schemas and track usage
  • Makes consumers submit access requests and understand lineage
  • Audits what’s shared and how it’s used (OpenTelemetry, DataDog, custom traces)
  • Automate via GitOps for audit (CI/CD pipelines, data versioning, tests)


Zero-copy means fewer "WTFs per query"

You can't scale agentic AI with batch ETLs. You can't fine-tune a small model on a windowed dataset when every team slices time differently. You can't secure your pipeline if every copy bypasses your policy engine. Zero-copy is not "copying nothing", it's "exposing with control":

  • Flexible governance and lineage
  • Quick to expose for RAG and GenAI pipelines
  • Simpler LLMOps debugging and prompt tracing
  • Lower infra cost and faster time to insight

It’s clearly not for everything, there are times to break the rule. But if you don't zero-copy anything, you’re just building debt, not just in storage, but in complexity (meaning cost and TTM).

To view or add a comment, sign in

Others also viewed

Explore topics