Zero-copy is the backbone of scalable Agentic AI

Summary

One Domain, Many Views
Zero Copy with Third-Party Data
Memory Lives in Logs
Break Apps Isolation
Solve for Latency
Build Trust, Not Copies
Zero-copy means fewer "WTFs per query"

Introduction

In the 2010s, we laughed off data virtualization and canonical forms: attempts at abstracting data access and standardizing meaning across systems, long before Agentic AI made those things non-negotiable. ETL jobs and lakehouses gave us just enough leverage to patch over bad architecture. That worked for a while. It doesn't anymore.

Agentic AI needed more. It doesn’t just need fast models, it needs structured, consistent access to data across its planning and reasoning loops. You don’t get that from 14 conflicting pipelines feeding five stale warehouses. You get it from coherent, contract-bound data sources that don’t splinter on every handoff.

You get it from systems that don’t copy data unless they have to.

One Domain, Many Views

No one runs analytics directly on OLTP tables or runs joins inside a vector DB. Every system that consumes data has its own query pattern: batch files, streaming, SQL query, vector search, its own latency, its own format. Shaping data is not exactly copying data.

For instance, Kafka, Redpanda Data , Spark, Flink, ksqlDB, etc. let you project event streams into shape-specific views. Apache Iceberg and Delta Lake give you storage patterns where any type of data, raw logs, compacted state, and time-based partitions can all coexist.

Want real-time dashboards? Stream events into Apache Druid or StarTree (Pinot) with inline pre-aggregations to serve low-latency metrics
Need memory for your RAG? Sync embeddings into a vector store like Weaviate , Pinecone , or Milvus, created by Zilliz , while streaming metadata via Kafka or Redpanda to maintain live references back to the original document (traceability).
Building a vector index for semantic search? Use OpenAI or Cohere to derive embeddings from canonical customer records in your Lakehouse (e.g. Iceberg, Delta Lake).
Powering an agent planning loop? Expose task states via real-time views built in Materialize or ksqlDB instead of materializing full snapshots.
Monitoring application state in real-time? Aggregate and window data streams with Apache Flink, RisingWave, or Confluent ksqlDB, avoiding the need to persist into yet another warehouse.

These are projections tailored for specific consumption patterns: stream joins, RAG hydration, real-time dashboards; not raw copies. Each has lineage and context. None should become a new, conflicting source of truth.

Zero Copy with Third-Party Data

Modern external data sharing isn’t about FTP drops and batch syncs (well, sometimes, it still is...). It’s about clean APIs, Snowflake sharing, Delta Sharing, Conduktor Exchange, and even federated access across organizations. For instance, with Conduktor Exchange, partners can subscribe to real-time Kafka topics directly: no pipelines, no duplication, no extra processing for the data provider. This shifts the cost structure and makes third-party consumption scalable by design.

Agentic architectures need this by default:

agents pulling from partner services
reacting to external signals
calling external tools

Agentic systems won't wait for nightly copies. They need real-time remote access (with embedded security and constraints). Zero-copy isn’t about owning every byte. It’s about knowing where truth lives and pulling the right data/view, at the right time, for the right use.

Memory Lives in Logs

Apache Iceberg, Delta Lake, and Hudi have changed how we store time. Even Kafka has started streaming directly into S3-compatible data lakes (see KIP-405 "Tiered Storage" and very recently KIP-1150 "Diskless Topics" from Aiven ), enabling tools like Snowflake , Databricks , or Starburst to query event data without intermediary pipelines.

This doesn’t turn Kafka into a long-term storage system: it reinforces its role as a high-throughput, policy-aware, schema-governed data router that links producers, consumers, and lakes, making data more accessible.

Need audit logs for a compliance boundary? Store changes as first-class records with encryption, metadata, and retention policies. Need to retrain your LLM on customer behavior from the past year? Replay from Kafka, or query from Iceberg, both are zero-copy patterns.

History isn't a reason to copy. It's a reason to log immutably and query flexibly.

Break Apps Isolation

Most applications are still walled gardens: Oracle, SAP, Salesforce, your 2005 ERP, etc. They rarely emit events or want to keep interesting things locked in and hidden.

You don’t need to replatform an ERP to participate in modern data architecture. You need to isolate what it outputs and plug it into the real-time backbone. It's often possible to not build some brittle ETLs to pull data from those apps, but instead to use the Strangler pattern: gradually wrap legacy systems with change data capture (CDC), define data contracts and enforce them at the edge.

The goal isn’t to refactor your old system, it’s to externalize its outputs into a reusable and owned data layer that respects policy, freshness, and schema evolution, and to stop letting every team write their own copy logic.

Solve for Latency

Physics is real. If you're joining two massive tables across regions, yes, naive virtualization will fall over. That’s not a zero-copy failure. That’s a lack of locality planning.

Do what we already do in infra:

Cache with purpose (Redis, DuckDB, Arrow Flight)
Materialize when queries repeat, Incremental View Maintenance is the future
Push joins to edge systems (stream processing)
Avoid 3NF joins in read paths: normalize in write, denormalize in view

You don't need to build teleportation to make zero-copy work. You just need to treat data as a product, with SLAs, latency budgets, and cost-awareness.

Build Trust, Not Copies

Most teams duplicate data because they don’t trust upstream availability or correctness. So they copy it, rename columns, drop nulls, etc. The fix is based on ownership concepts like Data Mesh:

Lets producers define schemas and track usage
Makes consumers submit access requests and understand lineage
Audits what’s shared and how it’s used (OpenTelemetry, DataDog, custom traces)
Automate via GitOps for audit (CI/CD pipelines, data versioning, tests)

Zero-copy means fewer "WTFs per query"

You can't scale agentic AI with batch ETLs. You can't fine-tune a small model on a windowed dataset when every team slices time differently. You can't secure your pipeline if every copy bypasses your policy engine. Zero-copy is not "copying nothing", it's "exposing with control":

Flexible governance and lineage
Quick to expose for RAG and GenAI pipelines
Simpler LLMOps debugging and prompt tracing
Lower infra cost and faster time to insight

It’s clearly not for everything, there are times to break the rule. But if you don't zero-copy anything, you’re just building debt, not just in storage, but in complexity (meaning cost and TTM).

Zero-copy is the backbone of scalable Agentic AI

Stephane Derosiaux

Co-founder & CPTO at Conduktor ∙ Real-time Data ∙ Agentic AI ∙ Product Builder ∙ Writer ∙ Speaker

Introduction

One Domain, Many Views

Zero Copy with Third-Party Data

Memory Lives in Logs

Break Apps Isolation

Solve for Latency

Build Trust, Not Copies

Zero-copy means fewer "WTFs per query"

More articles by this author

Others also viewed

30% Off ODSC East, Fan-Favorite Speakers, Foundation Models for Times Series, and ETL Pipeline Orchestration

ML/AI Architect

Ephemeral Neo4j Instances, Data Tests, and Self-Evolving Knowledge Graphs

The 7 Lives of Databases – From Hierarchical to Multi-Model

From Lakehouse to Powerhouse: How Databricks is Setting the Pace for Industrial AI

AutoGen Frameworks in LLMOps: Automating JSON Flow Generation

ET(K)L: The Future of Data Integration in AI-Driven Architectures

Product Innovation Update - Spring

Analytics and Data Science News for the Week of May 30; Updates from Databricks, dbt Labs, Sisense & More

DATA Pill #048 - Zero-ETL, Chat GPT and why NOT to use Kubeflow

Explore topics

Introduction

One Domain, Many Views

Zero Copy with Third-Party Data

Memory Lives in Logs

Break Apps Isolation

Solve for Latency

Build Trust, Not Copies

Zero-copy means fewer "WTFs per query"

From Stack Traces to Thought Traces: Welcome to 2025

May 21, 2025

AI Writes It. You Ship It. Who's Responsible?

May 19, 2025

SLMs + MCP + A2A = Meta-Applications

May 5, 2025

All You Need is AI Agents and AWS S3

Mar 25, 2025

Streaming AI Agents: Why Kafka and Flink are the foundations of AI at scale

Feb 8, 2025

Becoming AI Architects: Intent-Driven Design

Feb 5, 2025

More Freedom means More Problems (the cost of decentralization)

Feb 3, 2025

Data Protectionism and Isolation

Feb 2, 2025

Why Kafka is always late?

Oct 25, 2024

Quantum AI: Incoming Revolution or Vaporware?

Oct 9, 2024

Others also viewed

30% Off ODSC East, Fan-Favorite Speakers, Foundation Models for Times Series, and ETL Pipeline Orchestration

ML/AI Architect

Ephemeral Neo4j Instances, Data Tests, and Self-Evolving Knowledge Graphs

The 7 Lives of Databases – From Hierarchical to Multi-Model

From Lakehouse to Powerhouse: How Databricks is Setting the Pace for Industrial AI

AutoGen Frameworks in LLMOps: Automating JSON Flow Generation

ET(K)L: The Future of Data Integration in AI-Driven Architectures

Product Innovation Update - Spring

Analytics and Data Science News for the Week of May 30; Updates from Databricks, dbt Labs, Sisense & More

DATA Pill #048 - Zero-ETL, Chat GPT and why NOT to use Kubeflow

Explore topics