Real-time data Processing: Building a Zero-ETL Pipeline with AWS Services

Sanjiv Kumar Jha

Data Scientist and Enterprise IT Architect driving digital transformation with Data Science, AI, and Cloud expertise

Published Feb 3, 2025

Understanding the Architecture

Creating a real-time data pipeline from external APIs to analytics involves several crucial architectural decisions. At first glance, writing data directly to an analytics layer might seem like the simplest approach. However, experience shows that a well-designed multi-stage pipeline using AWS services provides superior reliability, scalability, and maintainability. Let's explore how to build such a system and understand why each component matters.

DynamoDB: The Foundation of Real-time Storage

The journey begins with choosing where to store incoming API data. Writing directly to an analytics store might appear straightforward, but this approach quickly reveals its limitations when dealing with API failures, retries, and high-load scenarios. DynamoDB addresses these challenges by serving as a reliable buffer, efficiently handling high-throughput writes while managing API timeouts and retries. Here's how we implement this initial stage:

Capturing Changes with DynamoDB Streams

With data flowing into DynamoDB, we need a reliable way to capture changes. Traditional polling approaches often miss updates or consume excessive resources. DynamoDB Streams offers a better solution, providing ordered change capture without missing updates. To enable streaming:

Enhancing Reliability with Kinesis Data Streams

While connecting DynamoDB Streams directly to Redshift might seem tempting, this approach faces several limitations. DynamoDB Streams' 24-hour retention limit makes recovery from extended outages impossible. Additionally, multiple applications often need access to the same change data, and the ability to replay historical data proves invaluable for reprocessing or testing.

Kinesis Data Streams resolves these challenges by offering extended data retention up to 365 days, support for multiple consumers, and robust replay capabilities. Here's our Lambda function that bridges DynamoDB Streams to Kinesis: (lambda_function.py)

To deploy this function and ensure it processes the correct data stream:

Redshift Integration: Implementing True Zero-ETL

The final and most crucial component of our pipeline is the Redshift integration. Here, we face a critical architectural decision. While Redshift can connect directly to Kinesis streams, we need to carefully design our approach to handle high-volume data efficiently while maintaining system performance.

Let's start with the foundation - setting up the zero-ETL connection to Kinesis:

The AUTO REFRESH YES parameter is crucial here - it enables true zero-ETL by automatically refreshing the materialized view as new data arrives in the Kinesis stream. The WITH NO SCHEMA BINDING clause allows Redshift to optimize these refreshes for better performance.

However, directly consuming from this materialized view for analytics can lead to resource contentions and performance issues. Instead, we implement a staged approach with proper batch processing:

Notice the DISTSTYLE EVEN on the staging table - this choice is intentional. It spreads the data evenly across nodes during the initial load, reducing hot spots and improving parallel processing. The production table, however, uses DISTKEY(id) to optimize for the most common query patterns.

The heart of our processing logic lies in the batch synchronization procedure:

This procedure runs every five minutes, processing data in batches. The batch size isn't arbitrary - it's chosen to balance data freshness with system performance. Too small a window increases system overhead, while too large a window delays data availability.

Monitoring becomes crucial in this setup. We need to track both the materialized view refresh status and overall pipeline health:

This monitoring view provides crucial insights into pipeline health, helping identify potential bottlenecks before they impact business operations.

Alternative Approaches

While this implementation has proven robust for many use cases, some scenarios might benefit from alternative approaches. One such pattern is API → Data Lake → Redshift, which can provide additional flexibility and cost optimization for certain workloads. The choice between approaches should consider factors like data volume, latency requirements, and query patterns.

Conclusion

Building a reliable real-time data pipeline requires careful consideration of each component's strengths and limitations. The architecture presented here balances data freshness with system stability, using batch processing and materialized views to manage resource contention effectively. Regular monitoring and performance optimization ensure the pipeline continues to meet business needs as data volumes grow.

The success of such implementations often lies not just in the initial architecture, but in understanding and addressing the specific challenges that emerge at scale. Whether using this approach or alternatives like Data Lake integration, the key is to maintain a balance between real-time capabilities and system stability.

Real-time data Processing: Building a Zero-ETL Pipeline with AWS Services

Sanjiv Kumar Jha

Data Scientist and Enterprise IT Architect driving digital transformation with Data Science, AI, and Cloud expertise

Understanding the Architecture

DynamoDB: The Foundation of Real-time Storage

Capturing Changes with DynamoDB Streams

Enhancing Reliability with Kinesis Data Streams

Redshift Integration: Implementing True Zero-ETL

Alternative Approaches

Conclusion

More articles by this author

Others also viewed

Iceberg: Building AI Apps on a Solid Data Foundation

Data on EKS: The Future of Governed, Real-Time Analytics

Fetched | May 2025

Serverless Data Engineering: Future of Scalable Data Processing

Understanding Batch and Real-Time Processing in DataBricks

Modern Analytical Databases: How to Power Your Big Data Insights

Simplifying Analytics with Azure Databricks' Open Lakehouse Architecture

Mapping Microsoft's Data Analytics Landscape – Comparing Databricks, Synapse and Fabric

From Pipelines to Platforms: Rethinking Data Engineering on Azure

Part 4: Final Thoughts – Pros, Cons, and Best Practices Across Storage Solutions

Explore topics

Understanding the Architecture

DynamoDB: The Foundation of Real-time Storage

Capturing Changes with DynamoDB Streams

Enhancing Reliability with Kinesis Data Streams

Redshift Integration: Implementing True Zero-ETL

Alternative Approaches

Conclusion

The Future of Indian Agriculture: How NISAR Will Transform Remote Sensing in Agritech

Aug 7, 2025

The Evolution of Optimization: From Mathematical Programming to Machine Learning Solutions

Jul 28, 2025

The Complete Guide to IT Security: A Strategic Framework for Modern Organizations

Jul 24, 2025

The Evolution of Data Labelling: From Human Labor to AI Science

Jul 15, 2025

The Data Platform Evolution: From Traditional Warehouses to GenAI-Ready Architectures

Jul 7, 2025

The AGI Tipping Point: Why 2025 Could Be the Year Everything Changes

Jul 3, 2025

Enterprise Architecture for the AI Era: Adapting Frameworks for Machine Learning and Generative AI

Jun 21, 2025

Enterprise Digital Transformation: Integrating TOGAF with AWS Well-Architected Framework

Jun 19, 2025

Mechanistic Interpretability: Illuminating the Black Box of Neural Networks

Dec 20, 2024

Modern Data Architecture: A Comprehensive Analysis of Lake, Lakehouse, and Beyond

Dec 20, 2024

Others also viewed

Iceberg: Building AI Apps on a Solid Data Foundation

Data on EKS: The Future of Governed, Real-Time Analytics

Fetched | May 2025

Serverless Data Engineering: Future of Scalable Data Processing

Understanding Batch and Real-Time Processing in DataBricks

Modern Analytical Databases: How to Power Your Big Data Insights

Simplifying Analytics with Azure Databricks' Open Lakehouse Architecture

Mapping Microsoft's Data Analytics Landscape – Comparing Databricks, Synapse and Fabric

From Pipelines to Platforms: Rethinking Data Engineering on Azure

Part 4: Final Thoughts – Pros, Cons, and Best Practices Across Storage Solutions

Explore topics