Learn How to Build a Datalake with DuckLake, DuckDB, and AWS S3 Express One Zone

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & EMR | Data Lake(Hudi | Iceberg) Specialist | YouTuber

Published May 27, 2025

TL;DR

DuckLake simplifies lakehouses by using a standard SQL database for all metadata instead of complex file-based systems, while still storing data in open formats like Parquet. This makes your data lake more reliable, faster, and easier to manage. Combine this with AWS S3 Express One Zonefor blazing-fast reads and writes, and you have a next-generation analytics platform at your fingertips.

Background: The Lakehouse Revolution

Innovative data systems like BigQuery and Snowflake have shown that disconnecting storage and compute is a great idea in a time where storage is a virtualized commodity. That way, both storage and compute can scale independently and we don't have to buy expensive database machines just to store tables we will never read.

t the same time, market forces have pushed people to insist that data systems use open formats like Parquet to avoid the all-too-common hostage taking of data by a single vendor. In this new world, lots of data systems happily frolic around a pristine “data lake” built on Parquet and S3 and all was well. Who needs those old school databases anyway!

But quickly it emerged that – shockingly – people would like to make changes to their dataset. Simple appends worked pretty well by just dropping more files into a folder, but anything beyond that required complex and error-prone custom scripts without any notion of correctness or – Codd beware – transactional guarantees.

Enter DuckLake: Simplicity and Power

DuckLake, a new open-source extension for DuckDB, reimagines the lakehouse architecture. Instead of relying on complex file-based metadata systems (like Iceberg or Delta Lake), DuckLake uses a standard SQL database (DuckDB or Postgres) to store all metadata. Actual data lives in open formats like Parquet on object storage (such as S3).

Why is this a big deal?

Reliability: All metadata changes are ACID-compliant SQL transactions.
Simplicity: No more wrangling with JSON, Avro, or manifest files.
Performance: Metadata operations are fast and transactional.
Open Data: Your data is always in open formats, ready for any tool.

Why Use AWS S3 Express One Zone?

AWS S3 Express One Zone is Amazon’s new high-performance, single-AZ storage class. It’s designed for workloads that need ultra-low latency and high throughput, making it a perfect match for modern analytics and data lake workloads.

Key Benefits:

Faster Reads/Writes: Single-digit millisecond access times, millions of requests per second.
Cost-Effective: Lower latency and cost for data that doesn’t need cross-AZ durability.
Perfect for Analytics: Ideal for data lakes, ML, and real-time analytics where speed is critical.

Building Your Lakehouse: Step-by-Step Example

Let’s see how easy it is to build a datalake with DuckLake, DuckDB, and S3 Express One Zone.

1. Set Up Your S3 Express One Zone Directory Bucket

Create a directory bucket in your AWS Console.

2. Configure DuckDB and DuckLake

Install and load the required DuckDB extensions:

3. Set Up AWS Credentials and S3 Express Endpoint

4. Attach Your DuckLake Metadata Database

5. Create and Manipulate Tables

Query it

Conclusion

With DuckLake and DuckDB, you get a simple, reliable, and high-performance lakehouse—no more wrangling with complex metadata files or worrying about vendor lock-in. By pairing this with AWS S3 Express One Zone, you unlock blazing-fast analytics on open data formats, with all the flexibility and power of SQL.

Ready to build your own lightning-fast, open lakehouse? Try DuckLake with S3 Express One Zone today!

Further Reading:

Magnus Eriksson

Architect and Cloud Specialist | ex-AWS

1mo

Supposedly Ducklake should work with other engines than DuckDB - would like an alternative that is hosted in AWS to avoid operating it (backup, patching....). Ideally I would have liked to use the new fully serverless AWS DSQL service but it may still lack some SQL features that are needed?! I suppose one could then run DuckDB locally (on multiple computers concurrently) to access the fully AWS hosted ducklake for ingestion and analytics....

Samarth M O

Senior Data Platform Engineer @ Take-Two Interactive

2mo

Love this, Soumil. Excited to try it out!

Subramanian Neelakantan

Sr Consultant @ Visa | Forever Python Blogger | Data Science & AI

2mo

Nice. That was quick. Will try it out today.

Soumil S.

2mo

I think I’m really loving this—no need for Spark! We can perform full INSERT, UPDATE, and DELETE operations directly through the table service. That’s just amazing. With high-level metadata now kept in duckdb, this should be incredibly fast in theory. What I’m particularly interested in is how existing customers with large datasets can migrate to Duck Lake. What strategies are available for catalog migration? https://ducklake.select/docs/stable/specification/queries

Learn How to Build a Datalake with DuckLake, DuckDB, and AWS S3 Express One Zone

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & EMR | Data Lake(Hudi | Iceberg) Specialist | YouTuber

TL;DR

Background: The Lakehouse Revolution

Enter DuckLake: Simplicity and Power

Why Use AWS S3 Express One Zone?

Building Your Lakehouse: Step-by-Step Example

1. Set Up Your S3 Express One Zone Directory Bucket

2. Configure DuckDB and DuckLake

3. Set Up AWS Credentials and S3 Express Endpoint

4. Attach Your DuckLake Metadata Database

5. Create and Manipulate Tables

Conclusion

More articles by this author

Others also viewed

March 2025 - Bring Your Own Cloud for AWS, Postgres CDC connector is Beta, Theta Sketches

Building a Scalable Data Lake with AWS S3 and Open-Source Technologies for the BFSI Sector

AWS Athena

Amazon Dynamodb – Design Principles

Best Practices for Using DynamoDB in Enterprise-Level Applications

Introduction to Amazon Athena

"Distinguishing HDFS from Cloud Data Lakes: ADLS Gen2 and Amazon S3"

Understanding DynamoDB’s scaling features in the console

Decoding Real-Time Databases: When to Use Pinot, Druid, Redis, and InfluxDB

Redis: The Ultimate In-Memory Data Store for Modern Applications

Explore topics

TL;DR

Background: The Lakehouse Revolution

Enter DuckLake: Simplicity and Power

Why Use AWS S3 Express One Zone?

Building Your Lakehouse: Step-by-Step Example

1. Set Up Your S3 Express One Zone Directory Bucket

2. Configure DuckDB and DuckLake

3. Set Up AWS Credentials and S3 Express Endpoint

4. Attach Your DuckLake Metadata Database

5. Create and Manipulate Tables

Conclusion

Building a Data Migration Bootstrapper: Migrating 5,000+ Tables (6TB) from Cloud Data Warehouse to S3 Tables (Iceberg)

Aug 18, 2025

I Learned from a Principal Engineer that EMR Adds Its Own Charge on Top of the Base EC2 Price — Which is 25%

Aug 2, 2025

Breaking Through Python’s GIL: Scaling Multi-Tenant S3 Archival from Hours to Minutes

Jul 19, 2025

Experiment: S3 Tables with Incremental Loads up to 520GB At Zeta Global

Jul 10, 2025

Parallel Iceberg Table Compaction with AWS Step Functions and Athena

May 24, 2025

Turning Vision into Reality: The Lakehouse Project at Zeta Global

May 23, 2025

Leveraging Spark Connect with S3 Tables (Managed Iceberg): A Comprehensive Guide

May 15, 2025

Multi-Tenant Data Ingestion with Apache Iceberg Views: A Spark-Powered Single Table Design

Apr 18, 2025

Single Table Design vs. Multiple Table Design: A Comparison for Tenant-Based Data Processing

Mar 29, 2025

Join us for an exciting workshop at the Iceberg Summit 2025 | Hands on Labs

Mar 25, 2025

Others also viewed

March 2025 - Bring Your Own Cloud for AWS, Postgres CDC connector is Beta, Theta Sketches

Building a Scalable Data Lake with AWS S3 and Open-Source Technologies for the BFSI Sector

AWS Athena

Amazon Dynamodb – Design Principles

Best Practices for Using DynamoDB in Enterprise-Level Applications

Introduction to Amazon Athena

"Distinguishing HDFS from Cloud Data Lakes: ADLS Gen2 and Amazon S3"

Understanding DynamoDB’s scaling features in the console

Decoding Real-Time Databases: When to Use Pinot, Druid, Redis, and InfluxDB

Redis: The Ultimate In-Memory Data Store for Modern Applications

Explore topics