Learn How to Build a Datalake with DuckLake, DuckDB, and AWS S3 Express One Zone

Learn How to Build a Datalake with DuckLake, DuckDB, and AWS S3 Express One Zone

TL;DR

DuckLake simplifies lakehouses by using a standard SQL database for all metadata instead of complex file-based systems, while still storing data in open formats like Parquet. This makes your data lake more reliable, faster, and easier to manage. Combine this with AWS S3 Express One Zonefor blazing-fast reads and writes, and you have a next-generation analytics platform at your fingertips.


Background: The Lakehouse Revolution

Innovative data systems like BigQuery and Snowflake have shown that disconnecting storage and compute is a great idea in a time where storage is a virtualized commodity. That way, both storage and compute can scale independently and we don't have to buy expensive database machines just to store tables we will never read.

t the same time, market forces have pushed people to insist that data systems use open formats like Parquet to avoid the all-too-common hostage taking of data by a single vendor. In this new world, lots of data systems happily frolic around a pristine “data lake” built on Parquet and S3 and all was well. Who needs those old school databases anyway!

But quickly it emerged that – shockingly – people would like to make changes to their dataset. Simple appends worked pretty well by just dropping more files into a folder, but anything beyond that required complex and error-prone custom scripts without any notion of correctness or – Codd beware – transactional guarantees.

Enter DuckLake: Simplicity and Power

DuckLake, a new open-source extension for DuckDB, reimagines the lakehouse architecture. Instead of relying on complex file-based metadata systems (like Iceberg or Delta Lake), DuckLake uses a standard SQL database (DuckDB or Postgres) to store all metadata. Actual data lives in open formats like Parquet on object storage (such as S3).

Why is this a big deal?

  • Reliability: All metadata changes are ACID-compliant SQL transactions.

  • Simplicity: No more wrangling with JSON, Avro, or manifest files.

  • Performance: Metadata operations are fast and transactional.

  • Open Data: Your data is always in open formats, ready for any tool.

Why Use AWS S3 Express One Zone?

AWS S3 Express One Zone is Amazon’s new high-performance, single-AZ storage class. It’s designed for workloads that need ultra-low latency and high throughput, making it a perfect match for modern analytics and data lake workloads.

Key Benefits:

  • Faster Reads/Writes: Single-digit millisecond access times, millions of requests per second.

  • Cost-Effective: Lower latency and cost for data that doesn’t need cross-AZ durability.

  • Perfect for Analytics: Ideal for data lakes, ML, and real-time analytics where speed is critical.

Building Your Lakehouse: Step-by-Step Example

Let’s see how easy it is to build a datalake with DuckLake, DuckDB, and S3 Express One Zone.

1. Set Up Your S3 Express One Zone Directory Bucket

  • Create a directory bucket in your AWS Console.

2. Configure DuckDB and DuckLake

Install and load the required DuckDB extensions:

3. Set Up AWS Credentials and S3 Express Endpoint

4. Attach Your DuckLake Metadata Database

5. Create and Manipulate Tables

Query it

Conclusion

With DuckLake and DuckDB, you get a simple, reliable, and high-performance lakehouse—no more wrangling with complex metadata files or worrying about vendor lock-in. By pairing this with AWS S3 Express One Zone, you unlock blazing-fast analytics on open data formats, with all the flexibility and power of SQL.

Ready to build your own lightning-fast, open lakehouse? Try DuckLake with S3 Express One Zone today!


Further Reading:

Magnus Eriksson

Architect and Cloud Specialist | ex-AWS

1mo

Supposedly Ducklake should work with other engines than DuckDB - would like an alternative that is hosted in AWS to avoid operating it (backup, patching....). Ideally I would have liked to use the new fully serverless AWS DSQL service but it may still lack some SQL features that are needed?! I suppose one could then run DuckDB locally (on multiple computers concurrently) to access the fully AWS hosted ducklake for ingestion and analytics....

Like
Reply
Samarth M O

Senior Data Platform Engineer @ Take-Two Interactive

2mo

Love this, Soumil. Excited to try it out!

Like
Reply
Subramanian Neelakantan

Sr Consultant @ Visa | Forever Python Blogger | Data Science & AI

2mo

Nice. That was quick. Will try it out today.

Like
Reply
Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & EMR | Data Lake(Hudi | Iceberg) Specialist | YouTuber

2mo

I think I’m really loving this—no need for Spark! We can perform full INSERT, UPDATE, and DELETE operations directly through the table service. That’s just amazing. With high-level metadata now kept in duckdb, this should be incredibly fast in theory. What I’m particularly interested in is how existing customers with large datasets can migrate to Duck Lake. What strategies are available for catalog migration? https://ducklake.select/docs/stable/specification/queries

To view or add a comment, sign in

Others also viewed

Explore topics