Docker - Spark + Delta Lake + Minio + Hive Metastore: Your Quick-Start Guide to Building a Modern Data Platform
Are you looking for a quick way to get started with Delta Lake, MinIO, and Apache Spark? Whether you're building a proof of concept or setting up a development environment, this guide will help you get a fully functional data lakehouse running in minutes. We'll walk through a Docker-based setup that combines Delta Lake's ACID guarantees, MinIO's S3-compatible storage, and Spark's powerful processing capabilities – all orchestrated with Apache Hive Metastore for seamless table management.
Architecture Overview
Our setup consists of several key components:
Apache Spark with Delta Lake for data processing and storage
MinIO as S3-compatible object storage
Apache Hive Metastore for managing table metadata
PostgreSQL as the backend database for Hive Metastore
Prerequisites
Docker and Docker Compose
Basic understanding of Apache Spark and SQL
Familiarity with Python programming
Quick Start
About the Spark Delta Image
Our custom Spark Delta image is built on top of the official Spark 3.5.0 image and includes everything you need for a local Delta Lake development environment. Here's what's inside:
Base Image and Core Components:
Based on
Python 3 with PySpark support
Java 11 runtime
Scala 2.12
Key Features:
Pre-installed Delta Lake dependencies
Configured S3A connector for MinIO
Hadoop AWS libraries for S3 compatibility
Built-in Hive configuration
Optimized for local development
Implementation
1. Docker Compose Configuration
First, let's look at our configuration that sets up all required services:
docker-compose.yml
2. Spark Configuration
Spark Environment Configuration
Before starting the environment, create a file to configure Spark's resource allocation and performance settings:
.env.spark
The following Python code sets up our Spark session with Delta Lake integration and necessary configurations for MinIO and Hive Metastore:
scripts/spark_config_delta.py
Start all services using Docker Compose:
3. Data Ingestion
Place some sample csv files as below into the minio bucket - source-data/
Here's a sample implementation for ingesting CSV files into Delta tables:
scripts/basic_spark_delta.py
To execute above using the deployed containers using docker compose -
3. Spark SQL Session
scripts/spark_sql_session_delta.py
Conclusion
This setup provides a robust foundation for building modern data applications. The combination of Spark, Delta Lake, MinIO, and Hive Metastore offers a powerful, scalable, and maintainable data platform that can handle various data processing needs while maintaining data consistency and providing SQL capabilities.
Remember to adjust configurations based on your specific needs and scale requirements. This setup can be extended with additional services like Airflow for orchestration or Superset for visualization as your needs grow.