Docker - Spark + Delta Lake + Minio + Hive Metastore: Your Quick-Start Guide to Building a Modern Data Platform

Beyond10x

Your workflow automation needs for data

Published Dec 12, 2024

Are you looking for a quick way to get started with Delta Lake, MinIO, and Apache Spark? Whether you're building a proof of concept or setting up a development environment, this guide will help you get a fully functional data lakehouse running in minutes. We'll walk through a Docker-based setup that combines Delta Lake's ACID guarantees, MinIO's S3-compatible storage, and Spark's powerful processing capabilities – all orchestrated with Apache Hive Metastore for seamless table management.

Architecture Overview

Our setup consists of several key components:

Apache Spark with Delta Lake for data processing and storage
MinIO as S3-compatible object storage
Apache Hive Metastore for managing table metadata
PostgreSQL as the backend database for Hive Metastore

Prerequisites

Docker and Docker Compose
Basic understanding of Apache Spark and SQL
Familiarity with Python programming

Quick Start

About the Spark Delta Image

Our custom Spark Delta image is built on top of the official Spark 3.5.0 image and includes everything you need for a local Delta Lake development environment. Here's what's inside:

Base Image and Core Components:

Based on
Python 3 with PySpark support
Java 11 runtime
Scala 2.12

Key Features:

Pre-installed Delta Lake dependencies
Configured S3A connector for MinIO
Hadoop AWS libraries for S3 compatibility
Built-in Hive configuration
Optimized for local development

Implementation

1. Docker Compose Configuration

First, let's look at our configuration that sets up all required services:

docker-compose.yml

2. Spark Configuration

Spark Environment Configuration

Before starting the environment, create a file to configure Spark's resource allocation and performance settings:

.env.spark

The following Python code sets up our Spark session with Delta Lake integration and necessary configurations for MinIO and Hive Metastore:

scripts/spark_config_delta.py

Start all services using Docker Compose:

3. Data Ingestion

Place some sample csv files as below into the minio bucket - source-data/

Here's a sample implementation for ingesting CSV files into Delta tables:

scripts/basic_spark_delta.py

To execute above using the deployed containers using docker compose -

3. Spark SQL Session

scripts/spark_sql_session_delta.py

Conclusion

This setup provides a robust foundation for building modern data applications. The combination of Spark, Delta Lake, MinIO, and Hive Metastore offers a powerful, scalable, and maintainable data platform that can handle various data processing needs while maintaining data consistency and providing SQL capabilities.

Remember to adjust configurations based on your specific needs and scale requirements. This setup can be extended with additional services like Airflow for orchestration or Superset for visualization as your needs grow.

LinkedIn respects your privacy

Docker - Spark + Delta Lake + Minio + Hive Metastore: Your Quick-Start Guide to Building a Modern Data Platform

Beyond10x

Your workflow automation needs for data

Architecture Overview

Prerequisites

Quick Start

About the Spark Delta Image

Implementation

1. Docker Compose Configuration

2. Spark Configuration

Spark Environment Configuration

3. Data Ingestion

3. Spark SQL Session

Conclusion

More articles by this author

Explore content categories

Architecture Overview

Prerequisites

Quick Start

About the Spark Delta Image

Implementation

1. Docker Compose Configuration

2. Spark Configuration

Spark Environment Configuration

3. Data Ingestion

3. Spark SQL Session

Conclusion

Spark, Delta Lake, and External Hive-Metastore with Postgres — in Docker

Dec 9, 2024

Explore content categories