Docker - Spark + Delta Lake + Minio + Hive Metastore: Your Quick-Start Guide to Building a Modern Data Platform

Are you looking for a quick way to get started with Delta Lake, MinIO, and Apache Spark? Whether you're building a proof of concept or setting up a development environment, this guide will help you get a fully functional data lakehouse running in minutes. We'll walk through a Docker-based setup that combines Delta Lake's ACID guarantees, MinIO's S3-compatible storage, and Spark's powerful processing capabilities – all orchestrated with Apache Hive Metastore for seamless table management.

Architecture Overview

Our setup consists of several key components:

  • Apache Spark with Delta Lake for data processing and storage

  • MinIO as S3-compatible object storage

  • Apache Hive Metastore for managing table metadata

  • PostgreSQL as the backend database for Hive Metastore

Prerequisites

  • Docker and Docker Compose

  • Basic understanding of Apache Spark and SQL

  • Familiarity with Python programming

Quick Start

About the Spark Delta Image

Our custom Spark Delta image is built on top of the official Spark 3.5.0 image and includes everything you need for a local Delta Lake development environment. Here's what's inside:

Base Image and Core Components:

  • Based on

  • Python 3 with PySpark support

  • Java 11 runtime

  • Scala 2.12

Key Features:

  • Pre-installed Delta Lake dependencies

  • Configured S3A connector for MinIO

  • Hadoop AWS libraries for S3 compatibility

  • Built-in Hive configuration

  • Optimized for local development

Implementation

1. Docker Compose Configuration

First, let's look at our configuration that sets up all required services:

docker-compose.yml

2. Spark Configuration

Spark Environment Configuration

Before starting the environment, create a file to configure Spark's resource allocation and performance settings:

.env.spark

The following Python code sets up our Spark session with Delta Lake integration and necessary configurations for MinIO and Hive Metastore:

scripts/spark_config_delta.py

Start all services using Docker Compose:

3. Data Ingestion

Place some sample csv files as below into the minio bucket - source-data/

Here's a sample implementation for ingesting CSV files into Delta tables:

scripts/basic_spark_delta.py

To execute above using the deployed containers using docker compose -

3. Spark SQL Session

scripts/spark_sql_session_delta.py

Conclusion

This setup provides a robust foundation for building modern data applications. The combination of Spark, Delta Lake, MinIO, and Hive Metastore offers a powerful, scalable, and maintainable data platform that can handle various data processing needs while maintaining data consistency and providing SQL capabilities.

Remember to adjust configurations based on your specific needs and scale requirements. This setup can be extended with additional services like Airflow for orchestration or Superset for visualization as your needs grow.

To view or add a comment, sign in

Explore content categories