Multi-Tenant Data Ingestion with Apache Iceberg Views: A Spark-Powered Single Table Design

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & EMR | Data Lake(Hudi | Iceberg) Specialist | YouTuber

Published Apr 18, 2025

In this tutorial, I’ll walk you through a comprehensive system for multi-tenant data ingestion using Apache Spark and Apache Iceberg. We’ll implement a single-table design that partitions data by tenant ID and creates specialized Iceberg views to expose each tenant’s data separately.

Video Guides

Architecture Overview

This solution uses:

Apache Spark for data processing
Apache Iceberg for table format and views
MinIO as S3-compatible object storage
Iceberg REST service for catalog management

The workflow:

Read multi-tenant data from S3
Merge data into a single Iceberg table (partitioned by tenant)
Create tenant-specific views for data access

Step 1: Set Up the Environment

First, let’s set up our Docker-based environment with the following docker-compose.yml:

This setup provides:

A Spark environment with Iceberg integration
MinIO for S3-compatible storage
Iceberg REST service for metadata management
A MinIO client to initialize the warehouse bucket

Step 2: Data Ingestion Process

Let’s examine the data ingestion script that reads multi-tenant data and merges it into our partitioned Iceberg table:

Key aspects of this ingestion process:

Manifest-based Processing: The script creates a manifest file listing all data files to be processed
Data Deduplication: Using a window function to keep only the most recent version of each record
Error Handling: Files that cause errors are moved to an error folder
File Archiving: Successfully processed files are moved to an archive folder

Step 3: Creating Iceberg Views for Each Tenant

After ingesting data into our partitioned Iceberg table, we need to create tenant-specific views:

his script:

Queries the Iceberg metadata to find all distinct tenant values
Creates a separate view for each tenant that filters the base table
Places these views in a dedicated demo.views namespace

Understanding the Multi-Tenant Architecture

Let’s break down the key components of this solution:

Single Table Design

One table (demo.db.multi_tenant) holds data for all tenants
The table is partitioned by the tenant column for efficient filtering
Updates and inserts use Iceberg’s merge capabilities for data consistency

Data Flow

Raw data files are uploaded to S3/MinIO in the data/ prefix
Our ingestion process creates a manifest of files to process
Spark reads these files and performs a merge operation into the Iceberg table
Successfully processed files are moved to an archive folder

Tenant-Specific Views

For each tenant, we create a dedicated view in the demo.views namespace
These views filter the base table by tenant ID
Views provide logical separation without physical data duplication
This approach allows for tenant-specific access control

Benefits of Iceberg Views for Multi-Tenant Data

Data Isolation: Each tenant can only access their own data through dedicated views
Storage Efficiency: A single table design avoids data duplication
Performance: Iceberg’s partition pruning ensures efficient queries
Simplified Operations: One table to maintain instead of many
Versioning: Iceberg’s time travel capabilities work across the entire dataset

Running the Solution

Start the environment:

Create the tenant views:

Github :

https://github.com/soumilshah1995/iceberg-multi-tenant-view

Resources for Learning More

Video Tutorials

By using this approach, you can create a highly scalable, efficient, and secure multi-tenant data platform powered by Apache Iceberg and Spark.

Multi-Tenant Data Ingestion with Apache Iceberg Views: A Spark-Powered Single Table Design

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & EMR | Data Lake(Hudi | Iceberg) Specialist | YouTuber

Architecture Overview

Step 1: Set Up the Environment

Step 2: Data Ingestion Process

Step 3: Creating Iceberg Views for Each Tenant

Understanding the Multi-Tenant Architecture

Single Table Design

Data Flow

Tenant-Specific Views

Benefits of Iceberg Views for Multi-Tenant Data

Running the Solution

Resources for Learning More

Video Tutorials

More articles by this author

Others also viewed

All About Parquet Part 09 - Parquet in Data Lake Architectures

Introduction to Data Engineering Concepts |9| Storage Formats and Compression

Introduction to Data Engineering Concepts |17| Apache Iceberg, Arrow, and Polaris

Understanding the Apache Iceberg Manifest File

Data Warehousing Basics: Extract

15 Real-World Data Pipeline Issues and How to Solve Them Like a Pro

💊 DATA Pill #154 - Flink or Kafka Streams? Apache Airflow® 3

Delta Lake

Implementing Real-Time Analytics with Change Data Capture using Debezium and Apache Spark

Master Apache Hudi Streamer: 15+ Hands-On Labs, Exercise Materials, and Videos - The Go-To Guide for Companies, Data Leaders, Engineers, and Developer

Explore topics

Architecture Overview

Step 1: Set Up the Environment

Step 2: Data Ingestion Process

Step 3: Creating Iceberg Views for Each Tenant

Understanding the Multi-Tenant Architecture

Single Table Design

Data Flow

Tenant-Specific Views

Benefits of Iceberg Views for Multi-Tenant Data

Running the Solution

Resources for Learning More

Video Tutorials

Building a Data Migration Bootstrapper: Migrating 5,000+ Tables (6TB) from Cloud Data Warehouse to S3 Tables (Iceberg)

Aug 18, 2025

I Learned from a Principal Engineer that EMR Adds Its Own Charge on Top of the Base EC2 Price — Which is 25%

Aug 2, 2025

Breaking Through Python’s GIL: Scaling Multi-Tenant S3 Archival from Hours to Minutes

Jul 19, 2025

Experiment: S3 Tables with Incremental Loads up to 520GB At Zeta Global

Jul 10, 2025

Learn How to Build a Datalake with DuckLake, DuckDB, and AWS S3 Express One Zone

May 27, 2025

Parallel Iceberg Table Compaction with AWS Step Functions and Athena

May 24, 2025

Turning Vision into Reality: The Lakehouse Project at Zeta Global

May 23, 2025

Leveraging Spark Connect with S3 Tables (Managed Iceberg): A Comprehensive Guide

May 15, 2025

Single Table Design vs. Multiple Table Design: A Comparison for Tenant-Based Data Processing

Mar 29, 2025

Join us for an exciting workshop at the Iceberg Summit 2025 | Hands on Labs

Mar 25, 2025

Others also viewed

All About Parquet Part 09 - Parquet in Data Lake Architectures

Introduction to Data Engineering Concepts |9| Storage Formats and Compression

Introduction to Data Engineering Concepts |17| Apache Iceberg, Arrow, and Polaris

Understanding the Apache Iceberg Manifest File

Data Warehousing Basics: Extract

15 Real-World Data Pipeline Issues and How to Solve Them Like a Pro

💊 DATA Pill #154 - Flink or Kafka Streams? Apache Airflow® 3

Delta Lake

Implementing Real-Time Analytics with Change Data Capture using Debezium and Apache Spark

Master Apache Hudi Streamer: 15+ Hands-On Labs, Exercise Materials, and Videos - The Go-To Guide for Companies, Data Leaders, Engineers, and Developer

Explore topics