Breaking Through Python’s GIL: Scaling Multi-Tenant S3 Archival from Hours to Minutes

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & EMR | Data Lake(Hudi | Iceberg) Specialist | YouTuber

Published Jul 19, 2025

The Challenge: Millions Files Daily

Our multi-tenant application processes an enormous volume of files daily . After processing, we tag each file using put_object_tagging to mark them ready for archival.

The business requirement was clear: retain files for 30 days before archiving them to cheaper storage tiers. However, our archival job became the bottleneck that threatened to topple our entire data pipeline.

The Bottleneck: When Threading Isn’t Enough

Initially, our archival job was a single Python process with heavy threading, designed to handle all tenants simultaneously. The results were disappointing:

Production scale: N Tenant took a staggering Hours

This massive performance degradation puzzled me initially. With heavy threading, I expected near-linear scaling, but the reality was far different.

Then it hit me: Python’s Global Interpreter Lock (GIL).

Understanding the GIL Problem

Python’s GIL ensures that only one thread executes Python bytecode at a time, effectively limiting true parallelism in CPU-bound operations. While our archival job involved I/O operations (which can release the GIL), the coordination overhead and limited true concurrency were killing our performance at scale.

The single-threaded job simply couldn’t keep up with our aggressive processing pipeline

The Solution: Divide and Conquer with Process Sharding

Instead of fighting Python’s threading limitations, I decided to embrace process-based parallelism. The strategy was elegantly simple:

1. Tenant Sharding Strategy

Instead of one massive job handling all tenants, I implemented a bin-based sharding approach:

Divide: Split all tenants into N equal bins
Distribute: Each bin gets its own dedicated Python process
Deploy: Launch all processes simultaneously as separate jobs

2. Key-Based Distribution

The sharding logic distributes tenants evenly across bins:

Tenant 1, 31, 61… → Bin 1
Tenant 2, 32, 62… → Bin 2
Tenant 3, 33, 63… → Bin 3
And so on…

This ensures balanced workload distribution across all archival processes.

3. Job Submission Engine Integration

Each bin becomes an independent job submitted to our job submission engine:

Bin 1: Archive tenants [1, 31, 61, 91…]
Bin 2: Archive tenants [2, 32, 62, 92…]
Bin N: Archive tenants [N, N+30, N+60…]

The Transformation: From Hours to Minutes

True Parallelism: By using separate processes instead of threads, we bypass Python’s GIL entirely. Each process has its own interpreter and memory space, enabling true parallel execution.

Resource Optimization

Instead of one process competing for resources, we have multiple processes that can:

Utilize multiple CPU cores effectively
Handle I/O operations concurrently without GIL contention
Scale horizontally with available infrastructure

Fault Isolation

If one bin fails, it doesn’t bring down the entire archival operation. Other bins continue processing independently.

Linear Scalability

Need to process more tenants? Simply increase the number of bins. The architecture scales linearly with available compute resources.

Lessons Learned

Threading ≠ Parallelism: Python’s GIL makes threading less effective for CPU-bound or coordination-heavy tasks
Process-based scaling: Sometimes the solution isn’t optimizing existing code, but rethinking the architecture
Horizontal scaling: Dividing work across processes can be more effective than vertical optimization
Testing at scale: Performance characteristics can change dramatically between small and large datasets

The Bottom Line

When dealing with high-volume data processing in Python, don’t let the GIL limit your scale. By rethinking our architecture from a single-process, multi-threaded approach to a multi-process, sharded approach, we transformed our archival pipeline from a hours bottleneck into a minutes .

The key insight: sometimes the best optimization is architectural, not algorithmic.

The aggressive sharding strategy proves that with the right architecture, even Python’s GIL limitations can be overcome at enterprise scale.

Breaking Through Python’s GIL: Scaling Multi-Tenant S3 Archival from Hours to Minutes

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & EMR | Data Lake(Hudi | Iceberg) Specialist | YouTuber

The Challenge: Millions Files Daily

The Bottleneck: When Threading Isn’t Enough

Understanding the GIL Problem

The Solution: Divide and Conquer with Process Sharding

1. Tenant Sharding Strategy

2. Key-Based Distribution

3. Job Submission Engine Integration

The Transformation: From Hours to Minutes

Resource Optimization

Fault Isolation

Linear Scalability

Lessons Learned

The Bottom Line

More articles by this author

Others also viewed

Graph RAG, Automated Prompt Engineering, Agent Frameworks, and Other September Must-Reads

Data Science Portfolios, Speeding Up Python, KANs, and Other May Must-Reads

Stop the Hallucinations: Hybrid Retrieval with BM25, pgvector, embedding rerank, LLM Rubric Rerank & HyDE

🧠 PYTHON + TIME SERIES – Correcting Outliers with Z-Score and Linear Interpolation

Build Reliable Machine Learning Pipelines with the Dependency Inversion Principle in Python

📑 Python GUI & Machine Learning Series Cosine Similarity Applied to Product Recommendation

Document Splitting

Springing Forward: Rod Johnson on Spring, Generative AI & the Future of Development

How to Run an LLM as Powerful as DeepSeek R1 Locally with LM Studio With Just 24 GB VRAM

Unlock Python’s Counting Superpower: A Deep Dive into collections.Counter

Explore topics

The Challenge: Millions Files Daily

The Bottleneck: When Threading Isn’t Enough

Understanding the GIL Problem

The Solution: Divide and Conquer with Process Sharding

1. Tenant Sharding Strategy

2. Key-Based Distribution

3. Job Submission Engine Integration

The Transformation: From Hours to Minutes

Resource Optimization

Fault Isolation

Linear Scalability

Lessons Learned

The Bottom Line

Building a Data Migration Bootstrapper: Migrating 5,000+ Tables (6TB) from Cloud Data Warehouse to S3 Tables (Iceberg)

Aug 18, 2025

I Learned from a Principal Engineer that EMR Adds Its Own Charge on Top of the Base EC2 Price — Which is 25%

Aug 2, 2025

Experiment: S3 Tables with Incremental Loads up to 520GB At Zeta Global

Jul 10, 2025

Learn How to Build a Datalake with DuckLake, DuckDB, and AWS S3 Express One Zone

May 27, 2025

Parallel Iceberg Table Compaction with AWS Step Functions and Athena

May 24, 2025

Turning Vision into Reality: The Lakehouse Project at Zeta Global

May 23, 2025

Leveraging Spark Connect with S3 Tables (Managed Iceberg): A Comprehensive Guide

May 15, 2025

Multi-Tenant Data Ingestion with Apache Iceberg Views: A Spark-Powered Single Table Design

Apr 18, 2025

Single Table Design vs. Multiple Table Design: A Comparison for Tenant-Based Data Processing

Mar 29, 2025

Join us for an exciting workshop at the Iceberg Summit 2025 | Hands on Labs

Mar 25, 2025

Others also viewed

Graph RAG, Automated Prompt Engineering, Agent Frameworks, and Other September Must-Reads

Data Science Portfolios, Speeding Up Python, KANs, and Other May Must-Reads

Stop the Hallucinations: Hybrid Retrieval with BM25, pgvector, embedding rerank, LLM Rubric Rerank & HyDE

🧠 PYTHON + TIME SERIES – Correcting Outliers with Z-Score and Linear Interpolation

Build Reliable Machine Learning Pipelines with the Dependency Inversion Principle in Python

📑 Python GUI & Machine Learning Series Cosine Similarity Applied to Product Recommendation

Document Splitting

Springing Forward: Rod Johnson on Spring, Generative AI & the Future of Development

How to Run an LLM as Powerful as DeepSeek R1 Locally with LM Studio With Just 24 GB VRAM

Unlock Python’s Counting Superpower: A Deep Dive into collections.Counter

Explore topics