Learning System Design Through Real Systems: 9 Architectures Worth Your Time

YASH JAIN

Python Developer | Senior AI & Backend Engineer @Quantzig | Ex - Dell | Mentor | YouTuber | 15k+ Followers

Published Jun 14, 2025

System design is not an academic exercise. It’s the architecture of decision-making under constraints — budgetary, technical, human. When you’re expected to build scalable products or lead backend efforts, your understanding of how large systems hold together shifts from nice-to-have to essential.

And make no mistake: this isn’t something you master by watching a few videos or scanning a cheat sheet of acronyms.

The truth is, system design thinking is formed through exposure to real-world architecture — warts and all. That’s why studying real systems built and operated at scale gives you far more insight than hypothetical diagrams.

Here are nine architectures worth your time. Use them as benchmarks. Use them as springboards.

Set aside 30 minutes every third day. No skipping. No summarizing. Read, diagram, critique.

You’ll learn more in a month than most candidates do in six.

1. YouTube: Scaling with MySQL, Reluctantly and Deliberately

YouTube runs at a scale most backends will never approach and still relies heavily on MySQL. That’s not a mistake. At the scale of billions of video requests and asynchronous writes, they leaned into operational maturity instead of novel architecture.

But they couldn’t rely solely on vertical scaling or basic read replicas. So they implemented:

Horizontal sharding by user account and asset
Frontend-layer proxies that route queries to the right cluster
Separation between metadata (MySQL) and binary data (GFS or similar blob store)

Errors were expected. Failover strategies weren’t optional. Teams had to know how to route around failed shards, recover from ghost replicas, and frequently evaluate long-tail query performance.

Takeaway? SQL isn’t obsolete. It just has a breaking point. With sharding, it’s still a valid backbone — even for platforms like YouTube.

2. Google Ads: Mature SQL in a Low-Latency Ecosystem

Google Ads doesn’t operate like most analytics platforms. Ad selection happens in real time, and the system has to respond faster than the bids themselves are evaluated.

That means:

OLTP stores optimized for predictable latency
Periodic flushing to OLAP layers like Bigtable or Dremel
Query gateways that reroute expensive calls to asynchronous jobs

SQL survives here because the data model changes rarely. Campaigns don’t shift by the second. But traffic patterns do.

What most people don’t realize is how much tuning goes into SQL performance at this level — indexes fine-tuned to query planners, intermediate caches at query-layer level, and roll-up preprocessing to offload daytime traffic.

There’s no one trick. Just lots of small disciplines under the surface.

3. Slack Messaging: Durable Streams with Real-Time Expectations

Slack depends on an architecture that guarantees two things: your messages always go through, and they arrive in the right order — even if you’re on mobile with two bars at a train station.

The core pieces are:

WebSocket connections that remain open across client sessions
Kafka (or equivalent) for producing and consuming message streams
Device token management for per-session retries and ordered delivery

Multiple devices? They all subscribe to your message stream. Synchronization is the hard part. What if one is offline? What if another is delayed behind a firewall?

Messages aren’t stored on the frontend. They drop into durable, offset-based queues and are confirmed at the client level.

Real-time messaging isn’t about speed. It’s about predictability under unreliable conditions.

4. Meta (Facebook): Cache Consistency That Actually Works

You don’t keep three billion users online without strong caching strategies — but that’s not the headline. What matters is keeping those caches correct under race conditions, maintenance windows, and cross-datacenter replication.

Their approach:

Write-through cache to reduce race surfaces between origin and cache
Pub-sub propagation for invalidation events across zones
Version tokens to guard stale reads without delaying fast-path requests

It’s not Transactional Cache™. It’s lived-in, patched, refined through years of incidents.

And the best part: they assume eventual inconsistency and build mechanisms to reconcile it. You don’t see that in theoretical designs because theoretical designs aren’t evaluated under fire.

5. Bitly: Small URLs, Non-Trivial Infrastructure

At the surface, Bitly takes a long link and returns a short one.

Underneath:

Unique keys are generated using Base62 encoding on high-throughput counters
Requests are cached massively at the edge (Cloudflare, Akamai)
Abuse handling is embedded in the core request path — no separable fraud team service

Bitly has to be fast. But it also has to stop bad actors from generating a million links a minute. They strike a balance with throttling headers, abuse-pattern flagging, and batch log aggregation.

It’s not architecture magic — it’s operational honesty that keeps it upright.

6. Real-Time Leaderboards: A Lesson in Constraint

Game devs edge right into the DevOps world the moment they build a leaderboard.

Problem: scoring is update-heavy. Millions of players try to beat the board in real time, and they all want to see themselves inch up with each level. That creates high churn in lists that must remain qualified and deterministic.

Typical pattern:

Redis ZSETs for high-speed ordered inserts
Fan-out architecture to notify relevant clients
Periodic sync to volume-based cold store for recovery protection

ZSETs break under specific size and eviction pressure, so they implement tiered sets — one per region/time window — and periodically merge the rank positions.

No leaderboard stays flat unless it’s already broken.

7. AWS at 10 Million Users: Budget vs. Scale, Daily

Here’s how an actual team approached this: they built a product that scaled from 10K to 10M MAUs inside a year, and nothing about their system stayed still.

They started serverless:

Lambda for all APIs
DynamoDB as primary store
CloudFront for caching

Two months in, the cold start latency hit a wall. So they pivoted:

Load balancers feeding EC2 auto-scaling pools
Redis edge cache on compute node
PostgreSQL for transactional events needing indexed queries

Contrary to AWS guides, they found the best setup required mixing multiple compute layers — not choosing one.

Lesson: don’t scale by copying Cloud Architecture 101. Scale by measuring the pain and adjusting infrastructure accordingly.

8. Cloudflare’s 55M RPS with PostgreSQL: In the Trenches

Cloudflare needs to do two things at the same time:

Field 55 million requests per second.
Log enough of that to review when something breaks.

Their Postgres layer supports upstream configuration — block lists, DNS routes, zero-day patch propagation.

They use:

Logical partitioning with lots of small Postgres shards, not one monster instance
Hot failovers with automated replica promotion
Queued writes over append-only logs

Their dashboards don’t query production data — they pull from streaming read feeds. That reduces interference. You just don’t often hear that side of the story.

These are blue-collar databases doing hard work, patched carefully, debugged weekly.

9. Distributed Counters: Simple Until They’re Not

If all you need is to count how many times someone clicked a button, you probably start by incrementing a key. That gets you through the first 10,000 users.

What about 50 million?

Trouble begins when:

Updates race each other in cross-zone calls
Connections fail mid-request
One batch of updates gets rolled back without reconciling the counter upstream

To fix this in real production systems:

Store counters in CRDTs (conflict-free data types)
Assign updates to per-partition logical nodes
Merge deltas through high-confidence quorum reads

Crucially: the counter doesn’t pretend to be atomic unless it is. Everything else is a best-effort update with built-in reconciliation logic.

Distributed counters work right when they know when they’re wrong.

What You’ll Actually Learn Studying These Systems

Most people read system design blogs and learn buzzwords: write-through cache, failover listener, CRDT, durable pub/sub.

You’ll get none of that from this exercise. You’ll get friction, and that’s more valuable.

You’ll understand why it’s hard to shard once the system’s already live. Why consistent hashing isn’t always consistent. Why caching solves 80% of read problems and creates 100% of consistency headaches.

These systems weren’t designed in study halls. They evolved — through outages, CDN misconfigurations, race conditions, deployment mistakes, and very long incident reviews.

Re-read that last part. That’s the real curriculum.

Make this your plan:

Choose a system from this list.
Spend three days reading everything you can about how it works — not summary blogs, the real engineering posts.
Write a half-page critique of what trade-off that system made and why.
Then move on to the next one.

At the end of one month, you’ll see architecture differently. Not as a collection of patterns — but as realistic trade-offs shaped by limits nobody talks about until they’re already in trouble.

That’s what real system design looks like in the wild.

If you have any questions or need clarification, feel free to leave a comment on this blog or reach out to me on

Topmate: https://topmate.io/yash0307jain

You can read more blogs on Medium

Medium: https://medium.com/@yash0307jain

Thanks for reading, and I’ll see you next time!

AlgoMart

2,554 followers

+ Subscribe

Himanshu sharma

1mo

Thanks for sharing, YASH JAIN

1 Reaction

Nitish Kumar

Full Stack Developer | Immediate Joiner | Node.js, React.js, MongoDB, PostgreSQL, AWS | Passionate about building impactful products

2mo

Definitely worth reading

1 Reaction

Neha Gupta

2mo

Thanks for sharing such an insightful blog. YASH JAIN

1 Reaction

Vimanyu Chaturvedi

2mo

Love this, YASH. Looking forward to other such blogs.

2 Reactions

See more comments

To view or add a comment, sign in

See all

Learning System Design Through Real Systems: 9 Architectures Worth Your Time

YASH JAIN

Python Developer | Senior AI & Backend Engineer @Quantzig | Ex - Dell | Mentor | YouTuber | 15k+ Followers

1. YouTube: Scaling with MySQL, Reluctantly and Deliberately

2. Google Ads: Mature SQL in a Low-Latency Ecosystem

3. Slack Messaging: Durable Streams with Real-Time Expectations

4. Meta (Facebook): Cache Consistency That Actually Works

5. Bitly: Small URLs, Non-Trivial Infrastructure

6. Real-Time Leaderboards: A Lesson in Constraint

7. AWS at 10 Million Users: Budget vs. Scale, Daily

8. Cloudflare’s 55M RPS with PostgreSQL: In the Trenches

9. Distributed Counters: Simple Until They’re Not

What You’ll Actually Learn Studying These Systems

AlgoMart

2,554 followers

More articles by this author

Others also viewed

go MiLes: Learning technology through sharing series (Vol 2)

Assessment of Developing Spark in Databricks and Fabric

Lecture 5: Model serving architectures

ETL encapsulation in aws-Lambda Function with Serverless, CloudFormation, APIGateway, Docker, FastAPI to PowerBI API

Exploring Azure Databricks: Unleashing the Power of Analytics and Data Science

Demystifying Resilient Distributed Datasets (RDD) in Apache Spark

Backfilling Apache Hudi Tables in Production: Techniques & Approaches Using AWS Glue by Job Target LLC

An In-depth Exploration of PySpark: A Powerful Framework for Big Data Processing

Replace PySpark Notebooks in Microsoft Fabric using Livy API - No More Clicking Around the UI

Unleashing the Power of Big Data Processing with Apache Spark

Explore topics

1. YouTube: Scaling with MySQL, Reluctantly and Deliberately

2. Google Ads: Mature SQL in a Low-Latency Ecosystem

3. Slack Messaging: Durable Streams with Real-Time Expectations

4. Meta (Facebook): Cache Consistency That Actually Works

5. Bitly: Small URLs, Non-Trivial Infrastructure

6. Real-Time Leaderboards: A Lesson in Constraint

7. AWS at 10 Million Users: Budget vs. Scale, Daily

8. Cloudflare’s 55M RPS with PostgreSQL: In the Trenches

9. Distributed Counters: Simple Until They’re Not

What You’ll Actually Learn Studying These Systems

AlgoMart

2,554 followers

Building Your First RAG-powered LLM Application with Langchain: A Step-by-Step Guide

May 7, 2025

Understanding Memory Types in LangChain

May 4, 2025

How AI Is Transforming Software Engineering in 2025: Tools, Trends, and the Future of Coding

May 1, 2025

Understanding LangChain Runnables: A Comprehensive Guide

Apr 30, 2025

LangGraph: Unlock the Future of AI Application Development

Apr 29, 2025

Smol but Mighty: Building LLM Agents with Smol Agents by Hugging Face

Apr 28, 2025

A Beginner guide to start with Langchain

Apr 27, 2025

RAG Reranking with LangChain — Building Smarter Retrieval Pipelines

Apr 26, 2025

Others also viewed

go MiLes: Learning technology through sharing series (Vol 2)

Assessment of Developing Spark in Databricks and Fabric

Lecture 5: Model serving architectures

ETL encapsulation in aws-Lambda Function with Serverless, CloudFormation, APIGateway, Docker, FastAPI to PowerBI API

Exploring Azure Databricks: Unleashing the Power of Analytics and Data Science

Demystifying Resilient Distributed Datasets (RDD) in Apache Spark

Backfilling Apache Hudi Tables in Production: Techniques & Approaches Using AWS Glue by Job Target LLC

An In-depth Exploration of PySpark: A Powerful Framework for Big Data Processing

Replace PySpark Notebooks in Microsoft Fabric using Livy API - No More Clicking Around the UI

Unleashing the Power of Big Data Processing with Apache Spark

Explore topics