SlideShare a Scribd company logo
Lambda Architecture 
Analyzing large scale, unstructured, 
dynamic data 
Rajesh Muppalla (@codingnirvana) 
rajesh@indix.com
Indix - Quick Overview 
Am I priced higher or lower w.r.t 
my competitor on Nikon D700? 
Which product has the UPC - 
8745354434? 
What are all the variants of 
Apple Macbook Air 13”? What is the average price change of all Nike Shoes 
in Walmart in the last 3 months?
Data Pipeline @ Indix 
C 
Crawling Parsing 
ML 
Model 
ML 
Model 
Classification 
C1 C1 C1 C1 
C2 C2 C2 
C2 C2 
Matching 
Product & Price 
Catalog
Data Pipeline @ Indix 
Analytics 
(Precomputes, 
Insights) 
Search Index 
Product & Price 
Catalog 
Experiences 
We released the v1.0 of our API today - developer.indix.com
Data is Dynamic 
C C1 C1 C1 C1 
C2 C2 C2 
C2 C2 
ML 
Model 
ML 
Model 
(new) 
Crawling Parsing Classification Matching
Data Scale 
400 M 
Product 
URLs 4 TB 
HTML Data 
Crawled 
Daily 
100 TB 
Data 
Processed 
Daily 
3000 
Categories 
10 B 
Price 
Points 
2000 
Sites
Data Pipeline v1.0
Batch using HBase & MapReduce
Problem 1 
Mutable State 
Data Systems should be Human Fault Tolerant
Problem 2 
Compactions 
Random Write databases are hard to manage at large scale
Problem 3 
16 hours 
16 hours latency is a lot. We wanted it to be couple of hours
Three Problems 
● No Human Fault Tolerance 
○ Mutable State 
● Operational Complexity 
○ Random Writes (Compactions) 
● High Latency 
○ Batch system architectural tradeoff
Rethink our data systems
Lambda Architecture
Lambda Architecture 
● An approach to build big data systems 
○ Architectural Components & Principles 
○ Ties Batch & Real Time Systems 
○ General Purpose - Domain Agnostic 
● Coined by Nathan Marz 
○ Ex-Twitter Engineer 
○ Creator of Storm
Data System - Traditional Approach 
HBase 
Application 
Source of Truth
Data System - New Approach 
Immutable 
Raw 
Data 
Application 
Processed 
View(s) 
Source of Truth
Let’s take an example 
Find the count of unique products in any 
given category for the entire time range
Two Requirements 
● Recomputations 
● Large Scale
Batch Layer Implementation 
HDFS (Vertical Partitioning) HBase 
C1 5 
C2 7 
C3 4 
C4 7 
C5 1 
Products Master Data 
9 am 
10 am 
11 am 
12 pm 
1 pm 
2 pm 
Query 
Intermediate view 
C1 
C2 
C3 
C4 
C5 
MR Job 1 
Batch View 
New Data MR Job 2
Handling Recomputations 
HDFS (Vertical Partitioning) HBase 
C1 5 
C2 7 
C3 4 
C4 7 
C5 1 
Products Master Data 
9 am 
10 am 
11 am 
12 pm 
1 pm 
2 pm 
Query 
Intermediate view 
C1 
C2 
C3 
C4 
C5 
MR Job 1 
Batch View 
New Data MR Job 2
Handling Scale 
● Hadoop HDFS, MapReduce, HBase 
● Proven Linear Scalability
Three Problems (Recap) 
● No Human Fault Tolerance 
○ Mutable State 
● Operational Complexity 
○ Random Writes (Compactions) 
● High Latency 
○ Batch system architectural tradeoff
Human Fault Tolerance 
● Bugs in the batch jobs 
○ Discard views & Recompute 
● Bugs in the master data jobs 
○ Re-process the master data to hide the old data 
● Bugs in the query 
○ Re-deploy the query layer 
● Traceability as a side effect
Operational Complexity 
● No random writes in the batch layer 
○ Bulk Updates to build the batch view
Great
 What about Latency?
Speed Layer 
Queue 
(Kafka) 
Recent Data 
Real Time Processing 
(Storm) 
HHyyppeHerylroplogeglrolloogg gS lSeoetgst s Query 
Random 
Writes 
(Updates) 
Read-Write Data Store 
(Riak, HBase, 
Cassandra)
Speed Layer has mutation... But 
● Speed layer deals with much smaller data 
○ Batch Layer - Months/years of data 
○ Speed Layer - Few hours or 1 day of data 
● Easy to manage operationally 
Complexity Isolation
Final Step - Merging Results 
Batch Layer 
Speed Layer 
Data 
Query 
Merged Results 
C1 - 50000 
C1 - 499 
(Approximate with 
error 0.02%) 
C1 - 50499
What about Accuracy? 
Batch Layer 
Speed Layer 
Data 
Query 
Merged Results 
C1 - 499 
(Approximate with 
error 0.02%) 
C1’ - 50500 
Batch Layer 
CC11’ -- 5500050000 
Eventually Accurate
Lambda Architecture
Lambda Architecture @ INDIX
Lambda Architecture @ Indix
Batch Layer @ Indix 
● Pail 
○ Vertical partitioning 
○ Consolidation of small files 
● Scalding 
● Thrift for enforcing schemas 
● HBase/Solr for views 
○ Bulk updates to create views
Speed Layer @ Indix 
● Still WIP 
● To reduce latency 
○ Micro batches for Speed layer 
○ Use the last batch run + bulk update views
Open Challenges 
● Managing both Batch & Real Time still painful 
● Two broad directions 
○ Abstractions 
■ SummingBird (Twitter) 
○ Unified Stack 
■ Spark 
■ Kafka + Samza/Storm (LinkedIn) 
■ Cloud Data Flow (Google)
In Conclusion... 
● Lambda Architecture 
○ A different approach to build data systems 
○ Solid principles 
○ Domain Agnostic 
○ Tools not yet mature
Resources 
● Indix Engineering Blog - http://engineering.indix.com 
● Runaway Complexity in Big Data Systems 
● Lambda Architecture 
● Big Data Book - Manning 
● Scalding 
● Spark 
● Pail 
● Summingbird
Key Takeaways 
- Human Fault Tolerance 
- Complexity Isolation 
- Higher Level Abstractions
Thank You
Batch vs Real Time Choices
Tying it all together - Go-CD
Extras 
● Monoids 
● LA is not new 
○ Search Engines (fast, slow crawl) 
○ Event Sourcing (immutable events to maintain 
state) 
○ Patch, Audit, Bootstrap
Problem Statement - Optimization

More Related Content

PDF
Continuous delivery for machine learning
PPTX
Implementing the Lambda Architecture efficiently with Apache Spark
PDF
Lambda architecture
PPTX
Speed layer : Real time views in LAMBDA architecture
 
PDF
Using Hazelcast in the Kappa architecture
PDF
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
PPTX
Realtime streaming architecture in INFINARIO
PDF
ASPgems - kappa architecture
Continuous delivery for machine learning
Implementing the Lambda Architecture efficiently with Apache Spark
Lambda architecture
Speed layer : Real time views in LAMBDA architecture
 
Using Hazelcast in the Kappa architecture
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
Realtime streaming architecture in INFINARIO
ASPgems - kappa architecture

What's hot (20)

PDF
Big data real time architectures
PDF
Extracting Insights from Data at Twitter
PDF
Big Telco - Yousun Jeong
PDF
Modern ETL Pipelines with Change Data Capture
PDF
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
PDF
Lambda architecture for real time big data
PDF
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
PDF
Big data serving: Processing and inference at scale in real time
PDF
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
PDF
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
PPTX
Spark - Migration Story
PPTX
Spark Streaming the Industrial IoT
PDF
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
PDF
Introduction to Data Engineer and Data Pipeline at Credit OK
PPTX
Quark Virtualization Engine for Analytics
PDF
Spark Streaming and IoT by Mike Freedman
PDF
Superworkflow of Graph Neural Networks with K8S and Fugue
PDF
Introduction to Apache Apex by Thomas Weise
PDF
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
PDF
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Big data real time architectures
Extracting Insights from Data at Twitter
Big Telco - Yousun Jeong
Modern ETL Pipelines with Change Data Capture
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Lambda architecture for real time big data
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Big data serving: Processing and inference at scale in real time
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Spark - Migration Story
Spark Streaming the Industrial IoT
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Introduction to Data Engineer and Data Pipeline at Credit OK
Quark Virtualization Engine for Analytics
Spark Streaming and IoT by Mike Freedman
Superworkflow of Graph Neural Networks with K8S and Fugue
Introduction to Apache Apex by Thomas Weise
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Ad

Similar to Lambda architecture @ Indix (20)

PDF
How to get started in Big Data for master's students
PDF
Cloud Lambda Architecture Patterns
PDF
🐬 The future of MySQL is Postgres 🐘
PPTX
Big Data in 200 km/h | AWS Big Data Demystified #1.3
PDF
A Day in the Life of a Druid Implementor and Druid's Roadmap
ODP
Lambda Architecture with Spark
PDF
Understanding Hadoop
PPTX
Real time data viz with Spark Streaming, Kafka and D3.js
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PDF
All the DataOps, all the paradigms .
PPTX
Dynamic DDL: Adding structure to streaming IoT data on the fly
PDF
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
PPTX
"EventStoreDb: To be, or not to be, that is the question", Illia Maier
 
PPTX
OtimizaçÔes de Projetos de Big Data, Dw e AI no Microsoft Azure
PPTX
Real-Time Analytics with Spark and MemSQL
PDF
About VisualDNA Architecture @ Rubyslava 2014
PDF
SQL Engines for Hadoop - The case for Impala
PDF
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
PDF
Scala Days Highlights | BoldRadius
PPTX
Essential Data Engineering for Data Scientist
How to get started in Big Data for master's students
Cloud Lambda Architecture Patterns
🐬 The future of MySQL is Postgres 🐘
Big Data in 200 km/h | AWS Big Data Demystified #1.3
A Day in the Life of a Druid Implementor and Druid's Roadmap
Lambda Architecture with Spark
Understanding Hadoop
Real time data viz with Spark Streaming, Kafka and D3.js
AWS Big Data Demystified #1: Big data architecture lessons learned
All the DataOps, all the paradigms .
Dynamic DDL: Adding structure to streaming IoT data on the fly
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
"EventStoreDb: To be, or not to be, that is the question", Illia Maier
 
OtimizaçÔes de Projetos de Big Data, Dw e AI no Microsoft Azure
Real-Time Analytics with Spark and MemSQL
About VisualDNA Architecture @ Rubyslava 2014
SQL Engines for Hadoop - The case for Impala
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
Scala Days Highlights | BoldRadius
Essential Data Engineering for Data Scientist
Ad

Recently uploaded (20)

PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
web development for engineering and engineering
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Construction Project Organization Group 2.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
DOCX
573137875-Attendance-Management-System-original
PDF
composite construction of structures.pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPT
Mechanical Engineering MATERIALS Selection
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Arduino robotics embedded978-1-4302-3184-4.pdf
additive manufacturing of ss316l using mig welding
Strings in CPP - Strings in C++ are sequences of characters used to store and...
web development for engineering and engineering
Model Code of Practice - Construction Work - 21102022 .pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
UNIT 4 Total Quality Management .pptx
OOP with Java - Java Introduction (Basics)
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Structs to JSON How Go Powers REST APIs.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Construction Project Organization Group 2.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
573137875-Attendance-Management-System-original
composite construction of structures.pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Mechanical Engineering MATERIALS Selection

Lambda architecture @ Indix

  • 1. Lambda Architecture Analyzing large scale, unstructured, dynamic data Rajesh Muppalla (@codingnirvana) rajesh@indix.com
  • 2. Indix - Quick Overview Am I priced higher or lower w.r.t my competitor on Nikon D700? Which product has the UPC - 8745354434? What are all the variants of Apple Macbook Air 13”? What is the average price change of all Nike Shoes in Walmart in the last 3 months?
  • 3. Data Pipeline @ Indix C Crawling Parsing ML Model ML Model Classification C1 C1 C1 C1 C2 C2 C2 C2 C2 Matching Product & Price Catalog
  • 4. Data Pipeline @ Indix Analytics (Precomputes, Insights) Search Index Product & Price Catalog Experiences We released the v1.0 of our API today - developer.indix.com
  • 5. Data is Dynamic C C1 C1 C1 C1 C2 C2 C2 C2 C2 ML Model ML Model (new) Crawling Parsing Classification Matching
  • 6. Data Scale 400 M Product URLs 4 TB HTML Data Crawled Daily 100 TB Data Processed Daily 3000 Categories 10 B Price Points 2000 Sites
  • 8. Batch using HBase & MapReduce
  • 9. Problem 1 Mutable State Data Systems should be Human Fault Tolerant
  • 10. Problem 2 Compactions Random Write databases are hard to manage at large scale
  • 11. Problem 3 16 hours 16 hours latency is a lot. We wanted it to be couple of hours
  • 12. Three Problems ● No Human Fault Tolerance ○ Mutable State ● Operational Complexity ○ Random Writes (Compactions) ● High Latency ○ Batch system architectural tradeoff
  • 13. Rethink our data systems
  • 15. Lambda Architecture ● An approach to build big data systems ○ Architectural Components & Principles ○ Ties Batch & Real Time Systems ○ General Purpose - Domain Agnostic ● Coined by Nathan Marz ○ Ex-Twitter Engineer ○ Creator of Storm
  • 16. Data System - Traditional Approach HBase Application Source of Truth
  • 17. Data System - New Approach Immutable Raw Data Application Processed View(s) Source of Truth
  • 18. Let’s take an example Find the count of unique products in any given category for the entire time range
  • 19. Two Requirements ● Recomputations ● Large Scale
  • 20. Batch Layer Implementation HDFS (Vertical Partitioning) HBase C1 5 C2 7 C3 4 C4 7 C5 1 Products Master Data 9 am 10 am 11 am 12 pm 1 pm 2 pm Query Intermediate view C1 C2 C3 C4 C5 MR Job 1 Batch View New Data MR Job 2
  • 21. Handling Recomputations HDFS (Vertical Partitioning) HBase C1 5 C2 7 C3 4 C4 7 C5 1 Products Master Data 9 am 10 am 11 am 12 pm 1 pm 2 pm Query Intermediate view C1 C2 C3 C4 C5 MR Job 1 Batch View New Data MR Job 2
  • 22. Handling Scale ● Hadoop HDFS, MapReduce, HBase ● Proven Linear Scalability
  • 23. Three Problems (Recap) ● No Human Fault Tolerance ○ Mutable State ● Operational Complexity ○ Random Writes (Compactions) ● High Latency ○ Batch system architectural tradeoff
  • 24. Human Fault Tolerance ● Bugs in the batch jobs ○ Discard views & Recompute ● Bugs in the master data jobs ○ Re-process the master data to hide the old data ● Bugs in the query ○ Re-deploy the query layer ● Traceability as a side effect
  • 25. Operational Complexity ● No random writes in the batch layer ○ Bulk Updates to build the batch view
  • 27. Speed Layer Queue (Kafka) Recent Data Real Time Processing (Storm) HHyyppeHerylroplogeglrolloogg gS lSeoetgst s Query Random Writes (Updates) Read-Write Data Store (Riak, HBase, Cassandra)
  • 28. Speed Layer has mutation... But ● Speed layer deals with much smaller data ○ Batch Layer - Months/years of data ○ Speed Layer - Few hours or 1 day of data ● Easy to manage operationally Complexity Isolation
  • 29. Final Step - Merging Results Batch Layer Speed Layer Data Query Merged Results C1 - 50000 C1 - 499 (Approximate with error 0.02%) C1 - 50499
  • 30. What about Accuracy? Batch Layer Speed Layer Data Query Merged Results C1 - 499 (Approximate with error 0.02%) C1’ - 50500 Batch Layer CC11’ -- 5500050000 Eventually Accurate
  • 34. Batch Layer @ Indix ● Pail ○ Vertical partitioning ○ Consolidation of small files ● Scalding ● Thrift for enforcing schemas ● HBase/Solr for views ○ Bulk updates to create views
  • 35. Speed Layer @ Indix ● Still WIP ● To reduce latency ○ Micro batches for Speed layer ○ Use the last batch run + bulk update views
  • 36. Open Challenges ● Managing both Batch & Real Time still painful ● Two broad directions ○ Abstractions ■ SummingBird (Twitter) ○ Unified Stack ■ Spark ■ Kafka + Samza/Storm (LinkedIn) ■ Cloud Data Flow (Google)
  • 37. In Conclusion... ● Lambda Architecture ○ A different approach to build data systems ○ Solid principles ○ Domain Agnostic ○ Tools not yet mature
  • 38. Resources ● Indix Engineering Blog - http://engineering.indix.com ● Runaway Complexity in Big Data Systems ● Lambda Architecture ● Big Data Book - Manning ● Scalding ● Spark ● Pail ● Summingbird
  • 39. Key Takeaways - Human Fault Tolerance - Complexity Isolation - Higher Level Abstractions
  • 41. Batch vs Real Time Choices
  • 42. Tying it all together - Go-CD
  • 43. Extras ● Monoids ● LA is not new ○ Search Engines (fast, slow crawl) ○ Event Sourcing (immutable events to maintain state) ○ Patch, Audit, Bootstrap
  • 44. Problem Statement - Optimization