Lambda architecture @ Indix

Lambda Architecture
Analyzing large scale, unstructured,
dynamic data
Rajesh Muppalla (@codingnirvana)
rajesh@indix.com

Indix - Quick Overview
Am I priced higher or lower w.r.t
my competitor on Nikon D700?
Which product has the UPC -
8745354434?
What are all the variants of
Apple Macbook Air 13”? What is the average price change of all Nike Shoes
in Walmart in the last 3 months?

Data Pipeline @ Indix
C
Crawling Parsing
ML
Model
ML
Model
Classification
C1 C1 C1 C1
C2 C2 C2
C2 C2
Matching
Product & Price
Catalog

Data Pipeline @ Indix
Analytics
(Precomputes,
Insights)
Search Index
Product & Price
Catalog
Experiences
We released the v1.0 of our API today - developer.indix.com

Data is Dynamic
C C1 C1 C1 C1
C2 C2 C2
C2 C2
ML
Model
ML
Model
(new)
Crawling Parsing Classification Matching

Data Scale
400 M
Product
URLs 4 TB
HTML Data
Crawled
Daily
100 TB
Data
Processed
Daily
3000
Categories
10 B
Price
Points
2000
Sites

Problem 1
Mutable State
Data Systems should be Human Fault Tolerant

Problem 2
Compactions
Random Write databases are hard to manage at large scale

Problem 3
16 hours
16 hours latency is a lot. We wanted it to be couple of hours

Three Problems
● No Human Fault Tolerance
○ Mutable State
● Operational Complexity
○ Random Writes (Compactions)
● High Latency
○ Batch system architectural tradeoff

Lambda Architecture
● An approach to build big data systems
○ Architectural Components & Principles
○ Ties Batch & Real Time Systems
○ General Purpose - Domain Agnostic
● Coined by Nathan Marz
○ Ex-Twitter Engineer
○ Creator of Storm

Data System - Traditional Approach
HBase
Application
Source of Truth

Data System - New Approach
Immutable
Raw
Data
Application
Processed
View(s)
Source of Truth

Let’s take an example
Find the count of unique products in any
given category for the entire time range

Two Requirements
● Recomputations
● Large Scale

Batch Layer Implementation
HDFS (Vertical Partitioning) HBase
C1 5
C2 7
C3 4
C4 7
C5 1
Products Master Data
9 am
10 am
11 am
12 pm
1 pm
2 pm
Query
Intermediate view
C1
C2
C3
C4
C5
MR Job 1
Batch View
New Data MR Job 2

Handling Recomputations
HDFS (Vertical Partitioning) HBase
C1 5
C2 7
C3 4
C4 7
C5 1
Products Master Data
9 am
10 am
11 am
12 pm
1 pm
2 pm
Query
Intermediate view
C1
C2
C3
C4
C5
MR Job 1
Batch View
New Data MR Job 2

Handling Scale
● Hadoop HDFS, MapReduce, HBase
● Proven Linear Scalability

Three Problems (Recap)
● No Human Fault Tolerance
○ Mutable State
● Operational Complexity
○ Random Writes (Compactions)
● High Latency
○ Batch system architectural tradeoff

Human Fault Tolerance
● Bugs in the batch jobs
○ Discard views & Recompute
● Bugs in the master data jobs
○ Re-process the master data to hide the old data
● Bugs in the query
○ Re-deploy the query layer
● Traceability as a side effect

Operational Complexity
● No random writes in the batch layer
○ Bulk Updates to build the batch view

Speed Layer
Queue
(Kafka)
Recent Data
Real Time Processing
(Storm)
HHyyppeHerylroplogeglrolloogg gS lSeoetgst s Query
Random
Writes
(Updates)
Read-Write Data Store
(Riak, HBase,
Cassandra)

Speed Layer has mutation... But
● Speed layer deals with much smaller data
○ Batch Layer - Months/years of data
○ Speed Layer - Few hours or 1 day of data
● Easy to manage operationally
Complexity Isolation

Final Step - Merging Results
Batch Layer
Speed Layer
Data
Query
Merged Results
C1 - 50000
C1 - 499
(Approximate with
error 0.02%)
C1 - 50499

What about Accuracy?
Batch Layer
Speed Layer
Data
Query
Merged Results
C1 - 499
(Approximate with
error 0.02%)
C1’ - 50500
Batch Layer
CC11’ -- 5500050000
Eventually Accurate

Batch Layer @ Indix
● Pail
○ Vertical partitioning
○ Consolidation of small files
● Scalding
● Thrift for enforcing schemas
● HBase/Solr for views
○ Bulk updates to create views

Speed Layer @ Indix
● Still WIP
● To reduce latency
○ Micro batches for Speed layer
○ Use the last batch run + bulk update views

Open Challenges
● Managing both Batch & Real Time still painful
● Two broad directions
○ Abstractions
■ SummingBird (Twitter)
○ Unified Stack
■ Spark
■ Kafka + Samza/Storm (LinkedIn)
■ Cloud Data Flow (Google)

In Conclusion...
● Lambda Architecture
○ A different approach to build data systems
○ Solid principles
○ Domain Agnostic
○ Tools not yet mature

Resources
● Indix Engineering Blog - http://engineering.indix.com
● Runaway Complexity in Big Data Systems
● Lambda Architecture
● Big Data Book - Manning
● Scalding
● Spark
● Pail
● Summingbird

Key Takeaways
- Human Fault Tolerance
- Complexity Isolation
- Higher Level Abstractions

Extras
● Monoids
● LA is not new
○ Search Engines (fast, slow crawl)
○ Event Sourcing (immutable events to maintain
state)
○ Patch, Audit, Bootstrap

Problem Statement - Optimization

Lambda architecture @ Indix

More Related Content

What's hot (20)

Similar to Lambda architecture @ Indix (20)

Recently uploaded (20)

Lambda architecture @ Indix