Ledingkart Meetup #4: Data pipeline @ lk

Building a Data Pipeline - Case studies
Amit Sharma, Director @NoBroker
Raam Baranidharan, Associate Director @Treebo
Jitendra Agrawal, VP Technology @ LendingKart
18th August 2018

2
Building data pipelines @NoBroker
-Amit Sharma
Director Engineering

• Business Needs Data.
• Analytics is computationally taxing.
• Data exists in multiple platforms
• CRM
• APPs
• Calls
• 3rd Party Webhooks
3
Why Data Pipeline

• Moving, Joining and re-formatting data between
systems.
• A Data Pipeline is sum of all these steps
• Its Job is to ensure these steps all happen reliably on
all data.
4
What is Data Pipeline

• Sources
• Joins
• Extractions
• Standardization/Corrections
• Loads
• Automation
Parts And Processes of Data Pipeline

DATA
HARVESTER
NB ESTIMATES
- Rentometer
- Prop Worth
- Lifestyle Score
- Commute Score
SIA
- Synthetic Language
Generator
OCULUS
- Image Intelligence
LITMUS
- Text Profanity Engine
MASS
- Property Quality
Estimates
HURACAN
- Lead Intelligence
JARVIS
- Automatic Speech
Recognition Engine
(NB Voice Asst.)
TROUBLESHOOT
- Sentiment Analysis
Engine
DEEP LEON
- Large Scale Deep
Learning Framework
Website Mobile CRM Calls
User
Footprint
Third Party
Data
Property
Data
Prism
ML Engine
INFERENCE
SYSTEMS
HOROSCOPE QUICKSILVER
CLICK & EARN SMART SALES
NB BOT
SMART
FOLLOW UPS
DEMAND -
SUPPLY
ANALYTICS
SMART
GRIEVANCE
REDRESSAL
VIGILANTE
WORK IN PROGRESS

• With Analytics Data, Scale matters
• One server is never enough
• Once a data pipeline is source of truth, Reliability
matters
• Without enrichments, It’s hard to derive insights
Lessons from building data pipeline

Data World @ Treebo
A perspective & some musings
Sneh

Presentation flow
❖ GROUND 0 Initial 3 decks prepare the ground
❖ GET Following 3 decks set the broader context around -
➢ Problem Statement
➢ Some fundamentals around systems, storage & general considerations
❖ SET Intermediate 2 slides deep dive into -
➢ Different phases and the choices around AWS technology for them
❖ GO Concluding 2 slides talk about -
➢ The overall architecture put together
➢ Progressive thoughts

What is Data Strategy (DS)? .. ⅓
A set of techniques around the collection, storage and usage of a data, in a manner that
data can serve not only the operations of a company, but also open up additional
potential monetisation avenues in the future.
A good DS has to be actionable and, at the same time, evolutionary to adjust to
disruptive market forces.
DS always has to be business-driven, never technology-driven.

Is DS worth it? .. ⅔
The data that we, as Treebo, own is a resource that has economic value and we expect
it to provide future benefit just like any other asset.
But, generally, data is an under-managed, underutilized asset because it doesn’t feature
in company’s P&L book closing.
To look at it differently: As we have people-focused strategy to retain employees (our asset),
similarly a data-focused strategy is required to retain good data (our asset, again)!
Without DS, we will be forced to deal with myriad data-related
initiatives taken up by various business groups.

❖ Planning and discovery.
Identify business objectives, key stakeholders & scope.
❖ Current state assessment.
Focus on business process, data sources, technology stack & policies.
❖ Analysis, prioritization and roadmap.
Requirement analysis, criteria for prioritization & layout initiative roadmap.
❖ Change management.
Encompass organizational change, cultural change, technology change, and changes in business processes.
High-level Framework for DS .. 3/3

Problem Statement .. GET ⅓
To Design highly scalable, highly available, low latency data platform which can capture, move
and transform transactions with zero data loss and should support replay capability when required.
So, essentially, a system needs to be designed which is/has:
Highly Scalable; Highly Available; Low Latency; Zero data loss; Replay capability

Golden Rules .. GET ⅔
❖ Do not go distributed if the data size is small enough.
Any distributed system takes 10 years to debug; any database takes 10 years to debug; and
any distributed database is still being debugged!
❖ Do not go streaming if batches serve well.
The above two rules hold true for practically
all data initiatives.

Revisiting Some Fundamentals .. GET 3/3
❖ Scalability, Availability, Consistency, Latency, Durability, Disaster Recovery
^ Processing
❖ RDBMSs & touch-base with above features ^
^ Source
❖ On-premise/Hybrid/Cloud
^ Hosting

❖ Source
Logs, tools, open libraries, proprietary solutions
❖ Fetches/Transient storage
Ordering, delivery semantics, consistency guarantee, schema evolvability
❖ Processing
Inflight/at the destination, batch, (near) real time
❖ Destination
Hardware, SQL (No/new), Columnar
❖ Cache
Optional!
❖ Visualisation
Various options suited to use case
Different phases .. SET ½

Different phase choices .. SET 2/2

❖ Append-only event logging for immutability (Kappa architecture)
❖ Ensure idempotency
❖ Custom checkpointing for better replay
❖ Specialised storage formats
❖ Data governance & workload management
❖ Transition higher up the matured analytics value pyramid
Progressive thoughts .. GO 2/2

“I’m sure, the highest capacity of storage device, will not be
enough to record all our stories; because, everytime with you is
very valuable data.”
Not really sure if this was said by somebody with reference to technology or their
love interest! :)
Q & A
Thank You

Data Pipeline @ LK
By Jitendra Agrawal

Basic Lambda Architecture
Message
Queue
Real-time
processing
Batch
processing
Queries
Responses
Queries
Responses

Stream vs. batch
● Stream / speed layer
○ Processing - Apache Storm, Apache Spark, Apache Samza
○ Store - ElasticSearch, Druid, Spark SQL, Other DBs
○ Usage
■ Live dashboards (potentially inaccurate)
● Counts, Averages
■ Rate limiting
■ Triggers for further action
● Batch
○ Immutable(?) store
■ HDFS
■ Cassandra
■ Event stream to S3
○ Batch processing and precomputation
○ Data warehouse - HBase, Hive, Redshift, Postgres

Database change logs
● MySQL
○ Row level bin logs
○ Debezium -> Kafka
○ Before and after values
○ Handles database restarts / restreams data (duplicates)
● MongoDB
○ Op log
○ Oplog reader -> Kafka
○ After values
○ Handles database restarts

Data @ LK
● Multiple self MySQL instances (Application)
● On-premise MySQL installation (Calls)
● MongoDB (Application)
● Mixpanel
● Facebook Ads
● Google Ads
● Mandrill
● A couple of terabytes and increasing rapidly

Motivation for considering a data warehouse
● Joins across multiple databases
● MySQL just can’t run some analytics queries
● Some of the ‘changes’ are not sent to Mixpanel as events
● A lot of questions are asked on data retrospectively

Data warehouse inputs
● Mysql
○ Sync current states of all databases to Redshift
○ Send all changes in tables to Kafka
■ Debezium
○ All before / after values for changes are stored in S3
○ S3 data is processed to create audit trail tables
● MongoDB
○ Send all changes in collections to Kafka
■ Oplog Reader
○ All changes are stored in S3
○ S3 data is processed to create copy of MongoDB and audit trail
● Store all changes
○ Filtering for duplicates can be done later

Ledingkart Meetup #4: Data pipeline @ lk

Ledingkart Meetup #4: Data pipeline @ lk

More Related Content

What's hot (18)

Similar to Ledingkart Meetup #4: Data pipeline @ lk (20)

Recently uploaded (20)

Ledingkart Meetup #4: Data pipeline @ lk