SlideShare a Scribd company logo
Building a Data Pipeline - Case studies
Amit Sharma, Director @NoBroker
Raam Baranidharan, Associate Director @Treebo
Jitendra Agrawal, VP Technology @ LendingKart
18th August 2018
2
Building data pipelines @NoBroker
-Amit Sharma
Director Engineering
• Business Needs Data.
• Analytics is computationally taxing.
• Data exists in multiple platforms
• CRM
• APPs
• Calls
• 3rd Party Webhooks
3
Why Data Pipeline
• Moving, Joining and re-formatting data between
systems.
• A Data Pipeline is sum of all these steps
• Its Job is to ensure these steps all happen reliably on
all data.
4
What is Data Pipeline
• Sources
• Joins
• Extractions
• Standardization/Corrections
• Loads
• Automation
Parts And Processes of Data Pipeline
How We use Data Pipeline
DATA
HARVESTER
NB ESTIMATES
- Rentometer
- Prop Worth
- Lifestyle Score
- Commute Score
SIA
- Synthetic Language
Generator
OCULUS
- Image Intelligence
LITMUS
- Text Profanity Engine
MASS
- Property Quality
Estimates
HURACAN
- Lead Intelligence
JARVIS
- Automatic Speech
Recognition Engine
(NB Voice Asst.)
TROUBLESHOOT
- Sentiment Analysis
Engine
DEEP LEON
- Large Scale Deep
Learning Framework
Website Mobile CRM Calls
User
Footprint
Third Party
Data
Property
Data
Prism
ML Engine
INFERENCE
SYSTEMS
HOROSCOPE QUICKSILVER
CLICK & EARN SMART SALES
NB BOT
SMART
FOLLOW UPS
DEMAND -
SUPPLY
ANALYTICS
SMART
GRIEVANCE
REDRESSAL
VIGILANTE
WORK IN PROGRESS
• With Analytics Data, Scale matters
• One server is never enough
• Once a data pipeline is source of truth, Reliability
matters
• Without enrichments, It’s hard to derive insights
Lessons from building data pipeline
Any Questions?
Data World @ Treebo
A perspective & some musings
Sneh
Presentation flow
❖ GROUND 0 Initial 3 decks prepare the ground
❖ GET Following 3 decks set the broader context around -
➢ Problem Statement
➢ Some fundamentals around systems, storage & general considerations
❖ SET Intermediate 2 slides deep dive into -
➢ Different phases and the choices around AWS technology for them
❖ GO Concluding 2 slides talk about -
➢ The overall architecture put together
➢ Progressive thoughts
What is Data Strategy (DS)? .. ⅓
A set of techniques around the collection, storage and usage of a data, in a manner that
data can serve not only the operations of a company, but also open up additional
potential monetisation avenues in the future.
A good DS has to be actionable and, at the same time, evolutionary to adjust to
disruptive market forces.
DS always has to be business-driven, never technology-driven.
Is DS worth it? .. ⅔
The data that we, as Treebo, own is a resource that has economic value and we expect
it to provide future benefit just like any other asset.
But, generally, data is an under-managed, underutilized asset because it doesn’t feature
in company’s P&L book closing.
To look at it differently: As we have people-focused strategy to retain employees (our asset),
similarly a data-focused strategy is required to retain good data (our asset, again)!
Without DS, we will be forced to deal with myriad data-related
initiatives taken up by various business groups.
❖ Planning and discovery.
Identify business objectives, key stakeholders & scope.
❖ Current state assessment.
Focus on business process, data sources, technology stack & policies.
❖ Analysis, prioritization and roadmap.
Requirement analysis, criteria for prioritization & layout initiative roadmap.
❖ Change management.
Encompass organizational change, cultural change, technology change, and changes in business processes.
High-level Framework for DS .. 3/3
Problem Statement .. GET ⅓
To Design highly scalable, highly available, low latency data platform which can capture, move
and transform transactions with zero data loss and should support replay capability when required.
So, essentially, a system needs to be designed which is/has:
Highly Scalable; Highly Available; Low Latency; Zero data loss; Replay capability
Golden Rules .. GET ⅔
❖ Do not go distributed if the data size is small enough.
Any distributed system takes 10 years to debug; any database takes 10 years to debug; and
any distributed database is still being debugged!
❖ Do not go streaming if batches serve well.
The above two rules hold true for practically
all data initiatives.
Revisiting Some Fundamentals .. GET 3/3
❖ Scalability, Availability, Consistency, Latency, Durability, Disaster Recovery
^ Processing
❖ RDBMSs & touch-base with above features ^
^ Source
❖ On-premise/Hybrid/Cloud
^ Hosting
❖ Source
Logs, tools, open libraries, proprietary solutions
❖ Fetches/Transient storage
Ordering, delivery semantics, consistency guarantee, schema evolvability
❖ Processing
Inflight/at the destination, batch, (near) real time
❖ Destination
Hardware, SQL (No/new), Columnar
❖ Cache
Optional!
❖ Visualisation
Various options suited to use case
Different phases .. SET ½
Different phase choices .. SET 2/2
Treebo Architecture .. GO ½
❖ Append-only event logging for immutability (Kappa architecture)
❖ Ensure idempotency
❖ Custom checkpointing for better replay
❖ Specialised storage formats
❖ Data governance & workload management
❖ Transition higher up the matured analytics value pyramid
Progressive thoughts .. GO 2/2
“I’m sure, the highest capacity of storage device, will not be
enough to record all our stories; because, everytime with you is
very valuable data.”
Not really sure if this was said by somebody with reference to technology or their
love interest! :)
Q & A
Thank You
Data Pipeline @ LK
By Jitendra Agrawal
Types of data
Event stream
Basic Lambda Architecture
Message
Queue
Real-time
processing
Batch
processing
Queries
Responses
Queries
Responses
Stream vs. batch
● Stream / speed layer
○ Processing - Apache Storm, Apache Spark, Apache Samza
○ Store - ElasticSearch, Druid, Spark SQL, Other DBs
○ Usage
■ Live dashboards (potentially inaccurate)
● Counts, Averages
■ Rate limiting
■ Triggers for further action
● Batch
○ Immutable(?) store
■ HDFS
■ Cassandra
■ Event stream to S3
○ Batch processing and precomputation
○ Data warehouse - HBase, Hive, Redshift, Postgres
Database change logs
● MySQL
○ Row level bin logs
○ Debezium -> Kafka
○ Before and after values
○ Handles database restarts / restreams data (duplicates)
● MongoDB
○ Op log
○ Oplog reader -> Kafka
○ After values
○ Handles database restarts
Data @ LK
● Multiple self MySQL instances (Application)
● On-premise MySQL installation (Calls)
● MongoDB (Application)
● Mixpanel
● Facebook Ads
● Google Ads
● Mandrill
● A couple of terabytes and increasing rapidly
Motivation for considering a data warehouse
● Joins across multiple databases
● MySQL just can’t run some analytics queries
● Some of the ‘changes’ are not sent to Mixpanel as events
● A lot of questions are asked on data retrospectively
Data warehouse inputs
● Mysql
○ Sync current states of all databases to Redshift
○ Send all changes in tables to Kafka
■ Debezium
○ All before / after values for changes are stored in S3
○ S3 data is processed to create audit trail tables
● MongoDB
○ Send all changes in collections to Kafka
■ Oplog Reader
○ All changes are stored in S3
○ S3 data is processed to create copy of MongoDB and audit trail
● Store all changes
○ Filtering for duplicates can be done later
Lendingkart Architecture
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lk
Questions?

More Related Content

PPTX
Ledingkart Meetup #2: Scaling Search @Lendingkart
PPTX
Improve your SQL workload with observability
PPTX
MongoDB in a Mainframe World
PDF
Data platform architecture principles - ieee infrastructure 2020
PDF
BigData in IoT #iotconfua
PPTX
MongoDB as a Data Warehouse: Time Series and Device History Data (Medtronic)
PPTX
Benefits of Using MongoDB Over RDBMSs
PPTX
MongoDB Days UK: Tales from the Field
Ledingkart Meetup #2: Scaling Search @Lendingkart
Improve your SQL workload with observability
MongoDB in a Mainframe World
Data platform architecture principles - ieee infrastructure 2020
BigData in IoT #iotconfua
MongoDB as a Data Warehouse: Time Series and Device History Data (Medtronic)
Benefits of Using MongoDB Over RDBMSs
MongoDB Days UK: Tales from the Field

What's hot (18)

PDF
Introducing MongoDB Stitch, Backend-as-a-Service from MongoDB
PPTX
MongoDB Operations for Developers
PPTX
How leading financial services organisations are winning with tech
PDF
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
PDF
WSO2 Analytics Platform - The one stop shop for all your data needs
PPTX
MongoDB and RDBMS: Using Polyglot Persistence at Equifax
PDF
RealTime Recommendations @Netflix - Spark
PDF
MongoDB .local Chicago 2019: MongoDB – Powering the new age data demands
PDF
Kafka Summit SF 2017 - Keynote - Managing Data at Scale: The Unreasonable Eff...
PDF
Webinar: Faster Big Data Analytics with MongoDB
PDF
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
PPTX
Log analysis using elk
PDF
Apache Flink Adoption at Shopify
PPTX
An Enterprise Architect's View of MongoDB
PDF
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
PDF
SAIS2018 - Fact Store At Netflix Scale
PDF
WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0
PDF
WSO2 Data Services Server - Product Overview
Introducing MongoDB Stitch, Backend-as-a-Service from MongoDB
MongoDB Operations for Developers
How leading financial services organisations are winning with tech
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
WSO2 Analytics Platform - The one stop shop for all your data needs
MongoDB and RDBMS: Using Polyglot Persistence at Equifax
RealTime Recommendations @Netflix - Spark
MongoDB .local Chicago 2019: MongoDB – Powering the new age data demands
Kafka Summit SF 2017 - Keynote - Managing Data at Scale: The Unreasonable Eff...
Webinar: Faster Big Data Analytics with MongoDB
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Log analysis using elk
Apache Flink Adoption at Shopify
An Enterprise Architect's View of MongoDB
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
SAIS2018 - Fact Store At Netflix Scale
WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0
WSO2 Data Services Server - Product Overview
Ad

Similar to Ledingkart Meetup #4: Data pipeline @ lk (20)

PPTX
Data Science Machine Lerning Bigdat.pptx
PPTX
Introduction To Big Data & Hadoop
PDF
Meta scale kognitio hadoop webinar
PPTX
Big Data and Hadoop
PDF
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
PPTX
unit 1 big data.pptx
PPTX
Understanding System Design and Architecture Blueprints of Efficiency
PDF
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
PDF
Meta scale kognitio hadoop webinar
PPTX
Bitkom Cray presentation - on HPC affecting big data analytics in FS
PDF
Design Choices for Cloud Data Platforms
PDF
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
PDF
Hadoop Master Class : A concise overview
PPTX
Big Data Infrastructure and Hadoop components.pptx
PPTX
Fundamentals of big data analytics and Hadoop
PPTX
5 Things that Make Hadoop a Game Changer
PPT
Big data.ppt
PDF
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
PPTX
Transform your DBMS to drive engagement innovation with Big Data
PPTX
Is the traditional data warehouse dead?
Data Science Machine Lerning Bigdat.pptx
Introduction To Big Data & Hadoop
Meta scale kognitio hadoop webinar
Big Data and Hadoop
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
unit 1 big data.pptx
Understanding System Design and Architecture Blueprints of Efficiency
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Meta scale kognitio hadoop webinar
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Design Choices for Cloud Data Platforms
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Hadoop Master Class : A concise overview
Big Data Infrastructure and Hadoop components.pptx
Fundamentals of big data analytics and Hadoop
5 Things that Make Hadoop a Game Changer
Big data.ppt
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
Transform your DBMS to drive engagement innovation with Big Data
Is the traditional data warehouse dead?
Ad

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Advanced IT Governance
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Cloud computing and distributed systems.
PDF
cuic standard and advanced reporting.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Electronic commerce courselecture one. Pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Big Data Technologies - Introduction.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Chapter 3 Spatial Domain Image Processing.pdf
Advanced Soft Computing BINUS July 2025.pdf
Advanced IT Governance
Unlocking AI with Model Context Protocol (MCP)
Cloud computing and distributed systems.
cuic standard and advanced reporting.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Understanding_Digital_Forensics_Presentation.pptx
Approach and Philosophy of On baking technology
Electronic commerce courselecture one. Pdf
Empathic Computing: Creating Shared Understanding
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Network Security Unit 5.pdf for BCA BBA.
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
NewMind AI Weekly Chronicles - August'25 Week I
Mobile App Security Testing_ A Comprehensive Guide.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Per capita expenditure prediction using model stacking based on satellite ima...
Big Data Technologies - Introduction.pptx

Ledingkart Meetup #4: Data pipeline @ lk

  • 1. Building a Data Pipeline - Case studies Amit Sharma, Director @NoBroker Raam Baranidharan, Associate Director @Treebo Jitendra Agrawal, VP Technology @ LendingKart 18th August 2018
  • 2. 2 Building data pipelines @NoBroker -Amit Sharma Director Engineering
  • 3. • Business Needs Data. • Analytics is computationally taxing. • Data exists in multiple platforms • CRM • APPs • Calls • 3rd Party Webhooks 3 Why Data Pipeline
  • 4. • Moving, Joining and re-formatting data between systems. • A Data Pipeline is sum of all these steps • Its Job is to ensure these steps all happen reliably on all data. 4 What is Data Pipeline
  • 5. • Sources • Joins • Extractions • Standardization/Corrections • Loads • Automation Parts And Processes of Data Pipeline
  • 6. How We use Data Pipeline
  • 7. DATA HARVESTER NB ESTIMATES - Rentometer - Prop Worth - Lifestyle Score - Commute Score SIA - Synthetic Language Generator OCULUS - Image Intelligence LITMUS - Text Profanity Engine MASS - Property Quality Estimates HURACAN - Lead Intelligence JARVIS - Automatic Speech Recognition Engine (NB Voice Asst.) TROUBLESHOOT - Sentiment Analysis Engine DEEP LEON - Large Scale Deep Learning Framework Website Mobile CRM Calls User Footprint Third Party Data Property Data Prism ML Engine INFERENCE SYSTEMS HOROSCOPE QUICKSILVER CLICK & EARN SMART SALES NB BOT SMART FOLLOW UPS DEMAND - SUPPLY ANALYTICS SMART GRIEVANCE REDRESSAL VIGILANTE WORK IN PROGRESS
  • 8. • With Analytics Data, Scale matters • One server is never enough • Once a data pipeline is source of truth, Reliability matters • Without enrichments, It’s hard to derive insights Lessons from building data pipeline
  • 10. Data World @ Treebo A perspective & some musings Sneh
  • 11. Presentation flow ❖ GROUND 0 Initial 3 decks prepare the ground ❖ GET Following 3 decks set the broader context around - ➢ Problem Statement ➢ Some fundamentals around systems, storage & general considerations ❖ SET Intermediate 2 slides deep dive into - ➢ Different phases and the choices around AWS technology for them ❖ GO Concluding 2 slides talk about - ➢ The overall architecture put together ➢ Progressive thoughts
  • 12. What is Data Strategy (DS)? .. ⅓ A set of techniques around the collection, storage and usage of a data, in a manner that data can serve not only the operations of a company, but also open up additional potential monetisation avenues in the future. A good DS has to be actionable and, at the same time, evolutionary to adjust to disruptive market forces. DS always has to be business-driven, never technology-driven.
  • 13. Is DS worth it? .. ⅔ The data that we, as Treebo, own is a resource that has economic value and we expect it to provide future benefit just like any other asset. But, generally, data is an under-managed, underutilized asset because it doesn’t feature in company’s P&L book closing. To look at it differently: As we have people-focused strategy to retain employees (our asset), similarly a data-focused strategy is required to retain good data (our asset, again)! Without DS, we will be forced to deal with myriad data-related initiatives taken up by various business groups.
  • 14. ❖ Planning and discovery. Identify business objectives, key stakeholders & scope. ❖ Current state assessment. Focus on business process, data sources, technology stack & policies. ❖ Analysis, prioritization and roadmap. Requirement analysis, criteria for prioritization & layout initiative roadmap. ❖ Change management. Encompass organizational change, cultural change, technology change, and changes in business processes. High-level Framework for DS .. 3/3
  • 15. Problem Statement .. GET ⅓ To Design highly scalable, highly available, low latency data platform which can capture, move and transform transactions with zero data loss and should support replay capability when required. So, essentially, a system needs to be designed which is/has: Highly Scalable; Highly Available; Low Latency; Zero data loss; Replay capability
  • 16. Golden Rules .. GET ⅔ ❖ Do not go distributed if the data size is small enough. Any distributed system takes 10 years to debug; any database takes 10 years to debug; and any distributed database is still being debugged! ❖ Do not go streaming if batches serve well. The above two rules hold true for practically all data initiatives.
  • 17. Revisiting Some Fundamentals .. GET 3/3 ❖ Scalability, Availability, Consistency, Latency, Durability, Disaster Recovery ^ Processing ❖ RDBMSs & touch-base with above features ^ ^ Source ❖ On-premise/Hybrid/Cloud ^ Hosting
  • 18. ❖ Source Logs, tools, open libraries, proprietary solutions ❖ Fetches/Transient storage Ordering, delivery semantics, consistency guarantee, schema evolvability ❖ Processing Inflight/at the destination, batch, (near) real time ❖ Destination Hardware, SQL (No/new), Columnar ❖ Cache Optional! ❖ Visualisation Various options suited to use case Different phases .. SET ½
  • 21. ❖ Append-only event logging for immutability (Kappa architecture) ❖ Ensure idempotency ❖ Custom checkpointing for better replay ❖ Specialised storage formats ❖ Data governance & workload management ❖ Transition higher up the matured analytics value pyramid Progressive thoughts .. GO 2/2
  • 22. “I’m sure, the highest capacity of storage device, will not be enough to record all our stories; because, everytime with you is very valuable data.” Not really sure if this was said by somebody with reference to technology or their love interest! :) Q & A Thank You
  • 23. Data Pipeline @ LK By Jitendra Agrawal
  • 26. Stream vs. batch ● Stream / speed layer ○ Processing - Apache Storm, Apache Spark, Apache Samza ○ Store - ElasticSearch, Druid, Spark SQL, Other DBs ○ Usage ■ Live dashboards (potentially inaccurate) ● Counts, Averages ■ Rate limiting ■ Triggers for further action ● Batch ○ Immutable(?) store ■ HDFS ■ Cassandra ■ Event stream to S3 ○ Batch processing and precomputation ○ Data warehouse - HBase, Hive, Redshift, Postgres
  • 27. Database change logs ● MySQL ○ Row level bin logs ○ Debezium -> Kafka ○ Before and after values ○ Handles database restarts / restreams data (duplicates) ● MongoDB ○ Op log ○ Oplog reader -> Kafka ○ After values ○ Handles database restarts
  • 28. Data @ LK ● Multiple self MySQL instances (Application) ● On-premise MySQL installation (Calls) ● MongoDB (Application) ● Mixpanel ● Facebook Ads ● Google Ads ● Mandrill ● A couple of terabytes and increasing rapidly
  • 29. Motivation for considering a data warehouse ● Joins across multiple databases ● MySQL just can’t run some analytics queries ● Some of the ‘changes’ are not sent to Mixpanel as events ● A lot of questions are asked on data retrospectively
  • 30. Data warehouse inputs ● Mysql ○ Sync current states of all databases to Redshift ○ Send all changes in tables to Kafka ■ Debezium ○ All before / after values for changes are stored in S3 ○ S3 data is processed to create audit trail tables ● MongoDB ○ Send all changes in collections to Kafka ■ Oplog Reader ○ All changes are stored in S3 ○ S3 data is processed to create copy of MongoDB and audit trail ● Store all changes ○ Filtering for duplicates can be done later