SlideShare a Scribd company logo
3
Most read
4
Most read
7
Most read
www.globalbigdataconference.com
Twitter : @bigdataconf
Big Data Ingestion
Navneet Gupta
Flipkart Data Platform
navneet.gupta@flipkart.com
● Data Governance - Democratizing data at Flipkart
● Divided into three sub-teams called Ingestion, Processing and Consumption.
● Was created out of the vision to make Flipkart a data centric company. (Some
examples are Facebook, Google and LinkedIn)
● Work with all teams in Flipkart and act as a broker between teams for exchanging
data (raw or processed).
● Provides capabilities around data processing/consumption but is agnostic to any
knowledge about any business processes. Does not build any apps itself on top of
data collected.
● Examples of applications on top of FDP - Seller Analytics
Flipkart Data Platform (FDP)
● Responsibility to push data to FDP lies with source teams.
● Responsibility to report data availability lies with FDP. Should call out if
source teams not pushing data.
● All the business processes are modeled as entities/events and FDP
provides console to define those entities/events using custom Schema
management (Open source solutions include Avro, Thrift, Protocol
Buffers).
● Validation is bundled with schema definition.
● Having Schema helps to have strong assumptions about fields in data
More about FDP ...
● Flipkart teams work with varied datastores like MySQL, MongoDB, CouchDB,
HBase, Hadoop
● Some teams onboard later than others. Bootstrapping of huge volumes of data is
performed at times.
● A single ingestion mechanism might not be suitable for all teams at Flipkart. Some
teams prefer streaming ingestions, others want batch and some teams want
support to ingest their data in a Hadoop Cluster
● Data could be present in many formats like binary blobs, JSON, XML, CSV. We
don’t want to deal with each format and support only JSON payloads currently
Data has many faces at Flipkart !
● Almost 2 billion ingestions seen on an average day
● Half of those ingestions happening in streaming fashion (HTTP endpoint)
● Other ingestion mechanisms
○ Hadoop based ingestion
○ Java library
○ Daemons process on source machines
○ Cmdline tools to ingest file in one shot
● Plan to support 5-10x of ingestion numbers for next BBD
Some numbers ...
● Dropwizard based Java app. Endpoints defined for ingesting data
● Performs schema validation online.
● Relays validated data to KAFKA.
● Validation failures go through a different flow and customers are alerted
if the no of failures breaches some rules.
● Clients get 200 response code as well as a traceId when data ingested is
actually accepted by the service
● Monitoring is built for the service by exposing JMX metrics which goes to
a central monitoring service.
Streaming Ingestion
Big Data Ingestion @ Flipkart Data Platform
● Kafka is distributed, partitioned, replicated and fault tolerant publish
subscribe system (but with a unique design)
● Invented at LinkedIn, Used by many other large companies today (Yahoo,
Twitter, Netflix, Uber, Goldman Sachs)
● Has notion of Producers, Consumers, Brokers, Topics, Partitions
● Messages are persistent. Multiple consumers can consume messages. Can
consume the same message again by resetting the offset (replay)
● Highly scalable and highly configurable
● Excellent documentation and community support.
● Battle tested and easy to administor.
More about Kafka
Big Data Ingestion @ Flipkart Data Platform
● Kafka is a temporary store and contains data only till last 30 days
(configurable by no of days or size)
● Current consumers of our Kafka cluster include batch processing and
real-time processing flows.
● We use CAMUS to copy data from Kafka to Hadoop. Camus instance runs
every hour currently to copy all the new data in Kafka to Hadoop.
● Stream processing flow built on top of Storm uses official KafkaSpout to
consume data from Kafka.
Onto downstream systems ...
● Streaming Ingesting and Processing at FDP -
speakerdeck.com/sids/streaming-ingestion-and-processing-at-flipkart
● Kafka - http://kafka.apache.org/081/documentation.html
● LinkedIn Camus - https://github.com/linkedin/camus
● Apache Avro - http://avro.apache.org/docs/current/
● Dropwizard - http://www.dropwizard.io/
● Blog on building stream data platform -
http://blog.confluent.io/2015/02/25/stream-data-platform-2/
References
Questions?
BTW, We are hiring !!
careers.flipkart.com

More Related Content

PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
Lecture1 introduction to big data
PPTX
No sqlpresentation
PDF
A complete-guide-to-oracle-to-redshift-migration
PDF
8 - OpenShift - A look at a container platform: what's in the box
PDF
Cloud Computing and Service oriented Architecture
PPTX
Peephole Optimization
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Lecture1 introduction to big data
No sqlpresentation
A complete-guide-to-oracle-to-redshift-migration
8 - OpenShift - A look at a container platform: what's in the box
Cloud Computing and Service oriented Architecture
Peephole Optimization

What's hot (20)

PPTX
Market oriented Cloud Computing
PDF
Data Streaming For Big Data
PDF
Integrating Apache Kafka Into Your Environment
PDF
Service-Oriented Architecture (SOA)
PPT
Building Aneka clouds.ppt
PPTX
CLOUD COMPUTING SERVICES - Cloud Reference Modal
DOCX
Levels of Virtualization.docx
PPTX
Big Data Analytics with Hadoop
PPTX
Data visualization
PPTX
Migration into a Cloud
PDF
Introduction to ETL and Data Integration
PDF
Microsoft Azure Cloud Services
PPT
Hive(ppt)
PDF
Open stack
PDF
Managed Feature Store for Machine Learning
PPTX
Introduction to HiveQL
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
ETL and its impact on Business Intelligence
PPTX
Cloud Resource Management
PPTX
Object Oriented Design
Market oriented Cloud Computing
Data Streaming For Big Data
Integrating Apache Kafka Into Your Environment
Service-Oriented Architecture (SOA)
Building Aneka clouds.ppt
CLOUD COMPUTING SERVICES - Cloud Reference Modal
Levels of Virtualization.docx
Big Data Analytics with Hadoop
Data visualization
Migration into a Cloud
Introduction to ETL and Data Integration
Microsoft Azure Cloud Services
Hive(ppt)
Open stack
Managed Feature Store for Machine Learning
Introduction to HiveQL
A Deep Dive into Query Execution Engine of Spark SQL
ETL and its impact on Business Intelligence
Cloud Resource Management
Object Oriented Design
Ad

Similar to Big Data Ingestion @ Flipkart Data Platform (20)

PDF
How Big Data is helping Flipkart to achieve the Milestone
PPTX
Big data architecture
PDF
Reliable Data Intestion in BigData / IoT
PPTX
Building Continuously Curated Ingestion Pipelines
PPTX
Building Distributed Data Streaming System
PPTX
Apache frameworks for Big and Fast Data
PDF
Data streaming-systems
PPTX
Big Data Analytics_basic introduction of Kafka.pptx
PDF
Big data knolx
PPTX
Spark streaming with apache kafka
PDF
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
PPT
Architecting Big Data Ingest & Manipulation
PDF
Building end to end streaming application on Spark
PPTX
Data ingestion
PPTX
Rakuten techconf2015.baiji.he.bigdataforsmallstartupandbeyond
PDF
Kafka & InfluxDB: BFFs for Enterprise Data Applications | Russ Savage, Influx...
PPTX
Flipkart Data Platform @ Scale - slash n 2018 reprise
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PDF
xGem Data Stream Processing
PDF
Data Ingestion in Big Data and IoT platforms
How Big Data is helping Flipkart to achieve the Milestone
Big data architecture
Reliable Data Intestion in BigData / IoT
Building Continuously Curated Ingestion Pipelines
Building Distributed Data Streaming System
Apache frameworks for Big and Fast Data
Data streaming-systems
Big Data Analytics_basic introduction of Kafka.pptx
Big data knolx
Spark streaming with apache kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Architecting Big Data Ingest & Manipulation
Building end to end streaming application on Spark
Data ingestion
Rakuten techconf2015.baiji.he.bigdataforsmallstartupandbeyond
Kafka & InfluxDB: BFFs for Enterprise Data Applications | Russ Savage, Influx...
Flipkart Data Platform @ Scale - slash n 2018 reprise
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
xGem Data Stream Processing
Data Ingestion in Big Data and IoT platforms
Ad

Big Data Ingestion @ Flipkart Data Platform

  • 2. Big Data Ingestion Navneet Gupta Flipkart Data Platform navneet.gupta@flipkart.com
  • 3. ● Data Governance - Democratizing data at Flipkart ● Divided into three sub-teams called Ingestion, Processing and Consumption. ● Was created out of the vision to make Flipkart a data centric company. (Some examples are Facebook, Google and LinkedIn) ● Work with all teams in Flipkart and act as a broker between teams for exchanging data (raw or processed). ● Provides capabilities around data processing/consumption but is agnostic to any knowledge about any business processes. Does not build any apps itself on top of data collected. ● Examples of applications on top of FDP - Seller Analytics Flipkart Data Platform (FDP)
  • 4. ● Responsibility to push data to FDP lies with source teams. ● Responsibility to report data availability lies with FDP. Should call out if source teams not pushing data. ● All the business processes are modeled as entities/events and FDP provides console to define those entities/events using custom Schema management (Open source solutions include Avro, Thrift, Protocol Buffers). ● Validation is bundled with schema definition. ● Having Schema helps to have strong assumptions about fields in data More about FDP ...
  • 5. ● Flipkart teams work with varied datastores like MySQL, MongoDB, CouchDB, HBase, Hadoop ● Some teams onboard later than others. Bootstrapping of huge volumes of data is performed at times. ● A single ingestion mechanism might not be suitable for all teams at Flipkart. Some teams prefer streaming ingestions, others want batch and some teams want support to ingest their data in a Hadoop Cluster ● Data could be present in many formats like binary blobs, JSON, XML, CSV. We don’t want to deal with each format and support only JSON payloads currently Data has many faces at Flipkart !
  • 6. ● Almost 2 billion ingestions seen on an average day ● Half of those ingestions happening in streaming fashion (HTTP endpoint) ● Other ingestion mechanisms ○ Hadoop based ingestion ○ Java library ○ Daemons process on source machines ○ Cmdline tools to ingest file in one shot ● Plan to support 5-10x of ingestion numbers for next BBD Some numbers ...
  • 7. ● Dropwizard based Java app. Endpoints defined for ingesting data ● Performs schema validation online. ● Relays validated data to KAFKA. ● Validation failures go through a different flow and customers are alerted if the no of failures breaches some rules. ● Clients get 200 response code as well as a traceId when data ingested is actually accepted by the service ● Monitoring is built for the service by exposing JMX metrics which goes to a central monitoring service. Streaming Ingestion
  • 9. ● Kafka is distributed, partitioned, replicated and fault tolerant publish subscribe system (but with a unique design) ● Invented at LinkedIn, Used by many other large companies today (Yahoo, Twitter, Netflix, Uber, Goldman Sachs) ● Has notion of Producers, Consumers, Brokers, Topics, Partitions ● Messages are persistent. Multiple consumers can consume messages. Can consume the same message again by resetting the offset (replay) ● Highly scalable and highly configurable ● Excellent documentation and community support. ● Battle tested and easy to administor. More about Kafka
  • 11. ● Kafka is a temporary store and contains data only till last 30 days (configurable by no of days or size) ● Current consumers of our Kafka cluster include batch processing and real-time processing flows. ● We use CAMUS to copy data from Kafka to Hadoop. Camus instance runs every hour currently to copy all the new data in Kafka to Hadoop. ● Stream processing flow built on top of Storm uses official KafkaSpout to consume data from Kafka. Onto downstream systems ...
  • 12. ● Streaming Ingesting and Processing at FDP - speakerdeck.com/sids/streaming-ingestion-and-processing-at-flipkart ● Kafka - http://kafka.apache.org/081/documentation.html ● LinkedIn Camus - https://github.com/linkedin/camus ● Apache Avro - http://avro.apache.org/docs/current/ ● Dropwizard - http://www.dropwizard.io/ ● Blog on building stream data platform - http://blog.confluent.io/2015/02/25/stream-data-platform-2/ References
  • 13. Questions? BTW, We are hiring !! careers.flipkart.com