SlideShare a Scribd company logo
From Scratch
1
Joe Crobak
@joecrobak
!
Tuesday, June 24, 2014
Axium Lyceum - New York, NY
BUILDING A
DATA PIPELINE
INTRODUCTION
2
Software Engineer @ Project Florida
!
Previously:
•Foursquare
•Adconion Media Group
•Joost
OVERVIEW
3
Why do we care?
Defining Data Pipeline
Events
System Architecture
4
DATA PIPELINES ARE EVERYWHERE
RECOMMENDATIONS
5
http://blog.linkedin.com/2010/05/12/linkedin-pymk/
RECOMMENDATIONS
6
Clicks
Views
Recommendations
http://blog.linkedin.com/2010/05/12/linkedin-pymk/
AD NETWORKS
7
AD NETWORKS
8
Clicks
Impressions
User Ad Profile
SEARCH
9
http://lucene.apache.org/solr/
SEARCH
10
Search Rankings
Page Rank
http://www.jevans.com/pubnetmap.html
A / B TESTING
11
https://flic.kr/p/4ieVGa
A / B TESTING
12
https://flic.kr/p/4ieVGa
A conversions
B conversions
Experiment Analysis
DATA WAREHOUSING
13
http://gethue.com/hadoop-ui-hue-3-6-and-the-search-dashboards-are-out/
DATA WAREHOUSING
14
http://gethue.com/hadoop-ui-hue-3-6-and-the-search-dashboards-are-out/
key metrics
user events
Data Warehouse
15
WHAT IS A DATA PIPELINE?
DATA PIPELINE
16
A Data Pipeline is a unified system for
capturing events for analysis and
building products.
DATA PIPELINE
17
click data
user events
Data Warehouse
web visits
email sends
…
Product Features
Ad Hoc analysis
•Counting
•Machine Learning
•Extract Transform Load (ETL)
DATA PIPELINE
18
A Data Pipeline is a unified system for
capturing events for analysis and
building products.
19
EVENTS
EVENTS
20
Each of these actions can be thought of as an
event.
COARSE-GRAINED EVENTS
21
•Events are captured as a by-product.
•Stored in text logs used primarily for
debugging and secondarily for analysis.
COARSE-GRAINED EVENTS
22
127.0.0.1 - - [17/Jun/2014:01:53:16 UTC] "GET / HTTP/1.1" 200 3969!
IP Address Timestamp Action Status
•Events are captured as a
•Stored in
debugging and secondarily for analysis.
COARSE-GRAINED EVENTS
23
Implicit tracking—i.e. a “page load” event is a
proxy for ≥1 other event.
!
e.g. event GET /newsfeed corresponds to:
•App Load (but only if this is the first time
loaded this session)
•Timeline load, user is in “group A” of an A/B
Test
These implementations details have to be known at analysis time.
FINE-GRAINED EVENTS
24
Record events like:
•app opened
•auto refresh
•user pull down refresh
!
Rather than:
•GET /newsfeed
FINE-GRAINED EVENTS
25
Annotate events with contextual
information like:
•view the user was on
•which button was clicked
FINE-GRAINED EVENTS
26
Decouple logging and analysis. Create events
for everything!
FINE-GRAINED EVENTS
27
A couple of schema-less formats are popular
(e.g. JSON and CSV), but they have
drawbacks.
•harder to change schemas
•inefficient
•require writing parsers
SCHEMA
28
Used to describe data, providing a contract
about fields and their types.
!
Two schemas are compatible if you can read
data written in schema 1 with schema 2.
SCHEMA
29
Facilities automated analytics—summary
statistics, session/funnel analysis, a/b testing.
SCHEMA
30
https://engineering.twitter.com/research/publication/the-unified-logging-infrastructure-for-data-analytics-at-twitter
Facilities automated analytics—summary
statistics, session/funnel analysis, a/b testing.
SCHEMA
31
client:page:section:component:element:action e.g.:
!
iphone:home:mentions:tweet:button:click!
!
Count iPhone users clicking from home page:
!
iphone:home:*:*:*:click!
!
Count home clicks on buttons or avatars:
!
*:home:*:*:{button,avatar}:click
32
KEY COMPONENTS
EVENT FRAMEWORK
33
For easily generating events from your
applications
EVENT FRAMEWORK
34
For
applications
BIG MESSAGE BUS
35
•Horizontally scalable
•Redundant
•APIs / easy to integrate
BIG MESSAGE BUS
36
•Scribe (Facebook)
•Apache Chukwa
•Apache Flume
•Apache Kafka*
!
•Horizontally scalable
•Redundant
•APIs / easy to integrate
* My recommendation
DATA PERSISTENCE
37
For storing your events in files for batch
processing
DATA PERSISTENCE
38
For
processing
Kite Software Development Kit
http://kitesdk.org/
!
Spring Hadoop
http://projects.spring.io/spring-hadoop/
WORKFLOW MANAGEMENT
39
For coordinating the tasks in your data
pipeline
WORKFLOW MANAGEMENT
40
… or your own system written
in your own language of choice.
*
For
pipeline
SERIALIZATION FRAMEWORK
41
Used for converting an Event to bytes on
disk. Provides efficient, cross-language
framework for serializing/deserializing data.
SERIALIZATION FRAMEWORK
42
•Apache Avro*
•Apache Thrift
•Protocol Buffers (google)
Used for
disk
framework for serializing/deserializing data.
BATCH PROCESSING AND AD HOC
ANALYSIS
43
•Apache Hadoop (MapReduce)
•Apache Hive (or other SQL-on-Hadoop)
•Apache Spark
SYSTEM OVERVIEW
44
Application
logging
framework
data
serialization
Message Bus
Persistant
Storage
Data
Warehouse
Ad hoc
Analysis
Product
data flow
workflow engine
Production
DB dumps
SYSTEM OVERVIEW (OPINIONATED)
45
Application
logging
framework
data
serialization
Message Bus
Persistant
Storage
Data
Warehouse
Ad hoc
Analysis
Product
data flow
workflow engine
Production
DB dumps
Apache Avro
Apache Kafka
Luigi
NEXT STEPS
46
This architecture opens up a lot of possibilities
•Near-real time computation—Apache
Storm, Apache Samza (incubating), Apache
Spark streaming.
•Sharing information between services
asynchronously—e.g. to augment user
profile information.
•Cross-datacenter replication
•Columnar storage
LAMBDA ARCHITECTURE
47
Term coined by Nathan Marz (creator of
Apache Storm) for hybrid batch and real-
time processing.
!
Batch processing is treated as source of truth,
and real-time updates models/insights
between batches.
LAMBDA ARCHITECTURE
48
http://lambda-architecture.net/
SUMMARY
49
•Data Pipelines are everywhere.
•Useful to think of data as events.
•A unified data pipeline is very powerful.
•Plethora of open-source tools to build
data pipeline.
FURTHER READING
50
The Unified Logging Infrastructure for Data
Analytics at Twitter
!
The Log: What every software engineer should
know about real-time data's unifying
abstraction (Jay Kreps, LinkedIn)
!
Big Data by Nathan Marz and James Warren
!
Implementing Microservice Architectures
THANK YOU
51
Questions?
!
Shameless plug: www.hadoopweekly.com
52
EXTRA SLIDES
WHY KAFKA?
53
• https://kafka.apache.org/
documentation.html#design
• Pull model works well
• Easy to configure and deploy
• Good JVM support
• Well-integrated with the LinkedIn stack
WHY LUIGI?
54
• Scripting language (you’ll end up writing
scripts anyway)
• Simplicity (low learning curve)
• Idempotency
• Easy to deploy
WHY AVRO?
55
• Self-describing files
• Integrated with nearly everything in the
ecosystem
• CLI tools for dumping to JSON, CSV

More Related Content

PDF
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
PPTX
Elastic Stack Introduction
PPTX
Apache Flink and what it is used for
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
PDF
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
PPTX
Best practices and lessons learnt from Running Apache NiFi at Renault
PDF
Parallelizing with Apache Spark in Unexpected Ways
PDF
Big Data Analytics Architecture PowerPoint Presentation Slides
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Elastic Stack Introduction
Apache Flink and what it is used for
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
Best practices and lessons learnt from Running Apache NiFi at Renault
Parallelizing with Apache Spark in Unexpected Ways
Big Data Analytics Architecture PowerPoint Presentation Slides

What's hot (20)

PDF
Spark (Structured) Streaming vs. Kafka Streams
PDF
Data pipeline with kafka
PDF
Apache Kafka Architecture & Fundamentals Explained
PPTX
Embedding Data & Analytics With Looker
PPTX
Real-Time Data Flows with Apache NiFi
PDF
Changelog Stream Processing with Apache Flink
PPTX
Apache NiFi Crash Course Intro
PDF
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
PDF
Data Engineering Basics
PDF
The basics of fluentd
PPTX
Apache Beam: A unified model for batch and stream processing data
PDF
The Apache Spark File Format Ecosystem
PPTX
Apache Kafka Security
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
Moving OBIEE to Oracle Analytics Cloud
PDF
Apache NiFi Record Processing
PDF
Apache Kafka - Martin Podval
PPTX
Real time analytics
PDF
Introduction to elasticsearch
PPTX
Couchbase 101
Spark (Structured) Streaming vs. Kafka Streams
Data pipeline with kafka
Apache Kafka Architecture & Fundamentals Explained
Embedding Data & Analytics With Looker
Real-Time Data Flows with Apache NiFi
Changelog Stream Processing with Apache Flink
Apache NiFi Crash Course Intro
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Data Engineering Basics
The basics of fluentd
Apache Beam: A unified model for batch and stream processing data
The Apache Spark File Format Ecosystem
Apache Kafka Security
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Moving OBIEE to Oracle Analytics Cloud
Apache NiFi Record Processing
Apache Kafka - Martin Podval
Real time analytics
Introduction to elasticsearch
Couchbase 101
Ad

Viewers also liked (15)

PDF
Building Data Pipelines in Python
PPTX
Building a unified data pipeline in Apache Spark
PPTX
The evolution of the big data platform @ Netflix (OSCON 2015)
PDF
Business process analysis and design – importance of having a common language...
PDF
B2B is dead. B4B is born.
PDF
B4B: Where Tech is Headed
PPTX
Cisco Presentation
PDF
10 ways to stumble with big data
PDF
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
PPTX
A Beginner's Guide to Building Data Pipelines with Luigi
PDF
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
PDF
Data pipelines from zero to solid
PDF
Test strategies for data processing pipelines, v2.0
PDF
Testing data streaming applications
PPTX
Culture
Building Data Pipelines in Python
Building a unified data pipeline in Apache Spark
The evolution of the big data platform @ Netflix (OSCON 2015)
Business process analysis and design – importance of having a common language...
B2B is dead. B4B is born.
B4B: Where Tech is Headed
Cisco Presentation
10 ways to stumble with big data
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
A Beginner's Guide to Building Data Pipelines with Luigi
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Data pipelines from zero to solid
Test strategies for data processing pipelines, v2.0
Testing data streaming applications
Culture
Ad

Similar to Building a Data Pipeline from Scratch - Joe Crobak (20)

PDF
Building end to end streaming application on Spark
PPTX
Ledingkart Meetup #4: Data pipeline @ lk
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PDF
Data platform architecture
PPTX
Software architecture for data applications
PDF
Building real time data-driven products
PPTX
Trivento summercamp masterclass 9/9/2016
PDF
Agile data lake? An oxymoron?
PPTX
Challenges in building a Data Pipeline
PPTX
Challenges in Building a Data Pipeline
PDF
Data Infrastructure for a World of Music
PDF
Data Streaming Technology Overview
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PPTX
Big Data_Architecture.pptx
PPTX
Functional architectural patterns
PDF
Web Analytics Wednesday Melbourne Meet Up
PDF
Big data pipelines
PPTX
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
PDF
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
PPTX
Apache frameworks for Big and Fast Data
Building end to end streaming application on Spark
Ledingkart Meetup #4: Data pipeline @ lk
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Data platform architecture
Software architecture for data applications
Building real time data-driven products
Trivento summercamp masterclass 9/9/2016
Agile data lake? An oxymoron?
Challenges in building a Data Pipeline
Challenges in Building a Data Pipeline
Data Infrastructure for a World of Music
Data Streaming Technology Overview
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Big Data_Architecture.pptx
Functional architectural patterns
Web Analytics Wednesday Melbourne Meet Up
Big data pipelines
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
Apache frameworks for Big and Fast Data

More from Hakka Labs (20)

PDF
Always Valid Inference (Ramesh Johari, Stanford)
PPTX
DataEngConf SF16 - High cardinality time series search
PDF
DataEngConf SF16 - Data Asserts: Defensive Data Science
PDF
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
PDF
DataEngConf SF16 - Recommendations at Instacart
PDF
DataEngConf SF16 - Running simulations at scale
PDF
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
PDF
DataEngConf SF16 - Collecting and Moving Data at Scale
PDF
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
PDF
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
PDF
DataEngConf SF16 - Three lessons learned from building a production machine l...
PDF
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
PDF
DataEngConf SF16 - Bridging the gap between data science and data engineering
PDF
DataEngConf SF16 - Multi-temporal Data Structures
PDF
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
PDF
DataEngConf SF16 - Beginning with Ourselves
PDF
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
PDF
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
PDF
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
PDF
DataEngConf SF16 - Spark SQL Workshop
Always Valid Inference (Ramesh Johari, Stanford)
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - Data Asserts: Defensive Data Science
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Spark SQL Workshop

Recently uploaded (20)

PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
medical staffing services at VALiNTRY
PDF
Understanding Forklifts - TECH EHS Solution
PDF
System and Network Administration Chapter 2
PDF
System and Network Administraation Chapter 3
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
AI in Product Development-omnex systems
PPTX
L1 - Introduction to python Backend.pptx
PDF
Nekopoi APK 2025 free lastest update
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Digital Strategies for Manufacturing Companies
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PTS Company Brochure 2025 (1).pdf.......
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
medical staffing services at VALiNTRY
Understanding Forklifts - TECH EHS Solution
System and Network Administration Chapter 2
System and Network Administraation Chapter 3
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
AI in Product Development-omnex systems
L1 - Introduction to python Backend.pptx
Nekopoi APK 2025 free lastest update
Upgrade and Innovation Strategies for SAP ERP Customers
Online Work Permit System for Fast Permit Processing
CHAPTER 2 - PM Management and IT Context
Digital Strategies for Manufacturing Companies
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Operating system designcfffgfgggggggvggggggggg
Wondershare Filmora 15 Crack With Activation Key [2025
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...

Building a Data Pipeline from Scratch - Joe Crobak