SlideShare a Scribd company logo
Building a Data Pipeline
with Tools from the Apache Hadoop Ecosystem
Rich Haase
Twitter - @richhaase
LinkedIn - linkedin.com/in/richhaase
About Me
• 18 years experience in technology
• Infrastructure/Data
• Started using Hadoop in 2010 and haven’t looked
back
• Mildly obsessed with the movement and
management of lots of data
Why Hadoop?
This is your home grown data pipeline toolkit.
http://www.heavycoverinc.com/heavy-cover-titanium-spork-multi-tool/
This is your data pipeline toolkit when you make
use of the Hadoop ecosystem.
http://www.overstock.com/Sports-Toys/Wenger-Giant-85-tool-141-function-Swiss-Army-Knife/3361457/product.html
Distribution Software Matrix
CDH 5.7.1 HDP 2.4.2 Bigtop 1.1 MapR 5.1 EMR 4.7.0
Flume 1.6.0 1.5.2 1.6.0 1.6.0 -
Hadoop 2.6.0 2.7.1 2.7.1 2.7.1* 2.7.2
Hue 3.9.0 2.6.1 3.9.0 3.9.0 3.7.1
HBase 1.2.0 1.1.2 0.98.12 1.1 1.2.1
Hive 1.1.0 1.2.1 1.2.1 1.2.1 1.0.0
Oozie 4.1.0 4.2.0 4.2.0 4.2.0 4.2.0
Pig 0.12.0 0.15.0 0.15.0 0.15.0 0.14.0
Spark 1.6.0 1.6.1 1.5.1 1.6.1 1.6.1
Sqoop 1.4.6 1.4.6 1.4.5 1.4.6 1.4.6
ZooKeeper 3.4.5 3.4.6 3.4.6 - 3.4.8
https://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-release-components.html
https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_vd_cdh_package_tarball_57.html
http://www.apache.org/dist/bigtop/bigtop-1.1.0/repos
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_HDP_RelNotes/content/ch_relnotes_v242.html
http://maprdocs.mapr.com/home/#InteropMatrix/r_eco_matrix.html
The Hadoop Ecosystem
• ~150 projects
• Dozens of vendors
• Contributions from a wide variety of organizations
and individuals
• Constantly evolving
https://hadoopecosystemtable.github.io/
Tier 1
Included in all major distributions
Tier 2
Included in a major distribution
Apache DataFu™
Tier 3
Not included in any distribution
Disclaimer:
I’m sure I’ve missed projects on this list. Any oversight was
completely unintentional.
Tiers are not indicative of software quality, nor is it an indictment
of the engineers/organizations who contributed to the project.
Every open source contribution brings a fairy back to life.
Also, some projects didn’t have logos.
Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016
Data Life Cycle
Capture
Enrichment
Analysis
Presentation Reporting
Archival
Removal
Demo
https://github.com/richhaase/building-a-data-pipeline
Questions
See https://github.com/richhaase/building-a-data-pipeline/blob/master/README.md for references.

More Related Content

PDF
How to get started in Big Data without Big Costs - StampedeCon 2016
PDF
Turn Data Into Actionable Insights - StampedeCon 2016
PDF
Innovation in the Data Warehouse - StampedeCon 2016
PPTX
Introduction to Kudu - StampedeCon 2016
PDF
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
PPTX
PDF
Big Data and Hadoop - key drivers, ecosystem and use cases
PPTX
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift
How to get started in Big Data without Big Costs - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Big Data and Hadoop - key drivers, ecosystem and use cases
Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift

What's hot (20)

PDF
Filling the Data Lake
PPTX
Optimizing Big Data to run in the Public Cloud
PPTX
Atlanta MLConf
PPTX
Swimming Across the Data Lake, Lessons learned and keys to success
PPTX
Summer Shorts: Big Data Integration
 
PDF
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
PPTX
Analysis of Major Trends in Big Data Analytics
PPTX
Hadoop Reporting and Analysis - Jaspersoft
PDF
Big Data Architecture and Deployment
PPTX
Harnessing the Power of Apache Hadoop
PDF
High Performance Spatial-Temporal Trajectory Analysis with Spark
PDF
Impala use case @ Zoosk
PDF
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
PPTX
Scaling Data Science on Big Data
PDF
A Reference Architecture for ETL 2.0
PDF
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
PPTX
Accelerating Big Data Analytics
PPTX
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
PDF
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
Filling the Data Lake
Optimizing Big Data to run in the Public Cloud
Atlanta MLConf
Swimming Across the Data Lake, Lessons learned and keys to success
Summer Shorts: Big Data Integration
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Analysis of Major Trends in Big Data Analytics
Hadoop Reporting and Analysis - Jaspersoft
Big Data Architecture and Deployment
Harnessing the Power of Apache Hadoop
High Performance Spatial-Temporal Trajectory Analysis with Spark
Impala use case @ Zoosk
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Scaling Data Science on Big Data
A Reference Architecture for ETL 2.0
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
Accelerating Big Data Analytics
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
Ad

Viewers also liked (19)

PPTX
Designing Data Pipelines Using Hadoop
PDF
Building a Data Pipeline from Scratch - Joe Crobak
PPT
Big data analysis using map/reduce
PPTX
Building a Self-Service Big Data Pipeline
PPTX
Batch and Real-time EHR updates into Hadoop - StampedeCon 2015
PPTX
Building data pipelines with kite
PDF
Managing data workflows with Luigi
PDF
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
PDF
Hadoop Security and Compliance - StampedeCon 2016
PDF
Visualizing Big Data – The Fundamentals
PPTX
Creating a Data Driven Organization - StampedeCon 2016
PPTX
Using The Internet of Things for Population Health Management - StampedeCon 2016
PDF
Building Scalable Big Data Pipelines
PPTX
A Beginner's Guide to Building Data Pipelines with Luigi
PPTX
Building a unified data pipeline in Apache Spark
PPTX
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
PDF
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
PDF
Big Data Landscape 2016
PDF
SQL to Hive Cheat Sheet
Designing Data Pipelines Using Hadoop
Building a Data Pipeline from Scratch - Joe Crobak
Big data analysis using map/reduce
Building a Self-Service Big Data Pipeline
Batch and Real-time EHR updates into Hadoop - StampedeCon 2015
Building data pipelines with kite
Managing data workflows with Luigi
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Hadoop Security and Compliance - StampedeCon 2016
Visualizing Big Data – The Fundamentals
Creating a Data Driven Organization - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
Building Scalable Big Data Pipelines
A Beginner's Guide to Building Data Pipelines with Luigi
Building a unified data pipeline in Apache Spark
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
Big Data Landscape 2016
SQL to Hive Cheat Sheet
Ad

Similar to Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016 (20)

PPTX
Transform You Business with Big Data and Hortonworks
PPTX
Transform Your Business with Big Data and Hortonworks
PPTX
presentation_Hadoop_File_System
PPTX
Hadoop data access layer v4.0
PDF
Bn1028 demo hadoop administration and development
PDF
Hadoop Application Architectures Mark Grover Ted Malaska Jonathan Seidman Gwe...
PPTX
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
PPTX
Stéphane Fréchette - Samedi SQL - Introduction to HDInsight
PPTX
Introduction to Azure HDInsight
PPTX
Hadoop Hadoop & Spark meetup - Altiscale
PPTX
Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
PPTX
201305 hadoop jpl-v3
PPTX
How to Use Apache Zeppelin with HWX HDB
PPTX
Introduction to Data Analyst Training
PDF
Tools and techniques for data science
PPTX
BIG DATA ANALYTICS WITH HADOOP
PDF
Level Up – How to Achieve Hadoop Acceleration
PPTX
Big Data Integration Webinar: Getting Started With Hadoop Big Data
PPTX
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
PDF
Twitter word frequency count using hadoop components 150331221753
Transform You Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks
presentation_Hadoop_File_System
Hadoop data access layer v4.0
Bn1028 demo hadoop administration and development
Hadoop Application Architectures Mark Grover Ted Malaska Jonathan Seidman Gwe...
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
Stéphane Fréchette - Samedi SQL - Introduction to HDInsight
Introduction to Azure HDInsight
Hadoop Hadoop & Spark meetup - Altiscale
Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
201305 hadoop jpl-v3
How to Use Apache Zeppelin with HWX HDB
Introduction to Data Analyst Training
Tools and techniques for data science
BIG DATA ANALYTICS WITH HADOOP
Level Up – How to Achieve Hadoop Acceleration
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Twitter word frequency count using hadoop components 150331221753

More from StampedeCon (20)

PDF
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
PDF
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
PDF
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
PDF
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
PDF
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
PDF
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
PDF
Foundations of Machine Learning - StampedeCon AI Summit 2017
PDF
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
PDF
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
PDF
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
PDF
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
PDF
A Different Data Science Approach - StampedeCon AI Summit 2017
PDF
Graph in Customer 360 - StampedeCon Big Data Conference 2017
PDF
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
PDF
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
PDF
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
PDF
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
PPTX
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
PDF
Resource Management in Impala - StampedeCon 2016
PDF
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Resource Management in Impala - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
Teaching material agriculture food technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Approach and Philosophy of On baking technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
A Presentation on Artificial Intelligence
20250228 LYD VKU AI Blended-Learning.pptx
Teaching material agriculture food technology
“AI and Expert System Decision Support & Business Intelligence Systems”
sap open course for s4hana steps from ECC to s4
Approach and Philosophy of On baking technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The AUB Centre for AI in Media Proposal.docx
Reach Out and Touch Someone: Haptics and Empathic Computing
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Spectral efficient network and resource selection model in 5G networks
Machine learning based COVID-19 study performance prediction
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Unlocking AI with Model Context Protocol (MCP)
The Rise and Fall of 3GPP – Time for a Sabbatical?
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Encapsulation_ Review paper, used for researhc scholars
NewMind AI Weekly Chronicles - August'25-Week II
A Presentation on Artificial Intelligence

Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016

  • 1. Building a Data Pipeline with Tools from the Apache Hadoop Ecosystem Rich Haase Twitter - @richhaase LinkedIn - linkedin.com/in/richhaase
  • 2. About Me • 18 years experience in technology • Infrastructure/Data • Started using Hadoop in 2010 and haven’t looked back • Mildly obsessed with the movement and management of lots of data
  • 4. This is your home grown data pipeline toolkit. http://www.heavycoverinc.com/heavy-cover-titanium-spork-multi-tool/
  • 5. This is your data pipeline toolkit when you make use of the Hadoop ecosystem. http://www.overstock.com/Sports-Toys/Wenger-Giant-85-tool-141-function-Swiss-Army-Knife/3361457/product.html
  • 6. Distribution Software Matrix CDH 5.7.1 HDP 2.4.2 Bigtop 1.1 MapR 5.1 EMR 4.7.0 Flume 1.6.0 1.5.2 1.6.0 1.6.0 - Hadoop 2.6.0 2.7.1 2.7.1 2.7.1* 2.7.2 Hue 3.9.0 2.6.1 3.9.0 3.9.0 3.7.1 HBase 1.2.0 1.1.2 0.98.12 1.1 1.2.1 Hive 1.1.0 1.2.1 1.2.1 1.2.1 1.0.0 Oozie 4.1.0 4.2.0 4.2.0 4.2.0 4.2.0 Pig 0.12.0 0.15.0 0.15.0 0.15.0 0.14.0 Spark 1.6.0 1.6.1 1.5.1 1.6.1 1.6.1 Sqoop 1.4.6 1.4.6 1.4.5 1.4.6 1.4.6 ZooKeeper 3.4.5 3.4.6 3.4.6 - 3.4.8 https://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-release-components.html https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_vd_cdh_package_tarball_57.html http://www.apache.org/dist/bigtop/bigtop-1.1.0/repos http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_HDP_RelNotes/content/ch_relnotes_v242.html http://maprdocs.mapr.com/home/#InteropMatrix/r_eco_matrix.html
  • 7. The Hadoop Ecosystem • ~150 projects • Dozens of vendors • Contributions from a wide variety of organizations and individuals • Constantly evolving https://hadoopecosystemtable.github.io/
  • 8. Tier 1 Included in all major distributions
  • 9. Tier 2 Included in a major distribution Apache DataFu™
  • 10. Tier 3 Not included in any distribution
  • 11. Disclaimer: I’m sure I’ve missed projects on this list. Any oversight was completely unintentional. Tiers are not indicative of software quality, nor is it an indictment of the engineers/organizations who contributed to the project. Every open source contribution brings a fairy back to life. Also, some projects didn’t have logos.