SlideShare a Scribd company logo
Introduction to the
Hadoop ecosystem
About me
About us
Why Hadoop?
Why Hadoop?
Why Hadoop?
Why Hadoop?
Why Hadoop?
Why Hadoop?
Why Hadoop?
How to scale data?
w1 w2 w3
r1 r2 r3
But…
But…
What is Hadoop?
What is Hadoop?
What is Hadoop?
What is Hadoop?
The Hadoop App Store
HDFS MapRed HCat Pig Hive HBase Ambari Avro Cassandra
Chukwa
Intel
Sync
Flume Hana HyperT Impala Mahout Nutch Oozie Scoop
Scribe Tez Vertica Whirr ZooKee Cloudera Horton MapR EMC
IBM Talend TeraData Pivotal Informat Microsoft. Pentaho Jasper
Kognitio Tableau Splunk Platfora Rack Karma Actuate MicStrat
Data Storage
Data Storage
Hadoop Distributed File System
•
•
•
Hadoop Distributed File System
•
•
HDFS Architecture
Data Processing
Data Processing
MapReduce
•
•
•
Typical large-data problem
•
•
•
•
•
MapReduce Flow
𝐤 𝟏 𝐯 𝟏 𝐤 𝟐 𝐯 𝟐 𝐤 𝟒 𝐯 𝟒 𝐤 𝟓 𝐯 𝟓 𝐤 𝟔 𝐯 𝟔𝐤 𝟑 𝐯 𝟑
a 𝟏 b 2 c 9 a 3 c 2 b 7 c 8
a 𝟏 b 2 c 3 c 6 a 3 c 2 b 7 c 8
a 1 3 b 𝟐 7 c 2 8 9
a 4 b 9 c 19
Jobs & Tasks
•
•
•
•
Combined Hadoop Architecture
Word Count Mapper in Java
public class WordCountMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
Word Count Reducer in Java
public class WordCountReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator values, OutputCollector
output, Reporter reporter) throws IOException
{
int sum = 0;
while (values.hasNext())
{
IntWritable value = (IntWritable) values.next();
sum += value.get();
}
output.collect(key, new IntWritable(sum));
}
}
Scripting for Hadoop
Scripting for Hadoop
Apache Pig
•
•
•
•
Pig in the Hadoop ecosystem
Hadoop Distributed File System
Distributed Programming Framework
Metadata Management
Scripting
Pig Latin
users = LOAD 'users.txt' USING PigStorage(',') AS (name,
age);
pages = LOAD 'pages.txt' USING PigStorage(',') AS (user,
url);
filteredUsers = FILTER users BY age >= 18 and age <=50;
joinResult = JOIN filteredUsers BY name, pages by user;
grouped = GROUP joinResult BY url;
summed = FOREACH grouped GENERATE group,
COUNT(joinResult) as clicks;
sorted = ORDER summed BY clicks desc;
top10 = LIMIT sorted 10;
STORE top10 INTO 'top10sites';
Pig Execution Plan
Try that with Java…
SQL for Hadoop
SQL for Hadoop
Apache Hive
•
•
Hive in the Hadoop ecosystem
Hadoop Distributed File System
Distributed Programming Framework
Metadata Management
Scripting Query
Hive Architecture
Hive Example
CREATE TABLE users(name STRING, age INT);
CREATE TABLE pages(user STRING, url STRING);
LOAD DATA INPATH '/user/sandbox/users.txt' INTO
TABLE 'users';
LOAD DATA INPATH '/user/sandbox/pages.txt' INTO
TABLE 'pages';
SELECT pages.url, count(*) AS clicks FROM users JOIN
pages ON (users.name = pages.user)
WHERE users.age >= 18 AND users.age <= 50
GROUP BY pages.url
SORT BY clicks DESC
LIMIT 10;
Bringing it all together…
Online Advertising
Getting started…
Hortonworks Sandbox
Hadoop Training
•
•
•
•
•
•
•
•
•
The end…or the beginning?

More Related Content

PDF
Introduction to the Hadoop Ecosystem (SEACON Edition)
PDF
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
PPTX
Introduction to Apache Drill - interactive query and analysis at scale
PPTX
Big data, just an introduction to Hadoop and Scripting Languages
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
ODP
Hadoop - Overview
PDF
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to Apache Drill - interactive query and analysis at scale
Big data, just an introduction to Hadoop and Scripting Languages
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Hadoop - Overview
Keynote: Getting Serious about MySQL and Hadoop at Continuent

What's hot (20)

PPTX
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
PPTX
Hadoop & HDFS for Beginners
PDF
Hadoop Pig: MapReduce the easy way!
PPTX
Real time data pipeline with spark streaming and cassandra with mesos
PDF
알쓸신잡
PPTX
Practical Hadoop using Pig
PDF
Introduction to Mongodb
PDF
introduction to data processing using Hadoop and Pig
PDF
May 2013 HUG: HCatalog/Hive Data Out
PDF
Spark Cassandra Connector: Past, Present, and Future
PDF
Introduction to Hadoop
PPTX
Druid at Hadoop Ecosystem
PDF
Migrating structured data between Hadoop and RDBMS
PPT
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
PDF
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
PDF
The Hadoop Ecosystem
PPTX
Hadoop and mysql by Chris Schneider
PDF
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
PPTX
A Developer’s View into Spark's Memory Model with Wenchen Fan
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Hadoop & HDFS for Beginners
Hadoop Pig: MapReduce the easy way!
Real time data pipeline with spark streaming and cassandra with mesos
알쓸신잡
Practical Hadoop using Pig
Introduction to Mongodb
introduction to data processing using Hadoop and Pig
May 2013 HUG: HCatalog/Hive Data Out
Spark Cassandra Connector: Past, Present, and Future
Introduction to Hadoop
Druid at Hadoop Ecosystem
Migrating structured data between Hadoop and RDBMS
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
The Hadoop Ecosystem
Hadoop and mysql by Chris Schneider
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
A Developer’s View into Spark's Memory Model with Wenchen Fan
Ad

Viewers also liked (18)

PPT
Big data introduction, Hadoop in details
PPTX
Hadoop Ecosystem
PDF
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
PDF
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
PPT
Hadoop Introduction (1.0)
PDF
Introduction To Hadoop Ecosystem
PDF
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
PDF
Hadoop Internals (2.3.0 or later)
PDF
하둡 (Hadoop) 및 관련기술 훑어보기
PPT
Hadoop MapReduce Fundamentals
KEY
Intro To Hadoop
PPTX
Big Data & Hadoop Tutorial
PPT
Seminar Presentation Hadoop
PPTX
Hadoop introduction , Why and What is Hadoop ?
PDF
Hadoop Overview & Architecture
 
PPTX
Big data and Hadoop
PPTX
Big Data Analytics with Hadoop
PPTX
Big data ppt
Big data introduction, Hadoop in details
Hadoop Ecosystem
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Hadoop Introduction (1.0)
Introduction To Hadoop Ecosystem
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Internals (2.3.0 or later)
하둡 (Hadoop) 및 관련기술 훑어보기
Hadoop MapReduce Fundamentals
Intro To Hadoop
Big Data & Hadoop Tutorial
Seminar Presentation Hadoop
Hadoop introduction , Why and What is Hadoop ?
Hadoop Overview & Architecture
 
Big data and Hadoop
Big Data Analytics with Hadoop
Big data ppt
Ad

Similar to Introduction to the Hadoop Ecosystem (codemotion Edition) (20)

PDF
Introduction to Apache Flink - Fast and reliable big data processing
PPTX
Hadoop with Python
PDF
Osd ctw spark
PPTX
Sf NoSQL MeetUp: Apache Hadoop and HBase
PPTX
Big Data Analytics Projects - Real World with Pentaho
PPT
Hadoop trainingin bangalore
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PPTX
מיכאל
PPTX
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
PPTX
Hadoop workshop
PDF
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
PDF
Introduction to apache hadoop
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PPTX
Basic of Big Data
PDF
Lecture 2 part 3
PDF
Basics of big data analytics hadoop
PPTX
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
PPTX
Hands on Hadoop and pig
PDF
Apache Eagle - Monitor Hadoop in Real Time
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Introduction to Apache Flink - Fast and reliable big data processing
Hadoop with Python
Osd ctw spark
Sf NoSQL MeetUp: Apache Hadoop and HBase
Big Data Analytics Projects - Real World with Pentaho
Hadoop trainingin bangalore
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
מיכאל
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Hadoop workshop
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Introduction to apache hadoop
Big Data Analytics with Hadoop, MongoDB and SQL Server
Basic of Big Data
Lecture 2 part 3
Basics of big data analytics hadoop
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Hands on Hadoop and pig
Apache Eagle - Monitor Hadoop in Real Time
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...

More from Uwe Printz (17)

PDF
Hadoop 3.0 - Revolution or evolution?
PDF
Hadoop 3.0 - Revolution or evolution?
PDF
Hadoop meets Agile! - An Agile Big Data Model
PDF
Hadoop & Security - Past, Present, Future
PDF
Hadoop Operations - Best practices from the field
PDF
Apache Spark
PDF
Lightning Talk: Agility & Databases
PDF
Hadoop 2 - More than MapReduce
PDF
Welcome to Hadoop2Land!
PDF
Hadoop 2 - Beyond MapReduce
PDF
MongoDB für Java Programmierer (JUGKA, 11.12.13)
PDF
Hadoop 2 - Going beyond MapReduce
PDF
MongoDB for Coder Training (Coding Serbia 2013)
PDF
MongoDB für Java-Programmierer
PDF
Introduction to Twitter Storm
PDF
Map/Confused? A practical approach to Map/Reduce with MongoDB
PDF
First meetup of the MongoDB User Group Frankfurt
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Hadoop meets Agile! - An Agile Big Data Model
Hadoop & Security - Past, Present, Future
Hadoop Operations - Best practices from the field
Apache Spark
Lightning Talk: Agility & Databases
Hadoop 2 - More than MapReduce
Welcome to Hadoop2Land!
Hadoop 2 - Beyond MapReduce
MongoDB für Java Programmierer (JUGKA, 11.12.13)
Hadoop 2 - Going beyond MapReduce
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB für Java-Programmierer
Introduction to Twitter Storm
Map/Confused? A practical approach to Map/Reduce with MongoDB
First meetup of the MongoDB User Group Frankfurt

Recently uploaded (20)

PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
A Presentation on Artificial Intelligence
PDF
KodekX | Application Modernization Development
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
Teaching material agriculture food technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation theory and applications.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
NewMind AI Monthly Chronicles - July 2025
Review of recent advances in non-invasive hemoglobin estimation
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Spectral efficient network and resource selection model in 5G networks
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
A Presentation on Artificial Intelligence
KodekX | Application Modernization Development
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Reach Out and Touch Someone: Haptics and Empathic Computing
Teaching material agriculture food technology
Network Security Unit 5.pdf for BCA BBA.
Encapsulation theory and applications.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

Introduction to the Hadoop Ecosystem (codemotion Edition)