SlideShare a Scribd company logo
Introduction to the
Hadoop ecosystem
About me
About us
Why Hadoop?
Why Hadoop?
Why Hadoop?
Why Hadoop?
Why Hadoop?
Why Hadoop?
Why Hadoop?
How to scale data?
w1 w2 w3
r1 r2 r3
But…
But…
What is Hadoop?
What is Hadoop?
What is Hadoop?
What is Hadoop?
The Hadoop App Store
HDFS MapRed HCat Pig Hive HBase Ambari Avro Cassandra
Chukwa
Intel
Sync
Flume Hana HyperT Impala Mahout Nutch Oozie Scoop
Scribe Tez Vertica Whirr ZooKee Cloudera Horton MapR EMC
IBM Talend TeraData Pivotal Informat Microsoft. Pentaho Jasper
Kognitio Tableau Splunk Platfora Rack Karma Actuate MicStrat
Data Storage
Data Storage
Hadoop Distributed File System
•
•
•
Hadoop Distributed File System
•
•
HDFS Architecture
Data Processing
Data Processing
MapReduce
•
•
•
Typical large-data problem
•
•
•
•
•
MapReduce Flow
𝐤 𝟏 𝐯 𝟏 𝐤 𝟐 𝐯 𝟐 𝐤 𝟒 𝐯 𝟒 𝐤 𝟓 𝐯 𝟓 𝐤 𝟔 𝐯 𝟔𝐤 𝟑 𝐯 𝟑
a 𝟏 b 2 c 9 a 3 c 2 b 7 c 8
a 𝟏 b 2 c 3 c 6 a 3 c 2 b 7 c 8
a 1 3 b 𝟐 7 c 2 8 9
a 4 b 9 c 19
Jobs & Tasks
•
•
•
•
Combined Hadoop Architecture
Word Count Mapper in Java
public class WordCountMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
Word Count Reducer in Java
public class WordCountReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator values, OutputCollector
output, Reporter reporter) throws IOException
{
int sum = 0;
while (values.hasNext())
{
IntWritable value = (IntWritable) values.next();
sum += value.get();
}
output.collect(key, new IntWritable(sum));
}
}
Scripting for Hadoop
Scripting for Hadoop
Apache Pig
•
•
•
•
Pig in the Hadoop ecosystem
Hadoop Distributed File System
Distributed Programming Framework
Metadata Management
Scripting
Pig Latin
users = LOAD 'users.txt' USING PigStorage(',') AS (name,
age);
pages = LOAD 'pages.txt' USING PigStorage(',') AS (user,
url);
filteredUsers = FILTER users BY age >= 18 and age <=50;
joinResult = JOIN filteredUsers BY name, pages by user;
grouped = GROUP joinResult BY url;
summed = FOREACH grouped GENERATE group,
COUNT(joinResult) as clicks;
sorted = ORDER summed BY clicks desc;
top10 = LIMIT sorted 10;
STORE top10 INTO 'top10sites';
Pig Execution Plan
Try that with Java…
SQL for Hadoop
SQL for Hadoop
Apache Hive
•
•
Hive in the Hadoop ecosystem
Hadoop Distributed File System
Distributed Programming Framework
Metadata Management
Scripting Query
Hive Architecture
Hive Example
CREATE TABLE users(name STRING, age INT);
CREATE TABLE pages(user STRING, url STRING);
LOAD DATA INPATH '/user/sandbox/users.txt' INTO
TABLE 'users';
LOAD DATA INPATH '/user/sandbox/pages.txt' INTO
TABLE 'pages';
SELECT pages.url, count(*) AS clicks FROM users JOIN
pages ON (users.name = pages.user)
WHERE users.age >= 18 AND users.age <= 50
GROUP BY pages.url
SORT BY clicks DESC
LIMIT 10;
Bringing it all together…
Online Advertising
Getting started…
Hortonworks Sandbox
Hadoop Training
•
•
•
•
•
•
•
•
•
The end…or the beginning?

More Related Content

PDF
Introduction to the Hadoop Ecosystem (codemotion Edition)
PDF
introduction to data processing using Hadoop and Pig
PPTX
Practical Hadoop using Pig
PDF
Hadoop Pig: MapReduce the easy way!
KEY
Getting Started on Hadoop
PDF
Hadoop pig
KEY
Hive vs Pig for HadoopSourceCodeReading
PPTX
January 2011 HUG: Howl Presentation
Introduction to the Hadoop Ecosystem (codemotion Edition)
introduction to data processing using Hadoop and Pig
Practical Hadoop using Pig
Hadoop Pig: MapReduce the easy way!
Getting Started on Hadoop
Hadoop pig
Hive vs Pig for HadoopSourceCodeReading
January 2011 HUG: Howl Presentation

What's hot (20)

PPT
Hadoop basics
PPT
Hadoop summit 2010 frameworks panel elephant bird
PDF
Hadoop Integration in Cassandra
KEY
Hadoop, Pig, and Twitter (NoSQL East 2009)
PDF
Intro to py spark (and cassandra)
PDF
R, Hadoop and Amazon Web Services
PPT
hadoop&zing
PDF
Scalding - the not-so-basics @ ScalaDays 2014
PDF
Engineering fast indexes
PPTX
Yahoo! Mail antispam - Bay area Hadoop user group
PPTX
Pig, Making Hadoop Easy
PDF
Hadoop Architecture in Depth
PPTX
Bigdata : Big picture
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
PDF
Apache spark session
PDF
Druid meetup 4th_sql_on_druid
PPTX
Hive paris
PPTX
January 2011 HUG: Pig Presentation
PDF
Online Analytics with Hadoop and Cassandra
PPTX
Introduction to Apache Pig
Hadoop basics
Hadoop summit 2010 frameworks panel elephant bird
Hadoop Integration in Cassandra
Hadoop, Pig, and Twitter (NoSQL East 2009)
Intro to py spark (and cassandra)
R, Hadoop and Amazon Web Services
hadoop&zing
Scalding - the not-so-basics @ ScalaDays 2014
Engineering fast indexes
Yahoo! Mail antispam - Bay area Hadoop user group
Pig, Making Hadoop Easy
Hadoop Architecture in Depth
Bigdata : Big picture
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Apache spark session
Druid meetup 4th_sql_on_druid
Hive paris
January 2011 HUG: Pig Presentation
Online Analytics with Hadoop and Cassandra
Introduction to Apache Pig
Ad

Similar to Introduction to the hadoop ecosystem by Uwe Seiler (20)

PDF
Introduction to Apache Flink - Fast and reliable big data processing
PPTX
Hadoop with Python
PDF
Osd ctw spark
PPTX
Sf NoSQL MeetUp: Apache Hadoop and HBase
PPTX
Big data, just an introduction to Hadoop and Scripting Languages
PPTX
Big Data Analytics Projects - Real World with Pentaho
PPT
Hadoop trainingin bangalore
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PPTX
מיכאל
PPTX
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
PPTX
Hadoop workshop
PDF
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
PDF
Introduction to apache hadoop
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PPTX
Basic of Big Data
PDF
Lecture 2 part 3
PDF
Basics of big data analytics hadoop
PPTX
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
PPTX
Hands on Hadoop and pig
Introduction to Apache Flink - Fast and reliable big data processing
Hadoop with Python
Osd ctw spark
Sf NoSQL MeetUp: Apache Hadoop and HBase
Big data, just an introduction to Hadoop and Scripting Languages
Big Data Analytics Projects - Real World with Pentaho
Hadoop trainingin bangalore
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
מיכאל
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Hadoop workshop
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Introduction to apache hadoop
Big Data Analytics with Hadoop, MongoDB and SQL Server
Basic of Big Data
Lecture 2 part 3
Basics of big data analytics hadoop
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Hands on Hadoop and pig
Ad

More from Codemotion (20)

PDF
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
PDF
Pompili - From hero to_zero: The FatalNoise neverending story
PPTX
Pastore - Commodore 65 - La storia
PPTX
Pennisi - Essere Richard Altwasser
PPTX
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
PPTX
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
PPTX
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
PPTX
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
PDF
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
PDF
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
PDF
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
PDF
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
PDF
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
PDF
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
PPTX
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
PPTX
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
PDF
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
PDF
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
PDF
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
PDF
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Pompili - From hero to_zero: The FatalNoise neverending story
Pastore - Commodore 65 - La storia
Pennisi - Essere Richard Altwasser
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019

Recently uploaded (20)

PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
project resource management chapter-09.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Mushroom cultivation and it's methods.pdf
PPTX
A Presentation on Touch Screen Technology
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Encapsulation theory and applications.pdf
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
project resource management chapter-09.pdf
Tartificialntelligence_presentation.pptx
Encapsulation_ Review paper, used for researhc scholars
Mushroom cultivation and it's methods.pdf
A Presentation on Touch Screen Technology
DP Operators-handbook-extract for the Mautical Institute
Assigned Numbers - 2025 - Bluetooth® Document
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
1 - Historical Antecedents, Social Consideration.pdf
Getting Started with Data Integration: FME Form 101
NewMind AI Weekly Chronicles - August'25-Week II
cloud_computing_Infrastucture_as_cloud_p
WOOl fibre morphology and structure.pdf for textiles
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Accuracy of neural networks in brain wave diagnosis of schizophrenia
A comparative analysis of optical character recognition models for extracting...
Enhancing emotion recognition model for a student engagement use case through...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Encapsulation theory and applications.pdf

Introduction to the hadoop ecosystem by Uwe Seiler