SlideShare a Scribd company logo
Introduction to Hadoop
Ron Sher
Agenda

•
•
•
•
•

Big data - big issues
Hadoop to the rescue
Storage - HDFS
Processing - MapReduce
Hadoop ecosystem
Big Data - Big Issues
● Volume, Velocity, Variability
● Lots of data - logs, sensors, social, pictures,
video, etc.
● May not fit a single machine
● Access to data is slow
● Hardware may fail
● Network errors happen
Hadoop to the rescue

•
•
•
•
•
•

Distributed “operating system”
Scalable - many servers of commodity hardware
with lots of cores and disks
Reliable - detect failures, redundant storage
Fault-tolerant - auto-retry, self-healing
Simple - use many servers as one really big
computer
Suitable for batch processing (throughput over
Storage - HDFS

•
•

•
•

Hadoop Distributed File System
Replicated (3 default) fixed size blocks
(64MB default)
runs on large clusters of commodity
machines
Optimized for write once - read many
throughput of large files
HDFS Architecture
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/images/hdfsarchitecture.png
Useful HDFS commands
•
•
•
•
•
•
•
•

hdfs dfs -get <file name> - copy a file from hdfs to local
hdfs dfs -put <file name> [destination]- copy a file from local
to hdfs in the specified destination
hdfs dfs -cat <file name> - prints a file to stdout
hdfs dfs -ls <dir name> - show all files under the specified
directory
hdfs dfs -mv <file name> <changed name> - rename a file
hdfs dfs -rm <file name> - remove a file
hdfs dfs -rmr <directory name> - remove a directory
hdfs dfs -mkdir <dir name> - creates a directory
Processing - MapReduce

•
•

•
•

A distributed data processing model and execution
environment that runs on large clusters of commodity
machines
Responsible for running a job in parallel on many
servers
Handles re-trying a task that fails, validating complete
results
Computation moved to the data
MapReduce Sample - Word Count
input

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini
MapReduce Sample - Word Count
input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mini
MapReduce Sample - Word Count
input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mini

mapping
Ini, 1
Mini, 1
Miny,1
Mo, 1
Mo, 1
Miny,1

Ini, 1
Mo, 1
Mini, 1
MapReduce Sample - Word Count
input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mini

mapping
Ini, 1
Mini, 1
Miny,1
Mo, 1
Mo, 1
Miny,1

Ini, 1
Mo, 1
Mini, 1

shuffling
Ini, 1
Ini, 1
Mini, 1
Mini, 1
Miny, 1
Miny, 1
Mo, 1
Mo, 1
Mo, 1
MapReduce Sample - Word Count
input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mini

mapping
Ini, 1
Mini, 1
Miny,1
Mo, 1
Mo, 1
Miny,1

Ini, 1
Mo, 1
Mini, 1

shuffling

reducing

Ini, 1
Ini, 1

Ini, [1,1]

Mini, 1
Mini, 1

Mini, [1,1]

Miny, 1
Miny, 1

Miny, [1,1]

Mo, 1
Mo, 1
Mo, 1

Mo, [1,1,1]
MapReduce Sample - Word Count
input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mini

mapping
Ini, 1
Mini, 1
Miny,1
Mo, 1
Mo, 1
Miny,1

Ini, 1
Mo, 1
Mini, 1

shuffling

reducing

Ini, 1
Ini, 1

Ini, [1,1]

Mini, 1
Mini, 1

Mini, [1,1]

Miny, 1
Miny, 1

Miny, [1,1]

Mo, 1
Mo, 1
Mo, 1

Mo, [1,1,1]

final result

Ini, 2
Mini, 2
Miny,2
Mo, 3
http://answers.oreilly.com/uploads/monthly_10_2009/post-118-125676084924_thumb.png

How a MapReduce Job Runs in Hadoop
Monitoring MR jobs (machine:50030)
Monitoring MR jobs (machine:50030)
Monitoring MR jobs (machine:50030)
Monitoring MR jobs (machine:50030)
Useful Commands

•
•

mapred job -kill <job id> - kill a running job
mapred job -status <job id> - show status
of a job
Useful Commands

•
•

mapred job -kill <job id> - kill a running job
mapred job -status <job id> - show status
of a job
Word Count Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
Word Count Reducer
public static class Reduce extends MapReduceBase implements Reducer<Text,
IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Hadoop Ecosystem

•
•
•
•
•
•
•
•

Hive - SQL like language over big data using MR
HBase - distributed, column-oriented database
ZooKeeper - coordination service
Avro - cross language serialization
Pig - language for exploring big data
Impala - SQL like directly over HDFS
Sqoop - tool for moving data from DBs to HDFS
Mahout - machine learning and data mining library
Some resources

•
•
•
•
•
•

Motivation about hadoop and where it’s
going video and whitepaper
HDFS Architecture Guide
How MapReduce Works With Hadoop
HDFS shell commands
VM
MapReduce tutorial

More Related Content

PPTX
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
PDF
Map reduce and hadoop at mylife
PDF
Introduction to the Hadoop Ecosystem (codemotion Edition)
PPTX
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
PPTX
Optimizing Performance - Clojure Remote - Nikola Peric
PPTX
Hadoop performance optimization tips
PDF
MapReduce: Distributed Computing for Machine Learning
PDF
[@NaukriEngineering] Apache Spark
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Map reduce and hadoop at mylife
Introduction to the Hadoop Ecosystem (codemotion Edition)
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimizing Performance - Clojure Remote - Nikola Peric
Hadoop performance optimization tips
MapReduce: Distributed Computing for Machine Learning
[@NaukriEngineering] Apache Spark

What's hot (19)

PPTX
Introduction to MapReduce and Hadoop
PDF
introduction to data processing using Hadoop and Pig
PPTX
GoodFit: Multi-Resource Packing of Tasks with Dependencies
PDF
MapReduce and Hadoop
PPTX
Big data & hadoop
PPSX
MapReduce Scheduling Algorithms
PPTX
06 pig etl features
PDF
Introduction to Hadoop and MapReduce
PPT
An Introduction To Map-Reduce
PPTX
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
PDF
Probabilistic algorithms for fun and pseudorandom profit
PDF
Introduction to Map-Reduce
PDF
【Unite 2017 Tokyo】C#ジョブシステムによるモバイルゲームのパフォーマンス向上テクニック(note付き)
PPTX
Priority queue
PDF
Large Scale Data Analysis with Map/Reduce, part I
PPTX
Hadoop job chaining
PPTX
MapReduce: A useful parallel tool that still has room for improvement
PDF
Hadoop & MapReduce
Introduction to MapReduce and Hadoop
introduction to data processing using Hadoop and Pig
GoodFit: Multi-Resource Packing of Tasks with Dependencies
MapReduce and Hadoop
Big data & hadoop
MapReduce Scheduling Algorithms
06 pig etl features
Introduction to Hadoop and MapReduce
An Introduction To Map-Reduce
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Probabilistic algorithms for fun and pseudorandom profit
Introduction to Map-Reduce
【Unite 2017 Tokyo】C#ジョブシステムによるモバイルゲームのパフォーマンス向上テクニック(note付き)
Priority queue
Large Scale Data Analysis with Map/Reduce, part I
Hadoop job chaining
MapReduce: A useful parallel tool that still has room for improvement
Hadoop & MapReduce
Ad

Viewers also liked (7)

PPT
HDFS Issues
PPT
Heirarchy
PDF
Resume2015 copy
PPT
Performance Issues on Hadoop Clusters
PPTX
IBM Big Data for Social Good Challenge - Submission Showcase
PDF
MapReduce: Optimizations, Limitations, and Open Issues
PPTX
Big Data Analytics with Hadoop
HDFS Issues
Heirarchy
Resume2015 copy
Performance Issues on Hadoop Clusters
IBM Big Data for Social Good Challenge - Submission Showcase
MapReduce: Optimizations, Limitations, and Open Issues
Big Data Analytics with Hadoop
Ad

Similar to Introduction to hadoop (20)

PPSX
Hadoop-Quick introduction
PDF
MapReduce on Zero VM
PPT
Hadoop tutorial
PPT
Hadoop Tutorial.ppt
PPTX
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
PPTX
Big data week presentation
PDF
MapReduce basics
PPTX
Hadoop bigdata overview
PPT
PPTX
Hadoop
PPT
Hadoop - Introduction to HDFS
PDF
1. Big Data - Introduction(what is bigdata).pdf
PDF
Introduction to the hadoop ecosystem by Uwe Seiler
PDF
Introduction to the Hadoop Ecosystem (SEACON Edition)
PDF
OpenSource Big Data Platform - Flamingo Project
PPTX
Manta Unleashed BigDataSG talk 2 July 2013
PDF
HadoopThe Hadoop Java Software Framework
PPTX
Hadoop: A distributed framework for Big Data
PPTX
Hadoop fault tolerance
PDF
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class
Hadoop-Quick introduction
MapReduce on Zero VM
Hadoop tutorial
Hadoop Tutorial.ppt
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
Big data week presentation
MapReduce basics
Hadoop bigdata overview
Hadoop
Hadoop - Introduction to HDFS
1. Big Data - Introduction(what is bigdata).pdf
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the Hadoop Ecosystem (SEACON Edition)
OpenSource Big Data Platform - Flamingo Project
Manta Unleashed BigDataSG talk 2 July 2013
HadoopThe Hadoop Java Software Framework
Hadoop: A distributed framework for Big Data
Hadoop fault tolerance
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
cuic standard and advanced reporting.pdf
PPTX
Cloud computing and distributed systems.
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
KodekX | Application Modernization Development
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Machine learning based COVID-19 study performance prediction
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Digital-Transformation-Roadmap-for-Companies.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Big Data Technologies - Introduction.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Understanding_Digital_Forensics_Presentation.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
cuic standard and advanced reporting.pdf
Cloud computing and distributed systems.
Spectral efficient network and resource selection model in 5G networks
KodekX | Application Modernization Development
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Machine learning based COVID-19 study performance prediction
NewMind AI Weekly Chronicles - August'25 Week I
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Advanced methodologies resolving dimensionality complications for autism neur...
Per capita expenditure prediction using model stacking based on satellite ima...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Introduction to hadoop

  • 2. Agenda • • • • • Big data - big issues Hadoop to the rescue Storage - HDFS Processing - MapReduce Hadoop ecosystem
  • 3. Big Data - Big Issues ● Volume, Velocity, Variability ● Lots of data - logs, sensors, social, pictures, video, etc. ● May not fit a single machine ● Access to data is slow ● Hardware may fail ● Network errors happen
  • 4. Hadoop to the rescue • • • • • • Distributed “operating system” Scalable - many servers of commodity hardware with lots of cores and disks Reliable - detect failures, redundant storage Fault-tolerant - auto-retry, self-healing Simple - use many servers as one really big computer Suitable for batch processing (throughput over
  • 5. Storage - HDFS • • • • Hadoop Distributed File System Replicated (3 default) fixed size blocks (64MB default) runs on large clusters of commodity machines Optimized for write once - read many throughput of large files
  • 7. Useful HDFS commands • • • • • • • • hdfs dfs -get <file name> - copy a file from hdfs to local hdfs dfs -put <file name> [destination]- copy a file from local to hdfs in the specified destination hdfs dfs -cat <file name> - prints a file to stdout hdfs dfs -ls <dir name> - show all files under the specified directory hdfs dfs -mv <file name> <changed name> - rename a file hdfs dfs -rm <file name> - remove a file hdfs dfs -rmr <directory name> - remove a directory hdfs dfs -mkdir <dir name> - creates a directory
  • 8. Processing - MapReduce • • • • A distributed data processing model and execution environment that runs on large clusters of commodity machines Responsible for running a job in parallel on many servers Handles re-trying a task that fails, validating complete results Computation moved to the data
  • 9. MapReduce Sample - Word Count input Ini Mini Miny Mo Mo Miny Ini Mo Mini
  • 10. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini
  • 11. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini mapping Ini, 1 Mini, 1 Miny,1 Mo, 1 Mo, 1 Miny,1 Ini, 1 Mo, 1 Mini, 1
  • 12. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini mapping Ini, 1 Mini, 1 Miny,1 Mo, 1 Mo, 1 Miny,1 Ini, 1 Mo, 1 Mini, 1 shuffling Ini, 1 Ini, 1 Mini, 1 Mini, 1 Miny, 1 Miny, 1 Mo, 1 Mo, 1 Mo, 1
  • 13. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini mapping Ini, 1 Mini, 1 Miny,1 Mo, 1 Mo, 1 Miny,1 Ini, 1 Mo, 1 Mini, 1 shuffling reducing Ini, 1 Ini, 1 Ini, [1,1] Mini, 1 Mini, 1 Mini, [1,1] Miny, 1 Miny, 1 Miny, [1,1] Mo, 1 Mo, 1 Mo, 1 Mo, [1,1,1]
  • 14. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini mapping Ini, 1 Mini, 1 Miny,1 Mo, 1 Mo, 1 Miny,1 Ini, 1 Mo, 1 Mini, 1 shuffling reducing Ini, 1 Ini, 1 Ini, [1,1] Mini, 1 Mini, 1 Mini, [1,1] Miny, 1 Miny, 1 Miny, [1,1] Mo, 1 Mo, 1 Mo, 1 Mo, [1,1,1] final result Ini, 2 Mini, 2 Miny,2 Mo, 3
  • 16. Monitoring MR jobs (machine:50030)
  • 17. Monitoring MR jobs (machine:50030)
  • 18. Monitoring MR jobs (machine:50030)
  • 19. Monitoring MR jobs (machine:50030)
  • 20. Useful Commands • • mapred job -kill <job id> - kill a running job mapred job -status <job id> - show status of a job
  • 21. Useful Commands • • mapred job -kill <job id> - kill a running job mapred job -status <job id> - show status of a job
  • 22. Word Count Mapper public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
  • 23. Word Count Reducer public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 24. Hadoop Ecosystem • • • • • • • • Hive - SQL like language over big data using MR HBase - distributed, column-oriented database ZooKeeper - coordination service Avro - cross language serialization Pig - language for exploring big data Impala - SQL like directly over HDFS Sqoop - tool for moving data from DBs to HDFS Mahout - machine learning and data mining library
  • 25. Some resources • • • • • • Motivation about hadoop and where it’s going video and whitepaper HDFS Architecture Guide How MapReduce Works With Hadoop HDFS shell commands VM MapReduce tutorial