SlideShare a Scribd company logo
Hadoop and Big Data Training
Lessons learned
0
What’s Cloudera?
 Leading company in the NoSQL and cloud computing space
 Most popular Hadoop distribution
 Ex-es from Google, Facebook, Oracle and other leading tech
companies
 Sample Bn$ companies client list:
eBay,JPMorganChase,Experian,Groupon,MorganStanley,Nokia
,Orbitz,NationalCancerInstitute,RIM,TheWaltDisney Company
 Consulting and training services
1
Why this training?
 MongoDB is great for OLTP
 Not an OLAP DB, not really aspiring to become one
 Big Data coming in, need for more advanced analysis
processes
2
Intended audience
 Software engineers and friends 
3
 The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets across
clusters of computers using simple programming models
 Modules:
 HadoopCommon
 Hadoop Distributed File System (HDFS™)
 HadoopYARN
 HadoopMapReduce
4
What’s Hadoop?
How does it fit in our Big Goal?
 MongoDB for OLTP
 RDBMS (MySQL) for config data
 Hadoop for OLAP
5
What’s Map Reduce?
 MapReduce is a programming model for processing large data
sets, and the name of an implementation of the model by
Google. MapReduce is typically used to do distributed
computing on clusters of computers. © Wiki
 Practically?
 Can perform computations in a distributed fashion
 Highly scalable
 Inherently highly available
 By design fault tolerant
6
Bindings
 Native Java
 any language, even scripting ones, using Streaming
7
MapReduce framework vs. MapReduce functionality
 Several NoSQL technologies provide MR functionality
8
MR functionality
 Compromise….
 i.e. MongoDB
 CouchDB select * from foo; ;;
9
MapReduce V1 vsMapReduce V2
 MR V1 can not scale past 4k nodes per cluster
 More important to our goals, MR V1 is monolithic
10
MR V2 YARN
 Pluggable implementations on top of Hadoop
 Whole new set of problems can be solved:
 Graph processing
 MPI
11
MR V1 Architecture
12
MR V1 daemons
 client
 NameNode (HDFS)
 JobTracker
 DataNode(HDFS) + TaskTracker
13
MR V2 Architecture
14
MR V2 daemons
 Client
 Resource manager/Application manager
 NodeManager
 Application Master (resource containers)
15
Data Locality in Hadoop
 First replica placed in client node (or random if off cluster
client)
 Second off-rack
 Third in same rack as second but different node
16
HDFS - Architecture
 Hot
 Very large files
 Streaming data access (seek time ~<1% transfer time)
 Commodity hardware (no iphones…)
 Not
 Low-latency data access
 Lots of small files
 Multiple writers, arbitrary file modification
17
HDFS – NameNode
 Namenode Master
 Filesystem tree
 Metadata for all files and directories
 Namespace image and edit log
 Secondary Namenode
 Not a backup node!
 Periodically merges edit log into namespace image
 Could take 30 mins to come back online
18
HDFS HA - NameNode
 2.x Hadoop brings in HDFS HA
 Active-standby config for NameNodes
 Gotchas:
 Shared storage for edit log
 Datanodes send block reports to both NameNodes
 NameNode needs to be transparent to clients
19
HDFS – Read
20
HDFS - Read
 Client requests file from namenode (for first 10 blocks)
 Namenode returns addresses of datanodes
 Client contacts directly datanodes
 Blocks are read in order
21
HDFS - Write
22
HDFS - Write
 RPC initial call to create the file
 Permissions/file exists checks in NameNode etc
 As we write data, data queue in client which asks the
NameNode for datanode to store data
 List of datanodes form a pipeline
 ack queue to verify all replicas have been written
 Close file
23
Job Configuration
 setInputFormatClass
 setOutputFormatClass
 setMapperClass
 setReducerClass
 Set(Map)OutputKeyClass
 set(Map)OutputValueClass
 setNumReduceTasks
24
Job Configuration
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
OR job.submit();
25
Job Configuration
 Simple to invoke:
 bin/hadoop jar WordCountinputPathoutputPath
26
Map Reduce phases
27
Mapper – Life cycle
 Mapper inputs <K1,V1> outputs <K2,V2>
28
Shuffle and Sort
 All same keys are guaranteed to end up in the same reducer,
sorted by key
 Mapper output <K2,V2><‘the’,1>, <‘the’,2>, <‘cat’,1>
 Reducer input <K2,[V2]><‘cat’,*1+>, <‘the’,*1,2+>
29
Reducer – Life cycle
 Reducer inputs <K2, [V2]> outputs <K3, V3>
30
Hadoop interfaces and classes
 >=0.23 new API favoring abstract classes
 <0.23 old API with interfaces
 Packages mapred.* OLD API, mapreduce.* NEW API
31
Speculative execution
 At least one minute into a mapper or reducer, the Jobtracker
will decide based on the progress of a task
 Threshold of each task progress compared to
avgprogress(configurable)
 Relaunch task in different NameNode and have them race..
 Sometimes not wanted
 Cluster utilization
 Non idempotent partial output (OutputCollector)
32
Input Output Formats
 InputFormat<K,V> ->FileInputFormat<K,V> ->TextInputFormat,
KeyValueTextInputFormat, SequenceFileInputFormat
 Default TextInputFormat key=byte offset, value=line
 KeyValueTextInputFormat (key t value)
 Binary splittable format
 Corresponding Output formats
33
Compression
 The billion files problem
 300B/file * 10^9 files  300G RAM
 Big Data storage
 Solutions:
 Containers
 Compression
34
Containers
 HAR (splittable)
 Sequence Files, RC files, Avro files (splittable, compressable)
35
Compression codecs
 LZO, LZ4, snappy codecs are best VFM in compression speed
 Bzip2 offers native splitting but can be slow
36
Long story short
 Compression + sequence files
 Compression that supports splitting
 Split file into chunks in application layer with chunk size
aligned to HDFS block size
 Don’t bother
37
Partitioner
 Default is HashPartitioner
 Why implement our own partitioner?
 Sample case: Total ordering
 1 reducer
 Multiple reducers?
38
Partitioner
 TotalOrderPartitioner
 Sample input to determine number of reducers for maximum
performance
39
Hadoop Ecosystem
 Pig
 Apache Pig is a platform for analyzing large data sets. Pig's
language, Pig Latin, lets you specify a sequence of data
transformations such as merging data sets, filtering them, and
applying functions to records or groups of records.
 Procedural language, lazy evaluated, pipeline split support
 Closer to developers (or relational algebra aficionados) than
not
40
Hadoop Ecosystem
 Hive
 Access to hadoop clusters for non developers
 Data analysts, data scientists, statisticians, SDMs etc
 Subset of SQL-92 plus Hive extensions
 Insert overwrite, no update or delete
 No transactions
 No indexes, parallel scanning
 “Near” real time
 Only equality joins
41
Hadoop Ecosystem
 Mahout
Collaborative Filtering
User and Item based recommenders
K-Means, Fuzzy K-Means clustering
Mean Shift clustering
Dirichlet process clustering
Latent Dirichlet Allocation
Singular value decomposition
Parallel Frequent Pattern mining
Complementary Naive Bayes classifier
Random forest decision tree based classifier
42
Hadoop ecosystem
 Algorithmic categories:
 Classification
 Clustering
 Pattern mining
 Regression
 Dimension reduction
 Recommendation engines
 Vector similarity
…
43
Reporting Services
 Pentaho, Microstrategy, Jasper all can hook up to a hadoop
cluster
44
References
 Hadoop the definite guide 3rd edition
 apache.hadoop.org
 Hadoop in practice
 Cloudera Custom training slides
45

More Related Content

PDF
Hadoop Architecture in Depth
PPTX
Introduction to HDFS
PDF
May 2013 HUG: HCatalog/Hive Data Out
PPT
Hadoop ppt2
PPTX
Introduction to Hadoop part 2
PDF
Hadoop-Introduction
PDF
Big data overview of apache hadoop
PPTX
Understanding hdfs
Hadoop Architecture in Depth
Introduction to HDFS
May 2013 HUG: HCatalog/Hive Data Out
Hadoop ppt2
Introduction to Hadoop part 2
Hadoop-Introduction
Big data overview of apache hadoop
Understanding hdfs

What's hot (20)

PPT
Meethadoop
PPTX
SQLBits XI - ETL with Hadoop
PPTX
Hadoop architecture by ajay
PPTX
Hadoop architecture meetup
PPT
Hadoop training in hyderabad-kellytechnologies
PDF
Hadoop interview questions
PPTX
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
PDF
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
PDF
Hadoop 31-frequently-asked-interview-questions
PDF
Big data interview questions and answers
PPTX
BIG DATA: Apache Hadoop
PPTX
HDFS: Hadoop Distributed Filesystem
PDF
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
PPT
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
PDF
Hadoop hdfs interview questions
PDF
Big data hadooop analytic and data warehouse comparison guide
PPTX
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
ODT
Hadoop Interview Questions and Answers by rohit kapa
PPTX
Hadoop Interview Questions and Answers
Meethadoop
SQLBits XI - ETL with Hadoop
Hadoop architecture by ajay
Hadoop architecture meetup
Hadoop training in hyderabad-kellytechnologies
Hadoop interview questions
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop 31-frequently-asked-interview-questions
Big data interview questions and answers
BIG DATA: Apache Hadoop
HDFS: Hadoop Distributed Filesystem
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop hdfs interview questions
Big data hadooop analytic and data warehouse comparison guide
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers
Ad

Similar to Hadoop and big data training (20)

PPTX
002 Introduction to hadoop v3
PPTX
Hadoop_arunam_ppt
PDF
Distributed Computing with Apache Hadoop: Technology Overview
PPTX
Introduction to Hadoop and Big Data
PDF
Hadoop Tutorial with @techmilind
 
PPTX
Hadoop-part1 in cloud computing subject.pptx
PPTX
Bigdata workshop february 2015
PPTX
Hadoop ppt1
PPTX
Getting Started with Hadoop
PPTX
Introduction to Hadoop
PPT
Hadoop - Introduction to HDFS
DOCX
project report on hadoop
PPTX
Big Data and Hadoop
PDF
1. Big Data - Introduction(what is bigdata).pdf
PPT
Hadoop and Mapreduce Introduction
PPTX
Distributed computing poli
PDF
Hadoop ecosystem
PPTX
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
PDF
2.1-HADOOP.pdf
PPT
Apache hadoop, hdfs and map reduce Overview
002 Introduction to hadoop v3
Hadoop_arunam_ppt
Distributed Computing with Apache Hadoop: Technology Overview
Introduction to Hadoop and Big Data
Hadoop Tutorial with @techmilind
 
Hadoop-part1 in cloud computing subject.pptx
Bigdata workshop february 2015
Hadoop ppt1
Getting Started with Hadoop
Introduction to Hadoop
Hadoop - Introduction to HDFS
project report on hadoop
Big Data and Hadoop
1. Big Data - Introduction(what is bigdata).pdf
Hadoop and Mapreduce Introduction
Distributed computing poli
Hadoop ecosystem
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
2.1-HADOOP.pdf
Apache hadoop, hdfs and map reduce Overview
Ad

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Empathic Computing: Creating Shared Understanding
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Machine learning based COVID-19 study performance prediction
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
MYSQL Presentation for SQL database connectivity
“AI and Expert System Decision Support & Business Intelligence Systems”
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
sap open course for s4hana steps from ECC to s4
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Empathic Computing: Creating Shared Understanding
Reach Out and Touch Someone: Haptics and Empathic Computing
20250228 LYD VKU AI Blended-Learning.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Network Security Unit 5.pdf for BCA BBA.
Understanding_Digital_Forensics_Presentation.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Digital-Transformation-Roadmap-for-Companies.pptx
Review of recent advances in non-invasive hemoglobin estimation
Machine learning based COVID-19 study performance prediction
The AUB Centre for AI in Media Proposal.docx
MYSQL Presentation for SQL database connectivity

Hadoop and big data training

  • 1. Hadoop and Big Data Training Lessons learned 0
  • 2. What’s Cloudera?  Leading company in the NoSQL and cloud computing space  Most popular Hadoop distribution  Ex-es from Google, Facebook, Oracle and other leading tech companies  Sample Bn$ companies client list: eBay,JPMorganChase,Experian,Groupon,MorganStanley,Nokia ,Orbitz,NationalCancerInstitute,RIM,TheWaltDisney Company  Consulting and training services 1
  • 3. Why this training?  MongoDB is great for OLTP  Not an OLAP DB, not really aspiring to become one  Big Data coming in, need for more advanced analysis processes 2
  • 4. Intended audience  Software engineers and friends  3
  • 5.  The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models  Modules:  HadoopCommon  Hadoop Distributed File System (HDFS™)  HadoopYARN  HadoopMapReduce 4 What’s Hadoop?
  • 6. How does it fit in our Big Goal?  MongoDB for OLTP  RDBMS (MySQL) for config data  Hadoop for OLAP 5
  • 7. What’s Map Reduce?  MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers. © Wiki  Practically?  Can perform computations in a distributed fashion  Highly scalable  Inherently highly available  By design fault tolerant 6
  • 8. Bindings  Native Java  any language, even scripting ones, using Streaming 7
  • 9. MapReduce framework vs. MapReduce functionality  Several NoSQL technologies provide MR functionality 8
  • 10. MR functionality  Compromise….  i.e. MongoDB  CouchDB select * from foo; ;; 9
  • 11. MapReduce V1 vsMapReduce V2  MR V1 can not scale past 4k nodes per cluster  More important to our goals, MR V1 is monolithic 10
  • 12. MR V2 YARN  Pluggable implementations on top of Hadoop  Whole new set of problems can be solved:  Graph processing  MPI 11
  • 14. MR V1 daemons  client  NameNode (HDFS)  JobTracker  DataNode(HDFS) + TaskTracker 13
  • 16. MR V2 daemons  Client  Resource manager/Application manager  NodeManager  Application Master (resource containers) 15
  • 17. Data Locality in Hadoop  First replica placed in client node (or random if off cluster client)  Second off-rack  Third in same rack as second but different node 16
  • 18. HDFS - Architecture  Hot  Very large files  Streaming data access (seek time ~<1% transfer time)  Commodity hardware (no iphones…)  Not  Low-latency data access  Lots of small files  Multiple writers, arbitrary file modification 17
  • 19. HDFS – NameNode  Namenode Master  Filesystem tree  Metadata for all files and directories  Namespace image and edit log  Secondary Namenode  Not a backup node!  Periodically merges edit log into namespace image  Could take 30 mins to come back online 18
  • 20. HDFS HA - NameNode  2.x Hadoop brings in HDFS HA  Active-standby config for NameNodes  Gotchas:  Shared storage for edit log  Datanodes send block reports to both NameNodes  NameNode needs to be transparent to clients 19
  • 22. HDFS - Read  Client requests file from namenode (for first 10 blocks)  Namenode returns addresses of datanodes  Client contacts directly datanodes  Blocks are read in order 21
  • 24. HDFS - Write  RPC initial call to create the file  Permissions/file exists checks in NameNode etc  As we write data, data queue in client which asks the NameNode for datanode to store data  List of datanodes form a pipeline  ack queue to verify all replicas have been written  Close file 23
  • 25. Job Configuration  setInputFormatClass  setOutputFormatClass  setMapperClass  setReducerClass  Set(Map)OutputKeyClass  set(Map)OutputValueClass  setNumReduceTasks 24
  • 26. Job Configuration Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); OR job.submit(); 25
  • 27. Job Configuration  Simple to invoke:  bin/hadoop jar WordCountinputPathoutputPath 26
  • 29. Mapper – Life cycle  Mapper inputs <K1,V1> outputs <K2,V2> 28
  • 30. Shuffle and Sort  All same keys are guaranteed to end up in the same reducer, sorted by key  Mapper output <K2,V2><‘the’,1>, <‘the’,2>, <‘cat’,1>  Reducer input <K2,[V2]><‘cat’,*1+>, <‘the’,*1,2+> 29
  • 31. Reducer – Life cycle  Reducer inputs <K2, [V2]> outputs <K3, V3> 30
  • 32. Hadoop interfaces and classes  >=0.23 new API favoring abstract classes  <0.23 old API with interfaces  Packages mapred.* OLD API, mapreduce.* NEW API 31
  • 33. Speculative execution  At least one minute into a mapper or reducer, the Jobtracker will decide based on the progress of a task  Threshold of each task progress compared to avgprogress(configurable)  Relaunch task in different NameNode and have them race..  Sometimes not wanted  Cluster utilization  Non idempotent partial output (OutputCollector) 32
  • 34. Input Output Formats  InputFormat<K,V> ->FileInputFormat<K,V> ->TextInputFormat, KeyValueTextInputFormat, SequenceFileInputFormat  Default TextInputFormat key=byte offset, value=line  KeyValueTextInputFormat (key t value)  Binary splittable format  Corresponding Output formats 33
  • 35. Compression  The billion files problem  300B/file * 10^9 files  300G RAM  Big Data storage  Solutions:  Containers  Compression 34
  • 36. Containers  HAR (splittable)  Sequence Files, RC files, Avro files (splittable, compressable) 35
  • 37. Compression codecs  LZO, LZ4, snappy codecs are best VFM in compression speed  Bzip2 offers native splitting but can be slow 36
  • 38. Long story short  Compression + sequence files  Compression that supports splitting  Split file into chunks in application layer with chunk size aligned to HDFS block size  Don’t bother 37
  • 39. Partitioner  Default is HashPartitioner  Why implement our own partitioner?  Sample case: Total ordering  1 reducer  Multiple reducers? 38
  • 40. Partitioner  TotalOrderPartitioner  Sample input to determine number of reducers for maximum performance 39
  • 41. Hadoop Ecosystem  Pig  Apache Pig is a platform for analyzing large data sets. Pig's language, Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records.  Procedural language, lazy evaluated, pipeline split support  Closer to developers (or relational algebra aficionados) than not 40
  • 42. Hadoop Ecosystem  Hive  Access to hadoop clusters for non developers  Data analysts, data scientists, statisticians, SDMs etc  Subset of SQL-92 plus Hive extensions  Insert overwrite, no update or delete  No transactions  No indexes, parallel scanning  “Near” real time  Only equality joins 41
  • 43. Hadoop Ecosystem  Mahout Collaborative Filtering User and Item based recommenders K-Means, Fuzzy K-Means clustering Mean Shift clustering Dirichlet process clustering Latent Dirichlet Allocation Singular value decomposition Parallel Frequent Pattern mining Complementary Naive Bayes classifier Random forest decision tree based classifier 42
  • 44. Hadoop ecosystem  Algorithmic categories:  Classification  Clustering  Pattern mining  Regression  Dimension reduction  Recommendation engines  Vector similarity … 43
  • 45. Reporting Services  Pentaho, Microstrategy, Jasper all can hook up to a hadoop cluster 44
  • 46. References  Hadoop the definite guide 3rd edition  apache.hadoop.org  Hadoop in practice  Cloudera Custom training slides 45

Editor's Notes

  • #10: Combiners invoked by design in mongodb
  • #39: 1 reducer is the default config