SlideShare a Scribd company logo
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Introducing Hadoop
Mastering Hadoop Map-reduce for Data Analysis


Shashank Tiwari
blog: shanky.org | twitter: @tshanky
st@treasuryofideas.com
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                        All other & referenced work is copyrighted to their respective owners




What is Hadoop
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                        All other & referenced work is copyrighted to their respective owners




HDFS Architecture
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                        All other & referenced work is copyrighted to their respective owners




Namenode/Datanode, JobTracker/TaskTracker
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                      All other & referenced work is copyrighted to their respective owners




MapReduce
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                      All other & referenced work is copyrighted to their respective owners




ZK Namespace
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                        All other & referenced work is copyrighted to their respective owners




Essential HBase Schema
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                         All other & referenced work is copyrighted to their respective owners




Multi-dimensional View
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                                  All other & referenced work is copyrighted to their respective owners




A Map/Hash View

•{


• "row_key_1" : { "name" : {


•     "first_name" : "Jolly", "last_name" : "Goodfellow"


•     } } },


•    "location" : { "zip": "94301" },
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                         All other & referenced work is copyrighted to their respective owners




Architectural View (HBase)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                       All other & referenced work is copyrighted to their respective owners




The Persistence Mechanism
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                         All other & referenced work is copyrighted to their respective owners




The underlying file format
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                                 All other & referenced work is copyrighted to their respective owners




Installing & Setting up Hadoop

• Required software: Java 1.6.x, ssh + sshd


• Download


• Install


• Configure


   • single-node


   • pseudo-distributed


   • cluster
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                                All other & referenced work is copyrighted to their respective owners




Download

• Source: http://hadoop.apache.org/


• Version:


   • 0.20.203.x -- current stable


   • 0.20.x -- previous stable


• Includes


   • Hadoop Common -- common utilities, HDFS, MapReduce
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                                All other & referenced work is copyrighted to their respective owners




Install

• Extract: tar zxvf hadoop-0.20.203.0rc1.tar.gz


• Move & Create Symbolic Link


   • ln -s hadoop-0.20.203.0 hadoop


• On Windows


   • http://developer.yahoo.com/hadoop/tutorial/module3.html
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                               All other & referenced work is copyrighted to their respective owners




Configure -- single-node

• Edit: conf/hadoop-env.sh


  • Set JAVA_HOME


• Default configuration is single-node


• Start bin/hadoop (for command options)


• Reference: http://hadoop.apache.org/common/docs/r0.20.203.0/
  single_node_setup.html
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                                All other & referenced work is copyrighted to their respective owners




Configure -- pseduo-distributed

• Edit: conf/core-site.xml (configure HDFS daemon)


• Edit: conf/hdfs-site.xml (configure HDFS replication factor)


• Edit: conf/mapred-site.xml (configure MapReduce JobTracker daemon)


• Enable ssh to localhost (without passphrase)


• Reference: http://hadoop.apache.org/common/docs/r0.20.203.0/
  single_node_setup.html
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                                 All other & referenced work is copyrighted to their respective owners




Start Hadoop
• Format HDFS: bin/hadoop namenode -format


• Start all daemons: bin/start-all.sh


• Verify logs


• Browse the web interface:


   • Namenode: http://localhost:50070/


   • JobTracker: http://localhost:50030/
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                                All other & referenced work is copyrighted to their respective owners




Take Hadoop for a test-drive
• Run examples (hadoop-examples-0.20.203.0.jar)


• Grep using regular expressions


  • Copy files to HDFS: bin/hadoop fs -put bin input


  • Grep for files which have text beginning with ‘start’


  • Verify output on HDFS: bin/hadoop fs -cat output/*


  • Copy output to local filesystem & verify: bin/hadoop fs -get output output
    && cat output/*
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                               All other & referenced work is copyrighted to their respective owners




Configure -- cluster
• References:


• http://hadoop.apache.org/common/docs/r0.20.203.0/cluster_setup.html
  (official documentation)


• http://developer.yahoo.com/hadoop/tutorial/module7.html (Managing a
  Hadoop Cluster. Source: YDN)


• http://wiki.datameer.com/display/DAS1/Hadoop+Cluster+Configuration+Tips
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                                All other & referenced work is copyrighted to their respective owners




Questions?




• blog: shanky.org | twitter: @tshanky


• st@treasuryofideas.com

More Related Content

PDF
Sdec2011 shashank-introducing hadoop
PDF
SDEC2011 Essentials of Hive
PPTX
HCatalog Hadoop Summit 2011
PPT
Website designing company_in_delhi_phpwebdevelopment
PDF
May 2013 HUG: HCatalog/Hive Data Out
KEY
Picconf12
PPTX
Future of HCatalog - Hadoop Summit 2012
PPTX
H cat berlinbuzzwords2012
Sdec2011 shashank-introducing hadoop
SDEC2011 Essentials of Hive
HCatalog Hadoop Summit 2011
Website designing company_in_delhi_phpwebdevelopment
May 2013 HUG: HCatalog/Hive Data Out
Picconf12
Future of HCatalog - Hadoop Summit 2012
H cat berlinbuzzwords2012

What's hot (18)

KEY
Elasticsearch - Devoxx France 2012 - English version
PPTX
An Introduction to Apache Pig
PDF
Beginning hive and_apache_pig
PDF
Apache Pig: Making data transformation easy
PPT
Hadoop ecosystem
PDF
Introduction to pig & pig latin
PDF
06 pig-01-intro
PDF
Apache Drill @ PJUG, Jan 15, 2013
KEY
Asset Pipeline
PDF
Apache Hive micro guide - ConfusedCoders
PPTX
Oozie or Easy: Managing Hadoop Workloads the EASY Way
PPTX
API Design Antipatterns - APICon SF
KEY
API Design
PDF
Workshop: Learning Elasticsearch
PPTX
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
PPTX
Puppet for Everybody: Federated and Hierarchical Puppet Enterprise
PPTX
Battle of the Giants round 2
Elasticsearch - Devoxx France 2012 - English version
An Introduction to Apache Pig
Beginning hive and_apache_pig
Apache Pig: Making data transformation easy
Hadoop ecosystem
Introduction to pig & pig latin
06 pig-01-intro
Apache Drill @ PJUG, Jan 15, 2013
Asset Pipeline
Apache Hive micro guide - ConfusedCoders
Oozie or Easy: Managing Hadoop Workloads the EASY Way
API Design Antipatterns - APICon SF
API Design
Workshop: Learning Elasticsearch
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Puppet for Everybody: Federated and Hierarchical Puppet Enterprise
Battle of the Giants round 2
Ad

Viewers also liked (9)

KEY
SDEC2011 Big engineer vs small entreprenuer
PDF
รับสมัครครูนาฏศิลป์
ZIP
Sdec2011 Introducing Hadoop
PDF
SDEC2011 Arcus NHN memcached cloud
PDF
SDEC2011 Essentials of Pig
PDF
SDEC2011 Implementing me2day friend suggestion
PDF
SDEC2011 NoSQL concepts and models
PDF
SDEC2011 NoSQL Data modelling
PPTX
Google Protocol Buffers
SDEC2011 Big engineer vs small entreprenuer
รับสมัครครูนาฏศิลป์
Sdec2011 Introducing Hadoop
SDEC2011 Arcus NHN memcached cloud
SDEC2011 Essentials of Pig
SDEC2011 Implementing me2day friend suggestion
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL Data modelling
Google Protocol Buffers
Ad

Similar to SDEC2011 Introducing Hadoop (20)

PDF
Hadoop 101
 
PDF
SDEC2011 Essentials of Mahout
PDF
Introduction to hadoop and hdfs
DOC
Configure h base hadoop and hbase client
PPTX
Apache Hadoop
PPTX
Hadoop & HDFS for Beginners
PDF
Operate your hadoop cluster like a high eff goldmine
PDF
Hadoop Overview
 
PPTX
Hadoop
PDF
Zh Tw Introduction To Hadoop And Hdfs
PDF
Petabyte scale on commodity infrastructure
PDF
App cap2956v2-121001194956-phpapp01 (1)
PDF
App Cap2956v2 121001194956 Phpapp01 (1)
PDF
Inside the Hadoop Machine @ VMworld
PPT
Ams+Dm Server+Ec2
PDF
Hadoop, Taming Elephants
PPTX
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
PDF
Apache Hadoop & Friends at Utah Java User's Group
PDF
Building Scale Free Applications with Hadoop and Cascading
Hadoop 101
 
SDEC2011 Essentials of Mahout
Introduction to hadoop and hdfs
Configure h base hadoop and hbase client
Apache Hadoop
Hadoop & HDFS for Beginners
Operate your hadoop cluster like a high eff goldmine
Hadoop Overview
 
Hadoop
Zh Tw Introduction To Hadoop And Hdfs
Petabyte scale on commodity infrastructure
App cap2956v2-121001194956-phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
Inside the Hadoop Machine @ VMworld
Ams+Dm Server+Ec2
Hadoop, Taming Elephants
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
hdfs readrmation ghghg bigdats analytics info.pdf
Apache Hadoop & Friends at Utah Java User's Group
Building Scale Free Applications with Hadoop and Cascading

More from Korea Sdec (6)

PDF
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
PDF
SDEC2011 Rapidant
PDF
SDEC2011 Mahout - the what, the how and the why
PDF
SDEC2011 Going by TACC
PDF
SDEC2011 Glory-FS development & Experiences
PDF
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Rapidant
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Going by TACC
SDEC2011 Glory-FS development & Experiences
SDEC2011 Using Couchbase for social game scaling and speed

Recently uploaded (20)

PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Approach and Philosophy of On baking technology
PPTX
sap open course for s4hana steps from ECC to s4
PPT
Teaching material agriculture food technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Empathic Computing: Creating Shared Understanding
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Cloud computing and distributed systems.
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Machine learning based COVID-19 study performance prediction
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Chapter 3 Spatial Domain Image Processing.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Spectral efficient network and resource selection model in 5G networks
Approach and Philosophy of On baking technology
sap open course for s4hana steps from ECC to s4
Teaching material agriculture food technology
Unlocking AI with Model Context Protocol (MCP)
Network Security Unit 5.pdf for BCA BBA.
Programs and apps: productivity, graphics, security and other tools
The Rise and Fall of 3GPP – Time for a Sabbatical?
Empathic Computing: Creating Shared Understanding
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
A comparative analysis of optical character recognition models for extracting...
Cloud computing and distributed systems.
Assigned Numbers - 2025 - Bluetooth® Document
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Machine learning based COVID-19 study performance prediction

SDEC2011 Introducing Hadoop

  • 1. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Introducing Hadoop Mastering Hadoop Map-reduce for Data Analysis Shashank Tiwari blog: shanky.org | twitter: @tshanky st@treasuryofideas.com
  • 2. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners What is Hadoop
  • 3. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners HDFS Architecture
  • 4. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Namenode/Datanode, JobTracker/TaskTracker
  • 5. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners MapReduce
  • 6. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners ZK Namespace
  • 7. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Essential HBase Schema
  • 8. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Multi-dimensional View
  • 9. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners A Map/Hash View •{ • "row_key_1" : { "name" : { • "first_name" : "Jolly", "last_name" : "Goodfellow" • } } }, • "location" : { "zip": "94301" },
  • 10. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Architectural View (HBase)
  • 11. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners The Persistence Mechanism
  • 12. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners The underlying file format
  • 13. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Installing & Setting up Hadoop • Required software: Java 1.6.x, ssh + sshd • Download • Install • Configure • single-node • pseudo-distributed • cluster
  • 14. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Download • Source: http://hadoop.apache.org/ • Version: • 0.20.203.x -- current stable • 0.20.x -- previous stable • Includes • Hadoop Common -- common utilities, HDFS, MapReduce
  • 15. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Install • Extract: tar zxvf hadoop-0.20.203.0rc1.tar.gz • Move & Create Symbolic Link • ln -s hadoop-0.20.203.0 hadoop • On Windows • http://developer.yahoo.com/hadoop/tutorial/module3.html
  • 16. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Configure -- single-node • Edit: conf/hadoop-env.sh • Set JAVA_HOME • Default configuration is single-node • Start bin/hadoop (for command options) • Reference: http://hadoop.apache.org/common/docs/r0.20.203.0/ single_node_setup.html
  • 17. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Configure -- pseduo-distributed • Edit: conf/core-site.xml (configure HDFS daemon) • Edit: conf/hdfs-site.xml (configure HDFS replication factor) • Edit: conf/mapred-site.xml (configure MapReduce JobTracker daemon) • Enable ssh to localhost (without passphrase) • Reference: http://hadoop.apache.org/common/docs/r0.20.203.0/ single_node_setup.html
  • 18. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Start Hadoop • Format HDFS: bin/hadoop namenode -format • Start all daemons: bin/start-all.sh • Verify logs • Browse the web interface: • Namenode: http://localhost:50070/ • JobTracker: http://localhost:50030/
  • 19. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Take Hadoop for a test-drive • Run examples (hadoop-examples-0.20.203.0.jar) • Grep using regular expressions • Copy files to HDFS: bin/hadoop fs -put bin input • Grep for files which have text beginning with ‘start’ • Verify output on HDFS: bin/hadoop fs -cat output/* • Copy output to local filesystem & verify: bin/hadoop fs -get output output && cat output/*
  • 20. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Configure -- cluster • References: • http://hadoop.apache.org/common/docs/r0.20.203.0/cluster_setup.html (official documentation) • http://developer.yahoo.com/hadoop/tutorial/module7.html (Managing a Hadoop Cluster. Source: YDN) • http://wiki.datameer.com/display/DAS1/Hadoop+Cluster+Configuration+Tips
  • 21. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Questions? • blog: shanky.org | twitter: @tshanky • st@treasuryofideas.com