SlideShare a Scribd company logo
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Introducing Hadoop
Mastering Hadoop Map-reduce for Data Analysis


Shashank Tiwari
blog: shanky.org | twitter: @tshanky
st@treasuryofideas.com
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                        All other & referenced work is copyrighted to their respective owners




What is Hadoop
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                        All other & referenced work is copyrighted to their respective owners




HDFS Architecture
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                        All other & referenced work is copyrighted to their respective owners




Namenode/Datanode, JobTracker/TaskTracker
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                      All other & referenced work is copyrighted to their respective owners




MapReduce
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                      All other & referenced work is copyrighted to their respective owners




ZK Namespace
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                        All other & referenced work is copyrighted to their respective owners




Essential HBase Schema
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                         All other & referenced work is copyrighted to their respective owners




Multi-dimensional View
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                                  All other & referenced work is copyrighted to their respective owners




A Map/Hash View

•{


• "row_key_1" : { "name" : {


•     "first_name" : "Jolly", "last_name" : "Goodfellow"


•     } } },


•    "location" : { "zip": "94301" },
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                         All other & referenced work is copyrighted to their respective owners




Architectural View (HBase)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                       All other & referenced work is copyrighted to their respective owners




The Persistence Mechanism
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                         All other & referenced work is copyrighted to their respective owners




The underlying file format
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                                 All other & referenced work is copyrighted to their respective owners




Installing & Setting up Hadoop

• Required software: Java 1.6.x, ssh + sshd


• Download


• Install


• Configure


   • single-node


   • pseudo-distributed


   • cluster
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                                All other & referenced work is copyrighted to their respective owners




Download

• Source: http://hadoop.apache.org/


• Version:


   • 0.20.203.x -- current stable


   • 0.20.x -- previous stable


• Includes


   • Hadoop Common -- common utilities, HDFS, MapReduce
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                                All other & referenced work is copyrighted to their respective owners




Install

• Extract: tar zxvf hadoop-0.20.203.0rc1.tar.gz


• Move & Create Symbolic Link


   • ln -s hadoop-0.20.203.0 hadoop


• On Windows


   • http://developer.yahoo.com/hadoop/tutorial/module3.html
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                               All other & referenced work is copyrighted to their respective owners




Configure -- single-node

• Edit: conf/hadoop-env.sh


  • Set JAVA_HOME


• Default configuration is single-node


• Start bin/hadoop (for command options)


• Reference: http://hadoop.apache.org/common/docs/r0.20.203.0/
  single_node_setup.html
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                                All other & referenced work is copyrighted to their respective owners




Configure -- pseduo-distributed

• Edit: conf/core-site.xml (configure HDFS daemon)


• Edit: conf/hdfs-site.xml (configure HDFS replication factor)


• Edit: conf/mapred-site.xml (configure MapReduce JobTracker daemon)


• Enable ssh to localhost (without passphrase)


• Reference: http://hadoop.apache.org/common/docs/r0.20.203.0/
  single_node_setup.html
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                                 All other & referenced work is copyrighted to their respective owners




Start Hadoop
• Format HDFS: bin/hadoop namenode -format


• Start all daemons: bin/start-all.sh


• Verify logs


• Browse the web interface:


   • Namenode: http://localhost:50070/


   • JobTracker: http://localhost:50030/
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                                All other & referenced work is copyrighted to their respective owners




Take Hadoop for a test-drive
• Run examples (hadoop-examples-0.20.203.0.jar)


• Grep using regular expressions


  • Copy files to HDFS: bin/hadoop fs -put bin input


  • Grep for files which have text beginning with ‘start’


  • Verify output on HDFS: bin/hadoop fs -cat output/*


  • Copy output to local filesystem & verify: bin/hadoop fs -get output output
    && cat output/*
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                               All other & referenced work is copyrighted to their respective owners




Configure -- cluster
• References:


• http://hadoop.apache.org/common/docs/r0.20.203.0/cluster_setup.html
  (official documentation)


• http://developer.yahoo.com/hadoop/tutorial/module7.html (Managing a
  Hadoop Cluster. Source: YDN)


• http://wiki.datameer.com/display/DAS1/Hadoop+Cluster+Configuration+Tips
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC
                                All other & referenced work is copyrighted to their respective owners




Questions?




• blog: shanky.org | twitter: @tshanky


• st@treasuryofideas.com

More Related Content

PDF
SDEC2011 Essentials of Hive
PDF
SDEC2011 Essentials of Pig
ZIP
Sdec2011 Introducing Hadoop
KEY
Asset Pipeline
PPTX
HCatalog Hadoop Summit 2011
PDF
HCatalog
PDF
May 2013 HUG: HCatalog/Hive Data Out
PPTX
Future of HCatalog - Hadoop Summit 2012
SDEC2011 Essentials of Hive
SDEC2011 Essentials of Pig
Sdec2011 Introducing Hadoop
Asset Pipeline
HCatalog Hadoop Summit 2011
HCatalog
May 2013 HUG: HCatalog/Hive Data Out
Future of HCatalog - Hadoop Summit 2012

What's hot (19)

KEY
Picconf12
PDF
Beginning hive and_apache_pig
KEY
Polyglot Persistence & Big Data in the Cloud
PDF
Apache Hive micro guide - ConfusedCoders
PPTX
H cat berlinbuzzwords2012
PPTX
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
PDF
Future of HCatalog
PDF
The First Class Integration of Solr with Hadoop
PPT
Website designing company_in_delhi_phpwebdevelopment
KEY
API Design
PPTX
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
PPTX
Hortonworks HBase Meetup Presentation
PPTX
Puppet for Everybody: Federated and Hierarchical Puppet Enterprise
PPTX
REDIS327
PPTX
Puppet Camp DC: Puppet for Everybody
PDF
Amebaサービスのログ解析基盤
PPTX
An Introduction to Apache Pig
PPTX
Session 03 - Hadoop Installation and Basic Commands
PDF
Rails 6 Multi-DB 実戦投入
Picconf12
Beginning hive and_apache_pig
Polyglot Persistence & Big Data in the Cloud
Apache Hive micro guide - ConfusedCoders
H cat berlinbuzzwords2012
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Future of HCatalog
The First Class Integration of Solr with Hadoop
Website designing company_in_delhi_phpwebdevelopment
API Design
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Hortonworks HBase Meetup Presentation
Puppet for Everybody: Federated and Hierarchical Puppet Enterprise
REDIS327
Puppet Camp DC: Puppet for Everybody
Amebaサービスのログ解析基盤
An Introduction to Apache Pig
Session 03 - Hadoop Installation and Basic Commands
Rails 6 Multi-DB 実戦投入
Ad

Viewers also liked (7)

PDF
SDEC2011 NoSQL concepts and models
ODP
Map Reduce
PDF
SDEC2011 Introducing Hadoop
PDF
SDEC2011 NoSQL Data modelling
PDF
TimeTrax CaseStudy-CocaCola-CCBPL
KEY
SDEC2011 Big engineer vs small entreprenuer
PPTX
Google Protocol Buffers
SDEC2011 NoSQL concepts and models
Map Reduce
SDEC2011 Introducing Hadoop
SDEC2011 NoSQL Data modelling
TimeTrax CaseStudy-CocaCola-CCBPL
SDEC2011 Big engineer vs small entreprenuer
Google Protocol Buffers
Ad

Similar to Sdec2011 shashank-introducing hadoop (20)

PDF
Hadoop 101
 
PDF
SDEC2011 Essentials of Mahout
PDF
Introduction to hadoop and hdfs
DOC
Configure h base hadoop and hbase client
PPTX
Apache Hadoop
PPTX
Hadoop & HDFS for Beginners
PDF
Operate your hadoop cluster like a high eff goldmine
PDF
Hadoop Overview
 
PPTX
Hadoop
PDF
Zh Tw Introduction To Hadoop And Hdfs
PDF
Petabyte scale on commodity infrastructure
PDF
App cap2956v2-121001194956-phpapp01 (1)
PDF
App Cap2956v2 121001194956 Phpapp01 (1)
PDF
Inside the Hadoop Machine @ VMworld
PPT
Ams+Dm Server+Ec2
PDF
Hadoop, Taming Elephants
PPTX
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
PDF
Apache Hadoop & Friends at Utah Java User's Group
PDF
Building Scale Free Applications with Hadoop and Cascading
Hadoop 101
 
SDEC2011 Essentials of Mahout
Introduction to hadoop and hdfs
Configure h base hadoop and hbase client
Apache Hadoop
Hadoop & HDFS for Beginners
Operate your hadoop cluster like a high eff goldmine
Hadoop Overview
 
Hadoop
Zh Tw Introduction To Hadoop And Hdfs
Petabyte scale on commodity infrastructure
App cap2956v2-121001194956-phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
Inside the Hadoop Machine @ VMworld
Ams+Dm Server+Ec2
Hadoop, Taming Elephants
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
hdfs readrmation ghghg bigdats analytics info.pdf
Apache Hadoop & Friends at Utah Java User's Group
Building Scale Free Applications with Hadoop and Cascading

More from Korea Sdec (8)

PDF
SDEC2011 Implementing me2day friend suggestion
PDF
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
PDF
SDEC2011 Rapidant
PDF
SDEC2011 Mahout - the what, the how and the why
PDF
SDEC2011 Going by TACC
PDF
SDEC2011 Glory-FS development & Experiences
PDF
SDEC2011 Using Couchbase for social game scaling and speed
PDF
SDEC2011 Arcus NHN memcached cloud
SDEC2011 Implementing me2day friend suggestion
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Rapidant
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Going by TACC
SDEC2011 Glory-FS development & Experiences
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Arcus NHN memcached cloud

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Electronic commerce courselecture one. Pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
MYSQL Presentation for SQL database connectivity
Mobile App Security Testing_ A Comprehensive Guide.pdf
Empathic Computing: Creating Shared Understanding
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
sap open course for s4hana steps from ECC to s4
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
20250228 LYD VKU AI Blended-Learning.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Big Data Technologies - Introduction.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Electronic commerce courselecture one. Pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Digital-Transformation-Roadmap-for-Companies.pptx
Encapsulation theory and applications.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
NewMind AI Weekly Chronicles - August'25 Week I
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

Sdec2011 shashank-introducing hadoop

  • 1. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Introducing Hadoop Mastering Hadoop Map-reduce for Data Analysis Shashank Tiwari blog: shanky.org | twitter: @tshanky st@treasuryofideas.com
  • 2. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners What is Hadoop
  • 3. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners HDFS Architecture
  • 4. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Namenode/Datanode, JobTracker/TaskTracker
  • 5. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners MapReduce
  • 6. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners ZK Namespace
  • 7. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Essential HBase Schema
  • 8. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Multi-dimensional View
  • 9. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners A Map/Hash View •{ • "row_key_1" : { "name" : { • "first_name" : "Jolly", "last_name" : "Goodfellow" • } } }, • "location" : { "zip": "94301" },
  • 10. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Architectural View (HBase)
  • 11. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners The Persistence Mechanism
  • 12. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners The underlying file format
  • 13. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Installing & Setting up Hadoop • Required software: Java 1.6.x, ssh + sshd • Download • Install • Configure • single-node • pseudo-distributed • cluster
  • 14. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Download • Source: http://hadoop.apache.org/ • Version: • 0.20.203.x -- current stable • 0.20.x -- previous stable • Includes • Hadoop Common -- common utilities, HDFS, MapReduce
  • 15. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Install • Extract: tar zxvf hadoop-0.20.203.0rc1.tar.gz • Move & Create Symbolic Link • ln -s hadoop-0.20.203.0 hadoop • On Windows • http://developer.yahoo.com/hadoop/tutorial/module3.html
  • 16. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Configure -- single-node • Edit: conf/hadoop-env.sh • Set JAVA_HOME • Default configuration is single-node • Start bin/hadoop (for command options) • Reference: http://hadoop.apache.org/common/docs/r0.20.203.0/ single_node_setup.html
  • 17. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Configure -- pseduo-distributed • Edit: conf/core-site.xml (configure HDFS daemon) • Edit: conf/hdfs-site.xml (configure HDFS replication factor) • Edit: conf/mapred-site.xml (configure MapReduce JobTracker daemon) • Enable ssh to localhost (without passphrase) • Reference: http://hadoop.apache.org/common/docs/r0.20.203.0/ single_node_setup.html
  • 18. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Start Hadoop • Format HDFS: bin/hadoop namenode -format • Start all daemons: bin/start-all.sh • Verify logs • Browse the web interface: • Namenode: http://localhost:50070/ • JobTracker: http://localhost:50030/
  • 19. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Take Hadoop for a test-drive • Run examples (hadoop-examples-0.20.203.0.jar) • Grep using regular expressions • Copy files to HDFS: bin/hadoop fs -put bin input • Grep for files which have text beginning with ‘start’ • Verify output on HDFS: bin/hadoop fs -cat output/* • Copy output to local filesystem & verify: bin/hadoop fs -get output output && cat output/*
  • 20. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Configure -- cluster • References: • http://hadoop.apache.org/common/docs/r0.20.203.0/cluster_setup.html (official documentation) • http://developer.yahoo.com/hadoop/tutorial/module7.html (Managing a Hadoop Cluster. Source: YDN) • http://wiki.datameer.com/display/DAS1/Hadoop+Cluster+Configuration+Tips
  • 21. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC All other & referenced work is copyrighted to their respective owners Questions? • blog: shanky.org | twitter: @tshanky • st@treasuryofideas.com