Presented By
 Framework for running applications on large clusters of commodity
hardware
 Scale: petabytes of data on thousands of nodes
 Include
 Storage: HDFS
 Processing: MapReduce
 Support the Map/Reduce programming model
 Requirements
 Economy: use cluster of comodity computers
 Easy to use
 Users: no need to deal with the complexity of distributed computing
 Reliable: can handle node failures automatically
http://www.kellytechno.com
 Implemented in Java
 Apache Top Level Project
 http://hadoop.apache.org/core/
 Core (15 Committers)
 HDFS
 MapReduce
 Community of contributors is growing
 Though mostly Yahoo for HDFS and MapReduce
 You can contribute too!
http://www.kellytechno.com
 Commodity HW
 Add inexpensive servers
 Storage servers and their disks are not assumed to be highly reliable and
available
 Use replication across servers to deal with unreliable storage/servers
 Metadata-data separation - simple design
 Namenode maintains metadata
 Datanodes manage storage
 Slightly Restricted file semantics
 Focus is mostly sequential access
 Single writers
 No file locking features
 Support for moving computation close to data
 Servers have 2 purposes: data storage and computation
 Single ‘storage + compute’ cluster vs. Separate clusters
http://www.kellytechno.com
Data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Results
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Hadoop Cluster
DFS Block 1
DFS Block 1
DFS Block 2
DFS Block 2
DFS Block 2
DFS Block 1
DFS Block 3
DFS Block 3
DFS Block 3
MAP
MAP
MAP
Reduce
http://www.kellytechno.com
 Data is organized into files and directories
 Files are divided into uniform sized blocks and
distributed across cluster nodes
 Blocks are replicated to handle hardware
failure
 Filesystem keeps checksums of data for
corruption detection and recovery
 HDFS exposes block placement so that
computation can be migrated to data
http://www.kellytechno.com
NameNode(Filename, replicationFactor, block-ids, …)
/users/user1/data/part-0, r:2, {1,3}, …
/users/user1/data/part-1, r:3, {2,4,5}, …
Datanodes
1 1
3
3
2
2
2
4
4
4
55
5
http://www.kellytechno.com
 Master-Slave architecture
 DFS Master “Namenode”
 Manages the filesystem namespace
 Maintain file name to list blocks + location mapping
 Manages block allocation/replication
 Checkpoints namespace and journals namespace changes
for reliability
 Control access to namespace
 DFS Slaves “Datanodes” handle block storage
 Stores blocks using the underlying OS’s files
 Clients access the blocks directly from datanodes
 Periodically sends block reports to Namenode
 Periodically check block integrity
http://www.kellytechno.com
Read
Metadata ops
Client
Metadata (Name, #replicas, …):
/users/foo/data, 3, …
Namenode
Client
Datanodes
Rack 1 Rack 2
Replication
Block ops
Datanodes
Write
Blocks
http://www.kellytechno.com
 A file’s replication factor can be set per file (default 3)
 Block placement is rack aware
 Guarantee placement on two racks
 1st replica is on local node, 2rd/3rd replicas are on remote rack
 Avoid hot spots: balance I/O traffic
 Writes are pipelined to block replicas
 Minimize bandwidth usage
 Overlapping disk writes and network writes
 Reads are from the nearest replica
 Block under-replication & over-replication is detected by Namenode
 Balancer application rebalances blocks to balance DN utilization
http://www.kellytechno.com
 Scale cluster size
 Scale number of clients
 Scale namespace size (total number of files,
amount of data)
 Possible solutions
 Multiple namenodes
 Read-only secondary namenode
 Separate cluster management and namespace management
 Dynamic Partition namespace
 Mounting
http://www.kellytechno.com
 Map/Reduce is a programming model for efficient
distributed computing
 It works like a Unix pipeline:
 cat input | grep | sort
| uniq -c | cat > output
 Input | Map | Shuffle & Sort | Reduce |
Output
 A simple model but good for a lot of applications
 Log processing
 Web index building
http://www.kellytechno.com
http://www.kellytechno.com
 Mapper
 Input: value: lines of text of input
 Output: key: word, value: 1
 Reducer
 Input: key: word, value: set of counts
 Output: key: word, value: sum
 Launching program
 Defines the job
 Submits job to cluster
http://www.kellytechno.com
 Fine grained Map and Reduce tasks
 Improved load balancing
 Faster recovery from failed tasks
 Automatic re-execution on failure
 In a large cluster, some nodes are always slow or flaky
 Framework re-executes failed tasks
 Locality optimizations
 With large data, bandwidth to data is a problem
 Map-Reduce + HDFS is a very effective solution
 Map-Reduce queries HDFS for locations of input data
 Map tasks are scheduled close to the inputs when
possible
http://www.kellytechno.com
• Hadoop Wiki
– Introduction
• http://hadoop.apache.org/core/
– Getting Started
• http://wiki.apache.org/hadoop/GettingStartedWithHadoop
– Map/Reduce Overview
• http://wiki.apache.org/hadoop/HadoopMapReduce
– DFS
• http://hadoop.apache.org/core/docs/current/hdfs_design.html
• Javadoc
– http://hadoop.apache.org/core/docs/current/api/index.html
http://www.kellytechno.com
Thank you!
http://www.kellytechno.com
Presented
By

More Related Content

PPTX
Hadoop
PPTX
Unit 2.pptx
PPTX
Introduction to HDFS
PDF
Hadoop HDFS
PPTX
Hadoop architecture-tutorial
PPTX
Hadoop distributed file system
PDF
HDFS Architecture
PPTX
Ravi Namboori Hadoop & HDFS Architecture
Hadoop
Unit 2.pptx
Introduction to HDFS
Hadoop HDFS
Hadoop architecture-tutorial
Hadoop distributed file system
HDFS Architecture
Ravi Namboori Hadoop & HDFS Architecture

What's hot (18)

PPTX
2.introduction to hdfs
PDF
Hadoop architecture-tutorial
PPT
Hadoop ppt2
PDF
Hdfs architecture
PPTX
Map Reduce basics
PPT
Meethadoop
PPTX
Hadoop Distributed File System
PPTX
Big data- HDFS(2nd presentation)
PPTX
Hadoop Distributed File System
PDF
Lecture 2 part 1
PPT
Hadoop - Introduction to mapreduce
PPTX
Snapshot in Hadoop Distributed File System
PPTX
PPTX
Bd class 2 complete
PPTX
Sector Vs Hadoop
PDF
PPTX
Hadoop Distributed File System
2.introduction to hdfs
Hadoop architecture-tutorial
Hadoop ppt2
Hdfs architecture
Map Reduce basics
Meethadoop
Hadoop Distributed File System
Big data- HDFS(2nd presentation)
Hadoop Distributed File System
Lecture 2 part 1
Hadoop - Introduction to mapreduce
Snapshot in Hadoop Distributed File System
Bd class 2 complete
Sector Vs Hadoop
Hadoop Distributed File System
Ad

Viewers also liked (20)

PDF
RewardTrax Overview
PDF
PDF
Guía ahorro energetico
PDF
Spain E Book
PPT
Importancia de la internet en la educacion
DOCX
Furacões
PDF
Thesis Presentation01
DOCX
Trabajo grupal de fisica
DOC
Bases + inscripciones concurso yo no lo compro ygualarte 2017
PDF
Yoga Meditation in 9 steps
PPT
Generation of electricity_from_coal_&_dts
PPTX
Contes bojos presentacio power
PDF
Research Method for Business chapter 6
PDF
Lo scenario del loyalty marketing: retail e industria a confronto (C.Ziliani)
PPTX
Virtual Reality & Marketing Part 3 of 3
PDF
Research Method for Business chapter 3
PDF
Overview of-device-regulation-fda-san-diego-ca
RewardTrax Overview
Guía ahorro energetico
Spain E Book
Importancia de la internet en la educacion
Furacões
Thesis Presentation01
Trabajo grupal de fisica
Bases + inscripciones concurso yo no lo compro ygualarte 2017
Yoga Meditation in 9 steps
Generation of electricity_from_coal_&_dts
Contes bojos presentacio power
Research Method for Business chapter 6
Lo scenario del loyalty marketing: retail e industria a confronto (C.Ziliani)
Virtual Reality & Marketing Part 3 of 3
Research Method for Business chapter 3
Overview of-device-regulation-fda-san-diego-ca
Ad

Similar to Hadoop training in bangalore-kellytechnologies (20)

PPTX
Introduction to hadoop and hdfs
PPT
PPT
Apache hadoop and hive
PPTX
hadoop_Introduction module 2 and chapter 3pptx.pptx
PPTX
Hadoop_Introduction unit-2 for vtu syllabus
PPTX
Hadoop
PPT
Hadoop training in bangalore
PPT
Hadoop training in hyderabad-kellytechnologies
PPTX
Hadoop BRamamurthy ajjaahdvddvdnsmsjdjfj
PPTX
PPTX
2. hadoop fundamentals
PPTX
Hadoop Fundamentals
PPTX
Hadoop fundamentals
PDF
Hadoop trainting in hyderabad@kelly technologies
PPTX
Managing Big data with Hadoop
PPT
Hadoop online-training
PPTX
PDF
Hadoop Distributed File System
PPTX
Understanding Hadoop
PPT
Hadoop trainting-in-hyderabad@kelly technologies
Introduction to hadoop and hdfs
Apache hadoop and hive
hadoop_Introduction module 2 and chapter 3pptx.pptx
Hadoop_Introduction unit-2 for vtu syllabus
Hadoop
Hadoop training in bangalore
Hadoop training in hyderabad-kellytechnologies
Hadoop BRamamurthy ajjaahdvddvdnsmsjdjfj
2. hadoop fundamentals
Hadoop Fundamentals
Hadoop fundamentals
Hadoop trainting in hyderabad@kelly technologies
Managing Big data with Hadoop
Hadoop online-training
Hadoop Distributed File System
Understanding Hadoop
Hadoop trainting-in-hyderabad@kelly technologies

Recently uploaded (20)

PPTX
Share_Module_2_Power_conflict_and_negotiation.pptx
PPTX
TNA_Presentation-1-Final(SAVE)) (1).pptx
PDF
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
PDF
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
PDF
IGGE1 Understanding the Self1234567891011
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PDF
Hazard Identification & Risk Assessment .pdf
DOCX
Cambridge-Practice-Tests-for-IELTS-12.docx
PPTX
Computer Architecture Input Output Memory.pptx
PDF
Complications of Minimal Access-Surgery.pdf
PPTX
Virtual and Augmented Reality in Current Scenario
PPTX
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
PDF
LDMMIA Reiki Yoga Finals Review Spring Summer
PDF
Weekly quiz Compilation Jan -July 25.pdf
PDF
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PDF
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
PDF
International_Financial_Reporting_Standa.pdf
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
PDF
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
Share_Module_2_Power_conflict_and_negotiation.pptx
TNA_Presentation-1-Final(SAVE)) (1).pptx
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
IGGE1 Understanding the Self1234567891011
Practical Manual AGRO-233 Principles and Practices of Natural Farming
Hazard Identification & Risk Assessment .pdf
Cambridge-Practice-Tests-for-IELTS-12.docx
Computer Architecture Input Output Memory.pptx
Complications of Minimal Access-Surgery.pdf
Virtual and Augmented Reality in Current Scenario
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
LDMMIA Reiki Yoga Finals Review Spring Summer
Weekly quiz Compilation Jan -July 25.pdf
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
B.Sc. DS Unit 2 Software Engineering.pptx
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
International_Financial_Reporting_Standa.pdf
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα

Hadoop training in bangalore-kellytechnologies

  • 2.  Framework for running applications on large clusters of commodity hardware  Scale: petabytes of data on thousands of nodes  Include  Storage: HDFS  Processing: MapReduce  Support the Map/Reduce programming model  Requirements  Economy: use cluster of comodity computers  Easy to use  Users: no need to deal with the complexity of distributed computing  Reliable: can handle node failures automatically http://www.kellytechno.com
  • 3.  Implemented in Java  Apache Top Level Project  http://hadoop.apache.org/core/  Core (15 Committers)  HDFS  MapReduce  Community of contributors is growing  Though mostly Yahoo for HDFS and MapReduce  You can contribute too! http://www.kellytechno.com
  • 4.  Commodity HW  Add inexpensive servers  Storage servers and their disks are not assumed to be highly reliable and available  Use replication across servers to deal with unreliable storage/servers  Metadata-data separation - simple design  Namenode maintains metadata  Datanodes manage storage  Slightly Restricted file semantics  Focus is mostly sequential access  Single writers  No file locking features  Support for moving computation close to data  Servers have 2 purposes: data storage and computation  Single ‘storage + compute’ cluster vs. Separate clusters http://www.kellytechno.com
  • 5. Data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Results Data data data data Data data data data Data data data data Data data data data Data data data data Data data data data Data data data data Data data data data Data data data data Hadoop Cluster DFS Block 1 DFS Block 1 DFS Block 2 DFS Block 2 DFS Block 2 DFS Block 1 DFS Block 3 DFS Block 3 DFS Block 3 MAP MAP MAP Reduce http://www.kellytechno.com
  • 6.  Data is organized into files and directories  Files are divided into uniform sized blocks and distributed across cluster nodes  Blocks are replicated to handle hardware failure  Filesystem keeps checksums of data for corruption detection and recovery  HDFS exposes block placement so that computation can be migrated to data http://www.kellytechno.com
  • 7. NameNode(Filename, replicationFactor, block-ids, …) /users/user1/data/part-0, r:2, {1,3}, … /users/user1/data/part-1, r:3, {2,4,5}, … Datanodes 1 1 3 3 2 2 2 4 4 4 55 5 http://www.kellytechno.com
  • 8.  Master-Slave architecture  DFS Master “Namenode”  Manages the filesystem namespace  Maintain file name to list blocks + location mapping  Manages block allocation/replication  Checkpoints namespace and journals namespace changes for reliability  Control access to namespace  DFS Slaves “Datanodes” handle block storage  Stores blocks using the underlying OS’s files  Clients access the blocks directly from datanodes  Periodically sends block reports to Namenode  Periodically check block integrity http://www.kellytechno.com
  • 9. Read Metadata ops Client Metadata (Name, #replicas, …): /users/foo/data, 3, … Namenode Client Datanodes Rack 1 Rack 2 Replication Block ops Datanodes Write Blocks http://www.kellytechno.com
  • 10.  A file’s replication factor can be set per file (default 3)  Block placement is rack aware  Guarantee placement on two racks  1st replica is on local node, 2rd/3rd replicas are on remote rack  Avoid hot spots: balance I/O traffic  Writes are pipelined to block replicas  Minimize bandwidth usage  Overlapping disk writes and network writes  Reads are from the nearest replica  Block under-replication & over-replication is detected by Namenode  Balancer application rebalances blocks to balance DN utilization http://www.kellytechno.com
  • 11.  Scale cluster size  Scale number of clients  Scale namespace size (total number of files, amount of data)  Possible solutions  Multiple namenodes  Read-only secondary namenode  Separate cluster management and namespace management  Dynamic Partition namespace  Mounting http://www.kellytechno.com
  • 12.  Map/Reduce is a programming model for efficient distributed computing  It works like a Unix pipeline:  cat input | grep | sort | uniq -c | cat > output  Input | Map | Shuffle & Sort | Reduce | Output  A simple model but good for a lot of applications  Log processing  Web index building http://www.kellytechno.com
  • 14.  Mapper  Input: value: lines of text of input  Output: key: word, value: 1  Reducer  Input: key: word, value: set of counts  Output: key: word, value: sum  Launching program  Defines the job  Submits job to cluster http://www.kellytechno.com
  • 15.  Fine grained Map and Reduce tasks  Improved load balancing  Faster recovery from failed tasks  Automatic re-execution on failure  In a large cluster, some nodes are always slow or flaky  Framework re-executes failed tasks  Locality optimizations  With large data, bandwidth to data is a problem  Map-Reduce + HDFS is a very effective solution  Map-Reduce queries HDFS for locations of input data  Map tasks are scheduled close to the inputs when possible http://www.kellytechno.com
  • 16. • Hadoop Wiki – Introduction • http://hadoop.apache.org/core/ – Getting Started • http://wiki.apache.org/hadoop/GettingStartedWithHadoop – Map/Reduce Overview • http://wiki.apache.org/hadoop/HadoopMapReduce – DFS • http://hadoop.apache.org/core/docs/current/hdfs_design.html • Javadoc – http://hadoop.apache.org/core/docs/current/api/index.html http://www.kellytechno.com

Editor's Notes

  • #9: 60M objects on 16G machine (e.g. 20M files with 2 blocks each)