SlideShare a Scribd company logo
Introduction to Hadoop
• Tarjei Romtveit
• Co-founder of Monokkel AS
• Former CTO – Integrasco AS
• My story with Hadoop
www.monokkel.io
• Daglig leder i Monokkel AS
• Tidligere COO i Integrasco AS
• Persistering, Prosessering og Presentasjon av data
Persistering – Prosessering – Presentasjon
Bombshell
If you work with data today and not start to
learn the Hadoop ecosystem: You may be
unemployed soon
Agenda
• Context – Big Data and how to handle it
• What is Hadoop?
• Demo
• Distributions and/or demo
• “Deepdive” into Hadoop - Architecure
– HDFS
– YARN
– MapReduce
• Languages and ecosystem
What we not will cover
• Security
• Integrations with database X or system Y
• Running Hadoop in production
Big Data
Big Data – hype and hipsters
Big Data
Big Data – Let’s add some letters
• Volume
• Variety
• Velocity
• Variability
• Veracity / Data quality
and the step-brother
• Complexity
Big Data – Example
The Nordic Hotel Tycoon
1600 Hotels in 5 countries
I am a digital champion:
The website
I am a digital champion:
The desk
I am a digital champion:
The external provider
I am a digital champion
The IoT case
I am a digital champion
Social
Houston we have a problem
• Sales is declining and my stock price is
tumbling
The CEO
How can the CEO manage his
problem?
• Get control over the data
• Implement analytical
processes to aid sales
Introduction to hadoop V2
The data he need to handle
• Volume – Gigabytes/Terabyte
• Variety – Click stream, Voice, emails, sensor data,
social data, different languages, timestamp data,
transactional data, third party data
• Variability – Various quality
• Velocity – MB per second
The data he need to handle
• Veracity / Data quality – Inconsistent data quality
• Complexity – Many legacy domain models
How to handle ?
Web
Emails
Sensors
Social
Processing
RDBMS
Search
How to understand ?
Web
Emails
Sensors
Social
Processing
RDBMS
Search
So what do Hadoop solve?
Processing
What is Hadoop?
What is Hadoop?
An operating system for data
An OS need software on top
Distributions
'
Distributions
• ”Stable” compilation of the Hadoop Ecosystem
• Operational tools
• Integration tools and frameworks
• Data governance and data management tools
• Security
Distributions
HADOOP
An operating system for data
Layman’s terms
• Store huge files (unstructured) on many
machines
• Query and modify data
• Can run sophisticated analytics on top
How to start:
Alt 1
• https://hadoop.apache.org/
• Getting Started
• Download
• Unzip
• bin/hadoop <commandline arguments>
Alt 2
• http://hortonworks.com/products/hortonworks-sandbox/#install
• Install VMWare Player or VirtualBox
• Download image (6 GB)
• Install and run (give it lots of memory)
DEMO
– Transform and modify data
– Machine learning with Spark
– Integrate with ElasticSearch
NEXT: ARCHITECHTURE AND HOW IT WORKS
DEMO
• Hortonworks Sandbox
• Hortonworks Ambari
• Hortonworks Hue
Hadoop - Architecture
HDFS
YARN
MapReduce
2.X.X
• Hadoop Distributed File System (HDFS)
• YARN (Yet Another Resource Negotiator)
• MapReduce
HDFS
D1
D2
DX
Name
Node
Failover
Name
Node
Client
HDFS
Block index
D1
D2
D3
Data
Nodes
B: 1, D1
B: 2, D2
B: 3, D3
B: 4, D1
B: 5, D2
B: 6, D3
Name node
HDFS
Block index
D1
D2
D3
Data
Nodes
B: 1, D1
B: 2, D2
B: 3, D3
B: 4, D1
B: 5, D2
B: 6, D3
Name node
HDFS Write
Client
/path/to/document1, R:2, B:{1,2}
Name Node
I need to write a
document!
Client
/path/to/document1, R:2, B:{1,2}
Name Node
I need to write
/path/to/document1, R:2, B:{3,4}
/path/to/document1, R:2, B:{5,6}
HDFS Write
Client
Name
Node
You can write to
: D1,D2,D3 D1
D2
D3
Data Nodes
HDFS Write
Client
Name
Node
D1
D2
D3
B:{D2:5,D3:6}
B:{D3:3,D1:4}
B:{D1:1,D2:2}
Split and write
HDFS Write
HDFS Write
Client
Name
Node
D1
D2
D3
Replicate
B:1 to
D2:2
Success
HDFS Read
Client
Name
Node
D1
D2
D3B:{D3:3,D3:6}
B:{D2:2,D2:5}
• HDFS blocks are immutable you can not change them!
• Deletes and updates are written as new blocks
• The node name takes care of overwriting deleted
blocks
• Small files are consuming a lot of name node memory
HDFS Delete/Update
HDFS Scalability
D1
D2
DX
Name
Node
Failover
Name
Node
YARN
HOW DOES HADOOP PROCESS
THE DATA STORED IN HDFS?
YARN
Client
Resource Manager
Scheduler
Applications manager
I want to process file
“docuemt1” with
my-app.jar?
YARN
Resource Manager
Scheduler
Applications manager
You can process on D1!
YARN
D1 D2
Node Manager Node Manager
Resource Manager
Scheduler
Applications manager
Start my-app.jar
Application Master
YARN
D1 D2
Node Manager Node Manager
Resource Manager
Scheduler
Applications manager
Application Master
AM to RM: “document1” is
located on d1 and d2 and I
need X Gb RAM
YARN
D1 D2
Node Manager Node Manager
Application Master Container
Resource Manager
Scheduler
Applications manager
my-app.jar is running here!
Start my-app.jar
YARN + HDFS
D1
D2
D3
Name
Node
Client
Client
Client
• YARN will try to make
sure data is processed
where it is stored
• ….. data locality
YARN + HDFS
• Blocks are immutable. This enables high write speeds
• Data is schema free! You can store any data you want
• Data locality is what differentiates HDFS from other data
storage
• You can read massive amounts of data only limited by
disk read speeds
MapReduce and others
OK… BUT HOW DO I
PROCESS ?
YARN
Tez MapReduce <Name here>
Libraries: Mahout, MLib, GraphX, Oryx
Languages: Hive, Pig, R, Spark SQL, Stinger
YARN
Tez <Name here>
Languages: Hive, Pig, R, Spark SQL, Stinger
Libraries: Mahout, Crunch, Mlib, GraphX, Oryx
MapReduce
MapReduce
Document
Deer Bear River
Car Car River
Deer Car BearDocument
stored in HDFS
Splitting
Deer Bear River
Deer Car Bear
Deer Bear River
Car Car River
Car Car River
Deer Car Bear
Mapping
Deer Bear River
Car Car River
Deer Car Bear
Deer 1
Bear 1
River 1
Car 1
Car 1
River 1
Deer 1
Car 1
Bear 1
Shuffling
Deer 1
Bear 1
River 1
Deer 1
Car 1
Bear 1
Car 1
Car 1
River 1
Deer 1
Deer 1
Deer 1
Bear 1
Bear 1
Car 1
Car 1
River 1
River 1
Reduce
Deer 1
Deer 1
Deer 1
Bear 1
Bear 1
Car 1
Car 1
River 1
River 1
Deer 3
Bear 2
Car 2
River 2
Deer 3
Bear 2
Car 2
River 2
HDFS
API: Mapper
interface
API: Reduce
interface
API: Main
How to run
$ bin/hadoop jar wc.jar WordCount /hdfs/dir/in /hdfs/dir/out
MapReduce
• Mappers and reducers are distributed in YARN
containers
• Chaining of MapReduce jobs make them slow
• Easy to scale but difficult to code
• … use the data DSL languages instead
Languages
YARN
Tez MapReduce <Name here>
Languages: Hive, Pig, R, Spark SQL, Stinger
Libraries: Mahout, Crunch, MLib, GraphX, Oryx
”Languages”
PIG
• Procedural language
• Execute on YARN
• Great for
• Structuring
• Moving
• Transforming
Hive/Drill/Spark
SQL
• Declarative / SQL-like languages
• Great for
• Column data / Database dumps
• Aggregations
• Connect BI tools and Dashboards
• Data Warehouse for Hadoop++
Spark
• Core language (runs in YARN or standalone)
• Great for
• Anything that MapReduce can do
• Analytics, Machine Learning
• In memory and languages in Java, Scala and
Python
Summary
• Hadoop is designed to handle/process massive amounts of data
through HDFS and/or YARN
• The data do not need to be structured before it is stored in HDFS
• Hadoop is an ecosystem and have languages/frameworks for data
extraction, data management, data analysis and data integration
• It is most convenient to begin with Hadoop by testing distributions.
E.g. Hortonworks, Cloudera, MapR etc.
• Learn MapReduce and learn to understand languages and a few
integration tools
Is it a fad?

More Related Content

PPTX
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
PPTX
Introduction to HDFS and MapReduce
PPTX
Apache hadoop basics
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
PPTX
Asbury Hadoop Overview
PPT
Presentation
PPTX
Bigdata workshop february 2015
PPTX
Hadoop overview
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Introduction to HDFS and MapReduce
Apache hadoop basics
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Asbury Hadoop Overview
Presentation
Bigdata workshop february 2015
Hadoop overview

What's hot (20)

PDF
Hadoop for sys_admin
PDF
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
PPTX
Hadoop
PPTX
Introduction to Hadoop - The Essentials
PDF
Hadoop in Practice (SDN Conference, Dec 2014)
PPTX
Introduction to Hadoop
PDF
Hadoop 101
 
PPT
Seminar Presentation Hadoop
KEY
Intro To Hadoop
PPTX
Hadoop And Their Ecosystem
PPTX
Hadoop overview
PPTX
2. hadoop fundamentals
PPTX
Spark SQL versus Apache Drill: Different Tools with Different Rules
PPTX
Hadoop jon
PPTX
HBase in Practice
PDF
Big Data and Hadoop Ecosystem
PDF
Hadoop distributed computing framework for big data
PPTX
The Hadoop Ecosystem
PDF
Hadoop ecosystem
PDF
Hadoop User Group - Status Apache Drill
Hadoop for sys_admin
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Hadoop
Introduction to Hadoop - The Essentials
Hadoop in Practice (SDN Conference, Dec 2014)
Introduction to Hadoop
Hadoop 101
 
Seminar Presentation Hadoop
Intro To Hadoop
Hadoop And Their Ecosystem
Hadoop overview
2. hadoop fundamentals
Spark SQL versus Apache Drill: Different Tools with Different Rules
Hadoop jon
HBase in Practice
Big Data and Hadoop Ecosystem
Hadoop distributed computing framework for big data
The Hadoop Ecosystem
Hadoop ecosystem
Hadoop User Group - Status Apache Drill
Ad

Viewers also liked (10)

PDF
Handbook Cover 3 Example Designs
DOCX
nhận làm phim quảng cáo bảo đảm
PDF
kuolin_GISpractical_redo
PPTX
JA3 - kurssin aloitus
PPTX
Tema 3 unidades 1 y 2
PDF
GPI 5: ROPA DE TRABAJO
PDF
Sleep country scotia back to school conference 2016
PDF
Ellen Cruz
DOCX
Jennifer Martin Resume 2012
Handbook Cover 3 Example Designs
nhận làm phim quảng cáo bảo đảm
kuolin_GISpractical_redo
JA3 - kurssin aloitus
Tema 3 unidades 1 y 2
GPI 5: ROPA DE TRABAJO
Sleep country scotia back to school conference 2016
Ellen Cruz
Jennifer Martin Resume 2012
Ad

Similar to Introduction to hadoop V2 (20)

PPTX
Introduction to Hadoop and Big Data
PPTX
Introduction to BIg Data and Hadoop
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
PPTX
BW Tech Meetup: Hadoop and The rise of Big Data
PPTX
Bw tech hadoop
PPTX
Hadoop Ecosystem
PPTX
4. hadoop גיא לבנברג
PPTX
Big data Hadoop
PPTX
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
PDF
Unit IV.pdf
PPT
Hadoop
PPTX
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
PPT
PPT
Apache hadoop, hdfs and map reduce Overview
PPTX
Big Data in the Microsoft Platform
PDF
Hadoop 2.0 handout 5.0
PDF
Hadoop on Azure, Blue elephants
PPSX
Hadoop-Quick introduction
PPTX
Getting started big data
PPTX
Hadoop and Big Data
Introduction to Hadoop and Big Data
Introduction to BIg Data and Hadoop
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
BW Tech Meetup: Hadoop and The rise of Big Data
Bw tech hadoop
Hadoop Ecosystem
4. hadoop גיא לבנברג
Big data Hadoop
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Unit IV.pdf
Hadoop
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Apache hadoop, hdfs and map reduce Overview
Big Data in the Microsoft Platform
Hadoop 2.0 handout 5.0
Hadoop on Azure, Blue elephants
Hadoop-Quick introduction
Getting started big data
Hadoop and Big Data

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
cloud_computing_Infrastucture_as_cloud_p
PPTX
Tartificialntelligence_presentation.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Encapsulation theory and applications.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Machine Learning_overview_presentation.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
Empathic Computing: Creating Shared Understanding
Accuracy of neural networks in brain wave diagnosis of schizophrenia
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Per capita expenditure prediction using model stacking based on satellite ima...
Network Security Unit 5.pdf for BCA BBA.
cloud_computing_Infrastucture_as_cloud_p
Tartificialntelligence_presentation.pptx
Spectral efficient network and resource selection model in 5G networks
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Programs and apps: productivity, graphics, security and other tools
Encapsulation theory and applications.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Building Integrated photovoltaic BIPV_UPV.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Machine Learning_overview_presentation.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
SOPHOS-XG Firewall Administrator PPT.pptx

Introduction to hadoop V2

Editor's Notes

  • #3: - Startet med å bygge distribuerte systemer for store mengder data Hoppet på Hadoop da det skulle løse alle problemer ca 2009/2010 Hoppet av igjen Hopper på igjen nå
  • #4: Hvor mange jobber med data Hvor mange har jobbet med Hadoop Hvor mange har jobbet med ElasticSearch Hvor mange er konsulenter Hvor mange konsuleter/ansatt I industrien/olje/manufacturing Hvor mange konsuleter/ansatt I merkantile/handel/service/IT Hvor mange konsuleter/ansatt I statlig
  • #6: Noen med erfaring med Hadoop?
  • #9: Detter er det jeg forbinder med Big Data akkurat nå Veldig mye buzz… men la oss se hva det er i kjærnen og hvor Hadoop kommer inn i dette bilde
  • #10: Doug Laney the inventor of big data back in 2001
  • #11: MASSE KJEDELIGE ORD… LA OSS PRØVE Å SE PÅ ET EKSEMPEL Volume – Variety – Many datasets Velocity – The speed of generation of data Variability – Data can be inconsise and come in various form Veracity – Quality of data Complexity
  • #12: Doug Laney the inventor of big data back in 2001
  • #13: Doug Laney the inventor of big data back in 2001
  • #14: Clickstream data Ratings
  • #15: Clickstream data Ratings
  • #16: External agreements on ratings and traffic
  • #17: -Stuepiken is registering all activities -IoT
  • #18: -Stuepiken is registering all activities -IoT
  • #20: Doug Laney the inventor of big data back in 2001
  • #29: An OpenSource operationg system for data
  • #38: 2002: Open source crawler Nutch by Dough Cutting and Mike Cafarella: The internet crawler. Web was maximumily 1 billion pages large. Limited scalability capabilities. 2003: Google releases their GFS paper for massively distributed filesystem.. Cutting and Cafarella incorporates the filesystem into Nutch 2004: Google releases their Map Reduce paper for massively parallell computing. This is incorporated into Nutch as well 2006: Yahoo hires Dough Cutting and the filesystem and Map Reduce component is extracted into the Hadoop project from the Nutch project.
  • #39: 2002: Open source crawler Nutch by Dough Cutting and Mike Cafarella: The internet crawler. Web was maximumily 1 billion pages large. Limited scalability capabilities. 2003: Google releases their GFS paper for massively distributed filesystem.. Cutting and Cafarella incorporates the filesystem into Nutch 2004: Google releases their Map Reduce paper for massively parallell computing. This is incorporated into Nutch as well 2006: Yahoo hires Dough Cutting and the filesystem and Map Reduce component is extracted into the Hadoop project from the Nutch project.
  • #40: 2008: Hadoop was storing all data. Even financial data was trusted to Hadoop 2008: Cloudera was the first commercial company that supported Hadoop 2011: 42 000 nodes storing petabytes of data 2011: Hortonworks was spun out of Yahoo as hadoop company. This company only focuses on the open source software from with its origin @ yahoo
  • #41: 2011 – First feature complete 1.0 version of Hadoop. MapReduce and HDFS is tighly integrated in 1.0 and pre versions 2013 – First large refactor of the operating system. Map Reduce is detached and Hadoop is more generalized to handle different processing paradigms
  • #45: Data Nodes contains disks only
  • #46: Data Nodes contains disks only
  • #57: Scheduler is allocating based on information available from the node ApplicationsManager track the state of all applications (managers) in the cluster
  • #58: Node Managers constantly updates the ResourceManager with the current resource situatuon Node Managers start the ApplicationMaster and Container Application Masters are negotiating resources and allocates more containers if allowed
  • #59: Node Managers constantly updates the ResourceManager with the current resource situatuon Node Managers start the ApplicationMaster and Container Application Masters are negotiating resources and allocates more containers if allowed
  • #60: Application Masters are negotiating resources and allocates more containers if allowed. CPU cores, and Memory is requested, and that my file is located on D2 The application started by the Node Manager does not need to be Java.
  • #84: De store selskapene: Spotify, Google, Netflix, === disruptorene