Introduction to hadoop V2

• Tarjei Romtveit
• Co-founder of Monokkel AS
• Former CTO – Integrasco AS
• My story with Hadoop
www.monokkel.io

• Daglig leder i Monokkel AS
• Tidligere COO i Integrasco AS
• Persistering, Prosessering og Presentasjon av data
Persistering – Prosessering – Presentasjon

Bombshell
If you work with data today and not start to
learn the Hadoop ecosystem: You may be
unemployed soon

Agenda
• Context – Big Data and how to handle it
• What is Hadoop?
• Demo
• Distributions and/or demo
• “Deepdive” into Hadoop - Architecure
– HDFS
– YARN
– MapReduce
• Languages and ecosystem

What we not will cover
• Security
• Integrations with database X or system Y
• Running Hadoop in production

Big Data – hype and hipsters

Big Data – Let’s add some letters
• Volume
• Variety
• Velocity
• Variability
• Veracity / Data quality
and the step-brother
• Complexity

The Nordic Hotel Tycoon
1600 Hotels in 5 countries

I am a digital champion:
The website

The desk

The external provider

I am a digital champion
The IoT case

I am a digital champion
Social

Houston we have a problem
• Sales is declining and my stock price is
tumbling

How can the CEO manage his
problem?
• Get control over the data
• Implement analytical
processes to aid sales

The data he need to handle
• Volume – Gigabytes/Terabyte
• Variety – Click stream, Voice, emails, sensor data,
social data, different languages, timestamp data,
transactional data, third party data
• Variability – Various quality
• Velocity – MB per second

The data he need to handle
• Veracity / Data quality – Inconsistent data quality
• Complexity – Many legacy domain models

How to handle ?
Web
Emails
Sensors
Social
Processing
RDBMS
Search

How to understand ?
Web
Emails
Sensors
Social
Processing
RDBMS
Search

So what do Hadoop solve?
Processing

What is Hadoop?
An operating system for data

Distributions
• ”Stable” compilation of the Hadoop Ecosystem
• Operational tools
• Integration tools and frameworks
• Data governance and data management tools
• Security

HADOOP
An operating system for data
Layman’s terms
• Store huge files (unstructured) on many
machines
• Query and modify data
• Can run sophisticated analytics on top

How to start:
Alt 1
• https://hadoop.apache.org/
• Getting Started
• Download
• Unzip
• bin/hadoop <commandline arguments>
Alt 2
• http://hortonworks.com/products/hortonworks-sandbox/#install
• Install VMWare Player or VirtualBox
• Download image (6 GB)
• Install and run (give it lots of memory)

DEMO
– Transform and modify data
– Machine learning with Spark
– Integrate with ElasticSearch
NEXT: ARCHITECHTURE AND HOW IT WORKS

DEMO
• Hortonworks Sandbox
• Hortonworks Ambari
• Hortonworks Hue

Hadoop - Architecture
HDFS
YARN
MapReduce

2.X.X
• Hadoop Distributed File System (HDFS)
• YARN (Yet Another Resource Negotiator)
• MapReduce

HDFS
D1
D2
DX
Name
Node
Failover
Name
Node
Client

HDFS
Block index
D1
D2
D3
Data
Nodes
B: 1, D1
B: 2, D2
B: 3, D3
B: 4, D1
B: 5, D2
B: 6, D3
Name node

HDFS Write
Client
/path/to/document1, R:2, B:{1,2}
Name Node
I need to write a
document!

Client
Name Node
I need to write
HDFS Write

Client
Name
Node
You can write to
: D1,D2,D3 D1
D2
D3
Data Nodes
HDFS Write

Client
Name
Node
D1
D2
D3
B:{D2:5,D3:6}
B:{D3:3,D1:4}
B:{D1:1,D2:2}
Split and write
HDFS Write

HDFS Write
Client
Name
Node
D1
D2
D3
Replicate
B:1 to
D2:2
Success

HDFS Read
Client
Name
Node
D1
D2
D3B:{D3:3,D3:6}
B:{D2:2,D2:5}

• HDFS blocks are immutable you can not change them!
• Deletes and updates are written as new blocks
• The node name takes care of overwriting deleted
blocks
• Small files are consuming a lot of name node memory
HDFS Delete/Update

HDFS Scalability
D1
D2
DX
Name
Node
Failover
Name
Node

YARN
HOW DOES HADOOP PROCESS
THE DATA STORED IN HDFS?

YARN
Client
Resource Manager
Scheduler
Applications manager
I want to process file
“docuemt1” with
my-app.jar?

YARN
Resource Manager
Scheduler
You can process on D1!

YARN
D1 D2
Node Manager Node Manager
Resource Manager
Scheduler
Start my-app.jar
Application Master

YARN
D1 D2
Resource Manager
Scheduler
Application Master
AM to RM: “document1” is
located on d1 and d2 and I
need X Gb RAM

YARN
D1 D2
Application Master Container
Resource Manager
Scheduler
my-app.jar is running here!
Start my-app.jar

YARN + HDFS
D1
D2
D3
Name
Node
Client
Client
Client
• YARN will try to make
sure data is processed
where it is stored
• ….. data locality

YARN + HDFS
• Blocks are immutable. This enables high write speeds
• Data is schema free! You can store any data you want
• Data locality is what differentiates HDFS from other data
storage
• You can read massive amounts of data only limited by
disk read speeds

MapReduce and others
OK… BUT HOW DO I
PROCESS ?

YARN
Tez MapReduce <Name here>
Libraries: Mahout, MLib, GraphX, Oryx
Languages: Hive, Pig, R, Spark SQL, Stinger

YARN
Tez <Name here>
Libraries: Mahout, Crunch, Mlib, GraphX, Oryx
MapReduce

Document
Deer Bear River
Car Car River
Deer Car BearDocument
stored in HDFS

Splitting
Deer Bear River
Deer Car Bear
Deer Bear River
Car Car River
Car Car River
Deer Car Bear

Mapping
Deer Bear River
Car Car River
Deer Car Bear
Deer 1
Bear 1
River 1
Car 1
Car 1
River 1
Deer 1
Car 1
Bear 1

Shuffling
Deer 1
Bear 1
River 1
Deer 1
Car 1
Bear 1
Car 1
Car 1
River 1
Deer 1
Deer 1
Deer 1
Bear 1
Bear 1
Car 1
Car 1
River 1
River 1

Reduce
Deer 1
Deer 1
Deer 1
Bear 1
Bear 1
Car 1
Car 1
River 1
River 1
Deer 3
Bear 2
Car 2
River 2
Deer 3
Bear 2
Car 2
River 2
HDFS

How to run
$ bin/hadoop jar wc.jar WordCount /hdfs/dir/in /hdfs/dir/out

MapReduce
• Mappers and reducers are distributed in YARN
containers
• Chaining of MapReduce jobs make them slow
• Easy to scale but difficult to code
• … use the data DSL languages instead

YARN
Tez MapReduce <Name here>
Libraries: Mahout, Crunch, MLib, GraphX, Oryx

PIG
• Procedural language
• Execute on YARN
• Great for
• Structuring
• Moving
• Transforming

Hive/Drill/Spark
SQL
• Declarative / SQL-like languages
• Great for
• Column data / Database dumps
• Aggregations
• Connect BI tools and Dashboards
• Data Warehouse for Hadoop++

Spark
• Core language (runs in YARN or standalone)
• Great for
• Anything that MapReduce can do
• Analytics, Machine Learning
• In memory and languages in Java, Scala and
Python

Summary
• Hadoop is designed to handle/process massive amounts of data
through HDFS and/or YARN
• The data do not need to be structured before it is stored in HDFS
• Hadoop is an ecosystem and have languages/frameworks for data
extraction, data management, data analysis and data integration
• It is most convenient to begin with Hadoop by testing distributions.
E.g. Hortonworks, Cloudera, MapR etc.
• Learn MapReduce and learn to understand languages and a few
integration tools

Introduction to hadoop V2

More Related Content

What's hot (20)

Viewers also liked (10)

Similar to Introduction to hadoop V2 (20)

Recently uploaded (20)

Introduction to hadoop V2

Editor's Notes