Introduction to Hadoop Ecosystem

©
■
■
■
■
●
■
■
●

©
A Definition For Your Daddy

©
$ hdfs dfs -ls /user/tiger
$ hdfs dfs -put songs.txt /user/tiger
$ hdfs dfs -cat /user/tiger/songs.txt
$ hdfs dfs -mkdir songs
$ hdfs dfs -mv songs.txt songs
$ hdfs dfs -rmr songs

©
■
$ hdfs dfs -put songs.txt /user/tiger
Question?

©
■
●
●
■
Image source:
http://pixgood.com/slicing-bread.html

©
■
$ hdfs dfs -cat /user/tiger/songs.txt
Question?

©
■
●
■
●
●
■
■
●

©
1. Offers compute
resources such as CPU
and RAM
2. Runs tasks of the
applications submitted by
users
3. Reports to the Master

©
1. Knows about all Slaves
2. Knows about available and
occupied resources on each
Slave
3. Schedules jobs submitted by
clients

©
A user can submit
any type of
application that is
supported by YARN

©
1. Started and
overseen by
Resource
Manager
2. Coordinates the
execution of all
tasks within an
application
3. Asks for
resources
needed to run its
tasks
4. Runs on the Node
Manager

©
■
●
■
Containers are
dynamically
created and
deleted

©
■
■
■
■
■
■
●
●

©
■
Large volume
of data
Computation
e.g. a JAR file

©
1. NodeManagers
should be
collocated with
DataNodes
2. The Resource
Manager tries to
schedule tasks on
a node which is the
closest to the data
3. Large volumes of
data don’t have to
be sent over the
network

©
Their reality
■
■
●
■
Their conclusion
■

©
HADOOP
MR
MR
SOME
MAGIC
1. Parses query
2. Plans execution
3. Submits jobs
4. Monitors jobs
5. Returns results
Execution
SELECT trackid,
COUNT(*) AS cnt
FROM stream
GROUP BY trackid
ORDER BY cnt DESC;
Results

©
HADOOP
MR
MR
APACHE
HIVE
Results
1. Parses query
2. Plans execution
3. Submits jobs
4. Monitors jobs
5. Returns results
Execution
SELECT trackid,
COUNT(*) AS cnt
FROM stream
GROUP BY trackid
ORDER BY cnt DESC;

©
RDBMS
Hive
Metastore
Stores Hive
metadata
Manages metadata
about databases,
tables and views

©
Hive Shell CLI
RDBMS
Hive
Metastore

©
Hive Shell CLI
BeesWax
HUE
RDBMS
Hive
Metastore
Acts as a proxy
for “ligth” clients
JDBC/ODBC
Hive Server 2
Beeline CLI

©
Job 1 Job 2
Possible to cache dataset
in cluster’s (distributed)
memory to read it faster
in next jobs
HDFS
Read
Memory
Read
Cache In
Memory
Cache In
Memory
Memory
Read

©
Job 1 Job 2
Great fit for
iterative algorithms
and interactive
queries!
HDFS
Read
Memory
Read
Cache In
Memory
Cache In
Memory
Possible to cache dataset
in cluster’s (distributed)
memory to read it faster
in next jobs
Memory
Read

©
Interactive queries
Iterative algorithms
Input
Query 2
Query 1
Query 3
Input Iteration 1 Iteration 2
Distributed
Memory

©
NodeManager
Client
YARN Container
Spark
Application
Master
Spark Driver
Resource Manager
NodeManager
YARN Container
Spark
Executor Spark Task
NodeManager
YARN Container
Spark
Executor
Spark Task

©
./bin/spark-submit --class org.apache.spark.examples.SparkPi
--master yarn
--deploy-mode cluster
--driver-memory 4g
--executor-memory 20g
--executor-cores 3
lib/spark-examples*.jar
10

©
■
Spark Core
Spark
SQL
Spark
Streaming
(near real-time,
micro-batch)
MLlib
(machine
learning)
GraphFrames
(graph
processing)
SparkR
(R on
Spark)

Introduction to Hadoop Ecosystem

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to Introduction to Hadoop Ecosystem (20)

More from GetInData (20)

Recently uploaded (20)

Introduction to Hadoop Ecosystem