SlideShare a Scribd company logo
Hadoop Ecosystem
Lior Sidi
Sep 2016
Hello!
I am Lior Sidi
Big data V’s
Volume
Velocity
Variety
What is Hadoop?
• Hadoop – Open source implementation of MapReduce (MR)
• Perform MR Jobs fast and efficient
Goal
generating Value from large datasets
That cannot be analyzed
using traditional technologies
Hadoop Concepts
Requirements
• Linear horizontal scalability
• Jobs run in isolation
• Simple programming model
Challenges and solution
• Ch1: Data access bottleneck
• Sol: Store and process data on same node
• Ch1: Distributed Programming is Difficult
• Sol: Use high level languages API
Hadoop Timeline
2003 Oct
Google File System
paper released
2004 Dec
MapReduce: Simplified Data
Processing on Large Clusters
2006 Oct
Hadoop 1.0 released
2007 Oct
Yahoo Labs creates Pig
2008 Oct
Cloudera, Hadoop
distributor is founded
2010 Sep
Hive and Pig Graduates
2011 Jan
Zookeeper Graduates
2013 Mar
Yarn deployed in Yahoo
2014 Feb
Apache Spark top
Level Apache Project
Data
Streaming
Analysis
Data Injection
Resource
Management
coordination
Processing
workflow
Visualization
Cluster
management
Storage
Search
Data
Formats
Hadoop Ecosystem
Storage
Hadoop Ecosystem
Storage / HDFS
• “Hadoop Distributed File System”
• Design:
• Write once – read many times pattern
• Cheap hardware
• Low latency data access
• Concepts:
• Block – File is split to Size 128 MB blocks, redundancy - 3
• NameNode (Master) – per cluster - file system namespace for blocks (single point of
failure)
• DataNode (Worker) – per Node - store and retrieve blocks
• Functions:
• High availability – run a second NameNode
• Block caching – block cached in only one DataNode
• Locality - Rack sensitive, network topology
• File permissions – like POSIX – r w x – owner/group/mode file/directory
• Interfaces – HTTP (proxy/direct), Java API
• Cluster balance – evenly spread the block on the cluster
2Rack
1Rack
Data
Block 1
Block 2
Block 3
DataNodeDataNodeDataNodeDataNode
Block 1
Block 1
Block 2
Block 2
Block 3
Block 3
Block 1
DataNode
Block 2
Block 3
NameNode
HDFS proxy Client
file is distribution and
accessed on Hadoop HDFS
Resource
Management
Storage
Hadoop Ecosystem
Resource Management / YARN
• “Yet Another Resource Negotiator”
• Manage and schedule the cluster resource
• Daemons:
• Resource Manager – Per Cluster – manage resource across the cluster
• Node Manager – Per Node – launch and monitor a Container
• Container – execute an app process
• Resource requests for containers:
• Amount of computers (CPU & Memory)
• Locality (node/rack)
• Lifespan: application per user job or long-running apps shared by users
• Scheduling:
• Allocate resource by policy (FIFO, capacity (ordanisation), Fair
Hadoop Cluster
Nodemanager
node
NodeManager
Container
Master
Client node
application
Resource manager node
ResourceManager
Client
Nodemanager
node
NodeManager
Container
Worker
Nodemanager
node
NodeManager
Container
Worker
launch
launch
launch
launch
Launch
YARN app
heartbeat Job scheduling on top
Hadoop Cluster
Resource
Management
Processing
Storage
Hadoop Ecosystem
Processing / MapReduce
• Simplify, large scale, automatic, Fault tolerant development data
processing
• origin - Google paper 2004
• Batch processing
• Hadoop MR:
• JobTracker – 1per cluster - master process, schedule tasks on workers,
monitor progress
• taskTracker – 1 per worker - execute map/reduce tasks locally and
report progress
Processing / MapReduce
LiorRonLior
RonRonAndrey
LiorAndreyLior
CountName
1Lior
1Ron
1Lior
CountName
1Lior
1Andrey
1Lior
CountName
1Andrey
1Ron
1Ron
CountName
4Lior
CountName
3Ron
CountName
2Andrey
Data
Map
ReduceShuffle
& Sort
Hadoop Cluster
Nodemanager
node
NodeManager
Container
JobTracker
Client node
MR program
Resource manager node
ResourceManager
Client
Nodemanager
node
NodeManager
Container
TaskTracker
Nodemanager
node
NodeManager
Container
TaskTracker
launch
launch
launch
launch
Launch
YARN app
heartbeat
MR Job scheduling on top
Hadoop Cluster
Resource
Management
Processing
Storage
Hadoop Ecosystem
Storage / HBase
• Distributes Column Base database on top HDFS
• Real time read/write random access for large data-sets
• Region – tables splitting by row
• Pheonex - SQL on HBase
RowKey Column Family 1 Column Family 2
Col 1.1
Version Data
Col 1.2 Col 1.3
Version Data
Version Data
Hbase Data Model
Resource
Management
coordination
Processing
Storage
Hadoop Ecosystem
Coordination / ZooKeeper
• Hadoop’s distributed coordination service
• Coordinate read/write action on data
• high availability filesystem
• Implementation:
• Data model:
• Tree build from Znodes (1MB data)
• Znode – data changes, ACL (access control list )
• Leader - perform write and broadcast an update
• Follower – pass atomic request to leader
• Lock service
• User groups
• Replicate mode
Coordination / ZooKeeper
Hadoop Cluster
ZooKeaper Service
Leader
HDFSHBase
DataNodeDataNodeDataNode
HMaster Other
Client
RegionRegionRegion
NameNode
/
/HBase HDFS/
Follower
/
/HBase HDFS/
Follower
/
/HBase HDFS/
LOCK LOCK
ZooKeeper
Coordination
example
Resource
Management
coordination
Processing
Storage
Data
Formats
Hadoop Ecosystem
Row Based  Avro
• Language natural data serialization system
• Share many data formats with many code language
• Split able and sortable - Allow easy map reduce
• Rich schema resolution – flexible scheme
• Other Row Based formats
• sequenceFile - Logfile format
• MapFile - Sorted sequenceFile
Row Based  Avro
Header Block 1 Block 2 Block N
Count objs Serialized objs SyncMarker
identifier Metadata: Schema & codec SyncMarker
Size objs
{
"Type":"record"
"Name":"Person"
"Fields":
[{
"name":"firstName",
"type":"string"
"order":"descending"
},{
"name":"age",
"type":"int"
},{...
]
}
Schema
File Structure
File Structure
Parquet
• Columnar storage format
• Skip unneeded columns
• Fast queries & small size
• Efficient nested data store Header Block 1 Block 2 Block N
Column chunk Column chunk Column chunk
Page Page Page Page
Magic Number File Metadata
Footer
Message Person {
Required binary name (UTF8);
Required int32 age (UTF8);
Required group hobbies (LIST) {
Required binary array (UTF8);
}
}
Schema
Data Injection
Resource
Management
coordination
Processing
Storage
Data
Formats
Hadoop Ecosystem
Data Integration / Sqoop
• Import/export structural data
• Sqoop connector:
• import/export from a database
• Sqoop1- command line
• Sqoop2 – service
• Connectors – connect RDBs
Hadoop Cluster
Export MapReduce Job
Database
Table
Sqoop client
Import MapReduce Job
Hdfs Hdfs
Map Map
Hdfs Hdfs
Map Map
metadata
launch launch
ExportImport
Data Integration / Flume
• Event base data injection into Hadoop
• Flume agent components:
• Sources – spoolingDir (create events), Avro(RPC), Http (requests)
• Channel
• Sink – Avro, HDFS, HBase, Solr(=near real time)
• Reliability - Use separate transaction
• Fan out – one source many sinks
• Scale - agent tiers for aggregation multiple sources
• Sink grouping- avoid failure and load balancing
Fan Out
Data Integration / Flume
Hadoop Data
File
system
Flume Agent
Source Channel Sink
Tier 1
Flume Agent
Tier 1
Flume Agent
Tier 1
Flume Agent
Tier 2
Flume Agent
Tier 2
Flume Agent
Tier 3
Flume Agent
Tier 3
Flume Agent
File
system Sink
GroupingScale
HDFS
HBase
Data
Data Integration / Kafka
• distributed publish-subscribe messaging system
• Fast, scalable, durable
• Components:
• Topics – categories of feeds messages
• Procedures – process that publish messages to topic
• Message consumer – processes that subscribe for topic
• Broker – kafka servers on cluster
• Distribution
• Leader – allow read/write
• Follower – replicate
Data
Streaming
Data Injection
Resource
Management
coordination
Processing
Storage
Data
Formats
Hadoop Ecosystem
Data Integration / Streaming
• Stream processing
• Kafka Stream - Process and analyze data in Kafka
• Storm – real-time computation
• Spark streaming – process live data and can apply Spark MLib and
graphX
Flume Agent 1
Data
Kafka
Spark Streaming
Flume Agent 2 Storm
Topic
A
Topic
B
HDFS
1
1
1
2
2
Data
Streaming
Analysis
Data Injection
Resource
Management
coordination
Processing
Storage
Data
Formats
Hadoop Ecosystem
• Cluster Computing Framework
• In Memory processing
• Language: Scala, Java and Python
• RDD – resilience Distributed dataset
• Read only collection spread in the cluster
• Computation of transformation happened when Action
• DAG engine – schedule many transformations to one optimal Job
• Spark context
• parallel jobs
• Caching
• Broadcast variables (Data/Functions)
• Cluster Manager of executors:
• Local, Standalone, Mesos , Yarn
Computation / Spark
Computation / Spark
Hadoop
Driver
SparkContext
Spark Program
DAG Scheduler
Task Scheduler
Scheduler backend
Executer Executer Executer
Job
Job
Stages
Tasks
Task Task Task
Scripting / Pig
• Data flow programming language - Map reduce abstraction
• support: User defined functions (UDF), Streaming, nested data
• Don’t support: random read/write
• Pig Latin - Scripting language
• Load, store, filtering, Group, Join, Sort, Union and Split, UDF, Co-group
• Modes
• Local – small datasets
• MR mode – run on cluster
• Execution - script, grunt (shell), embedded (java)
• Parameter substitution – run script with different parameters
• Similar
• Crunch – MR pipeline with Java (no UDF)
Query / Hive
• Components
• MetaStore – tables description
• HiveQL – SQL dialect (SQL: 2003)
• tables Management
• warehouse directory
• external tables
• functionality
• Bucketing and Partitions by column
• Support UDF and UDAF (aggregate)
• Insert Update Delete:
• Saved in delta files
• Background MR Jobs
• (Available Transaction context)
• Lock table (avoid drop)
Query / Comparison
SparkSql (shark)ImpalaHive
Procedural
development
BI & SQL analyticsBatchUsage
OKBestbadSpeed
MemoryDedicated Deamons on
DataNode
MapReduceimplementation
Persto ,
Drill (SQL: 2011)
Hive On sparkSimilar tools
Data
Streaming
Analysis
Data Injection
Resource
Management
coordination
Processing
workflow
Storage
Data
Formats
Hadoop Ecosystem
Workflow / Oozie
• Schedule Hadoop jobs
• Job types:
• Workflows – sequence of jobs via Directed Graphs (DAGs)
• Coordinator - trigger jobs by time or availability
start Sqoop Fork
Pig
PigMR
Sub
workflow
FS
(HDFS)
Join End
Control flow
Action
Email
Data
Streaming
Analysis
Data Injection
Resource
Management
coordination
Processing
workflow
Storage
Search
Data
Formats
Hadoop Ecosystem
Search / Solr
• Full- text search over Hadoop
• Near real time indexing
• REST API
• Based on Apache Lucene java search library
Data
Streaming
Analysis
Data Injection
Resource
Management
coordination
Processing
workflow
Visualization
Storage
Search
Data
Formats
Hadoop Ecosystem
Visualization / Hue
• Open source Web interface for analyzing data with any Hadoop.
• Application:
• File Browser: HDFS, Hbase
• Scheduling of jobs and workflows : Oozie
• Job Browser: YARN
• SQL : Hive, Impala
• Data analysis: Pig, UDF
• Dynamic Search: Solr
• Notebooks: Spark
• Data Transfer: Sqoop 2
Data
Streaming
Analysis
Data Injection
Resource
Management
coordination
Processing
workflow
Visualization
Cluster
management
Storage
Search
Data
Formats
Hadoop Ecosystem
Cluster Management / Cloudera
• 100% open source
• The most complete and tested distribution of Hadoop
• Integrate all Hadoop project
• Express – free, end to end administration
• Enterprise – Extra features and support
Cluster Management / Comparison
https://talendexpert.com/cloudera-vs-honworks-vs-mapr
MasterMasterMaster
Other Servers
Worker
Basic Cluster configuration
Resource manager
Standby
Resource Manager
NodeManager
DataNode
Cloudera Manager
Hive GW
ZooKeeper
Impala Daemon
Impala State
Sqoop GW
Spark GW
NameNode
Master
ZooKeeper
Secondary
NameNode
Worker
NodeManager
DataNode
Impala Daemon
Worker
NodeManager
DataNode
Impala Daemon
Worker
NodeManager
DataNode
Impala Daemon
Data
Streaming
Analysis
Data Injection
Resource
Management
coordination
Processing
workflow
Visualization
Cluster
management
Storage
Search
Data
Formats
Hadoop Ecosystem
Thanks!
Any questions?

More Related Content

PPTX
The Evolution of the Hadoop Ecosystem
PDF
Distributed Elixir
PDF
PostgreSQL, performance for queries with grouping
PPT
Non Invasive Ventilation indications
PPTX
ORM in Go. Internals, tips & tricks
PPTX
Евгений Варфоломеев "Hibernate vs my batis vs jdbc: is there a silver bullet?"
PDF
gRPC in Go
PDF
Mvcc in postgreSQL 권건우
The Evolution of the Hadoop Ecosystem
Distributed Elixir
PostgreSQL, performance for queries with grouping
Non Invasive Ventilation indications
ORM in Go. Internals, tips & tricks
Евгений Варфоломеев "Hibernate vs my batis vs jdbc: is there a silver bullet?"
gRPC in Go
Mvcc in postgreSQL 권건우

What's hot (13)

PPTX
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
PPTX
Zephframe short pitch deck
PPT
Html & Css presentation
PPT
GoldenGate for MySQL 설치 시 필요한 사항
PPTX
SQL and NoSQL Better Together in Alasql
PPTX
CSS3 2D/3D transform
PPTX
Write microservice in golang
PDF
Productizing Structured Streaming Jobs
PDF
Efficient, maintainable CSS
PPTX
MongoDB - Aggregation Pipeline
PDF
Grokking TechTalk #20: PostgreSQL Internals 101
PDF
Intro to HBase
PDF
MongoDB Aggregation Framework
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Zephframe short pitch deck
Html & Css presentation
GoldenGate for MySQL 설치 시 필요한 사항
SQL and NoSQL Better Together in Alasql
CSS3 2D/3D transform
Write microservice in golang
Productizing Structured Streaming Jobs
Efficient, maintainable CSS
MongoDB - Aggregation Pipeline
Grokking TechTalk #20: PostgreSQL Internals 101
Intro to HBase
MongoDB Aggregation Framework
Ad

Similar to Hadoop Ecosystem (20)

PPTX
Introduction to Hadoop and Big Data
PPTX
Hadoop ppt1
PPTX
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
PPTX
Presentation sreenu dwh-services
PPTX
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
PDF
Hadoop Primer
PPTX
MOD-2 presentation on engineering students
PDF
Hadoop, Taming Elephants
PPTX
Hadoop and Big data in Big data and cloud.pptx
PDF
Understanding Hadoop
PPTX
Hadoop Platform at Yahoo
PPTX
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
PPTX
Hadoop.pptx
PPTX
Hadoop.pptx
PPTX
List of Engineering Colleges in Uttarakhand
PDF
Introduction to Hadoop Administration
PDF
Introduction to Hadoop Administration
PPTX
Bigdata workshop february 2015
PPTX
Introduction to Hadoop and Big Data
Hadoop ppt1
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Presentation sreenu dwh-services
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop Primer
MOD-2 presentation on engineering students
Hadoop, Taming Elephants
Hadoop and Big data in Big data and cloud.pptx
Understanding Hadoop
Hadoop Platform at Yahoo
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Hadoop.pptx
Hadoop.pptx
List of Engineering Colleges in Uttarakhand
Introduction to Hadoop Administration
Introduction to Hadoop Administration
Bigdata workshop february 2015
Ad

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Approach and Philosophy of On baking technology
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Advanced IT Governance
PPTX
Big Data Technologies - Introduction.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Cloud computing and distributed systems.
PDF
Machine learning based COVID-19 study performance prediction
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Network Security Unit 5.pdf for BCA BBA.
Dropbox Q2 2025 Financial Results & Investor Presentation
Per capita expenditure prediction using model stacking based on satellite ima...
Approach and Philosophy of On baking technology
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
NewMind AI Monthly Chronicles - July 2025
20250228 LYD VKU AI Blended-Learning.pptx
Advanced IT Governance
Big Data Technologies - Introduction.pptx
Empathic Computing: Creating Shared Understanding
Chapter 3 Spatial Domain Image Processing.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Cloud computing and distributed systems.
Machine learning based COVID-19 study performance prediction
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
NewMind AI Weekly Chronicles - August'25 Week I
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Network Security Unit 5.pdf for BCA BBA.

Hadoop Ecosystem

Editor's Notes

  • #10: HDFS – manage the file system across network of machines Design to store big files Master worker pattern Namenode maintain the directory tree –doesn’t maintain a perstistent location but reconstract when reboot Namenode is the most important component in the cluster when it lost the entire access to the cluster is lost therefore it possible to create high availabuility where we
  • #13: Design to support map reduce but is used for other operations