SlideShare a Scribd company logo
©
Introduction to
©
■
■
●
●
©
■ StreamRockTM
●
©
■ StreamRockTM
©
■
●
●
●
©
©
©
■
■
■
©
■
■
■
■
●
■
■
●
©
■
■
■
■
●
■
■
●
©
■
©
A Definition For Your Daddy
©
$ hdfs dfs -ls /user/tiger
$ hdfs dfs -put songs.txt /user/tiger
$ hdfs dfs -cat /user/tiger/songs.txt
$ hdfs dfs -mkdir songs
$ hdfs dfs -mv songs.txt songs
$ hdfs dfs -rmr songs
©
■
$ hdfs dfs -put songs.txt /user/tiger
Question?
©
■
■
Answer!
©
■
●
●
■
Image source:
http://pixgood.com/slicing-bread.html
©
■
$ hdfs dfs -cat /user/tiger/songs.txt
Question?
©
■
●
●
●
Answer!
©
■
■
©
■
●
■
●
●
■
■
●
©
■
●
■
●
■
●
©
©
©
■
■
■
■
©
1. Offers compute
resources such as CPU
and RAM
2. Runs tasks of the
applications submitted by
users
3. Reports to the Master
©
1. Knows about all Slaves
2. Knows about available and
occupied resources on each
Slave
3. Schedules jobs submitted by
clients
©
A user can submit
any type of
application that is
supported by YARN
©
1. Started and
overseen by
Resource
Manager
2. Coordinates the
execution of all
tasks within an
application
3. Asks for
resources
needed to run its
tasks
4. Runs on the Node
Manager
©
■
●
■
Containers are
dynamically
created and
deleted
©
■
■
●
■
©
■
■
■
■
■
■
●
●
©
■
Large volume
of data
Computation
e.g. a JAR file
©
1. NodeManagers
should be
collocated with
DataNodes
2. The Resource
Manager tries to
schedule tasks on
a node which is the
closest to the data
3. Large volumes of
data don’t have to
be sent over the
network
©
■
©
Their reality
■
■
●
■
Their conclusion
■
©
HADOOP
MR
MR
SOME
MAGIC
1. Parses query
2. Plans execution
3. Submits jobs
4. Monitors jobs
5. Returns results
Execution
SELECT trackid,
COUNT(*) AS cnt
FROM stream
GROUP BY trackid
ORDER BY cnt DESC;
Results
©
HADOOP
MR
MR
APACHE
HIVE
Results
1. Parses query
2. Plans execution
3. Submits jobs
4. Monitors jobs
5. Returns results
Execution
SELECT trackid,
COUNT(*) AS cnt
FROM stream
GROUP BY trackid
ORDER BY cnt DESC;
©
©
©
©
©
■
●
●
●
●
©
RDBMS
Hive
Metastore
Stores Hive
metadata
Manages metadata
about databases,
tables and views
©
Hive Shell CLI
RDBMS
Hive
Metastore
©
Hive Shell CLI
BeesWax
HUE
RDBMS
Hive
Metastore
Acts as a proxy
for “ligth” clients
JDBC/ODBC
Hive Server 2
Beeline CLI
©
©
■
■
©
©
Job 1 Job 2
Possible to cache dataset
in cluster’s (distributed)
memory to read it faster
in next jobs
HDFS
Read
Memory
Read
Cache In
Memory
Cache In
Memory
Memory
Read
©
Job 1 Job 2
Great fit for
iterative algorithms
and interactive
queries!
HDFS
Read
Memory
Read
Cache In
Memory
Cache In
Memory
Possible to cache dataset
in cluster’s (distributed)
memory to read it faster
in next jobs
Memory
Read
©
Interactive queries
Iterative algorithms
Input
Query 2
Query 1
Query 3
Input Iteration 1 Iteration 2
Distributed
Memory
©
NodeManager
Client
YARN Container
Spark
Application
Master
Spark Driver
Resource Manager
NodeManager
YARN Container
Spark
Executor Spark Task
NodeManager
YARN Container
Spark
Executor
Spark Task
©
./bin/spark-submit --class org.apache.spark.examples.SparkPi 
--master yarn 
--deploy-mode cluster 
--driver-memory 4g 
--executor-memory 20g 
--executor-cores 3 
lib/spark-examples*.jar 
10
©
■
Spark Core
Spark
SQL
Spark
Streaming
(near real-time,
micro-batch)
MLlib
(machine
learning)
GraphFrames
(graph
processing)
SparkR
(R on
Spark)
©
<-
INGEST
<-
STORE
<-
MANAGE
<-
ANALYZE
©
■ StreamRockTM
●
■
●
●
●
●
©
Non - stop
Each event
or
each minute
or
each user
session
Real-time
event
collection
Stream
processing
©
■
●
■
●
●
©
StreamRockTM
■
●
■
●
©
■
●
■
●
●
■
©
©
©

More Related Content

PDF
HCatalog
PDF
Simplified Data Management And Process Scheduling in Hadoop
PDF
Introduction to Hive and HCatalog
PDF
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
PDF
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
PDF
Cmu-2011-09.pptx
PDF
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
HBase with MapR
HCatalog
Simplified Data Management And Process Scheduling in Hadoop
Introduction to Hive and HCatalog
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Cmu-2011-09.pptx
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
HBase with MapR

What's hot (19)

PPTX
Overview of Spark for HPC
PPTX
Hadoop Architecture_Cluster_Cap_Plan
PPTX
Hadoop MapReduce Streaming and Pipes
PDF
Apache Flume
PPTX
Hadoop hbase mapreduce
PDF
Analyzing Real-World Data with Apache Drill
PPTX
Introduction to Apache Pig
PDF
Apache Drill - Why, What, How
PPTX
HCatalog Hadoop Summit 2011
PPTX
Performance Hive+Tez 2
PDF
The Hadoop Ecosystem
PDF
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
Hive+Tez: A performance deep dive
PDF
Real-Time Data Loading from MySQL to Hadoop
PDF
Hive Anatomy
PPTX
Pptx present
PPT
Hadoop Tutorial
PPTX
An intriduction to hive
PPTX
Batch is Back: Critical for Agile Application Adoption
Overview of Spark for HPC
Hadoop Architecture_Cluster_Cap_Plan
Hadoop MapReduce Streaming and Pipes
Apache Flume
Hadoop hbase mapreduce
Analyzing Real-World Data with Apache Drill
Introduction to Apache Pig
Apache Drill - Why, What, How
HCatalog Hadoop Summit 2011
Performance Hive+Tez 2
The Hadoop Ecosystem
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Hive+Tez: A performance deep dive
Real-Time Data Loading from MySQL to Hadoop
Hive Anatomy
Pptx present
Hadoop Tutorial
An intriduction to hive
Batch is Back: Critical for Agile Application Adoption
Ad

Viewers also liked (20)

PDF
Quick Introduction to Apache Tez
PDF
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
PDF
Streaming analytics better than batch when and why - (Big Data Tech 2017)
PDF
Apache Hadoop In Theory And Practice
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
PDF
Building Hadoop Data Applications with Kite by Tom White
PDF
Apache Hadoop YARN, NameNode HA, HDFS Federation
PDF
Apache Flume
PDF
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
PDF
Apache Spark Overview
PPTX
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
PDF
Apache Flume NG
PPTX
Tune up Yarn and Hive
 
PPTX
HDFS Federation
PDF
Introduction To Elastic MapReduce at WHUG
PDF
Data Aggregation At Scale Using Apache Flume
PDF
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
PDF
Apache Avro and You
PPT
functional dependencies with example
PPTX
Hadoop as data refinery
Quick Introduction to Apache Tez
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Streaming analytics better than batch when and why - (Big Data Tech 2017)
Apache Hadoop In Theory And Practice
Apache Tez - A New Chapter in Hadoop Data Processing
Building Hadoop Data Applications with Kite by Tom White
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Flume
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Apache Spark Overview
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Apache Flume NG
Tune up Yarn and Hive
 
HDFS Federation
Introduction To Elastic MapReduce at WHUG
Data Aggregation At Scale Using Apache Flume
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Apache Avro and You
functional dependencies with example
Hadoop as data refinery
Ad

Similar to Introduction to Hadoop Ecosystem (20)

PDF
Gdb basics for my sql db as (percona live europe 2019)
PDF
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
PDF
Practical Groovy DSL
PDF
Orchestrating Big Data pipelines @ Fandom - Krystian Mistrzak Thejas Murthy
PDF
Practical Domain-Specific Languages in Groovy
PPTX
Tales from the Field
PDF
The Kitchen Cloud How To: Automating Joyent SmartMachines with Chef
PDF
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
PDF
MySQL High Availability Sprint: Launch the Pacemaker
PDF
Integrating ChatGPT with Apache Airflow
PDF
NetFlow Data processing using Hadoop and Vertica
PDF
Chef on SmartOS
PPTX
Are your ready for in memory applications?
PDF
Screaming Fast Wpmu
PDF
2014 hadoop wrocław jug
PPTX
H2O on Hadoop Dec 12
ODP
Big data nyu
PDF
Redis at LINE
PDF
Virtual training optimizing the tick stack
PPT
Hw09 Production Deep Dive With High Availability
Gdb basics for my sql db as (percona live europe 2019)
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
Practical Groovy DSL
Orchestrating Big Data pipelines @ Fandom - Krystian Mistrzak Thejas Murthy
Practical Domain-Specific Languages in Groovy
Tales from the Field
The Kitchen Cloud How To: Automating Joyent SmartMachines with Chef
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
MySQL High Availability Sprint: Launch the Pacemaker
Integrating ChatGPT with Apache Airflow
NetFlow Data processing using Hadoop and Vertica
Chef on SmartOS
Are your ready for in memory applications?
Screaming Fast Wpmu
2014 hadoop wrocław jug
H2O on Hadoop Dec 12
Big data nyu
Redis at LINE
Virtual training optimizing the tick stack
Hw09 Production Deep Dive With High Availability

More from GetInData (20)

PDF
LLMOps: from Demo to Production-Ready GenAI Systems
PDF
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
PDF
How do we work with customers on Big Data / ML / Analytics Projects using Scr...
PDF
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
PDF
How NOT to win a Kaggle competition
PDF
How to become good Developer in Scrum Team?
PDF
OpenLineage & Airflow - data lineage has never been easier
PDF
Benefits of a Homemade ML Platform
PDF
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
PDF
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
PDF
MLOps implemented - how we combine the cloud & open-source to boost data scie...
PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
PDF
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
PDF
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
PDF
Big data trends - Krzysztof Zarzycki, GetInData
PDF
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
PDF
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
PDF
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
PDF
Complex event processing platform handling millions of users - Krzysztof Zarz...
PDF
Predicting Startup Market Trends based on the news and social media - Albert ...
LLMOps: from Demo to Production-Ready GenAI Systems
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
How do we work with customers on Big Data / ML / Analytics Projects using Scr...
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
How NOT to win a Kaggle competition
How to become good Developer in Scrum Team?
OpenLineage & Airflow - data lineage has never been easier
Benefits of a Homemade ML Platform
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
MLOps implemented - how we combine the cloud & open-source to boost data scie...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Big data trends - Krzysztof Zarzycki, GetInData
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Complex event processing platform handling millions of users - Krzysztof Zarz...
Predicting Startup Market Trends based on the news and social media - Albert ...

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Spectroscopy.pptx food analysis technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation theory and applications.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Electronic commerce courselecture one. Pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Machine Learning_overview_presentation.pptx
PPTX
MYSQL Presentation for SQL database connectivity
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
20250228 LYD VKU AI Blended-Learning.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Spectroscopy.pptx food analysis technology
MIND Revenue Release Quarter 2 2025 Press Release
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
sap open course for s4hana steps from ECC to s4
Programs and apps: productivity, graphics, security and other tools
Network Security Unit 5.pdf for BCA BBA.
Encapsulation theory and applications.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Mobile App Security Testing_ A Comprehensive Guide.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Electronic commerce courselecture one. Pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Machine Learning_overview_presentation.pptx
MYSQL Presentation for SQL database connectivity

Introduction to Hadoop Ecosystem