How YARN Enables Multiple Data Processing Engines in Hadoop

© Hortonworks Inc. 2011 – 2014. All Rights Reserved
How YARN Enables Multiple Data
Processing Engines in Hadoop
We Do Hadoop
Eric Mizell - Director, Solution Engineering

Agenda
• HDFS Overview - Storage
• YARN 101 - Compute
– Yet Another Resource Negotiator
• Enabling a Modern Data Architecture
• YARN in action
– Demo of streaming application
• Hadoop Tools
– Demos
• Sample Code - https://github.com/emizell/HBase-Code-Samples
2

HDFS Overview
3

HDFS Overview
4
•  Typical Hardware for DataNodes
–  2@8 Core
–  256GB RAM
–  2@24TB Disk
–  10 GbE
•  Hadoop is rack aware
–  Data is replicated across racks to ensure no data loss
•  Scale up or down
–  Add or remove DataNodes and HDFS auto rebalances
•  HDFS is a file system
–  Store any kind of data
–  Inexpensive storage
–  Replica of 3 by default (can be changed)

YARN Concepts
• Application
– Application is a job submitted to the framework
– Example – MapReduce Job
• Container
– Basic unit of allocation
– Fine-grained resource allocation across multiple resource types (memory, cpu,
disk, network, gpu etc.)
– container_0 = 2GB, 1CPU
– container_1 = 1GB, 6 CPU
– Replaces the fixed map/reduce slots
5

YARN Architecture
• Resource Manager
– Global resource scheduler
– Hierarchical queues
– Application management
• Node Manager
– Per-machine agent
– Manages the life-cycle of container
– Container resource monitoring
• Application Master
– Per-application
– Manages application scheduling and task execution
– E.g. MapReduce Application Master
6

RackN
NodeManager
NodeManager
NodeManager
Rack2
NodeManager
NodeManager
NodeManager
Rack1
NodeManager
NodeManager
NodeManager
C2.1
C1.4
AM2
C2.2 C2.3
AM1
C1.3
C1.2
C1.1
Hadoop Client 1
Hadoop Client 2
create app2
submit app1
submit app2
create app1
ASM Scheduler
queues
ASM Containers
NM ASM
Scheduler Resources
.......negotiates.......
.......reports to.......
.......partitions.......
ResourceManager
status report
YARN – Running Apps

Hadoop 2.x Stack – Enabled by YARN
Hadoop
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez
Tez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV
Engines
HDFS
(Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider
Slider
SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
Cluster: Ranger
Deployment ChoiceLinux Windows On-Premises Cloud
YARN
is the architectural
center of HDP
Enables batch, interactive
and real-time workloads
Provides comprehensive
enterprise capabilities
The widest range of
deployment options

Hadoop 2.2.x Stack – Versions

Enabling a Modern Data Architecture
with Apache Hadoop
Hortonworks. We do Hadoop.

Existing Siloed Data Architectures Under PressureAPPLICATIONS
DATA

SYSTEM
SOURCES

Business

Analy:cs

Custom

Applica:ons

Packaged

Applica:ons

Exis:ng
Sources

(CRM,
ERP,
Clickstream,
Logs)

SILO

SILO

RDBMS

SILO
SILO

SILO
SILO

EDW
MPP

Data
growth:
New
Data
Types

OLTP,
ERP,
CRM
Systems

Unstructured
docs,
emails

Clickstream

Server
logs

Social/Web
Data

Sensor.
Machine
Data

Geoloca:on

85%
Source: IDC
??
"   Can’t manage new
data paradigm
"   Constrains data to
specific schema
" Siloed data
"   Limited scalability
"   Economically
unfeasible
"   Limited analytics

HDP2 and YARN enable the Modern Data Architecture
Hortonworks architected and  
led development of YARN
Common data set, multiple applications
•  Optionally land all data in a single cluster
•  Batch, interactive & real-time use cases
•  Support multi-tenant access, processing
& segmentation of data
YARN: Architectural center of Hadoop
•  Consistent security, governance & operations
•  Ecosystem applications certiﬁed  
by Hortonworks to run natively in Hadoop
SOURCES
EXISTING

Systems

Clickstream
Web

&Social

Geoloca:on
Sensor

&
Machine

Server

Logs

Unstructured

APPLICATIONSDATASYSTEM
Business
Analytics
Custom
Applications
Packaged
Applications
RDBMS EDW MPP YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
HDFS
Interactive Real-TimeBatch

YARN in Action

Truck Sensors
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing
(Storm)
Inbound Messaging
(Kafka)
Microsoft
Excel
Interactive Query
(Hive on Tez)
Alerts & Events
(ActiveMQ)
Real-Time
User Interface
Real-time Serving
(HBase)

Components of the Topology
• 9 Node HDP 2.2 Cluster with Storm and HBase on YARN
• 4 Node 0.8 Kafka Cluster
• 1 Node ActiveMQ with Stomp Protocol Enabled
• Spring 4.0 WebMVC Web Using SocketJS & ActiveMQ over STOMP
Page 15

Topology Architecture
Page 16
Truck
Simulator
T(1)
T(2)
T(N)
Truck Stream Generator via AKKA
Kafka
Collector
Kafka Grid - Captures all Driving Events
BR(1) BR(2) BR(3)
BR(4) BR(5)
ZK
truck_events
TOPIC
Storm on YARN on HDP
Kafka Spout
HBase
Bolt
Monitoring
Bolt
WebSocket
Bolt
HBase on HDP
HBase
driver
dangerous
events
driver
dangerous
events
count
Email
Alerts
ActiveMQ
Alert
Topic
Spring WebApp with SockJS WebSockets
Real-Time
Streaming Driver
Monitoring App
ActiveMQ
Violation
Events
Topic

Demo

Hadoop Tools

Agenda
•  The Basics
•  MapReduce & Java
•  Pig
•  Hive
•  HBase, Solr & Spark
•  Abstractions: .net, cascading and Spring XD
•  Intro to the Sandbox
•  Basic Hello World Using Hive and Pig
•  HBase and Phoenix demo and code discussion

Hortonworks Data Platform 2.2
HDP Delivers Enterprise Hadoop
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez
Tez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
HDFS
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider
Slider
SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Authentication
Authorization
Audit
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive
Pipeline: Falcon
Cluster: Ranger
Cluster: Knox
Deployment ChoiceLinux Windows Cloud
YARN is the architectural
center of HDP
•  Common data set across all
applications
•  Batch, interactive & real-time
workloads
•  Multi-tenant access & processing
Provides comprehensive
enterprise capabilities
•  Governance
•  Security
•  Operations
Enables broad
ecosystem adoption
•  ISVs can plug directly into Hadoop
The widest range of deployment options
•  Linux & Windows
•  On premises & cloud
Others
ISV
Engines
On-Premises

Hortonworks Data Platform 2.2
HDP Delivers Enterprise Hadoop
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
HDFS
SECURITY OPERATIONS
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Authentication
Authorization
Audit
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive
Pipeline: Falcon
Cluster: Ranger
Cluster: Knox
Deployment ChoiceLinux Windows CloudOn-Premises
Script
Pig
SQL
Hive
Tez
Tez
Java
Scala
Cascading
Tez
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider
Slider
GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Others
ISV
Engines
We will cover:
•  What it is & where it is used
•  Basic elements

MapReduce
MapReduce is a framework for writing
applications that process large amounts
of structured and unstructured data in
parallel across a cluster of thousands of
machines, in a reliable and fault-tolerant
manner
Developers use it to…
•  They don’t have to anymore
•  Many tools have been created
to abstract this complexity
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS

Pig
•  Apache™ Pig allows you to write complex
MapReduce transformations using a simple
scripting language.
•  Pig Latin (the language) defines a set of
transformations on a data set such as
aggregate, join and sort.
•  Pig Latin is sometimes extended using UDFs
(User Defined Functions), in Java or a scripting
language and then call directly from the Pig
Latin.
Developers use Pig for
•  ETL
•  Basic “spreadsheet” functions
•  Prepare data for data science

Example
RAW_LOGS
=
LOAD
'/user/paul/data/apache/access'
USING
TextLoader

as
(line:chararray);

CLICKS_RAW
=
LOAD
'$input'
USING
PigStorage('|')
as

(sls_key:chararray,
sls_item_ln_id:int,
chn_id:int,
loc_id:int,

all_chnl_rpt_chn_id:int,
all_chnl_rpt_loc_id:int,

sls_bsns_dt:chararray,
sku_id:int);

RECORDS
=
load
'config'
using

org.apache.hcatalog.pig.HCatLoader();

Pig Operators

Hive
•  Apache Hive is the defacto standard for SQL
queries over petabytes of data in Hadoop
•  Created by a team at Facebook.
•  Provides a standard SQL interface to data
stored in Hadoop.
•  Quickly find value in raw data files.
•  Proven at petabyte scale.
•  Compatible with every popular BI tools such
as Tableau, Excel, MicroStrategy, Business
Objects, etc.
Developers use it to:
•  Perform SQL queries
•  Interface with existing tools via
JDBC/ODBC

Sample SQL with Hive
SELECT [ALL | DISTINCT] select_expr, select_expr, ...!
FROM table_reference!
[WHERE where_condition]!
[GROUP BY col_list]!
[HAVING having_condition]!
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY
!col_list]]!
[LIMIT number] ; !

Hive - Select Syntax

Hive Demonstration
HDP Sandbox
•  Up and running with a Hadoop
environment in minutes
•  Basic and advanced tutorials with
discreet learning paths
•  Ecosystem partner tutorials
hortonworks.com/sandbox

HBase
•  Apache™ HBase is a non-relational (NoSQL)
database that runs on top of the Hadoop®
Distributed File System (HDFS).
•  It is columnar and provides fault-tolerant
storage and quick access to large quantities
of sparse data.
•  It also adds transactional capabilities to
Hadoop, allowing users to conduct updates,
inserts and deletes.
•  HBase was created for hosting very large
tables with billions of rows and millions of
columns.
•  Provide low latency access to
massive amounts of data (eg.
Recommendation engine
results)
•  Document store

Phoenix
•  Apache™ Phoenix is a high performance
relational database layer over HBase for low
latency applications.
•  SQL queries are compiled into a series of
HBase scans producing regular JDBC result
sets.
•  Table metadata is stored in an HBase table
and versioned and can be queried by version.
•  Query performance in the millisecond to low
seconds range.
•  Largest know table size is a Trillion+ rows
with query response times in the 30 second
range.
Developers use it for:
•  Low latency queries
•  SQL skin on HBase

Phoenix Functions

HBase/Phoenix Demonstration
HDP Sandbox
•  Up and running with a Hadoop
environment in minutes
•  Basic and advanced tutorials with
discreet learning paths
•  Ecosystem partner tutorials
hortonworks.com/sandbox

Storm
•  Apache™ Storm is a distributed real-time
computation system for processing fast, large
streams of data. Storm adds reliable real-time
data processing capabilities to Hadoop.
•  Storm is extremely fast, with the ability to
process over a million records per second per
node on a cluster of modest size.
•  Apache Kafka is a publish-subscribe
messaging system that works well with
Storm.
•  Analyze stream data in real-
time

Solr
•  Apache Solr provides full-text search and
near real-time indexing for data stored in
Hadoop.
•  Whether users search for tabular, text, geo-
location or sensor data in Hadoop, they find it
quickly with Apache Solr.
•  Apache Solr indexes via XML, JSON, CSV or
binary over HTTP. Users can query petabytes
of data via HTTP GET and receive XML, JSON,
CSV or binary results.
•  Provide search capability for a
cluster
•  Data Scientist often use to
explore data found in HDFS

Spark
•  Spark is a general-purpose engine for ad-hoc
interactive analytics, iterative machine-
learning, and other use cases well-suited to
interactive, in-memory data processing of GB
to TB sized datasets.
•  Spark loads data into memory so it can be
queried repeatedly. It can create a “shadow”
of data that can be used in the next iteration
of a query
•  Spark provides simple APIs for data scientists
and engineers familiar with Scala
(programming language) to build applications
•  Spark is YARN-ready – another engine on
YARN!
•  Data Science: machine learning
and iterative analytics

Cascading
•  Cascading is an application development
framework for building data applications.
Converts applications into MapReduce jobs.
•  The Cascading SDK provides a collection of
tools, documentation, libraries, tutorials and
example projects.
•  Lingual. Simplifies systems integration through ANSI
SQL compatibility and a JDBC driver
•  Pattern. Enables various machine learning scoring
algorithms through PMML compatibility
•  Scalding. Enables development with Scala, a
powerful language for solving functional problems
•  Cascalog. Enables development with Clojure, a Lisp
dialect
•  Build complex native Hadoop
applications without getting
into MapReduce.

.net
•  The Microsoft .NET SDK for Hadoop provides
API access to HDP and Microsoft HDInsight
including HDFS, HCatalog, Oozie and Ambari,
and also some Powershell scripts for cluster
management.
•  There are also libraries for MapReduce and
LINQ to Hive. The latter is really interesting as
it builds on the established technology
for .NET developers to access most data
sources to deliver the capabilities of the de
facto standard for Hadoop data query.
•  Build complex MSFT .net
Hadoop applications.

Java & Spring XD
•  Spring for Apache Hadoop (SHDP) provides a
developer API for Pig, Hive, Cascading and
provides extensions to Spring Batch for
orchestrating Hadoop based workflows.
•  It integrates with other Spring ecosystem
project such as Spring Integration and Spring
Batch
•  These foundational parts of Spring IO
platform make Hadoop development more
accessible to a wider range of Java
developers.
•  Build complex Hadoop
applications using Java and the
Spring framework

Hadoop Summit 2015
Page 40

Thank You!
Eric Mizell – Director, Solutions Engineering
emizell@hortonworks.com
@ericmizell

How YARN Enables Multiple Data Processing Engines in Hadoop

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to How YARN Enables Multiple Data Processing Engines in Hadoop (20)

More from POSSCON (20)

Recently uploaded (20)

How YARN Enables Multiple Data Processing Engines in Hadoop