SlideShare a Scribd company logo
Apache Apex as
YARN Application
Chinmay Kolhatkar (chinmay@apache.org)
Jan 6, 2016
Apache Apex Meetup
Agenda
• Directed Acyclic Graph
• Apex as a YARN Application
• Application Components of Apex
• Lifecycle of Apex as a YARN Application
Apache Apex Meetup
Directed Acyclic Graph (DAG)
• Defines compute stages of streaming application
• Defines tuple flow across Operators via Stream
Compute
1
Compute
4
Compute
3
Compute
2
Apache Apex Meetup
DAG Components
• Tuple
●
Atomic data that flows over a stream
• Operator
●
Basic compute unit per tuple
• Stream
●
Connector abstraction between operators
●
Tuples flow over this
Operator1 Operator2tuple3 tuple2 tuple1
Stream
Apache Apex Meetup
DAG Types
• Logical Plan
●
Logical representation of computation
●
Defines operators, streams and dataflow
• Physical Plan
●
Deployable plan on cluster
●
Contains partition information
of operators
●
Has ready-to-deploy serialized operator
instances
O1 O2
O3
O4
O5
O2P2
O3
O4
O5
O2P3
O2P1
U
O1P1
O1P3
O1P2
Logical Plan
Physical Plan
Apache Apex Meetup
Apex as YARN application
Node
ResourceManager
(AsM + Scheduler)
NM Node NM Node NM
YarnClientClientRM
Protocol
AppMaster
AMRM
Protocol
YarnContainer
YarnContainer
YarnContainer
ContainerManager
Protocol
ContainerManager
Protocol
StrAM
(AppMaster)
AMRM
Protocol
ClientRM
Protocol
YarnContainer
StrAMChild
O1 O2
YarnContainer
StrAMChild
O3
ContainerManager
Protocol
DTCLI
StrAMClient
YarnClient
Apache Apex Meetup
Application Components of Apex - StrAMClient
• Part of dtcli client interface
• Invoked by “launch” command of dtcli
• Tasks:
●
Copy required the application package files into HDFS
●
Validate Logical Plan
●
Serialize Logical plan to HDFS
●
Launch Application Master i.e. StrAM
Apache Apex Meetup
Application Components of Apex - StrAM
• Streaming Application Master
• Started by StrAMClient on a YarnContainer
• Tasks:
●
Convert logical plan to physical plan
●
Serialize operators to HDFS
●
Request for resources to ResourceManager
●
Start StrAMChild in YarnContainer(s)
●
Monitor StrAMChild using ContainerManager protocol
●
Generate Application statistics
●
Host results on WebService (dtManage)
●
Fault Tolerance
●
Checkpointing/Committing Application States
●
Support Security
●
Shutdown Application
Apache Apex Meetup
Application Components of Apex - StrAMChild
• Deployed on YarnContainer
• Started by NodeManager as instructed by StrAM
• Instance of StreamingContainer
• Contains Operators (compute-related)
• Contains BufferServer (stream-related)
• Tasks:
●
Regularly send heartbeat to StrAM
●
Execute commands from StrAM
●
Shutdown or Kill self if instructed
●
Manage lifecycle of an Operator
●
Network communication using BufferServer
Apache Apex Meetup
Lifecycle of Apex/YARN Application - Start
Node
ResourceManager
(AsM + Scheduler)
NM Node NM Node NM
DTCLI/
StrAMClient
(YarnClient)ClientRMProtocol
1) Access cluster information
HDFS
2) Copies file to HDFS3) Submit Application to RM
StrAM
(AppMaster)
AMRMProtocol
4) StrAM Registers with RM
5) StrAM sends heartbeats regularly
6) StrAM request containers with specifications
7) StrAMChild reads
serialized operator
from HDFS
8) StrAMChild starts
operator lifecycle
ContainerManager
Protocol
YarnContainer
StrAMChild
O2
O1
YarnContainer
StrAMChild
O3
YarnContainer
StrAMChild
O4
ContainerManager
Protocol
Apache Apex Meetup
Lifecycle of Apex/YARN Application - Running
Node
ResourceManager
(AsM + Scheduler)
NM Node NM Node NM
DTCLI/
StrAMClient
(YarnClient)ClientRMProtocol
4) Query Status of application
HDFS
StrAM
(AppMaster)
AMRMProtocol
3) StrAM send regular heartbeats to RM
1) StrAMChild send heartbeats
2) StrAMChild sends operator data
ContainerManager
Protocol
YarnContainer
StrAMChild
O2
O1
YarnContainer
StrAMChild
O3
YarnContainer
StrAMChild
O4
ContainerManager
Protocol
Apache Apex Meetup
Lifecycle of Apex/YARN Application - Shutdown
Node
ResourceManager
(AsM + Scheduler)
NM Node NM Node NM
DTCLI/
StrAMClient
(YarnClient)ClientRMProtocol
HDFS
StrAM
(AppMaster)
AMRMProtocol
ContainerManager
Protocol
YarnContainer
StrAMChild
O2
O1
YarnContainer
StrAMChild
O3
YarnContainer
StrAMChild
O4
ContainerManager
Protocol
1) Connect on WebService
2) Send Shutdown command to StrAM
REST API
3) Send shutdown signal to
StrAMChild
4) StrAMChild finishes
Operator lifecycle
5) Check if all containers are freed
6) StrAM unregisters itself
7) StrAM exits
8) Check if application
has shutdown
Apache Apex Meetup
Lifecycle of Apex/YARN Application - Kill
Node
ResourceManager
(AsM + Scheduler)
NM Node NM Node NM
DTCLI/
StrAMClient
(YarnClient)
ClientRMProtocol
HDFS
StrAM
(AppMaster)
AMRMProtocol
ContainerManager
Protocol
YarnContainer
StrAMChild
O2
O1
YarnContainer
StrAMChild
O3
YarnContainer
StrAMChild
O4
ContainerManager
Protocol
1) Send kill-app command
to Yarn
2) RM kills all
containers
Apache Apex Meetup
Summary
• Apex enables YARN to be used for Streaming Applications
• Apex takes care of YARN specific work
• User can focus on business logic defined in Operators
Apache Apex Meetup
15
Apache Apex Meetup
Resources
• Apache Apex page
●
http://apex.incubator.apache.org
●
http://apex.incubator.apache.org/community.html
• Mailing list
●
dev@apex.incubator.apache.org
●
users@apex.incubator.apache.org
• Repository
●
https://github.com/apache/incubator-apex-core
●
https://github.com/apache/incubator-apex-malhar
• Issue Tracking
●
https://issues.apache.org/jira/browse/APEXCORE/
●
https://issues.apache.org/jira/browse/APEXMALHAR/
Apache Apex Meetup

More Related Content

PDF
Apex as yarn application
PPTX
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
PDF
SignalFx: Making Cassandra Perform as a Time Series Database
ODP
Akka streams
PDF
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
PPTX
C++ 11 range-based for loop
PPTX
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
PDF
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Apex as yarn application
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
SignalFx: Making Cassandra Perform as a Time Series Database
Akka streams
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
C++ 11 range-based for loop
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Scaling ingest pipelines with high performance computing principles - Rajiv K...

What's hot (20)

PDF
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
PDF
Apache Flink internals
PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
PPTX
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
PPTX
Apache Flink: API, runtime, and project roadmap
PDF
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. Sax
PPTX
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
PDF
Reactive programming using rx java & akka actors - pdx-scala - june 2014
PPTX
Apache Flink Training: System Overview
PDF
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
PDF
Flink Gelly - Karlsruhe - June 2015
PDF
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
PDF
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
PPTX
Javantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
PPTX
The Stream Processor as a Database Apache Flink
PDF
Processing Big Data in Real-Time - Yanai Franchi, Tikal
PDF
Virtual Flink Forward 2020: Build your next-generation stream platform based ...
ODP
Introduction to ScalaZ
PPTX
Dive into spark2
PDF
Ufuc Celebi – Stream & Batch Processing in one System
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Apache Flink internals
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Apache Flink: API, runtime, and project roadmap
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. Sax
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Reactive programming using rx java & akka actors - pdx-scala - june 2014
Apache Flink Training: System Overview
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Flink Gelly - Karlsruhe - June 2015
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Javantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
The Stream Processor as a Database Apache Flink
Processing Big Data in Real-Time - Yanai Franchi, Tikal
Virtual Flink Forward 2020: Build your next-generation stream platform based ...
Introduction to ScalaZ
Dive into spark2
Ufuc Celebi – Stream & Batch Processing in one System
Ad

Similar to Apache Apex as YARN Application (20)

PDF
Apache Apex as a YARN Apllication
PPTX
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
PPTX
Understanding yarn - Pune apex meetup jan 06 2016
PPTX
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
PDF
Hadoop ecosystem
PDF
Hadoop ecosystem
PDF
A sdn based application aware and network provisioning
PPTX
YARN - Next Generation Compute Platform fo Hadoop
PPT
Building Applications on YARN
PDF
Venturing into Large Hadoop Clusters
PDF
Venturing into Hadoop Large Clusters
PPTX
Apache Apex Introduction with PubMatic
PDF
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...
PDF
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...
PPTX
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
PPTX
YARN - Hadoop Next Generation Compute Platform
PDF
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
PDF
Spark on yarn
PPT
Venturing into Large Hadoop Clusters
PDF
Introduction to yarn
Apache Apex as a YARN Apllication
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Understanding yarn - Pune apex meetup jan 06 2016
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
Hadoop ecosystem
Hadoop ecosystem
A sdn based application aware and network provisioning
YARN - Next Generation Compute Platform fo Hadoop
Building Applications on YARN
Venturing into Large Hadoop Clusters
Venturing into Hadoop Large Clusters
Apache Apex Introduction with PubMatic
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
YARN - Hadoop Next Generation Compute Platform
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Spark on yarn
Venturing into Large Hadoop Clusters
Introduction to yarn
Ad

Recently uploaded (20)

PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
DOCX
573137875-Attendance-Management-System-original
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Artificial Intelligence
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Digital Logic Computer Design lecture notes
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PPTX
Sustainable Sites - Green Building Construction
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
web development for engineering and engineering
Model Code of Practice - Construction Work - 21102022 .pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
573137875-Attendance-Management-System-original
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Lecture Notes Electrical Wiring System Components
Artificial Intelligence
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Embodied AI: Ushering in the Next Era of Intelligent Systems
Safety Seminar civil to be ensured for safe working.
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Digital Logic Computer Design lecture notes
Foundation to blockchain - A guide to Blockchain Tech
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
Sustainable Sites - Green Building Construction
bas. eng. economics group 4 presentation 1.pptx
web development for engineering and engineering

Apache Apex as YARN Application

  • 1. Apache Apex as YARN Application Chinmay Kolhatkar (chinmay@apache.org) Jan 6, 2016 Apache Apex Meetup
  • 2. Agenda • Directed Acyclic Graph • Apex as a YARN Application • Application Components of Apex • Lifecycle of Apex as a YARN Application Apache Apex Meetup
  • 3. Directed Acyclic Graph (DAG) • Defines compute stages of streaming application • Defines tuple flow across Operators via Stream Compute 1 Compute 4 Compute 3 Compute 2 Apache Apex Meetup
  • 4. DAG Components • Tuple ● Atomic data that flows over a stream • Operator ● Basic compute unit per tuple • Stream ● Connector abstraction between operators ● Tuples flow over this Operator1 Operator2tuple3 tuple2 tuple1 Stream Apache Apex Meetup
  • 5. DAG Types • Logical Plan ● Logical representation of computation ● Defines operators, streams and dataflow • Physical Plan ● Deployable plan on cluster ● Contains partition information of operators ● Has ready-to-deploy serialized operator instances O1 O2 O3 O4 O5 O2P2 O3 O4 O5 O2P3 O2P1 U O1P1 O1P3 O1P2 Logical Plan Physical Plan Apache Apex Meetup
  • 6. Apex as YARN application Node ResourceManager (AsM + Scheduler) NM Node NM Node NM YarnClientClientRM Protocol AppMaster AMRM Protocol YarnContainer YarnContainer YarnContainer ContainerManager Protocol ContainerManager Protocol StrAM (AppMaster) AMRM Protocol ClientRM Protocol YarnContainer StrAMChild O1 O2 YarnContainer StrAMChild O3 ContainerManager Protocol DTCLI StrAMClient YarnClient Apache Apex Meetup
  • 7. Application Components of Apex - StrAMClient • Part of dtcli client interface • Invoked by “launch” command of dtcli • Tasks: ● Copy required the application package files into HDFS ● Validate Logical Plan ● Serialize Logical plan to HDFS ● Launch Application Master i.e. StrAM Apache Apex Meetup
  • 8. Application Components of Apex - StrAM • Streaming Application Master • Started by StrAMClient on a YarnContainer • Tasks: ● Convert logical plan to physical plan ● Serialize operators to HDFS ● Request for resources to ResourceManager ● Start StrAMChild in YarnContainer(s) ● Monitor StrAMChild using ContainerManager protocol ● Generate Application statistics ● Host results on WebService (dtManage) ● Fault Tolerance ● Checkpointing/Committing Application States ● Support Security ● Shutdown Application Apache Apex Meetup
  • 9. Application Components of Apex - StrAMChild • Deployed on YarnContainer • Started by NodeManager as instructed by StrAM • Instance of StreamingContainer • Contains Operators (compute-related) • Contains BufferServer (stream-related) • Tasks: ● Regularly send heartbeat to StrAM ● Execute commands from StrAM ● Shutdown or Kill self if instructed ● Manage lifecycle of an Operator ● Network communication using BufferServer Apache Apex Meetup
  • 10. Lifecycle of Apex/YARN Application - Start Node ResourceManager (AsM + Scheduler) NM Node NM Node NM DTCLI/ StrAMClient (YarnClient)ClientRMProtocol 1) Access cluster information HDFS 2) Copies file to HDFS3) Submit Application to RM StrAM (AppMaster) AMRMProtocol 4) StrAM Registers with RM 5) StrAM sends heartbeats regularly 6) StrAM request containers with specifications 7) StrAMChild reads serialized operator from HDFS 8) StrAMChild starts operator lifecycle ContainerManager Protocol YarnContainer StrAMChild O2 O1 YarnContainer StrAMChild O3 YarnContainer StrAMChild O4 ContainerManager Protocol Apache Apex Meetup
  • 11. Lifecycle of Apex/YARN Application - Running Node ResourceManager (AsM + Scheduler) NM Node NM Node NM DTCLI/ StrAMClient (YarnClient)ClientRMProtocol 4) Query Status of application HDFS StrAM (AppMaster) AMRMProtocol 3) StrAM send regular heartbeats to RM 1) StrAMChild send heartbeats 2) StrAMChild sends operator data ContainerManager Protocol YarnContainer StrAMChild O2 O1 YarnContainer StrAMChild O3 YarnContainer StrAMChild O4 ContainerManager Protocol Apache Apex Meetup
  • 12. Lifecycle of Apex/YARN Application - Shutdown Node ResourceManager (AsM + Scheduler) NM Node NM Node NM DTCLI/ StrAMClient (YarnClient)ClientRMProtocol HDFS StrAM (AppMaster) AMRMProtocol ContainerManager Protocol YarnContainer StrAMChild O2 O1 YarnContainer StrAMChild O3 YarnContainer StrAMChild O4 ContainerManager Protocol 1) Connect on WebService 2) Send Shutdown command to StrAM REST API 3) Send shutdown signal to StrAMChild 4) StrAMChild finishes Operator lifecycle 5) Check if all containers are freed 6) StrAM unregisters itself 7) StrAM exits 8) Check if application has shutdown Apache Apex Meetup
  • 13. Lifecycle of Apex/YARN Application - Kill Node ResourceManager (AsM + Scheduler) NM Node NM Node NM DTCLI/ StrAMClient (YarnClient) ClientRMProtocol HDFS StrAM (AppMaster) AMRMProtocol ContainerManager Protocol YarnContainer StrAMChild O2 O1 YarnContainer StrAMChild O3 YarnContainer StrAMChild O4 ContainerManager Protocol 1) Send kill-app command to Yarn 2) RM kills all containers Apache Apex Meetup
  • 14. Summary • Apex enables YARN to be used for Streaming Applications • Apex takes care of YARN specific work • User can focus on business logic defined in Operators Apache Apex Meetup
  • 16. Resources • Apache Apex page ● http://apex.incubator.apache.org ● http://apex.incubator.apache.org/community.html • Mailing list ● dev@apex.incubator.apache.org ● users@apex.incubator.apache.org • Repository ● https://github.com/apache/incubator-apex-core ● https://github.com/apache/incubator-apex-malhar • Issue Tracking ● https://issues.apache.org/jira/browse/APEXCORE/ ● https://issues.apache.org/jira/browse/APEXMALHAR/ Apache Apex Meetup