SlideShare a Scribd company logo
© 2015 DataTorrent
Akshay Gore, Bhupesh Chawda
DataTorrent
Apex Hands-on Lab - Into the code!
Getting started with your first Apex Application!
© 2015 DataTorrent
Operators
• Input Adaptor Vs
Generic Operators ?
• What are streams?
• What are ports?
© 2015 DataTorrent
Apex Operator Lifecycle
© 2015 DataTorrent
Apex Streaming Application
public class Application implements StreamingApplication
{
populateDAG(DAG dag, Configuration conf)
{
// Add Operators to dag - dag.addOperator(args)
// Add Streams between operators - dag.addStream(args)
// Additional config + Hints to YARN - Optional
}
}
© 2015 DataTorrent
Apex Application - FilterWords
Apex Application DAG
• Problem statement - Filter words in the file
ᵒ Read a file located on HDFS
ᵒ Split each line into words, check if it is not one of the forbidden words and write it
down to HDFS
HDFS
Lines Filtered Words
HDFS
© 2015 DataTorrent
FilterWords Application DAG
Reader Tokenize Processor Writter
Input
Operator
(Adapter)
Output
Operator
(Adapter)
Generic
Operators
HDFS HDFS
Lines Words
Filtered
Words
© 2015 DataTorrent
Prerequisites
• JAVA 1.7 or above
• Maven 3.0 or above
• Apache Apex projects:
ᵒ Apache Apex Core: core platform, engine
ᵒ Apache Apex Malhar: operators library
• Hadoop cluster in running state
• Your favourite IDE - Eclipse / vi
© 2015 DataTorrent
Demo time!
• Apex application structure
• Application code walk through
• How to execute the application
• Assignment
© 2015 DataTorrent
Assignment - WordCount
Apex Application DAG
• Problem statement - Count occurrences of words in a file
ᵒ Read a file located on HDFS
ᵒ Emit count at the end of the every window and writes into HDFS
HDFS
Lines <Word, Count>
HDFS
© 2015 DataTorrent
Assignment - Word Count Application DAG
Reader Tokenize
Counter
Output
HDFS HDFS
Lines Words
<Word,
count>
© 2015 DataTorrent
Assignment - What you need to do
Reader Tokenizer Processor Writer
String String String
Line Words Words’
Counter Writer
Map
{Word: Count}
Assignment
© 2015 DataTorrent
Assignment - Hints
• Create copy of Processor.java. Name it Counter.java
• Modify Counter.java as follows:
ᵒ Define a data structure which can hold counts for words
ᵒ Process method of input port must count the occurrences
ᵒ Clear the counts in beginWindow() call
ᵒ Emit the counts in endWindow() call
© 2015 DataTorrent
Solution - Changes to Counter.java
• Need to define a data structure which can hold counts for words
private HashMap<String, Integer> counts = new HashMap<>();
• Process method of input port must count the occurrences
if(counts.containsKey(refinedWord)) {
counts.put(refinedWord, counts.get(refinedWord) + 1);
} else {
counts.put(refinedWord, 1);
}
● Clear the counts in beginWindow call
counts.clear();
● Emit the counts in endWindow call
output.emit(counts.toString());
● Run Application Test
© 2015 DataTorrent
Assignment - Are we done yet?
• Change the DAG
ᵒ Replace Processor operator with the newly created operator - Counter
© 2015 DataTorrent
Assignment - Slight change
• We are emitting a Map. However it is still a string.
ᵒ Change type of output port of Counter to type Map
ᵒ Change type of input port of Writer to Map
ᵒ Make appropriate changes to Writer to read a Map and write in a format such that
each line belongs to a single word.
© 2015 DataTorrent
Assignment - Final change
• Change the code such that each count is the overall count, not just for each
window?
© 2015 DataTorrent
Summary - Recap
• Writing Apache Apex operators
• Chaining the operators into an Apache Apex application
• Executing the application on the Apache Apex platform
© 2015 DataTorrent
Where to go from here?
Apache Apex Documentation - http://apex.incubator.apache.org/docs.html
Apache Apex Core Git - https://github.com/apache/incubator-apex-core
Apache Apex Malhar Git - https://github.com/apache/incubator-apex-malhar
Join Users Mailing List - users-subscribe@apex.incubator.apache.org
Join Dev Mailing List - dev-subscribe@apex.incubator.apache.org
Send queries to Users Mailing List - users@apex.incubator.apache.org
Send queries to Dev Mailing List - dev@apex.incubator.apache.org
© 2015 DataTorrent
Thank You

More Related Content

PPTX
University program - writing an apache apex application
PDF
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
PDF
How to Build a Telegraf Plugin by Noah Crowley
PDF
Write your own telegraf plugin
PDF
201810 td tech_talk
PDF
Recent Changes and Challenges for Future Presto
PDF
Managing Machine Learning workflows on Treasure Data
PDF
PLAZMA TD Tech Talk 2018 at Shibuya: Hive2 as a new td hadoop core engine
University program - writing an apache apex application
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
How to Build a Telegraf Plugin by Noah Crowley
Write your own telegraf plugin
201810 td tech_talk
Recent Changes and Challenges for Future Presto
Managing Machine Learning workflows on Treasure Data
PLAZMA TD Tech Talk 2018 at Shibuya: Hive2 as a new td hadoop core engine

What's hot (18)

PPTX
Salesforce Summer 14 Release
PDF
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
PDF
LambdaFlow: Scala Functional Message Processing
PPTX
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
PPT
PDF
Monitoring, Alerting, and Tasks as Code by Russ Savage, Director of Product M...
PDF
Apache Apex as YARN Application
PPTX
Enhancements in Java 9 Streams
PPTX
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
PDF
Apex as yarn application
PPTX
TFIDF and Machine Learning – efficient hybrid processing
PDF
Orca: A Modular Query Optimizer Architecture for Big Data
 
PPTX
Scilab: Computing Tool For Engineers
PDF
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
PDF
OPTIMIZING THE TICK STACK
PPTX
Parallel First-Order Operations
PPTX
Whats New For Developers In JDK 9
Salesforce Summer 14 Release
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
LambdaFlow: Scala Functional Message Processing
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
Monitoring, Alerting, and Tasks as Code by Russ Savage, Director of Product M...
Apache Apex as YARN Application
Enhancements in Java 9 Streams
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Apex as yarn application
TFIDF and Machine Learning – efficient hybrid processing
Orca: A Modular Query Optimizer Architecture for Big Data
 
Scilab: Computing Tool For Engineers
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
OPTIMIZING THE TICK STACK
Parallel First-Order Operations
Whats New For Developers In JDK 9
Ad

Similar to Building YARN Applications (11)

PPTX
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
PDF
Building Your First Apache Apex Application
PDF
Building your first aplication using Apache Apex
PPTX
Java High Level Stream API
PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
PPTX
Stream Processing with Apache Apex
PPTX
Introduction to Apache Apex
PPTX
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
PPTX
DataTorrent Presentation @ Big Data Application Meetup
PPTX
Deep Dive into Apache Apex App Development
PPTX
Apache Apex Meetup at Cask
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex Application
Building your first aplication using Apache Apex
Java High Level Stream API
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Stream Processing with Apache Apex
Introduction to Apache Apex
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
DataTorrent Presentation @ Big Data Application Meetup
Deep Dive into Apache Apex App Development
Apache Apex Meetup at Cask
Ad

More from Apache Apex (20)

PDF
Low Latency Polyglot Model Scoring using Apache Apex
PDF
From Batch to Streaming with Apache Apex Dataworks Summit 2017
PDF
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
PDF
Developing streaming applications with apache apex (strata + hadoop world)
PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
PPTX
Intro to Apache Apex @ Women in Big Data
PPTX
Hadoop Interacting with HDFS
PPTX
Introduction to Real-Time Data Processing
PPTX
Introduction to Apache Apex
PPTX
Introduction to Yarn
PPTX
Introduction to Map Reduce
PPTX
HDFS Internals
PPTX
Intro to Big Data Hadoop
PPTX
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PPTX
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
PPTX
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
PPTX
Ingestion and Dimensions Compute and Enrich using Apache Apex
PPTX
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Low Latency Polyglot Model Scoring using Apache Apex
From Batch to Streaming with Apache Apex Dataworks Summit 2017
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Developing streaming applications with apache apex (strata + hadoop world)
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Intro to Apache Apex @ Women in Big Data
Hadoop Interacting with HDFS
Introduction to Real-Time Data Processing
Introduction to Apache Apex
Introduction to Yarn
Introduction to Map Reduce
HDFS Internals
Intro to Big Data Hadoop
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Big Data Berlin v8.0 Stream Processing with Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming

Recently uploaded (20)

PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PPTX
Reimagine Home Health with the Power of Agentic AI​
DOCX
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
PDF
Nekopoi APK 2025 free lastest update
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
L1 - Introduction to python Backend.pptx
PDF
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
AutoCAD Professional Crack 2025 With License Key
PDF
Download FL Studio Crack Latest version 2025 ?
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PPTX
history of c programming in notes for students .pptx
PDF
Designing Intelligence for the Shop Floor.pdf
PPTX
assetexplorer- product-overview - presentation
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Digital Systems & Binary Numbers (comprehensive )
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Reimagine Home Health with the Power of Agentic AI​
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
Nekopoi APK 2025 free lastest update
How to Choose the Right IT Partner for Your Business in Malaysia
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
L1 - Introduction to python Backend.pptx
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
Why Generative AI is the Future of Content, Code & Creativity?
AutoCAD Professional Crack 2025 With License Key
Download FL Studio Crack Latest version 2025 ?
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
history of c programming in notes for students .pptx
Designing Intelligence for the Shop Floor.pdf
assetexplorer- product-overview - presentation
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...

Building YARN Applications

  • 1. © 2015 DataTorrent Akshay Gore, Bhupesh Chawda DataTorrent Apex Hands-on Lab - Into the code! Getting started with your first Apex Application!
  • 2. © 2015 DataTorrent Operators • Input Adaptor Vs Generic Operators ? • What are streams? • What are ports?
  • 3. © 2015 DataTorrent Apex Operator Lifecycle
  • 4. © 2015 DataTorrent Apex Streaming Application public class Application implements StreamingApplication { populateDAG(DAG dag, Configuration conf) { // Add Operators to dag - dag.addOperator(args) // Add Streams between operators - dag.addStream(args) // Additional config + Hints to YARN - Optional } }
  • 5. © 2015 DataTorrent Apex Application - FilterWords Apex Application DAG • Problem statement - Filter words in the file ᵒ Read a file located on HDFS ᵒ Split each line into words, check if it is not one of the forbidden words and write it down to HDFS HDFS Lines Filtered Words HDFS
  • 6. © 2015 DataTorrent FilterWords Application DAG Reader Tokenize Processor Writter Input Operator (Adapter) Output Operator (Adapter) Generic Operators HDFS HDFS Lines Words Filtered Words
  • 7. © 2015 DataTorrent Prerequisites • JAVA 1.7 or above • Maven 3.0 or above • Apache Apex projects: ᵒ Apache Apex Core: core platform, engine ᵒ Apache Apex Malhar: operators library • Hadoop cluster in running state • Your favourite IDE - Eclipse / vi
  • 8. © 2015 DataTorrent Demo time! • Apex application structure • Application code walk through • How to execute the application • Assignment
  • 9. © 2015 DataTorrent Assignment - WordCount Apex Application DAG • Problem statement - Count occurrences of words in a file ᵒ Read a file located on HDFS ᵒ Emit count at the end of the every window and writes into HDFS HDFS Lines <Word, Count> HDFS
  • 10. © 2015 DataTorrent Assignment - Word Count Application DAG Reader Tokenize Counter Output HDFS HDFS Lines Words <Word, count>
  • 11. © 2015 DataTorrent Assignment - What you need to do Reader Tokenizer Processor Writer String String String Line Words Words’ Counter Writer Map {Word: Count} Assignment
  • 12. © 2015 DataTorrent Assignment - Hints • Create copy of Processor.java. Name it Counter.java • Modify Counter.java as follows: ᵒ Define a data structure which can hold counts for words ᵒ Process method of input port must count the occurrences ᵒ Clear the counts in beginWindow() call ᵒ Emit the counts in endWindow() call
  • 13. © 2015 DataTorrent Solution - Changes to Counter.java • Need to define a data structure which can hold counts for words private HashMap<String, Integer> counts = new HashMap<>(); • Process method of input port must count the occurrences if(counts.containsKey(refinedWord)) { counts.put(refinedWord, counts.get(refinedWord) + 1); } else { counts.put(refinedWord, 1); } ● Clear the counts in beginWindow call counts.clear(); ● Emit the counts in endWindow call output.emit(counts.toString()); ● Run Application Test
  • 14. © 2015 DataTorrent Assignment - Are we done yet? • Change the DAG ᵒ Replace Processor operator with the newly created operator - Counter
  • 15. © 2015 DataTorrent Assignment - Slight change • We are emitting a Map. However it is still a string. ᵒ Change type of output port of Counter to type Map ᵒ Change type of input port of Writer to Map ᵒ Make appropriate changes to Writer to read a Map and write in a format such that each line belongs to a single word.
  • 16. © 2015 DataTorrent Assignment - Final change • Change the code such that each count is the overall count, not just for each window?
  • 17. © 2015 DataTorrent Summary - Recap • Writing Apache Apex operators • Chaining the operators into an Apache Apex application • Executing the application on the Apache Apex platform
  • 18. © 2015 DataTorrent Where to go from here? Apache Apex Documentation - http://apex.incubator.apache.org/docs.html Apache Apex Core Git - https://github.com/apache/incubator-apex-core Apache Apex Malhar Git - https://github.com/apache/incubator-apex-malhar Join Users Mailing List - users-subscribe@apex.incubator.apache.org Join Dev Mailing List - dev-subscribe@apex.incubator.apache.org Send queries to Users Mailing List - users@apex.incubator.apache.org Send queries to Dev Mailing List - dev@apex.incubator.apache.org

Editor's Notes

  • #3: Operators are basic compute units. Operators process each incoming tuple and emit zero or more tuples on output ports as per the business logic. Input Adapter - This is one of the starting points in the application DAG and is responsible for getting tuples from an external system. At the same time, such data may also be generated by the operator itself, without interacting with the outside world Generic Operator - This type of operator accepts input tuples from the previous operators and passes them on to the following operators in the DAG Output Adapter - This is one of the ending points in the application DAG and is responsible for writing the data out to some external system.
  • #4: An operator passes through various stages during its lifetime. Each stage is an API call that the Streaming Application Master makes for an operator. setup() call initializes the operator and prepares itself to start processing tuples. beginWindow() call marks the beginning of an application window and allows for any processing to be done before a window starts process() call belongs to the InputPort and gets triggered when any tuple arrives at the Input port of the operator emitTuples() call is used by Input adapters to emit any tuples that are fetched from the external systems endWindow() call marks the end of the window and allows for any processing to be done after the window ends teardown() call is used for gracefully shutting down the operator and releasing any resources held by the operator
  • #5: Skeleton for Apex application
  • #8: For application development or for functional testing, hadoop cluster or services as it can run in the local file system as single process with multiple threads. A hadoop cluster (distributed cluster) is recommended for benchmarking and production testing. For single node cluster, throughput will not be high as multi node cluster, memory constraints