SlideShare a Scribd company logo
Intro to Apache Apex
Pramod Immaneni
Apache Apex PMC, Architect DataTorrent
Oct 12th 2016
Next Gen Stream Data Processing
• Data from variety of sources (IoT, Kafka, files, social media etc.)
• Unbounded, continuous data streams
ᵒ Batch can be processed as stream (but a stream is not a batch)
• (In-memory) Processing with temporal boundaries (windows)
• Stateful operations: Aggregation, Rules, … -> Analytics
• Results stored to variety of sinks or destinations
ᵒ Streaming application can also serve data with very low latency
2
Browser
Web Server
Kafka Input
(logs)
Decompress,
Parse, Filter
Dimensions
Aggregate Kafka
Logs
Kafka
Apache Apex
3
• In-memory, distributed stream processing
• Application logic broken into components called operators that run in a distributed fashion
across your cluster
• Natural programming model
• Unobtrusive Java API to express (custom) logic
• Maintain state and metrics in your member variables
• Scalable, high throughput, low latency
• Operators can be scaled up or down at runtime according to the load and SLA
• Dynamic scaling (elasticity), compute locality
• Fault tolerance & correctness
• Automatically recover from node outages without having to reprocess from beginning
• State is preserved, checkpointing, incremental recovery
• End-to-end exactly-once
• Operability
• System and application metrics, record/visualize data
• Dynamic changes
Apex Platform Overview
4
Native Hadoop Integration
5
• YARN is
the
resource
manager
• HDFS for
storing
persistent
state
Application Development Model
6
 A Stream is a sequence of data tuples
 A typical Operator takes one or more input streams, performs computations & emits one or more output streams
• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
• Operator has many instances that run in parallel and each instance is single-threaded
 Directed Acyclic Graph (DAG) is made up of operators and streams
Directed Acyclic Graph (DAG)
Output
Stream
Tupl
e
Tupl
e
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
Scalability
7
0
Unifier
1a
1b
1c
2a
2b
Unifier 3
Unifier
0 4
3a2a1a
1b 2b 3b
Unifier
uopr1
uopr2
uopr3
uopr4
doprunifier
unifier
unifier
Container
Container
NICNIC
NICNIC
NIC
Container
Dynamic Partitioning
8
• Partitioning change while application is running
ᵒ Change number of partitions at runtime based on stats
ᵒ Determine initial number of partitions dynamically
• Kafka operators scale according to number of kafka partitions
ᵒ Supports re-distribution of state when number of partitions change
ᵒ API for custom scaler or partitioner
2b
2c
3
2a
2d
1b
1a1a 2a
1b 2b
3
1a 2b
1b 2c 3b
2a
2d
3a
Unifiers not shown
Fault Tolerance
9
• Operator state is checkpointed to persistent store
ᵒ Automatically performed by engine, no additional coding needed
ᵒ Asynchronous and distributed
ᵒ In case of failure operators are restarted from checkpoint state
• Automatic detection and recovery of failed containers
ᵒ Heartbeat mechanism
ᵒ YARN process status notification
• Buffering to enable replay of data from recovered point
ᵒ Fast, incremental recovery, spike handling
• Application master state checkpointed
ᵒ Snapshot of physical (and logical) plan
ᵒ Execution layer change log
Checkpointing
10
 Application window
 Sliding window and tumbling window
 Checkpoint window
 No artificial latency
• In-memory PubSub
• Stores results emitted by operator until committed
• Handles backpressure / spillover to local disk
• Ordering, idempotency
Operator
1
Container 1
Buffer
Server
Node 1
Operator
2
Container 2
Node 2
Buffer Server
11
End-to-End Exactly Once
12
• Important when writing to external systems
• Data should not be duplicated or lost in the external system in case of
application failures
• Common external systems
ᵒ Databases
ᵒ Files
ᵒ Message queues
• Exactly-once = at-least-once + idempotency + consistent state
• Data duplication must be avoided when data is replayed from checkpoint
ᵒ Operators implement the logic dependent on the external system
ᵒ Platform provides checkpointing and repeatable windowing
Example application – Streaming WordCount
13
• Kafka to Mysql
• Streaming source
• Traditional database output
• Functionality
- Stream messages that contain lines of text from kafka, break them into words, drop
commonly occurring words like articles, do a running count of how many times each word
occurs and keep updating the totals into a database table
• Five operators
• Kafka Input to stream messages from Kafka
• Parser to break lines into words
• Filter to drop the articles
• Counter to count occurrence of each word
• Database output to write counts to database
DAG
14
Kafka
Input
Parser
Word
Counter
Database
Output
CountsWordsLines
Kafka Database
Apex Application
• Design and develop operators or use existing ones from the library
• Connect operators to form an Application
• Configure operators
• Configure scaling and other platform attributes
• Test functionality, performance and iterate
Filter
Filtered
Kafka Input
15
1 to 1 partition 1 to N partition
• Available in Apex Malhar library - KafkaSinglePortStringInputOperator
• Operator dynamically scales with partitions of Kafka side
• Fault tolerant and idempotent, keeps track of offset for idempotent replay
during failure recovery
Parser Operator
16
• Simple parser implementation, splits strings based on a regex pattern
• Define input port to receive data and an output port to output data
• The callback process is called whenever data is available
• To send data operator calls the emit method
Filter
17
• Removes articles
Word Counter
18
• Simple implementation that keeps the counts in a HashMap in memory
• Counts are automatically saved and restored during failure recovery
• Periodically emits counts at the end of every window
• Unifier not shown here, available in the Apex Malhar library
Database Output
19
• Operator in library abstracts the low level logic to communicate with database
• Only need to specify the SQL statement and how to populate it with data
Application
20
• Instantiate operators and connect operators by connecting the respective ports
• Give friendly names to operators and streams
Configuration
21
• Use friendly names to specify properties of the operator
Attributes
22
• Attributes are platform features and apply to all operators
• Scaling, memory, checkpointing etc can be configured using attributes
Attributes
23
• Memory can be configured on per operator level or globally
• Locality controls how operators are deployed on the cluster
Runtime topology
24
Higher level Application specification
25
Java Stream API (declarative)
Next Release (3.5): Support for Windowing à la Apache Beam (incubating):
@ApplicationAnnotation(name = "WordCountStreamingApiDemo")
public class ApplicationWithStreamAPI implements StreamingApplication
{
@Override
public void populateDAG(DAG dag, Configuration configuration)
{
String localFolder = "./src/test/resources/data";
ApexStream<String> stream = StreamFactory
.fromFolder(localFolder)
.flatMap(new Split())
.window(new WindowOption.GlobalWindow(), new
TriggerOption().withEarlyFiringsAtEvery(Duration.millis(1000)).accumulatingFiredPanes())
.countByKey(new ConvertToKeyVal()).print();
stream.populateDag(dag);
}
}
Operator Library
26
RDBMS
• Vertica
• MySQL
• Oracle
• JDBC
NoSQL
• Cassandra, Hbase
• Aerospike, Accumulo
• Couchbase/ CouchDB
• Redis, MongoDB
• Geode
Messaging
• Kafka
• Solace
• Flume, ActiveMQ
• Kinesis, NiFi
File Systems
• HDFS/ Hive
• NFS
• S3
Parsers
• XML
• JSON
• CSV
• Avro
• Parquet
Transformations
• Filters
• Rules
• Expression
• Dedup
• Enrich
Analytics
• Dimensional Aggregations
(with state management for
historical data + query)
Protocols
• HTTP
• FTP
• WebSocket
• MQTT
• SMTP
Other
• Elastic Search
• Script (JavaScript, Python, R)
• Solr
• Twitter
Monitoring Console
Logical View
27
Physical View
Real-Time Dashboards
28
Application Designer
29
Maximize Revenue w/ real-time insights
30
PubMatic is the leading marketing automation software company for publishers. Through real-time analytics,
yield management, and workflow automation, PubMatic enables publishers to make smarter inventory
decisions and improve revenue performance
Business Need Apex based Solution Client Outcome
• Ingest and analyze high volume clicks &
views in real-time to help customers
improve revenue
- 200K events/second data
flow
• Report critical metrics for campaign
monetization from auction and client
logs
- 22 TB/day data generated
• Handle ever increasing traffic with
efficient resource utilization
• Always-on ad network
• DataTorrent Enterprise platform,
powered by Apache Apex
• In-memory stream processing
• Comprehensive library of pre-built
operators including connectors
• Built-in fault tolerance
• Dynamically scalable
• Management UI & Data Visualization
console
• Helps PubMatic deliver ad performance
insights to publishers and advertisers in
real-time instead of 5+ hours
• Helps Publishers visualize campaign
performance and adjust ad inventory in
real-time to maximize their revenue
• Enables PubMatic reduce OPEX with
efficient compute resource utilization
• Built-in fault tolerance ensures
customers can always access ad
network
Industrial IoT applications
31
GE is dedicated to providing advanced IoT analytics solutions to thousands of customers who are using their
devices and sensors across different verticals. GE has built a sophisticated analytics platform, Predix, to help its
customers develop and execute Industrial IoT applications and gain real-time insights as well as actions.
Business Need Apex based Solution Client Outcome
• Ingest and analyze high-volume, high speed
data from thousands of devices, sensors
per customer in real-time without data loss
• Predictive analytics to reduce costly
maintenance and improve customer
service
• Unified monitoring of all connected sensors
and devices to minimize disruptions
• Fast application development cycle
• High scalability to meet changing business
and application workloads
• Ingestion application using DataTorrent
Enterprise platform
• Powered by Apache Apex
• In-memory stream processing
• Built-in fault tolerance
• Dynamic scalability
• Comprehensive library of pre-built
operators
• Management UI console
• Helps GE improve performance and lower
cost by enabling real-time Big Data
analytics
• Helps GE detect possible failures and
minimize unplanned downtimes with
centralized management & monitoring of
devices
• Enables faster innovation with short
application development cycle
• No data loss and 24x7 availability of
applications
• Helps GE adjust to scalability needs with
auto-scaling
Smart energy applications
32
Silver Spring Networks helps global utilities and cities connect, optimize, and manage smart energy and smart city
infrastructure. Silver Spring Networks receives data from over 22 million connected devices, conducts 2 million
remote operations per year
Business Need Apex based Solution Client Outcome
• Ingest high-volume, high speed data from
millions of devices & sensors in real-time
without data loss
• Make data accessible to applications
without delay to improve customer service
• Capture & analyze historical data to
understand & improve grid operations
• Reduce the cost, time, and pain of
integrating with 3rd party apps
• Centralized management of software &
operations
• DataTorrent Enterprise platform, powered
by Apache Apex
• In-memory stream processing
• Pre-built operator
• Built-in fault tolerance
• Dynamically scalable
• Management UI console
• Helps Silver Spring Networks ingest &
analyze data in real-time for effective load
management & customer service
• Helps Silver Spring Networks detect
possible failures and reduce outages with
centralized management & monitoring of
devices
• Enables fast application development for
faster time to market
• Helps Silver Spring Networks scale with
easy to partition operators
• Automatic recovery from failures
Resources for the use cases
33
• Pubmatic
• https://www.youtube.com/watch?v=JSXpgfQFcU8
• GE
• https://www.youtube.com/watch?v=hmaSkXhHNu0
• http://www.slideshare.net/ApacheApex/ge-iot-predix-time-series-data-ingestion-service-using-
apache-apex-hadoop
• SilverSpring Networks
• https://www.youtube.com/watch?v=8VORISKeSjI
• http://www.slideshare.net/ApacheApex/iot-big-data-ingestion-and-processing-in-hadoop-by-
silver-spring-networks
Resources
34
• http://apex.apache.org/
• Learn more: http://apex.apache.org/docs.html
• Subscribe - http://apex.apache.org/community.html
• Download - http://apex.apache.org/downloads.html
• Follow @ApacheApex - https://twitter.com/apacheapex
• Meetups – http://www.meetup.com/pro/apacheapex/
• More examples: https://github.com/DataTorrent/examples
• Slideshare: http://www.slideshare.net/ApacheApex/presentations
• https://www.youtube.com/results?search_query=apache+apex
• Free Enterprise License for Startups -
https://www.datatorrent.com/product/startup-accelerator/
Q&A
35

More Related Content

PPTX
Introduction to Apache Apex
PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
PPTX
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PPTX
Java High Level Stream API
PDF
Building your first aplication using Apache Apex
PPTX
Smart Partitioning with Apache Apex (Webinar)
PDF
From Batch to Streaming with Apache Apex Dataworks Summit 2017
Introduction to Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Java High Level Stream API
Building your first aplication using Apache Apex
Smart Partitioning with Apache Apex (Webinar)
From Batch to Streaming with Apache Apex Dataworks Summit 2017

What's hot (20)

PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
PPTX
Architectual Comparison of Apache Apex and Spark Streaming
PPTX
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
PDF
Developing streaming applications with apache apex (strata + hadoop world)
PPTX
Introduction to Apache Apex and writing a big data streaming application
PPTX
Deep Dive into Apache Apex App Development
PDF
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
PPTX
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
PPTX
Apache Apex: Stream Processing Architecture and Applications
PPTX
Fault Tolerance and Processing Semantics in Apache Apex
PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
PPTX
Ingestion and Dimensions Compute and Enrich using Apache Apex
PDF
Low Latency Polyglot Model Scoring using Apache Apex
PPTX
DataTorrent Presentation @ Big Data Application Meetup
PPTX
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
PPTX
Introduction to Apache Apex
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
PPTX
Introduction to Real-Time Data Processing
PPTX
Fault-Tolerant File Input & Output
PDF
Introduction to Apache Apex - CoDS 2016
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Architectual Comparison of Apache Apex and Spark Streaming
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Developing streaming applications with apache apex (strata + hadoop world)
Introduction to Apache Apex and writing a big data streaming application
Deep Dive into Apache Apex App Development
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex: Stream Processing Architecture and Applications
Fault Tolerance and Processing Semantics in Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
DataTorrent Presentation @ Big Data Application Meetup
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Introduction to Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
Introduction to Real-Time Data Processing
Fault-Tolerant File Input & Output
Introduction to Apache Apex - CoDS 2016
Ad

Viewers also liked (12)

PPTX
Introduction to Yarn
PDF
Introduction to Real-time data processing
PPTX
HDFS Internals
PPTX
Hadoop Interacting with HDFS
PPTX
Capital One's Next Generation Decision in less than 2 ms
PDF
Apache Hadoop YARN - Enabling Next Generation Data Applications
PDF
Windowing in Apache Apex
PPTX
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
PPTX
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
PPTX
Introduction to Map Reduce
PPSX
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
PPT
Presentación de Moodle
Introduction to Yarn
Introduction to Real-time data processing
HDFS Internals
Hadoop Interacting with HDFS
Capital One's Next Generation Decision in less than 2 ms
Apache Hadoop YARN - Enabling Next Generation Data Applications
Windowing in Apache Apex
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Introduction to Map Reduce
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
Presentación de Moodle
Ad

Similar to Intro to Apache Apex @ Women in Big Data (18)

PPTX
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
PDF
Introduction to Apache Apex by Thomas Weise
PPTX
Next Gen Big Data Analytics with Apache Apex
PPTX
Apache Apex: Stream Processing Architecture and Applications
PDF
BigDataSpain 2016: Introduction to Apache Apex
PDF
BigDataSpain 2016: Stream Processing Applications with Apache Apex
PDF
Stream Processing use cases and applications with Apache Apex by Thomas Weise
PPTX
Stream data from Apache Kafka for processing with Apache Apex
PPTX
Trivento summercamp masterclass 9/9/2016
PDF
It's Time To Stop Using Lambda Architecture
PDF
Real-time Stream Processing using Apache Apex
PPTX
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
PDF
Streaming architecture patterns
PPTX
Apache Apex Fault Tolerance and Processing Semantics
PDF
Building Big Data Streaming Architectures
PDF
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
PDF
Kafka Summit NYC 2017 - Building Advanced Streaming Applications using the La...
PDF
Streaming analytics state of the art
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Introduction to Apache Apex by Thomas Weise
Next Gen Big Data Analytics with Apache Apex
Apache Apex: Stream Processing Architecture and Applications
BigDataSpain 2016: Introduction to Apache Apex
BigDataSpain 2016: Stream Processing Applications with Apache Apex
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Stream data from Apache Kafka for processing with Apache Apex
Trivento summercamp masterclass 9/9/2016
It's Time To Stop Using Lambda Architecture
Real-time Stream Processing using Apache Apex
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
Streaming architecture patterns
Apache Apex Fault Tolerance and Processing Semantics
Building Big Data Streaming Architectures
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
Kafka Summit NYC 2017 - Building Advanced Streaming Applications using the La...
Streaming analytics state of the art

More from Apache Apex (6)

PPTX
Intro to Big Data Hadoop
PPTX
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
PPTX
Apache Beam (incubating)
PPTX
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
PPTX
Apache Apex & Bigtop
PDF
Building Your First Apache Apex Application
Intro to Big Data Hadoop
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Apache Beam (incubating)
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Apache Apex & Bigtop
Building Your First Apache Apex Application

Recently uploaded (20)

PPTX
L1 - Introduction to python Backend.pptx
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
history of c programming in notes for students .pptx
PDF
medical staffing services at VALiNTRY
PDF
AI in Product Development-omnex systems
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
L1 - Introduction to python Backend.pptx
ManageIQ - Sprint 268 Review - Slide Deck
VVF-Customer-Presentation2025-Ver1.9.pptx
CHAPTER 2 - PM Management and IT Context
Softaken Excel to vCard Converter Software.pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Online Work Permit System for Fast Permit Processing
Understanding Forklifts - TECH EHS Solution
history of c programming in notes for students .pptx
medical staffing services at VALiNTRY
AI in Product Development-omnex systems
How Creative Agencies Leverage Project Management Software.pdf
Operating system designcfffgfgggggggvggggggggg
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Which alternative to Crystal Reports is best for small or large businesses.pdf
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
How to Choose the Right IT Partner for Your Business in Malaysia

Intro to Apache Apex @ Women in Big Data

  • 1. Intro to Apache Apex Pramod Immaneni Apache Apex PMC, Architect DataTorrent Oct 12th 2016
  • 2. Next Gen Stream Data Processing • Data from variety of sources (IoT, Kafka, files, social media etc.) • Unbounded, continuous data streams ᵒ Batch can be processed as stream (but a stream is not a batch) • (In-memory) Processing with temporal boundaries (windows) • Stateful operations: Aggregation, Rules, … -> Analytics • Results stored to variety of sinks or destinations ᵒ Streaming application can also serve data with very low latency 2 Browser Web Server Kafka Input (logs) Decompress, Parse, Filter Dimensions Aggregate Kafka Logs Kafka
  • 3. Apache Apex 3 • In-memory, distributed stream processing • Application logic broken into components called operators that run in a distributed fashion across your cluster • Natural programming model • Unobtrusive Java API to express (custom) logic • Maintain state and metrics in your member variables • Scalable, high throughput, low latency • Operators can be scaled up or down at runtime according to the load and SLA • Dynamic scaling (elasticity), compute locality • Fault tolerance & correctness • Automatically recover from node outages without having to reprocess from beginning • State is preserved, checkpointing, incremental recovery • End-to-end exactly-once • Operability • System and application metrics, record/visualize data • Dynamic changes
  • 5. Native Hadoop Integration 5 • YARN is the resource manager • HDFS for storing persistent state
  • 6. Application Development Model 6  A Stream is a sequence of data tuples  A typical Operator takes one or more input streams, performs computations & emits one or more output streams • Each Operator is YOUR custom business logic in java, or built-in operator from our open source library • Operator has many instances that run in parallel and each instance is single-threaded  Directed Acyclic Graph (DAG) is made up of operators and streams Directed Acyclic Graph (DAG) Output Stream Tupl e Tupl e er Operator er Operator er Operator er Operator er Operator er Operator
  • 7. Scalability 7 0 Unifier 1a 1b 1c 2a 2b Unifier 3 Unifier 0 4 3a2a1a 1b 2b 3b Unifier uopr1 uopr2 uopr3 uopr4 doprunifier unifier unifier Container Container NICNIC NICNIC NIC Container
  • 8. Dynamic Partitioning 8 • Partitioning change while application is running ᵒ Change number of partitions at runtime based on stats ᵒ Determine initial number of partitions dynamically • Kafka operators scale according to number of kafka partitions ᵒ Supports re-distribution of state when number of partitions change ᵒ API for custom scaler or partitioner 2b 2c 3 2a 2d 1b 1a1a 2a 1b 2b 3 1a 2b 1b 2c 3b 2a 2d 3a Unifiers not shown
  • 9. Fault Tolerance 9 • Operator state is checkpointed to persistent store ᵒ Automatically performed by engine, no additional coding needed ᵒ Asynchronous and distributed ᵒ In case of failure operators are restarted from checkpoint state • Automatic detection and recovery of failed containers ᵒ Heartbeat mechanism ᵒ YARN process status notification • Buffering to enable replay of data from recovered point ᵒ Fast, incremental recovery, spike handling • Application master state checkpointed ᵒ Snapshot of physical (and logical) plan ᵒ Execution layer change log
  • 10. Checkpointing 10  Application window  Sliding window and tumbling window  Checkpoint window  No artificial latency
  • 11. • In-memory PubSub • Stores results emitted by operator until committed • Handles backpressure / spillover to local disk • Ordering, idempotency Operator 1 Container 1 Buffer Server Node 1 Operator 2 Container 2 Node 2 Buffer Server 11
  • 12. End-to-End Exactly Once 12 • Important when writing to external systems • Data should not be duplicated or lost in the external system in case of application failures • Common external systems ᵒ Databases ᵒ Files ᵒ Message queues • Exactly-once = at-least-once + idempotency + consistent state • Data duplication must be avoided when data is replayed from checkpoint ᵒ Operators implement the logic dependent on the external system ᵒ Platform provides checkpointing and repeatable windowing
  • 13. Example application – Streaming WordCount 13 • Kafka to Mysql • Streaming source • Traditional database output • Functionality - Stream messages that contain lines of text from kafka, break them into words, drop commonly occurring words like articles, do a running count of how many times each word occurs and keep updating the totals into a database table • Five operators • Kafka Input to stream messages from Kafka • Parser to break lines into words • Filter to drop the articles • Counter to count occurrence of each word • Database output to write counts to database
  • 14. DAG 14 Kafka Input Parser Word Counter Database Output CountsWordsLines Kafka Database Apex Application • Design and develop operators or use existing ones from the library • Connect operators to form an Application • Configure operators • Configure scaling and other platform attributes • Test functionality, performance and iterate Filter Filtered
  • 15. Kafka Input 15 1 to 1 partition 1 to N partition • Available in Apex Malhar library - KafkaSinglePortStringInputOperator • Operator dynamically scales with partitions of Kafka side • Fault tolerant and idempotent, keeps track of offset for idempotent replay during failure recovery
  • 16. Parser Operator 16 • Simple parser implementation, splits strings based on a regex pattern • Define input port to receive data and an output port to output data • The callback process is called whenever data is available • To send data operator calls the emit method
  • 18. Word Counter 18 • Simple implementation that keeps the counts in a HashMap in memory • Counts are automatically saved and restored during failure recovery • Periodically emits counts at the end of every window • Unifier not shown here, available in the Apex Malhar library
  • 19. Database Output 19 • Operator in library abstracts the low level logic to communicate with database • Only need to specify the SQL statement and how to populate it with data
  • 20. Application 20 • Instantiate operators and connect operators by connecting the respective ports • Give friendly names to operators and streams
  • 21. Configuration 21 • Use friendly names to specify properties of the operator
  • 22. Attributes 22 • Attributes are platform features and apply to all operators • Scaling, memory, checkpointing etc can be configured using attributes
  • 23. Attributes 23 • Memory can be configured on per operator level or globally • Locality controls how operators are deployed on the cluster
  • 25. Higher level Application specification 25 Java Stream API (declarative) Next Release (3.5): Support for Windowing à la Apache Beam (incubating): @ApplicationAnnotation(name = "WordCountStreamingApiDemo") public class ApplicationWithStreamAPI implements StreamingApplication { @Override public void populateDAG(DAG dag, Configuration configuration) { String localFolder = "./src/test/resources/data"; ApexStream<String> stream = StreamFactory .fromFolder(localFolder) .flatMap(new Split()) .window(new WindowOption.GlobalWindow(), new TriggerOption().withEarlyFiringsAtEvery(Duration.millis(1000)).accumulatingFiredPanes()) .countByKey(new ConvertToKeyVal()).print(); stream.populateDag(dag); } }
  • 26. Operator Library 26 RDBMS • Vertica • MySQL • Oracle • JDBC NoSQL • Cassandra, Hbase • Aerospike, Accumulo • Couchbase/ CouchDB • Redis, MongoDB • Geode Messaging • Kafka • Solace • Flume, ActiveMQ • Kinesis, NiFi File Systems • HDFS/ Hive • NFS • S3 Parsers • XML • JSON • CSV • Avro • Parquet Transformations • Filters • Rules • Expression • Dedup • Enrich Analytics • Dimensional Aggregations (with state management for historical data + query) Protocols • HTTP • FTP • WebSocket • MQTT • SMTP Other • Elastic Search • Script (JavaScript, Python, R) • Solr • Twitter
  • 30. Maximize Revenue w/ real-time insights 30 PubMatic is the leading marketing automation software company for publishers. Through real-time analytics, yield management, and workflow automation, PubMatic enables publishers to make smarter inventory decisions and improve revenue performance Business Need Apex based Solution Client Outcome • Ingest and analyze high volume clicks & views in real-time to help customers improve revenue - 200K events/second data flow • Report critical metrics for campaign monetization from auction and client logs - 22 TB/day data generated • Handle ever increasing traffic with efficient resource utilization • Always-on ad network • DataTorrent Enterprise platform, powered by Apache Apex • In-memory stream processing • Comprehensive library of pre-built operators including connectors • Built-in fault tolerance • Dynamically scalable • Management UI & Data Visualization console • Helps PubMatic deliver ad performance insights to publishers and advertisers in real-time instead of 5+ hours • Helps Publishers visualize campaign performance and adjust ad inventory in real-time to maximize their revenue • Enables PubMatic reduce OPEX with efficient compute resource utilization • Built-in fault tolerance ensures customers can always access ad network
  • 31. Industrial IoT applications 31 GE is dedicated to providing advanced IoT analytics solutions to thousands of customers who are using their devices and sensors across different verticals. GE has built a sophisticated analytics platform, Predix, to help its customers develop and execute Industrial IoT applications and gain real-time insights as well as actions. Business Need Apex based Solution Client Outcome • Ingest and analyze high-volume, high speed data from thousands of devices, sensors per customer in real-time without data loss • Predictive analytics to reduce costly maintenance and improve customer service • Unified monitoring of all connected sensors and devices to minimize disruptions • Fast application development cycle • High scalability to meet changing business and application workloads • Ingestion application using DataTorrent Enterprise platform • Powered by Apache Apex • In-memory stream processing • Built-in fault tolerance • Dynamic scalability • Comprehensive library of pre-built operators • Management UI console • Helps GE improve performance and lower cost by enabling real-time Big Data analytics • Helps GE detect possible failures and minimize unplanned downtimes with centralized management & monitoring of devices • Enables faster innovation with short application development cycle • No data loss and 24x7 availability of applications • Helps GE adjust to scalability needs with auto-scaling
  • 32. Smart energy applications 32 Silver Spring Networks helps global utilities and cities connect, optimize, and manage smart energy and smart city infrastructure. Silver Spring Networks receives data from over 22 million connected devices, conducts 2 million remote operations per year Business Need Apex based Solution Client Outcome • Ingest high-volume, high speed data from millions of devices & sensors in real-time without data loss • Make data accessible to applications without delay to improve customer service • Capture & analyze historical data to understand & improve grid operations • Reduce the cost, time, and pain of integrating with 3rd party apps • Centralized management of software & operations • DataTorrent Enterprise platform, powered by Apache Apex • In-memory stream processing • Pre-built operator • Built-in fault tolerance • Dynamically scalable • Management UI console • Helps Silver Spring Networks ingest & analyze data in real-time for effective load management & customer service • Helps Silver Spring Networks detect possible failures and reduce outages with centralized management & monitoring of devices • Enables fast application development for faster time to market • Helps Silver Spring Networks scale with easy to partition operators • Automatic recovery from failures
  • 33. Resources for the use cases 33 • Pubmatic • https://www.youtube.com/watch?v=JSXpgfQFcU8 • GE • https://www.youtube.com/watch?v=hmaSkXhHNu0 • http://www.slideshare.net/ApacheApex/ge-iot-predix-time-series-data-ingestion-service-using- apache-apex-hadoop • SilverSpring Networks • https://www.youtube.com/watch?v=8VORISKeSjI • http://www.slideshare.net/ApacheApex/iot-big-data-ingestion-and-processing-in-hadoop-by- silver-spring-networks
  • 34. Resources 34 • http://apex.apache.org/ • Learn more: http://apex.apache.org/docs.html • Subscribe - http://apex.apache.org/community.html • Download - http://apex.apache.org/downloads.html • Follow @ApacheApex - https://twitter.com/apacheapex • Meetups – http://www.meetup.com/pro/apacheapex/ • More examples: https://github.com/DataTorrent/examples • Slideshare: http://www.slideshare.net/ApacheApex/presentations • https://www.youtube.com/results?search_query=apache+apex • Free Enterprise License for Startups - https://www.datatorrent.com/product/startup-accelerator/

Editor's Notes

  • #3: Partitioning & Scaling built-in Operators can be dynamically scaled Throughput, latency or any custom logic Streams can be split in flexible ways Tuple hashcode, tuple field or custom logic Parallel partitioning for parallel pipelines MxN partitioning for generic pipelines Unifier concept for merging results from partitions Helps in handling skew imbalance Advanced Windowing support Application window configurable per operator Sliding window and tumbling window support Checkpoint window control for fault recovery Windowing does not introduce artificial latency Stateful fault tolerance out of the box Operators recover automatically from a precise point before failure At least once At most once Exactly once at window boundaries
  • #4: In-memory stream processing platform Developed since 2012, ASF TLP since 04/2016 Unobtrusive Java API to express (custom) logic Scale out, distributed, parallel High throughput & low latency processing Windowing (temporal boundary) Reliability, fault tolerance, operability Hadoop native Compute locality, affinity Dynamic updates, elasticity
  • #9: Partitioning decision (yes/no) by trigger (StatsListener) Pluggable component, can use any system or custom metric Externally driven partitioning example: KafkaInputOperator Stateful! Uses checkpointed state Ability to transfer state from old to new partitions (partitioner, customizable) Steps: Call partitioner Modify physical plan, rewrite checkpoints as needed Undeploy old partitions from execution layer Release/request container resources Deploy new partitions (from rewritten checkpoint) No loss of data (buffered) Incremental operation, partitions that don’t change continue processing API: Partitioner interface
  • #27: A snaphot of our apex malhar library listing the commonly needed operators. Talk about robust support for kafka, dynamic partitioning, 0.8 and 0.9 API support with offset management and idempotent recovery. Operators for streaming and batch
  • #28: When your application is running you would want to know what is going on with your application. That is where monitoring console comes in. Touch upon a few salient features Gives you a detailed look into your application including all operators and their partitions. Gives you performance statistics, resource usage and container information for each of the partitions Helps development and operations by providing access to the individual logs right in the console and allowing users to search the logs or change log levels on the fly. Also lets you record data in a stream live at any stage. Talk about locality here quickly By default operators are deployed in containers (processes) on different nodes across the Hadoop cluster Locality options for streams RACK_LOCAL: Data does not traverse network switches NODE_LOCAL: Data transfer via loopback interface, frees up network bandwidth CONTAINER_LOCAL: Data transfer via in memory queues between operators, does not require serialization THREAD_LOCAL: Data passed through call stack, operators share thread Host Locality Operators can be deployed on specific hosts New in 3.4.0: (Anti-)Affinity (APEXCORE-10) Ability to express relative deployment without specifying a host
  • #29: Use the built-in real-time dashboards and widgets in your application or connect to your own. This picture shows the variety of widgets we have.
  • #31: Wanted to talk about some instances where Apache Apex and DataTorrent are being used in production today. These cases mentioned here, customers have talked about using Apache Apex openly and have presented them in meetups. If you want to know more in depth we have provided links to the those meetup resources at the end. Pubmatic is in advertising space and provides real time analytics around ad impressions and clicks for publishers. They are using dimensional computation operators provided by datatorrent to slice and dice the data in different ways and provide the results of those analytics through real-time dashboards.
  • #32: GE is the leader in Industrial IoT and has built a cloud platform called Predix to store and analyze machine data from all over the world. There are using datatorrent and apache apex for high speed ingestion and processing in a fault tolerant way.
  • #33: Silver spring networks also in the IoT space provides technology to collect and analyze data from smart energy meters for utilities. They perform ingestion and analytics with datatorrent to detect failures in the systems in advance and reduce outages.