SlideShare a Scribd company logo
Next Gen Big Data Analytics with Apache Apex
Apache Big Data, Vancouver
May 9th 2016
Thomas Weise, Apache Apex PMC
@thweise thw@apache.org
Stream Processing
• Data from a variety of sources (IoT, Kafka, files, social media etc.)
• Unbounded stream data
ᵒ Batch can be processed as stream (but a stream is not a batch)
• (In-memory) Processing with temporal boundaries (windows)
• Stateful operations: Aggregation, Rules, … -> Analytics
• Results stored to a variety of sinks or destinations
ᵒ Streaming application can also serve data with very low latency
2
Browser
Web Server
Kafka Input
(logs)
Decompress,
Parse, Filter
Dimensions
Aggregate Kafka
Logs
Kafka
Apache Apex Features
• In-memory stream processing platform
ᵒ Developed since 2012, ASF TLP since 04/2016
• Unobtrusive Java API to express (custom) logic
• Scale out, distributed, parallel
• High throughput & low latency processing
• Windowing (temporal boundary)
• Reliability, fault tolerance, operability
• Hadoop native
• Compute locality, affinity
• Dynamic updates, elasticity
3
Applications on Apex
4
• Distributed processing
• Application logic broken into components called operators that run in a distributed fashion
across your cluster
• Natural programming model
• Code as if you were writing normal Java logic
• Maintain state in your application variables
• Scalable
• Operators can be scaled up or down at runtime according to the load and SLA
• Fault tolerant
• Automatically recover from node outages without having to reprocess from beginning
• State is preserved, checkpointing, incremental recovery
• Long running applications
• Operational insight
• See how each operator is performing and even record data
Apex Stack Overview
5
Apache Apex Malhar Library
6
Native Hadoop Integration
7
• YARN is
the
resource
manager
• HDFS used
for storing
any
persistent
state
Application Development Model
8
 A Stream is a sequence of data tuples
 A typical Operator takes one or more input streams, performs computations & emits one or more output streams
• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
• Operator has many instances that run in parallel and each instance is single-threaded
 Directed Acyclic Graph (DAG) is made up of operators and streams
Directed Acyclic Graph (DAG)
Output
Stream
Tupl
e
Tupl
e
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
Streaming Windows
9
 Application window
 Sliding window and tumbling window
 Checkpoint window
 No artificial latency
Event time & Dimensions Computation
10
(All) : 5
t=4:00 : 2
t=5:00 : 3
k=A, t=4:00 : 2
k=A, t=5:00 : 1
k=B, t=5:00 : 2
(All) : 4
t=4:00 : 2
t=5:00 : 2
k=A, t=4:00 : 2
K=B, t=5:00 : 2
k=A
t=5:00
(All) : 1
t=4:00 : 1
k=A, t=4:00 : 1
k=B
t=5:59
k=B
t=5:00
k=A
T=4:30
k=A
t=4:00
Application in Java
11
Operators
12
Operators (contd)
13
Partitioning
14
NxM PartitionsUnifier
0 1 2 3
Logical DAG
0 1 2
1
1 Unifier
1
20
Logical Diagram
Physical Diagram with operator 1 with 3 partitions
0
Unifier
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck
Unifier
Unifier0
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
Advanced Partitioning
15
0
1a
1b
2 3 4Unifier
Physical DAG
0 4
3a2a1a
1b 2b 3b
Unifier
Physical DAG with Parallel Partition
Parallel Partition
Container
uopr
uopr1
uopr2
uopr3
uopr4
uopr1
uopr2
uopr3
uopr4
dopr
dopr
doprunifier
unifier
unifier
unifier
Container
Container
NICNIC
NICNIC
NIC
Container
NIC
Logical Plan
Execution Plan, for N = 4; M = 1
Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers
Cascading Unifiers
0 1 2 3 4
Logical DAG
Dynamic Partitioning
16
• Partitioning change while application is running
ᵒ Change number of partitions at runtime based on stats
ᵒ Determine initial number of partitions dynamically
• Kafka operators scale according to number of kafka partitions
ᵒ Supports re-distribution of state when number of partitions change
ᵒ API for custom scaler or partitioner
2b
2c
3
2a
2d
1b
1a1a 2a
1b 2b
3
1a 2b
1b 2c 3b
2a
2d
3a
Unifiers not shown
How tuples are partitioned
17
• Tuple hashcode and mask used to determine destination partition
ᵒ Mask picks the last n bits of the hashcode of the tuple
ᵒ hashcode method can be overridden
• StreamCodec can be used to specify custom hashcode for tuples
ᵒ Can also be used for specifying custom serialization
tuple: {
Name,
24204842,
San Jose
}
Hashcode:
00101010001
0101
Mask
(0x11)
Partition
00 1
01 2
10 3
11 4
Fault Tolerance
18
• Operator state is checkpointed to persistent store
ᵒ Automatically performed by engine, no additional coding needed
ᵒ Asynchronous and distributed
ᵒ In case of failure operators are restarted from checkpoint state
• Automatic detection and recovery of failed containers
ᵒ Heartbeat mechanism
ᵒ YARN process status notification
• Buffering to enable replay of data from recovered point
ᵒ Fast, incremental recovery, spike handling
• Application master state checkpointed
ᵒ Snapshot of physical (and logical) plan
ᵒ Execution layer change log
• In-memory PubSub
• Stores results emitted by operator until committed
• Handles backpressure / spillover to local disk
• Ordering, idempotency
Operator
1
Container 1
Buffer
Server
Node 1
Operator
2
Container 2
Node 2
Buffer Server
19
Recovery Scenario
… EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1
sum
0
… EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1
sum
7
… EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1
sum
10
… EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1
sum
7
20
Processing Guarantees
21
At-least-once
• On recovery data will be replayed from a previous checkpoint
ᵒ No messages lost
ᵒ Default, suitable for most applications
• Can be used to ensure data is written once to store
ᵒ Transactions with meta information, Rewinding output, Feedback from
external entity, Idempotent operations
At-most-once
• On recovery the latest data is made available to operator
ᵒ Useful in use cases where some data loss is acceptable and latest data is
sufficient
Exactly-once
ᵒ At-least-once + idempotency + transactional mechanisms (operator logic) to
achieve end-to-end exactly once behavior
End-to-End Exactly Once
22
• Becomes important when writing to external systems
• Data should not be duplicated or lost in the external system even in case of
application failures
• Common external systems
ᵒ Databases
ᵒ Files
ᵒ Message queues
• Platform support for at least once is a must so that no data is lost
• Data duplication must still be avoided when data is replayed from checkpoint
ᵒ Operators implement the logic dependent on the external system
ᵒ Platform provides checkpointing and repeatable windowing
Compute Locality
23
• By default operators are deployed in containers (processes) on
different nodes across the Hadoop cluster
• Locality options for streams
ᵒ RACK_LOCAL: Data does not traverse network switches
ᵒ NODE_LOCAL: Data transfer via loopback interface, frees up network
bandwidth
ᵒ CONTAINER_LOCAL: Data transfer via in memory queues between
operators, does not require serialization
ᵒ THREAD_LOCAL: Data passed through call stack, operators share thread
• Host Locality
ᵒ Operators can be deployed on specific hosts
• New in 3.4.0: (Anti-)Affinity (APEXCORE-10)
ᵒ Ability to express relative deployment without specifying a host
Data Processing Pipeline Example
App Builder
24
Monitoring Console
Logical View
25
Monitoring Console
Physical View
26
Real-Time Dashboards
Real Time Visualization
27
Maximize Revenue w/ real-time insights
28
PubMatic is the leading marketing automation software company for publishers. Through real-time analytics,
yield management, and workflow automation, PubMatic enables publishers to make smarter inventory
decisions and improve revenue performance
Business Need Apex based Solution Client Outcome
• Ingest and analyze high volume clicks &
views in real-time to help customers
improve revenue
- 200K events/second data
flow
• Report critical metrics for campaign
monetization from auction and client
logs
- 22 TB/day data generated
• Handle ever increasing traffic with
efficient resource utilization
• Always-on ad network
• DataTorrent Enterprise platform,
powered by Apache Apex
• In-memory stream processing
• Comprehensive library of pre-built
operators including connectors
• Built-in fault tolerance
• Dynamically scalable
• Management UI & Data Visualization
console
• Helps PubMatic deliver ad performance
insights to publishers and advertisers in
real-time instead of 5+ hours
• Helps Publishers visualize campaign
performance and adjust ad inventory in
real-time to maximize their revenue
• Enables PubMatic reduce OPEX with
efficient compute resource utilization
• Built-in fault tolerance ensures
customers can always access ad
network
Industrial IoT applications
29
GE is dedicated to providing advanced IoT analytics solutions to thousands of customers who are using their
devices and sensors across different verticals. GE has built a sophisticated analytics platform, Predix, to help its
customers develop and execute Industrial IoT applications and gain real-time insights as well as actions.
Business Need Apex based Solution Client Outcome
• Ingest and analyze high-volume, high speed
data from thousands of devices, sensors
per customer in real-time without data loss
• Predictive analytics to reduce costly
maintenance and improve customer
service
• Unified monitoring of all connected sensors
and devices to minimize disruptions
• Fast application development cycle
• High scalability to meet changing business
and application workloads
• Ingestion application using DataTorrent
Enterprise platform
• Powered by Apache Apex
• In-memory stream processing
• Built-in fault tolerance
• Dynamic scalability
• Comprehensive library of pre-built
operators
• Management UI console
• Helps GE improve performance and lower
cost by enabling real-time Big Data
analytics
• Helps GE detect possible failures and
minimize unplanned downtimes with
centralized management & monitoring of
devices
• Enables faster innovation with short
application development cycle
• No data loss and 24x7 availability of
applications
• Helps GE adjust to scalability needs with
auto-scaling
Smart energy applications
30
Silver Spring Networks helps global utilities and cities connect, optimize, and manage smart energy and smart city
infrastructure. Silver Spring Networks receives data from over 22 million connected devices, conducts 2 million
remote operations per year
Business Need Apex based Solution Client Outcome
• Ingest high-volume, high speed data from
millions of devices & sensors in real-time
without data loss
• Make data accessible to applications
without delay to improve customer service
• Capture & analyze historical data to
understand & improve grid operations
• Reduce the cost, time, and pain of
integrating with 3rd party apps
• Centralized management of software &
operations
• DataTorrent Enterprise platform, powered
by Apache Apex
• In-memory stream processing
• Pre-built operator
• Built-in fault tolerance
• Dynamically scalable
• Management UI console
• Helps Silver Spring Networks ingest &
analyze data in real-time for effective load
management & customer service
• Helps Silver Spring Networks detect
possible failures and reduce outages with
centralized management & monitoring of
devices
• Enables fast application development for
faster time to market
• Helps Silver Spring Networks scale with
easy to partition operators
• Automatic recovery from failures
Learn about another use case?
31
Next-Gen Decision Making in Under 2ms
Ilya Ganelin, Capital One Data Innovation Lab
Monday May 9, 2016 5:10pm - 6:00pm
Plaza B
Resources
32
• http://apex.apache.org/
• Learn more: http://apex.apache.org/docs.html
• Subscribe - http://apex.apache.org/community.html
• Download - http://apex.apache.org/downloads.html
• Follow @ApacheApex - https://twitter.com/apacheapex
• Meetups - http://www.meetup.com/topics/apache-apex
• More examples: https://github.com/DataTorrent/examples
• Slideshare: http://www.slideshare.net/ApacheApex/presentations
• https://www.youtube.com/results?search_query=apache+apex
Q&A
33

More Related Content

PPTX
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
PPTX
Apache Apex: Stream Processing Architecture and Applications
PPTX
Introduction to Apache Apex and writing a big data streaming application
PPTX
Architectual Comparison of Apache Apex and Spark Streaming
PPTX
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
PPTX
DataTorrent Presentation @ Big Data Application Meetup
PDF
Building your first aplication using Apache Apex
PPTX
Smart Partitioning with Apache Apex (Webinar)
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
Apache Apex: Stream Processing Architecture and Applications
Introduction to Apache Apex and writing a big data streaming application
Architectual Comparison of Apache Apex and Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
DataTorrent Presentation @ Big Data Application Meetup
Building your first aplication using Apache Apex
Smart Partitioning with Apache Apex (Webinar)

What's hot (20)

PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
PPTX
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
PDF
Introduction to Apache Apex - CoDS 2016
PDF
Apex as yarn application
PPTX
Introduction to Apache Apex
PPTX
Introduction to Apache Apex
PPTX
Intro to Apache Apex @ Women in Big Data
PPTX
Java High Level Stream API
PPTX
Apache Apex Fault Tolerance and Processing Semantics
PDF
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PPTX
Ingestion and Dimensions Compute and Enrich using Apache Apex
PPTX
Fault-Tolerant File Input & Output
PPTX
Fault Tolerance and Processing Semantics in Apache Apex
PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
PDF
From Batch to Streaming with Apache Apex Dataworks Summit 2017
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
PDF
Extending The Yahoo Streaming Benchmark to Apache Apex
PDF
Developing streaming applications with apache apex (strata + hadoop world)
PPTX
University program - writing an apache apex application
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Introduction to Apache Apex - CoDS 2016
Apex as yarn application
Introduction to Apache Apex
Introduction to Apache Apex
Intro to Apache Apex @ Women in Big Data
Java High Level Stream API
Apache Apex Fault Tolerance and Processing Semantics
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Ingestion and Dimensions Compute and Enrich using Apache Apex
Fault-Tolerant File Input & Output
Fault Tolerance and Processing Semantics in Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
From Batch to Streaming with Apache Apex Dataworks Summit 2017
Big Data Berlin v8.0 Stream Processing with Apache Apex
Extending The Yahoo Streaming Benchmark to Apache Apex
Developing streaming applications with apache apex (strata + hadoop world)
University program - writing an apache apex application
Ad

Viewers also liked (9)

PDF
Public Cloud Service Agreements: What to Expect and What to Negotiate V2.0
PPTX
2015 10-16 colloque atuq 2015 presentation
PPTX
Yr7 information evening PowerPoint October 2016
PPTX
Apache Flink at Strata San Jose 2016
PPTX
Continuous Processing with Apache Flink - Strata London 2016
PDF
[db tech showcase Tokyo 2016] B15: サイバーエージェント アドテクスタジオの次世代データ分析基盤紹介 by 株式会社サイ...
PPTX
Capital One's Next Generation Decision in less than 2 ms
PPSX
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
DOCX
Job Competition Guide Project Jan. 2, 2015
Public Cloud Service Agreements: What to Expect and What to Negotiate V2.0
2015 10-16 colloque atuq 2015 presentation
Yr7 information evening PowerPoint October 2016
Apache Flink at Strata San Jose 2016
Continuous Processing with Apache Flink - Strata London 2016
[db tech showcase Tokyo 2016] B15: サイバーエージェント アドテクスタジオの次世代データ分析基盤紹介 by 株式会社サイ...
Capital One's Next Generation Decision in less than 2 ms
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
Job Competition Guide Project Jan. 2, 2015
Ad

Similar to Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex (19)

PPTX
Next Gen Big Data Analytics with Apache Apex
PPTX
Apache Apex: Stream Processing Architecture and Applications
PDF
Introduction to Apache Apex by Thomas Weise
PPTX
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
PDF
BigDataSpain 2016: Introduction to Apache Apex
PPTX
Trivento summercamp masterclass 9/9/2016
PPTX
Apache Apex Fault Tolerance and Processing Semantics
PDF
Real-time Stream Processing using Apache Apex
PPTX
Stream Processing with Apache Apex
PPTX
Trivento summercamp fast data 9/9/2016
PPTX
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
PPTX
Stream data from Apache Kafka for processing with Apache Apex
PDF
Building Big Data Streaming Architectures
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
PDF
BigDataSpain 2016: Stream Processing Applications with Apache Apex
PPTX
Flink Streaming
PPTX
Data Architectures for Robust Decision Making
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Next Gen Big Data Analytics with Apache Apex
Apache Apex: Stream Processing Architecture and Applications
Introduction to Apache Apex by Thomas Weise
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
BigDataSpain 2016: Introduction to Apache Apex
Trivento summercamp masterclass 9/9/2016
Apache Apex Fault Tolerance and Processing Semantics
Real-time Stream Processing using Apache Apex
Stream Processing with Apache Apex
Trivento summercamp fast data 9/9/2016
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
Stream data from Apache Kafka for processing with Apache Apex
Building Big Data Streaming Architectures
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
BigDataSpain 2016: Stream Processing Applications with Apache Apex
Flink Streaming
Data Architectures for Robust Decision Making
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Spark Streaming Recipes and "Exactly Once" Semantics Revised

More from Apache Apex (17)

PDF
Low Latency Polyglot Model Scoring using Apache Apex
PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
PPTX
Deep Dive into Apache Apex App Development
PPTX
Hadoop Interacting with HDFS
PPTX
Introduction to Real-Time Data Processing
PPTX
Introduction to Yarn
PPTX
Introduction to Map Reduce
PPTX
HDFS Internals
PPTX
Intro to Big Data Hadoop
PPTX
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
PPTX
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
PPTX
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
PPTX
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
PPTX
Apache Beam (incubating)
PPTX
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
PPTX
Apache Apex & Bigtop
PDF
Building Your First Apache Apex Application
Low Latency Polyglot Model Scoring using Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Deep Dive into Apache Apex App Development
Hadoop Interacting with HDFS
Introduction to Real-Time Data Processing
Introduction to Yarn
Introduction to Map Reduce
HDFS Internals
Intro to Big Data Hadoop
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Apache Beam (incubating)
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Apache Apex & Bigtop
Building Your First Apache Apex Application

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Encapsulation theory and applications.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
cuic standard and advanced reporting.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Machine learning based COVID-19 study performance prediction
Electronic commerce courselecture one. Pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Encapsulation theory and applications.pdf
Encapsulation_ Review paper, used for researhc scholars
Reach Out and Touch Someone: Haptics and Empathic Computing
cuic standard and advanced reporting.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Understanding_Digital_Forensics_Presentation.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
MYSQL Presentation for SQL database connectivity
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Network Security Unit 5.pdf for BCA BBA.
Building Integrated photovoltaic BIPV_UPV.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
NewMind AI Monthly Chronicles - July 2025
Machine learning based COVID-19 study performance prediction

Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex

  • 1. Next Gen Big Data Analytics with Apache Apex Apache Big Data, Vancouver May 9th 2016 Thomas Weise, Apache Apex PMC @thweise thw@apache.org
  • 2. Stream Processing • Data from a variety of sources (IoT, Kafka, files, social media etc.) • Unbounded stream data ᵒ Batch can be processed as stream (but a stream is not a batch) • (In-memory) Processing with temporal boundaries (windows) • Stateful operations: Aggregation, Rules, … -> Analytics • Results stored to a variety of sinks or destinations ᵒ Streaming application can also serve data with very low latency 2 Browser Web Server Kafka Input (logs) Decompress, Parse, Filter Dimensions Aggregate Kafka Logs Kafka
  • 3. Apache Apex Features • In-memory stream processing platform ᵒ Developed since 2012, ASF TLP since 04/2016 • Unobtrusive Java API to express (custom) logic • Scale out, distributed, parallel • High throughput & low latency processing • Windowing (temporal boundary) • Reliability, fault tolerance, operability • Hadoop native • Compute locality, affinity • Dynamic updates, elasticity 3
  • 4. Applications on Apex 4 • Distributed processing • Application logic broken into components called operators that run in a distributed fashion across your cluster • Natural programming model • Code as if you were writing normal Java logic • Maintain state in your application variables • Scalable • Operators can be scaled up or down at runtime according to the load and SLA • Fault tolerant • Automatically recover from node outages without having to reprocess from beginning • State is preserved, checkpointing, incremental recovery • Long running applications • Operational insight • See how each operator is performing and even record data
  • 6. Apache Apex Malhar Library 6
  • 7. Native Hadoop Integration 7 • YARN is the resource manager • HDFS used for storing any persistent state
  • 8. Application Development Model 8  A Stream is a sequence of data tuples  A typical Operator takes one or more input streams, performs computations & emits one or more output streams • Each Operator is YOUR custom business logic in java, or built-in operator from our open source library • Operator has many instances that run in parallel and each instance is single-threaded  Directed Acyclic Graph (DAG) is made up of operators and streams Directed Acyclic Graph (DAG) Output Stream Tupl e Tupl e er Operator er Operator er Operator er Operator er Operator er Operator
  • 9. Streaming Windows 9  Application window  Sliding window and tumbling window  Checkpoint window  No artificial latency
  • 10. Event time & Dimensions Computation 10 (All) : 5 t=4:00 : 2 t=5:00 : 3 k=A, t=4:00 : 2 k=A, t=5:00 : 1 k=B, t=5:00 : 2 (All) : 4 t=4:00 : 2 t=5:00 : 2 k=A, t=4:00 : 2 K=B, t=5:00 : 2 k=A t=5:00 (All) : 1 t=4:00 : 1 k=A, t=4:00 : 1 k=B t=5:59 k=B t=5:00 k=A T=4:30 k=A t=4:00
  • 14. Partitioning 14 NxM PartitionsUnifier 0 1 2 3 Logical DAG 0 1 2 1 1 Unifier 1 20 Logical Diagram Physical Diagram with operator 1 with 3 partitions 0 Unifier 1a 1b 1c 2a 2b Unifier 3 Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck Unifier Unifier0 1a 1b 1c 2a 2b Unifier 3 Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
  • 15. Advanced Partitioning 15 0 1a 1b 2 3 4Unifier Physical DAG 0 4 3a2a1a 1b 2b 3b Unifier Physical DAG with Parallel Partition Parallel Partition Container uopr uopr1 uopr2 uopr3 uopr4 uopr1 uopr2 uopr3 uopr4 dopr dopr doprunifier unifier unifier unifier Container Container NICNIC NICNIC NIC Container NIC Logical Plan Execution Plan, for N = 4; M = 1 Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers Cascading Unifiers 0 1 2 3 4 Logical DAG
  • 16. Dynamic Partitioning 16 • Partitioning change while application is running ᵒ Change number of partitions at runtime based on stats ᵒ Determine initial number of partitions dynamically • Kafka operators scale according to number of kafka partitions ᵒ Supports re-distribution of state when number of partitions change ᵒ API for custom scaler or partitioner 2b 2c 3 2a 2d 1b 1a1a 2a 1b 2b 3 1a 2b 1b 2c 3b 2a 2d 3a Unifiers not shown
  • 17. How tuples are partitioned 17 • Tuple hashcode and mask used to determine destination partition ᵒ Mask picks the last n bits of the hashcode of the tuple ᵒ hashcode method can be overridden • StreamCodec can be used to specify custom hashcode for tuples ᵒ Can also be used for specifying custom serialization tuple: { Name, 24204842, San Jose } Hashcode: 00101010001 0101 Mask (0x11) Partition 00 1 01 2 10 3 11 4
  • 18. Fault Tolerance 18 • Operator state is checkpointed to persistent store ᵒ Automatically performed by engine, no additional coding needed ᵒ Asynchronous and distributed ᵒ In case of failure operators are restarted from checkpoint state • Automatic detection and recovery of failed containers ᵒ Heartbeat mechanism ᵒ YARN process status notification • Buffering to enable replay of data from recovered point ᵒ Fast, incremental recovery, spike handling • Application master state checkpointed ᵒ Snapshot of physical (and logical) plan ᵒ Execution layer change log
  • 19. • In-memory PubSub • Stores results emitted by operator until committed • Handles backpressure / spillover to local disk • Ordering, idempotency Operator 1 Container 1 Buffer Server Node 1 Operator 2 Container 2 Node 2 Buffer Server 19
  • 20. Recovery Scenario … EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1 sum 0 … EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1 sum 7 … EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1 sum 10 … EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1 sum 7 20
  • 21. Processing Guarantees 21 At-least-once • On recovery data will be replayed from a previous checkpoint ᵒ No messages lost ᵒ Default, suitable for most applications • Can be used to ensure data is written once to store ᵒ Transactions with meta information, Rewinding output, Feedback from external entity, Idempotent operations At-most-once • On recovery the latest data is made available to operator ᵒ Useful in use cases where some data loss is acceptable and latest data is sufficient Exactly-once ᵒ At-least-once + idempotency + transactional mechanisms (operator logic) to achieve end-to-end exactly once behavior
  • 22. End-to-End Exactly Once 22 • Becomes important when writing to external systems • Data should not be duplicated or lost in the external system even in case of application failures • Common external systems ᵒ Databases ᵒ Files ᵒ Message queues • Platform support for at least once is a must so that no data is lost • Data duplication must still be avoided when data is replayed from checkpoint ᵒ Operators implement the logic dependent on the external system ᵒ Platform provides checkpointing and repeatable windowing
  • 23. Compute Locality 23 • By default operators are deployed in containers (processes) on different nodes across the Hadoop cluster • Locality options for streams ᵒ RACK_LOCAL: Data does not traverse network switches ᵒ NODE_LOCAL: Data transfer via loopback interface, frees up network bandwidth ᵒ CONTAINER_LOCAL: Data transfer via in memory queues between operators, does not require serialization ᵒ THREAD_LOCAL: Data passed through call stack, operators share thread • Host Locality ᵒ Operators can be deployed on specific hosts • New in 3.4.0: (Anti-)Affinity (APEXCORE-10) ᵒ Ability to express relative deployment without specifying a host
  • 24. Data Processing Pipeline Example App Builder 24
  • 27. Real-Time Dashboards Real Time Visualization 27
  • 28. Maximize Revenue w/ real-time insights 28 PubMatic is the leading marketing automation software company for publishers. Through real-time analytics, yield management, and workflow automation, PubMatic enables publishers to make smarter inventory decisions and improve revenue performance Business Need Apex based Solution Client Outcome • Ingest and analyze high volume clicks & views in real-time to help customers improve revenue - 200K events/second data flow • Report critical metrics for campaign monetization from auction and client logs - 22 TB/day data generated • Handle ever increasing traffic with efficient resource utilization • Always-on ad network • DataTorrent Enterprise platform, powered by Apache Apex • In-memory stream processing • Comprehensive library of pre-built operators including connectors • Built-in fault tolerance • Dynamically scalable • Management UI & Data Visualization console • Helps PubMatic deliver ad performance insights to publishers and advertisers in real-time instead of 5+ hours • Helps Publishers visualize campaign performance and adjust ad inventory in real-time to maximize their revenue • Enables PubMatic reduce OPEX with efficient compute resource utilization • Built-in fault tolerance ensures customers can always access ad network
  • 29. Industrial IoT applications 29 GE is dedicated to providing advanced IoT analytics solutions to thousands of customers who are using their devices and sensors across different verticals. GE has built a sophisticated analytics platform, Predix, to help its customers develop and execute Industrial IoT applications and gain real-time insights as well as actions. Business Need Apex based Solution Client Outcome • Ingest and analyze high-volume, high speed data from thousands of devices, sensors per customer in real-time without data loss • Predictive analytics to reduce costly maintenance and improve customer service • Unified monitoring of all connected sensors and devices to minimize disruptions • Fast application development cycle • High scalability to meet changing business and application workloads • Ingestion application using DataTorrent Enterprise platform • Powered by Apache Apex • In-memory stream processing • Built-in fault tolerance • Dynamic scalability • Comprehensive library of pre-built operators • Management UI console • Helps GE improve performance and lower cost by enabling real-time Big Data analytics • Helps GE detect possible failures and minimize unplanned downtimes with centralized management & monitoring of devices • Enables faster innovation with short application development cycle • No data loss and 24x7 availability of applications • Helps GE adjust to scalability needs with auto-scaling
  • 30. Smart energy applications 30 Silver Spring Networks helps global utilities and cities connect, optimize, and manage smart energy and smart city infrastructure. Silver Spring Networks receives data from over 22 million connected devices, conducts 2 million remote operations per year Business Need Apex based Solution Client Outcome • Ingest high-volume, high speed data from millions of devices & sensors in real-time without data loss • Make data accessible to applications without delay to improve customer service • Capture & analyze historical data to understand & improve grid operations • Reduce the cost, time, and pain of integrating with 3rd party apps • Centralized management of software & operations • DataTorrent Enterprise platform, powered by Apache Apex • In-memory stream processing • Pre-built operator • Built-in fault tolerance • Dynamically scalable • Management UI console • Helps Silver Spring Networks ingest & analyze data in real-time for effective load management & customer service • Helps Silver Spring Networks detect possible failures and reduce outages with centralized management & monitoring of devices • Enables fast application development for faster time to market • Helps Silver Spring Networks scale with easy to partition operators • Automatic recovery from failures
  • 31. Learn about another use case? 31 Next-Gen Decision Making in Under 2ms Ilya Ganelin, Capital One Data Innovation Lab Monday May 9, 2016 5:10pm - 6:00pm Plaza B
  • 32. Resources 32 • http://apex.apache.org/ • Learn more: http://apex.apache.org/docs.html • Subscribe - http://apex.apache.org/community.html • Download - http://apex.apache.org/downloads.html • Follow @ApacheApex - https://twitter.com/apacheapex • Meetups - http://www.meetup.com/topics/apache-apex • More examples: https://github.com/DataTorrent/examples • Slideshare: http://www.slideshare.net/ApacheApex/presentations • https://www.youtube.com/results?search_query=apache+apex