Intro to Apache Apex @ Women in Big Data

Intro to Apache Apex
Pramod Immaneni
Apache Apex PMC, Architect DataTorrent
Oct 12th 2016

Next Gen Stream Data Processing
• Data from variety of sources (IoT, Kafka, files, social media etc.)
• Unbounded, continuous data streams
ᵒ Batch can be processed as stream (but a stream is not a batch)
• (In-memory) Processing with temporal boundaries (windows)
• Stateful operations: Aggregation, Rules, … -> Analytics
• Results stored to variety of sinks or destinations
ᵒ Streaming application can also serve data with very low latency
2
Browser
Web Server
Kafka Input
(logs)
Decompress,
Parse, Filter
Dimensions
Aggregate Kafka
Logs
Kafka

Apache Apex
3
• In-memory, distributed stream processing
• Application logic broken into components called operators that run in a distributed fashion
across your cluster
• Natural programming model
• Unobtrusive Java API to express (custom) logic
• Maintain state and metrics in your member variables
• Scalable, high throughput, low latency
• Operators can be scaled up or down at runtime according to the load and SLA
• Dynamic scaling (elasticity), compute locality
• Fault tolerance & correctness
• Automatically recover from node outages without having to reprocess from beginning
• State is preserved, checkpointing, incremental recovery
• End-to-end exactly-once
• Operability
• System and application metrics, record/visualize data
• Dynamic changes

Native Hadoop Integration
5
• YARN is
the
resource
manager
• HDFS for
storing
persistent
state

Application Development Model
6
 A Stream is a sequence of data tuples
 A typical Operator takes one or more input streams, performs computations & emits one or more output streams
• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
• Operator has many instances that run in parallel and each instance is single-threaded
 Directed Acyclic Graph (DAG) is made up of operators and streams
Directed Acyclic Graph (DAG)
Output
Stream
Tupl
e
Tupl
e
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator

Scalability
7
0
Unifier
1a
1b
1c
2a
2b
Unifier 3
Unifier
0 4
3a2a1a
1b 2b 3b
Unifier
uopr1
uopr2
uopr3
uopr4
doprunifier
unifier
unifier
Container
Container
NICNIC
NICNIC
NIC
Container

Dynamic Partitioning
8
• Partitioning change while application is running
ᵒ Change number of partitions at runtime based on stats
ᵒ Determine initial number of partitions dynamically
• Kafka operators scale according to number of kafka partitions
ᵒ Supports re-distribution of state when number of partitions change
ᵒ API for custom scaler or partitioner
2b
2c
3
2a
2d
1b
1a1a 2a
1b 2b
3
1a 2b
1b 2c 3b
2a
2d
3a
Unifiers not shown

Fault Tolerance
9
• Operator state is checkpointed to persistent store
ᵒ Automatically performed by engine, no additional coding needed
ᵒ Asynchronous and distributed
ᵒ In case of failure operators are restarted from checkpoint state
• Automatic detection and recovery of failed containers
ᵒ Heartbeat mechanism
ᵒ YARN process status notification
• Buffering to enable replay of data from recovered point
ᵒ Fast, incremental recovery, spike handling
• Application master state checkpointed
ᵒ Snapshot of physical (and logical) plan
ᵒ Execution layer change log

Checkpointing
10
 Application window
 Sliding window and tumbling window
 Checkpoint window
 No artificial latency

• In-memory PubSub
• Stores results emitted by operator until committed
• Handles backpressure / spillover to local disk
• Ordering, idempotency
Operator
1
Container 1
Buffer
Server
Node 1
Operator
2
Container 2
Node 2
Buffer Server
11

End-to-End Exactly Once
12
• Important when writing to external systems
• Data should not be duplicated or lost in the external system in case of
application failures
• Common external systems
ᵒ Databases
ᵒ Files
ᵒ Message queues
• Exactly-once = at-least-once + idempotency + consistent state
• Data duplication must be avoided when data is replayed from checkpoint
ᵒ Operators implement the logic dependent on the external system
ᵒ Platform provides checkpointing and repeatable windowing

Example application – Streaming WordCount
13
• Kafka to Mysql
• Streaming source
• Traditional database output
• Functionality
- Stream messages that contain lines of text from kafka, break them into words, drop
commonly occurring words like articles, do a running count of how many times each word
occurs and keep updating the totals into a database table
• Five operators
• Kafka Input to stream messages from Kafka
• Parser to break lines into words
• Filter to drop the articles
• Counter to count occurrence of each word
• Database output to write counts to database

DAG
14
Kafka
Input
Parser
Word
Counter
Database
Output
CountsWordsLines
Kafka Database
Apex Application
• Design and develop operators or use existing ones from the library
• Connect operators to form an Application
• Configure operators
• Configure scaling and other platform attributes
• Test functionality, performance and iterate
Filter
Filtered

Kafka Input
15
1 to 1 partition 1 to N partition
• Available in Apex Malhar library - KafkaSinglePortStringInputOperator
• Operator dynamically scales with partitions of Kafka side
• Fault tolerant and idempotent, keeps track of offset for idempotent replay
during failure recovery

Parser Operator
16
• Simple parser implementation, splits strings based on a regex pattern
• Define input port to receive data and an output port to output data
• The callback process is called whenever data is available
• To send data operator calls the emit method

Filter
17
• Removes articles

Word Counter
18
• Simple implementation that keeps the counts in a HashMap in memory
• Counts are automatically saved and restored during failure recovery
• Periodically emits counts at the end of every window
• Unifier not shown here, available in the Apex Malhar library

Database Output
19
• Operator in library abstracts the low level logic to communicate with database
• Only need to specify the SQL statement and how to populate it with data

Application
20
• Instantiate operators and connect operators by connecting the respective ports
• Give friendly names to operators and streams

Configuration
21
• Use friendly names to specify properties of the operator

Attributes
22
• Attributes are platform features and apply to all operators
• Scaling, memory, checkpointing etc can be configured using attributes

Attributes
23
• Memory can be configured on per operator level or globally
• Locality controls how operators are deployed on the cluster

Higher level Application specification
25
Java Stream API (declarative)
Next Release (3.5): Support for Windowing à la Apache Beam (incubating):
@ApplicationAnnotation(name = "WordCountStreamingApiDemo")
public class ApplicationWithStreamAPI implements StreamingApplication
{
@Override
public void populateDAG(DAG dag, Configuration configuration)
{
String localFolder = "./src/test/resources/data";
ApexStream<String> stream = StreamFactory
.fromFolder(localFolder)
.flatMap(new Split())
.window(new WindowOption.GlobalWindow(), new
TriggerOption().withEarlyFiringsAtEvery(Duration.millis(1000)).accumulatingFiredPanes())
.countByKey(new ConvertToKeyVal()).print();
stream.populateDag(dag);
}
}

Operator Library
26
RDBMS
• Vertica
• MySQL
• Oracle
• JDBC
NoSQL
• Cassandra, Hbase
• Aerospike, Accumulo
• Couchbase/ CouchDB
• Redis, MongoDB
• Geode
Messaging
• Kafka
• Solace
• Flume, ActiveMQ
• Kinesis, NiFi
File Systems
• HDFS/ Hive
• NFS
• S3
Parsers
• XML
• JSON
• CSV
• Avro
• Parquet
Transformations
• Filters
• Rules
• Expression
• Dedup
• Enrich
Analytics
• Dimensional Aggregations
(with state management for
historical data + query)
Protocols
• HTTP
• FTP
• WebSocket
• MQTT
• SMTP
Other
• Elastic Search
• Script (JavaScript, Python, R)
• Solr
• Twitter

Monitoring Console
Logical View
27
Physical View

Maximize Revenue w/ real-time insights
30
PubMatic is the leading marketing automation software company for publishers. Through real-time analytics,
yield management, and workflow automation, PubMatic enables publishers to make smarter inventory
decisions and improve revenue performance
Business Need Apex based Solution Client Outcome
• Ingest and analyze high volume clicks &
views in real-time to help customers
improve revenue
- 200K events/second data
flow
• Report critical metrics for campaign
monetization from auction and client
logs
- 22 TB/day data generated
• Handle ever increasing traffic with
efficient resource utilization
• Always-on ad network
• DataTorrent Enterprise platform,
powered by Apache Apex
• In-memory stream processing
• Comprehensive library of pre-built
operators including connectors
• Built-in fault tolerance
• Dynamically scalable
• Management UI & Data Visualization
console
• Helps PubMatic deliver ad performance
insights to publishers and advertisers in
real-time instead of 5+ hours
• Helps Publishers visualize campaign
performance and adjust ad inventory in
real-time to maximize their revenue
• Enables PubMatic reduce OPEX with
efficient compute resource utilization
• Built-in fault tolerance ensures
customers can always access ad
network

Industrial IoT applications
31
GE is dedicated to providing advanced IoT analytics solutions to thousands of customers who are using their
devices and sensors across different verticals. GE has built a sophisticated analytics platform, Predix, to help its
customers develop and execute Industrial IoT applications and gain real-time insights as well as actions.
• Ingest and analyze high-volume, high speed
data from thousands of devices, sensors
per customer in real-time without data loss
• Predictive analytics to reduce costly
maintenance and improve customer
service
• Unified monitoring of all connected sensors
and devices to minimize disruptions
• Fast application development cycle
• High scalability to meet changing business
and application workloads
• Ingestion application using DataTorrent
Enterprise platform
• Powered by Apache Apex
• Dynamic scalability
• Comprehensive library of pre-built
operators
• Management UI console
• Helps GE improve performance and lower
cost by enabling real-time Big Data
analytics
• Helps GE detect possible failures and
minimize unplanned downtimes with
centralized management & monitoring of
devices
• Enables faster innovation with short
application development cycle
• No data loss and 24x7 availability of
applications
• Helps GE adjust to scalability needs with
auto-scaling

Smart energy applications
32
Silver Spring Networks helps global utilities and cities connect, optimize, and manage smart energy and smart city
infrastructure. Silver Spring Networks receives data from over 22 million connected devices, conducts 2 million
remote operations per year
• Ingest high-volume, high speed data from
millions of devices & sensors in real-time
without data loss
• Make data accessible to applications
without delay to improve customer service
• Capture & analyze historical data to
understand & improve grid operations
• Reduce the cost, time, and pain of
integrating with 3rd party apps
• Centralized management of software &
operations
• DataTorrent Enterprise platform, powered
by Apache Apex
• Pre-built operator
• Dynamically scalable
• Management UI console
• Helps Silver Spring Networks ingest &
analyze data in real-time for effective load
management & customer service
• Helps Silver Spring Networks detect
possible failures and reduce outages with
centralized management & monitoring of
devices
• Enables fast application development for
faster time to market
• Helps Silver Spring Networks scale with
easy to partition operators
• Automatic recovery from failures

Resources for the use cases
33
• Pubmatic
• https://www.youtube.com/watch?v=JSXpgfQFcU8
• GE
• https://www.youtube.com/watch?v=hmaSkXhHNu0
• http://www.slideshare.net/ApacheApex/ge-iot-predix-time-series-data-ingestion-service-using-
apache-apex-hadoop
• SilverSpring Networks
• https://www.youtube.com/watch?v=8VORISKeSjI
• http://www.slideshare.net/ApacheApex/iot-big-data-ingestion-and-processing-in-hadoop-by-
silver-spring-networks

Resources
34
• http://apex.apache.org/
• Learn more: http://apex.apache.org/docs.html
• Subscribe - http://apex.apache.org/community.html
• Download - http://apex.apache.org/downloads.html
• Follow @ApacheApex - https://twitter.com/apacheapex
• Meetups – http://www.meetup.com/pro/apacheapex/
• More examples: https://github.com/DataTorrent/examples
• Slideshare: http://www.slideshare.net/ApacheApex/presentations
• https://www.youtube.com/results?search_query=apache+apex
• Free Enterprise License for Startups -
https://www.datatorrent.com/product/startup-accelerator/

Intro to Apache Apex @ Women in Big Data

More Related Content

What's hot (20)

Viewers also liked (12)

Similar to Intro to Apache Apex @ Women in Big Data (18)

More from Apache Apex (6)

Recently uploaded (20)

Intro to Apache Apex @ Women in Big Data

Editor's Notes