SlideShare a Scribd company logo
Smart Partitioning with Apache Apex
Pramod Immaneni, Architect, PMC member
Thomas Weise, Architect & Co-founder, PMC member
May 19th 2016
Stream Processing
• Data from a variety of sources (IoT, Kafka, files, social media etc.)
• Unbounded stream data
ᵒ Batch can be processed as stream (but a stream is not a batch)
• (In-memory) Processing with temporal boundaries (windows)
• Stateful operations: Aggregation, Rules, … -> Analytics
• Results stored to a variety of sinks or destinations
ᵒ Streaming application can also serve data with very low latency
2
Browser
Web Server
Kafka Input
(logs)
Decompress,
Parse, Filter
Dimensions
Aggregate Kafka
Logs
Kafka
Apache Apex Features
• In-memory stream processing platform
ᵒ Developed since 2012, ASF TLP since 04/2016
• Unobtrusive Java API to express (custom) logic
• Scale out, distributed, parallel
• High throughput & low latency processing
• Windowing (temporal boundary)
• Reliability, fault tolerance, operability
• Hadoop native
• Compute locality, affinity
• Dynamic updates, elasticity
3
Big data processing & partitioning
• Large amount of data to process
• Data could be streaming in at high velocity
• Pipelining and partitioning to solve the problem
• Partitioning
ᵒ Run same logic in multiple processes or threads
ᵒ Each partition processes a subset of the data
• Apex supports partitioning out of the box
ᵒ Different partitioning schemes
ᵒ Unification
ᵒ Static & Dynamic Partitioning
ᵒ Separation of processing logic from scaling decisions
4
Apex Platform Overview
5
Native Hadoop Integration
6
• YARN is
the
resource
manager
• HDFS used
for storing
any
persistent
state
Application Development Model
7
 A Stream is a sequence of data tuples
 A typical Operator takes one or more input streams, performs computations & emits one or more output streams
• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
• Operator has many instances that run in parallel and each instance is single-threaded
 Directed Acyclic Graph (DAG) is made up of operators and streams
Directed Acyclic Graph (DAG)
Output
Stream
Tupl
e
Tupl
e
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
Streaming Windows
8
 Application window
 Sliding window and tumbling window
 Checkpoint window
 No artificial latency
Partitioning
9
NxM PartitionsUnifier
0 1 2 3
Logical DAG
0 1 2
1
1 Unifier
1
20
Logical Diagram
Physical Diagram with operator 1 with 3 partitions
0
Unifier
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck
Unifier
Unifier0
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
Advanced Partitioning
10
0
1a
1b
2 3 4Unifier
Physical DAG
0 4
3a2a1a
1b 2b 3b
Unifier
Physical DAG with Parallel Partition
Parallel Partition
Container
uopr
uopr1
uopr2
uopr3
uopr4
uopr1
uopr2
uopr3
uopr4
dopr
dopr
doprunifier
unifier
unifier
unifier
Container
Container
NICNIC
NICNIC
NIC
Container
NIC
Logical Plan
Execution Plan, for N = 4; M = 1
Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers
Cascading Unifiers
0 1 2 3 4
Logical DAG
Dynamic Partitioning
11
• Partitioning change while application is running
ᵒ Change number of partitions at runtime based on stats
ᵒ Determine initial number of partitions dynamically
• Kafka operators scale according to number of kafka partitions
ᵒ Supports re-distribution of state when number of partitions change
ᵒ API for custom scaler or partitioner
2b
2c
3
2a
2d
1b
1a1a 2a
1b 2b
3
1a 2b
1b 2c 3b
2a
2d
3a
Unifiers not shown
Kafka Consumer Partitioning
12
1 to 1 Partitioning 1 to N Partitioning
File Reader Partitioning
13
File1
File2
File3
File4
File5
…
Filen
• User can specify number of partitions to
start with
• Partitions are created accordingly
• Files are distributed among the
partitions
• User can change the number of partitions
at runtime by setting a property
• New partitions will get created
automatically
• Remaining files will be balanced
among the new partition setc
Block Reader Partitioning
14
File1
File2
File3
File4
File5
…
Filen
• Users can specify number a
minimum and maximum number
of partitions.
• New partitions are added on the
fly as there are more blocks to
read
• Partitions are released when there
are there are no pending blocks
Throughput Partitioner
15
• Scale based on thresholds for processed events
• Demo: Location tracking, scale with number of active devices
How dynamic partitioning works
16
• Partitioning decision (yes/no) by trigger (StatsListener)
ᵒ Pluggable component, can use any system or custom metric
ᵒ Externally driven partitioning example: KafkaInputOperator
• Stateful!
ᵒ Uses checkpointed state
ᵒ Ability to transfer state from old to new partitions (partitioner, customizable)
ᵒ Steps:
• Call partitioner
• Modify physical plan, rewrite checkpoints as needed
• Undeploy old partitions from execution layer
• Release/request container resources
• Deploy new partitions (from rewritten checkpoint)
ᵒ No loss of data (buffered)
ᵒ Incremental operation, partitions that don’t change continue processing
• API: Partitioner interface
Writing custom partitioner
17
• Partitioner accepts set of partitions as of last checkpoint (frozen state) and
returns set of new partitions
• Access to operator state
• Control partitioning for each input port
• Can implement any custom logic to derive new from old partitions
• StatsListener
• public Response processStats(BatchedOperatorStats stats)
• Throughput, latency, CPU, …
• Partitioner
• public Collection<Partition<T>> definePartitions(Collection<Partition<T>> partitions,
PartitioningContext context)
• Examples: https://github.com/apache/incubator-apex-
malhar/tree/master/library/src/main/java/com/datatorrent/lib/partitioner
How tuples are split between partitions
18
• Tuple hashcode and mask used to determine destination partition
ᵒ Mask picks the last n bits of the hashcode of the tuple
ᵒ hashcode method can be overridden
• StreamCodec can be used to specify custom hashcode for tuples
ᵒ Can also be used for specifying custom serialization
tuple: {
Name,
24204842,
San Jose
}
Hashcode:
00101010001
0101
Mask
(0x11)
Partition
00 1
01 2
10 3
11 4
Custom splits
19
• Custom distribution of tuples
ᵒ E.g.. Broadcast
tuple:{
Name,
24204842,
San Jose
}
Hashcode:
00101010001
0101
Mask
(0x00)
Partition
00 1
00 2
00 3
00 4
Resources
20
• http://apex.apache.org/
• Learn more: http://apex.apache.org/docs.html
• Subscribe - http://apex.apache.org/community.html
• Download - http://apex.apache.org/downloads.html
• Follow @ApacheApex - https://twitter.com/apacheapex
• Meetups – http://www.meetup.com/pro/apacheapex/
• More examples: https://github.com/DataTorrent/examples
• Slideshare: http://www.slideshare.net/ApacheApex/presentations
• https://www.youtube.com/results?search_query=apache+apex
• Free Enterprise License for Startups -
https://www.datatorrent.com/product/startup-accelerator/
Q&A
21

More Related Content

PPTX
Introduction to Apache Apex and writing a big data streaming application
PPTX
Apache Apex: Stream Processing Architecture and Applications
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
PPTX
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
PPTX
Intro to Apache Apex @ Women in Big Data
PPTX
Introduction to Apache Apex
PDF
Building your first aplication using Apache Apex
PPTX
Architectual Comparison of Apache Apex and Spark Streaming
Introduction to Apache Apex and writing a big data streaming application
Apache Apex: Stream Processing Architecture and Applications
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex @ Women in Big Data
Introduction to Apache Apex
Building your first aplication using Apache Apex
Architectual Comparison of Apache Apex and Spark Streaming

What's hot (20)

PPTX
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
PPTX
DataTorrent Presentation @ Big Data Application Meetup
PPTX
Stream Processing with Apache Apex
PPTX
Java High Level Stream API
PPTX
Apache Apex Fault Tolerance and Processing Semantics
PPTX
Introduction to Apache Apex
PPTX
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
PPTX
Apache Apex Introduction with PubMatic
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
PDF
Developing streaming applications with apache apex (strata + hadoop world)
PPTX
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
PPTX
Fault-Tolerant File Input & Output
PDF
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
PPTX
Deep Dive into Apache Apex App Development
PDF
Apex as yarn application
PPTX
Fault Tolerance and Processing Semantics in Apache Apex
PPTX
Stream data from Apache Kafka for processing with Apache Apex
PDF
Low Latency Polyglot Model Scoring using Apache Apex
PDF
From Batch to Streaming with Apache Apex Dataworks Summit 2017
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
DataTorrent Presentation @ Big Data Application Meetup
Stream Processing with Apache Apex
Java High Level Stream API
Apache Apex Fault Tolerance and Processing Semantics
Introduction to Apache Apex
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Apache Apex Introduction with PubMatic
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Developing streaming applications with apache apex (strata + hadoop world)
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
Fault-Tolerant File Input & Output
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Deep Dive into Apache Apex App Development
Apex as yarn application
Fault Tolerance and Processing Semantics in Apache Apex
Stream data from Apache Kafka for processing with Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
From Batch to Streaming with Apache Apex Dataworks Summit 2017
Ad

Viewers also liked (18)

PDF
Introduction to Real-time data processing
PPTX
Writing an Apache Apex Application
PPTX
Apache Apex & Bigtop
PPTX
DataFlow & Beam
PPTX
Apache Apex Kafka Input Operator
PDF
Real-time Stream Processing using Apache Apex
PPTX
The Avant-garde of Apache NiFi
PPTX
Apache NiFi in the Hadoop Ecosystem
PPTX
Integrating Apache NiFi and Apache Flink
PPTX
Introduction to Apache NiFi - Seattle Scalability Meetup
PPTX
Apache NiFi 1.0 in Nutshell
PPTX
Next Gen Big Data Analytics with Apache Apex
PDF
Introduction to Apache Beam
PPTX
Apache Beam: A unified model for batch and stream processing data
PPTX
Apache NiFi Crash Course Intro
PDF
Joe Witt presentation on Apache NiFi
PPTX
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
PPTX
Integrating Apache Spark and NiFi for Data Lakes
Introduction to Real-time data processing
Writing an Apache Apex Application
Apache Apex & Bigtop
DataFlow & Beam
Apache Apex Kafka Input Operator
Real-time Stream Processing using Apache Apex
The Avant-garde of Apache NiFi
Apache NiFi in the Hadoop Ecosystem
Integrating Apache NiFi and Apache Flink
Introduction to Apache NiFi - Seattle Scalability Meetup
Apache NiFi 1.0 in Nutshell
Next Gen Big Data Analytics with Apache Apex
Introduction to Apache Beam
Apache Beam: A unified model for batch and stream processing data
Apache NiFi Crash Course Intro
Joe Witt presentation on Apache NiFi
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Integrating Apache Spark and NiFi for Data Lakes
Ad

Similar to Smart Partitioning with Apache Apex (Webinar) (20)

PPTX
Apache Apex: Stream Processing Architecture and Applications
PDF
Introduction to Apache Apex by Thomas Weise
PDF
BigDataSpain 2016: Introduction to Apache Apex
PPTX
Ingestion and Dimensions Compute and Enrich using Apache Apex
PPTX
Flink Streaming @BudapestData
PDF
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PDF
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
PDF
Journey into Reactive Streams and Akka Streams
PDF
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
PPTX
GOTO Night Amsterdam - Stream processing with Apache Flink
PPTX
QCon London - Stream Processing with Apache Flink
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
PPTX
Data Stream Processing with Apache Flink
PPTX
Apache Apex Fault Tolerance and Processing Semantics
PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
PPTX
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
PDF
From Batch to Streaming ET(L) with Apache Apex
PPTX
Flink 0.10 - Upcoming Features
Apache Apex: Stream Processing Architecture and Applications
Introduction to Apache Apex by Thomas Weise
BigDataSpain 2016: Introduction to Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
Flink Streaming @BudapestData
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Flexible and Real-Time Stream Processing with Apache Flink
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Journey into Reactive Streams and Akka Streams
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
GOTO Night Amsterdam - Stream processing with Apache Flink
QCon London - Stream Processing with Apache Flink
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Data Stream Processing with Apache Flink
Apache Apex Fault Tolerance and Processing Semantics
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
From Batch to Streaming ET(L) with Apache Apex
Flink 0.10 - Upcoming Features

More from Apache Apex (13)

PPTX
Hadoop Interacting with HDFS
PPTX
Introduction to Real-Time Data Processing
PPTX
Introduction to Yarn
PPTX
Introduction to Map Reduce
PPTX
HDFS Internals
PPTX
Intro to Big Data Hadoop
PPTX
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
PPTX
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
PPTX
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
PPTX
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
PPTX
Apache Beam (incubating)
PPTX
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
PDF
Building Your First Apache Apex Application
Hadoop Interacting with HDFS
Introduction to Real-Time Data Processing
Introduction to Yarn
Introduction to Map Reduce
HDFS Internals
Intro to Big Data Hadoop
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Apache Beam (incubating)
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Building Your First Apache Apex Application

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Electronic commerce courselecture one. Pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Machine learning based COVID-19 study performance prediction
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
20250228 LYD VKU AI Blended-Learning.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Unlocking AI with Model Context Protocol (MCP)
Electronic commerce courselecture one. Pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Per capita expenditure prediction using model stacking based on satellite ima...
Advanced methodologies resolving dimensionality complications for autism neur...
MYSQL Presentation for SQL database connectivity
Machine learning based COVID-19 study performance prediction
Review of recent advances in non-invasive hemoglobin estimation
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Chapter 3 Spatial Domain Image Processing.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Building Integrated photovoltaic BIPV_UPV.pdf

Smart Partitioning with Apache Apex (Webinar)

  • 1. Smart Partitioning with Apache Apex Pramod Immaneni, Architect, PMC member Thomas Weise, Architect & Co-founder, PMC member May 19th 2016
  • 2. Stream Processing • Data from a variety of sources (IoT, Kafka, files, social media etc.) • Unbounded stream data ᵒ Batch can be processed as stream (but a stream is not a batch) • (In-memory) Processing with temporal boundaries (windows) • Stateful operations: Aggregation, Rules, … -> Analytics • Results stored to a variety of sinks or destinations ᵒ Streaming application can also serve data with very low latency 2 Browser Web Server Kafka Input (logs) Decompress, Parse, Filter Dimensions Aggregate Kafka Logs Kafka
  • 3. Apache Apex Features • In-memory stream processing platform ᵒ Developed since 2012, ASF TLP since 04/2016 • Unobtrusive Java API to express (custom) logic • Scale out, distributed, parallel • High throughput & low latency processing • Windowing (temporal boundary) • Reliability, fault tolerance, operability • Hadoop native • Compute locality, affinity • Dynamic updates, elasticity 3
  • 4. Big data processing & partitioning • Large amount of data to process • Data could be streaming in at high velocity • Pipelining and partitioning to solve the problem • Partitioning ᵒ Run same logic in multiple processes or threads ᵒ Each partition processes a subset of the data • Apex supports partitioning out of the box ᵒ Different partitioning schemes ᵒ Unification ᵒ Static & Dynamic Partitioning ᵒ Separation of processing logic from scaling decisions 4
  • 6. Native Hadoop Integration 6 • YARN is the resource manager • HDFS used for storing any persistent state
  • 7. Application Development Model 7  A Stream is a sequence of data tuples  A typical Operator takes one or more input streams, performs computations & emits one or more output streams • Each Operator is YOUR custom business logic in java, or built-in operator from our open source library • Operator has many instances that run in parallel and each instance is single-threaded  Directed Acyclic Graph (DAG) is made up of operators and streams Directed Acyclic Graph (DAG) Output Stream Tupl e Tupl e er Operator er Operator er Operator er Operator er Operator er Operator
  • 8. Streaming Windows 8  Application window  Sliding window and tumbling window  Checkpoint window  No artificial latency
  • 9. Partitioning 9 NxM PartitionsUnifier 0 1 2 3 Logical DAG 0 1 2 1 1 Unifier 1 20 Logical Diagram Physical Diagram with operator 1 with 3 partitions 0 Unifier 1a 1b 1c 2a 2b Unifier 3 Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck Unifier Unifier0 1a 1b 1c 2a 2b Unifier 3 Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
  • 10. Advanced Partitioning 10 0 1a 1b 2 3 4Unifier Physical DAG 0 4 3a2a1a 1b 2b 3b Unifier Physical DAG with Parallel Partition Parallel Partition Container uopr uopr1 uopr2 uopr3 uopr4 uopr1 uopr2 uopr3 uopr4 dopr dopr doprunifier unifier unifier unifier Container Container NICNIC NICNIC NIC Container NIC Logical Plan Execution Plan, for N = 4; M = 1 Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers Cascading Unifiers 0 1 2 3 4 Logical DAG
  • 11. Dynamic Partitioning 11 • Partitioning change while application is running ᵒ Change number of partitions at runtime based on stats ᵒ Determine initial number of partitions dynamically • Kafka operators scale according to number of kafka partitions ᵒ Supports re-distribution of state when number of partitions change ᵒ API for custom scaler or partitioner 2b 2c 3 2a 2d 1b 1a1a 2a 1b 2b 3 1a 2b 1b 2c 3b 2a 2d 3a Unifiers not shown
  • 12. Kafka Consumer Partitioning 12 1 to 1 Partitioning 1 to N Partitioning
  • 13. File Reader Partitioning 13 File1 File2 File3 File4 File5 … Filen • User can specify number of partitions to start with • Partitions are created accordingly • Files are distributed among the partitions • User can change the number of partitions at runtime by setting a property • New partitions will get created automatically • Remaining files will be balanced among the new partition setc
  • 14. Block Reader Partitioning 14 File1 File2 File3 File4 File5 … Filen • Users can specify number a minimum and maximum number of partitions. • New partitions are added on the fly as there are more blocks to read • Partitions are released when there are there are no pending blocks
  • 15. Throughput Partitioner 15 • Scale based on thresholds for processed events • Demo: Location tracking, scale with number of active devices
  • 16. How dynamic partitioning works 16 • Partitioning decision (yes/no) by trigger (StatsListener) ᵒ Pluggable component, can use any system or custom metric ᵒ Externally driven partitioning example: KafkaInputOperator • Stateful! ᵒ Uses checkpointed state ᵒ Ability to transfer state from old to new partitions (partitioner, customizable) ᵒ Steps: • Call partitioner • Modify physical plan, rewrite checkpoints as needed • Undeploy old partitions from execution layer • Release/request container resources • Deploy new partitions (from rewritten checkpoint) ᵒ No loss of data (buffered) ᵒ Incremental operation, partitions that don’t change continue processing • API: Partitioner interface
  • 17. Writing custom partitioner 17 • Partitioner accepts set of partitions as of last checkpoint (frozen state) and returns set of new partitions • Access to operator state • Control partitioning for each input port • Can implement any custom logic to derive new from old partitions • StatsListener • public Response processStats(BatchedOperatorStats stats) • Throughput, latency, CPU, … • Partitioner • public Collection<Partition<T>> definePartitions(Collection<Partition<T>> partitions, PartitioningContext context) • Examples: https://github.com/apache/incubator-apex- malhar/tree/master/library/src/main/java/com/datatorrent/lib/partitioner
  • 18. How tuples are split between partitions 18 • Tuple hashcode and mask used to determine destination partition ᵒ Mask picks the last n bits of the hashcode of the tuple ᵒ hashcode method can be overridden • StreamCodec can be used to specify custom hashcode for tuples ᵒ Can also be used for specifying custom serialization tuple: { Name, 24204842, San Jose } Hashcode: 00101010001 0101 Mask (0x11) Partition 00 1 01 2 10 3 11 4
  • 19. Custom splits 19 • Custom distribution of tuples ᵒ E.g.. Broadcast tuple:{ Name, 24204842, San Jose } Hashcode: 00101010001 0101 Mask (0x00) Partition 00 1 00 2 00 3 00 4
  • 20. Resources 20 • http://apex.apache.org/ • Learn more: http://apex.apache.org/docs.html • Subscribe - http://apex.apache.org/community.html • Download - http://apex.apache.org/downloads.html • Follow @ApacheApex - https://twitter.com/apacheapex • Meetups – http://www.meetup.com/pro/apacheapex/ • More examples: https://github.com/DataTorrent/examples • Slideshare: http://www.slideshare.net/ApacheApex/presentations • https://www.youtube.com/results?search_query=apache+apex • Free Enterprise License for Startups - https://www.datatorrent.com/product/startup-accelerator/

Editor's Notes

  • #3: Partitioning & Scaling built-in Operators can be dynamically scaled Throughput, latency or any custom logic Streams can be split in flexible ways Tuple hashcode, tuple field or custom logic Parallel partitioning for parallel pipelines MxN partitioning for generic pipelines Unifier concept for merging results from partitions Helps in handling skew imbalance Advanced Windowing support Application window configurable per operator Sliding window and tumbling window support Checkpoint window control for fault recovery Windowing does not introduce artificial latency Stateful fault tolerance out of the box Operators recover automatically from a precise point before failure At least once At most once Exactly once at window boundaries
  • #4: Partitioning & Scaling built-in Operators can be dynamically scaled Throughput, latency or any custom logic Streams can be split in flexible ways Tuple hashcode, tuple field or custom logic Parallel partitioning for parallel pipelines MxN partitioning for generic pipelines Unifier concept for merging results from partitions Helps in handling skew imbalance Advanced Windowing support Application window configurable per operator Sliding window and tumbling window support Checkpoint window control for fault recovery Windowing does not introduce artificial latency Stateful fault tolerance out of the box Operators recover automatically from a precise point before failure At least once At most once Exactly once at window boundaries
  • #5: Partitioning & Scaling built-in Operators can be dynamically scaled Throughput, latency or any custom logic Streams can be split in flexible ways Tuple hashcode, tuple field or custom logic Parallel partitioning for parallel pipelines MxN partitioning for generic pipelines Unifier concept for merging results from partitions Helps in handling skew imbalance Advanced Windowing support Application window configurable per operator Sliding window and tumbling window support Checkpoint window control for fault recovery Windowing does not introduce artificial latency Stateful fault tolerance out of the box Operators recover automatically from a precise point before failure At least once At most once Exactly once at window boundaries