SlideShare a Scribd company logo
Low-latency ingestion and analytics with
Apache Kafka and Apache Apex
Thomas Weise, Architect DataTorrent, PPMC member Apache Apex
March 28th 2016
Apache Apex Features
• In-memory Stream Processing
• Scale out, Distributed, Parallel, High Throughput
• Windowing (temporal boundary)
• Reliability, Fault Tolerance
• Operability
• YARN native
• Compute Locality
• Dynamic updates
2
Apex Platform Overview
3
Apache Apex Malhar Library
4
Apache Kafka
5
“A high-throughput distributed messaging system.”
“Fast, Scalable, Durable, Distributed”
Kafka is a natural fit to deliver events
into Apex for low-latency processing.
Kafka Integration - Consumer
6
• Low-latency, high throughput ingest
• Scales with Kafka topics
ᵒ Auto-partitioning
ᵒ Flexible and customizable partition mapping
• Fault-tolerance (in 0.8 based on SimpleConsumer)
ᵒ Metadata monitoring/failover to new broker
ᵒ Offset checkpointing
ᵒ Idempotency
ᵒ External offset storage
• Support for multiple clusters
ᵒ Built for better resource utilization
• Bandwidth control
ᵒ Bytes per second
Kafka Integration - Producer
7
• Output operator is a Kafka producer
• Exactly once strategy
ᵒ On failure data already sent to message queue should not be re-sent
ᵒ Sends a key along with data that is monotonically increasing
ᵒ On recovery operator asks the message queue for the last sent message
• Gets the recovery key from the message
ᵒ Ignores all replayed data with key that is less than or equal to the recovered key
ᵒ If the key is not monotonically increasing then data can be sorted on the key at the
end of the window and sent to message queue
• Implemented in operator AbstractExactlyOnceKafkaOutputOperator in
apache/incubator-apex-malhar github repository available here
Apex Application Specification
8
Logical and Physical Plan
9
Partitioning
10
NxM PartitionsUnifier
0 1 2 3
Logical DAG
0 1 2
1
1 Unifier
1
20
Logical Diagram
Physical Diagram with operator 1 with 3 partitions
0
Unifier
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck
Unifier
Unifier0
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
Advanced Partitioning
11
0
1a
1b
2 3 4Unifier
Physical DAG
0 4
3a2a1a
1b 2b 3b
Unifier
Physical DAG with Parallel Partition
Parallel Partition
Container
uopr
uopr1
uopr2
uopr3
uopr4
uopr1
uopr2
uopr3
uopr4
dopr
dopr
doprunifier
unifier
unifier
unifier
Container
Container
NICNIC
NICNIC
NIC
Container
NIC
Logical Plan
Execution Plan, for N = 4; M = 1
Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers
Cascading Unifiers
0 1 2 3 4
Logical DAG
Dynamic Scaling
12
 Partitioning change while application is running
• Change number of partitions at runtime based on stats
• Determine initial number of partitions dynamically
– Kafka operators scale according to number of Kafka partitions
• Supports re-distribution of state when number of partitions change
• API for custom scaling or partitioning
2b
2c
3
2a
2d
1b
1a1a 2a
1b 2b
3
1a 2b
1b 2c 3b
2a
2d
3a
Unifiers not shown
Fault Tolerance
13
• Operator state is checkpointed to persistent store
ᵒ Automatically performed by engine, no additional coding needed
ᵒ Asynchronous and distributed
ᵒ In case of failure operators are restarted from checkpoint state
• Automatic detection and recovery of failed containers
ᵒ Heartbeat mechanism
ᵒ YARN process status notification
• Buffering to enable replay of data from recovered point
ᵒ Fast, incremental recovery, spike handling
• Application master state checkpointed
ᵒ Snapshot of physical (and logical) plan
ᵒ Execution layer change log
Streaming Windows
14
 Application window
 Sliding window and tumbling window
 Checkpoint window
 No artificial latency
Checkpointing Operator State
15
• Save state of operator so that it can be recovered on failure
• Pluggable storage handler
• Default implementation
ᵒ Serialization with Kryo
ᵒ All non-transient fields serialized
ᵒ Serialized state written to HDFS
ᵒ Writes asynchronous, non-blocking
• Possible to implement custom handlers for alternative approach to
extract state or different storage backend (such as IMDG)
• For operators that rely on previous state for computation
ᵒ Operators can be marked @Stateless to skip checkpointing
• Checkpoint frequency tunable (by default 30s)
ᵒ Based on streaming windows for consistent state
Processing Guarantees
16
At-least-once
• On recovery data will be replayed from a previous checkpoint
ᵒ No messages lost
ᵒ Default, suitable for most applications
• Can be used to ensure data is written once to store
ᵒ Transactions with meta information, Rewinding output, Feedback from
external entity, Idempotent operations
At-most-once
• On recovery the latest data is made available to operator
ᵒ Useful in use cases where some data loss is acceptable and latest data is
sufficient
Exactly-once
ᵒ At-least-once + idempotency + transactional mechanisms (operator logic) to
achieve end-to-end exactly once behavior
Idempotency with Kafka Consumer
17
Use Case – Ad Tech
Customer:
• Leading digital automation software company for publishers
• Helps publishers monetize their digital assets
• Enables publishers to make smarter inventory decisions and improve revenue
Features:
• Reporting of critical metrics from auctions and client logs
• Revenue, impression, and click information
• Aggregate counters and reporting on top N metrics
• Low latency querying using pub-sub model
18
Use Case – Ad Tech
19
User
Browser
AdServer
REST proxy
REST proxy
Kafka
Cluster
Client
logs
Kafka Input
(Auction logs)
Kafka Input
(Client logs)
CDN
(Caching
of logs)
ETL ETL
Filter Filter
Dimensions
Aggregator
Dimensions
Aggregator
Dimensions
Store
Query Query
Result
Kafka
Cluster
Auction
Logs
Client
logs
Middleware
Auction
Logs
Client logs
Kafka Messages Kafka Messages
Decompress
& Flatten
Decompress
& Flatten
Filtered Events Filtered Events
Aggregates
Query from
MW
Query Query
Results
Kafka
Cluster
Use Case – Ad Tech
20
Use Case – Ad Tech
• 15+ billion impressions per day
• Average data inflow of 200K events/sec
• 64 Kafka Input operators reading from 6 geographically distributed DCs
• 32 instances of in-memory distributed store
• 64 aggregators
• ~150 container processes, 30+ nodes
• 1.2 TB memory footprint @ peak load
21
Resources
22
• Exactly-once processing: https://www.datatorrent.com/blog/end-to-end-
exactly-once-with-apache-apex/
• Examples with Kafka and Files: https://github.com/tweise/apex-
samples/tree/master/exactly-once
• Learn more: http://apex.incubator.apache.org/docs.html
• Subscribe - http://apex.incubator.apache.org/community.html
• Download - http://apex.incubator.apache.org/downloads.html
• Apex website - http://apex.incubator.apache.org/
• Follow @ApacheApex - https://twitter.com/apacheapex
• Meetups - http://www.meetup.com/topics/apache-apex
Q&A
23

More Related Content

PPTX
Apache Apex Kafka Input Operator
PPTX
Apache Apex connector with Kafka 0.9 consumer API
PPTX
Apache Apex Meetup at Cask
PPTX
Stream Processing with Apache Apex
PPTX
Apache Apex Introduction with PubMatic
PPTX
Introduction to Apache Apex
PPTX
Smart Partitioning with Apache Apex (Webinar)
PPTX
Introduction to Apache Apex and writing a big data streaming application
Apache Apex Kafka Input Operator
Apache Apex connector with Kafka 0.9 consumer API
Apache Apex Meetup at Cask
Stream Processing with Apache Apex
Apache Apex Introduction with PubMatic
Introduction to Apache Apex
Smart Partitioning with Apache Apex (Webinar)
Introduction to Apache Apex and writing a big data streaming application

What's hot (20)

PPTX
Architectual Comparison of Apache Apex and Spark Streaming
PDF
Building your first aplication using Apache Apex
PPTX
Apache Apex Fault Tolerance and Processing Semantics
PDF
Developing streaming applications with apache apex (strata + hadoop world)
PPTX
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
PPTX
Capital One's Next Generation Decision in less than 2 ms
PPTX
Introduction to Apache Apex
PDF
Low Latency Polyglot Model Scoring using Apache Apex
PPTX
Intro to Apache Apex @ Women in Big Data
PDF
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
PPTX
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
PPTX
Apache Apex: Stream Processing Architecture and Applications
PDF
Introduction to Apache Apex - CoDS 2016
PPTX
Fault Tolerance and Processing Semantics in Apache Apex
PDF
Apex as yarn application
PPTX
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
PPTX
Fault-Tolerant File Input & Output
PPTX
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
PPTX
Deep Dive into Apache Apex App Development
Architectual Comparison of Apache Apex and Spark Streaming
Building your first aplication using Apache Apex
Apache Apex Fault Tolerance and Processing Semantics
Developing streaming applications with apache apex (strata + hadoop world)
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Capital One's Next Generation Decision in less than 2 ms
Introduction to Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
Intro to Apache Apex @ Women in Big Data
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex: Stream Processing Architecture and Applications
Introduction to Apache Apex - CoDS 2016
Fault Tolerance and Processing Semantics in Apache Apex
Apex as yarn application
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Fault-Tolerant File Input & Output
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
Deep Dive into Apache Apex App Development
Ad

Viewers also liked (20)

ODP
Open source and business rules
PPT
Introduction to Drools
PDF
FOSS in the Enterprise
PPTX
Jboss drools 4 scope - benefits, shortfalls
PDF
Drools & jBPM Workshop London 2013
PPTX
Apache Beam (incubating)
ODP
Drools BeJUG 2010
PDF
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + Demos
PDF
The Next Generation of Data Processing and Open Source
ODP
Drools & jBPM Info Sheet
PDF
Intro to Drools - St Louis Gateway JUG
PDF
Rules Programming tutorial
PDF
IIA4: Open Source and the Enterprise ( Predix Transform 2016)
PDF
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
PDF
Apache Beam @ GCPUG.TW Flink.TW 20161006
PDF
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
PDF
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
PDF
IIA3: Coding Like a Unicorn (Predix Transform 2016)
PDF
Drools
Open source and business rules
Introduction to Drools
FOSS in the Enterprise
Jboss drools 4 scope - benefits, shortfalls
Drools & jBPM Workshop London 2013
Apache Beam (incubating)
Drools BeJUG 2010
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + Demos
The Next Generation of Data Processing and Open Source
Drools & jBPM Info Sheet
Intro to Drools - St Louis Gateway JUG
Rules Programming tutorial
IIA4: Open Source and the Enterprise ( Predix Transform 2016)
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Apache Beam @ GCPUG.TW Flink.TW 20161006
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
IIA3: Coding Like a Unicorn (Predix Transform 2016)
Drools
Ad

Similar to Stream data from Apache Kafka for processing with Apache Apex (20)

PDF
BigDataSpain 2016: Introduction to Apache Apex
PDF
Introduction to Apache Apex by Thomas Weise
PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
PPTX
Ingestion and Dimensions Compute and Enrich using Apache Apex
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PPTX
Apache Apex: Stream Processing Architecture and Applications
PPTX
Apache Kafka® + Machine Learning for Supply Chain 
PPTX
IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...
PPTX
Kafkha real time analytics platform.pptx
PPSX
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
PDF
It's Time To Stop Using Lambda Architecture
PPTX
Next Gen Big Data Analytics with Apache Apex
PDF
Apache Kafka - Free Friday
PDF
Making Apache Kafka Even Faster And More Scalable
PPTX
messaging.pptx
PDF
Structured Streaming with Kafka
PPTX
Data In Motion Paris 2023
PDF
Streaming solutions for real time problems
PDF
Anomaly Detection at Scale
PDF
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
BigDataSpain 2016: Introduction to Apache Apex
Introduction to Apache Apex by Thomas Weise
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex: Stream Processing Architecture and Applications
Apache Kafka® + Machine Learning for Supply Chain 
IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...
Kafkha real time analytics platform.pptx
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
It's Time To Stop Using Lambda Architecture
Next Gen Big Data Analytics with Apache Apex
Apache Kafka - Free Friday
Making Apache Kafka Even Faster And More Scalable
messaging.pptx
Structured Streaming with Kafka
Data In Motion Paris 2023
Streaming solutions for real time problems
Anomaly Detection at Scale
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify

More from Apache Apex (17)

PDF
From Batch to Streaming with Apache Apex Dataworks Summit 2017
PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
PPTX
Hadoop Interacting with HDFS
PPTX
Introduction to Real-Time Data Processing
PPTX
Introduction to Yarn
PPTX
Introduction to Map Reduce
PPTX
HDFS Internals
PPTX
Intro to Big Data Hadoop
PPTX
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
PPTX
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
PPTX
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
PPTX
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
PPTX
Java High Level Stream API
PPTX
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
PPTX
Apache Apex & Bigtop
PDF
Building Your First Apache Apex Application
From Batch to Streaming with Apache Apex Dataworks Summit 2017
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Hadoop Interacting with HDFS
Introduction to Real-Time Data Processing
Introduction to Yarn
Introduction to Map Reduce
HDFS Internals
Intro to Big Data Hadoop
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Big Data Berlin v8.0 Stream Processing with Apache Apex
Java High Level Stream API
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Apache Apex & Bigtop
Building Your First Apache Apex Application

Recently uploaded (20)

PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
cuic standard and advanced reporting.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Cloud computing and distributed systems.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPT
Teaching material agriculture food technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Encapsulation_ Review paper, used for researhc scholars
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Electronic commerce courselecture one. Pdf
cuic standard and advanced reporting.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Review of recent advances in non-invasive hemoglobin estimation
Cloud computing and distributed systems.
The Rise and Fall of 3GPP – Time for a Sabbatical?
The AUB Centre for AI in Media Proposal.docx
20250228 LYD VKU AI Blended-Learning.pptx
Network Security Unit 5.pdf for BCA BBA.
Teaching material agriculture food technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Reach Out and Touch Someone: Haptics and Empathic Computing

Stream data from Apache Kafka for processing with Apache Apex

  • 1. Low-latency ingestion and analytics with Apache Kafka and Apache Apex Thomas Weise, Architect DataTorrent, PPMC member Apache Apex March 28th 2016
  • 2. Apache Apex Features • In-memory Stream Processing • Scale out, Distributed, Parallel, High Throughput • Windowing (temporal boundary) • Reliability, Fault Tolerance • Operability • YARN native • Compute Locality • Dynamic updates 2
  • 4. Apache Apex Malhar Library 4
  • 5. Apache Kafka 5 “A high-throughput distributed messaging system.” “Fast, Scalable, Durable, Distributed” Kafka is a natural fit to deliver events into Apex for low-latency processing.
  • 6. Kafka Integration - Consumer 6 • Low-latency, high throughput ingest • Scales with Kafka topics ᵒ Auto-partitioning ᵒ Flexible and customizable partition mapping • Fault-tolerance (in 0.8 based on SimpleConsumer) ᵒ Metadata monitoring/failover to new broker ᵒ Offset checkpointing ᵒ Idempotency ᵒ External offset storage • Support for multiple clusters ᵒ Built for better resource utilization • Bandwidth control ᵒ Bytes per second
  • 7. Kafka Integration - Producer 7 • Output operator is a Kafka producer • Exactly once strategy ᵒ On failure data already sent to message queue should not be re-sent ᵒ Sends a key along with data that is monotonically increasing ᵒ On recovery operator asks the message queue for the last sent message • Gets the recovery key from the message ᵒ Ignores all replayed data with key that is less than or equal to the recovered key ᵒ If the key is not monotonically increasing then data can be sorted on the key at the end of the window and sent to message queue • Implemented in operator AbstractExactlyOnceKafkaOutputOperator in apache/incubator-apex-malhar github repository available here
  • 10. Partitioning 10 NxM PartitionsUnifier 0 1 2 3 Logical DAG 0 1 2 1 1 Unifier 1 20 Logical Diagram Physical Diagram with operator 1 with 3 partitions 0 Unifier 1a 1b 1c 2a 2b Unifier 3 Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck Unifier Unifier0 1a 1b 1c 2a 2b Unifier 3 Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
  • 11. Advanced Partitioning 11 0 1a 1b 2 3 4Unifier Physical DAG 0 4 3a2a1a 1b 2b 3b Unifier Physical DAG with Parallel Partition Parallel Partition Container uopr uopr1 uopr2 uopr3 uopr4 uopr1 uopr2 uopr3 uopr4 dopr dopr doprunifier unifier unifier unifier Container Container NICNIC NICNIC NIC Container NIC Logical Plan Execution Plan, for N = 4; M = 1 Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers Cascading Unifiers 0 1 2 3 4 Logical DAG
  • 12. Dynamic Scaling 12  Partitioning change while application is running • Change number of partitions at runtime based on stats • Determine initial number of partitions dynamically – Kafka operators scale according to number of Kafka partitions • Supports re-distribution of state when number of partitions change • API for custom scaling or partitioning 2b 2c 3 2a 2d 1b 1a1a 2a 1b 2b 3 1a 2b 1b 2c 3b 2a 2d 3a Unifiers not shown
  • 13. Fault Tolerance 13 • Operator state is checkpointed to persistent store ᵒ Automatically performed by engine, no additional coding needed ᵒ Asynchronous and distributed ᵒ In case of failure operators are restarted from checkpoint state • Automatic detection and recovery of failed containers ᵒ Heartbeat mechanism ᵒ YARN process status notification • Buffering to enable replay of data from recovered point ᵒ Fast, incremental recovery, spike handling • Application master state checkpointed ᵒ Snapshot of physical (and logical) plan ᵒ Execution layer change log
  • 14. Streaming Windows 14  Application window  Sliding window and tumbling window  Checkpoint window  No artificial latency
  • 15. Checkpointing Operator State 15 • Save state of operator so that it can be recovered on failure • Pluggable storage handler • Default implementation ᵒ Serialization with Kryo ᵒ All non-transient fields serialized ᵒ Serialized state written to HDFS ᵒ Writes asynchronous, non-blocking • Possible to implement custom handlers for alternative approach to extract state or different storage backend (such as IMDG) • For operators that rely on previous state for computation ᵒ Operators can be marked @Stateless to skip checkpointing • Checkpoint frequency tunable (by default 30s) ᵒ Based on streaming windows for consistent state
  • 16. Processing Guarantees 16 At-least-once • On recovery data will be replayed from a previous checkpoint ᵒ No messages lost ᵒ Default, suitable for most applications • Can be used to ensure data is written once to store ᵒ Transactions with meta information, Rewinding output, Feedback from external entity, Idempotent operations At-most-once • On recovery the latest data is made available to operator ᵒ Useful in use cases where some data loss is acceptable and latest data is sufficient Exactly-once ᵒ At-least-once + idempotency + transactional mechanisms (operator logic) to achieve end-to-end exactly once behavior
  • 17. Idempotency with Kafka Consumer 17
  • 18. Use Case – Ad Tech Customer: • Leading digital automation software company for publishers • Helps publishers monetize their digital assets • Enables publishers to make smarter inventory decisions and improve revenue Features: • Reporting of critical metrics from auctions and client logs • Revenue, impression, and click information • Aggregate counters and reporting on top N metrics • Low latency querying using pub-sub model 18
  • 19. Use Case – Ad Tech 19 User Browser AdServer REST proxy REST proxy Kafka Cluster Client logs Kafka Input (Auction logs) Kafka Input (Client logs) CDN (Caching of logs) ETL ETL Filter Filter Dimensions Aggregator Dimensions Aggregator Dimensions Store Query Query Result Kafka Cluster Auction Logs Client logs Middleware Auction Logs Client logs Kafka Messages Kafka Messages Decompress & Flatten Decompress & Flatten Filtered Events Filtered Events Aggregates Query from MW Query Query Results Kafka Cluster
  • 20. Use Case – Ad Tech 20
  • 21. Use Case – Ad Tech • 15+ billion impressions per day • Average data inflow of 200K events/sec • 64 Kafka Input operators reading from 6 geographically distributed DCs • 32 instances of in-memory distributed store • 64 aggregators • ~150 container processes, 30+ nodes • 1.2 TB memory footprint @ peak load 21
  • 22. Resources 22 • Exactly-once processing: https://www.datatorrent.com/blog/end-to-end- exactly-once-with-apache-apex/ • Examples with Kafka and Files: https://github.com/tweise/apex- samples/tree/master/exactly-once • Learn more: http://apex.incubator.apache.org/docs.html • Subscribe - http://apex.incubator.apache.org/community.html • Download - http://apex.incubator.apache.org/downloads.html • Apex website - http://apex.incubator.apache.org/ • Follow @ApacheApex - https://twitter.com/apacheapex • Meetups - http://www.meetup.com/topics/apache-apex

Editor's Notes

  • #3: Partitioning & Scaling built-in Operators can be dynamically scaled Throughput, latency or any custom logic Streams can be split in flexible ways Tuple hashcode, tuple field or custom logic Parallel partitioning for parallel pipelines MxN partitioning for generic pipelines Unifier concept for merging results from partitions Helps in handling skew imbalance Advanced Windowing support Application window configurable per operator Sliding window and tumbling window support Checkpoint window control for fault recovery Windowing does not introduce artificial latency Stateful fault tolerance out of the box Operators recover automatically from a precise point before failure At least once At most once Exactly once at window boundaries
  • #6: Partitioning & Scaling built-in Operators can be dynamically scaled Throughput, latency or any custom logic Streams can be split in flexible ways Tuple hashcode, tuple field or custom logic Parallel partitioning for parallel pipelines MxN partitioning for generic pipelines Unifier concept for merging results from partitions Helps in handling skew imbalance Advanced Windowing support Application window configurable per operator Sliding window and tumbling window support Checkpoint window control for fault recovery Windowing does not introduce artificial latency Stateful fault tolerance out of the box Operators recover automatically from a precise point before failure At least once At most once Exactly once at window boundaries
  • #7: Partitioning & Scaling built-in Operators can be dynamically scaled Throughput, latency or any custom logic Streams can be split in flexible ways Tuple hashcode, tuple field or custom logic Parallel partitioning for parallel pipelines MxN partitioning for generic pipelines Unifier concept for merging results from partitions Helps in handling skew imbalance Advanced Windowing support Application window configurable per operator Sliding window and tumbling window support Checkpoint window control for fault recovery Windowing does not introduce artificial latency Stateful fault tolerance out of the box Operators recover automatically from a precise point before failure At least once At most once Exactly once at window boundaries
  • #8: Partitioning & Scaling built-in Operators can be dynamically scaled Throughput, latency or any custom logic Streams can be split in flexible ways Tuple hashcode, tuple field or custom logic Parallel partitioning for parallel pipelines MxN partitioning for generic pipelines Unifier concept for merging results from partitions Helps in handling skew imbalance Advanced Windowing support Application window configurable per operator Sliding window and tumbling window support Checkpoint window control for fault recovery Windowing does not introduce artificial latency Stateful fault tolerance out of the box Operators recover automatically from a precise point before failure At least once At most once Exactly once at window boundaries
  • #19: Partitioning & Scaling built-in Operators can be dynamically scaled Throughput, latency or any custom logic Streams can be split in flexible ways Tuple hashcode, tuple field or custom logic Parallel partitioning for parallel pipelines MxN partitioning for generic pipelines Unifier concept for merging results from partitions Helps in handling skew imbalance Advanced Windowing support Application window configurable per operator Sliding window and tumbling window support Checkpoint window control for fault recovery Windowing does not introduce artificial latency Stateful fault tolerance out of the box Operators recover automatically from a precise point before failure At least once At most once Exactly once at window boundaries
  • #20: Partitioning & Scaling built-in Operators can be dynamically scaled Throughput, latency or any custom logic Streams can be split in flexible ways Tuple hashcode, tuple field or custom logic Parallel partitioning for parallel pipelines MxN partitioning for generic pipelines Unifier concept for merging results from partitions Helps in handling skew imbalance Advanced Windowing support Application window configurable per operator Sliding window and tumbling window support Checkpoint window control for fault recovery Windowing does not introduce artificial latency Stateful fault tolerance out of the box Operators recover automatically from a precise point before failure At least once At most once Exactly once at window boundaries
  • #21: Partitioning & Scaling built-in Operators can be dynamically scaled Throughput, latency or any custom logic Streams can be split in flexible ways Tuple hashcode, tuple field or custom logic Parallel partitioning for parallel pipelines MxN partitioning for generic pipelines Unifier concept for merging results from partitions Helps in handling skew imbalance Advanced Windowing support Application window configurable per operator Sliding window and tumbling window support Checkpoint window control for fault recovery Windowing does not introduce artificial latency Stateful fault tolerance out of the box Operators recover automatically from a precise point before failure At least once At most once Exactly once at window boundaries
  • #22: Partitioning & Scaling built-in Operators can be dynamically scaled Throughput, latency or any custom logic Streams can be split in flexible ways Tuple hashcode, tuple field or custom logic Parallel partitioning for parallel pipelines MxN partitioning for generic pipelines Unifier concept for merging results from partitions Helps in handling skew imbalance Advanced Windowing support Application window configurable per operator Sliding window and tumbling window support Checkpoint window control for fault recovery Windowing does not introduce artificial latency Stateful fault tolerance out of the box Operators recover automatically from a precise point before failure At least once At most once Exactly once at window boundaries