SlideShare a Scribd company logo
Extending the Yahoo
Streaming Benchmark for Apache Apex
San Jose Apache Apex Meetup
May 4th
2016
Sandesh Hegde
sandesh@apache.org
Background
• Yahoo created a benchmark to compare Stream processing systems and
compared Storm, Flink and Spark Streaming [1]
• dataArtisans extended the benchmark by comparing Flink and Storm with
different scenarios [2]
• No benchmark comparison about Stream processing is complete without
including Apache Apex.
2
Yahoo Streaming Benchmark
Simple Advertisement Application : To see how many times an ad
campaign has been seen in an window.
• Read ads from Kafka
• Deserialize JSON string
• Filter unnecessary ads
• Projection of Fields ( remove non-essential fields )
• Join ad id with campaign id from Redis
• Windowed count per campaign and output to Redis
3
Application - with Kafka
4
Kafka Input Deserialize FilterKafka Redis OutputRedis JoinFilter Fields
Setup
• Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
• 10GigE Between compute nodes
• 4 Kafka Brokers ( 2 Partitions each & 1 Replica )
• Kafka Version : 0.8.2
• Apex ( 3.4-SNAPSHOT & 3.3 ) & Flink ( 1.0.2 )
• Yarn-Containers size: 16GB
• 1 ZooKeeper
• Message Size: 218 Bytes
• Sample Message: {"user_id":"e5e0db4b-05ea-4ac5-af7a-4bba5ed27c4c","
page_id":"80f60d0a-b02b-40e2-a667-5548a1120dda","ad_id":"
600589859","ad_type":"banner78","event_type":"purchase","event_time":"
1462374087774","ip_address":"1.2.3.4"}
5
Apex Application
6
Physical Plan
7
Quick Primer on Locality
8
• CONTAINER_LOCAL
■ Deployed in the same process, different threads
■ No serialization
■ Queue between the operators
• THREAD_LOCAL
■ Same thread
■ No serialization
■ Use it only when operators do light work
Note: [New feature] Anti Affinity is not covered here.
Benchmarking Against Previous Releases
9
https://www.datatorrent.com/blog/blog-apex-performance-benchmark/
Part of Release Certification
Application : with Kafka
10
https://github.com/sandeshh/streaming-benchmarks
Application - With Generator
11
Kafka Input Deserialize FilterKafka Redis OutputRedis JoinFilter Fields
Generator
Application - With Generator
12
https://github.com/sandeshh/streaming-benchmarks
Setup: Single Partition
State of the Art & Streaming
13
Generator Filter Redis OutputRedis JoinFilter Fields
What’s our recommendation to query the State?
In memory Key-Value store in the operators?
Application - State Store & Query
14
Generator Filter
Dimensional
Computation
Redis JoinFilter Fields Store (HDHT) QueryResult
1. Durable state ( HDHT is a key value store native to Hadoop ) [4]
2. Single System, scales with your application
3. Easy integration with external Consoles [7]
4. Low operability cost
5. Complex Dimensional Computation [5][6]
Demo
15
Q&A
16
References
17
1. https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
2. http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
3. https://www.datatorrent.com/blog/blog-apex-performance-benchmark/
4. https://www.datatorrent.com/blog/data-store-for-scalable-stream-processing/
5. https://www.datatorrent.com/blog/blog-dimensions-computation-aggregate-navigator-part-1-intro/
6. https://www.datatorrent.com/blog/dimensions-computation-aggregate-navigator-part-2-
implementation/
7. http://docs.datatorrent.com/app_data_framework/
© 2016 DataTorrent
Resources
18
• Apache Apex website - http://apex.apache.org/
• Subscribe - http://apex.apache.org/community.html
• Download - http://apex.apache.org/downloads.html
• Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex
• Facebook - https://www.facebook.com/ApacheApex/
• Meetup - http://www.meetup.com/topics/apache-apex
• Free Enterprise License for Startups - https://www.datatorrent.com/product/startup-
accelerator/
© 2016 DataTorrent
We Are Hiring
19
• jobs@datatorrent.com
• Developers/Architects
• QA Automation Developers
• Information Developers
• Build and Release
• Community Leaders

More Related Content

PPTX
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
PPTX
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
PPTX
University program - writing an apache apex application
PPTX
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
PPTX
Introduction to Apache Apex
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PPTX
Intro to Apache Apex @ Women in Big Data
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
University program - writing an apache apex application
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Introduction to Apache Apex
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex @ Women in Big Data

What's hot (20)

PDF
From Batch to Streaming with Apache Apex Dataworks Summit 2017
PPTX
Introduction to Apache Apex
PDF
Building your first aplication using Apache Apex
PPTX
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
PDF
Introduction to Apache Apex - CoDS 2016
PDF
Apex as yarn application
PPTX
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
PPTX
Introduction to Apache Apex and writing a big data streaming application
PPTX
DataTorrent Presentation @ Big Data Application Meetup
PPTX
Java High Level Stream API
PDF
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
PPTX
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
PPTX
Apache Apex: Stream Processing Architecture and Applications
PPTX
Smart Partitioning with Apache Apex (Webinar)
PPTX
Architectual Comparison of Apache Apex and Spark Streaming
PPTX
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
PDF
Developing streaming applications with apache apex (strata + hadoop world)
PPTX
Ingestion and Dimensions Compute and Enrich using Apache Apex
PPTX
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
From Batch to Streaming with Apache Apex Dataworks Summit 2017
Introduction to Apache Apex
Building your first aplication using Apache Apex
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Introduction to Apache Apex - CoDS 2016
Apex as yarn application
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Introduction to Apache Apex and writing a big data streaming application
DataTorrent Presentation @ Big Data Application Meetup
Java High Level Stream API
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Apache Apex: Stream Processing Architecture and Applications
Smart Partitioning with Apache Apex (Webinar)
Architectual Comparison of Apache Apex and Spark Streaming
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Developing streaming applications with apache apex (strata + hadoop world)
Ingestion and Dimensions Compute and Enrich using Apache Apex
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Ad

Viewers also liked (9)

PPTX
Apache Apex Fault Tolerance and Processing Semantics
PPTX
Extending the Yahoo Streaming Benchmark
PDF
Windowing in Apache Apex
PPTX
Apache Apex Fault Tolerance and Processing Semantics
PDF
Stream Processing use cases and applications with Apache Apex by Thomas Weise
PPTX
Deep Dive into Apache Apex App Development
PPSX
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
PPTX
Capital One's Next Generation Decision in less than 2 ms
PDF
最近のストリーム処理事情振り返り
Apache Apex Fault Tolerance and Processing Semantics
Extending the Yahoo Streaming Benchmark
Windowing in Apache Apex
Apache Apex Fault Tolerance and Processing Semantics
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Deep Dive into Apache Apex App Development
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
Capital One's Next Generation Decision in less than 2 ms
最近のストリーム処理事情振り返り
Ad

Similar to Extending The Yahoo Streaming Benchmark to Apache Apex (20)

PDF
Streaming Solutions for Real time problems
PDF
It's Time To Stop Using Lambda Architecture
PPTX
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PDF
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
PDF
Streaming architecture patterns
PDF
 Kafka Streams VS Spark Structured Streaming - Modern Stream Processing Engin...
PPTX
Stream, stream, stream: Different streaming methods with Spark and Kafka
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PPTX
Streaming Data and Stream Processing with Apache Kafka
PPT
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
PDF
Spark (Structured) Streaming vs. Kafka Streams
PPTX
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
PDF
Structured Streaming with Kafka
PPTX
Trivento summercamp masterclass 9/9/2016
PDF
Data Streaming For Big Data
PDF
Building end to end streaming application on Spark
PPTX
Your Guide to Streaming - The Engineer's Perspective
PDF
Strata NYC 2015: What's new in Spark Streaming
PDF
Build real time stream processing applications using Apache Kafka
Streaming Solutions for Real time problems
It's Time To Stop Using Lambda Architecture
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
Streaming architecture patterns
 Kafka Streams VS Spark Structured Streaming - Modern Stream Processing Engin...
Stream, stream, stream: Different streaming methods with Spark and Kafka
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Streaming Data and Stream Processing with Apache Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Spark (Structured) Streaming vs. Kafka Streams
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Structured Streaming with Kafka
Trivento summercamp masterclass 9/9/2016
Data Streaming For Big Data
Building end to end streaming application on Spark
Your Guide to Streaming - The Engineer's Perspective
Strata NYC 2015: What's new in Spark Streaming
Build real time stream processing applications using Apache Kafka

More from Apache Apex (13)

PDF
Low Latency Polyglot Model Scoring using Apache Apex
PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
PPTX
Hadoop Interacting with HDFS
PPTX
Introduction to Real-Time Data Processing
PPTX
Introduction to Yarn
PPTX
Introduction to Map Reduce
PPTX
HDFS Internals
PPTX
Intro to Big Data Hadoop
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
PPTX
Apache Beam (incubating)
PPTX
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
PPTX
Apache Apex & Bigtop
PDF
Building Your First Apache Apex Application
Low Latency Polyglot Model Scoring using Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Hadoop Interacting with HDFS
Introduction to Real-Time Data Processing
Introduction to Yarn
Introduction to Map Reduce
HDFS Internals
Intro to Big Data Hadoop
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Beam (incubating)
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Apache Apex & Bigtop
Building Your First Apache Apex Application

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
A Presentation on Artificial Intelligence
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPT
Teaching material agriculture food technology
PDF
Electronic commerce courselecture one. Pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Empathic Computing: Creating Shared Understanding
PDF
KodekX | Application Modernization Development
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
Reach Out and Touch Someone: Haptics and Empathic Computing
CIFDAQ's Market Insight: SEC Turns Pro Crypto
NewMind AI Weekly Chronicles - August'25 Week I
A Presentation on Artificial Intelligence
Spectral efficient network and resource selection model in 5G networks
MYSQL Presentation for SQL database connectivity
Dropbox Q2 2025 Financial Results & Investor Presentation
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Building Integrated photovoltaic BIPV_UPV.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Unlocking AI with Model Context Protocol (MCP)
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Big Data Technologies - Introduction.pptx
Teaching material agriculture food technology
Electronic commerce courselecture one. Pdf
Encapsulation_ Review paper, used for researhc scholars
Empathic Computing: Creating Shared Understanding
KodekX | Application Modernization Development
Diabetes mellitus diagnosis method based random forest with bat algorithm

Extending The Yahoo Streaming Benchmark to Apache Apex

  • 1. Extending the Yahoo Streaming Benchmark for Apache Apex San Jose Apache Apex Meetup May 4th 2016 Sandesh Hegde sandesh@apache.org
  • 2. Background • Yahoo created a benchmark to compare Stream processing systems and compared Storm, Flink and Spark Streaming [1] • dataArtisans extended the benchmark by comparing Flink and Storm with different scenarios [2] • No benchmark comparison about Stream processing is complete without including Apache Apex. 2
  • 3. Yahoo Streaming Benchmark Simple Advertisement Application : To see how many times an ad campaign has been seen in an window. • Read ads from Kafka • Deserialize JSON string • Filter unnecessary ads • Projection of Fields ( remove non-essential fields ) • Join ad id with campaign id from Redis • Windowed count per campaign and output to Redis 3
  • 4. Application - with Kafka 4 Kafka Input Deserialize FilterKafka Redis OutputRedis JoinFilter Fields
  • 5. Setup • Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz • 10GigE Between compute nodes • 4 Kafka Brokers ( 2 Partitions each & 1 Replica ) • Kafka Version : 0.8.2 • Apex ( 3.4-SNAPSHOT & 3.3 ) & Flink ( 1.0.2 ) • Yarn-Containers size: 16GB • 1 ZooKeeper • Message Size: 218 Bytes • Sample Message: {"user_id":"e5e0db4b-05ea-4ac5-af7a-4bba5ed27c4c"," page_id":"80f60d0a-b02b-40e2-a667-5548a1120dda","ad_id":" 600589859","ad_type":"banner78","event_type":"purchase","event_time":" 1462374087774","ip_address":"1.2.3.4"} 5
  • 8. Quick Primer on Locality 8 • CONTAINER_LOCAL ■ Deployed in the same process, different threads ■ No serialization ■ Queue between the operators • THREAD_LOCAL ■ Same thread ■ No serialization ■ Use it only when operators do light work Note: [New feature] Anti Affinity is not covered here.
  • 9. Benchmarking Against Previous Releases 9 https://www.datatorrent.com/blog/blog-apex-performance-benchmark/ Part of Release Certification
  • 10. Application : with Kafka 10 https://github.com/sandeshh/streaming-benchmarks
  • 11. Application - With Generator 11 Kafka Input Deserialize FilterKafka Redis OutputRedis JoinFilter Fields Generator
  • 12. Application - With Generator 12 https://github.com/sandeshh/streaming-benchmarks Setup: Single Partition
  • 13. State of the Art & Streaming 13 Generator Filter Redis OutputRedis JoinFilter Fields What’s our recommendation to query the State? In memory Key-Value store in the operators?
  • 14. Application - State Store & Query 14 Generator Filter Dimensional Computation Redis JoinFilter Fields Store (HDHT) QueryResult 1. Durable state ( HDHT is a key value store native to Hadoop ) [4] 2. Single System, scales with your application 3. Easy integration with external Consoles [7] 4. Low operability cost 5. Complex Dimensional Computation [5][6]
  • 17. References 17 1. https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at 2. http://data-artisans.com/extending-the-yahoo-streaming-benchmark/ 3. https://www.datatorrent.com/blog/blog-apex-performance-benchmark/ 4. https://www.datatorrent.com/blog/data-store-for-scalable-stream-processing/ 5. https://www.datatorrent.com/blog/blog-dimensions-computation-aggregate-navigator-part-1-intro/ 6. https://www.datatorrent.com/blog/dimensions-computation-aggregate-navigator-part-2- implementation/ 7. http://docs.datatorrent.com/app_data_framework/
  • 18. © 2016 DataTorrent Resources 18 • Apache Apex website - http://apex.apache.org/ • Subscribe - http://apex.apache.org/community.html • Download - http://apex.apache.org/downloads.html • Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex • Facebook - https://www.facebook.com/ApacheApex/ • Meetup - http://www.meetup.com/topics/apache-apex • Free Enterprise License for Startups - https://www.datatorrent.com/product/startup- accelerator/
  • 19. © 2016 DataTorrent We Are Hiring 19 • jobs@datatorrent.com • Developers/Architects • QA Automation Developers • Information Developers • Build and Release • Community Leaders