SlideShare a Scribd company logo
Streaming Data Analytics
Gizem Akman | Software Infrastructure
Nov. 21, 2017
Between 2017 and 2022, Gartner estimates that the market for event
stream processing (ESP) platforms will grow 15% year over year
(compound annual growth rate).
Streaming Analytics – What?
Analytics
Real Time
«Data in motion»
Batch
«Data at rest»
Stream Analytics
Streaming Analytics – What?
Software that can filter, aggregate, enrich, and analyze a high
throughput of data from multiple, disparate live data sources and in
any data format to identify simple and complex patterns to provide
applications with context to detect opportune situations, automate
immediate actions, and dynamically adapt.
What is Real Time?
Stream Analytics
Stream Analytics
Paradigms – Processing Types
• Atomic
• Micro Batching
• Windowing
Paradigms – Data Handling Guarantees
• At Most Once
• At Least Once
• Exactly Once
Requirements & Characteristics
• Data must be real time (current)
• High volume & High Velocity Data
• «Perishable Data»
• Analytic logic must be predefined
• Ultra-High performance messaging
• Unbounded - Execution never stops
Streaming Analytics – How?
• Analytic logic must be predefined
• in-memory
• Parallel – scale out
• faster chips or GPUs
• efficient algorithms (ex. minimizing context switches)
• Leveraging innovative data architectures (ex. hashing)
• Compromise on flexibility ( such as limiting random data access)
Zero-Copy
Popular Platforms
Open Source
Apache Storm
Apache Flink
Spark Streaming
Apache Samza
Vendors
IBM Streams
Software AG – Apama
Streaming Analytics
Azure Stream Analytics
SAP Event Stream Processor
Oracle Stream Analytics
SAS Event Stream Processing
Cisco Streaming Analytics
Amazon Kinesis
Google Cloud Dataflow
TIBCO Event Analytics
Informatica
Striim
DataTorrent
StreamAnalytix
SQLStream Blaze
Data Artisans
Impetus Technologies
EsperTech
Stream Analytics
Basics
• Free & open source distributed realtime computing
engine «Hadoop for real time»
• Fast over a 1M tuples processed per second per node
Architecture
Data Model
Groupings
Data Processing Guarantee
Fault Tolerance
«fail-fast, auto restart»
• When a worker dies: the supervisor will restart it. If it continuously fails on
startup and is unable to heartbeat to Nimbus, Nimbus will reschedule the
worker.
• When a node dies: The tasks assigned to that machine will time-out and
Nimbus will reassign those tasks to other machines.
• When Nimbus dies: The Nimbus is fail-fast (process self-destructs
whenever any unexpected situation is encountered) and stateless (all state
is kept in Zookeeper or on disk, so restart like nothing happened.
• If you lose the Nimbus node, the workers will still continue to function.
Additionally, supervisors will continue to restart workers if they die.
However, without Nimbus, workers won't be reassigned to other machines
when necessary (like if you lose a worker machine).
Parallelism
Parallelism
Integration
• Apache Kafka
• Apache Hbase
• Apache HDFS
• Apache Hive
• Apache Solr
• Apache Cassandra
• JDBC
• JMS
• Redis
• Event Hubs
• Elasticsearch
• MQTT
• Mongodb
• OpenTSDB
• Kinesis
• Druid
• Kestrel
With External Systems,
and Other Libraries
With Containers,
and Resource Management Systems
• YARN
• Mesos
• Docker
• Kubernetes
And many others>> http://storm.apache.org/Powered-By.html
Stream Analytics

More Related Content

PDF
Machine Learning Deep Dive
PDF
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
PDF
Data Care, Feeding, and Maintenance
PDF
Winning the On-Demand Economy with Spark and Predictive Analytics
PDF
T-Mobile and Elastic
PPTX
Real time architecture big data
PPTX
Obfuscating LinkedIn Member Data
PPTX
Real time monitoring of hadoop and spark workflows
Machine Learning Deep Dive
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
Data Care, Feeding, and Maintenance
Winning the On-Demand Economy with Spark and Predictive Analytics
T-Mobile and Elastic
Real time architecture big data
Obfuscating LinkedIn Member Data
Real time monitoring of hadoop and spark workflows

What's hot (20)

PDF
Zipline—Airbnb’s Declarative Feature Engineering Framework
PPTX
Real-Time Analytics with Spark and MemSQL
PPTX
Rapid Data Analytics @ Netflix
PDF
Building Custom Machine Learning Algorithms With Apache SystemML
PPTX
Azure stream analytics by Nico Jacobs
PDF
Building Software to Scale
PDF
Microsoft cosmos
PDF
[2C6]Everyplay_Big_Data
PDF
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
PDF
Battling Model Decay with Deep Learning and Gamification
PPTX
Machine Learning on Distributed Systems by Josh Poduska
PPTX
Oct 2011 CHADNUG Presentation on Hadoop
PDF
Spark and the Enterprise by Tony Baer
PPTX
Building near real-time HTAP solutions using Synapse Link for Azure Cosmos DB
PPTX
Rapid Data Analytics @ Netflix
PPTX
Puree through Trillion of clicks in seconds using Interana
PPTX
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
PPTX
Overkill Analytics Seattle Spark Meetup
PPTX
Fikrimuhal TRHUG 2016 Machine Learning
PPTX
Data Analytics - Real Time Trending
Zipline—Airbnb’s Declarative Feature Engineering Framework
Real-Time Analytics with Spark and MemSQL
Rapid Data Analytics @ Netflix
Building Custom Machine Learning Algorithms With Apache SystemML
Azure stream analytics by Nico Jacobs
Building Software to Scale
Microsoft cosmos
[2C6]Everyplay_Big_Data
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Battling Model Decay with Deep Learning and Gamification
Machine Learning on Distributed Systems by Josh Poduska
Oct 2011 CHADNUG Presentation on Hadoop
Spark and the Enterprise by Tony Baer
Building near real-time HTAP solutions using Synapse Link for Azure Cosmos DB
Rapid Data Analytics @ Netflix
Puree through Trillion of clicks in seconds using Interana
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
Overkill Analytics Seattle Spark Meetup
Fikrimuhal TRHUG 2016 Machine Learning
Data Analytics - Real Time Trending
Ad

Similar to Stream Analytics (20)

PDF
Streamlio and IoT analytics with Apache Pulsar
PDF
Stream dataprocessing101
PDF
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
PDF
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
PPTX
RedisConf17 - Redis in High Traffic Adtech Stack
PPTX
Next Gen Big Data Analytics with Apache Apex
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
PDF
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
PDF
John adams talk cloudy
PPTX
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
PPTX
Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...
PPTX
Shikha fdp 62_14july2017
PDF
Introduction to near real time computing
PDF
Real-Time Analytics with Confluent and MemSQL
PDF
Webinar: SQL for Machine Data?
PDF
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
PDF
Cassandra Day Chicago 2015: Diagnosing Problems in Production
PDF
Cassandra Day London 2015: Diagnosing Problems in Production
PPTX
Hybrid Transactional/Analytics Processing: Beyond the Big Database Hype
PDF
CrateDB Machine Data Platform Webinar
Streamlio and IoT analytics with Apache Pulsar
Stream dataprocessing101
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
RedisConf17 - Redis in High Traffic Adtech Stack
Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
John adams talk cloudy
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...
Shikha fdp 62_14july2017
Introduction to near real time computing
Real-Time Analytics with Confluent and MemSQL
Webinar: SQL for Machine Data?
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in Production
Hybrid Transactional/Analytics Processing: Beyond the Big Database Hype
CrateDB Machine Data Platform Webinar
Ad

More from Software Infrastructure (20)

PPTX
Quartz Scheduler
PPTX
Test Driven Development
PPTX
Deep Learning
PDF
Progressive Web Apps
PPTX
Machine learning
PPTX
PPTX
PPTX
Hazelcast sunum
PPTX
Microsoft bot framework
PPTX
Blockchain use cases
PPTX
PPTX
Server Side Swift
PPTX
Push Notification
PPTX
PDF
Big Data & Hadoop
Quartz Scheduler
Test Driven Development
Deep Learning
Progressive Web Apps
Machine learning
Hazelcast sunum
Microsoft bot framework
Blockchain use cases
Server Side Swift
Push Notification
Big Data & Hadoop

Recently uploaded (20)

PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Digital Logic Computer Design lecture notes
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
composite construction of structures.pdf
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
web development for engineering and engineering
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PPT
Project quality management in manufacturing
PPTX
Geodesy 1.pptx...............................................
PPT
Mechanical Engineering MATERIALS Selection
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
UNIT 4 Total Quality Management .pptx
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Digital Logic Computer Design lecture notes
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
composite construction of structures.pdf
Lecture Notes Electrical Wiring System Components
web development for engineering and engineering
OOP with Java - Java Introduction (Basics)
Lesson 3_Tessellation.pptx finite Mathematics
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
bas. eng. economics group 4 presentation 1.pptx
CH1 Production IntroductoryConcepts.pptx
Project quality management in manufacturing
Geodesy 1.pptx...............................................
Mechanical Engineering MATERIALS Selection
Mitigating Risks through Effective Management for Enhancing Organizational Pe...

Stream Analytics

  • 1. Streaming Data Analytics Gizem Akman | Software Infrastructure Nov. 21, 2017
  • 2. Between 2017 and 2022, Gartner estimates that the market for event stream processing (ESP) platforms will grow 15% year over year (compound annual growth rate).
  • 3. Streaming Analytics – What? Analytics Real Time «Data in motion» Batch «Data at rest»
  • 5. Streaming Analytics – What? Software that can filter, aggregate, enrich, and analyze a high throughput of data from multiple, disparate live data sources and in any data format to identify simple and complex patterns to provide applications with context to detect opportune situations, automate immediate actions, and dynamically adapt.
  • 6. What is Real Time?
  • 9. Paradigms – Processing Types • Atomic • Micro Batching • Windowing
  • 10. Paradigms – Data Handling Guarantees • At Most Once • At Least Once • Exactly Once
  • 11. Requirements & Characteristics • Data must be real time (current) • High volume & High Velocity Data • «Perishable Data» • Analytic logic must be predefined • Ultra-High performance messaging • Unbounded - Execution never stops
  • 12. Streaming Analytics – How? • Analytic logic must be predefined • in-memory • Parallel – scale out • faster chips or GPUs • efficient algorithms (ex. minimizing context switches) • Leveraging innovative data architectures (ex. hashing) • Compromise on flexibility ( such as limiting random data access)
  • 14. Popular Platforms Open Source Apache Storm Apache Flink Spark Streaming Apache Samza Vendors IBM Streams Software AG – Apama Streaming Analytics Azure Stream Analytics SAP Event Stream Processor Oracle Stream Analytics SAS Event Stream Processing Cisco Streaming Analytics Amazon Kinesis Google Cloud Dataflow TIBCO Event Analytics Informatica Striim DataTorrent StreamAnalytix SQLStream Blaze Data Artisans Impetus Technologies EsperTech
  • 16. Basics • Free & open source distributed realtime computing engine «Hadoop for real time» • Fast over a 1M tuples processed per second per node
  • 21. Fault Tolerance «fail-fast, auto restart» • When a worker dies: the supervisor will restart it. If it continuously fails on startup and is unable to heartbeat to Nimbus, Nimbus will reschedule the worker. • When a node dies: The tasks assigned to that machine will time-out and Nimbus will reassign those tasks to other machines. • When Nimbus dies: The Nimbus is fail-fast (process self-destructs whenever any unexpected situation is encountered) and stateless (all state is kept in Zookeeper or on disk, so restart like nothing happened. • If you lose the Nimbus node, the workers will still continue to function. Additionally, supervisors will continue to restart workers if they die. However, without Nimbus, workers won't be reassigned to other machines when necessary (like if you lose a worker machine).
  • 24. Integration • Apache Kafka • Apache Hbase • Apache HDFS • Apache Hive • Apache Solr • Apache Cassandra • JDBC • JMS • Redis • Event Hubs • Elasticsearch • MQTT • Mongodb • OpenTSDB • Kinesis • Druid • Kestrel With External Systems, and Other Libraries With Containers, and Resource Management Systems • YARN • Mesos • Docker • Kubernetes
  • 25. And many others>> http://storm.apache.org/Powered-By.html

Editor's Notes

  • #6: *visualize real time *detect urgent situations *automate immediate actions Normal Programming >> code execution controls data Streaming Applications >> incoming data controls the code
  • #7: «loosely real time»
  • #8: pipeline
  • #10: Atomic:  (aka one-key-value-at-a-time) Processes each inbound data event as a separate element.  Seems logical but this is the most computationally expensive design.  For example, it’s used to guarantee fastest processing of individual events with least delay in transmitting the event to the subscriber.  Seen often for customer transactional inputs so that if some element of the event block fails the entire block is not deleted but moved to a bad record file that can later be processed further.  Apache Storm uses this paradigm. Micro batching:  Yes these are batches of events but typically those that occur within only a few milliseconds.  You can adjust the time window.  This makes the process somewhat more efficient.  Spark Streaming uses this paradigm. Windowing:  Similar to batching, Windowing allows the design of micro batches that may be simple time-based batches, but also allows for many more sophisticated interpretations such as sliding windows (e.g. everything that occurred in the last X period of time).  This can be very useful for aggregating events or determining outliers when compared to averages or standard deviation.
  • #11: At most once: data loss possible At least once: no data loss- redelivery attemps made, duplication possible Exactly once: typically require use of idempotent updates
  • #12: Compute Intensity, the number of arithmetic operations per I/O or global memory reference. In many signal processing applications today it is well over 50:1 and increasing with algorithmic complexity. Data Parallelism exists in a kernel if the same function is applied to all records of an input stream and a number of records can be processed simultaneously without waiting for results from previous records. Data Locality is a specific type of temporal locality common in signal and media processing applications where data is produced once, read once or twice later in the application, and never read again. Intermediate streams passed between kernels as well as intermediate data within kernel functions can capture this locality directly using the stream processing programming model.
  • #13: Most of these can be hidden within modern analytics products so that the user does not have to be aware of exactly how they are being used. Analytic logic must be predefined – daha önceki datalara dayanmamalı, independent belirlenmiş kurallar pipelineından geçiyor Stream processing is essentially a compromise, driven by a data-centric model that works very well for traditional DSP or GPU-type applications (such as image, video and digital signal processing) but less so for general purpose processing with more randomized data access (such as databases). By sacrificing some flexibility in the model, the implications allow easier, faster and more efficient execution. Depending on the context, processor design may be tuned for maximum efficiency or a trade-off for flexibility.
  • #14: Zero-copy Kafka neden hızlı? **Arada context switch yok**   Zero Copy - basically it calls the OS kernel direct rather than at the application layer to move data fast. Batch Data in Chunks - Kafka is all about batching the data into chunks. This minimises cross machine latency with all the buffering/copying that accompanies this. Avoids Random Disk Access - as Kafka is an immutable commit log it does not need to rewind the disk and do many random I/O operations and can just access the disk in a sequential manner. This enables it to get similar speeds from a physical disk compared with memory. Can Scale Horizontally - The ability to have thousands of partitions for a single topic spread among thousands of machines means Kafka can handle huge loads
  • #17: Storm is designed for massive scalability, supports fault-tolerance with a “fail fast, auto restart” approach to processes, and offers a strong guarantee that every tuple will be processed. Storm defaults to an “at least once” guarantee for messages, but offers the ability to implement “exactly once” processing as well. Storm is written primarily in Clojure and is designed to support wiring “spouts” (think input streams) and “bolts” (processing and output modules) together as a directed acyclic graph (DAG) called a topology. Storm topologies run on clusters and the Storm scheduler distributes work to nodes around the cluster, based on the topology configuration. You can think of topologies as roughly analogous to a MapReduce job in Hadoop, except that given Storm’s focus on real-time, stream-based processing, topologies default to running forever or until manually terminated. Once a topology is started, the spouts bring data into the system and hand the data off to bolts (which may in turn hand data to subsequent bolts) where the main computational work is done. As processing progresses, one or more bolts may write data out to a database or file system, send a message to another external system, or otherwise make the results of the computation available to the users. Hadoop: Store & Query. Store the data first, and indefinitely. Analyse when you like. Storm: Analyze while data is being produced. No need to store anything at all.
  • #21: Spouttan her tuple unique bir messageid ile çıkıyor ve bu messageidnin ack veya fail edilmesi gerekiyor. Ack edilmeyenler failed sayılıp spouttan tekrar tetikleniyor. Her boltun işlediği tupleı acklemesi lazım.
  • #27: Storm  cbdevapp02 Redis  nsidevapp01 redis/redis-4.0.2/src/redis-cli psubscribe WordCountTopology redis/redis-4.0.2/src/redis-server /WebSphere85/AppServer/java_1.8_64/bin/java -jar storm/Storm.jar 1>/dev/null 2>&1