Stream Analytics

Streaming Data Analytics
Gizem Akman | Software Infrastructure
Nov. 21, 2017

Between 2017 and 2022, Gartner estimates that the market for event
stream processing (ESP) platforms will grow 15% year over year
(compound annual growth rate).

Streaming Analytics – What?
Analytics
Real Time
«Data in motion»
Batch
«Data at rest»

Streaming Analytics – What?
Software that can filter, aggregate, enrich, and analyze a high
throughput of data from multiple, disparate live data sources and in
any data format to identify simple and complex patterns to provide
applications with context to detect opportune situations, automate
immediate actions, and dynamically adapt.

Paradigms – Processing Types
• Atomic
• Micro Batching
• Windowing

Paradigms – Data Handling Guarantees
• At Most Once
• At Least Once
• Exactly Once

Requirements & Characteristics
• Data must be real time (current)
• High volume & High Velocity Data
• «Perishable Data»
• Analytic logic must be predefined
• Ultra-High performance messaging
• Unbounded - Execution never stops

Streaming Analytics – How?
• Analytic logic must be predefined
• in-memory
• Parallel – scale out
• faster chips or GPUs
• efficient algorithms (ex. minimizing context switches)
• Leveraging innovative data architectures (ex. hashing)
• Compromise on flexibility ( such as limiting random data access)

Popular Platforms
Open Source
Apache Storm
Apache Flink
Spark Streaming
Apache Samza
Vendors
IBM Streams
Software AG – Apama
Streaming Analytics
Azure Stream Analytics
SAP Event Stream Processor
Oracle Stream Analytics
SAS Event Stream Processing
Cisco Streaming Analytics
Amazon Kinesis
Google Cloud Dataflow
TIBCO Event Analytics
Informatica
Striim
DataTorrent
StreamAnalytix
SQLStream Blaze
Data Artisans
Impetus Technologies
EsperTech

Basics
• Free & open source distributed realtime computing
engine «Hadoop for real time»
• Fast over a 1M tuples processed per second per node

Fault Tolerance
«fail-fast, auto restart»
• When a worker dies: the supervisor will restart it. If it continuously fails on
startup and is unable to heartbeat to Nimbus, Nimbus will reschedule the
worker.
• When a node dies: The tasks assigned to that machine will time-out and
Nimbus will reassign those tasks to other machines.
• When Nimbus dies: The Nimbus is fail-fast (process self-destructs
whenever any unexpected situation is encountered) and stateless (all state
is kept in Zookeeper or on disk, so restart like nothing happened.
• If you lose the Nimbus node, the workers will still continue to function.
Additionally, supervisors will continue to restart workers if they die.
However, without Nimbus, workers won't be reassigned to other machines
when necessary (like if you lose a worker machine).

Integration
• Apache Kafka
• Apache Hbase
• Apache HDFS
• Apache Hive
• Apache Solr
• Apache Cassandra
• JDBC
• JMS
• Redis
• Event Hubs
• Elasticsearch
• MQTT
• Mongodb
• OpenTSDB
• Kinesis
• Druid
• Kestrel
With External Systems,
and Other Libraries
With Containers,
and Resource Management Systems
• YARN
• Mesos
• Docker
• Kubernetes

And many others>> http://storm.apache.org/Powered-By.html

Stream Analytics

More Related Content

What's hot (20)

Similar to Stream Analytics (20)

More from Software Infrastructure (20)

Recently uploaded (20)

Stream Analytics

Editor's Notes