How to extract valueable information from real time data feeds

How to extract valuable
information from real-
time data feeds
Gene Leybzon, February 2016

“The critical challenge is using
this data when it is still in
motion – and extracting
valuable information from it.”
- Frédéric Combaneyre, SAS
IoT Challenge

 Detect events of interest and trigger appropriate
actions
 Aggregate information for monitoring
 Sensor data cleansing and validation
 Real-time predictive and optimized operations
(support for real-time decision making)
Role of Data Streams

 Transform data — convert the data into another format, for example,
converting a captured device signal voltage to a calibrated unit measure of
temperature
 Aggregate and compute data — By combining data you can add checks:
such as averaging data across multiple devices to avoid acting on a single,
spurious, device; or ensure you have actionable data if a single device goes
offline. By adding computation to your pipeline, you can apply streaming
analytics to data while it is still in the processing pipeline.
 Enrich data — You can combine the device-generated data with other
metadata about the device, or with other datasets, such as weather or
traffic data, for use in subsequent analysis.
 Move data — You can store the processed data in one or more final storage
locations.
Role of “Pipelines”

 Fault-tolerance against hardware failures and human errors
 Support for a variety of use cases that include low latency
querying as well as updates
 Linear scale-out capabilities, meaning that throwing more
machines at the problem should help with getting the job done
 Extensibility so that the system is manageable and can
accommodate newer features easily
 Consistency - data is the same across the cluster
 Availability - ability to access the cluster even if a node in the
cluster goes down
 Partition-tolerance - cluster continues to function even if there is
a "partition" (communications break) between two nodes
What we want from stream
architecture?

“It is impossible for a distributed computer system to
simultaneously provide all three of the following
guarantees:
 Consistency (all nodes see the same data at the same
time)
 Availability (a guarantee that every request receives a
response about whether it succeeded or failed)
 Partition tolerance (the system continues to operate
despite arbitrary partitioning due to network
failures)”
CAP Theorem

Facing the Cap Theorem
Consistency Availability
Partition
Tolerance
∅
Cassandra
Riak
CouchBase
MongoDB
λ
Poxos
Zab
Raft

 One-way data flow (doesn’t transact and make per-
event decisions on the streaming data, nor does it
respond immediately to the events coming in)
 Eventual consistency
 NoSQL
 Complexity
Limitations of the λ-Architecture

 Designed for low latency
 Open-sourced in 2012
 Long history of data
 Scale > 500K events/sec in Avg
Druid Project

 Distributed stream processing framework
 Simple API
 Fault tolerance
 Manages stream state
 Fault tolerance
 Guarantee that messages are processed in the order
they were written to a partition, and that no
messages are ever lost.
Apache Samza

Stream Databases and Pipelines
Building Blocks

Apache Cassandra
 Decentralized (Every node in the cluster has the same role.)
 No single point of failure.
 Scalable
 Read and write throughput both increase linearly as new machines
are added, with no downtime or interruption to applications.
 Fault-tolerant
 Tunable level of consistency, all the way from "writes never fail" to
"block for all replicas to be readable”
 Hadoop integration, integration with MapReduce
 Query language

Apache Flink
• High performance
• Low latency
• Support for out-of
order events
• Flexible streaming
window
• Fault tolerance

 Finding frequent items
 Estimating number of distinct
 Statistics
 Finding “signal”
 Error correction
 Filtering
 Anomaly detection
 Incremental learning
 Data clustering
Popular Stream Algorithms

Machine Learning from Stream Data

Take into account recent history
ML Model is updatable (“evolves”
as new data comes in)
How ML from stream data is
different from traditional ML
techniques?

 Incremental algorithms (both support vector
machines and neural networks can work
incrementally)
 Periodic retraining with new data batch
Two Approaches to Adopt ML to
Stream Data

How to extract valueable information from real time data feeds

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to How to extract valueable information from real time data feeds (20)

More from Gene Leybzon (20)

Recently uploaded (20)

How to extract valueable information from real time data feeds

Editor's Notes