SlideShare a Scribd company logo
How to extract valuable
information from real-
time data feeds
Gene Leybzon, February 2016
“The critical challenge is using
this data when it is still in
motion – and extracting
valuable information from it.”
- Frédéric Combaneyre, SAS
IoT Challenge
 Detect events of interest and trigger appropriate
actions
 Aggregate information for monitoring
 Sensor data cleansing and validation
 Real-time predictive and optimized operations
(support for real-time decision making)
Role of Data Streams
Platforms
Google Cloud Platform
AWS IoT Initiative
SAS
 Transform data — convert the data into another format, for example,
converting a captured device signal voltage to a calibrated unit measure of
temperature
 Aggregate and compute data — By combining data you can add checks:
such as averaging data across multiple devices to avoid acting on a single,
spurious, device; or ensure you have actionable data if a single device goes
offline. By adding computation to your pipeline, you can apply streaming
analytics to data while it is still in the processing pipeline.
 Enrich data — You can combine the device-generated data with other
metadata about the device, or with other datasets, such as weather or
traffic data, for use in subsequent analysis.
 Move data — You can store the processed data in one or more final storage
locations.
Role of “Pipelines”
Architecture
 Fault-tolerance against hardware failures and human errors
 Support for a variety of use cases that include low latency
querying as well as updates
 Linear scale-out capabilities, meaning that throwing more
machines at the problem should help with getting the job done
 Extensibility so that the system is manageable and can
accommodate newer features easily
 Consistency - data is the same across the cluster
 Availability - ability to access the cluster even if a node in the
cluster goes down
 Partition-tolerance - cluster continues to function even if there is
a "partition" (communications break) between two nodes
What we want from stream
architecture?
“It is impossible for a distributed computer system to
simultaneously provide all three of the following
guarantees:
 Consistency (all nodes see the same data at the same
time)
 Availability (a guarantee that every request receives a
response about whether it succeeded or failed)
 Partition tolerance (the system continues to operate
despite arbitrary partitioning due to network
failures)”
CAP Theorem
Facing the Cap Theorem
Consistency Availability
Partition
Tolerance
∅
Cassandra
Riak
CouchBase
MongoDB
λ
Poxos
Zab
Raft
λ-Architecture
 One-way data flow (doesn’t transact and make per-
event decisions on the streaming data, nor does it
respond immediately to the events coming in)
 Eventual consistency
 NoSQL
 Complexity
Limitations of the λ-Architecture
Out-of the box Solutions
 Designed for low latency
 Open-sourced in 2012
 Long history of data
 Scale > 500K events/sec in Avg
Druid Project
Druid data store
 Distributed stream processing framework
 Simple API
 Fault tolerance
 Manages stream state
 Fault tolerance
 Guarantee that messages are processed in the order
they were written to a partition, and that no
messages are ever lost.
Apache Samza
Apache Samza
Samza Architecture
VoltDB
Stream Databases and Pipelines
Building Blocks
PipelineDB (example of usage)
AWS Kinesis
Apache Cassandra
 Decentralized (Every node in the cluster has the same role.)
 No single point of failure.
 Scalable
 Read and write throughput both increase linearly as new machines
are added, with no downtime or interruption to applications.
 Fault-tolerant
 Tunable level of consistency, all the way from "writes never fail" to
"block for all replicas to be readable”
 Hadoop integration, integration with MapReduce
 Query language
Apache Flink
• High performance
• Low latency
• Support for out-of
order events
• Flexible streaming
window
• Fault tolerance
Stream Processing Algorithms
 Finding frequent items
 Estimating number of distinct
 Statistics
 Finding “signal”
 Error correction
 Filtering
 Anomaly detection
 Incremental learning
 Data clustering
Popular Stream Algorithms
Machine Learning from Stream Data
Take into account recent history
ML Model is updatable (“evolves”
as new data comes in)
How ML from stream data is
different from traditional ML
techniques?
 Incremental algorithms (both support vector
machines and neural networks can work
incrementally)
 Periodic retraining with new data batch
Two Approaches to Adopt ML to
Stream Data
Questions?

More Related Content

PPTX
Cloud Options for Wearable Data Analysis
PPT
WattDepot: A software ecosystem for energy data collection, storage, analysis...
PPT
cloud computing
PDF
Dataops on streaming data: Kafka to InfluxDb via Kubernetes native flows
PPT
Cloud computer
PDF
Scalable Data Management for Kafka and Beyond | Dan Rice, BigID
PDF
Infrastructure monitoring made easy, from ingest to insight
PDF
Winning the On-Demand Economy with Spark and Predictive Analytics
Cloud Options for Wearable Data Analysis
WattDepot: A software ecosystem for energy data collection, storage, analysis...
cloud computing
Dataops on streaming data: Kafka to InfluxDb via Kubernetes native flows
Cloud computer
Scalable Data Management for Kafka and Beyond | Dan Rice, BigID
Infrastructure monitoring made easy, from ingest to insight
Winning the On-Demand Economy with Spark and Predictive Analytics

What's hot (20)

PPTX
Let me connect your Vertex
PPTX
SnapLogic Live: IoT Integration
PPTX
SnapLogic Live: AWS Integration
PDF
Le monitoring d'infrastructure de l'ingestion aux données : un jeu d'enfants !
PPT
Analytics for the Real-Time Web
PDF
Дмитрий Попович "How to build a data warehouse?"
PPTX
SnapLogic Live: ServiceNow Integration
PDF
Taming the QIX Engine with Reactive Programming
PDF
Integrating Web and Business Data
PDF
Big Data and Analytics Innovation Summit
PPTX
Next Generation of Data Integration with Azure Data Factory by Tom Kerkhove
PDF
Amazon Web Services
PDF
Combining Logs, Metrics, and Traces for Unified Observability
PPTX
SnapLogic Live: Powering Cloud Analytics
PDF
Transforming data into actionable insights
PPTX
Aws community day pune 2020 v3
PDF
Real-time analytics in IoT by Sam Vanhoutte (@Building The Future 2019)
PDF
Detect Fraud Successfully with GrabDefence! | Muqi Li, Grab
PDF
The Impact of Always-on Connectivity for Geospatial Applications and Analysis
PDF
Data Con LA 2019 - Large scale streaming analytics using cloud based managed ...
Let me connect your Vertex
SnapLogic Live: IoT Integration
SnapLogic Live: AWS Integration
Le monitoring d'infrastructure de l'ingestion aux données : un jeu d'enfants !
Analytics for the Real-Time Web
Дмитрий Попович "How to build a data warehouse?"
SnapLogic Live: ServiceNow Integration
Taming the QIX Engine with Reactive Programming
Integrating Web and Business Data
Big Data and Analytics Innovation Summit
Next Generation of Data Integration with Azure Data Factory by Tom Kerkhove
Amazon Web Services
Combining Logs, Metrics, and Traces for Unified Observability
SnapLogic Live: Powering Cloud Analytics
Transforming data into actionable insights
Aws community day pune 2020 v3
Real-time analytics in IoT by Sam Vanhoutte (@Building The Future 2019)
Detect Fraud Successfully with GrabDefence! | Muqi Li, Grab
The Impact of Always-on Connectivity for Geospatial Applications and Analysis
Data Con LA 2019 - Large scale streaming analytics using cloud based managed ...
Ad

Viewers also liked (20)

PDF
filename-1-rotated
PDF
IMC Summit 2016 Innovation - Derek Nelson - PipelineDB: The Streaming-SQL Dat...
PDF
Przedsiębiorczość w Polsce [infografika]
PDF
PipelineDBとは?
PDF
The future of real time information
PPTX
Integrating Hadoop Into the Enterprise
PPTX
The future of Big Data tooling
PPTX
Real-time analytics with HBase
PPT
Big Data: Improving capacity utilization of transport companies
PDF
5 najważniejszych trendów w Big Data na 2017 rok
PPT
Real-time data integration to the cloud
PDF
Real-time information analysis: social networks and open data
PDF
Hue: Big Data Web applications for Interactive Hadoop at Big Data Spain 2014
PDF
Stream Processing in SmartNews #jawsdays
PDF
Data science challenges in flight search
PPTX
Big Data Ecosystem
PDF
Big Data Real Time Applications
PDF
Building a Sustainable Data Platform on AWS
PPTX
Wearable medical devices
PPTX
Medical Wearable Devices
filename-1-rotated
IMC Summit 2016 Innovation - Derek Nelson - PipelineDB: The Streaming-SQL Dat...
Przedsiębiorczość w Polsce [infografika]
PipelineDBとは?
The future of real time information
Integrating Hadoop Into the Enterprise
The future of Big Data tooling
Real-time analytics with HBase
Big Data: Improving capacity utilization of transport companies
5 najważniejszych trendów w Big Data na 2017 rok
Real-time data integration to the cloud
Real-time information analysis: social networks and open data
Hue: Big Data Web applications for Interactive Hadoop at Big Data Spain 2014
Stream Processing in SmartNews #jawsdays
Data science challenges in flight search
Big Data Ecosystem
Big Data Real Time Applications
Building a Sustainable Data Platform on AWS
Wearable medical devices
Medical Wearable Devices
Ad

Similar to How to extract valueable information from real time data feeds (20)

PDF
Azure and cloud design patterns
PDF
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
PPTX
AWS Summit 2018 Summary
PDF
Big Data Streams Architectures. Why? What? How?
PDF
AWS Big Data Landscape
PDF
Big data serving: Processing and inference at scale in real time
PPTX
Predictive maintenance withsensors_in_utilities_
PPTX
Microsoft Azure Cloud Basics Tutorial
PPTX
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
PPTX
ML on Big Data: Real-Time Analysis on Time Series
PPTX
Scalable Service Architectures
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
PPTX
PPTX
Designing distributed systems
PDF
Scaling web application in the Cloud
PPTX
Amazon aws big data demystified | Introduction to streaming and messaging flu...
PPTX
Trivento summercamp masterclass 9/9/2016
PPTX
Oracle Coherence
PPT
Cloud Crowd GigaSpaces Presentation
PDF
Cloud Lambda Architecture Patterns
Azure and cloud design patterns
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AWS Summit 2018 Summary
Big Data Streams Architectures. Why? What? How?
AWS Big Data Landscape
Big data serving: Processing and inference at scale in real time
Predictive maintenance withsensors_in_utilities_
Microsoft Azure Cloud Basics Tutorial
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
ML on Big Data: Real-Time Analysis on Time Series
Scalable Service Architectures
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Designing distributed systems
Scaling web application in the Cloud
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Trivento summercamp masterclass 9/9/2016
Oracle Coherence
Cloud Crowd GigaSpaces Presentation
Cloud Lambda Architecture Patterns

More from Gene Leybzon (20)

PPTX
Generative AI Application Development using LangChain and LangFlow
PPTX
Chat GPTs
PPTX
Generative AI Use cases for Enterprise - Second Session
PPTX
Generative AI Use-cases for Enterprise - First Session
PPTX
Non-fungible tokens (nfts)
PPTX
Introduction to Solidity and Smart Contract Development (9).pptx
PPTX
Ethereum in Enterprise.pptx
PPTX
ERC-4907 Rentable NFT Standard.pptx
PPTX
Onchain Decentralized Governance 2.pptx
PPTX
Onchain Decentralized Governance.pptx
PPTX
Web3 File Storage Options
PPTX
Web3 Full Stack Development
PPTX
Instantly tradeable NFT contracts based on ERC-1155 standard
PPTX
Non-fungible tokens. From smart contract code to marketplace
PPTX
The Art of non-fungible tokens
PPTX
Graph protocol for accessing information about blockchains and d apps
PPTX
Substrate Framework
PPTX
Chainlink
PPTX
OpenZeppelin + Remix + BNB smart chain
PPTX
Chainlink, Cosmos, Kusama, Polkadot: Approaches to the Internet of Blockchains
Generative AI Application Development using LangChain and LangFlow
Chat GPTs
Generative AI Use cases for Enterprise - Second Session
Generative AI Use-cases for Enterprise - First Session
Non-fungible tokens (nfts)
Introduction to Solidity and Smart Contract Development (9).pptx
Ethereum in Enterprise.pptx
ERC-4907 Rentable NFT Standard.pptx
Onchain Decentralized Governance 2.pptx
Onchain Decentralized Governance.pptx
Web3 File Storage Options
Web3 Full Stack Development
Instantly tradeable NFT contracts based on ERC-1155 standard
Non-fungible tokens. From smart contract code to marketplace
The Art of non-fungible tokens
Graph protocol for accessing information about blockchains and d apps
Substrate Framework
Chainlink
OpenZeppelin + Remix + BNB smart chain
Chainlink, Cosmos, Kusama, Polkadot: Approaches to the Internet of Blockchains

Recently uploaded (20)

PDF
Digital Strategies for Manufacturing Companies
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
L1 - Introduction to python Backend.pptx
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
Essential Infomation Tech presentation.pptx
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Transform Your Business with a Software ERP System
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
ai tools demonstartion for schools and inter college
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
medical staffing services at VALiNTRY
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
Digital Strategies for Manufacturing Companies
How Creative Agencies Leverage Project Management Software.pdf
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
L1 - Introduction to python Backend.pptx
How to Migrate SBCGlobal Email to Yahoo Easily
Navsoft: AI-Powered Business Solutions & Custom Software Development
Essential Infomation Tech presentation.pptx
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Design an Analysis of Algorithms I-SECS-1021-03
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Transform Your Business with a Software ERP System
PTS Company Brochure 2025 (1).pdf.......
ai tools demonstartion for schools and inter college
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Softaken Excel to vCard Converter Software.pdf
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
CHAPTER 2 - PM Management and IT Context
How to Choose the Right IT Partner for Your Business in Malaysia
medical staffing services at VALiNTRY
VVF-Customer-Presentation2025-Ver1.9.pptx

How to extract valueable information from real time data feeds

  • 1. How to extract valuable information from real- time data feeds Gene Leybzon, February 2016
  • 2. “The critical challenge is using this data when it is still in motion – and extracting valuable information from it.” - Frédéric Combaneyre, SAS IoT Challenge
  • 3.  Detect events of interest and trigger appropriate actions  Aggregate information for monitoring  Sensor data cleansing and validation  Real-time predictive and optimized operations (support for real-time decision making) Role of Data Streams
  • 7. SAS
  • 8.  Transform data — convert the data into another format, for example, converting a captured device signal voltage to a calibrated unit measure of temperature  Aggregate and compute data — By combining data you can add checks: such as averaging data across multiple devices to avoid acting on a single, spurious, device; or ensure you have actionable data if a single device goes offline. By adding computation to your pipeline, you can apply streaming analytics to data while it is still in the processing pipeline.  Enrich data — You can combine the device-generated data with other metadata about the device, or with other datasets, such as weather or traffic data, for use in subsequent analysis.  Move data — You can store the processed data in one or more final storage locations. Role of “Pipelines”
  • 10.  Fault-tolerance against hardware failures and human errors  Support for a variety of use cases that include low latency querying as well as updates  Linear scale-out capabilities, meaning that throwing more machines at the problem should help with getting the job done  Extensibility so that the system is manageable and can accommodate newer features easily  Consistency - data is the same across the cluster  Availability - ability to access the cluster even if a node in the cluster goes down  Partition-tolerance - cluster continues to function even if there is a "partition" (communications break) between two nodes What we want from stream architecture?
  • 11. “It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:  Consistency (all nodes see the same data at the same time)  Availability (a guarantee that every request receives a response about whether it succeeded or failed)  Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures)” CAP Theorem
  • 12. Facing the Cap Theorem Consistency Availability Partition Tolerance ∅ Cassandra Riak CouchBase MongoDB λ Poxos Zab Raft
  • 14.  One-way data flow (doesn’t transact and make per- event decisions on the streaming data, nor does it respond immediately to the events coming in)  Eventual consistency  NoSQL  Complexity Limitations of the λ-Architecture
  • 15. Out-of the box Solutions
  • 16.  Designed for low latency  Open-sourced in 2012  Long history of data  Scale > 500K events/sec in Avg Druid Project
  • 18.  Distributed stream processing framework  Simple API  Fault tolerance  Manages stream state  Fault tolerance  Guarantee that messages are processed in the order they were written to a partition, and that no messages are ever lost. Apache Samza
  • 22. Stream Databases and Pipelines Building Blocks
  • 25. Apache Cassandra  Decentralized (Every node in the cluster has the same role.)  No single point of failure.  Scalable  Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications.  Fault-tolerant  Tunable level of consistency, all the way from "writes never fail" to "block for all replicas to be readable”  Hadoop integration, integration with MapReduce  Query language
  • 26. Apache Flink • High performance • Low latency • Support for out-of order events • Flexible streaming window • Fault tolerance
  • 28.  Finding frequent items  Estimating number of distinct  Statistics  Finding “signal”  Error correction  Filtering  Anomaly detection  Incremental learning  Data clustering Popular Stream Algorithms
  • 29. Machine Learning from Stream Data
  • 30. Take into account recent history ML Model is updatable (“evolves” as new data comes in) How ML from stream data is different from traditional ML techniques?
  • 31.  Incremental algorithms (both support vector machines and neural networks can work incrementally)  Periodic retraining with new data batch Two Approaches to Adopt ML to Stream Data

Editor's Notes

  • #7: https://aws.amazon.com/iot/how-it-works/#shadows
  • #12: https://en.wikipedia.org/wiki/CAP_theorem
  • #13: http://www.slideshare.net/gakhov/bbuzz-overview-part1
  • #14: http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html https://www.mapr.com/developercentral/lambda-architecture
  • #15: http://radar.oreilly.com/2015/02/improving-on-the-lambda-architecture-for-streaming-analysis.html
  • #17: https://en.wikipedia.org/wiki/Druid_(open-source_data_store)
  • #18: https://en.wikipedia.org/wiki/Druid_(open-source_data_store)
  • #24: https://github.com/pipelinedb/pipelinedb
  • #25: https://github.com/pipelinedb/pipelinedb
  • #27: https://flink.apache.org/features.html https://flink.apache.org/
  • #32: Considerations: Data Horizon Data Obsolescence