SlideShare a Scribd company logo
© 2015 IBM Corporation
How Spark Enables the Internet of Things:
Efficient Integration of Multiple Spark
Components for Smart City Use Cases
Paula Ta-Shma
IBM Research
paula@il.ibm.com
Joint work with:
Adnan Akbar, University of Surrey
Michael Factor, IBM Research
Guy Hadash, IBM Research
Juan Sancho, ATOS
© 2015 IBM Corporation2
The Evolution of Data Collection
Internet of
Things
© 2015 IBM Corporation3
2005 2012 2017
The IoT market will grow to
$1.7 trillion in 2020 (IDC)
By 2020 the number of networked devices
will be 30 billion (IDC), more than 4 times
the entire global population
IoT : The Biggest Big Data
GlobalDataVolumeinExabytes
2005 2012 2017
© 2015 IBM Corporation4
EMT Madrid Bus Company Needs to Make Decisions
According to Current and Predicted Future Traffic State
 The Problem
– EMT needs to staff control rooms where employees manually analyze Madrid traffic sensor output.
This can be slow and costly.
 Objective
– Improve customer satisfaction and reduce costs by responding more efficiently and quickly to real-
time traffic problems
 Approach
– Monitor data from up to 3000 sensors. React by rerouting buses, modifying traffic lights, etc., based
upon knowledge derived from historical data
Today Tomorrow
© 2015 IBM Corporation5
1. Collect historical time series data
– Collect data from devices
– Aggregate into objects
– Index and/or partition
Generic IoT Architecture – Data Flow
Secor
IoT
Swift
© 2015 IBM Corporation6
2. Learn patterns in data
– May be time/location dependent
– Generate thresholds, classifiers etc.
Generic IoT Architecture – Data Flow
Secor
Swift
© 2015 IBM Corporation7
IoT
3. Apply what was learned on
real time data stream
– Take action
Generic IoT Architecture – Data Flow
Secor
CEP
Swift
© 2015 IBM Corporation8
How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark
Components for Smart City Use Cases
IoT
Generic IoT Architecture – Data Flow
CEP
Secor
Swift
Green Flows: Real time
Purple Flows: Batch
© 2015 IBM Corporation9
Aim: Collect historical timeseries data for analysis
– Continuously collect data from up to 3000 Madrid council traffic sensors via web service
- Data includes traffic speeds and intensities, updated every 5 mins
– Push the messages to Kafka
– Use Secor to aggregate multiple messages into a single Swift object
- According to policy, e.g., every 60 mins
- Possibly partition the data, e.g. according to date
- Convert to Parquet format
- Annotate with metadata, e.g., min/max speed, start/end time
– Index Swift objects according to their metadata using ElasticSearch
Secor
Swift
IoT Architecture – Madrid Traffic – Ingestion Flow
IoT
© 2015 IBM Corporation10
IoT Architecture – Madrid Traffic – Data Access
Aim: Access data efficiently and cost
effectively
– Store IoT data in OpenStack Swift object
storage
- Open source, low cost deployment, and
highly scalable
– Parquet data is accessible via Spark SQL
– Optimized predicate pushdown
- Custom Spark SQL external data source
driver
- Uses object metadata indexes
- Searches for Swift objects whose min/max
values overlap requested ranges
Get all data for morning traffic:
SELECT codigo, intensidad, velocidad FROM
madridtraffic
WHERE tf >= '08:00:00' AND tf <= '12:00:00'
Brute force method
13245 Swift requests
Optimized predicate pushdown
616 Swift requests
21.5 times improvement
Swift
© 2015 IBM Corporation11
IoT Architecture – Madrid Traffic – Machine Learning
Aim: Learn to differentiate between ‘good’ and
‘bad’ traffic
– Depends on context
- Time (morning/evening), Day (weekday/weekend)
- Location
– Use Spark MLlib k-means clustering
– Produce threshold values for real-time decision making
– Re-run algorithm when quality of clusters decreases
- Can use silhouette index to measure quality
Swift
© 2015 IBM Corporation12
IoT Architecture – Madrid Traffic – Machine Learning
Event Detection:
• Use Spark MLlib k-means
clustering to separate
data into 2 clusters
• Find the midpoint between
the 2 cluster centres
• Use this midpoint to
generate the thresholds
• Repeat for each context
e.g. time period (morning,
afternoon, evening, night)
Anomaly Detection:
• Use a single cluster and
define an anomaly to be
further than a certain
distance from the cluster
centre
Morning Traffic on Weekdays
© 2015 IBM Corporation13
IoT Architecture – Madrid Traffic –
Real Time Decision Making
Aim: Respond in real time to traffic conditions
– Use Complex Event Processing (CEP) approach
- Rule based
- Process events record by record
- CEP rules are typically defined manually but in many
cases it is difficult to get them right
- We automate this process and make it smart
- uCEP has a small footprint, can be run at the edge
CEP
IoT
Work in Progress
Proactive approach:
• Use Spark streaming
linear regression to
predict traffic behavior
(e.g. speed, intensity)
for near future
• Apply CEP on
predicted data
• Respond pro-actively
to predicted events
such as traffic
congestion
– e.g. EMT can
proactively re-
route buses
© 2015 IBM Corporation14
Demo
© 2015 IBM Corporation15
Our Architecture Applies to Many IoT Use Cases
 Energy/utilities
– Anomaly detection
- Pipe leakage
- Appliance malfunction
– Occupancy detection
 Healthcare
– Healthcare patient
monitoring/alert/response
 Insurance
– Driver behavior and location
monitoring
 Transportation
– Connected vehicles, engine
diagnostics, automated service
scheduling
 Logistics
– Goods tracking, sensitive
goods management
© 2015 IBM Corporation
Data
Sources
Apache
Spark
Node-RED
Secor
Message
Bus
Data
Storage
Data
Analytics
Data
Visualization
Freeboard Dashboard
Object
Storage
16
MQTT
The Madrid Traffic Use Case on IBM Bluemix
Madrid Traffic Sensors
Joint work with Naeem Altaf and team
© 2015 IBM Corporation17
Thank You !
© 2015 IBM Corporation18
Backup
© 2015 IBM Corporation19
COSMOS
 Funding: EU FP7 at level of 2PY x 3 years
 Started: Sept 2013
 Coordinator: ATOS
 Technical partners: IBM, NTUA, Univ Surrey, Siemens, ATOS
 Use Case Partners: Hildebrand/Camden, EMT Madrid Bus Transport/Madrid
Council, III Taiwan – Smart Cities use cases
 Project Vision: Enable ‘things’ to interact with each other based on shared
experience, trust, reputation etc.
© 2015 IBM Corporation20
IBM Bluemix Data Analytics for IoT Architecture
© 2015 IBM Corporation21
 What is it?
– Apache Kafka is a high throughput distributed publish/subscribe messaging system.
– Secor is an open source tool developed by Pinterest, which aggregates Kafka messages
and saves as an S3 object.
 What extensions were needed?
– Support for OpenStack Swift as a Secor target. We also added support for Parquet
format and annotating objects with metadata search to support indexing.
 What is the value of integration with Swift?
– Enables bringing new data and applications to Swift which is an open source solution.
Parquet and metadata search enable improved performance for batch analytics.
 Status
– We contributed OpenStack Swift support to the Secor community and it is now part of
Secor.
Secor
Kafka + Secor
© 2015 IBM Corporation22
Parquet
 What is it?
– A column based semi-structured, schema-based storage format supported by Hadoop
and Spark. Enables column-wise compression and projection pushdown.
 What integration is needed?
– Since Swift is now part of the Hadoop ecosystem, no additional integration is needed.
Data in Swift can be stored in Apache Parquet format, inheriting associated advantages.
 Status
– Spark SQL supports storing tabular data in Parquet format in Hadoop compatible storage
systems such as Swift.
© 2015 IBM Corporation23
elasticsearch
 What is it?
– A distributed, scalable, real-time search and analytics engine, built on Apache Lucene.
 What integration is needed?
– Index object metadata allowing search for objects by attributes.
 What is the value of integration with Swift
– Use search to select objects for further processing, e.g., relevant objects for analytics.
- Note that S3 does not yet have native search according to metadata.
 Status
– The IBM SoftLayer object service includes a basic implementation of metadata search;
At IBM Research, we added extensions such as data type support and range searches.

More Related Content

PPTX
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
PPTX
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
PPTX
Realtime streaming architecture in INFINARIO
PDF
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
PDF
Spark Summit EU talk by Christos Erotocritou
PPTX
Implementing the Lambda Architecture efficiently with Apache Spark
PDF
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
PDF
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
Realtime streaming architecture in INFINARIO
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit EU talk by Christos Erotocritou
Implementing the Lambda Architecture efficiently with Apache Spark
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...

What's hot (20)

PDF
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
PDF
Spark Summit EU talk by Bas Geerdink
PPTX
Apache Spark Model Deployment
PDF
Spark Summit EU talk by Miha Pelko and Til Piffl
PDF
Spark Summit EU talk by Zoltan Zvara
PDF
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
PDF
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
PDF
Big Telco - Yousun Jeong
PDF
H2O World - H2O Rains with Databricks Cloud
PDF
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
PDF
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
PDF
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
PPTX
Tailored for Spark
PDF
Strata EU 2014: Spark Streaming Case Studies
PDF
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
PPTX
Self-Service Analytics on Hadoop: Lessons Learned
PDF
Automated Production Ready ML at Scale
PPTX
[Strata] Sparkta
PDF
Spark Summit EU talk by Tug Grall
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Spark Summit EU talk by Bas Geerdink
Apache Spark Model Deployment
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Zoltan Zvara
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Big Telco - Yousun Jeong
H2O World - H2O Rains with Databricks Cloud
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
Tailored for Spark
Strata EU 2014: Spark Streaming Case Studies
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Self-Service Analytics on Hadoop: Lessons Learned
Automated Production Ready ML at Scale
[Strata] Sparkta
Spark Summit EU talk by Tug Grall
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Ad

Viewers also liked (20)

PDF
Spark Streaming and IoT by Mike Freedman
PPTX
Spark Streaming the Industrial IoT
PDF
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
PPTX
Spark Streaming and Expert Systems
PPTX
Effective Spark on Multi-Tenant Clusters
PPT
Topfoison product catalog
PPTX
Spark and Hadoop Perfect Togeher by Arun Murthy
PDF
El sorprendente esp8266
PPTX
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
PPTX
Remote temperature monitor (DHT11)
PPTX
11th-strl organisation by Arya
PPTX
DIFERENCIAS ENTRE EL ETANOL Y METANO
PPTX
Pulling Back the Curtain – CloudStack in Private and Community Clouds
PPT
Alexander rodriguez
PPTX
Gravity Thinking "Social Media can change the World" Immediate Cycling Confer...
PPTX
Nopal
DOC
Impacto de las tic en educacion[1]
PPTX
Biocombustibles y el nopal
PDF
Kickstarter Code 2014
PPTX
The Evolution of Blended and Competency-Based Schooling: What Lies Beyond the...
Spark Streaming and IoT by Mike Freedman
Spark Streaming the Industrial IoT
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Spark Streaming and Expert Systems
Effective Spark on Multi-Tenant Clusters
Topfoison product catalog
Spark and Hadoop Perfect Togeher by Arun Murthy
El sorprendente esp8266
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
Remote temperature monitor (DHT11)
11th-strl organisation by Arya
DIFERENCIAS ENTRE EL ETANOL Y METANO
Pulling Back the Curtain – CloudStack in Private and Community Clouds
Alexander rodriguez
Gravity Thinking "Social Media can change the World" Immediate Cycling Confer...
Nopal
Impacto de las tic en educacion[1]
Biocombustibles y el nopal
Kickstarter Code 2014
The Evolution of Blended and Competency-Based Schooling: What Lies Beyond the...
Ad

Similar to How Spark Enables the Internet of Things- Paula Ta-Shma (20)

PPTX
Fin fest 2014 - Internet of Things and APIs
PPTX
[IoT Tech Expo] Smart Cities – Leveraging Messaging from Project to City to ...
PDF
IRJET- Smart Parking System in Multi-Storey Buildings
PPTX
Informix MQTT Streaming
PDF
Smart Cities, IoT, SDN, 5G Networks, Cloud Computing… Managing Complexity wit...
PDF
2019 punter data voor slimme systemen dvc 17okt-pdf
PDF
Is it harder to find a taxi when it is raining?
PDF
Confluent Cloud inside the Digital Transformation of Autostrade per l’Italia
PDF
Transport for London - London's Operations Digital Twin
PPT
Ibm iot overview
PDF
OCCIware: extensible and standard-based XaaS platform to manage everything in...
 
PDF
OCCIware: extensible and standard-based XaaS platform to manage everything in...
PDF
OCCIware@OW2con 2016
PDF
IBM CDS Overview
PDF
Session 1908 connecting devices to the IBM IoT Cloud
PDF
Telvent Big Data Approach and Case Studies
PDF
Get Cloud Resources to the IoT Edge with Fog Computing
PPTX
Druid Overview by Rachel Pedreschi
PPTX
Connectivité temps réel et bi-directionnelle ​ pour solutions IOT
PDF
20130503 iCore at calipso workshop fia dublin
Fin fest 2014 - Internet of Things and APIs
[IoT Tech Expo] Smart Cities – Leveraging Messaging from Project to City to ...
IRJET- Smart Parking System in Multi-Storey Buildings
Informix MQTT Streaming
Smart Cities, IoT, SDN, 5G Networks, Cloud Computing… Managing Complexity wit...
2019 punter data voor slimme systemen dvc 17okt-pdf
Is it harder to find a taxi when it is raining?
Confluent Cloud inside the Digital Transformation of Autostrade per l’Italia
Transport for London - London's Operations Digital Twin
Ibm iot overview
OCCIware: extensible and standard-based XaaS platform to manage everything in...
 
OCCIware: extensible and standard-based XaaS platform to manage everything in...
OCCIware@OW2con 2016
IBM CDS Overview
Session 1908 connecting devices to the IBM IoT Cloud
Telvent Big Data Approach and Case Studies
Get Cloud Resources to the IoT Edge with Fog Computing
Druid Overview by Rachel Pedreschi
Connectivité temps réel et bi-directionnelle ​ pour solutions IOT
20130503 iCore at calipso workshop fia dublin

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Computer network topology notes for revision
PPTX
Introduction to machine learning and Linear Models
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Database Infoormation System (DBIS).pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPT
Quality review (1)_presentation of this 21
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
IB Computer Science - Internal Assessment.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Computer network topology notes for revision
Introduction to machine learning and Linear Models
Introduction to Knowledge Engineering Part 1
Major-Components-ofNKJNNKNKNKNKronment.pptx
climate analysis of Dhaka ,Banglades.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Database Infoormation System (DBIS).pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Quality review (1)_presentation of this 21
Galatica Smart Energy Infrastructure Startup Pitch Deck

How Spark Enables the Internet of Things- Paula Ta-Shma

  • 1. © 2015 IBM Corporation How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases Paula Ta-Shma IBM Research paula@il.ibm.com Joint work with: Adnan Akbar, University of Surrey Michael Factor, IBM Research Guy Hadash, IBM Research Juan Sancho, ATOS
  • 2. © 2015 IBM Corporation2 The Evolution of Data Collection Internet of Things
  • 3. © 2015 IBM Corporation3 2005 2012 2017 The IoT market will grow to $1.7 trillion in 2020 (IDC) By 2020 the number of networked devices will be 30 billion (IDC), more than 4 times the entire global population IoT : The Biggest Big Data GlobalDataVolumeinExabytes 2005 2012 2017
  • 4. © 2015 IBM Corporation4 EMT Madrid Bus Company Needs to Make Decisions According to Current and Predicted Future Traffic State  The Problem – EMT needs to staff control rooms where employees manually analyze Madrid traffic sensor output. This can be slow and costly.  Objective – Improve customer satisfaction and reduce costs by responding more efficiently and quickly to real- time traffic problems  Approach – Monitor data from up to 3000 sensors. React by rerouting buses, modifying traffic lights, etc., based upon knowledge derived from historical data Today Tomorrow
  • 5. © 2015 IBM Corporation5 1. Collect historical time series data – Collect data from devices – Aggregate into objects – Index and/or partition Generic IoT Architecture – Data Flow Secor IoT Swift
  • 6. © 2015 IBM Corporation6 2. Learn patterns in data – May be time/location dependent – Generate thresholds, classifiers etc. Generic IoT Architecture – Data Flow Secor Swift
  • 7. © 2015 IBM Corporation7 IoT 3. Apply what was learned on real time data stream – Take action Generic IoT Architecture – Data Flow Secor CEP Swift
  • 8. © 2015 IBM Corporation8 How Spark Enables the Internet of Things: Efficient Integration of Multiple Spark Components for Smart City Use Cases IoT Generic IoT Architecture – Data Flow CEP Secor Swift Green Flows: Real time Purple Flows: Batch
  • 9. © 2015 IBM Corporation9 Aim: Collect historical timeseries data for analysis – Continuously collect data from up to 3000 Madrid council traffic sensors via web service - Data includes traffic speeds and intensities, updated every 5 mins – Push the messages to Kafka – Use Secor to aggregate multiple messages into a single Swift object - According to policy, e.g., every 60 mins - Possibly partition the data, e.g. according to date - Convert to Parquet format - Annotate with metadata, e.g., min/max speed, start/end time – Index Swift objects according to their metadata using ElasticSearch Secor Swift IoT Architecture – Madrid Traffic – Ingestion Flow IoT
  • 10. © 2015 IBM Corporation10 IoT Architecture – Madrid Traffic – Data Access Aim: Access data efficiently and cost effectively – Store IoT data in OpenStack Swift object storage - Open source, low cost deployment, and highly scalable – Parquet data is accessible via Spark SQL – Optimized predicate pushdown - Custom Spark SQL external data source driver - Uses object metadata indexes - Searches for Swift objects whose min/max values overlap requested ranges Get all data for morning traffic: SELECT codigo, intensidad, velocidad FROM madridtraffic WHERE tf >= '08:00:00' AND tf <= '12:00:00' Brute force method 13245 Swift requests Optimized predicate pushdown 616 Swift requests 21.5 times improvement Swift
  • 11. © 2015 IBM Corporation11 IoT Architecture – Madrid Traffic – Machine Learning Aim: Learn to differentiate between ‘good’ and ‘bad’ traffic – Depends on context - Time (morning/evening), Day (weekday/weekend) - Location – Use Spark MLlib k-means clustering – Produce threshold values for real-time decision making – Re-run algorithm when quality of clusters decreases - Can use silhouette index to measure quality Swift
  • 12. © 2015 IBM Corporation12 IoT Architecture – Madrid Traffic – Machine Learning Event Detection: • Use Spark MLlib k-means clustering to separate data into 2 clusters • Find the midpoint between the 2 cluster centres • Use this midpoint to generate the thresholds • Repeat for each context e.g. time period (morning, afternoon, evening, night) Anomaly Detection: • Use a single cluster and define an anomaly to be further than a certain distance from the cluster centre Morning Traffic on Weekdays
  • 13. © 2015 IBM Corporation13 IoT Architecture – Madrid Traffic – Real Time Decision Making Aim: Respond in real time to traffic conditions – Use Complex Event Processing (CEP) approach - Rule based - Process events record by record - CEP rules are typically defined manually but in many cases it is difficult to get them right - We automate this process and make it smart - uCEP has a small footprint, can be run at the edge CEP IoT Work in Progress Proactive approach: • Use Spark streaming linear regression to predict traffic behavior (e.g. speed, intensity) for near future • Apply CEP on predicted data • Respond pro-actively to predicted events such as traffic congestion – e.g. EMT can proactively re- route buses
  • 14. © 2015 IBM Corporation14 Demo
  • 15. © 2015 IBM Corporation15 Our Architecture Applies to Many IoT Use Cases  Energy/utilities – Anomaly detection - Pipe leakage - Appliance malfunction – Occupancy detection  Healthcare – Healthcare patient monitoring/alert/response  Insurance – Driver behavior and location monitoring  Transportation – Connected vehicles, engine diagnostics, automated service scheduling  Logistics – Goods tracking, sensitive goods management
  • 16. © 2015 IBM Corporation Data Sources Apache Spark Node-RED Secor Message Bus Data Storage Data Analytics Data Visualization Freeboard Dashboard Object Storage 16 MQTT The Madrid Traffic Use Case on IBM Bluemix Madrid Traffic Sensors Joint work with Naeem Altaf and team
  • 17. © 2015 IBM Corporation17 Thank You !
  • 18. © 2015 IBM Corporation18 Backup
  • 19. © 2015 IBM Corporation19 COSMOS  Funding: EU FP7 at level of 2PY x 3 years  Started: Sept 2013  Coordinator: ATOS  Technical partners: IBM, NTUA, Univ Surrey, Siemens, ATOS  Use Case Partners: Hildebrand/Camden, EMT Madrid Bus Transport/Madrid Council, III Taiwan – Smart Cities use cases  Project Vision: Enable ‘things’ to interact with each other based on shared experience, trust, reputation etc.
  • 20. © 2015 IBM Corporation20 IBM Bluemix Data Analytics for IoT Architecture
  • 21. © 2015 IBM Corporation21  What is it? – Apache Kafka is a high throughput distributed publish/subscribe messaging system. – Secor is an open source tool developed by Pinterest, which aggregates Kafka messages and saves as an S3 object.  What extensions were needed? – Support for OpenStack Swift as a Secor target. We also added support for Parquet format and annotating objects with metadata search to support indexing.  What is the value of integration with Swift? – Enables bringing new data and applications to Swift which is an open source solution. Parquet and metadata search enable improved performance for batch analytics.  Status – We contributed OpenStack Swift support to the Secor community and it is now part of Secor. Secor Kafka + Secor
  • 22. © 2015 IBM Corporation22 Parquet  What is it? – A column based semi-structured, schema-based storage format supported by Hadoop and Spark. Enables column-wise compression and projection pushdown.  What integration is needed? – Since Swift is now part of the Hadoop ecosystem, no additional integration is needed. Data in Swift can be stored in Apache Parquet format, inheriting associated advantages.  Status – Spark SQL supports storing tabular data in Parquet format in Hadoop compatible storage systems such as Swift.
  • 23. © 2015 IBM Corporation23 elasticsearch  What is it? – A distributed, scalable, real-time search and analytics engine, built on Apache Lucene.  What integration is needed? – Index object metadata allowing search for objects by attributes.  What is the value of integration with Swift – Use search to select objects for further processing, e.g., relevant objects for analytics. - Note that S3 does not yet have native search according to metadata.  Status – The IBM SoftLayer object service includes a basic implementation of metadata search; At IBM Research, we added extensions such as data type support and range searches.

Editor's Notes

  • #3: So what really is the Internet of Things?   It is made up of physical objects (“things”) that have chips, sensors embedded in them that allow the sensing, capturing and communication of all types of data. These devices are then linked through both wired and wireless networks to the Internet.  Advanced  “things” have actuators embedded into them as well, giving them the capability to interact with other devices, computing systems and the external environment, including people. IoT takes this one step further – Actuation Quantity of data and quality of solution (actuation) Sensors have existed for a long time, think how many sensors you need to send a rocket into space, but today this is not rocket science, what is happening is that sensors are becoming commodities, leading to adoption on a massive scale, enabling new applications to be possible e.g. placing large numbers of sensors in agricultural fields to measure soil humidity and nutrient levels
  • #4: Big data versus huge data IoT data : typically sensor readings and associated data together with timestamps Why so big ? 1) Many more networked devices than networked humans – and growing fast 2) Associated data can be video, audio and social networking data – yes things will join social networks like humans do => Going to be biggest big data Video, audio, images can also be IoT data According to new research from International Data Corporation (IDC), the worldwide Internet of Things market will grow from $655.8 billion in 2014 to $1.7 trillion in 2020 with a compound annual growth rate (CAGR) of 16.9%. http://www.idc.com/getdoc.jsp?containerId=prUS25658015 http://www.emc.com/leadership/digital-universe/2014iview/internet-of-things.htm
  • #5: Sensors record speed and intensity of traffic Open data
  • #6: Swift highly scalable and low cost, in comparison to using a database Essential for keeping IoT data long term What kind of data ? Together with traffic speed and intensity data we also have camera images – less suitable for DB Need some kind of indexing to make it efficient
  • #7: Learn from historical data to respond in real time to live data
  • #8: Actuate: CEP – respond according to thresholds Possible actions: reroute buses, alter traffic light behavior, alert citizens, etc. Action to be taken is application specific
  • #9: Need for real time and historical – real time is green arrows Adding historical dimension Left cycle is real time only, we added larger cycle Stopwatch graphic is from Wikimedia commons
  • #10: Object storage (openstack swift) as a long term repository for IoT data Scalable and relatively low cost By adding metadata to describe what is contained in each object and metadata search we can access it efficiently Databases are often overkill for what is needed by analytics Partitioning needs to be done statically, difficult to change later on Difficult to change the schema Partitioning must be hierarchical – advantages to selecting on columns up high in hierarchy HDFS versus Swift for analytics -Swift more scalable and reliable (Namenode SPOF issue?) -Supports metadata -Supports stronger security model -More natural for storing some types of data e.g. traffic images
  • #12: Can depend on other elements of context like weather etc. Note: table is for one location only
  • #14: Importance of responding in time
  • #16: Value of actuation in real time e.g. pipe leakage Potential to learn new insights by tapping in to the historical data e.g. as you heard in a previous talk by Intel, in healthcare improve quality of healthcare for patients with Parkinsons disease Insurance – pay as you drive and pay how you drive models – pay where you drive ? New business models I’ve been thinking a lot about the last one since I arrived in Amsterdam, I think its ironic that I lost my things on the way to a conference where I’m talking about the internet of things, airlines please work on this ;-)
  • #17: IBM Bluemix is IBM’s Platform as a Service offering. Based on Cloud Foundry. Bluemix has services for most of the components we used, for example there are Spark and object storage services as well as a MessageHub service based on Kafka. Together with a team led by Naeem Altaf, we ported this use case to run on Bluemix to give customers an example of an IoT use case that can be built on the platform. This work was demoed yesterday in Las Vegas at IBM’s Insight conference.