SlideShare a Scribd company logo
A REFERENCE
ARCHITECTURE
FOR INTERNET
OF THINGS
SUJEE MANIYAM
FOUNDER / PRINCIPAL @ ELEPHANTSCALE
SUJEE@ELEPHANTSCALE.COM
(c) Elephant Scale 2015
HI, I’M SUJEE MANIYAM
•  Founder / Principal @ ElephantScale
•  Consulting & Training in Big Data
•  Training in Spark / Hadoop / NoSQL /
Data Science
•  Author
•  “Hadoop illuminated” open source book
•  “HBase Design Patterns
•  Open Source contributor: github.com/sujee
•  sujee@elephantscale.com
•  http://sujee.net
(c) Elephant Scale 2015
Spark
Training
available!
INTERNET OF THINGS – A
REALITY
(c) Elephant Scale 2015
DATA INFRASTRUCTURE
?
(c) Elephant Scale 2015
DATA VOLUME ?
A NAPKIN CALCULATION
Say we have
•  Million sensors
•  Each sensor reports every minute
•  data size 1KB
This will result in :
•  1.44 Billions events / day !
•  1.44 TB / day !!
(c) Elephant Scale 2015
SENSOR DATA WORKSHEET
IoT Temperature Sensor Projection
variables description
sensors 1M 1.00E+06 1 million
signal frequency every min / 60 secs 60 secs
event size 1KB 1000 bytes
events per day per sensor 1440
total events per day 1.44E+09 1440$$millions1.44$$billion
total events / sec 1.67E+04 16,666.67$$
total data size per day 1.44E+12 1440$$$GB 1.44$$TB
(c) Elephant Scale 2015
SENSOR DATA : TEXAS UTILITIES
SMART METER DATA
Texas Smart Meter Projections
variables description
sensors 10 million customers 1.00E+07 10 million
signal frequency every 15 mins 900 secs
event size 1.4 K 1400 bytes
events per day per sensor 96
total events per day 9.60E+08 960$$millions 0.96$$billion
total events / sec 1.11E+04 11,111.11$$
total data size per day 1.34E+12 1344$$$GB 1.344$$TB
(c) Elephant Scale 2015
BIG ‘DATA’
(c) Elephant Scale 2015
DATA VELOCITY
Say we have
•  Million sensors
•  Each sensor reports every minute
•  data size 1KB
è
•  Millions events / minute
•  ~ 17,000 events / sec
(c) Elephant Scale 2015
DATA PROCESSING SPEED
•  Need (near) real time processing most of the time
•  E.g. Need to alert if temperature suddenly spikes
temp
Time
Alert
(c) Elephant Scale 2015
CHALLENGE = BIG DATA +
REAL TIME
•  Don’t loose events !
•  Any event could be important
•  Most events are mundane (e.g. temperature stays
between 68’F – 72’ F)
•  Process them in near real time
•  Store the events for a long time
•  Audit
•  Diagnose
•  Support various queries
•  Real time (what is the latest temperature for sensor id
123?)
•  Aggregate (what is the avg. temp in zipcode 12345)
(c) Elephant Scale 2015
HIGH LEVEL ARCHITECTURE
(c) Elephant Scale 2015
NEXT : (1) CAPTURE
(c) Elephant Scale 2015
(1) CAPTURE
REQUIREMENTS
Requirements:
•  Capture events coming at high speed
•  Tens of thousands events / sec (some times millions / sec)
•  Don’t loose events
•  Tolerate hardware / software failure
•  Tolerate intermittent connectivity issues
•  Scale ‘easily’
(c) Elephant Scale 2015
(1) CAPTURE
CHOICES
•  MQ (RabbitMQ ..etc)
•  Good adoption in enterprises / durable
•  FluentD
•  Data collector for various sources
•  Flume
•  Part of Hadoop eco system
•  Good for collecting logs from many sources
•  AWS Kinesis
•  Queue system in Amazon Cloud
•  Kafka
•  Distributed queue
(c) Elephant Scale 2015
(1) CAPTURE
MEET KAFKA
•  Apache Kafka is a distributed messaging system
•  Came out of LinkedIn… open sourced in 2011
•  Built to tolerate hardware / software / network failures
•  Built for high throughput and scale
•  LinkedIn : 220 Billion messages / day
•  At peak : 3+ million messages / day
(c) Elephant Scale 2015
(1) CAPTURE
KAFKA ARCHITECTURE
•  Publisher - subscriber / producer – consumer model
(c) Elephant Scale 2015
(1) CAPTURE
KAFKA ARCHITECTURE
•  Producers write data
to brokers
•  Consumers read data
from brokers
•  All of this is
distributed / parallel
•  Failure tolerant
•  Data is stored as topics
•  “sensor_data”
•  “alerts”
•  “emails”
(c) Elephant Scale 2015
(1) CAPTURE
KAFKA USERS
•  Linked In
•  Track user activities
•  Sending emails
•  Netflix
•  Real time monitoring
•  Spotify
•  Ship logs to hadoop
•  Uber… AirBnB….
(c) Elephant Scale 2015
(1) CAPTURE
ARCHITECTURE WITH KAFKA
(c) Elephant Scale 2015
NEXT : (2) PROCESSING
(c) Elephant Scale 2015
(2) PROCESSING
REQUIREMENTS
•  Process events in real time or near real time
•  High velocity
•  Tens of thousands ! millions of events / sec
•  Guaranteed processing
•  Process an event at-least-once
•  Exactly-once (harder to achieve)
•  Failure tolerant
•  Scale ‘easily’
(c) Elephant Scale 2015
(2) PROCESSING
CHOICES
•  Storm
•  Process streams
•  Events based
•  Came out of twitter
•  Apache Samza
•  Stream processing framework based on Kafka + Hadoop
YARN
•  Apache NiFi
•  Data flow
•  New project / incubating
•  Spark streaming
(c) Elephant Scale 2015
(2) PROCESSING
MEET SPARK
•  Spark is the new darling of ‘Big Data’ world
•  Lot’s of activity and interest
•  Fast and Expressive Cluster Compute Engine
•  “First Big Data platform to integrate batch, streaming and
interactive computations in a unified framework” –
stratio.com
Hadoop
Spark
(c) Elephant Scale 2015
(2) PROCESSING
SPARK ECO-SYSTEM
(c) Elephant Scale 2015
Spark Core
Spark
SQL
Spark
Streaming
ML lib
Schema /
sql
Real Time
Machine
Learning
Stand alone YARN MESOS
Cluster
managers
GraphX
Graph
processing
(2) PROCESSING
SPARK DATA SOURCE
ABSTRACTION
Spark
(compute
engine)
HDFS
Amazon
S3
Cassandra ???
RDD
Hadoop
RDD
Cassandra
RDD
(c) Elephant Scale 2015
(2) PROCESSING
SPARK : ‘UNIFIED’ STACK
Spark supports multiple programming models
•  Map reduce style batch processing
•  Streaming / real time processing
•  Querying via SQL
•  Machine learning
All modules are tightly integrated
•  Facilitates rich applications
Spark can be only stack you need !
(c) Elephant Scale 2015
Image: buymeposters.com
(2) PROCESSING
SPARK STREAMING
Streaming
Sources
Storage
(c) Elephant Scale 2015
(2) PROCESSING
SPARK STREAMING
•  Provides ‘high level’ operations in time windows
•  E.g. ‘calculate X for the past 10 seconds’
•  Good adoption
(c) Elephant Scale 2015
(2) PROCESSING
ARCHITECTURE WITH SPARK
STREAMING
(c) Elephant Scale 2015
NEXT : STORAGE
(c) Elephant Scale 2015
(3) STORAGE
REQUIREMENTS
•  Handle ‘Big Data’ ( 1 TB / day !)
•  Traditional storages are not effective (or too expensive)
•  Need two types of storage
1.  ‘forever’ storage
•  Store multi terabytes of data for a long periods
•  Support Batch queries
2.  ‘fast / real-time lookup’ storage
•  Query in real time (milliseconds)
“what is the latest reading for sensor-123 ?”
•  Store latest / new data (e.g. last 3 months)
•  Flexible schema for semi-structured data
•  Both need to scale
(c) Elephant Scale 2015
(3) STORAGE
REQUIREMENTS
(c) Elephant Scale 2015
(3) STORAGE
CHOICES
•  ‘forever’ storage
•  Scalable distributed file systems
•  Hadoop ! (HDFS actually)
•  ‘real time store’
•  Traditional RDBMS won’t work
•  Don’t scale well (or too expensive)
•  Rigid schema layout
•  NoSQL !
(c) Elephant Scale 2015
(3) STORAGE
HDFS (IN 20 SECS)
•  Distributed file system
•  Runs on commodity servers
•  ! high ROI
•  Can keep ticking even when nodes go down
•  ! fault tolerant
•  Replicates data to prevent data loss in case of node
failures
•  ! built in backup ☺
•  Scales to Peta bytes (horizontal scalability)
•  Proven in the field
(c) Elephant Scale 2015
(3) STORAGE
HDFS ARCHITECTURE
(c) Elephant Scale 2015
(3) STORAGE
COST OF BIG DATA
Source : hortonworks
(c) Elephant Scale 2015
(3) STORAGE
HDFS
•  Can handle big data
•  Scales easily
•  Cost effective
•  “Source of Truth”
•  Files are immutable within HDFS (new data is ‘appended’ )
•  Audit friendly
(c) Elephant Scale 2015
(3) STORAGE
CAPACITY PLANNING (HADOOP)
Variables (tweak these) description value units
Average daily ingest 1000 GB
raw data node storage eg. 12 disks x 3TB 36 TB
replication default 3 3
space allocated for HDFS HDFS 75% + Mapreduce 25% 75.00%
growth per month (not calculated) 0
Calculation
effective data storage per node 27 TB
growth 1 month 6 month 1 yr 2 yr
data size (TB) 90 540 1,080 2,160
# data nodes 3.333333333 20 40 80
(c) Elephant Scale 2015
(3) STORAGE (REAL TIME)
CHOICES FOR NOSQL
•  Too many ! J
•  HBase
•  Part of Hadoop eco system
•  Uses HDFS for storage
•  Provides consistent view of data
•  Cassandra
•  Popular NoSQL store
•  No Single Point of Failure (SPOF) – ring architecture
•  No dependency on Hadoop
•  Accumulo
•  Came out of NSA !
•  Uses HDFS for storage
•  Provides very good security (naturally !)
(c) Elephant Scale 2015
(3) STORAGE
CAP THEOREM
(c) Elephant Scale 2015
(3) STORAGE
ARCHITECTURE SO FAR
(c) Elephant Scale 2015
NEXT : QUERY
(c) Elephant Scale 2015
(3) QUERY
REQUIREMENTS
•  Real Time queries
•  “what is the latest reading for the sensor id = 123”
•  Useful for building applications /
dashboards
•  Latency : milli-seconds
•  Batch / Aggregate queries
•  “What is the average temperature in
zip code = 12345” ?
•  May need to go through large data points
•  Latency : ‘batch’ (minutes / hours)
(c) Elephant Scale 2015
(3) QUERY
SOLUTIONS
•  Batch queries
•  Query data in HDFS (and or NoSQL)
•  Hadoop mapreduce (Pig / Hive)
•  Spark batch analytics
•  Real time queries
•  Queries to go NoSQL store
HDFS
NoSQL
Real time
queries
Batch queries
(c) Elephant Scale 2015
(3) QUERY
ARCHITECTURE
(c) Elephant Scale 2015
FINAL ARCHITECTURE
(c) Elephant Scale 2015
LAMBDA ARCHITECTURE
(c) Elephant Scale 2015
LAMBDA ARCHITECTURE
EXPLAINED
1.  All new data is sent to both batch layer and speed layer
2.  Batch layer
•  Holds master data set (immutable , append-only)
•  Answers batch queries
3.  Serving layer
•  updates batch views so they can be queried adhoc
4.  Speed Layer
•  Handles new data
•  Facilitates fast / real-time queries
5.  Query layer
•  Answers queries using batch & real-time views
(c) Elephant Scale 2015
INCORPORATING LAMBDA
ARCHITECTURE
(c) Elephant Scale 2015
OUR ARCHITECTURE
•  Each component is scalable
•  Each component is fault tolerant
•  Incorporates best practices
•  All open source !
(c) Elephant Scale 2015
… AND ONE MORE THING…
•  Security !
(c) Elephant Scale 2015
Source : businessinsider.com
HOW EVER…
(c) Elephant Scale 2015
At scale nothing works as advertised !
GOOD NEWS !
•  We’d like to build an open source, reference data platform for IoT /
connected devices!
•  Yes, open source ! J
•  ElephantScale is a strong believer in open source
•  “hadoop illuminated” – open source Hadoop book
•  Github.com/elephantscale
•  Best practices
•  Bringing together lots of expertise in Big Data systems
•  Register your interest
http://elephantscale.com/iotx/
(c) Elephant Scale 2015
GOALS FOR IOTX
http://elephantscale.com/iotx/
•  Use open-source proven components
•  Capture :
•  Kafka
•  Kinesis (AWS)
•  Processing : Spark Streaming
•  Batch storage : Hadoop / HDFS
•  Real Time Store : support multiple data stores
•  Cassandra
•  Hbase
•  Accumulo
•  ???
(c) Elephant Scale 2015
GOALS FOR IOTX…
http://elephantscale.com/iotx/
•  Query templates using
•  Spark
•  Hadoop Map Reduce (Pig / Hive)
•  Incorporate third party libraries for
•  Outlier detection (temperature is outside norms)
•  Trend detection (stock price is trending up)
•  Alerts (fire !)
•  Monitoring & Metrics (key !!)
•  What’s going in the system?
•  Host / system level (cpu / network ..etc) – easier
•  application level (e.g. find slow queries) – harder
•  Incorporate third party libraries
(c) Elephant Scale 2015
THANKS AND QUESTIONS?
“A Reference Architecture for Internet of
Things (IoT)”
Sujee Maniyam
Founder / Principal @ ElephantScale
Expert Consulting + Training in Big Data technologies
sujee@elephantscale.com
Elephantscale.com
Project sign up page : http://elephantscale.com/iotx/
(c) Elephant Scale 2015
IMAGE CREDITS
•  www.engadget.com
•  Xfinity.com
•  Tesla.com
(c) Elephant Scale 2015

More Related Content

PDF
IMCSummit 2015 - Day 2 Developer Track - The Internet of Analytics – Discover...
PDF
5 Comparing Microsoft Big Data Technologies for Analytics
PDF
Big data on AWS
PDF
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
PPTX
Big Data on azure
PDF
Reference architecture for Internet of Things
PDF
Big Data Architecture
IMCSummit 2015 - Day 2 Developer Track - The Internet of Analytics – Discover...
5 Comparing Microsoft Big Data Technologies for Analytics
Big data on AWS
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data on azure
Reference architecture for Internet of Things
Big Data Architecture

What's hot (20)

PDF
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
PDF
Turn Data Into Actionable Insights - StampedeCon 2016
PDF
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
PPTX
Intuitive Real-Time Analytics with Search
PPTX
Webinar: Don't Leave Your Data in the Dark
PPTX
How much money do you lose every time your ecommerce site goes down?
PPTX
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
PDF
Big Data in Production: Lessons from Running in the Cloud
PDF
Cloud Big Data Architectures
PDF
Big Data Computing Architecture
PDF
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
PPTX
How To Tell if Your Business Needs NoSQL
PDF
Modernizing to a Cloud Data Architecture
PPTX
Simplifying Real-Time Architectures for IoT with Apache Kudu
PPTX
Stream Analytics
PPTX
Using Hadoop to Drive Down Fraud for Telcos
PDF
Delta Lake: Open Source Reliability w/ Apache Spark
PPT
Webinar: 2 Billion Data Points Each Day
PPTX
Big Data Analytics in the Cloud with Microsoft Azure
PPTX
Kudu Forrester Webinar
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Turn Data Into Actionable Insights - StampedeCon 2016
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Intuitive Real-Time Analytics with Search
Webinar: Don't Leave Your Data in the Dark
How much money do you lose every time your ecommerce site goes down?
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Big Data in Production: Lessons from Running in the Cloud
Cloud Big Data Architectures
Big Data Computing Architecture
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
How To Tell if Your Business Needs NoSQL
Modernizing to a Cloud Data Architecture
Simplifying Real-Time Architectures for IoT with Apache Kudu
Stream Analytics
Using Hadoop to Drive Down Fraud for Telcos
Delta Lake: Open Source Reliability w/ Apache Spark
Webinar: 2 Billion Data Points Each Day
Big Data Analytics in the Cloud with Microsoft Azure
Kudu Forrester Webinar
Ad

Viewers also liked (20)

PDF
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
PDF
IMCSummit 2015 - Day 2 Developer Track - Catch Them in the Act - Fraud Detect...
PDF
IMCSummit 2015 - Day 2 IT Business Track - Drive IMC Efficiency with Flash E...
PPTX
IMCSummit 2016 Keynote - Abe Kleinfeld - The In-Memory Computing Landscape: L...
PPTX
IMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory Easy
PDF
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
PDF
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
PDF
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
PPTX
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
PPTX
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
PDF
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...
PDF
IMCSummit 2015 - Day 1 Developer Session - The Science and Engineering Behind...
PPTX
IMC Summit 2016 Breakout - Nikita Ivanov - Shared In-Memory RDDs – Missing Li...
PPTX
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
PDF
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
PDF
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
PPTX
IMC Summit 2016 Keynote - Jason Stamper - In-Memory: The Foundation of the In...
PDF
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
PPTX
IMC Summit 2016 Breakout - Per Minoborg - Work with Multiple Hot Terabytes in...
PPTX
IMC Summit 2016 Keynote - Robert Barr - In Memory Computing for Financial Ser...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 Developer Track - Catch Them in the Act - Fraud Detect...
IMCSummit 2015 - Day 2 IT Business Track - Drive IMC Efficiency with Flash E...
IMCSummit 2016 Keynote - Abe Kleinfeld - The In-Memory Computing Landscape: L...
IMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory Easy
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...
IMCSummit 2015 - Day 1 Developer Session - The Science and Engineering Behind...
IMC Summit 2016 Breakout - Nikita Ivanov - Shared In-Memory RDDs – Missing Li...
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
IMC Summit 2016 Keynote - Jason Stamper - In-Memory: The Foundation of the In...
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
IMC Summit 2016 Breakout - Per Minoborg - Work with Multiple Hot Terabytes in...
IMC Summit 2016 Keynote - Robert Barr - In Memory Computing for Financial Ser...
Ad

Similar to IMCSummit 2015 - Day 2 Developer Track - A Reference Architecture for the Internet of Things (20)

PDF
Hadoop to spark_v2
PDF
Hadoop to spark-v2
PDF
Big Data , Big Problem?
PDF
Introduction To Hadoop Ecosystem
PDF
Extending Spark Streaming to Support Complex Event Processing
PDF
Spark cep
PPTX
Big Data Retrospective - STL Big Data IDEA Jan 2019
PDF
Survey Paper on Big Data and Hadoop
PDF
Hadoop Technologies
PPTX
Perfecting Your Streaming Skills with Spark and Real World IoT Data
PPTX
PDF
eScience Cluster Arch. Overview
PDF
Introduction to Spark Training
PPTX
Intro to Spark development
PDF
Apache Spark Streaming
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
PDF
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
PPTX
Hadoop vs Apache Spark
PPTX
Hadoop and friends
Hadoop to spark_v2
Hadoop to spark-v2
Big Data , Big Problem?
Introduction To Hadoop Ecosystem
Extending Spark Streaming to Support Complex Event Processing
Spark cep
Big Data Retrospective - STL Big Data IDEA Jan 2019
Survey Paper on Big Data and Hadoop
Hadoop Technologies
Perfecting Your Streaming Skills with Spark and Real World IoT Data
eScience Cluster Arch. Overview
Introduction to Spark Training
Intro to Spark development
Apache Spark Streaming
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Hadoop - Architectural road map for Hadoop Ecosystem
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Hadoop vs Apache Spark
Hadoop and friends

More from In-Memory Computing Summit (20)

PPTX
IMC Summit 2016 Breakout - Henning Andersen - Using Lock-free and Wait-free I...
PPTX
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
PDF
IMC Summit 2016 Breakout - Nikita Shamgunov - Propelling IoT Innovation with ...
PPTX
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
PDF
IMC Summit 2016 Innovation - Derek Nelson - PipelineDB: The Streaming-SQL Dat...
PPTX
IMC Summit 2016 Innovation - Dennis Duckworth - Lambda-B-Gone: The In-memory ...
PPTX
IMC Summit 2016 Innovation - Steve Wilkes - Tap Into Your Enterprise – Why Da...
PPTX
IMC Summit 2016 Innovation - Girish Mutreja - Unveiling the X Platform
PPTX
IMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage Tier
PPTX
IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...
PPTX
IMC Summit 2016 Breakout - Andy Pavlo - What Non-Volatile Memory Means for th...
PPTX
IMC Summit 2016 Breakout - Gordon Patrick - Developments in Persistent Memory
PPTX
IMC Summit 2016 Breakout - Girish Kathalagiri - Decision Making with MLLIB, S...
PPTX
IMC Summit 2016 Breakout - Steve Wikes - Making IMC Enterprise Grade
PPTX
IMC Summit 2016 Breakout - Noah Arliss - The Truth: How to Test Your Distribu...
PPTX
IMC Summit 2016 Breakout - Aleksandar Seovic - The Illusion of Statelessness
PPTX
IMC Summit 2016 Breakout - Girish Mutreja - Extreme Transaction Processing in...
PPTX
IMC Summit 2016 Breakout - Greg Luck - How to Speed Up Your Application Using...
PPTX
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
PPTX
IMC Summit 2016 Breakout - William Bain - Implementing Extensible Data Struct...
IMC Summit 2016 Breakout - Henning Andersen - Using Lock-free and Wait-free I...
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
IMC Summit 2016 Breakout - Nikita Shamgunov - Propelling IoT Innovation with ...
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
IMC Summit 2016 Innovation - Derek Nelson - PipelineDB: The Streaming-SQL Dat...
IMC Summit 2016 Innovation - Dennis Duckworth - Lambda-B-Gone: The In-memory ...
IMC Summit 2016 Innovation - Steve Wilkes - Tap Into Your Enterprise – Why Da...
IMC Summit 2016 Innovation - Girish Mutreja - Unveiling the X Platform
IMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage Tier
IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...
IMC Summit 2016 Breakout - Andy Pavlo - What Non-Volatile Memory Means for th...
IMC Summit 2016 Breakout - Gordon Patrick - Developments in Persistent Memory
IMC Summit 2016 Breakout - Girish Kathalagiri - Decision Making with MLLIB, S...
IMC Summit 2016 Breakout - Steve Wikes - Making IMC Enterprise Grade
IMC Summit 2016 Breakout - Noah Arliss - The Truth: How to Test Your Distribu...
IMC Summit 2016 Breakout - Aleksandar Seovic - The Illusion of Statelessness
IMC Summit 2016 Breakout - Girish Mutreja - Extreme Transaction Processing in...
IMC Summit 2016 Breakout - Greg Luck - How to Speed Up Your Application Using...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - William Bain - Implementing Extensible Data Struct...

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
PPT
Teaching material agriculture food technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Understanding_Digital_Forensics_Presentation.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Modernizing your data center with Dell and AMD
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Empathic Computing: Creating Shared Understanding
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Cloud computing and distributed systems.
Teaching material agriculture food technology
Encapsulation_ Review paper, used for researhc scholars
Understanding_Digital_Forensics_Presentation.pptx
The AUB Centre for AI in Media Proposal.docx
Mobile App Security Testing_ A Comprehensive Guide.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Advanced methodologies resolving dimensionality complications for autism neur...
Dropbox Q2 2025 Financial Results & Investor Presentation
Modernizing your data center with Dell and AMD
The Rise and Fall of 3GPP – Time for a Sabbatical?
MYSQL Presentation for SQL database connectivity
Empathic Computing: Creating Shared Understanding
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
NewMind AI Monthly Chronicles - July 2025
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

IMCSummit 2015 - Day 2 Developer Track - A Reference Architecture for the Internet of Things

  • 1. A REFERENCE ARCHITECTURE FOR INTERNET OF THINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANTSCALE SUJEE@ELEPHANTSCALE.COM (c) Elephant Scale 2015
  • 2. HI, I’M SUJEE MANIYAM •  Founder / Principal @ ElephantScale •  Consulting & Training in Big Data •  Training in Spark / Hadoop / NoSQL / Data Science •  Author •  “Hadoop illuminated” open source book •  “HBase Design Patterns •  Open Source contributor: github.com/sujee •  sujee@elephantscale.com •  http://sujee.net (c) Elephant Scale 2015 Spark Training available!
  • 3. INTERNET OF THINGS – A REALITY (c) Elephant Scale 2015
  • 5. DATA VOLUME ? A NAPKIN CALCULATION Say we have •  Million sensors •  Each sensor reports every minute •  data size 1KB This will result in : •  1.44 Billions events / day ! •  1.44 TB / day !! (c) Elephant Scale 2015
  • 6. SENSOR DATA WORKSHEET IoT Temperature Sensor Projection variables description sensors 1M 1.00E+06 1 million signal frequency every min / 60 secs 60 secs event size 1KB 1000 bytes events per day per sensor 1440 total events per day 1.44E+09 1440$$millions1.44$$billion total events / sec 1.67E+04 16,666.67$$ total data size per day 1.44E+12 1440$$$GB 1.44$$TB (c) Elephant Scale 2015
  • 7. SENSOR DATA : TEXAS UTILITIES SMART METER DATA Texas Smart Meter Projections variables description sensors 10 million customers 1.00E+07 10 million signal frequency every 15 mins 900 secs event size 1.4 K 1400 bytes events per day per sensor 96 total events per day 9.60E+08 960$$millions 0.96$$billion total events / sec 1.11E+04 11,111.11$$ total data size per day 1.34E+12 1344$$$GB 1.344$$TB (c) Elephant Scale 2015
  • 9. DATA VELOCITY Say we have •  Million sensors •  Each sensor reports every minute •  data size 1KB è •  Millions events / minute •  ~ 17,000 events / sec (c) Elephant Scale 2015
  • 10. DATA PROCESSING SPEED •  Need (near) real time processing most of the time •  E.g. Need to alert if temperature suddenly spikes temp Time Alert (c) Elephant Scale 2015
  • 11. CHALLENGE = BIG DATA + REAL TIME •  Don’t loose events ! •  Any event could be important •  Most events are mundane (e.g. temperature stays between 68’F – 72’ F) •  Process them in near real time •  Store the events for a long time •  Audit •  Diagnose •  Support various queries •  Real time (what is the latest temperature for sensor id 123?) •  Aggregate (what is the avg. temp in zipcode 12345) (c) Elephant Scale 2015
  • 12. HIGH LEVEL ARCHITECTURE (c) Elephant Scale 2015
  • 13. NEXT : (1) CAPTURE (c) Elephant Scale 2015
  • 14. (1) CAPTURE REQUIREMENTS Requirements: •  Capture events coming at high speed •  Tens of thousands events / sec (some times millions / sec) •  Don’t loose events •  Tolerate hardware / software failure •  Tolerate intermittent connectivity issues •  Scale ‘easily’ (c) Elephant Scale 2015
  • 15. (1) CAPTURE CHOICES •  MQ (RabbitMQ ..etc) •  Good adoption in enterprises / durable •  FluentD •  Data collector for various sources •  Flume •  Part of Hadoop eco system •  Good for collecting logs from many sources •  AWS Kinesis •  Queue system in Amazon Cloud •  Kafka •  Distributed queue (c) Elephant Scale 2015
  • 16. (1) CAPTURE MEET KAFKA •  Apache Kafka is a distributed messaging system •  Came out of LinkedIn… open sourced in 2011 •  Built to tolerate hardware / software / network failures •  Built for high throughput and scale •  LinkedIn : 220 Billion messages / day •  At peak : 3+ million messages / day (c) Elephant Scale 2015
  • 17. (1) CAPTURE KAFKA ARCHITECTURE •  Publisher - subscriber / producer – consumer model (c) Elephant Scale 2015
  • 18. (1) CAPTURE KAFKA ARCHITECTURE •  Producers write data to brokers •  Consumers read data from brokers •  All of this is distributed / parallel •  Failure tolerant •  Data is stored as topics •  “sensor_data” •  “alerts” •  “emails” (c) Elephant Scale 2015
  • 19. (1) CAPTURE KAFKA USERS •  Linked In •  Track user activities •  Sending emails •  Netflix •  Real time monitoring •  Spotify •  Ship logs to hadoop •  Uber… AirBnB…. (c) Elephant Scale 2015
  • 20. (1) CAPTURE ARCHITECTURE WITH KAFKA (c) Elephant Scale 2015
  • 21. NEXT : (2) PROCESSING (c) Elephant Scale 2015
  • 22. (2) PROCESSING REQUIREMENTS •  Process events in real time or near real time •  High velocity •  Tens of thousands ! millions of events / sec •  Guaranteed processing •  Process an event at-least-once •  Exactly-once (harder to achieve) •  Failure tolerant •  Scale ‘easily’ (c) Elephant Scale 2015
  • 23. (2) PROCESSING CHOICES •  Storm •  Process streams •  Events based •  Came out of twitter •  Apache Samza •  Stream processing framework based on Kafka + Hadoop YARN •  Apache NiFi •  Data flow •  New project / incubating •  Spark streaming (c) Elephant Scale 2015
  • 24. (2) PROCESSING MEET SPARK •  Spark is the new darling of ‘Big Data’ world •  Lot’s of activity and interest •  Fast and Expressive Cluster Compute Engine •  “First Big Data platform to integrate batch, streaming and interactive computations in a unified framework” – stratio.com Hadoop Spark (c) Elephant Scale 2015
  • 25. (2) PROCESSING SPARK ECO-SYSTEM (c) Elephant Scale 2015 Spark Core Spark SQL Spark Streaming ML lib Schema / sql Real Time Machine Learning Stand alone YARN MESOS Cluster managers GraphX Graph processing
  • 26. (2) PROCESSING SPARK DATA SOURCE ABSTRACTION Spark (compute engine) HDFS Amazon S3 Cassandra ??? RDD Hadoop RDD Cassandra RDD (c) Elephant Scale 2015
  • 27. (2) PROCESSING SPARK : ‘UNIFIED’ STACK Spark supports multiple programming models •  Map reduce style batch processing •  Streaming / real time processing •  Querying via SQL •  Machine learning All modules are tightly integrated •  Facilitates rich applications Spark can be only stack you need ! (c) Elephant Scale 2015 Image: buymeposters.com
  • 29. (2) PROCESSING SPARK STREAMING •  Provides ‘high level’ operations in time windows •  E.g. ‘calculate X for the past 10 seconds’ •  Good adoption (c) Elephant Scale 2015
  • 30. (2) PROCESSING ARCHITECTURE WITH SPARK STREAMING (c) Elephant Scale 2015
  • 31. NEXT : STORAGE (c) Elephant Scale 2015
  • 32. (3) STORAGE REQUIREMENTS •  Handle ‘Big Data’ ( 1 TB / day !) •  Traditional storages are not effective (or too expensive) •  Need two types of storage 1.  ‘forever’ storage •  Store multi terabytes of data for a long periods •  Support Batch queries 2.  ‘fast / real-time lookup’ storage •  Query in real time (milliseconds) “what is the latest reading for sensor-123 ?” •  Store latest / new data (e.g. last 3 months) •  Flexible schema for semi-structured data •  Both need to scale (c) Elephant Scale 2015
  • 34. (3) STORAGE CHOICES •  ‘forever’ storage •  Scalable distributed file systems •  Hadoop ! (HDFS actually) •  ‘real time store’ •  Traditional RDBMS won’t work •  Don’t scale well (or too expensive) •  Rigid schema layout •  NoSQL ! (c) Elephant Scale 2015
  • 35. (3) STORAGE HDFS (IN 20 SECS) •  Distributed file system •  Runs on commodity servers •  ! high ROI •  Can keep ticking even when nodes go down •  ! fault tolerant •  Replicates data to prevent data loss in case of node failures •  ! built in backup ☺ •  Scales to Peta bytes (horizontal scalability) •  Proven in the field (c) Elephant Scale 2015
  • 36. (3) STORAGE HDFS ARCHITECTURE (c) Elephant Scale 2015
  • 37. (3) STORAGE COST OF BIG DATA Source : hortonworks (c) Elephant Scale 2015
  • 38. (3) STORAGE HDFS •  Can handle big data •  Scales easily •  Cost effective •  “Source of Truth” •  Files are immutable within HDFS (new data is ‘appended’ ) •  Audit friendly (c) Elephant Scale 2015
  • 39. (3) STORAGE CAPACITY PLANNING (HADOOP) Variables (tweak these) description value units Average daily ingest 1000 GB raw data node storage eg. 12 disks x 3TB 36 TB replication default 3 3 space allocated for HDFS HDFS 75% + Mapreduce 25% 75.00% growth per month (not calculated) 0 Calculation effective data storage per node 27 TB growth 1 month 6 month 1 yr 2 yr data size (TB) 90 540 1,080 2,160 # data nodes 3.333333333 20 40 80 (c) Elephant Scale 2015
  • 40. (3) STORAGE (REAL TIME) CHOICES FOR NOSQL •  Too many ! J •  HBase •  Part of Hadoop eco system •  Uses HDFS for storage •  Provides consistent view of data •  Cassandra •  Popular NoSQL store •  No Single Point of Failure (SPOF) – ring architecture •  No dependency on Hadoop •  Accumulo •  Came out of NSA ! •  Uses HDFS for storage •  Provides very good security (naturally !) (c) Elephant Scale 2015
  • 41. (3) STORAGE CAP THEOREM (c) Elephant Scale 2015
  • 42. (3) STORAGE ARCHITECTURE SO FAR (c) Elephant Scale 2015
  • 43. NEXT : QUERY (c) Elephant Scale 2015
  • 44. (3) QUERY REQUIREMENTS •  Real Time queries •  “what is the latest reading for the sensor id = 123” •  Useful for building applications / dashboards •  Latency : milli-seconds •  Batch / Aggregate queries •  “What is the average temperature in zip code = 12345” ? •  May need to go through large data points •  Latency : ‘batch’ (minutes / hours) (c) Elephant Scale 2015
  • 45. (3) QUERY SOLUTIONS •  Batch queries •  Query data in HDFS (and or NoSQL) •  Hadoop mapreduce (Pig / Hive) •  Spark batch analytics •  Real time queries •  Queries to go NoSQL store HDFS NoSQL Real time queries Batch queries (c) Elephant Scale 2015
  • 49. LAMBDA ARCHITECTURE EXPLAINED 1.  All new data is sent to both batch layer and speed layer 2.  Batch layer •  Holds master data set (immutable , append-only) •  Answers batch queries 3.  Serving layer •  updates batch views so they can be queried adhoc 4.  Speed Layer •  Handles new data •  Facilitates fast / real-time queries 5.  Query layer •  Answers queries using batch & real-time views (c) Elephant Scale 2015
  • 51. OUR ARCHITECTURE •  Each component is scalable •  Each component is fault tolerant •  Incorporates best practices •  All open source ! (c) Elephant Scale 2015
  • 52. … AND ONE MORE THING… •  Security ! (c) Elephant Scale 2015 Source : businessinsider.com
  • 53. HOW EVER… (c) Elephant Scale 2015 At scale nothing works as advertised !
  • 54. GOOD NEWS ! •  We’d like to build an open source, reference data platform for IoT / connected devices! •  Yes, open source ! J •  ElephantScale is a strong believer in open source •  “hadoop illuminated” – open source Hadoop book •  Github.com/elephantscale •  Best practices •  Bringing together lots of expertise in Big Data systems •  Register your interest http://elephantscale.com/iotx/ (c) Elephant Scale 2015
  • 55. GOALS FOR IOTX http://elephantscale.com/iotx/ •  Use open-source proven components •  Capture : •  Kafka •  Kinesis (AWS) •  Processing : Spark Streaming •  Batch storage : Hadoop / HDFS •  Real Time Store : support multiple data stores •  Cassandra •  Hbase •  Accumulo •  ??? (c) Elephant Scale 2015
  • 56. GOALS FOR IOTX… http://elephantscale.com/iotx/ •  Query templates using •  Spark •  Hadoop Map Reduce (Pig / Hive) •  Incorporate third party libraries for •  Outlier detection (temperature is outside norms) •  Trend detection (stock price is trending up) •  Alerts (fire !) •  Monitoring & Metrics (key !!) •  What’s going in the system? •  Host / system level (cpu / network ..etc) – easier •  application level (e.g. find slow queries) – harder •  Incorporate third party libraries (c) Elephant Scale 2015
  • 57. THANKS AND QUESTIONS? “A Reference Architecture for Internet of Things (IoT)” Sujee Maniyam Founder / Principal @ ElephantScale Expert Consulting + Training in Big Data technologies sujee@elephantscale.com Elephantscale.com Project sign up page : http://elephantscale.com/iotx/ (c) Elephant Scale 2015
  • 58. IMAGE CREDITS •  www.engadget.com •  Xfinity.com •  Tesla.com (c) Elephant Scale 2015