SlideShare a Scribd company logo
An Architect's Guide to Building
Real Time Big Data Systems
Raja SP
10 July 2014, Singapore
Lead Architect & Head of Products
< Real Time >
Big Data
WHY WHAT HOW
< Real Time >
Big Data
WHY WHAT HOW
What is the right time to shoot
me ?
There is a rhythm in the universe
Telecom Marketing Scenario
Cell Utilisation is Low In a Geo-Fence High Balance Frequent Visitor High Data User in the
Past
What is out there?
Square Kilometers of Arrays Tens of Thousands of Antennae Terabits of Data
Security / Intelligence
< Real Time >
Big Data
WHY WHAT HOW
Partitioned Parallel Processing
TASK
TASK
TASK
DATAi
DATAj
DATAk
Pipelined Parallel Processing
DATA TASK i TASK j TASK k
TASKDATA
Hybrid Parallel Processing
DATA TASK i
TASKj
TASK mTASK k
TASK l
TASKDATA
Should Data go to Tasks?
Or
Tasks go to Data?
DATATASK TASK TASK TASK TASK TASK
Static Data / Data at Rest
DATA DATA DATA TASK DATA DATA DATA
Streaming Data / Data in Motion
Streaming Data / Data in Motion Analytics
The classic “Word Count” (Stream Computing Version)
Counter
Counter
Java Python
Lisp
Python Java
C++
Counter
Python Python Python 2
Token
Splitter
Sink
Stream Computing Programming Constructs
Stream Tuple
Operator / Bolt
Counter
Counter
Java Python
Lisp
Python Java
C++
Counter
Python Python Python 2
Token
Splitter
Sink
Operator
Source Operator
Sink Operator
IBM Infosphere Streams Apache Storms
Bolt
Spout
-------
Composite Topology
Composite WordCountApp {
Graph
Stream< rstring sentence > Sentence = FileSource() {}
Stream< rstring word > Word = Split( Sentence ) {}
Stream< rstring word, int count > Counts = Count( Word ) {}
}
Source Split Count
IBM Infosphere Streams
Sentence Word Counts
Apache Storms
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout( ”Source", new RandomSentenceSpout(), 5 );
builder.setBolt( ”Split", new SplitSentence(), 8).shuffleGrouping( "Source” );
builder.setBolt( ”Count", new WordCount(), 12).fieldsGrouping( ”Split", new Fields( "word” ));
Source Split Count
IBM Infosphere Streams – Some Operators
Functor Perform tuple-level manipulations (~250 functions)
Filter Remove some tuples from a stream
Aggregate Group and summarize incoming tuples
Sort Impose an order on incoming tuples in a stream
Join Correlate two streams
Punctor Insert window punctuation markers into a stream
IBM Infosphere Streams – Some Operators (continued)
Barrier Synchronize tuples from sequence-correlated streams
Pair Group tuples from multiple streams of same type
Split Forward tuples to output streams based on a predicate
ThreadedSplit Distribute tuples over output streams by availability
Union Construct an output tuple from each input tuple
DeDuplicate Suppress duplicate tuples seen within a given time period
DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA
Stream Window
Aggregate
Sort
Join
< Real Time >
Big Data
WHY WHAT HOW
Streams Application Development
Method
Apache Storms
RunTime
Components
IBM Infosphere Streams
Instance
Management
Host
Application Host
Nimbus ZooKeeper
Node 1
Node 2
Node 3
Cluste
r
Apache Storms
Application Deployment Units
Instance
Management
Host
Application Host 1
Processin
g Element
1
Processin
g Element
2
Cluste
r
Management Node
(Nimbus)
Node 1
Worker 1 Worker 2
Executor
IBM Infosphere Streams
Executo
r
Executo
r
ZooKeeper
Node
High Availability & Adaptability
Optimizing scheduler assigns jobs to nodes, and
continually manages resource allocation
Apache StormsIBM Infosphere Streams
High Availability & Adaptability
Apache StormsIBM Infosphere Streams
Dynamically add Nodes and Jobs
High Availability & Adaptability
Apache StormsIBM Infosphere Streams
Execution Units on Failed Nodes can be
moved automatically with communications re-
routed
Topic:
Organized by
UNICOM Trainings & Seminars Pvt. Ltd.
contact@unicomlearning.com
DEMO
Topic:
Organized by
UNICOM Trainings & Seminars Pvt. Ltd.
contact@unicomlearning.com
Speaker name: Raja SP
Email ID: raja@knowesis.com
Thank You

More Related Content

PPTX
Putting Lipstick on Apache Pig at Netflix
PDF
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
PPTX
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
PPTX
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
PDF
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
PDF
Using Kafka to integrate DWH and Cloud Based big data systems
PDF
Fast and Reliable Apache Spark SQL Engine
PDF
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
Putting Lipstick on Apache Pig at Netflix
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Using Kafka to integrate DWH and Cloud Based big data systems
Fast and Reliable Apache Spark SQL Engine
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...

What's hot (19)

PDF
Headaches and Breakthroughs in Building Continuous Applications
PDF
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
PDF
Spark Summit EU talk by Rolf Jagerman
PDF
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
PDF
Bridging the Gap Between Datasets and DataFrames
PDF
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
PPTX
Apache Zeppelin Meetup Christian Tzolov 1/21/16
PDF
Time series database, InfluxDB & PHP
PPTX
Zeppelin at Twitter
PPTX
DataFlow & Beam
PDF
Scaling Machine Learning To Billions Of Parameters
PDF
Ray: Enterprise-Grade, Distributed Python
PPTX
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
PDF
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
PDF
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
PPTX
Big Data Pipeline and Analytics Platform
PDF
Structured streaming for machine learning
PDF
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Headaches and Breakthroughs in Building Continuous Applications
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Spark Summit EU talk by Rolf Jagerman
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Bridging the Gap Between Datasets and DataFrames
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Time series database, InfluxDB & PHP
Zeppelin at Twitter
DataFlow & Beam
Scaling Machine Learning To Billions Of Parameters
Ray: Enterprise-Grade, Distributed Python
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Big Data Pipeline and Analytics Platform
Structured streaming for machine learning
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Ad

Similar to An Architect's guide to real time big data systems (20)

PDF
Spark + AI Summit 2020 イベント概要
PPT
SQL Server 2008 Integration Services
PDF
Spark streaming State of the Union - Strata San Jose 2015
PDF
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
PPTX
Building data pipelines
PPTX
Yahoo compares Storm and Spark
PDF
Towards sql for streams
PDF
Is there a way that we can build our Azure Synapse Pipelines all with paramet...
PPT
Java one 2010
PDF
Social media analytics using Azure Technologies
PPTX
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
PDF
SnappyData at Spark Summit 2017
PPTX
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
PDF
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
PPT
Os Lonergan
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PPT
SQL Server 2008 for Developers
PPTX
Running Presto and Spark on the Netflix Big Data Platform
PDF
Apache spark - Architecture , Overview & libraries
PDF
Distributed Real-Time Stream Processing: Why and How 2.0
Spark + AI Summit 2020 イベント概要
SQL Server 2008 Integration Services
Spark streaming State of the Union - Strata San Jose 2015
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Building data pipelines
Yahoo compares Storm and Spark
Towards sql for streams
Is there a way that we can build our Azure Synapse Pipelines all with paramet...
Java one 2010
Social media analytics using Azure Technologies
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
SnappyData at Spark Summit 2017
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
Os Lonergan
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
SQL Server 2008 for Developers
Running Presto and Spark on the Netflix Big Data Platform
Apache spark - Architecture , Overview & libraries
Distributed Real-Time Stream Processing: Why and How 2.0
Ad

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
KodekX | Application Modernization Development
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPT
Teaching material agriculture food technology
PDF
Machine learning based COVID-19 study performance prediction
Review of recent advances in non-invasive hemoglobin estimation
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Spectral efficient network and resource selection model in 5G networks
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Big Data Technologies - Introduction.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
MYSQL Presentation for SQL database connectivity
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
KodekX | Application Modernization Development
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Chapter 3 Spatial Domain Image Processing.pdf
NewMind AI Monthly Chronicles - July 2025
Per capita expenditure prediction using model stacking based on satellite ima...
Building Integrated photovoltaic BIPV_UPV.pdf
Teaching material agriculture food technology
Machine learning based COVID-19 study performance prediction

An Architect's guide to real time big data systems

  • 1. An Architect's Guide to Building Real Time Big Data Systems Raja SP 10 July 2014, Singapore Lead Architect & Head of Products
  • 2. < Real Time > Big Data WHY WHAT HOW
  • 3. < Real Time > Big Data WHY WHAT HOW
  • 4. What is the right time to shoot me ?
  • 5. There is a rhythm in the universe
  • 6. Telecom Marketing Scenario Cell Utilisation is Low In a Geo-Fence High Balance Frequent Visitor High Data User in the Past
  • 7. What is out there? Square Kilometers of Arrays Tens of Thousands of Antennae Terabits of Data
  • 9. < Real Time > Big Data WHY WHAT HOW
  • 10. Partitioned Parallel Processing TASK TASK TASK DATAi DATAj DATAk Pipelined Parallel Processing DATA TASK i TASK j TASK k TASKDATA Hybrid Parallel Processing DATA TASK i TASKj TASK mTASK k TASK l
  • 11. TASKDATA Should Data go to Tasks? Or Tasks go to Data?
  • 12. DATATASK TASK TASK TASK TASK TASK Static Data / Data at Rest DATA DATA DATA TASK DATA DATA DATA Streaming Data / Data in Motion
  • 13. Streaming Data / Data in Motion Analytics
  • 14. The classic “Word Count” (Stream Computing Version) Counter Counter Java Python Lisp Python Java C++ Counter Python Python Python 2 Token Splitter Sink
  • 15. Stream Computing Programming Constructs Stream Tuple Operator / Bolt Counter Counter Java Python Lisp Python Java C++ Counter Python Python Python 2 Token Splitter Sink
  • 16. Operator Source Operator Sink Operator IBM Infosphere Streams Apache Storms Bolt Spout ------- Composite Topology
  • 17. Composite WordCountApp { Graph Stream< rstring sentence > Sentence = FileSource() {} Stream< rstring word > Word = Split( Sentence ) {} Stream< rstring word, int count > Counts = Count( Word ) {} } Source Split Count IBM Infosphere Streams Sentence Word Counts
  • 18. Apache Storms TopologyBuilder builder = new TopologyBuilder(); builder.setSpout( ”Source", new RandomSentenceSpout(), 5 ); builder.setBolt( ”Split", new SplitSentence(), 8).shuffleGrouping( "Source” ); builder.setBolt( ”Count", new WordCount(), 12).fieldsGrouping( ”Split", new Fields( "word” )); Source Split Count
  • 19. IBM Infosphere Streams – Some Operators Functor Perform tuple-level manipulations (~250 functions) Filter Remove some tuples from a stream Aggregate Group and summarize incoming tuples Sort Impose an order on incoming tuples in a stream Join Correlate two streams Punctor Insert window punctuation markers into a stream
  • 20. IBM Infosphere Streams – Some Operators (continued) Barrier Synchronize tuples from sequence-correlated streams Pair Group tuples from multiple streams of same type Split Forward tuples to output streams based on a predicate ThreadedSplit Distribute tuples over output streams by availability Union Construct an output tuple from each input tuple DeDuplicate Suppress duplicate tuples seen within a given time period
  • 21. DATA DATA DATA DATA DATA DATA DATA DATA DATA DATA Stream Window Aggregate Sort Join
  • 22. < Real Time > Big Data WHY WHAT HOW
  • 24. Apache Storms RunTime Components IBM Infosphere Streams Instance Management Host Application Host Nimbus ZooKeeper Node 1 Node 2 Node 3 Cluste r
  • 25. Apache Storms Application Deployment Units Instance Management Host Application Host 1 Processin g Element 1 Processin g Element 2 Cluste r Management Node (Nimbus) Node 1 Worker 1 Worker 2 Executor IBM Infosphere Streams Executo r Executo r ZooKeeper Node
  • 26. High Availability & Adaptability Optimizing scheduler assigns jobs to nodes, and continually manages resource allocation Apache StormsIBM Infosphere Streams
  • 27. High Availability & Adaptability Apache StormsIBM Infosphere Streams Dynamically add Nodes and Jobs
  • 28. High Availability & Adaptability Apache StormsIBM Infosphere Streams Execution Units on Failed Nodes can be moved automatically with communications re- routed
  • 29. Topic: Organized by UNICOM Trainings & Seminars Pvt. Ltd. contact@unicomlearning.com DEMO
  • 30. Topic: Organized by UNICOM Trainings & Seminars Pvt. Ltd. contact@unicomlearning.com Speaker name: Raja SP Email ID: raja@knowesis.com Thank You

Editor's Notes

  • #3: Enough chaos What – architectural thinking, programming concepts. Stream, Storm – map/reduce idea comes from lisp (1958). The 80’s game Can’t roll sleeves and deploy a 1000 node system
  • #5: Option 1 – I am here you are pointing your gun to me. Will you pull the trigger right now? OR Option 2 – Wait until 3 hours after I left this place and THEN pull the trigger?
  • #6: Wife cooks rearely….. I Thank god for that….. ½ km Spin Speed 30KM orbit speed
  • #8: Radio Astronomy Tyco Brahe Uppsala University and the LOFAR Outrigger In Scandinavia (LOIS )
  • #9: NSA breakout – prism, snowden Torture the data and it will confess to anything. Fallacy – Endogeneity Big Data has arrvied but not big Analytics – Tim Harford – the undercover economist – Financial Times
  • #11: Singlish – sequential process – until cows come home oredy Shared nothing data Divide Data – Example – Calculating tax for all singaporean. Work hard and earn less group
  • #13: Hadoop – Map Reduce Stream Computing
  • #14: 13
  • #15: Compare with map reduce Splitter heuristics continuous running streams – transient counts… sorts aggregates… windows A man cannot take bath in the same river twice
  • #16: Tuple – composite of fields. Tuple Schema
  • #17: 2 Popular frameworks
  • #19: InputDeclarer
  • #20: Relational Operators
  • #21: Utility Operators
  • #22: Tumbling Windows Sliding Windows
  • #25: Describe the components Describe how they are deployed