SlideShare a Scribd company logo
Simultaneous Analysis of Massive Data
Streams in Real-Time and Batch
Anjana Fernando
Technical Lead
WSO2
Agenda
• How massive data streams created
• How to receive
• How to store
• How to analyze, batch vs real-time
• WSO2 Big Data solution
• Demo
Massive Data Streams -> Data Streams with Big Data
What is Big Data?
❏ The 3 Vs
❏ Velocity
❏ Volume
❏ Variety
Where does it originate from?
• Machine logs
• Social media
• Archives
• Traffic information
• Weather data
• Sensor data (IoT)
What do I do with it?
Create intelligence..
• Should I take an umbrella to work today?
• What is the best route to go back home?
• What are the current market trends?
• Are my servers running healthily?
Protocols used to publish data..
• HTTP
• MQTT
• Zigbee
• Thrift
• Avro
• ProtoBuf
How to store the data?
• Relational databases
• Block data stores
-> HDFS
• Column oriented
-> HBase
-> Cassandra
• Document based
-> MongoDB
-> CouchDB
• In-Memory
-> VoltDB
A
C P
How to analyse data?
• Two options:
-> Batch processing: Schedule data processing jobs
and receive the processed data later
-> Real-time processing: The queries are executed
and the results are retrieved instantly
Analysing data..
• Batch processing
-> Apache Hadoop: Map/Reduce processing system
and a distributed file system
Analysing data..
• Batch processing - Data Warehouse
-> Apache Hive - Hadoop based framework for working
on large scale data stores with SQL-like queries
INSERT OVERWRITE TABLE UserTable SELECT userName, COUNT(DISTINCT
orderID),SUM(quantity) FROM PhoneSalesTable WHERE version= "1.0.0"
GROUP BY userName;
Analysing data..
• Batch processing - In-Memory Computing
-> Apache Spark - Functional programming model,
in-memory computing, claims 10x - 100x faster than
Hadoop
Analysing data..
• Real-time processing - Stream Processing
-> Apache Storm - Distributed and fault-tolerant
Spouts Bolts
Analysing data..
• Real-time processing - Complex Event Processing
-> WSO2 Siddhi:
Simultaneous analysis of massive data streams in real time and batch
Big Data Architecture with WSO2..
• Data Streams
{
'name':'phone.retail.shop',
'version':'1.0.0',
'nickName': 'Phone_Retail_Shop',
'description': 'Phone Sales',
'metaData':[
{'name':'clientType','type':'STRING'}
],
'payloadData':[
{'name':'brand','type':'STRING'},
{'name':'quantity','type':'INT'},
{'name':'total','type':'INT'},
{'name':'user','type':'STRING'}
]
}
The common stream format used in both CEP and BAM; The stream
definition contains the stream name, version and other attributes that
makes up the stream.
Big Data Architecture with WSO2..
• WSO2 BAM
-> Data Receiver - High performance binary format data
publishing with Apache Thrift, shared with WSO2 CEP
-> Data Storage - Cassandra for highly scalable data store
-> Data Analyzer - Hive based batch processing
Big Data Architecture with WSO2..
• WSO2 BAM..
-> Activity Monitoring: Implemented using a custom indexing
mechanism to instantly search for events of a specific activity in
the system
Big Data Architecture with WSO2..
• WSO2 BAM..
-> Incremental Data Processing - Customized Hive to support
incremental data processing:
@Incremental (name="salesAnalysis" , tables="PhoneSalesTable")
SELECT brandname,
Count(DISTINCT orderid),
Sum(quantity)
FROM phonesalestable
WHERE version = "1.0.0"
GROUP BY brandname;
Big Data Architecture with WSO2..
• WSO2 CEP
-> Same data receiver as BAM, where this is the point where the
same event is sent to both servers, where BAM for batch
processing and CEP for real-time processing of the same data
streams
-> Real-time in-memory processing, based on WSO2 Siddhi
engine, with data adapters for receiving and sending event with
different data types and transports, e.g. XML, JSON, Text, HTTP,
JMS, SMTP
Demo
Questions?
Thank you!

More Related Content

PDF
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...
PPTX
Schema Design Best Practices with Buzz Moschetti
PPTX
Real-time analytics with HBase
PPTX
Big Data Lakes Benchmarking 2018
PDF
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
PDF
Make it fast for everyone - performance and middleware design
PDF
Overview of Big data zoo
PPTX
Webinar: Best Practices for Getting Started with MongoDB
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...
Schema Design Best Practices with Buzz Moschetti
Real-time analytics with HBase
Big Data Lakes Benchmarking 2018
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
Make it fast for everyone - performance and middleware design
Overview of Big data zoo
Webinar: Best Practices for Getting Started with MongoDB

What's hot (20)

PDF
Look Ma! No more blobs
PDF
Data integration
PDF
Mongodb
PPTX
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
PDF
Webinar: Managing Real Time Risk Analytics with MongoDB
PPTX
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
PPTX
Webinar: Choosing the Right Shard Key for High Performance and Scale
PDF
MongoDB .local Munich 2019: Managing a Heterogeneous Stack with MongoDB & SQL
PPTX
Agility and Scalability with MongoDB
PPTX
Data Con LA 2019 - Hybrid Transactional Analytical Processing (HTAP) with Mar...
PDF
Intelligent integration with WSO2 ESB & WSO2 CEP
PPTX
Socialite, the Open Source Status Feed Part 2: Managing the Social Graph
PPTX
Solving Challenges With 'Huge Data'
PDF
MongoDB Schema Design Tips & Tricks
PPTX
Tales from production with postgreSQL at scale
PPTX
Back to Basics Webinar 1: Introduction to NoSQL
PPTX
MongoDB + Spring
PPTX
Back to Basics 2017: Introduction to Sharding
PDF
MongoDB - An Introduction
PPTX
Intro to bigdata on gcp (1)
Look Ma! No more blobs
Data integration
Mongodb
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
Webinar: Managing Real Time Risk Analytics with MongoDB
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
Webinar: Choosing the Right Shard Key for High Performance and Scale
MongoDB .local Munich 2019: Managing a Heterogeneous Stack with MongoDB & SQL
Agility and Scalability with MongoDB
Data Con LA 2019 - Hybrid Transactional Analytical Processing (HTAP) with Mar...
Intelligent integration with WSO2 ESB & WSO2 CEP
Socialite, the Open Source Status Feed Part 2: Managing the Social Graph
Solving Challenges With 'Huge Data'
MongoDB Schema Design Tips & Tricks
Tales from production with postgreSQL at scale
Back to Basics Webinar 1: Introduction to NoSQL
MongoDB + Spring
Back to Basics 2017: Introduction to Sharding
MongoDB - An Introduction
Intro to bigdata on gcp (1)
Ad

Similar to Simultaneous analysis of massive data streams in real time and batch (20)

PPTX
MongoDB Best Practices
PDF
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
PDF
Big data from the trenches
PDF
Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
PPTX
Introduction to WSO2 Data Analytics Platform
PDF
MongoDB: What, why, when
PPTX
Microsoft Azure Big Data Analytics
PPTX
Big Data Analytics Strategy and Roadmap
PPTX
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
PDF
Analytics in Your Enterprise
PDF
Siddhi - cloud-native stream processor
PPTX
Azure Stream Analytics : Analyse Data in Motion
PPTX
Introduction to Azure DocumentDB
PDF
WSO2 Analytics Platform - The one stop shop for all your data needs
PDF
Real-time big data analytics based on product recommendations case study
PPTX
Cloud-Based Big Data Analytics
PDF
Filtering From the Firehose: Real Time Social Media Streaming
PDF
[WSO2Con USA 2018] Patterns for Building Streaming Apps
PPTX
Lecture1 BIG DATA and Types of data in details
PPTX
IoT interoperability
MongoDB Best Practices
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Big data from the trenches
Real-Time Streaming: Move IMS Data to Your Cloud Data Warehouse
Introduction to WSO2 Data Analytics Platform
MongoDB: What, why, when
Microsoft Azure Big Data Analytics
Big Data Analytics Strategy and Roadmap
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Analytics in Your Enterprise
Siddhi - cloud-native stream processor
Azure Stream Analytics : Analyse Data in Motion
Introduction to Azure DocumentDB
WSO2 Analytics Platform - The one stop shop for all your data needs
Real-time big data analytics based on product recommendations case study
Cloud-Based Big Data Analytics
Filtering From the Firehose: Real Time Social Media Streaming
[WSO2Con USA 2018] Patterns for Building Streaming Apps
Lecture1 BIG DATA and Types of data in details
IoT interoperability
Ad

More from Anjana Fernando (13)

PDF
Ballerina – An Open-Source, Cloud-Native Programming Language for Microservices
PDF
Automatic Microservices Observability with Ballerina - GIDS 2021
PDF
Ballerina: An Open-Source, Cloud-Native Programming Language - GIDS 2021
PDF
IoT Analytics
PDF
Java Distributed Transactions
PDF
Monitoring Your Business with WSO2 BAM
PDF
Data Services: Getting Your Data Into APIs
PDF
Scalable Log Analysis with WSO2 BAM
PDF
Data integration and Business Processes
PDF
Ballerina - A Programming Language for Cloud and DevOps
PDF
Ballerina - Cloud Native Programming Language
PDF
Ballerina - A Programming Language for Cloud and DevOps
PDF
Effective microservices development with ballerina
Ballerina – An Open-Source, Cloud-Native Programming Language for Microservices
Automatic Microservices Observability with Ballerina - GIDS 2021
Ballerina: An Open-Source, Cloud-Native Programming Language - GIDS 2021
IoT Analytics
Java Distributed Transactions
Monitoring Your Business with WSO2 BAM
Data Services: Getting Your Data Into APIs
Scalable Log Analysis with WSO2 BAM
Data integration and Business Processes
Ballerina - A Programming Language for Cloud and DevOps
Ballerina - Cloud Native Programming Language
Ballerina - A Programming Language for Cloud and DevOps
Effective microservices development with ballerina

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation theory and applications.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Electronic commerce courselecture one. Pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Approach and Philosophy of On baking technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
cuic standard and advanced reporting.pdf
PPT
Teaching material agriculture food technology
PPTX
Understanding_Digital_Forensics_Presentation.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation theory and applications.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Electronic commerce courselecture one. Pdf
Network Security Unit 5.pdf for BCA BBA.
Unlocking AI with Model Context Protocol (MCP)
Digital-Transformation-Roadmap-for-Companies.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Advanced methodologies resolving dimensionality complications for autism neur...
“AI and Expert System Decision Support & Business Intelligence Systems”
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
20250228 LYD VKU AI Blended-Learning.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Approach and Philosophy of On baking technology
Review of recent advances in non-invasive hemoglobin estimation
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
cuic standard and advanced reporting.pdf
Teaching material agriculture food technology
Understanding_Digital_Forensics_Presentation.pptx

Simultaneous analysis of massive data streams in real time and batch

  • 1. Simultaneous Analysis of Massive Data Streams in Real-Time and Batch Anjana Fernando Technical Lead WSO2
  • 2. Agenda • How massive data streams created • How to receive • How to store • How to analyze, batch vs real-time • WSO2 Big Data solution • Demo
  • 3. Massive Data Streams -> Data Streams with Big Data
  • 4. What is Big Data? ❏ The 3 Vs ❏ Velocity ❏ Volume ❏ Variety
  • 5. Where does it originate from? • Machine logs • Social media • Archives • Traffic information • Weather data • Sensor data (IoT)
  • 6. What do I do with it? Create intelligence.. • Should I take an umbrella to work today? • What is the best route to go back home? • What are the current market trends? • Are my servers running healthily?
  • 7. Protocols used to publish data.. • HTTP • MQTT • Zigbee • Thrift • Avro • ProtoBuf
  • 8. How to store the data? • Relational databases • Block data stores -> HDFS • Column oriented -> HBase -> Cassandra • Document based -> MongoDB -> CouchDB • In-Memory -> VoltDB A C P
  • 9. How to analyse data? • Two options: -> Batch processing: Schedule data processing jobs and receive the processed data later -> Real-time processing: The queries are executed and the results are retrieved instantly
  • 10. Analysing data.. • Batch processing -> Apache Hadoop: Map/Reduce processing system and a distributed file system
  • 11. Analysing data.. • Batch processing - Data Warehouse -> Apache Hive - Hadoop based framework for working on large scale data stores with SQL-like queries INSERT OVERWRITE TABLE UserTable SELECT userName, COUNT(DISTINCT orderID),SUM(quantity) FROM PhoneSalesTable WHERE version= "1.0.0" GROUP BY userName;
  • 12. Analysing data.. • Batch processing - In-Memory Computing -> Apache Spark - Functional programming model, in-memory computing, claims 10x - 100x faster than Hadoop
  • 13. Analysing data.. • Real-time processing - Stream Processing -> Apache Storm - Distributed and fault-tolerant Spouts Bolts
  • 14. Analysing data.. • Real-time processing - Complex Event Processing -> WSO2 Siddhi:
  • 16. Big Data Architecture with WSO2.. • Data Streams { 'name':'phone.retail.shop', 'version':'1.0.0', 'nickName': 'Phone_Retail_Shop', 'description': 'Phone Sales', 'metaData':[ {'name':'clientType','type':'STRING'} ], 'payloadData':[ {'name':'brand','type':'STRING'}, {'name':'quantity','type':'INT'}, {'name':'total','type':'INT'}, {'name':'user','type':'STRING'} ] } The common stream format used in both CEP and BAM; The stream definition contains the stream name, version and other attributes that makes up the stream.
  • 17. Big Data Architecture with WSO2.. • WSO2 BAM -> Data Receiver - High performance binary format data publishing with Apache Thrift, shared with WSO2 CEP -> Data Storage - Cassandra for highly scalable data store -> Data Analyzer - Hive based batch processing
  • 18. Big Data Architecture with WSO2.. • WSO2 BAM.. -> Activity Monitoring: Implemented using a custom indexing mechanism to instantly search for events of a specific activity in the system
  • 19. Big Data Architecture with WSO2.. • WSO2 BAM.. -> Incremental Data Processing - Customized Hive to support incremental data processing: @Incremental (name="salesAnalysis" , tables="PhoneSalesTable") SELECT brandname, Count(DISTINCT orderid), Sum(quantity) FROM phonesalestable WHERE version = "1.0.0" GROUP BY brandname;
  • 20. Big Data Architecture with WSO2.. • WSO2 CEP -> Same data receiver as BAM, where this is the point where the same event is sent to both servers, where BAM for batch processing and CEP for real-time processing of the same data streams -> Real-time in-memory processing, based on WSO2 Siddhi engine, with data adapters for receiving and sending event with different data types and transports, e.g. XML, JSON, Text, HTTP, JMS, SMTP
  • 21. Demo