SlideShare a Scribd company logo
High-throughput data analysis 	A Streaming Reports PlatformAuthorsJ Singh, Early Stage ITDavid Zheng, Early Stage ITContributorSatya Gupta, Virsec SystemsOctober 3, 2011
High-throughput data analysisA few examples of streaming data problemsA concrete problem we solvedHow we solved itTake-away lessons
Streaming DataData arrives continuouslyMust be processed continuouslyEmit analysis results or alerts as needed
Security: Scrapers, Spammers, …
Monitoring and Alerts
Financial Markets
High-throughput data analysisA few examples of streaming data problemsA concrete problem we solvedHow we solved itTake-away lessons
Example Use CaseResolve Virtual Machineprofiles application and gathers data about the applicationThe data analysis blocked until data collection was complete, Took several hours before conclusions could be drawnProject goalsStream-mode analysisBegin within a few seconds of start of profilingContinuous updateData rates up to 5 GB per hourAbility to sustain rate for 24 hrsA product of Virsec SystemsAnalysis and Reporting configured to run in the Amazon EC2 environmentCan be scaled up (bigger machines)
Or scaled out (more machines)ApproachA variant of Eric Ries’ Lean Startup approachIntroduced in his book, The Lean Startup, Crown Publishing Group, 2011It is recipe for learning quicklyRe-adapted by us for this and other “learning projects”Do what it takes to get an end-to-end solutionMeasure, Learn, Build, repeatWhen the cycle has stabilized, expand scope appropriately
RequirementsFast inserts into the databaseThenature and amount of analysis required was hard to judge in the beginningPrevious experience with Map/Reduce in the Google App Engine environment had shown promise but GAE was not appropriate for this applicationSlick, demo-worthy web interface for presenting resultsStream-mode operationStart showing results within a few seconds of starting the Resolve Virtual Machine, and update it periodically as more data is collected and analyzed.
High-throughput data analysisA few examples of streaming data problemsA concrete problem we solvedHow we solved itTake-away lessons
Key decisionsChunk up data into 1-second “slices” as it arrivesUse a collection for signaling the availability of each data sliceProcess each chunk as it becomes availableUse Map/Reduce for analysisExploit Parallelism of the data by using as many processors as needed to maintain the “flow rate”Pipeline the various Map Reduce jobs to maintain sequentiality of data
Pipeline Component: ListenerListenerGoal: push the data into MongoDB as fast as possibleReceives the data from the Resolve Virtual Machine and stores it into MongoDBSelf-describing data12 different types of data fed over 12 different socketsWritten in C++Socket Interface at one endMongoDB C++ driver at other end
Pipeline Component:MongoDBMongoDBGoal: persistence
Did everything it was supposed to do
Allowed us to focus on our problem, not on MongoDB
Will use replica sets for making the data available to analysis serversPipeline Component: Map/ReduceAnalysis Program“Function Call Structure” Data TypeCalculation of T was better done in the listener, moved there.Could scale the solution up, but could not scale it outMap Reduce was much faster and could scale out“Memory Usage” Data TypeNeeded multiple map/reduce stages
“Function Call Structure” AnalysisFnNameTotalTimeSrcFnAddress                 PIDSF_CFAmap (emit: FnName, TotalTime, min_addr, NumOfCalls, PID, …)Shuffle stage:FnName: CreateRaceObjects { {TotalTime: 3, min_addr: 2 , NumOfCalls: 1 , PID: 1, …}                                                  {TotalTime: 7, min_addr: 3, NumOfCalls: 1, PID: 1, …}                                                  {TotalTime: 4, min_addr: 1, NumOfCalls: 1, PID: 1, …}{TotalTime: 6, min_addr: 1, NumOfCalls: 1, PID: 1, …}                                                } reduceOutput:{FnName: CreateRaceObjects{TotalTime: 20, min_addr: 1 , NumOfCalls: 2 , PID: 1, …}
Pipeline Component: PresentationPresentationEach feed type presents a different view of the user’s application.
And requires a page design of its own
Tool of choice: DjangoNonRel

More Related Content

PPTX
Efficient processing of Rank-aware queries in Map/Reduce
PPTX
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
PPT
Complex Event Processing with Esper
PDF
CEP: from Esper back to Akka
PPTX
PDF
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
PPT
Iygapyisi cause10-slideshare
PDF
Statistics for Engineers
Efficient processing of Rank-aware queries in Map/Reduce
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Complex Event Processing with Esper
CEP: from Esper back to Akka
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Iygapyisi cause10-slideshare
Statistics for Engineers

What's hot (20)

PPTX
Tuning Java Servers
PDF
Psdot 1 optimization of resource provisioning cost in cloud computing
PPTX
Introduction to Reactive programming
PPT
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...
PDF
Netflix machine learning
PPTX
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
PPTX
An Efficient Decentralized Load Balancing Algorithm in Cloud Computing
PPTX
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
PPT
Learning From the Past: Automated Rule Generation for CEP - DEBS 2014
PPTX
1. introduction
PDF
18 Data Streams
PDF
Efficient monitoring and alerting
DOCX
PPTX
Mapreduce script
PDF
Streaming computing: architectures, and tchnologies
PPTX
Hadoop MapReduce Paradigm
PDF
Distributed deep learning
PPTX
Online learning with structured streaming, spark summit brussels 2016
PPTX
Monitoring and Alerting with InfluxDB 2.0 | Deniz Kusefoglu & Nate Isley | In...
PDF
Ufuc Celebi – Stream & Batch Processing in one System
Tuning Java Servers
Psdot 1 optimization of resource provisioning cost in cloud computing
Introduction to Reactive programming
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...
Netflix machine learning
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
An Efficient Decentralized Load Balancing Algorithm in Cloud Computing
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
Learning From the Past: Automated Rule Generation for CEP - DEBS 2014
1. introduction
18 Data Streams
Efficient monitoring and alerting
Mapreduce script
Streaming computing: architectures, and tchnologies
Hadoop MapReduce Paradigm
Distributed deep learning
Online learning with structured streaming, spark summit brussels 2016
Monitoring and Alerting with InfluxDB 2.0 | Deniz Kusefoglu & Nate Isley | In...
Ufuc Celebi – Stream & Batch Processing in one System
Ad

Similar to High Throughput Data Analysis (20)

PPTX
Real Time Analytics
PPTX
Real Time Analytics
PPTX
Large scale computing with mapreduce
PDF
Streaming Analytics Unit 1 notes for engineers
PDF
Unstructured Datasets Analysis: Thesaurus Model
PDF
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
PDF
Firebird meets NoSQL
PDF
Hadoop at datasift
PDF
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
PPTX
Introduction to Apache Hadoop
PPTX
Big Data & Hadoop Introduction
PPTX
Trivento summercamp masterclass 9/9/2016
PPTX
Software architecture for data applications
KEY
Processing Big Data
PDF
Large Scale Data Analysis with Map/Reduce, part I
PDF
Проектирование крупномасштабных приложений сбора данных (Josh Berkus)
PPTX
Intro to Big Data and NoSQL
PDF
SA UNIT I STREAMING ANALYTICS.pdf
PDF
Webinar: How We Evaluated MongoDB as a Relational Database Replacement
PPT
Unit 1 Introduction to Streaming Analytics
Real Time Analytics
Real Time Analytics
Large scale computing with mapreduce
Streaming Analytics Unit 1 notes for engineers
Unstructured Datasets Analysis: Thesaurus Model
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
Firebird meets NoSQL
Hadoop at datasift
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
Introduction to Apache Hadoop
Big Data & Hadoop Introduction
Trivento summercamp masterclass 9/9/2016
Software architecture for data applications
Processing Big Data
Large Scale Data Analysis with Map/Reduce, part I
Проектирование крупномасштабных приложений сбора данных (Josh Berkus)
Intro to Big Data and NoSQL
SA UNIT I STREAMING ANALYTICS.pdf
Webinar: How We Evaluated MongoDB as a Relational Database Replacement
Unit 1 Introduction to Streaming Analytics
Ad

More from J Singh (20)

PDF
OpenLSH - a framework for locality sensitive hashing
PPTX
Designing analytics for big data
PDF
Open LSH - september 2014 update
PPTX
PaaS - google app engine
PPTX
Mining of massive datasets using locality sensitive hashing (LSH)
PPTX
Data Analytic Technology Platforms: Options and Tradeoffs
PPTX
Facebook Analytics with Elastic Map/Reduce
PPTX
Big Data Laboratory
PPTX
The Hadoop Ecosystem
PPTX
Social Media Mining using GAE Map Reduce
PPTX
NoSQL and MapReduce
PPTX
CS 542 -- Concurrency Control, Distributed Commit
PPTX
CS 542 -- Failure Recovery, Concurrency Control
PPTX
CS 542 -- Query Optimization
PPTX
CS 542 -- Query Execution
PPTX
CS 542 Putting it all together -- Storage Management
PPTX
CS 542 Parallel DBs, NoSQL, MapReduce
PPTX
CS 542 Database Index Structures
PPTX
CS 542 Controlling Database Integrity and Performance
PPTX
CS 542 Overview of query processing
OpenLSH - a framework for locality sensitive hashing
Designing analytics for big data
Open LSH - september 2014 update
PaaS - google app engine
Mining of massive datasets using locality sensitive hashing (LSH)
Data Analytic Technology Platforms: Options and Tradeoffs
Facebook Analytics with Elastic Map/Reduce
Big Data Laboratory
The Hadoop Ecosystem
Social Media Mining using GAE Map Reduce
NoSQL and MapReduce
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Query Optimization
CS 542 -- Query Execution
CS 542 Putting it all together -- Storage Management
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Database Index Structures
CS 542 Controlling Database Integrity and Performance
CS 542 Overview of query processing

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Cloud computing and distributed systems.
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
sap open course for s4hana steps from ECC to s4
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPT
Teaching material agriculture food technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Advanced methodologies resolving dimensionality complications for autism neur...
The AUB Centre for AI in Media Proposal.docx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
MYSQL Presentation for SQL database connectivity
Network Security Unit 5.pdf for BCA BBA.
20250228 LYD VKU AI Blended-Learning.pptx
Cloud computing and distributed systems.
Diabetes mellitus diagnosis method based random forest with bat algorithm
The Rise and Fall of 3GPP – Time for a Sabbatical?
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
sap open course for s4hana steps from ECC to s4
NewMind AI Weekly Chronicles - August'25 Week I
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation_ Review paper, used for researhc scholars
Reach Out and Touch Someone: Haptics and Empathic Computing
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx

High Throughput Data Analysis

  • 1. High-throughput data analysis A Streaming Reports PlatformAuthorsJ Singh, Early Stage ITDavid Zheng, Early Stage ITContributorSatya Gupta, Virsec SystemsOctober 3, 2011
  • 2. High-throughput data analysisA few examples of streaming data problemsA concrete problem we solvedHow we solved itTake-away lessons
  • 3. Streaming DataData arrives continuouslyMust be processed continuouslyEmit analysis results or alerts as needed
  • 7. High-throughput data analysisA few examples of streaming data problemsA concrete problem we solvedHow we solved itTake-away lessons
  • 8. Example Use CaseResolve Virtual Machineprofiles application and gathers data about the applicationThe data analysis blocked until data collection was complete, Took several hours before conclusions could be drawnProject goalsStream-mode analysisBegin within a few seconds of start of profilingContinuous updateData rates up to 5 GB per hourAbility to sustain rate for 24 hrsA product of Virsec SystemsAnalysis and Reporting configured to run in the Amazon EC2 environmentCan be scaled up (bigger machines)
  • 9. Or scaled out (more machines)ApproachA variant of Eric Ries’ Lean Startup approachIntroduced in his book, The Lean Startup, Crown Publishing Group, 2011It is recipe for learning quicklyRe-adapted by us for this and other “learning projects”Do what it takes to get an end-to-end solutionMeasure, Learn, Build, repeatWhen the cycle has stabilized, expand scope appropriately
  • 10. RequirementsFast inserts into the databaseThenature and amount of analysis required was hard to judge in the beginningPrevious experience with Map/Reduce in the Google App Engine environment had shown promise but GAE was not appropriate for this applicationSlick, demo-worthy web interface for presenting resultsStream-mode operationStart showing results within a few seconds of starting the Resolve Virtual Machine, and update it periodically as more data is collected and analyzed.
  • 11. High-throughput data analysisA few examples of streaming data problemsA concrete problem we solvedHow we solved itTake-away lessons
  • 12. Key decisionsChunk up data into 1-second “slices” as it arrivesUse a collection for signaling the availability of each data sliceProcess each chunk as it becomes availableUse Map/Reduce for analysisExploit Parallelism of the data by using as many processors as needed to maintain the “flow rate”Pipeline the various Map Reduce jobs to maintain sequentiality of data
  • 13. Pipeline Component: ListenerListenerGoal: push the data into MongoDB as fast as possibleReceives the data from the Resolve Virtual Machine and stores it into MongoDBSelf-describing data12 different types of data fed over 12 different socketsWritten in C++Socket Interface at one endMongoDB C++ driver at other end
  • 15. Did everything it was supposed to do
  • 16. Allowed us to focus on our problem, not on MongoDB
  • 17. Will use replica sets for making the data available to analysis serversPipeline Component: Map/ReduceAnalysis Program“Function Call Structure” Data TypeCalculation of T was better done in the listener, moved there.Could scale the solution up, but could not scale it outMap Reduce was much faster and could scale out“Memory Usage” Data TypeNeeded multiple map/reduce stages
  • 18. “Function Call Structure” AnalysisFnNameTotalTimeSrcFnAddress PIDSF_CFAmap (emit: FnName, TotalTime, min_addr, NumOfCalls, PID, …)Shuffle stage:FnName: CreateRaceObjects { {TotalTime: 3, min_addr: 2 , NumOfCalls: 1 , PID: 1, …} {TotalTime: 7, min_addr: 3, NumOfCalls: 1, PID: 1, …} {TotalTime: 4, min_addr: 1, NumOfCalls: 1, PID: 1, …}{TotalTime: 6, min_addr: 1, NumOfCalls: 1, PID: 1, …} } reduceOutput:{FnName: CreateRaceObjects{TotalTime: 20, min_addr: 1 , NumOfCalls: 2 , PID: 1, …}
  • 19. Pipeline Component: PresentationPresentationEach feed type presents a different view of the user’s application.
  • 20. And requires a page design of its own
  • 21. Tool of choice: DjangoNonRel
  • 22. But Python driver for MongoDB was sufficient for most work.
  • 23. DjangoNonRel was not really required.High-throughput data analysisA few examples of streaming data problemsA concrete problem we solvedHow we solved itTake-away lessons
  • 25. Endpoint StackData Capture (Listener)Custom, preferably written in C++ or JavaNoSQL DatabaseMongoDBWell suited for high speedinsertsCalculation PlatformMongoDB Map/ReduceCould use Hadoop but startup times are a concernPresentationDjango Non-Rel
  • 26. About UsInvolved with Map/Reduce and NoSQL technologies on several platformsMany students in J’s Database Systems class at WPI did a project on a NoSQL database.DataThinks.org is a new service of Early Stage ITBuilding and operating “Big Data” analytics servicesThanks