SlideShare a Scribd company logo
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
Real-time
Distributed Stream Processing
@ Scale
Mountain-View Meetup (March 2016)
Jerome Boulon
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
CaliStream.com
Take Control of Your Data
Jerome Boulon
CEO/Founder
jboulon@caliStream.com https://www.linkedin.com/in/jboulon
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
Quick History
1999 2008 2009 2010 2012
Yahoo!: Chukwa
Hadoop Monitoring Solution
Netflix: Honu
Data Collection Pipeline
CaliStream: Founder
Honu: Data As a Service
Monitoring Solution
for cable modems/TV
network
Ontology/Semantic Search
Acquired by Microsoft
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
CaliStream.com
Take Control of Your Data
TO ENABLE ANY COMPANY
TO QUICKLY LEVERAGE BIG DATA
AS A STRATEGIC ADVANTAGE
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
Big Data
• LIMITED TALENT WITH HADOOP KNOWLEDGE
• LIMITED TALENT TO BUILD SCALABLE SYSTEMS
• A TORRENT OF DATA
… BUT A STRATEGIC ADVANTAGE
THAT YOU CANNOT IGNORE
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
CONSULTING CALISTREAM SAAS
• Architecture Design & Review
• Big Data Projects
• Distributed & Large Scale
Projects
• Research Projects
Schema-less data processing pipeline
to easily stream large volume of events
from your applications directly
to Hive/Hadoop
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
Agenda
• CaliStream Data Pipeline
• Stream Analysis:
– Samza
– Live Monitoring
– Network Analysis
– Samza New Features
• Q & A
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
Data Pipeline,
Some Challenges …
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
Challenges …
• Lack of talents
• Event format
• Schema evolution
• Time to market
• Rapidly changing infrastructure
• Collecting massive amount of data live
• Batch & Real-time Analysis
• Load balancing, Auto-scaling, Discovery, etc
• Being up 24/7 (no downtime!)
• Cost
• …
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
Traditional Pipeline Vs. Ease of Use
• Schema Based
• Uncompressed
• Dedicated team
• Global schema
• Synchronization
• Upgrade/Downtime
TRADITIONAL PIPELINE EASE OF USE
• Schema-Less
• Hadoop Bin. Compressed
• Self-Service
• No Synchronization
• No Upgrade/Downtime
• Continuous integration
• Continuous delivery
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
Our Solution: CaliStream
CaliStream provides
a SaaS data processing pipeline
to easily stream large volume of events
from your applications directly to Hive/Hadoop
in a robust, scalable and cost effective way
without any prior Hadoop Knowledge
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
CaliStream: Native integration with Hive
Big
Data
Sensor
Data
Social
Click
Stream
Location
logs
Sensor
Data
Click
Stream
Location logsSocial
… …
CaliStream
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
Real-time Analysis
- Log Search
- User Activity
- Click Stream
- Customer Support
- …
Stream Processing
Events Routing
Data augmentation
Data Security
…
BI Analytics
Data Science
- Sensors Data
- Telemetry Data
- Users behavior
- Location
- Social
- …
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
CaliStream
Everything in CaliStream
is represented as a Hive table
so you can easily analyze your data
Select […]
from […]
where […]
group by […] ;
• Managed Service
• Self-Service SDK
• Schema-Less API
• SQL Compliant
• Scalable
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
CaliStream: Data Lake
CaliStream
Rest API
JS API
Java API
AppsClusters
CaliStream
RT
…
DataCollectionAnalysis
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
Ease of Use
1. Start
+ Online Signup
+ Download from CaliStream.com
2. Prototype
Implement few lines of code
3. Deploy and collect data
Less than a week deployment cycle
1. Start
Bring/train talent in-house
and hire consultants
2. Prototype/Iterate
Understand Hadoop
Ecosystem
Deploy/Fix cycles
3. Deploy and collect data
14-24 months deployment cycle
VS
CaliStreamOthers
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
Ease of Use (Java, REST, MQTT, …)
CaliStream
Hive Table
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
Streams Analysis
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
Stream Analysis
• Gaming:
– Real-time Deployment Monitoring & Alerting
– Billion of events to analyze
Number of Downloads/Installs/Errors
GroupBy Version/Region/Country
Over 1, 5, 10 Minutes window
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
Apache Samza
• App logic decoupled from Msg
Transport
• Native Kafka Integration
• Scalable & Fault Tolerant
• Automatic State Management
• Etc.
• Top Level Apache Project
http://samza.apache.org/
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
Architecture
Kibana
• Data augmentation
(IP -> Region, Country, ISP, etc.)
• Native Operators
• GroupBy
• State Management
• Windowed aggregation
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
Stream Analysis
• Network Security:
– Real-time DDOS and Breaches Analysis
• Monitor, Detect, Mitigate
– Billion of events to analyze
• Players behaviors
• Network events
• Audit Trail
• Etc.
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
Samza
• Native Stream-stream join: Join Players,
Network and Audit trail activity Streams
• Native RocksDB Integration
(Key/Value local Storage)
• Native Windowed aggregation :
Aggregate over 1, 5, 10 minutes window
• Native Log Compaction & Broadcast
Stream
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
Apache Samza (New Features)
• Coordinator Stream (Dyn. config, etc.)
• Broadcast Stream (Update/Change Behavior)
• Host affinity & State reuse (Faster re-start)
• ElasticSearch Producer (Write output to ES)
• Samza @ scale on EC2 (Netflix)
Take Control of Your Data: CaliStream.com © CaliStream.com 2016
Meetup
CaliStream.com
Take Control of Your Data
Jerome Boulon
CEO/Founder
jboulon@caliStream.com

More Related Content

PPTX
Data to Drive Decision-Making - CaliStream Meetup
PDF
Cloud Connect 2012, Big Data @ Netflix
PPT
The Evolution of Big Data Pipelines at Intuit
PDF
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
PPTX
Realtime streaming architecture in INFINARIO
PDF
Data Ingest Self Service and Management using Nifi and Kafka
PDF
Reliable and Scalable Data Ingestion at Airbnb
PPTX
Visual Mapping of Clickstream Data
Data to Drive Decision-Making - CaliStream Meetup
Cloud Connect 2012, Big Data @ Netflix
The Evolution of Big Data Pipelines at Intuit
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
Realtime streaming architecture in INFINARIO
Data Ingest Self Service and Management using Nifi and Kafka
Reliable and Scalable Data Ingestion at Airbnb
Visual Mapping of Clickstream Data

What's hot (20)

PPTX
Instrumenting your Instruments
PPTX
Real time fraud detection at 1+M scale on hadoop stack
PDF
Fast data for fitness 10 nov 2020
PDF
What's new in SQL on Hadoop and Beyond
PDF
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
PDF
How Disney+ uses fast data ubiquity to improve the customer experience
PPTX
Preventative Maintenance of Robots in Automotive Industry
PPTX
Apache Spark in Scientific Applciations
PDF
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
PDF
Headaches and Breakthroughs in Building Continuous Applications
PPTX
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop
PPTX
eBay Experimentation Platform on Hadoop
PPTX
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
PDF
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
PDF
Enterprise Metadata Integration
PPTX
Apache Kylin @ Big Data Europe 2015
PDF
Stream Processing Overview
PPTX
Solr + Hadoop: Interactive Search for Hadoop
PPTX
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
PDF
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Instrumenting your Instruments
Real time fraud detection at 1+M scale on hadoop stack
Fast data for fitness 10 nov 2020
What's new in SQL on Hadoop and Beyond
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
How Disney+ uses fast data ubiquity to improve the customer experience
Preventative Maintenance of Robots in Automotive Industry
Apache Spark in Scientific Applciations
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Headaches and Breakthroughs in Building Continuous Applications
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop
eBay Experimentation Platform on Hadoop
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Enterprise Metadata Integration
Apache Kylin @ Big Data Europe 2015
Stream Processing Overview
Solr + Hadoop: Interactive Search for Hadoop
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Ad

Similar to Real-time Distributed Stream Processing @ Scale (20)

PDF
Streaming Visualization
PDF
Stream Processing with Kafka in Uber, Danny Yuan
PDF
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger download pdf
PDF
Down the event-driven road: Experiences of integrating streaming into analyti...
PDF
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger
PDF
Streaming Processing in Uber Marketplace for Kafka Summit 2016
PDF
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
PDF
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
PDF
Streaming Visualisation
PDF
Buy ebook Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger cheap price
PDF
Confluent_AWS_ImmersionDay_Q42023.pdf
PDF
Build real-time streaming data pipelines to AWS with Confluent
PDF
Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...
PDF
Mit Streaming die Brücken zum Erfolg bauen
PDF
Streaming Visualization
PDF
Data Ingestion in Big Data and IoT platforms
PDF
Download Complete Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger ...
PDF
Serverless Kafka on AWS as Part of a Cloud-native Data Lake Architecture
PPTX
Confluent:AWS - GameDay.pptx
PPTX
Unlock value with Confluent and AWS.pptx
Streaming Visualization
Stream Processing with Kafka in Uber, Danny Yuan
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger download pdf
Down the event-driven road: Experiences of integrating streaming into analyti...
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger
Streaming Processing in Uber Marketplace for Kafka Summit 2016
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Streaming Visualisation
Buy ebook Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger cheap price
Confluent_AWS_ImmersionDay_Q42023.pdf
Build real-time streaming data pipelines to AWS with Confluent
Need for Speed – Intro To Real-Time Data Streaming Analytics on AWS | AWS Sum...
Mit Streaming die Brücken zum Erfolg bauen
Streaming Visualization
Data Ingestion in Big Data and IoT platforms
Download Complete Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger ...
Serverless Kafka on AWS as Part of a Cloud-native Data Lake Architecture
Confluent:AWS - GameDay.pptx
Unlock value with Confluent and AWS.pptx
Ad

Recently uploaded (20)

PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PDF
How to run a consulting project- client discovery
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PDF
Introduction to Data Science and Data Analysis
PPTX
Managing Community Partner Relationships
PDF
Global Data and Analytics Market Outlook Report
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Leprosy and NLEP programme community medicine
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
IBA_Chapter_11_Slides_Final_Accessible.pptx
ISS -ESG Data flows What is ESG and HowHow
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Topic 5 Presentation 5 Lesson 5 Corporate Fin
A Complete Guide to Streamlining Business Processes
importance of Data-Visualization-in-Data-Science. for mba studnts
How to run a consulting project- client discovery
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Introduction to Data Science and Data Analysis
Managing Community Partner Relationships
Global Data and Analytics Market Outlook Report
[EN] Industrial Machine Downtime Prediction
Leprosy and NLEP programme community medicine
Acceptance and paychological effects of mandatory extra coach I classes.pptx
IMPACT OF LANDSLIDE.....................
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
retention in jsjsksksksnbsndjddjdnFPD.pptx

Real-time Distributed Stream Processing @ Scale

  • 1. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup Real-time Distributed Stream Processing @ Scale Mountain-View Meetup (March 2016) Jerome Boulon
  • 2. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup CaliStream.com Take Control of Your Data Jerome Boulon CEO/Founder jboulon@caliStream.com https://www.linkedin.com/in/jboulon
  • 3. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup Quick History 1999 2008 2009 2010 2012 Yahoo!: Chukwa Hadoop Monitoring Solution Netflix: Honu Data Collection Pipeline CaliStream: Founder Honu: Data As a Service Monitoring Solution for cable modems/TV network Ontology/Semantic Search Acquired by Microsoft
  • 4. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup CaliStream.com Take Control of Your Data TO ENABLE ANY COMPANY TO QUICKLY LEVERAGE BIG DATA AS A STRATEGIC ADVANTAGE
  • 5. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup Big Data • LIMITED TALENT WITH HADOOP KNOWLEDGE • LIMITED TALENT TO BUILD SCALABLE SYSTEMS • A TORRENT OF DATA … BUT A STRATEGIC ADVANTAGE THAT YOU CANNOT IGNORE
  • 6. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup CONSULTING CALISTREAM SAAS • Architecture Design & Review • Big Data Projects • Distributed & Large Scale Projects • Research Projects Schema-less data processing pipeline to easily stream large volume of events from your applications directly to Hive/Hadoop
  • 7. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup Agenda • CaliStream Data Pipeline • Stream Analysis: – Samza – Live Monitoring – Network Analysis – Samza New Features • Q & A
  • 8. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup Data Pipeline, Some Challenges …
  • 9. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup Challenges … • Lack of talents • Event format • Schema evolution • Time to market • Rapidly changing infrastructure • Collecting massive amount of data live • Batch & Real-time Analysis • Load balancing, Auto-scaling, Discovery, etc • Being up 24/7 (no downtime!) • Cost • …
  • 10. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup Traditional Pipeline Vs. Ease of Use • Schema Based • Uncompressed • Dedicated team • Global schema • Synchronization • Upgrade/Downtime TRADITIONAL PIPELINE EASE OF USE • Schema-Less • Hadoop Bin. Compressed • Self-Service • No Synchronization • No Upgrade/Downtime • Continuous integration • Continuous delivery
  • 11. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup Our Solution: CaliStream CaliStream provides a SaaS data processing pipeline to easily stream large volume of events from your applications directly to Hive/Hadoop in a robust, scalable and cost effective way without any prior Hadoop Knowledge
  • 12. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup CaliStream: Native integration with Hive Big Data Sensor Data Social Click Stream Location logs Sensor Data Click Stream Location logsSocial … … CaliStream
  • 13. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup Real-time Analysis - Log Search - User Activity - Click Stream - Customer Support - … Stream Processing Events Routing Data augmentation Data Security … BI Analytics Data Science - Sensors Data - Telemetry Data - Users behavior - Location - Social - …
  • 14. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup CaliStream Everything in CaliStream is represented as a Hive table so you can easily analyze your data Select […] from […] where […] group by […] ; • Managed Service • Self-Service SDK • Schema-Less API • SQL Compliant • Scalable
  • 15. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup CaliStream: Data Lake CaliStream Rest API JS API Java API AppsClusters CaliStream RT … DataCollectionAnalysis
  • 16. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup Ease of Use 1. Start + Online Signup + Download from CaliStream.com 2. Prototype Implement few lines of code 3. Deploy and collect data Less than a week deployment cycle 1. Start Bring/train talent in-house and hire consultants 2. Prototype/Iterate Understand Hadoop Ecosystem Deploy/Fix cycles 3. Deploy and collect data 14-24 months deployment cycle VS CaliStreamOthers
  • 17. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup Ease of Use (Java, REST, MQTT, …) CaliStream Hive Table
  • 18. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup Streams Analysis
  • 19. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup Stream Analysis • Gaming: – Real-time Deployment Monitoring & Alerting – Billion of events to analyze Number of Downloads/Installs/Errors GroupBy Version/Region/Country Over 1, 5, 10 Minutes window
  • 20. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup Apache Samza • App logic decoupled from Msg Transport • Native Kafka Integration • Scalable & Fault Tolerant • Automatic State Management • Etc. • Top Level Apache Project http://samza.apache.org/
  • 21. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup Architecture Kibana • Data augmentation (IP -> Region, Country, ISP, etc.) • Native Operators • GroupBy • State Management • Windowed aggregation
  • 22. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup Stream Analysis • Network Security: – Real-time DDOS and Breaches Analysis • Monitor, Detect, Mitigate – Billion of events to analyze • Players behaviors • Network events • Audit Trail • Etc.
  • 23. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup Samza • Native Stream-stream join: Join Players, Network and Audit trail activity Streams • Native RocksDB Integration (Key/Value local Storage) • Native Windowed aggregation : Aggregate over 1, 5, 10 minutes window • Native Log Compaction & Broadcast Stream
  • 24. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup Apache Samza (New Features) • Coordinator Stream (Dyn. config, etc.) • Broadcast Stream (Update/Change Behavior) • Host affinity & State reuse (Faster re-start) • ElasticSearch Producer (Write output to ES) • Samza @ scale on EC2 (Netflix)
  • 25. Take Control of Your Data: CaliStream.com © CaliStream.com 2016 Meetup CaliStream.com Take Control of Your Data Jerome Boulon CEO/Founder jboulon@caliStream.com