SlideShare a Scribd company logo
Tagging and Processing Data in
Real Time
Hari Shreedharan
Cloudera
Siddhartha Jain
Salesforce.com
Features
● Process and normalize log streams in near real time
○ regex matching
● Scale from 100k events per second to a million
○ More producers could get added in real time
○ Must scale to increasing data volumes
● Horizontal scalability and fault tolerance
○ Throwing more hardware at the app should not break the app
○ Machines/Tiers can fail and come back
● Ability to plug in existing frameworks
○ Kafka, HDFS, Elastic Search, Spark….
Goals
● Search/Analysis
○ Investigations and casual exploration
● Compute
○ Data Science, correlation and alerting
● Enrich
○ For integrating external threat feeds and internally generated profiles
Expectations
● Stream processing delay tolerance
○ Worst case < 5 minutes
● Volume of messages
○ Anywhere between 100k/sec to 1 million/sec
● Maintain common data dictionary across ingest, store and compute
pipelines
Possible solution stacks
Flume:
Syslog Source
File Channel
Custom
Interceptor
Sink to HDFS
Storm:
Custom Syslog
Spout
Custom Bolts
Custom “Save to
HDFS”
Spark Streaming:
RSyslog Kafka Plugin
Kafka
Spark Streaming Receiver
Custom Spark App
Kafka
Storage/Compute
Log collection and aggregation: Geo-distributed - RSyslog
Log Normalization &
Enrichment
Kafka Spark
LogStash (Transport)
ElasticSearch (Index/Store)
Kibana (Web UI / Visualization)
Search and Transient LookUp Tables
Flume (Transport)
HDFS (Store)
Impala Spark
(+Streaming/SQL)
MapReduce
Permanent Storage and Compute
Single CDH Cluster
<86>May 15 20:29:59 rh6-x64-test-template sshd[32632]: Accepted publickey for jdoe from 10.3.1.1 port 62902
ssh2
RSyslog
<86>May 15 20:29:59 rh6-x64-test-template sshd[32632]: Accepted publickey for jdoe from 10.3.1.1 port 62902
ssh2
Kafka (topic=rawUnStructured)
Spark (Apply Regex match/extract, map to JSON)
topic=NormalizedStructured
JSON = {
"srcIP": "10.3.1.1",
"srcPort": "62902",
"serviceType": "authentication",
"regexMatch":
"sshdAcceptedSessionsLinux",
"user": "jdoe",
"product": "Linux",
"@version": "1",
"@timestamp": "2015-05-15T20:30:53.714Z"
}
Kafka
topic=sshdAcceptedSessionsLinux
JSON = {
"srcIP": "10.3.9.1",
"srcPort": "62902",
"serviceType": "authentication",
"regexMatch": "sshdAcceptedSessionsLinux",
"user": "jdoe",
"product": "Linux",
"@version": "1",
"@timestamp": "2015-05-15T20:30:53.714Z"
}
topic=srcIP
1.1.1.1
2.2.2.2
4.4.4.4
topic=User
joe
jane
doe
john …..
topic=dstIP
10.1.1.1
10.2.2.2
7.2.2.2
3.3.3.3
Production stats and lessons
● 100k EPS peak, 3 billion events/day
○ End-to-end delay of <20 seconds
● RSyslog to Kafka bottleneck
○ low producer instances - Only 2 RSyslog->Kafka Producers
● Spark specific configuration
○ Executors - 100
○ Memory - Total Yarn Memory: 201GB
○ CPU - 400 virtual threads/cores
● Parallel scheduling vs Union + Inherent-Partitioning
○ Had to use concurrent jobs (undocumented feature) until 1.1
Spark Streaming Choices
● Concurrent Jobs vs Union/Partitioning
● Kafka Read/Write choices
● Scale/Partition Kafka as per Spark cluster size
Issues
● None of the issues were because of Spark Streaming!
● Going from Spark 0.9 to 1.3
● YARN logAggregation
○ too verbose, crashing the executor nodes
● Yarn Client vs Cluster mode
○ failed at first attempt
● Too many config options, switches/knobs
○ hard to test/identify bottlenecks/bugs
More Issues
● Can’t debug issues in an IDE
● Can’t debug or identify bottlenecks with test data easily or test flows
● If something gets fixed eventually in Spark, don’t bother trying to find root
cause of why it didn’t work earlier
○ You probably won’t be able to figure out in a reasonable amount of time
(100s of commits per release!)
What it’s not
● All open source, no proprietary data stores or compute frameworks
● It is a platform, adaptable to any big data problem, no silver bullet or magic
sauce
● Uses scalable data stores, no massive monolithic databases
● Mostly plumbing/integration work, no massive code to maintain
● Runs on commodity hardware, no custom hardware

More Related Content

PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
PDF
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
PDF
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
PDF
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
PDF
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
PDF
SSR: Structured Streaming for R and Machine Learning
PDF
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
PDF
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
SSR: Structured Streaming for R and Machine Learning
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...

What's hot (20)

PDF
Processing 70Tb Of Genomics Data With ADAM And Toil
PDF
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
PPTX
Real time data viz with Spark Streaming, Kafka and D3.js
PDF
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
PDF
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
PDF
Spark Summit EU talk by Ruben Pulido Behar Veliqi
PDF
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
PDF
Spark Summit EU talk by Berni Schiefer
PPTX
Intro to Spark development
PDF
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
PDF
Big Data visualization with Apache Spark and Zeppelin
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
PDF
Top 5 mistakes when writing Streaming applications
PDF
Spark Summit EU talk by Miklos Christine paddling up the stream
PDF
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
PDF
Hoodie: How (And Why) We built an analytical datastore on Spark
PDF
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
PDF
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
PDF
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
PDF
Introduction to apache spark
Processing 70Tb Of Genomics Data With ADAM And Toil
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Real time data viz with Spark Streaming, Kafka and D3.js
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Spark Summit EU talk by Berni Schiefer
Intro to Spark development
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Big Data visualization with Apache Spark and Zeppelin
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Top 5 mistakes when writing Streaming applications
Spark Summit EU talk by Miklos Christine paddling up the stream
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Hoodie: How (And Why) We built an analytical datastore on Spark
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Introduction to apache spark
Ad

Viewers also liked (20)

PDF
Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)
PDF
Airstream: Spark Streaming At Airbnb
PDF
Durable Streaming and Enterprise Messaging
PDF
December 2013 HUG: Spark at Yahoo!
PPT
Food Recommendation System Using Clustering Analysis for Diabetic patients
PDF
Building end to end streaming application on Spark
PDF
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
PDF
Production Readiness Testing At Salesforce Using Spark MLlib
PDF
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
PDF
Spark Summit EU 2015: SparkUI visualization: a lens into your application
PDF
Spark with Cassandra by Christopher Batey
PDF
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
PPTX
Hbase at Salesforce.com
PDF
An Introduction to Sparkling Water by Michal Malohlava
PDF
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
PDF
Spark Tuning for Enterprise System Administrators By Anya Bida
PDF
Continuous Integration for Spark Apps by Sean McIntyre
PDF
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
PDF
Beyond Parallelize and Collect by Holden Karau
PDF
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)
Airstream: Spark Streaming At Airbnb
Durable Streaming and Enterprise Messaging
December 2013 HUG: Spark at Yahoo!
Food Recommendation System Using Clustering Analysis for Diabetic patients
Building end to end streaming application on Spark
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Production Readiness Testing At Salesforce Using Spark MLlib
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Spark Summit EU 2015: SparkUI visualization: a lens into your application
Spark with Cassandra by Christopher Batey
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Hbase at Salesforce.com
An Introduction to Sparkling Water by Michal Malohlava
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Spark Tuning for Enterprise System Administrators By Anya Bida
Continuous Integration for Spark Apps by Sean McIntyre
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Beyond Parallelize and Collect by Holden Karau
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Ad

Similar to Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jain, Cloudera and Salesforce) (20)

PDF
Lessons Learned: Using Spark and Microservices
PPTX
Typesafe spark- Zalando meetup
PDF
Extending Spark Streaming to Support Complex Event Processing
PDF
Spark cep
PDF
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
PDF
Sparklife - Life In The Trenches With Spark
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
PDF
What no one tells you about writing a streaming app
PDF
IMCSummit 2015 - Day 2 Developer Track - A Reference Architecture for the Int...
PDF
From Pipelines to Refineries: Scaling Big Data Applications
PDF
Apache Spark Streaming
PDF
Reference architecture for Internet of Things
PDF
Reference architecture for Internet Of Things
PPTX
Building real time Data Pipeline using Spark Streaming
PPTX
5 things one must know about spark!
PPTX
Introduction to pyspark for civil engineers
PDF
Introduction to Spark Streaming
PDF
QCon São Paulo: Real-Time Analytics with Spark Streaming
PPTX
PPTX
Perfecting Your Streaming Skills with Spark and Real World IoT Data
Lessons Learned: Using Spark and Microservices
Typesafe spark- Zalando meetup
Extending Spark Streaming to Support Complex Event Processing
Spark cep
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Sparklife - Life In The Trenches With Spark
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What no one tells you about writing a streaming app
IMCSummit 2015 - Day 2 Developer Track - A Reference Architecture for the Int...
From Pipelines to Refineries: Scaling Big Data Applications
Apache Spark Streaming
Reference architecture for Internet of Things
Reference architecture for Internet Of Things
Building real time Data Pipeline using Spark Streaming
5 things one must know about spark!
Introduction to pyspark for civil engineers
Introduction to Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
Perfecting Your Streaming Skills with Spark and Real World IoT Data

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
Foundation of Data Science unit number two notes
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
Lecture1 pattern recognition............
PDF
Launch Your Data Science Career in Kochi – 2025
Supervised vs unsupervised machine learning algorithms
IB Computer Science - Internal Assessment.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
1_Introduction to advance data techniques.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction-to-Cloud-ComputingFinal.pptx
Mega Projects Data Mega Projects Data
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Foundation of Data Science unit number two notes
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Moving the Public Sector (Government) to a Digital Adoption
Lecture1 pattern recognition............
Launch Your Data Science Career in Kochi – 2025

Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jain, Cloudera and Salesforce)

  • 1. Tagging and Processing Data in Real Time Hari Shreedharan Cloudera Siddhartha Jain Salesforce.com
  • 2. Features ● Process and normalize log streams in near real time ○ regex matching ● Scale from 100k events per second to a million ○ More producers could get added in real time ○ Must scale to increasing data volumes ● Horizontal scalability and fault tolerance ○ Throwing more hardware at the app should not break the app ○ Machines/Tiers can fail and come back ● Ability to plug in existing frameworks ○ Kafka, HDFS, Elastic Search, Spark….
  • 3. Goals ● Search/Analysis ○ Investigations and casual exploration ● Compute ○ Data Science, correlation and alerting ● Enrich ○ For integrating external threat feeds and internally generated profiles
  • 4. Expectations ● Stream processing delay tolerance ○ Worst case < 5 minutes ● Volume of messages ○ Anywhere between 100k/sec to 1 million/sec ● Maintain common data dictionary across ingest, store and compute pipelines
  • 5. Possible solution stacks Flume: Syslog Source File Channel Custom Interceptor Sink to HDFS Storm: Custom Syslog Spout Custom Bolts Custom “Save to HDFS” Spark Streaming: RSyslog Kafka Plugin Kafka Spark Streaming Receiver Custom Spark App Kafka Storage/Compute
  • 6. Log collection and aggregation: Geo-distributed - RSyslog Log Normalization & Enrichment Kafka Spark LogStash (Transport) ElasticSearch (Index/Store) Kibana (Web UI / Visualization) Search and Transient LookUp Tables Flume (Transport) HDFS (Store) Impala Spark (+Streaming/SQL) MapReduce Permanent Storage and Compute Single CDH Cluster
  • 7. <86>May 15 20:29:59 rh6-x64-test-template sshd[32632]: Accepted publickey for jdoe from 10.3.1.1 port 62902 ssh2 RSyslog <86>May 15 20:29:59 rh6-x64-test-template sshd[32632]: Accepted publickey for jdoe from 10.3.1.1 port 62902 ssh2 Kafka (topic=rawUnStructured) Spark (Apply Regex match/extract, map to JSON) topic=NormalizedStructured JSON = { "srcIP": "10.3.1.1", "srcPort": "62902", "serviceType": "authentication", "regexMatch": "sshdAcceptedSessionsLinux", "user": "jdoe", "product": "Linux", "@version": "1", "@timestamp": "2015-05-15T20:30:53.714Z" } Kafka topic=sshdAcceptedSessionsLinux JSON = { "srcIP": "10.3.9.1", "srcPort": "62902", "serviceType": "authentication", "regexMatch": "sshdAcceptedSessionsLinux", "user": "jdoe", "product": "Linux", "@version": "1", "@timestamp": "2015-05-15T20:30:53.714Z" } topic=srcIP 1.1.1.1 2.2.2.2 4.4.4.4 topic=User joe jane doe john ….. topic=dstIP 10.1.1.1 10.2.2.2 7.2.2.2 3.3.3.3
  • 8. Production stats and lessons ● 100k EPS peak, 3 billion events/day ○ End-to-end delay of <20 seconds ● RSyslog to Kafka bottleneck ○ low producer instances - Only 2 RSyslog->Kafka Producers ● Spark specific configuration ○ Executors - 100 ○ Memory - Total Yarn Memory: 201GB ○ CPU - 400 virtual threads/cores ● Parallel scheduling vs Union + Inherent-Partitioning ○ Had to use concurrent jobs (undocumented feature) until 1.1
  • 9. Spark Streaming Choices ● Concurrent Jobs vs Union/Partitioning ● Kafka Read/Write choices ● Scale/Partition Kafka as per Spark cluster size
  • 10. Issues ● None of the issues were because of Spark Streaming! ● Going from Spark 0.9 to 1.3 ● YARN logAggregation ○ too verbose, crashing the executor nodes ● Yarn Client vs Cluster mode ○ failed at first attempt ● Too many config options, switches/knobs ○ hard to test/identify bottlenecks/bugs
  • 11. More Issues ● Can’t debug issues in an IDE ● Can’t debug or identify bottlenecks with test data easily or test flows ● If something gets fixed eventually in Spark, don’t bother trying to find root cause of why it didn’t work earlier ○ You probably won’t be able to figure out in a reasonable amount of time (100s of commits per release!)
  • 12. What it’s not ● All open source, no proprietary data stores or compute frameworks ● It is a platform, adaptable to any big data problem, no silver bullet or magic sauce ● Uses scalable data stores, no massive monolithic databases ● Mostly plumbing/integration work, no massive code to maintain ● Runs on commodity hardware, no custom hardware