SlideShare a Scribd company logo
SPARK SUMMIT
EUROPE2016
FROM SINGLE-TENANT HADOOP
TO 3000 TENANTS IN APACHE SPARK
RUBEN PULIDO
BEHAR VELIQI
IBM WATSON ANALYTICS FOR SOCIAL MEDIA
IBM GERMANY R&D LAB
WHAT IS WATSON ANALYTICS FOR SOCIAL MEDIA
PREVIOUS ARCHITECTURE
THOUGHT PROCESS TOWARDS MULTITENANCY
NEW ARCHITECTURE
LESSONS LEARNED
WATSON ANALYTICS A data analytics solution for business users in the cloud
users
WATSON ANALYTICS FOR SOCIAL MEDIA
◉ Part of Watson Analytics
◉ Allows users to
harvest and analyze
social media content
◉ To understand how
brands, products,
services, social issues,


are perceived
Cognos BI
Text Analytics
BigInsights
DB2
Evolving Topics
End to End Orchestration
Content is
stored in
HDFS


analyzed
within our
components...

stored in
HDFS
again...

and loaded
into DB2
Our previous architecture:
a single-tenant “big data” ETL batch pipeline
Photo by GhislainBonneau,gbphotodidactical.ca
What people think about Social Media data rates
What Stream Processing is really good at
Our requirement:
Efficiently processing
“trickle feeds” of data
What the size of a customer specific
social media data-feed really is
WATSON ANALYTICS FOR SOCIAL MEDIA
PREVIOUS ARCHITECTURE
THOUGHT PROCESS TOWARDS MULTITENANCY
NEW ARCHITECTURE
LESSONS LEARNED
Step 1: Revisit our analytics workload
Author
features
detection
Concept
level
sentiment
detection
Concept
detection
Everyone I ask says my PhoneA
is way better than my brother’s PhoneB
Everyone I ask says my PhoneAis way better than my brother’s PhoneB
PhoneA:
Positive
Everyone I ask says my PhoneA is way better than my brother’s PhoneB
PhoneB:
Negative
My wife’s PhoneA died yesterday.
My son wants me to get him PhoneB.
Author: Is Married = true
Author: Has Children = true
Creation
of author
profiles
Is Married = true
Has Children = true
Author:
Author: Is Married = true
Author: Has Children = true
Step 2: Assign the right processing
model for the right workload
Document-level analytics
Concept detection
Sentiment detection
Extracting author features
◉ Can easily work on a data stream
◉ User gets additional value from
every document
Collection-level analytics
Consolidation of author profiles
Influence detection


◉ Most algorithms need a batch of
documents (or all documents)
◉ User value is getting the “big picture”
◉ Document-level analytics makes sense on a
stream

◉ 
.document analysis for a single tenant often
does not justify stream processing

◉ 
but what if we process all documents for
all tenants in a single system?
Step 3: Turn “trickle feeds” into a
stream through multitenancy
Step 4: Address Multitenancy Requirements
◉ Minimize “switching costs” when analyzing data
for different tenants
−Separated text analytics steps into:
− tenant-specific
− language-specific
−Optimized tenant-specific steps for “low latency” switching
◉ Avoid storing tenant state in processing components
−Using Zookeeper as distributed configuration store
WATSON ANALYTICS FOR SOCIAL MEDIA
PREVIOUS ARCHITECTURE
THOUGHT PROCESS TOWARDS MULTITENANCY
NEW ARCHITECTURE
LESSONS LEARNED
ANALYSIS PIPELINE
COLLECTIONSERVICE
ANALYSIS PIPELINE
DATA FETCHING
DOCUMENT
ANALYSIS
EXPORT
AUTHORANALYSIS
DATASETCREATION
COLLECTIONSERVICE
DATA FETCHING
DOCUMENT
ANALYSIS
EXPORT
AUTHORANALYSIS
DATASETCREATION
COLLECTIONSERVICE
Stream Processing Batch Processing
DATA FETCHING
DOCUMENT
ANALYSIS
EXPORT
AUTHORANALYSIS
DATASETCREATION
COLLECTIONSERVICE
urls
raw
analyzed
Kafka
HDFS
HTTP
HDFS HDFS
Stream Processing Batch Processing
DATA FETCHING
DOCUMENT
ANALYSIS
EXPORT
AUTHORANALYSIS
DATASETCREATION
COLLECTIONSERVICE
urls
raw
analyzed
Kafka
HDFS
HTTP
HDFS HDFS
Stream Processing Batch Processing
ORCHESTRATION
DATA FETCHING
DOCUMENT
ANALYSIS
EXPORT
AUTHORANALYSIS
DATASETCREATION
COLLECTIONSERVICE
urls
raw
analyzed
Kafka
HDFS
HTTP
HDFS HDFS
Stream Processing Batch Processing
master master master
ZK ZK ZK
worker broker worker broker
worker broker worker broker
worker broker worker broker
HDFS
HDFS
HDFS
Orchestration Tier Data Processing Tier Storage Tier
worker broker worker broker
worker broker worker broker
worker broker worker broker
HDFS
HDFS
HDFS
Data Processing Tier Storage Tier
master master master
ZK ZK ZK
JOB-1
JOB-2
JOB-N

 status
errorCode
# fetched
# analyzed


Orchestration Tier
Orchestration Tier Storage Tier
master master master
ZK ZK ZK
HDFS
HDFS
HDFS
worker broker worker broker
Data Processing Tier
Orchestration Tier Storage Tier
master master master
ZK ZK ZK
HDFS
HDFS
HDFS
en de fr es 
 en de fr es 

en de fr es 

en de fr es 
 en de fr es 

worker broker worker broker
worker broker worker broker
worker broker worker broker
Data Processing Tier
‱ Expensive in
Initialization
‱ High memory usage
Ă  Not appropriate for
. Caching
‱ Stable across tenants
‱ One analysis engine
‱ per language
‱ per Spark
worker
Ă  created at
. deployment time
‱ Processing threads
co-use these engines
LANGUANGE
SPECIFIC ANALYSIS
en de fr es 

Orchestration Tier Storage Tier
master master master
ZK ZK ZK
HDFS
HDFS
HDFS
en de fr es 
 en de fr es 

en de fr es 

en de fr es 
 en de fr es 

worker broker worker broker
worker broker worker broker
worker broker worker broker
Data Processing Tier
‱ Low memory usage
compared to language
analysis engines
Ă  LRU-cache of 100
. tenant specific
. analysis engines
. per Spark worker
‱ Very low tenant-
switching overhead
TENANT
SPECIFIC ANALYSIS
‱ Expensive in
Initialization
‱ High memory usage
Ă  Not appropriate for
. Caching
‱ Stable across tenants
‱ One analysis engine
‱ per language
‱ per Spark
worker
Ă  created at
. deployment time
‱ Processing threads
co-use these engines
LANGUANGE
SPECIFIC ANALYSIS
en de fr es 

READ


raw
READ


raw
ANALYZE
create view Parent as
extract regex /
(my | our)
(sons?|daughters?|kids?|baby( boy|girl)?|babies|child|children)
/
with flags 'CASE_INSENSITIVE' on D.text as match from Document D
AQL
READ


raw
ANALYZE
WRITEanalyzed
status = RUNNING
errorCode = NONE
# fetched = 5000
# analyzed = 4000


◉ More than 3000 tenants
◉ Between 150-300 analysis jobs per day
◉ Between 25K-4M documents analyzed per job
◉ 3 data centers in Toronto, Dallas, Amsterdam
◉ Cluster of 10 VMs: each with 16 Cores and 32G RAM
◉ Text-Analytics Throughput:
◉ Full Stack: from Tokenization to Sentiment / Behavior
◉ Mixed Sources: ~60K documents per minute
◉ Tweets only: 150K Tweets per minute
DEPLOYMENT
ACTIVE
REST-API
GitHub CI
Uber
JAR
Deployment
Container
UPLOAD TO HDFS
GRACEFUL SHUTDOWN
START APPS


offsetSettingStrategy: "keepWithNewPartitions",
segment: {
app: “fetcher”,
spark.cores.max: 10,
spark.executor.memory: "3G",
spark.streaming.stopGracefullyOnShutdown: true,


}


WATSON ANALYTICS FOR SOCIAL MEDIA
PREVIOUS ARCHITECTURE
THOUGHT PROCESS TOWARDS MULTITENANCY
NEW ARCHITECTURE
LESSONS LEARNED
◉ “Driver Only” – No Stream or Batch Processing
◉ Uses Spark’s Fail-Over mechanisms
◉ No implementation of custom master election necessary
ORCHESTRATION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS


// update progress on zookeeper
// trigger the batchphases
KAFKA EVENT Ă  HTTP REQUEST Ă  DOCUMENT BATCH
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
KAFKA EVENT Ă  HTTP REQUEST Ă  DOCUMENT BATCH
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
◉ wrong workload for Spark`s
micro-batch based stream-processing
◉ Hard to find appropriate scheduling interval
◉ builds high scheduling delays
◉ hard to distribute equally across the cluster
◉ No benefit of Spark settings like
maxRatePerPartition or backpressure.enabled
◉ sometimes leading to OOM
KAFKA EVENT Ă  HTTP REQUEST Ă  DOCUMENT BATCH
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
NON-SPARK APPLICATION
◉ Run the fetcher as separate application
outside of Spark – “long running” thread
◉ Fetch documents from social media providers
◉ Write to raw documents to Kafka
◉ Let the pipeline analyze these documents
◉ backpressure.enabled = true
◉ wrong workload for Spark`s
micro-batch based stream-processing
◉ Hard to find appropriate scheduling interval
◉ builds high scheduling delays
◉ hard to distribute equally across the cluster
◉ No benefit of Spark settings like
maxRatePerPartition or backpressure.enabled
◉ sometimes leading to OOM
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
worker broker worker broker
worker broker worker broker
JOB-1
JOB-2
Job-ID
‱ Jobs don’tinfluence each other
‱ Run fully in parallel
‱ Cluster is not fully/equally utilized
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
worker broker worker broker
worker broker worker broker
JOB-1
JOB-2
worker broker worker broker
worker broker worker broker
JOB-1
JOB-2
Job-ID
“Random” + All URLsAt Once
‱ Jobs don’tinfluence each other
‱ Run fully in parallel
‱ Cluster is not fully/equally utilized
‱ Cluster is fully utilized
‱ Workload equally distributed
‱ Jobs run sequentially
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
worker broker worker broker
worker broker worker broker
JOB-1
JOB-2
worker broker worker broker
worker broker worker broker
JOB-1
JOB-2
worker broker worker broker
worker broker worker broker
JOB-1
JOB-2
Job-ID
”Random” + “Next” URLs
‱ Jobs don’tinfluence each other
‱ Run fully in parallel
‱ Cluster is not fully/equally utilized
‱ Cluster is fully utilized
‱ Workload equally distributed
‱ Jobs run sequentially
‱ Cluster is fully utilized
‱ Workload equally distributed
‱ Jobs run in parallel
“Random” + All URLsAt Once
◉ Grouping happens in User Memory region
◉ When a large batch of documents is processed:
◉ Spark does does not spill to disk
◉ OOM errors
Ă  configure a small streaming batch interval or
. set maxRatePerPartition to limit the events of the batch
OOM
Write documents for each job to a separate HDFS directory.
First version: Group documents using Scala collections API
User
Memory
Region
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
◉ Group and write job documents through DataFrame APIs
◉ Uses the Execution Memory region for computations
◉ Spark can prevent OOM errors by borrowing space from the
Storage memory region or by spilling to disk if needed.
Spark
Execution
Memory
Region
No OOM
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
Write documents for each job to a separate HDFS directory.
Second version: Let Spark do the grouping and write to HDFS
ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
worker worker
worker worker
JOB-1
JOB-2
FIFO SCHEDULER
◉ First job gets available resources
while its stages have tasks to
launch
◉ Big jobs “block” smaller jobs
started later
worker worker
worker worker
JOB-1
JOB-2
◉ Tasks between jobs assigned
“round robin”
◉ Smaller jobs submitted while a
big job is running can start
receiving resources right away
FAIR SCHEDULER
Within-Application Job Scheduling
◉ Watson Analytics for Social Media allows harvesting and analyzing
social media data
◉ Becoming a real-time multi-tenant analytics pipeline required:
– Splitting analytics into tenant-specific and language-specific
– Aggregating all tenants trickle-feeds into a single stream
– Ensuring low-latency context switching between tenants
– Removing state from processing components
◉ New pipeline based on Spark, Kafka and Zookeeper
– robust
– fulfills latency and throughputrequirements
– additional analytics
SUMMARY
SPARK SUMMIT
EUROPE2016
THANK YOU.
RUBEN PULIDO
www.linkedin.com/in/ruben-pulido
@_rubenpulido
www.watsonanalytics.comFREE
TRIAL
BEHAR VELIQI
www.linkedin.com/in/beharveliqi
@beveliqi

More Related Content

PDF
Spark after Dark by Chris Fregly of Databricks
KEY
HBase and Hadoop at Urban Airship
PPTX
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
PPTX
Overview of Cascading 3.0 on Apache Flink
PDF
QCon SĂŁo Paulo: Real-Time Analytics with Spark Streaming
PPTX
Streaming in the Wild with Apache Flink
PDF
October 2013 HUG: HBase 0.96
PDF
Reactive streams
Spark after Dark by Chris Fregly of Databricks
HBase and Hadoop at Urban Airship
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Overview of Cascading 3.0 on Apache Flink
QCon SĂŁo Paulo: Real-Time Analytics with Spark Streaming
Streaming in the Wild with Apache Flink
October 2013 HUG: HBase 0.96
Reactive streams

What's hot (20)

PDF
Hadoop application architectures - using Customer 360 as an example
PPTX
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
PPTX
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
PDF
Next Generation Hadoop Operations
PDF
Apache Kafka Streams + Machine Learning / Deep Learning
PPTX
Devnexus 2018
PPTX
Dev nexus 2017
PDF
The Future of Apache Storm
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
PDF
Spark Summit 2014: Spark Job Server Talk
PDF
Cascading - A Java Developer’s Companion to the Hadoop World
PDF
Tale of ISUCON and Its Bench Tools
PDF
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
PDF
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
PPTX
Design Patterns For Real Time Streaming Data Analytics
PPTX
Building Deep Learning Workflows with DL4J
PDF
R, Hadoop and Amazon Web Services
PDF
Hadoop to spark_v2
PDF
Hadoop Strata Talk - Uber, your hadoop has arrived
PPTX
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Hadoop application architectures - using Customer 360 as an example
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Next Generation Hadoop Operations
Apache Kafka Streams + Machine Learning / Deep Learning
Devnexus 2018
Dev nexus 2017
The Future of Apache Storm
Why apache Flink is the 4G of Big Data Analytics Frameworks
Spark Summit 2014: Spark Job Server Talk
Cascading - A Java Developer’s Companion to the Hadoop World
Tale of ISUCON and Its Bench Tools
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Design Patterns For Real Time Streaming Data Analytics
Building Deep Learning Workflows with DL4J
R, Hadoop and Amazon Web Services
Hadoop to spark_v2
Hadoop Strata Talk - Uber, your hadoop has arrived
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Ad

Viewers also liked (10)

PDF
Rigorous and Multi-tenant HBase Performance
PDF
Managing multi tenant resource toward Hive 2.0
PDF
STAC Summit 2014 - Building a multitenant Big Data infrastructure
PPTX
Multi tier, multi-tenant, multi-problem kafka
PPTX
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
PPTX
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
PPTX
Managing a Multi-Tenant Data Lake
PPTX
Strata Hadoop Hopsworks
PPTX
Data Warehouse Optimization
Rigorous and Multi-tenant HBase Performance
Managing multi tenant resource toward Hive 2.0
STAC Summit 2014 - Building a multitenant Big Data infrastructure
Multi tier, multi-tenant, multi-problem kafka
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Managing a Multi-Tenant Data Lake
Strata Hadoop Hopsworks
Data Warehouse Optimization
Ad

Similar to Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop to 3000 tenants in Apache Spark (20)

PDF
Spark Summit EU talk by Ruben Pulido Behar Veliqi
PDF
Apache Spark Streaming
PDF
Streaming Sensor Data Slides_Virender
PDF
Data processing platforms with SMACK: Spark and Mesos internals
PPTX
Scalable data pipeline at Traveloka - Facebook Dev Bandung
PDF
Spark Streaming and IoT by Mike Freedman
PDF
Scalable Stream Processing with Apache Samza
PPTX
Trivento summercamp masterclass 9/9/2016
PDF
Building end to end streaming application on Spark
PDF
Streaming Analytics for Financial Enterprises
PPTX
Trivento summercamp fast data 9/9/2016
PDF
Unbounded bounded-data-strangeloop-2016-monal-daxini
PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
PDF
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
PPTX
Fraud Detection Architecture
PPTX
Architecting a Fraud Detection Application with Hadoop
PDF
New Analytics Toolbox
PPTX
Intro to Spark - for Denver Big Data Meetup
PPTX
Wikibon #IoT #HyperConvergence Presentation via @theCUBE
PPTX
Hyper-Convergence CrowdChat
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Apache Spark Streaming
Streaming Sensor Data Slides_Virender
Data processing platforms with SMACK: Spark and Mesos internals
Scalable data pipeline at Traveloka - Facebook Dev Bandung
Spark Streaming and IoT by Mike Freedman
Scalable Stream Processing with Apache Samza
Trivento summercamp masterclass 9/9/2016
Building end to end streaming application on Spark
Streaming Analytics for Financial Enterprises
Trivento summercamp fast data 9/9/2016
Unbounded bounded-data-strangeloop-2016-monal-daxini
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Fraud Detection Architecture
Architecting a Fraud Detection Application with Hadoop
New Analytics Toolbox
Intro to Spark - for Denver Big Data Meetup
Wikibon #IoT #HyperConvergence Presentation via @theCUBE
Hyper-Convergence CrowdChat

Recently uploaded (20)

PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
System and Network Administration Chapter 2
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Reimagine Home Health with the Power of Agentic AI​
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
assetexplorer- product-overview - presentation
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Understanding Forklifts - TECH EHS Solution
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
ai tools demonstartion for schools and inter college
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Design an Analysis of Algorithms I-SECS-1021-03
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Designing Intelligence for the Shop Floor.pdf
Odoo Companies in India – Driving Business Transformation.pdf
System and Network Administration Chapter 2
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Reimagine Home Health with the Power of Agentic AI​
VVF-Customer-Presentation2025-Ver1.9.pptx
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Operating system designcfffgfgggggggvggggggggg
Computer Software and OS of computer science of grade 11.pptx
Design an Analysis of Algorithms II-SECS-1021-03
assetexplorer- product-overview - presentation
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Understanding Forklifts - TECH EHS Solution
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
ai tools demonstartion for schools and inter college

Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop to 3000 tenants in Apache Spark

  • 1. SPARK SUMMIT EUROPE2016 FROM SINGLE-TENANT HADOOP TO 3000 TENANTS IN APACHE SPARK RUBEN PULIDO BEHAR VELIQI IBM WATSON ANALYTICS FOR SOCIAL MEDIA IBM GERMANY R&D LAB
  • 2. WHAT IS WATSON ANALYTICS FOR SOCIAL MEDIA PREVIOUS ARCHITECTURE THOUGHT PROCESS TOWARDS MULTITENANCY NEW ARCHITECTURE LESSONS LEARNED
  • 3. WATSON ANALYTICS A data analytics solution for business users in the cloud users WATSON ANALYTICS FOR SOCIAL MEDIA ◉ Part of Watson Analytics ◉ Allows users to harvest and analyze social media content ◉ To understand how brands, products, services, social issues, 
 are perceived
  • 4. Cognos BI Text Analytics BigInsights DB2 Evolving Topics End to End Orchestration Content is stored in HDFS
 
analyzed within our components... 
stored in HDFS again... 
and loaded into DB2 Our previous architecture: a single-tenant “big data” ETL batch pipeline
  • 5. Photo by GhislainBonneau,gbphotodidactical.ca What people think about Social Media data rates What Stream Processing is really good at
  • 6. Our requirement: Efficiently processing “trickle feeds” of data What the size of a customer specific social media data-feed really is
  • 7. WATSON ANALYTICS FOR SOCIAL MEDIA PREVIOUS ARCHITECTURE THOUGHT PROCESS TOWARDS MULTITENANCY NEW ARCHITECTURE LESSONS LEARNED
  • 8. Step 1: Revisit our analytics workload Author features detection Concept level sentiment detection Concept detection Everyone I ask says my PhoneA is way better than my brother’s PhoneB Everyone I ask says my PhoneAis way better than my brother’s PhoneB PhoneA: Positive Everyone I ask says my PhoneA is way better than my brother’s PhoneB PhoneB: Negative My wife’s PhoneA died yesterday. My son wants me to get him PhoneB. Author: Is Married = true Author: Has Children = true Creation of author profiles Is Married = true Has Children = true Author: Author: Is Married = true Author: Has Children = true
  • 9. Step 2: Assign the right processing model for the right workload Document-level analytics Concept detection Sentiment detection Extracting author features ◉ Can easily work on a data stream ◉ User gets additional value from every document Collection-level analytics Consolidation of author profiles Influence detection 
 ◉ Most algorithms need a batch of documents (or all documents) ◉ User value is getting the “big picture”
  • 10. ◉ Document-level analytics makes sense on a stream
 ◉ 
.document analysis for a single tenant often does not justify stream processing
 ◉ 
but what if we process all documents for all tenants in a single system? Step 3: Turn “trickle feeds” into a stream through multitenancy
  • 11. Step 4: Address Multitenancy Requirements ◉ Minimize “switching costs” when analyzing data for different tenants −Separated text analytics steps into: − tenant-specific − language-specific −Optimized tenant-specific steps for “low latency” switching ◉ Avoid storing tenant state in processing components −Using Zookeeper as distributed configuration store
  • 12. WATSON ANALYTICS FOR SOCIAL MEDIA PREVIOUS ARCHITECTURE THOUGHT PROCESS TOWARDS MULTITENANCY NEW ARCHITECTURE LESSONS LEARNED
  • 20. master master master ZK ZK ZK worker broker worker broker worker broker worker broker worker broker worker broker HDFS HDFS HDFS Orchestration Tier Data Processing Tier Storage Tier
  • 21. worker broker worker broker worker broker worker broker worker broker worker broker HDFS HDFS HDFS Data Processing Tier Storage Tier master master master ZK ZK ZK JOB-1 JOB-2 JOB-N 
 status errorCode # fetched # analyzed 
 Orchestration Tier
  • 22. Orchestration Tier Storage Tier master master master ZK ZK ZK HDFS HDFS HDFS worker broker worker broker Data Processing Tier
  • 23. Orchestration Tier Storage Tier master master master ZK ZK ZK HDFS HDFS HDFS en de fr es 
 en de fr es 
 en de fr es 
 en de fr es 
 en de fr es 
 worker broker worker broker worker broker worker broker worker broker worker broker Data Processing Tier ‱ Expensive in Initialization ‱ High memory usage Ă  Not appropriate for . Caching ‱ Stable across tenants ‱ One analysis engine ‱ per language ‱ per Spark worker Ă  created at . deployment time ‱ Processing threads co-use these engines LANGUANGE SPECIFIC ANALYSIS en de fr es 

  • 24. Orchestration Tier Storage Tier master master master ZK ZK ZK HDFS HDFS HDFS en de fr es 
 en de fr es 
 en de fr es 
 en de fr es 
 en de fr es 
 worker broker worker broker worker broker worker broker worker broker worker broker Data Processing Tier ‱ Low memory usage compared to language analysis engines Ă  LRU-cache of 100 . tenant specific . analysis engines . per Spark worker ‱ Very low tenant- switching overhead TENANT SPECIFIC ANALYSIS ‱ Expensive in Initialization ‱ High memory usage Ă  Not appropriate for . Caching ‱ Stable across tenants ‱ One analysis engine ‱ per language ‱ per Spark worker Ă  created at . deployment time ‱ Processing threads co-use these engines LANGUANGE SPECIFIC ANALYSIS en de fr es 

  • 26. READ 
 raw ANALYZE create view Parent as extract regex / (my | our) (sons?|daughters?|kids?|baby( boy|girl)?|babies|child|children) / with flags 'CASE_INSENSITIVE' on D.text as match from Document D AQL
  • 27. READ 
 raw ANALYZE WRITEanalyzed status = RUNNING errorCode = NONE # fetched = 5000 # analyzed = 4000 

  • 28. ◉ More than 3000 tenants ◉ Between 150-300 analysis jobs per day ◉ Between 25K-4M documents analyzed per job ◉ 3 data centers in Toronto, Dallas, Amsterdam ◉ Cluster of 10 VMs: each with 16 Cores and 32G RAM ◉ Text-Analytics Throughput: ◉ Full Stack: from Tokenization to Sentiment / Behavior ◉ Mixed Sources: ~60K documents per minute ◉ Tweets only: 150K Tweets per minute
  • 29. DEPLOYMENT ACTIVE REST-API GitHub CI Uber JAR Deployment Container UPLOAD TO HDFS GRACEFUL SHUTDOWN START APPS 
 offsetSettingStrategy: "keepWithNewPartitions", segment: { app: “fetcher”, spark.cores.max: 10, spark.executor.memory: "3G", spark.streaming.stopGracefullyOnShutdown: true, 
 } 

  • 30. WATSON ANALYTICS FOR SOCIAL MEDIA PREVIOUS ARCHITECTURE THOUGHT PROCESS TOWARDS MULTITENANCY NEW ARCHITECTURE LESSONS LEARNED
  • 31. ◉ “Driver Only” – No Stream or Batch Processing ◉ Uses Spark’s Fail-Over mechanisms ◉ No implementation of custom master election necessary ORCHESTRATION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS 
 // update progress on zookeeper // trigger the batchphases
  • 32. KAFKA EVENT Ă  HTTP REQUEST Ă  DOCUMENT BATCH ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
  • 33. KAFKA EVENT Ă  HTTP REQUEST Ă  DOCUMENT BATCH ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS ◉ wrong workload for Spark`s micro-batch based stream-processing ◉ Hard to find appropriate scheduling interval ◉ builds high scheduling delays ◉ hard to distribute equally across the cluster ◉ No benefit of Spark settings like maxRatePerPartition or backpressure.enabled ◉ sometimes leading to OOM
  • 34. KAFKA EVENT Ă  HTTP REQUEST Ă  DOCUMENT BATCH ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS NON-SPARK APPLICATION ◉ Run the fetcher as separate application outside of Spark – “long running” thread ◉ Fetch documents from social media providers ◉ Write to raw documents to Kafka ◉ Let the pipeline analyze these documents ◉ backpressure.enabled = true ◉ wrong workload for Spark`s micro-batch based stream-processing ◉ Hard to find appropriate scheduling interval ◉ builds high scheduling delays ◉ hard to distribute equally across the cluster ◉ No benefit of Spark settings like maxRatePerPartition or backpressure.enabled ◉ sometimes leading to OOM
  • 35. ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS worker broker worker broker worker broker worker broker JOB-1 JOB-2 Job-ID ‱ Jobs don’tinfluence each other ‱ Run fully in parallel ‱ Cluster is not fully/equally utilized
  • 36. ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS worker broker worker broker worker broker worker broker JOB-1 JOB-2 worker broker worker broker worker broker worker broker JOB-1 JOB-2 Job-ID “Random” + All URLsAt Once ‱ Jobs don’tinfluence each other ‱ Run fully in parallel ‱ Cluster is not fully/equally utilized ‱ Cluster is fully utilized ‱ Workload equally distributed ‱ Jobs run sequentially
  • 37. ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS worker broker worker broker worker broker worker broker JOB-1 JOB-2 worker broker worker broker worker broker worker broker JOB-1 JOB-2 worker broker worker broker worker broker worker broker JOB-1 JOB-2 Job-ID ”Random” + “Next” URLs ‱ Jobs don’tinfluence each other ‱ Run fully in parallel ‱ Cluster is not fully/equally utilized ‱ Cluster is fully utilized ‱ Workload equally distributed ‱ Jobs run sequentially ‱ Cluster is fully utilized ‱ Workload equally distributed ‱ Jobs run in parallel “Random” + All URLsAt Once
  • 38. ◉ Grouping happens in User Memory region ◉ When a large batch of documents is processed: ◉ Spark does does not spill to disk ◉ OOM errors Ă  configure a small streaming batch interval or . set maxRatePerPartition to limit the events of the batch OOM Write documents for each job to a separate HDFS directory. First version: Group documents using Scala collections API User Memory Region ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS
  • 39. ◉ Group and write job documents through DataFrame APIs ◉ Uses the Execution Memory region for computations ◉ Spark can prevent OOM errors by borrowing space from the Storage memory region or by spilling to disk if needed. Spark Execution Memory Region No OOM ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS Write documents for each job to a separate HDFS directory. Second version: Let Spark do the grouping and write to HDFS
  • 40. ORCHESTRAION DATA FETCHING PARTITIONING EXPORTING AUTHOR ANALYSIS worker worker worker worker JOB-1 JOB-2 FIFO SCHEDULER ◉ First job gets available resources while its stages have tasks to launch ◉ Big jobs “block” smaller jobs started later worker worker worker worker JOB-1 JOB-2 ◉ Tasks between jobs assigned “round robin” ◉ Smaller jobs submitted while a big job is running can start receiving resources right away FAIR SCHEDULER Within-Application Job Scheduling
  • 41. ◉ Watson Analytics for Social Media allows harvesting and analyzing social media data ◉ Becoming a real-time multi-tenant analytics pipeline required: – Splitting analytics into tenant-specific and language-specific – Aggregating all tenants trickle-feeds into a single stream – Ensuring low-latency context switching between tenants – Removing state from processing components ◉ New pipeline based on Spark, Kafka and Zookeeper – robust – fulfills latency and throughputrequirements – additional analytics SUMMARY
  • 42. SPARK SUMMIT EUROPE2016 THANK YOU. RUBEN PULIDO www.linkedin.com/in/ruben-pulido @_rubenpulido www.watsonanalytics.comFREE TRIAL BEHAR VELIQI www.linkedin.com/in/beharveliqi @beveliqi