SlideShare a Scribd company logo
After Dark
Generating High-Quality Recommendations using
Real-time Advanced Analytics and Machine Learning with
Chris Fregly
Data Solutions Engineer @ Databricks
Who am I?
2
Data Platform Engineer
playboy.com
Streaming Platform Engineer
NetflixOSS Committer
netflix.com, github.com/Netflix
Data Solutions Engineer
Apache Spark Contributor
databricks.com, github.com/apache/spark
Why After Dark?
Playboy After Dark
Late 1960’s TV Show
Progressive Show For Its Time
And it rhymes!!
3
What is ?
4
Spark Core
Spark
Streaming
real-timeSpark SQL
structured data
MLlib
machine
learning
GraphX
graph
analytics
…	
  
BlinkDB
approx queries
in Production
5
What is ?
6
Founded by the creators of
as a Service
Powerful Visualizations
Collaborative Notebooks
Scala/Java, Python, SQL, R
Flexible Cluster Management
Job Scheduling and Monitoring
in Production
7
8
① Generate high-quality recommendations
② Demonstrate Spark high-level libraries:
③  Spark Streaming -> Kafka, Approximates
④  Spark SQL -> DataFrames, Cassandra
①  GraphX -> PageRank, Shortest Path
①  MLlib -> Matrix Factor, Word2Vec
Goals of After Dark?
Images courtesy of tinder.com. Not affiliated with Tinder in any way.
Popular Dating Sites
9
Themes of this Talk
10
① Performance
② Parallelism
③ Columnar Storage
④ Approximations
⑤ Similarity
⑥ Minimize Shuffle
Performance
11
Daytona Gray Sort Contest
12
On-disk only
250,000 partitions
No in-memory caching
(2014)(2013) (2014)
Improved Shuffle and Network Layer
13
① Introduced sort-based shuffle
Mapper maintains large buffer grouped by keys
Reducer seeks directly to group and scans
② Minimizes OS resources
Less mapper-reducer open files,connections
③ Netty: Async keeps CPU hot, reuse ByteBuffer
④ epoll: disk-network comm in kernel space only
Project Tungsten: CPU and Memory
14
① Largest change to Spark exec engine to date
② Cache-aware data structs and sorting
->
③ Expand JVM bytecode gen, JIT optimizations
④ Custom mem manage, serializers, HashMap
DataFrames and Catalyst
15
15
https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/
Tip: Use DataFrames! -->
JVM bytecode
generation
Parallelism
16
Brady Bunch circa 1980
17
Season 5, Episode 18: “Two Petes in a Pod”
Parallel Algorithm : O(log n)
18
O(log n)
Non-parallel Algorithm : O(n)
19
O(n)
Columnar Storage
20
Columnar Storage Format
21
*Skip whole chunks with min-max heuristics
stored in each chunk (sorted data only)
Parquet File Format
22
① Based on Google Dremel Paper
② Implemented by Twitter and Cloudera
③ Columnar storage format
④ Optimized for fast columnar aggregations
⑤ Tight compression
⑥ Supports pushdowns
⑦ Nested, self-describing, evolving schema
Types of Compression
23
① Run Length Encoding
Repeated data
② Dictionary Encoding
Fixed set of values
③ Delta, Prefix Encoding
Sorted dataset
Types of Pushdowns
24
① Column, Partition Pruning
② Row, Predicate Filtering
Approximations
25
Sketch Algorithm: Count Min Sketch
26
①  Approximate counters
②  Better than HashMap
③  Fixed, low memory
④  Known error bounds
⑤  Large num of counters
⑥  Available in Twitter’s Algebird
⑦  Streaming example in Spark
Probabilistic Data Structure: HyperLogLog
27
①  Fixed memory
②  Known error distribution
③  Measures set cardinality
④  Approx count distinct
⑤  Number of unique users
⑥  From Twitter’s Algebird
⑦  Streaming example in Spark
⑧  RDD: countApproxDistinctByKey()
Similarity
28
Types of Similarity
29
① Euclidean: linear measure
Magnitude bias
② Cosine: angle measure
Adjusts for magnitude bias
③ Jaccard: set intersection divided by union
Popularity bias
④ Log Likelihood
Adjusts for bias -->
	
  	
   Ali	
   Matei	
   Reynold	
   Patrick	
   Andy	
  
Kimberly	
   1	
   1	
   1	
   1	
  
Paula	
   1
Lisa	
   1	
  
Cindy	
   1	
   1	
  
Holden	
   1	
   1	
   1	
   1	
   1	
  
z
All-pairs Similarity
30
① Compare everything to everything
② aka. “pair-wise similarity” or “similarity join”
③ Naïve shuffle: O(m*n^2); m=rows, n=cols
④ Minimize shuffle: reduce data size & approx
Reduce m (rows)
Sampling and bucketing
Reduce n (cols)
Remove most frequent value (0?)
Minimize Shuffle
31
Sampling Algo: DIMSUM
32
① "Dimension Independent Matrix Square
Using MR”
② Remove rows with low similarity probability
③ MLlib: RowMatrix.columnSimilarities(…)
④ Twitter: 40% efficiency gain over Cosine
Bucket Algo: Locality Sensitive Hashing
33
①  Split into b buckets using similarity hash algo
Requires pre-processing of data
②  Compare bucket contents in parallel
③  Converts O(m*n^2) -> O(m*n/b*b^2);
m=rows, n=cols, b=buckets
④  Example: 500k x 500k matrix
O(1.25E17) -> O(1.25E13); b=50
⑤  github.com/mrsqueeze/spark-hash
MLlib: SparseVector vs. DenseVector
34
①  Remove columns using sparse vectors
②  Converts O(m*n^2) -> O(m*nnz^2);
nnz=num nonzeros, nnz << n
Tip: Choose most frequent value … may not be 0
Interactive Demo!
35
Audience Participation Needed!
36
① Navigate to sparkafterdark.com
② Click 3 actors and 3 actresses
->
You are here
->
Recommendation Terminology
37
① User
User seeking likeable recommendations
② Item
User who has been liked
*Also a user seeking likeable recommendations!
③ Types of Feedback
Explicit: Ratings, Like/Dislike
Implicit: Search, Click, Hover, View, Scroll
Types of Recommendations
38
① Non-personalized
Cold Start
No preference or behavior data for user, yet
② Personalized
Items that others with similar prefs have liked
User-Item Similarity
Items similar to your previously-liked items
Item-Item Similarity
Non-personalized
Recommendations
39
Summary Statistics and Aggregations
40
① Top Users by Like Count
“I might like users with the highest sum aggregation
of likes overall.”
SparkSQL + DataFrame: Aggregations
Like Graph Analysis
41
② Top Influencers by Like Graph
“I might like users who have the highest probability of
me liking them randomly while walking the like graph.”
GraphX: PageRank
Demo!
Spark SQL + DataFrames + GraphX
42
Personalized
Recommendations
43
Collaborative Filtering Personalized Recs
44
③ Like behavior of similar users
“I like the same people that you like.
What other people did you like that I haven’t seen?”
MLlib: Matrix Factorization, User-Item Similarity
Text-based Personalized Recs
45
④ Similar profiles to each other
“Our profiles have similar, unique k-skip n-grams.
We might like each other.”
MLlib: Word2Vec, TF/IDF, Doc Similarity
More Text-based Personalized Recs
46
⑤ Similar profiles from my past likes
“Your profile shares a similar feature vector space to
others that I’ve liked. I might like you.”
MLlib: Word2Vec, TF/IDF, Doc Similarity
More Text-based Personalized Recs
47
⑥ Relevant, High-Value Emails
“Your initial email has similar named entities to my profile.
I might like you just for making the effort.”
MLlib: Word2Vec, TF/IDF, Entity Recognition
^
Her Email
< My Profile
Demo!
MLlib + ALS + Word2Vec + TF/IDF
48
Bonus!
The Future of Recommendations
49
Facial Recognition
50
⑦ Eigenfaces
“Your face looks similar to others that I’ve liked.
I might like you.”
MLlib: RowMatrix, PCA, Item-Item Similarity
Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
Conversation Starter Bot
51
⑧ NLP and DecisionTrees
“If your responses to my trite opening lines are positive,
I might actually read your profile.”
MLlib: TF/IDF, DecisionTree,
Sentiment Analysis
Positive
responses ->
Negative
<- responses
Image courtesty of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
Double Bonus!
52
Maintaining the
Compromise Recommendations (Couples)
53
⑨ Similarity Pathways
“I want Mad Max. You want Message In a Bottle.
Let’s find something in between to watch tonight.”
MLlib: RowMatrix, Item-Item Similarity
GraphX: Nearest Neighbors, Shortest Path
similar similar
plots -> <- actors
… …
And the Final,
54
⑩ Personalized Recommendation
My Personalized Recommendation
55
⑩ Get Off Your Computer and Be Social!!
Thank you!
cfregly@databricks.com
@cfregly
Image courtesy of http://www.duchess-france.org/

More Related Content

PPTX
Programming the Semantic Web
PDF
Staab programming thesemanticweb
PDF
An introduction to similarity search and k-nn graphs
PPTX
Large Scale Machine learning with Spark
PDF
Indexing Complex PostgreSQL Data Types
PDF
Sparse Data Support in MLlib
PPTX
Lambda expressions
PDF
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Programming the Semantic Web
Staab programming thesemanticweb
An introduction to similarity search and k-nn graphs
Large Scale Machine learning with Spark
Indexing Complex PostgreSQL Data Types
Sparse Data Support in MLlib
Lambda expressions
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...

Viewers also liked (20)

PDF
Ibm leads way with hadoop and spark 2015 may 15
PPTX
Hadoop and Spark Analytics over Better Storage
PDF
Hadoop & Spark Performance tuning using Dr. Elephant
PDF
Hadoop to spark-v2
PPTX
Spark and Hadoop Perfect Togeher by Arun Murthy
PPTX
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
PDF
How Apache Spark fits into the Big Data landscape
PDF
What the Spark!? Intro and Use Cases
PPTX
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
PPSX
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
PDF
Evolution of apache spark
PDF
Hadoop Spark Introduction-20150130
PPT
11. From Hadoop to Spark 1:2
PDF
Hadoop to spark_v2
PPT
11. From Hadoop to Spark 2/2
PPTX
Hadoop, Spark and Big Data Summit presentation with SAP HANA Vora and a path ...
PPTX
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
PPT
Big Graph Analytics on Neo4j with Apache Spark
PDF
Spark Meetup at Uber
PDF
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Ibm leads way with hadoop and spark 2015 may 15
Hadoop and Spark Analytics over Better Storage
Hadoop & Spark Performance tuning using Dr. Elephant
Hadoop to spark-v2
Spark and Hadoop Perfect Togeher by Arun Murthy
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
How Apache Spark fits into the Big Data landscape
What the Spark!? Intro and Use Cases
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Evolution of apache spark
Hadoop Spark Introduction-20150130
11. From Hadoop to Spark 1:2
Hadoop to spark_v2
11. From Hadoop to Spark 2/2
Hadoop, Spark and Big Data Summit presentation with SAP HANA Vora and a path ...
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Big Graph Analytics on Neo4j with Apache Spark
Spark Meetup at Uber
Spark 2.0 What's Next (Hadoop / Spark Conference Japan 2016 キーノート講演資料)
Ad

Similar to IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Quality Dating Recommendations Using Advanced Real Time Analytics (20)

PPTX
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
PPTX
Dublin Ireland Spark Meetup October 15, 2015
PDF
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
PDF
Recent Developments in Spark MLlib and Beyond
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
PDF
Recent Developments in Spark MLlib and Beyond
PDF
Spark DataFrames and ML Pipelines
PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
PPTX
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
PPTX
PPTX
Agility and Scalability with MongoDB
PDF
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
PDF
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
PPT
9. Document Oriented Databases
PDF
Data Engineering with Solr and Spark
PDF
Data Science with Solr and Spark
PPTX
Combining Machine Learning Frameworks with Apache Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Dublin Ireland Spark Meetup October 15, 2015
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Recent Developments in Spark MLlib and Beyond
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Recent Developments in Spark MLlib and Beyond
Spark DataFrames and ML Pipelines
Practical Distributed Machine Learning Pipelines on Hadoop
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
GraphFrames: DataFrame-based graphs for Apache® Spark™
Agility and Scalability with MongoDB
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
9. Document Oriented Databases
Data Engineering with Solr and Spark
Data Science with Solr and Spark
Combining Machine Learning Frameworks with Apache Spark
Ad

More from In-Memory Computing Summit (20)

PPTX
IMC Summit 2016 Breakout - Per Minoborg - Work with Multiple Hot Terabytes in...
PPTX
IMC Summit 2016 Breakout - Henning Andersen - Using Lock-free and Wait-free I...
PPTX
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
PDF
IMC Summit 2016 Breakout - Nikita Shamgunov - Propelling IoT Innovation with ...
PPTX
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
PDF
IMC Summit 2016 Innovation - Derek Nelson - PipelineDB: The Streaming-SQL Dat...
PPTX
IMC Summit 2016 Innovation - Dennis Duckworth - Lambda-B-Gone: The In-memory ...
PPTX
IMC Summit 2016 Innovation - Steve Wilkes - Tap Into Your Enterprise – Why Da...
PPTX
IMC Summit 2016 Innovation - Girish Mutreja - Unveiling the X Platform
PPTX
IMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage Tier
PPTX
IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...
PDF
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
PPTX
IMC Summit 2016 Breakout - Andy Pavlo - What Non-Volatile Memory Means for th...
PPTX
IMC Summit 2016 Breakout - Gordon Patrick - Developments in Persistent Memory
PPTX
IMC Summit 2016 Breakout - Girish Kathalagiri - Decision Making with MLLIB, S...
PPTX
IMC Summit 2016 Breakout - Steve Wikes - Making IMC Enterprise Grade
PPTX
IMC Summit 2016 Breakout - Noah Arliss - The Truth: How to Test Your Distribu...
PPTX
IMC Summit 2016 Breakout - Aleksandar Seovic - The Illusion of Statelessness
PPTX
IMC Summit 2016 Breakout - Girish Mutreja - Extreme Transaction Processing in...
PPTX
IMC Summit 2016 Breakout - Greg Luck - How to Speed Up Your Application Using...
IMC Summit 2016 Breakout - Per Minoborg - Work with Multiple Hot Terabytes in...
IMC Summit 2016 Breakout - Henning Andersen - Using Lock-free and Wait-free I...
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
IMC Summit 2016 Breakout - Nikita Shamgunov - Propelling IoT Innovation with ...
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
IMC Summit 2016 Innovation - Derek Nelson - PipelineDB: The Streaming-SQL Dat...
IMC Summit 2016 Innovation - Dennis Duckworth - Lambda-B-Gone: The In-memory ...
IMC Summit 2016 Innovation - Steve Wilkes - Tap Into Your Enterprise – Why Da...
IMC Summit 2016 Innovation - Girish Mutreja - Unveiling the X Platform
IMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage Tier
IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
IMC Summit 2016 Breakout - Andy Pavlo - What Non-Volatile Memory Means for th...
IMC Summit 2016 Breakout - Gordon Patrick - Developments in Persistent Memory
IMC Summit 2016 Breakout - Girish Kathalagiri - Decision Making with MLLIB, S...
IMC Summit 2016 Breakout - Steve Wikes - Making IMC Enterprise Grade
IMC Summit 2016 Breakout - Noah Arliss - The Truth: How to Test Your Distribu...
IMC Summit 2016 Breakout - Aleksandar Seovic - The Illusion of Statelessness
IMC Summit 2016 Breakout - Girish Mutreja - Extreme Transaction Processing in...
IMC Summit 2016 Breakout - Greg Luck - How to Speed Up Your Application Using...

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
cuic standard and advanced reporting.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
Digital-Transformation-Roadmap-for-Companies.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Building Integrated photovoltaic BIPV_UPV.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The AUB Centre for AI in Media Proposal.docx
NewMind AI Monthly Chronicles - July 2025
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
cuic standard and advanced reporting.pdf
Modernizing your data center with Dell and AMD
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Chapter 3 Spatial Domain Image Processing.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Empathic Computing: Creating Shared Understanding
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Unlocking AI with Model Context Protocol (MCP)

IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Quality Dating Recommendations Using Advanced Real Time Analytics