SlideShare a Scribd company logo
Microservices, containers,
and machine learning
2015-07-23 • PDX
Paco Nathan, @pacoid

O’Reilly Learning
Licensed under a Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
Spark Brief
Spark Brief: Components
• generalized patterns

unified engine for many use cases
• lazy evaluation of the lineage graph

reduces wait states, better pipelining
• generational differences in hardware

off-heap use of large memory spaces
• functional programming / ease of use

reduction in cost to maintain large apps
• lower overhead for starting jobs
• less expensive shuffles
Spark Brief: Key Distinctions vs. MapReduce
databricks.com/blog/2014/11/05/spark-officially-
sets-a-new-record-in-large-scale-sorting.html
Spark Brief: SmashingThe Previous Petabyte Sort Record
GraphX Examples
cost
4
node
0
node
1
node
3
node
2
cost
3
cost
1
cost
2
cost
1
GraphX:
spark.apache.org/docs/latest/graphx-
programming-guide.html
Key Points:
• graph-parallel systems
• emphasis on integrated workflows
• optimizations
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs

J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin

graphlab.org/files/osdi2012-gonzalez-low-gu-bickson-
guestrin.pdf
Pregel: Large-scale graph computing at Google

Grzegorz Czajkowski, et al.

googleresearch.blogspot.com/2009/06/large-scale-graph-
computing-at-google.html
GraphX: Graph Analytics in Spark

Ankur Dave, Databricks

spark-summit.org/east-2015/talk/graphx-graph-
analytics-in-spark
Topic modeling with LDA: MLlib meets GraphX

Joseph Bradley, Databricks

databricks.com/blog/2015/03/25/topic-modeling-with-
lda-mllib-meets-graphx.html
GraphX: Further Reading…
GraphX: Compose Node + Edge RDDs into a Graph
val nodeRDD: RDD[(Long, ND)] = sc.parallelize(…)
val edgeRDD: RDD[Edge[ED]] = sc.parallelize(…)
val g: Graph[ND, ED] = Graph(nodeRDD, edgeRDD)
// http://spark.apache.org/docs/latest/graphx-programming-guide.html
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
case class Peep(name: String, age: Int)
val nodeArray = Array(
(1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),
(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),
(5L, Peep("Leslie", 45))
)
val edgeArray = Array(
Edge(2L, 1L, 7), Edge(2L, 4L, 2),
Edge(3L, 2L, 4), Edge(3L, 5L, 3),
Edge(4L, 1L, 1), Edge(5L, 3L, 9)
)
val nodeRDD: RDD[(Long, Peep)] = sc.parallelize(nodeArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
val g: Graph[Peep, Int] = Graph(nodeRDD, edgeRDD)
val results = g.triplets.filter(t => t.attr > 7)
for (triplet <- results.collect) {
println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")
}
GraphX: Example – simple traversals
GraphX: Example – routing problems
cost
4
node
0
node
1
node
3
node
2
cost
3
cost
1
cost
2
cost
1
What is the cost to reach node 0 from any other
node in the graph? This is a common use case for
graph algorithms, e.g., Dijkstra
GraphX: code examples…
Let’s check
some code!
Graph Analytics
Graph Analytics: terminology
• many real-world problems are often
represented as graphs
• graphs can generally be converted into
sparse matrices (bridge to linear algebra)
• eigenvectors find the stable points in 

a system defined by matrices – which 

may be more efficient to compute
• beyond simpler graphs, complex data 

may require work with tensors
Suppose we have a graph as shown below:
We call x a vertex (sometimes called a node)
An edge (sometimes called an arc) is any line
connecting two vertices
Graph Analytics: example
v
u
w
x
We can represent this kind of graph as an
adjacency matrix:
• label the rows and columns based 

on the vertices
• entries get a 1 if an edge connects the
corresponding vertices, or 0 otherwise
Graph Analytics: representation
v
u
w
x
u v w x
u 0 1 0 1
v 1 0 1 1
w 0 1 0 1
x 1 1 1 0
An adjacency matrix always has certain
properties:
• it is symmetric, i.e., A = AT
• it has real eigenvalues
Therefore algebraic graph theory bridges
between linear algebra and graph theory
Graph Analytics: algebraic graph theory
Sparse Matrix Collection… for when you really
need a wide variety of sparse matrix examples,
e.g., to evaluate new ML algorithms
University of Florida
Sparse Matrix Collection

cise.ufl.edu/
research/sparse/
matrices/
Graph Analytics: beauty in sparsity
Algebraic GraphTheory

Norman Biggs

Cambridge (1974)

amazon.com/dp/0521458978
Graph Analysis andVisualization

Richard Brath, David Jonker

Wiley (2015)

shop.oreilly.com/product/9781118845844.do
See also examples in: Just Enough Math
Graph Analytics: resources
Although tensor factorization is considered
problematic, it may provide more general case
solutions, and some work leverages Spark:
TheTensor Renaissance in Data Science

Anima Anandkumar @UC Irvine

radar.oreilly.com/2015/05/the-tensor-
renaissance-in-data-science.html
Spacey RandomWalks and Higher Order Markov Chains

David Gleich @Purdue

slideshare.net/dgleich/spacey-random-walks-
and-higher-order-markov-chains
Graph Analytics: tensor solutions emerging
Although tensor
problematic, it may provide more general case
solutions, and some work leverages Spark:
TheTensor Renaissance in Data Science
Anima Anandkumar
radar.oreilly.com/2015/05/the-tensor-
renaissance-in-data-science.html
Spacey RandomWalks and Higher Order Markov Chains
David Gleich
slideshare.net/dgleich/spacey-random-walks-
and-higher-order-markov-chains
Graph Analytics:
watch
this space
carefully
Data Preparation
Data Prep: Exsto Project Overview
https://github.com/ceteri/exsto/
• insights about dev communities, via data mining
their email forums
• works with any Apache project email archive
• applies NLP and ML techniques to analyze
message threads
• graph analytics surface themes and interactions
• results provide feedback for communities, e.g.,
leaderboards
Data Prep: Exsto Project Overview – four links
https://github.com/ceteri/exsto/
http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf
http://mail-archives.apache.org/mod_mbox/spark-user/
http://goo.gl/2YqJZK
Data Prep: Scraper pipeline
https://github.com/ceteri/exsto/
+
Data Prep: Scraper pipeline
Typical data rates, e.g., for dev@spark.apache.org:
• ~2K msgs/month
• ~18 MB/month parsed in JSON
Six months’ list activity represents a graph of:
• 1882 senders
• 1,762,113 nodes
• 3,232,174 edges
A large graph?! In any case, it satisfies definition of a 

graph-parallel system – lots of data locality to leverage
Data Prep: idealized ML workflow…
evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
Data Prep: Microservices meet Parallel Processing
services
email
archives community
leaderboards
SparkSQL
Data Prep
Features
Explore
Scraper /
Parser
NLTK
data Unique
Word IDs
TextRank,
Word2Vec,
etc.
community
insights
not so big data… relatively big compute…
( we’ll come back to this point! )
Data Prep: Scraper pipeline
message
JSON
Py
filter
quoted
content
Apache
email list
archive
urllib2
crawl
monthly list
by date
Py
segment
paragraphs
Data Prep: Scraper pipeline
message
JSON
Py
filter
quoted
content
Apache
email list
archive
urllib2
crawl
monthly list
by date
Py
segment
paragraphs
{
"date": "2014-10-01T00:16:08+00:00",
"id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",
"next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",
"next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQ
"prev_thread": "",
"sender": "Debasish Das <debasish.da...@gmail.com>",
"subject": "Re: memory vs data_size",
"text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....n
}
TextBlob
tag and
lemmatize
words
TextBlob
segment
sentences
TextBlob
sentiment
analysis
Py
generate
skip-grams
parsed
JSON
message
JSON Treebank,
WordNet
Data Prep: Parser pipeline
TextBlob
tag and
lemmatize
words
TextBlob
segment
sentences
TextBlob
sentiment
analysis
Py
generate
skip-grams
parsed
JSON
message
JSON Treebank,
WordNet
Data Prep: Parser pipeline
{
"graf": [ [1, "Only", "only", "RB", 1, 0], [2, "fit", "fit", "VBP", 1, 1 ] ... ],
"id": “CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",
"polr": 0.2,
"sha1": "178b7a57ec6168f20a8a4f705fb8b0b04e59eeb7",
"size": 14,
"subj": 0.7,
"tile": [ [1, 2], [2, 3], [3, 4] ... ]
]
}
{
"date": "2014-10-01T00:16:08+00:00",
"id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",
"next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",
"next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQDM=p
"prev_thread": "",
"sender": "Debasish Das <debasish.da...@gmail.com>",
"subject": "Re: memory vs data_size",
"text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....nnFor
}
Data Prep: Example data
Example data from the Apache Spark email list 

is available as JSON on S3:
• https://s3-us-west-1.amazonaws.com/paco.dbfs.public/
exsto/original/2015_01.json
• https://s3-us-west-1.amazonaws.com/paco.dbfs.public/
exsto/parsed/2015_01.json
Data Prep: code examples…
Let’s check
some code!
WTF ML Frameworks?
services
email
archives
community
leaderboards
SparkSQL
Data Prep
Features
Explore
Scraper /
Parser
NLTK
data
Unique
Word IDs
TextRank,
Word2Vec,
etc.
com
in
WTF ML Frameworks? Microservices meet Parallel Processing
services
email
archives community
leaderboards
SparkSQL
Data Prep
Features
Explore
Scraper /
Parser
NLTK
data Unique
Word IDs
TextRank,
Word2Vec,
etc.
community
insights
not so big data… relatively big compute…
• Big Compute, not Big Data: Mb’s per month,
organized as millions of elements in a graph
• The required libraries (NLTK, etc.) are nearly

1000x larger than the data!
• This data does not change, does not need 

to be recomputed…
• Also: assigning unique IDs to entities during
NLP parsing … that doesn’t readily fit Spark’s
compute model (immutable data)
WTF ML Frameworks? Microservices meet Parallel Processing
Personal observation:
• There’s an unfortunate tendency within machine
learning frameworks to try to subsume all of the
data handling within the framework…
• In the case of Spark, app performance would
already be upside-down just by distributing
NLTK across all of the executors
• Also, installing NLTK data can be “interesting”
WTF ML Frameworks? Microservices meet Parallel Processing
Personal observation:
• There’s an unfortunate tendency within machine
learning frameworks to try to subsume all of the
data handling within the framework…
• In the case of Spark, app performance would
already be upside-down just by distributing
NLTK across all of the executors
• Also, installing
WTF ML Frameworks?
WTF

ML Frameworks

???
Keep in mind that “One Size Fits All” is an 

anti-pattern, especially for Big Data tools:
• consider provisioning cost vs. frequency 

of use
• serialization overhead in workflows
• be mindful of crafting the “working set” 

for memory resources
WTF ML Frameworks? Microservices meet Parallel Processing
WTF ML Frameworks? Microservices meet Parallel Processing
instead of OSFA… 

(yes, yes, Big Data `blasphemy`, sigh)
containerized
microservices
Flask
Redis
email
archives
SparkSQL
Data Prep
Features
Explore
Scraper /
Parser
NLTK
data Unique
Word IDs
Mesos / DCOS
Spark
executors
TextRank in Spark
TextRank: original paper
TextRank: Bringing Order intoTexts


Rada Mihalcea, Paul Tarau
Conference on Empirical Methods in Natural
Language Processing (July 2004)
https://goo.gl/AJnA76
http://web.eecs.umich.edu/~mihalcea/papers.html
http://www.cse.unt.edu/~tarau/
TextRank: other implementations
Jeff Kubina (Perl / English):
http://search.cpan.org/~kubina/Text-Categorize-
Textrank-0.51/lib/Text/Categorize/Textrank/En.pm
Paco Nathan (Hadoop / English+Spanish):
https://github.com/ceteri/textrank/
Karin Christiasen (Java / Icelandic):
https://github.com/karchr/icetextsum
TextRank: Spark-based pipeline
Spark
create
word graph
RDD
word
graph
NetworkX
visualize
graph
GraphX
run
TextRank
Spark
extract
phrases
ranked
phrases
parsed
JSON
TextRank: raw text input
TextRank: data results
"Compatibility of systems of linear constraints"
[{'index': 0, 'stem': 'compat', 'tag': 'NNP','word': 'compatibility'},
{'index': 1, 'stem': 'of', 'tag': 'IN', 'word': 'of'},
{'index': 2, 'stem': 'system', 'tag': 'NNS', 'word': 'systems'},
{'index': 3, 'stem': 'of', 'tag': 'IN', 'word': 'of'},
{'index': 4, 'stem': 'linear', 'tag': 'JJ', 'word': 'linear'},
{'index': 5, 'stem': 'constraint', 'tag': 'NNS','word': 'constraints'}]
compat
system
linear
constraint
1:
2:
3:
TextRank: dependencies
https://en.wikipedia.org/wiki/PageRank
TextRank: how it works
TextRank: code examples…
Let’s check
some code!
Social Graph
Social Graph: use GraphX to run graph analytics
// run graph analytics
val g: Graph[String, Int] = Graph(nodes, edges)
val r = g.pageRank(0.0001).vertices
r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)
// define a reduce operation to compute the highest degree vertex
def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {
if (a._2 > b._2) a else b
}
// compute the max degrees
val maxInDegree: (VertexId, Int) = g.inDegrees.reduce(max)
val maxOutDegree: (VertexId, Int) = g.outDegrees.reduce(max)
val maxDegrees: (VertexId, Int) = g.degrees.reduce(max)
// connected components
val scc = g.stronglyConnectedComponents(10).vertices
node.join(scc).foreach(println)
Social Graph: PageRank of top dev@spark email, 4Q2014
(389,(22.690229478710016,Sean Owen <so...@cloudera.com>))
(857,(20.832469059298248,Akhil Das <ak...@sigmoidanalytics.com>))
(652,(13.281821379806798,Michael Armbrust <mich...@databricks.com>))
(101,(9.963167550803664,Tobias Pfeiffer <...@preferred.jp>))
(471,(9.614436778460558,Steve Lewis <lordjoe2...@gmail.com>))
(931,(8.217073486575732,shahab <shahab.mok...@gmail.com>))
(48,(7.653814912512137,ll <duy.huynh....@gmail.com>))
(1011,(7.602002681952157,Ashic Mahtab <as...@live.com>))
(1055,(7.572376489758199,Cheng Lian <lian.cs....@gmail.com>))
(122,(6.87247388819558,Gerard Maas <gerard.m...@gmail.com>))
(904,(6.252657820614504,Xiangrui Meng <men...@gmail.com>))
(827,(6.0941062762076115,Jianshi Huang <jianshi.hu...@gmail.com>))
(887,(5.835053915864531,Davies Liu <dav...@databricks.com>))
(303,(5.724235650446037,Ted Yu <yuzhih...@gmail.com>))
(206,(5.430238461114108,Deep Pradhan <pradhandeep1...@gmail.com>))
(483,(5.332452537151523,Akshat Aranya <aara...@gmail.com>))
(185,(5.259438927615685,SK <skrishna...@gmail.com>))
(636,(5.235941228955769,Matei Zaharia <matei.zaha…@gmail.com>))
// seaaaaaaaaaan!
maxInDegree: (org.apache.spark.graphx.VertexId, Int) = (389,126)
maxOutDegree: (org.apache.spark.graphx.VertexId, Int) = (389,170)
maxDegrees: (org.apache.spark.graphx.VertexId, Int) = (389,296)
Social Graph: code examples…
Let’s check
some code!
Whither Next?
Misc., Etc., Maybe:
Feature learning withWord2Vec

Matt Krzus

www.yseam.com/blog/WV.html
ranked
phrases
GraphX
run
Con.Comp.
MLlib
run
Word2Vec
aggregated
by topic
MLlib
run
KMeans
topic
vectors
better than
LDA?
features… models… insights…
O’Reilly Studios, O’Reilly Learning:
O’Reilly Studios, O’Reilly Learning:
Embracing Jupyter Notebooks at O'Reilly
https://beta.oreilly.com/ideas/jupyter-at-oreilly
Andrew Odewahn, 2015-05-07
“O'Reilly Media is using our Atlas platform to 

make Jupyter Notebooks a first class authoring
environment for our publishing program.”
Jupyter, Thebe, Docker, Mesos, etc.
Thank you
presenter:
Just Enough Math
O’Reilly (2014)
justenoughmath.com

preview: youtu.be/TQ58cWgdCpA
monthly newsletter for updates, 

events, conf summaries, etc.:
liber118.com/pxn/
Intro to Apache Spark

O’Reilly (2015)

shop.oreilly.com/product/
0636920036807.do

More Related Content

PDF
Jupyter for Education: Beyond Gutenberg and Erasmus
PDF
GalvanizeU Seattle: Eleven Almost-Truisms About Data
PDF
Data Science in 2016: Moving Up
PDF
Data Science in Future Tense
PDF
Apache Spark and the Emerging Technology Landscape for Big Data
PDF
Use of standards and related issues in predictive analytics
PDF
QCon São Paulo: Real-Time Analytics with Spark Streaming
PDF
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Jupyter for Education: Beyond Gutenberg and Erasmus
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Data Science in 2016: Moving Up
Data Science in Future Tense
Apache Spark and the Emerging Technology Landscape for Big Data
Use of standards and related issues in predictive analytics
QCon São Paulo: Real-Time Analytics with Spark Streaming
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More

What's hot (20)

PDF
Spark streaming
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
PDF
Big Data is changing abruptly, and where it is likely heading
PDF
Databricks Meetup @ Los Angeles Apache Spark User Group
PDF
Strata EU 2014: Spark Streaming Case Studies
PDF
A New Year in Data Science: ML Unpaused
PDF
Microservices, Containers, and Machine Learning
PDF
Data Science with Spark
PDF
How Apache Spark fits into the Big Data landscape
PDF
How Apache Spark fits in the Big Data landscape
PDF
Graph Analytics in Spark
PDF
H2O with Erin LeDell at Portland R User Group
PDF
An excursion into Graph Analytics with Apache Spark GraphX
PPTX
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
PDF
Architecture in action 01
PPTX
Gephi, Graphx, and Giraph
PDF
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
PDF
High Performance Machine Learning in R with H2O
PDF
Big Data, Mob Scale.
PPTX
Machine Learning with Spark
Spark streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Big Data is changing abruptly, and where it is likely heading
Databricks Meetup @ Los Angeles Apache Spark User Group
Strata EU 2014: Spark Streaming Case Studies
A New Year in Data Science: ML Unpaused
Microservices, Containers, and Machine Learning
Data Science with Spark
How Apache Spark fits into the Big Data landscape
How Apache Spark fits in the Big Data landscape
Graph Analytics in Spark
H2O with Erin LeDell at Portland R User Group
An excursion into Graph Analytics with Apache Spark GraphX
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Architecture in action 01
Gephi, Graphx, and Giraph
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
High Performance Machine Learning in R with H2O
Big Data, Mob Scale.
Machine Learning with Spark
Ad

Similar to Microservices, containers, and machine learning (20)

PDF
GraphX: Graph analytics for insights about developer communities
PPTX
Graphs in data structures are non-linear data structures made up of a finite ...
PPTX
The Challenges of Bringing Machine Learning to the Masses
PDF
Graph Realities
PDF
The Analytics Frontier of the Hadoop Eco-System
PPTX
GraphLab Conference 2014 Keynote - Carlos Guestrin
PDF
Web-Scale Graph Analytics with Apache® Spark™
PDF
Distributed processing of large graphs in python
PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
PDF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
PPTX
Big data analytics_7_giants_public_24_sep_2013
PDF
Ling liu part 02:big graph processing
PDF
Social network-analysis-in-python
PDF
Xia Zhu – Intel at MLconf ATL
PDF
Graph Algorithms - Map-Reduce Graph Processing
PDF
Leveraging Graphs for Better AI
PPTX
When Graphs Meet Machine Learning
PPTX
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
PPTX
Graph Based Machine Learning on Relational Data
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
GraphX: Graph analytics for insights about developer communities
Graphs in data structures are non-linear data structures made up of a finite ...
The Challenges of Bringing Machine Learning to the Masses
Graph Realities
The Analytics Frontier of the Hadoop Eco-System
GraphLab Conference 2014 Keynote - Carlos Guestrin
Web-Scale Graph Analytics with Apache® Spark™
Distributed processing of large graphs in python
GraphFrames: DataFrame-based graphs for Apache® Spark™
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Big data analytics_7_giants_public_24_sep_2013
Ling liu part 02:big graph processing
Social network-analysis-in-python
Xia Zhu – Intel at MLconf ATL
Graph Algorithms - Map-Reduce Graph Processing
Leveraging Graphs for Better AI
When Graphs Meet Machine Learning
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
Graph Based Machine Learning on Relational Data
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Ad

More from Paco Nathan (12)

PDF
Human in the loop: a design pattern for managing teams working with ML
PDF
Human-in-the-loop: a design pattern for managing teams that leverage ML
PDF
Human-in-a-loop: a design pattern for managing teams which leverage ML
PDF
Humans in a loop: Jupyter notebooks as a front-end for AI
PDF
Humans in the loop: AI in open source and industry
PDF
Computable Content
PDF
Computable Content: Lessons Learned
PDF
SF Python Meetup: TextRank in Python
PDF
Data Science Reinvents Learning?
PDF
What's new with Apache Spark?
PDF
Brief Intro to Apache Spark @ Stanford ICME
PDF
How Apache Spark fits into the Big Data landscape
Human in the loop: a design pattern for managing teams working with ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in the loop: AI in open source and industry
Computable Content
Computable Content: Lessons Learned
SF Python Meetup: TextRank in Python
Data Science Reinvents Learning?
What's new with Apache Spark?
Brief Intro to Apache Spark @ Stanford ICME
How Apache Spark fits into the Big Data landscape

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Cloud computing and distributed systems.
PDF
Modernizing your data center with Dell and AMD
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Advanced IT Governance
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Unlocking AI with Model Context Protocol (MCP)
Cloud computing and distributed systems.
Modernizing your data center with Dell and AMD
Advanced Soft Computing BINUS July 2025.pdf
Machine learning based COVID-19 study performance prediction
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Advanced IT Governance
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Per capita expenditure prediction using model stacking based on satellite ima...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
MYSQL Presentation for SQL database connectivity
Dropbox Q2 2025 Financial Results & Investor Presentation
Advanced methodologies resolving dimensionality complications for autism neur...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
20250228 LYD VKU AI Blended-Learning.pptx
Understanding_Digital_Forensics_Presentation.pptx
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf

Microservices, containers, and machine learning

  • 1. Microservices, containers, and machine learning 2015-07-23 • PDX Paco Nathan, @pacoid
 O’Reilly Learning Licensed under a Creative Commons Attribution- NonCommercial-NoDerivatives 4.0 International License http://www.oscon.com/open-source-2015/public/schedule/detail/41579
  • 4. • generalized patterns
 unified engine for many use cases • lazy evaluation of the lineage graph
 reduces wait states, better pipelining • generational differences in hardware
 off-heap use of large memory spaces • functional programming / ease of use
 reduction in cost to maintain large apps • lower overhead for starting jobs • less expensive shuffles Spark Brief: Key Distinctions vs. MapReduce
  • 7. GraphX: spark.apache.org/docs/latest/graphx- programming-guide.html Key Points: • graph-parallel systems • emphasis on integrated workflows • optimizations
  • 8. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
 J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin
 graphlab.org/files/osdi2012-gonzalez-low-gu-bickson- guestrin.pdf Pregel: Large-scale graph computing at Google
 Grzegorz Czajkowski, et al.
 googleresearch.blogspot.com/2009/06/large-scale-graph- computing-at-google.html GraphX: Graph Analytics in Spark
 Ankur Dave, Databricks
 spark-summit.org/east-2015/talk/graphx-graph- analytics-in-spark Topic modeling with LDA: MLlib meets GraphX
 Joseph Bradley, Databricks
 databricks.com/blog/2015/03/25/topic-modeling-with- lda-mllib-meets-graphx.html GraphX: Further Reading…
  • 9. GraphX: Compose Node + Edge RDDs into a Graph val nodeRDD: RDD[(Long, ND)] = sc.parallelize(…) val edgeRDD: RDD[Edge[ED]] = sc.parallelize(…) val g: Graph[ND, ED] = Graph(nodeRDD, edgeRDD)
  • 10. // http://spark.apache.org/docs/latest/graphx-programming-guide.html import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD case class Peep(name: String, age: Int) val nodeArray = Array( (1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)), (3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)), (5L, Peep("Leslie", 45)) ) val edgeArray = Array( Edge(2L, 1L, 7), Edge(2L, 4L, 2), Edge(3L, 2L, 4), Edge(3L, 5L, 3), Edge(4L, 1L, 1), Edge(5L, 3L, 9) ) val nodeRDD: RDD[(Long, Peep)] = sc.parallelize(nodeArray) val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray) val g: Graph[Peep, Int] = Graph(nodeRDD, edgeRDD) val results = g.triplets.filter(t => t.attr > 7) for (triplet <- results.collect) { println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}") } GraphX: Example – simple traversals
  • 11. GraphX: Example – routing problems cost 4 node 0 node 1 node 3 node 2 cost 3 cost 1 cost 2 cost 1 What is the cost to reach node 0 from any other node in the graph? This is a common use case for graph algorithms, e.g., Dijkstra
  • 14. Graph Analytics: terminology • many real-world problems are often represented as graphs • graphs can generally be converted into sparse matrices (bridge to linear algebra) • eigenvectors find the stable points in 
 a system defined by matrices – which 
 may be more efficient to compute • beyond simpler graphs, complex data 
 may require work with tensors
  • 15. Suppose we have a graph as shown below: We call x a vertex (sometimes called a node) An edge (sometimes called an arc) is any line connecting two vertices Graph Analytics: example v u w x
  • 16. We can represent this kind of graph as an adjacency matrix: • label the rows and columns based 
 on the vertices • entries get a 1 if an edge connects the corresponding vertices, or 0 otherwise Graph Analytics: representation v u w x u v w x u 0 1 0 1 v 1 0 1 1 w 0 1 0 1 x 1 1 1 0
  • 17. An adjacency matrix always has certain properties: • it is symmetric, i.e., A = AT • it has real eigenvalues Therefore algebraic graph theory bridges between linear algebra and graph theory Graph Analytics: algebraic graph theory
  • 18. Sparse Matrix Collection… for when you really need a wide variety of sparse matrix examples, e.g., to evaluate new ML algorithms University of Florida Sparse Matrix Collection
 cise.ufl.edu/ research/sparse/ matrices/ Graph Analytics: beauty in sparsity
  • 19. Algebraic GraphTheory
 Norman Biggs
 Cambridge (1974)
 amazon.com/dp/0521458978 Graph Analysis andVisualization
 Richard Brath, David Jonker
 Wiley (2015)
 shop.oreilly.com/product/9781118845844.do See also examples in: Just Enough Math Graph Analytics: resources
  • 20. Although tensor factorization is considered problematic, it may provide more general case solutions, and some work leverages Spark: TheTensor Renaissance in Data Science
 Anima Anandkumar @UC Irvine
 radar.oreilly.com/2015/05/the-tensor- renaissance-in-data-science.html Spacey RandomWalks and Higher Order Markov Chains
 David Gleich @Purdue
 slideshare.net/dgleich/spacey-random-walks- and-higher-order-markov-chains Graph Analytics: tensor solutions emerging
  • 21. Although tensor problematic, it may provide more general case solutions, and some work leverages Spark: TheTensor Renaissance in Data Science Anima Anandkumar radar.oreilly.com/2015/05/the-tensor- renaissance-in-data-science.html Spacey RandomWalks and Higher Order Markov Chains David Gleich slideshare.net/dgleich/spacey-random-walks- and-higher-order-markov-chains Graph Analytics: watch this space carefully
  • 23. Data Prep: Exsto Project Overview https://github.com/ceteri/exsto/ • insights about dev communities, via data mining their email forums • works with any Apache project email archive • applies NLP and ML techniques to analyze message threads • graph analytics surface themes and interactions • results provide feedback for communities, e.g., leaderboards
  • 24. Data Prep: Exsto Project Overview – four links https://github.com/ceteri/exsto/ http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf http://mail-archives.apache.org/mod_mbox/spark-user/ http://goo.gl/2YqJZK
  • 25. Data Prep: Scraper pipeline https://github.com/ceteri/exsto/ +
  • 26. Data Prep: Scraper pipeline Typical data rates, e.g., for dev@spark.apache.org: • ~2K msgs/month • ~18 MB/month parsed in JSON Six months’ list activity represents a graph of: • 1882 senders • 1,762,113 nodes • 3,232,174 edges A large graph?! In any case, it satisfies definition of a 
 graph-parallel system – lots of data locality to leverage
  • 27. Data Prep: idealized ML workflow… evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms
  • 28. Data Prep: Microservices meet Parallel Processing services email archives community leaderboards SparkSQL Data Prep Features Explore Scraper / Parser NLTK data Unique Word IDs TextRank, Word2Vec, etc. community insights not so big data… relatively big compute… ( we’ll come back to this point! )
  • 29. Data Prep: Scraper pipeline message JSON Py filter quoted content Apache email list archive urllib2 crawl monthly list by date Py segment paragraphs
  • 30. Data Prep: Scraper pipeline message JSON Py filter quoted content Apache email list archive urllib2 crawl monthly list by date Py segment paragraphs { "date": "2014-10-01T00:16:08+00:00", "id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw", "next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg", "next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQ "prev_thread": "", "sender": "Debasish Das <debasish.da...@gmail.com>", "subject": "Re: memory vs data_size", "text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....n }
  • 32. TextBlob tag and lemmatize words TextBlob segment sentences TextBlob sentiment analysis Py generate skip-grams parsed JSON message JSON Treebank, WordNet Data Prep: Parser pipeline { "graf": [ [1, "Only", "only", "RB", 1, 0], [2, "fit", "fit", "VBP", 1, 1 ] ... ], "id": “CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw", "polr": 0.2, "sha1": "178b7a57ec6168f20a8a4f705fb8b0b04e59eeb7", "size": 14, "subj": 0.7, "tile": [ [1, 2], [2, 3], [3, 4] ... ] ] } { "date": "2014-10-01T00:16:08+00:00", "id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw", "next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg", "next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQDM=p "prev_thread": "", "sender": "Debasish Das <debasish.da...@gmail.com>", "subject": "Re: memory vs data_size", "text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....nnFor }
  • 33. Data Prep: Example data Example data from the Apache Spark email list 
 is available as JSON on S3: • https://s3-us-west-1.amazonaws.com/paco.dbfs.public/ exsto/original/2015_01.json • https://s3-us-west-1.amazonaws.com/paco.dbfs.public/ exsto/parsed/2015_01.json
  • 34. Data Prep: code examples… Let’s check some code!
  • 35. WTF ML Frameworks? services email archives community leaderboards SparkSQL Data Prep Features Explore Scraper / Parser NLTK data Unique Word IDs TextRank, Word2Vec, etc. com in
  • 36. WTF ML Frameworks? Microservices meet Parallel Processing services email archives community leaderboards SparkSQL Data Prep Features Explore Scraper / Parser NLTK data Unique Word IDs TextRank, Word2Vec, etc. community insights not so big data… relatively big compute…
  • 37. • Big Compute, not Big Data: Mb’s per month, organized as millions of elements in a graph • The required libraries (NLTK, etc.) are nearly
 1000x larger than the data! • This data does not change, does not need 
 to be recomputed… • Also: assigning unique IDs to entities during NLP parsing … that doesn’t readily fit Spark’s compute model (immutable data) WTF ML Frameworks? Microservices meet Parallel Processing
  • 38. Personal observation: • There’s an unfortunate tendency within machine learning frameworks to try to subsume all of the data handling within the framework… • In the case of Spark, app performance would already be upside-down just by distributing NLTK across all of the executors • Also, installing NLTK data can be “interesting” WTF ML Frameworks? Microservices meet Parallel Processing
  • 39. Personal observation: • There’s an unfortunate tendency within machine learning frameworks to try to subsume all of the data handling within the framework… • In the case of Spark, app performance would already be upside-down just by distributing NLTK across all of the executors • Also, installing WTF ML Frameworks? WTF
 ML Frameworks
 ???
  • 40. Keep in mind that “One Size Fits All” is an 
 anti-pattern, especially for Big Data tools: • consider provisioning cost vs. frequency 
 of use • serialization overhead in workflows • be mindful of crafting the “working set” 
 for memory resources WTF ML Frameworks? Microservices meet Parallel Processing
  • 41. WTF ML Frameworks? Microservices meet Parallel Processing instead of OSFA… 
 (yes, yes, Big Data `blasphemy`, sigh) containerized microservices Flask Redis email archives SparkSQL Data Prep Features Explore Scraper / Parser NLTK data Unique Word IDs Mesos / DCOS Spark executors
  • 43. TextRank: original paper TextRank: Bringing Order intoTexts 
 Rada Mihalcea, Paul Tarau Conference on Empirical Methods in Natural Language Processing (July 2004) https://goo.gl/AJnA76 http://web.eecs.umich.edu/~mihalcea/papers.html http://www.cse.unt.edu/~tarau/
  • 44. TextRank: other implementations Jeff Kubina (Perl / English): http://search.cpan.org/~kubina/Text-Categorize- Textrank-0.51/lib/Text/Categorize/Textrank/En.pm Paco Nathan (Hadoop / English+Spanish): https://github.com/ceteri/textrank/ Karin Christiasen (Java / Icelandic): https://github.com/karchr/icetextsum
  • 45. TextRank: Spark-based pipeline Spark create word graph RDD word graph NetworkX visualize graph GraphX run TextRank Spark extract phrases ranked phrases parsed JSON
  • 47. TextRank: data results "Compatibility of systems of linear constraints" [{'index': 0, 'stem': 'compat', 'tag': 'NNP','word': 'compatibility'}, {'index': 1, 'stem': 'of', 'tag': 'IN', 'word': 'of'}, {'index': 2, 'stem': 'system', 'tag': 'NNS', 'word': 'systems'}, {'index': 3, 'stem': 'of', 'tag': 'IN', 'word': 'of'}, {'index': 4, 'stem': 'linear', 'tag': 'JJ', 'word': 'linear'}, {'index': 5, 'stem': 'constraint', 'tag': 'NNS','word': 'constraints'}] compat system linear constraint 1: 2: 3:
  • 52. Social Graph: use GraphX to run graph analytics // run graph analytics val g: Graph[String, Int] = Graph(nodes, edges) val r = g.pageRank(0.0001).vertices r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println) // define a reduce operation to compute the highest degree vertex def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = { if (a._2 > b._2) a else b } // compute the max degrees val maxInDegree: (VertexId, Int) = g.inDegrees.reduce(max) val maxOutDegree: (VertexId, Int) = g.outDegrees.reduce(max) val maxDegrees: (VertexId, Int) = g.degrees.reduce(max) // connected components val scc = g.stronglyConnectedComponents(10).vertices node.join(scc).foreach(println)
  • 53. Social Graph: PageRank of top dev@spark email, 4Q2014 (389,(22.690229478710016,Sean Owen <so...@cloudera.com>)) (857,(20.832469059298248,Akhil Das <ak...@sigmoidanalytics.com>)) (652,(13.281821379806798,Michael Armbrust <mich...@databricks.com>)) (101,(9.963167550803664,Tobias Pfeiffer <...@preferred.jp>)) (471,(9.614436778460558,Steve Lewis <lordjoe2...@gmail.com>)) (931,(8.217073486575732,shahab <shahab.mok...@gmail.com>)) (48,(7.653814912512137,ll <duy.huynh....@gmail.com>)) (1011,(7.602002681952157,Ashic Mahtab <as...@live.com>)) (1055,(7.572376489758199,Cheng Lian <lian.cs....@gmail.com>)) (122,(6.87247388819558,Gerard Maas <gerard.m...@gmail.com>)) (904,(6.252657820614504,Xiangrui Meng <men...@gmail.com>)) (827,(6.0941062762076115,Jianshi Huang <jianshi.hu...@gmail.com>)) (887,(5.835053915864531,Davies Liu <dav...@databricks.com>)) (303,(5.724235650446037,Ted Yu <yuzhih...@gmail.com>)) (206,(5.430238461114108,Deep Pradhan <pradhandeep1...@gmail.com>)) (483,(5.332452537151523,Akshat Aranya <aara...@gmail.com>)) (185,(5.259438927615685,SK <skrishna...@gmail.com>)) (636,(5.235941228955769,Matei Zaharia <matei.zaha…@gmail.com>)) // seaaaaaaaaaan! maxInDegree: (org.apache.spark.graphx.VertexId, Int) = (389,126) maxOutDegree: (org.apache.spark.graphx.VertexId, Int) = (389,170) maxDegrees: (org.apache.spark.graphx.VertexId, Int) = (389,296)
  • 54. Social Graph: code examples… Let’s check some code!
  • 56. Misc., Etc., Maybe: Feature learning withWord2Vec
 Matt Krzus
 www.yseam.com/blog/WV.html ranked phrases GraphX run Con.Comp. MLlib run Word2Vec aggregated by topic MLlib run KMeans topic vectors better than LDA? features… models… insights…
  • 58. O’Reilly Studios, O’Reilly Learning: Embracing Jupyter Notebooks at O'Reilly https://beta.oreilly.com/ideas/jupyter-at-oreilly Andrew Odewahn, 2015-05-07 “O'Reilly Media is using our Atlas platform to 
 make Jupyter Notebooks a first class authoring environment for our publishing program.” Jupyter, Thebe, Docker, Mesos, etc.
  • 60. presenter: Just Enough Math O’Reilly (2014) justenoughmath.com
 preview: youtu.be/TQ58cWgdCpA monthly newsletter for updates, 
 events, conf summaries, etc.: liber118.com/pxn/ Intro to Apache Spark
 O’Reilly (2015)
 shop.oreilly.com/product/ 0636920036807.do