Microservices, containers, and machine learning

Microservices, containers,
and machine learning
2015-07-23 • PDX
Paco Nathan, @pacoid 
O’Reilly Learning
Licensed under a Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License
http://www.oscon.com/open-source-2015/public/schedule/detail/41579

• generalized patterns 
uniﬁed engine for many use cases
• lazy evaluation of the lineage graph 
reduces wait states, better pipelining
• generational differences in hardware 
off-heap use of large memory spaces
• functional programming / ease of use 
reduction in cost to maintain large apps
• lower overhead for starting jobs
• less expensive shufﬂes
Spark Brief: Key Distinctions vs. MapReduce

databricks.com/blog/2014/11/05/spark-ofﬁcially-
sets-a-new-record-in-large-scale-sorting.html
Spark Brief: SmashingThe Previous Petabyte Sort Record

GraphX Examples
cost
4
node
0
node
1
node
3
node
2
cost
3
cost
1
cost
2
cost
1

GraphX:
spark.apache.org/docs/latest/graphx-
programming-guide.html
Key Points:
• graph-parallel systems
• emphasis on integrated workﬂows
• optimizations

PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs 
J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin 
graphlab.org/ﬁles/osdi2012-gonzalez-low-gu-bickson-
guestrin.pdf
Pregel: Large-scale graph computing at Google 
Grzegorz Czajkowski, et al. 
googleresearch.blogspot.com/2009/06/large-scale-graph-
computing-at-google.html
GraphX: Graph Analytics in Spark 
Ankur Dave, Databricks 
spark-summit.org/east-2015/talk/graphx-graph-
analytics-in-spark
Topic modeling with LDA: MLlib meets GraphX 
Joseph Bradley, Databricks 
databricks.com/blog/2015/03/25/topic-modeling-with-
lda-mllib-meets-graphx.html
GraphX: Further Reading…

GraphX: Compose Node + Edge RDDs into a Graph
val nodeRDD: RDD[(Long, ND)] = sc.parallelize(…)
val edgeRDD: RDD[Edge[ED]] = sc.parallelize(…)
val g: Graph[ND, ED] = Graph(nodeRDD, edgeRDD)

// http://spark.apache.org/docs/latest/graphx-programming-guide.html
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
case class Peep(name: String, age: Int)
val nodeArray = Array(
(1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),
(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),
(5L, Peep("Leslie", 45))
)
val edgeArray = Array(
Edge(2L, 1L, 7), Edge(2L, 4L, 2),
Edge(3L, 2L, 4), Edge(3L, 5L, 3),
Edge(4L, 1L, 1), Edge(5L, 3L, 9)
)
val nodeRDD: RDD[(Long, Peep)] = sc.parallelize(nodeArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
val g: Graph[Peep, Int] = Graph(nodeRDD, edgeRDD)
val results = g.triplets.filter(t => t.attr > 7)
for (triplet <- results.collect) {
println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")
}
GraphX: Example – simple traversals

GraphX: Example – routing problems
cost
4
node
0
node
1
node
3
node
2
cost
3
cost
1
cost
2
cost
1
What is the cost to reach node 0 from any other
node in the graph? This is a common use case for
graph algorithms, e.g., Dijkstra

GraphX: code examples…
Let’s check
some code!

Graph Analytics: terminology
• many real-world problems are often
represented as graphs
• graphs can generally be converted into
sparse matrices (bridge to linear algebra)
• eigenvectors find the stable points in  
a system defined by matrices – which  
may be more efficient to compute
• beyond simpler graphs, complex data  
may require work with tensors

Suppose we have a graph as shown below:
We call x a vertex (sometimes called a node)
An edge (sometimes called an arc) is any line
connecting two vertices
Graph Analytics: example
v
u
w
x

We can represent this kind of graph as an
adjacency matrix:
• label the rows and columns based  
on the vertices
• entries get a 1 if an edge connects the
corresponding vertices, or 0 otherwise
Graph Analytics: representation
v
u
w
x
u v w x
u 0 1 0 1
v 1 0 1 1
w 0 1 0 1
x 1 1 1 0

An adjacency matrix always has certain
properties:
• it is symmetric, i.e., A = AT
• it has real eigenvalues
Therefore algebraic graph theory bridges
between linear algebra and graph theory
Graph Analytics: algebraic graph theory

Sparse Matrix Collection… for when you really
need a wide variety of sparse matrix examples,
e.g., to evaluate new ML algorithms
University of Florida
Sparse Matrix Collection 
cise.uﬂ.edu/
research/sparse/
matrices/
Graph Analytics: beauty in sparsity

Algebraic GraphTheory 
Norman Biggs 
Cambridge (1974) 
amazon.com/dp/0521458978
Graph Analysis andVisualization 
Richard Brath, David Jonker 
Wiley (2015) 
shop.oreilly.com/product/9781118845844.do
See also examples in: Just Enough Math
Graph Analytics: resources

Although tensor factorization is considered
problematic, it may provide more general case
solutions, and some work leverages Spark:
TheTensor Renaissance in Data Science 
Anima Anandkumar @UC Irvine 
radar.oreilly.com/2015/05/the-tensor-
renaissance-in-data-science.html
Spacey RandomWalks and Higher Order Markov Chains 
David Gleich @Purdue 
slideshare.net/dgleich/spacey-random-walks-
and-higher-order-markov-chains
Graph Analytics: tensor solutions emerging

Although tensor
problematic, it may provide more general case
solutions, and some work leverages Spark:
TheTensor Renaissance in Data Science
Anima Anandkumar
radar.oreilly.com/2015/05/the-tensor-
renaissance-in-data-science.html
Spacey RandomWalks and Higher Order Markov Chains
David Gleich
slideshare.net/dgleich/spacey-random-walks-
and-higher-order-markov-chains
Graph Analytics:
watch
this space
carefully

Data Prep: Exsto Project Overview
https://github.com/ceteri/exsto/
• insights about dev communities, via data mining
their email forums
• works with any Apache project email archive
• applies NLP and ML techniques to analyze
message threads
• graph analytics surface themes and interactions
• results provide feedback for communities, e.g.,
leaderboards

Data Prep: Exsto Project Overview – four links
http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf
http://mail-archives.apache.org/mod_mbox/spark-user/
http://goo.gl/2YqJZK

Data Prep: Scraper pipeline
+

Typical data rates, e.g., for dev@spark.apache.org:
• ~2K msgs/month
• ~18 MB/month parsed in JSON
Six months’ list activity represents a graph of:
• 1882 senders
• 1,762,113 nodes
• 3,232,174 edges
A large graph?! In any case, it satisﬁes deﬁnition of a  
graph-parallel system – lots of data locality to leverage

Data Prep: idealized ML workﬂow…
evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms

Data Prep: Microservices meet Parallel Processing
services
email
archives community
leaderboards
SparkSQL
Data Prep
Features
Explore
Scraper /
Parser
NLTK
data Unique
Word IDs
TextRank,
Word2Vec,
etc.
community
insights
not so big data… relatively big compute…
( we’ll come back to this point! )

message
JSON
Py
filter
quoted
content
Apache
email list
archive
urllib2
crawl
monthly list
by date
Py
segment
paragraphs

message
JSON
Py
filter
quoted
content
Apache
email list
archive
urllib2
crawl
monthly list
by date
Py
segment
paragraphs
{
"date": "2014-10-01T00:16:08+00:00",
"id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",
"next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",
"next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQ
"prev_thread": "",
"sender": "Debasish Das <debasish.da...@gmail.com>",
"subject": "Re: memory vs data_size",
"text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....n
}

TextBlob
tag and
lemmatize
words
TextBlob
segment
sentences
TextBlob
sentiment
analysis
Py
generate
skip-grams
parsed
JSON
message
JSON Treebank,
WordNet
Data Prep: Parser pipeline

TextBlob
tag and
lemmatize
words
TextBlob
segment
sentences
TextBlob
sentiment
analysis
Py
generate
skip-grams
parsed
JSON
message
JSON Treebank,
WordNet
Data Prep: Parser pipeline
{
"graf": [ [1, "Only", "only", "RB", 1, 0], [2, "fit", "fit", "VBP", 1, 1 ] ... ],
"id": “CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",
"polr": 0.2,
"sha1": "178b7a57ec6168f20a8a4f705fb8b0b04e59eeb7",
"size": 14,
"subj": 0.7,
"tile": [ [1, 2], [2, 3], [3, 4] ... ]
]
}
{
"date": "2014-10-01T00:16:08+00:00",
"id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",
"next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",
"next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQDM=p
"prev_thread": "",
"sender": "Debasish Das <debasish.da...@gmail.com>",
"subject": "Re: memory vs data_size",
"text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....nnFor
}

Data Prep: Example data
Example data from the Apache Spark email list  
is available as JSON on S3:
• https://s3-us-west-1.amazonaws.com/paco.dbfs.public/
exsto/original/2015_01.json
• https://s3-us-west-1.amazonaws.com/paco.dbfs.public/
exsto/parsed/2015_01.json

Data Prep: code examples…
Let’s check
some code!

WTF ML Frameworks?
services
email
archives
community
leaderboards
SparkSQL
Data Prep
Features
Explore
Scraper /
Parser
NLTK
data
Unique
Word IDs
TextRank,
Word2Vec,
etc.
com
in

WTF ML Frameworks? Microservices meet Parallel Processing
services
email
archives community
leaderboards
SparkSQL
Data Prep
Features
Explore
Scraper /
Parser
NLTK
data Unique
Word IDs
TextRank,
Word2Vec,
etc.
community
insights
not so big data… relatively big compute…

• Big Compute, not Big Data: Mb’s per month,
organized as millions of elements in a graph
• The required libraries (NLTK, etc.) are nearly 
1000x larger than the data!
• This data does not change, does not need  
to be recomputed…
• Also: assigning unique IDs to entities during
NLP parsing … that doesn’t readily ﬁt Spark’s
compute model (immutable data)

Personal observation:
• There’s an unfortunate tendency within machine
learning frameworks to try to subsume all of the
data handling within the framework…
• In the case of Spark, app performance would
already be upside-down just by distributing
NLTK across all of the executors
• Also, installing NLTK data can be “interesting”

Personal observation:
• There’s an unfortunate tendency within machine
learning frameworks to try to subsume all of the
data handling within the framework…
• In the case of Spark, app performance would
already be upside-down just by distributing
NLTK across all of the executors
• Also, installing
WTF ML Frameworks?
WTF 
ML Frameworks 
???

Keep in mind that “One Size Fits All” is an  
anti-pattern, especially for Big Data tools:
• consider provisioning cost vs. frequency  
of use
• serialization overhead in workﬂows
• be mindful of crafting the “working set”  
for memory resources

instead of OSFA…  
(yes, yes, Big Data `blasphemy`, sigh)
containerized
microservices
Flask
Redis
email
archives
SparkSQL
Data Prep
Features
Explore
Scraper /
Parser
NLTK
data Unique
Word IDs
Mesos / DCOS
Spark
executors

TextRank: original paper
TextRank: Bringing Order intoTexts
 
Rada Mihalcea, Paul Tarau
Conference on Empirical Methods in Natural
Language Processing (July 2004)
https://goo.gl/AJnA76
http://web.eecs.umich.edu/~mihalcea/papers.html
http://www.cse.unt.edu/~tarau/

TextRank: other implementations
Jeff Kubina (Perl / English):
http://search.cpan.org/~kubina/Text-Categorize-
Textrank-0.51/lib/Text/Categorize/Textrank/En.pm
Paco Nathan (Hadoop / English+Spanish):
https://github.com/ceteri/textrank/
Karin Christiasen (Java / Icelandic):
https://github.com/karchr/icetextsum

TextRank: Spark-based pipeline
Spark
create
word graph
RDD
word
graph
NetworkX
visualize
graph
GraphX
run
TextRank
Spark
extract
phrases
ranked
phrases
parsed
JSON

TextRank: data results
"Compatibility of systems of linear constraints"
[{'index': 0, 'stem': 'compat', 'tag': 'NNP','word': 'compatibility'},
{'index': 1, 'stem': 'of', 'tag': 'IN', 'word': 'of'},
{'index': 2, 'stem': 'system', 'tag': 'NNS', 'word': 'systems'},
{'index': 3, 'stem': 'of', 'tag': 'IN', 'word': 'of'},
{'index': 4, 'stem': 'linear', 'tag': 'JJ', 'word': 'linear'},
{'index': 5, 'stem': 'constraint', 'tag': 'NNS','word': 'constraints'}]
compat
system
linear
constraint
1:
2:
3:

https://en.wikipedia.org/wiki/PageRank
TextRank: how it works

TextRank: code examples…
Let’s check
some code!

Social Graph: use GraphX to run graph analytics
// run graph analytics
val g: Graph[String, Int] = Graph(nodes, edges)
val r = g.pageRank(0.0001).vertices
r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)
// define a reduce operation to compute the highest degree vertex
def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {
if (a._2 > b._2) a else b
}
// compute the max degrees
val maxInDegree: (VertexId, Int) = g.inDegrees.reduce(max)
val maxOutDegree: (VertexId, Int) = g.outDegrees.reduce(max)
val maxDegrees: (VertexId, Int) = g.degrees.reduce(max)
// connected components
val scc = g.stronglyConnectedComponents(10).vertices
node.join(scc).foreach(println)

Social Graph: PageRank of top dev@spark email, 4Q2014
(389,(22.690229478710016,Sean Owen <so...@cloudera.com>))
(857,(20.832469059298248,Akhil Das <ak...@sigmoidanalytics.com>))
(652,(13.281821379806798,Michael Armbrust <mich...@databricks.com>))
(101,(9.963167550803664,Tobias Pfeiffer <...@preferred.jp>))
(471,(9.614436778460558,Steve Lewis <lordjoe2...@gmail.com>))
(931,(8.217073486575732,shahab <shahab.mok...@gmail.com>))
(48,(7.653814912512137,ll <duy.huynh....@gmail.com>))
(1011,(7.602002681952157,Ashic Mahtab <as...@live.com>))
(1055,(7.572376489758199,Cheng Lian <lian.cs....@gmail.com>))
(122,(6.87247388819558,Gerard Maas <gerard.m...@gmail.com>))
(904,(6.252657820614504,Xiangrui Meng <men...@gmail.com>))
(827,(6.0941062762076115,Jianshi Huang <jianshi.hu...@gmail.com>))
(887,(5.835053915864531,Davies Liu <dav...@databricks.com>))
(303,(5.724235650446037,Ted Yu <yuzhih...@gmail.com>))
(206,(5.430238461114108,Deep Pradhan <pradhandeep1...@gmail.com>))
(483,(5.332452537151523,Akshat Aranya <aara...@gmail.com>))
(185,(5.259438927615685,SK <skrishna...@gmail.com>))
(636,(5.235941228955769,Matei Zaharia <matei.zaha…@gmail.com>))
// seaaaaaaaaaan!
maxInDegree: (org.apache.spark.graphx.VertexId, Int) = (389,126)
maxOutDegree: (org.apache.spark.graphx.VertexId, Int) = (389,170)
maxDegrees: (org.apache.spark.graphx.VertexId, Int) = (389,296)

Social Graph: code examples…
Let’s check
some code!

Misc., Etc., Maybe:
Feature learning withWord2Vec 
Matt Krzus 
www.yseam.com/blog/WV.html
ranked
phrases
GraphX
run
Con.Comp.
MLlib
run
Word2Vec
aggregated
by topic
MLlib
run
KMeans
topic
vectors
better than
LDA?
features… models… insights…

O’Reilly Studios, O’Reilly Learning:

O’Reilly Studios, O’Reilly Learning:
Embracing Jupyter Notebooks at O'Reilly
https://beta.oreilly.com/ideas/jupyter-at-oreilly
Andrew Odewahn, 2015-05-07
“O'Reilly Media is using our Atlas platform to  
make Jupyter Notebooks a ﬁrst class authoring
environment for our publishing program.”
Jupyter, Thebe, Docker, Mesos, etc.

presenter:
Just Enough Math
O’Reilly (2014)
justenoughmath.com 
preview: youtu.be/TQ58cWgdCpA
monthly newsletter for updates,  
events, conf summaries, etc.:
liber118.com/pxn/
Intro to Apache Spark 
O’Reilly (2015) 
shop.oreilly.com/product/
0636920036807.do

Microservices, containers, and machine learning

More Related Content

What's hot (20)

Similar to Microservices, containers, and machine learning (20)

More from Paco Nathan (12)

Recently uploaded (20)

Microservices, containers, and machine learning