Microservices, Containers, and Machine Learning

Microservices, Containers,
and Machine Learning
Paco Nathan, @pacoid

oracle.com/technetwork/java/javase/downloads/
jdk7-downloads-1880260.html

• follow the license agreement instructions

• then click the download for your OS

• need JDK instead of JRE (for Maven, etc.)

• JDK 6, 7, 8 is ﬁne
Downloads: Java JDK

For Python 2.7, check out Anaconda by
Continuum Analytics for a full-featured
platform:

store.continuum.io/cshop/anaconda/
Downloads: Python

Let’s get started using Apache Spark, in just a few
easy steps… Download code from:

databricks.com/spark-training-resources#itas

or for a fallback: spark.apache.org/downloads.html

!
Also, the GitHub project:

github.com/ceteri/spark-exercises/tree/master/exsto
Downloads: Spark

Connect into the inﬂated “spark” directory,
then run:

./bin/spark-shell!
Downloads: Spark

// load error messages from a log into memory!
// then interactively search for various patterns!
// https://gist.github.com/ceteri/8ae5b9509a08c08a1132!
!
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
Spark Deconstructed: Log Mining Example

Driver
Worker
Worker
Worker
We start with Spark running on a cluster… 
submitting code to be evaluated on it:

// base RDD!
!
messages.cache()!
!
// action 1!
!
// action 2!
discussing the other part

scala> messages.toDebugString!
res5: String = !
MappedRDD[4] at map at <console>:16 (3 partitions)!
MappedRDD[3] at map at <console>:16 (3 partitions)!
FilteredRDD[2] at filter at <console>:14 (3 partitions)!
MappedRDD[1] at textFile at <console>:12 (3 partitions)!
HadoopRDD[0] at textFile at <console>:12 (3 partitions)
At this point, take a look at the transformed
RDD operator graph:

Driver
Worker
Worker
Worker
// base RDD!
!
messages.cache()!
!
// action 1!
!
// action 2!

Driver
Worker
Worker
Worker
block 1
block 2
block 3
// base RDD!
!
messages.cache()!
!
// action 1!
!
// action 2!

Driver
Worker
Worker
Worker
block 1
block 2
block 3
read
HDFS
block
read
HDFS
block
read
HDFS
block
// base RDD!
!
messages.cache()!
!
// action 1!
!
// action 2!

Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
process,
cache data
process,
cache data
process,
cache data
// base RDD!
!
messages.cache()!
!
// action 1!
!
// action 2!

Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
// base RDD!
!
messages.cache()!
!
// action 1!
!
// action 2!

// base RDD!
!
messages.cache()!
!
// action 1!
!
// action 2!
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3

Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
process
from cache
process
from cache
process
from cache
// base RDD!
!
messages.cache()!
!
// action 1!
messages.filter(_.contains(“mysql")).count()!
!
// action 2!

Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
// base RDD!
!
messages.cache()!
!
// action 1!
messages.filter(_.contains(“mysql")).count()!
!
// action 2!

GraphX
spark.apache.org/docs/latest/graphx-
programming-guide.html

!
Key Points:

!
• graph-parallel systems

• importance of workﬂows

• optimizations

PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs 
J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin 
graphlab.org/ﬁles/osdi2012-gonzalez-low-gu-
bickson-guestrin.pdf

Pregel: Large-scale graph computing at Google 
Grzegorz Czajkowski, et al. 
googleresearch.blogspot.com/2009/06/large-scale-
graph-computing-at-google.html

GraphX: Uniﬁed Graph Analytics on Spark 
Ankur Dave, Databricks 
databricks-training.s3.amazonaws.com/slides/
graphx@sparksummit_2014-07.pdf

Advanced Exercises: GraphX 
databricks-training.s3.amazonaws.com/graph-
analytics-with-graphx.html
GraphX

// http://spark.apache.org/docs/latest/graphx-programming-guide.html!
!
import org.apache.spark.graphx._!
import org.apache.spark.rdd.RDD!
!
case class Peep(name: String, age: Int)!
!
val nodeArray = Array(!
(1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),!
(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),!
(5L, Peep("Leslie", 45))!
)!
val edgeArray = Array(!
Edge(2L, 1L, 7), Edge(2L, 4L, 2),!
Edge(3L, 2L, 4), Edge(3L, 5L, 3),!
Edge(4L, 1L, 1), Edge(5L, 3L, 9)!
)!
!
val nodeRDD: RDD[(Long, Peep)] = sc.parallelize(nodeArray)!
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)!
val g: Graph[Peep, Int] = Graph(nodeRDD, edgeRDD)!
!
val results = g.triplets.filter(t => t.attr > 7)!
!
for (triplet <- results.collect) {!
println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")!
}
GraphX: demo

TextRank Demo:

!
cdn.liber118.com/spark/ipynb/textrank/
PySparkTextRank.ipynb

!
IPYTHON_OPTS="notebook --pylab inline" ./bin/pyspark!
GraphX: demo

evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
Typical Workﬂows:

Workflows: Scraper pipeline
Typical data rates, e.g., for dev@spark.apache.org:

• ~2K msgs/month

• ~6 MB as JSON

• ~13 MB parsed

Three months’ list activity represents a graph of:

• 1061 senders

• 753,400 nodes

• 1,027,806 edges

A big graph! However, it satisfies definition for a  
graph-parallel system; lots of data locality to leverage

Workﬂows: A Few Notes about Microservices and Containers
The Strengths andWeaknesses of Microservices 
Abel Avram 
http://www.infoq.com/news/2014/05/microservices

DockerCon EU Keynote: State of the Art in Microservices 
Adrian Cockcroft 
https://blog.docker.com/2014/12/dockercon-
europe-keynote-state-of-the-art-in-microservices-
by-adrian-cockcroft-battery-ventures/

Microservices Architecture 
Martin Fowler 
http://martinfowler.com/articles/microservices.html

Workﬂows: An Example…
Python-based service in a Docker container?

Just Enough Math, IPython+Docker 
Paco Nathan, Andrew Odewahn, Kyle Kelly 
https://github.com/ceteri/jem-docker 
https://registry.hub.docker.com/u/ceteri/jem/

Docker Jumpstart 
Andrew Odewahn 
http://odewahn.github.io/docker-jumpstart/

Workﬂows: A Brief Note about ETL in SparkSQL
Spark SQL Data Sources API: Uniﬁed Data Access for
the Spark Platform 
Michael Armbrust 
databricks.com/blog/2015/01/09/spark-sql-
data-sources-api-unified-data-access-for-
the-spark-platform.html

This Workﬂow: Microservices meet Parallel Processing
services
email
archives community
leaderboards
SparkSQL
Data Prep
Features
Explore
Scraper /
Parser
NLTK
data Unique
Word IDs
TextRank,
Word2Vec,
etc.
community
insights
not so big data… relatively big compute…

message
JSON
Py
filter
quoted
content
Apache
email list
archive
urllib2
crawl
monthly list
by date
Py
segment
paragraphs

message
JSON
Py
filter
quoted
content
Apache
email list
archive
urllib2
crawl
monthly list
by date
Py
segment
paragraphs
{!
"date": "2014-10-01T00:16:08+00:00",!
"id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",!
"next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",!
"next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQ
"prev_thread": "",!
"sender": "Debasish Das <debasish.da...@gmail.com>",!
"subject": "Re: memory vs data_size",!
"text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....n
}

TextBlob
tag and
lemmatize
words
TextBlob
segment
sentences
TextBlob
sentiment
analysis
Py
generate
skip-grams
parsed
JSON
message
JSON Treebank,
WordNet
Workﬂows: Parser pipeline

TextBlob
tag and
lemmatize
words
TextBlob
segment
sentences
TextBlob
sentiment
analysis
Py
generate
skip-grams
parsed
JSON
message
JSON Treebank,
WordNet
Workﬂows: Parser pipeline
{!
"graf": [ [1, "Only", "only", "RB", 1, 0], [2, "fit", "fit", "VBP", 1, 1 ] ... ],!
"id": “CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",!
"polr": 0.2,!
"sha1": "178b7a57ec6168f20a8a4f705fb8b0b04e59eeb7",!
"size": 14,!
"subj": 0.7,!
"tile": [ [1, 2], [2, 3], [3, 4] ... ]!
]!
}
{!
"date": "2014-10-01T00:16:08+00:00",!
"id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",!
"next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",!
"next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQDM=p
"prev_thread": "",!
"sender": "Debasish Das <debasish.da...@gmail.com>",!
"subject": "Re: memory vs data_size",!
"text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....nnFor
}

Workﬂows: TextRank pipeline
Spark
create
word graph
RDD
word
graph
NetworkX
visualize
graph
GraphX
run
TextRank
Spark
extract
phrases
ranked
phrases
parsed
JSON

Workﬂows: TextRank pipeline
"Compatibility of systems of linear constraints"
[{'index': 0, 'stem': 'compat', 'tag': 'NNP','word': 'compatibility'},
{'index': 1, 'stem': 'of', 'tag': 'IN', 'word': 'of'},
{'index': 2, 'stem': 'system', 'tag': 'NNS', 'word': 'systems'},
{'index': 3, 'stem': 'of', 'tag': 'IN', 'word': 'of'},
{'index': 4, 'stem': 'linear', 'tag': 'JJ', 'word': 'linear'},
{'index': 5, 'stem': 'constraint', 'tag': 'NNS','word': 'constraints'}]
compat
system
linear
constraint
1:
2:
3:
TextRank: Bringing Order intoTexts

Rada Mihalcea, Paul Tarau

http://web.eecs.umich.edu/~mihalcea/
papers/mihalcea.emnlp04.pdf

https://en.wikipedia.org/wiki/PageRank
Workﬂows: TextRank – how it works

TextRank impl: load parquet ﬁles
!
val sqlCtx = new org.apache.spark.sql.SQLContext(sc)!
import sqlCtx._!
!
val edge = sqlCtx.parquetFile("graf_edge.parquet")!
edge.registerTempTable("edge")!
!
val node = sqlCtx.parquetFile("graf_node.parquet")!
node.registerTempTable("node")!
!
// pick one message as an example; at scale we'd parallelize!
val msg_id = "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw"

TextRank impl: use SparkSQL to collect node list + edge list
val sql = """!
SELECT node_id, root !
FROM node !
WHERE id='%s' AND keep='1'!
""".format(msg_id)!
!
val n = sqlCtx.sql(sql.stripMargin).distinct()!
val nodes: RDD[(Long, String)] = n.map{ p =>!
(p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[String])!
}!
nodes.collect()!
!
val sql = """!
SELECT node0, node1 !
FROM edge !
WHERE id='%s'!
""".format(msg_id)!
!
val e = sqlCtx.sql(sql.stripMargin).distinct()!
val edges: RDD[Edge[Int]] = e.map{ p =>!
Edge(p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[Int].toLong, 0)!
}!
edges.collect()

TextRank impl: use GraphX to run PageRank
// run PageRank!
val g: Graph[String, Int] = Graph(nodes, edges)!
val r = g.pageRank(0.0001).vertices!
!
r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)!
!
// save the ranks!
case class Rank(id: Int, rank: Float)!
val rank = r.map(p => Rank(p._1.toInt, p._2.toFloat))!
rank.registerTempTable("rank")!
!
def median[T](s: Seq[T])(implicit n: Fractional[T]) = {!
import n._!
val (lower, upper) = s.sortWith(_<_).splitAt(s.size / 2)!
if (s.size % 2 == 0) (lower.last + upper.head) / fromInt(2) else upper.head!
}!
!
val min_rank = median(r.map(_._2).collect())

TextRank impl: join ranked words with parsed text
var span:List[String] = List()!
var last_index = -1!
var rank_sum = 0.0!
!
var phrases:collection.mutable.Map[String, Double] = collection.mutable.Map()!
!
val sql = """!
SELECT n.num, n.raw, r.rank!
FROM node n JOIN rank r ON n.node_id = r.id !
WHERE n.id='%s' AND n.keep='1'!
ORDER BY n.num!
""".format(msg_id)!
!
val s = sqlCtx.sql(sql.stripMargin).collect()

TextRank impl: “pull strings” for the top-ranked keyphrases
s.foreach { x => !
//println (x)!
val index = x.getInt(0)!
val word = x.getString(1)!
val rank = x.getFloat(2)!
var isStop = false!
!
// test for break from past!
if (span.size > 0 && rank < min_rank) isStop = true!
if (span.size > 0 && (index - last_index > 1)) isStop = true!
!
// clear accumulation!
if (isStop) {!
val phrase = span.mkString(" ")!
phrases += (phrase -> rank_sum)!
!
span = List()!
last_index = index!
rank_sum = 0.0!
}!
!
// start or append!
if (rank >= min_rank) {!
span = span :+ word!
last_index = index!
rank_sum += rank!
}!
}!

TextRank impl: report the top keyphrases
// summarize the text as a list of ranked keyphrases!
val summary = sc.parallelize(phrases.toSeq)!
.distinct()!
.sortBy(_._2, ascending=false)

Reply Graph: load parquet ﬁles
!
val sqlCtx = new org.apache.spark.sql.SQLContext(sc)!
import sqlCtx._!
!
val edge = sqlCtx.parquetFile("reply_edge.parquet")!
edge.registerTempTable("edge")!
!
val node = sqlCtx.parquetFile("reply_node.parquet")!
node.registerTempTable("node")!
!
edge.schemaString!
node.schemaString

Reply Graph: use SparkSQL to collect node list + edge list
val sql = "SELECT id, sender FROM node"!
val n = sqlCtx.sql(sql).distinct()!
val nodes: RDD[(Long, String)] = n.map{ p =>!
(p(0).asInstanceOf[Long], p(1).asInstanceOf[String])!
}!
nodes.collect()!
!
val sql = "SELECT replier, sender, num FROM edge"!
val e = sqlCtx.sql(sql).distinct()!
val edges: RDD[Edge[Int]] = e.map{ p =>!
Edge(p(0).asInstanceOf[Long], p(1).asInstanceOf[Long], p(2).asInstanceOf[Int])!
}!
edges.collect()

Reply Graph: use GraphX to run graph analytics
// run graph analytics!
val g: Graph[String, Int] = Graph(nodes, edges)!
val r = g.pageRank(0.0001).vertices!
r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)!
!
// define a reduce operation to compute the highest degree vertex!
def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {!
if (a._2 > b._2) a else b!
}!
!
// compute the max degrees!
val maxInDegree: (VertexId, Int) = g.inDegrees.reduce(max)!
val maxOutDegree: (VertexId, Int) = g.outDegrees.reduce(max)!
val maxDegrees: (VertexId, Int) = g.degrees.reduce(max)!
!
// connected components!
val scc = g.stronglyConnectedComponents(10).vertices!
node.join(scc).foreach(println)

Reply Graph: PageRank of top dev@spark email, 4Q2014
(389,(22.690229478710016,Sean Owen <so...@cloudera.com>))!
(857,(20.832469059298248,Akhil Das <ak...@sigmoidanalytics.com>))!
(652,(13.281821379806798,Michael Armbrust <mich...@databricks.com>))!
(101,(9.963167550803664,Tobias Pfeiffer <...@preferred.jp>))!
(471,(9.614436778460558,Steve Lewis <lordjoe2...@gmail.com>))!
(931,(8.217073486575732,shahab <shahab.mok...@gmail.com>))!
(48,(7.653814912512137,ll <duy.huynh....@gmail.com>))!
(1011,(7.602002681952157,Ashic Mahtab <as...@live.com>))!
(1055,(7.572376489758199,Cheng Lian <lian.cs....@gmail.com>))!
(122,(6.87247388819558,Gerard Maas <gerard.m...@gmail.com>))!
(904,(6.252657820614504,Xiangrui Meng <men...@gmail.com>))!
(827,(6.0941062762076115,Jianshi Huang <jianshi.hu...@gmail.com>))!
(887,(5.835053915864531,Davies Liu <dav...@databricks.com>))!
(303,(5.724235650446037,Ted Yu <yuzhih...@gmail.com>))!
(206,(5.430238461114108,Deep Pradhan <pradhandeep1...@gmail.com>))!
(483,(5.332452537151523,Akshat Aranya <aara...@gmail.com>))!
(185,(5.259438927615685,SK <skrishna...@gmail.com>))!
(636,(5.235941228955769,Matei Zaharia <matei.zaha…@gmail.com>))!
!
// seaaaaaaaaaan!!
maxInDegree: (org.apache.spark.graphx.VertexId, Int) = (389,126)!
maxOutDegree: (org.apache.spark.graphx.VertexId, Int) = (389,170)!
maxDegrees: (org.apache.spark.graphx.VertexId, Int) = (389,296)

Reply Graph: What SSSP looks like in GraphX/Pregel
github.com/ceteri/spark-exercises/blob/master/src/main/scala/
com/databricks/apps/graphx/sssp.scala

Look Ahead: Where is this heading?
Feature learning withWord2Vec 
Matt Krzus 
www.yseam.com/blog/WV.html
ranked
phrases
GraphX
run
Con.Comp.
MLlib
run
Word2Vec
aggregated
by topic
MLlib
run
KMeans
topic
vectors
better than
LDA?
features… models… insights…

Apache Spark developer certificate program
• http://oreilly.com/go/sparkcert
• defined by Spark experts @Databricks
• assessed by O’Reilly Media
• establishes the bar for Spark expertise
certification:

MOOCs:
Anthony Joseph 
UC Berkeley

begins 2015-02-23

edx.org/course/uc-berkeleyx/uc-
berkeleyx-cs100-1x-
introduction-big-6181
Ameet Talwalkar 
UCLA

begins 2015-04-14

edx.org/course/uc-berkeleyx/
uc-berkeleyx-cs190-1x-
scalable-machine-6066

community:
spark.apache.org/community.html
events worldwide: goo.gl/2YqJZK
!
video+preso archives: spark-summit.org
resources: databricks.com/spark-training-resources
workshops: databricks.com/spark-training

http://spark-summit.org/

confs:
Strata CA 
San Jose, Feb 18-20 
strataconf.com/strata2015
Spark Summit East 
NYC, Mar 18-19 
spark-summit.org/east
Big Data Tech Con 
Boston, Apr 26-28 
bigdatatechcon.com
Strata EU 
London, May 5-7 
strataconf.com/big-data-conference-uk-2015
Spark Summit 2015 
SF, Jun 15-17 
spark-summit.org

books:
Fast Data Processing  
with Spark 
Holden Karau 
Packt (2013) 
shop.oreilly.com/product/
9781782167068.do
Spark in Action 
Chris Fregly 
Manning (2015*) 
sparkinaction.com/
Learning Spark 
Holden Karau,  
Andy Konwinski,
Matei Zaharia 
O’Reilly (2015*) 
0636920028512.do

presenter:
Just Enough Math
O’Reilly, 2014
justenoughmath.com 
preview: youtu.be/TQ58cWgdCpA
monthly newsletter for updates,  
events, conf summaries, etc.:
liber118.com/pxn/
Enterprise Data Workﬂows
with Cascading
O’Reilly, 2013
0636920028536.do

Microservices, Containers, and Machine Learning

More Related Content

What's hot (20)

Similar to Microservices, Containers, and Machine Learning (20)

More from Paco Nathan (20)

Recently uploaded (20)

Microservices, Containers, and Machine Learning