SlideShare a Scribd company logo
Microservices, Containers,
and Machine Learning
Paco Nathan, @pacoid
Downloads
oracle.com/technetwork/java/javase/downloads/
jdk7-downloads-1880260.html	

• follow the license agreement instructions	

• then click the download for your OS	

• need JDK instead of JRE (for Maven, etc.)	

• JDK 6, 7, 8 is fine
Downloads: Java JDK
For Python 2.7, check out Anaconda by
Continuum Analytics for a full-featured
platform:	

store.continuum.io/cshop/anaconda/
Downloads: Python
Let’s get started using Apache Spark, in just a few
easy steps… Download code from:	

databricks.com/spark-training-resources#itas	

or for a fallback: spark.apache.org/downloads.html	

!
Also, the GitHub project:	

github.com/ceteri/spark-exercises/tree/master/exsto
Downloads: Spark
Connect into the inflated “spark” directory,
then run:	

./bin/spark-shell!
Downloads: Spark
Spark Deconstructed
// load error messages from a log into memory!
// then interactively search for various patterns!
// https://gist.github.com/ceteri/8ae5b9509a08c08a1132!
!
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
Spark Deconstructed: Log Mining Example
Driver
Worker
Worker
Worker
Spark Deconstructed: Log Mining Example
We start with Spark running on a cluster…

submitting code to be evaluated on it:
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
Spark Deconstructed: Log Mining Example
discussing the other part
Spark Deconstructed: Log Mining Example
scala> messages.toDebugString!
res5: String = !
MappedRDD[4] at map at <console>:16 (3 partitions)!
MappedRDD[3] at map at <console>:16 (3 partitions)!
FilteredRDD[2] at filter at <console>:14 (3 partitions)!
MappedRDD[1] at textFile at <console>:12 (3 partitions)!
HadoopRDD[0] at textFile at <console>:12 (3 partitions)
At this point, take a look at the transformed
RDD operator graph:
Driver
Worker
Worker
Worker
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
Driver
Worker
Worker
Worker
block 1
block 2
block 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
Driver
Worker
Worker
Worker
block 1
block 2
block 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
Driver
Worker
Worker
Worker
block 1
block 2
block 3
read
HDFS
block
read
HDFS
block
read
HDFS
block
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
process,
cache data
process,
cache data
process,
cache data
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
Spark Deconstructed: Log Mining Example
discussing the other part
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
process
from cache
process
from cache
process
from cache
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains(“mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains(“mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
discussing the other part
GraphX
GraphX
spark.apache.org/docs/latest/graphx-
programming-guide.html	

!
Key Points:	

!
• graph-parallel systems	

• importance of workflows	

• optimizations
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs

J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin

graphlab.org/files/osdi2012-gonzalez-low-gu-
bickson-guestrin.pdf	

Pregel: Large-scale graph computing at Google

Grzegorz Czajkowski, et al.

googleresearch.blogspot.com/2009/06/large-scale-
graph-computing-at-google.html	

GraphX: Unified Graph Analytics on Spark

Ankur Dave, Databricks

databricks-training.s3.amazonaws.com/slides/
graphx@sparksummit_2014-07.pdf	

Advanced Exercises: GraphX

databricks-training.s3.amazonaws.com/graph-
analytics-with-graphx.html
GraphX
// http://spark.apache.org/docs/latest/graphx-programming-guide.html!
!
import org.apache.spark.graphx._!
import org.apache.spark.rdd.RDD!
!
case class Peep(name: String, age: Int)!
!
val nodeArray = Array(!
(1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),!
(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),!
(5L, Peep("Leslie", 45))!
)!
val edgeArray = Array(!
Edge(2L, 1L, 7), Edge(2L, 4L, 2),!
Edge(3L, 2L, 4), Edge(3L, 5L, 3),!
Edge(4L, 1L, 1), Edge(5L, 3L, 9)!
)!
!
val nodeRDD: RDD[(Long, Peep)] = sc.parallelize(nodeArray)!
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)!
val g: Graph[Peep, Int] = Graph(nodeRDD, edgeRDD)!
!
val results = g.triplets.filter(t => t.attr > 7)!
!
for (triplet <- results.collect) {!
println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")!
}
GraphX: demo
TextRank Demo:	

!
cdn.liber118.com/spark/ipynb/textrank/
PySparkTextRank.ipynb	

!
IPYTHON_OPTS="notebook --pylab inline" ./bin/pyspark!
GraphX: demo
Workflows
evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
Typical Workflows:
Workflows: Scraper pipeline
Typical data rates, e.g., for dev@spark.apache.org:	

• ~2K msgs/month	

• ~6 MB as JSON	

• ~13 MB parsed	

Three months’ list activity represents a graph of:	

• 1061 senders	

• 753,400 nodes	

• 1,027,806 edges	

A big graph! However, it satisfies definition for a 

graph-parallel system; lots of data locality to leverage
Workflows: A Few Notes about Microservices and Containers
The Strengths andWeaknesses of Microservices

Abel Avram

http://www.infoq.com/news/2014/05/microservices	

DockerCon EU Keynote: State of the Art in Microservices

Adrian Cockcroft

https://blog.docker.com/2014/12/dockercon-
europe-keynote-state-of-the-art-in-microservices-
by-adrian-cockcroft-battery-ventures/	

Microservices Architecture

Martin Fowler

http://martinfowler.com/articles/microservices.html
Workflows: An Example…
Python-based service in a Docker container?	

Just Enough Math, IPython+Docker

Paco Nathan, Andrew Odewahn, Kyle Kelly

https://github.com/ceteri/jem-docker

https://registry.hub.docker.com/u/ceteri/jem/	

Docker Jumpstart

Andrew Odewahn

http://odewahn.github.io/docker-jumpstart/
Workflows: A Brief Note about ETL in SparkSQL
Spark SQL Data Sources API: Unified Data Access for
the Spark Platform

Michael Armbrust

databricks.com/blog/2015/01/09/spark-sql-
data-sources-api-unified-data-access-for-
the-spark-platform.html
This Workflow: Microservices meet Parallel Processing
services
email
archives community
leaderboards
SparkSQL
Data Prep
Features
Explore
Scraper /
Parser
NLTK
data Unique
Word IDs
TextRank,
Word2Vec,
etc.
community
insights
not so big data… relatively big compute…
Workflows: Scraper pipeline
message
JSON
Py
filter
quoted
content
Apache
email list
archive
urllib2
crawl
monthly list
by date
Py
segment
paragraphs
Workflows: Scraper pipeline
message
JSON
Py
filter
quoted
content
Apache
email list
archive
urllib2
crawl
monthly list
by date
Py
segment
paragraphs
{!
"date": "2014-10-01T00:16:08+00:00",!
"id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",!
"next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",!
"next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQ
"prev_thread": "",!
"sender": "Debasish Das <debasish.da...@gmail.com>",!
"subject": "Re: memory vs data_size",!
"text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....n
}
TextBlob
tag and
lemmatize
words
TextBlob
segment
sentences
TextBlob
sentiment
analysis
Py
generate
skip-grams
parsed
JSON
message
JSON Treebank,
WordNet
Workflows: Parser pipeline
TextBlob
tag and
lemmatize
words
TextBlob
segment
sentences
TextBlob
sentiment
analysis
Py
generate
skip-grams
parsed
JSON
message
JSON Treebank,
WordNet
Workflows: Parser pipeline
{!
"graf": [ [1, "Only", "only", "RB", 1, 0], [2, "fit", "fit", "VBP", 1, 1 ] ... ],!
"id": “CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",!
"polr": 0.2,!
"sha1": "178b7a57ec6168f20a8a4f705fb8b0b04e59eeb7",!
"size": 14,!
"subj": 0.7,!
"tile": [ [1, 2], [2, 3], [3, 4] ... ]!
]!
}
{!
"date": "2014-10-01T00:16:08+00:00",!
"id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",!
"next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",!
"next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQDM=p
"prev_thread": "",!
"sender": "Debasish Das <debasish.da...@gmail.com>",!
"subject": "Re: memory vs data_size",!
"text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....nnFor
}
Workflows: TextRank pipeline
Spark
create
word graph
RDD
word
graph
NetworkX
visualize
graph
GraphX
run
TextRank
Spark
extract
phrases
ranked
phrases
parsed
JSON
Workflows: TextRank pipeline
"Compatibility of systems of linear constraints"
[{'index': 0, 'stem': 'compat', 'tag': 'NNP','word': 'compatibility'},
{'index': 1, 'stem': 'of', 'tag': 'IN', 'word': 'of'},
{'index': 2, 'stem': 'system', 'tag': 'NNS', 'word': 'systems'},
{'index': 3, 'stem': 'of', 'tag': 'IN', 'word': 'of'},
{'index': 4, 'stem': 'linear', 'tag': 'JJ', 'word': 'linear'},
{'index': 5, 'stem': 'constraint', 'tag': 'NNS','word': 'constraints'}]
compat
system
linear
constraint
1:
2:
3:
TextRank: Bringing Order intoTexts	

Rada Mihalcea, Paul Tarau	

http://web.eecs.umich.edu/~mihalcea/
papers/mihalcea.emnlp04.pdf
https://en.wikipedia.org/wiki/PageRank
Workflows: TextRank – how it works
TextRank impl
TextRank impl: load parquet files
import org.apache.spark.graphx._!
import org.apache.spark.rdd.RDD!
!
val sqlCtx = new org.apache.spark.sql.SQLContext(sc)!
import sqlCtx._!
!
val edge = sqlCtx.parquetFile("graf_edge.parquet")!
edge.registerTempTable("edge")!
!
val node = sqlCtx.parquetFile("graf_node.parquet")!
node.registerTempTable("node")!
!
// pick one message as an example; at scale we'd parallelize!
val msg_id = "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw"
TextRank impl: use SparkSQL to collect node list + edge list
val sql = """!
SELECT node_id, root !
FROM node !
WHERE id='%s' AND keep='1'!
""".format(msg_id)!
!
val n = sqlCtx.sql(sql.stripMargin).distinct()!
val nodes: RDD[(Long, String)] = n.map{ p =>!
(p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[String])!
}!
nodes.collect()!
!
val sql = """!
SELECT node0, node1 !
FROM edge !
WHERE id='%s'!
""".format(msg_id)!
!
val e = sqlCtx.sql(sql.stripMargin).distinct()!
val edges: RDD[Edge[Int]] = e.map{ p =>!
Edge(p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[Int].toLong, 0)!
}!
edges.collect()
TextRank impl: use GraphX to run PageRank
// run PageRank!
val g: Graph[String, Int] = Graph(nodes, edges)!
val r = g.pageRank(0.0001).vertices!
!
r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)!
!
// save the ranks!
case class Rank(id: Int, rank: Float)!
val rank = r.map(p => Rank(p._1.toInt, p._2.toFloat))!
rank.registerTempTable("rank")!
!
def median[T](s: Seq[T])(implicit n: Fractional[T]) = {!
import n._!
val (lower, upper) = s.sortWith(_<_).splitAt(s.size / 2)!
if (s.size % 2 == 0) (lower.last + upper.head) / fromInt(2) else upper.head!
}!
!
val min_rank = median(r.map(_._2).collect())
TextRank impl: join ranked words with parsed text
var span:List[String] = List()!
var last_index = -1!
var rank_sum = 0.0!
!
var phrases:collection.mutable.Map[String, Double] = collection.mutable.Map()!
!
val sql = """!
SELECT n.num, n.raw, r.rank!
FROM node n JOIN rank r ON n.node_id = r.id !
WHERE n.id='%s' AND n.keep='1'!
ORDER BY n.num!
""".format(msg_id)!
!
val s = sqlCtx.sql(sql.stripMargin).collect()
TextRank impl: “pull strings” for the top-ranked keyphrases
s.foreach { x => !
//println (x)!
val index = x.getInt(0)!
val word = x.getString(1)!
val rank = x.getFloat(2)!
var isStop = false!
!
// test for break from past!
if (span.size > 0 && rank < min_rank) isStop = true!
if (span.size > 0 && (index - last_index > 1)) isStop = true!
!
// clear accumulation!
if (isStop) {!
val phrase = span.mkString(" ")!
phrases += (phrase -> rank_sum)!
!
span = List()!
last_index = index!
rank_sum = 0.0!
}!
!
// start or append!
if (rank >= min_rank) {!
span = span :+ word!
last_index = index!
rank_sum += rank!
}!
}!
TextRank impl: report the top keyphrases
// summarize the text as a list of ranked keyphrases!
val summary = sc.parallelize(phrases.toSeq)!
.distinct()!
.sortBy(_._2, ascending=false)
Reply Graph
Reply Graph: load parquet files
import org.apache.spark.graphx._!
import org.apache.spark.rdd.RDD!
!
val sqlCtx = new org.apache.spark.sql.SQLContext(sc)!
import sqlCtx._!
!
val edge = sqlCtx.parquetFile("reply_edge.parquet")!
edge.registerTempTable("edge")!
!
val node = sqlCtx.parquetFile("reply_node.parquet")!
node.registerTempTable("node")!
!
edge.schemaString!
node.schemaString
Reply Graph: use SparkSQL to collect node list + edge list
val sql = "SELECT id, sender FROM node"!
val n = sqlCtx.sql(sql).distinct()!
val nodes: RDD[(Long, String)] = n.map{ p =>!
(p(0).asInstanceOf[Long], p(1).asInstanceOf[String])!
}!
nodes.collect()!
!
val sql = "SELECT replier, sender, num FROM edge"!
val e = sqlCtx.sql(sql).distinct()!
val edges: RDD[Edge[Int]] = e.map{ p =>!
Edge(p(0).asInstanceOf[Long], p(1).asInstanceOf[Long], p(2).asInstanceOf[Int])!
}!
edges.collect()
Reply Graph: use GraphX to run graph analytics
// run graph analytics!
val g: Graph[String, Int] = Graph(nodes, edges)!
val r = g.pageRank(0.0001).vertices!
r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)!
!
// define a reduce operation to compute the highest degree vertex!
def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {!
if (a._2 > b._2) a else b!
}!
!
// compute the max degrees!
val maxInDegree: (VertexId, Int) = g.inDegrees.reduce(max)!
val maxOutDegree: (VertexId, Int) = g.outDegrees.reduce(max)!
val maxDegrees: (VertexId, Int) = g.degrees.reduce(max)!
!
// connected components!
val scc = g.stronglyConnectedComponents(10).vertices!
node.join(scc).foreach(println)
Reply Graph: PageRank of top dev@spark email, 4Q2014
(389,(22.690229478710016,Sean Owen <so...@cloudera.com>))!
(857,(20.832469059298248,Akhil Das <ak...@sigmoidanalytics.com>))!
(652,(13.281821379806798,Michael Armbrust <mich...@databricks.com>))!
(101,(9.963167550803664,Tobias Pfeiffer <...@preferred.jp>))!
(471,(9.614436778460558,Steve Lewis <lordjoe2...@gmail.com>))!
(931,(8.217073486575732,shahab <shahab.mok...@gmail.com>))!
(48,(7.653814912512137,ll <duy.huynh....@gmail.com>))!
(1011,(7.602002681952157,Ashic Mahtab <as...@live.com>))!
(1055,(7.572376489758199,Cheng Lian <lian.cs....@gmail.com>))!
(122,(6.87247388819558,Gerard Maas <gerard.m...@gmail.com>))!
(904,(6.252657820614504,Xiangrui Meng <men...@gmail.com>))!
(827,(6.0941062762076115,Jianshi Huang <jianshi.hu...@gmail.com>))!
(887,(5.835053915864531,Davies Liu <dav...@databricks.com>))!
(303,(5.724235650446037,Ted Yu <yuzhih...@gmail.com>))!
(206,(5.430238461114108,Deep Pradhan <pradhandeep1...@gmail.com>))!
(483,(5.332452537151523,Akshat Aranya <aara...@gmail.com>))!
(185,(5.259438927615685,SK <skrishna...@gmail.com>))!
(636,(5.235941228955769,Matei Zaharia <matei.zaha…@gmail.com>))!
!
// seaaaaaaaaaan!!
maxInDegree: (org.apache.spark.graphx.VertexId, Int) = (389,126)!
maxOutDegree: (org.apache.spark.graphx.VertexId, Int) = (389,170)!
maxDegrees: (org.apache.spark.graphx.VertexId, Int) = (389,296)
Reply Graph: What SSSP looks like in GraphX/Pregel
github.com/ceteri/spark-exercises/blob/master/src/main/scala/
com/databricks/apps/graphx/sssp.scala
Look Ahead: Where is this heading?
Feature learning withWord2Vec

Matt Krzus

www.yseam.com/blog/WV.html
ranked
phrases
GraphX
run
Con.Comp.
MLlib
run
Word2Vec
aggregated
by topic
MLlib
run
KMeans
topic
vectors
better than
LDA?
features… models… insights…
Resources
Apache Spark developer certificate program
• http://oreilly.com/go/sparkcert
• defined by Spark experts @Databricks
• assessed by O’Reilly Media
• establishes the bar for Spark expertise
certification:
MOOCs:
Anthony Joseph

UC Berkeley	

begins 2015-02-23	

edx.org/course/uc-berkeleyx/uc-
berkeleyx-cs100-1x-
introduction-big-6181
Ameet Talwalkar

UCLA	

begins 2015-04-14	

edx.org/course/uc-berkeleyx/
uc-berkeleyx-cs190-1x-
scalable-machine-6066
community:
spark.apache.org/community.html
events worldwide: goo.gl/2YqJZK
!
video+preso archives: spark-summit.org
resources: databricks.com/spark-training-resources
workshops: databricks.com/spark-training
http://spark-summit.org/
confs:
Strata CA

San Jose, Feb 18-20

strataconf.com/strata2015
Spark Summit East

NYC, Mar 18-19

spark-summit.org/east
Big Data Tech Con

Boston, Apr 26-28

bigdatatechcon.com
Strata EU

London, May 5-7

strataconf.com/big-data-conference-uk-2015
Spark Summit 2015

SF, Jun 15-17

spark-summit.org
books:
Fast Data Processing 

with Spark

Holden Karau

Packt (2013)

shop.oreilly.com/product/
9781782167068.do
Spark in Action

Chris Fregly

Manning (2015*)

sparkinaction.com/
Learning Spark

Holden Karau, 

Andy Konwinski,
Matei Zaharia

O’Reilly (2015*)

shop.oreilly.com/product/
0636920028512.do
presenter:
Just Enough Math
O’Reilly, 2014
justenoughmath.com

preview: youtu.be/TQ58cWgdCpA
monthly newsletter for updates, 

events, conf summaries, etc.:
liber118.com/pxn/
Enterprise Data Workflows
with Cascading
O’Reilly, 2013
shop.oreilly.com/product/
0636920028536.do

More Related Content

PDF
QCon São Paulo: Real-Time Analytics with Spark Streaming
PDF
Databricks Meetup @ Los Angeles Apache Spark User Group
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
PDF
Strata EU 2014: Spark Streaming Case Studies
PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
Transitioning Compute Models: Hadoop MapReduce to Spark
PPTX
Intro to Spark development
PDF
A look under the hood at Apache Spark's API and engine evolutions
QCon São Paulo: Real-Time Analytics with Spark Streaming
Databricks Meetup @ Los Angeles Apache Spark User Group
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Strata EU 2014: Spark Streaming Case Studies
Jump Start with Apache Spark 2.0 on Databricks
Transitioning Compute Models: Hadoop MapReduce to Spark
Intro to Spark development
A look under the hood at Apache Spark's API and engine evolutions

What's hot (20)

PPTX
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
PDF
Spark and machine learning in microservices architecture
PDF
Scaling Up AI Research to Production with PyTorch and MLFlow
PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
PDF
Composable Parallel Processing in Apache Spark and Weld
PDF
Jump Start on Apache Spark 2.2 with Databricks
PPTX
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
PDF
SparkApplicationDevMadeEasy_Spark_Summit_2015
PPTX
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
PDF
Data Science meets Software Development
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PPTX
Introduction to Apache Spark
PDF
Apache spark-the-definitive-guide-excerpts-r1
PDF
Designing and Building Next Generation Data Pipelines at Scale with Structure...
PDF
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
PDF
Spark streaming
PDF
Using Databricks as an Analysis Platform
PPTX
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
PDF
Writing Continuous Applications with Structured Streaming in PySpark
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Spark and machine learning in microservices architecture
Scaling Up AI Research to Production with PyTorch and MLFlow
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Composable Parallel Processing in Apache Spark and Weld
Jump Start on Apache Spark 2.2 with Databricks
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Jump Start on Apache® Spark™ 2.x with Databricks
SparkApplicationDevMadeEasy_Spark_Summit_2015
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Data Science meets Software Development
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Introduction to Apache Spark
Apache spark-the-definitive-guide-excerpts-r1
Designing and Building Next Generation Data Pipelines at Scale with Structure...
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
Spark streaming
Using Databricks as an Analysis Platform
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Writing Continuous Applications with Structured Streaming in PySpark
Ad

Similar to Microservices, Containers, and Machine Learning (20)

PDF
Intro to apache spark stand ford
PDF
#MesosCon 2014: Spark on Mesos
PDF
Brief Intro to Apache Spark @ Stanford ICME
PDF
Stanford CS347 Guest Lecture: Apache Spark
PDF
What's new with Apache Spark?
PDF
How Apache Spark fits into the Big Data landscape
PDF
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
PDF
20150716 introduction to apache spark v3
PPTX
Big Data processing with Spark, Scala or Java?
PPTX
Apache Spark Fundamentals
PDF
Introduction to Spark
PDF
Introduction to Apache Spark
PPTX
In Memory Analytics with Apache Spark
PDF
Spark devoxx2014
PDF
How Apache Spark fits in the Big Data landscape
PPTX
Spark core
PDF
How Apache Spark fits into the Big Data landscape
PDF
Apache Spark Presentation good for big data
PDF
Apache Spark Tutorial
PDF
Spark 101
Intro to apache spark stand ford
#MesosCon 2014: Spark on Mesos
Brief Intro to Apache Spark @ Stanford ICME
Stanford CS347 Guest Lecture: Apache Spark
What's new with Apache Spark?
How Apache Spark fits into the Big Data landscape
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
20150716 introduction to apache spark v3
Big Data processing with Spark, Scala or Java?
Apache Spark Fundamentals
Introduction to Spark
Introduction to Apache Spark
In Memory Analytics with Apache Spark
Spark devoxx2014
How Apache Spark fits in the Big Data landscape
Spark core
How Apache Spark fits into the Big Data landscape
Apache Spark Presentation good for big data
Apache Spark Tutorial
Spark 101
Ad

More from Paco Nathan (20)

PDF
Human in the loop: a design pattern for managing teams working with ML
PDF
Human-in-the-loop: a design pattern for managing teams that leverage ML
PDF
Human-in-a-loop: a design pattern for managing teams which leverage ML
PDF
Humans in a loop: Jupyter notebooks as a front-end for AI
PDF
Humans in the loop: AI in open source and industry
PDF
Computable Content
PDF
Computable Content: Lessons Learned
PDF
SF Python Meetup: TextRank in Python
PDF
Use of standards and related issues in predictive analytics
PDF
Data Science in 2016: Moving Up
PDF
Data Science Reinvents Learning?
PDF
Jupyter for Education: Beyond Gutenberg and Erasmus
PDF
GalvanizeU Seattle: Eleven Almost-Truisms About Data
PDF
Microservices, containers, and machine learning
PDF
GraphX: Graph analytics for insights about developer communities
PDF
Graph Analytics in Spark
PDF
Apache Spark and the Emerging Technology Landscape for Big Data
PDF
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
PDF
A New Year in Data Science: ML Unpaused
PDF
Big Data is changing abruptly, and where it is likely heading
Human in the loop: a design pattern for managing teams working with ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in the loop: AI in open source and industry
Computable Content
Computable Content: Lessons Learned
SF Python Meetup: TextRank in Python
Use of standards and related issues in predictive analytics
Data Science in 2016: Moving Up
Data Science Reinvents Learning?
Jupyter for Education: Beyond Gutenberg and Erasmus
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Microservices, containers, and machine learning
GraphX: Graph analytics for insights about developer communities
Graph Analytics in Spark
Apache Spark and the Emerging Technology Landscape for Big Data
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
A New Year in Data Science: ML Unpaused
Big Data is changing abruptly, and where it is likely heading

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Modernizing your data center with Dell and AMD
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Cloud computing and distributed systems.
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
cuic standard and advanced reporting.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Electronic commerce courselecture one. Pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Dropbox Q2 2025 Financial Results & Investor Presentation
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Modernizing your data center with Dell and AMD
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Cloud computing and distributed systems.
Advanced Soft Computing BINUS July 2025.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
cuic standard and advanced reporting.pdf
Spectral efficient network and resource selection model in 5G networks
Per capita expenditure prediction using model stacking based on satellite ima...
Machine learning based COVID-19 study performance prediction
Understanding_Digital_Forensics_Presentation.pptx
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Electronic commerce courselecture one. Pdf

Microservices, Containers, and Machine Learning

  • 1. Microservices, Containers, and Machine Learning Paco Nathan, @pacoid
  • 3. oracle.com/technetwork/java/javase/downloads/ jdk7-downloads-1880260.html • follow the license agreement instructions • then click the download for your OS • need JDK instead of JRE (for Maven, etc.) • JDK 6, 7, 8 is fine Downloads: Java JDK
  • 4. For Python 2.7, check out Anaconda by Continuum Analytics for a full-featured platform: store.continuum.io/cshop/anaconda/ Downloads: Python
  • 5. Let’s get started using Apache Spark, in just a few easy steps… Download code from: databricks.com/spark-training-resources#itas or for a fallback: spark.apache.org/downloads.html ! Also, the GitHub project: github.com/ceteri/spark-exercises/tree/master/exsto Downloads: Spark
  • 6. Connect into the inflated “spark” directory, then run: ./bin/spark-shell! Downloads: Spark
  • 8. // load error messages from a log into memory! // then interactively search for various patterns! // https://gist.github.com/ceteri/8ae5b9509a08c08a1132! ! // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() Spark Deconstructed: Log Mining Example
  • 9. Driver Worker Worker Worker Spark Deconstructed: Log Mining Example We start with Spark running on a cluster…
 submitting code to be evaluated on it:
  • 10. // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() Spark Deconstructed: Log Mining Example discussing the other part
  • 11. Spark Deconstructed: Log Mining Example scala> messages.toDebugString! res5: String = ! MappedRDD[4] at map at <console>:16 (3 partitions)! MappedRDD[3] at map at <console>:16 (3 partitions)! FilteredRDD[2] at filter at <console>:14 (3 partitions)! MappedRDD[1] at textFile at <console>:12 (3 partitions)! HadoopRDD[0] at textFile at <console>:12 (3 partitions) At this point, take a look at the transformed RDD operator graph:
  • 12. Driver Worker Worker Worker Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 13. Driver Worker Worker Worker block 1 block 2 block 3 Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 14. Driver Worker Worker Worker block 1 block 2 block 3 Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 15. Driver Worker Worker Worker block 1 block 2 block 3 read HDFS block read HDFS block read HDFS block Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 16. Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 process, cache data process, cache data process, cache data Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 17. Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 18. // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains("mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 Spark Deconstructed: Log Mining Example discussing the other part
  • 19. Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 process from cache process from cache process from cache Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains(“mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 20. Driver Worker Worker Worker block 1 block 2 block 3 cache 1 cache 2 cache 3 Spark Deconstructed: Log Mining Example // base RDD! val lines = sc.textFile("hdfs://...")! ! // transformed RDDs! val errors = lines.filter(_.startsWith("ERROR"))! val messages = errors.map(_.split("t")).map(r => r(1))! messages.cache()! ! // action 1! messages.filter(_.contains(“mysql")).count()! ! // action 2! messages.filter(_.contains("php")).count() discussing the other part
  • 23. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
 J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin
 graphlab.org/files/osdi2012-gonzalez-low-gu- bickson-guestrin.pdf Pregel: Large-scale graph computing at Google
 Grzegorz Czajkowski, et al.
 googleresearch.blogspot.com/2009/06/large-scale- graph-computing-at-google.html GraphX: Unified Graph Analytics on Spark
 Ankur Dave, Databricks
 databricks-training.s3.amazonaws.com/slides/ graphx@sparksummit_2014-07.pdf Advanced Exercises: GraphX
 databricks-training.s3.amazonaws.com/graph- analytics-with-graphx.html GraphX
  • 24. // http://spark.apache.org/docs/latest/graphx-programming-guide.html! ! import org.apache.spark.graphx._! import org.apache.spark.rdd.RDD! ! case class Peep(name: String, age: Int)! ! val nodeArray = Array(! (1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),! (3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),! (5L, Peep("Leslie", 45))! )! val edgeArray = Array(! Edge(2L, 1L, 7), Edge(2L, 4L, 2),! Edge(3L, 2L, 4), Edge(3L, 5L, 3),! Edge(4L, 1L, 1), Edge(5L, 3L, 9)! )! ! val nodeRDD: RDD[(Long, Peep)] = sc.parallelize(nodeArray)! val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)! val g: Graph[Peep, Int] = Graph(nodeRDD, edgeRDD)! ! val results = g.triplets.filter(t => t.attr > 7)! ! for (triplet <- results.collect) {! println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")! } GraphX: demo
  • 27. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms Typical Workflows:
  • 28. Workflows: Scraper pipeline Typical data rates, e.g., for dev@spark.apache.org: • ~2K msgs/month • ~6 MB as JSON • ~13 MB parsed Three months’ list activity represents a graph of: • 1061 senders • 753,400 nodes • 1,027,806 edges A big graph! However, it satisfies definition for a 
 graph-parallel system; lots of data locality to leverage
  • 29. Workflows: A Few Notes about Microservices and Containers The Strengths andWeaknesses of Microservices
 Abel Avram
 http://www.infoq.com/news/2014/05/microservices DockerCon EU Keynote: State of the Art in Microservices
 Adrian Cockcroft
 https://blog.docker.com/2014/12/dockercon- europe-keynote-state-of-the-art-in-microservices- by-adrian-cockcroft-battery-ventures/ Microservices Architecture
 Martin Fowler
 http://martinfowler.com/articles/microservices.html
  • 30. Workflows: An Example… Python-based service in a Docker container? Just Enough Math, IPython+Docker
 Paco Nathan, Andrew Odewahn, Kyle Kelly
 https://github.com/ceteri/jem-docker
 https://registry.hub.docker.com/u/ceteri/jem/ Docker Jumpstart
 Andrew Odewahn
 http://odewahn.github.io/docker-jumpstart/
  • 31. Workflows: A Brief Note about ETL in SparkSQL Spark SQL Data Sources API: Unified Data Access for the Spark Platform
 Michael Armbrust
 databricks.com/blog/2015/01/09/spark-sql- data-sources-api-unified-data-access-for- the-spark-platform.html
  • 32. This Workflow: Microservices meet Parallel Processing services email archives community leaderboards SparkSQL Data Prep Features Explore Scraper / Parser NLTK data Unique Word IDs TextRank, Word2Vec, etc. community insights not so big data… relatively big compute…
  • 33. Workflows: Scraper pipeline message JSON Py filter quoted content Apache email list archive urllib2 crawl monthly list by date Py segment paragraphs
  • 34. Workflows: Scraper pipeline message JSON Py filter quoted content Apache email list archive urllib2 crawl monthly list by date Py segment paragraphs {! "date": "2014-10-01T00:16:08+00:00",! "id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",! "next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",! "next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQ "prev_thread": "",! "sender": "Debasish Das <debasish.da...@gmail.com>",! "subject": "Re: memory vs data_size",! "text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....n }
  • 36. TextBlob tag and lemmatize words TextBlob segment sentences TextBlob sentiment analysis Py generate skip-grams parsed JSON message JSON Treebank, WordNet Workflows: Parser pipeline {! "graf": [ [1, "Only", "only", "RB", 1, 0], [2, "fit", "fit", "VBP", 1, 1 ] ... ],! "id": “CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",! "polr": 0.2,! "sha1": "178b7a57ec6168f20a8a4f705fb8b0b04e59eeb7",! "size": 14,! "subj": 0.7,! "tile": [ [1, 2], [2, 3], [3, 4] ... ]! ]! } {! "date": "2014-10-01T00:16:08+00:00",! "id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",! "next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",! "next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQDM=p "prev_thread": "",! "sender": "Debasish Das <debasish.da...@gmail.com>",! "subject": "Re: memory vs data_size",! "text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....nnFor }
  • 37. Workflows: TextRank pipeline Spark create word graph RDD word graph NetworkX visualize graph GraphX run TextRank Spark extract phrases ranked phrases parsed JSON
  • 38. Workflows: TextRank pipeline "Compatibility of systems of linear constraints" [{'index': 0, 'stem': 'compat', 'tag': 'NNP','word': 'compatibility'}, {'index': 1, 'stem': 'of', 'tag': 'IN', 'word': 'of'}, {'index': 2, 'stem': 'system', 'tag': 'NNS', 'word': 'systems'}, {'index': 3, 'stem': 'of', 'tag': 'IN', 'word': 'of'}, {'index': 4, 'stem': 'linear', 'tag': 'JJ', 'word': 'linear'}, {'index': 5, 'stem': 'constraint', 'tag': 'NNS','word': 'constraints'}] compat system linear constraint 1: 2: 3: TextRank: Bringing Order intoTexts Rada Mihalcea, Paul Tarau http://web.eecs.umich.edu/~mihalcea/ papers/mihalcea.emnlp04.pdf
  • 41. TextRank impl: load parquet files import org.apache.spark.graphx._! import org.apache.spark.rdd.RDD! ! val sqlCtx = new org.apache.spark.sql.SQLContext(sc)! import sqlCtx._! ! val edge = sqlCtx.parquetFile("graf_edge.parquet")! edge.registerTempTable("edge")! ! val node = sqlCtx.parquetFile("graf_node.parquet")! node.registerTempTable("node")! ! // pick one message as an example; at scale we'd parallelize! val msg_id = "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw"
  • 42. TextRank impl: use SparkSQL to collect node list + edge list val sql = """! SELECT node_id, root ! FROM node ! WHERE id='%s' AND keep='1'! """.format(msg_id)! ! val n = sqlCtx.sql(sql.stripMargin).distinct()! val nodes: RDD[(Long, String)] = n.map{ p =>! (p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[String])! }! nodes.collect()! ! val sql = """! SELECT node0, node1 ! FROM edge ! WHERE id='%s'! """.format(msg_id)! ! val e = sqlCtx.sql(sql.stripMargin).distinct()! val edges: RDD[Edge[Int]] = e.map{ p =>! Edge(p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[Int].toLong, 0)! }! edges.collect()
  • 43. TextRank impl: use GraphX to run PageRank // run PageRank! val g: Graph[String, Int] = Graph(nodes, edges)! val r = g.pageRank(0.0001).vertices! ! r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)! ! // save the ranks! case class Rank(id: Int, rank: Float)! val rank = r.map(p => Rank(p._1.toInt, p._2.toFloat))! rank.registerTempTable("rank")! ! def median[T](s: Seq[T])(implicit n: Fractional[T]) = {! import n._! val (lower, upper) = s.sortWith(_<_).splitAt(s.size / 2)! if (s.size % 2 == 0) (lower.last + upper.head) / fromInt(2) else upper.head! }! ! val min_rank = median(r.map(_._2).collect())
  • 44. TextRank impl: join ranked words with parsed text var span:List[String] = List()! var last_index = -1! var rank_sum = 0.0! ! var phrases:collection.mutable.Map[String, Double] = collection.mutable.Map()! ! val sql = """! SELECT n.num, n.raw, r.rank! FROM node n JOIN rank r ON n.node_id = r.id ! WHERE n.id='%s' AND n.keep='1'! ORDER BY n.num! """.format(msg_id)! ! val s = sqlCtx.sql(sql.stripMargin).collect()
  • 45. TextRank impl: “pull strings” for the top-ranked keyphrases s.foreach { x => ! //println (x)! val index = x.getInt(0)! val word = x.getString(1)! val rank = x.getFloat(2)! var isStop = false! ! // test for break from past! if (span.size > 0 && rank < min_rank) isStop = true! if (span.size > 0 && (index - last_index > 1)) isStop = true! ! // clear accumulation! if (isStop) {! val phrase = span.mkString(" ")! phrases += (phrase -> rank_sum)! ! span = List()! last_index = index! rank_sum = 0.0! }! ! // start or append! if (rank >= min_rank) {! span = span :+ word! last_index = index! rank_sum += rank! }! }!
  • 46. TextRank impl: report the top keyphrases // summarize the text as a list of ranked keyphrases! val summary = sc.parallelize(phrases.toSeq)! .distinct()! .sortBy(_._2, ascending=false)
  • 48. Reply Graph: load parquet files import org.apache.spark.graphx._! import org.apache.spark.rdd.RDD! ! val sqlCtx = new org.apache.spark.sql.SQLContext(sc)! import sqlCtx._! ! val edge = sqlCtx.parquetFile("reply_edge.parquet")! edge.registerTempTable("edge")! ! val node = sqlCtx.parquetFile("reply_node.parquet")! node.registerTempTable("node")! ! edge.schemaString! node.schemaString
  • 49. Reply Graph: use SparkSQL to collect node list + edge list val sql = "SELECT id, sender FROM node"! val n = sqlCtx.sql(sql).distinct()! val nodes: RDD[(Long, String)] = n.map{ p =>! (p(0).asInstanceOf[Long], p(1).asInstanceOf[String])! }! nodes.collect()! ! val sql = "SELECT replier, sender, num FROM edge"! val e = sqlCtx.sql(sql).distinct()! val edges: RDD[Edge[Int]] = e.map{ p =>! Edge(p(0).asInstanceOf[Long], p(1).asInstanceOf[Long], p(2).asInstanceOf[Int])! }! edges.collect()
  • 50. Reply Graph: use GraphX to run graph analytics // run graph analytics! val g: Graph[String, Int] = Graph(nodes, edges)! val r = g.pageRank(0.0001).vertices! r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)! ! // define a reduce operation to compute the highest degree vertex! def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {! if (a._2 > b._2) a else b! }! ! // compute the max degrees! val maxInDegree: (VertexId, Int) = g.inDegrees.reduce(max)! val maxOutDegree: (VertexId, Int) = g.outDegrees.reduce(max)! val maxDegrees: (VertexId, Int) = g.degrees.reduce(max)! ! // connected components! val scc = g.stronglyConnectedComponents(10).vertices! node.join(scc).foreach(println)
  • 51. Reply Graph: PageRank of top dev@spark email, 4Q2014 (389,(22.690229478710016,Sean Owen <so...@cloudera.com>))! (857,(20.832469059298248,Akhil Das <ak...@sigmoidanalytics.com>))! (652,(13.281821379806798,Michael Armbrust <mich...@databricks.com>))! (101,(9.963167550803664,Tobias Pfeiffer <...@preferred.jp>))! (471,(9.614436778460558,Steve Lewis <lordjoe2...@gmail.com>))! (931,(8.217073486575732,shahab <shahab.mok...@gmail.com>))! (48,(7.653814912512137,ll <duy.huynh....@gmail.com>))! (1011,(7.602002681952157,Ashic Mahtab <as...@live.com>))! (1055,(7.572376489758199,Cheng Lian <lian.cs....@gmail.com>))! (122,(6.87247388819558,Gerard Maas <gerard.m...@gmail.com>))! (904,(6.252657820614504,Xiangrui Meng <men...@gmail.com>))! (827,(6.0941062762076115,Jianshi Huang <jianshi.hu...@gmail.com>))! (887,(5.835053915864531,Davies Liu <dav...@databricks.com>))! (303,(5.724235650446037,Ted Yu <yuzhih...@gmail.com>))! (206,(5.430238461114108,Deep Pradhan <pradhandeep1...@gmail.com>))! (483,(5.332452537151523,Akshat Aranya <aara...@gmail.com>))! (185,(5.259438927615685,SK <skrishna...@gmail.com>))! (636,(5.235941228955769,Matei Zaharia <matei.zaha…@gmail.com>))! ! // seaaaaaaaaaan!! maxInDegree: (org.apache.spark.graphx.VertexId, Int) = (389,126)! maxOutDegree: (org.apache.spark.graphx.VertexId, Int) = (389,170)! maxDegrees: (org.apache.spark.graphx.VertexId, Int) = (389,296)
  • 52. Reply Graph: What SSSP looks like in GraphX/Pregel github.com/ceteri/spark-exercises/blob/master/src/main/scala/ com/databricks/apps/graphx/sssp.scala
  • 53. Look Ahead: Where is this heading? Feature learning withWord2Vec
 Matt Krzus
 www.yseam.com/blog/WV.html ranked phrases GraphX run Con.Comp. MLlib run Word2Vec aggregated by topic MLlib run KMeans topic vectors better than LDA? features… models… insights…
  • 55. Apache Spark developer certificate program • http://oreilly.com/go/sparkcert • defined by Spark experts @Databricks • assessed by O’Reilly Media • establishes the bar for Spark expertise certification:
  • 56. MOOCs: Anthony Joseph
 UC Berkeley begins 2015-02-23 edx.org/course/uc-berkeleyx/uc- berkeleyx-cs100-1x- introduction-big-6181 Ameet Talwalkar
 UCLA begins 2015-04-14 edx.org/course/uc-berkeleyx/ uc-berkeleyx-cs190-1x- scalable-machine-6066
  • 57. community: spark.apache.org/community.html events worldwide: goo.gl/2YqJZK ! video+preso archives: spark-summit.org resources: databricks.com/spark-training-resources workshops: databricks.com/spark-training
  • 59. confs: Strata CA
 San Jose, Feb 18-20
 strataconf.com/strata2015 Spark Summit East
 NYC, Mar 18-19
 spark-summit.org/east Big Data Tech Con
 Boston, Apr 26-28
 bigdatatechcon.com Strata EU
 London, May 5-7
 strataconf.com/big-data-conference-uk-2015 Spark Summit 2015
 SF, Jun 15-17
 spark-summit.org
  • 60. books: Fast Data Processing 
 with Spark
 Holden Karau
 Packt (2013)
 shop.oreilly.com/product/ 9781782167068.do Spark in Action
 Chris Fregly
 Manning (2015*)
 sparkinaction.com/ Learning Spark
 Holden Karau, 
 Andy Konwinski, Matei Zaharia
 O’Reilly (2015*)
 shop.oreilly.com/product/ 0636920028512.do
  • 61. presenter: Just Enough Math O’Reilly, 2014 justenoughmath.com
 preview: youtu.be/TQ58cWgdCpA monthly newsletter for updates, 
 events, conf summaries, etc.: liber118.com/pxn/ Enterprise Data Workflows with Cascading O’Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do