Scio

Scio
A Scala API for Google Cloud Dataflow
Neville Li @sinisa_lyh

Origin Story
Scalding and Spark
ML, recommendations, analytics
50+ users, 400+ unique jobs

Moving to
Google Cloud
Early 2015 - Dataflow Scala hack project

Data model
Spark
• RDD for batch, DStream for streaming
• Explicit caching semantics
• Two sets ofAPIs
Dataflow
• PCollection for both batch and streaming
• Windowed and timestamped values
• One unifiedAPI

Execution
Spark
• Driver and executors
• Dynamic execution from driver
• Transforms and actions
Dataflow
• No master
• Static execution planning
• Transforms only, no actions

Why not Scalding on GCE
Pros
• Community 
Twitter, eBay, Etsy, Stripe, LinkedIn, …
• Stable and proven

Why not Scalding on GCE
Cons
• Hadoop cluster operations
• Multi-tenancy 
resource contention and utilization
• No streaming mode (Summingbird?)

Why not Spark on GCE
Pros
• Batch, streaming, interactive and SQL
• MLlib, GraphX
• Scala, Python, and R support
• Zeppelin, spark-notebook, Hue

Why not Spark on GCE
Cons
• Hard to tune and scale
• Cluster lifecycle management

Why Dataflow with Scala
Dataflow
• Hosted solution, no operations
• Ecosystem 
GCS, BigQuery, PubSub, Bigtable, …
• Unified batch and streaming model

Why Dataflow with Scala
Scala
• High level DSL 
easytransition for developers
• Reusable and composable code via FP
• Numerical libraries: Breeze,Algebird

Scio
Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o]
Verb: I can, know, understand, have knowledge.

WordCount
Almost identical to Spark version
val sc = ScioContext()
sc.textFile("shakespeare.txt")
.flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty))
.countByValue()
.saveAsTextFile("wordcount.txt")

PageRank
def pageRank(in: SCollection[(String, String)]) = {
val links = in.groupByKey()
var ranks = links.mapValues(_ => 1.0)
for (i <- 1 to 10) {
val contribs = links.join(ranks).values
.flatMap { case (urls, rank) =>
val size = urls.size
urls.map((_, rank / size))
}
ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _)
}
ranks
}

Spotify Running
60 million tracks
30m users * 10 tempo buckets * 25 tracks
Audio: tempo, energy, time signature ...
Metadata: genres, categories, …
Latent vectors from collaborative filtering

Personalized new releases
• Pre-computed weekly on Hadoop 
(on-premise cluster)
• 100GB recommendations 
from HDFS to Bigtable in US+EU
• 250GB Bloom filters from Bigtable to HDFS
• 200 LOC

User conversion analysis
• For marketing and campaigning strategies
• Track usertransitions through products
• Aggregated for simulation and projection
• 150GB BigQuery in and out

Design and Implementation
• Simplicity over premature optimization
• Usability over Python/Java inter-op
• Ser/de: ☑kryo/chill ☒Coder[T]
• Closure cleaner

What’s next?
• Apache Beam donation
• Migrating internal teams
• BigQuery SQL-2011 dialect
• Better streaming support
• PRs and issues welcome!

Neville Li
@sinisa_lyh
Thank you!

Scio

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Scio (20)

Recently uploaded (20)

Scio