SlideShare a Scribd company logo
Scio
A Scala API for Google Cloud Dataflow
Neville Li @sinisa_lyh
Who am I?
Origin Story
Scalding and Spark
ML, recommendations, analytics
50+ users, 400+ unique jobs
Moving to
Google Cloud
Early 2015 - Dataflow Scala hack project
What is Dataflow?
Data model
Spark
• RDD for batch, DStream for streaming
• Explicit caching semantics
• Two sets ofAPIs
Dataflow
• PCollection for both batch and streaming
• Windowed and timestamped values
• One unifiedAPI
Execution
Spark
• Driver and executors
• Dynamic execution from driver
• Transforms and actions
Dataflow
• No master
• Static execution planning
• Transforms only, no actions
Why Dataflow?
Why not Scalding on GCE
Pros
• Community

Twitter, eBay, Etsy, Stripe, LinkedIn, …
• Stable and proven
Why not Scalding on GCE
Cons
• Hadoop cluster operations
• Multi-tenancy

resource contention and utilization
• No streaming mode (Summingbird?)
Why not Spark on GCE
Pros
• Batch, streaming, interactive and SQL
• MLlib, GraphX
• Scala, Python, and R support
• Zeppelin, spark-notebook, Hue
Why not Spark on GCE
Cons
• Hard to tune and scale
• Cluster lifecycle management
Why Dataflow with Scala
Dataflow
• Hosted solution, no operations
• Ecosystem

GCS, BigQuery, PubSub, Bigtable, …
• Unified batch and streaming model
Why Dataflow with Scala
Scala
• High level DSL

easytransition for developers
• Reusable and composable code via FP
• Numerical libraries: Breeze,Algebird
Scio
Scio
Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o]
Verb: I can, know, understand, have knowledge.
github.com/spotify/scio
WordCount
Almost identical to Spark version
val sc = ScioContext()
sc.textFile("shakespeare.txt")
.flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty))
.countByValue()
.saveAsTextFile("wordcount.txt")
PageRank
def pageRank(in: SCollection[(String, String)]) = {
val links = in.groupByKey()
var ranks = links.mapValues(_ => 1.0)
for (i <- 1 to 10) {
val contribs = links.join(ranks).values
.flatMap { case (urls, rank) =>
val size = urls.size
urls.map((_, rank / size))
}
ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _)
}
ranks
}
Spotify Running
60 million tracks
30m users * 10 tempo buckets * 25 tracks
Audio: tempo, energy, time signature ...
Metadata: genres, categories, …
Latent vectors from collaborative filtering
Scio
Scio
Scio
Scio
Personalized new releases
• Pre-computed weekly on Hadoop

(on-premise cluster)
• 100GB recommendations

from HDFS to Bigtable in US+EU
• 250GB Bloom filters from Bigtable to HDFS
• 200 LOC
User conversion analysis
• For marketing and campaigning strategies
• Track usertransitions through products
• Aggregated for simulation and projection
• 150GB BigQuery in and out
Demo Time!
Design and Implementation
• Simplicity over premature optimization
• Usability over Python/Java inter-op
• Ser/de: ☑kryo/chill ☒Coder[T]
• Closure cleaner
What’s next?
• Apache Beam donation
• Migrating internal teams
• BigQuery SQL-2011 dialect
• Better streaming support
• PRs and issues welcome!
Neville Li
@sinisa_lyh
Thank you!

More Related Content

PDF
Sorry - How Bieber broke Google Cloud at Spotify
PDF
Scio - Moving to Google Cloud, A Spotify Story
PDF
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
PDF
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
PPTX
Introduction to Apache Drill - interactive query and analysis at scale
PDF
SparkSQL: A Compiler from Queries to RDDs
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Apache Cassandra and Python for Analyzing Streaming Big Data
Sorry - How Bieber broke Google Cloud at Spotify
Scio - Moving to Google Cloud, A Spotify Story
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Introduction to Apache Drill - interactive query and analysis at scale
SparkSQL: A Compiler from Queries to RDDs
The Parquet Format and Performance Optimization Opportunities
Apache Cassandra and Python for Analyzing Streaming Big Data

What's hot (20)

PDF
Cassandra spark connector
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
PDF
Using PostgreSQL with Bibliographic Data
PPTX
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
PPTX
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
PDF
Pivoting Data with SparkSQL by Andrew Ray
PPTX
Spark meetup v2.0.5
PPTX
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
PDF
DataEngConf SF16 - Spark SQL Workshop
PDF
Cost-based query optimization in Apache Hive 0.14
PPTX
Building data pipelines
PPTX
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
PDF
Assessing Graph Solutions for Apache Spark
PDF
DataSource V2 and Cassandra – A Whole New World
PDF
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
PPTX
Large scale, interactive ad-hoc queries over different datastores with Apache...
PPTX
Apache spark Intro
PDF
PySpark Cassandra - Amsterdam Spark Meetup
PDF
SparkR - Play Spark Using R (20160909 HadoopCon)
PPTX
Graph databases: Tinkerpop and Titan DB
Cassandra spark connector
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Using PostgreSQL with Bibliographic Data
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Pivoting Data with SparkSQL by Andrew Ray
Spark meetup v2.0.5
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
DataEngConf SF16 - Spark SQL Workshop
Cost-based query optimization in Apache Hive 0.14
Building data pipelines
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Assessing Graph Solutions for Apache Spark
DataSource V2 and Cassandra – A Whole New World
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Large scale, interactive ad-hoc queries over different datastores with Apache...
Apache spark Intro
PySpark Cassandra - Amsterdam Spark Meetup
SparkR - Play Spark Using R (20160909 HadoopCon)
Graph databases: Tinkerpop and Titan DB
Ad

Viewers also liked (20)

PPTX
Only the First Drop: Changing the Way Startups are Funded by Denes Ban (OurCr...
PPT
Smarter campus workshop Part I - Amit Sinha and Heidi Riley - Smarter planet ...
PDF
SCIO – Explore Me, IoT Israel 2014
PDF
Open Spectrum - Physics, Engineering, Commerce and Politics
PDF
Refactoring workshop (Campus Party Quito 2014)
PDF
Nutrition and It's Importance
PDF
Bringing iot data to life, IoT Israel 2014
PDF
Dr. Jimmy Schwarzkopf main tent trends 2016
PDF
Linux Kernel Exploitation
PPTX
Sensors candidated dkim_v2
PDF
STKI Israeli IT market study 2016 V2
PDF
Molecular Sensor from SCIO
PPTX
Ansible + Hadoop
PDF
The Future of Digital Health
PDF
The Digital Health Tech Vision 2016
PPTX
Video is Changing the World
PDF
Chemicals: Smarter Investments, Outstanding Results
PDF
Unlocking the Power of RegTech
PDF
Mastering The Fourth Industrial Revolution
Only the First Drop: Changing the Way Startups are Funded by Denes Ban (OurCr...
Smarter campus workshop Part I - Amit Sinha and Heidi Riley - Smarter planet ...
SCIO – Explore Me, IoT Israel 2014
Open Spectrum - Physics, Engineering, Commerce and Politics
Refactoring workshop (Campus Party Quito 2014)
Nutrition and It's Importance
Bringing iot data to life, IoT Israel 2014
Dr. Jimmy Schwarzkopf main tent trends 2016
Linux Kernel Exploitation
Sensors candidated dkim_v2
STKI Israeli IT market study 2016 V2
Molecular Sensor from SCIO
Ansible + Hadoop
The Future of Digital Health
The Digital Health Tech Vision 2016
Video is Changing the World
Chemicals: Smarter Investments, Outstanding Results
Unlocking the Power of RegTech
Mastering The Fourth Industrial Revolution
Ad

Similar to Scio (20)

PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
PDF
Apache Spark for Everyone - Women Who Code Workshop
PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
PDF
From stream to recommendation using apache beam with cloud pubsub and cloud d...
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PDF
20160512 apache-spark-for-everyone
PPTX
How Concur uses Big Data to get you to Tableau Conference On Time
PDF
Spark Programming Basic Training Handout
PDF
Artigo 81 - spark_tutorial.pdf
PPTX
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
PDF
20170126 big data processing
KEY
NoSQL: Why, When, and How
PPTX
Scala 20140715
PPTX
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
PDF
OCF.tw's talk about "Introduction to spark"
PPTX
2015 Data Science Summit @ dato Review
PDF
Apache Spark RDDs
PDF
Big data workloads using Apache Sparkon HDInsight
PPTX
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
PDF
Sandish3Certs
Alpine academy apache spark series #1 introduction to cluster computing wit...
Apache Spark for Everyone - Women Who Code Workshop
Big Data Processing with .NET and Spark (SQLBits 2020)
From stream to recommendation using apache beam with cloud pubsub and cloud d...
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
20160512 apache-spark-for-everyone
How Concur uses Big Data to get you to Tableau Conference On Time
Spark Programming Basic Training Handout
Artigo 81 - spark_tutorial.pdf
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
20170126 big data processing
NoSQL: Why, When, and How
Scala 20140715
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
OCF.tw's talk about "Introduction to spark"
2015 Data Science Summit @ dato Review
Apache Spark RDDs
Big data workloads using Apache Sparkon HDInsight
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Sandish3Certs

Recently uploaded (20)

PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
AI in Product Development-omnex systems
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
top salesforce developer skills in 2025.pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
Introduction to Artificial Intelligence
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
ISO 45001 Occupational Health and Safety Management System
PPTX
CHAPTER 2 - PM Management and IT Context
PPT
Introduction Database Management System for Course Database
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Transform Your Business with a Software ERP System
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
2025 Textile ERP Trends: SAP, Odoo & Oracle
AI in Product Development-omnex systems
Navsoft: AI-Powered Business Solutions & Custom Software Development
top salesforce developer skills in 2025.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Online Work Permit System for Fast Permit Processing
Introduction to Artificial Intelligence
Operating system designcfffgfgggggggvggggggggg
ISO 45001 Occupational Health and Safety Management System
CHAPTER 2 - PM Management and IT Context
Introduction Database Management System for Course Database
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Understanding Forklifts - TECH EHS Solution
Transform Your Business with a Software ERP System
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises

Scio

  • 1. Scio A Scala API for Google Cloud Dataflow Neville Li @sinisa_lyh
  • 3. Origin Story Scalding and Spark ML, recommendations, analytics 50+ users, 400+ unique jobs
  • 4. Moving to Google Cloud Early 2015 - Dataflow Scala hack project
  • 6. Data model Spark • RDD for batch, DStream for streaming • Explicit caching semantics • Two sets ofAPIs Dataflow • PCollection for both batch and streaming • Windowed and timestamped values • One unifiedAPI
  • 7. Execution Spark • Driver and executors • Dynamic execution from driver • Transforms and actions Dataflow • No master • Static execution planning • Transforms only, no actions
  • 9. Why not Scalding on GCE Pros • Community
 Twitter, eBay, Etsy, Stripe, LinkedIn, … • Stable and proven
  • 10. Why not Scalding on GCE Cons • Hadoop cluster operations • Multi-tenancy
 resource contention and utilization • No streaming mode (Summingbird?)
  • 11. Why not Spark on GCE Pros • Batch, streaming, interactive and SQL • MLlib, GraphX • Scala, Python, and R support • Zeppelin, spark-notebook, Hue
  • 12. Why not Spark on GCE Cons • Hard to tune and scale • Cluster lifecycle management
  • 13. Why Dataflow with Scala Dataflow • Hosted solution, no operations • Ecosystem
 GCS, BigQuery, PubSub, Bigtable, … • Unified batch and streaming model
  • 14. Why Dataflow with Scala Scala • High level DSL
 easytransition for developers • Reusable and composable code via FP • Numerical libraries: Breeze,Algebird
  • 16. Scio Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o] Verb: I can, know, understand, have knowledge.
  • 18. WordCount Almost identical to Spark version val sc = ScioContext() sc.textFile("shakespeare.txt") .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty)) .countByValue() .saveAsTextFile("wordcount.txt")
  • 19. PageRank def pageRank(in: SCollection[(String, String)]) = { val links = in.groupByKey() var ranks = links.mapValues(_ => 1.0) for (i <- 1 to 10) { val contribs = links.join(ranks).values .flatMap { case (urls, rank) => val size = urls.size urls.map((_, rank / size)) } ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _) } ranks }
  • 20. Spotify Running 60 million tracks 30m users * 10 tempo buckets * 25 tracks Audio: tempo, energy, time signature ... Metadata: genres, categories, … Latent vectors from collaborative filtering
  • 25. Personalized new releases • Pre-computed weekly on Hadoop
 (on-premise cluster) • 100GB recommendations
 from HDFS to Bigtable in US+EU • 250GB Bloom filters from Bigtable to HDFS • 200 LOC
  • 26. User conversion analysis • For marketing and campaigning strategies • Track usertransitions through products • Aggregated for simulation and projection • 150GB BigQuery in and out
  • 28. Design and Implementation • Simplicity over premature optimization • Usability over Python/Java inter-op • Ser/de: ☑kryo/chill ☒Coder[T] • Closure cleaner
  • 29. What’s next? • Apache Beam donation • Migrating internal teams • BigQuery SQL-2011 dialect • Better streaming support • PRs and issues welcome!