Splitgraph: Docker for Data

Splitgraph
"Docker for Data"
Artjoms Iškovs, Miles Richardson

"B.D." Building Packages Before Docker
The Dark Ages
• Sourcing packages
• Rebuilding,
reconﬁguring,
rebuilding...
• Googling, rage
inducing

Data preparation
accounts for about
80% of the work
of data scientists.

Why so hard to build and maintain data sets?
• Sourcing data is not composable
• Why can’t I query multiple data sets at once?
• Wrangling and cleaning data is not maintainable
• Why can’t I keep my data sets up to date?
• Running ad-hoc queries is not reproducible
• Why can’t I share my data sets?

What do we mean by data?
Sources
• Open Data
• Internal Data
• Licensed Data
Types
• SQL Databases
• NoSQL Databases
• CSV Files...

The journey of a dataset: Scenario
• Two publishers:
• NOAA publishes climate data
• USDA publishes corn yields
• Consumer wants to merge both data sets
• Let’s follow the climate data...

The journey of a dataset: Introduction

The journey of a dataset 1: Creation

Ingesting data from another DB via CLI
$ sgr mount -t mongo_fdw me:pwd@my_db:27017 ’
{ "rainfall": {
"db": "observations",
"coll": "rainfall",
"schema": {
"timestamp": "timestamp",
"state": "varchar",
"rainfall": "numeric
} } }’ staging
$ sgr import staging
’SELECT timestamp, state, rainfall FROM rainfall’
noaa/climate rainfall

The journey of a dataset 2: Publication

Committing and Publishing Data via CLI
$ sgr publish noaa/climate data.splitgraph.com

The journey of a dataset 3: Usage

SGFiles: Dockerﬁles for data
• Like Dockerﬁles.
• Image: state of a database schema
• Layers w/ deterministic hashes and cache invalidation if:
• Previous layer changes
• Command changes
• Commands:
• FROM – base the image on something else
• IMPORT – import tables from another image
• SQL – run SQL against the image

Consumption: Demo
FROM usda/yields IMPORT crop_yields
FROM noaa/climate:latest IMPORT rainfall
SQL CREATE TABLE rainfall_yields AS
SELECT * FROM rainfall JOIN crop_yields ...

The journey of a dataset 4: Updating

The journey of a dataset 4: Updating
• Puerto Rico is now a US state
• NOAA wants to revise its climate data
• Can the consumer get just the changes?

Delta compression
• Only care about changes
• Need to eﬃciently:
• Create diffs (→ commit, push)
• Apply diffs (→ checkout, pull)

Delta compression
Docker
• Files
• Custom FS
Git
• Lines
• diff
Splitgraph
• Rows
• Audit triggers

The journey of a dataset 5: Maintenance

The journey of a dataset 5: Maintenance
• Can we update it?
• Where did this dataset come from?
• Build context fully encapsulated within the metadata

Q&A
twitter.com/splitgraph · splitgraph.com

Splitgraph: Docker for Data

More Related Content

What's hot (20)

Similar to Splitgraph: Docker for Data (20)

Recently uploaded (20)

Splitgraph: Docker for Data