SlideShare a Scribd company logo
Splitgraph
"Docker for Data"
Artjoms Iškovs, Miles Richardson
"B.D." Building Packages Before Docker
The Dark Ages
• Sourcing packages
• Rebuilding,
reconfiguring,
rebuilding...
• Googling, rage
inducing
Data preparation
accounts for about
80% of the work
of data scientists.
Why so hard to build and maintain data sets?
• Sourcing data is not composable
• Why can’t I query multiple data sets at once?
• Wrangling and cleaning data is not maintainable
• Why can’t I keep my data sets up to date?
• Running ad-hoc queries is not reproducible
• Why can’t I share my data sets?
What do we mean by data?
Sources
• Open Data
• Internal Data
• Licensed Data
Types
• SQL Databases
• NoSQL Databases
• CSV Files...
The journey of a dataset: Scenario
• Two publishers:
• NOAA publishes climate data
• USDA publishes corn yields
• Consumer wants to merge both data sets
• Let’s follow the climate data...
The journey of a dataset: Introduction
The journey of a dataset: Introduction
The journey of a dataset 1: Creation
Ingesting data from another DB via CLI
$ sgr mount -t mongo_fdw me:pwd@my_db:27017 ’
{ "rainfall": {
"db": "observations",
"coll": "rainfall",
"schema": {
"timestamp": "timestamp",
"state": "varchar",
"rainfall": "numeric
} } }’ staging
$ sgr import staging 
’SELECT timestamp, state, rainfall FROM rainfall’
noaa/climate rainfall
The journey of a dataset 2: Publication
Committing and Publishing Data via CLI
$ sgr publish noaa/climate data.splitgraph.com
The journey of a dataset 3: Usage
SGFiles: Dockerfiles for data
• Like Dockerfiles.
• Image: state of a database schema
• Layers w/ deterministic hashes and cache invalidation if:
• Previous layer changes
• Command changes
• Commands:
• FROM – base the image on something else
• IMPORT – import tables from another image
• SQL – run SQL against the image
Consumption: Demo
FROM usda/yields IMPORT crop_yields
FROM noaa/climate:latest IMPORT rainfall
SQL CREATE TABLE rainfall_yields AS
SELECT * FROM rainfall JOIN crop_yields ...
Splitgraph: Docker for Data
Splitgraph: Docker for Data
The journey of a dataset 4: Updating
The journey of a dataset 4: Updating
• Puerto Rico is now a US state
• NOAA wants to revise its climate data
• Can the consumer get just the changes?
Delta compression
• Only care about changes
• Need to efficiently:
• Create diffs (→ commit, push)
• Apply diffs (→ checkout, pull)
Delta compression
Docker
• Files
• Custom FS
Git
• Lines
• diff
Splitgraph
• Rows
• Audit triggers
Updating: Demo
Splitgraph: Docker for Data
Splitgraph: Docker for Data
Splitgraph: Docker for Data
Splitgraph: Docker for Data
Splitgraph: Docker for Data
Splitgraph: Docker for Data
The journey of a dataset 5: Maintenance
The journey of a dataset 5: Maintenance
• Can we update it?
• Where did this dataset come from?
• Build context fully encapsulated within the metadata
Provenance and rebasing demo
Splitgraph: Docker for Data
Splitgraph: Docker for Data
Splitgraph: Docker for Data
Splitgraph: Docker for Data
Splitgraph: Docker for Data
Splitgraph: Docker for Data
Splitgraph: Docker for Data
Q&A
twitter.com/splitgraph · splitgraph.com

More Related Content

ODP
ckan 2.0: Harvesting from other sources
PDF
Discover database
PDF
An introduction to U1db
PDF
Road to Analytics
PDF
Discover Database
PPTX
Using spark 1.2 with Java 8 and Cassandra
PDF
Hands-On Apache Spark
PDF
Grails Data
ckan 2.0: Harvesting from other sources
Discover database
An introduction to U1db
Road to Analytics
Discover Database
Using spark 1.2 with Java 8 and Cassandra
Hands-On Apache Spark
Grails Data

What's hot (20)

PPTX
Advanced topics in hive
PDF
Persistence in Android
PPT
Workspace Management
PPT
Online Oracle Training For Beginners
PDF
SANSA ISWC 2017 Talk
PPTX
“Open Data Web” – A Linked Open Data Repository Built with CKAN
PPTX
20131191 msbuild properties
PPTX
Solr in Drupal
PDF
Apache Spark — Fundamentals and MLlib
PPTX
Hello cloud 2
PDF
Updating materialized views and caches using kafka
PPTX
Klevis Mino: MongoDB
PDF
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Replicating application data into materialized views
PPTX
Using load tables to manage electronic resource records
PDF
Introduction to Apache Spark
PDF
Users as Data
PDF
Leveraging Map Reduce With Hadoop for Weather Data Analytics
PPTX
Working with Scientific Data in MATLAB
PPTX
MapReduce and Hadoop
Advanced topics in hive
Persistence in Android
Workspace Management
Online Oracle Training For Beginners
SANSA ISWC 2017 Talk
“Open Data Web” – A Linked Open Data Repository Built with CKAN
20131191 msbuild properties
Solr in Drupal
Apache Spark — Fundamentals and MLlib
Hello cloud 2
Updating materialized views and caches using kafka
Klevis Mino: MongoDB
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Replicating application data into materialized views
Using load tables to manage electronic resource records
Introduction to Apache Spark
Users as Data
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Working with Scientific Data in MATLAB
MapReduce and Hadoop
Ad

Similar to Splitgraph: Docker for Data (20)

PDF
Splitgraph: AHL talk
PDF
Processing and analysing streaming data with Python. Pycon Italy 2022
PDF
Analyzing Log Data With Apache Spark
PDF
Predicting Space Weather with Docker
PDF
JConWorld_ Continuous SQL with Kafka and Flink
PDF
Data Infrastructure for a World of Music
PPTX
Snowplow Analytics: from NoSQL to SQL and back again
PDF
Geospatial Sensor Networks and Partitioning Data
PDF
Q4 2016 GeoTrellis Presentation
PDF
Containers and Logging
PPTX
Letgo Data Platform: A global overview
PDF
Data Science Lab Meetup: Cassandra and Spark
PDF
Docker: Containers for Data Science
PDF
PyDX Presentation about Python, GeoData and Maps
PDF
Data Lessons Learned at Scale
PDF
04 open source_tools
PPTX
Scaling Big Data Mining Infrastructure Twitter Experience
PPT
Architecting Big Data Ingest & Manipulation
PDF
Dirty data? Clean it up! - Datapalooza Denver 2016
Splitgraph: AHL talk
Processing and analysing streaming data with Python. Pycon Italy 2022
Analyzing Log Data With Apache Spark
Predicting Space Weather with Docker
JConWorld_ Continuous SQL with Kafka and Flink
Data Infrastructure for a World of Music
Snowplow Analytics: from NoSQL to SQL and back again
Geospatial Sensor Networks and Partitioning Data
Q4 2016 GeoTrellis Presentation
Containers and Logging
Letgo Data Platform: A global overview
Data Science Lab Meetup: Cassandra and Spark
Docker: Containers for Data Science
PyDX Presentation about Python, GeoData and Maps
Data Lessons Learned at Scale
04 open source_tools
Scaling Big Data Mining Infrastructure Twitter Experience
Architecting Big Data Ingest & Manipulation
Dirty data? Clean it up! - Datapalooza Denver 2016
Ad

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Machine Learning_overview_presentation.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Big Data Technologies - Introduction.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PDF
Approach and Philosophy of On baking technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
A Presentation on Artificial Intelligence
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Assigned Numbers - 2025 - Bluetooth® Document
MIND Revenue Release Quarter 2 2025 Press Release
Building Integrated photovoltaic BIPV_UPV.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Machine Learning_overview_presentation.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Big Data Technologies - Introduction.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Encapsulation_ Review paper, used for researhc scholars
Digital-Transformation-Roadmap-for-Companies.pptx
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
Approach and Philosophy of On baking technology
Advanced methodologies resolving dimensionality complications for autism neur...

Splitgraph: Docker for Data

  • 1. Splitgraph "Docker for Data" Artjoms Iškovs, Miles Richardson
  • 2. "B.D." Building Packages Before Docker The Dark Ages • Sourcing packages • Rebuilding, reconfiguring, rebuilding... • Googling, rage inducing
  • 3. Data preparation accounts for about 80% of the work of data scientists.
  • 4. Why so hard to build and maintain data sets? • Sourcing data is not composable • Why can’t I query multiple data sets at once? • Wrangling and cleaning data is not maintainable • Why can’t I keep my data sets up to date? • Running ad-hoc queries is not reproducible • Why can’t I share my data sets?
  • 5. What do we mean by data? Sources • Open Data • Internal Data • Licensed Data Types • SQL Databases • NoSQL Databases • CSV Files...
  • 6. The journey of a dataset: Scenario • Two publishers: • NOAA publishes climate data • USDA publishes corn yields • Consumer wants to merge both data sets • Let’s follow the climate data...
  • 7. The journey of a dataset: Introduction
  • 8. The journey of a dataset: Introduction
  • 9. The journey of a dataset 1: Creation
  • 10. Ingesting data from another DB via CLI $ sgr mount -t mongo_fdw me:pwd@my_db:27017 ’ { "rainfall": { "db": "observations", "coll": "rainfall", "schema": { "timestamp": "timestamp", "state": "varchar", "rainfall": "numeric } } }’ staging $ sgr import staging ’SELECT timestamp, state, rainfall FROM rainfall’ noaa/climate rainfall
  • 11. The journey of a dataset 2: Publication
  • 12. Committing and Publishing Data via CLI $ sgr publish noaa/climate data.splitgraph.com
  • 13. The journey of a dataset 3: Usage
  • 14. SGFiles: Dockerfiles for data • Like Dockerfiles. • Image: state of a database schema • Layers w/ deterministic hashes and cache invalidation if: • Previous layer changes • Command changes • Commands: • FROM – base the image on something else • IMPORT – import tables from another image • SQL – run SQL against the image
  • 15. Consumption: Demo FROM usda/yields IMPORT crop_yields FROM noaa/climate:latest IMPORT rainfall SQL CREATE TABLE rainfall_yields AS SELECT * FROM rainfall JOIN crop_yields ...
  • 18. The journey of a dataset 4: Updating
  • 19. The journey of a dataset 4: Updating • Puerto Rico is now a US state • NOAA wants to revise its climate data • Can the consumer get just the changes?
  • 20. Delta compression • Only care about changes • Need to efficiently: • Create diffs (→ commit, push) • Apply diffs (→ checkout, pull)
  • 21. Delta compression Docker • Files • Custom FS Git • Lines • diff Splitgraph • Rows • Audit triggers
  • 29. The journey of a dataset 5: Maintenance
  • 30. The journey of a dataset 5: Maintenance • Can we update it? • Where did this dataset come from? • Build context fully encapsulated within the metadata