SlideShare a Scribd company logo
THE MISSING MANUAL FOR DATA SCIENCE: REMIX.
          RESUSE. REPRODUCE
                      SPEAKER: Matt Wood
                               Principal Data Scientist
                               Amazon Web Services




Monday, April 1, 13
The Missing Manual:
                      Reproduce, Reuse, Remix

                      Dr. Matt Wood
                      matthew@amazon.com
                      @mza

Monday, April 1, 13
Monday, April 1, 13
Hello.


Monday, April 1, 13
Monday, April 1, 13
Data.


Monday, April 1, 13
Generation



                        Collection & storage



                      Analytics & computation



                      Collaboration & sharing


Monday, April 1, 13
Monday, April 1, 13
Generation challenge.


Monday, April 1, 13
Amazing data generators: cell phones tracking cholera in Haiti




                                                                                 Linus Bengtsson et al. PLoS Medicine, 2011

Monday, April 1, 13
Amazing data generators: social networks tracking influenza




                                                      You Are What You Tweet: Analyzing Twitter for Public Health. M. J. Paul and M. Dredze, 2011

Monday, April 1, 13
Amazing data generators: web app logs targeting advertising




                                  500% return on ad spend

Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
Chromosome 11 : ACTN3 : rs1815739




Monday, April 1, 13
Chromosome X : rs6625163




Monday, April 1, 13
Chromosome 19 : FUT2 : rs601338




Monday, April 1, 13
Chromosome 2 : rs10427255




Monday, April 1, 13
Chromosome 10 : rs7903146




                      TYPE II


Monday, April 1, 13
Chromosome 15 : rs2472297




                      +0.25
Monday, April 1, 13
Monday, April 1, 13
Generation challenge.


Monday, April 1, 13
Generation challenge.
                                       X


Monday, April 1, 13
Generation



                        Collection & storage



                      Analytics & computation



                      Collaboration & sharing


Monday, April 1, 13
Generation



                        Collection & storage



                      Analytics & computation



                      Collaboration & sharing


Monday, April 1, 13
Monday, April 1, 13
Utility computing.


Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
Remove constraints.


Monday, April 1, 13
Monday, April 1, 13
Analytics challenge.


Monday, April 1, 13
Analytics challenge.
                                       X


Monday, April 1, 13
Generation



                        Collection & storage



                      Analytics & computation



                      Collaboration & sharing


Monday, April 1, 13
Monday, April 1, 13
Beautiful, unique.


Monday, April 1, 13
Monday, April 1, 13
Impossible to recreate.


Monday, April 1, 13
Monday, April 1, 13
Snowflake Data Science


Monday, April 1, 13
Monday, April 1, 13
Reproducibility.




Monday, April 1, 13
Monday, April 1, 13
Reproducibility scales data science.




Monday, April 1, 13
Monday, April 1, 13
Reproduce. Reuse. Remix.




Monday, April 1, 13
Monday, April 1, 13
Value++




Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
How do we get from
                        here to there?
                                          IPLESF
                                  5 PR INC    O


                                   REPRO DUCIBILITY




Monday, April 1, 13
PRINCIPLESF
                      5
                               O

                      REPRODUCIBILITY




Monday, April 1, 13
PRINCIPLESF
                      5
                               O

                      REPRODUCIBILITY




                                        1. Data has Gravity




Monday, April 1, 13
Monday, April 1, 13
Increasingly large data collections.




Monday, April 1, 13
Monday, April 1, 13
Challenging to obtain and manage.




Monday, April 1, 13
Monday, April 1, 13
Expensive to experiment.




Monday, April 1, 13
Monday, April 1, 13
Large barrier to reproducibility.




Monday, April 1, 13
Monday, April 1, 13
Move data to the users.




Monday, April 1, 13
Move data to the users.
                                         X



Monday, April 1, 13
Monday, April 1, 13
Move tools to the data.




Monday, April 1, 13
Monday, April 1, 13
Place data where it can be
                         consumed by tools.



Monday, April 1, 13
Monday, April 1, 13
Place tools where they
                         can access data.



Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
More data,
                       more users,
                       more uses,
                      more locations


Monday, April 1, 13
Monday, April 1, 13
Cost




Monday, April 1, 13
Monday, April 1, 13
Force multiplier.




Monday, April 1, 13
Monday, April 1, 13
Cost and complexity
                       kill reproducibility.



Monday, April 1, 13
PRINCIPLESF
                      5
                               O

                      REPRODUCIBILITY




Monday, April 1, 13
PRINCIPLESF
                      5
                               O

                      REPRODUCIBILITY




                          2. Ease of use is a prerequisite




Monday, April 1, 13
http://headrush.typepad.com/creating_passionate_users/2005/10/getting_users_p.html

Monday, April 1, 13
Monday, April 1, 13
Help overcome the suck threshold.




Monday, April 1, 13
Monday, April 1, 13
Easy to embrace and extend.




Monday, April 1, 13
Monday, April 1, 13
Choose the right abstraction for the user.




Monday, April 1, 13
Monday, April 1, 13
$ ec2-run-instances




Monday, April 1, 13
Monday, April 1, 13
$ starcluster start




Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
Package and automate.




Monday, April 1, 13
Monday, April 1, 13
Expert-as-a-service.




Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
1000 Genomes
                         Project




    Cloud BioLinux




Monday, April 1, 13
Monday, April 1, 13
1000 Genomes
                       Project + your
                       genomic data




                        Illumina Basespace




Monday, April 1, 13
Cassandra   Aegisthus                             Hadoop, Hive, Pig

                                      Amazon S3


                                  Legacy data warehousing



                                                             http://www.youtube.com/watch?v=oGcZ7WVx6EI
Monday, April 1, 13
Sting
                                                                Microstrategy
                         R



          Cassandra   Aegisthus                             Hadoop, Hive, Pig

                                      Amazon S3


                                  Legacy data warehousing



                                                             http://www.youtube.com/watch?v=oGcZ7WVx6EI
Monday, April 1, 13
Monday, April 1, 13
PRINCIPLESF
                      5
                               O

                      REPRODUCIBILITY




Monday, April 1, 13
PRINCIPLESF
                      5
                               O

                      REPRODUCIBILITY




                  3. Reuse is as important as reproduction




Monday, April 1, 13
Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformatics

Monday, April 1, 13
Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformatics

Monday, April 1, 13
Monday, April 1, 13
Data scientists are hackers.




Monday, April 1, 13
Monday, April 1, 13
They have their own way of working.




Monday, April 1, 13
Monday, April 1, 13
Beware the Big Red Button.




Monday, April 1, 13
Monday, April 1, 13
Fire and forget reproduction
                      is a good first step, but limits
                            longer term value.



Monday, April 1, 13
Monday, April 1, 13
Monolithic, one-stop-shop.




Monday, April 1, 13
Monday, April 1, 13
Work well for intended purpose.




Monday, April 1, 13
Monday, April 1, 13
Challenging to install,
                       dependency heavy.



Monday, April 1, 13
Monday, April 1, 13
Difficult to grok.




Monday, April 1, 13
Monday, April 1, 13
Data scientists are hackers:
                              embrace it.



Monday, April 1, 13
Monday, April 1, 13
Small things. Loosely coupled.




Monday, April 1, 13
Monday, April 1, 13
Easier to grok, reuse and integrate.




Monday, April 1, 13
Monday, April 1, 13
Lower barrier to entry.




Monday, April 1, 13
PRINCIPLESF
                      5
                               O

                      REPRODUCIBILITY




Monday, April 1, 13
PRINCIPLESF
                      5
                               O

                      REPRODUCIBILITY




                             4. Build for collaboration




Monday, April 1, 13
Monday, April 1, 13
Workflows are memes.




Monday, April 1, 13
Monday, April 1, 13
Reproduction is just the first step.




Monday, April 1, 13
Monday, April 1, 13
Bill of materials:
                      code, data, configuration, infrastructure.



Monday, April 1, 13
Monday, April 1, 13
Full definition for reproduction.




Monday, April 1, 13
Monday, April 1, 13
Utility computing provides a
                      playground for data science.



Monday, April 1, 13
Code + AMI +
                      custom datasets + public datasets +
                       databases + compute + result data



Monday, April 1, 13
Code + AMI +
                      custom datasets + public datasets +
                       databases + compute + result data



Monday, April 1, 13
Code + AMI +
                      custom datasets + public datasets +
                       databases + compute + result data



Monday, April 1, 13
Code + AMI +
                      custom datasets + public datasets +
                       databases + compute + result data



Monday, April 1, 13
PRINCIPLESF
                      5
                               O

                      REPRODUCIBILITY




Monday, April 1, 13
PRINCIPLESF
                       5
                                 O

                       REPRODUCIBILITY




                      5. Provenance is a first class object




Monday, April 1, 13
Monday, April 1, 13
Versioning becomes really important.




Monday, April 1, 13
Monday, April 1, 13
Especially in an active community.




Monday, April 1, 13
Monday, April 1, 13
Doubly so with loosely coupled tools.




Monday, April 1, 13
Monday, April 1, 13
Provenance metadata is a
                          first class entity.



Monday, April 1, 13
Monday, April 1, 13
Distributed provenance.




Monday, April 1, 13
IPLESF
                      5
                      PRI NC    O

                                    Y
                        RODUCIBILIT
                      REP




Monday, April 1, 13
IPLESF
                                      5 PRI NC    O

                                                      Y
                                          RODUCIBILIT
                                        REP



                      1. Data has gravity
                      2. Ease of use is a prerequisite
                      3. Reuse is as important as reproduction
                      4. Build for collaboration
                      5. Provenance is a first class object


Monday, April 1, 13
Monday, April 1, 13
Thank you
                      matthew@amazon.com


                        aws.amazon.com
                            @mza



Monday, April 1, 13
Monday, April 1, 13

More Related Content

PPTX
Day 1. Presentation 2. iwan wibisono
PPTX
Redes sociales badoo
PDF
IAS13: Metadata in the Cross-Channel Ecosystem: Consistency, Context and Inte...
PDF
Google’s Fuel For Your Marketing Machine - A Workshop In Aspen For Industry P...
Day 1. Presentation 2. iwan wibisono
Redes sociales badoo
IAS13: Metadata in the Cross-Channel Ecosystem: Consistency, Context and Inte...
Google’s Fuel For Your Marketing Machine - A Workshop In Aspen For Industry P...

Similar to THE MISSING MANUAL FOR DATA SCIENCE: REMIX. RESUSE. REPRODUCE from Structure:Data 2013 (18)

PDF
Extreme Mobile App Performance: Native to Web
PDF
Matt Bailey
PDF
Building Data Narrative: Discovering Haight Street
PDF
MotherCoders Week 3 - The Internet of Things
PDF
Structuring apps in Scala
PDF
Dribbble Meetup Russia Иконки: формы и детали
PDF
Unmoderated User Testing
PDF
HYPERCONNECTED BIG DATA: HOW SDN WILL SHAPE SHARING ECOSYSTEMS from Structure...
PDF
Mobile Platforms And Devices
PDF
Persona modeler
PDF
Reggefiber glasvezel presentatie
PDF
Redesigning UBC Library
PDF
Region ESC 7 iPad in Education
PDF
Fed2013_Managing Workplace Productivity
PDF
#Emesaconnect presentatie Vakantieveiling .nl
PDF
Speed geek presentation
PDF
Data visualization 101
PDF
Offensive support
Extreme Mobile App Performance: Native to Web
Matt Bailey
Building Data Narrative: Discovering Haight Street
MotherCoders Week 3 - The Internet of Things
Structuring apps in Scala
Dribbble Meetup Russia Иконки: формы и детали
Unmoderated User Testing
HYPERCONNECTED BIG DATA: HOW SDN WILL SHAPE SHARING ECOSYSTEMS from Structure...
Mobile Platforms And Devices
Persona modeler
Reggefiber glasvezel presentatie
Redesigning UBC Library
Region ESC 7 iPad in Education
Fed2013_Managing Workplace Productivity
#Emesaconnect presentatie Vakantieveiling .nl
Speed geek presentation
Data visualization 101
Offensive support
Ad

More from Gigaom (20)

PPTX
Structure 2014 - The strategic value of the cloud - Joe Weinman
PPTX
Structure 2014 - The right and wrong way to scale - Rackspace
PPTX
Structure 2014 - The future of cloud computing survey results
PPTX
Structure 2014 - Launchpad Competition
PPTX
Structure 2014 - Disrupting the data center - Intel sponsor workshop
PPTX
Structure 2014 - Cloud trends - Battery
PDF
Structure Data 2014: HOW MICRODATA CAN SAY A LOT ABOUT MACROECONOMICS, David ...
PDF
Structure Data 2014: QLIK SPONSOR WORKSHOP: ANALYTICS THE WAY NATURE INTENDED...
PDF
Structure Data 2014: FIVE MYTHS ABOUT BIG DATA, Amit Bendov
PDF
Structure Data 2014: AMID BILLIONS OF METRICS, YOUR SOFTWARE IS TRYING TO TEL...
PDF
Structure Data 2014: SISENSE SPONSOR WORKSHOP: ON BEER, CHIPS AND DATA,
PDF
Structure Data 2014: INVERTING 80/20: BEYOND BESPOKE BIG DATA, Ari Gesher
PDF
Structure Data 2014: TRACKING A SOCCER GAME WITH BIG DATA, Chris Haddad
PDF
Structure Data 2014: TECH AGAINST HUMAN TRAFFICKING AND ILLICIT NETWORKS, Jus...
PDF
Structure Data 2014: DATA DRIVEN DESIGN AT FORMULA ONE SPEED, Geoff McGrath
PDF
Structure Data 2014: IS VIDEO BIG DATA?, Steve Russell
PDF
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
PDF
How Data is Remaking E-commerce - from Roadmap 2013
PDF
25 Favorite Experiences in Tech - from Roadmap 2013
PDF
How Moore’s Law is Influencing Design - from Roadmap 2013
Structure 2014 - The strategic value of the cloud - Joe Weinman
Structure 2014 - The right and wrong way to scale - Rackspace
Structure 2014 - The future of cloud computing survey results
Structure 2014 - Launchpad Competition
Structure 2014 - Disrupting the data center - Intel sponsor workshop
Structure 2014 - Cloud trends - Battery
Structure Data 2014: HOW MICRODATA CAN SAY A LOT ABOUT MACROECONOMICS, David ...
Structure Data 2014: QLIK SPONSOR WORKSHOP: ANALYTICS THE WAY NATURE INTENDED...
Structure Data 2014: FIVE MYTHS ABOUT BIG DATA, Amit Bendov
Structure Data 2014: AMID BILLIONS OF METRICS, YOUR SOFTWARE IS TRYING TO TEL...
Structure Data 2014: SISENSE SPONSOR WORKSHOP: ON BEER, CHIPS AND DATA,
Structure Data 2014: INVERTING 80/20: BEYOND BESPOKE BIG DATA, Ari Gesher
Structure Data 2014: TRACKING A SOCCER GAME WITH BIG DATA, Chris Haddad
Structure Data 2014: TECH AGAINST HUMAN TRAFFICKING AND ILLICIT NETWORKS, Jus...
Structure Data 2014: DATA DRIVEN DESIGN AT FORMULA ONE SPEED, Geoff McGrath
Structure Data 2014: IS VIDEO BIG DATA?, Steve Russell
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
How Data is Remaking E-commerce - from Roadmap 2013
25 Favorite Experiences in Tech - from Roadmap 2013
How Moore’s Law is Influencing Design - from Roadmap 2013
Ad

Recently uploaded (20)

PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Hybrid model detection and classification of lung cancer
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PPTX
The various Industrial Revolutions .pptx
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Architecture types and enterprise applications.pdf
PPTX
1. Introduction to Computer Programming.pptx
PDF
Web App vs Mobile App What Should You Build First.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Hybrid model detection and classification of lung cancer
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
The various Industrial Revolutions .pptx
A contest of sentiment analysis: k-nearest neighbor versus neural network
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
observCloud-Native Containerability and monitoring.pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Programs and apps: productivity, graphics, security and other tools
Hindi spoken digit analysis for native and non-native speakers
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Getting started with AI Agents and Multi-Agent Systems
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
cloud_computing_Infrastucture_as_cloud_p
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Architecture types and enterprise applications.pdf
1. Introduction to Computer Programming.pptx
Web App vs Mobile App What Should You Build First.pdf

THE MISSING MANUAL FOR DATA SCIENCE: REMIX. RESUSE. REPRODUCE from Structure:Data 2013