SlideShare a Scribd company logo
TWC 
Why Data Science Matters 
Xiaogang (Marshall) Ma 
Tetherless World Constellation 
Rensselaer Polytechnic Institute 
Email: max7@rpi.edu; Twitter: @MarshallXMa 
ICSU-WDS Data Stewardship Award Lecture 
SciDataCon 2014, New Delhi, India, Nov. 02-05
TAckWnowledgCements 
• Dr. Mustapha Mokrane and Dr. Simon Hodson 
• Colleagues at TWC/RPI, CODATA-ECDP, ESIP, CGI-IUGS, 
AGU/ESSI, ICSU-WDS, RDA, ITC, and more 
• My mentor Prof. Peter Fox 
• My family 
• All of you
TWOutlinCe 
• Technical trends 
– Data management, publication & citation 
• Methodology 
– Interoperability & Provenance 
• Data management is just a start 
– Data analysis 
– Semantic eScience 
3
TDatWa ManagCement 
4 
data work 
Image courtesy Randy Glasbergen
DTata MWanagemCent Plan 
• Data Management Plan 
– A formal document that outlines what you will do with your data 
during and after you complete your research 
• Resources/Tools help create DMPs: 
– NSF Data Management Plan Requirements: 
http://www.nsf.gov/eng/general/dmp.jsp 
– DCC Data Management Plans: 
http://www.dcc.ac.uk/resources/data-management-plans 
– DMPTool: https://dmptool.org 
– DCC DMPOnline: https://dmponline.dcc.ac.uk 
5
TDaWta PubliCcation 
• Data as first class products of research 
– e.g., NSF bio-sketches can include data publications 
See: http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp 
6 Image from j4h.net
TWC 
7 
“All data necessary to understand, assess, and extend the conclusions of 
the manuscript must be available to any reader of Science. ” 
“…authors are required to make materials, data and associated protocols 
promptly available to readers without undue qualifications.” 
“…authors must make materials, data, and associated protocols available 
to readers.” 
“…it is a condition of publication that authors make available the data and 
research materials supporting the results in the article.” 
“…require authors to make all data underlying the findings described in 
their manuscript fully available without restriction…” 
“Earth and space science data should be widely accessible in multiple 
formats and long‐term preservation of data is an integral responsibility of 
scientists and sponsoring institutions.” 
“…support the principle that research data should be made freely 
available to all researchers…” 
“…recommends depositing data that correspond to journal articles in 
reliable data repositories…”
TWC 
• Ways of data publication 
– Data as supplemental material of a paper 
– Standalone data 
– Data paper: data in a repository + descriptive ‘data paper’ 
8 
Examples: 
• Standalone data journals: Nature Scientific Data, Geoscience Data 
Journal, Ecological Archives, Data in Brief … 
• Journals that publish data papers: Earth and Space Science, 
GigaScience, F1000 Research, Internet Archaeology … 
Strasser, GeoData 2014 Workshop Presentation (2014)
TWC 
9 
An isolateddata island ?! 
Image from nature.com
TDWata CitaCtion 
• Data Citation Index 
– Indexes the world's leading data repositories 
– Connects datasets to related refereed literature indexed in 
the Web of Science™ 
– Efficient access to data across subjects and regions 
10 
Image courtesy http://wokinfo.com
TDataW interopCerability 
11 
Interoperability: 
“Data should be discoverable, accessible, decodable, 
understandable and usable, and data sharing should be 
legal and ethical for all participants.” 
Ma et al., Nature Geosciecne (2011) 
Original image from: http://ehna.org
PTroveWnance ofC research 
12 
Provenance documentation 
“Linking a range of observations and model outputs, research 
activities, people and organizations involved in the production of 
scientific findings with the supporting data sets and methods 
used to generate them” 
Image from nature.com 
Ma et al., Nature Climate Change (2014) 
http://data.globalchange.gov
TWC • IPython Notebook: 
A web-based interactive computational environment 
Codes, APIs, 
datasets, text… 
PDF document 
• We made extension to the IPython Notebook 
environment to enable automatic provenance 
capture during a scientific workflow 
Di Stefano et al., ESIP 2014 Summer Meeting Presentation (2014) 
13
TWC 
14
TSemWantic eSCcience 
• Artificial Intelligence accelerates scientific discovery 
– Data search, synthesis and hypothesis representation 
– Data analysis: reasoning with models of the data 
Gil et al., Science (2014) 
Image from science.com 
A state-of-the-art example: 
Hanalyzer (high-throughput analyzer) 
• Uses natural language processing to 
automatically extract a semantic network from 
all PubMed papers relevant to a scientist 
• Uses Semantic Web technology to integrate 
assertions from other biomedical sources 
• Reasons about the network to find new 
correlations that suggest new genes to 
investigate 
Leach et al., PLoS Comput Bio (2009) 
15
TWC Deep Carbon Virtual Observatory 
Fox, RDA Fourth Plenary Meeting Presentation (2014) 
A cyber-enabled 
platform for linked 
science 
http://deepcarbon.net
TWSummaCry 
• Data as first class products of research 
• eScience: the digital or electronic facilitation of science 
• Semantic eScience 
– A virtuous circle between science and semantic technologies 
– Data driven + Knowledge driven? 
Image courtesy @WileyExchanges 
17
TWC 
More information: 
Marshall X Ma 
max7@rpi.edu 
Thank you!

More Related Content

PPTX
SciGaP Science Gateways for Artificial Intelligence and Machine Learning
PDF
On community-standards, data curation and scholarly communication" Stanford M...
PPTX
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
PPTX
UWA Research Week 2016
PPTX
Data for Science: How Elsevier is using data science to empower researchers
PDF
E research attachment survey
PPTX
Open access day
PPTX
Best practices data collection
SciGaP Science Gateways for Artificial Intelligence and Machine Learning
On community-standards, data curation and scholarly communication" Stanford M...
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
UWA Research Week 2016
Data for Science: How Elsevier is using data science to empower researchers
E research attachment survey
Open access day
Best practices data collection

What's hot (20)

PPTX
Information architecture at Elsevier
PDF
Strasser "Effective data management and its role in open research"
PDF
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
PPTX
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
PDF
Research data management free online courses, publisher policies
PPTX
SEEKing our way to better presentation of data and models from scientific inv...
PDF
NIH BD2K DataMed metadata model - Force11, 2016
PPTX
Data Management Services at the Morgan Library
PPTX
Sources of Change in Modern Knowledge Organization Systems
PPTX
Zudilova-Seinstra-Elsevier-data and the article of the future-nfdp13
PPTX
Introduction to data management
PPTX
Structured Data & the Future of Educational Material
PPTX
EDI Training Module 2: EDI Project
PPTX
2017 05 03 Implementing Pure at UWA - ANDS Webinar Series
PPTX
The Disappearing Data Scientist
PDF
No more waiting! Tools that work Today to reveal dataset use
PPTX
Machines are people too
PPTX
Collaborative Data Management using OSF
PPTX
Knowledge Graph Semantics/Interoperability
Information architecture at Elsevier
Strasser "Effective data management and its role in open research"
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
Research data management free online courses, publisher policies
SEEKing our way to better presentation of data and models from scientific inv...
NIH BD2K DataMed metadata model - Force11, 2016
Data Management Services at the Morgan Library
Sources of Change in Modern Knowledge Organization Systems
Zudilova-Seinstra-Elsevier-data and the article of the future-nfdp13
Introduction to data management
Structured Data & the Future of Educational Material
EDI Training Module 2: EDI Project
2017 05 03 Implementing Pure at UWA - ANDS Webinar Series
The Disappearing Data Scientist
No more waiting! Tools that work Today to reveal dataset use
Machines are people too
Collaborative Data Management using OSF
Knowledge Graph Semantics/Interoperability
Ad

Viewers also liked (13)

PPTX
Exploring the Web of Data for Earth and Environmental Sciences
PDF
Ontology spectrum for geological data interoperability (PhD defense nov 2011)
PDF
Deep Earth Computer: A Platform for Linked Science of the Deep Carbon Obser...
PPTX
Knowledge Evolution in Distributed Geoscience Datasets and the Role of Semant...
PDF
A short story of geologic time ontologies and vocabularies
PPTX
Adoption of RDA DTR and PID in Deep Carbon Observatory Data Portal
PDF
Exploratory visualization of earth science data in a Semantic Web context
PPT
Ontology Development for Provenance Tracing in National Climate Assessment o...
PPTX
A short review of Connected China: A visualization of elite social networks i...
PPT
A use case-driven iterative method for building a provenance-aware GCIS onto...
PPTX
Why data science matters and what we can do with it
PPTX
From data portal to knowledge portal: Leveraging semantic technologies to sup...
PDF
A short introduction to GIS
Exploring the Web of Data for Earth and Environmental Sciences
Ontology spectrum for geological data interoperability (PhD defense nov 2011)
Deep Earth Computer: A Platform for Linked Science of the Deep Carbon Obser...
Knowledge Evolution in Distributed Geoscience Datasets and the Role of Semant...
A short story of geologic time ontologies and vocabularies
Adoption of RDA DTR and PID in Deep Carbon Observatory Data Portal
Exploratory visualization of earth science data in a Semantic Web context
Ontology Development for Provenance Tracing in National Climate Assessment o...
A short review of Connected China: A visualization of elite social networks i...
A use case-driven iterative method for building a provenance-aware GCIS onto...
Why data science matters and what we can do with it
From data portal to knowledge portal: Leveraging semantic technologies to sup...
A short introduction to GIS
Ad

Similar to Why Data Science Matters - 2014 WDS Data Stewardship Award Lecture (20)

PPTX
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
PDF
Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The ...
PDF
ODIN Final Event - The Care and Feeding of Scientific Data
PPTX
Data Publishing at Harvard's Research Data Access Symposium
PPTX
Do It Yourself (DIY) Earth Science Collaboratories Using Best Practices and B...
PDF
Managing, Sharing and Curating Your Research Data in a Digital Environment
PPTX
Data, Data Everywhere: What's A Publisher to Do?
PDF
FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014
PPTX
IEDA Data Publication Workshop @AGU
PDF
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
PDF
Data Science 1st Edition Robert Stahlbock Gary M Weiss Mahmoud Abounasr
PPTX
The expanding dataverse
PPTX
Data Science & Analytics (light overview)
PPTX
Publishing the Full Research Data Lifecycle
PDF
Data sharing as part of the research ecosystem
PDF
Large-scale analysis of bibliometric networks
PPT
Open Data in a Big Data World: easy to say, but hard to do?
PDF
Data publishing at the UQ Library
PDF
Taming the Big Data Beast - Together
PDF
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The ...
ODIN Final Event - The Care and Feeding of Scientific Data
Data Publishing at Harvard's Research Data Access Symposium
Do It Yourself (DIY) Earth Science Collaboratories Using Best Practices and B...
Managing, Sharing and Curating Your Research Data in a Digital Environment
Data, Data Everywhere: What's A Publisher to Do?
FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014
IEDA Data Publication Workshop @AGU
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
Data Science 1st Edition Robert Stahlbock Gary M Weiss Mahmoud Abounasr
The expanding dataverse
Data Science & Analytics (light overview)
Publishing the Full Research Data Lifecycle
Data sharing as part of the research ecosystem
Large-scale analysis of bibliometric networks
Open Data in a Big Data World: easy to say, but hard to do?
Data publishing at the UQ Library
Taming the Big Data Beast - Together
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015

Recently uploaded (20)

PPTX
famous lake in india and its disturibution and importance
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PPTX
Microbiology with diagram medical studies .pptx
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
Cell Membrane: Structure, Composition & Functions
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
The scientific heritage No 166 (166) (2025)
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
. Radiology Case Scenariosssssssssssssss
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
famous lake in india and its disturibution and importance
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
Microbiology with diagram medical studies .pptx
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
Phytochemical Investigation of Miliusa longipes.pdf
Cell Membrane: Structure, Composition & Functions
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Derivatives of integument scales, beaks, horns,.pptx
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
The scientific heritage No 166 (166) (2025)
TOTAL hIP ARTHROPLASTY Presentation.pptx
7. General Toxicologyfor clinical phrmacy.pptx
2. Earth - The Living Planet Module 2ELS
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
AlphaEarth Foundations and the Satellite Embedding dataset
Biophysics 2.pdffffffffffffffffffffffffff
Taita Taveta Laboratory Technician Workshop Presentation.pptx
. Radiology Case Scenariosssssssssssssss
bbec55_b34400a7914c42429908233dbd381773.pdf

Why Data Science Matters - 2014 WDS Data Stewardship Award Lecture

  • 1. TWC Why Data Science Matters Xiaogang (Marshall) Ma Tetherless World Constellation Rensselaer Polytechnic Institute Email: max7@rpi.edu; Twitter: @MarshallXMa ICSU-WDS Data Stewardship Award Lecture SciDataCon 2014, New Delhi, India, Nov. 02-05
  • 2. TAckWnowledgCements • Dr. Mustapha Mokrane and Dr. Simon Hodson • Colleagues at TWC/RPI, CODATA-ECDP, ESIP, CGI-IUGS, AGU/ESSI, ICSU-WDS, RDA, ITC, and more • My mentor Prof. Peter Fox • My family • All of you
  • 3. TWOutlinCe • Technical trends – Data management, publication & citation • Methodology – Interoperability & Provenance • Data management is just a start – Data analysis – Semantic eScience 3
  • 4. TDatWa ManagCement 4 data work Image courtesy Randy Glasbergen
  • 5. DTata MWanagemCent Plan • Data Management Plan – A formal document that outlines what you will do with your data during and after you complete your research • Resources/Tools help create DMPs: – NSF Data Management Plan Requirements: http://www.nsf.gov/eng/general/dmp.jsp – DCC Data Management Plans: http://www.dcc.ac.uk/resources/data-management-plans – DMPTool: https://dmptool.org – DCC DMPOnline: https://dmponline.dcc.ac.uk 5
  • 6. TDaWta PubliCcation • Data as first class products of research – e.g., NSF bio-sketches can include data publications See: http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp 6 Image from j4h.net
  • 7. TWC 7 “All data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science. ” “…authors are required to make materials, data and associated protocols promptly available to readers without undue qualifications.” “…authors must make materials, data, and associated protocols available to readers.” “…it is a condition of publication that authors make available the data and research materials supporting the results in the article.” “…require authors to make all data underlying the findings described in their manuscript fully available without restriction…” “Earth and space science data should be widely accessible in multiple formats and long‐term preservation of data is an integral responsibility of scientists and sponsoring institutions.” “…support the principle that research data should be made freely available to all researchers…” “…recommends depositing data that correspond to journal articles in reliable data repositories…”
  • 8. TWC • Ways of data publication – Data as supplemental material of a paper – Standalone data – Data paper: data in a repository + descriptive ‘data paper’ 8 Examples: • Standalone data journals: Nature Scientific Data, Geoscience Data Journal, Ecological Archives, Data in Brief … • Journals that publish data papers: Earth and Space Science, GigaScience, F1000 Research, Internet Archaeology … Strasser, GeoData 2014 Workshop Presentation (2014)
  • 9. TWC 9 An isolateddata island ?! Image from nature.com
  • 10. TDWata CitaCtion • Data Citation Index – Indexes the world's leading data repositories – Connects datasets to related refereed literature indexed in the Web of Science™ – Efficient access to data across subjects and regions 10 Image courtesy http://wokinfo.com
  • 11. TDataW interopCerability 11 Interoperability: “Data should be discoverable, accessible, decodable, understandable and usable, and data sharing should be legal and ethical for all participants.” Ma et al., Nature Geosciecne (2011) Original image from: http://ehna.org
  • 12. PTroveWnance ofC research 12 Provenance documentation “Linking a range of observations and model outputs, research activities, people and organizations involved in the production of scientific findings with the supporting data sets and methods used to generate them” Image from nature.com Ma et al., Nature Climate Change (2014) http://data.globalchange.gov
  • 13. TWC • IPython Notebook: A web-based interactive computational environment Codes, APIs, datasets, text… PDF document • We made extension to the IPython Notebook environment to enable automatic provenance capture during a scientific workflow Di Stefano et al., ESIP 2014 Summer Meeting Presentation (2014) 13
  • 15. TSemWantic eSCcience • Artificial Intelligence accelerates scientific discovery – Data search, synthesis and hypothesis representation – Data analysis: reasoning with models of the data Gil et al., Science (2014) Image from science.com A state-of-the-art example: Hanalyzer (high-throughput analyzer) • Uses natural language processing to automatically extract a semantic network from all PubMed papers relevant to a scientist • Uses Semantic Web technology to integrate assertions from other biomedical sources • Reasons about the network to find new correlations that suggest new genes to investigate Leach et al., PLoS Comput Bio (2009) 15
  • 16. TWC Deep Carbon Virtual Observatory Fox, RDA Fourth Plenary Meeting Presentation (2014) A cyber-enabled platform for linked science http://deepcarbon.net
  • 17. TWSummaCry • Data as first class products of research • eScience: the digital or electronic facilitation of science • Semantic eScience – A virtuous circle between science and semantic technologies – Data driven + Knowledge driven? Image courtesy @WileyExchanges 17
  • 18. TWC More information: Marshall X Ma max7@rpi.edu Thank you!