SlideShare a Scribd company logo
Kerstin Lehnert Lamont -Doherty Earth Observatory of Columbia University
Palisades, NY, 10964
Success and Challenges in the Earth Sciences
Monday’s Musings: Beyond The Three V’s of Big
Data – Viscosity and Virality
February 27, 2012 by R "Ray" Wang
http://blog.softwareinsider.org/2012/02/27/mondays-
musings-beyond-the-three-vs-of-big-data-viscosity-and-
virality/
2
ValueThe sixth ‘V’:
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences
• heterogeneous
• customized & optimized
for research questions
• lack of data standards
• culture of data ‘hording’
• lack of data
infrastructure (facilities)
Making Small Data BIG: Succss and Challenges in the Earth Sciences 3
3/22/2016
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 4
“While the data volumes are small when viewed
individually, in total they represent a very significant
portion of the country’s scientific output.”
“The long tail is a breeding ground for new ideas and
never before attempted science.”
(Heidorn, B. 2008: “Shedding Light on the Dark Data in the Long Tail of Science”)
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 5
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 6
… that form a picture
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 7
The PetDB Synthesis
Map shows data from >300 publications
Symbols are locations of rock samples. Color is scaled to the 87Sr/86Sr isotope ratio in the rocks.
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 8
9
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences
“Understanding where
the dust that's in the
atmosphere and oceans
comes from can help
scientists estimate its
impact on earth's
climate system.”
Bess Koffman, Michael Kaplan, Steven Goldstein, Gisela Winckler (LDEO), Natalie Mahowald (Cornell)
http://blogs.ei.columbia.edu/2014/03/13/did-new-zealand-dust-influence-the-last-ice-age/
Example #1:
Did New Zealand
Dust Influence the
Last Ice Age?
Making Small Data BIG: Succss and Challenges in the Earth Sciences 10
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 11
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 12
3/22/2016
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 13
Note the number of
data points generated in
this study (the yellow
dots) in light of the
effort that included
collecting samples in NZ
to operating expensive
equipment in the lab.
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 14
Example #2:
Do convergent margin
volcanoes really represent
continental crust?
“As it is crucial to understand
the extent and origin of the
compositional difference
between central Aleutian
lavas and plutons through
time and space, this project
will map and sample plutonic
rocks exposed on the central
Aleutians and their coeval
volcanic host rocks.”
“Results and the samples
acquired in this study will
help to answer fundamental
questions of continental
crust formation, and shed
light on the formation
mechanisms of plutons and
volcanics in arcs.”
http://www.nsf.gov/discoveries/disc_summ.jsp?cntn_id=135851&org=NSF
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 15
Anticipated Data:
• ~ 250 samples
• ~ 200 major element analyses
• ~ 150 trace element analyses
• 50 U/Pb zircon geochronology
• 30 Ar-Ar ages
• 80 Sr, Nd, Hf and Pb isotope analyses
• 4 scientists (3 institutions)
• 5 weeks on remote islands
• a boat (with crew)
• a helicopter
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 16
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 17
Making Small Data BIG: Succss and Challenges in the Earth Sciences 18
3/22/2016
• They are widely dispersed in the literature (past &
present).
• They are not openly accessible.
• They lack sufficient and standardized metadata.
• They are never published (“dark data”).
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 19
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 20
findable
identification,
persistence
accessible
protection,
protocols
context,
provenance
re-usable
harmonized,
machine-readable
interoperable
small data Data Curation Standards
Generic Repositories
Domain-specific Data Standards
Community Data Collections
V
a
l
u
e
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 21
findable
identification,
persistence
accessible
protection,
protocols
context,
provenance
re-usable
harmonized,
machine-readable
interoperable
small data Data Curation Standards Domain-specific Data Standards
V
a
l
u
e
Domain Repositories
Making Small Data BIG: Succss and Challenges in the Earth Sciences 22
Science
Community
Domain specific
Data facility
22
Libraries
Archives
CI, Computer
Science
Publishers,
editors
Discipline-specific data services
• Context & provenance metadata
• Semantics
• Workflows
Funding
Agencies
Data Facilities
Registries
3/22/2016
Data curation services
CI development
Disciplinary Expertise
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 23
Data Services for the
Solid Earth Sciences
www.iedadata.org
24
www.iedadata.org
• Solid Earth Observational Data
• High-T Geochemistry
• Low-T Geochemistry
• Petrology
• Marine Geophysics & Geology
• Geochronology
• Cross-disciplinary tools & services
• Sample registry SESAR
• IEDA Data Browser
• Portals (GeoPRISMs, USAP-DCC, etc.)
• GeoMapApp
• Interoperability
Making Small Data BIG: Succss and Challenges in the Earth Sciences
3/22/2016
25
IEDA Repositories
 >720,000 files
 59 TB
 4 x 106 samples
IEDA Syntheses
 19 x 106 analytical values in EarthChem
 2.79 x 106 miles of data from 875 cruises in the
Global Multi-Resolution Topography (GMRT)
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 26
27
Data Data Data Data Data
EarthChem Library
Data Data Data Data Data
PetDB, SedDB EarthChem Portal
Data Publication & Preservation Data Mining & Analysis
Investigators
Metadata
Catalog Data &
Metadata
Data &
Metadata
External Systems
EarthChem Data Managers
FINDABLE & ACCESSIBLE
• DOI registration
• Long-term archiving
• CC license
• Guidelines for data reporting
(community endorsed)
• QC by data managers
RE-USABLE & INTEROPERABLE
• Data & metadata harmonization
• Standards-compliant data model
• Service Oriented Architecture (ECP)
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences
Making Small Data BIG: Succss and Challenges in the Earth Sciences 28
DOI to allow proper citation
Link to publications
Link to funding source
28
3/22/2016
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 29
Global compilation of
geochemical data for
igneous rocks from the
ocean floor & mantle
xenoliths
> 2,200 data sets/publications
> 84,000 samples
> 3.2 million observed values
http://www.earthchem.org/petdb
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 30
Data from
• >13,000 publications
• >850,000 samples
Total: >19.6 million analytical values
Partner Databases:
• PetDB
• SedDB
• GEOROC
• USGS
• MetPetDB
• GANSEKI
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 31
Filter by method or
concentration
Making Small Data BIG: Succss and Challenges in the Earth Sciences 32
3/22/2016
• 500 - 800 downloads per quarter
• >600 citations in the literature
• many fundamental new
discoveries & insights
• Disciplinary
• Multi-disciplinary
• Unanticipated purposes
• new scientific approaches
• Statistical rather than hypothetical
Making Small Data BIG: Succss and Challenges in the Earth Sciences 33
3/22/2016
• Many samples and collections are not ‘online’.
• Repositories lack resources & expertise to develop &
maintain digital collection catalogs.
• Samples often only described in publications.
• Existing online catalogs are not connected or
federated.
• No easy way to search for samples.
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 34
34
February 25,
2016
DFG Rundgespräch Geochemical Databases
• Linking physical samples digital data generated by
their study.
• Reproducibility! Access to the physical samples is required to
verify & reproduce observations.
• Re-usability! Access to information about samples is
required for proper evaluation & interpretation of sample-
based data.
• Broad sharing of physical samples for use & re-use.
• Samples are often expensive to collect (drilling, remote locations).
• Many samples are unique and irreplaceable.
• Re-analysis augments utility of existing data.
• Samples often serve in ways that the collectors and repositories could
not have imagined.
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 35
• Discovery & Access for Re-use and Reproducibility
• Sample Citation
• Data Integration
• Sample Management
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 36
IGSN = International Geo Sample Number
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 37
38
“… AGU Publications also strongly
encourages use of other identifiers in
our journal papers. International Geo
Sample Numbers (IGSNs) uniquely
identify items, such as a rock sample, a
piece of coral, or a vial of water taken
from the natural environment, and
provide important, consistent
information about these samples.”
Hanson, B. (2016), AGU opens its journals to author
identifiers, Eos, 97, doi:10.1029/2016EO043183.
Published on 7 January 2016. 3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences
3/22/2016
39Making Small Data BIG: Succss and Challenges in the Earth Sciences
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 40
Technical
Organizational
Social/cultural
• Limitation of resources versus diversity of data
• Need best practices for all small data communities
• Need flexibility and performance of database schemas &
search applications
• Need tools for investigators to improve quality of submitted
data
• Need tools for data managers to support (semi-automate?)
QC workflow
• Repository standards/certification
• Inclusion of legacy data (data rescue)
How can we grow small data across the Geosciences?
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 41
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 42
Coalition for PublishingData in the Earth & Space Sciences
43
• Joint initiative of Earth Science publishers and Data Facilities to
help translate the aspirations of open, available, and useful
data from policy into practice.
• Alignment of data policies across different publishers
• Advancing integration of publication and data submission workflows
• Support for authors and editors to comply with publishers’ data policies
• e.g., online community directory of appropriate Earth science community
repositories that meet leading standards on curation, quality, and access
Increases development and enforcement of data best practices
Reduces effort of metadata QC
Increases flow of small data into repositories
www.copdess.org3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences
• Cross-disciplinary development of community data
model ODM2 (Observation Data Model)
• Collaboration with commercial software engineering
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 44
• Advances coordination, collaboration, and integration
• Community governance
• Integrative Activities
• Fosters new data communities
• Research Coordination Networks
• Develops and adapts new technologies to structure,
transform, integrate, document, harmonize data &
metadata
• Building Blocks
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 45
The Alliance Testbed Project
“Interdisciplinary Earth Data Alliance as a Model for Integrating EarthCube
Technology Resources and Engaging the Broad Community”
• Design & develop the organizational and technical
architecture of a data facility that operates as an
alliance of scientifically related data communities
• Sharing data services and infrastructure that support
trusted data curation and interdisciplinary science.
46
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 47
• Build on and transition existing infrastructure of an
established data facility (IEDA) to provide shared data
services for all Alliance partners
• Data Submission Hub
• Trusted repository services (DOI registration, long-term
preservation)
• Deploy newly developed EC technologies to align and
integrate with EC architecture
• CINERGI: pipeline for harvesting, improving, unifying, and re-
publishing metadata records assembled by Alliance partners
• GeoWS: mechanism for Alliance partners to exchange data with
data discovery, search, and visualization tools across the Alliance
• GeoLink: Vocabulary services to support the Data Submission Hub
48
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences
• Data Facility: IEDA
• Including existing IEDA Partners: MGDS, EarthChem, SESAR, Geochron, ASP@UTIG,
LEPR
• Community Data Collection: MetPetDB
• New data communities: Mineral Physics, Deep Seafloor Processes
• New data provider: IcePod
• EarthCube Building Blocks: CINERGI, GeoLink, GeoWS
• Stakeholder Alignment: WayMark Systems
49
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences
• Small data grows BIG when properly curated, documented,
harmonized, and integrated.
• Domain-specific data facilities are essential to ensure quality of
data for trusted re-use & community engagement.
• Current approaches are not sufficiently scalable.
• Partnerships and collaborations help address the challenges.
• Integration with publications will augment the flow of data into
repositories and data products.
• Partnerships among long-tail data communities allow sharing of data
publication & preservation infrastructure while supporting domain-
specific data curation.
• Community-wide initiatives such as EarthCube help solve the entire
range of social, technical, and organizational challenges.
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 50

More Related Content

PPTX
Data Facilities Workshop - Panel on Current Concepts in Data Sharing & Intero...
PPTX
Why data science matters and what we can do with it
PDF
Christine borgman keynote
PDF
Understanding the Big Picture of e-Science
PPT
WOW13_RPITWC_Web Observatories
PPTX
Biodiversity Informatics: An Interdisciplinary Challenge
PDF
Beyond Preservation: Situating Archaeological Data in Professional Practice
PDF
The Biodiversity Informatics Landscape
Data Facilities Workshop - Panel on Current Concepts in Data Sharing & Intero...
Why data science matters and what we can do with it
Christine borgman keynote
Understanding the Big Picture of e-Science
WOW13_RPITWC_Web Observatories
Biodiversity Informatics: An Interdisciplinary Challenge
Beyond Preservation: Situating Archaeological Data in Professional Practice
The Biodiversity Informatics Landscape

What's hot (20)

PDF
2021-01-27--biodiversity-informatics-gbif-(52slides)
PDF
RDFC2012 Open Access to Research Data
PPTX
Goldschmidt2019 Samples Workshop
PDF
GBIF and reuse of research data, Bergen (2016-12-14)
PPTX
Open data: Enhancing preservation, reproducibility, and innovation
PPT
EarthCubeArchitectureWS_June2015
PPTX
Research Data Infrastructure for Geochemistry (DFG Roundtable)
PDF
The role of biodiversity informatics in GBIF, 2021-05-18
PPTX
Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...
PPTX
GigaScience: a new resource for the big-data community.
PPTX
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
PPT
Research Dataspaces: Pay-as-you-go Integration and Analysis
PDF
FAIR and open biodiversity collection data management
PDF
Museum collections as research data - October 2019
PPTX
Big Data in the Arts and Humanities
PPTX
PDF
Introduction to GBIF. GBIF seminar in Bergen. 2016-12-14
PDF
GBIF and Biodiversity informatics for museums, 15 March 2021
PPT
E scidocdays review
PDF
Digital research: Collections, data, tools and methods
2021-01-27--biodiversity-informatics-gbif-(52slides)
RDFC2012 Open Access to Research Data
Goldschmidt2019 Samples Workshop
GBIF and reuse of research data, Bergen (2016-12-14)
Open data: Enhancing preservation, reproducibility, and innovation
EarthCubeArchitectureWS_June2015
Research Data Infrastructure for Geochemistry (DFG Roundtable)
The role of biodiversity informatics in GBIF, 2021-05-18
Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...
GigaScience: a new resource for the big-data community.
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Research Dataspaces: Pay-as-you-go Integration and Analysis
FAIR and open biodiversity collection data management
Museum collections as research data - October 2019
Big Data in the Arts and Humanities
Introduction to GBIF. GBIF seminar in Bergen. 2016-12-14
GBIF and Biodiversity informatics for museums, 15 March 2021
E scidocdays review
Digital research: Collections, data, tools and methods
Ad

Viewers also liked (10)

PPTX
Week 2 Exploration: Storyboard
DOCX
Resume sample emily wernndocx
DOCX
ROBERT SCHICKLER'S RESUME ##
PPTX
HXR 2016: Free the Data Access & Integration -Jonathan Hare, WebShield
PPTX
HXR 2016: Improving Care Experiences through Human-Centered Design - Katie Ab...
PDF
Building an Intelligent Biobank to Power Research Decision-Making
DOCX
El orgullo final
PDF
5.3 Régression logistique
PDF
AI & Big Data Analytics : Innovation trends and use cases
PDF
Les 100-questions-classiques-dun-entretien-dembauche
Week 2 Exploration: Storyboard
Resume sample emily wernndocx
ROBERT SCHICKLER'S RESUME ##
HXR 2016: Free the Data Access & Integration -Jonathan Hare, WebShield
HXR 2016: Improving Care Experiences through Human-Centered Design - Katie Ab...
Building an Intelligent Biobank to Power Research Decision-Making
El orgullo final
5.3 Régression logistique
AI & Big Data Analytics : Innovation trends and use cases
Les 100-questions-classiques-dun-entretien-dembauche
Ad

Similar to Making Small Data BIG (UT Austin, March 2016) (20)

PPTX
Lehnert: Making Small Data Big, IACS, April2015
PPTX
Boosting Data Science in Geochemistry: We Need Global Geochemical Data Standa...
PPTX
Data Standards & Best Practices for the Stratigraphic Record
PPTX
IEDA Overview & Updates, March 2014
PDF
Solving Geophysics Problems with Python - Speaker Notes
PPTX
Data Services for Geochemical Data
PPTX
Accelerating Discovery via Science Services
PPT
Topic 1 introduction
PDF
ICSTI Annual Meeting 2014 Tokyo Y. Murayama
PPT
E research overview gahegan bioinformatics workshop 2010
PDF
Core Concepts Backgrounder and Evaluation Strategies
PPTX
EGU 2018 Ian McHarg Lecture
PDF
Earth Sciences Notable Research And Discoveries Frontiers Of Science 1st Edit...
PPT
GA 2015 Digital Earth Workshop
PPT
Data, data, data
PDF
Integrated Earth Data Applications: Enhancing Reliable Data Services Through ...
PPTX
Open science, transdisciplinary research, and the future of archaeology
PPTX
TSB Environmental Data Workshop - 5th July 2013
PPT
01.introduction to earth_science
PDF
Science in the Open
Lehnert: Making Small Data Big, IACS, April2015
Boosting Data Science in Geochemistry: We Need Global Geochemical Data Standa...
Data Standards & Best Practices for the Stratigraphic Record
IEDA Overview & Updates, March 2014
Solving Geophysics Problems with Python - Speaker Notes
Data Services for Geochemical Data
Accelerating Discovery via Science Services
Topic 1 introduction
ICSTI Annual Meeting 2014 Tokyo Y. Murayama
E research overview gahegan bioinformatics workshop 2010
Core Concepts Backgrounder and Evaluation Strategies
EGU 2018 Ian McHarg Lecture
Earth Sciences Notable Research And Discoveries Frontiers Of Science 1st Edit...
GA 2015 Digital Earth Workshop
Data, data, data
Integrated Earth Data Applications: Enhancing Reliable Data Services Through ...
Open science, transdisciplinary research, and the future of archaeology
TSB Environmental Data Workshop - 5th July 2013
01.introduction to earth_science
Science in the Open

More from Kerstin Lehnert (11)

PPTX
Astromat Update on Developments 2021-01-29
PPTX
Lehnert_EGU201_SampleMetadataStandards
PPTX
Advancing Reproducible Science from Physical Samples: The IGSN and the iSampl...
PPTX
IGSN: The International Geo Sample Number (DFG Roundtable)
PPTX
Interdisciplinary Data Resources for Volcanology at the IEDA (Interdisciplina...
PPTX
The Internet of Samples: IGSN in Action
PPTX
Digital Representation of Physical Samples in Scientific Publications
PPTX
IEDA: Making Small Data BIG Through Interdisciplinary Partnerships Among Long...
PPTX
iSamples Research Coordination Network (C4P Webinar)
PPTX
MoonDB: Restoration & Synthesis of Planetary Geochemical Data
PPTX
IEDA Data Publication Workshop @AGU
Astromat Update on Developments 2021-01-29
Lehnert_EGU201_SampleMetadataStandards
Advancing Reproducible Science from Physical Samples: The IGSN and the iSampl...
IGSN: The International Geo Sample Number (DFG Roundtable)
Interdisciplinary Data Resources for Volcanology at the IEDA (Interdisciplina...
The Internet of Samples: IGSN in Action
Digital Representation of Physical Samples in Scientific Publications
IEDA: Making Small Data BIG Through Interdisciplinary Partnerships Among Long...
iSamples Research Coordination Network (C4P Webinar)
MoonDB: Restoration & Synthesis of Planetary Geochemical Data
IEDA Data Publication Workshop @AGU

Recently uploaded (20)

PPTX
Understanding the Circulatory System……..
PPTX
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPT
LEC Synthetic Biology and its application.ppt
PPTX
perinatal infections 2-171220190027.pptx
PPTX
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
PPT
6.1 High Risk New Born. Padetric health ppt
PPT
Presentation of a Romanian Institutee 2.
PPTX
gene cloning powerpoint for general biology 2
PPTX
BODY FLUIDS AND CIRCULATION class 11 .pptx
PPTX
Lesson-1-Introduction-to-the-Study-of-Chemistry.pptx
PDF
The Land of Punt — A research by Dhani Irwanto
PPTX
A powerpoint on colorectal cancer with brief background
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PPTX
Seminar Hypertension and Kidney diseases.pptx
PPTX
Fluid dynamics vivavoce presentation of prakash
PPT
Mutation in dna of bacteria and repairss
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
Understanding the Circulatory System……..
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
LEC Synthetic Biology and its application.ppt
perinatal infections 2-171220190027.pptx
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
6.1 High Risk New Born. Padetric health ppt
Presentation of a Romanian Institutee 2.
gene cloning powerpoint for general biology 2
BODY FLUIDS AND CIRCULATION class 11 .pptx
Lesson-1-Introduction-to-the-Study-of-Chemistry.pptx
The Land of Punt — A research by Dhani Irwanto
A powerpoint on colorectal cancer with brief background
lecture 2026 of Sjogren's syndrome l .pdf
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
Seminar Hypertension and Kidney diseases.pptx
Fluid dynamics vivavoce presentation of prakash
Mutation in dna of bacteria and repairss
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...

Making Small Data BIG (UT Austin, March 2016)

  • 1. Kerstin Lehnert Lamont -Doherty Earth Observatory of Columbia University Palisades, NY, 10964 Success and Challenges in the Earth Sciences
  • 2. Monday’s Musings: Beyond The Three V’s of Big Data – Viscosity and Virality February 27, 2012 by R "Ray" Wang http://blog.softwareinsider.org/2012/02/27/mondays- musings-beyond-the-three-vs-of-big-data-viscosity-and- virality/ 2 ValueThe sixth ‘V’: 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences
  • 3. • heterogeneous • customized & optimized for research questions • lack of data standards • culture of data ‘hording’ • lack of data infrastructure (facilities) Making Small Data BIG: Succss and Challenges in the Earth Sciences 3 3/22/2016
  • 4. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 4 “While the data volumes are small when viewed individually, in total they represent a very significant portion of the country’s scientific output.” “The long tail is a breeding ground for new ideas and never before attempted science.” (Heidorn, B. 2008: “Shedding Light on the Dark Data in the Long Tail of Science”)
  • 5. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 5
  • 6. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 6 … that form a picture
  • 7. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 7 The PetDB Synthesis Map shows data from >300 publications Symbols are locations of rock samples. Color is scaled to the 87Sr/86Sr isotope ratio in the rocks.
  • 8. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 8
  • 9. 9 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences “Understanding where the dust that's in the atmosphere and oceans comes from can help scientists estimate its impact on earth's climate system.” Bess Koffman, Michael Kaplan, Steven Goldstein, Gisela Winckler (LDEO), Natalie Mahowald (Cornell) http://blogs.ei.columbia.edu/2014/03/13/did-new-zealand-dust-influence-the-last-ice-age/ Example #1: Did New Zealand Dust Influence the Last Ice Age?
  • 10. Making Small Data BIG: Succss and Challenges in the Earth Sciences 10 3/22/2016
  • 11. Making Small Data BIG: Succss and Challenges in the Earth Sciences 11 3/22/2016
  • 12. Making Small Data BIG: Succss and Challenges in the Earth Sciences 12 3/22/2016
  • 13. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 13 Note the number of data points generated in this study (the yellow dots) in light of the effort that included collecting samples in NZ to operating expensive equipment in the lab.
  • 14. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 14 Example #2: Do convergent margin volcanoes really represent continental crust? “As it is crucial to understand the extent and origin of the compositional difference between central Aleutian lavas and plutons through time and space, this project will map and sample plutonic rocks exposed on the central Aleutians and their coeval volcanic host rocks.” “Results and the samples acquired in this study will help to answer fundamental questions of continental crust formation, and shed light on the formation mechanisms of plutons and volcanics in arcs.” http://www.nsf.gov/discoveries/disc_summ.jsp?cntn_id=135851&org=NSF
  • 15. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 15 Anticipated Data: • ~ 250 samples • ~ 200 major element analyses • ~ 150 trace element analyses • 50 U/Pb zircon geochronology • 30 Ar-Ar ages • 80 Sr, Nd, Hf and Pb isotope analyses • 4 scientists (3 institutions) • 5 weeks on remote islands • a boat (with crew) • a helicopter
  • 16. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 16
  • 17. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 17
  • 18. Making Small Data BIG: Succss and Challenges in the Earth Sciences 18 3/22/2016
  • 19. • They are widely dispersed in the literature (past & present). • They are not openly accessible. • They lack sufficient and standardized metadata. • They are never published (“dark data”). 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 19
  • 20. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 20 findable identification, persistence accessible protection, protocols context, provenance re-usable harmonized, machine-readable interoperable small data Data Curation Standards Generic Repositories Domain-specific Data Standards Community Data Collections V a l u e
  • 21. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 21 findable identification, persistence accessible protection, protocols context, provenance re-usable harmonized, machine-readable interoperable small data Data Curation Standards Domain-specific Data Standards V a l u e Domain Repositories
  • 22. Making Small Data BIG: Succss and Challenges in the Earth Sciences 22 Science Community Domain specific Data facility 22 Libraries Archives CI, Computer Science Publishers, editors Discipline-specific data services • Context & provenance metadata • Semantics • Workflows Funding Agencies Data Facilities Registries 3/22/2016 Data curation services CI development Disciplinary Expertise
  • 23. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 23 Data Services for the Solid Earth Sciences www.iedadata.org
  • 24. 24 www.iedadata.org • Solid Earth Observational Data • High-T Geochemistry • Low-T Geochemistry • Petrology • Marine Geophysics & Geology • Geochronology • Cross-disciplinary tools & services • Sample registry SESAR • IEDA Data Browser • Portals (GeoPRISMs, USAP-DCC, etc.) • GeoMapApp • Interoperability Making Small Data BIG: Succss and Challenges in the Earth Sciences 3/22/2016
  • 25. 25 IEDA Repositories  >720,000 files  59 TB  4 x 106 samples IEDA Syntheses  19 x 106 analytical values in EarthChem  2.79 x 106 miles of data from 875 cruises in the Global Multi-Resolution Topography (GMRT) 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences
  • 26. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 26
  • 27. 27 Data Data Data Data Data EarthChem Library Data Data Data Data Data PetDB, SedDB EarthChem Portal Data Publication & Preservation Data Mining & Analysis Investigators Metadata Catalog Data & Metadata Data & Metadata External Systems EarthChem Data Managers FINDABLE & ACCESSIBLE • DOI registration • Long-term archiving • CC license • Guidelines for data reporting (community endorsed) • QC by data managers RE-USABLE & INTEROPERABLE • Data & metadata harmonization • Standards-compliant data model • Service Oriented Architecture (ECP) 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences
  • 28. Making Small Data BIG: Succss and Challenges in the Earth Sciences 28 DOI to allow proper citation Link to publications Link to funding source 28 3/22/2016
  • 29. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 29 Global compilation of geochemical data for igneous rocks from the ocean floor & mantle xenoliths > 2,200 data sets/publications > 84,000 samples > 3.2 million observed values http://www.earthchem.org/petdb
  • 30. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 30 Data from • >13,000 publications • >850,000 samples Total: >19.6 million analytical values Partner Databases: • PetDB • SedDB • GEOROC • USGS • MetPetDB • GANSEKI
  • 31. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 31 Filter by method or concentration
  • 32. Making Small Data BIG: Succss and Challenges in the Earth Sciences 32 3/22/2016
  • 33. • 500 - 800 downloads per quarter • >600 citations in the literature • many fundamental new discoveries & insights • Disciplinary • Multi-disciplinary • Unanticipated purposes • new scientific approaches • Statistical rather than hypothetical Making Small Data BIG: Succss and Challenges in the Earth Sciences 33 3/22/2016
  • 34. • Many samples and collections are not ‘online’. • Repositories lack resources & expertise to develop & maintain digital collection catalogs. • Samples often only described in publications. • Existing online catalogs are not connected or federated. • No easy way to search for samples. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 34 34 February 25, 2016 DFG Rundgespräch Geochemical Databases
  • 35. • Linking physical samples digital data generated by their study. • Reproducibility! Access to the physical samples is required to verify & reproduce observations. • Re-usability! Access to information about samples is required for proper evaluation & interpretation of sample- based data. • Broad sharing of physical samples for use & re-use. • Samples are often expensive to collect (drilling, remote locations). • Many samples are unique and irreplaceable. • Re-analysis augments utility of existing data. • Samples often serve in ways that the collectors and repositories could not have imagined. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 35
  • 36. • Discovery & Access for Re-use and Reproducibility • Sample Citation • Data Integration • Sample Management 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 36 IGSN = International Geo Sample Number
  • 37. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 37
  • 38. 38 “… AGU Publications also strongly encourages use of other identifiers in our journal papers. International Geo Sample Numbers (IGSNs) uniquely identify items, such as a rock sample, a piece of coral, or a vial of water taken from the natural environment, and provide important, consistent information about these samples.” Hanson, B. (2016), AGU opens its journals to author identifiers, Eos, 97, doi:10.1029/2016EO043183. Published on 7 January 2016. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences
  • 39. 3/22/2016 39Making Small Data BIG: Succss and Challenges in the Earth Sciences
  • 40. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 40 Technical Organizational Social/cultural
  • 41. • Limitation of resources versus diversity of data • Need best practices for all small data communities • Need flexibility and performance of database schemas & search applications • Need tools for investigators to improve quality of submitted data • Need tools for data managers to support (semi-automate?) QC workflow • Repository standards/certification • Inclusion of legacy data (data rescue) How can we grow small data across the Geosciences? 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 41
  • 42. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 42
  • 43. Coalition for PublishingData in the Earth & Space Sciences 43 • Joint initiative of Earth Science publishers and Data Facilities to help translate the aspirations of open, available, and useful data from policy into practice. • Alignment of data policies across different publishers • Advancing integration of publication and data submission workflows • Support for authors and editors to comply with publishers’ data policies • e.g., online community directory of appropriate Earth science community repositories that meet leading standards on curation, quality, and access Increases development and enforcement of data best practices Reduces effort of metadata QC Increases flow of small data into repositories www.copdess.org3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences
  • 44. • Cross-disciplinary development of community data model ODM2 (Observation Data Model) • Collaboration with commercial software engineering 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 44
  • 45. • Advances coordination, collaboration, and integration • Community governance • Integrative Activities • Fosters new data communities • Research Coordination Networks • Develops and adapts new technologies to structure, transform, integrate, document, harmonize data & metadata • Building Blocks 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 45
  • 46. The Alliance Testbed Project “Interdisciplinary Earth Data Alliance as a Model for Integrating EarthCube Technology Resources and Engaging the Broad Community” • Design & develop the organizational and technical architecture of a data facility that operates as an alliance of scientifically related data communities • Sharing data services and infrastructure that support trusted data curation and interdisciplinary science. 46 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences
  • 47. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 47
  • 48. • Build on and transition existing infrastructure of an established data facility (IEDA) to provide shared data services for all Alliance partners • Data Submission Hub • Trusted repository services (DOI registration, long-term preservation) • Deploy newly developed EC technologies to align and integrate with EC architecture • CINERGI: pipeline for harvesting, improving, unifying, and re- publishing metadata records assembled by Alliance partners • GeoWS: mechanism for Alliance partners to exchange data with data discovery, search, and visualization tools across the Alliance • GeoLink: Vocabulary services to support the Data Submission Hub 48 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences
  • 49. • Data Facility: IEDA • Including existing IEDA Partners: MGDS, EarthChem, SESAR, Geochron, ASP@UTIG, LEPR • Community Data Collection: MetPetDB • New data communities: Mineral Physics, Deep Seafloor Processes • New data provider: IcePod • EarthCube Building Blocks: CINERGI, GeoLink, GeoWS • Stakeholder Alignment: WayMark Systems 49 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences
  • 50. • Small data grows BIG when properly curated, documented, harmonized, and integrated. • Domain-specific data facilities are essential to ensure quality of data for trusted re-use & community engagement. • Current approaches are not sufficiently scalable. • Partnerships and collaborations help address the challenges. • Integration with publications will augment the flow of data into repositories and data products. • Partnerships among long-tail data communities allow sharing of data publication & preservation infrastructure while supporting domain- specific data curation. • Community-wide initiatives such as EarthCube help solve the entire range of social, technical, and organizational challenges. 3/22/2016 Making Small Data BIG: Succss and Challenges in the Earth Sciences 50