SlideShare a Scribd company logo
Big Data Repository for
Structural Biology:
Challenges and Opportunities
Piotr Sliz, PhD
sliz@hkl.hms.harvard.edu
!
SBGrid: http://sbgrid.org
SBGrid Data Bank: http://data.sbgrid.org
Twitter: @SBGrid
YouTube: SBGridTV
SBGrid
Consortium
Support Center at Harvard Medical School
300 Research Groups
13 Countries
Long Term Sustainability: Membership Fee
Harvard Medical!
School
SBGrid supports compilation, installation
and upgrades of ~300 scientific applications
Several Software Categories (EM, NMR, Xrays, Comp Chem, etc.)
Multiple versions of most applications
OS X (10.6-10.10) and Linux support (CentOS 5-7)
No additional, end-user configuration required
Software always works = more time for research
Core Mission:
Grid Computing (Open Science Grid VO + Grid Portal)
General Research Infrastructure (Boston Area)
Training (workshops, software cataloguing, webtales)
Webinars at youtube.com/SBGridTV
Developer Resources
Advocating for Open Source Software
Morin et al. Shining Light into Black Boxes. Science, 2012.
Other Activities:
Additional!
Publications
Primary Citation:
Other Citations:
New Opportunity:
Data
anonymous SBGrid member 1:
“we cannot find the original frames for many of our
structures (move from X to Y), including recent high
impact projects. What do you recommend that we do?”
anonymous SBGrid member 2:
“I was able to locate the data directory
but I must have done a good job
cleaning up the disk space before I
left: usually there are only two .img files
left in the data directory, the 1st and
the last image of a full run.”
Lack of Storage Support
for Diffraction Images
derive
reproduce
improve
correct
• Stokes-Rees, I., Levesque, I., Murphy, F.V., Yang, W., Deacon, A., and Sliz, P. (2012). Adapting federated
cyberinfrastructure for shared data collection facilities in structural biology. J Synchrotron Radiat 19, 462–467.
• Terwilliger, T.C., and Bricogne, G. (2014). Continuous mutual improvement of macromolecular structure models in the PDB
and of X-ray crystallographic software: the dual role of deposited experimental data. Acta Crystallogr. D Biol. Crystallogr.
70, 2533–2543.
• Terwilliger, T.C. (2014). Archiving raw crystallographic data. Acta Crystallogr D Biol Crystallogr.
• Guss, J.M., and McMahon (2014). How to make deposition of images a reality. Acta Crystallogr. D Biol. Crystallogr. 70,
2520–2532
Focus on Primary	

Data
SBGrid Data Bank. Pilot: May 1st, Production: June 1st, 2015	

EZID
Dataset
Lock
BIODBCORE-­‐000683
re3data.org
Data Mining
and
Annotation
Web 	

Interface
Related!
Datasets
Depositors:
URL: data.sbgrid.org
Dataset Landing Page
DataCite!
Schema CC0 License
Download
Dataset URL
Current Statistics
Publication Workflow:
Data Access Alliance:
Make Data easily accessible for reprocessing
Minimize Project Cost
Increase Redundancy
Challenges
Dataset Size (APIs, Data Access Alliance)
Journal + Data Automation
automated embargo release
cross-referencing
coordination/communication with journals
Data vs Journal Citations
Metrics:
Dataset Deposition Rates
Data Use: DAA Membership vs. direct downloads
Dataset Quality (Level 0-2)
Data Citations
Master Format
OME-TIFF vs DataCite vs DataVerse schema
Transition to a Research Data Management Software
ORCID integration and adoption
Opportunities
Better support to ~300 structural biology laboratories:
Compliance
Reproducibility
Integration with PDB and other repositories
Other data types in addition to X-ray diffraction
Thank you
Piotr Sliz, PhD
sliz@hkl.hms.harvard.edu
!
SBGrid: http://sbgrid.org
SBGrid Data Bank: http://data.sbgrid.org
!
Twitter: @SBGrid
YouTube: SBGridTV
Stephanie Socias
Pete Meyer
Merce Crosas

More Related Content

PDF
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
PPTX
DataONE Education Module 08: Data Citation
PDF
Data Citation Implementation Guidelines By Tim Clark
PDF
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
PPTX
DataONE Education Module 02: Data Sharing
PPTX
DataONE Education Module 10: Legal and Policy Issues
PDF
A Data Biosphere for Biomedical Research
PDF
What is Data Commons and How Can Your Organization Build One?
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
DataONE Education Module 08: Data Citation
Data Citation Implementation Guidelines By Tim Clark
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
DataONE Education Module 02: Data Sharing
DataONE Education Module 10: Legal and Policy Issues
A Data Biosphere for Biomedical Research
What is Data Commons and How Can Your Organization Build One?

What's hot (20)

PDF
A Gen3 Perspective of Disparate Data
PDF
Dataverse, Cloud Dataverse, and DataTags
PDF
Some Proposed Principles for Interoperating Cloud Based Data Platforms
PDF
Some Frameworks for Improving Analytic Operations at Your Company
PPTX
Scott Edmunds slides from #IDCC13 Data Science session
PPTX
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
PDF
DataTags, The Tags Toolset, and Dataverse Integration
PDF
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
PPTX
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
PPTX
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
PDF
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
PDF
The DataTags System: Sharing Sensitive Data with Confidence
PDF
Maintaining Data Confidentiality in Association Rule Mining in Distributed En...
PPTX
DataONE Education Module 07: Metadata
PPTX
DataONE Education Module 01: Why Data Management?
PDF
Current trends in data security nursing research ppt
PDF
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...
PPTX
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
PDF
McGeary Data Curation Network: Developing and Scaling
PPT
Privacy Preserving DB Systems
A Gen3 Perspective of Disparate Data
Dataverse, Cloud Dataverse, and DataTags
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Frameworks for Improving Analytic Operations at Your Company
Scott Edmunds slides from #IDCC13 Data Science session
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
DataTags, The Tags Toolset, and Dataverse Integration
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
The DataTags System: Sharing Sensitive Data with Confidence
Maintaining Data Confidentiality in Association Rule Mining in Distributed En...
DataONE Education Module 07: Metadata
DataONE Education Module 01: Why Data Management?
Current trends in data security nursing research ppt
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Worl...
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
McGeary Data Curation Network: Developing and Scaling
Privacy Preserving DB Systems
Ad

Similar to Big Data Repository for Structural Biology: Challenges and Opportunities by Piotr Sliz (20)

PPTX
The need for a transparent data supply chain
PPTX
HKU Data Curation MLIM7350 Class 8
PPTX
AI from the Perspective of a School of Data Science
PPTX
Nicole Nogoy at the Auckland BMC RoadShow
PPTX
Data Science and AI in Biomedicine: The World has Changed
PPTX
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
PPT
BeSTGRID OpenGridForum 29 GIN session
PPTX
GigaScience: a new resource for the big-data community.
PPTX
XLDB South America Keynote: eScience Institute and Myria
PPT
Services For Science April 2009
PPTX
Democratising biodiversity and genomics research: open and citizen science to...
PPTX
eResearch New Zealand Keynote
PDF
CLIR Fellows - Science Data - 14_0730
PDF
Advancing Science through Coordinated Cyberinfrastructure
PPT
Sla2009 D Curation Heidorn
PPT
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...
PPTX
Data management plans archeology class 10 18 2012
PDF
An AI-driven closed-loop facility for materials synthesis
PPT
Building an Information Infrastructure to Support Genetic Sciences
PPTX
Enabling simultaneous analysis of multiple cohort studies: A BRISSKit use case
The need for a transparent data supply chain
HKU Data Curation MLIM7350 Class 8
AI from the Perspective of a School of Data Science
Nicole Nogoy at the Auckland BMC RoadShow
Data Science and AI in Biomedicine: The World has Changed
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
BeSTGRID OpenGridForum 29 GIN session
GigaScience: a new resource for the big-data community.
XLDB South America Keynote: eScience Institute and Myria
Services For Science April 2009
Democratising biodiversity and genomics research: open and citizen science to...
eResearch New Zealand Keynote
CLIR Fellows - Science Data - 14_0730
Advancing Science through Coordinated Cyberinfrastructure
Sla2009 D Curation Heidorn
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...
Data management plans archeology class 10 18 2012
An AI-driven closed-loop facility for materials synthesis
Building an Information Infrastructure to Support Genetic Sciences
Enabling simultaneous analysis of multiple cohort studies: A BRISSKit use case
Ad

More from datascienceiqss (20)

PDF
Citing Data in Journal Articles using JATS by Deborah A. Lapeyre
PDF
iRODS/Dataverse Project by Jonathan Crabtree
PDF
DataTags: Sharing Privacy Sensitive Data by Latanya Sweeney
PDF
Center for Open Science and the Open Science Framework: Dataverse Add-on by S...
PDF
Data Analysis in Dataverse & Visualization of Datasets on Historical Maps by ...
PDF
Geospatial Data Visualization: WorldMap Integration by Raman Prasad
PDF
Sharing Data Through Plots with Plotly by Alex Johnson
PDF
TwoRavens: A Graphical, Browser-Based Statistical Interface for Data Reposito...
PDF
MIT Libraries Dataverse by Katherine McNeill
PDF
The Project TIER Dataverse: Archiving and Sharing Replicable Student Research...
PDF
Dataverse in China: Internationalization, Curation and Promotion by Yin Shenqin
PPTX
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
PDF
Metadata & Data Curation Services by Thu-Mai Christian
PDF
American Journal of Political Science & The Odum Institute: Promoting Researc...
PDF
Political Analysis Dataverse by Jonathan N. Katz
PDF
Data in Brief and Dataverse: Incentivizing Authors to Share Data by Paige Sha...
PDF
Dataverse in the Universe of Data by Christine L. Borgman
PDF
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
PDF
Data Publishing Models by Sünje Dallmeier-Tiessen
PDF
Persistent Identifier Services and their Metadata by John Kunze
Citing Data in Journal Articles using JATS by Deborah A. Lapeyre
iRODS/Dataverse Project by Jonathan Crabtree
DataTags: Sharing Privacy Sensitive Data by Latanya Sweeney
Center for Open Science and the Open Science Framework: Dataverse Add-on by S...
Data Analysis in Dataverse & Visualization of Datasets on Historical Maps by ...
Geospatial Data Visualization: WorldMap Integration by Raman Prasad
Sharing Data Through Plots with Plotly by Alex Johnson
TwoRavens: A Graphical, Browser-Based Statistical Interface for Data Reposito...
MIT Libraries Dataverse by Katherine McNeill
The Project TIER Dataverse: Archiving and Sharing Replicable Student Research...
Dataverse in China: Internationalization, Curation and Promotion by Yin Shenqin
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Metadata & Data Curation Services by Thu-Mai Christian
American Journal of Political Science & The Odum Institute: Promoting Researc...
Political Analysis Dataverse by Jonathan N. Katz
Data in Brief and Dataverse: Incentivizing Authors to Share Data by Paige Sha...
Dataverse in the Universe of Data by Christine L. Borgman
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data Publishing Models by Sünje Dallmeier-Tiessen
Persistent Identifier Services and their Metadata by John Kunze

Recently uploaded (20)

PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Complications of Minimal Access Surgery at WLH
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PPTX
Pharma ospi slides which help in ospi learning
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Renaissance Architecture: A Journey from Faith to Humanism
102 student loan defaulters named and shamed – Is someone you know on the list?
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Complications of Minimal Access Surgery at WLH
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
O7-L3 Supply Chain Operations - ICLT Program
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
01-Introduction-to-Information-Management.pdf
Week 4 Term 3 Study Techniques revisited.pptx
Pharma ospi slides which help in ospi learning
PPH.pptx obstetrics and gynecology in nursing
Supply Chain Operations Speaking Notes -ICLT Program
Anesthesia in Laparoscopic Surgery in India
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
STATICS OF THE RIGID BODIES Hibbelers.pdf

Big Data Repository for Structural Biology: Challenges and Opportunities by Piotr Sliz

  • 1. Big Data Repository for Structural Biology: Challenges and Opportunities Piotr Sliz, PhD sliz@hkl.hms.harvard.edu ! SBGrid: http://sbgrid.org SBGrid Data Bank: http://data.sbgrid.org Twitter: @SBGrid YouTube: SBGridTV SBGrid Consortium Support Center at Harvard Medical School 300 Research Groups 13 Countries Long Term Sustainability: Membership Fee Harvard Medical! School
  • 2. SBGrid supports compilation, installation and upgrades of ~300 scientific applications Several Software Categories (EM, NMR, Xrays, Comp Chem, etc.) Multiple versions of most applications OS X (10.6-10.10) and Linux support (CentOS 5-7) No additional, end-user configuration required Software always works = more time for research Core Mission: Grid Computing (Open Science Grid VO + Grid Portal) General Research Infrastructure (Boston Area) Training (workshops, software cataloguing, webtales) Webinars at youtube.com/SBGridTV Developer Resources Advocating for Open Source Software Morin et al. Shining Light into Black Boxes. Science, 2012. Other Activities: Additional! Publications Primary Citation: Other Citations:
  • 3. New Opportunity: Data anonymous SBGrid member 1: “we cannot find the original frames for many of our structures (move from X to Y), including recent high impact projects. What do you recommend that we do?” anonymous SBGrid member 2: “I was able to locate the data directory but I must have done a good job cleaning up the disk space before I left: usually there are only two .img files left in the data directory, the 1st and the last image of a full run.” Lack of Storage Support for Diffraction Images derive reproduce improve correct • Stokes-Rees, I., Levesque, I., Murphy, F.V., Yang, W., Deacon, A., and Sliz, P. (2012). Adapting federated cyberinfrastructure for shared data collection facilities in structural biology. J Synchrotron Radiat 19, 462–467. • Terwilliger, T.C., and Bricogne, G. (2014). Continuous mutual improvement of macromolecular structure models in the PDB and of X-ray crystallographic software: the dual role of deposited experimental data. Acta Crystallogr. D Biol. Crystallogr. 70, 2533–2543. • Terwilliger, T.C. (2014). Archiving raw crystallographic data. Acta Crystallogr D Biol Crystallogr. • Guss, J.M., and McMahon (2014). How to make deposition of images a reality. Acta Crystallogr. D Biol. Crystallogr. 70, 2520–2532
  • 4. Focus on Primary Data SBGrid Data Bank. Pilot: May 1st, Production: June 1st, 2015 EZID Dataset Lock BIODBCORE-­‐000683 re3data.org Data Mining and Annotation
  • 5. Web Interface Related! Datasets Depositors: URL: data.sbgrid.org Dataset Landing Page DataCite! Schema CC0 License Download Dataset URL
  • 7. Data Access Alliance: Make Data easily accessible for reprocessing Minimize Project Cost Increase Redundancy Challenges Dataset Size (APIs, Data Access Alliance) Journal + Data Automation automated embargo release cross-referencing coordination/communication with journals Data vs Journal Citations Metrics: Dataset Deposition Rates Data Use: DAA Membership vs. direct downloads Dataset Quality (Level 0-2) Data Citations Master Format OME-TIFF vs DataCite vs DataVerse schema Transition to a Research Data Management Software ORCID integration and adoption
  • 8. Opportunities Better support to ~300 structural biology laboratories: Compliance Reproducibility Integration with PDB and other repositories Other data types in addition to X-ray diffraction Thank you Piotr Sliz, PhD sliz@hkl.hms.harvard.edu ! SBGrid: http://sbgrid.org SBGrid Data Bank: http://data.sbgrid.org ! Twitter: @SBGrid YouTube: SBGridTV Stephanie Socias Pete Meyer Merce Crosas