SlideShare a Scribd company logo
Data Review and Clean-Up Using
Crowdsourced Input via the
US EPA CompTox Dashboard
Antony Williams1, Katie Paul-Friedman1, Jason Brown2,
Chris Grulke1, Emma L. Schymanski3 and Jeff Edwards1
1) National Center for Computational Toxicology, U.S. Environmental Protection Agency, RTP, NC
2) ORAU
3) Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, Luxembourg
August 2018
ACS Fall Meeting, Boston
http://www.orcid.org/0000-0002-2668-4821
The views expressed in this presentation are those of the author and do not necessarily reflect the views or policies of the U.S. EPA
Overview
• An introduction to the EPA’s CompTox Dashboard
• Our Data: experimental and predicted property data,
high-throughput screening assay data and hazard
and environmental exposure data, and lots more
• Data quality: high-quality data needed but
challenging to produce
– Millions of individual data points and annotations
– 10’0s of 1000’s of chemicals
• What is the role of user feedback?
• Our efforts to curate our ToxCast bioassay data
1
CompTox Portal
2
The CompTox Chemistry Dashboard
• A publicly accessible website delivering access:
– New entry portal for all NCCT dashboards
– ~762,000 chemicals with related property data
– Searchable by chemical, product use, gene and assay
(ToxCast)
– Experimental and predicted physicochemical property data
– “Bioactivity data” for the ToxCast/Tox21 project
– Generalized Read-Across (GenRA) module
– Links to other agency websites and public data resources
– “Literature” searches for chemicals using public resources
– “Batch searching” for thousands of chemicals
– DOWNLOADABLE Open Data for reuse and repurposing
3
CompTox Dashboard
https://comptox.epa.gov/dashboard
4
CompTox Dashboard
Chemicals
5
CompTox Dashboard
Products and Use Categories
6
CompTox Dashboard
Assays and Genes
7
Detailed Chemical Pages
8
Physicochemical properties
9
Access to Chemical Hazard Data
10
Hazard Data from “ToxVal_DB”
• ToxVal Database contains following data:
–30,050 chemicals
–772,721 toxicity values
–29 sources of data
–21,507 sub-sources
–4585 journals cited
–69,833 literature citations
11
In Vitro Bioassay Screening
ToxCast and Tox21
12
In Vitro Bioassay Screening
ToxCast and Tox21
13
How can be curate our data?
• Crowdsourcing is well proven nowadays
• Comments can be added at a record level
• Submitted comments are reviewed by
administrators and responded to
14
Public Crowdsourced Comments
https://comptox.epa.gov/dashboard/comments/public_index
15
Reviewer comments are public
16
17
MassBank/CompTox Curation of External Data
o A “nice” example: 4-4'-Bis(2-sulfostyryl)biphenyl
Purchased: CAS: 27344-41-8
DTXSID6036467
Registered: CAS: 38775-22-3 (UFZ)
DTXSID7047017
Comments to date
• The majority of comments to date:
– Structure and names/CASRN do not match
– Add additional synonyms
– Request to add specific property data
– Structure layout/depiction needs improving
18
Crowdsourcing Comments
Single Cell Commenting added
• Highlight an alphanumeric text string
19
Crowdsourcing Comments
20
Bioactivity Data
• 100s of thousands of
bioactivity curves to
review
• Impossible to review
every one manually
• Now accepting public
Crowdsourced
Comments
• Public crowdsourcing
will not suffice!!!
21
Internal Review of 25,000 curves
22
Screenshot of entry page for Beta R Shiny Application for NCCT users
Brown & Paul-Friedman
Internal Review of 25,000 curves
A “good fit” bioactivity curve
23
Internal Review of 25,000 curves
24
…and gain-loss fit with a cell
viability assay (makes little sense)
Single-Point in middle of concentration range
drives ACTIVE Hit Call
Internal Review of 25,000 curves
Abnormally High-Noise
25
Internal Curve Review Results
• Internal curve review has resulted in:
– Instances of correction of fitting procedures in the ToxCast Pipeline
– Identification of issues with source data
– Identification of additional flags or filters that could be used,
depending on the application of ToxCast data
– a beta implementation of quality assurance for HTS data
– Brown & Paul-Friedman, Uncertainty in ToxCast Curve-Fitting: Quantitative and
Qualitative Descriptors Inform a Model to Predict Reproducible Fits (in
preparation)
26
The CompTox Dashboard for
Structure Identification by MS
27
Collaborative Data Curation
• Mapping between our data (and websites)
has resulted in collaborative data curation
• Collaboration with Emma Schymanski re.
the NORMAN Suspects Exchange
https://www.norman-network.com/?q=node/236
• Our process for mapping data is iterative
28
NORMAN Suspect Exchange
29
http://www.norman-network.com/?q=node/236
Example: NORMAN Priority List
30
Mapping on Two Identifiers
31
Mapping on Two Identifiers
32
Mapping Quality Control (I)
33
Mapping Quality Control (II)
34
Example: NORMAN Priority List
35
Example: NORMAN Priority List
36
>23 NORMAN Lists Available
37
Progressive Curation on #Lists
38(many) more registrations…
Cleaning up lists to remove errors
Undefined mixtures (UVCBs)
Conclusion
• The CompTox Dashboard provides access to data for
~760,000 chemicals
• Crowdsourced comments from users may help clean
data
• Record level data curation - any alphanumeric string
• Improved data quality provides better data for
modeling to underpin our prediction algorithms
• Collaborative data curation benefits extended
communities of users
• Data curation is never complete – our data expands
daily… 39
Contact
Antony Williams
US EPA Office of Research and Development
National Center for Computational Toxicology (NCCT)
Williams.Antony@epa.gov
ORCID: https://orcid.org/0000-0002-2668-4821
40
41
NORMAN Suspect Exchange
o http://www.norman-network.com/?q=node/236

More Related Content

PPTX
Does bigger mean better in the world of chemistry databases?
PPTX
Non-targeted analysis supported by data and cheminformatics delivered via the...
PPTX
US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...
PPTX
US-EPA CompTox Chemicals Dashboard – integrating chemistry and biology data t...
PPTX
US-EPA CompTox Chemicals Dashboard: Bioactivity Data for Endocrine Assays
PPTX
Bringing it all together: A Web-based Database for Chemical and Biological Da...
PPTX
Chemistry data: Distortion and dissemination in the Internet Era
PDF
ICIC 2014 From SureChem to SureChEMBL
Does bigger mean better in the world of chemistry databases?
Non-targeted analysis supported by data and cheminformatics delivered via the...
US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...
US-EPA CompTox Chemicals Dashboard – integrating chemistry and biology data t...
US-EPA CompTox Chemicals Dashboard: Bioactivity Data for Endocrine Assays
Bringing it all together: A Web-based Database for Chemical and Biological Da...
Chemistry data: Distortion and dissemination in the Internet Era
ICIC 2014 From SureChem to SureChEMBL

What's hot (20)

PPTX
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
PPTX
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
PPTX
Delivering access to chemistry and bioassay data from the National Center for...
PPTX
Connectivity > documents > structures > bioactivity
PPTX
New developments in delivering public access to data from the National Center...
PPTX
Chemical identification of unknowns in high resolution mass spectrometry usin...
PPTX
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
PDF
Poster (1)
PPTX
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
PPTX
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
PPTX
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
PPTX
US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...
PPTX
Development of a Tool for Systematic Integration of Traditional and New Appro...
PPTX
Incorporating new technologies and High Throughput Screening in the design an...
PPTX
Using open bioactivity data for developing machine-learning prediction models...
PPTX
Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...
PPTX
Translating research into practical tools: A case study of GenRA, a new read...
PDF
PubChem and Big Data Chemistry
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Delivering access to chemistry and bioassay data from the National Center for...
Connectivity > documents > structures > bioactivity
New developments in delivering public access to data from the National Center...
Chemical identification of unknowns in high resolution mass spectrometry usin...
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
Poster (1)
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...
Development of a Tool for Systematic Integration of Traditional and New Appro...
Incorporating new technologies and High Throughput Screening in the design an...
Using open bioactivity data for developing machine-learning prediction models...
Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...
Translating research into practical tools: A case study of GenRA, a new read...
PubChem and Big Data Chemistry
Ad

Similar to Data Review and Clean-Up Using Crowdsourced Input via the US EPA CompTox Dashboard (20)

PPTX
Delivering The Benefits of Chemical-Biological Integration in Computational T...
PPTX
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
PPTX
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
PPTX
Chemistry data delivery from the US-EPA to support environmental chemistry
PPTX
Delivering web-based access to data and algorithms to support computational t...
PPTX
Delivering chemical-associated data via EPA web applications
PPTX
Accessing Environmental Chemistry Data via Data Dashboards
PPTX
Structure identification by Mass Spectrometry Non-Targeted Analysis using the...
PPTX
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
PPTX
Web-based access to experimental and predicted data for environmental fate, t...
PDF
The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...
PPTX
CompTox Chemicals Dashboard: Data and tools to support chemical and environme...
PPTX
Accessing information for chemicals in hydraulic fracturing fluids using the ...
PPTX
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
PPTX
Accessing Data to Support Pesticide Residue and Emerging Contaminant Analysis...
PPTX
PPTX
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
Chemistry data delivery from the US-EPA to support environmental chemistry
Delivering web-based access to data and algorithms to support computational t...
Delivering chemical-associated data via EPA web applications
Accessing Environmental Chemistry Data via Data Dashboards
Structure identification by Mass Spectrometry Non-Targeted Analysis using the...
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
Web-based access to experimental and predicted data for environmental fate, t...
The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...
CompTox Chemicals Dashboard: Data and tools to support chemical and environme...
Accessing information for chemicals in hydraulic fracturing fluids using the ...
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Accessing Data to Support Pesticide Residue and Emerging Contaminant Analysis...
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Ad

Recently uploaded (20)

PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
2Systematics of Living Organisms t-.pptx
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
Sciences of Europe No 170 (2025)
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
2. Earth - The Living Planet earth and life
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
Microbiology with diagram medical studies .pptx
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
INTRODUCTION TO EVS | Concept of sustainability
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
2Systematics of Living Organisms t-.pptx
Placing the Near-Earth Object Impact Probability in Context
HPLC-PPT.docx high performance liquid chromatography
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Sciences of Europe No 170 (2025)
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
ECG_Course_Presentation د.محمد صقران ppt
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
7. General Toxicologyfor clinical phrmacy.pptx
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
2. Earth - The Living Planet earth and life
Phytochemical Investigation of Miliusa longipes.pdf
Derivatives of integument scales, beaks, horns,.pptx
Biophysics 2.pdffffffffffffffffffffffffff
Microbiology with diagram medical studies .pptx
2. Earth - The Living Planet Module 2ELS
INTRODUCTION TO EVS | Concept of sustainability

Data Review and Clean-Up Using Crowdsourced Input via the US EPA CompTox Dashboard

  • 1. Data Review and Clean-Up Using Crowdsourced Input via the US EPA CompTox Dashboard Antony Williams1, Katie Paul-Friedman1, Jason Brown2, Chris Grulke1, Emma L. Schymanski3 and Jeff Edwards1 1) National Center for Computational Toxicology, U.S. Environmental Protection Agency, RTP, NC 2) ORAU 3) Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, Luxembourg August 2018 ACS Fall Meeting, Boston http://www.orcid.org/0000-0002-2668-4821 The views expressed in this presentation are those of the author and do not necessarily reflect the views or policies of the U.S. EPA
  • 2. Overview • An introduction to the EPA’s CompTox Dashboard • Our Data: experimental and predicted property data, high-throughput screening assay data and hazard and environmental exposure data, and lots more • Data quality: high-quality data needed but challenging to produce – Millions of individual data points and annotations – 10’0s of 1000’s of chemicals • What is the role of user feedback? • Our efforts to curate our ToxCast bioassay data 1
  • 4. The CompTox Chemistry Dashboard • A publicly accessible website delivering access: – New entry portal for all NCCT dashboards – ~762,000 chemicals with related property data – Searchable by chemical, product use, gene and assay (ToxCast) – Experimental and predicted physicochemical property data – “Bioactivity data” for the ToxCast/Tox21 project – Generalized Read-Across (GenRA) module – Links to other agency websites and public data resources – “Literature” searches for chemicals using public resources – “Batch searching” for thousands of chemicals – DOWNLOADABLE Open Data for reuse and repurposing 3
  • 11. Access to Chemical Hazard Data 10
  • 12. Hazard Data from “ToxVal_DB” • ToxVal Database contains following data: –30,050 chemicals –772,721 toxicity values –29 sources of data –21,507 sub-sources –4585 journals cited –69,833 literature citations 11
  • 13. In Vitro Bioassay Screening ToxCast and Tox21 12
  • 14. In Vitro Bioassay Screening ToxCast and Tox21 13
  • 15. How can be curate our data? • Crowdsourcing is well proven nowadays • Comments can be added at a record level • Submitted comments are reviewed by administrators and responded to 14
  • 18. 17 MassBank/CompTox Curation of External Data o A “nice” example: 4-4'-Bis(2-sulfostyryl)biphenyl Purchased: CAS: 27344-41-8 DTXSID6036467 Registered: CAS: 38775-22-3 (UFZ) DTXSID7047017
  • 19. Comments to date • The majority of comments to date: – Structure and names/CASRN do not match – Add additional synonyms – Request to add specific property data – Structure layout/depiction needs improving 18
  • 20. Crowdsourcing Comments Single Cell Commenting added • Highlight an alphanumeric text string 19
  • 22. Bioactivity Data • 100s of thousands of bioactivity curves to review • Impossible to review every one manually • Now accepting public Crowdsourced Comments • Public crowdsourcing will not suffice!!! 21
  • 23. Internal Review of 25,000 curves 22 Screenshot of entry page for Beta R Shiny Application for NCCT users Brown & Paul-Friedman
  • 24. Internal Review of 25,000 curves A “good fit” bioactivity curve 23
  • 25. Internal Review of 25,000 curves 24 …and gain-loss fit with a cell viability assay (makes little sense) Single-Point in middle of concentration range drives ACTIVE Hit Call
  • 26. Internal Review of 25,000 curves Abnormally High-Noise 25
  • 27. Internal Curve Review Results • Internal curve review has resulted in: – Instances of correction of fitting procedures in the ToxCast Pipeline – Identification of issues with source data – Identification of additional flags or filters that could be used, depending on the application of ToxCast data – a beta implementation of quality assurance for HTS data – Brown & Paul-Friedman, Uncertainty in ToxCast Curve-Fitting: Quantitative and Qualitative Descriptors Inform a Model to Predict Reproducible Fits (in preparation) 26
  • 28. The CompTox Dashboard for Structure Identification by MS 27
  • 29. Collaborative Data Curation • Mapping between our data (and websites) has resulted in collaborative data curation • Collaboration with Emma Schymanski re. the NORMAN Suspects Exchange https://www.norman-network.com/?q=node/236 • Our process for mapping data is iterative 28
  • 32. Mapping on Two Identifiers 31
  • 33. Mapping on Two Identifiers 32
  • 38. >23 NORMAN Lists Available 37
  • 39. Progressive Curation on #Lists 38(many) more registrations… Cleaning up lists to remove errors Undefined mixtures (UVCBs)
  • 40. Conclusion • The CompTox Dashboard provides access to data for ~760,000 chemicals • Crowdsourced comments from users may help clean data • Record level data curation - any alphanumeric string • Improved data quality provides better data for modeling to underpin our prediction algorithms • Collaborative data curation benefits extended communities of users • Data curation is never complete – our data expands daily… 39
  • 41. Contact Antony Williams US EPA Office of Research and Development National Center for Computational Toxicology (NCCT) Williams.Antony@epa.gov ORCID: https://orcid.org/0000-0002-2668-4821 40
  • 42. 41 NORMAN Suspect Exchange o http://www.norman-network.com/?q=node/236

Editor's Notes

  • #28: Comments from Katie: In terms of verbal comments, you might note that we are simply reviewing the uncertainty related to the curve-fitting itself. Sometime people conflate uncertainty in response, e.g., assay interference due to autofluorescence or cytotoxicity, with curve-fitting uncertainty. These are two distinct sources of uncertainty. An autofluorescent compound should produce a beautiful curve in an assay that measures fluorescence as an indirect marker of some bioactivity. These curves will get an “active” score.  When filtering the data for use – say you only want to know the chemicals that definitively activate some receptor – then you might want to layer in not only the reproducibility of the curve fitting, but also other information such as the cytotoxicity burst, information on autofluorescence or nonspecific protein inhibition, etc. This is probably elementary to you, but it is a point I like to make because so many people think that we should auto-filter data for these different sources of assay interference. I am okay with pre-set options to select for filtering, but different applications will require different levels of stringency in terms of filtering.