SlideShare a Scribd company logo
www.sib.swiss
Exploring neXtProt data and beyond:
A SPARQLing solution
Biocuration 2019, Cambridge, UK
Monique Zahn
Overview
Exploring neXtProt data using SPARQL
Biocuration and quality control
Summary and future perspectives
Introduction01
02
03
04
neXtProt – the SIB knowledgebase on human proteins
https://www.nextprot.org/
neXtProt data sources
A resource
• ENTRIES have a globally
unique and persistent ID
• Use controlled vocabularies
and ontologies
• SPARQL end-point
• Detailed provenance
• Data license CC BY 4.0
• All data since 2011 on FTP
• Code on GitHub
• Web: human readable
• API: machine parsable
• FTP: XML, PEFF, RDF/ttl
FINDABLE ACCESSIBLE
INTEROPERABLE REUSABLE
Overview
Exploring neXtProt data using SPARQL
Biocuration and quality control
Summary and future perspectives
Introduction01
02
03
04
neXtProt Advanced Search
Complex queries to retrieve entries matching criteria i.e. queries that cannot
be performed using the simple search
Advanced Search results
NXQ_00022 Proteins with no function annotated
neXtProt SNORQL Search
Queries to retrieve entries and other information matching criteria
SNORQL Search results
NXQ_00125 Domains that are entirely covered by 3D structures
Data visualization
Use javascript to see SPARQL query results using Highcharts.com
SPARQL query
Retrieve results
Plot results
Visualizing SPARQL query results
Automatically updated at each data release
Overview
Exploring neXtProt data using SPARQL
Biocuration and quality control
Summary and future perspectives
Introduction01
02
03
04
SPARQL for biocuration
Routine maintenance
Updating controlled vocabularies
Finding potential annotation targets
NXQ_00022 Proteins with no function annotated
NXQ_00125 Domains that are NOT entirely covered by
3D structures
Tracking progress
Number of entries with a GO molecular function
Number of GO molecular function terms annotated
Updating controlled vocabularies
SPARQL query to find protein domains relevant to no entry
returns domains to be deleted
SPARQL for quality control
Examples for spot checks
Entry with the most isoforms (CACNA1C NX_Q13936 37 isoforms)
Entry with the most variants (TTN NX_Q8WZ42 24,240 variants)
Global checks
Find data which is:
• missing
• incorrect
• incomplete
Automated quality control (1)
Automated quality control (2)
SPARQL queries must return 0 (rows) if there is no error
Execute a batch of queries either on the test or production server
There are 4 protein domains with no associated entry
to be deleted.
Automated quality control (3)
There are 10 sequence conflict annotations for which the
quality has been incorrectly assigned.
Overview
Exploring neXtProt data using SPARQL
Biocuration and quality control
Summary and future perspectives
Introduction01
02
03
04
Summary and future perspectives
Exploration
Execute complex queries
Query several database endpoints (federated queries)
Biocuration
Quality control
User guide to using SPARQL in neXtProt
Dashboard
Visual detection of anomalies
Acknowledgements
neXtProt
Directors
Amos Bairoch, Lydie Lane
Biocurators
Pascale Gaudet, Paula Duek
Developers
Pierre-André Michel, Alain Gateau, Frédéric Nikitin
Researcher
Mathieu Schaeffer (poster 80), Kasun Samarasinghe
Quality assurance
Monique Zahn (poster 92)
Web: https://www.nextprot.org/
Twitter: @neXtProt_news
ResearchGate: neXtProt project
Photograph taken by Chris James
www.sib.swiss
Thank you!
E-mail: monique.zahn@sib.swiss

More Related Content

PPT
Open innovation contributions from RSC resulting from the Open Phacts project
PPTX
Chemistry Validation and Standardization Platform v2.0
PPT
Royal society of chemistry activities to develop a data repository for chemis...
PDF
New Testing Standards Are on the Horizon: What Will Be Their Impact?
PPTX
Enhancing the Quality of ImmPort Data
PDF
MR201402 effectiveness of unknown malware classification by logistic regressi...
PPTX
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
PPTX
TextRank Based Search Term Identification for Software Change Tasks
Open innovation contributions from RSC resulting from the Open Phacts project
Chemistry Validation and Standardization Platform v2.0
Royal society of chemistry activities to develop a data repository for chemis...
New Testing Standards Are on the Horizon: What Will Be Their Impact?
Enhancing the Quality of ImmPort Data
MR201402 effectiveness of unknown malware classification by logistic regressi...
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
TextRank Based Search Term Identification for Software Change Tasks

What's hot (15)

PPTX
SMART Protocols in LISC-2014
PPTX
Opportunities in chemical structure standardization
PDF
2014 genome informatics Linked Data
PPTX
Gene Ontology WormBase Workshop International Worm Meeting 2015
PPTX
US2TS presentation on Gene Ontology
PPTX
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
PPTX
Source code comprehension on evolving software
PPTX
Tools and approaches for data deposition into nanomaterial databases
PPTX
Structure Identification Using High Resolution Mass Spectrometry Data and the...
PPTX
Consensus ranking and fragmentation prediction for identification of unknowns...
PPTX
The Landscape of Ontology Reuse in Linked Data - OEDW2012
PPTX
2014 agbt giab_progress update
PDF
Unit testing
PDF
Replication and Benchmarking in Software Analytics
SMART Protocols in LISC-2014
Opportunities in chemical structure standardization
2014 genome informatics Linked Data
Gene Ontology WormBase Workshop International Worm Meeting 2015
US2TS presentation on Gene Ontology
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
Source code comprehension on evolving software
Tools and approaches for data deposition into nanomaterial databases
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Consensus ranking and fragmentation prediction for identification of unknowns...
The Landscape of Ontology Reuse in Linked Data - OEDW2012
2014 agbt giab_progress update
Unit testing
Replication and Benchmarking in Software Analytics
Ad

Similar to Exploring neXtProt data and beyond: A SPARQLing solution (20)

PDF
Uni protsparqlcloud
PDF
Learning sparql 2012 12
PPTX
Proteins databases
PDF
The UniProt SPARQL endpoint: 20 billion quads in production
PDF
20 billion triples in production
PDF
20 billion triples in production
PPTX
Strategies for Processing and Explaining Distributed Queries on Linked Data
PPT
Biohackathon2013: Tripling Bioinformatics Productivity
PDF
Connecting life sciences data at the European Bioinformatics Institute
PPTX
Proteomics resources at the EBI & ExPASy
PPTX
A Preliminary survey of RDF/Neo4j as backends for KnetMiner
PDF
Bio ontologies and semantic technologies
PDF
SWAT4LS 2014 SLIDE by Yamamoto
PDF
UniProt and the Semantic Web
PDF
BioSamples Database Linked Data, SWAT4LS Tutorial
PPTX
PROTEIN DATABASE
PDF
Introduction to BioHackathon 2014
PPTX
Protein structure
PPTX
Protein sequence data bases in animals.pptx
Uni protsparqlcloud
Learning sparql 2012 12
Proteins databases
The UniProt SPARQL endpoint: 20 billion quads in production
20 billion triples in production
20 billion triples in production
Strategies for Processing and Explaining Distributed Queries on Linked Data
Biohackathon2013: Tripling Bioinformatics Productivity
Connecting life sciences data at the European Bioinformatics Institute
Proteomics resources at the EBI & ExPASy
A Preliminary survey of RDF/Neo4j as backends for KnetMiner
Bio ontologies and semantic technologies
SWAT4LS 2014 SLIDE by Yamamoto
UniProt and the Semantic Web
BioSamples Database Linked Data, SWAT4LS Tutorial
PROTEIN DATABASE
Introduction to BioHackathon 2014
Protein structure
Protein sequence data bases in animals.pptx
Ad

Recently uploaded (20)

PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
BIOMOLECULES PPT........................
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PDF
An interstellar mission to test astrophysical black holes
PPT
protein biochemistry.ppt for university classes
PPTX
Microbiology with diagram medical studies .pptx
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PDF
Placing the Near-Earth Object Impact Probability in Context
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
microscope-Lecturecjchchchchcuvuvhc.pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
BIOMOLECULES PPT........................
Classification Systems_TAXONOMY_SCIENCE8.pptx
POSITIONING IN OPERATION THEATRE ROOM.ppt
An interstellar mission to test astrophysical black holes
protein biochemistry.ppt for university classes
Microbiology with diagram medical studies .pptx
neck nodes and dissection types and lymph nodes levels
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
Phytochemical Investigation of Miliusa longipes.pdf
2. Earth - The Living Planet Module 2ELS
ECG_Course_Presentation د.محمد صقران ppt
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
. Radiology Case Scenariosssssssssssssss
The KM-GBF monitoring framework – status & key messages.pptx
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
Placing the Near-Earth Object Impact Probability in Context

Exploring neXtProt data and beyond: A SPARQLing solution