SlideShare a Scribd company logo
UniProt and the Semantic Web
                     Chimezie Ogbuji
‘Omics’ Data Challenges
 Advances in protein science is a major catalyst in the
  exploding availability of bioinformatics data
 We have already discussed the dimensions of omics
  data:
   Molecular components, interactions, and phenotype
    observations

 Data from large-scale experiments are no longer
  published conventionally but stored in a database
 Protein sequence databases are one of the most
  comprehensive information resources for scientists
Protein Sequence Databases
 Universal protein sequence databases cover all species

 Specialized protein databases are particular to a protein
  family or organism

 Sequence repositories
   A simple registry of sequence record
   No annotations

 Curated protein databases
   Enrich sequence information with links to various sources
    (scientific literature primarily)
Informatics Challenges
 Standard data integration challenge is the lack of
  common conventions

 Applies to not just notation but also to:
   Use of identifiers
   Representation of cross-references
   Framework for defining terms and relationships between
    them

 Links between omics sources is another important
  component of data integration
What is UniProt?
 A comprehensive repository of protein sequences and
  their functional annotations

 Curators add value to raw data by annotations against
  scientific literature

 Objective is: the creation and maintenance of stable,
  comprehensive, and high-quality protein databases,
  with high level of accessibility, to facilitate cross-
  database information retrival

 Makes use of Semantic Web technologies to address its
  challenges
UniProt: Core Activities
 Sequence archiving

 Manual (peer-reviewed) and automated curation of
  sequences

 Development of human / machine-readable Uniprot web
  site

 Interaction with other protein-related databases for
  expanding cross references
UniProt: Components
  UniProtKB –Protein sequence annotations and metadata:
    Protein name, function, taxonomy, enzyme-specific
     information, domains, sites, subcellular location, interactions,
     relationships to disease etc.
    Links to external sources: DNA sequence repositories, protein
     structure databases, protein domain and family databases, and
     species & function-specific data collections
  UniRef – Compresses sequences at different resolutions
    Parameterized by percent of how identical two sequences or
     sub-sequences are (100,90,50).
  UniParc – Non-redundant database of all publically
   available protein sequences
    Manages globaly-unique identifers, the sequence, information
     on source database, and CRC check number.
Semantic Web Technologies
 Set of standards for managing web-based content in a way
  that emphasizes use by an automaton
   Automaton: a machine that performs a function according to
    a predetermined set of coded instructions
 The architectural vision (the Semantic Web) is to extend the
  standards and best practices behind the World-wide Web with
  new standards that emphasize meaning over structure of
  data.
   Common data formats
   Provide a means to make assertions about the world such that
     an automaton can reason about it through them
 The vision is often confused with the tools meant to achieve
  it (i.e., set of standards)
UniProt and the Semantic Web
RDF: Data Model
 Standardized format for representating arbitrary
  information as a labelled, directed graph

 Comprised of statements: subject, predicate, object

 Terms in statements can be Universal Resource
  Identifiers (URIs), Blank Nodes (anonymous entities), or
  Literals

 Abstract data model: a labelled, directed graph

 Various serializations: XML-based and text-based
Information About John Smith
Modelling vocabulary: RDFS/OWL
 RDF Schema (RDFS)
   Simple, minimal schema language for RDF

 Ontology Web Language (OWL)
   Vocabulary for defining classes, relationships, and various
    constraints that limit how RDF is interpreted
   More powerful modeling language

 Tools for constraining & defining reality that can be
  used to codify scientific understanding
 Gene Ontology is modelled in this way to capture our
  understanding of macromolecular reality
UniProt and the Semantic Web
Query Language: SPARQL
 Provides a common graph-matching language for
  querying RDF data

 Similar to SQL in many respects
Nature of UniProt Data
 Very large number of cross references to external
  resources

 Cross-reference topology that of a graph not a tree

 Automated and manual annotation require storage of
  provenance information (how / when data was
  acquired)

 Requires a framework for both data as well as metadata
  (data about data)
UniProt Distribution
UniProt: Data Conventions
 All outbound RDF statements are grouped together
  (statements about the same subject)

 Datasets (nodes in previous graph) are distributed as a
  single file

 Only stores stated data, not entailed data.
   For instance, relationships involving symmetric properties
    are only stored in one direction
UniProt and the Semantic Web
UniProt: Naming Conventions
 Generally, in semiotics: a symbol denotes a referent.

 In Web architecture, URIs identify resources
   URIs that can be resolved over the web are URLs

 UniProt URIs identify:
   Resources that correspond to database entries
   Modeling vocabulary that use standard namespaces: RDFS
    and OWL
   Classes and properties used by UniProt
     For ex: http://purl.uniprot.org/core/Gene
   Resources without stable identifiers (from their source)
The Omics Identification Problem
 UniProt uses a templated naming convention:
   http://purl.uniprot.org/{database}/{identifier}
   http://purl.uniprot.org/uniprot/{protein_identifier}

 Problem
     http://purl.uniprot.org/uniprot/P04926 denotes the Malaria
      protein EX-1
     If loading that address in a browser returns a web page, can an
      automaton infer that Malaria protein EX-1 is a web page?
     How do you identify abstract concepts v.s. digital media
The PURL Solution
 Persistent Uniform Resource Locator (PURL) is a public
  URI management service for allocating a ‘URI space’ as
  a mapping of identifiers (aliases) for resources they are
  not immediately responsible for
 PURLs are web addresses that act as permanent
  identifiers in the face of a dynamic and changing Web
  infrastructure
 A request to a PURL returns a 303 HTTP status code and
  a location:
   303 indicates that a response can be found under the
    returned location
The PURL Solution: Continued
 Can use PURL addresses to identify abstract concepts

 Redirect requests to such addresses to an informative
  web page (for humans) with a means for machines to
  extract other formats

 RDF statements are about proteins, machines can
  reasons about proteins, and humans resolve protein
  identifiers to view informative web pages
 RDF/XML link:

    http://www.uniprot.org/uniprot/P04926.rdf
UniProt: Protein Class
UniProt: Annotation Hierarchy
Serendipitous Re-use
 Having a rich repository of protein sequence metadata,
  annotations, and taxonomic classification in a
  distributed, standard format encourages scientific
  collaboration
General UniProt Re-Use Scenario
 User A refers to protein P1 in their dataset
   User A’s dataset doesn’t include statements about P1 (the
    host organism for instance)

 User B comes across this dataset and (in order to find
  out more about protein P1) puts the URI of protein P1
  in their browser and pulls up human-readable
  information about it (including the host organism)
 Automaton C comes across the same dataset, fetches
  the web page, fetches the RDF about P1 and has access
  to the same information as user B and can reason about
  the major taxon the host organism belongs to
References

 Wu, C. et.al.,”The Universal Protein Resource
  (UniProt): an expanding universe of protein
  information”. Nucleic Acids Research, vol. 34. 2006

 Swiss Institute of Bioinformatics, “UniProt RDF (project
  page)”. http://dev.isb-sib.ch/projects/uniprot-rdf/

 Redaschi, N. and UniProt Consortium, “UniProt in RDF:
  Tackling Data Integration and Distributed Annotation”
  Nature Proceedings, 3rd International Biocuration
  Conference, April 2009.
  http://precedings.nature.com/documents/3193/version/1

More Related Content

PPTX
ExPASy SIB Bioinformatics Resource Portal CIIT ATD sp13-bty-001
PPTX
Major resources of bioinformatics 2
DOCX
PDF
Ab Initio Protein Structure Prediction
PPT
Proteome databases
PPT
PDF
Molecular modeling database
ExPASy SIB Bioinformatics Resource Portal CIIT ATD sp13-bty-001
Major resources of bioinformatics 2
Ab Initio Protein Structure Prediction
Proteome databases
Molecular modeling database

What's hot (20)

PPTX
PPTX
Biological databases
PPTX
Uni prot presentation
PPTX
Protein information resource (PIR)
PPTX
Kegg
PPTX
Pathways and genomes databases in bioinformatics
PPT
Gene bank by kk sahu
PPTX
Introduction to Bioinformatics
PPTX
origin, history.pptx
PPT
Primary and secondary database
PPTX
Sequence alig Sequence Alignment Pairwise alignment:-
PPTX
European molecular biology laboratory (EMBL)
PPT
Protein database
PPTX
Biological databases
PPTX
Data retreival system
PDF
Tools and database of NCBI
PPTX
Major databases in bioinformatics
PPTX
Introduction to ncbi, embl, ddbj
Biological databases
Uni prot presentation
Protein information resource (PIR)
Kegg
Pathways and genomes databases in bioinformatics
Gene bank by kk sahu
Introduction to Bioinformatics
origin, history.pptx
Primary and secondary database
Sequence alig Sequence Alignment Pairwise alignment:-
European molecular biology laboratory (EMBL)
Protein database
Biological databases
Data retreival system
Tools and database of NCBI
Major databases in bioinformatics
Introduction to ncbi, embl, ddbj
Ad

Viewers also liked (8)

DOCX
La muerte y la tortura no es arte ni cultura
PPTX
Semantic Variation Graphs the case for RDF & SPARQL
PPT
UniProt & Ontologies
PPTX
Advanced genomics v_medical_pitt_kent_osu
PPTX
Protein 3D structure and classification database
PDF
Linked Data Management
PPT
UniProt-GOA
 
PPTX
Slideshare ppt
La muerte y la tortura no es arte ni cultura
Semantic Variation Graphs the case for RDF & SPARQL
UniProt & Ontologies
Advanced genomics v_medical_pitt_kent_osu
Protein 3D structure and classification database
Linked Data Management
UniProt-GOA
 
Slideshare ppt
Ad

Similar to UniProt and the Semantic Web (20)

PPTX
Linked APIs for Life Sciences Tutorial at SWAT4LS 3011
PPTX
Web Science, SADI, and the Singularity
PPTX
Designing a community resource - Sandra Orchard
PPTX
Important protein databases and proteomics softwares
PPTX
BioPAX Models and Pathways
PDF
Bio it 2005_rdf_workshop05
PDF
Capturing the context: one small(ish step for modellers, one giant leap for m...
PDF
Use of open_linked_data_in_bioinformatics
PDF
RDF: what and why plus a SPARQL tutorial
PDF
Final Acb All Hands 26 11 07.Key
ODP
Semantic web technologies applied to bioinformatics and laboratory data manag...
PPTX
Semantic Web use cases in outcomes research
PPTX
Web Science - ISoLA 2012
PPTX
TheUniProtKBpptx__2022_03_30_13_07_41.pptx
PPT
The uni prot knowledgebase
PPTX
Evaluating Hypotheses using SPARQL-DL as an abstract workflow language to cho...
PPT
Adding Meaning To Your Data
Linked APIs for Life Sciences Tutorial at SWAT4LS 3011
Web Science, SADI, and the Singularity
Designing a community resource - Sandra Orchard
Important protein databases and proteomics softwares
BioPAX Models and Pathways
Bio it 2005_rdf_workshop05
Capturing the context: one small(ish step for modellers, one giant leap for m...
Use of open_linked_data_in_bioinformatics
RDF: what and why plus a SPARQL tutorial
Final Acb All Hands 26 11 07.Key
Semantic web technologies applied to bioinformatics and laboratory data manag...
Semantic Web use cases in outcomes research
Web Science - ISoLA 2012
TheUniProtKBpptx__2022_03_30_13_07_41.pptx
The uni prot knowledgebase
Evaluating Hypotheses using SPARQL-DL as an abstract workflow language to cho...
Adding Meaning To Your Data

More from Chimezie Ogbuji (12)

PPTX
Reference Domain Ontologies and Large Medical Language Models
PPTX
Using OWL for the RESO Data Dictionary
PPTX
Semantic Web Technologies: A Paradigm for Medical Informatics
PPT
Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...
PPTX
Automated clinicalontologyextraction
PDF
GRDDL: The Why, What, How, and Where
PDF
GRDDL: A Pictorial Approach
PDF
Tools for Next Generation of CMS: XML, RDF, & GRDDL
PPT
Semantic Web Technologies as a Framework for Clinical Informatics
PDF
Segmenting & Merging Domain-specific Modules for Clinical Informatics
PDF
Overview of CPR Ontology
PDF
The Characteristics of a RESTful Semantic Web and Why They Are Important
Reference Domain Ontologies and Large Medical Language Models
Using OWL for the RESO Data Dictionary
Semantic Web Technologies: A Paradigm for Medical Informatics
Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...
Automated clinicalontologyextraction
GRDDL: The Why, What, How, and Where
GRDDL: A Pictorial Approach
Tools for Next Generation of CMS: XML, RDF, & GRDDL
Semantic Web Technologies as a Framework for Clinical Informatics
Segmenting & Merging Domain-specific Modules for Clinical Informatics
Overview of CPR Ontology
The Characteristics of a RESTful Semantic Web and Why They Are Important

Recently uploaded (20)

PPT
Infections Member of Royal College of Physicians.ppt
DOCX
PEADIATRICS NOTES.docx lecture notes for medical students
PDF
Copy of OB - Exam #2 Study Guide. pdf
PPTX
vertigo topics for undergraduate ,mbbs/md/fcps
PPTX
CHEM421 - Biochemistry (Chapter 1 - Introduction)
PPTX
NRPchitwan6ab2802f9.pptxnepalindiaindiaindiapakistan
PPT
Copy-Histopathology Practical by CMDA ESUTH CHAPTER(0) - Copy.ppt
PPTX
Neuropathic pain.ppt treatment managment
PPTX
Cardiovascular - antihypertensive medical backgrounds
PPTX
Anatomy and physiology of the digestive system
PPT
neurology Member of Royal College of Physicians (MRCP).ppt
PPTX
the psycho-oncology for psychiatrists pptx
PDF
Extended-Expanded-role-of-Nurses.pdf is a key for student Nurses
PPTX
2 neonat neotnatology dr hussein neonatologist
PPT
Obstructive sleep apnea in orthodontics treatment
PPTX
Human Reproduction: Anatomy, Physiology & Clinical Insights.pptx
PPTX
y4d nutrition and diet in pregnancy and postpartum
PDF
Intl J Gynecology Obste - 2021 - Melamed - FIGO International Federation o...
PPTX
MANAGEMENT SNAKE BITE IN THE TROPICALS.pptx
PPTX
ONCOLOGY Principles of Radiotherapy.pptx
Infections Member of Royal College of Physicians.ppt
PEADIATRICS NOTES.docx lecture notes for medical students
Copy of OB - Exam #2 Study Guide. pdf
vertigo topics for undergraduate ,mbbs/md/fcps
CHEM421 - Biochemistry (Chapter 1 - Introduction)
NRPchitwan6ab2802f9.pptxnepalindiaindiaindiapakistan
Copy-Histopathology Practical by CMDA ESUTH CHAPTER(0) - Copy.ppt
Neuropathic pain.ppt treatment managment
Cardiovascular - antihypertensive medical backgrounds
Anatomy and physiology of the digestive system
neurology Member of Royal College of Physicians (MRCP).ppt
the psycho-oncology for psychiatrists pptx
Extended-Expanded-role-of-Nurses.pdf is a key for student Nurses
2 neonat neotnatology dr hussein neonatologist
Obstructive sleep apnea in orthodontics treatment
Human Reproduction: Anatomy, Physiology & Clinical Insights.pptx
y4d nutrition and diet in pregnancy and postpartum
Intl J Gynecology Obste - 2021 - Melamed - FIGO International Federation o...
MANAGEMENT SNAKE BITE IN THE TROPICALS.pptx
ONCOLOGY Principles of Radiotherapy.pptx

UniProt and the Semantic Web

  • 1. UniProt and the Semantic Web Chimezie Ogbuji
  • 2. ‘Omics’ Data Challenges  Advances in protein science is a major catalyst in the exploding availability of bioinformatics data  We have already discussed the dimensions of omics data:  Molecular components, interactions, and phenotype observations  Data from large-scale experiments are no longer published conventionally but stored in a database  Protein sequence databases are one of the most comprehensive information resources for scientists
  • 3. Protein Sequence Databases  Universal protein sequence databases cover all species  Specialized protein databases are particular to a protein family or organism  Sequence repositories  A simple registry of sequence record  No annotations  Curated protein databases  Enrich sequence information with links to various sources (scientific literature primarily)
  • 4. Informatics Challenges  Standard data integration challenge is the lack of common conventions  Applies to not just notation but also to:  Use of identifiers  Representation of cross-references  Framework for defining terms and relationships between them  Links between omics sources is another important component of data integration
  • 5. What is UniProt?  A comprehensive repository of protein sequences and their functional annotations  Curators add value to raw data by annotations against scientific literature  Objective is: the creation and maintenance of stable, comprehensive, and high-quality protein databases, with high level of accessibility, to facilitate cross- database information retrival  Makes use of Semantic Web technologies to address its challenges
  • 6. UniProt: Core Activities  Sequence archiving  Manual (peer-reviewed) and automated curation of sequences  Development of human / machine-readable Uniprot web site  Interaction with other protein-related databases for expanding cross references
  • 7. UniProt: Components  UniProtKB –Protein sequence annotations and metadata:  Protein name, function, taxonomy, enzyme-specific information, domains, sites, subcellular location, interactions, relationships to disease etc.  Links to external sources: DNA sequence repositories, protein structure databases, protein domain and family databases, and species & function-specific data collections  UniRef – Compresses sequences at different resolutions  Parameterized by percent of how identical two sequences or sub-sequences are (100,90,50).  UniParc – Non-redundant database of all publically available protein sequences  Manages globaly-unique identifers, the sequence, information on source database, and CRC check number.
  • 8. Semantic Web Technologies  Set of standards for managing web-based content in a way that emphasizes use by an automaton  Automaton: a machine that performs a function according to a predetermined set of coded instructions  The architectural vision (the Semantic Web) is to extend the standards and best practices behind the World-wide Web with new standards that emphasize meaning over structure of data.  Common data formats  Provide a means to make assertions about the world such that an automaton can reason about it through them  The vision is often confused with the tools meant to achieve it (i.e., set of standards)
  • 10. RDF: Data Model  Standardized format for representating arbitrary information as a labelled, directed graph  Comprised of statements: subject, predicate, object  Terms in statements can be Universal Resource Identifiers (URIs), Blank Nodes (anonymous entities), or Literals  Abstract data model: a labelled, directed graph  Various serializations: XML-based and text-based
  • 12. Modelling vocabulary: RDFS/OWL  RDF Schema (RDFS)  Simple, minimal schema language for RDF  Ontology Web Language (OWL)  Vocabulary for defining classes, relationships, and various constraints that limit how RDF is interpreted  More powerful modeling language  Tools for constraining & defining reality that can be used to codify scientific understanding  Gene Ontology is modelled in this way to capture our understanding of macromolecular reality
  • 14. Query Language: SPARQL  Provides a common graph-matching language for querying RDF data  Similar to SQL in many respects
  • 15. Nature of UniProt Data  Very large number of cross references to external resources  Cross-reference topology that of a graph not a tree  Automated and manual annotation require storage of provenance information (how / when data was acquired)  Requires a framework for both data as well as metadata (data about data)
  • 17. UniProt: Data Conventions  All outbound RDF statements are grouped together (statements about the same subject)  Datasets (nodes in previous graph) are distributed as a single file  Only stores stated data, not entailed data.  For instance, relationships involving symmetric properties are only stored in one direction
  • 19. UniProt: Naming Conventions  Generally, in semiotics: a symbol denotes a referent.  In Web architecture, URIs identify resources  URIs that can be resolved over the web are URLs  UniProt URIs identify:  Resources that correspond to database entries  Modeling vocabulary that use standard namespaces: RDFS and OWL  Classes and properties used by UniProt  For ex: http://purl.uniprot.org/core/Gene  Resources without stable identifiers (from their source)
  • 20. The Omics Identification Problem  UniProt uses a templated naming convention:  http://purl.uniprot.org/{database}/{identifier}  http://purl.uniprot.org/uniprot/{protein_identifier}  Problem  http://purl.uniprot.org/uniprot/P04926 denotes the Malaria protein EX-1  If loading that address in a browser returns a web page, can an automaton infer that Malaria protein EX-1 is a web page?  How do you identify abstract concepts v.s. digital media
  • 21. The PURL Solution  Persistent Uniform Resource Locator (PURL) is a public URI management service for allocating a ‘URI space’ as a mapping of identifiers (aliases) for resources they are not immediately responsible for  PURLs are web addresses that act as permanent identifiers in the face of a dynamic and changing Web infrastructure  A request to a PURL returns a 303 HTTP status code and a location:  303 indicates that a response can be found under the returned location
  • 22. The PURL Solution: Continued  Can use PURL addresses to identify abstract concepts  Redirect requests to such addresses to an informative web page (for humans) with a means for machines to extract other formats  RDF statements are about proteins, machines can reasons about proteins, and humans resolve protein identifiers to view informative web pages
  • 23.  RDF/XML link:  http://www.uniprot.org/uniprot/P04926.rdf
  • 26. Serendipitous Re-use  Having a rich repository of protein sequence metadata, annotations, and taxonomic classification in a distributed, standard format encourages scientific collaboration
  • 27. General UniProt Re-Use Scenario  User A refers to protein P1 in their dataset  User A’s dataset doesn’t include statements about P1 (the host organism for instance)  User B comes across this dataset and (in order to find out more about protein P1) puts the URI of protein P1 in their browser and pulls up human-readable information about it (including the host organism)  Automaton C comes across the same dataset, fetches the web page, fetches the RDF about P1 and has access to the same information as user B and can reason about the major taxon the host organism belongs to
  • 28. References  Wu, C. et.al.,”The Universal Protein Resource (UniProt): an expanding universe of protein information”. Nucleic Acids Research, vol. 34. 2006  Swiss Institute of Bioinformatics, “UniProt RDF (project page)”. http://dev.isb-sib.ch/projects/uniprot-rdf/  Redaschi, N. and UniProt Consortium, “UniProt in RDF: Tackling Data Integration and Distributed Annotation” Nature Proceedings, 3rd International Biocuration Conference, April 2009. http://precedings.nature.com/documents/3193/version/1