SlideShare a Scribd company logo
Group Meeting 2016-08-17, Tech
“GPKB: Genomic and Proteomic
Knowledge Base”
by Davide Chicco
davide.chicco@gmail.com
● A data warehouse developed and mantained by my
former colleagues at Politecnico di Milano university
● Integration of several data sources:
● KEGG (Kyoto Encyclopedia of Genes and Genomes)
● OMIM (Online Mendelian Inheritance in Man)
● Gene Ontology Annotations (GOA)
● Gene Ontology (GO)
● Expasy Enzyme
● Entrez Gene
● Reactome
● UniProt
● BioCyc
● IntAct
Genomic and Proteomic Knowledge Base (GPKB)
(c) Flickr Vitlava: database-integration
● Large amounts of biological datasets are available all
around the world
● Especially, biomolecular annotations (associations
between genes or gene products and biological function
features) can help scientists in the understanding of
biology and life science
● The hierarchical structure of the ontology structure of
these datasets are able to highlight semantic
relationships beween data
Motivation
● Implemented in PostgreSQL
● It can be downloaded or used through a web interface
● Dataset quantitative characteristics:
~ 20 milions of genes
~ 20 milions of proteins
~ 17 milions of gene annotations
~ 31 milions of protein annotations
● Some tables are simply imported from data sources
(GO, Reactome, etc)
● Other tables are INFERED from the available datasets
Technical details and quantitative characteristics
● Data tables available:
Technical details and quantitative characteristics
Image from M. Masseroli, et al. "Explorative search of distributed bio-data to answer complex biomedical questions." BMC
Bioinformatics 15.1 (2014): 1.
Green-gray boxes: data table available in the general data warehouse and publically
available on the web interface
Gray boxes: data table available in the general data warehouse (publically available in the
future)
Two main execution modes:
● Basic search
● Easy search
GPKB
● The Basic search functionality is available for searches
aimed at retrieving all information directly associated
with a single feature instance, either imported from
external sources or inferred based on the integrated
data
● For example, all annotations and interactions of a
specific gene or protein (e.g. the human insulin-like
growth factor 2 (somatome-din A) (IGF2) gene, Entrez
Gene ID 3481), or all genes and proteins annotated to
a particular biomedical feature instance, such as a
specific pathway or genetic disorder (e.g. the Alzheimer
disease , OMIM ID 104300).
Basic search
GPKB: Genomic and Proteomic Knowledge Base
GPKB: Genomic and Proteomic Knowledge Base
GPKB: Genomic and Proteomic Knowledge Base
GPKB: Genomic and Proteomic Knowledge Base
GPKB: Genomic and Proteomic Knowledge Base
● Authors also implemented an enhanced functionality
and graphical interface for multi-feature search, named
Easy search.
● It supports the simple graphical composition of
complex queries on multiple features just by orderly
selecting the required features, e.g. gene, pathway,
enzyme, biological function feature, genetic disorder,
clinical synopsis, etc.; if needed, display and filtering
constrains can be defined for any attribute of each
selected feature just by specifying them in the feature
windo.
Easy search
● Query example: relationship between genes, biological
function features of pathologies (e.g. in Muscular
dystrophy, Duchenne type).
● Using the Easy search functionality, the user can
orderly select the gene feature, then the gene
associated biological function feature and genetic
disorder features, and then the genetic disorder
associated clinical synopsis feature; finally, before
submitting the query, if the user wants to investigate
only some related pathologies, he/she can specify them
as value of the name attribute in the genetic disorder
feature window.
Easy search
GPKB: Genomic and Proteomic Knowledge Base
GPKB: Genomic and Proteomic Knowledge Base
GPKB: Genomic and Proteomic Knowledge Base
GPKB: Genomic and Proteomic Knowledge Base
GPKB: Genomic and Proteomic Knowledge Base
Distinct: only distinct results
Exact count: it runs exact count of the query results,
otherwise it estimates the result count
Conceptual query (C): the query includes the
conceptually equivalent database items coming
from other data sources
Semantic expansion: When a query is executed with
semantic expansion for a feature then the result contains
not only the items that satisfy the query but also
semantically related more general items based on the
feature ontologies
Expand query: After obtaining results for an initial
query, to expand the query only for the user selected
rows of the previous query result
Show all: shows all the query results
Only matching: shows only the query results
matching values between all the selected features
“Find all the genes that are involved both in breast
cancer and in prostate cancer, and then retrieve all the
proteins that are encoded by one of those genes”
http://www.bioinformatics.deib.polimi.it/GPKB
Demo
Main advantages of GPKB compared to other systems
(such as BioWarehouse, Biozon, etc):
1) flexible data schema and software architecture, to
facilitate data import
2) integration of datasets from different sources
highlight semantic relationships between data
elements
3) ability to answer multi-domain biomedical
questions
GPKB advantages
M. Masseroli, A. Canakoglu, and S. Ceri. "Integration
and querying of genomic and proteomic semantic
annotations for biomedical knowledge extraction"
IEEE/ACM Transactions on Computational Biology and
Bioinformatics 13.2 (2016): 209-219.
http://www.bioinformatics.deib.polimi.it/GPKB
Citation and web link

More Related Content

PPTX
Data Mining
PPTX
Systems genetics approaches to understand complex traits
PPTX
Revealing disease-associated pathways by network integration of untargeted me...
PPTX
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
PPTX
Data mining ppt
PDF
Publicly available tools and open resources in Bioinformatics
PPTX
Session i lab bioinfo dm and app mmc
PPT
Chapter - 8.4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Data Mining
Systems genetics approaches to understand complex traits
Revealing disease-associated pathways by network integration of untargeted me...
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
Data mining ppt
Publicly available tools and open resources in Bioinformatics
Session i lab bioinfo dm and app mmc
Chapter - 8.4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

What's hot (20)

PDF
Network-based machine learning approach for aggregating multi-modal data
PPT
Bioinformatics n bio-bio-1_uoda_workshop_4_july_2013_v1.0
PPTX
PPTX
Biomart WormBase Workshop International Worm Meeting 2017
PPT
iEvobIO
PDF
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
PPTX
Scripps bioinformatics seminar_day_2
PPTX
Bioinformatics
PDF
PAG 2015 - Overview of the Breeding Management System - Dr Graham McLaren
PPTX
Career oppurtunities in the field of Bioinformatics
PPTX
KnetMiner - EBI Workshop 2017
PPTX
KnetMiner - Knowledge Network Miner
PPTX
Bioinformatic, and tools by kk sahu
PPTX
PPTX
Introduction to bioinformatics
PPTX
Tools of bioinforformatics by kk
PDF
Structural Bioinformatics - Homology modeling & its Scope
PPTX
PhoenixBio 2020 Stanford Workshop on PhyloGenes
Network-based machine learning approach for aggregating multi-modal data
Bioinformatics n bio-bio-1_uoda_workshop_4_july_2013_v1.0
Biomart WormBase Workshop International Worm Meeting 2017
iEvobIO
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
Scripps bioinformatics seminar_day_2
Bioinformatics
PAG 2015 - Overview of the Breeding Management System - Dr Graham McLaren
Career oppurtunities in the field of Bioinformatics
KnetMiner - EBI Workshop 2017
KnetMiner - Knowledge Network Miner
Bioinformatic, and tools by kk sahu
Introduction to bioinformatics
Tools of bioinforformatics by kk
Structural Bioinformatics - Homology modeling & its Scope
PhoenixBio 2020 Stanford Workshop on PhyloGenes
Ad

Similar to GPKB: Genomic and Proteomic Knowledge Base (20)

PPTX
BIOINFORMATICS_PRACTICAL_A_BRIEF_INTRODUCTION.pptx
DOC
V1_I1_2012_Paper5.doc
PDF
Knowledge Driven User Interfaces for Complex Biological Queries
PDF
GASCAN: A Novel Database for Gastric Cancer Genes and Primers
PPTX
Group 3 presentation.pptx
PPTX
2016 mem good
PDF
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
DOCX
Bioinformatics Lab01.docx
PPTX
Data analysis & integration challenges in genomics
PPT
Intro to databases
PPTX
Data retreival system
PPT
Biological databases
PDF
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...
PPTX
Biomart
PDF
Use of open_linked_data_in_bioinformatics
PPTX
Genetic disease identification and medical diagnosis using MF, CC, BF, MicroR...
PPTX
Entrez databases
PPTX
BioSHaRE: Making data useful without direct sharing: Cafe Variome and Omics b...
PDF
BIOLOGICAL DATABASE AND ITS TYPES,IMPORTANCE OF BIOLOGICAL DATABASE
PPTX
Bridging Histology and Bioinformatics
BIOINFORMATICS_PRACTICAL_A_BRIEF_INTRODUCTION.pptx
V1_I1_2012_Paper5.doc
Knowledge Driven User Interfaces for Complex Biological Queries
GASCAN: A Novel Database for Gastric Cancer Genes and Primers
Group 3 presentation.pptx
2016 mem good
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Bioinformatics Lab01.docx
Data analysis & integration challenges in genomics
Intro to databases
Data retreival system
Biological databases
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...
Biomart
Use of open_linked_data_in_bioinformatics
Genetic disease identification and medical diagnosis using MF, CC, BF, MicroR...
Entrez databases
BioSHaRE: Making data useful without direct sharing: Cafe Variome and Omics b...
BIOLOGICAL DATABASE AND ITS TYPES,IMPORTANCE OF BIOLOGICAL DATABASE
Bridging Histology and Bioinformatics
Ad

More from Hoffman Lab (20)

PPTX
Miller: A command-line tool for querying, shaping, and reformatting data files
PDF
GNU Parallel: Lab meeting—technical talk
PDF
TCRpower
PPTX
Efficient querying of genomic reference databases with gget
PPTX
WashU Epigenome Browser
PPTX
Wireguard: A Virtual Private Network Tunnel
PPTX
Plotting heatmap with matplotlib/seaborn
PPTX
Go Get Data (GGD)
PPTX
fastp: the FASTQ pre-processor
PPTX
R markdown and Rmdformats
PPTX
File searching tools
PPTX
Better BibTeX (BBT) for Zotero
PPTX
Awk primer and Bioawk
PPTX
Terminals and Shells
PPTX
BioRender & Glossary/Acronym
PPTX
Linters in R
PPTX
BioSyntax: syntax highlighting for computational biology
PPTX
Get Good With Git
PDF
Tech Talk: UCSC Genome Browser
PDF
MultiQC: summarize analysis results for multiple tools and samples in a singl...
Miller: A command-line tool for querying, shaping, and reformatting data files
GNU Parallel: Lab meeting—technical talk
TCRpower
Efficient querying of genomic reference databases with gget
WashU Epigenome Browser
Wireguard: A Virtual Private Network Tunnel
Plotting heatmap with matplotlib/seaborn
Go Get Data (GGD)
fastp: the FASTQ pre-processor
R markdown and Rmdformats
File searching tools
Better BibTeX (BBT) for Zotero
Awk primer and Bioawk
Terminals and Shells
BioRender & Glossary/Acronym
Linters in R
BioSyntax: syntax highlighting for computational biology
Get Good With Git
Tech Talk: UCSC Genome Browser
MultiQC: summarize analysis results for multiple tools and samples in a singl...

Recently uploaded (20)

PPTX
Chapter 5: Probability Theory and Statistics
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
A Presentation on Artificial Intelligence
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
project resource management chapter-09.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Hybrid model detection and classification of lung cancer
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
A Presentation on Touch Screen Technology
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Approach and Philosophy of On baking technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Chapter 5: Probability Theory and Statistics
Building Integrated photovoltaic BIPV_UPV.pdf
Encapsulation theory and applications.pdf
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
A Presentation on Artificial Intelligence
WOOl fibre morphology and structure.pdf for textiles
project resource management chapter-09.pdf
Programs and apps: productivity, graphics, security and other tools
Hybrid model detection and classification of lung cancer
Web App vs Mobile App What Should You Build First.pdf
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
1 - Historical Antecedents, Social Consideration.pdf
A Presentation on Touch Screen Technology
Group 1 Presentation -Planning and Decision Making .pptx
Approach and Philosophy of On baking technology
Unlocking AI with Model Context Protocol (MCP)
MIND Revenue Release Quarter 2 2025 Press Release
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
From MVP to Full-Scale Product A Startup’s Software Journey.pdf

GPKB: Genomic and Proteomic Knowledge Base

  • 1. Group Meeting 2016-08-17, Tech “GPKB: Genomic and Proteomic Knowledge Base” by Davide Chicco davide.chicco@gmail.com
  • 2. ● A data warehouse developed and mantained by my former colleagues at Politecnico di Milano university ● Integration of several data sources: ● KEGG (Kyoto Encyclopedia of Genes and Genomes) ● OMIM (Online Mendelian Inheritance in Man) ● Gene Ontology Annotations (GOA) ● Gene Ontology (GO) ● Expasy Enzyme ● Entrez Gene ● Reactome ● UniProt ● BioCyc ● IntAct Genomic and Proteomic Knowledge Base (GPKB) (c) Flickr Vitlava: database-integration
  • 3. ● Large amounts of biological datasets are available all around the world ● Especially, biomolecular annotations (associations between genes or gene products and biological function features) can help scientists in the understanding of biology and life science ● The hierarchical structure of the ontology structure of these datasets are able to highlight semantic relationships beween data Motivation
  • 4. ● Implemented in PostgreSQL ● It can be downloaded or used through a web interface ● Dataset quantitative characteristics: ~ 20 milions of genes ~ 20 milions of proteins ~ 17 milions of gene annotations ~ 31 milions of protein annotations ● Some tables are simply imported from data sources (GO, Reactome, etc) ● Other tables are INFERED from the available datasets Technical details and quantitative characteristics
  • 5. ● Data tables available: Technical details and quantitative characteristics Image from M. Masseroli, et al. "Explorative search of distributed bio-data to answer complex biomedical questions." BMC Bioinformatics 15.1 (2014): 1. Green-gray boxes: data table available in the general data warehouse and publically available on the web interface Gray boxes: data table available in the general data warehouse (publically available in the future)
  • 6. Two main execution modes: ● Basic search ● Easy search GPKB
  • 7. ● The Basic search functionality is available for searches aimed at retrieving all information directly associated with a single feature instance, either imported from external sources or inferred based on the integrated data ● For example, all annotations and interactions of a specific gene or protein (e.g. the human insulin-like growth factor 2 (somatome-din A) (IGF2) gene, Entrez Gene ID 3481), or all genes and proteins annotated to a particular biomedical feature instance, such as a specific pathway or genetic disorder (e.g. the Alzheimer disease , OMIM ID 104300). Basic search
  • 13. ● Authors also implemented an enhanced functionality and graphical interface for multi-feature search, named Easy search. ● It supports the simple graphical composition of complex queries on multiple features just by orderly selecting the required features, e.g. gene, pathway, enzyme, biological function feature, genetic disorder, clinical synopsis, etc.; if needed, display and filtering constrains can be defined for any attribute of each selected feature just by specifying them in the feature windo. Easy search
  • 14. ● Query example: relationship between genes, biological function features of pathologies (e.g. in Muscular dystrophy, Duchenne type). ● Using the Easy search functionality, the user can orderly select the gene feature, then the gene associated biological function feature and genetic disorder features, and then the genetic disorder associated clinical synopsis feature; finally, before submitting the query, if the user wants to investigate only some related pathologies, he/she can specify them as value of the name attribute in the genetic disorder feature window. Easy search
  • 21. Exact count: it runs exact count of the query results, otherwise it estimates the result count
  • 22. Conceptual query (C): the query includes the conceptually equivalent database items coming from other data sources
  • 23. Semantic expansion: When a query is executed with semantic expansion for a feature then the result contains not only the items that satisfy the query but also semantically related more general items based on the feature ontologies
  • 24. Expand query: After obtaining results for an initial query, to expand the query only for the user selected rows of the previous query result
  • 25. Show all: shows all the query results Only matching: shows only the query results matching values between all the selected features
  • 26. “Find all the genes that are involved both in breast cancer and in prostate cancer, and then retrieve all the proteins that are encoded by one of those genes” http://www.bioinformatics.deib.polimi.it/GPKB Demo
  • 27. Main advantages of GPKB compared to other systems (such as BioWarehouse, Biozon, etc): 1) flexible data schema and software architecture, to facilitate data import 2) integration of datasets from different sources highlight semantic relationships between data elements 3) ability to answer multi-domain biomedical questions GPKB advantages
  • 28. M. Masseroli, A. Canakoglu, and S. Ceri. "Integration and querying of genomic and proteomic semantic annotations for biomedical knowledge extraction" IEEE/ACM Transactions on Computational Biology and Bioinformatics 13.2 (2016): 209-219. http://www.bioinformatics.deib.polimi.it/GPKB Citation and web link