SlideShare a Scribd company logo
APP
NGS Applications
J. S. Freitas1
, M. P. Caraciolo1
, V. M. Diniz1
, R. B. de Alexandre1
, J. B. Oliveira1
1
Genomika Diagnósticos
API-Centric Data Integration for Human Genomics Reference
Databases: Achievements, Lessons Learned and Challenges
MOTIVATION
Data Integration is a main challenge faced in clinical genetics where there are
multiple heterogeneous databases spanning several domains presented in
confusing formats without clear and common standards. In variant analysis for
molecular diagnostics applications, one central task is to connect biological
information to clinical data such that specialists can determine the potential impact
of that variant associated with the disease [1, 2].
For this task, it requires the flexible assembly of tailored data sets continuously
curated without wasting the biologists and geneticists time on searching several
databases individually online, parsing, cleaning and integrating those data in
complex spreadsheets.
We are building a platform that leverages Linked Data to provide integrated
access to bioinformatics databases such as OMIN, Clinvar, using a common
and well-defined interface.
Our assumption is that by exposing those datasets via Application
Programming Interfaces (API's), it can facilitate the data access from several
sources to a big data infrastructure, which provides integrated access to
covering information about biological, carrier testing, variant analysis and
literature mining.
bioinfo@genomika.com.br | genomika.com.br
Rua Senador José Henrique, 224, Alfred Nobel, Sala 1301 | Recife, PE | Brazil
OUR COLLABORATION
DATA INFRASTRUCTURE
Lessons Learned
x
REFERENCES
[1] Anguita, A., et al. (2010) A review of methods and tools for database integration in biomedicine. Curr. Bioinform., 5, 253–269
[2] Peterson, Thomas A., Emily Doughty, and Maricel G. Kann. "Towards precision medicine: advances in computational approaches for the analysis of human
variants." Journal of molecular biology 425.21 (2013): 4047-4063.
[3] Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured storage system." ACM SIGOPS Operating Systems Review 44.2 (2010): 35-40.
[4] Spark, Apache. "Lightning-fast cluster computing (2015)." (2015): 345-353.
[5] Stockinger, Heinz, et al. "Experience using web services for biological sequence analysis." Briefings in bioinformatics 9.6 (2008): 493-505.
DISTRIBUTED AGGREGATION NEW SOURCE CONSUMPTION
The growing number of databases vs the variability of their
schemata. To tackle it, we designed a global schema, using
meta-modeling concepts to abstract the data fields and values.
Novel approaches to aggregate the facets by the same key. Good
solutions: NoSQL databases (Cassandra) and large data
processing engine using MapReduce concepts (Spark) [3, 4].
Load several databases and related versions will require a
replication/distributed policy for your database engine. There are
some good dataengine solutions that achieved great results on
this by using a distributed strategy for partitioning data.
RESTful APIs for exposing data. It supports several formats (XML,
JSON) and frameworks available that works out-of-the-box [5].
Challenges
The underlying datasets can change their
schema, so there's a intellectual complexity in
developing fixes in the source data
consumption.
The limited number of building new versions,
the all process requires bandwidth and
demanding computing power, so how to
overcome the number of fetching jobs running
simultaneously?
How to deal with semantic mappings between
datasets or depositories? What should the
single integrated vocabulary be in order to
identify possible relationships?
sample
genomic
position
genomic
position
Sequencing
Machine
Annotator
(rowA,
(DataFieldA, facetValue1))
(rowB,
(DataFieldA, facetValue2))
(rowA,
[(DFA, FV1)),
(DFB, FV3)),
(DFC, FV4)),
(DFD, FV7)),
(DFE, FV8)),
(DFF, FV9))]
(rowA,
(DFB, FV3))
(rowA,
(DFC, FV4))
(rowB,
(DFB, FV5))
(rowB,
(DFC, FV6))
(rowA,
(DFA, FV1))
(rowA,
(DFB, FV3))
(rowA,
(DFE, FV8))
(rowA,
(DFF, FV9))
(rowB,
[(DFA, FV2)),
(DFB, FV5)),
(DFC, FV6)),
(DFD, FV10)),
(DFE, FV11)),
(DFF, FV12))]
(rowB,
(DFB, FV2)
(rowB,
(DFB, FV5))
(rowB,
(DFE, FV11))
(rowB,
(DFF, FV12))
(rowA,
(DFD, FV7))
(rowA,
(DFE, FV8))
(rowB,
(DFE, FV11))
(rowB,
(DFF, FV12))
ClinGen Tool
Patient
Data
150,000,000
Variants observed
Variants
we understand
2003 2007 2015
Genotype
AnnotatorClinvar
dbSNP
Uniprot
OMIM
NCBI
GENE
1,000
Genome
Depository N
Clinvar
OMIM
DATA EXPOSURE
...
omim_idGene Symbol
100650
... Datafield N
... Facet #1ALDH
104760 ... Facet #nAPP
DataFieldrowID
Gene_Symbol
... DataFacet
... ALDH1
OMIM_ID ... 1006501
Gene_Symbol ... APP2
OMIM_ID ... 1047602
1.0.0 2.0.0 Depository
Version
...Genes Phenotypes Dataset N
curl
https://$GENDB_API_KEY@api.gendb.com/v1/
datasets/OMIM/3.5.0/Genes/data 
-H "Content-Type: application/json"

-d '{
"filters": [
["gene_symbol", "BRCA1"]
]
}'
{
"dataset": "OMIM/3.5.0/Genes",
"dataset_id": 65,
"genome_build": "GRCh37",
"limit": 100,
"total": 111425,
"took": 5,
"results": [ "..." ]
}
As the number of current human variant
resources used in variant analysis increases,
the variants reported growing faster every
year, there's only a initial work on
understanding all this information and how
can we extract and link those variant sources.
...
fetch data
Sequencer Data
fetch data
API
GENDB
MIM
1000 Genomes
Entrez Gene
dbSNP
dbSNP
dbNSFP
COSMIC
ClinVar
Other Sources
+ name
+ output_dir
- fetch(is_dl_forced=False)
- parse()
- prepare_new_dataset(name, version)
- update_new_version(version_name)
- check_if_remote_newer(remote, local)
- get_files(is_dl_forced)
- fetch_from_url(remote_file, local_file)
- fetch_from_db(query, conn, limit, is_dl_forced)
- fetch_from_source(...)
Source
Abstract class for any
data sources that we'll
import and process.
Each of the subclasses
will fetch() the data,
scrub() it as necessary,
then parse() it into a
database.
+ name: OMIM
+ output_dir : "./raw/omim"
- _get_omim_ids()
- _process_all()
- _process_morbidmap()
- _process_phenotypicseries()
OMIM
+ name: Source N
+ output_dir : "output/dir"
- local functions()
- inheritend_functions()
extendsextends
Source N
...
......

More Related Content

PDF
2015 GU-ICBI Poster (third printing)
PPTX
FedCentric_Presentation
PPTX
Model Organism Linked Data
PPT
Claudia medina: Linking Health Records for Population Health Research in Brazil.
PPT
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
PDF
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
PPTX
Charleston Conference 2016
PPTX
2016 bmdid-mappings
2015 GU-ICBI Poster (third printing)
FedCentric_Presentation
Model Organism Linked Data
Claudia medina: Linking Health Records for Population Health Research in Brazil.
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
Charleston Conference 2016
2016 bmdid-mappings

What's hot (20)

PDF
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
PPTX
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
PDF
Cancer Analytics Poster
PDF
Rethinking data intensive science using scalable analytics systems
PPTX
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
PDF
Wim de Grave: Big Data in life sciences
PDF
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
PDF
Link Analysis of Life Sciences Linked Data
PDF
Role of Data Accessibility During Pandemic
PPTX
Presentation from Code Camp 2017
PDF
Nucl. Acids Res.-2014-Howe-nar-gku1244
PDF
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
PPTX
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
PPTX
W3C HCLS Dataset Description Guidelines
PDF
Next-Generation Search Engines for Information Retrieval
PPTX
Omic Data Integration Strategies
PDF
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
PDF
Developing tools for high resolution mass spectrometry-based screening via th...
PDF
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
PDF
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
Cancer Analytics Poster
Rethinking data intensive science using scalable analytics systems
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Wim de Grave: Big Data in life sciences
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
Link Analysis of Life Sciences Linked Data
Role of Data Accessibility During Pandemic
Presentation from Code Camp 2017
Nucl. Acids Res.-2014-Howe-nar-gku1244
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
W3C HCLS Dataset Description Guidelines
Next-Generation Search Engines for Information Retrieval
Omic Data Integration Strategies
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
Developing tools for high resolution mass spectrometry-based screening via th...
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
Ad

Similar to API-Centric Data Integration for Human Genomics Reference Databases: Achievements, Lessons Learned and Challenges (20)

PPTX
8. Data mining_warehousing_integration.pptx
PDF
D1803012022
PPTX
2013 nas-ehs-data-integration-dc
PPTX
Being FAIR: Enabling Reproducible Data Science
PDF
A Systems Approach To Qualitative Data Management And Analysis
PDF
LIMS FOR MAIZE MAPPING PROJECT
PDF
LIMS for maize mapping project
PPTX
FAIR & AI Ready KGs for Explainable Predictions
PPTX
The Roots: Linked data and the foundations of successful Agriculture Data
DOC
V1_I1_2012_Paper5.doc
PDF
GASCAN: A Novel Database for Gastric Cancer Genes and Primers
PDF
Accelerating GWAS epistatic interaction analysis methods
PDF
Poster (1)
PDF
Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...
PDF
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
PDF
Current advances to bridge the usability-expressivity gap in biomedical seman...
PPTX
Branch: An interactive, web-based tool for building decision tree classifiers
PDF
Bioinformatics data mining
PPTX
eTRIKS Data Harmonization Service Platform
8. Data mining_warehousing_integration.pptx
D1803012022
2013 nas-ehs-data-integration-dc
Being FAIR: Enabling Reproducible Data Science
A Systems Approach To Qualitative Data Management And Analysis
LIMS FOR MAIZE MAPPING PROJECT
LIMS for maize mapping project
FAIR & AI Ready KGs for Explainable Predictions
The Roots: Linked data and the foundations of successful Agriculture Data
V1_I1_2012_Paper5.doc
GASCAN: A Novel Database for Gastric Cancer Genes and Primers
Accelerating GWAS epistatic interaction analysis methods
Poster (1)
Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Current advances to bridge the usability-expressivity gap in biomedical seman...
Branch: An interactive, web-based tool for building decision tree classifiers
Bioinformatics data mining
eTRIKS Data Harmonization Service Platform
Ad

More from Genomika Diagnósticos (9)

PDF
MamaRisk - Resume Article IHC 2016
PDF
MamaRisk - Presentation IHC 2016
PDF
Detecção de CNVs por NGS: validação de pipeline de bioinformática para painéi...
PDF
The importance of an adequate soft-clip based approach on bioinformatics pipe...
PDF
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...
PDF
X-Meeting Poster 2015 - Vallys A Coverage tool
PDF
Docker poster bsb2015-print
PDF
Como seu DNA com a Bioinformática pode revolucionar o diagnóstico clínico no ...
PDF
Construindo softwares de bioinformática para análises clínicas (Introdução)
MamaRisk - Resume Article IHC 2016
MamaRisk - Presentation IHC 2016
Detecção de CNVs por NGS: validação de pipeline de bioinformática para painéi...
The importance of an adequate soft-clip based approach on bioinformatics pipe...
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...
X-Meeting Poster 2015 - Vallys A Coverage tool
Docker poster bsb2015-print
Como seu DNA com a Bioinformática pode revolucionar o diagnóstico clínico no ...
Construindo softwares de bioinformática para análises clínicas (Introdução)

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Cloud computing and distributed systems.
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Encapsulation theory and applications.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
Electronic commerce courselecture one. Pdf
NewMind AI Monthly Chronicles - July 2025
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Cloud computing and distributed systems.
“AI and Expert System Decision Support & Business Intelligence Systems”
The AUB Centre for AI in Media Proposal.docx
NewMind AI Weekly Chronicles - August'25 Week I
Encapsulation theory and applications.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Advanced methodologies resolving dimensionality complications for autism neur...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Approach and Philosophy of On baking technology
Spectral efficient network and resource selection model in 5G networks
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Mobile App Security Testing_ A Comprehensive Guide.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Dropbox Q2 2025 Financial Results & Investor Presentation

API-Centric Data Integration for Human Genomics Reference Databases: Achievements, Lessons Learned and Challenges

  • 1. APP NGS Applications J. S. Freitas1 , M. P. Caraciolo1 , V. M. Diniz1 , R. B. de Alexandre1 , J. B. Oliveira1 1 Genomika Diagnósticos API-Centric Data Integration for Human Genomics Reference Databases: Achievements, Lessons Learned and Challenges MOTIVATION Data Integration is a main challenge faced in clinical genetics where there are multiple heterogeneous databases spanning several domains presented in confusing formats without clear and common standards. In variant analysis for molecular diagnostics applications, one central task is to connect biological information to clinical data such that specialists can determine the potential impact of that variant associated with the disease [1, 2]. For this task, it requires the flexible assembly of tailored data sets continuously curated without wasting the biologists and geneticists time on searching several databases individually online, parsing, cleaning and integrating those data in complex spreadsheets. We are building a platform that leverages Linked Data to provide integrated access to bioinformatics databases such as OMIN, Clinvar, using a common and well-defined interface. Our assumption is that by exposing those datasets via Application Programming Interfaces (API's), it can facilitate the data access from several sources to a big data infrastructure, which provides integrated access to covering information about biological, carrier testing, variant analysis and literature mining. bioinfo@genomika.com.br | genomika.com.br Rua Senador José Henrique, 224, Alfred Nobel, Sala 1301 | Recife, PE | Brazil OUR COLLABORATION DATA INFRASTRUCTURE Lessons Learned x REFERENCES [1] Anguita, A., et al. (2010) A review of methods and tools for database integration in biomedicine. Curr. Bioinform., 5, 253–269 [2] Peterson, Thomas A., Emily Doughty, and Maricel G. Kann. "Towards precision medicine: advances in computational approaches for the analysis of human variants." Journal of molecular biology 425.21 (2013): 4047-4063. [3] Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured storage system." ACM SIGOPS Operating Systems Review 44.2 (2010): 35-40. [4] Spark, Apache. "Lightning-fast cluster computing (2015)." (2015): 345-353. [5] Stockinger, Heinz, et al. "Experience using web services for biological sequence analysis." Briefings in bioinformatics 9.6 (2008): 493-505. DISTRIBUTED AGGREGATION NEW SOURCE CONSUMPTION The growing number of databases vs the variability of their schemata. To tackle it, we designed a global schema, using meta-modeling concepts to abstract the data fields and values. Novel approaches to aggregate the facets by the same key. Good solutions: NoSQL databases (Cassandra) and large data processing engine using MapReduce concepts (Spark) [3, 4]. Load several databases and related versions will require a replication/distributed policy for your database engine. There are some good dataengine solutions that achieved great results on this by using a distributed strategy for partitioning data. RESTful APIs for exposing data. It supports several formats (XML, JSON) and frameworks available that works out-of-the-box [5]. Challenges The underlying datasets can change their schema, so there's a intellectual complexity in developing fixes in the source data consumption. The limited number of building new versions, the all process requires bandwidth and demanding computing power, so how to overcome the number of fetching jobs running simultaneously? How to deal with semantic mappings between datasets or depositories? What should the single integrated vocabulary be in order to identify possible relationships? sample genomic position genomic position Sequencing Machine Annotator (rowA, (DataFieldA, facetValue1)) (rowB, (DataFieldA, facetValue2)) (rowA, [(DFA, FV1)), (DFB, FV3)), (DFC, FV4)), (DFD, FV7)), (DFE, FV8)), (DFF, FV9))] (rowA, (DFB, FV3)) (rowA, (DFC, FV4)) (rowB, (DFB, FV5)) (rowB, (DFC, FV6)) (rowA, (DFA, FV1)) (rowA, (DFB, FV3)) (rowA, (DFE, FV8)) (rowA, (DFF, FV9)) (rowB, [(DFA, FV2)), (DFB, FV5)), (DFC, FV6)), (DFD, FV10)), (DFE, FV11)), (DFF, FV12))] (rowB, (DFB, FV2) (rowB, (DFB, FV5)) (rowB, (DFE, FV11)) (rowB, (DFF, FV12)) (rowA, (DFD, FV7)) (rowA, (DFE, FV8)) (rowB, (DFE, FV11)) (rowB, (DFF, FV12)) ClinGen Tool Patient Data 150,000,000 Variants observed Variants we understand 2003 2007 2015 Genotype AnnotatorClinvar dbSNP Uniprot OMIM NCBI GENE 1,000 Genome Depository N Clinvar OMIM DATA EXPOSURE ... omim_idGene Symbol 100650 ... Datafield N ... Facet #1ALDH 104760 ... Facet #nAPP DataFieldrowID Gene_Symbol ... DataFacet ... ALDH1 OMIM_ID ... 1006501 Gene_Symbol ... APP2 OMIM_ID ... 1047602 1.0.0 2.0.0 Depository Version ...Genes Phenotypes Dataset N curl https://$GENDB_API_KEY@api.gendb.com/v1/ datasets/OMIM/3.5.0/Genes/data -H "Content-Type: application/json" -d '{ "filters": [ ["gene_symbol", "BRCA1"] ] }' { "dataset": "OMIM/3.5.0/Genes", "dataset_id": 65, "genome_build": "GRCh37", "limit": 100, "total": 111425, "took": 5, "results": [ "..." ] } As the number of current human variant resources used in variant analysis increases, the variants reported growing faster every year, there's only a initial work on understanding all this information and how can we extract and link those variant sources. ... fetch data Sequencer Data fetch data API GENDB MIM 1000 Genomes Entrez Gene dbSNP dbSNP dbNSFP COSMIC ClinVar Other Sources + name + output_dir - fetch(is_dl_forced=False) - parse() - prepare_new_dataset(name, version) - update_new_version(version_name) - check_if_remote_newer(remote, local) - get_files(is_dl_forced) - fetch_from_url(remote_file, local_file) - fetch_from_db(query, conn, limit, is_dl_forced) - fetch_from_source(...) Source Abstract class for any data sources that we'll import and process. Each of the subclasses will fetch() the data, scrub() it as necessary, then parse() it into a database. + name: OMIM + output_dir : "./raw/omim" - _get_omim_ids() - _process_all() - _process_morbidmap() - _process_phenotypicseries() OMIM + name: Source N + output_dir : "output/dir" - local functions() - inheritend_functions() extendsextends Source N ... ......