SlideShare a Scribd company logo
Index San Francisco @mirocupak
Miro Cupak
Senior Software Engineer, DNAstack
22/02/2018
How we built a global search
engine for genetic data
Index San Francisco @mirocupak
What and why?
2
‱ Beacon Network (https://beacon-network.org/)
‱ from the Global Alliance for Genomics and Health (GA4GH)
‱ largest search and discovery engine of human genomic variation
‱ case study talk
‱ domain background
‱ standard, architecture and technologies
‱ fun with stats
Index San Francisco @mirocupak
Background
3
Index San Francisco @mirocupak 4
https://beacon-network.org
Index San Francisco @mirocupak 5
https://beacon-network.org
Index San Francisco @mirocupak 6
Index San Francisco @mirocupak
‱ sequencing cost decreasing exponentially (3M times since 2000)
Trends
7
https://www.nature.com/news/technology-the-1-000-genome-1.14901
Index San Francisco @mirocupak
‱ genomic data volumes increasing exponentially (1M times since 2000)
Trends
8
http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
Index San Francisco @mirocupak
‱ up to 2 billion human genomes sequenced in the next 10 years (more
data annually than uploaded to Twitter and YouTube)
Trends
9
Expected Data Volumes by 2025
DataVolumes(GB)
0E+00
1E+10
2E+10
3E+10
4E+10
Twitter Youtube Genomics
Lower Bound Upper Bound
http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
Index San Francisco @mirocupak
‱ no single institution will have sufïŹcient resources
‱ still, institutions don’t have enough data
‱ common diseases
‱ rare diseases
‱ challenge
‱ discovering data
‱ solution
‱ traditional approach of data aggregation in a single centralized site
not working
‱ federated system capable of executing cross-dataset and cross-
institution queries is needed Beacon Network
Problem
10
Index San Francisco @mirocupak
Global Alliance for Genomics & Health
11
‱ nonproïŹt standards alliance
‱ a coalition of over 500 leading institutions working in health care,
research, disease advocacy, life science, and information technology
‱ goal: enable responsible sharing of genomic and clinical data
‱ since 2013
http://ga4gh.org/
https://www.broadinstitute.org/ïŹles/news/pdfs/GAWhitePaperJune3.pdf
Index San Francisco @mirocupak
Beacon Project
12
‱ experiment to test the willingness of
international sites to share genetic data in
the simplest of all technical contexts
‱ named after the SETI project (Search for
Extra-Terrestrial Intelligence, http://
history.nasa.gov/seti.html)
‱ initiative requiring collaboration of many
different GA4GH groups
‱ started in March 2014, quickly gained
traction
Index San Francisco @mirocupak
Beacon
13
Index San Francisco @mirocupak
Beacon
14
‱ simple web service allowing users to query institution’s databases to
determine whether they contain a genetic variant of interest
‱ receives questions of the form Do you have information about this
mutation?
‱ responds with yes or no, optionally with additional information about
the mutation
‱ design principles
‱ A beacon has to be technically simple.
‱ A beacon has to minimize risks associated with genomic data
sharing.
‱ It has to be possible to make a beacon publicly available.
Index San Francisco @mirocupak
Standard: Before Beacon Network
15
‱ no formal speciïŹcation
‱ Receives questions of the form Do you have information about this
mutation?. Responds with yes or no.
‱ 4 public beacons, each API different
‱ request method
‱ supported parameters
‱ parameter names
‱ chromosome identiïŹers
‱ positional base
‱ assembly notation
‱ supported alleles
‱ dataset support
‱ response format
‱ data included in the response
Index San Francisco @mirocupak 16
Standard: Before Beacon Network
Index San Francisco @mirocupak
Standard: 0.1
17
‱ 2014
‱ really simple (2 records)
‱ true/false response
‱ format: Avro
‱ not enough traction
‱ too vague
‱ issues partially addressed by the Beacon Network
Index San Francisco @mirocupak
Standard: 0.2
18
‱ 2015
‱ complex (9 records)
‱ true/false/overlap/null response
‱ datasets
‱ simple data use conditions
‱ self description
‱ format: Avro
‱ not well adopted
‱ not polished enough
Index San Francisco @mirocupak
Standard: 0.3
19
‱ 2016
‱ simpliïŹed 0.2
‱ based on real needs, successful
‱ true/false/null response
‱ improved support for datasets and cross-dataset queries
‱ modular and extensible
‱ data versioning
‱ various improvements to the data model, more metadata, extended
response
‱ tooling
‱ format: Avro to Proto3
Index San Francisco @mirocupak
Standard: 0.4
20
‱ 2018
‱ stable and more ïŹ‚exible
‱ support for complex variants
‱ improved error handling
‱ improved data use conditions
‱ various minor improvements.
‱ developer experience
‱ format: Proto3 to OpenAPI
Index San Francisco @mirocupak
Beacon Network
21
Index San Francisco @mirocupak
Requirements
22
‱ federation of queries across
beacons
‱ integration of publicly available
beacons
‱ aggregation of data from multiple
sources
‱ online distribution of queries
without the need to store genomic
data
‱ registry of public beacons
‱ programmatically accessible
‱ easily accessible
‱ uniïŹed beacon API
‱ push for standardization of the
standard
‱ performance
‱ scalability
‱ modularity and extensibility
‱ logging and audit trail
‱ beacon monitoring
‱ lower barrier of entry for beacon
developers
‱ development under the umbrella
of GA4GH
Index San Francisco @mirocupak
Architecture
23
Index San Francisco @mirocupak
Data
24
‱ access data stored in a relational database
Index San Francisco @mirocupak
Service
25
‱ communication with other subsystems
‱ query normalization
‱ aggregators
‱ participant resolution
‱ query distribution
‱ audit trail
‱ L1 parallelization
Index San Francisco @mirocupak
Processor
26
‱ executing a query against a beacon and
processing its response
‱ management of a ïŹ‚exible, dynamic and easily
extensible query execution pipeline
‱ pipeline stages resolution (CDI and EJB)
‱ L2 parallelization
‱ cross-assembly query handling
Index San Francisco @mirocupak
Converter
27
‱ ïŹrst stage in the query execution pipeline
‱ translating query parameters
Index San Francisco @mirocupak
Requester
28
‱ second stage in the query execution pipeline
‱ constructing beacon requests based on their
URIs and parameters produced by the
converters
Index San Francisco @mirocupak
Fetcher
29
‱ third stage in the query execution pipeline
‱ unit actually talking to the API of beacons
‱ submitting requests over the network and
obtaining the raw response
Index San Francisco @mirocupak
Parser
30
‱ last stage in the pipeline
‱ extracting information of interest from
the raw response obtained by a fetcher
‱ dealing with various formats
‱ handling metadata, multiple responses,
errors
‱ response normalization
‱ parallelized
Index San Francisco @mirocupak
Mapper
31
‱ translation between different representations of
objects
Index San Francisco @mirocupak
REST
32
‱ handling client requests
‱ data serialization
Index San Francisco @mirocupak
Search execution
33
Index San Francisco @mirocupak
Stats
34
Index San Francisco @mirocupak
Size
35
‱ ~100 installations, 40 institutions, 18 countries, 6 continents
Index San Francisco @mirocupak
Users
36
‱ 13k users, 136 countries
Index San Francisco @mirocupak 37
Searches
Index San Francisco @mirocupak
Assemblies
38
Others
11%
GRCh38
6%
GRCh37
83%
Index San Francisco @mirocupak
Chromosomes
39
Others
39%
Chr. 7
7% Chr. 13
11%
Chr. 1
11%
Chr. 17
14%
Chr. 2
18%
Index San Francisco @mirocupak
Variants
40
‱ 84k distinct variants
Others
74%
2 : 212289100 C (ERBB4)
1%
2 : 29432776 C (ALK)
1%
14 : 23894969 A (MYH7)
1%
1 : 115258747 A (NRAS)
1%
1 : 43815163 C (MPL)
2%
7 : 140453136 C (BRAF)
2%
2 : 45895 G (FAM110C)
3%
22 : 46546565 A (PPARA)
3%
13 : 32936732 C (BRCA2)
6%
2 : 38938 C (FAM110C)
6%
Index San Francisco @mirocupak
Deleteriousness
41
Numberofvariants
1
1000
1000000
Score
0.00 0.09 0.18 0.27 0.36 0.45 0.54 0.63 0.72 0.81 0.90 0.99
Numberofvariants
1
1000
1000000
Score
0.00 0.09 0.18 0.27 0.36 0.45 0.54 0.63 0.72 0.81 0.90 0.99
SIFT
(Sorting Intolerant From Tolerant)
PolyPhen-2 HDIV
(Polymorphism Phenotyping v2)
69% damaging, 31% tolerated 55% probably damaging, 22%
possibly damaging, 23% benign
Index San Francisco @mirocupak
Rarity
42
‱ 25% rare variants (in in 1,000 Genomes Project, August 2015 release)
Numberofvariants
1
100
10000
Allele frequency
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
Index San Francisco @mirocupak
Genes
43
Symbol Name
1 FAM110C
Family With Sequence Similarity 110
Member C
2 BRCA1 BRCA1, DNA Repair Associated
3 BRCA2 BRCA2, DNA Repair Associated
4 PPARA
Peroxisome Proliferator Activated
Receptor Alpha
5 ERBB4 Erb-B2 Receptor Tyrosine Kinase 4
6 BRAF
B-Raf Proto-Oncogene, Serine/
Threonine Kinase
7 MPL
MPL Proto-Oncogene,
Thrombopoietin Receptor
8 MYH7 Myosin Heavy Chain 7
9 KIT
KIT Proto-Oncogene Receptor
Tyrosine Kinase
10 RET Ret Proto-Oncogene
Others
53%
RET
1%
KIT
1%
MYH7
2%
MPL
2% BRAF
3%
ERBB4
3%
PPARA
4%
BRCA2
9%
BRCA1
10%
FAM110C
11%
Index San Francisco @mirocupak
Disorders & clinical abnormalities
44
OMIM HPO
1 Pancreatic cancer, susceptibility to, 4 Autosomal dominant inheritance
2 Breast-ovarian cancer, familial, 1 Autosomal recessive inheritance
3 Fanconi anemia, complementation group D1 Scoliosis
4 Prostate cancer Short stature
5 Pancreatic cancer 2 Cognitive impairment
6 Medulloblastoma Constipation
7 Glioblastoma 3 Somatic mutation
8 Breast-ovarian cancer, familial, 2 Cafe-au-lait spot
9 Breast cancer, male, susceptibility to Failure to thrive
10 Wilms tumor Nausea and vomiting
Index San Francisco @mirocupak
Getting involved
45
‱ Contribute on GitHub
‱ https://github.com/ga4gh/beacon-team/
‱ Google Summer of Code
‱ https://summerofcode.withgoogle.com/organizations/
5727014175113216/
‱ DNAstack
‱ https://dnastack.com/#/team/careers
Index San Francisco @mirocupak
Questions?
46
https://mirocupak.com

More Related Content

PDF
How we built a global search engine for genetic data
PDF
Developing Apps: Exposing Your Data Through Araport
PDF
ICAR 2015 Plenary - Chris Town
PDF
ICAR 2015 Workshop - Blake Meyers
PPT
Unraveling Ebola One Tweet at a Time: Dynamic Network Analysis of an Ebola-Re...
PDF
Tripal within the Arabidopsis Information Portal - PAG XXIII
PPTX
A guided tour of Araport
PDF
Introducing ProtAnnot - Araport workshop at PAG 2016
How we built a global search engine for genetic data
Developing Apps: Exposing Your Data Through Araport
ICAR 2015 Plenary - Chris Town
ICAR 2015 Workshop - Blake Meyers
Unraveling Ebola One Tweet at a Time: Dynamic Network Analysis of an Ebola-Re...
Tripal within the Arabidopsis Information Portal - PAG XXIII
A guided tour of Araport
Introducing ProtAnnot - Araport workshop at PAG 2016

Similar to How we built a global search engine for genetic data (20)

PDF
How we've made a global search engine for genetic data
PDF
Building a Global Search Engine for Genetic Data
PDF
Building an Internet of Genomics
PDF
Beacon Network: A System for Global Genomic Data Sharing
PDF
Beacon Network
PPTX
Emerging challenges in data-intensive genomics
PDF
Containerized attribute indexing and graph genomes for federated data access
PPTX
BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
PDF
How to Light a Beacon
PDF
Addressing privacy concerns_in_the_age_of_federated_data_access
PDF
Beacon Network: A System for Global Genomic Data Sharing
PPTX
Wikidata workshop for ISB Biocuration 2016
PDF
Beacon: A Protocol for Federated Discovery and Sharing of Genomic Data
PPTX
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
PPTX
High-performance web services for gene and variant annotations
PPTX
Brief introduction to Bioinformatics
PPTX
Hadoop ecosystem for health/life sciences
PPTX
Cool Informatics Tools and Services for Biomedical Research
PDF
Open Source Networking Solving Molecular Analysis of Cancer
PPTX
Talk at Bioinformatics Open Source Conference, 2012
How we've made a global search engine for genetic data
Building a Global Search Engine for Genetic Data
Building an Internet of Genomics
Beacon Network: A System for Global Genomic Data Sharing
Beacon Network
Emerging challenges in data-intensive genomics
Containerized attribute indexing and graph genomes for federated data access
BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
How to Light a Beacon
Addressing privacy concerns_in_the_age_of_federated_data_access
Beacon Network: A System for Global Genomic Data Sharing
Wikidata workshop for ISB Biocuration 2016
Beacon: A Protocol for Federated Discovery and Sharing of Genomic Data
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
High-performance web services for gene and variant annotations
Brief introduction to Bioinformatics
Hadoop ecosystem for health/life sciences
Cool Informatics Tools and Services for Biomedical Research
Open Source Networking Solving Molecular Analysis of Cancer
Talk at Bioinformatics Open Source Conference, 2012
Ad

More from Miro Cupak (20)

PDF
Exploring the latest and greatest from Java 14
PDF
Exploring reactive programming in Java
PDF
Exploring the last year of Java
PDF
Local variable type inference - Will it compile?
PDF
The Good, the Bad and the Ugly of Java API design
PDF
Local variable type inference - Will it compile?
PDF
Exploring reactive programming in Java
PDF
The good, the bad, and the ugly of Java API design
PDF
Master class in modern Java
PDF
The good, the bad, and the ugly of Java API design
PDF
Exploring reactive programming in Java
PDF
The good, the bad, and the ugly of Java API design
PDF
Writing clean code with modern Java
PDF
The good, the bad, and the ugly of Java API design
PDF
Master class in modern Java
PDF
Exploring reactive programming in Java
PDF
Writing clean code with modern Java
PDF
Exploring what's new in Java 10 and 11 (and 12)
PDF
Exploring what's new in Java 10 and 11
PDF
Exploring what's new in Java in 2018
Exploring the latest and greatest from Java 14
Exploring reactive programming in Java
Exploring the last year of Java
Local variable type inference - Will it compile?
The Good, the Bad and the Ugly of Java API design
Local variable type inference - Will it compile?
Exploring reactive programming in Java
The good, the bad, and the ugly of Java API design
Master class in modern Java
The good, the bad, and the ugly of Java API design
Exploring reactive programming in Java
The good, the bad, and the ugly of Java API design
Writing clean code with modern Java
The good, the bad, and the ugly of Java API design
Master class in modern Java
Exploring reactive programming in Java
Writing clean code with modern Java
Exploring what's new in Java 10 and 11 (and 12)
Exploring what's new in Java 10 and 11
Exploring what's new in Java in 2018
Ad

Recently uploaded (20)

PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
ai tools demonstartion for schools and inter college
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Introduction to Artificial Intelligence
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Digital Strategies for Manufacturing Companies
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
L1 - Introduction to python Backend.pptx
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
System and Network Administration Chapter 2
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
top salesforce developer skills in 2025.pdf
PDF
System and Network Administraation Chapter 3
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Wondershare Filmora 15 Crack With Activation Key [2025
Odoo POS Development Services by CandidRoot Solutions
ai tools demonstartion for schools and inter college
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Introduction to Artificial Intelligence
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Digital Strategies for Manufacturing Companies
How Creative Agencies Leverage Project Management Software.pdf
L1 - Introduction to python Backend.pptx
Softaken Excel to vCard Converter Software.pdf
System and Network Administration Chapter 2
Operating system designcfffgfgggggggvggggggggg
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
top salesforce developer skills in 2025.pdf
System and Network Administraation Chapter 3
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Adobe Illustrator 28.6 Crack My Vision of Vector Design

How we built a global search engine for genetic data

  • 1. Index San Francisco @mirocupak Miro Cupak Senior Software Engineer, DNAstack 22/02/2018 How we built a global search engine for genetic data
  • 2. Index San Francisco @mirocupak What and why? 2 ‱ Beacon Network (https://beacon-network.org/) ‱ from the Global Alliance for Genomics and Health (GA4GH) ‱ largest search and discovery engine of human genomic variation ‱ case study talk ‱ domain background ‱ standard, architecture and technologies ‱ fun with stats
  • 3. Index San Francisco @mirocupak Background 3
  • 4. Index San Francisco @mirocupak 4 https://beacon-network.org
  • 5. Index San Francisco @mirocupak 5 https://beacon-network.org
  • 6. Index San Francisco @mirocupak 6
  • 7. Index San Francisco @mirocupak ‱ sequencing cost decreasing exponentially (3M times since 2000) Trends 7 https://www.nature.com/news/technology-the-1-000-genome-1.14901
  • 8. Index San Francisco @mirocupak ‱ genomic data volumes increasing exponentially (1M times since 2000) Trends 8 http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
  • 9. Index San Francisco @mirocupak ‱ up to 2 billion human genomes sequenced in the next 10 years (more data annually than uploaded to Twitter and YouTube) Trends 9 Expected Data Volumes by 2025 DataVolumes(GB) 0E+00 1E+10 2E+10 3E+10 4E+10 Twitter Youtube Genomics Lower Bound Upper Bound http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
  • 10. Index San Francisco @mirocupak ‱ no single institution will have sufïŹcient resources ‱ still, institutions don’t have enough data ‱ common diseases ‱ rare diseases ‱ challenge ‱ discovering data ‱ solution ‱ traditional approach of data aggregation in a single centralized site not working ‱ federated system capable of executing cross-dataset and cross- institution queries is needed Beacon Network Problem 10
  • 11. Index San Francisco @mirocupak Global Alliance for Genomics & Health 11 ‱ nonproïŹt standards alliance ‱ a coalition of over 500 leading institutions working in health care, research, disease advocacy, life science, and information technology ‱ goal: enable responsible sharing of genomic and clinical data ‱ since 2013 http://ga4gh.org/ https://www.broadinstitute.org/ïŹles/news/pdfs/GAWhitePaperJune3.pdf
  • 12. Index San Francisco @mirocupak Beacon Project 12 ‱ experiment to test the willingness of international sites to share genetic data in the simplest of all technical contexts ‱ named after the SETI project (Search for Extra-Terrestrial Intelligence, http:// history.nasa.gov/seti.html) ‱ initiative requiring collaboration of many different GA4GH groups ‱ started in March 2014, quickly gained traction
  • 13. Index San Francisco @mirocupak Beacon 13
  • 14. Index San Francisco @mirocupak Beacon 14 ‱ simple web service allowing users to query institution’s databases to determine whether they contain a genetic variant of interest ‱ receives questions of the form Do you have information about this mutation? ‱ responds with yes or no, optionally with additional information about the mutation ‱ design principles ‱ A beacon has to be technically simple. ‱ A beacon has to minimize risks associated with genomic data sharing. ‱ It has to be possible to make a beacon publicly available.
  • 15. Index San Francisco @mirocupak Standard: Before Beacon Network 15 ‱ no formal speciïŹcation ‱ Receives questions of the form Do you have information about this mutation?. Responds with yes or no. ‱ 4 public beacons, each API different ‱ request method ‱ supported parameters ‱ parameter names ‱ chromosome identiïŹers ‱ positional base ‱ assembly notation ‱ supported alleles ‱ dataset support ‱ response format ‱ data included in the response
  • 16. Index San Francisco @mirocupak 16 Standard: Before Beacon Network
  • 17. Index San Francisco @mirocupak Standard: 0.1 17 ‱ 2014 ‱ really simple (2 records) ‱ true/false response ‱ format: Avro ‱ not enough traction ‱ too vague ‱ issues partially addressed by the Beacon Network
  • 18. Index San Francisco @mirocupak Standard: 0.2 18 ‱ 2015 ‱ complex (9 records) ‱ true/false/overlap/null response ‱ datasets ‱ simple data use conditions ‱ self description ‱ format: Avro ‱ not well adopted ‱ not polished enough
  • 19. Index San Francisco @mirocupak Standard: 0.3 19 ‱ 2016 ‱ simpliïŹed 0.2 ‱ based on real needs, successful ‱ true/false/null response ‱ improved support for datasets and cross-dataset queries ‱ modular and extensible ‱ data versioning ‱ various improvements to the data model, more metadata, extended response ‱ tooling ‱ format: Avro to Proto3
  • 20. Index San Francisco @mirocupak Standard: 0.4 20 ‱ 2018 ‱ stable and more ïŹ‚exible ‱ support for complex variants ‱ improved error handling ‱ improved data use conditions ‱ various minor improvements. ‱ developer experience ‱ format: Proto3 to OpenAPI
  • 21. Index San Francisco @mirocupak Beacon Network 21
  • 22. Index San Francisco @mirocupak Requirements 22 ‱ federation of queries across beacons ‱ integration of publicly available beacons ‱ aggregation of data from multiple sources ‱ online distribution of queries without the need to store genomic data ‱ registry of public beacons ‱ programmatically accessible ‱ easily accessible ‱ uniïŹed beacon API ‱ push for standardization of the standard ‱ performance ‱ scalability ‱ modularity and extensibility ‱ logging and audit trail ‱ beacon monitoring ‱ lower barrier of entry for beacon developers ‱ development under the umbrella of GA4GH
  • 23. Index San Francisco @mirocupak Architecture 23
  • 24. Index San Francisco @mirocupak Data 24 ‱ access data stored in a relational database
  • 25. Index San Francisco @mirocupak Service 25 ‱ communication with other subsystems ‱ query normalization ‱ aggregators ‱ participant resolution ‱ query distribution ‱ audit trail ‱ L1 parallelization
  • 26. Index San Francisco @mirocupak Processor 26 ‱ executing a query against a beacon and processing its response ‱ management of a ïŹ‚exible, dynamic and easily extensible query execution pipeline ‱ pipeline stages resolution (CDI and EJB) ‱ L2 parallelization ‱ cross-assembly query handling
  • 27. Index San Francisco @mirocupak Converter 27 ‱ ïŹrst stage in the query execution pipeline ‱ translating query parameters
  • 28. Index San Francisco @mirocupak Requester 28 ‱ second stage in the query execution pipeline ‱ constructing beacon requests based on their URIs and parameters produced by the converters
  • 29. Index San Francisco @mirocupak Fetcher 29 ‱ third stage in the query execution pipeline ‱ unit actually talking to the API of beacons ‱ submitting requests over the network and obtaining the raw response
  • 30. Index San Francisco @mirocupak Parser 30 ‱ last stage in the pipeline ‱ extracting information of interest from the raw response obtained by a fetcher ‱ dealing with various formats ‱ handling metadata, multiple responses, errors ‱ response normalization ‱ parallelized
  • 31. Index San Francisco @mirocupak Mapper 31 ‱ translation between different representations of objects
  • 32. Index San Francisco @mirocupak REST 32 ‱ handling client requests ‱ data serialization
  • 33. Index San Francisco @mirocupak Search execution 33
  • 34. Index San Francisco @mirocupak Stats 34
  • 35. Index San Francisco @mirocupak Size 35 ‱ ~100 installations, 40 institutions, 18 countries, 6 continents
  • 36. Index San Francisco @mirocupak Users 36 ‱ 13k users, 136 countries
  • 37. Index San Francisco @mirocupak 37 Searches
  • 38. Index San Francisco @mirocupak Assemblies 38 Others 11% GRCh38 6% GRCh37 83%
  • 39. Index San Francisco @mirocupak Chromosomes 39 Others 39% Chr. 7 7% Chr. 13 11% Chr. 1 11% Chr. 17 14% Chr. 2 18%
  • 40. Index San Francisco @mirocupak Variants 40 ‱ 84k distinct variants Others 74% 2 : 212289100 C (ERBB4) 1% 2 : 29432776 C (ALK) 1% 14 : 23894969 A (MYH7) 1% 1 : 115258747 A (NRAS) 1% 1 : 43815163 C (MPL) 2% 7 : 140453136 C (BRAF) 2% 2 : 45895 G (FAM110C) 3% 22 : 46546565 A (PPARA) 3% 13 : 32936732 C (BRCA2) 6% 2 : 38938 C (FAM110C) 6%
  • 41. Index San Francisco @mirocupak Deleteriousness 41 Numberofvariants 1 1000 1000000 Score 0.00 0.09 0.18 0.27 0.36 0.45 0.54 0.63 0.72 0.81 0.90 0.99 Numberofvariants 1 1000 1000000 Score 0.00 0.09 0.18 0.27 0.36 0.45 0.54 0.63 0.72 0.81 0.90 0.99 SIFT (Sorting Intolerant From Tolerant) PolyPhen-2 HDIV (Polymorphism Phenotyping v2) 69% damaging, 31% tolerated 55% probably damaging, 22% possibly damaging, 23% benign
  • 42. Index San Francisco @mirocupak Rarity 42 ‱ 25% rare variants (in in 1,000 Genomes Project, August 2015 release) Numberofvariants 1 100 10000 Allele frequency 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
  • 43. Index San Francisco @mirocupak Genes 43 Symbol Name 1 FAM110C Family With Sequence Similarity 110 Member C 2 BRCA1 BRCA1, DNA Repair Associated 3 BRCA2 BRCA2, DNA Repair Associated 4 PPARA Peroxisome Proliferator Activated Receptor Alpha 5 ERBB4 Erb-B2 Receptor Tyrosine Kinase 4 6 BRAF B-Raf Proto-Oncogene, Serine/ Threonine Kinase 7 MPL MPL Proto-Oncogene, Thrombopoietin Receptor 8 MYH7 Myosin Heavy Chain 7 9 KIT KIT Proto-Oncogene Receptor Tyrosine Kinase 10 RET Ret Proto-Oncogene Others 53% RET 1% KIT 1% MYH7 2% MPL 2% BRAF 3% ERBB4 3% PPARA 4% BRCA2 9% BRCA1 10% FAM110C 11%
  • 44. Index San Francisco @mirocupak Disorders & clinical abnormalities 44 OMIM HPO 1 Pancreatic cancer, susceptibility to, 4 Autosomal dominant inheritance 2 Breast-ovarian cancer, familial, 1 Autosomal recessive inheritance 3 Fanconi anemia, complementation group D1 Scoliosis 4 Prostate cancer Short stature 5 Pancreatic cancer 2 Cognitive impairment 6 Medulloblastoma Constipation 7 Glioblastoma 3 Somatic mutation 8 Breast-ovarian cancer, familial, 2 Cafe-au-lait spot 9 Breast cancer, male, susceptibility to Failure to thrive 10 Wilms tumor Nausea and vomiting
  • 45. Index San Francisco @mirocupak Getting involved 45 ‱ Contribute on GitHub ‱ https://github.com/ga4gh/beacon-team/ ‱ Google Summer of Code ‱ https://summerofcode.withgoogle.com/organizations/ 5727014175113216/ ‱ DNAstack ‱ https://dnastack.com/#/team/careers
  • 46. Index San Francisco @mirocupak Questions? 46 https://mirocupak.com