Accelerate pharmaceutical r&d with mongo db

Mongo Boston 2013
Accelerate Pharmaceutical R&D with
Big Data and MongoDB

Jason Tetrault
Architect - AstraZeneca

AstraZeneca at a glance
We are a global, innovation led biopharmaceutical company
with a mission to make a meaningful difference to patient health
through great medicines and a belief that health connects us all

Global

Targeted

Collaborative

57,000 people
Sales in 100 countries
Manufacturing in 16
R&D across 3 continents
$4 bn invested in R&D
$33 bn sales in 2011

Cancer
Cardiovascular
Gastrointestinal
Infection
Neuroscience
Respiratory & inflammation

HCPs
Patients
Payers
Regulators
Partners
Local communities

Constantly anticipating and
adapting to the needs of a
changing world.

Driving continued innovation
where we can make the most
difference.

Connecting with others to
achieve common goals
in improving healthcare.

Committed to driving business success responsibly

Architect: R&D Information
What does this mean?

• Support the Researchers
• AstraZeneca has Multiple iMeds that are
focused on different areas of R&D
• Specifically, I work with the Oncology and
Infection iMeds here in Waltham
• Support different software and system
builds and / or purchases
• Looking to apply new technologies to
enable Researchers

• Core Focus:
• Next Generation Sequencing
Scaling
• IAAS
• Big Data Pilots and Exploration

Introduction of Disruptive Technology:
Step 1: Introduce Concepts

• What
• Unstructured Data
• NoSQL
• Categories (Document, Key Value, Graph)

•
•
•
•

Hadoop
Map Reduce
Horizontal Scalability
Cloud (IAAS and SAAS)

• How
•
•
•
•

Lunch and Learns
Examples (Craigslist uses this)
“Big Cookies for Big Data”
Demonstrations

Step 2: Pilots

• Goals:
• We needed to show what “Unstructured Data” actually means.
• We needed to prove what these technologies can and cannot
do for us.
• Find something difficult and make it easy!
• We needed to find the best way to enable researchers.

Iterative Agile Analytics
How quickly can I make indirect associations between gene sequence
features and structural fingerprints?
Now scale up to 4M compounds, 20K
assays…and more decoration – 5to50 Tb

Data sources
Compound

JSON

Pivot
Map Reduce

Matrix

AssayResults
(300K Compounds) – 200Gb

GeneCatalog

(1.4M fingerprints) – 1Gb

• Compound with Fingerprints
• Gene sequence
• Target mappings
• Assay results

Gather

Fingerprint with
compounds

Aggregate

(500m pairs) – 81Gb

Tanimoto matrix
Gene matrix

Analyze

Target mappings

Decorate
• Easily convert to JSON and import an initial cut of data from different sources (e.g.
spreadsheets, RDBMS, …)
• Embrace unstructured data, massage it into a more useful format: Rinse, Wash, Repeat!
• Ability to decorate data, adding fields and additional datastores quickly
6

Pilot Findings

• Tech Findings:
• GSON can help with weird character
conversions.
• Per Node write limits (500 per second)
but, you can save a bunch of documents
at once (Change to bulk Insert).
• Users think that even though they could
do it relationally, this was way quicker.
• Using arrays for multiple results in a doc
can be interesting.
• JSON and JavaScript is fairly natural to
technical researchers (python).

• We are not alone…
•
•
•
•

Davy Suvee
tranSMART
Seven Bridges
…

Next Generation Sequencing:
Driving Question:

Can we predict which drug is
most effective against
specific tumors?

How many other cancer types
that I have processed have the
same variation as the cancer
type I am working on?

Fairly Inaccurate Overview of Genetics
Processing
A 2 Minutes Over Simplification to a Really Hard
Problem

9

Processing
Sequencing

10

Processing
Sequencing

11

Processing
Alignment
HG19

12

Set area descriptor | Sub level 1

Processing
Down Stream Processing (Variant)

HG19

13

Can I Process 88 Whole Human Genomes?
Researcher: I would like to process 88 public Genomic Samples from of Cancer Patients. They are Whole Human
Genomes. Each patient has 2 genomic sequences, one of the tumor and one from a normal cell.

Tech:
• 200 GB raw uncompressed fastq per
experiment
• 176 Genome Pipelines to process
• Each “pipeline” runs on a m1.xlarge
• We ran 4 runs of ~3.5 days on 50 nodes
• Total processed data in the pipeline may be 5X
per experiment
•Could expand to 10X or more for more complex
pipelines
• ~86 GB result average to save
• Stored in S3 / Glacier
• Totals:
• ~171 TB Total Processed Storage
• ~14,784 hours of processing
• ~15 TB of results

Elastic HPC
Infrastructure

Scripts, progr
ams, referenc
e

Shared Storage
Compute
Amazon

StarCluster
Elastic Node Expansion
Local Storage
Processing

Result offload to S3
Transition to
Glacier

A Possible Vision for Experiment Management
NGS Data
Explants
TumorsFFPE
Tumors –
fresh frozen
Cell lines

 Patient stratification
 Biomarkers for
prognosis, drug
response, safety

Expression
RNASeq

Variants
Amplicon

DNASeq

Whole
exome
Whole
genome

Coding and
non-coding
variants
Coding
variants

 Mechanism of drug
action
 Mechanism of disease
New Target ID

Inbound

Seven Bridges

GenePattern

Storage

Partners

Big Data
Store

Experiment
Management /
Metadata
Management

Services
Genome Upload /
Curation

Pipeline
Engines

Long Term
Storage

Partner
Integration

Big Data Storage
and Analytics

Lets look at a Variant …
Another Area Mongo May Help

16

VCF Format
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS
ID
REF
ALT
QUAL FILTER INFO
FORMAT
NA00001
NA00002
NA00003
20
14370
rs6054257 G
A
29
PASS
NS=3;DP=14;AF=0.5;DB;H2
GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51
1/1:43:5:.,.
20
17330
.
T
A
3
q10
NS=3;DP=11;AF=0.017
GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3
0/0:41:3
20
1110696 rs6040355 A
G,T
67
PASS
NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2
2/2:35:4
20
1230237 .
T
.
47
PASS
NS=3;DP=13;AA=T
GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51
0/0:61:2
20
1234567 microsat1 GTC
G,GTCT 50
PASS
NS=3;DP=9;AA=G
GT:GQ:DP
0/1:35:4
0/2:17:2
1/1:40:3

17

VCF as JSON
Header and Variant Information
{

{
"_id" : ObjectId("52617b613004b77f64efed62"),
"ALT" : [
"A"
],
"QUAL" : "29",
"NA00001" : "0|0:48:1:51,51",
"POS" : 14370,
"NA00002" : "1|0:48:8:51,51",
"FILTER" : "PASS",
"CHROM" : "20",
"NA00003" : "1/1:43:5:.,.",
"FORMAT" : "GT:GQ:DP:HQ",
"__vcfid" : "40770f6f-165a-4930-8092-05e98e4e0b27",
"ID" : "rs6054257",
"INFO" : {
"DP" : "14",
"AF" : "0.5",
"NS" : "3"
},
"REF" : "G"
}

18

"_id" : ObjectId("52617b613004b77f64efed67"),
"phasing" : "partial",
"fileformat" : "VCFv4.1",
"fileDate" : "20090805",
"source" : "myImputationProgramV3.1",
"FORMAT" : {
"Description" : ""Haplotype Quality"",
"Type" : "Integer",
"Number" : "2",
"ID" : "HQ"
},
"__vcfid" : "40770f6f-165a-4930-8092-05e98e4e0b27",
"contig" : {
"species" : ""Homo sapiens"",
"assembly" : "B36",
"md5" : "f126cdf8a6e0c7f379d618ff66beb2da",
"length" : "62435964",
"ID" : "20",
"taxonomy" : "x"
},
"INFO" : {
"Description" : ""HapMap2 membership"",
"Type" : "Flag",
"Number" : "0",
"ID" : "H2"
},
"reference" : "file:///seq/references/1000GenomesPilotNCBI36.fasta",
"FILTER" : {
"Description" : ""Less than 50% of samples have data"",
"ID" : "s50"
}
}

Query
Search Variant Ranges
// Here is our range definition
var begin = 10000;
var end = 10200;
// The Chromosome position is fuzzy in format so, we use a regex
var chromosome = ".*17$";
var variant = "A"
// Query for range and chromosome position.
db.publicvariants.find(
{"POS":{$gte: begin, $lt: end},
"CHROM":{$regex : chromosome}
})
db.variants.find(
"CHROM":{$regex : chromosome}
})
// Query for a specific variant in a range
db.publicvariants.find(
"CHROM":{$regex : chromosome},
"ALT":variant})
db.variants.find(
"CHROM":{$regex : chromosome},
"ALT":variant})

19

Wrap Up and Panel
• Panel
• DenizKural: Founder and CEO – SevenBridges

• Code:
• https://github.com/jjtetrault/bio-mongo

• Thanks
• Todd Nelson, Rajan Desai
• Sebastien Lefebvre, Robin Brouwer
• Sara Dempster
20

Confidentiality Notice
This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and
remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or
disclosure of the contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 2 Kingdom Street, London, W2
6BD, UK, T: +44(0)20 7604 8000, F: +44 (0)20 7604 8151, www.astrazeneca.com

22

Accelerate pharmaceutical r&d with mongo db

More Related Content

Viewers also liked (8)

Similar to Accelerate pharmaceutical r&d with mongo db (20)

More from MongoDB (20)

Recently uploaded (20)

Accelerate pharmaceutical r&d with mongo db

Editor's Notes