Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2

Alternative Approaches to Managing and Integrating
Bioinformatics Data
GBCB Seminar
October 9, 2014
Dan Sullivan
Cyberinfrastructure Division

 Bioinformatics and Relational Database
Management Systems (RDBMs)
 Use Cases – Text Mining and Atherosclerosis
 Bioinformatics and NoSQL Databases
 How to Choose a Database for Your Project
 Closing Comments

Relational Database – a database that [explicitly] stores
information about both the data and how it is related.”
(Source: http://en.wikipedia.org/wiki/Relational_database)
NoSQL Database – “[a] database [that] provides a
mechanism for storage and retrieval of data that is
modeled in means other than the tabular relations used
in relational databases.”
(Source: http://en.wikipedia.org/wiki/NoSQL)

Volume of data
Variety of data
Integration of data

 Pragmatic
 Widely applicable
 Many options
 Modeling
 Reduce risk of data
anomalies.
 Separate logical
and physical
models

The key,
The whole key, and
Nothing but the key.

Implementation
bottlenecks
vs.
Data
Modeler
Developer
Scaling-up vs.
scaling-out
Frequent need for
denormalization

Text Mining
Storing Text
Caching Word Vectors
Extracted Features
Experiment Results
Atherosclerosis
Research
Demographics
Sample Tracking
Genomic data
Sequence Variants
Mass Spec Results

Early 1950s Korean War
autopsies
2012-2016 Genomic and Proteomic
Architecture of Atherosclerosis (GPAA)
1985-1998 Pathodeterminants
of Atherosclerosis in Youth
(PDAY) study

“… tell your
children not to do
what I have done …”
House of the Rising Sun
American Folk Song

Started with
MySQL
Could have stayed with
relational model, but:
Requirements change
New data sets
Unknown data structures
Increasingly complex
normalized model

Scalability
Cost
Availability
Consistency
Flexibility

 Key Value Databases
 Document Databases
 Wide Column Stores
 Graph Databases
 Search Engines

Features
Simple primitive data
structure
No predefined schema
Limited query capabilities
Dictionary-like
functionality at large scale
key3
key2
key1 value1
value2
value2
Bioinformatics Use Case
Word vectors in text
mining
Caching
Limitations
Key lookup only, no
generalized query
Small number of
attributes per entity

>>> Import redis
>>> r_server = redis.Redis(“localhost”)
>>> r_server.set(“sample:123:type”,”Aorta”)
>>> r_server.get(“sample:123:type”)
>>> “Aorta”

Features
 JSON/XML structures
 Fields vary between docs
 No predefined schema
 Documents analogous to
rows
 Collections analogous to
tables
 Query capabilities
Bioinformatics Use Case
Text mining
Atherosclerosis
Limitations
No joins
No referential integrity
checks
Object-based query language
{
id : <value>,
<key> : <value>,
<key> : <embedded
document>,
<key> : <array>
}

{
subject_id: "F8273",
age : "26",
sex : "M"
date_of_death : "12-Jan-1995”,
glycohemoglobin: 10%,
BMI : 22,
samples : [ {type:"Thoracic Aorta", AHA_score: 1},
{type:"Abdominal Aorta", AHA_score: 2},
{type:"LAD", AHA_Score:5} ],
sequence: {seq_file: "F8273_08152014.bam",
variant_file: "F8273_08152014.vcf”}
}

Features
Groups attributes into
column families
Column families store key-
value pairs
Implemented as sparse
multi-dimensional arrays
Denormalized
104-106 columns; 109 rows
 Bioinformatics Use Case
 Large studies
 Many experiments & data types
 Simulations
 Limitations
 Operationally
challenging
 Suitable for large
number of servers

Limitations
Less suited for tabular
data
Features
Highly normalized
Graph-based query
language (Gremlin)
SQL-inspired query
language (Cypher)
Support for path finding
and recursion Bioinformatics Use Case
Epidemiology
simulations
Interaction networks

Relational:
Requirements known at start
of project
Entities described by common
attributes
Compliance and audit issues
Need normalization
Acceptable performance on
small number of servers
Need server side joins


Key value:
Caching
Few attributes
Document databases:
Varying attributes
Integrate diverse data
types
Use denormalized
data
key3
key2
key1 value1
value2
value3
{
id : <value>,
<key> : <value>,
<key> : <embedded
document>,
<key> : <array>
}

 Wide column data stores:
 Extremely large volumes
of data
 High availability
 Graph Databases:
 Connected data
 Need path finding and
recursive queries

Multiple types of databases
NoSQL complements relational models
Research question drives selection
Balance benefits and limitations
May use multiple types of databases in a
single project
NoSQL databases are improving rapidly,
gaining additional functionality

* Slide 1:
* http://upload.wikimedia.org/wikipedia/commons/e/e9/Arthrobacter_arilaitensis_Re
117_genome.png
* http://upload.wikimedia.org/wikipedia/commons/b/b4/The_protein_interaction_net
work_of_Treponema_pallidum.png
* http://upload.wikimedia.org/wikipedia/commons/c/c7/Picoplancton_cytometrie.jpg
* http://www.ncbi.nlm.nih.gov/pubmed/?term=salmonella+typhimurium
* http://www.keepcalm-o-matic.co.uk/p/keep-calm-and-form-a-hypothesis-5/
* Slide 2:
* http://pixabay.com/id/spreadsheet-excel-tabel-diagram-98491/
* http://en.wikipedia.org/wiki/File:MySQL.svg
* http://commons.wikimedia.org/wiki/File:Database-postgres.svg
* http://commons.wikimedia.org/wiki/File:SQLite_Logo_4.png
* http://commons.wikimedia.org/wiki/File:Oracle_logo.svg
* http://upload.wikimedia.org/wikipedia/commons/7/78/Sql-server-ce-4-logo.png
* Slide 3
* http://faculty.csuci.edu/Fminder.chen/mba550/caseStudy
* http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
* Sllide 4
* http://pixabay.com/static/uploads/photo/2013/07/12/17/22/database-
152091_640.png
* http://www.clker.com/clipart-desk-work.html
* Slide 6
* http://cdn.ttgtmedia.com/ITKE/uploads/blogs.dir/141/files/2012/09/greendatacent
er.jpg
* Slide 7
* http://en.wikipedia.org/wiki/Chase_(bank)
* http://en.wikipedia.org/wiki/Computer-
aided_dispatch#mediaviewer/File:Moderne_Leitstelle_Arbeitsplatz.jpg
* http://olioshealth.com/services/electronic-medical-record-implementation/
* Slide 9
* http://tran-bio3u-
fall09.wikispaces.com/file/view/Atherosclerosis.jpg/114176189/Atherosclerosis.jpg
* Slide 11
* http://arteriosclerotic.org/arteriosclerotic-cardiovascular/
Slide 12
http://opentodo.net/wp-content/uploads/2014/05/redis-300dpi.png
http://en.wikipedia.org/wiki/File:Riak_product_logo.png
http://download.oracle.com/berkeley-
db/docs/je/3.2.76/images/Oracle_BerkeleyDB_clr.bmp
http://www.yegor256.com/images/2014/04/dynamodb-logo.png
https://foundationdb.com/
http://www.aerospike.com/
Slide 13
http://arnoldit.com/wordpress/2008/05/07/enterprise-search-and-train-
wrecks/
Slide 15
http://upload.wikimedia.org/wikipedia/en/e/eb/MongoDB_Logo.png
http://tomphilip.me/couchdb-its-too-easy/
http://www.datanami.com/2014/02/25/look_out_mongo_here_comes_cou
chbase/
http://ravendb.net/
https://cloudant.com/
Slide 17
http://en.wikipedia.org/wiki/Apache_Cassandra#mediaviewer/File:Cassan
dra_logo.svg
https://svn.apache.org/repos/asf/hbase/branches/instant_schema_alter/s
rc/site/resources/images
https://accumulo.apache.org/
http://hypertable.com/http://radar.oreilly.com/2013/07/why-choose-a-
graph-database.html
Slide 18
http://genome.ucsc.edu/cgi-
bin/hgTracks?db=hg19&position=chr10%3A90973326-
90985006&hgsid=391056163_yzDnkth3pso3om9pe5BgBFunDug5
Slide 19
https://github.com/thinkaurelius/titan
http://www.neotechnology.com/logos/
http://en.wikipedia.org/wiki/OrientDB#mediaviewer/File:OrientdbLogo.p
ng
http://franz.com/
Slide 21
http://blogs.teradata.com/international/why-the-reports-of-the-death-
of-the-relational-database-are-an-exaggeration/

*Dr. Rebecca Wattam,
Advisor
*Becky Will, GPAA VT PI
*Chengdong Zhang, DBA & SE
*Cyberinfrastructure Division
*GPAA Collaborators

Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2 (20)

More from Dan Sullivan, Ph.D. (10)

Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2

Editor's Notes