SlideShare a Scribd company logo
Alternative Approaches to Managing and Integrating
Bioinformatics Data
GBCB Seminar
October 9, 2014
Dan Sullivan
Cyberinfrastructure Division
 Bioinformatics and Relational Database
Management Systems (RDBMs)
 Use Cases – Text Mining and Atherosclerosis
 Bioinformatics and NoSQL Databases
 How to Choose a Database for Your Project
 Closing Comments
Relational Database – a database that [explicitly] stores
information about both the data and how it is related.”
(Source: http://en.wikipedia.org/wiki/Relational_database)
NoSQL Database – “[a] database [that] provides a
mechanism for storage and retrieval of data that is
modeled in means other than the tabular relations used
in relational databases.”
(Source: http://en.wikipedia.org/wiki/NoSQL)
Volume of data
Variety of data
Integration of data
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
 Pragmatic
 Widely applicable
 Many options
 Modeling
 Reduce risk of data
anomalies.
 Separate logical
and physical
models
The key,
The whole key, and
Nothing but the key.
Implementation
bottlenecks
vs.
Data
Modeler
Developer
Scaling-up vs.
scaling-out
Frequent need for
denormalization
 Bioinformatics and Relational Database
Management Systems (RDBMs)
 Use Cases – Text Mining and Atherosclerosis
 Bioinformatics and NoSQL Databases
 How to Choose a Database for Your Project
 Closing Comments
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Text Mining
Storing Text
Caching Word Vectors
Extracted Features
Experiment Results
Atherosclerosis
Research
Demographics
Sample Tracking
Genomic data
Sequence Variants
Mass Spec Results
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Early 1950s Korean War
autopsies
2012-2016 Genomic and Proteomic
Architecture of Atherosclerosis (GPAA)
1985-1998 Pathodeterminants
of Atherosclerosis in Youth
(PDAY) study
“… tell your
children not to do
what I have done …”
House of the Rising Sun
American Folk Song
Started with
MySQL
Could have stayed with
relational model, but:
Requirements change
New data sets
Unknown data structures
Increasingly complex
normalized model
 Bioinformatics and Relational Database
Management Systems (RDBMs)
 Use Cases – Text Mining and Atherosclerosis
 Bioinformatics and NoSQL Databases
 How to Choose a Database for Your Project
 Closing Comments
Scalability
Cost
Availability
Consistency
Flexibility
 Key Value Databases
 Document Databases
 Wide Column Stores
 Graph Databases
 Search Engines
Features
Simple primitive data
structure
No predefined schema
Limited query capabilities
Dictionary-like
functionality at large scale
key3
key2
key1 value1
value2
value2
Bioinformatics Use Case
Word vectors in text
mining
Caching
Limitations
Key lookup only, no
generalized query
Small number of
attributes per entity
>>> Import redis
>>> r_server = redis.Redis(“localhost”)
>>> r_server.set(“sample:123:type”,”Aorta”)
>>> r_server.get(“sample:123:type”)
>>> “Aorta”
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Features
 JSON/XML structures
 Fields vary between docs
 No predefined schema
 Documents analogous to
rows
 Collections analogous to
tables
 Query capabilities
Bioinformatics Use Case
Text mining
Atherosclerosis
Limitations
No joins
No referential integrity
checks
Object-based query language
{
id : <value>,
<key> : <value>,
<key> : <embedded
document>,
<key> : <array>
}
{
subject_id: "F8273",
age : "26",
sex : "M"
date_of_death : "12-Jan-1995”,
glycohemoglobin: 10%,
BMI : 22,
samples : [ {type:"Thoracic Aorta", AHA_score: 1},
{type:"Abdominal Aorta", AHA_score: 2},
{type:"LAD", AHA_Score:5} ],
sequence: {seq_file: "F8273_08152014.bam",
variant_file: "F8273_08152014.vcf”}
}
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Features
Groups attributes into
column families
Column families store key-
value pairs
Implemented as sparse
multi-dimensional arrays
Denormalized
104-106 columns; 109 rows
 Bioinformatics Use Case
 Large studies
 Many experiments & data types
 Simulations
 Limitations
 Operationally
challenging
 Suitable for large
number of servers
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Limitations
Less suited for tabular
data
Features
Highly normalized
Graph-based query
language (Gremlin)
SQL-inspired query
language (Cypher)
Support for path finding
and recursion Bioinformatics Use Case
Epidemiology
simulations
Interaction networks
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
 Bioinformatics and Relational Database
Management Systems (RDBMs)
 Use Cases – Text Mining and Atherosclerosis
 Bioinformatics and NoSQL Databases
 How to Choose a Database for Your Project
 Closing Comments
Relational:
Requirements known at start
of project
Entities described by common
attributes
Compliance and audit issues
Need normalization
Acceptable performance on
small number of servers
Need server side joins

Key value:
Caching
Few attributes
Document databases:
Varying attributes
Integrate diverse data
types
Use denormalized
data
key3
key2
key1 value1
value2
value3
{
id : <value>,
<key> : <value>,
<key> : <embedded
document>,
<key> : <array>
}
 Wide column data stores:
 Extremely large volumes
of data
 High availability
 Graph Databases:
 Connected data
 Need path finding and
recursive queries
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Multiple types of databases
NoSQL complements relational models
Research question drives selection
Balance benefits and limitations
May use multiple types of databases in a
single project
NoSQL databases are improving rapidly,
gaining additional functionality
* Slide 1:
* http://upload.wikimedia.org/wikipedia/commons/e/e9/Arthrobacter_arilaitensis_Re
117_genome.png
* http://upload.wikimedia.org/wikipedia/commons/b/b4/The_protein_interaction_net
work_of_Treponema_pallidum.png
* http://upload.wikimedia.org/wikipedia/commons/c/c7/Picoplancton_cytometrie.jpg
* http://www.ncbi.nlm.nih.gov/pubmed/?term=salmonella+typhimurium
* http://www.keepcalm-o-matic.co.uk/p/keep-calm-and-form-a-hypothesis-5/
* Slide 2:
* http://pixabay.com/id/spreadsheet-excel-tabel-diagram-98491/
* http://en.wikipedia.org/wiki/File:MySQL.svg
* http://commons.wikimedia.org/wiki/File:Database-postgres.svg
* http://commons.wikimedia.org/wiki/File:SQLite_Logo_4.png
* http://commons.wikimedia.org/wiki/File:Oracle_logo.svg
* http://upload.wikimedia.org/wikipedia/commons/7/78/Sql-server-ce-4-logo.png
* Slide 3
* http://faculty.csuci.edu/Fminder.chen/mba550/caseStudy
* http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
* Sllide 4
* http://pixabay.com/static/uploads/photo/2013/07/12/17/22/database-
152091_640.png
* http://www.clker.com/clipart-desk-work.html
* Slide 6
* http://cdn.ttgtmedia.com/ITKE/uploads/blogs.dir/141/files/2012/09/greendatacent
er.jpg
* Slide 7
* http://en.wikipedia.org/wiki/Chase_(bank)
* http://en.wikipedia.org/wiki/Computer-
aided_dispatch#mediaviewer/File:Moderne_Leitstelle_Arbeitsplatz.jpg
* http://olioshealth.com/services/electronic-medical-record-implementation/
* Slide 9
* http://tran-bio3u-
fall09.wikispaces.com/file/view/Atherosclerosis.jpg/114176189/Atherosclerosis.jpg
* Slide 11
* http://arteriosclerotic.org/arteriosclerotic-cardiovascular/
Slide 12
http://opentodo.net/wp-content/uploads/2014/05/redis-300dpi.png
http://en.wikipedia.org/wiki/File:Riak_product_logo.png
http://download.oracle.com/berkeley-
db/docs/je/3.2.76/images/Oracle_BerkeleyDB_clr.bmp
http://www.yegor256.com/images/2014/04/dynamodb-logo.png
https://foundationdb.com/
http://www.aerospike.com/
Slide 13
http://arnoldit.com/wordpress/2008/05/07/enterprise-search-and-train-
wrecks/
Slide 15
http://upload.wikimedia.org/wikipedia/en/e/eb/MongoDB_Logo.png
http://tomphilip.me/couchdb-its-too-easy/
http://www.datanami.com/2014/02/25/look_out_mongo_here_comes_cou
chbase/
http://ravendb.net/
https://cloudant.com/
Slide 17
http://en.wikipedia.org/wiki/Apache_Cassandra#mediaviewer/File:Cassan
dra_logo.svg
https://svn.apache.org/repos/asf/hbase/branches/instant_schema_alter/s
rc/site/resources/images
https://accumulo.apache.org/
http://hypertable.com/http://radar.oreilly.com/2013/07/why-choose-a-
graph-database.html
Slide 18
http://genome.ucsc.edu/cgi-
bin/hgTracks?db=hg19&position=chr10%3A90973326-
90985006&hgsid=391056163_yzDnkth3pso3om9pe5BgBFunDug5
Slide 19
https://github.com/thinkaurelius/titan
http://www.neotechnology.com/logos/
http://en.wikipedia.org/wiki/OrientDB#mediaviewer/File:OrientdbLogo.p
ng
http://franz.com/
Slide 21
http://blogs.teradata.com/international/why-the-reports-of-the-death-
of-the-relational-database-are-an-exaggeration/
*Dr. Rebecca Wattam,
Advisor
*Becky Will, GPAA VT PI
*Chengdong Zhang, DBA & SE
*Cyberinfrastructure Division
*GPAA Collaborators
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2

More Related Content

PPTX
Limits of RDBMS and Need for NoSQL in Bioinformatics
PPTX
Text mining meets neural nets
PPTX
Survey on NoSQL integration
PPTX
PPTX
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
PDF
Nosql database presentation
PPTX
Chemical workflows supporting automated research data collection
PDF
Cedar Overview
Limits of RDBMS and Need for NoSQL in Bioinformatics
Text mining meets neural nets
Survey on NoSQL integration
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Nosql database presentation
Chemical workflows supporting automated research data collection
Cedar Overview

What's hot (20)

PPTX
Appache Cassandra
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
PDF
NoSQL Databases, Not just a Buzzword
PPTX
Big data technology unit 3
PPT
Hadoop mapreduce and yarn frame work- unit5
PPTX
Intro to bigdata on gcp (1)
PPTX
Big Data Unit 4 - Hadoop
PDF
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
PDF
Dataverse, Cloud Dataverse, and DataTags
PDF
Hierarchal clustering and similarity measures along
PDF
Hierarchal clustering and similarity measures along with multi representation
PDF
TCP connection management in SDN
PPT
NO SQL: What, Why, How
PPTX
Modeling with Document Database: 5 Key Patterns
PPTX
473_LightningTalks.pptx
DOC
Liger cat challenge
 
PPTX
Semantic Web Technologies: A Paradigm for Medical Informatics
PPTX
عصر کلان داده، چرا و چگونه؟
PDF
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
PDF
Big data and hadoop
Appache Cassandra
Big data vahidamiri-tabriz-13960226-datastack.ir
NoSQL Databases, Not just a Buzzword
Big data technology unit 3
Hadoop mapreduce and yarn frame work- unit5
Intro to bigdata on gcp (1)
Big Data Unit 4 - Hadoop
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
Dataverse, Cloud Dataverse, and DataTags
Hierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures along with multi representation
TCP connection management in SDN
NO SQL: What, Why, How
Modeling with Document Database: 5 Key Patterns
473_LightningTalks.pptx
Liger cat challenge
 
Semantic Web Technologies: A Paradigm for Medical Informatics
عصر کلان داده، چرا و چگونه؟
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Big data and hadoop
Ad

Viewers also liked (8)

PPTX
H4 visas new regulations
DOC
MKH - ISO
PDF
IBRAHIM MAHMOOD C.V4
PPT
Foundry managent system
DOC
Yvonne okoro
DOC
Foundry Management System Desktop Application
PPTX
Paranormal activity
PDF
Un cuento de navidad millennial 2015
H4 visas new regulations
MKH - ISO
IBRAHIM MAHMOOD C.V4
Foundry managent system
Yvonne okoro
Foundry Management System Desktop Application
Paranormal activity
Un cuento de navidad millennial 2015
Ad

Similar to Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2 (20)

PPTX
2.Introduction to NOSQL (Core concepts).pptx
PDF
Presentation On NoSQL Databases
DOCX
data base system to new data science lerne
PPTX
Softwae and database in data communication network
PPTX
PPTX
The-Vital-Role-of-Databases-in-Data-Science.pptx
PDF
NoSQL Databases Introduction - UTN 2013
PPT
Database Systems Concepts, 5th Ed
PPTX
NoSQL Basics and MongDB
PPTX
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
PDF
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
PPTX
Nosql
PPTX
nosqldatabnjxjdjases-240121150542-d4ec9e23.pptx
PPTX
Unit-10.pptx
PPT
Unit01 dbms
PDF
Comparative study of no sql document, column store databases and evaluation o...
PPTX
RDBMS to NoSQL. An overview.
PPSX
A Seminar on NoSQL Databases.
PPTX
NoSQL in Big Data Analytics Tools .pptx
PDF
NoSQL BIg Data Analytics Mongo DB and Cassandra .pdf
2.Introduction to NOSQL (Core concepts).pptx
Presentation On NoSQL Databases
data base system to new data science lerne
Softwae and database in data communication network
The-Vital-Role-of-Databases-in-Data-Science.pptx
NoSQL Databases Introduction - UTN 2013
Database Systems Concepts, 5th Ed
NoSQL Basics and MongDB
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
Nosql
nosqldatabnjxjdjases-240121150542-d4ec9e23.pptx
Unit-10.pptx
Unit01 dbms
Comparative study of no sql document, column store databases and evaluation o...
RDBMS to NoSQL. An overview.
A Seminar on NoSQL Databases.
NoSQL in Big Data Analytics Tools .pptx
NoSQL BIg Data Analytics Mongo DB and Cassandra .pdf

More from Dan Sullivan, Ph.D. (10)

PPTX
How to Design a Modern Data Warehouse in BigQuery
PPTX
With Automated ML, is Everyone an ML Engineer?
PPTX
Getting Started with BigQuery ML
PPTX
Google Cloud Certifications & Machine Learning
PPTX
Unstructured text to structured data
PPTX
A first look at tf idf-pdx data science meetup
PPTX
ACID vs BASE in NoSQL: Another False Dichotomy
PPTX
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
PPTX
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
PPTX
Text Mining for Biocuration of Bacterial Infectious Diseases
How to Design a Modern Data Warehouse in BigQuery
With Automated ML, is Everyone an ML Engineer?
Getting Started with BigQuery ML
Google Cloud Certifications & Machine Learning
Unstructured text to structured data
A first look at tf idf-pdx data science meetup
ACID vs BASE in NoSQL: Another False Dichotomy
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Text Mining for Biocuration of Bacterial Infectious Diseases

Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2

  • 1. Alternative Approaches to Managing and Integrating Bioinformatics Data GBCB Seminar October 9, 2014 Dan Sullivan Cyberinfrastructure Division
  • 2.  Bioinformatics and Relational Database Management Systems (RDBMs)  Use Cases – Text Mining and Atherosclerosis  Bioinformatics and NoSQL Databases  How to Choose a Database for Your Project  Closing Comments
  • 3. Relational Database – a database that [explicitly] stores information about both the data and how it is related.” (Source: http://en.wikipedia.org/wiki/Relational_database) NoSQL Database – “[a] database [that] provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.” (Source: http://en.wikipedia.org/wiki/NoSQL)
  • 4. Volume of data Variety of data Integration of data
  • 6.  Pragmatic  Widely applicable  Many options  Modeling  Reduce risk of data anomalies.  Separate logical and physical models
  • 7. The key, The whole key, and Nothing but the key.
  • 9.  Bioinformatics and Relational Database Management Systems (RDBMs)  Use Cases – Text Mining and Atherosclerosis  Bioinformatics and NoSQL Databases  How to Choose a Database for Your Project  Closing Comments
  • 12. Text Mining Storing Text Caching Word Vectors Extracted Features Experiment Results Atherosclerosis Research Demographics Sample Tracking Genomic data Sequence Variants Mass Spec Results
  • 14. Early 1950s Korean War autopsies 2012-2016 Genomic and Proteomic Architecture of Atherosclerosis (GPAA) 1985-1998 Pathodeterminants of Atherosclerosis in Youth (PDAY) study
  • 15. “… tell your children not to do what I have done …” House of the Rising Sun American Folk Song
  • 16. Started with MySQL Could have stayed with relational model, but: Requirements change New data sets Unknown data structures Increasingly complex normalized model
  • 17.  Bioinformatics and Relational Database Management Systems (RDBMs)  Use Cases – Text Mining and Atherosclerosis  Bioinformatics and NoSQL Databases  How to Choose a Database for Your Project  Closing Comments
  • 19.  Key Value Databases  Document Databases  Wide Column Stores  Graph Databases  Search Engines
  • 20. Features Simple primitive data structure No predefined schema Limited query capabilities Dictionary-like functionality at large scale key3 key2 key1 value1 value2 value2 Bioinformatics Use Case Word vectors in text mining Caching Limitations Key lookup only, no generalized query Small number of attributes per entity
  • 21. >>> Import redis >>> r_server = redis.Redis(“localhost”) >>> r_server.set(“sample:123:type”,”Aorta”) >>> r_server.get(“sample:123:type”) >>> “Aorta”
  • 23. Features  JSON/XML structures  Fields vary between docs  No predefined schema  Documents analogous to rows  Collections analogous to tables  Query capabilities Bioinformatics Use Case Text mining Atherosclerosis Limitations No joins No referential integrity checks Object-based query language { id : <value>, <key> : <value>, <key> : <embedded document>, <key> : <array> }
  • 24. { subject_id: "F8273", age : "26", sex : "M" date_of_death : "12-Jan-1995”, glycohemoglobin: 10%, BMI : 22, samples : [ {type:"Thoracic Aorta", AHA_score: 1}, {type:"Abdominal Aorta", AHA_score: 2}, {type:"LAD", AHA_Score:5} ], sequence: {seq_file: "F8273_08152014.bam", variant_file: "F8273_08152014.vcf”} }
  • 26. Features Groups attributes into column families Column families store key- value pairs Implemented as sparse multi-dimensional arrays Denormalized 104-106 columns; 109 rows  Bioinformatics Use Case  Large studies  Many experiments & data types  Simulations  Limitations  Operationally challenging  Suitable for large number of servers
  • 28. Limitations Less suited for tabular data Features Highly normalized Graph-based query language (Gremlin) SQL-inspired query language (Cypher) Support for path finding and recursion Bioinformatics Use Case Epidemiology simulations Interaction networks
  • 30.  Bioinformatics and Relational Database Management Systems (RDBMs)  Use Cases – Text Mining and Atherosclerosis  Bioinformatics and NoSQL Databases  How to Choose a Database for Your Project  Closing Comments
  • 31. Relational: Requirements known at start of project Entities described by common attributes Compliance and audit issues Need normalization Acceptable performance on small number of servers Need server side joins 
  • 32. Key value: Caching Few attributes Document databases: Varying attributes Integrate diverse data types Use denormalized data key3 key2 key1 value1 value2 value3 { id : <value>, <key> : <value>, <key> : <embedded document>, <key> : <array> }
  • 33.  Wide column data stores:  Extremely large volumes of data  High availability  Graph Databases:  Connected data  Need path finding and recursive queries
  • 35. Multiple types of databases NoSQL complements relational models Research question drives selection Balance benefits and limitations May use multiple types of databases in a single project NoSQL databases are improving rapidly, gaining additional functionality
  • 36. * Slide 1: * http://upload.wikimedia.org/wikipedia/commons/e/e9/Arthrobacter_arilaitensis_Re 117_genome.png * http://upload.wikimedia.org/wikipedia/commons/b/b4/The_protein_interaction_net work_of_Treponema_pallidum.png * http://upload.wikimedia.org/wikipedia/commons/c/c7/Picoplancton_cytometrie.jpg * http://www.ncbi.nlm.nih.gov/pubmed/?term=salmonella+typhimurium * http://www.keepcalm-o-matic.co.uk/p/keep-calm-and-form-a-hypothesis-5/ * Slide 2: * http://pixabay.com/id/spreadsheet-excel-tabel-diagram-98491/ * http://en.wikipedia.org/wiki/File:MySQL.svg * http://commons.wikimedia.org/wiki/File:Database-postgres.svg * http://commons.wikimedia.org/wiki/File:SQLite_Logo_4.png * http://commons.wikimedia.org/wiki/File:Oracle_logo.svg * http://upload.wikimedia.org/wikipedia/commons/7/78/Sql-server-ce-4-logo.png * Slide 3 * http://faculty.csuci.edu/Fminder.chen/mba550/caseStudy * http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf * Sllide 4 * http://pixabay.com/static/uploads/photo/2013/07/12/17/22/database- 152091_640.png * http://www.clker.com/clipart-desk-work.html * Slide 6 * http://cdn.ttgtmedia.com/ITKE/uploads/blogs.dir/141/files/2012/09/greendatacent er.jpg * Slide 7 * http://en.wikipedia.org/wiki/Chase_(bank) * http://en.wikipedia.org/wiki/Computer- aided_dispatch#mediaviewer/File:Moderne_Leitstelle_Arbeitsplatz.jpg * http://olioshealth.com/services/electronic-medical-record-implementation/ * Slide 9 * http://tran-bio3u- fall09.wikispaces.com/file/view/Atherosclerosis.jpg/114176189/Atherosclerosis.jpg * Slide 11 * http://arteriosclerotic.org/arteriosclerotic-cardiovascular/ Slide 12 http://opentodo.net/wp-content/uploads/2014/05/redis-300dpi.png http://en.wikipedia.org/wiki/File:Riak_product_logo.png http://download.oracle.com/berkeley- db/docs/je/3.2.76/images/Oracle_BerkeleyDB_clr.bmp http://www.yegor256.com/images/2014/04/dynamodb-logo.png https://foundationdb.com/ http://www.aerospike.com/ Slide 13 http://arnoldit.com/wordpress/2008/05/07/enterprise-search-and-train- wrecks/ Slide 15 http://upload.wikimedia.org/wikipedia/en/e/eb/MongoDB_Logo.png http://tomphilip.me/couchdb-its-too-easy/ http://www.datanami.com/2014/02/25/look_out_mongo_here_comes_cou chbase/ http://ravendb.net/ https://cloudant.com/ Slide 17 http://en.wikipedia.org/wiki/Apache_Cassandra#mediaviewer/File:Cassan dra_logo.svg https://svn.apache.org/repos/asf/hbase/branches/instant_schema_alter/s rc/site/resources/images https://accumulo.apache.org/ http://hypertable.com/http://radar.oreilly.com/2013/07/why-choose-a- graph-database.html Slide 18 http://genome.ucsc.edu/cgi- bin/hgTracks?db=hg19&position=chr10%3A90973326- 90985006&hgsid=391056163_yzDnkth3pso3om9pe5BgBFunDug5 Slide 19 https://github.com/thinkaurelius/titan http://www.neotechnology.com/logos/ http://en.wikipedia.org/wiki/OrientDB#mediaviewer/File:OrientdbLogo.p ng http://franz.com/ Slide 21 http://blogs.teradata.com/international/why-the-reports-of-the-death- of-the-relational-database-are-an-exaggeration/
  • 37. *Dr. Rebecca Wattam, Advisor *Becky Will, GPAA VT PI *Chengdong Zhang, DBA & SE *Cyberinfrastructure Division *GPAA Collaborators

Editor's Notes

  • #4: Relational databases take advantage of relationships between entities (things, nouns) to minimize the amount of data stored NoSQL model entities but relationships are often implicit in structure. Less emphasis on minimizing storage, preserving data integrity, or avoiding data anomalies.
  • #5: Projects with any two of these can probably be well handled by RDBMS. When all three are encountered in one project, NoSQL can often provide better performance with different levels of support for Consistency, Availability and network Partitioning (CAP Theorem)
  • #6: Simple data sets can be managed in spreadsheets. Not ideal but works in some cases. Larger and more complicated data sets require a database. Relational is a natural next step from spreadsheets because of the tabular nature of data.
  • #7: Free, high quality RDBMSs available, e.g. MySQL PostgreSQL. Many commercial options as well. Mature set of tools, such as IDEs for database developers. Many resources and best practices available. From a more theoretic perspective, the relational model reduces risk of data anomalies (i.e. insert anomaly, delete anomaly & update anomaly). Also separates logical model (what we see as database users) from physical model (e.g. how data is actually stored on disk or other persistent storage media). Some performance disadvantages due to need for joins – gathering related information stored in separate tables and therefore on different parts of disk.
  • #8: Normalization is a process of reducing redundancy and risk of data anomalies. Several rules of normalization most important are Codd’s first three. Much of the code in RDBMS is designed to support querying normalized data: how to bring related data together, how to do it with an optimal set of steps (query optimizer)
  • #9: RDMBSs run well on single server. Can implement failover solutions, load balance read-only, difficult to have distributed RDBMSs with write operations and immediate consistency. Network and database latency causes delay in the time a row is updated in one instance and when it is updated in all others. Can require locking all replicas of rows until all replicas updated. Distributed RDBMS requires: Two phase commit for writes in Master-master configuration Master-slave replication helps with reads but not writes Sharding – helps if querying by shard key, otherwise need to query all servers Vertical partitioning – tables placed on different servers; hard to join tables on different servers Watch out for software license costs if scaling out with COTS. NoSQL database relax consistency constraint. Some implement eventual consistency. Implementation bottlenecks – need data modeler to change model schema and DBA to implement those changes. NoSQL allows developers to add columns, collections and other structures on the fly. Lose some benefits of RDBMS, such as referential integrity. Joins are time and resource consuming. Developers often deformalize to improve performance. Makes one question the use of RDBMSs if core functionality is not used.
  • #11: Relational good when - audit and compliance important - referential integrity - Immediate consistency - relational integrity - durability satisfied by backups Use cases: financial services, health care, manufacturing, even our own beloved Hokie Spa. Our use cases are different. Is relational really the best data model? Not necessary when - tolerant of some errors - availability primary concern - durability important
  • #12: Most important point of this talk Don’t be driven to choose a database model based on - what you are familiar with - what others say is the “best” data model - what has been used before just because it has been used before Let research requirements subject to constraints (time, funding, etc). Drive decision. Some of use learn this lesson the hard way.
  • #13: I’ll discuss how NoSQL databases can be used in two different bioinformatics areas: text mining and atherosclerosis I described text mining project in detail in seminar last semester so I won’t go into much detail in that area but I will spend a few minutes to provide background on atherosclerosis And I’ll use atherosclerosis examples when describing NoSQL data models.
  • #14: Build up of plaque inside arteries Plaque consists of fat, cholesterol, calcium and other substances Limits flow of oxygen Leads to: Heart attack Stroke From http://www.nhlbi.nih.gov/health/health-topics/topics/atherosclerosis/causes.html: The exact cause of atherosclerosis isn't known. However, studies show that atherosclerosis is a slow, complex disease that may start in childhood. It develops faster as you age. Atherosclerosis may start when certain factors damage the inner layers of the arteries. These factors include: Smoking High amounts of certain fats and cholesterol in the blood High blood pressure High amounts of sugar in the blood due to insulin resistanceexternal link icon or diabetesexternal link icon Plaque may begin to build up where the arteries are damaged. Over time, plaque hardens and narrows the arteries. Eventually, an area of plaque can rupture (break open). When this happens, blood cell fragments called platelets (PLATE-lets) stick to the site of the injury. They may clump together to form blood clots. Clots narrow the arteries even more, limiting the flow of oxygen-rich blood to your body.
  • #15: Autopsies performed during Korean War found evidence of early on set athero. Not enough time for lifestyle factors, such as high fat diet, smoking and inactivity to be sole cause of plague. Hypothesis – genetic factor influencing athero. PDAY – confirmed and expanded on earlier findings. Large collaboration of pathologists collected samples from young people who died of non-cardiovascular causes. 3,000 autopsies 15-34 year olds Aorta and LAD samples preserved in fixed formalin, paraffin embedded blocks. Liver samples also collected. GPAA - Use liver samples to sequence genomes. Proteomics collaborators have developed techniques for extracting proteins from old FFPE blocks. Makes genomic and proteomics analysis possible today.
  • #16: Time for confession. I ignored earlier advice about letting requirements and constraints drive database selection in GPAA project. I’ve worked with relational databases extensively, developed models for demographic, phenotypic, genomic and proteomic data before. I did not pay enough attention to the “unknown unknowns” – collaborators had additional ideas of how to leverage other data about GWAS, eQTL, histones, chromatins, etc. Did not appreciate how much would change.
  • #17: Could have stayed with relational model, but: Requirements were changing New data sets: GWAS, eQTL, Chromatin Segmentation, Histones Unknown data structures for Multiple Reaction Monitoring (MRM) Mass Spec and SWATH Normalized model was beginning to be more trouble than it was worth. Flexibility was a primary concern.
  • #19: First 4 especially important to organizations with big data and need for constant access to data and applications – e.g. Facebook, Amazon, Google Flexibility is primary driver for us to consider and eventually adopt a NoSQL database.
  • #20: 4 most commonly referenced database types in NoSQL community and press. Will not discuss Search databases here. PATRIC is using hybrid Relational-Search database strategy which is significantly improving performance over relational-only approach. Integration key for bioinformaticians and biologist; Don’t make them integrate data.
  • #22: So simple, it is almost trivial. Can store non-atomic values as well, e.g. JSON documents, but can only access entire document, cannot select a single value in the document or search for values of a particular field.
  • #23: Example KV databases. Redis – popular, easy to use, commonly used for caching; master-slave replication; multiple servers respond to read request; one server handles writes Riak – scalable, masterless BerkeleyDB – first widely used KV data store Areospike and FoundationDB – supports ACID transactions Amazon DynamoDB available in cloud (just announced on 10/9/2014 DynamoDB will support documents as well as KVs)
  • #24: JSON/BSON or XML storage
  • #28: Cassandra developed by Facebook Hbase part of Hadoop ecosystem Accumulo designed to support cell level access control; originally created by NSA Hypertable – used commercially
  • #30: Neo4j is probably most widely used of graph dbs OrientDB incorportes document db features as well as graphdb Titan runs on cluster, used Cassandra or HDFS (I think) for distributed storgae GraphChi-DB – project to run large graphs on small machines, e.g. Mac Mini’s AllegroGraph – commercial product from Franz, a long established Lisp vendor