SlideShare a Scribd company logo
SQL, noSQL or no database at all?
Are databases still a core skill?
Neil Saunders
COMPUTATIONAL INFORMATICS
www.csiro.au
Databases: Slide 2 of 24
alternative title: should David Lovell learn databases?
Databases: Slide 3 of 24
actual recent email request
Hi Neil,
I was wondering if you could help me with something. I am trying to put
together a table but it is rather slow by hand. Do you know if you can
help me with this task with a script? If it is too much of your time,
don’t worry about it. Just thought I’d ask before I start.
The task is:
The targets listed in A tab need to be found in B tab then the entire row
copied into C tab. Then the details in column C of C tab then need to be
matched with the details in D tab so that the patients with the mutations
are listed in row AG and AH of C tab.
Again, if this isn’t an easy task for you then don’t worry about it.
Databases: Slide 4 of 24
sounds like a database to me (c. 2004)
Databases: Slide 5 of 24
database design is a profession in itself
-- KEGG_DB schema
CREATE TABLE ec2go (
ec_no VARCHAR(16) NOT NULL, -- EC number (with "EC:" prefix)
go_id CHAR(10) NOT NULL -- GO ID
);
CREATE TABLE pathway2gene (
pathway_id CHAR(8) NOT NULL, -- KEGG pathway long ID
gene_id VARCHAR(20) NOT NULL -- Entrez Gene or ORF ID
);
CREATE TABLE pathway2name (
path_id CHAR(5) NOT NULL UNIQUE, -- KEGG pathway short ID
path_name VARCHAR(80) NOT NULL UNIQUE -- KEGG pathway name
);
-- Indexes.
CREATE INDEX Ipathway2gene ON pathway2gene (gene_id);
Databases: Slide 6 of 24
know your ORM from your MVC
(do you DSL?)
http://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller
Databases: Slide 7 of 24
my one tip for today: use ORM
= object relational mapping
#!/usr/bin/ruby
require ’sequel’
# connect to UCSC Genomes MySQL server
DB = Sequel.connect(:adapter => "mysql", :host => "genome-mysql.cse.ucsc.edu",
:user => "genome", :database => "hg19")
# instead of "SELECT count(*) FROM knownGene"
DB.from(:knownGene).count
# => 82960
# instead of "SELECT name, chrom, txStart FROM knownGene LIMIT 1"
DB.from(:knownGene).select(:name, :chrom, :txStart).first
# => {:name=>"uc001aaa.3", :chrom=>"chr1", :txStart=>11873}
# instead of "SELECT name FROM knownGene WHERE chrom == ’chrM’"
DB.from(:knownGene).where(:chrom => "chrM").all
# => [{:name=>"uc004coq.4"}, {:name=>"uc022bqo.2"}, {:name=>"uc004cor.1"}, {:name=>"uc004cos.5"},
# {:name=>"uc022bqp.1"}, {:name=>"uc022bqq.1"}, {:name=>"uc022bqr.1"}, {:name=>"uc031tga.1"},
# {:name=>"uc022bqs.1"}, {:name=>"uc011mfi.2"}, {:name=>"uc022bqt.1"}, {:name=>"uc022bqu.2"},
# {:name=>"uc004cov.5"}, {:name=>"uc031tgb.1"}, {:name=>"uc004cow.2"}, {:name=>"uc004cox.4"},
# {:name=>"uc022bqv.1"}, {:name=>"uc022bqw.1"}, {:name=>"uc022bqx.1"}, {:name=>"uc004coz.1"}]
Databases: Slide 8 of 24
don’t want to CREATE? you still might want to SELECT
Question: How to map a SNP to a gene around +/- 60KB ?
I am looking at a bunch of SNPs. Some of them are part of genes,
but other are not. I am interested to look up +60KB or -60KB of
those SNPs to get details about some nearby genes. Please share
your experience in dealing with such a situation or thoughts on
any methods that can do this. Thanks in advance.
http://www.biostars.org/p/413/
Databases: Slide 9 of 24
example SELECT
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e ’
select
K.proteinID, K.name, S.name,
S.avHet, S.chrom, S.chromStart,
K.txStart, K.txEnd
from snp130 as S
left join knownGene as K on
(S.chrom = K.chrom and not(K.txEnd + 60000 < S.chromStart or
S.chromEnd + 60000 < K.txStart))
where
S.name in ("rs25","rs100","rs75","rs9876","rs101")
’
Databases: Slide 10 of 24
example SELECT result
Databases: Slide 11 of 24
let’s talk about noSQL
http://www.infoivy.com/2013/07/nosql-database-comparison-chart-only.html
Databases: Slide 12 of 24
(potentially) a good fit for biological data
Databases: Slide 13 of 24
many data sources are “key-value ready”
(or close enough)
http://togows.dbcls.jp/entry/pathway/hsa00030/genes.json
[
{
"2821": "GPI; glucose-6-phosphate isomerase [KO:K01810] [EC:5.3.1.9]",
"2539": "G6PD; glucose-6-phosphate dehydrogenase [KO:K00036] [EC:1.1.1.49]",
"25796": "PGLS; 6-phosphogluconolactonase [KO:K01057] [EC:3.1.1.31]",
...
"5213": "PFKM; phosphofructokinase, muscle [KO:K00850] [EC:2.7.1.11]",
"5214": "PFKP; phosphofructokinase, platelet [KO:K00850] [EC:2.7.1.11]",
"5211": "PFKL; phosphofructokinase, liver [KO:K00850] [EC:2.7.1.11]"
}
]
Databases: Slide 14 of 24
schema-free: save first, worry later
(= agile)
#!/usr/bin/ruby
require "mongo"
require "json/pure"
require "open-uri"
db = Mongo::Connection.new.db(’kegg’)
col = db.collection(’genes’)
j = JSON.parse(open("http://togows.dbcls.jp/entry/pathway/hsa00030/genes.json").read)
j.each do |g|
gene = Hash.new
g.each_pair do |key, val|
gene[:_id] = key
gene[:desc] = val
col.save(gene)
end
end
Ruby code to save JSON from the TogoWS REST service
Databases: Slide 15 of 24
example application - PMRetract
ask later if interested
http://pmretract.heroku.com/
https://github.com/neilfws/PubMed/tree/master/retractions
Databases: Slide 16 of 24
when rows + columns != database
- sometimes a database is overkill
Databases: Slide 17 of 24
example 1 - R/IRanges
Databases: Slide 18 of 24
example 2 - bedtools
http://bedtools.readthedocs.org/en/latest/
Databases: Slide 19 of 24
example 3 - unix join (and the shell in general)
Databases: Slide 20 of 24
when are databases good?
- when data are updated frequently
- when multiple users do the updating
- when queries are complex or ever-changing
- as backends to web applications
Databases: Slide 21 of 24
when are databases not/less good?
- for basic “set operations”
- for sequence data [1]
(?)
[1] no time to discuss BioSQL, GBrowse/Bio::DB::GFF, BioDAS etc.
Databases: Slide 22 of 24
so how did I answer that email?
options(java.parameters = "-Xmx4g")
library(XLConnect)
wb <- loadWorkbook("˜/Downloads/NGS Target list Tumour for Neil.xlsx")
s1 <- readWorksheet(wb, sheet = 1, startCol = 1, endCol = 1, header = F)
s2 <- readWorksheet(wb, sheet = 2, startCol = 1, endCol = 32, header = T)
s4 <- readWorksheet(wb, sheet = 4, startCol = 1, endCol = 3, header = T)
# then use gsub, match, %in% etc. to clean and join the data
# ...
Read spreadsheet into R using the XLConnect package, then “munge”

More Related Content

PDF
ScriptLUA
PDF
MongoDB Indexing Constraints and Creative Schemas
TXT
新建 文本文档
PPTX
Php Basics part 1
PDF
MongoDB Performance Debugging
PDF
Bkbiet day2 & 3
PDF
Optimizing Slow Queries with Indexes and Creativity
PPTX
Mongo db query docuement
ScriptLUA
MongoDB Indexing Constraints and Creative Schemas
新建 文本文档
Php Basics part 1
MongoDB Performance Debugging
Bkbiet day2 & 3
Optimizing Slow Queries with Indexes and Creativity
Mongo db query docuement

What's hot (9)

PPTX
Cookies
KEY
PostgreSQL
PPTX
MUC - Moodle Universal Cache
PDF
Book integrated assignment
ODP
Running ms sql stored procedures in mule
PDF
Javantura v2 - Replication with MongoDB - what could go wrong... - Philipp Krenn
PDF
PGDay.Amsterdam 2018 - Stefan Fercot - Save your data with pgBackRest
PDF
MySQL5.7で遊んでみよう
PPTX
2015 02-09 - NoSQL Vorlesung Mosbach
Cookies
PostgreSQL
MUC - Moodle Universal Cache
Book integrated assignment
Running ms sql stored procedures in mule
Javantura v2 - Replication with MongoDB - what could go wrong... - Philipp Krenn
PGDay.Amsterdam 2018 - Stefan Fercot - Save your data with pgBackRest
MySQL5.7で遊んでみよう
2015 02-09 - NoSQL Vorlesung Mosbach
Ad

Viewers also liked (8)

PDF
MongoDB to Cassandra
PPTX
SQL Server 2012 Deep Dive (rus)
PPTX
iForum 2015: SQL vs. NoSQL
PDF
Equinix Big Data Platform and Cassandra - A view into the journey
PDF
NoSQL Databases, Not just a Buzzword
PPTX
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
PPTX
Big Data, NoSQL with MongoDB and Cassasdra
PPTX
Deep Dive into SharePoint Topologies and Server Architecture for SharePoint 2013
MongoDB to Cassandra
SQL Server 2012 Deep Dive (rus)
iForum 2015: SQL vs. NoSQL
Equinix Big Data Platform and Cassandra - A view into the journey
NoSQL Databases, Not just a Buzzword
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
Big Data, NoSQL with MongoDB and Cassasdra
Deep Dive into SharePoint Topologies and Server Architecture for SharePoint 2013
Ad

Similar to SQL, noSQL or no database at all? Are databases still a core skill? (20)

PDF
Ruby on bioinformatics
PPT
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
ODP
Non-Relational Databases: This hurts. I like it.
PPT
Bioinformatics&Databases.ppt
PPT
1.Databases for bioinformatics and its types
PPT
Research Dataspaces: Pay-as-you-go Integration and Analysis
PPT
Biological Database Systems
KEY
NoSQL: Why, When, and How
KEY
MongoDB - Ruby document store that doesn't rhyme with ouch
PDF
BIOLOGICAL DATABASE AND ITS TYPES,IMPORTANCE OF BIOLOGICAL DATABASE
PDF
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
PDF
Tips And Tricks For Bioinformatics Software Engineering
PDF
S Carbon - AmiGO2: document-oriented approach to ontology software and escapi...
PPT
Wmware NoSQL
PDF
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
PPTX
Biological data bioinformatics
PDF
NOSQL Overview
PDF
MongoDB is the MashupDB
PDF
Databases presentation for computing school
PDF
Databases presentation for computing school
Ruby on bioinformatics
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
Non-Relational Databases: This hurts. I like it.
Bioinformatics&Databases.ppt
1.Databases for bioinformatics and its types
Research Dataspaces: Pay-as-you-go Integration and Analysis
Biological Database Systems
NoSQL: Why, When, and How
MongoDB - Ruby document store that doesn't rhyme with ouch
BIOLOGICAL DATABASE AND ITS TYPES,IMPORTANCE OF BIOLOGICAL DATABASE
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
Tips And Tricks For Bioinformatics Software Engineering
S Carbon - AmiGO2: document-oriented approach to ontology software and escapi...
Wmware NoSQL
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
Biological data bioinformatics
NOSQL Overview
MongoDB is the MashupDB
Databases presentation for computing school
Databases presentation for computing school

More from Neil Saunders (12)

PDF
Online bioinformatics forums: why do we keep asking the same questions?
PDF
Should I be dead? a very personal genomics
PDF
Learning from complete strangers: social networking for bioinformaticians
PDF
Data Integration: What I Haven't Yet Achieved
PDF
Building A Web Application To Monitor PubMed Retraction Notices
PDF
Version Control in Bioinformatics: Our Experience Using Git
PDF
What can science networking online do for you
ODP
Using structural information to predict protein-protein interaction and enyzm...
ODP
Predikin and PredikinDB: tools to predict protein kinase peptide specificity
PPT
The Viking labelled release experiment: life on Mars?
PDF
Protein function and bioinformatics
PDF
Genomics of cold-adapted microorganisms
Online bioinformatics forums: why do we keep asking the same questions?
Should I be dead? a very personal genomics
Learning from complete strangers: social networking for bioinformaticians
Data Integration: What I Haven't Yet Achieved
Building A Web Application To Monitor PubMed Retraction Notices
Version Control in Bioinformatics: Our Experience Using Git
What can science networking online do for you
Using structural information to predict protein-protein interaction and enyzm...
Predikin and PredikinDB: tools to predict protein kinase peptide specificity
The Viking labelled release experiment: life on Mars?
Protein function and bioinformatics
Genomics of cold-adapted microorganisms

Recently uploaded (20)

PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
famous lake in india and its disturibution and importance
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
Sciences of Europe No 170 (2025)
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
2. Earth - The Living Planet earth and life
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
2. Earth - The Living Planet Module 2ELS
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
famous lake in india and its disturibution and importance
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Sciences of Europe No 170 (2025)
Biophysics 2.pdffffffffffffffffffffffffff
Classification Systems_TAXONOMY_SCIENCE8.pptx
Derivatives of integument scales, beaks, horns,.pptx
2. Earth - The Living Planet earth and life
microscope-Lecturecjchchchchcuvuvhc.pptx
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
INTRODUCTION TO EVS | Concept of sustainability
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
Phytochemical Investigation of Miliusa longipes.pdf
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
2. Earth - The Living Planet Module 2ELS

SQL, noSQL or no database at all? Are databases still a core skill?

  • 1. SQL, noSQL or no database at all? Are databases still a core skill? Neil Saunders COMPUTATIONAL INFORMATICS www.csiro.au
  • 2. Databases: Slide 2 of 24 alternative title: should David Lovell learn databases?
  • 3. Databases: Slide 3 of 24 actual recent email request Hi Neil, I was wondering if you could help me with something. I am trying to put together a table but it is rather slow by hand. Do you know if you can help me with this task with a script? If it is too much of your time, don’t worry about it. Just thought I’d ask before I start. The task is: The targets listed in A tab need to be found in B tab then the entire row copied into C tab. Then the details in column C of C tab then need to be matched with the details in D tab so that the patients with the mutations are listed in row AG and AH of C tab. Again, if this isn’t an easy task for you then don’t worry about it.
  • 4. Databases: Slide 4 of 24 sounds like a database to me (c. 2004)
  • 5. Databases: Slide 5 of 24 database design is a profession in itself -- KEGG_DB schema CREATE TABLE ec2go ( ec_no VARCHAR(16) NOT NULL, -- EC number (with "EC:" prefix) go_id CHAR(10) NOT NULL -- GO ID ); CREATE TABLE pathway2gene ( pathway_id CHAR(8) NOT NULL, -- KEGG pathway long ID gene_id VARCHAR(20) NOT NULL -- Entrez Gene or ORF ID ); CREATE TABLE pathway2name ( path_id CHAR(5) NOT NULL UNIQUE, -- KEGG pathway short ID path_name VARCHAR(80) NOT NULL UNIQUE -- KEGG pathway name ); -- Indexes. CREATE INDEX Ipathway2gene ON pathway2gene (gene_id);
  • 6. Databases: Slide 6 of 24 know your ORM from your MVC (do you DSL?) http://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller
  • 7. Databases: Slide 7 of 24 my one tip for today: use ORM = object relational mapping #!/usr/bin/ruby require ’sequel’ # connect to UCSC Genomes MySQL server DB = Sequel.connect(:adapter => "mysql", :host => "genome-mysql.cse.ucsc.edu", :user => "genome", :database => "hg19") # instead of "SELECT count(*) FROM knownGene" DB.from(:knownGene).count # => 82960 # instead of "SELECT name, chrom, txStart FROM knownGene LIMIT 1" DB.from(:knownGene).select(:name, :chrom, :txStart).first # => {:name=>"uc001aaa.3", :chrom=>"chr1", :txStart=>11873} # instead of "SELECT name FROM knownGene WHERE chrom == ’chrM’" DB.from(:knownGene).where(:chrom => "chrM").all # => [{:name=>"uc004coq.4"}, {:name=>"uc022bqo.2"}, {:name=>"uc004cor.1"}, {:name=>"uc004cos.5"}, # {:name=>"uc022bqp.1"}, {:name=>"uc022bqq.1"}, {:name=>"uc022bqr.1"}, {:name=>"uc031tga.1"}, # {:name=>"uc022bqs.1"}, {:name=>"uc011mfi.2"}, {:name=>"uc022bqt.1"}, {:name=>"uc022bqu.2"}, # {:name=>"uc004cov.5"}, {:name=>"uc031tgb.1"}, {:name=>"uc004cow.2"}, {:name=>"uc004cox.4"}, # {:name=>"uc022bqv.1"}, {:name=>"uc022bqw.1"}, {:name=>"uc022bqx.1"}, {:name=>"uc004coz.1"}]
  • 8. Databases: Slide 8 of 24 don’t want to CREATE? you still might want to SELECT Question: How to map a SNP to a gene around +/- 60KB ? I am looking at a bunch of SNPs. Some of them are part of genes, but other are not. I am interested to look up +60KB or -60KB of those SNPs to get details about some nearby genes. Please share your experience in dealing with such a situation or thoughts on any methods that can do this. Thanks in advance. http://www.biostars.org/p/413/
  • 9. Databases: Slide 9 of 24 example SELECT mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e ’ select K.proteinID, K.name, S.name, S.avHet, S.chrom, S.chromStart, K.txStart, K.txEnd from snp130 as S left join knownGene as K on (S.chrom = K.chrom and not(K.txEnd + 60000 < S.chromStart or S.chromEnd + 60000 < K.txStart)) where S.name in ("rs25","rs100","rs75","rs9876","rs101") ’
  • 10. Databases: Slide 10 of 24 example SELECT result
  • 11. Databases: Slide 11 of 24 let’s talk about noSQL http://www.infoivy.com/2013/07/nosql-database-comparison-chart-only.html
  • 12. Databases: Slide 12 of 24 (potentially) a good fit for biological data
  • 13. Databases: Slide 13 of 24 many data sources are “key-value ready” (or close enough) http://togows.dbcls.jp/entry/pathway/hsa00030/genes.json [ { "2821": "GPI; glucose-6-phosphate isomerase [KO:K01810] [EC:5.3.1.9]", "2539": "G6PD; glucose-6-phosphate dehydrogenase [KO:K00036] [EC:1.1.1.49]", "25796": "PGLS; 6-phosphogluconolactonase [KO:K01057] [EC:3.1.1.31]", ... "5213": "PFKM; phosphofructokinase, muscle [KO:K00850] [EC:2.7.1.11]", "5214": "PFKP; phosphofructokinase, platelet [KO:K00850] [EC:2.7.1.11]", "5211": "PFKL; phosphofructokinase, liver [KO:K00850] [EC:2.7.1.11]" } ]
  • 14. Databases: Slide 14 of 24 schema-free: save first, worry later (= agile) #!/usr/bin/ruby require "mongo" require "json/pure" require "open-uri" db = Mongo::Connection.new.db(’kegg’) col = db.collection(’genes’) j = JSON.parse(open("http://togows.dbcls.jp/entry/pathway/hsa00030/genes.json").read) j.each do |g| gene = Hash.new g.each_pair do |key, val| gene[:_id] = key gene[:desc] = val col.save(gene) end end Ruby code to save JSON from the TogoWS REST service
  • 15. Databases: Slide 15 of 24 example application - PMRetract ask later if interested http://pmretract.heroku.com/ https://github.com/neilfws/PubMed/tree/master/retractions
  • 16. Databases: Slide 16 of 24 when rows + columns != database - sometimes a database is overkill
  • 17. Databases: Slide 17 of 24 example 1 - R/IRanges
  • 18. Databases: Slide 18 of 24 example 2 - bedtools http://bedtools.readthedocs.org/en/latest/
  • 19. Databases: Slide 19 of 24 example 3 - unix join (and the shell in general)
  • 20. Databases: Slide 20 of 24 when are databases good? - when data are updated frequently - when multiple users do the updating - when queries are complex or ever-changing - as backends to web applications
  • 21. Databases: Slide 21 of 24 when are databases not/less good? - for basic “set operations” - for sequence data [1] (?) [1] no time to discuss BioSQL, GBrowse/Bio::DB::GFF, BioDAS etc.
  • 22. Databases: Slide 22 of 24 so how did I answer that email? options(java.parameters = "-Xmx4g") library(XLConnect) wb <- loadWorkbook("˜/Downloads/NGS Target list Tumour for Neil.xlsx") s1 <- readWorksheet(wb, sheet = 1, startCol = 1, endCol = 1, header = F) s2 <- readWorksheet(wb, sheet = 2, startCol = 1, endCol = 32, header = T) s4 <- readWorksheet(wb, sheet = 4, startCol = 1, endCol = 3, header = T) # then use gsub, match, %in% etc. to clean and join the data # ... Read spreadsheet into R using the XLConnect package, then “munge”