SlideShare a Scribd company logo
All	
  kmers	
  are	
  not	
  created	
  equal:	
  
finding	
  the	
  signal	
  from	
  the	
  noise	
  
in	
  large-­‐scale	
  metagenomes.	
  
Will	
  Trimble	
  
metagenomic	
  annota<on	
  group	
  
Argonne	
  Na<onal	
  Laboratory	
  
BEACON	
  seminar	
  	
  
April	
  23,	
  2014	
  	
  	
  	
  MSU	
  
Apology:	
  I	
  speak	
  biology	
  	
  
with	
  an	
  accent	
  
•  I	
  spent	
  six	
  years	
  in	
  dark	
  rooms	
  with	
  lasers	
  
•  Now	
  I	
  use	
  computers	
  to	
  analyze	
  high-­‐throughput	
  
sequence	
  data.	
  
•  I	
  introduce	
  myself	
  as	
  an	
  applied	
  mathema<cian.	
  
•  Finding	
  scoring	
  func<ons	
  to	
  answer	
  ques<ons	
  with	
  
ambiguous	
  data	
  
	
  
Apology:	
  I	
  speak	
  biology	
  	
  
with	
  an	
  accent	
  
•  I	
  spent	
  six	
  years	
  in	
  dark	
  rooms	
  with	
  lasers	
  
•  Now	
  I	
  use	
  computers	
  to	
  analyze	
  high-­‐throughput	
  
sequence	
  data.	
  
•  I	
  introduce	
  myself	
  as	
  an	
  applied	
  mathema<cian.	
  
•  Finding	
  scoring	
  func<ons	
  to	
  answer	
  ques<ons	
  with	
  
ambiguous	
  data	
  
•  Shoveling	
  data	
  from	
  the	
  data	
  producing	
  machine	
  into	
  
the	
  data-­‐consuming	
  furnace.	
  
	
  
•  Sequences	
  are	
  different	
  
•  How	
  much	
  did	
  my	
  sequencing	
  run	
  give	
  me?	
  	
  	
  
kmerspectrumanalyzer!
•  How	
  much	
  did	
  I	
  sample?	
  
nonpareil-k	
  
•  PreXy	
  pictures	
  
thumbnailpolish!
Outline	
  
•  Sequences	
  are	
  different	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (math)	
  
•  How	
  much	
  did	
  my	
  sequencing	
  run	
  give	
  me?	
  	
  	
  
kmerspectrumanalyzer (graphs)	
  
•  How	
  much	
  did	
  I	
  sample?	
  
nonpareil-k (graphs)	
  
•  PreXy	
  pictures	
  
thumbnailpolish (micrographs)!
Outline	
  
Sequences	
  are	
  different	
  
•  Sequencing	
  produces	
  sequences.	
  	
  Sequences	
  
are	
  qualita<vely	
  different	
  from	
  all	
  other	
  data	
  
types.	
  
	
  
	
  
@HWI-ST1035:125:D1K4CACXX:8:1101:1168
CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTT
+!
@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGII
@HWI-ST1035:125:D1K4CACXX:8:1101:1190
CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTAT
+!
CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGI
@HWI-ST1035:125:D1K4CACXX:8:1101:1339
CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTT
+!
BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJ
Instrument	
  readings,	
  
spectra,	
  micrographs	
  
	
  
Not	
  categorical.	
  
Low-­‐throughput	
  	
  
categorical	
  data	
  
	
  
Categories	
  are	
  sound	
  
	
  
High	
  throughput	
  
sequence	
  data	
  
	
  
Categoriza4on	
  is	
  an	
  art	
  
Sequences	
  are	
  different	
  
•  Sequencing	
  produces	
  sequences.	
  	
  Sequences	
  
are	
  qualita<vely	
  different	
  from	
  all	
  other	
  data	
  
types.	
  
	
  
	
  
@HWI-ST1035:125:D1K4CACXX:8:1101:1168
CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTT
+!
@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGII
@HWI-ST1035:125:D1K4CACXX:8:1101:1190
CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTAT
+!
CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGI
@HWI-ST1035:125:D1K4CACXX:8:1101:1339
CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTT
+!
BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJ
Instrument	
  readings,	
  
spectra,	
  micrographs	
  
	
  
Not	
  categorical.	
  
Low-­‐throughput	
  	
  
categorical	
  data	
  
	
  
Categories	
  are	
  sound	
  
	
  
High	
  throughput	
  
sequence	
  data	
  
	
  
Categoriza4on	
  is	
  an	
  art	
  
107	
  channels	
   103	
  channels	
   1011	
  channels	
  
Sequences	
  are	
  different	
  
•  Sequencing	
  produces	
  sequences.	
  	
  Sequences	
  
are	
  qualita<vely	
  different	
  from	
  all	
  other	
  data	
  
types.	
  
	
  
•  Each	
  sequence	
  is	
  an	
  informa<on-­‐rich	
  (possibly	
  
corrupted)	
  quota4on	
  from	
  the	
  catalog	
  of	
  
gene<c	
  polymers.	
  
What	
  is	
  this	
  sequence	
  ?	
  
>mystery_sequence
CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAG
GTCATCGATAGCAGGATAATAATACAGTA!
Who	
  wrote	
  this	
  line	
  ?	
  
“be regarded as unproved until it has been
checked against more exact results”
Searching	
  
We	
  know	
  what	
  to	
  do	
  with	
  these	
  puzzles.	
  	
  	
  
You	
  go	
  to	
  this	
  website,	
  and	
  type	
  it	
  in…	
  
What	
  is	
  this	
  sequence	
  ?	
  
>mystery_sequence
CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!
Who	
  wrote	
  this	
  line	
  ?	
  
“be regarded as unproved until it has been checked against more exact results”
	
  
Searching	
  
How	
  long	
  do	
  reads	
  need	
  to	
  be	
  	
  
to	
  recognize	
  them?	
  
What	
  is	
  this	
  sequence	
  ?	
  
>mystery_sequence
CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!
Who	
  wrote	
  this	
  line	
  ?	
  
“be regarded as unproved until it has been checked against more exact results”
	
  
Searching	
  
How	
  long	
  do	
  reads	
  need	
  to	
  be	
  	
  
to	
  recognize	
  them?	
  
To	
  do	
  what,	
  to	
  place	
  on	
  a	
  reference	
  genome?	
  	
  	
  
	
  
this	
  can	
  be	
  turned	
  into	
  a	
  math	
  problem	
  
	
  
that	
  I	
  will	
  illustrate	
  with	
  a	
  search	
  engine	
  analogy.	
  	
  	
  	
  
How	
  long	
  do	
  reads	
  need	
  to	
  be?	
  
Informa4on	
  	
  	
  (Shannon,	
  1949,	
  BSTJ):	
  	
  	
  
	
  
	
  
	
  
	
  
is	
  a	
  quan<ta<ve	
  summary	
  of	
  the	
  uncertainty	
  of	
  a	
  
probability	
  distribu4on	
  –	
  a	
  model	
  of	
  the	
  data	
  
	
  
Profound	
  applicability	
  in	
  paXern	
  matching	
  +	
  modeling	
  
Logarithmic	
  measurements	
  have	
  units!	
  
H =
X
i
pi log2
✓
1
pi
◆
A	
  word	
  on	
  the	
  sign	
  of	
  the	
  entropy	
  	
  
•  A	
  popular	
  straw	
  man	
  among-­‐mathema<cians-­‐
and-­‐CS-­‐people	
  is	
  the	
  “random	
  sequence	
  
model.”	
  	
  Uniform	
  categorical	
  distribu<on	
  over	
  
all	
  4L	
  	
  sequences.	
  	
  
•  When	
  we	
  learn	
  something—like	
  we	
  collect	
  
some	
  genomes	
  and	
  expect	
  our	
  new	
  sequences	
  
to	
  look	
  like	
  them—we	
  implicitly	
  construct	
  a	
  less	
  
flat	
  distribu<on.	
  	
  Models	
  always	
  have	
  less	
  
entropy	
  than	
  the	
  model	
  of	
  ignorance.	
  
How	
  long	
  do	
  phrases	
  need	
  to	
  be?	
  
Exercise:	
  	
  Pick	
  a	
  book	
  from	
  your	
  bookshelf.	
  
Pick	
  an	
  arbitrary	
  page	
  and	
  arbitrary	
  line.	
  
	
  
for n in 1..10 !
type the first n words into google
books, quoted.!
break if google identifies your book.!
•  Informa<on	
  content	
  of	
  English	
  words:	
  
	
  	
  	
  	
  	
  	
  	
  	
  Hword	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ca.	
  12	
  bits	
  per	
  word.	
  
•  Size	
  of	
  google	
  books?	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  Big	
  libraries	
  have	
  few	
  107	
  books,	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  each	
  one	
  has	
  105	
  indexed	
  words	
  
	
  	
  	
  	
  	
  	
  	
  	
  ….so	
  a	
  database	
  size	
  of	
  1012	
  words.	
  
	
  	
  	
  	
  	
  log(database	
  size)	
  	
  	
  	
  	
  	
  	
  	
  =	
  	
  
	
  	
  	
  	
  	
  1012	
  	
  =	
  239.9	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  40	
  bits	
  
•  So	
  we	
  expect	
  on	
  average	
  40	
  /	
  12	
  =	
  3.3	
  =	
  4	
  words	
  
to	
  be	
  enough	
  to	
  find	
  a	
  phrase	
  in	
  google’s	
  index.	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Try	
  it.	
  	
  	
  
How	
  long	
  do	
  phrases	
  need	
  to	
  be?	
  
How	
  long	
  do	
  phrases	
  need	
  to	
  be?	
  
Exercise:	
  	
  Pick	
  a	
  book	
  from	
  your	
  bookshelf.	
  
Pick	
  an	
  arbitrary	
  page	
  and	
  arbitrary	
  line.	
  
	
  
for n in 1..10 !
type the first n words into google books, quoted.!
break if google identifies your book.!
Most	
  oken	
  takes	
  4	
  words	
  
•  Informa<on	
  content	
  of	
  English	
  words:	
  
	
  	
  	
  	
  	
  	
  	
  	
  Hword	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ca.	
  12	
  bits	
  per	
  word.	
  
•  Size	
  of	
  google	
  books?	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  Big	
  libraries	
  have	
  few	
  107	
  books,	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  each	
  one	
  has	
  105	
  indexed	
  words	
  
	
  	
  	
  	
  	
  	
  	
  	
  ….so	
  a	
  database	
  size	
  of	
  1012	
  words.	
  
	
  	
  	
  	
  	
  log(database	
  size)	
  	
  	
  	
  	
  	
  	
  	
  =	
  	
  
	
  	
  	
  	
  	
  1012	
  	
  =	
  239.9	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  40	
  bits	
  
•  So	
  we	
  expect	
  on	
  average	
  40	
  /	
  12	
  =	
  3.3	
  =	
  4	
  words	
  
to	
  be	
  enough	
  to	
  find	
  a	
  phrase	
  in	
  google’s	
  index.	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Try	
  it.	
  	
  	
  
How	
  long	
  do	
  phrases	
  need	
  to	
  be?	
  
Not	
  all	
  phrases	
  are	
  equally	
  
dis<nc<ve.	
  
•  Maximum	
  informa<on	
  content	
  of	
  	
  	
  	
  	
  base	
  pairs	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Hread	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  2	
  	
  	
  	
  bits	
  	
  per	
  length-­‐	
  	
  	
  sequence	
  
•  Most	
  long	
  kmers	
  are	
  dis<nct:	
  	
  
	
  	
  	
  	
  	
  genome	
  of	
  size	
  G	
  (ca	
  1010	
  bp)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  log(G)	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  
	
  	
  	
  	
  	
  1010	
  	
  =	
  	
  	
  	
  233.2	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  	
  34	
  bits	
  
•  So	
  we	
  expect	
  that	
  when	
  2	
  	
  	
  	
  >	
  34	
  bits,	
  we	
  should	
  be	
  
able	
  to	
  place	
  any	
  sequence.	
  
•  That	
  means	
  we	
  need	
  at	
  least	
  	
  	
  17	
  base	
  pairs	
  
	
  	
  	
  	
  (seems	
  small)	
  to	
  deliver	
  mail	
  anywhere	
  in	
  the	
  
genome.	
  	
  
How	
  long	
  do	
  reads	
  need	
  to	
  be?	
  
`
`
`
`
The	
  data	
  deluge	
  
•  There	
  were	
  some	
  technological	
  
breakthroughs	
  in	
  the	
  mid-­‐2000s	
  that	
  
led	
  to	
  inexpensive	
  collec<on	
  of	
  10s	
  
of	
  Gbytes	
  of	
  sequence	
  data	
  at	
  once.	
  
•  The	
  data	
  has	
  outgrown	
  some	
  
favorite	
  algorithms	
  from	
  the	
  1990s	
  
(BLAST)	
  	
  
Picture,	
  if	
  you	
  will,	
  a	
  hiseq	
  flowcell	
  
Paris	
  of	
  microbial	
  	
  
genomes	
  	
  
Microbial	
  	
  
transcriptomes	
  +	
  replicates	
  
Environmental	
  isolate	
  genomes	
  
	
  
Environmental	
  extract	
  sequencing	
  
	
  
Prepara<on-­‐intensive	
  sequencing	
  
Eukaryo<c	
  	
  sequencing	
  
Eukaryo<c	
  	
  sequencing	
  for	
  variants	
  
What’s	
  in	
  
there?	
  
Picture,	
  if	
  you	
  will,	
  a	
  hiseq	
  flowcell	
  
Paris	
  of	
  microbial	
  	
  
genomes	
  	
  
Microbial	
  	
  
transcriptomes	
  +	
  replicates	
  
Environmental	
  isolate	
  genomes	
  
	
  
Environmental	
  extract	
  sequencing	
  
	
  
Prepara<on-­‐intensive	
  sequencing	
  
Eukaryo<c	
  	
  sequencing	
  
Eukaryo<c	
  	
  sequencing	
  for	
  variants	
  
What’s	
  in	
  
there?	
  
Let’s	
  count	
  
kmers!	
  
The	
  kmer	
  spectrum.	
  
21mer	
  abundance	
  	
  
number	
  of	
  kmers	
  
microbial	
  genome	
  
The	
  kmer	
  spectrum.	
  	
  
21mer	
  abundance	
  	
  
number	
  of	
  kmers	
  
microbial	
  genome	
  
low-­‐abundance	
  errors	
  
peak	
  contains	
  most	
  of	
  genome	
  
high-­‐abundance	
  peak	
  contains	
  mul<copy	
  genes	
  
really	
  high	
  abundance	
  stuff	
  oken	
  ar<facts	
  
rare	
   abundant	
  
Ranked	
  kmer	
  spectrum	
  	
  	
  
kmer	
  rank	
  (cumula<ve	
  sum	
  of	
  number	
  of	
  kmers)	
  
21mer	
  abundance	
  	
  	
  	
  
Ranked	
  kmer	
  spectrum	
  
rare	
  
abundant	
  
Ranked	
  kmers	
  consumed	
  
21mer	
  abundance	
  	
  
frac<on	
  of	
  observed	
  kmers	
  
Ranked	
  kmers	
  consumed	
  
rare	
  
abundant	
  
data	
  frac<on	
  
is	
  unusually	
  	
  
stable	
  
Different	
  kinds	
  of	
  data	
  have	
  different	
  
spectra	
  
Different	
  kinds	
  of	
  data	
  have	
  different	
  
spectra	
  
Redundancy	
  is	
  good	
  
•  OMG!	
  	
  	
  Check	
  out	
  these	
  three	
  sequences!	
  	
  I’ve	
  
found	
  the	
  fourth,	
  fikh,	
  and	
  sixth	
  domains	
  of	
  life.	
  
	
  	
  	
  	
  	
  
•  OMG!	
  	
  I	
  see	
  this	
  sequence	
  10	
  million	
  <mes.	
  	
  	
  
•  OMG!	
  	
  There	
  are	
  more	
  than	
  10	
  billion	
  dis<nct	
  
31mers	
  in	
  my	
  dataset.	
  	
  I	
  only	
  have	
  128	
  Gbases	
  of	
  
memory.	
  
•  Error	
  correc<on	
  and	
  diginorm	
  somewhat	
  
amusingly	
  strive	
  for	
  opposite	
  ends.	
  
Redundancy	
  is	
  good	
  
•  OMG!	
  	
  	
  Check	
  out	
  these	
  three	
  sequences!	
  	
  I’ve	
  
found	
  the	
  fourth,	
  fikh,	
  and	
  sixth	
  domains	
  of	
  life.	
  
	
  	
  	
  	
  	
  
•  OMG!	
  	
  I	
  see	
  this	
  sequence	
  10	
  million	
  <mes.	
  	
  	
  
•  OMG!	
  	
  There	
  are	
  more	
  than	
  10	
  billion	
  dis<nct	
  
31mers	
  in	
  my	
  dataset.	
  	
  I	
  only	
  have	
  128	
  Gbases	
  of	
  
memory.	
  
•  Error	
  correc<on	
  and	
  diginorm	
  somewhat	
  
amusingly	
  strive	
  for	
  opposite	
  ends.	
  
Abundance-­‐based	
  inferences	
  
are	
  beXer	
  in	
  the	
  high-­‐
abundance	
  part	
  of	
  the	
  data.	
  
kmerspectrumanalyzer:	
  infer	
  genome	
  
size	
  and	
  depth	
  
PNO (x; c, {an}, s) =
X
n
anNBpdf (s; µ = cn, ↵ = s/n)
Generaliza<on	
  of	
  mixed-­‐Poisson	
  model	
  to	
  	
  
es<mate	
  how	
  much	
  sequence	
  is	
  in	
  each	
  peak.	
  
0 2000 4000 6000 8000 10000
0
2000
4000
6000
8000
10000
Complete Genome size (kb)
EstimatedGenomeSize(kb)
Fig 2 Coun<ng	
  kmers	
  tells	
  you	
  genome	
  size	
  
…for	
  single	
  genomes,	
  
most	
  of	
  the	
  <me.	
  
so	
  much	
  for	
  calibra<on	
  data	
  
10%	
  
5.5%	
  
4%	
  
3%	
  
1.7%	
  
1%	
  
0.5%	
  
0.3%	
  
0.1%	
  
The	
  kink	
  does	
  measure	
  error	
  
Ar<ficial	
  E.	
  coli	
  data	
  
varying	
  subs<tu<on	
  errors	
  
But	
  I	
  want	
  to	
  sequence	
  everything!	
  
Ok,	
  we	
  can	
  count	
  kmers	
  in	
  everything	
  too..	
  
kmerspectrumanalyzer	
  summarizes	
  distribu<on,	
  	
  
es<mates	
  genome	
  size,	
  coverage	
  depth	
  
How	
  much	
  novelty	
  is	
  in	
  my	
  dataset?	
  
How	
  many	
  sequences	
  do	
  you	
  need	
  to	
  see	
  before	
  you	
  start	
  seeing	
  	
  
the	
  same	
  ones	
  over	
  and	
  over	
  again?	
  
	
  
Ini<ally,	
  everything	
  is	
  novel,	
  but	
  there	
  will	
  come	
  a	
  point	
  at	
  which	
  	
  
less	
  than	
  half	
  of	
  your	
  new	
  observa<ons	
  are	
  already	
  in	
  the	
  catalog.	
  
Nonuniqefraction(✏; {r}, {n}) =
X
i
ni · ri
P
j nj · rj
(1 Poisscdf (✏ · ri, 1))
(1 Poisscdf (✏ · ri, 0))
How	
  much	
  novelty	
  is	
  in	
  my	
  dataset?	
  
How	
  many	
  sequences	
  do	
  you	
  need	
  to	
  see	
  before	
  you	
  start	
  seeing	
  	
  
the	
  same	
  ones	
  over	
  and	
  over	
  again?	
  
	
  
Ini<ally,	
  everything	
  is	
  novel,	
  but	
  there	
  will	
  come	
  a	
  point	
  at	
  which	
  	
  
less	
  than	
  half	
  of	
  your	
  new	
  observa<ons	
  are	
  already	
  in	
  the	
  catalog.	
  
	
  
We	
  can	
  calculate	
  this	
  efficiently	
  using	
  the	
  kmer	
  spectrum.	
  
Nonpareil: model of sequence coverage	

Nonpareil-k: kmer rarefaction 	

summary of sequence diversity	

Nonpareil–	
  uses	
  subset-­‐against-­‐
all	
  alignment	
  to	
  find	
  out	
  how	
  
much	
  of	
  dataset	
  is	
  unique	
  
Nonpareil-­‐k	
  –	
  crunches	
  kmer	
  
spectrum	
  to	
  approximate	
  the	
  
unique	
  frac<on,	
  300x	
  faster.	
  
Nonpareil: model of sequence coverage	

Nonpareil-k: kmer rarefaction 	

summary of sequence diversity
Nonpareil-­‐k:	
  stra<fy	
  datasets	
  by	
  
coverage	
  distribu<on	
  
most	
  of	
  dataset	
  
likely	
  contained	
  in	
  	
  
assembly	
  
	
  
assembly	
  is	
  likely	
  
to	
  miss	
  or	
  	
  
aXenuate	
  the	
  	
  
large	
  unique	
  	
  
frac<on	
  of	
  dataset.	
  
	
  
kmer	
  spectra	
  reveal	
  sequencing	
  
problems	
  
•  Amok	
  PCR	
  –	
  seemingly	
  random	
  sequences	
  
•  Amok	
  MDA	
  –	
  10	
  Gbases	
  of	
  sequence,	
  one	
  gene	
  
•  PCR	
  duplicates:	
  en<re	
  sequencing	
  run	
  was	
  50x	
  
exact-­‐	
  and	
  near-­‐exact	
  duplicate	
  reads	
  
•  Unusually	
  high	
  error	
  rate:	
  indicated	
  by	
  low	
  frac<on	
  
of	
  “solid”	
  kmers	
  (for	
  isolate	
  genomes)	
  
•  Contaminated	
  samples:	
  95%	
  E.	
  coli	
  5%	
  E.	
  faecalis	
  
HMP	
  /	
  quan<le	
  norm	
  /	
  euclidean	
  /	
  colored	
  by	
  alpha	
  	
  
	
  
MG-­‐RAST	
  API	
  
R-­‐package	
  matR	
  
Hey	
  kid,	
  you	
  want	
  some	
  unlabeled	
  data?	
  
Figure'2a!
Hey	
  kid,	
  you	
  want	
  some	
  preXy	
  ordina<ons?	
  
Generali<es	
  from	
  the	
  	
  
kmer	
  coun<ng	
  mines	
  
•  Many	
  datasets	
  have	
  as	
  much	
  as	
  5-­‐45%	
  of	
  the	
  
sequence	
  yield	
  in	
  adapters.	
  	
  	
  
•  FEW	
  DATASETS	
  have	
  well-­‐separated	
  
abundance	
  peaks	
  (of	
  the	
  sort	
  metavelvet	
  was	
  
engineered	
  to	
  find)	
  	
  	
  
•  Diverse	
  datasets	
  have	
  a	
  featureless,	
  
geometric	
  rela4onship	
  between	
  kmer	
  rank	
  
and	
  kmer	
  abundance.	
  
•  Shannon	
  entropy	
  is	
  oversensi4ve	
  to	
  errors.	
  
Higher-­‐order	
  Rényi	
  entropy	
  is	
  more	
  stable.	
  
kmer	
  sta<s<cal	
  summaries	
  
•  H0	
  kmer	
  richness	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (VERY	
  BAD)	
  
•  H1	
  Shannon	
  entropy	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (BAD)	
  
•  H2	
  Reyni	
  entropy	
  /	
  Simpson	
  index	
  (GOOD)	
  
•  observa<on-­‐weighted	
  	
  	
  coverage	
  	
  (BAD)	
  
•  observa<on-­‐weighted	
  	
  	
  size	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (BAD)	
  
•  observa<on-­‐median	
  	
  	
  	
  	
  	
  coverage	
  	
  (GOOD)	
  
•  observa<on-­‐median	
  	
  	
  	
  	
  	
  size	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (GOOD)	
  
•  frac<on	
  in	
  top	
  100	
  kmers	
  	
  	
  	
  	
  	
  	
  (USEFUL)	
  
•  frac<on	
  unique	
  (OK	
  but	
  requires	
  size	
  correc<on)	
  
kmer	
  sta<s<cal	
  summaries	
  
•  H0	
  kmer	
  richness	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (VERY	
  BAD)	
  
•  H1	
  Shannon	
  entropy	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (BAD)	
  
•  H2	
  Reyni	
  entropy	
  /	
  Simpson	
  index	
  (GOOD)	
  
•  observa<on-­‐weighted	
  	
  	
  coverage	
  	
  (BAD)	
  
•  observa<on-­‐weighted	
  	
  	
  size	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (BAD)	
  
•  observa<on-­‐median	
  	
  	
  	
  	
  	
  coverage	
  	
  (GOOD)	
  
•  observa<on-­‐median	
  	
  	
  	
  	
  	
  size	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (GOOD)	
  
•  frac<on	
  in	
  top	
  100	
  kmers	
  	
  	
  	
  	
  	
  	
  (USEFUL)	
  
•  frac<on	
  unique	
  (OK	
  but	
  requires	
  size	
  correc<on)	
  
Most	
  of	
  these	
  give	
  answers	
  which	
  
vary	
  so	
  strongly	
  with	
  sampling	
  
depth	
  as	
  to	
  be	
  unusable.	
  
	
  
Observa<on-­‐weighted	
  frac<on-­‐of-­‐
data	
  metrics	
  	
  behave	
  fairly	
  well.	
  	
  
Frac<ons	
  of	
  the	
  data	
  with	
  par<cular	
  
proper<es	
  are	
  stable	
  with	
  respect	
  
to	
  sampling.	
  	
  	
  
thumbnailpolish!
http://www.mcs.anl.gov/~trimble/flowcell/!
Some<mes	
  the	
  sequencer	
  has	
  a	
  bad	
  day.	
  
Metagenomic	
  annota<on	
  group	
  
	
  
Folker	
  Meyer	
  
Elizabeth	
  Glass	
  
Narayan	
  Desai	
  
Kevin	
  Keegan	
  	
  
Adina	
  Howe	
  
Wolfgang	
  Gerlach	
  
Wei	
  Tang	
  
Travis	
  Harrison	
  
Jared	
  Bishof	
  
Dan	
  Braithwaite	
  
Hunter	
  MaXhews	
  
Sarah	
  Owens	
  
Formerly	
  of	
  Yale:	
  
Howard	
  Ochman	
  	
  
David	
  Williams	
  
	
  
Georgia	
  Tech:	
  
Kostas	
  Konstan<nidis	
  
Luis	
  Rodriguez-­‐Rojas	
  
	
  
Observa<on:	
  Most	
  scien<sts	
  seem	
  to	
  
be	
  self-­‐taught	
  in	
  compu<ng.	
  
	
  
Observa<on:	
  	
  Most	
  scien<sts	
  waste	
  a	
  	
  
lot	
  of	
  <me	
  using	
  computers	
  
inefficiently.	
  
Adina	
  and	
  I	
  volunteer	
  with	
  	
  
We	
  teach	
  scien<sts	
  
	
  how	
  to	
  get	
  more	
  done	
  
Woods	
  Hole	
  
Tuks	
  
U.	
  Chicago	
  
U.	
  Chicago	
  
UIC	
  

More Related Content

PDF
Sequencing run grief counseling: counting kmers at MG-RAST
PPTX
Korean-optimized Word Representations for Out of Vocabulary Problems caused b...
PDF
HackYale - Natural Language Processing (Week 0)
PPTX
Lecture 7- Text Statistics and Document Parsing
PDF
HackYale - Natural Language Processing (All Slides)
PPTX
L ramirez agrosilvopastoril systems mexico
DOCX
Smp1 addeo revised
PPTX
Give peace a chance
Sequencing run grief counseling: counting kmers at MG-RAST
Korean-optimized Word Representations for Out of Vocabulary Problems caused b...
HackYale - Natural Language Processing (Week 0)
Lecture 7- Text Statistics and Document Parsing
HackYale - Natural Language Processing (All Slides)
L ramirez agrosilvopastoril systems mexico
Smp1 addeo revised
Give peace a chance

Viewers also liked (20)

PPTX
Ready For Liftoff: Instructional Shifts in Social Studies
PDF
How to Create a Stand Out Resume
PPTX
Give peace a chance
PPTX
Scrap blog%20powerpoint
PPTX
Sunu (1)
PDF
P880 Wake On Wan
PPTX
Περιμένοντας ή διαμορφώνοντας το μέλλον; (in Greek)
PDF
Prabowo subianto
PPTX
Jose erales participatory livestock mexico
PPTX
Give peace a chance
PPTX
Evaluation 2
DOC
Mvavi y weävbhnhnh
PPTX
Evaluation 2
PPTX
DOCX
Addeo smp5 revised
PPTX
Presentation communication skills 10
PDF
Modern substation design 2
PPT
Ulrich schmutz garden organic research on drought and drylands
PPTX
Evaluation 2
PDF
Τάσεις στο ψηφιακό περιβάλλον, προκλήσεις, επιχειρηματικά μοντέλα
Ready For Liftoff: Instructional Shifts in Social Studies
How to Create a Stand Out Resume
Give peace a chance
Scrap blog%20powerpoint
Sunu (1)
P880 Wake On Wan
Περιμένοντας ή διαμορφώνοντας το μέλλον; (in Greek)
Prabowo subianto
Jose erales participatory livestock mexico
Give peace a chance
Evaluation 2
Mvavi y weävbhnhnh
Evaluation 2
Addeo smp5 revised
Presentation communication skills 10
Modern substation design 2
Ulrich schmutz garden organic research on drought and drylands
Evaluation 2
Τάσεις στο ψηφιακό περιβάλλον, προκλήσεις, επιχειρηματικά μοντέλα
Ad

Similar to All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes (20)

PPTX
2013 py con awesome big data algorithms
PDF
lec8_annotated.pdf ml csci 567 vatsal sharan
PPTX
2013 siam-cse-big-data
PPT
Language Modeling Putting a curve to the bag of words
PPTX
A Panorama of Natural Language Processing
PDF
Natural Language Processing using Java
PPTX
Information Retrieval and Extraction - Module 7
PDF
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
PPTX
Data science and Hadoop
PPTX
NLP Introduction and basics of natural language processing
PPTX
Demystifying Machine Learning
PDF
NLP Lecture on the preprocessing approaches
PPTX
Intro to Vectorization Concepts - GaTech cse6242
PDF
Lazy man's learning: How To Build Your Own Text Summarizer
PPTX
Mining high speed data streams: Hoeffding and VFDT
PPTX
2014 nicta-reproducibility
PPTX
Natural Language Processing & its importance
PPTX
Agile Analysis 101: Agile Stats v Command & Control Maths
PDF
Lecture on the annotation of transposable elements
PDF
Probability Theory Application and statitics
2013 py con awesome big data algorithms
lec8_annotated.pdf ml csci 567 vatsal sharan
2013 siam-cse-big-data
Language Modeling Putting a curve to the bag of words
A Panorama of Natural Language Processing
Natural Language Processing using Java
Information Retrieval and Extraction - Module 7
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
Data science and Hadoop
NLP Introduction and basics of natural language processing
Demystifying Machine Learning
NLP Lecture on the preprocessing approaches
Intro to Vectorization Concepts - GaTech cse6242
Lazy man's learning: How To Build Your Own Text Summarizer
Mining high speed data streams: Hoeffding and VFDT
2014 nicta-reproducibility
Natural Language Processing & its importance
Agile Analysis 101: Agile Stats v Command & Control Maths
Lecture on the annotation of transposable elements
Probability Theory Application and statitics
Ad

Recently uploaded (20)

PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
BIOMOLECULES PPT........................
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
famous lake in india and its disturibution and importance
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
2. Earth - The Living Planet earth and life
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPTX
Introduction to Cardiovascular system_structure and functions-1
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
Cell Membrane: Structure, Composition & Functions
Introduction to Fisheries Biotechnology_Lesson 1.pptx
ECG_Course_Presentation د.محمد صقران ppt
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
BIOMOLECULES PPT........................
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
famous lake in india and its disturibution and importance
Taita Taveta Laboratory Technician Workshop Presentation.pptx
Classification Systems_TAXONOMY_SCIENCE8.pptx
2. Earth - The Living Planet earth and life
Phytochemical Investigation of Miliusa longipes.pdf
. Radiology Case Scenariosssssssssssssss
The KM-GBF monitoring framework – status & key messages.pptx
POSITIONING IN OPERATION THEATRE ROOM.ppt
Introduction to Cardiovascular system_structure and functions-1
bbec55_b34400a7914c42429908233dbd381773.pdf
neck nodes and dissection types and lymph nodes levels
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Cell Membrane: Structure, Composition & Functions

All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

  • 1. All  kmers  are  not  created  equal:   finding  the  signal  from  the  noise   in  large-­‐scale  metagenomes.   Will  Trimble   metagenomic  annota<on  group   Argonne  Na<onal  Laboratory   BEACON  seminar     April  23,  2014        MSU  
  • 2. Apology:  I  speak  biology     with  an  accent   •  I  spent  six  years  in  dark  rooms  with  lasers   •  Now  I  use  computers  to  analyze  high-­‐throughput   sequence  data.   •  I  introduce  myself  as  an  applied  mathema<cian.   •  Finding  scoring  func<ons  to  answer  ques<ons  with   ambiguous  data    
  • 3. Apology:  I  speak  biology     with  an  accent   •  I  spent  six  years  in  dark  rooms  with  lasers   •  Now  I  use  computers  to  analyze  high-­‐throughput   sequence  data.   •  I  introduce  myself  as  an  applied  mathema<cian.   •  Finding  scoring  func<ons  to  answer  ques<ons  with   ambiguous  data   •  Shoveling  data  from  the  data  producing  machine  into   the  data-­‐consuming  furnace.    
  • 4. •  Sequences  are  different   •  How  much  did  my  sequencing  run  give  me?       kmerspectrumanalyzer! •  How  much  did  I  sample?   nonpareil-k   •  PreXy  pictures   thumbnailpolish! Outline  
  • 5. •  Sequences  are  different                                    (math)   •  How  much  did  my  sequencing  run  give  me?       kmerspectrumanalyzer (graphs)   •  How  much  did  I  sample?   nonpareil-k (graphs)   •  PreXy  pictures   thumbnailpolish (micrographs)! Outline  
  • 6. Sequences  are  different   •  Sequencing  produces  sequences.    Sequences   are  qualita<vely  different  from  all  other  data   types.       @HWI-ST1035:125:D1K4CACXX:8:1101:1168 CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTT +! @@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGII @HWI-ST1035:125:D1K4CACXX:8:1101:1190 CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTAT +! CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGI @HWI-ST1035:125:D1K4CACXX:8:1101:1339 CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTT +! BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJ Instrument  readings,   spectra,  micrographs     Not  categorical.   Low-­‐throughput     categorical  data     Categories  are  sound     High  throughput   sequence  data     Categoriza4on  is  an  art  
  • 7. Sequences  are  different   •  Sequencing  produces  sequences.    Sequences   are  qualita<vely  different  from  all  other  data   types.       @HWI-ST1035:125:D1K4CACXX:8:1101:1168 CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTT +! @@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGII @HWI-ST1035:125:D1K4CACXX:8:1101:1190 CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTAT +! CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGI @HWI-ST1035:125:D1K4CACXX:8:1101:1339 CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTT +! BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJ Instrument  readings,   spectra,  micrographs     Not  categorical.   Low-­‐throughput     categorical  data     Categories  are  sound     High  throughput   sequence  data     Categoriza4on  is  an  art   107  channels   103  channels   1011  channels  
  • 8. Sequences  are  different   •  Sequencing  produces  sequences.    Sequences   are  qualita<vely  different  from  all  other  data   types.     •  Each  sequence  is  an  informa<on-­‐rich  (possibly   corrupted)  quota4on  from  the  catalog  of   gene<c  polymers.  
  • 9. What  is  this  sequence  ?   >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAG GTCATCGATAGCAGGATAATAATACAGTA! Who  wrote  this  line  ?   “be regarded as unproved until it has been checked against more exact results” Searching   We  know  what  to  do  with  these  puzzles.       You  go  to  this  website,  and  type  it  in…  
  • 10. What  is  this  sequence  ?   >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA! Who  wrote  this  line  ?   “be regarded as unproved until it has been checked against more exact results”   Searching   How  long  do  reads  need  to  be     to  recognize  them?  
  • 11. What  is  this  sequence  ?   >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA! Who  wrote  this  line  ?   “be regarded as unproved until it has been checked against more exact results”   Searching   How  long  do  reads  need  to  be     to  recognize  them?   To  do  what,  to  place  on  a  reference  genome?         this  can  be  turned  into  a  math  problem     that  I  will  illustrate  with  a  search  engine  analogy.        
  • 12. How  long  do  reads  need  to  be?   Informa4on      (Shannon,  1949,  BSTJ):               is  a  quan<ta<ve  summary  of  the  uncertainty  of  a   probability  distribu4on  –  a  model  of  the  data     Profound  applicability  in  paXern  matching  +  modeling   Logarithmic  measurements  have  units!   H = X i pi log2 ✓ 1 pi ◆
  • 13. A  word  on  the  sign  of  the  entropy     •  A  popular  straw  man  among-­‐mathema<cians-­‐ and-­‐CS-­‐people  is  the  “random  sequence   model.”    Uniform  categorical  distribu<on  over   all  4L    sequences.     •  When  we  learn  something—like  we  collect   some  genomes  and  expect  our  new  sequences   to  look  like  them—we  implicitly  construct  a  less   flat  distribu<on.    Models  always  have  less   entropy  than  the  model  of  ignorance.  
  • 14. How  long  do  phrases  need  to  be?   Exercise:    Pick  a  book  from  your  bookshelf.   Pick  an  arbitrary  page  and  arbitrary  line.     for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.!
  • 15. •  Informa<on  content  of  English  words:                  Hword                                                              ca.  12  bits  per  word.   •  Size  of  google  books?                      Big  libraries  have  few  107  books,                    each  one  has  105  indexed  words                  ….so  a  database  size  of  1012  words.            log(database  size)                =              1012    =  239.9                                                                  =  40  bits   •  So  we  expect  on  average  40  /  12  =  3.3  =  4  words   to  be  enough  to  find  a  phrase  in  google’s  index.                                                                                                                                                                                      Try  it.       How  long  do  phrases  need  to  be?  
  • 16. How  long  do  phrases  need  to  be?   Exercise:    Pick  a  book  from  your  bookshelf.   Pick  an  arbitrary  page  and  arbitrary  line.     for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.! Most  oken  takes  4  words  
  • 17. •  Informa<on  content  of  English  words:                  Hword                                                              ca.  12  bits  per  word.   •  Size  of  google  books?                      Big  libraries  have  few  107  books,                    each  one  has  105  indexed  words                  ….so  a  database  size  of  1012  words.            log(database  size)                =              1012    =  239.9                                                                  =  40  bits   •  So  we  expect  on  average  40  /  12  =  3.3  =  4  words   to  be  enough  to  find  a  phrase  in  google’s  index.                                                                                                                                                                                      Try  it.       How  long  do  phrases  need  to  be?   Not  all  phrases  are  equally   dis<nc<ve.  
  • 18. •  Maximum  informa<on  content  of          base  pairs                            Hread                                            2        bits    per  length-­‐      sequence   •  Most  long  kmers  are  dis<nct:              genome  of  size  G  (ca  1010  bp)                            log(G)                                =            1010    =        233.2                                    =    34  bits   •  So  we  expect  that  when  2        >  34  bits,  we  should  be   able  to  place  any  sequence.   •  That  means  we  need  at  least      17  base  pairs          (seems  small)  to  deliver  mail  anywhere  in  the   genome.     How  long  do  reads  need  to  be?   ` ` ` `
  • 19. The  data  deluge   •  There  were  some  technological   breakthroughs  in  the  mid-­‐2000s  that   led  to  inexpensive  collec<on  of  10s   of  Gbytes  of  sequence  data  at  once.   •  The  data  has  outgrown  some   favorite  algorithms  from  the  1990s   (BLAST)    
  • 20. Picture,  if  you  will,  a  hiseq  flowcell   Paris  of  microbial     genomes     Microbial     transcriptomes  +  replicates   Environmental  isolate  genomes     Environmental  extract  sequencing     Prepara<on-­‐intensive  sequencing   Eukaryo<c    sequencing   Eukaryo<c    sequencing  for  variants   What’s  in   there?  
  • 21. Picture,  if  you  will,  a  hiseq  flowcell   Paris  of  microbial     genomes     Microbial     transcriptomes  +  replicates   Environmental  isolate  genomes     Environmental  extract  sequencing     Prepara<on-­‐intensive  sequencing   Eukaryo<c    sequencing   Eukaryo<c    sequencing  for  variants   What’s  in   there?   Let’s  count   kmers!  
  • 22. The  kmer  spectrum.   21mer  abundance     number  of  kmers   microbial  genome  
  • 23. The  kmer  spectrum.     21mer  abundance     number  of  kmers   microbial  genome   low-­‐abundance  errors   peak  contains  most  of  genome   high-­‐abundance  peak  contains  mul<copy  genes   really  high  abundance  stuff  oken  ar<facts   rare   abundant  
  • 24. Ranked  kmer  spectrum       kmer  rank  (cumula<ve  sum  of  number  of  kmers)   21mer  abundance         Ranked  kmer  spectrum   rare   abundant  
  • 25. Ranked  kmers  consumed   21mer  abundance     frac<on  of  observed  kmers   Ranked  kmers  consumed   rare   abundant   data  frac<on   is  unusually     stable  
  • 26. Different  kinds  of  data  have  different   spectra  
  • 27. Different  kinds  of  data  have  different   spectra  
  • 28. Redundancy  is  good   •  OMG!      Check  out  these  three  sequences!    I’ve   found  the  fourth,  fikh,  and  sixth  domains  of  life.             •  OMG!    I  see  this  sequence  10  million  <mes.       •  OMG!    There  are  more  than  10  billion  dis<nct   31mers  in  my  dataset.    I  only  have  128  Gbases  of   memory.   •  Error  correc<on  and  diginorm  somewhat   amusingly  strive  for  opposite  ends.  
  • 29. Redundancy  is  good   •  OMG!      Check  out  these  three  sequences!    I’ve   found  the  fourth,  fikh,  and  sixth  domains  of  life.             •  OMG!    I  see  this  sequence  10  million  <mes.       •  OMG!    There  are  more  than  10  billion  dis<nct   31mers  in  my  dataset.    I  only  have  128  Gbases  of   memory.   •  Error  correc<on  and  diginorm  somewhat   amusingly  strive  for  opposite  ends.   Abundance-­‐based  inferences   are  beXer  in  the  high-­‐ abundance  part  of  the  data.  
  • 30. kmerspectrumanalyzer:  infer  genome   size  and  depth   PNO (x; c, {an}, s) = X n anNBpdf (s; µ = cn, ↵ = s/n) Generaliza<on  of  mixed-­‐Poisson  model  to     es<mate  how  much  sequence  is  in  each  peak.  
  • 31. 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Complete Genome size (kb) EstimatedGenomeSize(kb) Fig 2 Coun<ng  kmers  tells  you  genome  size   …for  single  genomes,   most  of  the  <me.   so  much  for  calibra<on  data  
  • 32. 10%   5.5%   4%   3%   1.7%   1%   0.5%   0.3%   0.1%   The  kink  does  measure  error   Ar<ficial  E.  coli  data   varying  subs<tu<on  errors  
  • 33. But  I  want  to  sequence  everything!   Ok,  we  can  count  kmers  in  everything  too..   kmerspectrumanalyzer  summarizes  distribu<on,     es<mates  genome  size,  coverage  depth  
  • 34. How  much  novelty  is  in  my  dataset?   How  many  sequences  do  you  need  to  see  before  you  start  seeing     the  same  ones  over  and  over  again?     Ini<ally,  everything  is  novel,  but  there  will  come  a  point  at  which     less  than  half  of  your  new  observa<ons  are  already  in  the  catalog.  
  • 35. Nonuniqefraction(✏; {r}, {n}) = X i ni · ri P j nj · rj (1 Poisscdf (✏ · ri, 1)) (1 Poisscdf (✏ · ri, 0)) How  much  novelty  is  in  my  dataset?   How  many  sequences  do  you  need  to  see  before  you  start  seeing     the  same  ones  over  and  over  again?     Ini<ally,  everything  is  novel,  but  there  will  come  a  point  at  which     less  than  half  of  your  new  observa<ons  are  already  in  the  catalog.     We  can  calculate  this  efficiently  using  the  kmer  spectrum.  
  • 36. Nonpareil: model of sequence coverage Nonpareil-k: kmer rarefaction summary of sequence diversity Nonpareil–  uses  subset-­‐against-­‐ all  alignment  to  find  out  how   much  of  dataset  is  unique   Nonpareil-­‐k  –  crunches  kmer   spectrum  to  approximate  the   unique  frac<on,  300x  faster.  
  • 37. Nonpareil: model of sequence coverage Nonpareil-k: kmer rarefaction summary of sequence diversity
  • 38. Nonpareil-­‐k:  stra<fy  datasets  by   coverage  distribu<on   most  of  dataset   likely  contained  in     assembly     assembly  is  likely   to  miss  or     aXenuate  the     large  unique     frac<on  of  dataset.    
  • 39. kmer  spectra  reveal  sequencing   problems   •  Amok  PCR  –  seemingly  random  sequences   •  Amok  MDA  –  10  Gbases  of  sequence,  one  gene   •  PCR  duplicates:  en<re  sequencing  run  was  50x   exact-­‐  and  near-­‐exact  duplicate  reads   •  Unusually  high  error  rate:  indicated  by  low  frac<on   of  “solid”  kmers  (for  isolate  genomes)   •  Contaminated  samples:  95%  E.  coli  5%  E.  faecalis  
  • 40. HMP  /  quan<le  norm  /  euclidean  /  colored  by  alpha       MG-­‐RAST  API   R-­‐package  matR   Hey  kid,  you  want  some  unlabeled  data?  
  • 41. Figure'2a! Hey  kid,  you  want  some  preXy  ordina<ons?  
  • 42. Generali<es  from  the     kmer  coun<ng  mines   •  Many  datasets  have  as  much  as  5-­‐45%  of  the   sequence  yield  in  adapters.       •  FEW  DATASETS  have  well-­‐separated   abundance  peaks  (of  the  sort  metavelvet  was   engineered  to  find)       •  Diverse  datasets  have  a  featureless,   geometric  rela4onship  between  kmer  rank   and  kmer  abundance.   •  Shannon  entropy  is  oversensi4ve  to  errors.   Higher-­‐order  Rényi  entropy  is  more  stable.  
  • 43. kmer  sta<s<cal  summaries   •  H0  kmer  richness                                                              (VERY  BAD)   •  H1  Shannon  entropy                                                  (BAD)   •  H2  Reyni  entropy  /  Simpson  index  (GOOD)   •  observa<on-­‐weighted      coverage    (BAD)   •  observa<on-­‐weighted      size                        (BAD)   •  observa<on-­‐median            coverage    (GOOD)   •  observa<on-­‐median            size                        (GOOD)   •  frac<on  in  top  100  kmers              (USEFUL)   •  frac<on  unique  (OK  but  requires  size  correc<on)  
  • 44. kmer  sta<s<cal  summaries   •  H0  kmer  richness                                                              (VERY  BAD)   •  H1  Shannon  entropy                                                  (BAD)   •  H2  Reyni  entropy  /  Simpson  index  (GOOD)   •  observa<on-­‐weighted      coverage    (BAD)   •  observa<on-­‐weighted      size                        (BAD)   •  observa<on-­‐median            coverage    (GOOD)   •  observa<on-­‐median            size                        (GOOD)   •  frac<on  in  top  100  kmers              (USEFUL)   •  frac<on  unique  (OK  but  requires  size  correc<on)   Most  of  these  give  answers  which   vary  so  strongly  with  sampling   depth  as  to  be  unusable.     Observa<on-­‐weighted  frac<on-­‐of-­‐ data  metrics    behave  fairly  well.     Frac<ons  of  the  data  with  par<cular   proper<es  are  stable  with  respect   to  sampling.      
  • 46. Some<mes  the  sequencer  has  a  bad  day.  
  • 47. Metagenomic  annota<on  group     Folker  Meyer   Elizabeth  Glass   Narayan  Desai   Kevin  Keegan     Adina  Howe   Wolfgang  Gerlach   Wei  Tang   Travis  Harrison   Jared  Bishof   Dan  Braithwaite   Hunter  MaXhews   Sarah  Owens   Formerly  of  Yale:   Howard  Ochman     David  Williams     Georgia  Tech:   Kostas  Konstan<nidis   Luis  Rodriguez-­‐Rojas    
  • 48. Observa<on:  Most  scien<sts  seem  to   be  self-­‐taught  in  compu<ng.     Observa<on:    Most  scien<sts  waste  a     lot  of  <me  using  computers   inefficiently.   Adina  and  I  volunteer  with    
  • 49. We  teach  scien<sts    how  to  get  more  done   Woods  Hole   Tuks   U.  Chicago   U.  Chicago   UIC