All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

All
kmers
are
not
created
equal:

ﬁnding
the
signal
from
the
noise

in
large-‐scale
metagenomes.

Will
Trimble

metagenomic
annota<on
group

Argonne
Na<onal
Laboratory

BEACON
seminar

April
23,
2014

MSU

Apology:
I
speak
biology

with
an
accent

•  I
spent
six
years
in
dark
rooms
with
lasers

•  Now
I
use
computers
to
analyze
high-‐throughput

sequence
data.

•  I
introduce
myself
as
an
applied
mathema<cian.

•  Finding
scoring
func<ons
to
answer
ques<ons
with

ambiguous
data

Apology:
I
speak
biology

with
an
accent

•  I
spent
six
years
in
dark
rooms
with
lasers

•  Now
I
use
computers
to
analyze
high-‐throughput

sequence
data.

•  I
introduce
myself
as
an
applied
mathema<cian.

•  Finding
scoring
func<ons
to
answer
ques<ons
with

ambiguous
data

•  Shoveling
data
from
the
data
producing
machine
into

the
data-‐consuming
furnace.

•  Sequences
are
diﬀerent

•  How
much
did
my
sequencing
run
give
me?

kmerspectrumanalyzer!
•  How
much
did
I
sample?

nonpareil-k

•  PreXy
pictures

thumbnailpolish!
Outline

•  Sequences
are
diﬀerent

(math)

•  How
much
did
my
sequencing
run
give
me?

kmerspectrumanalyzer (graphs)

•  How
much
did
I
sample?

nonpareil-k (graphs)

•  PreXy
pictures

thumbnailpolish (micrographs)!
Outline

Sequences
are
diﬀerent

•  Sequencing
produces
sequences.

Sequences

are
qualita<vely
diﬀerent
from
all
other
data

types.

@HWI-ST1035:125:D1K4CACXX:8:1101:1168
CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTT
+!
@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGII
@HWI-ST1035:125:D1K4CACXX:8:1101:1190
CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTAT
+!
CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGI
@HWI-ST1035:125:D1K4CACXX:8:1101:1339
CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTT
+!
BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJ
Instrument
readings,

spectra,
micrographs

Not
categorical.

Low-‐throughput

categorical
data

Categories
are
sound

High
throughput

sequence
data

Categoriza4on
is
an
art

Sequences
are
diﬀerent

•  Sequencing
produces
sequences.

Sequences

are
qualita<vely
diﬀerent
from
all
other
data

types.

@HWI-ST1035:125:D1K4CACXX:8:1101:1168
CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTT
+!
@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGII
@HWI-ST1035:125:D1K4CACXX:8:1101:1190
CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTAT
+!
CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGI
@HWI-ST1035:125:D1K4CACXX:8:1101:1339
CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTT
+!
BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJ
Instrument
readings,

spectra,
micrographs

Not
categorical.

Low-‐throughput

categorical
data

Categories
are
sound

High
throughput

sequence
data

Categoriza4on
is
an
art

107
channels
103
channels
1011
channels

Sequences
are
diﬀerent

•  Sequencing
produces
sequences.

Sequences

are
qualita<vely
diﬀerent
from
all
other
data

types.

•  Each
sequence
is
an
informa<on-‐rich
(possibly

corrupted)
quota4on
from
the
catalog
of

gene<c
polymers.

What
is
this
sequence
?

>mystery_sequence
CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAG
GTCATCGATAGCAGGATAATAATACAGTA!
Who
wrote
this
line
?

“be regarded as unproved until it has been
checked against more exact results”
Searching

We
know
what
to
do
with
these
puzzles.

You
go
to
this
website,
and
type
it
in…

What
is
this
sequence
?

>mystery_sequence
CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!
Who
wrote
this
line
?

“be regarded as unproved until it has been checked against more exact results”

Searching

How
long
do
reads
need
to
be

to
recognize
them?

What
is
this
sequence
?

>mystery_sequence
CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!
Who
wrote
this
line
?

“be regarded as unproved until it has been checked against more exact results”

Searching

How
long
do
reads
need
to
be

to
recognize
them?

To
do
what,
to
place
on
a
reference
genome?

this
can
be
turned
into
a
math
problem

that
I
will
illustrate
with
a
search
engine
analogy.

How
long
do
reads
need
to
be?

Informa4on

(Shannon,
1949,
BSTJ):

is
a
quan<ta<ve
summary
of
the
uncertainty
of
a

probability
distribu4on
–
a
model
of
the
data

Profound
applicability
in
paXern
matching
+
modeling

Logarithmic
measurements
have
units!

H =
X
i
pi log2
✓
1
pi
◆

A
word
on
the
sign
of
the
entropy

•  A
popular
straw
man
among-‐mathema<cians-‐
and-‐CS-‐people
is
the
“random
sequence

model.”

Uniform
categorical
distribu<on
over

all
4L

sequences.

•  When
we
learn
something—like
we
collect

some
genomes
and
expect
our
new
sequences

to
look
like
them—we
implicitly
construct
a
less

ﬂat
distribu<on.

Models
always
have
less

entropy
than
the
model
of
ignorance.

How
long
do
phrases
need
to
be?

Exercise:

Pick
a
book
from
your
bookshelf.

Pick
an
arbitrary
page
and
arbitrary
line.

for n in 1..10 !
type the first n words into google
books, quoted.!
break if google identifies your book.!

•  Informa<on
content
of
English
words:

Hword

ca.
12
bits
per
word.

•  Size
of
google
books?

Big
libraries
have
few
107
books,

each
one
has
105
indexed
words

….so
a
database
size
of
1012
words.

log(database
size)

=

1012

=
239.9

=
40
bits

•  So
we
expect
on
average
40
/
12
=
3.3
=
4
words

to
be
enough
to
ﬁnd
a
phrase
in
google’s
index.

Try
it.

How
long
do
phrases
need
to
be?

How
long
do
phrases
need
to
be?

Exercise:

Pick
a
book
from
your
bookshelf.

Pick
an
arbitrary
page
and
arbitrary
line.

for n in 1..10 !
type the first n words into google books, quoted.!
break if google identifies your book.!
Most
oken
takes
4
words

•  Informa<on
content
of
English
words:

Hword

ca.
12
bits
per
word.

•  Size
of
google
books?

Big
libraries
have
few
107
books,

each
one
has
105
indexed
words

….so
a
database
size
of
1012
words.

log(database
size)

=

1012

=
239.9

=
40
bits

•  So
we
expect
on
average
40
/
12
=
3.3
=
4
words

to
be
enough
to
ﬁnd
a
phrase
in
google’s
index.

Try
it.

How
long
do
phrases
need
to
be?

Not
all
phrases
are
equally

dis<nc<ve.

•  Maximum
informa<on
content
of

base
pairs

Hread

2

bits

per
length-‐

sequence

•  Most
long
kmers
are
dis<nct:

genome
of
size
G
(ca
1010
bp)

log(G)

=

1010

=

233.2

=

34
bits

•  So
we
expect
that
when
2

>
34
bits,
we
should
be

able
to
place
any
sequence.

•  That
means
we
need
at
least

17
base
pairs

(seems
small)
to
deliver
mail
anywhere
in
the

genome.

How
long
do
reads
need
to
be?

`
`
`
`

The
data
deluge

•  There
were
some
technological

breakthroughs
in
the
mid-‐2000s
that

led
to
inexpensive
collec<on
of
10s

of
Gbytes
of
sequence
data
at
once.

•  The
data
has
outgrown
some

favorite
algorithms
from
the
1990s

(BLAST)

Picture,
if
you
will,
a
hiseq
ﬂowcell

Paris
of
microbial

genomes

Microbial

transcriptomes
+
replicates

Environmental
isolate
genomes

Environmental
extract
sequencing

Prepara<on-‐intensive
sequencing

Eukaryo<c

sequencing

Eukaryo<c

sequencing
for
variants

What’s
in

there?

Picture,
if
you
will,
a
hiseq
ﬂowcell

Paris
of
microbial

genomes

Microbial

transcriptomes
+
replicates

Environmental
isolate
genomes

Environmental
extract
sequencing

Prepara<on-‐intensive
sequencing

Eukaryo<c

sequencing

Eukaryo<c

sequencing
for
variants

What’s
in

there?

Let’s
count

kmers!

The
kmer
spectrum.

21mer
abundance

number
of
kmers

microbial
genome

The
kmer
spectrum.

21mer
abundance

number
of
kmers

microbial
genome

low-‐abundance
errors

peak
contains
most
of
genome

high-‐abundance
peak
contains
mul<copy
genes

really
high
abundance
stuﬀ
oken
ar<facts

rare
abundant

Ranked
kmer
spectrum

kmer
rank
(cumula<ve
sum
of
number
of
kmers)

21mer
abundance

Ranked
kmer
spectrum

rare

abundant

Ranked
kmers
consumed

21mer
abundance

frac<on
of
observed
kmers

Ranked
kmers
consumed

rare

abundant

data
frac<on

is
unusually

stable

Diﬀerent
kinds
of
data
have
diﬀerent

spectra

Redundancy
is
good

•  OMG!

Check
out
these
three
sequences!

I’ve

found
the
fourth,
ﬁkh,
and
sixth
domains
of
life.

•  OMG!

I
see
this
sequence
10
million
<mes.

•  OMG!

There
are
more
than
10
billion
dis<nct

31mers
in
my
dataset.

I
only
have
128
Gbases
of

memory.

•  Error
correc<on
and
diginorm
somewhat

amusingly
strive
for
opposite
ends.

Redundancy
is
good

•  OMG!

Check
out
these
three
sequences!

I’ve

found
the
fourth,
ﬁkh,
and
sixth
domains
of
life.

•  OMG!

I
see
this
sequence
10
million
<mes.

•  OMG!

There
are
more
than
10
billion
dis<nct

31mers
in
my
dataset.

I
only
have
128
Gbases
of

memory.

•  Error
correc<on
and
diginorm
somewhat

amusingly
strive
for
opposite
ends.

Abundance-‐based
inferences

are
beXer
in
the
high-‐
abundance
part
of
the
data.

kmerspectrumanalyzer:
infer
genome

size
and
depth

PNO (x; c, {an}, s) =
X
n
anNBpdf (s; µ = cn, ↵ = s/n)
Generaliza<on
of
mixed-‐Poisson
model
to

es<mate
how
much
sequence
is
in
each
peak.

0 2000 4000 6000 8000 10000
0
2000
4000
6000
8000
10000
Complete Genome size (kb)
EstimatedGenomeSize(kb)
Fig 2 Coun<ng
kmers
tells
you
genome
size

…for
single
genomes,

most
of
the
<me.

so
much
for
calibra<on
data

10%

5.5%

4%

3%

1.7%

1%

0.5%

0.3%

0.1%

The
kink
does
measure
error

Ar<ﬁcial
E.
coli
data

varying
subs<tu<on
errors

But
I
want
to
sequence
everything!

Ok,
we
can
count
kmers
in
everything
too..

kmerspectrumanalyzer
summarizes
distribu<on,

es<mates
genome
size,
coverage
depth

How
much
novelty
is
in
my
dataset?

How
many
sequences
do
you
need
to
see
before
you
start
seeing

the
same
ones
over
and
over
again?

Ini<ally,
everything
is
novel,
but
there
will
come
a
point
at
which

less
than
half
of
your
new
observa<ons
are
already
in
the
catalog.

Nonuniqefraction(✏; {r}, {n}) =
X
i
ni · ri
P
j nj · rj
(1 Poisscdf (✏ · ri, 1))
(1 Poisscdf (✏ · ri, 0))
How
much
novelty
is
in
my
dataset?

How
many
sequences
do
you
need
to
see
before
you
start
seeing

the
same
ones
over
and
over
again?

Ini<ally,
everything
is
novel,
but
there
will
come
a
point
at
which

less
than
half
of
your
new
observa<ons
are
already
in
the
catalog.

We
can
calculate
this
eﬃciently
using
the
kmer
spectrum.

Nonpareil: model of sequence coverage

Nonpareil-k: kmer rarefaction

summary of sequence diversity

Nonpareil–
uses
subset-‐against-‐
all
alignment
to
ﬁnd
out
how

much
of
dataset
is
unique

Nonpareil-‐k
–
crunches
kmer

spectrum
to
approximate
the

unique
frac<on,
300x
faster.

Nonpareil: model of sequence coverage

Nonpareil-k: kmer rarefaction

summary of sequence diversity

Nonpareil-‐k:
stra<fy
datasets
by

coverage
distribu<on

most
of
dataset

likely
contained
in

assembly

assembly
is
likely

to
miss
or

aXenuate
the

large
unique

frac<on
of
dataset.

kmer
spectra
reveal
sequencing

problems

•  Amok
PCR
–
seemingly
random
sequences

•  Amok
MDA
–
10
Gbases
of
sequence,
one
gene

•  PCR
duplicates:
en<re
sequencing
run
was
50x

exact-‐
and
near-‐exact
duplicate
reads

•  Unusually
high
error
rate:
indicated
by
low
frac<on

of
“solid”
kmers
(for
isolate
genomes)

•  Contaminated
samples:
95%
E.
coli
5%
E.
faecalis

HMP
/
quan<le
norm
/
euclidean
/
colored
by
alpha

MG-‐RAST
API

R-‐package
matR

Hey
kid,
you
want
some
unlabeled
data?

Figure'2a!
Hey
kid,
you
want
some
preXy
ordina<ons?

Generali<es
from
the

kmer
coun<ng
mines

•  Many
datasets
have
as
much
as
5-‐45%
of
the

sequence
yield
in
adapters.

•  FEW
DATASETS
have
well-‐separated

abundance
peaks
(of
the
sort
metavelvet
was

engineered
to
ﬁnd)

•  Diverse
datasets
have
a
featureless,

geometric
rela4onship
between
kmer
rank

and
kmer
abundance.

•  Shannon
entropy
is
oversensi4ve
to
errors.

Higher-‐order
Rényi
entropy
is
more
stable.

kmer
sta<s<cal
summaries

•  H0
kmer
richness

(VERY
BAD)

•  H1
Shannon
entropy

(BAD)

•  H2
Reyni
entropy
/
Simpson
index
(GOOD)

•  observa<on-‐weighted

coverage

(BAD)


size

(BAD)

•  observa<on-‐median

coverage

(GOOD)


size

(GOOD)

•  frac<on
in
top
100
kmers

(USEFUL)

•  frac<on
unique
(OK
but
requires
size
correc<on)

kmer
sta<s<cal
summaries

•  H0
kmer
richness

(VERY
BAD)

•  H1
Shannon
entropy

(BAD)

•  H2
Reyni
entropy
/
Simpson
index
(GOOD)


coverage

(BAD)


size

(BAD)


coverage

(GOOD)


size

(GOOD)

•  frac<on
in
top
100
kmers

(USEFUL)

•  frac<on
unique
(OK
but
requires
size
correc<on)

Most
of
these
give
answers
which

vary
so
strongly
with
sampling

depth
as
to
be
unusable.

Observa<on-‐weighted
frac<on-‐of-‐
data
metrics

behave
fairly
well.

Frac<ons
of
the
data
with
par<cular

proper<es
are
stable
with
respect

to
sampling.

thumbnailpolish!
http://www.mcs.anl.gov/~trimble/flowcell/!

Some<mes
the
sequencer
has
a
bad
day.

Metagenomic
annota<on
group

Folker
Meyer

Elizabeth
Glass

Narayan
Desai

Kevin
Keegan

Adina
Howe

Wolfgang
Gerlach

Wei
Tang

Travis
Harrison

Jared
Bishof

Dan
Braithwaite

Hunter
MaXhews

Sarah
Owens

Formerly
of
Yale:

Howard
Ochman

David
Williams

Georgia
Tech:

Kostas
Konstan<nidis

Luis
Rodriguez-‐Rojas

Observa<on:
Most
scien<sts
seem
to

be
self-‐taught
in
compu<ng.

Observa<on:

Most
scien<sts
waste
a

lot
of
<me
using
computers

ineﬃciently.

Adina
and
I
volunteer
with

We
teach
scien<sts

how
to
get
more
done

Woods
Hole

Tuks

U.
Chicago

U.
Chicago

UIC

All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

More Related Content

Viewers also liked (20)

Similar to All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes (20)

Recently uploaded (20)

All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes