Search at Twitter

Search @twitter
Michael Busch
@michibusch
michael@twitter.com
buschmi@apache.org

1

Search @twitter
Agenda
‣ Introduction
- Search Architecture
- Inverted Index 101
- Realtime Posting Lists

2

Introduction

Twitter has more than 230 million
monthly active users.

4

Introduction

500 million tweets are sent per day.

5

Introduction

More than 300 billion tweets have been
sent since company founding in 2006.

6

Introduction

Tweets-per-second world record:
33,388 TPS.

7

Introduction

More than 2 billion search queries per
day.

8

Introduction
2008

Twitter acquires Summize (MySQL-based RT search engine)

2009
2010

Modiﬁed Lucene (Earlybird) ships and replaces MySQL indexes

2011

New Earlybird features: image/video search; index compression;
eﬃcient relevance search in time-sorted index

2012
2013
2014

Tweet archive search on SSD with vanilla Lucene
New RT posting list format that supports arbitrary document
lengths, but keeps performance optimizations for tweets
9

Introduction
2008


2009
2010


2011


2012
2013
2014

10

Introduction
2008


2009
2010


2011


2012
2013
2014

11

Introduction
2008


2009
2010


2011


2012
2013
2014

12

Introduction
2008


2009
2010


2011


2012
2013
2014

13

Realtime Search @twitter
Agenda
- Introduction
‣ Search Architecture

14

Search Architecture
RT stream

raw
tweets

Analyzer/
Partitioner

analyzed
tweets

RT index
RT index
(Earlybird)

Blender

Search
requests

Tweet archive
HDFS

raw Mapreduce
tweets Analyzer

analyzed
tweets

Archive
RT index
index
writes
searches
16

Search Architecture
Analyzer/
Partitioner

• Pre-processes Tweets for indexing
• Analyzing (tokenization/normalization) of text
• Geo-coding, URL expansion, etc.
• Hash partitioning

17

Search Architecture
RT stream

raw
tweets

Analyzer/
Partitioner

analyzed
tweets

RT index
RT index
(Earlybird)

Blender

Search
requests

Tweet archive
HDFS

raw Mapreduce
tweets Analyzer

analyzed
tweets

Archive
RT index
index
writes
searches
18

Search Architecture
RT index
RT index
(Earlybird)

• Modified Lucene index implementation optimized for realtime search
• IndexWriter buffer is searchable (no need to flush to allow searching)
• In-memory
• Hash-partitioned, static layout

19

Cluster layout

Earlybird
Earlybird
Earlybird

Replicas

20

Cluster layout
n hash partitions (docId % n)

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

...

Earlybird
Earlybird
Earlybird

Replicas
21

Cluster layout
n hash partitions (docId % n)

Earlybird
Earlybird
Earlybird

Timeslices

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

...
Earlybird
Earlybird
Earlybird

...
Earlybird
Earlybird
Earlybird

...

Earlybird
Earlybird
Earlybird

...

Earlybird
Earlybird
Earlybird

...
Earlybird
Earlybird
Earlybird

...
...

Earlybird
Earlybird
Earlybird

Replicas
22

Cluster layout

Writable
timeslice

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

Complete
timeslices

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

...
Earlybird
Earlybird
Earlybird

...
Earlybird
Earlybird
Earlybird

...

Earlybird
Earlybird
Earlybird

...

Earlybird
Earlybird
Earlybird

...
Earlybird
Earlybird
Earlybird

...
...

Earlybird
Earlybird
Earlybird

23

Search Architecture
RT index
RT index
(Earlybird)

• Modified Lucene index implementation optimized for realtime search
• IndexWriter buffer is searchable (no need to flush to allow searching)
• In-memory
• Hash-partitioned, static layout

24

Search Architecture
RT stream

raw
tweets

Analyzer/
Partitioner

analyzed
tweets

RT index
RT index
(Earlybird)

Blender

Search
requests

Tweet archive
HDFS

raw Mapreduce
tweets Analyzer

analyzed
tweets

Archive
RT index
index
writes
searches
25

Search Architecture
Mapreduce
Analyzer

• Daily jobs that process raw tweets
• Analyzes text
• Aggregates metadata and signals

26

Search Architecture
RT stream

raw
tweets

Analyzer/
Partitioner

analyzed
tweets

RT index
RT index
(Earlybird)

Blender

Search
requests

Tweet archive
HDFS

raw Mapreduce
tweets Analyzer

analyzed
tweets

Archive
RT index
index
writes
searches
27

Search Architecture
Archive
RT index
index

• Standard Lucene (4.4) indexes
• Reverse time-sorted (new to old)
• Cluster layout similar to realtime search cluster

28

Search Architecture
Archive
RT index
index

• Two tiers: In-memory and on SSD

In-memory index

SSD index

29

Search Architecture
Archive
RT index
index

Contains small number of best
tweets of all time

In-memory index

SSD index

30

Search Architecture
Archive
RT index
index


In-memory index

Much bigger index with more
tweets, less max. QPS, limited by
SSD IOPS.
Only needs to be queried if inmemory index did not yield
enough results

SSD index

31

Search Architecture
RT stream

raw
tweets

Analyzer/
Partitioner

analyzed
tweets

RT index
RT index
(Earlybird)

Blender

Search
requests

Tweet archive
HDFS

raw Mapreduce
tweets Analyzer

analyzed
tweets

Archive
RT index
index
writes
searches
32

Search Architecture

RT index
RT index
(Earlybird)

• Blender is our Thrift
service aggregator
Blender

• Queries multiple
Earlybirds, merges results

Search
requests

Archive
RT index
index
writes
searches
33

Search Architecture
RT stream

raw
tweets

Analyzer/
Partitioner

analyzed
tweets

RT index
RT index
(Earlybird)

Blender

Search
requests

Tweet archive
HDFS

raw Mapreduce
tweets Analyzer

analyzed
tweets

Archive
RT index
index
writes
searches
34

Search Architecture
Tweets

Analyzer/
Partitioner

RT index
RT index
(Earlybird)

queue

Updates

HDFS

Deletes/
Engagement (e.g. retweets/favs)

Mapreduce
Analyzer

Blender

Search
requests

Archive
RT index
index
writes
searches
35

Agenda
- Introduction
‣ Inverted Index 101

36

Inverted Index 101
1

The old night keeper keeps the keep in the town

2

In the big old house in the big old gown.

3

The house in the town had the big old keep

4

Where the old night keeper never did sleep.

5

The night keeper keeps the keep in the night

6

And keeps in the dark and sleeps in the light.

Table with 6 documents

Example from:
Justin Zobel , Alistair Moffat,
Inverted ﬁles for text search engines,
ACM Computing Surveys (CSUR)
v.38 n.2, p.6-es, 2006

38

Inverted Index 101
1


2


3


4


5


6



term
and
big
dark
did
gown
had
house
in
keep
keeper
keeps
light
never
night
old
sleep
sleeps
the
town
where

freq
1
2
1
1
1
1
2
5
3
3
3
1
1
3
4
1
1
6
2
1

<6>
<2> <3>
<6>
<4>
<2>
<3>
<2> <3>
<1> <2> <3> <5> <6>
<1> <3> <5>
<1> <4> <5>
<1> <5> <6>
<6>
<4>
<1> <4> <5>
<1> <2> <3> <4>
<4>
<6>
<1> <2> <3> <4> <5> <6>
<1> <3>
<4>

Dictionary and posting lists
39

Inverted Index 101
1


2


3


4


5


6



Query: keeper
term
and
big
dark
did
gown
had
house
in
keep
keeper
keeps
light
never
night
old
sleep
sleeps
the
town
where

freq
1
2
1
1
1
1
2
5
3
3
3
1
1
3
4
1
1
6
2
1

<6>
<2> <3>
<6>
<4>
<2>
<3>
<2> <3>
<1> <2> <3> <5> <6>
<1> <3> <5>
<1> <4> <5>
<1> <5> <6>
<6>
<4>
<1> <4> <5>
<1> <2> <3> <4>
<4>
<6>
<1> <2> <3> <4> <5> <6>
<1> <3>
<4>

40

Inverted Index 101
1


2


3


4


5


6



Query: keeper
term
and
big
dark
did
gown
had
house
in
keep
keeper
keeps
light
never
night
old
sleep
sleeps
the
town
where

freq
1
2
1
1
1
1
2
5
3
3
3
1
1
3
4
1
1
6
2
1

<6>
<2> <3>
<6>
<4>
<2>
<3>
<2> <3>
<1> <2> <3> <5> <6>
<1> <3> <5>
<1> <4> <5>
<1> <5> <6>
<6>
<4>
<1> <4> <5>
<1> <2> <3> <4>
<4>
<6>
<1> <2> <3> <4> <5> <6>
<1> <3>
<4>

41

Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

42

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Delta encoding:

5 10 8985

2

90998

90

43

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Delta encoding:

5 10 8985

VInt compression:

00000101

2

90998

90

Values 0 <= delta <= 127 need
one byte

44

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Delta encoding:

VInt compression:

5 10 8985

2

90998

90

11000110 00011001

Values 128 <= delta <= 16384
need two bytes

45

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Delta encoding:

VInt compression:

5 10 8985

2

90998

90

11000110 00011001

First bit indicates whether next
byte belongs to the same value

46

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Delta encoding:

VInt compression:

5 10 8985

2

90998

90

11000110 00011001

• Variable number of bytes - a VInt-encoded posting can not be written as a
primitive Java type; therefore it can not be written atomically

47

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Delta encoding:

5 10 8985

2

90998

90

Read direction

• Each posting depends on previous one; decoding only possible in old-to-new
direction
• With recency ranking (new-to-old) no early termination is possible

48

• By default Lucene uses a combination of delta encoding and VInt
compression
• VInts are expensive to decode
• Problem 1: How to traverse posting lists backwards?
• Problem 2: How to write a posting atomically?

49

Agenda
- Introduction
‣ Realtime Posting Lists

50

Posting list encoding in Earlybird v1
int (32 bits)

docID
24 bits
max. 16.7M

textPosition
8 bits
max. 255

• Tweet text can only have 140 chars

52

Posting list encoding in Earlybird v1
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Earlybird encoding:
5

15

9000

9002

100000

100090

Read direction

53

Early query termination
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Earlybird encoding:
5

15

9000

9002

100000

100090

Read direction
E.g. 3 result are requested: Here
we can terminate after reading 3
postings

54

Inverted index components
Posting list storage

?
Dictionary

Parallel arrays

pointer to the most recently
indexed posting for a term

55

Inverted index components
Posting list storage

?
Dictionary

Parallel arrays

pointer to the most recently
indexed posting for a term

56

Posting lists storage - Objectives
• Store many single-linked lists of diﬀerent lengths space-eﬃciently
• The number of java objects should be independent of the number of lists or
number of items in the lists
• Every item should be a possible entry point into the lists for iterators, i.e.
items should not be dependent on other items (e.g. no delta encoding)
• Append and read possible by multiple threads in a lock-free fashion (single
append thread, multiple reader threads)
• Traversal in backwards order

57

Memory management
4 int[]
pools

= 32K int[]

58

Memory management
4 int[]
pools

= 32K int[]

Each pool can
be grown
individually by
adding 32K
blocks

59

Memory management
4 int[]
pools

• For simplicity we can forget about the blocks for now and think of the pools
as continuous, unbounded int[] arrays
• Small total number of Java objects (each 32K block is one object)

60

Memory management
slice size
211
27
24
21

• Slices can be allocated in each pool
• Each pool has a diﬀerent, but ﬁxed slice size

61

Adding and appending to a list
slice size
211
27

available

24

allocated

21

current list

62

slice size
211
27

available

24

allocated

21

current list

Store ﬁrst two
postings in this slice

63

slice size
211
27

available

24

allocated

21

current list

When ﬁrst slice is full, allocate another one in second pool

64

slice size
211
27

available

24

allocated

21

current list

Allocate a slice on each level as list grows

65

slice size
211
27

available

24

allocated

21

current list

On upper most level one list can own multiple slices

66

Posting list format v1
int (32 bits)

docID
24 bits
max. 16.7M

textPosition
8 bits
max. 255

• Tweet text can only have 140 chars

67

Addressing items
• Use 32 bit (int) pointers to address any item in any list unambiguously:

int (32 bits)

poolIndex
2 bits
0-3

sliceIndex
19-29 bits
depends on pool

offset in slice
1-11 bits
depends on pool

• Nice symmetry: Postings and address pointers both ﬁt into a 32 bit int

68

Linking the slices
slice size
211
27

available

24

allocated

21

current list

69

Linking the slices
slice size
211
27

available

24

allocated

21

current list

Dictionary

Parallel arrays

pointer to the last posting indexed for a term

70

Posting list encoding - Summary
• ints can be written atomically in Java
• Backwards traversal easy on absolute docIDs (not deltas)
• Every posting is a possible entry point for a searcher
• Skipping can be done without additional data structures as binary search,
though there are better approaches (skip lists)
• Repeating docIDs if a term occurs multiple times in the same document only
works for small docs
• Max. segment size: 2^24 = 16.7M tweets

71

New posting list encoding
• Objectives:
• 32 bit positions and variable-length payloads
• Store term frequency (TF) instead of repeating docIDs
• Keep:
• Concurrency model
• Space-eﬃciency for short documents
• Performance

72

DocID, termFreq

Position, Payload

73

DocID, termFreq

Position, Payload

Fixed length for each posting

74

DocID, termFreq

Position, Payload

Variable length

75


DocID, termFreq

Position, Payload

76


...

DocID, termFreq

DocID, termFreq

DocID, termFreq

Position, Payload

Position, Payload, Position

...

Position, Payload

77

...

DocID, termFreq

DocID, termFreq

DocID, termFreq

Position, Payload


...

Position, Payload

• Store TF instead of repeating the same DocID
• Store DocID/TF pairs separately from position/payloads
• Find a way to synchronously decode the two streams without storing a
pointer for each posting (expensive)

78

...

DocID, termFreq

DocID, termFreq

DocID, termFreq

Position, Payload


...

Position, Payload

Fixed length for each posting
(32 bits)

• Store TF instead of repeating the same DocID
• Store DocID/TF pairs separately from position/payloads
• Find a way to synchronously decode the two streams without storing a
pointer for each posting (expensive)

79


• Idea: Use an embedded skip list as periodical “synchronization points”
• Keeps memory overhead for pointers low and improves search performance

80

slice size
211
27

available

24

allocated

21

current list

81


Slice header

• Header contains:
• Back-pointer to previous slice (as before)
• Skip list
• Slice id

82

int (32 bits)

docID
24 bits
max. 16.7M

textPosition
8 bits
max. 255

• Observation: Most tweets don’t need all 8 bits for text position
• Idea: Use the position “inlining” approach for short documents, but support
Lucene’s 32-bit positions and variable length payloads

83

int (32 bits)

docID
24 bits
max. 16.7M

textPosition
or
termFreq
7 bits
max. 127

0=textPosition
1=termFreq
1 bit

As a storage optimization, the text position is stored with the docID if:
o termFreq == 1 (term occurs once only in the doc) AND
o textPosition <= 127 AND
o Posting has no payload AND
o Posting is not at a skip point of the docID posting list (see later).

84

New posting list encoding - Summary
• Support for 32 bit positions and arbitrary length payloads stored in separate
data structure
• Performance and space consumption very similar compared to previous
encoding for tweet search
• Skip lists used for speed and synchronization points
• For short documents positions can still be inlined

85

Questions?
Michael Busch
@michibusch
michael@twitter.com
buschmi@apache.org

Previous talk: http://vimeo.com/31195040
86

Search at Twitter

More Related Content

What's hot (20)

Viewers also liked (15)

Similar to Search at Twitter (20)

More from lucenerevolution (20)

Recently uploaded (20)

Search at Twitter