SlideShare a Scribd company logo
Search @twitter
Michael Busch
@michibusch
michael@twitter.com
buschmi@apache.org

1
Search @twitter
Agenda
‣ Introduction
- Search Architecture
- Inverted Index 101
- Realtime Posting Lists

2
Introduction

3
Introduction

Twitter has more than 230 million
monthly active users.

4
Introduction

500 million tweets are sent per day.

5
Introduction

More than 300 billion tweets have been
sent since company founding in 2006.

6
Introduction

Tweets-per-second world record:
33,388 TPS.

7
Introduction

More than 2 billion search queries per
day.

8
Introduction
2008

Twitter acquires Summize (MySQL-based RT search engine)

2009
2010

Modified Lucene (Earlybird) ships and replaces MySQL indexes

2011

New Earlybird features: image/video search; index compression;
efficient relevance search in time-sorted index

2012
2013
2014

Tweet archive search on SSD with vanilla Lucene
New RT posting list format that supports arbitrary document
lengths, but keeps performance optimizations for tweets
9
Introduction
2008

Twitter acquires Summize (MySQL-based RT search engine)

2009
2010

Modified Lucene (Earlybird) ships and replaces MySQL indexes

2011

New Earlybird features: image/video search; index compression;
efficient relevance search in time-sorted index

2012
2013
2014

Tweet archive search on SSD with vanilla Lucene
New RT posting list format that supports arbitrary document
lengths, but keeps performance optimizations for tweets
10
Introduction
2008

Twitter acquires Summize (MySQL-based RT search engine)

2009
2010

Modified Lucene (Earlybird) ships and replaces MySQL indexes

2011

New Earlybird features: image/video search; index compression;
efficient relevance search in time-sorted index

2012
2013
2014

Tweet archive search on SSD with vanilla Lucene
New RT posting list format that supports arbitrary document
lengths, but keeps performance optimizations for tweets
11
Introduction
2008

Twitter acquires Summize (MySQL-based RT search engine)

2009
2010

Modified Lucene (Earlybird) ships and replaces MySQL indexes

2011

New Earlybird features: image/video search; index compression;
efficient relevance search in time-sorted index

2012
2013
2014

Tweet archive search on SSD with vanilla Lucene
New RT posting list format that supports arbitrary document
lengths, but keeps performance optimizations for tweets
12
Introduction
2008

Twitter acquires Summize (MySQL-based RT search engine)

2009
2010

Modified Lucene (Earlybird) ships and replaces MySQL indexes

2011

New Earlybird features: image/video search; index compression;
efficient relevance search in time-sorted index

2012
2013
2014

Tweet archive search on SSD with vanilla Lucene
New RT posting list format that supports arbitrary document
lengths, but keeps performance optimizations for tweets
13
Realtime Search @twitter
Agenda
- Introduction
‣ Search Architecture
- Inverted Index 101
- Realtime Posting Lists

14
Search Architecture

15
Search Architecture
RT stream

raw
tweets

Analyzer/
Partitioner

analyzed
tweets

RT index
RT index
(Earlybird)

Blender

Search
requests

Tweet archive
HDFS

raw Mapreduce
tweets Analyzer

analyzed
tweets

Archive
RT index
index
writes
searches
16
Search Architecture
Analyzer/
Partitioner

• Pre-processes Tweets for indexing
• Analyzing (tokenization/normalization) of text
• Geo-coding, URL expansion, etc.
• Hash partitioning

17
Search Architecture
RT stream

raw
tweets

Analyzer/
Partitioner

analyzed
tweets

RT index
RT index
(Earlybird)

Blender

Search
requests

Tweet archive
HDFS

raw Mapreduce
tweets Analyzer

analyzed
tweets

Archive
RT index
index
writes
searches
18
Search Architecture
RT index
RT index
(Earlybird)

• Modified Lucene index implementation optimized for realtime search
• IndexWriter buffer is searchable (no need to flush to allow searching)
• In-memory
• Hash-partitioned, static layout

19
Cluster layout

Earlybird
Earlybird
Earlybird

Replicas

20
Cluster layout
n hash partitions (docId % n)

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

...

Earlybird
Earlybird
Earlybird

Replicas
21
Cluster layout
n hash partitions (docId % n)

Earlybird
Earlybird
Earlybird

Timeslices

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

...
Earlybird
Earlybird
Earlybird

...
Earlybird
Earlybird
Earlybird

...

Earlybird
Earlybird
Earlybird

...

Earlybird
Earlybird
Earlybird

...
Earlybird
Earlybird
Earlybird

...
...

Earlybird
Earlybird
Earlybird

Replicas
22
Cluster layout

Writable
timeslice

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

Complete
timeslices

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

Earlybird
Earlybird
Earlybird

...
Earlybird
Earlybird
Earlybird

...
Earlybird
Earlybird
Earlybird

...

Earlybird
Earlybird
Earlybird

...

Earlybird
Earlybird
Earlybird

...
Earlybird
Earlybird
Earlybird

...
...

Earlybird
Earlybird
Earlybird

23
Search Architecture
RT index
RT index
(Earlybird)

• Modified Lucene index implementation optimized for realtime search
• IndexWriter buffer is searchable (no need to flush to allow searching)
• In-memory
• Hash-partitioned, static layout

24
Search Architecture
RT stream

raw
tweets

Analyzer/
Partitioner

analyzed
tweets

RT index
RT index
(Earlybird)

Blender

Search
requests

Tweet archive
HDFS

raw Mapreduce
tweets Analyzer

analyzed
tweets

Archive
RT index
index
writes
searches
25
Search Architecture
Mapreduce
Analyzer

• Daily jobs that process raw tweets
• Analyzes text
• Aggregates metadata and signals

26
Search Architecture
RT stream

raw
tweets

Analyzer/
Partitioner

analyzed
tweets

RT index
RT index
(Earlybird)

Blender

Search
requests

Tweet archive
HDFS

raw Mapreduce
tweets Analyzer

analyzed
tweets

Archive
RT index
index
writes
searches
27
Search Architecture
Archive
RT index
index

• Standard Lucene (4.4) indexes
• Reverse time-sorted (new to old)
• Cluster layout similar to realtime search cluster

28
Search Architecture
Archive
RT index
index

• Two tiers: In-memory and on SSD

In-memory index

SSD index

29
Search Architecture
Archive
RT index
index

• Two tiers: In-memory and on SSD
Contains small number of best
tweets of all time

In-memory index

SSD index

30
Search Architecture
Archive
RT index
index

• Two tiers: In-memory and on SSD

In-memory index

Much bigger index with more
tweets, less max. QPS, limited by
SSD IOPS.
Only needs to be queried if inmemory index did not yield
enough results

SSD index

31
Search Architecture
RT stream

raw
tweets

Analyzer/
Partitioner

analyzed
tweets

RT index
RT index
(Earlybird)

Blender

Search
requests

Tweet archive
HDFS

raw Mapreduce
tweets Analyzer

analyzed
tweets

Archive
RT index
index
writes
searches
32
Search Architecture

RT index
RT index
(Earlybird)

• Blender is our Thrift
service aggregator
Blender

• Queries multiple
Earlybirds, merges results

Search
requests

Archive
RT index
index
writes
searches
33
Search Architecture
RT stream

raw
tweets

Analyzer/
Partitioner

analyzed
tweets

RT index
RT index
(Earlybird)

Blender

Search
requests

Tweet archive
HDFS

raw Mapreduce
tweets Analyzer

analyzed
tweets

Archive
RT index
index
writes
searches
34
Search Architecture
Tweets

Analyzer/
Partitioner

RT index
RT index
(Earlybird)

queue

Updates

HDFS

Deletes/
Engagement (e.g. retweets/favs)

Mapreduce
Analyzer

Blender

Search
requests

Archive
RT index
index
writes
searches
35
Realtime Search @twitter
Agenda
- Introduction
- Search Architecture
‣ Inverted Index 101
- Realtime Posting Lists

36
Inverted Index 101

37
Inverted Index 101
1

The old night keeper keeps the keep in the town

2

In the big old house in the big old gown.

3

The house in the town had the big old keep

4

Where the old night keeper never did sleep.

5

The night keeper keeps the keep in the night

6

And keeps in the dark and sleeps in the light.

Table with 6 documents

Example from:
Justin Zobel , Alistair Moffat,
Inverted files for text search engines,
ACM Computing Surveys (CSUR)
v.38 n.2, p.6-es, 2006

38
Inverted Index 101
1

The old night keeper keeps the keep in the town

2

In the big old house in the big old gown.

3

The house in the town had the big old keep

4

Where the old night keeper never did sleep.

5

The night keeper keeps the keep in the night

6

And keeps in the dark and sleeps in the light.

Table with 6 documents

term
and
big
dark
did
gown
had
house
in
keep
keeper
keeps
light
never
night
old
sleep
sleeps
the
town
where

freq
1
2
1
1
1
1
2
5
3
3
3
1
1
3
4
1
1
6
2
1

<6>
<2> <3>
<6>
<4>
<2>
<3>
<2> <3>
<1> <2> <3> <5> <6>
<1> <3> <5>
<1> <4> <5>
<1> <5> <6>
<6>
<4>
<1> <4> <5>
<1> <2> <3> <4>
<4>
<6>
<1> <2> <3> <4> <5> <6>
<1> <3>
<4>

Dictionary and posting lists
39
Inverted Index 101
1

The old night keeper keeps the keep in the town

2

In the big old house in the big old gown.

3

The house in the town had the big old keep

4

Where the old night keeper never did sleep.

5

The night keeper keeps the keep in the night

6

And keeps in the dark and sleeps in the light.

Table with 6 documents

Query: keeper
term
and
big
dark
did
gown
had
house
in
keep
keeper
keeps
light
never
night
old
sleep
sleeps
the
town
where

freq
1
2
1
1
1
1
2
5
3
3
3
1
1
3
4
1
1
6
2
1

<6>
<2> <3>
<6>
<4>
<2>
<3>
<2> <3>
<1> <2> <3> <5> <6>
<1> <3> <5>
<1> <4> <5>
<1> <5> <6>
<6>
<4>
<1> <4> <5>
<1> <2> <3> <4>
<4>
<6>
<1> <2> <3> <4> <5> <6>
<1> <3>
<4>

Dictionary and posting lists
40
Inverted Index 101
1

The old night keeper keeps the keep in the town

2

In the big old house in the big old gown.

3

The house in the town had the big old keep

4

Where the old night keeper never did sleep.

5

The night keeper keeps the keep in the night

6

And keeps in the dark and sleeps in the light.

Table with 6 documents

Query: keeper
term
and
big
dark
did
gown
had
house
in
keep
keeper
keeps
light
never
night
old
sleep
sleeps
the
town
where

freq
1
2
1
1
1
1
2
5
3
3
3
1
1
3
4
1
1
6
2
1

<6>
<2> <3>
<6>
<4>
<2>
<3>
<2> <3>
<1> <2> <3> <5> <6>
<1> <3> <5>
<1> <4> <5>
<1> <5> <6>
<6>
<4>
<1> <4> <5>
<1> <2> <3> <4>
<4>
<6>
<1> <2> <3> <4> <5> <6>
<1> <3>
<4>

Dictionary and posting lists
41
Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

42
Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Delta encoding:

5 10 8985

2

90998

90

43
Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Delta encoding:

5 10 8985

VInt compression:

00000101

2

90998

90

Values 0 <= delta <= 127 need
one byte

44
Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Delta encoding:

VInt compression:

5 10 8985

2

90998

90

11000110 00011001

Values 128 <= delta <= 16384
need two bytes

45
Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Delta encoding:

VInt compression:

5 10 8985

2

90998

90

11000110 00011001

First bit indicates whether next
byte belongs to the same value

46
Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Delta encoding:

VInt compression:

5 10 8985

2

90998

90

11000110 00011001

• Variable number of bytes - a VInt-encoded posting can not be written as a
primitive Java type; therefore it can not be written atomically

47
Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Delta encoding:

5 10 8985

2

90998

90

Read direction

• Each posting depends on previous one; decoding only possible in old-to-new
direction
• With recency ranking (new-to-old) no early termination is possible

48
Posting list encoding
• By default Lucene uses a combination of delta encoding and VInt
compression
• VInts are expensive to decode
• Problem 1: How to traverse posting lists backwards?
• Problem 2: How to write a posting atomically?

49
Realtime Search @twitter
Agenda
- Introduction
- Search Architecture
- Inverted Index 101
‣ Realtime Posting Lists

50
Realtime Posting Lists

51
Posting list encoding in Earlybird v1
int (32 bits)

docID
24 bits
max. 16.7M

textPosition
8 bits
max. 255

• Tweet text can only have 140 chars

52
Posting list encoding in Earlybird v1
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Earlybird encoding:
5

15

9000

9002

100000

100090

Read direction

53
Early query termination
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Earlybird encoding:
5

15

9000

9002

100000

100090

Read direction
E.g. 3 result are requested: Here
we can terminate after reading 3
postings

54
Inverted index components
Posting list storage

?
Dictionary

Parallel arrays

pointer to the most recently
indexed posting for a term

55
Inverted index components
Posting list storage

?
Dictionary

Parallel arrays

pointer to the most recently
indexed posting for a term

56
Posting lists storage - Objectives
• Store many single-linked lists of different lengths space-efficiently
• The number of java objects should be independent of the number of lists or
number of items in the lists
• Every item should be a possible entry point into the lists for iterators, i.e.
items should not be dependent on other items (e.g. no delta encoding)
• Append and read possible by multiple threads in a lock-free fashion (single
append thread, multiple reader threads)
• Traversal in backwards order

57
Memory management
4 int[]
pools

= 32K int[]

58
Memory management
4 int[]
pools

= 32K int[]

Each pool can
be grown
individually by
adding 32K
blocks

59
Memory management
4 int[]
pools

• For simplicity we can forget about the blocks for now and think of the pools
as continuous, unbounded int[] arrays
• Small total number of Java objects (each 32K block is one object)

60
Memory management
slice size
211
27
24
21

• Slices can be allocated in each pool
• Each pool has a different, but fixed slice size

61
Adding and appending to a list
slice size
211
27

available

24

allocated

21

current list

62
Adding and appending to a list
slice size
211
27

available

24

allocated

21

current list

Store first two
postings in this slice

63
Adding and appending to a list
slice size
211
27

available

24

allocated

21

current list

When first slice is full, allocate another one in second pool

64
Adding and appending to a list
slice size
211
27

available

24

allocated

21

current list

Allocate a slice on each level as list grows

65
Adding and appending to a list
slice size
211
27

available

24

allocated

21

current list

On upper most level one list can own multiple slices

66
Posting list format v1
int (32 bits)

docID
24 bits
max. 16.7M

textPosition
8 bits
max. 255

• Tweet text can only have 140 chars

67
Addressing items
• Use 32 bit (int) pointers to address any item in any list unambiguously:

int (32 bits)

poolIndex
2 bits
0-3

sliceIndex
19-29 bits
depends on pool

offset in slice
1-11 bits
depends on pool

• Nice symmetry: Postings and address pointers both fit into a 32 bit int

68
Linking the slices
slice size
211
27

available

24

allocated

21

current list

69
Linking the slices
slice size
211
27

available

24

allocated

21

current list

Dictionary

Parallel arrays

pointer to the last posting indexed for a term

70
Posting list encoding - Summary
• ints can be written atomically in Java
• Backwards traversal easy on absolute docIDs (not deltas)
• Every posting is a possible entry point for a searcher
• Skipping can be done without additional data structures as binary search,
though there are better approaches (skip lists)
• Repeating docIDs if a term occurs multiple times in the same document only
works for small docs
• Max. segment size: 2^24 = 16.7M tweets

71
New posting list encoding
• Objectives:
• 32 bit positions and variable-length payloads
• Store term frequency (TF) instead of repeating docIDs
• Keep:
• Concurrency model
• Space-efficiency for short documents
• Performance

72
New posting list encoding
DocID, termFreq

Position, Payload

73
New posting list encoding
DocID, termFreq

Position, Payload

Fixed length for each posting

74
New posting list encoding
DocID, termFreq

Position, Payload

Variable length

75
New posting list encoding

DocID, termFreq

Position, Payload

76
New posting list encoding

...

DocID, termFreq

DocID, termFreq

DocID, termFreq

Position, Payload

Position, Payload, Position

...

Position, Payload

77
New posting list encoding
...

DocID, termFreq

DocID, termFreq

DocID, termFreq

Position, Payload

Position, Payload, Position

...

Position, Payload

• Store TF instead of repeating the same DocID
• Store DocID/TF pairs separately from position/payloads
• Find a way to synchronously decode the two streams without storing a
pointer for each posting (expensive)

78
New posting list encoding
...

DocID, termFreq

DocID, termFreq

DocID, termFreq

Position, Payload

Position, Payload, Position

...

Position, Payload

Fixed length for each posting
(32 bits)

• Store TF instead of repeating the same DocID
• Store DocID/TF pairs separately from position/payloads
• Find a way to synchronously decode the two streams without storing a
pointer for each posting (expensive)

79
New posting list encoding

• Idea: Use an embedded skip list as periodical “synchronization points”
• Keeps memory overhead for pointers low and improves search performance

80
New posting list encoding
slice size
211
27

available

24

allocated

21

current list

81
New posting list encoding

Slice header

• Header contains:
• Back-pointer to previous slice (as before)
• Skip list
• Slice id

82
New posting list encoding
int (32 bits)

docID
24 bits
max. 16.7M

textPosition
8 bits
max. 255

• Observation: Most tweets don’t need all 8 bits for text position
• Idea: Use the position “inlining” approach for short documents, but support
Lucene’s 32-bit positions and variable length payloads

83
New posting list encoding
int (32 bits)

docID
24 bits
max. 16.7M

textPosition
or
termFreq
7 bits
max. 127

0=textPosition
1=termFreq
1 bit

As a storage optimization, the text position is stored with the docID if:
o termFreq == 1 (term occurs once only in the doc) AND
o textPosition <= 127 AND
o Posting has no payload AND
o Posting is not at a skip point of the docID posting list (see later).

84
New posting list encoding - Summary
• Support for 32 bit positions and arbitrary length payloads stored in separate
data structure
• Performance and space consumption very similar compared to previous
encoding for tweet search
• Skip lists used for speed and synchronization points
• For short documents positions can still be inlined

85
Questions?
Michael Busch
@michibusch
michael@twitter.com
buschmi@apache.org

Previous talk: http://vimeo.com/31195040
86

More Related Content

PDF
Faceted Search with Lucene
PDF
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
PDF
Search at Twitter: Presented by Michael Busch, Twitter
PDF
Improved Search with Lucene 4.0 - Robert Muir
PDF
Recent Additions to Lucene Arsenal
PPT
Lucene Introduction
PPTX
Lucene indexing
PDF
Apache Lucene intro - Breizhcamp 2015
Faceted Search with Lucene
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Search at Twitter: Presented by Michael Busch, Twitter
Improved Search with Lucene 4.0 - Robert Muir
Recent Additions to Lucene Arsenal
Lucene Introduction
Lucene indexing
Apache Lucene intro - Breizhcamp 2015

What's hot (20)

PDF
What is in a Lucene index?
PPT
Lucene and MySQL
PDF
Elasticsearch speed is key
ODP
Search Lucene
PDF
Apache Solr/Lucene Internals by Anatoliy Sokolenko
PPTX
Hacking Lucene for Custom Search Results
ODP
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
PDF
Twitter Search Architecture
PDF
Berlin Buzzwords 2013 - How does lucene store your data?
PDF
High Performance JSON Search and Relational Faceted Browsing with Lucene
PDF
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
PDF
Flexible Indexing in Lucene 4.0
PPT
Lucece Indexing
PPT
Intelligent crawling and indexing using lucene
PPT
Lucene basics
PDF
Munching & crunching - Lucene index post-processing
PPTX
Apache lucene
PDF
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
PDF
Portable Lucene Index Format & Applications - Andrzej Bialecki
PDF
Rapid Prototyping with Solr
What is in a Lucene index?
Lucene and MySQL
Elasticsearch speed is key
Search Lucene
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Hacking Lucene for Custom Search Results
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Twitter Search Architecture
Berlin Buzzwords 2013 - How does lucene store your data?
High Performance JSON Search and Relational Faceted Browsing with Lucene
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Flexible Indexing in Lucene 4.0
Lucece Indexing
Intelligent crawling and indexing using lucene
Lucene basics
Munching & crunching - Lucene index post-processing
Apache lucene
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Portable Lucene Index Format & Applications - Andrzej Bialecki
Rapid Prototyping with Solr
Ad

Viewers also liked (15)

PDF
Real-time systems at Twitter (Velocity 2012)
PDF
Adapting Alax Solr to Compare different sets of documents - Joan Codina
PPTX
Personalizing Search at LinkedIn
PDF
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
PPTX
Ektron 8.5 RC - Search
PPTX
Streaming ETL for All
PDF
Events, Signals, and Recommendations
PPTX
Solr JDBC - Lucene/Solr Revolution 2016
PDF
Netflix Global Search - Lucene Revolution
PDF
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
PDF
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
PDF
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
PDF
Big & Personal: the data and the models behind Netflix recommendations by Xa...
PPTX
Netflix Cloud Architecture and Open Source
PPTX
Building production spark streaming applications
Real-time systems at Twitter (Velocity 2012)
Adapting Alax Solr to Compare different sets of documents - Joan Codina
Personalizing Search at LinkedIn
Anatomy of Relevance - From Data to Action: Presented by Saïd Radhouani, Yell...
Ektron 8.5 RC - Search
Streaming ETL for All
Events, Signals, and Recommendations
Solr JDBC - Lucene/Solr Revolution 2016
Netflix Global Search - Lucene Revolution
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Big & Personal: the data and the models behind Netflix recommendations by Xa...
Netflix Cloud Architecture and Open Source
Building production spark streaming applications
Ad

Similar to Search at Twitter (20)

PPTX
Elastic meetup june16
PPTX
Apache IOTDB: a Time Series Database for Industrial IoT
PDF
Roaring with elastic search sangam2018
PPTX
Elastic Stack Introduction
PDF
Avast Premium Security 24.12.9725 + License Key Till 2050
PDF
Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]
PDF
FastStone Capture 10.4 Crack + Serial Key [Latest]
PDF
EASEUS Partition Master 18.8 Crack + License Code [2025]
PDF
MiniTool Partition Wizard Crack 12.8 + Serial Key Download
PDF
4K Video Downloader Crack (2025) + License Key Free
PDF
Capcut Pro Crack For PC Latest 2025 Full
PDF
Adobe Photoshop CC 26.3 Crack + Serial Key [Latest 2025]
PDF
minitool partition wizard crack 12.8 latest
PDF
Solr and ElasticSearch demo and speaker feb 2014
PDF
NoSQL Couchbase Lite & BigData HPCC Systems
PDF
What’s Evolving in the Elastic Stack
PPT
Lucene BootCamp
PDF
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
PDF
Apache CarbonData:New high performance data format for faster data analysis
PPTX
Tutorial(release)
Elastic meetup june16
Apache IOTDB: a Time Series Database for Industrial IoT
Roaring with elastic search sangam2018
Elastic Stack Introduction
Avast Premium Security 24.12.9725 + License Key Till 2050
Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]
FastStone Capture 10.4 Crack + Serial Key [Latest]
EASEUS Partition Master 18.8 Crack + License Code [2025]
MiniTool Partition Wizard Crack 12.8 + Serial Key Download
4K Video Downloader Crack (2025) + License Key Free
Capcut Pro Crack For PC Latest 2025 Full
Adobe Photoshop CC 26.3 Crack + Serial Key [Latest 2025]
minitool partition wizard crack 12.8 latest
Solr and ElasticSearch demo and speaker feb 2014
NoSQL Couchbase Lite & BigData HPCC Systems
What’s Evolving in the Elastic Stack
Lucene BootCamp
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Apache CarbonData:New high performance data format for faster data analysis
Tutorial(release)

More from lucenerevolution (20)

PDF
Text Classification Powered by Apache Mahout and Lucene
PDF
State of the Art Logging. Kibana4Solr is Here!
PDF
Building Client-side Search Applications with Solr
PDF
Integrate Solr with real-time stream processing applications
PDF
Scaling Solr with SolrCloud
PDF
Administering and Monitoring SolrCloud Clusters
PDF
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
PDF
Using Solr to Search and Analyze Logs
PDF
Enhancing relevancy through personalization & semantic search
PDF
Real-time Inverted Search in the Cloud Using Lucene and Storm
PDF
Solr's Admin UI - Where does the data come from?
PDF
Schemaless Solr and the Solr Schema REST API
PDF
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
PDF
Turning search upside down
PDF
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
PDF
Shrinking the haystack wes caldwell - final
PDF
The First Class Integration of Solr with Hadoop
PDF
A Novel methodology for handling Document Level Security in Search Based Appl...
PDF
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
PDF
Query Latency Optimization with Lucene
Text Classification Powered by Apache Mahout and Lucene
State of the Art Logging. Kibana4Solr is Here!
Building Client-side Search Applications with Solr
Integrate Solr with real-time stream processing applications
Scaling Solr with SolrCloud
Administering and Monitoring SolrCloud Clusters
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Using Solr to Search and Analyze Logs
Enhancing relevancy through personalization & semantic search
Real-time Inverted Search in the Cloud Using Lucene and Storm
Solr's Admin UI - Where does the data come from?
Schemaless Solr and the Solr Schema REST API
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Turning search upside down
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Shrinking the haystack wes caldwell - final
The First Class Integration of Solr with Hadoop
A Novel methodology for handling Document Level Security in Search Based Appl...
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
Query Latency Optimization with Lucene

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PPT
Teaching material agriculture food technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Machine learning based COVID-19 study performance prediction
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
cuic standard and advanced reporting.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation theory and applications.pdf
Teaching material agriculture food technology
Digital-Transformation-Roadmap-for-Companies.pptx
A Presentation on Artificial Intelligence
CIFDAQ's Market Insight: SEC Turns Pro Crypto
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Reach Out and Touch Someone: Haptics and Empathic Computing
NewMind AI Weekly Chronicles - August'25 Week I
Machine learning based COVID-19 study performance prediction
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
cuic standard and advanced reporting.pdf
Big Data Technologies - Introduction.pptx
Network Security Unit 5.pdf for BCA BBA.
Understanding_Digital_Forensics_Presentation.pptx
Modernizing your data center with Dell and AMD
Mobile App Security Testing_ A Comprehensive Guide.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Search at Twitter

  • 2. Search @twitter Agenda ‣ Introduction - Search Architecture - Inverted Index 101 - Realtime Posting Lists 2
  • 4. Introduction Twitter has more than 230 million monthly active users. 4
  • 5. Introduction 500 million tweets are sent per day. 5
  • 6. Introduction More than 300 billion tweets have been sent since company founding in 2006. 6
  • 8. Introduction More than 2 billion search queries per day. 8
  • 9. Introduction 2008 Twitter acquires Summize (MySQL-based RT search engine) 2009 2010 Modified Lucene (Earlybird) ships and replaces MySQL indexes 2011 New Earlybird features: image/video search; index compression; efficient relevance search in time-sorted index 2012 2013 2014 Tweet archive search on SSD with vanilla Lucene New RT posting list format that supports arbitrary document lengths, but keeps performance optimizations for tweets 9
  • 10. Introduction 2008 Twitter acquires Summize (MySQL-based RT search engine) 2009 2010 Modified Lucene (Earlybird) ships and replaces MySQL indexes 2011 New Earlybird features: image/video search; index compression; efficient relevance search in time-sorted index 2012 2013 2014 Tweet archive search on SSD with vanilla Lucene New RT posting list format that supports arbitrary document lengths, but keeps performance optimizations for tweets 10
  • 11. Introduction 2008 Twitter acquires Summize (MySQL-based RT search engine) 2009 2010 Modified Lucene (Earlybird) ships and replaces MySQL indexes 2011 New Earlybird features: image/video search; index compression; efficient relevance search in time-sorted index 2012 2013 2014 Tweet archive search on SSD with vanilla Lucene New RT posting list format that supports arbitrary document lengths, but keeps performance optimizations for tweets 11
  • 12. Introduction 2008 Twitter acquires Summize (MySQL-based RT search engine) 2009 2010 Modified Lucene (Earlybird) ships and replaces MySQL indexes 2011 New Earlybird features: image/video search; index compression; efficient relevance search in time-sorted index 2012 2013 2014 Tweet archive search on SSD with vanilla Lucene New RT posting list format that supports arbitrary document lengths, but keeps performance optimizations for tweets 12
  • 13. Introduction 2008 Twitter acquires Summize (MySQL-based RT search engine) 2009 2010 Modified Lucene (Earlybird) ships and replaces MySQL indexes 2011 New Earlybird features: image/video search; index compression; efficient relevance search in time-sorted index 2012 2013 2014 Tweet archive search on SSD with vanilla Lucene New RT posting list format that supports arbitrary document lengths, but keeps performance optimizations for tweets 13
  • 14. Realtime Search @twitter Agenda - Introduction ‣ Search Architecture - Inverted Index 101 - Realtime Posting Lists 14
  • 16. Search Architecture RT stream raw tweets Analyzer/ Partitioner analyzed tweets RT index RT index (Earlybird) Blender Search requests Tweet archive HDFS raw Mapreduce tweets Analyzer analyzed tweets Archive RT index index writes searches 16
  • 17. Search Architecture Analyzer/ Partitioner • Pre-processes Tweets for indexing • Analyzing (tokenization/normalization) of text • Geo-coding, URL expansion, etc. • Hash partitioning 17
  • 18. Search Architecture RT stream raw tweets Analyzer/ Partitioner analyzed tweets RT index RT index (Earlybird) Blender Search requests Tweet archive HDFS raw Mapreduce tweets Analyzer analyzed tweets Archive RT index index writes searches 18
  • 19. Search Architecture RT index RT index (Earlybird) • Modified Lucene index implementation optimized for realtime search • IndexWriter buffer is searchable (no need to flush to allow searching) • In-memory • Hash-partitioned, static layout 19
  • 21. Cluster layout n hash partitions (docId % n) Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird ... Earlybird Earlybird Earlybird Replicas 21
  • 22. Cluster layout n hash partitions (docId % n) Earlybird Earlybird Earlybird Timeslices Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird ... Earlybird Earlybird Earlybird ... Earlybird Earlybird Earlybird ... Earlybird Earlybird Earlybird ... Earlybird Earlybird Earlybird ... Earlybird Earlybird Earlybird ... ... Earlybird Earlybird Earlybird Replicas 22
  • 24. Search Architecture RT index RT index (Earlybird) • Modified Lucene index implementation optimized for realtime search • IndexWriter buffer is searchable (no need to flush to allow searching) • In-memory • Hash-partitioned, static layout 24
  • 25. Search Architecture RT stream raw tweets Analyzer/ Partitioner analyzed tweets RT index RT index (Earlybird) Blender Search requests Tweet archive HDFS raw Mapreduce tweets Analyzer analyzed tweets Archive RT index index writes searches 25
  • 26. Search Architecture Mapreduce Analyzer • Daily jobs that process raw tweets • Analyzes text • Aggregates metadata and signals 26
  • 27. Search Architecture RT stream raw tweets Analyzer/ Partitioner analyzed tweets RT index RT index (Earlybird) Blender Search requests Tweet archive HDFS raw Mapreduce tweets Analyzer analyzed tweets Archive RT index index writes searches 27
  • 28. Search Architecture Archive RT index index • Standard Lucene (4.4) indexes • Reverse time-sorted (new to old) • Cluster layout similar to realtime search cluster 28
  • 29. Search Architecture Archive RT index index • Two tiers: In-memory and on SSD In-memory index SSD index 29
  • 30. Search Architecture Archive RT index index • Two tiers: In-memory and on SSD Contains small number of best tweets of all time In-memory index SSD index 30
  • 31. Search Architecture Archive RT index index • Two tiers: In-memory and on SSD In-memory index Much bigger index with more tweets, less max. QPS, limited by SSD IOPS. Only needs to be queried if inmemory index did not yield enough results SSD index 31
  • 32. Search Architecture RT stream raw tweets Analyzer/ Partitioner analyzed tweets RT index RT index (Earlybird) Blender Search requests Tweet archive HDFS raw Mapreduce tweets Analyzer analyzed tweets Archive RT index index writes searches 32
  • 33. Search Architecture RT index RT index (Earlybird) • Blender is our Thrift service aggregator Blender • Queries multiple Earlybirds, merges results Search requests Archive RT index index writes searches 33
  • 34. Search Architecture RT stream raw tweets Analyzer/ Partitioner analyzed tweets RT index RT index (Earlybird) Blender Search requests Tweet archive HDFS raw Mapreduce tweets Analyzer analyzed tweets Archive RT index index writes searches 34
  • 35. Search Architecture Tweets Analyzer/ Partitioner RT index RT index (Earlybird) queue Updates HDFS Deletes/ Engagement (e.g. retweets/favs) Mapreduce Analyzer Blender Search requests Archive RT index index writes searches 35
  • 36. Realtime Search @twitter Agenda - Introduction - Search Architecture ‣ Inverted Index 101 - Realtime Posting Lists 36
  • 38. Inverted Index 101 1 The old night keeper keeps the keep in the town 2 In the big old house in the big old gown. 3 The house in the town had the big old keep 4 Where the old night keeper never did sleep. 5 The night keeper keeps the keep in the night 6 And keeps in the dark and sleeps in the light. Table with 6 documents Example from: Justin Zobel , Alistair Moffat, Inverted files for text search engines, ACM Computing Surveys (CSUR) v.38 n.2, p.6-es, 2006 38
  • 39. Inverted Index 101 1 The old night keeper keeps the keep in the town 2 In the big old house in the big old gown. 3 The house in the town had the big old keep 4 Where the old night keeper never did sleep. 5 The night keeper keeps the keep in the night 6 And keeps in the dark and sleeps in the light. Table with 6 documents term and big dark did gown had house in keep keeper keeps light never night old sleep sleeps the town where freq 1 2 1 1 1 1 2 5 3 3 3 1 1 3 4 1 1 6 2 1 <6> <2> <3> <6> <4> <2> <3> <2> <3> <1> <2> <3> <5> <6> <1> <3> <5> <1> <4> <5> <1> <5> <6> <6> <4> <1> <4> <5> <1> <2> <3> <4> <4> <6> <1> <2> <3> <4> <5> <6> <1> <3> <4> Dictionary and posting lists 39
  • 40. Inverted Index 101 1 The old night keeper keeps the keep in the town 2 In the big old house in the big old gown. 3 The house in the town had the big old keep 4 Where the old night keeper never did sleep. 5 The night keeper keeps the keep in the night 6 And keeps in the dark and sleeps in the light. Table with 6 documents Query: keeper term and big dark did gown had house in keep keeper keeps light never night old sleep sleeps the town where freq 1 2 1 1 1 1 2 5 3 3 3 1 1 3 4 1 1 6 2 1 <6> <2> <3> <6> <4> <2> <3> <2> <3> <1> <2> <3> <5> <6> <1> <3> <5> <1> <4> <5> <1> <5> <6> <6> <4> <1> <4> <5> <1> <2> <3> <4> <4> <6> <1> <2> <3> <4> <5> <6> <1> <3> <4> Dictionary and posting lists 40
  • 41. Inverted Index 101 1 The old night keeper keeps the keep in the town 2 In the big old house in the big old gown. 3 The house in the town had the big old keep 4 Where the old night keeper never did sleep. 5 The night keeper keeps the keep in the night 6 And keeps in the dark and sleeps in the light. Table with 6 documents Query: keeper term and big dark did gown had house in keep keeper keeps light never night old sleep sleeps the town where freq 1 2 1 1 1 1 2 5 3 3 3 1 1 3 4 1 1 6 2 1 <6> <2> <3> <6> <4> <2> <3> <2> <3> <1> <2> <3> <5> <6> <1> <3> <5> <1> <4> <5> <1> <5> <6> <6> <4> <1> <4> <5> <1> <2> <3> <4> <4> <6> <1> <2> <3> <4> <5> <6> <1> <3> <4> Dictionary and posting lists 41
  • 42. Posting list encoding Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 42
  • 43. Posting list encoding Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Delta encoding: 5 10 8985 2 90998 90 43
  • 44. Posting list encoding Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Delta encoding: 5 10 8985 VInt compression: 00000101 2 90998 90 Values 0 <= delta <= 127 need one byte 44
  • 45. Posting list encoding Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Delta encoding: VInt compression: 5 10 8985 2 90998 90 11000110 00011001 Values 128 <= delta <= 16384 need two bytes 45
  • 46. Posting list encoding Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Delta encoding: VInt compression: 5 10 8985 2 90998 90 11000110 00011001 First bit indicates whether next byte belongs to the same value 46
  • 47. Posting list encoding Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Delta encoding: VInt compression: 5 10 8985 2 90998 90 11000110 00011001 • Variable number of bytes - a VInt-encoded posting can not be written as a primitive Java type; therefore it can not be written atomically 47
  • 48. Posting list encoding Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Delta encoding: 5 10 8985 2 90998 90 Read direction • Each posting depends on previous one; decoding only possible in old-to-new direction • With recency ranking (new-to-old) no early termination is possible 48
  • 49. Posting list encoding • By default Lucene uses a combination of delta encoding and VInt compression • VInts are expensive to decode • Problem 1: How to traverse posting lists backwards? • Problem 2: How to write a posting atomically? 49
  • 50. Realtime Search @twitter Agenda - Introduction - Search Architecture - Inverted Index 101 ‣ Realtime Posting Lists 50
  • 52. Posting list encoding in Earlybird v1 int (32 bits) docID 24 bits max. 16.7M textPosition 8 bits max. 255 • Tweet text can only have 140 chars 52
  • 53. Posting list encoding in Earlybird v1 Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Earlybird encoding: 5 15 9000 9002 100000 100090 Read direction 53
  • 54. Early query termination Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Earlybird encoding: 5 15 9000 9002 100000 100090 Read direction E.g. 3 result are requested: Here we can terminate after reading 3 postings 54
  • 55. Inverted index components Posting list storage ? Dictionary Parallel arrays pointer to the most recently indexed posting for a term 55
  • 56. Inverted index components Posting list storage ? Dictionary Parallel arrays pointer to the most recently indexed posting for a term 56
  • 57. Posting lists storage - Objectives • Store many single-linked lists of different lengths space-efficiently • The number of java objects should be independent of the number of lists or number of items in the lists • Every item should be a possible entry point into the lists for iterators, i.e. items should not be dependent on other items (e.g. no delta encoding) • Append and read possible by multiple threads in a lock-free fashion (single append thread, multiple reader threads) • Traversal in backwards order 57
  • 59. Memory management 4 int[] pools = 32K int[] Each pool can be grown individually by adding 32K blocks 59
  • 60. Memory management 4 int[] pools • For simplicity we can forget about the blocks for now and think of the pools as continuous, unbounded int[] arrays • Small total number of Java objects (each 32K block is one object) 60
  • 61. Memory management slice size 211 27 24 21 • Slices can be allocated in each pool • Each pool has a different, but fixed slice size 61
  • 62. Adding and appending to a list slice size 211 27 available 24 allocated 21 current list 62
  • 63. Adding and appending to a list slice size 211 27 available 24 allocated 21 current list Store first two postings in this slice 63
  • 64. Adding and appending to a list slice size 211 27 available 24 allocated 21 current list When first slice is full, allocate another one in second pool 64
  • 65. Adding and appending to a list slice size 211 27 available 24 allocated 21 current list Allocate a slice on each level as list grows 65
  • 66. Adding and appending to a list slice size 211 27 available 24 allocated 21 current list On upper most level one list can own multiple slices 66
  • 67. Posting list format v1 int (32 bits) docID 24 bits max. 16.7M textPosition 8 bits max. 255 • Tweet text can only have 140 chars 67
  • 68. Addressing items • Use 32 bit (int) pointers to address any item in any list unambiguously: int (32 bits) poolIndex 2 bits 0-3 sliceIndex 19-29 bits depends on pool offset in slice 1-11 bits depends on pool • Nice symmetry: Postings and address pointers both fit into a 32 bit int 68
  • 69. Linking the slices slice size 211 27 available 24 allocated 21 current list 69
  • 70. Linking the slices slice size 211 27 available 24 allocated 21 current list Dictionary Parallel arrays pointer to the last posting indexed for a term 70
  • 71. Posting list encoding - Summary • ints can be written atomically in Java • Backwards traversal easy on absolute docIDs (not deltas) • Every posting is a possible entry point for a searcher • Skipping can be done without additional data structures as binary search, though there are better approaches (skip lists) • Repeating docIDs if a term occurs multiple times in the same document only works for small docs • Max. segment size: 2^24 = 16.7M tweets 71
  • 72. New posting list encoding • Objectives: • 32 bit positions and variable-length payloads • Store term frequency (TF) instead of repeating docIDs • Keep: • Concurrency model • Space-efficiency for short documents • Performance 72
  • 73. New posting list encoding DocID, termFreq Position, Payload 73
  • 74. New posting list encoding DocID, termFreq Position, Payload Fixed length for each posting 74
  • 75. New posting list encoding DocID, termFreq Position, Payload Variable length 75
  • 76. New posting list encoding DocID, termFreq Position, Payload 76
  • 77. New posting list encoding ... DocID, termFreq DocID, termFreq DocID, termFreq Position, Payload Position, Payload, Position ... Position, Payload 77
  • 78. New posting list encoding ... DocID, termFreq DocID, termFreq DocID, termFreq Position, Payload Position, Payload, Position ... Position, Payload • Store TF instead of repeating the same DocID • Store DocID/TF pairs separately from position/payloads • Find a way to synchronously decode the two streams without storing a pointer for each posting (expensive) 78
  • 79. New posting list encoding ... DocID, termFreq DocID, termFreq DocID, termFreq Position, Payload Position, Payload, Position ... Position, Payload Fixed length for each posting (32 bits) • Store TF instead of repeating the same DocID • Store DocID/TF pairs separately from position/payloads • Find a way to synchronously decode the two streams without storing a pointer for each posting (expensive) 79
  • 80. New posting list encoding • Idea: Use an embedded skip list as periodical “synchronization points” • Keeps memory overhead for pointers low and improves search performance 80
  • 81. New posting list encoding slice size 211 27 available 24 allocated 21 current list 81
  • 82. New posting list encoding Slice header • Header contains: • Back-pointer to previous slice (as before) • Skip list • Slice id 82
  • 83. New posting list encoding int (32 bits) docID 24 bits max. 16.7M textPosition 8 bits max. 255 • Observation: Most tweets don’t need all 8 bits for text position • Idea: Use the position “inlining” approach for short documents, but support Lucene’s 32-bit positions and variable length payloads 83
  • 84. New posting list encoding int (32 bits) docID 24 bits max. 16.7M textPosition or termFreq 7 bits max. 127 0=textPosition 1=termFreq 1 bit As a storage optimization, the text position is stored with the docID if: o termFreq == 1 (term occurs once only in the doc) AND o textPosition <= 127 AND o Posting has no payload AND o Posting is not at a skip point of the docID posting list (see later). 84
  • 85. New posting list encoding - Summary • Support for 32 bit positions and arbitrary length payloads stored in separate data structure • Performance and space consumption very similar compared to previous encoding for tweet search • Skip lists used for speed and synchronization points • For short documents positions can still be inlined 85