Impact of Crowdsourcing OCR Improvements on Retrievability Bias

Impact of Crowdsourcing OCR Improvements
on Retrievability Bias
Myriam C. Traub, Thaer Samar, Jacco van Ossenbruggen, Lynda Hardman

Centrum Wiskunde & Informatica, Amsterdam, NL 1

Motivation: Retrievability (Bias)
• Introduced by Azzopardi et al. in 2008 [1]
• Retrievability score counts how  
often a document is retrieved as one of  
the top K documents by a given set of queries
• Gini coefﬁcient quantiﬁes inequality in the distribution of
scores
[1] L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
2

Study on Retrievability Bias (JCDL2016)
• Follow-up study of Querylog-based Assessment of Retrievability Bias in
a Large Newspaper Corpus
• Large-scale study based on 102 million newspaper items, 4 million
simulated queries and 957,239 real user queries
• Findings:
• Large inequalities among the documents indicating retrievability bias
• Document length impacts retrieval, no evidence for other technical
bias found
• Simulated queries yield very different results than real queries,
experiments should take operators and facets into account
3

Potential Causes for
Retrievability Bias
• Skills and interest of users
• Collection bias
• Ranking algorithm
• UI design
• (OCR) quality
4
Courante uyt Italien, Duytslandt, &c

(14-06-1618)

Research Questions
• RQ1: Relation between OCR quality and
retrievability
• RQ2: Direct impact of correction on
retrievability bias of corrected documents
• RQ3: Indirect impact of correction of a
fraction of documents on non-corrected
ones
5

Research Questions
• RQ1: Relation between OCR quality and
retrievability
• RQ2: Direct impact of correction on
retrievability bias of corrected documents
• RQ3: Indirect impact of correction of a
fraction of documents on non-corrected
ones
How does bias caused by
OCR quality impact my (re-)search
results?
5

Documents & Queries
• Subset of the historic newspaper
archive maintained by the National
Library of the Netherlands (public,
KB)

• Ground truth set of 100 manually
corrected newspaper issues (822
articles) published in the 17th century
and WWII period (public, KB)

• Character error rates (CER)
computed with [1]

• User queries collected from
delpher.nl (conﬁdential, KB, same as
in previous study), stopwords, short
term removed, deduplicated
7
[1] https://www.digitisation.eu/ De geus onder studenten

(14-10-1940)

4 Corpora
• Ground truth set (822 documents):
• uncorrected
• corrected
• Ground truth + mixed in (1644 documents):
• uncorrected
• partially corrected
8

Setup - Retrievability
[1] http://www.lemurproject.org/
Indri search engine [1]
Documents
Queries
9
• We report on c=1, c=10, c=100 or c=inﬁnite
• Carried out on each of the four corpora

Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio10

tion, deﬁned as:
r(d) =
X
q2Q
oq · f(kdq, c),
The function f(kdq, c) is a generalized utility/cost functio
Retrievability score
for a document d
10

tion, deﬁned as:
r(d) =
X
q2Q
oq · f(kdq, c),
for a document d
Rank of document d in
the result list of a query q
10

tion, deﬁned as:
r(d) =
X
q2Q
oq · f(kdq, c),
for a document d
Cutoff value c
10

tion, deﬁned as:
r(d) =
X
q2Q
oq · f(kdq, c),
for a document d
Cutoff value c
Possibility to give more
weight to certain queries,  
we use oq=1
10

tion, deﬁned as:
r(d) =
X
q2Q
oq · f(kdq, c),
for a document d
Cutoff value c
For all queries q
in a query set Q
Possibility to give more
weight to certain queries,  
we use oq=1
10

Impact Assessment
• Wealth: How many documents were retrieved in total?
• Sum of all r(d) scores
• Equality: How are r(d) scores distributed among documents?
• Gini coefﬁcient
• Retrieval per document/query:
• Changes due to correction
• Impact of individual (query) terms
11

OCR Quality & Retrievability
RQ1: What is the relation between a document’s OCR character error rate and its
retrievability score?
12

RQ1: OCR Quality & Retrievability
• CER in 17cent collection signiﬁcantly higher
• R(d) scores higher in WWII collection
• Correlation between r(d) and CER: -0.57 (Pearson) and -0.61 (Spearman) with p<0.001
0
20,000
40,000
60,000
0% 20% 40% 60% 80%
Character error rate (CER)
R(d)score
Document length 1000 2000 3000 Subset 17cent WWII
13

Direct Impact of OCR Quality
RQ2: How does the correction of OCR errors impact the retrievability bias of the
corrected documents?
14

Direct Impact
Uncorrected
Complete
corpus was
corrected
Corrected
15

Impact of Correction on Wealth
• More documents
retrieved from corrected
documents
• Number of queries with
results increased by 8%
• Impact is largest for
users willing to look at
the entire result list
16
365,855
338,139
2,023,283
1,750,340
5,477,566
4,341,536
6,033,099
4,521,030
+ 8%+ 8%
+ 16%+ 16%
+ 26%+ 26%
+ 34%+ 34%
c=1
c=10
c=100
c=infinite
0 2,500,000 5,000,000 7,500,000 10,000,000
Sum of all r(d) scores (wealth)
Condition error−prone corrected

Impact on Equality
• Correction lowers inequality among
documents
• In contrast to earlier findings, Gini
coefficients do not decrease with
larger c’s
• Correction fixes more FN than FP
(c=infinite):
• Increases both, wealth and
equality
17
0.0
0.2
0.4
0.6
1 10 100 infinite
Gini
Condition 822GTcor 822GTerr
Direct Impact: Gini Coefficients

Retrieval per Document
• Few documents lose r(d) scores after correction: 
Good, these are former FP caused by OCR errors and no longer retrieved
• Most documents, however, gain — with 17cent corpus improving to a larger
extent, but still remaining at a lower level
18

Retrieval per Query
• Only 44% of the queries retrieved at least one document
• Despite small collection size, we see large gains
• Some queries lose because they retrieved FP from the uncorrected document set
19

Retrieval per Query
Top 10 terms cause 35% of the
wealth increase. These terms:
1. Appear very frequently in
user queries and
2. Are highly susceptible to
OCR errors in the
documents
Conclusion: Real queries are
also a source of bias
20
0
25
50
75
100
0 1,000 2,000 3,000 4,000
Query terms ordered by
difference in impact (descending)
Cumulativer(d)difference(%)
* new, Amsterdam, end, Mister, died/dead, grand/
large, Willem (name), two, three, old
Figure 4: Queries ordered by their gain/loss in number of
retrieved documents. The position on the y-axis represents
the number of documents retrieved from 822GTcor .
histograms. The distributions of the dierences in r(d) scores in Ta-
ble 2, show that for all cuto values, the median of the dierences is
positive, and increases from 8 (c = 1) to 912 (c = 1). The maximum
loss and the maximum gain in r(d) scores increase for larger cuto
values c, the latter to a much larger extent. Note that for c = 1 and
c = 10 the entire rst quartile is lled with documents that scored
worse in the corrected version. This shows that the competition
in the top results makes the gain of some documents the loss of
others.
Increased retrieval per query In a nal step, we investigated
0
25
50
75
100
0 1,000 2,000 3,000 4,000
Query terms ordered by
difference in impact (descending)
Cumulativer(d)difference(%)
Query Frequency in Cum.
Term Queries 822GT err 822GTcor Impact
nieuwe 1,903 99 166 7.36%
amsterdam 7,885 41 57 14.65%
ende 185 103 480 18.69%
heer 826 20 89 21.99%
overleden 3,698 5 18 24.78%
groot 1,573 125 153 27.33%
willem 5,375 5 13 29.81%
twee 319 64 175 31.83%
drie 401 34 120 33.81%
oude 991 50 78 35.41%
Figure 5: The accumulated impact scores of single-term
queries show that very few query term contribute a large
fraction of the overall wealth. The top ten query terms ac-
*

Indirect Impact of OCR Quality
RQ3: How does the correction of a fraction of error-prone documents inﬂuence the
retrievability of non-corrected ones?
21

Indirect Impact
Mixed
Half of the
corpus was
corrected
Uncorrected
22

Indirect Impact
Mixed
Half of the
corpus was
corrected
Uncorrected
22
50% same documents as  
in previous RQ

Indirect Impact
Mixed
Half of the
corpus was
corrected
Uncorrected
22
in previous RQ
50% new documents

Indirect Impact
Mixed
Half of the
corpus was
corrected
We’re mainly interested  
in these documents
Uncorrected
22
in previous RQ
50% new documents

Equality still increases!
• Equality in r(d) scores is higher in
the corrected document collection
• Again, correction has decreased
retrievability bias
23
0.0
0.2
0.4
0.6
0.8
1 10 100 infinite
Gini
Condition 1644err 1644mix
Indirect Impact: Gini Coefficients

376,139
353,613
2,307,996
2,099,816
7,676,830
6,698,945
9,520,643
8,008,574
c=1
c=10
c=100
c=infinity
0 3,000,000 6,000,000 9,000,000
Wealth
Complete Document Collection
225,809
180,079
1,420,322
1,112,705
4,898,694
3,783,514
6,033,099
4,521,030
c=1
c=10
c=100
c=infinity
0 2,000,000 4,000,000 6,000,000
Wealth
Ground Truth Document Collection
150,330
173,534
887,674
987,111
2,778,136
2,915,431
3,487,544
3,487,544
c=1
c=10
c=100
c=infinity
0 1,000,000 2,000,000 3,000,000 4,000,000
Wealth
Condition 1644_err 1644_mix
Mixed−in Document Collection
Impact on Wealth
• Complete: Correction increases wealth
24

376,139
353,613
2,307,996
2,099,816
7,676,830
6,698,945
9,520,643
8,008,574
c=1
c=10
c=100
c=infinity
0 3,000,000 6,000,000 9,000,000
Wealth
225,809
180,079
1,420,322
1,112,705
4,898,694
3,783,514
6,033,099
4,521,030
c=1
c=10
c=100
c=infinity
0 2,000,000 4,000,000 6,000,000
Wealth
150,330
173,534
887,674
987,111
2,778,136
2,915,431
3,487,544
3,487,544
c=1
c=10
c=100
c=infinity
0 1,000,000 2,000,000 3,000,000 4,000,000
Wealth
Impact on Wealth
• GT only:
• Increase in wealth
• c=1: +20%
• c=10: +22%
• c=100: +23%
• c=inﬁnite: +25%
24

376,139
353,613
2,307,996
2,099,816
7,676,830
6,698,945
9,520,643
8,008,574
c=1
c=10
c=100
c=infinity
0 3,000,000 6,000,000 9,000,000
Wealth
225,809
180,079
1,420,322
1,112,705
4,898,694
3,783,514
6,033,099
4,521,030
c=1
c=10
c=100
c=infinity
0 2,000,000 4,000,000 6,000,000
Wealth
150,330
173,534
887,674
987,111
2,778,136
2,915,431
3,487,544
3,487,544
c=1
c=10
c=100
c=infinity
0 1,000,000 2,000,000 3,000,000 4,000,000
Wealth
Impact on Wealth
• GT only:
• Increase in wealth
• c=1: +20%
• c=10: +22%
• c=100: +23%
• c=inﬁnite: +25%
• Mixed-in only:
• Decrease in wealth:
• c=1: -13%
• c=10: -10%
• c=100: -5%
24

Retrieval per Document (mixed-in only, c=10)
• Most documents’ scores change very little and if, they lose r(d) scores
• 171 documents gain r(d) scores
• Beneﬁt from FP matches that disappeared
25

Conclusions
• In our study, OCR correction
• Increases overall retrievability
• Reduces retrievability bias, even in a partially corrected corpus
• Higher scores caused by small set of terms that are
• frequent in queries and
• susceptible to OCR errors
• Using real user queries is essential to understand actual bias caused
by OCR errors.
27

Impact of Crowdsourcing
OCR Improvements on
Retrievability Bias
We would like to thank the

for making the newspaper corpus and the
(sensitive) user data available to us for
research.
28
This research is partly funded by the Dutch COMMIT/ program, the
WebART project and the VRE4EIC project, a project that has received
funding from the European Union’s Horizon 2020 research and innovation
program under grant agreement No 676247.

Impact of Crowdsourcing OCR Improvements on Retrievability Bias

More Related Content

Similar to Impact of Crowdsourcing OCR Improvements on Retrievability Bias (20)

More from Myriam Traub (6)

Recently uploaded (20)

Impact of Crowdsourcing OCR Improvements on Retrievability Bias