SlideShare a Scribd company logo
Impact of Crowdsourcing OCR Improvements
on Retrievability Bias
Myriam C. Traub, Thaer Samar, Jacco van Ossenbruggen, Lynda Hardman

Centrum Wiskunde & Informatica, Amsterdam, NL 1
Motivation: Retrievability (Bias)
• Introduced by Azzopardi et al. in 2008 [1]
• Retrievability score counts how 

often a document is retrieved as one of 

the top K documents by a given set of queries
• Gini coefficient quantifies inequality in the distribution of
scores
[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
2
Study on Retrievability Bias (JCDL2016)
• Follow-up study of Querylog-based Assessment of Retrievability Bias in
a Large Newspaper Corpus
• Large-scale study based on 102 million newspaper items, 4 million
simulated queries and 957,239 real user queries
• Findings:
• Large inequalities among the documents indicating retrievability bias
• Document length impacts retrieval, no evidence for other technical
bias found
• Simulated queries yield very different results than real queries,
experiments should take operators and facets into account
3
Potential Causes for
Retrievability Bias
• Skills and interest of users
• Collection bias
• Ranking algorithm
• UI design
• (OCR) quality
4
Courante uyt Italien, Duytslandt, &c 

(14-06-1618)
Research Questions
• RQ1: Relation between OCR quality and
retrievability
• RQ2: Direct impact of correction on
retrievability bias of corrected documents
• RQ3: Indirect impact of correction of a
fraction of documents on non-corrected
ones
5
Research Questions
• RQ1: Relation between OCR quality and
retrievability
• RQ2: Direct impact of correction on
retrievability bias of corrected documents
• RQ3: Indirect impact of correction of a
fraction of documents on non-corrected
ones
How does bias caused by
OCR quality impact my (re-)search
results?
5
Experimental Setup
6
Documents & Queries
• Subset of the historic newspaper
archive maintained by the National
Library of the Netherlands (public,
KB)

• Ground truth set of 100 manually
corrected newspaper issues (822
articles) published in the 17th century
and WWII period (public, KB)

• Character error rates (CER)
computed with [1]

• User queries collected from
delpher.nl (confidential, KB, same as
in previous study), stopwords, short
term removed, deduplicated
7
[1] https://www.digitisation.eu/ De geus onder studenten 

(14-10-1940)
4 Corpora
• Ground truth set (822 documents):
• uncorrected
• corrected
• Ground truth + mixed in (1644 documents):
• uncorrected
• partially corrected
8
Setup - Retrievability
[1] http://www.lemurproject.org/
Indri search engine [1]
Documents
Queries
9
• We report on c=1, c=10, c=100 or c=infinite
• Carried out on each of the four corpora
Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio10
[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio
Retrievability score
for a document d
10
[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio
Retrievability score
for a document d
Rank of document d in
the result list of a query q
10
[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio
Retrievability score
for a document d
Cutoff value c
Rank of document d in
the result list of a query q
10
[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio
Retrievability score
for a document d
Cutoff value c
Rank of document d in
the result list of a query q
Possibility to give more
weight to certain queries, 

we use oq=1
10
[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
Retrievability Scores r(d)
et al. introduced a way to measure how retrieval system
nfluence the accessibility of documents in a collection [1
The retrievability score of a document d, r(d), measures ho
accessible a document is. It is determined by several factor
ncluding the matching function of the retrieval system an
the number of documents a user is willing to evaluate. Th
retrievability score is the result of a cumulative scoring fun
tion, defined as:
r(d) =
X
q2Q
oq · f(kdq, c),
where c defines the number of documents a user is willin
to examine in a ranked list. We use cuto↵ values c = 1
c = 100, and c = 1000. The coe cient oq weights the im
portance of a query. We assign equal weights, with oq =
The function f(kdq, c) is a generalized utility/cost functio
Retrievability score
for a document d
Cutoff value c
Rank of document d in
the result list of a query q
For all queries q
in a query set Q
Possibility to give more
weight to certain queries, 

we use oq=1
10
[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
Impact Assessment
• Wealth: How many documents were retrieved in total?
• Sum of all r(d) scores
• Equality: How are r(d) scores distributed among documents?
• Gini coefficient
• Retrieval per document/query:
• Changes due to correction
• Impact of individual (query) terms
11
OCR Quality & Retrievability
RQ1: What is the relation between a document’s OCR character error rate and its
retrievability score?
12
RQ1: OCR Quality & Retrievability
• CER in 17cent collection significantly higher
• R(d) scores higher in WWII collection
• Correlation between r(d) and CER: -0.57 (Pearson) and -0.61 (Spearman) with p<0.001
0
20,000
40,000
60,000
0% 20% 40% 60% 80%
Character error rate (CER)
R(d)score
Document length 1000 2000 3000 Subset 17cent WWII
13
Direct Impact of OCR Quality
RQ2: How does the correction of OCR errors impact the retrievability bias of the
corrected documents?
14
Direct Impact
Uncorrected
Complete
corpus was
corrected
Corrected
15
Impact of Correction on Wealth
• More documents
retrieved from corrected
documents
• Number of queries with
results increased by 8%
• Impact is largest for
users willing to look at
the entire result list
16
365,855
338,139
2,023,283
1,750,340
5,477,566
4,341,536
6,033,099
4,521,030
+ 8%+ 8%
+ 16%+ 16%
+ 26%+ 26%
+ 34%+ 34%
c=1
c=10
c=100
c=infinite
0 2,500,000 5,000,000 7,500,000 10,000,000
Sum of all r(d) scores (wealth)
Condition error−prone corrected
Impact on Equality
• Correction lowers inequality among
documents
• In contrast to earlier findings, Gini
coefficients do not decrease with
larger c’s
• Correction fixes more FN than FP
(c=infinite):
• Increases both, wealth and
equality
17
0.0
0.2
0.4
0.6
1 10 100 infinite
Gini
Condition 822GTcor 822GTerr
Direct Impact: Gini Coefficients
Retrieval per Document
• Few documents lose r(d) scores after correction:

Good, these are former FP caused by OCR errors and no longer retrieved
• Most documents, however, gain — with 17cent corpus improving to a larger
extent, but still remaining at a lower level
18
Retrieval per Query
• Only 44% of the queries retrieved at least one document
• Despite small collection size, we see large gains
• Some queries lose because they retrieved FP from the uncorrected document set
19
Retrieval per Query
Top 10 terms cause 35% of the
wealth increase. These terms:
1. Appear very frequently in
user queries and
2. Are highly susceptible to
OCR errors in the
documents
Conclusion: Real queries are
also a source of bias
20
0
25
50
75
100
0 1,000 2,000 3,000 4,000
Query terms ordered by
difference in impact (descending)
Cumulativer(d)difference(%)
* new, Amsterdam, end, Mister, died/dead, grand/
large, Willem (name), two, three, old
Figure 4: Queries ordered by their gain/loss in number of
retrieved documents. The position on the y-axis represents
the number of documents retrieved from 822GTcor .
histograms. The distributions of the dierences in r(d) scores in Ta-
ble 2, show that for all cuto values, the median of the dierences is
positive, and increases from 8 (c = 1) to 912 (c = 1). The maximum
loss and the maximum gain in r(d) scores increase for larger cuto
values c, the latter to a much larger extent. Note that for c = 1 and
c = 10 the entire rst quartile is lled with documents that scored
worse in the corrected version. This shows that the competition
in the top results makes the gain of some documents the loss of
others.
Increased retrieval per query In a nal step, we investigated
0
25
50
75
100
0 1,000 2,000 3,000 4,000
Query terms ordered by
difference in impact (descending)
Cumulativer(d)difference(%)
Query Frequency in Cum.
Term Queries 822GT err 822GTcor Impact
nieuwe 1,903 99 166 7.36%
amsterdam 7,885 41 57 14.65%
ende 185 103 480 18.69%
heer 826 20 89 21.99%
overleden 3,698 5 18 24.78%
groot 1,573 125 153 27.33%
willem 5,375 5 13 29.81%
twee 319 64 175 31.83%
drie 401 34 120 33.81%
oude 991 50 78 35.41%
Figure 5: The accumulated impact scores of single-term
queries show that very few query term contribute a large
fraction of the overall wealth. The top ten query terms ac-
*
Indirect Impact of OCR Quality
RQ3: How does the correction of a fraction of error-prone documents influence the
retrievability of non-corrected ones?
21
Indirect Impact
Mixed
Half of the
corpus was
corrected
Uncorrected
22
Indirect Impact
Mixed
Half of the
corpus was
corrected
Uncorrected
22
50% same documents as 

in previous RQ
Indirect Impact
Mixed
Half of the
corpus was
corrected
Uncorrected
22
50% same documents as 

in previous RQ
50% new documents
Indirect Impact
Mixed
Half of the
corpus was
corrected
We’re mainly interested 

in these documents
Uncorrected
22
50% same documents as 

in previous RQ
50% new documents
Equality still increases!
• Equality in r(d) scores is higher in
the corrected document collection
• Again, correction has decreased
retrievability bias
23
0.0
0.2
0.4
0.6
0.8
1 10 100 infinite
Gini
Condition 1644err 1644mix
Indirect Impact: Gini Coefficients
376,139
353,613
2,307,996
2,099,816
7,676,830
6,698,945
9,520,643
8,008,574
c=1
c=10
c=100
c=infinity
0 3,000,000 6,000,000 9,000,000
Wealth
Complete Document Collection
225,809
180,079
1,420,322
1,112,705
4,898,694
3,783,514
6,033,099
4,521,030
c=1
c=10
c=100
c=infinity
0 2,000,000 4,000,000 6,000,000
Wealth
Ground Truth Document Collection
150,330
173,534
887,674
987,111
2,778,136
2,915,431
3,487,544
3,487,544
c=1
c=10
c=100
c=infinity
0 1,000,000 2,000,000 3,000,000 4,000,000
Wealth
Condition 1644_err 1644_mix
Mixed−in Document Collection
Impact on Wealth
• Complete: Correction increases wealth
24
376,139
353,613
2,307,996
2,099,816
7,676,830
6,698,945
9,520,643
8,008,574
c=1
c=10
c=100
c=infinity
0 3,000,000 6,000,000 9,000,000
Wealth
Complete Document Collection
225,809
180,079
1,420,322
1,112,705
4,898,694
3,783,514
6,033,099
4,521,030
c=1
c=10
c=100
c=infinity
0 2,000,000 4,000,000 6,000,000
Wealth
Ground Truth Document Collection
150,330
173,534
887,674
987,111
2,778,136
2,915,431
3,487,544
3,487,544
c=1
c=10
c=100
c=infinity
0 1,000,000 2,000,000 3,000,000 4,000,000
Wealth
Condition 1644_err 1644_mix
Mixed−in Document Collection
Impact on Wealth
• Complete: Correction increases wealth
• GT only:
• Increase in wealth
• c=1: +20%
• c=10: +22%
• c=100: +23%
• c=infinite: +25%
24
376,139
353,613
2,307,996
2,099,816
7,676,830
6,698,945
9,520,643
8,008,574
c=1
c=10
c=100
c=infinity
0 3,000,000 6,000,000 9,000,000
Wealth
Complete Document Collection
225,809
180,079
1,420,322
1,112,705
4,898,694
3,783,514
6,033,099
4,521,030
c=1
c=10
c=100
c=infinity
0 2,000,000 4,000,000 6,000,000
Wealth
Ground Truth Document Collection
150,330
173,534
887,674
987,111
2,778,136
2,915,431
3,487,544
3,487,544
c=1
c=10
c=100
c=infinity
0 1,000,000 2,000,000 3,000,000 4,000,000
Wealth
Condition 1644_err 1644_mix
Mixed−in Document Collection
Impact on Wealth
• Complete: Correction increases wealth
• GT only:
• Increase in wealth
• c=1: +20%
• c=10: +22%
• c=100: +23%
• c=infinite: +25%
• Mixed-in only:
• Decrease in wealth:
• c=1: -13%
• c=10: -10%
• c=100: -5%
24
Retrieval per Document (mixed-in only, c=10)
• Most documents’ scores change very little and if, they lose r(d) scores
• 171 documents gain r(d) scores
• Benefit from FP matches that disappeared
25
Conclusions
26
Conclusions
• In our study, OCR correction
• Increases overall retrievability
• Reduces retrievability bias, even in a partially corrected corpus
• Higher scores caused by small set of terms that are
• frequent in queries and
• susceptible to OCR errors
• Using real user queries is essential to understand actual bias caused
by OCR errors.
27
Impact of Crowdsourcing
OCR Improvements on
Retrievability Bias
We would like to thank the 

for making the newspaper corpus and the
(sensitive) user data available to us for
research.
28
This research is partly funded by the Dutch COMMIT/ program, the
WebART project and the VRE4EIC project, a project that has received
funding from the European Union’s Horizon 2020 research and innovation
program under grant agreement No 676247.

More Related Content

PDF
IMPROVING SEARCH ENGINES BY DEMO
PDF
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
PDF
International Journal of Engineering Research and Development (IJERD)
PPTX
Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
PDF
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
PPTX
Query Reranking As A Service
PPTX
Query expansion_Team42_IRE2k14
PDF
Querylog-based Assessment of Retrievability Bias in Delpher
IMPROVING SEARCH ENGINES BY DEMO
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
International Journal of Engineering Research and Development (IJERD)
Leveraging Dynamic Query Subtopics for Time-aware Search Result Diversification
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
Query Reranking As A Service
Query expansion_Team42_IRE2k14
Querylog-based Assessment of Retrievability Bias in Delpher

Similar to Impact of Crowdsourcing OCR Improvements on Retrievability Bias (20)

PPT
Performance evaluation of IR models
PPTX
lecture8-evaluation.pptxnnnnnnnnnnnnnnnnnnnnnnnnn
PDF
assia2015sakai
PPTX
IITB CS635 - Information Retrieval - Lecture 6
PPT
Information Retrieval 02
PPT
Chapter 3 retrieval evaluation
PPT
IR-lec05-scoring-term-weighting-vector-space.ppt
PDF
evaluation in infomation retrival
PDF
Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus
PDF
Evaluating Search Performance
PPT
lecture6-tfidf.pptiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
PPT
lecture-TFIDF information retrieval .ppt
PDF
Evaluating Search Performance
PPT
Important topics vector space mathematics lecture9.ppt
PDF
An introduction to system-oriented evaluation in Information Retrieval
PPTX
Social Book Search: Techniques and evaluation
PDF
IR-lec17-probabilistic-ir.pdf
PDF
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
PPTX
Common evaluation measures in NLP and IR
PPTX
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Performance evaluation of IR models
lecture8-evaluation.pptxnnnnnnnnnnnnnnnnnnnnnnnnn
assia2015sakai
IITB CS635 - Information Retrieval - Lecture 6
Information Retrieval 02
Chapter 3 retrieval evaluation
IR-lec05-scoring-term-weighting-vector-space.ppt
evaluation in infomation retrival
Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus
Evaluating Search Performance
lecture6-tfidf.pptiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
lecture-TFIDF information retrieval .ppt
Evaluating Search Performance
Important topics vector space mathematics lecture9.ppt
An introduction to system-oriented evaluation in Information Retrieval
Social Book Search: Techniques and evaluation
IR-lec17-probabilistic-ir.pdf
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
Common evaluation measures in NLP and IR
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Ad

More from Myriam Traub (6)

PDF
Effectiveness of Gamesourcing Expert Painting Annotations
PDF
The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool Criticism
PDF
Impact Analysis of OCR Quality on Research Tasks in Digital Archives
PDF
Tool Criticism
PDF
Estimating the Impact of OCR Quality on Research Tasks in the Digital Humanities
PDF
Measuring the Effectiveness of Gamesourcing Expert Oil Painting Annotations
Effectiveness of Gamesourcing Expert Painting Annotations
The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool Criticism
Impact Analysis of OCR Quality on Research Tasks in Digital Archives
Tool Criticism
Estimating the Impact of OCR Quality on Research Tasks in the Digital Humanities
Measuring the Effectiveness of Gamesourcing Expert Oil Painting Annotations
Ad

Recently uploaded (20)

PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
Sciences of Europe No 170 (2025)
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPT
protein biochemistry.ppt for university classes
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
The scientific heritage No 166 (166) (2025)
PPTX
Microbiology with diagram medical studies .pptx
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PDF
. Radiology Case Scenariosssssssssssssss
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PPTX
2. Earth - The Living Planet Module 2ELS
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Placing the Near-Earth Object Impact Probability in Context
Sciences of Europe No 170 (2025)
Classification Systems_TAXONOMY_SCIENCE8.pptx
POSITIONING IN OPERATION THEATRE ROOM.ppt
Comparative Structure of Integument in Vertebrates.pptx
protein biochemistry.ppt for university classes
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
The scientific heritage No 166 (166) (2025)
Microbiology with diagram medical studies .pptx
Taita Taveta Laboratory Technician Workshop Presentation.pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Introduction to Fisheries Biotechnology_Lesson 1.pptx
. Radiology Case Scenariosssssssssssssss
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
lecture 2026 of Sjogren's syndrome l .pdf
2. Earth - The Living Planet Module 2ELS
AlphaEarth Foundations and the Satellite Embedding dataset
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx

Impact of Crowdsourcing OCR Improvements on Retrievability Bias

  • 1. Impact of Crowdsourcing OCR Improvements on Retrievability Bias Myriam C. Traub, Thaer Samar, Jacco van Ossenbruggen, Lynda Hardman Centrum Wiskunde & Informatica, Amsterdam, NL 1
  • 2. Motivation: Retrievability (Bias) • Introduced by Azzopardi et al. in 2008 [1] • Retrievability score counts how 
 often a document is retrieved as one of 
 the top K documents by a given set of queries • Gini coefficient quantifies inequality in the distribution of scores [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM. 2
  • 3. Study on Retrievability Bias (JCDL2016) • Follow-up study of Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus • Large-scale study based on 102 million newspaper items, 4 million simulated queries and 957,239 real user queries • Findings: • Large inequalities among the documents indicating retrievability bias • Document length impacts retrieval, no evidence for other technical bias found • Simulated queries yield very different results than real queries, experiments should take operators and facets into account 3
  • 4. Potential Causes for Retrievability Bias • Skills and interest of users • Collection bias • Ranking algorithm • UI design • (OCR) quality 4 Courante uyt Italien, Duytslandt, &c (14-06-1618)
  • 5. Research Questions • RQ1: Relation between OCR quality and retrievability • RQ2: Direct impact of correction on retrievability bias of corrected documents • RQ3: Indirect impact of correction of a fraction of documents on non-corrected ones 5
  • 6. Research Questions • RQ1: Relation between OCR quality and retrievability • RQ2: Direct impact of correction on retrievability bias of corrected documents • RQ3: Indirect impact of correction of a fraction of documents on non-corrected ones How does bias caused by OCR quality impact my (re-)search results? 5
  • 8. Documents & Queries • Subset of the historic newspaper archive maintained by the National Library of the Netherlands (public, KB) • Ground truth set of 100 manually corrected newspaper issues (822 articles) published in the 17th century and WWII period (public, KB) • Character error rates (CER) computed with [1] • User queries collected from delpher.nl (confidential, KB, same as in previous study), stopwords, short term removed, deduplicated 7 [1] https://www.digitisation.eu/ De geus onder studenten (14-10-1940)
  • 9. 4 Corpora • Ground truth set (822 documents): • uncorrected • corrected • Ground truth + mixed in (1644 documents): • uncorrected • partially corrected 8
  • 10. Setup - Retrievability [1] http://www.lemurproject.org/ Indri search engine [1] Documents Queries 9 • We report on c=1, c=10, c=100 or c=infinite • Carried out on each of the four corpora
  • 11. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  • 12. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio Retrievability score for a document d 10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  • 13. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio Retrievability score for a document d Rank of document d in the result list of a query q 10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  • 14. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio Retrievability score for a document d Cutoff value c Rank of document d in the result list of a query q 10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  • 15. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio Retrievability score for a document d Cutoff value c Rank of document d in the result list of a query q Possibility to give more weight to certain queries, 
 we use oq=1 10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  • 16. Retrievability Scores r(d) et al. introduced a way to measure how retrieval system nfluence the accessibility of documents in a collection [1 The retrievability score of a document d, r(d), measures ho accessible a document is. It is determined by several factor ncluding the matching function of the retrieval system an the number of documents a user is willing to evaluate. Th retrievability score is the result of a cumulative scoring fun tion, defined as: r(d) = X q2Q oq · f(kdq, c), where c defines the number of documents a user is willin to examine in a ranked list. We use cuto↵ values c = 1 c = 100, and c = 1000. The coe cient oq weights the im portance of a query. We assign equal weights, with oq = The function f(kdq, c) is a generalized utility/cost functio Retrievability score for a document d Cutoff value c Rank of document d in the result list of a query q For all queries q in a query set Q Possibility to give more weight to certain queries, 
 we use oq=1 10 [1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.
  • 17. Impact Assessment • Wealth: How many documents were retrieved in total? • Sum of all r(d) scores • Equality: How are r(d) scores distributed among documents? • Gini coefficient • Retrieval per document/query: • Changes due to correction • Impact of individual (query) terms 11
  • 18. OCR Quality & Retrievability RQ1: What is the relation between a document’s OCR character error rate and its retrievability score? 12
  • 19. RQ1: OCR Quality & Retrievability • CER in 17cent collection significantly higher • R(d) scores higher in WWII collection • Correlation between r(d) and CER: -0.57 (Pearson) and -0.61 (Spearman) with p<0.001 0 20,000 40,000 60,000 0% 20% 40% 60% 80% Character error rate (CER) R(d)score Document length 1000 2000 3000 Subset 17cent WWII 13
  • 20. Direct Impact of OCR Quality RQ2: How does the correction of OCR errors impact the retrievability bias of the corrected documents? 14
  • 22. Impact of Correction on Wealth • More documents retrieved from corrected documents • Number of queries with results increased by 8% • Impact is largest for users willing to look at the entire result list 16 365,855 338,139 2,023,283 1,750,340 5,477,566 4,341,536 6,033,099 4,521,030 + 8%+ 8% + 16%+ 16% + 26%+ 26% + 34%+ 34% c=1 c=10 c=100 c=infinite 0 2,500,000 5,000,000 7,500,000 10,000,000 Sum of all r(d) scores (wealth) Condition error−prone corrected
  • 23. Impact on Equality • Correction lowers inequality among documents • In contrast to earlier findings, Gini coefficients do not decrease with larger c’s • Correction fixes more FN than FP (c=infinite): • Increases both, wealth and equality 17 0.0 0.2 0.4 0.6 1 10 100 infinite Gini Condition 822GTcor 822GTerr Direct Impact: Gini Coefficients
  • 24. Retrieval per Document • Few documents lose r(d) scores after correction:
 Good, these are former FP caused by OCR errors and no longer retrieved • Most documents, however, gain — with 17cent corpus improving to a larger extent, but still remaining at a lower level 18
  • 25. Retrieval per Query • Only 44% of the queries retrieved at least one document • Despite small collection size, we see large gains • Some queries lose because they retrieved FP from the uncorrected document set 19
  • 26. Retrieval per Query Top 10 terms cause 35% of the wealth increase. These terms: 1. Appear very frequently in user queries and 2. Are highly susceptible to OCR errors in the documents Conclusion: Real queries are also a source of bias 20 0 25 50 75 100 0 1,000 2,000 3,000 4,000 Query terms ordered by difference in impact (descending) Cumulativer(d)difference(%) * new, Amsterdam, end, Mister, died/dead, grand/ large, Willem (name), two, three, old Figure 4: Queries ordered by their gain/loss in number of retrieved documents. The position on the y-axis represents the number of documents retrieved from 822GTcor . histograms. The distributions of the dierences in r(d) scores in Ta- ble 2, show that for all cuto values, the median of the dierences is positive, and increases from 8 (c = 1) to 912 (c = 1). The maximum loss and the maximum gain in r(d) scores increase for larger cuto values c, the latter to a much larger extent. Note that for c = 1 and c = 10 the entire rst quartile is lled with documents that scored worse in the corrected version. This shows that the competition in the top results makes the gain of some documents the loss of others. Increased retrieval per query In a nal step, we investigated 0 25 50 75 100 0 1,000 2,000 3,000 4,000 Query terms ordered by difference in impact (descending) Cumulativer(d)difference(%) Query Frequency in Cum. Term Queries 822GT err 822GTcor Impact nieuwe 1,903 99 166 7.36% amsterdam 7,885 41 57 14.65% ende 185 103 480 18.69% heer 826 20 89 21.99% overleden 3,698 5 18 24.78% groot 1,573 125 153 27.33% willem 5,375 5 13 29.81% twee 319 64 175 31.83% drie 401 34 120 33.81% oude 991 50 78 35.41% Figure 5: The accumulated impact scores of single-term queries show that very few query term contribute a large fraction of the overall wealth. The top ten query terms ac- *
  • 27. Indirect Impact of OCR Quality RQ3: How does the correction of a fraction of error-prone documents influence the retrievability of non-corrected ones? 21
  • 28. Indirect Impact Mixed Half of the corpus was corrected Uncorrected 22
  • 29. Indirect Impact Mixed Half of the corpus was corrected Uncorrected 22 50% same documents as 
 in previous RQ
  • 30. Indirect Impact Mixed Half of the corpus was corrected Uncorrected 22 50% same documents as 
 in previous RQ 50% new documents
  • 31. Indirect Impact Mixed Half of the corpus was corrected We’re mainly interested 
 in these documents Uncorrected 22 50% same documents as 
 in previous RQ 50% new documents
  • 32. Equality still increases! • Equality in r(d) scores is higher in the corrected document collection • Again, correction has decreased retrievability bias 23 0.0 0.2 0.4 0.6 0.8 1 10 100 infinite Gini Condition 1644err 1644mix Indirect Impact: Gini Coefficients
  • 33. 376,139 353,613 2,307,996 2,099,816 7,676,830 6,698,945 9,520,643 8,008,574 c=1 c=10 c=100 c=infinity 0 3,000,000 6,000,000 9,000,000 Wealth Complete Document Collection 225,809 180,079 1,420,322 1,112,705 4,898,694 3,783,514 6,033,099 4,521,030 c=1 c=10 c=100 c=infinity 0 2,000,000 4,000,000 6,000,000 Wealth Ground Truth Document Collection 150,330 173,534 887,674 987,111 2,778,136 2,915,431 3,487,544 3,487,544 c=1 c=10 c=100 c=infinity 0 1,000,000 2,000,000 3,000,000 4,000,000 Wealth Condition 1644_err 1644_mix Mixed−in Document Collection Impact on Wealth • Complete: Correction increases wealth 24
  • 34. 376,139 353,613 2,307,996 2,099,816 7,676,830 6,698,945 9,520,643 8,008,574 c=1 c=10 c=100 c=infinity 0 3,000,000 6,000,000 9,000,000 Wealth Complete Document Collection 225,809 180,079 1,420,322 1,112,705 4,898,694 3,783,514 6,033,099 4,521,030 c=1 c=10 c=100 c=infinity 0 2,000,000 4,000,000 6,000,000 Wealth Ground Truth Document Collection 150,330 173,534 887,674 987,111 2,778,136 2,915,431 3,487,544 3,487,544 c=1 c=10 c=100 c=infinity 0 1,000,000 2,000,000 3,000,000 4,000,000 Wealth Condition 1644_err 1644_mix Mixed−in Document Collection Impact on Wealth • Complete: Correction increases wealth • GT only: • Increase in wealth • c=1: +20% • c=10: +22% • c=100: +23% • c=infinite: +25% 24
  • 35. 376,139 353,613 2,307,996 2,099,816 7,676,830 6,698,945 9,520,643 8,008,574 c=1 c=10 c=100 c=infinity 0 3,000,000 6,000,000 9,000,000 Wealth Complete Document Collection 225,809 180,079 1,420,322 1,112,705 4,898,694 3,783,514 6,033,099 4,521,030 c=1 c=10 c=100 c=infinity 0 2,000,000 4,000,000 6,000,000 Wealth Ground Truth Document Collection 150,330 173,534 887,674 987,111 2,778,136 2,915,431 3,487,544 3,487,544 c=1 c=10 c=100 c=infinity 0 1,000,000 2,000,000 3,000,000 4,000,000 Wealth Condition 1644_err 1644_mix Mixed−in Document Collection Impact on Wealth • Complete: Correction increases wealth • GT only: • Increase in wealth • c=1: +20% • c=10: +22% • c=100: +23% • c=infinite: +25% • Mixed-in only: • Decrease in wealth: • c=1: -13% • c=10: -10% • c=100: -5% 24
  • 36. Retrieval per Document (mixed-in only, c=10) • Most documents’ scores change very little and if, they lose r(d) scores • 171 documents gain r(d) scores • Benefit from FP matches that disappeared 25
  • 38. Conclusions • In our study, OCR correction • Increases overall retrievability • Reduces retrievability bias, even in a partially corrected corpus • Higher scores caused by small set of terms that are • frequent in queries and • susceptible to OCR errors • Using real user queries is essential to understand actual bias caused by OCR errors. 27
  • 39. Impact of Crowdsourcing OCR Improvements on Retrievability Bias We would like to thank the for making the newspaper corpus and the (sensitive) user data available to us for research. 28 This research is partly funded by the Dutch COMMIT/ program, the WebART project and the VRE4EIC project, a project that has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 676247.