Towards a Quality Assessment of Web Corpora for Language Technology Applications

2018-06-08
1
Towards a quality assessment of web corpora
for language technology applications
Wiktor Strandqvist
RISE SICS East and Linköping University, Linköping,
Sweden
wikst813@student.liu.se
(co-authors: M. Santini, L. Lind, A. Jönsson)
Outline
• Introduction
• Purpose
• Domain-specific web corpora
• Two-step approach:
1. Extraction of term seeds from use cases, personas and scenarios
• Results
2. Bootstrapping and evaluation of domain-specific web corpora
• Results
• Conclusion and Future Work

2018-06-08
2
Introduction
• Research exists to assess the representativeness of general purpose web corpora
by comparing them to traditional corpora.
• Our focus and purpose :
• Creation and evaluation of domain-specific web corpora to be used to build Language
Technology real-world applications.
• We propose a two-step approach:
1. Automatic extraction and evaluation of term seeds from use cases, personas and scenarios;
2. Creation and validation of specialized and domain-specific web corpora bootstrapped with
term seeds automatically extracted in step 1.
Step 1: Term extraction from use cases, personas and scenarios
• Why using use cases, personas and scenarios, when available?
• They are based on numerous interviews and observations of real situations;
• Checked by domain experts who know how to correctly use terms in their own domain.
• Use cases, personas and scenario are a good starting point to automatize the manual process
(often arbitrary and tedious) to identify term seeds to bootstrap domain-specific web corpora.
• We focus on the medical terms that occur in use cases, personas and scenarios written in
English for the E-care@home project.
• Challenge: accurate term extractor from a relatively short text (a few dozen pages)

2018-06-08
3
Step 2: Corpus boostrapping and evaluation
• Bootstrap a web corpus using term seeds automatically extracted from use cases,
personas and scenarios.
• Automatically evaluate the ”quality” ot the bootstrapped domain-specific web
corpora.
Open issues and proposed answers
Q1: What is meant by “quality” of a web corpus?
A1: here “quality” means high density of medical terms (lay or specialized ) related to certain illnesses.
Q2: How can we assess the quality of a corpus automatically bootstrapped from the web?
A2: by using metrics that are well-established and easily replicable.
Q3: What if a bootstrapped web corpus contains documents that are NOT relevant to the target
domain?
A3: It depends. We can measure the domain-specificity of a corpus and assess whether it is
satisfactorily domain-specific or whether the corpus needs some amends before being used.
Q4: Can we measure the domain-specificity of a corpus?
A4: Yes, we use word frequency lists (without stopwords) and apply some statistical measures, see part
2 of this presentation.

2018-06-08
4
Word frequency lists: a compact corpus representation
• Our assumptions:
• ”Words are not selected at random” (Adam Kilgarriff)
• Word frequency lists (aka unigram lists) are a “compact representation of a corpus, lacking
much of the information in the corpus but small and easily tractable” (Adam Kilgarriff)
• We use frequency list of content words (i.e. after having applied stopword
removal) to evaluate the “quality” of the web corpora.
Part 1:
Term-Extraction from Use Cases, Personas, Scenarios
• Term candidate extraction
• Part-of-speech tagging (Standford tagger)
• Syntactic patterns
• Term validation
• Partial matching against a medical databse (Snomed CT)
• Ranking the terms based on DF/IDF
• Cutoff
• Seed generation
• Triples sampled from the same context

2018-06-08
5
Part 1:
Term-Extraction results
• Term candidate extraction
• Extraction recall: 81%
• Term validation
• Precision: 34.2%
• Recall: 71%
• F1: 46.2%
Part 2:
Evaluating domain-specific web corpora
• In this part of the presentation:
• We show that a corpus bootstrapped with automatically extracted term seeds from use
cases, personas and scenarios (Auto corpus) has the same ”quality” of a corpus boostrapped
with hand-picked seeds (Gold corpus).
• We show that both the Gold corpus and the Auto corpus have similar domain-specificity
(domainhood), and do not share any similarity with a general language web corpus, like
ukWac.

2018-06-08
6
The Web Corpora used in our experiments
• ukWaCsample (872 565 words): a random subset of ukWaC (general language
corpus)
• Gold (544 677 words): a web corpus collected with hand-picked seeds
• Auto (492 479 words) : a web corpus collected with automatically extracted seeds
Plotting normalized frequencies (wpm)
• ukWaCsample (872 565 words), Gold (544 677 words), Auto (492 479 words)

2018-06-08
7
Plotting ranks (top 1000 words)
• The ranks are based on the normalized frequencies (wpm)
Rank Correlation: Kendall
• Non-parametric Kendall Tau

2018-06-08
8
Rank Correlation: Spearman
• Non parametric Spearman Rho
Smoothing: 0.01
• We apply smoothing before calculating KL divergence and log-likelihood (LL-G2).

2018-06-08
9
KL divergence (aka relative entropy)
• R: entropy package, function KL.empirical()
• KL: ukWacSample vs Gold = 7.544118
• KL: ukWacSample vs Auto = 6.519677
• KL: Gold vs Auto = 1.843863
Log-likelihood (LL-G2)
• Corpus profiling: the larger the LL-G2 scores, the more significant the difference
between two corpora.
• The total LL-G2 scores for the three web corpora (top 1000-ranked words) are
• LL-G2 : ukWaCsample vs Gold = 453 441.6
• LL-G2 : ukWaCsample vs Auto = 393 705.9
• LL-G2 : Gold vs Auto: 114 694.2

2018-06-08
10
List of LL- G2 scores
From left to right: ukWaCsample vs Gold; ukWaCsample vs Auto; Gold vs Auto
For the individual LL scores, a G2score of 3.8415 or higher is significant at the level
of p < 0.05 and a G2 score of 10.8276 is significant at the level of p < 0.001
Discussion
• These simple measures based on word frequency lists give a clear indication of
the ”quality” of a bootstrapped domain-specific werb corpus:
• Rank correlation
• KL divergence
• Log-likelihood (LL-G2)
• These measures can be used to assess the corpus quality BEFORE the corpus is
used to build LT applications, thus avoiding bad surprises.
• If the values returned by the metrics are not satisfactory, a corpus can be
amended accordingly.

2018-06-08
11
Conclusion and Future Work
• It is possible to create a fairly accurate term extractor from a relatively short text
written by domain experts.
• It is possible to assess the quality and domain-specificity of web corpora by using
well-established metrics.
• Future work: expanding word frequency list (including bigram and trigrams) &
identifying more metrics that can help in the evaluation of the quality of the
corpora, such as burstiness ad perplexity.
Thank you for your attention!

Towards a Quality Assessment of Web Corpora for Language Technology Applications

More Related Content

What's hot (20)

Similar to Towards a Quality Assessment of Web Corpora for Language Technology Applications (16)

More from Marina Santini (20)

Recently uploaded (20)

Towards a Quality Assessment of Web Corpora for Language Technology Applications