SlideShare a Scribd company logo
2018-06-08
1
Towards a quality assessment of web corpora
for language technology applications
Wiktor Strandqvist
RISE SICS East and Linköping University, Linköping,
Sweden
wikst813@student.liu.se
(co-authors: M. Santini, L. Lind, A. Jönsson)
Outline
• Introduction
• Purpose
• Domain-specific web corpora
• Two-step approach:
1. Extraction of term seeds from use cases, personas and scenarios
• Results
2. Bootstrapping and evaluation of domain-specific web corpora
• Results
• Conclusion and Future Work
2018-06-08
2
Introduction
• Research exists to assess the representativeness of general purpose web corpora
by comparing them to traditional corpora.
• Our focus and purpose :
• Creation and evaluation of domain-specific web corpora to be used to build Language
Technology real-world applications.
• We propose a two-step approach:
1. Automatic extraction and evaluation of term seeds from use cases, personas and scenarios;
2. Creation and validation of specialized and domain-specific web corpora bootstrapped with
term seeds automatically extracted in step 1.
Step 1: Term extraction from use cases, personas and scenarios
• Why using use cases, personas and scenarios, when available?
• They are based on numerous interviews and observations of real situations;
• Checked by domain experts who know how to correctly use terms in their own domain.
• Use cases, personas and scenario are a good starting point to automatize the manual process
(often arbitrary and tedious) to identify term seeds to bootstrap domain-specific web corpora.
• We focus on the medical terms that occur in use cases, personas and scenarios written in
English for the E-care@home project.
• Challenge: accurate term extractor from a relatively short text (a few dozen pages)
2018-06-08
3
Step 2: Corpus boostrapping and evaluation
• Bootstrap a web corpus using term seeds automatically extracted from use cases,
personas and scenarios.
• Automatically evaluate the ”quality” ot the bootstrapped domain-specific web
corpora.
Open issues and proposed answers
Q1: What is meant by “quality” of a web corpus?
A1: here “quality” means high density of medical terms (lay or specialized ) related to certain illnesses.
Q2: How can we assess the quality of a corpus automatically bootstrapped from the web?
A2: by using metrics that are well-established and easily replicable.
Q3: What if a bootstrapped web corpus contains documents that are NOT relevant to the target
domain?
A3: It depends. We can measure the domain-specificity of a corpus and assess whether it is
satisfactorily domain-specific or whether the corpus needs some amends before being used.
Q4: Can we measure the domain-specificity of a corpus?
A4: Yes, we use word frequency lists (without stopwords) and apply some statistical measures, see part
2 of this presentation.
2018-06-08
4
Word frequency lists: a compact corpus representation
• Our assumptions:
• ”Words are not selected at random” (Adam Kilgarriff)
• Word frequency lists (aka unigram lists) are a “compact representation of a corpus, lacking
much of the information in the corpus but small and easily tractable” (Adam Kilgarriff)
• We use frequency list of content words (i.e. after having applied stopword
removal) to evaluate the “quality” of the web corpora.
Part 1:
Term-Extraction from Use Cases, Personas, Scenarios
• Term candidate extraction
• Part-of-speech tagging (Standford tagger)
• Syntactic patterns
• Term validation
• Partial matching against a medical databse (Snomed CT)
• Ranking the terms based on DF/IDF
• Cutoff
• Seed generation
• Triples sampled from the same context
2018-06-08
5
Part 1:
Term-Extraction results
• Term candidate extraction
• Extraction recall: 81%
• Term validation
• Precision: 34.2%
• Recall: 71%
• F1: 46.2%
Part 2:
Evaluating domain-specific web corpora
• In this part of the presentation:
• We show that a corpus bootstrapped with automatically extracted term seeds from use
cases, personas and scenarios (Auto corpus) has the same ”quality” of a corpus boostrapped
with hand-picked seeds (Gold corpus).
• We show that both the Gold corpus and the Auto corpus have similar domain-specificity
(domainhood), and do not share any similarity with a general language web corpus, like
ukWac.
2018-06-08
6
The Web Corpora used in our experiments
• ukWaCsample (872 565 words): a random subset of ukWaC (general language
corpus)
• Gold (544 677 words): a web corpus collected with hand-picked seeds
• Auto (492 479 words) : a web corpus collected with automatically extracted seeds
Plotting normalized frequencies (wpm)
• ukWaCsample (872 565 words), Gold (544 677 words), Auto (492 479 words)
2018-06-08
7
Plotting ranks (top 1000 words)
• The ranks are based on the normalized frequencies (wpm)
Rank Correlation: Kendall
• Non-parametric Kendall Tau
2018-06-08
8
Rank Correlation: Spearman
• Non parametric Spearman Rho
Smoothing: 0.01
• We apply smoothing before calculating KL divergence and log-likelihood (LL-G2).
2018-06-08
9
KL divergence (aka relative entropy)
• R: entropy package, function KL.empirical()
• KL: ukWacSample vs Gold = 7.544118
• KL: ukWacSample vs Auto = 6.519677
• KL: Gold vs Auto = 1.843863
Log-likelihood (LL-G2)
• Corpus profiling: the larger the LL-G2 scores, the more significant the difference
between two corpora.
• The total LL-G2 scores for the three web corpora (top 1000-ranked words) are
• LL-G2 : ukWaCsample vs Gold = 453 441.6
• LL-G2 : ukWaCsample vs Auto = 393 705.9
• LL-G2 : Gold vs Auto: 114 694.2
2018-06-08
10
List of LL- G2 scores
From left to right: ukWaCsample vs Gold; ukWaCsample vs Auto; Gold vs Auto
For the individual LL scores, a G2score of 3.8415 or higher is significant at the level
of p < 0.05 and a G2 score of 10.8276 is significant at the level of p < 0.001
Discussion
• These simple measures based on word frequency lists give a clear indication of
the ”quality” of a bootstrapped domain-specific werb corpus:
• Rank correlation
• KL divergence
• Log-likelihood (LL-G2)
• These measures can be used to assess the corpus quality BEFORE the corpus is
used to build LT applications, thus avoiding bad surprises.
• If the values returned by the metrics are not satisfactory, a corpus can be
amended accordingly.
2018-06-08
11
Conclusion and Future Work
• It is possible to create a fairly accurate term extractor from a relatively short text
written by domain experts.
• It is possible to assess the quality and domain-specificity of web corpora by using
well-established metrics.
• Future work: expanding word frequency list (including bigram and trigrams) &
identifying more metrics that can help in the evaluation of the quality of the
corpora, such as burstiness ad perplexity.
Thank you for your attention!

More Related Content

PDF
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
PDF
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
PDF
Sybrandt Thesis Proposal Presentation
PDF
Named Entity Recognition from Online News
PPT
Finding Similar Files in Large Document Repositories
PPT
Data Mining and the Web_Past_Present and Future
PPTX
Skutil - H2O meets Sklearn - Taylor Smith
PDF
INCREASING LABORATORY EFFICIENCY AND VALUE OF LABORATORY DATA BY MAXIMISING ...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
Sybrandt Thesis Proposal Presentation
Named Entity Recognition from Online News
Finding Similar Files in Large Document Repositories
Data Mining and the Web_Past_Present and Future
Skutil - H2O meets Sklearn - Taylor Smith
INCREASING LABORATORY EFFICIENCY AND VALUE OF LABORATORY DATA BY MAXIMISING ...

What's hot (20)

PDF
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
PPTX
Programming with Semantic Broad Data
PPTX
(Semi-)Automatic analysis of online contents
PPTX
NAMED ENTITY RECOGNITION
PDF
How to valuate and determine standard essential patents
PDF
II-SDV 2017: Towards Semantic Search at the European Patent Office
PPTX
ML in materials discovery
PDF
Graph Gurus Episode 4: Detecting Fraud and Money Laudering in Real-Time Part 2
PPTX
Hattrick Simpers TMS Machine Learning Workshop Slides
PDF
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
PDF
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
PDF
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
PDF
II-PIC 22017: IP Landscape Study – Unique Execution Approach – Actionable Int...
PDF
FAIR Workflows: A step closer to the Scientific Paper of the Future
PDF
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
PPTX
EDF2012 Peter Boncz - LOD benchmarking SRbench
PDF
Graph Gurus Episode 6: Community Detection
PPTX
GLOBE Metadata Analysis
PPTX
Text Mining using LDA with Context
PDF
II-PIC 201: Product Presentation CAS / STN
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Programming with Semantic Broad Data
(Semi-)Automatic analysis of online contents
NAMED ENTITY RECOGNITION
How to valuate and determine standard essential patents
II-SDV 2017: Towards Semantic Search at the European Patent Office
ML in materials discovery
Graph Gurus Episode 4: Detecting Fraud and Money Laudering in Real-Time Part 2
Hattrick Simpers TMS Machine Learning Workshop Slides
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
II-PIC 22017: IP Landscape Study – Unique Execution Approach – Actionable Int...
FAIR Workflows: A step closer to the Scientific Paper of the Future
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
EDF2012 Peter Boncz - LOD benchmarking SRbench
Graph Gurus Episode 6: Community Detection
GLOBE Metadata Analysis
Text Mining using LDA with Context
II-PIC 201: Product Presentation CAS / STN
Ad

Similar to Towards a Quality Assessment of Web Corpora for Language Technology Applications (16)

PPTX
Extracting article text from the web with maximum subsequence segmentation
PDF
FIRE2014_IIT-P
ODP
Corpora, Blogs and Linguistic Variation (Paderborn)
PDF
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
PDF
Practical Corpus Linguistics An Introduction to Corpus-Based Language Analysi...
PPTX
Semantic mark-up with schema.org: helping search engines understand the Web
PDF
Relation Extraction from the Web using Distant Supervision
PDF
Transfer Learning for Low Resource Languages and Domains
PDF
Ontology learning
PPTX
Sentiment analysis in healthcare
PDF
About the use of biomedical ontologies to play with text in the context of th...
PPTX
Corpus annotation for corpus linguistics (nov2009)
PDF
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
PPTX
Terminology Extraction Tools for Interpreters
ODP
The need for sophistication in modern search engine implementations
PDF
ELSE IF 2019: Multilingual Text Analytics for Extracting Pharma Real-World Ev...
Extracting article text from the web with maximum subsequence segmentation
FIRE2014_IIT-P
Corpora, Blogs and Linguistic Variation (Paderborn)
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Practical Corpus Linguistics An Introduction to Corpus-Based Language Analysi...
Semantic mark-up with schema.org: helping search engines understand the Web
Relation Extraction from the Web using Distant Supervision
Transfer Learning for Low Resource Languages and Domains
Ontology learning
Sentiment analysis in healthcare
About the use of biomedical ontologies to play with text in the context of th...
Corpus annotation for corpus linguistics (nov2009)
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Terminology Extraction Tools for Interpreters
The need for sophistication in modern search engine implementations
ELSE IF 2019: Multilingual Text Analytics for Extracting Pharma Real-World Ev...
Ad

More from Marina Santini (20)

PDF
An Exploratory Study on Genre Classification using Readability Features
PDF
Lecture: Semantic Word Clouds
PDF
Lecture: Ontologies and the Semantic Web
PDF
Lecture: Summarization
PDF
Relation Extraction
PDF
Lecture: Question Answering
PDF
IE: Named Entity Recognition (NER)
PDF
Lecture: Vector Semantics (aka Distributional Semantics)
PDF
Lecture: Word Sense Disambiguation
PDF
Lecture: Word Senses
PDF
Sentiment Analysis
PDF
Semantic Role Labeling
PDF
Semantics and Computational Semantics
PDF
Lecture 9: Machine Learning in Practice (2)
PDF
Lecture 8: Machine Learning in Practice (1)
PDF
Lecture 5: Interval Estimation
PDF
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
PDF
Lecture 3b: Decision Trees (1 part)
PDF
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
PDF
Lecture 2: Preliminaries (Understanding and Preprocessing data)
An Exploratory Study on Genre Classification using Readability Features
Lecture: Semantic Word Clouds
Lecture: Ontologies and the Semantic Web
Lecture: Summarization
Relation Extraction
Lecture: Question Answering
IE: Named Entity Recognition (NER)
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Word Sense Disambiguation
Lecture: Word Senses
Sentiment Analysis
Semantic Role Labeling
Semantics and Computational Semantics
Lecture 9: Machine Learning in Practice (2)
Lecture 8: Machine Learning in Practice (1)
Lecture 5: Interval Estimation
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 3b: Decision Trees (1 part)
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 2: Preliminaries (Understanding and Preprocessing data)

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
cuic standard and advanced reporting.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Electronic commerce courselecture one. Pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Approach and Philosophy of On baking technology
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation theory and applications.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
cuic standard and advanced reporting.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Machine learning based COVID-19 study performance prediction
Electronic commerce courselecture one. Pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Building Integrated photovoltaic BIPV_UPV.pdf
Unlocking AI with Model Context Protocol (MCP)
NewMind AI Weekly Chronicles - August'25 Week I
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Big Data Technologies - Introduction.pptx
Modernizing your data center with Dell and AMD
NewMind AI Monthly Chronicles - July 2025
Approach and Philosophy of On baking technology
MYSQL Presentation for SQL database connectivity
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation theory and applications.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

Towards a Quality Assessment of Web Corpora for Language Technology Applications

  • 1. 2018-06-08 1 Towards a quality assessment of web corpora for language technology applications Wiktor Strandqvist RISE SICS East and Linköping University, Linköping, Sweden wikst813@student.liu.se (co-authors: M. Santini, L. Lind, A. Jönsson) Outline • Introduction • Purpose • Domain-specific web corpora • Two-step approach: 1. Extraction of term seeds from use cases, personas and scenarios • Results 2. Bootstrapping and evaluation of domain-specific web corpora • Results • Conclusion and Future Work
  • 2. 2018-06-08 2 Introduction • Research exists to assess the representativeness of general purpose web corpora by comparing them to traditional corpora. • Our focus and purpose : • Creation and evaluation of domain-specific web corpora to be used to build Language Technology real-world applications. • We propose a two-step approach: 1. Automatic extraction and evaluation of term seeds from use cases, personas and scenarios; 2. Creation and validation of specialized and domain-specific web corpora bootstrapped with term seeds automatically extracted in step 1. Step 1: Term extraction from use cases, personas and scenarios • Why using use cases, personas and scenarios, when available? • They are based on numerous interviews and observations of real situations; • Checked by domain experts who know how to correctly use terms in their own domain. • Use cases, personas and scenario are a good starting point to automatize the manual process (often arbitrary and tedious) to identify term seeds to bootstrap domain-specific web corpora. • We focus on the medical terms that occur in use cases, personas and scenarios written in English for the E-care@home project. • Challenge: accurate term extractor from a relatively short text (a few dozen pages)
  • 3. 2018-06-08 3 Step 2: Corpus boostrapping and evaluation • Bootstrap a web corpus using term seeds automatically extracted from use cases, personas and scenarios. • Automatically evaluate the ”quality” ot the bootstrapped domain-specific web corpora. Open issues and proposed answers Q1: What is meant by “quality” of a web corpus? A1: here “quality” means high density of medical terms (lay or specialized ) related to certain illnesses. Q2: How can we assess the quality of a corpus automatically bootstrapped from the web? A2: by using metrics that are well-established and easily replicable. Q3: What if a bootstrapped web corpus contains documents that are NOT relevant to the target domain? A3: It depends. We can measure the domain-specificity of a corpus and assess whether it is satisfactorily domain-specific or whether the corpus needs some amends before being used. Q4: Can we measure the domain-specificity of a corpus? A4: Yes, we use word frequency lists (without stopwords) and apply some statistical measures, see part 2 of this presentation.
  • 4. 2018-06-08 4 Word frequency lists: a compact corpus representation • Our assumptions: • ”Words are not selected at random” (Adam Kilgarriff) • Word frequency lists (aka unigram lists) are a “compact representation of a corpus, lacking much of the information in the corpus but small and easily tractable” (Adam Kilgarriff) • We use frequency list of content words (i.e. after having applied stopword removal) to evaluate the “quality” of the web corpora. Part 1: Term-Extraction from Use Cases, Personas, Scenarios • Term candidate extraction • Part-of-speech tagging (Standford tagger) • Syntactic patterns • Term validation • Partial matching against a medical databse (Snomed CT) • Ranking the terms based on DF/IDF • Cutoff • Seed generation • Triples sampled from the same context
  • 5. 2018-06-08 5 Part 1: Term-Extraction results • Term candidate extraction • Extraction recall: 81% • Term validation • Precision: 34.2% • Recall: 71% • F1: 46.2% Part 2: Evaluating domain-specific web corpora • In this part of the presentation: • We show that a corpus bootstrapped with automatically extracted term seeds from use cases, personas and scenarios (Auto corpus) has the same ”quality” of a corpus boostrapped with hand-picked seeds (Gold corpus). • We show that both the Gold corpus and the Auto corpus have similar domain-specificity (domainhood), and do not share any similarity with a general language web corpus, like ukWac.
  • 6. 2018-06-08 6 The Web Corpora used in our experiments • ukWaCsample (872 565 words): a random subset of ukWaC (general language corpus) • Gold (544 677 words): a web corpus collected with hand-picked seeds • Auto (492 479 words) : a web corpus collected with automatically extracted seeds Plotting normalized frequencies (wpm) • ukWaCsample (872 565 words), Gold (544 677 words), Auto (492 479 words)
  • 7. 2018-06-08 7 Plotting ranks (top 1000 words) • The ranks are based on the normalized frequencies (wpm) Rank Correlation: Kendall • Non-parametric Kendall Tau
  • 8. 2018-06-08 8 Rank Correlation: Spearman • Non parametric Spearman Rho Smoothing: 0.01 • We apply smoothing before calculating KL divergence and log-likelihood (LL-G2).
  • 9. 2018-06-08 9 KL divergence (aka relative entropy) • R: entropy package, function KL.empirical() • KL: ukWacSample vs Gold = 7.544118 • KL: ukWacSample vs Auto = 6.519677 • KL: Gold vs Auto = 1.843863 Log-likelihood (LL-G2) • Corpus profiling: the larger the LL-G2 scores, the more significant the difference between two corpora. • The total LL-G2 scores for the three web corpora (top 1000-ranked words) are • LL-G2 : ukWaCsample vs Gold = 453 441.6 • LL-G2 : ukWaCsample vs Auto = 393 705.9 • LL-G2 : Gold vs Auto: 114 694.2
  • 10. 2018-06-08 10 List of LL- G2 scores From left to right: ukWaCsample vs Gold; ukWaCsample vs Auto; Gold vs Auto For the individual LL scores, a G2score of 3.8415 or higher is significant at the level of p < 0.05 and a G2 score of 10.8276 is significant at the level of p < 0.001 Discussion • These simple measures based on word frequency lists give a clear indication of the ”quality” of a bootstrapped domain-specific werb corpus: • Rank correlation • KL divergence • Log-likelihood (LL-G2) • These measures can be used to assess the corpus quality BEFORE the corpus is used to build LT applications, thus avoiding bad surprises. • If the values returned by the metrics are not satisfactory, a corpus can be amended accordingly.
  • 11. 2018-06-08 11 Conclusion and Future Work • It is possible to create a fairly accurate term extractor from a relatively short text written by domain experts. • It is possible to assess the quality and domain-specificity of web corpora by using well-established metrics. • Future work: expanding word frequency list (including bigram and trigrams) & identifying more metrics that can help in the evaluation of the quality of the corpora, such as burstiness ad perplexity. Thank you for your attention!