Naveed_Presentation_Mayo

Unsupervised Relation Extraction for
E‐Learning Applications from
Biomedical Domain
Naveed Afzal
1

Outline
• Multiple Choice Questions
• Motivation
• System Architecture
• Unsupervised IE
• Surface‐based Approach
• Dependency‐based Approach
• Use of Web as a corpus
• Automatic Generation of Questions
• Automatic Generation of Distractors
• Extrinsic Evaluation
• Comparison
• Main Contributions
2

Multiple Choice Questions (MCQ)
•Popular assessment tool
•MCQ consists of:
• A question
• The correct answer
• A list of distractors (wrong answers)
•45%‐67% students assessment utilise MCQ
•Automatic generation of MCQ:
• An emerging area of NLP
3

Motivation
• Most of the previous approaches relied on the syntactic
structures of sentences to generate MCQs
• Previous approaches unable to automatically generate questions
from complex sentences
• A complete MCQ system based on IE
• Aim is to identify the most important semantic relations in a
document without assigning explicit labels to them in order to
ensure broad coverage, unrestricted to predefined types of
relations.
• Our approach used semantic relations using Unsupervised
Information Extraction (IE)
• Conversion of semantic relations into questions
• Distractors are generated using distributional similarity measure
4

System Architecture
Unannotated
corpus
Named Entity
Recognition
Semantic
Relations
Distractors
Generation
Rules
Question
Generation
Output
(MCQ)
Extraction of
Candidate
Patterns
Patterns
Ranking
Evaluation
Distributional
Similarity
5

Unsupervised IE
•Unsupervised Approaches
• Surface‐based
• Dependency‐based
•Each approach can cover a potentially unrestricted
range of semantic relations
•Other approaches (supervised and semi‐supervised)
require seed patterns to learn similar patterns which
are exemplified by the seeds
6

NER & POS Tagging
• GENIA Tagger is used for the purpose of NER & POS tagging
• The GENIA tagger provides us the base forms, POS tags and
NE tags for the GENIA corpus
• GENIA POS tagger achieves accuracy of 96.94% on Wall
Street Journal corpus and 98.26% on GENIA corpus
7

Named Entity Recognition (NER)
Entity Type Precision Recall F-score
Protein 65.82 81.41 72.79
DNA 65.64 66.76 66.20
RNA 60.45 68.64 64.29
Cell Line 56.12 59.60 57.81
Cell Type 78.51 70.54 74.31
Overall 67.45 75.78 71.37
GENIA NER is used to recognise the following 5
main Named Entities:
8

Surface‐based Approach (Patterns Building)
• Important relations are expressed with the help of
recurrent linguistic constructions
• These constructions can be recognised by examining
sequences of words between NE’s
• To discover such linguistic constructions
• Find pairs of NE’s in a document
• Extract sequences of words between them, which are
later used to learn extraction patterns
9

Patterns Building
• The presented approach uses the idea of content
words / notional words present between two named
entities along with prepositions and without
prepositions to build candidate patterns
• Three types of surface‐based patterns along with and
without prepositions
• Untagged word patterns
• PoS‐tagged word patterns
• Verb‐centred word patterns
• Content words consist of (Nouns, verb, adverbs and
adjectives)
10

Patterns Building
• Minimum one content word and maximum three
content words are extracted between two named
entities
• Why??
• The idea behind this selection process is that if
• No content word between two NE’s then it is most likely there
will be no relation between them
• While on the other hand, if two NE’s are quite far from each
other then it is also most likely they will be not related either
• Use of lemmatised word
11

Patterns Building
• Passive voice to active voice conversion relieve the
problem of data sparseness e.g.
• PROTEIN be_v express_v CELL_TYPE transformed into
• CELL_TYPE express_v PROTEIN
• We filter out patterns containing negation (e.g. not,
do not etc)
• We also filter out patterns containing stop words
only e.g.
• DNA through PROTEIN
• PROTEIN such as PROTEIN
• PROTEIN with PROTEIN in CELL_TYPE
• PROTEIN be same in CELL_LINE
• PROTEIN against PROTEIN
12

Dependency‐based Approach
• Unsupervised dependency‐based approach
• Dependency trees are suitable basis for semantic
patterns acquisition as they abstract away from
the surface structure to represent relations
between elements of a sentence
• Our assumption for semantic relations is that it is
between NE’s stated in the same sentence
13

Dependency‐based Approach (Patterns Building)
• After NER, the next step is extraction of candidate
patterns and it consist of two main stages:
• the construction of potential patterns from an unannotated
domain corpus
• their relevance ranking
• Use of an adapted Linked chain pattern model that
combines the pair of chains in a dependency tree
which share common verb root but no direct
descendants
14

Patterns Building
• Treat every NE as a chain in a dependency tree if it is
less than 5 dependencies away from the verb root
and the word linking the NE’s to the verb root are
from the category of content words (Verb, Noun,
Adverb and Adjective) along with prepositions.
• Consider only those chains in the dependency tree of
a sentence which contain NE’s
15

Example
Fibrinogen activates NF‐kappa B in mononuclear phagocytes.
• GENIA tagger for NER:
• <protein> Fibrinogen </protein> activates <protein> NF‐
kappa B </protein> in <cell_type> mononuclear phagocytes
</cell_type>.
• Replace all the NEs with their semantic class respectively, so the
aforementioned sentence is transformed into the following
sentence.
• PROTEIN activates PROTEIN in CELL.
• Parsed the sentence the Machinese Syntax parser
16

Example
[V/activate] (subj[PROTEIN] + obj[PROTEIN])
[V/activate] (obj[PROTEIN] + prep[in] + p[CELL_TYPE])
17

Patterns Ranking
• Ranked according to their significance in domain
corpus
• Use of general corpus (BNC)
• Measure the strength of association of a pattern
with the domain corpus as opposed to the general
corpus
18

Patterns Ranking
• The patterns are ranked using the following ranking methods:
• Information Gain
• Information Gain Ratio
• Mutual Information
• Normalised Mutual Information
• Log‐likelihood
• Chi‐Square
• Meta‐ranking
• tf‐idf
• The patterns along with their scores obtained using the above
mentioned ranking methods are stored into the database
Information-theoretic
concepts
Statistical tests
19

Ranking Methods
•Chi‐Square and Normalised Mutual Information – best
performing ranking methods in terms of precision but
recall is very low in Chi‐Square
•No statistical significant difference between
Information Gain, Information Gain Ratio and Log‐
likelihood
•Mutual Information – worst performing ranking
method
20

Surface‐based Patterns Ranking
0
0 .1
0 .2
0 .3
0 .4
0 .5
0 .6
0 .7
0 .8
0 .9
1
>0.08 >0.09 >0.1 >0.2 >0.3 >0.4 >0.5
CHI
NMI
21

Surface‐based Patterns Ranking (CHI)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
>0.06 >0.07 >0.08 >0.09 >0.1 >0.2 >0.3
Precision
Recall
F-score
22

Surface‐based Patterns Ranking (NMI)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
>0.06 >0.07 >0.08 >0.09 >0.1 >0.2 >0.3
Precision
Recall
F-score
23

Dependency‐based Patterns Ranking
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
>0.08 >0.09 >0.1 >0.2 >0.3 >0.4 >0.5
CHI
NMI
24

(CHI)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
>0.06 >0.07 >0.08 >0.09 >0.1 >0.2 >0.3
Precision
Recall
F-score
25

(NMI)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
>0.06 >0.07 >0.08 >0.09 >0.1 >0.2 >0.3
Precision
Recall
F-score
26

Use of Web as a corpus
• Due to the small size of the GENIA corpus we developed a
large WEB corpus by automatically collecting MEDLINE
articles similar to the GENIA corpus from the National
Library of Medicine
• To ensure that the web corpus is sufficiently on‐topic, it is
important to know how similar the two corpora are
• It is most important to first determine the homogeneity of
a corpus before computing its similarity to another corpus,
as the judgement of similarity can become unreliable if a
homogenous corpus is compared with a heterogeneous
one
27

Use of Web as a corpus
• Web corpus is not homogenous
• Web corpus is not similar to GENIA corpus
• Web corpus is not similar to GENIA EVENT corpus
• One of the possible reasons for this is that GENIA is a very
narrow‐domain corpus and it is hard to collect relevant
topical documents automatically
• Use of a Web as a corpus is still unable to ensure the same
level of topic relevance as achieved in manually compiled
corpora
28

Automatic Question Generation
•Emerging area of research
•Questions asking about important concepts
described in a given text
•Well‐known that generating/asking good questions
is a complicated task
•Semantic relations allow us to identify which part of
learning material is important and worth testing
29

•Surface‐based approach used a certain set of rules
to transform semantic patterns into questions
automatically
•Dependency‐based approach questions are
generated automatically by traversing the
dependency tree of a given sentence matched by a
semantic pattern
30

• Pattern: DNA contain_v DNA
• Step 1: Identify instantiations of a pattern in the evaluation
corpus, this involves finding the template (in the above
example, the verb ‘contain’) and the slot filler (two specifics
DNA’s in the above example). We then have the
aforementioned pattern being matched in the evaluation
corpus and the relevant sentence is extracted form it.
• Thus, the gamma 3 ECS is an inducible promoter containing
cis elements that critically mediate CD40L and IL‐4‐triggered
transcriptional activation of the human C gamma 3 gene.
31

• Step 2: The part of the extracted sentence that contains template together
with slot fillers is tagged by <QP> and </QP> tags as shown below:
• Thus, the <DNA> gamma 3 ECS </DNA> is an <QP> <DNA> inducible promoter
</DNA> containing <DNA> cis elements </DNA> </QP> that critically mediate
<protein> CD40L </protein> and IL‐4‐triggered transcriptional activation of the
<DNA> human C gamma 3 gene </DNA>.
• Step 3: In this step, we extract semantic tags and actual names from the
extracted sentence by employing Machinese parser (Tapanainen and Järvinen,
1997). After parsing, the extracted semantic pattern is transformed into the
following two types of questions (active voice and passive voice):
• Which DNA contains cis elements?
• Which DNA is contained by inducible promoter?
• For various forms of extracted patterns, we develop a certain set of rules
based on semantic classes (Named Entities) and part‐of‐speech (PoS)
information present in a pattern.
32

• [V/encode] (subj[DNA] + obj[PROTEIN])
• This pattern is matched with the following sentence, which
contains its instantiation:
• This structural similarity suggests that the pAT 133 gene encodes
a transcription factor with a specific biological function.
• Our dependency‐based patterns always include a main verb, so
in order to automatically generate questions:
• We traverse the whole dependency tree of the extracted sentence and
• Extract all of the words which rely on the main verb present in the
dependency parse of a sentence.
• The part of the sentence is then transformed into the question by
selecting the subtree of the parse bounded by the two named entities
present in the dependency pattern.
33

Which DNA encodes a transcription factor with a specific biological
function?
34

•In both surface‐based and dependency‐based
approaches, we are able to automatically generate
only one type of questions (Which questions)
regarding named entities present in a semantic
relation.
•Our approach is not capable of automatically
generating different types of questions (e.g. Why,
How and What questions), and in order to do that
one has to look at various NLG techniques.
35

Distractors Generation
•Distributional Similarity measure
•Alleviate problem of data sparseness
•Corpus driven
•Information Radius
•Our aim is to automatically generate plausible
distractors, so if the correct answer is a protein then
our approach automatically generates all protein
distractors that are involved in similar processes or
belong to the same biological category.
36

Distractors Generation
• We build a pool of various biomedical corpora in order to
generate distractors from these corpora.
• After linguistic processing, we build a frequency matrix which
involves the scanning of sequential semantic classes (Named
Entities) along with a notional word in the corpora and record
their frequencies in a database.
• Constructed distributional models of all candidate named
entities
• semantic classes are compared using the distributional
hypothesis that similar words appear in similar context.
• The distractors to a given correct answer are then automatically
generated by measuring it similarity to entire candidate named
entities.
• Top 4 similar candidate named entities as the distractors.
37

Extrinsic Evaluation
• Real Application users have a vital role to play
• Evaluated both MCQ systems as a whole in a user‐
centred fashion
• 2 biomedical experts (both post‐doc’s), vastly
experienced
• We selected a score‐thresholding (score > 0.01) for
NMI as it gives a maximum F‐score (surface‐based
54% & dependency‐based 65%)
• Surface‐based 80 and Dependency‐based 52 MCQs
38

•Question and Distractors Readability
1.Incomprehensible
2.Rather Clear
3.Clear
•Usefulness of Semantic Relation
1.Incomprehensible
2.Rather Clear
3.Clear
40

• Question and Distractors Relevance
1. Not Relevant
2. Rather Relevant
3. Very Relevant
• Question and Distractors Acceptability
• (0 = Unacceptable, 5= Acceptable)
• Overall MCQ Usability
1. Unusable
2. Need Major Revision
3. Need Minor Revision
4. Directly Usable
41

Surface‐based MCQ
QR
(1-3)
DR
(1-3)
USR
(1-3)
QRelv
(1-3)
DRelv
(1-3)
QA
(0-
5)
DA
(0-5)
MCQ
Usability
(1-4)
Evaluator 1 2.15 2.96 2.14 2.04 2.24 2.53 3.04 2.61
Evaluator 2 1.74 2.29 1.88 1.66 2.10 1.95 3.28 2.11
Average 1.95 2.63 2.01 1.85 2.17 2.24 3.16 2.36
42

Surface‐based MCQ
•In terms of overall MCQ usability, the extrinsic
evaluation results show that in surface‐based MCQ
system
• 35% of MCQ items were considered directly usable,
• 30% needed minor revisions and 14% needed major
revisions
• while 21% MCQ items were deemed unusable.
43

Dependency‐based MCQ
QR
(1-3)
DR
(1-3)
USR
(1-3)
QRelv
(1-3)
DRelv
(1-3)
QA
(0-5)
DA
(0-5)
MCQ
Usability
(1-4)
Evaluator 1 2.42 2.98 2.38 2.37 2.31 3.25 3.73 3.37
Evaluator 2 2.25 2.15 2.46 2.23 2.06 3.27 3.15 2.79
Average 2.34 2.57 2.42 2.30 2.19 3.26 3.44 3.08
44

Dependency‐based MCQ
•In dependency‐based MCQ system, we found that
•65% of MCQ items were considered directly
usable,
•23% needed minor revisions and
•6% needed major revisions
•while 6% of MCQ items were unusable.
45

Comparison
1
1.5
2
2.5
3
QR DR USR QRelv DRelv
Surface-based MCQ
Dependency-based MCQ
46

Comparison
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
QA DA
Surface-based MCQ
47

Comparison
1
1.5
2
2.5
3
3.5
4
MCQ Usability
Surface-based MCQ
48

Statistical Significance
Evaluator 1 Evaluator 2
Question Readability 0.1912 0.0011
Distractors Readability 0.5496 0.4249
Usefulness of Semantic Relation 0.2737 0.0002
Question Relevance 0.0855 0.0004
Distractors Relevance 0.1244 0.7022
Question Acceptability 0.1449 0.0028
Distractors Acceptability 0.0715 0.4123
Overall MCQ Usability 0.0026 0.0010
49

Main Contributions
•A fully implemented automatically generated MCQ
systems based on IE
•Overcome problems faced by previous approaches
•Use of IE to improve the quality of automatically
generated MCQ
•Unsupervised approaches for RE intended to be
deployed in an e‐Learning system for automatic
generation of MCQs.
•Explored different pattern ranking methods
•Our system has the capability to be easily adapted to
other domains
50

References
• PhD Thesis Online:
• http://clg.wlv.ac.uk/papers/afzal‐thesis.pdf
• Journal Papers:
• Afzal N. and Mitkov R. (2014). Automatic Generation of Multiple Choice Questions
using Dependency‐based Semantic Relations. Soft Computing. Volume 18, Issue 7, pp.
1269‐1281 (Impact Factor 2013: 1.304) DOI: 10.1007/s00500‐013‐1141‐4
• Afzal N. and Farzindar A. (2013). Unsupervised Relation Extraction from a Corpus
Automatically Collected from the Web from Biomedical Domain. International Journal
of Computational Linguistics and Natural Language Processing (IJCLNLP), Vol. 2 Issue 4
pp. 315‐324.
• Conference Papers:
• Afzal N., Mitkov R. and Farzindar A. (2011). Unsupervised Relation Extraction using
Dependency Trees for Automatic Generation of Multiple‐Choice Questions. In
Proceedings of the C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 32‐
43. Springer, Heidelberg.
• Afzal N. and Pekar V. (2009). Unsupervised Relation Extraction for Automatic
Generation of Multiple‐Choice Questions. In Proceedings of RANLP'2009 14‐16
September, 2009. Borovets, Bulgaria.
51

Naveed_Presentation_Mayo

More Related Content

What's hot (10)

Viewers also liked (17)

Similar to Naveed_Presentation_Mayo (20)

Naveed_Presentation_Mayo