Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Università degli studi di Bari “Aldo Moro”
Dipartimento di Informatica

Cooperating Techniques for
Extracting Conceptual Taxonomies from Text
S. Ferilli, F. Leuzzi, F. Rotella
L.A.C.A.M.
http://lacam.di.uniba.it:8000

AI*IA 2011 XIIth Conference of the Italian Association for Artificial Intelligence
Workshop on Mining Complex Patterns (MCP 2011)
Palermo, Italy, September 17, 2011

Overview
1. Introduction & Objectives
2. Extraction of knowledge from text
3. Knowledge representation formalism
4. Identification of relevant concepts
5. Generalization of similar concepts
6. Reasoning ‘by association’
7. Conclusions & Future works

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 2

Introduction
The spread of electronic documents and document
repositories has generated the need for automatic techniques
to understand and handle the documents content in order to
help users in satisfying their information needs.

Full Text Understading is not trivial, due to:
1. intrinsic ambiguity of natural language;
2. huge amount of common sense and conceptual background
knowledge.

For facing these problems lexical and/or conceptual
taxonomies are useful, even if manually building is very costly
and error prone.

Introduction
This lack is a strong motivation towards
automatic construction of conceptual
networks by mining large amounts of
documents in natural language.

However, even assuming a correct
knowledge representation, we are
far to simulate human abilities yet.


Objectives

1. Definition of a representation formalism for knowledge
extracted from natural language texts

2. Extraction of concepts and relevance assessment

3. Generalization of concepts having similar descriptions

4. Definition of a kind of reasoning by concept association that
looks for possible indirect connections between two
identified concepts


Extraction of knowledge
from text
Knowledge extracted by processing each sentence separately.

Stanford Stanford
Parser [1] Dependencies [2]

The final output of the Stanford Dependencies is a typed
syntactic structure of each sentence.


Knowledge representation
formalism
Among all grammatical roles played by words in a sentence,
only subject, verb and complement have been considered.
In the final conceptual graph subjects and complements will
represent concepts, while verbs will express relations between
them.

subject,
subject,
verb,
complement
complement


Identification of
relevant concept
A mix of several techniques are brought to cooperation for
identifying relevant concepts:

● Hub Words [3]: words having high frequency whose relevance is
computed as:

W (t )=α w 0 +β n+γ ∑ i=1 w (t i )

where: w0 , initial weight; n, # of relationships;
w(ti), tf*idf weight of i-th word related to t.

● Keyword extraction techniques from single documents.
● EM Clustering provided by Weka [4] based on Euclidean
distance.


Identification of
relevant concept
Inspired to the Hub Words approach we have defined a
Relevance Weight:

A B C D E
w (̄)
c e(̄)c ∑( c , ̄ ) w (c ) d M −d ( c )
c ̄ k (̄)
c
W ( ̄ )=α
c +β +γ +δ +ε
max c w( c ) max c e ( c ) e( ̄ ) c dM max c k ( c )

where: α + β+γ +δ +ε =1

Nodes in the network are ranked by decreasing Relevance
Weight.
A suitable cut-point in the ranking is determined by choosing
the first item such that:
W ( c k )-W (c k+1 )≥ p⋅ max ( W ( c i )-W (c i+1 ) )
i =0,.. . , n−1
where: p∈ [ 0,1 ]

Identification of relevant concept
Relevance Weight in details
Definition of the Initial Weight

The whole set of triples <subject,verb,complement> is
represented in a Concepts x Attributes matrix V recalling the
classical Terms x Documents Vector Space Model.

f i, j ∣A∣
Resembling tf*idf: ⋅log
∑ k
f k, j ∣{ j : c i ∈a j }∣

w (c )
̄
Therefore component A is: α
max c w ( c)
where w(c) is the initial weight assigned to node c computed
according to the above tf*idf schema.


Connections Number
Component B considers the number of connections (edges) in
which c is involved
e(̄)c
β
max c e ( c )

Neighborhood Weight Summary
Component C takes into account the average
initial weight of all neighbors of c

∑ (c,c )
̄
w ( c)
γ
e( c )
̄


Inverse Distance form Center
Component D represents the closeness to center of the cluster
d M −d( c )
̄
δ
dM

KE Influence
Component E takes into account the outcome of three KE
techniques suitably weighted:
k (̄ )
c
ε
max c k (c )
where:

k ( ̄ )=ςk co−occurrences ( ̄ )+ηk synset ( ̄ )+θk mvn ( ̄ )
c c c c


2
KE based on χ
k co− occurrences=ς
●

2
co-occurrences max cluster χ

kw synset
● KE based on k synset =η
WordNet Synsets max ( kw synset )

KE by means
kw mvn
●

Multivariate Normal k mvn=θ
max ( kw mvn )
Distribution (MVN)


Evaluations
Test # α β γ δ ε p
1 0.10 0.10 0.30 0.25 0.25 1.0
2 0.20 0.15 0.15 0.25 0.25 0.7
3 0.15 0.25 0.30 0.15 0.15 1.0

Test # Concept A B C D E W
1 network 0.100 0.100 0.021 0.178 0.250 0.649
access 0.001 0.001 0.154 0.239 0.250 0.646
subset 6.32E-4 0.001 0.150 0.239 0.250 0.641
2 network 0.200 0.150 0.0105 0.178 0.250 0.789
3 network 0.150 0.250 0.021 0.146 0.150 0.717
user 0.127 0.195 0.022 0.146 0.150 0.641
number 0.113 0.187 0.022 0.146 0.150 0.619
individual 0.103 0.174 0.020 0.146 0.150 0.594


Generalization of similar concepts
Pairwise clustering
Take in account the description of each concept, consisting in
a binary vector that represents presence or absence (1 or 0
respectively) of a <subject,complement> relation between
the involved concepts. The Hamming distance provides a
similarity evaluation between them.


WordNet
WordNet1 is an external resource that has some useful
properties:
1. lexical taxonomy
2. each concept is described as a set of synonyms (synset)
3. synsets are interlinked by means of conceptual-
semantic and lexical relations

We are focused on hyperonymy, a relation that links the
current synset to more general ones.

1. http://wordnet.princeton.edu/


Taxonomical similarity function
More general: provides a More specific: provides a
similarity value on the bases of similarity value on the bases of
common relations, without common relations, relying on
focusing on the specific path. the specific path.


WSD Domain Driven
One Domain per Discourse assumption: many uses of a word
in a coherent portion of text tend to share the same domain.
Prevalent domain
Prevalent domain
individuation
individuation

Extraction of all
Extraction of all
synsets for each term
synsets for each term

Extraction of all
Extraction of all
domains for each synset
domains for each synset

Choice of prevalent
Choice of prevalent
domain synset
domain synset


Evaluations
Two toy experiments have been performed with Hamming
distance threshold respectively equal to 0.001 and 0.0001,
while taxonomical similarity function threshold has been kept
equal to 0.4.


Reasoning ‘by association’
Breadth-First Search
Given two nodes (concepts), a Breadth-First Search starts
from both nodes, the former searches the latter's frontier and
vice versa, until the two frontiers meet by common nodes.
Then the path is restored going backward to the roots in both
directions.


Reasoning ‘by association’
Evaluations
The table below shows a sample of possible outcomes.
E.g., an interpretation of case 5 can be:
“the adults write about freedom and use platform, that is
recognized as a technology, as well as the internet”.


Conclusions
This work proposes an approach to extract automatic conceptual
taxonomy from natural language texts.

It works mixing different techniques in order to:
● identify relevant terms/concepts in text;
● generalize similar concepts;
● perform some kind of reasoning “by association”.

Preliminary experiments show that this approach can be viable
although extensions and refinements are needed.
A reliable outcome might help users in understanding the text
content and machines to automatically perform some kind of
reasoning on the taxonomy.

Future works
1. Extending the knowledge representation formalism to
express negation.

2. Defining a strategy to make a better choice of weights in
Relevance Weight computation.

3. Enriching the adjacency matrix to improve concept
descriptions.

4. ODD alternatives exploration, to overcome its limits.

5. Taxonomical similarity measures take into account only the
hypernym relation, while a more accurate similarity can be
obtained adding other relations.

6. Define a strategy to prefer one verb rather than keeping all
of them, in reasoning ‘by association’ phase.


References
[1] Dan Klein and Christopher D. Manning. Fast exact
inference with a factored model for natural language parsing.
In Advances in Neural Information Processing Systems,
volume 15. MIT Press, 2003.
[2] Marie-Catherine de Marneffe, Bill MacCartney, and
Christopher D. Manning. Generating typed dependency parses
from phrase structure trees. In LREC, 2006.
[3] Sang Ok Koo, Soo Yeon Lim, and Sang-Jo Lee. Constructing
an ontology based on hub words. In ISMIS’03, pages 93–97,
2003.
[4] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,
and I.H. Witten. The weka data mining software: an update.
SIGKDD Explorations, 11(1):10–18,2009.


Cooperating Techniques for Extracting Conceptual Taxonomies from Text

More Related Content

What's hot (19)

Similar to Cooperating Techniques for Extracting Conceptual Taxonomies from Text (20)

Recently uploaded (20)

Cooperating Techniques for Extracting Conceptual Taxonomies from Text