SlideShare a Scribd company logo
Università degli studi di Bari “Aldo Moro”
                                 Dipartimento di Informatica




                      Cooperating Techniques for
             Extracting Conceptual Taxonomies from Text
                                   S. Ferilli, F. Leuzzi, F. Rotella
L.A.C.A.M.
http://lacam.di.uniba.it:8000

                AI*IA 2011 XIIth Conference of the Italian Association for Artificial Intelligence
                             Workshop on Mining Complex Patterns (MCP 2011)
                                     Palermo, Italy, September 17, 2011
Overview
          1. Introduction & Objectives
          2. Extraction of knowledge from text
          3. Knowledge representation formalism
          4. Identification of relevant concepts
          5. Generalization of similar concepts
          6. Reasoning ‘by association’
          7. Conclusions & Future works




Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   2
Introduction
          The spread of electronic documents and document
          repositories has generated the need for automatic techniques
          to understand and handle the documents content in order to
          help users in satisfying their information needs.


          Full Text Understading is not trivial, due to:
          1. intrinsic ambiguity of natural language;
          2. huge amount of common sense and conceptual background
             knowledge.


          For facing these problems lexical and/or conceptual
          taxonomies are useful, even if manually building is very costly
          and error prone.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   3
Introduction
          This lack is a strong motivation towards
          automatic construction of conceptual
          networks by mining large amounts of
          documents in natural language.




                                                   However, even assuming a correct
                                                   knowledge representation, we are
                                                   far to simulate human abilities yet.

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   4
Objectives

          1. Definition of a representation formalism for knowledge
             extracted from natural language texts

          2. Extraction of concepts and relevance assessment

          3. Generalization of concepts having similar descriptions

          4. Definition of a kind of reasoning by concept association that
             looks for possible indirect connections between two
             identified concepts




Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   5
Extraction of knowledge
                           from text
          Knowledge extracted by processing each sentence separately.




                    Stanford                              Stanford
                   Parser [1]                          Dependencies [2]




          The final output of the Stanford Dependencies is a typed
          syntactic structure of each sentence.



Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   6
Knowledge representation
                        formalism
          Among all grammatical roles played by words in a sentence,
          only subject, verb and complement have been considered.
          In the final conceptual graph subjects and complements will
          represent concepts, while verbs will express relations between
          them.




          subject,
                                                        subject,
           verb,
                                                      complement
        complement




Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   7
Identification of
                               relevant concept
       A mix of several techniques are brought to cooperation for
       identifying relevant concepts:

       ●   Hub Words [3]: words having high frequency whose relevance is
           computed as:

                              W (t )=α w 0 +β n+γ ∑ i=1 w (t i )

           where: w0 , initial weight; n, # of relationships;
                     w(ti), tf*idf weight of i-th word related to t.

       ●   Keyword extraction techniques from single documents.
       ●   EM Clustering provided by Weka [4] based on Euclidean
           distance.


Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   8
Identification of
                               relevant concept
          Inspired to the Hub Words approach we have defined a
          Relevance Weight:

                    A                   B                       C                       D            E
            w (̄)
                c           e(̄)c          ∑( c , ̄ ) w (c ) d M −d ( c )
                                                  c                   ̄        k (̄)
                                                                                   c
W ( ̄ )=α
    c                  +β               +γ                  +δ            +ε
          max c w( c )    max c e ( c )       e( ̄ ) c           dM          max c k ( c )

          where: α + β+γ +δ +ε =1

          Nodes in the network are ranked by decreasing Relevance
          Weight.
          A suitable cut-point in the ranking is determined by choosing
          the first item such that:
                        W ( c k )-W (c k+1 )≥ p⋅ max                   ( W ( c i )-W (c i+1 ) )
                                                     i =0,.. . , n−1
          where: p∈ [ 0,1 ]
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   9
Identification of relevant concept
               Relevance Weight in details
                          Definition of the Initial Weight

          The whole set of triples <subject,verb,complement> is
          represented in a Concepts x Attributes matrix V recalling the
          classical Terms x Documents Vector Space Model.

                                            f i, j                 ∣A∣
          Resembling tf*idf:                           ⋅log
                                         ∑   k
                                                 f k, j     ∣{ j : c i ∈a j }∣

                                                          w (c )
                                                              ̄
          Therefore component A is:                   α
                                                        max c w ( c)
          where w(c) is the initial weight assigned to node c computed
          according to the above tf*idf schema.

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   10
Identification of relevant concept
               Relevance Weight in details
                                   Connections Number
          Component B considers the number of connections (edges) in
          which c is involved
                                                    e(̄)c
                                              β
                                                  max c e ( c )



                          Neighborhood Weight Summary
          Component C takes into account the average
          initial weight of all neighbors of c

               ∑ (c,c )
                    ̄
                          w ( c)
           γ
                   e( c )
                      ̄

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   11
Identification of relevant concept
               Relevance Weight in details
                            Inverse Distance form Center
          Component D represents the closeness to center of the cluster
                                                d M −d( c )
                                                        ̄
                                              δ
                                                    dM


                                           KE Influence
         Component E takes into account the outcome of three KE
         techniques suitably weighted:
                                                 k (̄ )
                                                     c
                                             ε
                                               max c k (c )
          where:

               k ( ̄ )=ςk co−occurrences ( ̄ )+ηk synset ( ̄ )+θk mvn ( ̄ )
                   c                       c               c            c

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   12
Identification of relevant concept
               Relevance Weight in details
                                                                                        2
              KE based on                                                           χ
                                              k co− occurrences=ς
          ●


                                                                                               2
              co-occurrences                                               max cluster χ

                                                                      kw synset
         ●    KE based on                     k synset =η
              WordNet Synsets                                   max ( kw synset )

              KE by means
                                                                     kw mvn
          ●



              Multivariate Normal              k mvn=θ
                                                               max ( kw mvn )
              Distribution (MVN)


Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   13
Identification of relevant concept
                                         Evaluations
                        Test #       α         β         γ           δ           ε        p
                           1       0.10      0.10      0.30       0.25        0.25     1.0
                           2       0.20      0.15      0.15       0.25        0.25     0.7
                           3       0.15      0.25      0.30       0.15        0.15     1.0


           Test #     Concept         A            B          C           D           E        W
              1      network       0.100      0.100          0.021       0.178       0.250    0.649
                     access        0.001      0.001          0.154       0.239       0.250    0.646
                     subset       6.32E-4     0.001          0.150       0.239       0.250    0.641
              2      network       0.200      0.150      0.0105          0.178       0.250    0.789
              3      network       0.150      0.250          0.021       0.146       0.150    0.717
                     user          0.127      0.195          0.022       0.146       0.150    0.641
                     number        0.113      0.187          0.022       0.146       0.150    0.619
                     individual    0.103      0.174          0.020       0.146       0.150    0.594


Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   14
Generalization of similar concepts
                         Pairwise clustering
          Take in account the description of each concept, consisting in
          a binary vector that represents presence or absence (1 or 0
          respectively) of a <subject,complement> relation between
          the involved concepts. The Hamming distance provides a
          similarity evaluation between them.




Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   15
Generalization of similar concepts
                                         WordNet
            WordNet1 is an external resource that has some useful
            properties:
            1. lexical taxonomy
            2. each concept is described as a set of synonyms (synset)
            3. synsets are interlinked by means of conceptual-
                semantic and lexical relations


            We are focused on hyperonymy, a relation that links the
            current synset to more general ones.


            1. http://wordnet.princeton.edu/




Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   16
Generalization of similar concepts
            Taxonomical similarity function
    More general: provides a                                  More specific: provides a
    similarity value on the bases of                          similarity value on the bases of
    common relations, without                                 common relations, relying on
    focusing on the specific path.                            the specific path.




Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   17
Generalization of similar concepts
                       WSD Domain Driven
          One Domain per Discourse assumption: many uses of a word
          in a coherent portion of text tend to share the same domain.
      Prevalent domain
      Prevalent domain
          individuation
         individuation

                                Extraction of all
                                Extraction of all
                           synsets for each term
                           synsets for each term

                                                       Extraction of all
                                                       Extraction of all
                                                domains for each synset
                                                domains for each synset

                                                                            Choice of prevalent
                                                                            Choice of prevalent
                                                                                domain synset
                                                                                domain synset


Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   18
Generalization of similar concepts
                                     Evaluations
          Two toy experiments have been performed with Hamming
          distance threshold respectively equal to 0.001 and 0.0001,
          while taxonomical similarity function threshold has been kept
          equal to 0.4.




Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   19
Reasoning ‘by association’
                      Breadth-First Search
          Given two nodes (concepts), a Breadth-First Search starts
          from both nodes, the former searches the latter's frontier and
          vice versa, until the two frontiers meet by common nodes.
          Then the path is restored going backward to the roots in both
          directions.




Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   20
Reasoning ‘by association’
                                     Evaluations
          The table below shows a sample of possible outcomes.
          E.g., an interpretation of case 5 can be:
          “the adults write about freedom and use platform, that is
          recognized as a technology, as well as the internet”.




Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   21
Conclusions
    This work proposes an approach to extract automatic conceptual
    taxonomy from natural language texts.


    It works mixing different techniques in order to:
    ●   identify relevant terms/concepts in text;
    ●   generalize similar concepts;
    ●   perform some kind of reasoning “by association”.


    Preliminary experiments show that this approach can be viable
    although extensions and refinements are needed.
    A reliable outcome might help users in understanding the text
    content and machines to automatically perform some kind of
    reasoning on the taxonomy.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   22
Future works
          1. Extending the knowledge representation formalism to
             express negation.

          2. Defining a strategy to make a better choice of weights in
             Relevance Weight computation.

          3. Enriching the adjacency matrix to improve concept
             descriptions.

          4. ODD alternatives exploration, to overcome its limits.

          5. Taxonomical similarity measures take into account only the
             hypernym relation, while a more accurate similarity can be
             obtained adding other relations.

          6. Define a strategy to prefer one verb rather than keeping all
             of them, in reasoning ‘by association’ phase.

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   23
References
          [1] Dan Klein and Christopher D. Manning. Fast exact
          inference with a factored model for natural language parsing.
          In Advances in Neural Information Processing Systems,
          volume 15. MIT Press, 2003.
          [2] Marie-Catherine de Marneffe, Bill MacCartney, and
          Christopher D. Manning. Generating typed dependency parses
          from phrase structure trees. In LREC, 2006.
          [3] Sang Ok Koo, Soo Yeon Lim, and Sang-Jo Lee. Constructing
          an ontology based on hub words. In ISMIS’03, pages 93–97,
          2003.
          [4] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,
          and I.H. Witten. The weka data mining software: an update.
          SIGKDD Explorations, 11(1):10–18,2009.




Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   24

More Related Content

PDF
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
PDF
Annotating Rhetorical and Argumentative Structures in Mathematical Knowledge
PDF
Ecml2010 Slides
PDF
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
PDF
Csr2011 june14 12_00_hansen
PPTX
An introduction to compositional models in distributional semantics
PDF
Blei lafferty2009
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Annotating Rhetorical and Argumentative Structures in Mathematical Knowledge
Ecml2010 Slides
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
Csr2011 june14 12_00_hansen
An introduction to compositional models in distributional semantics
Blei lafferty2009

What's hot (19)

PDF
Integration in Finite Terms
PDF
Text smilarity02 corpus_based
PDF
Jarrar.lecture notes.aai.2012s.descriptionlogic
PDF
Exempler approach
PDF
Extending the knowledge level of cognitive architectures with Conceptual Spac...
PPTX
Introduction to Distributional Semantics
DOC
12-Multistrategy-learning.doc
PDF
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
PDF
AMBIGUITY-AWARE DOCUMENT SIMILARITY
PDF
RuleML2015 The Herbrand Manifesto - Thinking Inside the Box
PDF
Lifelong Topic Modelling presentation
PDF
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
PDF
Constructive Description Logics 2006
PDF
Dependent Types in Natural Language Semantics
ODP
How to Ground A Language for Legal Discourse In a Prototypical Perceptual Sem...
PDF
A survey on parallel corpora alignment
PDF
Cerutti--TAFA 2011
PDF
Constructive Hybrid Logics
PDF
Truth as a logical connective?
Integration in Finite Terms
Text smilarity02 corpus_based
Jarrar.lecture notes.aai.2012s.descriptionlogic
Exempler approach
Extending the knowledge level of cognitive architectures with Conceptual Spac...
Introduction to Distributional Semantics
12-Multistrategy-learning.doc
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
AMBIGUITY-AWARE DOCUMENT SIMILARITY
RuleML2015 The Herbrand Manifesto - Thinking Inside the Box
Lifelong Topic Modelling presentation
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
Constructive Description Logics 2006
Dependent Types in Natural Language Semantics
How to Ground A Language for Legal Discourse In a Prototypical Perceptual Sem...
A survey on parallel corpora alignment
Cerutti--TAFA 2011
Constructive Hybrid Logics
Truth as a logical connective?
Ad

Similar to Cooperating Techniques for Extracting Conceptual Taxonomies from Text (20)

PDF
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
PDF
An Approach To Assess The Existence Of A Proposed Intervention In Essay-Argum...
PDF
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
PDF
Ma
PDF
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
PDF
Like Alice in Wonderland: Unraveling Reasoning and Cognition Using Analogies ...
PDF
FCA-MERGE: Bottom-Up Merging of Ontologies
PDF
Cross-lingual event-mining using wordnet as a shared knowledge interface
PPTX
Topic Extraction on Domain Ontology
PDF
Canini09a
PPTX
Introduction to First order logic .pptx
PDF
Lean Logic for Lean Times: Varieties of Natural Logic
PDF
ONTOLOGICAL MODEL FOR CHARACTER RECOGNITION BASED ON SPATIAL RELATIONS
PDF
Blei ngjordan2003
PDF
10.1.1.35.8376
PDF
Eswcsummerschool2010 ontologies final
PDF
Cerutti--Knowledge Representation and Reasoning (postgrad seminar @ Universit...
PPT
Method for ontology generation from concept maps in shallow domains
DOC
Discovering Novel Information with sentence Level clustering From Multi-docu...
PDF
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
An Approach To Assess The Existence Of A Proposed Intervention In Essay-Argum...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
Ma
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
Like Alice in Wonderland: Unraveling Reasoning and Cognition Using Analogies ...
FCA-MERGE: Bottom-Up Merging of Ontologies
Cross-lingual event-mining using wordnet as a shared knowledge interface
Topic Extraction on Domain Ontology
Canini09a
Introduction to First order logic .pptx
Lean Logic for Lean Times: Varieties of Natural Logic
ONTOLOGICAL MODEL FOR CHARACTER RECOGNITION BASED ON SPATIAL RELATIONS
Blei ngjordan2003
10.1.1.35.8376
Eswcsummerschool2010 ontologies final
Cerutti--Knowledge Representation and Reasoning (postgrad seminar @ Universit...
Method for ontology generation from concept maps in shallow domains
Discovering Novel Information with sentence Level clustering From Multi-docu...
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
Ad

Recently uploaded (20)

PPTX
TNA_Presentation-1-Final(SAVE)) (1).pptx
PDF
My India Quiz Book_20210205121199924.pdf
PDF
FORM 1 BIOLOGY MIND MAPS and their schemes
PDF
What if we spent less time fighting change, and more time building what’s rig...
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PPTX
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
PDF
HVAC Specification 2024 according to central public works department
PDF
LDMMIA Reiki Yoga Finals Review Spring Summer
PDF
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
PDF
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PDF
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
Introduction to pro and eukaryotes and differences.pptx
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
TNA_Presentation-1-Final(SAVE)) (1).pptx
My India Quiz Book_20210205121199924.pdf
FORM 1 BIOLOGY MIND MAPS and their schemes
What if we spent less time fighting change, and more time building what’s rig...
Practical Manual AGRO-233 Principles and Practices of Natural Farming
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
HVAC Specification 2024 according to central public works department
LDMMIA Reiki Yoga Finals Review Spring Summer
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
B.Sc. DS Unit 2 Software Engineering.pptx
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
Chinmaya Tiranga quiz Grand Finale.pdf
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Introduction to pro and eukaryotes and differences.pptx
Paper A Mock Exam 9_ Attempt review.pdf.

Cooperating Techniques for Extracting Conceptual Taxonomies from Text

  • 1. Università degli studi di Bari “Aldo Moro” Dipartimento di Informatica Cooperating Techniques for Extracting Conceptual Taxonomies from Text S. Ferilli, F. Leuzzi, F. Rotella L.A.C.A.M. http://lacam.di.uniba.it:8000 AI*IA 2011 XIIth Conference of the Italian Association for Artificial Intelligence Workshop on Mining Complex Patterns (MCP 2011) Palermo, Italy, September 17, 2011
  • 2. Overview 1. Introduction & Objectives 2. Extraction of knowledge from text 3. Knowledge representation formalism 4. Identification of relevant concepts 5. Generalization of similar concepts 6. Reasoning ‘by association’ 7. Conclusions & Future works Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 2
  • 3. Introduction The spread of electronic documents and document repositories has generated the need for automatic techniques to understand and handle the documents content in order to help users in satisfying their information needs. Full Text Understading is not trivial, due to: 1. intrinsic ambiguity of natural language; 2. huge amount of common sense and conceptual background knowledge. For facing these problems lexical and/or conceptual taxonomies are useful, even if manually building is very costly and error prone. Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 3
  • 4. Introduction This lack is a strong motivation towards automatic construction of conceptual networks by mining large amounts of documents in natural language. However, even assuming a correct knowledge representation, we are far to simulate human abilities yet. Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 4
  • 5. Objectives 1. Definition of a representation formalism for knowledge extracted from natural language texts 2. Extraction of concepts and relevance assessment 3. Generalization of concepts having similar descriptions 4. Definition of a kind of reasoning by concept association that looks for possible indirect connections between two identified concepts Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 5
  • 6. Extraction of knowledge from text Knowledge extracted by processing each sentence separately. Stanford Stanford Parser [1] Dependencies [2] The final output of the Stanford Dependencies is a typed syntactic structure of each sentence. Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 6
  • 7. Knowledge representation formalism Among all grammatical roles played by words in a sentence, only subject, verb and complement have been considered. In the final conceptual graph subjects and complements will represent concepts, while verbs will express relations between them. subject, subject, verb, complement complement Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 7
  • 8. Identification of relevant concept A mix of several techniques are brought to cooperation for identifying relevant concepts: ● Hub Words [3]: words having high frequency whose relevance is computed as: W (t )=α w 0 +β n+γ ∑ i=1 w (t i ) where: w0 , initial weight; n, # of relationships; w(ti), tf*idf weight of i-th word related to t. ● Keyword extraction techniques from single documents. ● EM Clustering provided by Weka [4] based on Euclidean distance. Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 8
  • 9. Identification of relevant concept Inspired to the Hub Words approach we have defined a Relevance Weight: A B C D E w (̄) c e(̄)c ∑( c , ̄ ) w (c ) d M −d ( c ) c ̄ k (̄) c W ( ̄ )=α c +β +γ +δ +ε max c w( c ) max c e ( c ) e( ̄ ) c dM max c k ( c ) where: α + β+γ +δ +ε =1 Nodes in the network are ranked by decreasing Relevance Weight. A suitable cut-point in the ranking is determined by choosing the first item such that: W ( c k )-W (c k+1 )≥ p⋅ max ( W ( c i )-W (c i+1 ) ) i =0,.. . , n−1 where: p∈ [ 0,1 ] Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 9
  • 10. Identification of relevant concept Relevance Weight in details Definition of the Initial Weight The whole set of triples <subject,verb,complement> is represented in a Concepts x Attributes matrix V recalling the classical Terms x Documents Vector Space Model. f i, j ∣A∣ Resembling tf*idf: ⋅log ∑ k f k, j ∣{ j : c i ∈a j }∣ w (c ) ̄ Therefore component A is: α max c w ( c) where w(c) is the initial weight assigned to node c computed according to the above tf*idf schema. Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 10
  • 11. Identification of relevant concept Relevance Weight in details Connections Number Component B considers the number of connections (edges) in which c is involved e(̄)c β max c e ( c ) Neighborhood Weight Summary Component C takes into account the average initial weight of all neighbors of c ∑ (c,c ) ̄ w ( c) γ e( c ) ̄ Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 11
  • 12. Identification of relevant concept Relevance Weight in details Inverse Distance form Center Component D represents the closeness to center of the cluster d M −d( c ) ̄ δ dM KE Influence Component E takes into account the outcome of three KE techniques suitably weighted: k (̄ ) c ε max c k (c ) where: k ( ̄ )=ςk co−occurrences ( ̄ )+ηk synset ( ̄ )+θk mvn ( ̄ ) c c c c Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 12
  • 13. Identification of relevant concept Relevance Weight in details 2 KE based on χ k co− occurrences=ς ● 2 co-occurrences max cluster χ kw synset ● KE based on k synset =η WordNet Synsets max ( kw synset ) KE by means kw mvn ● Multivariate Normal k mvn=θ max ( kw mvn ) Distribution (MVN) Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 13
  • 14. Identification of relevant concept Evaluations Test # α β γ δ ε p 1 0.10 0.10 0.30 0.25 0.25 1.0 2 0.20 0.15 0.15 0.25 0.25 0.7 3 0.15 0.25 0.30 0.15 0.15 1.0 Test # Concept A B C D E W 1 network 0.100 0.100 0.021 0.178 0.250 0.649 access 0.001 0.001 0.154 0.239 0.250 0.646 subset 6.32E-4 0.001 0.150 0.239 0.250 0.641 2 network 0.200 0.150 0.0105 0.178 0.250 0.789 3 network 0.150 0.250 0.021 0.146 0.150 0.717 user 0.127 0.195 0.022 0.146 0.150 0.641 number 0.113 0.187 0.022 0.146 0.150 0.619 individual 0.103 0.174 0.020 0.146 0.150 0.594 Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 14
  • 15. Generalization of similar concepts Pairwise clustering Take in account the description of each concept, consisting in a binary vector that represents presence or absence (1 or 0 respectively) of a <subject,complement> relation between the involved concepts. The Hamming distance provides a similarity evaluation between them. Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 15
  • 16. Generalization of similar concepts WordNet WordNet1 is an external resource that has some useful properties: 1. lexical taxonomy 2. each concept is described as a set of synonyms (synset) 3. synsets are interlinked by means of conceptual- semantic and lexical relations We are focused on hyperonymy, a relation that links the current synset to more general ones. 1. http://wordnet.princeton.edu/ Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 16
  • 17. Generalization of similar concepts Taxonomical similarity function More general: provides a More specific: provides a similarity value on the bases of similarity value on the bases of common relations, without common relations, relying on focusing on the specific path. the specific path. Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 17
  • 18. Generalization of similar concepts WSD Domain Driven One Domain per Discourse assumption: many uses of a word in a coherent portion of text tend to share the same domain. Prevalent domain Prevalent domain individuation individuation Extraction of all Extraction of all synsets for each term synsets for each term Extraction of all Extraction of all domains for each synset domains for each synset Choice of prevalent Choice of prevalent domain synset domain synset Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 18
  • 19. Generalization of similar concepts Evaluations Two toy experiments have been performed with Hamming distance threshold respectively equal to 0.001 and 0.0001, while taxonomical similarity function threshold has been kept equal to 0.4. Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 19
  • 20. Reasoning ‘by association’ Breadth-First Search Given two nodes (concepts), a Breadth-First Search starts from both nodes, the former searches the latter's frontier and vice versa, until the two frontiers meet by common nodes. Then the path is restored going backward to the roots in both directions. Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 20
  • 21. Reasoning ‘by association’ Evaluations The table below shows a sample of possible outcomes. E.g., an interpretation of case 5 can be: “the adults write about freedom and use platform, that is recognized as a technology, as well as the internet”. Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 21
  • 22. Conclusions This work proposes an approach to extract automatic conceptual taxonomy from natural language texts. It works mixing different techniques in order to: ● identify relevant terms/concepts in text; ● generalize similar concepts; ● perform some kind of reasoning “by association”. Preliminary experiments show that this approach can be viable although extensions and refinements are needed. A reliable outcome might help users in understanding the text content and machines to automatically perform some kind of reasoning on the taxonomy. Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 22
  • 23. Future works 1. Extending the knowledge representation formalism to express negation. 2. Defining a strategy to make a better choice of weights in Relevance Weight computation. 3. Enriching the adjacency matrix to improve concept descriptions. 4. ODD alternatives exploration, to overcome its limits. 5. Taxonomical similarity measures take into account only the hypernym relation, while a more accurate similarity can be obtained adding other relations. 6. Define a strategy to prefer one verb rather than keeping all of them, in reasoning ‘by association’ phase. Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 23
  • 24. References [1] Dan Klein and Christopher D. Manning. Fast exact inference with a factored model for natural language parsing. In Advances in Neural Information Processing Systems, volume 15. MIT Press, 2003. [2] Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning. Generating typed dependency parses from phrase structure trees. In LREC, 2006. [3] Sang Ok Koo, Soo Yeon Lim, and Sang-Jo Lee. Constructing an ontology based on hub words. In ISMIS’03, pages 93–97, 2003. [4] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten. The weka data mining software: an update. SIGKDD Explorations, 11(1):10–18,2009. Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 24