A Fuzzy Logic Intelligent Agent for Information Extraction

 A Fuzzy Logic intelligent agent for Information Extraction: Introducing a new
Fuzzy Logic-based term weighting scheme (Jorge Ropero, Ariel Gomez, Alejandro
Carrascom Carlos Leon ; Department of Electronic Technology, University of
Seville, Spain ; October 2011)
 A new Fuzzy Logic Based Information Retrieval Model (Slawomir Zadozny, Janusz
Kacprzyk ; Polish Academy of Sciences ; January 2008)
 Information Extraction in a Set of Knowledge Using a Fuzzy Logic Based
Intelligent Agent (Jorge Ropero, Ariel Gomez, Alejandro Carrascom Carlos Leon ;
University of Seville, Spain ; August 2007)

 A method of Information Extraction (IE) in a set of knowledge is proposed to
answer to user consultations using natural language.
 The system is based on fuzzy logic engine, which takes advantages of its flexibility
for managing sets of accumulated knowledge.
 These sets can be built in hierarchic levels by a tree structure.
 The eventual aim of this system is the implementation of an intelligent agent to
manage the information contained in an internet website (portal).

 The quantity of web information grows exponentially.
 Achieving both high recall and precision in IR is one of its most important
objectives.
 IR has been widely used for text classification introducing approaches such as
Vector Space Model (VSM), K nearest neighbor method , Bayesian Classification
model , Neural networks and Support Vector Machine.
 VSM is the most frequently used model.

 We propose a fully novel method using Fuzzy Logic (FL), to extract the required
information and related information too.
 Giving answers to the user consultations in natural language.
 Bearing in mind that non-experts users tend not to be exact in their searches.

 Data Mining (DM) : is an automatic process of analyzing information in order to discover
patters and to build predictive models, some of its applications are e-commerce, Text
mining, e-learning and marketing.
 Information Retrieval (IR) : is an automatic search of the relevant information contained
in a set of knowledge.
 Information Extraction (IE) : Once documents have been retrieved , the challenge is to
extract the required information automatically, so its task is to identify the specific
fragments of a document which constitute its main semantic content.

 The main objective of the designed system must be to let the users find possible
answers to what they are looking for in a huge set of knowledge.
 The whole set must be classified into different objects.
 These objects are the answers to possible user consultations, organized in hierarchic
groups.
 One or more standard questions are assigned for every object.
 Different index terms from each standard question must then be selected in order to
differentiate one object from the others.
 Finally, term weights are assigned to every index term for every level of hierarchy in a
scheme based on VSM, these terms are the inputs to the FL system.
 The system must return to the user the object correspondent with the standard
question, or questions that are more similar to the user consultations.

 The first step is divide the whole set of knowledge into objects, one or more
questions in NL are assigned to every object, the answers to this or these
questions must represent the desired object.
 The second step is the selection of the index terms, which are extracted from the
questions. The index terms represent the most related terms of standard
questions with the represented object.

 We may consider mainly two methods for term weighting:
1. Let an expert in the matter evaluate intuitively the importance of index terms. This
method is simple, but it has the disadvantages of depending exclusively on the
engineer of knowledge.
2. Automate TW by means of a series of rules.
 Given the large quantity of information there is in a web portal, we choose the
second option.
 We propose a VSM method for TW.
 The most widely used method for TW is so-called TF-IDF method.

 Every index term has an associated weight. This weight has a value between 0 and 1
depending on the importance of the term in every hierarchic level.
 The greater is the importance of the term in a level, the higher is the weight of the
term.
 The term weight might not be the same for every hierarchic level.
𝑤𝑡,𝑑 = 𝑡𝑓𝑡,𝑑 . log
𝐷
𝑑′ ∈ 𝐷 | 𝑡 ∈ 𝑑′
Is the term frequency of term t
in document d
is inverse document frequency
(a global parameter). | D | is
the total number of documents
in the document set, the
dominator is the number of
documents containing the
term t.

 The element of the intelligent agent which determines the degree of certainty for a
group of index terms belonging or not to every possible subsets of the whole set of
knowledge is the fuzzy inference engine.
 The Inference engine has several inputs : the weights of selects index terms.
 The Inference engine gives an output : the degree of certainty for a particular
subset in particular level.

 For the fuzzy engine , it is necessary to define:
1. The number of inputs to the inference engine.
2. Input fuzzy sets : input ranges, number of fuzzy sets and shape and range of
membership functions.
3. Output fuzzy sets: output ranges, member of fuzzy sets, and shape and range of
membership functions.
4. Fuzzy rules, they are of the IF … THEN type.
5. Used methods for AND and OR operations and defuzzifying.
 All these parameters must be taken into account to find the optimal configuration
for the inference engine.

 Once standard questions are defined, index terms are extracted from them. Index
terms are the ones that better represent a standard question.
 Every index term is associated with its correspondent term weight. This weight
has a value between 0 and 1 and depends on the importance of a term in a level.
 The higher the importance of a term weight in a level, the higher is the term
weight.
 The final aim of the intelligent agent must be to find object or objects whose
information is more similar to the requested user consultation.

Step Example
Step 1: Web page identified by standard
question/s
- Web page :
www.us.es/univirtual/internetwww.us.es/u
nivirtuail/internet
- Standard question : Which services can I
access as a virtual user at the University
of Seville?
Step 2: Locate standard question/s in
hierarchic structure.
Topic 12: Virtual University
Section 6: Virtual User
Object : 2
Step 3: Extract index terms Index terms : “services” , “virtual” , “user
Step 4 : Term weighting Will discuss it later

 The first goal is to check that the system makes a correct identification of
standard questions with an index of certainty higher than a certain threshold.
 This is related to the concept id recall.
 The second goal is to check weather the required standard question is among the
tree answers with higher degree of certainty.
 This is related to precision.

 Test results for standard questions recognition fit into five categories :
1. The correct question is the only one found or the one that has the highest degree of
certainty.
2. The correct question is one between the two with the highest certainty or is the one
that has the second highest degree of certainty.
3. The correct question is one between the three with the highest certainty or is the
one that has the third highest degree of certainty.
4. The correct question is found but not among the three with the highest degree of
certainty.
5. The correct question is not found.

 Input range corresponds to weight range for every index term.
 We considered three fuzzy sets represented by values LOW, MEDIUM and HIGH.
 All of them are kept as triangular.
 Output which gives the degree of certainty is defined as LOW, LOW-MEDIUM,
MEDIUM-HIGH and HIGH.
 The fact that the input takes 3 fuzzy sets is due to that the number of these sets
are enough so that results are coherent and there are not so many options to let
the number of rules increase.
 Center of gravity defuzzifier.

 The range of values for every input set is :
 LOW, from 0.0 to 0.4 centered in 0.0.
 MEDIUM from 0.2 to 0.8 centered in 0.5.
 HIGH from 0.6 to 0.1 centered in 1.0.

 The range of values for every output fuzzy set is :
 LOW, from 0.0 to 0.4 centered in 0.0.
 MEDIUM-LOW from 0.1 to 0.7 centered in 0.4.
 MEDIUM-HIGH from 0.3 to 0.9 centered in 0.6.
 HIGH from 0.6 to 1.0 centered in 1.0.

Rule number Rule definition Output
R1 IF one or more inputs = HIGH HIGH
R2 IF three inputs = MEDIUM HIGH
R3 IF two inputs = MEDIUM and one input =
LOW
MEDIUM-HIGH
R4 IF one input = MEDIUM and two inputs =
LOW
MEDIUM-LOW
R5 IF all inputs = LOW LOW

 Step 1 : User query in Natural language.
 “Which services can I access as a virtual user at the University of Seville ?”

 Step 2: Index term extraction.
 TiW = Term weight vector for Topic i.

 Step 3: Weight vectors are taken as inputs to the fuzzy engine for every topic.
 TiO = Fuzzy engine output for Topic i.
 *Topics 10 and 12 are considered , threshold 0.4 in our case.

 Step 4: Step 3 is repeated for the next hierarchic level “ Sections of the selected
Topics”.
 TiSjW = Term weight vector for Topic i, Section j.
 TisjO = Fuzzy engine output for Topic i, Section j.

 Step 5 : Step 3 is repeated for the next hierarchic level “Objects of the selected
Sections”.
 TiSjOkW = Term weight vector for Topic i, Section j, Object k.
 TiSjOkO = Fuzzy engine output for Topic i, Section j , Object k.

 We may consider two options to define these weights :
 An expert in the matter should evaluate intuitively the importance of the index term.
 Simple, but it has the disadvantage of depending exclusively on the knowledge engineer. Also not
possible to automate.
 The generation of automated weights by means of a set of rules.
 The most widely used method is the TF-IDF method
 A novel Fuzzy Logic based method.
 Achieves better results in IE.

 The FL-based method has two main advantages :
1. It improves the basic TF-IDF method by creating a table with all keywords and their
corresponding weights for every object, this table will be created in the phase of
keyword extraction from standard questions.
2. The whole method of term weighting is automated and the level of required
expertise for an operator is lower.

 The chosen formula for the tests was the one proposed by Liu et al(2001).
𝑊𝑖𝑘 =
𝑡𝑓𝑖𝑘 × log(
𝑁
𝑛 𝑘 + 0.001
)
𝑘=1
𝑚
𝑡𝑓𝑖𝑘 × log
𝑁
𝑛 𝑘 + 0.001
2
 Here, 𝑡𝑓𝑖𝑘 is the ith term frequency of occurrence in the kth subset
“Topic/Section/Object”, 𝑛 𝑘 is the number of subsets to which the term Ti is
assigned in a collection on N objects.

 As an example we are using the term ‘virtual’ as used in the previous example.
 At Topic level :
o ‘Virtual’ appears 8 times in Topic 12 (tf = 8, K = 12)
o ‘Virtual’ appears twice in other topics (nk = 3)
o There are 12 Topics in total (N = 12)
o Substituting , Wik = 0.20
 At Section level:
o ‘Virtual’ appears 3 times in Section 12.6 (tf = 3, K = 6)
o ‘Virtual’ appears 5 times in other sections in Topic 12 (nk = 6)
o There are 6 Sections in Topic 12 ( N = 6).
 At Object level:
o ‘Virtual’ appears Once in Section 12.6.2 (tf = 1, K = 2)
o ‘Virtual’ appears twice in other Topic 12 (nk = 3)
o There are 3 Objects in Section 12.6 ( N = 3).

 TF-IDF has the disadvantage of not considering the degree of identification of the
object if only the considered Index term is used.
 FL-based term weighting method is defined as a four questions that must be
answered to determine the TW of an index term.
1. Question 1 : How often does an index term appear in other subsets ? –related to IDF.
2. Question 2 : How often does an index term appear in its own subset? – related to TF.
3. Question 3 : Does an index term undoubtedly define an object by itself?
4. Question 4 : Is an index term tied to anther one?

 The answer to these questions gives a series of values which are the inputs to a
Fuzzy logic system , called Weight Assigner. The output of the WA is the definite
weight for the corresponding index term.

 Term weight is partly associated with the question “How often does an index term
appear in other subsets”.
 It is given by a value between 0 and 1.
 0 if it appears many times
 1 if it does not appear in any other subset.

 Term weight values for every Topic for Q1:
 Term weight values for every Section for Q1:
 Term weight values for every Object for Q1:

 To find the term weight associated with “How often does an index term appear in
its own subset ?”
 Q2 is senseless at the level of Object.

 Example for term weight for every Topic and Section:

 Does a term define undoubtedly a standard question ?
 The answer is completely subjective to the answers of “Yes”, ”Rather” and “No”.

 Term weight values for Q3:

 Is an index term tied to another one?

Rule Rule definition Output
R1 IF Q1 = HIGH and Q2 != LOW At least MEDIUM-HIGH
R2 IF Q1 = MEDIUM and Q2 = HIGH At least MEDIUM-HIGH
R3 IF Q1 = HIGH and Q2 = LOW Depends on other Questions
R4 IF Q1 = HIGH and Q2 = LOW Depends on other Questions
R5 IF Q3 = HIGH At least MEDIUM-HIGH
R6 IF Q4 = LOW Descends a level
R7 IF Q4 = MEDIUM If the Output is MEDIUM-
LOW, it depends to LOW
R8 IF (R1 and R2) or (R1 and R5) or (R2 and
R5)
HIGH
R9 In any other case MEDIUM-LOW

Cat1 Cat2 Cat3 Cat4 Cat5 Total
TF-IDF
method
466
(50.98%)
223
(24.40%)
53 (5.80%) 79 (8.64%) 93 (10.18%) 914
FL method 710
(77.68%)
108
(11.82%)
27 (2.95%) 28 (3.06%) 41 (4.49%) 914

A Fuzzy Logic Intelligent Agent for Information Extraction

More Related Content

What's hot (20)

Similar to A Fuzzy Logic Intelligent Agent for Information Extraction (20)

Recently uploaded (20)

A Fuzzy Logic Intelligent Agent for Information Extraction