SlideShare a Scribd company logo
A Fuzzy Logic Intelligent Agent for Information Extraction
 A Fuzzy Logic intelligent agent for Information Extraction: Introducing a new
Fuzzy Logic-based term weighting scheme (Jorge Ropero, Ariel Gomez, Alejandro
Carrascom Carlos Leon ; Department of Electronic Technology, University of
Seville, Spain ; October 2011)
 A new Fuzzy Logic Based Information Retrieval Model (Slawomir Zadozny, Janusz
Kacprzyk ; Polish Academy of Sciences ; January 2008)
 Information Extraction in a Set of Knowledge Using a Fuzzy Logic Based
Intelligent Agent (Jorge Ropero, Ariel Gomez, Alejandro Carrascom Carlos Leon ;
University of Seville, Spain ; August 2007)
 A method of Information Extraction (IE) in a set of knowledge is proposed to
answer to user consultations using natural language.
 The system is based on fuzzy logic engine, which takes advantages of its flexibility
for managing sets of accumulated knowledge.
 These sets can be built in hierarchic levels by a tree structure.
 The eventual aim of this system is the implementation of an intelligent agent to
manage the information contained in an internet website (portal).
 The quantity of web information grows exponentially.
 Achieving both high recall and precision in IR is one of its most important
objectives.
 IR has been widely used for text classification introducing approaches such as
Vector Space Model (VSM), K nearest neighbor method , Bayesian Classification
model , Neural networks and Support Vector Machine.
 VSM is the most frequently used model.
 We propose a fully novel method using Fuzzy Logic (FL), to extract the required
information and related information too.
 Giving answers to the user consultations in natural language.
 Bearing in mind that non-experts users tend not to be exact in their searches.
 Data Mining (DM) : is an automatic process of analyzing information in order to discover
patters and to build predictive models, some of its applications are e-commerce, Text
mining, e-learning and marketing.
 Information Retrieval (IR) : is an automatic search of the relevant information contained
in a set of knowledge.
 Information Extraction (IE) : Once documents have been retrieved , the challenge is to
extract the required information automatically, so its task is to identify the specific
fragments of a document which constitute its main semantic content.
 The main objective of the designed system must be to let the users find possible
answers to what they are looking for in a huge set of knowledge.
 The whole set must be classified into different objects.
 These objects are the answers to possible user consultations, organized in hierarchic
groups.
 One or more standard questions are assigned for every object.
 Different index terms from each standard question must then be selected in order to
differentiate one object from the others.
 Finally, term weights are assigned to every index term for every level of hierarchy in a
scheme based on VSM, these terms are the inputs to the FL system.
 The system must return to the user the object correspondent with the standard
question, or questions that are more similar to the user consultations.
 The first step is divide the whole set of knowledge into objects, one or more
questions in NL are assigned to every object, the answers to this or these
questions must represent the desired object.
 The second step is the selection of the index terms, which are extracted from the
questions. The index terms represent the most related terms of standard
questions with the represented object.
 We may consider mainly two methods for term weighting:
1. Let an expert in the matter evaluate intuitively the importance of index terms. This
method is simple, but it has the disadvantages of depending exclusively on the
engineer of knowledge.
2. Automate TW by means of a series of rules.
 Given the large quantity of information there is in a web portal, we choose the
second option.
 We propose a VSM method for TW.
 The most widely used method for TW is so-called TF-IDF method.
A Fuzzy Logic Intelligent Agent for Information Extraction
 Every index term has an associated weight. This weight has a value between 0 and 1
depending on the importance of the term in every hierarchic level.
 The greater is the importance of the term in a level, the higher is the weight of the
term.
 The term weight might not be the same for every hierarchic level.
𝑤𝑡,𝑑 = 𝑡𝑓𝑡,𝑑 . log
𝐷
𝑑′ ∈ 𝐷 | 𝑡 ∈ 𝑑′
Is the term frequency of term t
in document d
is inverse document frequency
(a global parameter). | D | is
the total number of documents
in the document set, the
dominator is the number of
documents containing the
term t.
 The element of the intelligent agent which determines the degree of certainty for a
group of index terms belonging or not to every possible subsets of the whole set of
knowledge is the fuzzy inference engine.
 The Inference engine has several inputs : the weights of selects index terms.
 The Inference engine gives an output : the degree of certainty for a particular
subset in particular level.
 For the fuzzy engine , it is necessary to define:
1. The number of inputs to the inference engine.
2. Input fuzzy sets : input ranges, number of fuzzy sets and shape and range of
membership functions.
3. Output fuzzy sets: output ranges, member of fuzzy sets, and shape and range of
membership functions.
4. Fuzzy rules, they are of the IF … THEN type.
5. Used methods for AND and OR operations and defuzzifying.
 All these parameters must be taken into account to find the optimal configuration
for the inference engine.
 Once standard questions are defined, index terms are extracted from them. Index
terms are the ones that better represent a standard question.
 Every index term is associated with its correspondent term weight. This weight
has a value between 0 and 1 and depends on the importance of a term in a level.
 The higher the importance of a term weight in a level, the higher is the term
weight.
 The final aim of the intelligent agent must be to find object or objects whose
information is more similar to the requested user consultation.
Step Example
Step 1: Web page identified by standard
question/s
- Web page :
www.us.es/univirtual/internetwww.us.es/u
nivirtuail/internet
- Standard question : Which services can I
access as a virtual user at the University
of Seville?
Step 2: Locate standard question/s in
hierarchic structure.
Topic 12: Virtual University
Section 6: Virtual User
Object : 2
Step 3: Extract index terms Index terms : “services” , “virtual” , “user
Step 4 : Term weighting Will discuss it later
 The first goal is to check that the system makes a correct identification of
standard questions with an index of certainty higher than a certain threshold.
 This is related to the concept id recall.
 The second goal is to check weather the required standard question is among the
tree answers with higher degree of certainty.
 This is related to precision.
 Test results for standard questions recognition fit into five categories :
1. The correct question is the only one found or the one that has the highest degree of
certainty.
2. The correct question is one between the two with the highest certainty or is the one
that has the second highest degree of certainty.
3. The correct question is one between the three with the highest certainty or is the
one that has the third highest degree of certainty.
4. The correct question is found but not among the three with the highest degree of
certainty.
5. The correct question is not found.
 Input range corresponds to weight range for every index term.
 We considered three fuzzy sets represented by values LOW, MEDIUM and HIGH.
 All of them are kept as triangular.
 Output which gives the degree of certainty is defined as LOW, LOW-MEDIUM,
MEDIUM-HIGH and HIGH.
 The fact that the input takes 3 fuzzy sets is due to that the number of these sets
are enough so that results are coherent and there are not so many options to let
the number of rules increase.
 Center of gravity defuzzifier.
 The range of values for every input set is :
 LOW, from 0.0 to 0.4 centered in 0.0.
 MEDIUM from 0.2 to 0.8 centered in 0.5.
 HIGH from 0.6 to 0.1 centered in 1.0.
 The range of values for every output fuzzy set is :
 LOW, from 0.0 to 0.4 centered in 0.0.
 MEDIUM-LOW from 0.1 to 0.7 centered in 0.4.
 MEDIUM-HIGH from 0.3 to 0.9 centered in 0.6.
 HIGH from 0.6 to 1.0 centered in 1.0.
Rule number Rule definition Output
R1 IF one or more inputs = HIGH HIGH
R2 IF three inputs = MEDIUM HIGH
R3 IF two inputs = MEDIUM and one input =
LOW
MEDIUM-HIGH
R4 IF one input = MEDIUM and two inputs =
LOW
MEDIUM-LOW
R5 IF all inputs = LOW LOW
 Step 1 : User query in Natural language.
 “Which services can I access as a virtual user at the University of Seville ?”
 Step 2: Index term extraction.
 TiW = Term weight vector for Topic i.
 Step 3: Weight vectors are taken as inputs to the fuzzy engine for every topic.
 TiO = Fuzzy engine output for Topic i.
 *Topics 10 and 12 are considered , threshold 0.4 in our case.
 Step 4: Step 3 is repeated for the next hierarchic level “ Sections of the selected
Topics”.
 TiSjW = Term weight vector for Topic i, Section j.
 TisjO = Fuzzy engine output for Topic i, Section j.
 Step 5 : Step 3 is repeated for the next hierarchic level “Objects of the selected
Sections”.
 TiSjOkW = Term weight vector for Topic i, Section j, Object k.
 TiSjOkO = Fuzzy engine output for Topic i, Section j , Object k.
A Fuzzy Logic Intelligent Agent for Information Extraction
 We may consider two options to define these weights :
 An expert in the matter should evaluate intuitively the importance of the index term.
 Simple, but it has the disadvantage of depending exclusively on the knowledge engineer. Also not
possible to automate.
 The generation of automated weights by means of a set of rules.
 The most widely used method is the TF-IDF method
 A novel Fuzzy Logic based method.
 Achieves better results in IE.
 The FL-based method has two main advantages :
1. It improves the basic TF-IDF method by creating a table with all keywords and their
corresponding weights for every object, this table will be created in the phase of
keyword extraction from standard questions.
2. The whole method of term weighting is automated and the level of required
expertise for an operator is lower.
 The chosen formula for the tests was the one proposed by Liu et al(2001).
𝑊𝑖𝑘 =
𝑡𝑓𝑖𝑘 × log(
𝑁
𝑛 𝑘 + 0.001
)
𝑘=1
𝑚
𝑡𝑓𝑖𝑘 × log
𝑁
𝑛 𝑘 + 0.001
2
 Here, 𝑡𝑓𝑖𝑘 is the ith term frequency of occurrence in the kth subset
“Topic/Section/Object”, 𝑛 𝑘 is the number of subsets to which the term Ti is
assigned in a collection on N objects.
 As an example we are using the term ‘virtual’ as used in the previous example.
 At Topic level :
o ‘Virtual’ appears 8 times in Topic 12 (tf = 8, K = 12)
o ‘Virtual’ appears twice in other topics (nk = 3)
o There are 12 Topics in total (N = 12)
o Substituting , Wik = 0.20
 At Section level:
o ‘Virtual’ appears 3 times in Section 12.6 (tf = 3, K = 6)
o ‘Virtual’ appears 5 times in other sections in Topic 12 (nk = 6)
o There are 6 Sections in Topic 12 ( N = 6).
o Substituting , Wik = 0.17
 At Object level:
o ‘Virtual’ appears Once in Section 12.6.2 (tf = 1, K = 2)
o ‘Virtual’ appears twice in other Topic 12 (nk = 3)
o There are 3 Objects in Section 12.6 ( N = 3).
o Substituting , Wik = 0.01
 TF-IDF has the disadvantage of not considering the degree of identification of the
object if only the considered Index term is used.
 FL-based term weighting method is defined as a four questions that must be
answered to determine the TW of an index term.
1. Question 1 : How often does an index term appear in other subsets ? –related to IDF.
2. Question 2 : How often does an index term appear in its own subset? – related to TF.
3. Question 3 : Does an index term undoubtedly define an object by itself?
4. Question 4 : Is an index term tied to anther one?
 The answer to these questions gives a series of values which are the inputs to a
Fuzzy logic system , called Weight Assigner. The output of the WA is the definite
weight for the corresponding index term.
 Term weight is partly associated with the question “How often does an index term
appear in other subsets”.
 It is given by a value between 0 and 1.
 0 if it appears many times
 1 if it does not appear in any other subset.
 Term weight values for every Topic for Q1:
 Term weight values for every Section for Q1:
 Term weight values for every Object for Q1:
 To find the term weight associated with “How often does an index term appear in
its own subset ?”
 Q2 is senseless at the level of Object.
 Example for term weight for every Topic and Section:
 Does a term define undoubtedly a standard question ?
 The answer is completely subjective to the answers of “Yes”, ”Rather” and “No”.
 Term weight values for Q3:
 Is an index term tied to another one?
Rule Rule definition Output
R1 IF Q1 = HIGH and Q2 != LOW At least MEDIUM-HIGH
R2 IF Q1 = MEDIUM and Q2 = HIGH At least MEDIUM-HIGH
R3 IF Q1 = HIGH and Q2 = LOW Depends on other Questions
R4 IF Q1 = HIGH and Q2 = LOW Depends on other Questions
R5 IF Q3 = HIGH At least MEDIUM-HIGH
R6 IF Q4 = LOW Descends a level
R7 IF Q4 = MEDIUM If the Output is MEDIUM-
LOW, it depends to LOW
R8 IF (R1 and R2) or (R1 and R5) or (R2 and
R5)
HIGH
R9 In any other case MEDIUM-LOW
Cat1 Cat2 Cat3 Cat4 Cat5 Total
TF-IDF
method
466
(50.98%)
223
(24.40%)
53 (5.80%) 79 (8.64%) 93 (10.18%) 914
FL method 710
(77.68%)
108
(11.82%)
27 (2.95%) 28 (3.06%) 41 (4.49%) 914

More Related Content

PPTX
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
PPTX
Tweets Classification using Naive Bayes and SVM
PPTX
Sentiment analysis using naive bayes classifier
PPT
Topic modeling
PDF
IRJET- Twitter Opinion Mining
PDF
Sentiment analysis of Twitter Data
PDF
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
PDF
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Tweets Classification using Naive Bayes and SVM
Sentiment analysis using naive bayes classifier
Topic modeling
IRJET- Twitter Opinion Mining
Sentiment analysis of Twitter Data
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE

What's hot (20)

ODP
Sentiment Analysis on Twitter
PDF
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
PDF
Modeling Text Independent Speaker Identification with Vector Quantization
PDF
TEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTION
PDF
Sentiment Analysis on Twitter Data
PDF
Binary search query classifier
PDF
Paper id 28201441
PDF
IRJET- Personality Recognition using Multi-Label Classification
PDF
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
PDF
MACHINE LEARNING TOOLBOX
PPTX
Sentiment analysis of twitter data
PDF
Learning strategy with groups on page based students' profiles
PDF
573 248-259
PDF
IRJET-Sentiment Analysis in Twitter
PDF
Profile Analysis of Users in Data Analytics Domain
PDF
Ijmer 46067276
PDF
Assessment of Programming Language Reliability Utilizing Soft-Computing
PDF
Sensing Trending Topics in Twitter for Greater Jakarta Area
PDF
IRJET- Machine Learning: Survey, Types and Challenges
PDF
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
Sentiment Analysis on Twitter
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
Modeling Text Independent Speaker Identification with Vector Quantization
TEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTION
Sentiment Analysis on Twitter Data
Binary search query classifier
Paper id 28201441
IRJET- Personality Recognition using Multi-Label Classification
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
MACHINE LEARNING TOOLBOX
Sentiment analysis of twitter data
Learning strategy with groups on page based students' profiles
573 248-259
IRJET-Sentiment Analysis in Twitter
Profile Analysis of Users in Data Analytics Domain
Ijmer 46067276
Assessment of Programming Language Reliability Utilizing Soft-Computing
Sensing Trending Topics in Twitter for Greater Jakarta Area
IRJET- Machine Learning: Survey, Types and Challenges
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
Ad

Similar to A Fuzzy Logic Intelligent Agent for Information Extraction (20)

PDF
IRJET - A Survey on Machine Learning Algorithms, Techniques and Applications
PDF
Document Classification Using Expectation Maximization with Semi Supervised L...
PDF
Document Classification Using Expectation Maximization with Semi Supervised L...
PDF
IRJET- Factoid Question and Answering System
PDF
Classification of Machine Learning Algorithms
PDF
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
PDF
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
PDF
Methodological study of opinion mining and sentiment analysis techniques
PDF
Classification of Health Forum Messages using Deep Learning
PDF
syllabus-CBR.pdf
PDF
Threat Detection System Using Data-science and NLP
PPTX
Artificial intyelligence and machine learning introduction.pptx
PDF
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
PDF
V34132136
PDF
Performance Comparision of Machine Learning Algorithms
PDF
Feature Subset Selection for High Dimensional Data using Clustering Techniques
DOCX
CIS 25 SPRING 2020FINAL Due 1159 PM May 22 (this is a har.docx
PDF
IRJET- Semantics based Document Clustering
PDF
IRJET- Personality Prediction System using AI
PDF
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
IRJET - A Survey on Machine Learning Algorithms, Techniques and Applications
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...
IRJET- Factoid Question and Answering System
Classification of Machine Learning Algorithms
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Methodological study of opinion mining and sentiment analysis techniques
Classification of Health Forum Messages using Deep Learning
syllabus-CBR.pdf
Threat Detection System Using Data-science and NLP
Artificial intyelligence and machine learning introduction.pptx
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
V34132136
Performance Comparision of Machine Learning Algorithms
Feature Subset Selection for High Dimensional Data using Clustering Techniques
CIS 25 SPRING 2020FINAL Due 1159 PM May 22 (this is a har.docx
IRJET- Semantics based Document Clustering
IRJET- Personality Prediction System using AI
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
Ad

Recently uploaded (20)

PPTX
Database Infoormation System (DBIS).pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Computer network topology notes for revision
PPTX
Introduction to Knowledge Engineering Part 1
PPT
Quality review (1)_presentation of this 21
PDF
Foundation of Data Science unit number two notes
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Fluorescence-microscope_Botany_detailed content
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Database Infoormation System (DBIS).pptx
climate analysis of Dhaka ,Banglades.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
1_Introduction to advance data techniques.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Clinical guidelines as a resource for EBP(1).pdf
ISS -ESG Data flows What is ESG and HowHow
Business Ppt On Nestle.pptx huunnnhhgfvu
Computer network topology notes for revision
Introduction to Knowledge Engineering Part 1
Quality review (1)_presentation of this 21
Foundation of Data Science unit number two notes
Miokarditis (Inflamasi pada Otot Jantung)
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Fluorescence-microscope_Botany_detailed content
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx

A Fuzzy Logic Intelligent Agent for Information Extraction

  • 2.  A Fuzzy Logic intelligent agent for Information Extraction: Introducing a new Fuzzy Logic-based term weighting scheme (Jorge Ropero, Ariel Gomez, Alejandro Carrascom Carlos Leon ; Department of Electronic Technology, University of Seville, Spain ; October 2011)  A new Fuzzy Logic Based Information Retrieval Model (Slawomir Zadozny, Janusz Kacprzyk ; Polish Academy of Sciences ; January 2008)  Information Extraction in a Set of Knowledge Using a Fuzzy Logic Based Intelligent Agent (Jorge Ropero, Ariel Gomez, Alejandro Carrascom Carlos Leon ; University of Seville, Spain ; August 2007)
  • 3.  A method of Information Extraction (IE) in a set of knowledge is proposed to answer to user consultations using natural language.  The system is based on fuzzy logic engine, which takes advantages of its flexibility for managing sets of accumulated knowledge.  These sets can be built in hierarchic levels by a tree structure.  The eventual aim of this system is the implementation of an intelligent agent to manage the information contained in an internet website (portal).
  • 4.  The quantity of web information grows exponentially.  Achieving both high recall and precision in IR is one of its most important objectives.  IR has been widely used for text classification introducing approaches such as Vector Space Model (VSM), K nearest neighbor method , Bayesian Classification model , Neural networks and Support Vector Machine.  VSM is the most frequently used model.
  • 5.  We propose a fully novel method using Fuzzy Logic (FL), to extract the required information and related information too.  Giving answers to the user consultations in natural language.  Bearing in mind that non-experts users tend not to be exact in their searches.
  • 6.  Data Mining (DM) : is an automatic process of analyzing information in order to discover patters and to build predictive models, some of its applications are e-commerce, Text mining, e-learning and marketing.  Information Retrieval (IR) : is an automatic search of the relevant information contained in a set of knowledge.  Information Extraction (IE) : Once documents have been retrieved , the challenge is to extract the required information automatically, so its task is to identify the specific fragments of a document which constitute its main semantic content.
  • 7.  The main objective of the designed system must be to let the users find possible answers to what they are looking for in a huge set of knowledge.  The whole set must be classified into different objects.  These objects are the answers to possible user consultations, organized in hierarchic groups.  One or more standard questions are assigned for every object.  Different index terms from each standard question must then be selected in order to differentiate one object from the others.  Finally, term weights are assigned to every index term for every level of hierarchy in a scheme based on VSM, these terms are the inputs to the FL system.  The system must return to the user the object correspondent with the standard question, or questions that are more similar to the user consultations.
  • 8.  The first step is divide the whole set of knowledge into objects, one or more questions in NL are assigned to every object, the answers to this or these questions must represent the desired object.  The second step is the selection of the index terms, which are extracted from the questions. The index terms represent the most related terms of standard questions with the represented object.
  • 9.  We may consider mainly two methods for term weighting: 1. Let an expert in the matter evaluate intuitively the importance of index terms. This method is simple, but it has the disadvantages of depending exclusively on the engineer of knowledge. 2. Automate TW by means of a series of rules.  Given the large quantity of information there is in a web portal, we choose the second option.  We propose a VSM method for TW.  The most widely used method for TW is so-called TF-IDF method.
  • 11.  Every index term has an associated weight. This weight has a value between 0 and 1 depending on the importance of the term in every hierarchic level.  The greater is the importance of the term in a level, the higher is the weight of the term.  The term weight might not be the same for every hierarchic level. 𝑤𝑡,𝑑 = 𝑡𝑓𝑡,𝑑 . log 𝐷 𝑑′ ∈ 𝐷 | 𝑡 ∈ 𝑑′ Is the term frequency of term t in document d is inverse document frequency (a global parameter). | D | is the total number of documents in the document set, the dominator is the number of documents containing the term t.
  • 12.  The element of the intelligent agent which determines the degree of certainty for a group of index terms belonging or not to every possible subsets of the whole set of knowledge is the fuzzy inference engine.  The Inference engine has several inputs : the weights of selects index terms.  The Inference engine gives an output : the degree of certainty for a particular subset in particular level.
  • 13.  For the fuzzy engine , it is necessary to define: 1. The number of inputs to the inference engine. 2. Input fuzzy sets : input ranges, number of fuzzy sets and shape and range of membership functions. 3. Output fuzzy sets: output ranges, member of fuzzy sets, and shape and range of membership functions. 4. Fuzzy rules, they are of the IF … THEN type. 5. Used methods for AND and OR operations and defuzzifying.  All these parameters must be taken into account to find the optimal configuration for the inference engine.
  • 14.  Once standard questions are defined, index terms are extracted from them. Index terms are the ones that better represent a standard question.  Every index term is associated with its correspondent term weight. This weight has a value between 0 and 1 and depends on the importance of a term in a level.  The higher the importance of a term weight in a level, the higher is the term weight.  The final aim of the intelligent agent must be to find object or objects whose information is more similar to the requested user consultation.
  • 15. Step Example Step 1: Web page identified by standard question/s - Web page : www.us.es/univirtual/internetwww.us.es/u nivirtuail/internet - Standard question : Which services can I access as a virtual user at the University of Seville? Step 2: Locate standard question/s in hierarchic structure. Topic 12: Virtual University Section 6: Virtual User Object : 2 Step 3: Extract index terms Index terms : “services” , “virtual” , “user Step 4 : Term weighting Will discuss it later
  • 16.  The first goal is to check that the system makes a correct identification of standard questions with an index of certainty higher than a certain threshold.  This is related to the concept id recall.  The second goal is to check weather the required standard question is among the tree answers with higher degree of certainty.  This is related to precision.
  • 17.  Test results for standard questions recognition fit into five categories : 1. The correct question is the only one found or the one that has the highest degree of certainty. 2. The correct question is one between the two with the highest certainty or is the one that has the second highest degree of certainty. 3. The correct question is one between the three with the highest certainty or is the one that has the third highest degree of certainty. 4. The correct question is found but not among the three with the highest degree of certainty. 5. The correct question is not found.
  • 18.  Input range corresponds to weight range for every index term.  We considered three fuzzy sets represented by values LOW, MEDIUM and HIGH.  All of them are kept as triangular.  Output which gives the degree of certainty is defined as LOW, LOW-MEDIUM, MEDIUM-HIGH and HIGH.  The fact that the input takes 3 fuzzy sets is due to that the number of these sets are enough so that results are coherent and there are not so many options to let the number of rules increase.  Center of gravity defuzzifier.
  • 19.  The range of values for every input set is :  LOW, from 0.0 to 0.4 centered in 0.0.  MEDIUM from 0.2 to 0.8 centered in 0.5.  HIGH from 0.6 to 0.1 centered in 1.0.
  • 20.  The range of values for every output fuzzy set is :  LOW, from 0.0 to 0.4 centered in 0.0.  MEDIUM-LOW from 0.1 to 0.7 centered in 0.4.  MEDIUM-HIGH from 0.3 to 0.9 centered in 0.6.  HIGH from 0.6 to 1.0 centered in 1.0.
  • 21. Rule number Rule definition Output R1 IF one or more inputs = HIGH HIGH R2 IF three inputs = MEDIUM HIGH R3 IF two inputs = MEDIUM and one input = LOW MEDIUM-HIGH R4 IF one input = MEDIUM and two inputs = LOW MEDIUM-LOW R5 IF all inputs = LOW LOW
  • 22.  Step 1 : User query in Natural language.  “Which services can I access as a virtual user at the University of Seville ?”
  • 23.  Step 2: Index term extraction.  TiW = Term weight vector for Topic i.
  • 24.  Step 3: Weight vectors are taken as inputs to the fuzzy engine for every topic.  TiO = Fuzzy engine output for Topic i.  *Topics 10 and 12 are considered , threshold 0.4 in our case.
  • 25.  Step 4: Step 3 is repeated for the next hierarchic level “ Sections of the selected Topics”.  TiSjW = Term weight vector for Topic i, Section j.  TisjO = Fuzzy engine output for Topic i, Section j.
  • 26.  Step 5 : Step 3 is repeated for the next hierarchic level “Objects of the selected Sections”.  TiSjOkW = Term weight vector for Topic i, Section j, Object k.  TiSjOkO = Fuzzy engine output for Topic i, Section j , Object k.
  • 28.  We may consider two options to define these weights :  An expert in the matter should evaluate intuitively the importance of the index term.  Simple, but it has the disadvantage of depending exclusively on the knowledge engineer. Also not possible to automate.  The generation of automated weights by means of a set of rules.  The most widely used method is the TF-IDF method  A novel Fuzzy Logic based method.  Achieves better results in IE.
  • 29.  The FL-based method has two main advantages : 1. It improves the basic TF-IDF method by creating a table with all keywords and their corresponding weights for every object, this table will be created in the phase of keyword extraction from standard questions. 2. The whole method of term weighting is automated and the level of required expertise for an operator is lower.
  • 30.  The chosen formula for the tests was the one proposed by Liu et al(2001). 𝑊𝑖𝑘 = 𝑡𝑓𝑖𝑘 × log( 𝑁 𝑛 𝑘 + 0.001 ) 𝑘=1 𝑚 𝑡𝑓𝑖𝑘 × log 𝑁 𝑛 𝑘 + 0.001 2  Here, 𝑡𝑓𝑖𝑘 is the ith term frequency of occurrence in the kth subset “Topic/Section/Object”, 𝑛 𝑘 is the number of subsets to which the term Ti is assigned in a collection on N objects.
  • 31.  As an example we are using the term ‘virtual’ as used in the previous example.  At Topic level : o ‘Virtual’ appears 8 times in Topic 12 (tf = 8, K = 12) o ‘Virtual’ appears twice in other topics (nk = 3) o There are 12 Topics in total (N = 12) o Substituting , Wik = 0.20  At Section level: o ‘Virtual’ appears 3 times in Section 12.6 (tf = 3, K = 6) o ‘Virtual’ appears 5 times in other sections in Topic 12 (nk = 6) o There are 6 Sections in Topic 12 ( N = 6). o Substituting , Wik = 0.17  At Object level: o ‘Virtual’ appears Once in Section 12.6.2 (tf = 1, K = 2) o ‘Virtual’ appears twice in other Topic 12 (nk = 3) o There are 3 Objects in Section 12.6 ( N = 3). o Substituting , Wik = 0.01
  • 32.  TF-IDF has the disadvantage of not considering the degree of identification of the object if only the considered Index term is used.  FL-based term weighting method is defined as a four questions that must be answered to determine the TW of an index term. 1. Question 1 : How often does an index term appear in other subsets ? –related to IDF. 2. Question 2 : How often does an index term appear in its own subset? – related to TF. 3. Question 3 : Does an index term undoubtedly define an object by itself? 4. Question 4 : Is an index term tied to anther one?
  • 33.  The answer to these questions gives a series of values which are the inputs to a Fuzzy logic system , called Weight Assigner. The output of the WA is the definite weight for the corresponding index term.
  • 34.  Term weight is partly associated with the question “How often does an index term appear in other subsets”.  It is given by a value between 0 and 1.  0 if it appears many times  1 if it does not appear in any other subset.
  • 35.  Term weight values for every Topic for Q1:  Term weight values for every Section for Q1:  Term weight values for every Object for Q1:
  • 36.  To find the term weight associated with “How often does an index term appear in its own subset ?”  Q2 is senseless at the level of Object.
  • 37.  Example for term weight for every Topic and Section:
  • 38.  Does a term define undoubtedly a standard question ?  The answer is completely subjective to the answers of “Yes”, ”Rather” and “No”.
  • 39.  Term weight values for Q3:
  • 40.  Is an index term tied to another one?
  • 41. Rule Rule definition Output R1 IF Q1 = HIGH and Q2 != LOW At least MEDIUM-HIGH R2 IF Q1 = MEDIUM and Q2 = HIGH At least MEDIUM-HIGH R3 IF Q1 = HIGH and Q2 = LOW Depends on other Questions R4 IF Q1 = HIGH and Q2 = LOW Depends on other Questions R5 IF Q3 = HIGH At least MEDIUM-HIGH R6 IF Q4 = LOW Descends a level R7 IF Q4 = MEDIUM If the Output is MEDIUM- LOW, it depends to LOW R8 IF (R1 and R2) or (R1 and R5) or (R2 and R5) HIGH R9 In any other case MEDIUM-LOW
  • 42. Cat1 Cat2 Cat3 Cat4 Cat5 Total TF-IDF method 466 (50.98%) 223 (24.40%) 53 (5.80%) 79 (8.64%) 93 (10.18%) 914 FL method 710 (77.68%) 108 (11.82%) 27 (2.95%) 28 (3.06%) 41 (4.49%) 914