International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 11 | Nov 2023 www.irjet.net p-ISSN: 2395-0072
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 843
Semantic similarity measurement- A theoretical study of various
approaches
Asha Chandran T1, Sajina K2
1 Lecturer In Computer Engineering, Department of Computer Engineering, Govt. Womens’ Polytechnic College,
Kayamkulam, Kerala, India
2 Lecturer In Computer Engineering, Department of Computer Engineering, Govt. Polytechnic College Neyyattinkara,
Kerala, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Semantic similarity has a wide variety of
applications in Natural Language Processing [NLP].
Approaches in semantic similarity measurement falls into
categories like dictionary-based, corpus-based, knowledge-
based, semantic indexing based etc. This paper aims to study
these approaches in detail. Comparison of short sentences or
phrases is given more emphasis since they draw more
attraction than paragraphs or large documents..
Key Words: Semantic similarity, WordNet, Concept
similarity, Concept nodes, Path length, InformationContent,
Least Common Subsumer(LCS)
1. INTRODUCTION
Semantic similarity between sentences is a measurement of
the semantic relationship, that is, how much they resemble
in their meanings. Semantic similarity has a wide variety of
applications in areas such as text comparison, text
compression, dictionaries and reverse dictionaries,
plagiarism detection, ranking searchresultsandmanymore.
Accuracy of results of many of these applications depends
heavily on the accuracy score of the semantic similarity
score. It is observed that comparison of short sentences or
phrases gain more interest than comparing paragraphs or
even large documents.
In this article, we aim to present and compare various
methods of comparing similarity between sentences.
Sentences are not mere grouping of words; the particular
ordering of words is also significant. While comparing
sentences, methods of comparing the word order such as
parse tree depths, bipartite graphs etc. are also considered.
1.1 WordNet
For comparing semantic similarity, researchers make use of
many lexical databases. WordNet is one of the most popular
databases widely used as a dictionary and a thesaurus. It is a
general purpose ontology developed byPrincetonUniversity
to model the lexical knowledge of Englishlanguage.WordNet
includes four Parts-of-Speech in English such as noun, verb,
adjective and adverb. It compiles each of these lexemes into
synonym sets or synsets. A synset is a specific meaning of a
word. The semantic similarity measurementapproachesthat
use WordNet take the synsets of words or phrases under
consideration and compare the glossary in their synsets.
WordNet depicts various relationships predominantly
hypernym-hyponym(Is-Arelation),holonym-meronym(Part-
of relation) etc. These hierarchies will have a more general
node. Among these relationship hierarchies, the ‘is-a’
relationshipcoversalmost70percentofthecorpusstructure.
2. Approaches for Similarity measurement
Various approaches have been developed over years to
compute semantic similarity among concepts commonly
expressed as short sentences. We look in detail into each of
these approaches.
2.1 Structure-based Methods
These methods rely on the structure of the corpus like
WordNet. To determine the semantic relatedness between
two concepts, these methods take the path length between
the concepts under consideration in the WordNet structure.
2.1.1 Shortest path method [1,2]
In this approach, the distance between two concepts in the
corpus is measured. Remember that there may be many
paths, via different parent nodes,betweentwoconceptnodes
in the hierarchy. The path length based methods take each of
these paths, compute the path length or edge count of each
path and the shortest among them is selected as the distance
between the two concepts.Thesmallerthedistance,themore
semantically similar the concepts will be.
The similarity between two concepts c1 and c2 is calculated
as
Sim(c1,c2)=(2*max_depth) –(shortest_path_length(c1,c2))
where max_depth is the depth of the hierarchy.
2.1.2 Weighted edge method [3]
This method uses the sum of weights of the edges in the path
between two concepts in the hierarchy. A link or edge is
assigned a weight based on two factors
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 11 | Nov 2023 www.irjet.net p-ISSN: 2395-0072
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 844
(i)Depth of the hierarchy
(ii)Density of the taxonomy at a level and strength of
connotation between parent and child nodes
This method is essentially an enhanced version of the
shortest path method.
2.1.3 Leacock- Chodrow method [4]
This is also another method that uses shortest path length
between two concepts and the maximum depth of the
hierarchy.
2.1.4 Hirst - St.Onge Method [5]
Most of the semantic relatedness is applicable to ‘is-a’
relations for nouns. HSO method is not limited to nouns, but
applicable to other relations also. This method classifies
concepts into upward, downward and horizontal relations.
The upward movement indicates a more general ’is-a’
relation whereas the downward movement indicates more
specific concepts. Horizontal relations share the similar
amount of specificity.
In thisapproach, number of deviationsinthepathconnecting
the two concepts is taken into account. If two concepts are
similar, there will not be many deviations in the path
connecting them.
The similarity between two concepts c1 and c2 is calculated
in HSO method as
SimHSO(c1,c2)= C- path_length(c1,c2) –
(k* number_of_deviations)
where C and k are constants whose values are set using
experiments.
2.1.5 Wu and Palmer [6]
This method calculates the path length of the concepts to be
compared to their nearest common ancestor. The notion
behind the idea is that the nearest common ancestor is the
most specific common concept between the concepts under
consideration.Thecommonparenthastheminimumnumber
of edges in the ‘is-a’ path with the concepts.
The similarity between two concepts c1 and c2, according to
this method is,
where N is the depth of the nearest common parent from the
root node of the hierarchy and N1 and N2 is the number if
edges between nearest common parent with c1 and c2
respectively.
2.2 Information Based Methods
Structure-based measures to count semantic similarity have
the major disadvantage that the concepts which are at the
same semantic distance may not be equally similar. This
happens because the information contained in each concept
nodeis not equivalent.Somenodesmaycontainmoregeneric
information whereas some contain particularly specific
knowledge. Structure-based methods mostly consider the
edgecount between concepts, depth of the hierarchyetc.and
do not take into account the information contained in a
concept node.
Information content indicates the specificity of a concept. A
more specific concept contains a considerable amount of
information about a topic. A general concept cannot provide
much information.
For example, a concept ‘motor car’ gives much specific
information than the concept ‘conveyance’ which is a more
generic concept.
In this section, we compare various information based
methods to compute semantic similarity.
2.2.1 Resnik method [7]
In this method, information content of a node is computed
approximately by counting the frequency of that conceptina
large corpus. The frequency is used to determine the
probabilityoftheconceptviaamaximumlikelihoodestimate.
Negative log of this frequency is considered as a measure of
the information content of the concept. Resnik did his
experiments with the Brown Corpus.
Information content IC of a concept c is computed as
where P(c) is the probability of the concept c.
If the corpus has sense-tagged text, each concept will be
associated with a unique sense. In such cases, the frequency
of the concept can be easily available. For scenarios with
sense-tagging is not present, Resnik suggested counting the
number of occurrences of the concept and then dividing itby
the number of different senses of that concept.
Semantic similarity of two concepts is proportional to the
amount of information content they share. The shared
information can be determined by the informationcontentof
the lowest common subsumer of these two concepts in the
hierarchy.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 11 | Nov 2023 www.irjet.net p-ISSN: 2395-0072
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 845
Thussimilarity between two conceptsc1andc2,accordingto
Resnik method, is
where LCS(c1,c2) is the least common subsume of the
concepts c1 and c2 respectively.
2.2.2 Lin’s method [8]
This method takes intoaccount both the informationcontent
of the concepts under consideration as well as that of the
least common subsumer. For this experiment, Lin used the
sense-tagged version of SemCor.
Similarity according to Lin’s method is
2.2.3 Jiang-Conrath measure [9]
Lin’s method takes the semantic similarity as a ratio of
information content of the common parent and the sum of
information contents of the concept nodes under
consideration. Jiang’s method also takes informationcontent
of concept nodes and that of the common parent.But,instead
of taking the ratio, this method takes the difference of these
two values. For checking the similarity, Jiang and Conrath
also used the sense-tagged version of SemCor.
Semantic distance between two concepts c1 and c2,
according to this method, is
Accordingly, semantic similarity between c1 and c2 is
2.3 Feature-Based Methods
Feature based methods are different from structure based
and information based methodsinthewaythatfeaturebased
methods do nottakeintoaccountthetaxonomystructureand
information content of the nodes and their parents.
Feature based methods assume that each concept is
associated witha set of features or properties. An exampleof
feature is the set of definitions or glosses of the concept.
Similarity between two concepts is computedasafunctionof
their features. More common features that two concepts
share indicate more similarity between them.
In general, we can say that these methods rely more on
semantic properties than mere edge counting methods.
2.3.1 Tversky’s method [10]
This method assumes that the semantic similarity is not
symmetric. That is, the measure of similarity of a concept c1
to another concept c2 is not sameas the similarity of c2 toc1.
A classic example this model uses is the inheritance
relationship. That is, it arguesthat,thesimilarityofasubclass
to its superclass is more than the similarity of superclass to
its subclass.
Semantic similarity of c1 to c2 in this method,
where k is a constant whose value is between 0 and 1and is
obtained by observation.
2.4 Hybrid Methods
Hybrid methods combine two or more of the above
mentioned methods. The similarity is calculated by taking
each method into account, assigninga weighttoeachmethod
and computing the weighted sum of the methods. The
weights can be assigned manually through experimental
observations. Hybrid methods mayalsotakeintoaccountthe
relationship between concepts such as ‘is-a’, ’part-of’ etc.
2.4.1 Zhou method [11]
A hybrid method proposed by Zhou takes both thestructure-
based measure and the information based measure to
compute a more accurate similarity measurement. A weight
whose value varies between 0 and 1 is chosen manually to
assign the contribution of each measure.
If k=0, this method becomes purely information content
based and if k=1, this becomes purely structure-based.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 11 | Nov 2023 www.irjet.net p-ISSN: 2395-0072
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 846
2.5 Comparison of various methods
Category Method Metric Working principle
Structure-based
Shortest path method
Path length and depth of the
hierarchy
Shortest path length is
subtracted from twice the
depth
Weighted edge method
Weighted path length and depth of
the hierarchy
Weight is assigned to each
edge based on depth of the
hierarchy and density of the
taxonomy
Leacock- Chodrow method
Path length and depth of the
hierarchy
Logarithmically conditioned
shortest path length is
divided by twicethedepthof
the hierarchy
Hirst - St.Onge Method
Path length and deviations in the
path
Less number of deviationsin
the path is a measure of
more similarity
Wu and Palmer method
Depth of the hierarchy and distance
to the nearest common parent
Distance from the nearest
common parentisa measure
of the similarity
Information - based
Resnik method Frequency of occurrence in the
corpus
Negative log of the
frequency of occurrence is a
measure of the information
content of the concept node.
Information content of the
nearest common parent is a
measure of the similarity
Lin's method
Frequency of occurrence of the
concepts and that of their nearest
common parent
Information content of the
nearest common parent
divided by that of the
concept nodes is a measure
of the similarity
Jiang’s measure
Frequency of occurrence of the
concepts and that of their nearest
common parent
Information content of the
nearest common parent
subtracted from that of the
concept nodes is a measure
of the similarity
Feature-based Tversky's method
compare concepts' feature, such as
their definitions or glosses
Common characteristic
features(e.g., glossary) ofthe
concepts is a measure of
similarity
Hybrid method Zhou's method
Path length, information content
and depth of the hierarchy
Combines structure-based
and information-based
methods in a certain
proportion to find similarity
measurement
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 11 | Nov 2023 www.irjet.net p-ISSN: 2395-0072
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 847
3. SUMMARY
This paper aims to theoretically evaluate various semantic
similarity measurement methods proposed by researchers
from timeto time. Themethodswestudiedincludestructure-
based methods, information-based methods, feature-based
methods and hybrid methods which incorporate the idea of
structure-based and information-based approaches. The
methods compared here are based on the popular general-
purpose ontology called WordNet. Various other domain-
specific ontologies are available. The comparison is aimed to
help researchers to select appropriate measure for their
requirements. However, anaccuratesimilaritymeasurement
is a problem yet to be researched further.
REFERENCES
[1]Rada, R., Mili, H., Bicknell, E., and Blettner, M. 1989.
Development and Application of a Metric on Semantic Nets.
IEEE Transactions on Systems, Man, and Cybernetics,
19(1):17-30, January/February
[2] H. Bulskov, R. Knappe and T. Andreasen, “On Measuring
Similarity for Conceptual Querying”, Proceedings of the 5th
International Conference on Flexible Query Answering
Systems, (2002) October 27-29, Copenhagen, Denmark
[3] .Richardson, R, Smeaton, A. & Murphy, J. 1994. Using
WordNet as a Knowledge Base for Measuring Semantic
Similarity Between Words. Technical Report Workingpaper
CA-1294, School of Computer Applications, Dublin City
University, Dublin, Ireland.
[4] C. Leacock and M. Chodorow, “Combining Local Context
and WordNet Similarity for Word Sense Identification,
WordNet: An Electronic Lexical Database”, MIT Press,
(1998), pp. 265-283
[5]. Hirst, G. and St-Onge, D. 1998. Lexical Chains as
Representations of Context for the Detection and Correction
of Malapropisms. In ProceedingsofFellbaum,pages305-332
[6]. Z. Wu and M. Palmer, “Verb semantics and lexical
selection”, Proceedings of 32nd annual Meeting of the
Association for Computational Linguistics, (1994) June 27-
30; Las Cruces, New Mexico.
[7] P. Resnik, “Using information content to evaluate
semantic similarity”, Proceedings of the 14th International
Joint Conference on Artificial Intelligence,(1995)August20-
25; Montréal Québec, Canada.
[8] D. Lin, “An information-theoretic definitionofsimilarity”,
Proceedings of the 15th International Conference on
Machine Learning, (1998) July 24-27; Madison, Wisconsin,
USA.
[9] J. J. Jiang and D. W. Conrath, “Semantic similarity based
on corpus statistics and lexical taxonomy”, Proceedings of
International Conference on Research in Computational
Linguistics, (1997) August 22-24; Taipei, TaiWan.
[10] A. Tversky, “Features of Similarity”, Psychological
Review, vol. 84, no. 4, (1977)
[11] Zhou, Z., Wang, Y. and Gu, J., 2008. “New Model of
Semantic Similarity Measuring in Wordnet”, Proceedings of
the 3rd International Conference on Intelligent System and
Knowledge Engineering, November 17-19, Xiamen, China

More Related Content

PDF
Measure Term Similarity Using a Semantic Network Approach
PDF
Information Retrieval using Semantic Similarity
DOCX
COMPUTING SEMANTIC SIMILARITY OF CONCEPTS IN KNOWLEDGE GRAPHS
PDF
Semantic Similarity Between Sentences
PDF
IRJET-Semantic Similarity Between Sentences
PDF
Context Sensitive Relatedness Measure of Word Pairs
PDF
P13 corley
PDF
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
Measure Term Similarity Using a Semantic Network Approach
Information Retrieval using Semantic Similarity
COMPUTING SEMANTIC SIMILARITY OF CONCEPTS IN KNOWLEDGE GRAPHS
Semantic Similarity Between Sentences
IRJET-Semantic Similarity Between Sentences
Context Sensitive Relatedness Measure of Word Pairs
P13 corley
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...

Similar to Semantic similarity measurement- A theoretical study of various approaches (20)

PDF
Semantic Based Document Clustering Using Lexical Chains
PDF
IRJET-Semantic Based Document Clustering Using Lexical Chains
PDF
Ijarcet vol-2-issue-7-2252-2257
PDF
Ijarcet vol-2-issue-7-2252-2257
ODP
Talk at UAB, April 12, 2013
PDF
ASSESSING SIMILARITY BETWEEN ONTOLOGIES: THE CASE OF THE CONCEPTUAL SIMILARITY
PDF
ASSESSING SIMILARITY BETWEEN ONTOLOGIES: THE CASE OF THE CONCEPTUAL SIMILARITY
PDF
A Survey on Unsupervised Graph-based Word Sense Disambiguation
PDF
Text smilarity02 corpus_based
PDF
Volume 2-issue-6-2016-2020
PDF
Volume 2-issue-6-2016-2020
PPTX
Chat bot using text similarity approach
PDF
Conceptual similarity: why, where and how
PDF
Ontology matching
PDF
IRJET - Deep Collaborrative Filtering with Aspect Information
PDF
International Journal of Engineering Research and Development (IJERD)
PDF
Conceptual similarity measurement algorithm for domain specific ontology[
PDF
CONCEPTUAL SIMILARITY MEASUREMENT ALGORITHM FOR DOMAIN SPECIFIC ONTOLOGY
PDF
Conceptual Similarity Measurement Algorithm For Domain Specific Ontology
PDF
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
Semantic Based Document Clustering Using Lexical Chains
IRJET-Semantic Based Document Clustering Using Lexical Chains
Ijarcet vol-2-issue-7-2252-2257
Ijarcet vol-2-issue-7-2252-2257
Talk at UAB, April 12, 2013
ASSESSING SIMILARITY BETWEEN ONTOLOGIES: THE CASE OF THE CONCEPTUAL SIMILARITY
ASSESSING SIMILARITY BETWEEN ONTOLOGIES: THE CASE OF THE CONCEPTUAL SIMILARITY
A Survey on Unsupervised Graph-based Word Sense Disambiguation
Text smilarity02 corpus_based
Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020
Chat bot using text similarity approach
Conceptual similarity: why, where and how
Ontology matching
IRJET - Deep Collaborrative Filtering with Aspect Information
International Journal of Engineering Research and Development (IJERD)
Conceptual similarity measurement algorithm for domain specific ontology[
CONCEPTUAL SIMILARITY MEASUREMENT ALGORITHM FOR DOMAIN SPECIFIC ONTOLOGY
Conceptual Similarity Measurement Algorithm For Domain Specific Ontology
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...

Recently uploaded (20)

PDF
Categorization of Factors Affecting Classification Algorithms Selection
PDF
August 2025 - Top 10 Read Articles in Network Security & Its Applications
PPTX
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
PDF
Visual Aids for Exploratory Data Analysis.pdf
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PDF
Abrasive, erosive and cavitation wear.pdf
PPT
Total quality management ppt for engineering students
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PDF
Design Guidelines and solutions for Plastics parts
PPTX
Feature types and data preprocessing steps
PPTX
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
PPTX
Module 8- Technological and Communication Skills.pptx
PPTX
communication and presentation skills 01
PDF
ChapteR012372321DFGDSFGDFGDFSGDFGDFGDFGSDFGDFGFD
PDF
August -2025_Top10 Read_Articles_ijait.pdf
PPTX
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PDF
distributed database system" (DDBS) is often used to refer to both the distri...
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PDF
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
Categorization of Factors Affecting Classification Algorithms Selection
August 2025 - Top 10 Read Articles in Network Security & Its Applications
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
Visual Aids for Exploratory Data Analysis.pdf
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
Abrasive, erosive and cavitation wear.pdf
Total quality management ppt for engineering students
III.4.1.2_The_Space_Environment.p pdffdf
Design Guidelines and solutions for Plastics parts
Feature types and data preprocessing steps
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
Module 8- Technological and Communication Skills.pptx
communication and presentation skills 01
ChapteR012372321DFGDSFGDFGDFSGDFGDFGDFGSDFGDFGFD
August -2025_Top10 Read_Articles_ijait.pdf
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
distributed database system" (DDBS) is often used to refer to both the distri...
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf

Semantic similarity measurement- A theoretical study of various approaches

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 11 | Nov 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 843 Semantic similarity measurement- A theoretical study of various approaches Asha Chandran T1, Sajina K2 1 Lecturer In Computer Engineering, Department of Computer Engineering, Govt. Womens’ Polytechnic College, Kayamkulam, Kerala, India 2 Lecturer In Computer Engineering, Department of Computer Engineering, Govt. Polytechnic College Neyyattinkara, Kerala, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - Semantic similarity has a wide variety of applications in Natural Language Processing [NLP]. Approaches in semantic similarity measurement falls into categories like dictionary-based, corpus-based, knowledge- based, semantic indexing based etc. This paper aims to study these approaches in detail. Comparison of short sentences or phrases is given more emphasis since they draw more attraction than paragraphs or large documents.. Key Words: Semantic similarity, WordNet, Concept similarity, Concept nodes, Path length, InformationContent, Least Common Subsumer(LCS) 1. INTRODUCTION Semantic similarity between sentences is a measurement of the semantic relationship, that is, how much they resemble in their meanings. Semantic similarity has a wide variety of applications in areas such as text comparison, text compression, dictionaries and reverse dictionaries, plagiarism detection, ranking searchresultsandmanymore. Accuracy of results of many of these applications depends heavily on the accuracy score of the semantic similarity score. It is observed that comparison of short sentences or phrases gain more interest than comparing paragraphs or even large documents. In this article, we aim to present and compare various methods of comparing similarity between sentences. Sentences are not mere grouping of words; the particular ordering of words is also significant. While comparing sentences, methods of comparing the word order such as parse tree depths, bipartite graphs etc. are also considered. 1.1 WordNet For comparing semantic similarity, researchers make use of many lexical databases. WordNet is one of the most popular databases widely used as a dictionary and a thesaurus. It is a general purpose ontology developed byPrincetonUniversity to model the lexical knowledge of Englishlanguage.WordNet includes four Parts-of-Speech in English such as noun, verb, adjective and adverb. It compiles each of these lexemes into synonym sets or synsets. A synset is a specific meaning of a word. The semantic similarity measurementapproachesthat use WordNet take the synsets of words or phrases under consideration and compare the glossary in their synsets. WordNet depicts various relationships predominantly hypernym-hyponym(Is-Arelation),holonym-meronym(Part- of relation) etc. These hierarchies will have a more general node. Among these relationship hierarchies, the ‘is-a’ relationshipcoversalmost70percentofthecorpusstructure. 2. Approaches for Similarity measurement Various approaches have been developed over years to compute semantic similarity among concepts commonly expressed as short sentences. We look in detail into each of these approaches. 2.1 Structure-based Methods These methods rely on the structure of the corpus like WordNet. To determine the semantic relatedness between two concepts, these methods take the path length between the concepts under consideration in the WordNet structure. 2.1.1 Shortest path method [1,2] In this approach, the distance between two concepts in the corpus is measured. Remember that there may be many paths, via different parent nodes,betweentwoconceptnodes in the hierarchy. The path length based methods take each of these paths, compute the path length or edge count of each path and the shortest among them is selected as the distance between the two concepts.Thesmallerthedistance,themore semantically similar the concepts will be. The similarity between two concepts c1 and c2 is calculated as Sim(c1,c2)=(2*max_depth) –(shortest_path_length(c1,c2)) where max_depth is the depth of the hierarchy. 2.1.2 Weighted edge method [3] This method uses the sum of weights of the edges in the path between two concepts in the hierarchy. A link or edge is assigned a weight based on two factors
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 11 | Nov 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 844 (i)Depth of the hierarchy (ii)Density of the taxonomy at a level and strength of connotation between parent and child nodes This method is essentially an enhanced version of the shortest path method. 2.1.3 Leacock- Chodrow method [4] This is also another method that uses shortest path length between two concepts and the maximum depth of the hierarchy. 2.1.4 Hirst - St.Onge Method [5] Most of the semantic relatedness is applicable to ‘is-a’ relations for nouns. HSO method is not limited to nouns, but applicable to other relations also. This method classifies concepts into upward, downward and horizontal relations. The upward movement indicates a more general ’is-a’ relation whereas the downward movement indicates more specific concepts. Horizontal relations share the similar amount of specificity. In thisapproach, number of deviationsinthepathconnecting the two concepts is taken into account. If two concepts are similar, there will not be many deviations in the path connecting them. The similarity between two concepts c1 and c2 is calculated in HSO method as SimHSO(c1,c2)= C- path_length(c1,c2) – (k* number_of_deviations) where C and k are constants whose values are set using experiments. 2.1.5 Wu and Palmer [6] This method calculates the path length of the concepts to be compared to their nearest common ancestor. The notion behind the idea is that the nearest common ancestor is the most specific common concept between the concepts under consideration.Thecommonparenthastheminimumnumber of edges in the ‘is-a’ path with the concepts. The similarity between two concepts c1 and c2, according to this method is, where N is the depth of the nearest common parent from the root node of the hierarchy and N1 and N2 is the number if edges between nearest common parent with c1 and c2 respectively. 2.2 Information Based Methods Structure-based measures to count semantic similarity have the major disadvantage that the concepts which are at the same semantic distance may not be equally similar. This happens because the information contained in each concept nodeis not equivalent.Somenodesmaycontainmoregeneric information whereas some contain particularly specific knowledge. Structure-based methods mostly consider the edgecount between concepts, depth of the hierarchyetc.and do not take into account the information contained in a concept node. Information content indicates the specificity of a concept. A more specific concept contains a considerable amount of information about a topic. A general concept cannot provide much information. For example, a concept ‘motor car’ gives much specific information than the concept ‘conveyance’ which is a more generic concept. In this section, we compare various information based methods to compute semantic similarity. 2.2.1 Resnik method [7] In this method, information content of a node is computed approximately by counting the frequency of that conceptina large corpus. The frequency is used to determine the probabilityoftheconceptviaamaximumlikelihoodestimate. Negative log of this frequency is considered as a measure of the information content of the concept. Resnik did his experiments with the Brown Corpus. Information content IC of a concept c is computed as where P(c) is the probability of the concept c. If the corpus has sense-tagged text, each concept will be associated with a unique sense. In such cases, the frequency of the concept can be easily available. For scenarios with sense-tagging is not present, Resnik suggested counting the number of occurrences of the concept and then dividing itby the number of different senses of that concept. Semantic similarity of two concepts is proportional to the amount of information content they share. The shared information can be determined by the informationcontentof the lowest common subsumer of these two concepts in the hierarchy.
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 11 | Nov 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 845 Thussimilarity between two conceptsc1andc2,accordingto Resnik method, is where LCS(c1,c2) is the least common subsume of the concepts c1 and c2 respectively. 2.2.2 Lin’s method [8] This method takes intoaccount both the informationcontent of the concepts under consideration as well as that of the least common subsumer. For this experiment, Lin used the sense-tagged version of SemCor. Similarity according to Lin’s method is 2.2.3 Jiang-Conrath measure [9] Lin’s method takes the semantic similarity as a ratio of information content of the common parent and the sum of information contents of the concept nodes under consideration. Jiang’s method also takes informationcontent of concept nodes and that of the common parent.But,instead of taking the ratio, this method takes the difference of these two values. For checking the similarity, Jiang and Conrath also used the sense-tagged version of SemCor. Semantic distance between two concepts c1 and c2, according to this method, is Accordingly, semantic similarity between c1 and c2 is 2.3 Feature-Based Methods Feature based methods are different from structure based and information based methodsinthewaythatfeaturebased methods do nottakeintoaccountthetaxonomystructureand information content of the nodes and their parents. Feature based methods assume that each concept is associated witha set of features or properties. An exampleof feature is the set of definitions or glosses of the concept. Similarity between two concepts is computedasafunctionof their features. More common features that two concepts share indicate more similarity between them. In general, we can say that these methods rely more on semantic properties than mere edge counting methods. 2.3.1 Tversky’s method [10] This method assumes that the semantic similarity is not symmetric. That is, the measure of similarity of a concept c1 to another concept c2 is not sameas the similarity of c2 toc1. A classic example this model uses is the inheritance relationship. That is, it arguesthat,thesimilarityofasubclass to its superclass is more than the similarity of superclass to its subclass. Semantic similarity of c1 to c2 in this method, where k is a constant whose value is between 0 and 1and is obtained by observation. 2.4 Hybrid Methods Hybrid methods combine two or more of the above mentioned methods. The similarity is calculated by taking each method into account, assigninga weighttoeachmethod and computing the weighted sum of the methods. The weights can be assigned manually through experimental observations. Hybrid methods mayalsotakeintoaccountthe relationship between concepts such as ‘is-a’, ’part-of’ etc. 2.4.1 Zhou method [11] A hybrid method proposed by Zhou takes both thestructure- based measure and the information based measure to compute a more accurate similarity measurement. A weight whose value varies between 0 and 1 is chosen manually to assign the contribution of each measure. If k=0, this method becomes purely information content based and if k=1, this becomes purely structure-based.
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 11 | Nov 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 846 2.5 Comparison of various methods Category Method Metric Working principle Structure-based Shortest path method Path length and depth of the hierarchy Shortest path length is subtracted from twice the depth Weighted edge method Weighted path length and depth of the hierarchy Weight is assigned to each edge based on depth of the hierarchy and density of the taxonomy Leacock- Chodrow method Path length and depth of the hierarchy Logarithmically conditioned shortest path length is divided by twicethedepthof the hierarchy Hirst - St.Onge Method Path length and deviations in the path Less number of deviationsin the path is a measure of more similarity Wu and Palmer method Depth of the hierarchy and distance to the nearest common parent Distance from the nearest common parentisa measure of the similarity Information - based Resnik method Frequency of occurrence in the corpus Negative log of the frequency of occurrence is a measure of the information content of the concept node. Information content of the nearest common parent is a measure of the similarity Lin's method Frequency of occurrence of the concepts and that of their nearest common parent Information content of the nearest common parent divided by that of the concept nodes is a measure of the similarity Jiang’s measure Frequency of occurrence of the concepts and that of their nearest common parent Information content of the nearest common parent subtracted from that of the concept nodes is a measure of the similarity Feature-based Tversky's method compare concepts' feature, such as their definitions or glosses Common characteristic features(e.g., glossary) ofthe concepts is a measure of similarity Hybrid method Zhou's method Path length, information content and depth of the hierarchy Combines structure-based and information-based methods in a certain proportion to find similarity measurement
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 11 | Nov 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 847 3. SUMMARY This paper aims to theoretically evaluate various semantic similarity measurement methods proposed by researchers from timeto time. Themethodswestudiedincludestructure- based methods, information-based methods, feature-based methods and hybrid methods which incorporate the idea of structure-based and information-based approaches. The methods compared here are based on the popular general- purpose ontology called WordNet. Various other domain- specific ontologies are available. The comparison is aimed to help researchers to select appropriate measure for their requirements. However, anaccuratesimilaritymeasurement is a problem yet to be researched further. REFERENCES [1]Rada, R., Mili, H., Bicknell, E., and Blettner, M. 1989. Development and Application of a Metric on Semantic Nets. IEEE Transactions on Systems, Man, and Cybernetics, 19(1):17-30, January/February [2] H. Bulskov, R. Knappe and T. Andreasen, “On Measuring Similarity for Conceptual Querying”, Proceedings of the 5th International Conference on Flexible Query Answering Systems, (2002) October 27-29, Copenhagen, Denmark [3] .Richardson, R, Smeaton, A. & Murphy, J. 1994. Using WordNet as a Knowledge Base for Measuring Semantic Similarity Between Words. Technical Report Workingpaper CA-1294, School of Computer Applications, Dublin City University, Dublin, Ireland. [4] C. Leacock and M. Chodorow, “Combining Local Context and WordNet Similarity for Word Sense Identification, WordNet: An Electronic Lexical Database”, MIT Press, (1998), pp. 265-283 [5]. Hirst, G. and St-Onge, D. 1998. Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms. In ProceedingsofFellbaum,pages305-332 [6]. Z. Wu and M. Palmer, “Verb semantics and lexical selection”, Proceedings of 32nd annual Meeting of the Association for Computational Linguistics, (1994) June 27- 30; Las Cruces, New Mexico. [7] P. Resnik, “Using information content to evaluate semantic similarity”, Proceedings of the 14th International Joint Conference on Artificial Intelligence,(1995)August20- 25; Montréal Québec, Canada. [8] D. Lin, “An information-theoretic definitionofsimilarity”, Proceedings of the 15th International Conference on Machine Learning, (1998) July 24-27; Madison, Wisconsin, USA. [9] J. J. Jiang and D. W. Conrath, “Semantic similarity based on corpus statistics and lexical taxonomy”, Proceedings of International Conference on Research in Computational Linguistics, (1997) August 22-24; Taipei, TaiWan. [10] A. Tversky, “Features of Similarity”, Psychological Review, vol. 84, no. 4, (1977) [11] Zhou, Z., Wang, Y. and Gu, J., 2008. “New Model of Semantic Similarity Measuring in Wordnet”, Proceedings of the 3rd International Conference on Intelligent System and Knowledge Engineering, November 17-19, Xiamen, China