SlideShare a Scribd company logo
International Journal of Computer Applications Technology and Research
Volume 7–Issue 08, 331-340, 2018, ISSN:-2319–8656
www.ijcat.com 331
Semantic Similarity Measures between Terms in the
Biomedical Domain within frame work Unified Medical
Language System (UMLS)
Abdelhakeem M. B. Abdelrahman
Sudan University of Science and Technology
Collage of Graduate Studies Khartoum, Sudan
Dr. Ahmad Kayed
Department of Computing and Information
Technology
Sohar University, Sohar, Oman
Abstract
The techniques and tests are tools used to define how measure the goodness of ontology or its
resources. The similarity between biomedical classes/concepts is an important task for the
biomedical information extraction and knowledge discovery. However, most of the semantic
similarity techniques can be adopted to be used in the biomedical domain (UMLS). Many
experiments have been conducted to check the applicability of these measures. In this paper, we
investigate to measure semantic similarity between two terms within single ontology or multiple
ontologies in ICD-10 “V1.0” as primary source, and compare my results to human experts score
by correlation coefficient.
Keywords: Information extraction, biomedical domain, semantic similarity techniques, Unified
Medical Language System (UMLS), and Semantic Information Retrieval (SIR).
1. INTRODUCTION
Ontology is test bed of semantic web, capturing knowledge about certain area via providing
relevant concept and relation between them. Quality metrics are essential to evaluate the quality.
Metrics are based on structure and semantic level. At the present the ontology evaluation is based
only on structural metrics, which has not been very appropriate in providing desired results.
International Journal of Computer Applications Technology and Research
Volume 7–Issue 08, 331-340, 2018, ISSN:-2319–8656
www.ijcat.com 332
Semantic similarity measures are widely used in Natural Language Processing. We show how six
existing domain-independent measures can be adapted to the biomedical domain. Semantic
similarity techniques are becoming important components in most intelligent knowledge-based
and Semantic Information Retrieval (SIR) systems [1]. Measures and tests are provided to define
how we can measure the “goodness” of ontology or its resources. Many experiments have been
conducted to check the applicability of these measures [4].
General English ontology based structure similarity measures can be adopted to be used into the
biomedical domain within UMLS. New approach for measuring semantic similarity between
biomedical concepts using multiple ontologies is proposed by Al-Mubaid and Nguyen [2, 3]. They
proposed new ontology structure based technique for measuring semantic similarity between
single ontology and multiple ontologies in the biomedical domain within the frame work of
Unified Medical Subject Language System (UMLS). Their proposed measure based on three
features [2]: first Cross modified path length between two concepts. Second, new features of
common specificity of concepts in the ontology. Third Local ontology granularity of ontology
cluster.
2. BIOMEDICAL DOMAIN ONTOLOGIES
Most of the semantic similarity techniques work in the biomedical domain uses only ontology
(e.g. MeSH, SOMED-CT) for computing the similarity between the biomedical terms[9].
However, in this work we use ICD- 10 ontology as primary source to computing the similarity
between concepts in biomedical domain.
International Classification of Diseases (ICD): The newest edition (ICD- 10) is divided into 22
chapters: (Infections, Neoplasm, Blood Diseases, Endocrine Diseases, etc.), and denote about
14,000 classes of diseases and related problems. The first character of the ICD code is a letter, and
each letter is associated with a particular chapter, except for the letter D, which is used in both
Chapter II, Neoplasm, and Chapter III, Diseases of the blood and blood-forming organs and certain
disorders involving the immune mechanism, and the letter H, which is used in both Chapter VII,
Diseases of the eye and adnexa and Chapter VIII, Diseases of the ear and mastoid process. Four
chapters (Chapters I, II, XIX and XX) use more than one letter in the first position of their codes.
Each chapter contains sufficient three-character categories to cover its content; not all available
codes are used, allowing space for future revision and expansion. Chapters I–XVII relate to
International Journal of Computer Applications Technology and Research
Volume 7–Issue 08, 331-340, 2018, ISSN:-2319–8656
www.ijcat.com 333
diseases and other morbid conditions, and Chapter XIX to injuries, poisoning and certain other
consequences of external causes. The remaining chapters complete the range of subject matter
nowadays included in diagnostic data. Chapter XVIII covers Symptoms, signs and abnormal
clinical and laboratory findings, not elsewhere classified. Chapter XX, External causes of
morbidity and mortality, was traditionally used to classify causes of injury and poisoning, but,
since the Ninth Revision, has also provided for any recorded external cause of diseases and other
morbid conditions. Finally, Chapter XXI, Factors influencing health status and contact with health
services, is intended for the classification of data explaining the reason for contact with health-
care services of a person not currently sick, or the circumstances in which the patient is receiving
care at that particular time or otherwise having some bearing on that person’s care [8, 10].
3. SEMANTIC SIMILARITY TECHNIQUES CHALLENGES IN THE BIOMEDICAL
DOMAIN
Most of existing semantic similarity techniques that used ontology structure as the primary source
can’t measure the similarity between terms using single ontology or multiple ontologies in the
biomedical domain within frame work Unified Medical Language System (UMLS). However,
some of the semantic similarity techniques have been adopted to biomedical domain by
incorporating domain information extracted from clinical data or medical ontologies.
4. RELATED WORK
4.1 Rada et al. Proposed semantic distance as a potential measure for semantic similarity between
two concepts in MeSH, and implemented the shortest path length measure, called CDist, based on
the shortest distance between two concept nodes in the ontology. They evaluated CDist on UMLS
Metathesaurus (MeSH, SNOMED, ICD9), and then compared the CDist similarity scores to
human expert scores by correlation coefficients.
4.2 Caviedes and cimino. [11] Implemented shortest path based measure, called CDist, based on
the shortest distance between two concepts nodes in the ontology. They evaluated CDist on UMLS
Metathesaurus (MeSH, SNOMED, ICD9), and then compared the CDist similarity scores to
human expert scores by correlation coefficient.
4.3 Pedersen et al.[1] Proposed semantic similarity and relatedness in the biomedicine domain, by
applied a corpus-based context vector approach to measure similarity between concepts in
International Journal of Computer Applications Technology and Research
Volume 7–Issue 08, 331-340, 2018, ISSN:-2319–8656
www.ijcat.com 334
SNOMED-CT. Their context vector approach is ontology-free but requires training text, for which,
they used text data from Mayo Clinic corpus of medical notes.
4.4 Wu and Palmer Similarity Measure [11] proposed a new method which define the semantic
similarity techniques between concepts C1 and C2 as
N3
Sim (C1 ,C2 ) = 2 × (1)
N1 +N2+2×N3
Where
N1 is the length given as the number of nodes in the path from C1 to C3 which is the least common
super concept of C1 and C2, and
N2 is the length given in the number of nodes on a path from C2 to C3.
N3 represents the global depth of the hierarchy and it serves as the scaling factor.
Figure 1 fragment of Intestinal infectious diseases
International Journal of Computer Applications Technology and Research
Volume 7–Issue 08, 331-340, 2018, ISSN:-2319–8656
www.ijcat.com 335
For example from Figure 1: ( LCS (A00.1, A00.9) = A00 and LCS(A00 ,A01) = A00_A09) of
two concept nodes and N1, N2 are the path lengths from each concept node to LCS, respectively.
4.5 Al- Mubaid and Nguyen Similarity technique [5, 11] proposed measure take the depth of their
least common subsume (LCS) and the distance of the shortest path between them. The higher
similarity arises when the two concepts are in the lower level of the hierarchy. Their similarity
measure is:
Sim (c1, c2) = log 2 ([L(c1, c2) -1 ] × [D- depth(L(c1, c2) ] + 2) (2)
Where:
L(c1, c2) is the shortest distance between c1 and c2.
Depth L(c1, c2) is depth of L(c1, c2) using node counting.
L(c1, c2) lowest common subsume of c1 and c2.
D is the maximum depth of the taxonomy.
The similarity equal 1, where two concepts nodes are in the same cluster/ontology. The maximum
value of this measure occur when one of the concepts is the left most leaf node, and the other
concept is the right leaf node in the tree. In the ICD-10 tree let us consider an example in ICD-
10 terminology. The category tree is “Intestinal infectious diseases” and is assigned letter A in
ICD10 terminology version 2016 at the link
(http://apps.who.int/classifications/icd10/browse/2016/en#/A00-A09). This tree looks as follows:
Intestinal infectious diseases [A00-A09]
Cholera [A00]+
Typhoid and paratyphoid fevers [A01]+
Other salmonella infections [A02]+
Shigellosis [A03]+
Viral and other specified intestinal infections [A08]+
Other gastroenteritis and colitis of infectious and
unspecified origin [A09]+
The similarity between “Cholera [A00]” and “Typhoid and paratyphoid fevers [A01]” is less
similarity than the similarity between “Cholera due to Vibrio cholerae 01, biovar eltor [A00.1]”
and “Cholera, unspecified [A00.9]”. However, in this measure they take into account the depth
International Journal of Computer Applications Technology and Research
Volume 7–Issue 08, 331-340, 2018, ISSN:-2319–8656
www.ijcat.com 336
of the LCS of two
concepts, in the path
length and leacock &
chodorwo produce
semantic similarity for
two pairs [(A00, A01)
and ( A00.1, A00.9)] in
sim (c1, c2) measure (Eq
2 in table 1) give high
similarity in lower level
in the ontology hierarchy
([ A00.1, A00.3]).
Table 1: Measures
Comparison
Pair of Concepts P. L L. C C. K Hisham Al-Mubaid & Nyguan Measure (Eq 2)
A00 – A01 0.37 2.13 0.91 3.2
A00.1 – A00.9 0.33 2.15 0.91 1.6
The higher numeric similarity result between (A00, A01) means the lower semantic similarity
between them.
5. EVALUATION
5.1 Datasets:
There are no standard human rating sets for semantic similarity in biomedical domain. Thus,
Hisham Al-Mubaid and Nguyen [3, 11] used dataset from Pedersen et. al [1], which was annotated
by 3 physician and 9 medical index experts to evaluate their proposed measure in the biomedical
domain.
The symbol “+” indicates that the concept can be further
expanded into a sub tree (sub-concepts). For example,
“Cholera” [A00] can be expanded to be as follows:
Cholera [A00]
Cholera due to Vibrio cholerae 01, biovar
cholerae [A00.0]+
Cholera due to Vibrio cholerae 01, biovar
eltor [A00.1]+
Cholera, unspecified [A00.9]+
International Journal of Computer Applications Technology and Research
Volume 7–Issue 08, 331-340, 2018, ISSN:-2319–8656
www.ijcat.com 337
Table 2 Dataset 1: 30 medical term pairs sorted in the order of the average [1].
Id Concept1 Concept2 Phys Expert Id Concept1 Concept2 Phys Expert
4 Renal failure I12.0 Kidney failure I12.0 4.0000 4.0000 27 Acne Syringe 2.0000 1.0000
5 Heart I51.5 Myocardium I51.5 3.3333 3.0000 12 Antibiotic (Z88.1) Allergy (Z88.1) 1.6667 1.2222
1 Stroke I64 Infarct I64 3.0000 2.7778 13 Cortisone Total knee
replacement
1.6667 1.0000
7 Abortion O03 Miscarriage O03 3.0000 3.3333 14 Pulmonary
embolus
Myocardial
infarction
1.6667 1.2222
9 Delusion (F06.2) Schizophrenia
(F06.2)
3.0000 2.2222 16 Pulmonary Fibrosis
(E84.0)
Lung Cancer
(C34.1)
1.6667 1.4444
11 Congestive heart
failure (I50.0)
Pulmonary edema
(I50.1)
3.0000 1.4444 6 Cholangiocarcino
ma
Colonoscopy 1.3333 1.0000
8 Metastasis (C77.0) Adenocarcinoma
(C08.9)
2.6667 1.7778 29 Lymphoid
hyperplasia (K38.0)
Laryngeal Cancer
(C32.0)
1.3333 1.0000
17 Calcification
(M61)
Stenosis (H04.5) 2.6667 2.0000 21 Multiple Sclerosis
(F06.8)
Psychosis (F06.8) 1.0000 1.0000
10 Diarrhea Stomach cramps 2.3333 1.3333 22 Appendicitis (K35) Osteoporosis
(M80)
1.0000 1.0000
19 Mitral stenosis
(I05.0)
Atrial fibrillation
(I48)
2.3333 1.3333 23 Rectal polyp
(K62.1)
Aorta (I70.0) 1.0000 1.0000
20 Chronic
obstructive
pulmonary disease
(J44.9)
Lung infiltrates (J82) 2.0000 1.8889 24 Xerostomia (K11.7) Alcoholic cirrhosis
(K70.3)
1.0000 1.0000
2 Rheumatoid
arthritis (M05.3)
Lupus (L93) 2.0000 1.1111 25 Peptic ulcer disease
(K21.0)
Myopia (H52.1) 1.0000 1.0000
3 Brain tumor
(G94.8)
Intracranial
hemorrhage(I69.2)
2.0000 1.3333 26 Depression (F20.4) Cellulitis (H60.1) 1.0000 1.0000
15 Carpal tunnel
Syndrome (G56.0)
Osteoarthritis
(M19.9)
2.0000 1.1111 28 Varicose vein Entire knee
meniscus
1.0000 1.0000
18 Diabetes mellitus
(E10-E14)
Hypertension (I10-
I15)
2.0000 1.0000 30 Hyperlipidemia
(E78.0)
Metastasis (C77.0) 1.0000 1.0000
5.2 Experiments and Results
Table 2. Test set of 30 medical term pairs sorted in the order of the averaged physicians’ scores
(taken from Pedersen et. al. 2005 [1]). Al-Mubaid and Nguyen [5, 11] find only 24 out of the 30
concept pairs in ICD-10 using http://apps.who.int/classifications/icd10/browse/2016/en browser
version 2010.
Another biomedical dataset was used containing 36 MeSH term pairs [15]. The human scores in
this dataset are the average evaluated scores of reliable doctors. UMLSKS browser was used [12]
International Journal of Computer Applications Technology and Research
Volume 7–Issue 08, 331-340, 2018, ISSN:-2319–8656
www.ijcat.com 338
for SNOMED-CTterms, and MeSH Browser [13] for MeSH terms. Table 3, Table 4, Table 5, and
Table 6 show Dataset2 along with human scores and scores of Path length, Wu and Palmer’s,
Leacock and Chodorow’s, and Hisham Al-Mubaid & Nguyen techniques calculated using MeSH
ontology. The term pairs in bold, in Table 3, Table 4, Table 5, and Table 6, are the ones that contain
a term that was not found in MeSH Ontology and they were excluded from experiments.
Table3. Biomedical Dataset 2 (36 pairs) with human similarity scores (Human) and Path length’s
scores using MeSH ontology.
Id Concept 1 Concept 2 Human Path length
1 Anemia Appendicitis 0.031 8
2 Meningitis Tricuspid Atresia 0.031 8
.
.
36 Chicken Pox
.
.
Varicella
.
.
0.968
.
.
1
Table 4. Biomedical Dataset 2 ( 36 pairs ) with human similarity scores (Human) and Wu and
Palmer’s scores using MeSH ontology.
Id Concept 1 Concept 2 Human Wu &Palmer
1 Anemia Appendicitis 0.031 0.364
2 Meningitis Tricuspid Atresia 0.031 0.364
.
.
36 Chicken Pox
.
.
Varicella
.
.
0.968
.
.
1.000
Table 5. Biomedical Dataset 2 ( 36 pairs ) with human similarity scores (Human) and Leacock and
Chodorow’s scores using MeSH ontology.
Id Concept 1 Concept 2 Human Leacock &
Chodorow
1 Anemia Appendicitis 0.031 1.099
2 Meningitis Tricuspid Atresia 0.031 1.099
.
36 Chicken Pox
.
Varicella
.
0.968
.
3.178
International Journal of Computer Applications Technology and Research
Volume 7–Issue 08, 331-340, 2018, ISSN:-2319–8656
www.ijcat.com 339
Table 6. Biomedical Dataset 2 (36 pairs ) with human similarity scores (Human) and Hisham Al-
Mubaid & Nguyen measure (SemDist) using MeSH ontology.
Id Concept 1 Concept 2 Human SemDist
1 Anemia Appendicitis 0.031 4.263
2 Meningitis Tricuspid Atresia 0.031 4.263
36 Chicken Pox Varicella 0.968 0.000
6. CONCLUSIONS AND FUTURE WORK
In this paper we discussed the basics of semantic similarity techniques, the classification of single
ontology similarity measures and cross ontologies similarity measures. We prepare a brief
introduction of the various semantic similarity measures in biomedical domain. However, from all
the above, we can used SemDist as semantic similarity measures in the biomedical domain. In
future work, we intend to explore the semantic similarity techniques in the biomedical domain
(ICD10, MeSH, and SNOMED-CT) within UMLS frame work. We also prepare implement a web-
based user interface for all these semantic similarity techniques and to make it available freely to
researchers over the Internet. That will be much helpful for interested researchers in the field of
bioinformatics text mining.
7. REFERENCES
[1] Ted Pedersen, et al. " Measures of semantic similarity and relatedness in the biomedical
domain ", Journal of Biomedical Informatics 40 (2007) 288–299.
[2] Hisham Al-Mubaid and Hoa A. Nguyen, “A Cluster-Based Approach for Semantic Similarity
in the Biomedical Domain” Proceedings of the 28th IEEE, EMBS Annual International
Conference New York City, USA, Aug 30-Sept 3, 2006.
[3] Hisham Al-mubaid & Hoa A. Nguyen “Measuring Semantic Similarity between Biomedical
concepts within multiple ontologies” IEEE Trans Syst Man Cybern Part C: Appl Rev 2009, 39.
[4] Ahmad Kayed, et al. "Ontology Evaluation: Which Test to Use" 2013 5th International
Conference on Computer Science and Information Technology (CSIT), IEEE, pp 45-48, 2013.
International Journal of Computer Applications Technology and Research
Volume 7–Issue 08, 331-340, 2018, ISSN:-2319–8656
www.ijcat.com 340
[5] Hisham Al-Mubaid and Hoa A. Nguyen, “New Ontology Based Semantic Similarity for the
Biomedical Domain”, (2006) p 623 – 628.
[6] S. Anitha Elavarasi, et. al, “A Survey on Semantic Similarity Measure” International Journal
of Research in Advent Technology, Vol.2, No.3, March 2014 E-ISSN: 2321-9637.
[7] Nguyen H., Al-Mubaid H. (2006) “New Semantic Similarity Techniques of Concepts applied
in the biomedical domain and WordNet.” MS Thesis, University o f Houston Clear Lake, Houston,
TX USA, 2006.
[8] World Health Organization, “International statistical classification of diseases and related
health problems”. - 10th revision, edition 2010.
[9] Hisham Al-Mubaid and Hoa A. Nguyen, “Using MEDLINE as Standard Corpus for Measuring
Semantic Similarity in the Biomedical Domain”, Sixth IEEE Symposium on BionInformatics and
BioEngineering (BIBE'06), 2006.
[10] Mirjana Ivanovic& Zoran Budimac, An overview of ontologies and data resources in medical
domains, Expert Systems with Applications 41 (2014) 5158–5166.
[11] Montserrat Batet Sanromà, “ontology-based semantic clustering”, PhD Thesis, 2010.
[12] UMLSKS. Available: http://umlsks.nlm.nih.gov
[13] MeSH Browser. Available: http://www.nlm.nih.gov/mesh/MBrowser.html

More Related Content

PDF
Evaluating Semantic Similarity between Biomedical Concepts/Classes through S...
PDF
Nlp based retrieval of medical information for diagnosis of human diseases
PDF
Nlp based retrieval of medical information for diagnosis of human diseases
PDF
Biomedical indexing and retrieval system based on language modeling approach
PDF
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
PDF
Visual Analytics and the Language of Web Query Logs - A Terminology Perspective
PDF
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...
PDF
Ijcet 06 10_003
Evaluating Semantic Similarity between Biomedical Concepts/Classes through S...
Nlp based retrieval of medical information for diagnosis of human diseases
Nlp based retrieval of medical information for diagnosis of human diseases
Biomedical indexing and retrieval system based on language modeling approach
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
Visual Analytics and the Language of Web Query Logs - A Terminology Perspective
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...
Ijcet 06 10_003

Similar to Semantic Similarity Measures between Terms in the Biomedical Domain within frame work Unified Medical Language System (UMLS) (20)

PDF
Great model a model for the automatic generation of semantic relations betwee...
PDF
Root cause analysis of COVID-19 cases by enhanced text mining process
PDF
International Journal of Computational Engineering Research(IJCER)
PDF
www.ijerd.com
PDF
2006NIC-NLPPoster_V2
PDF
An Automatic Approach for Bilingual Tuberculosis Ontology Based on Ontology D...
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
PDF
Three Feature Based Ensemble Deep Learning Model for Pulmonary Disease Classi...
PPT
Whofic2013 postera riusagrauperamnozalmroviracgallego
PDF
Cambridge seminar april 2018
PDF
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
PDF
E-Symptom Analysis System to Improve Medical Diagnosis and Treatment Recommen...
PDF
E-Symptom Analysis System to Improve Medical Diagnosis and Treatment Recommen...
DOCX
Strengths and Weakness of Informatics.docx
PDF
USABILITY TESTING PROCESS WITH PEOPLE WITH DOWN SYNDROME INTREACTING WITH MOB...
PDF
Hi-C Data Analysis 1st Edition Silvio Bicciato
PDF
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTS
PDF
A Proposed Multi-Domain Approach for Automatic Classification of Text Documents
PDF
A Survey On Medical Health Records And AI
PDF
A Literature Review and a Case Study.pdf
Great model a model for the automatic generation of semantic relations betwee...
Root cause analysis of COVID-19 cases by enhanced text mining process
International Journal of Computational Engineering Research(IJCER)
www.ijerd.com
2006NIC-NLPPoster_V2
An Automatic Approach for Bilingual Tuberculosis Ontology Based on Ontology D...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
Three Feature Based Ensemble Deep Learning Model for Pulmonary Disease Classi...
Whofic2013 postera riusagrauperamnozalmroviracgallego
Cambridge seminar april 2018
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
E-Symptom Analysis System to Improve Medical Diagnosis and Treatment Recommen...
E-Symptom Analysis System to Improve Medical Diagnosis and Treatment Recommen...
Strengths and Weakness of Informatics.docx
USABILITY TESTING PROCESS WITH PEOPLE WITH DOWN SYNDROME INTREACTING WITH MOB...
Hi-C Data Analysis 1st Edition Silvio Bicciato
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTS
A Proposed Multi-Domain Approach for Automatic Classification of Text Documents
A Survey On Medical Health Records And AI
A Literature Review and a Case Study.pdf
Ad

More from Editor IJCATR (20)

PDF
Advancements in Structural Integrity: Enhancing Frame Strength and Compressio...
PDF
Maritime Cybersecurity: Protecting Critical Infrastructure in The Digital Age
PDF
Leveraging Machine Learning for Proactive Threat Analysis in Cybersecurity
PDF
Leveraging Topological Data Analysis and AI for Advanced Manufacturing: Integ...
PDF
Leveraging AI and Principal Component Analysis (PCA) For In-Depth Analysis in...
PDF
The Intersection of Artificial Intelligence and Cybersecurity: Safeguarding D...
PDF
Leveraging AI and Deep Learning in Predictive Genomics for MPOX Virus Researc...
PDF
Text Mining in Digital Libraries using OKAPI BM25 Model
PDF
Green Computing, eco trends, climate change, e-waste and eco-friendly
PDF
Policies for Green Computing and E-Waste in Nigeria
PDF
Performance Evaluation of VANETs for Evaluating Node Stability in Dynamic Sce...
PDF
Optimum Location of DG Units Considering Operation Conditions
PDF
Analysis of Comparison of Fuzzy Knn, C4.5 Algorithm, and Naïve Bayes Classifi...
PDF
Web Scraping for Estimating new Record from Source Site
PDF
A Strategy for Improving the Performance of Small Files in Openstack Swift
PDF
Integrated System for Vehicle Clearance and Registration
PDF
Assessment of the Efficiency of Customer Order Management System: A Case Stu...
PDF
Energy-Aware Routing in Wireless Sensor Network Using Modified Bi-Directional A*
PDF
Security in Software Defined Networks (SDN): Challenges and Research Opportun...
PDF
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Advancements in Structural Integrity: Enhancing Frame Strength and Compressio...
Maritime Cybersecurity: Protecting Critical Infrastructure in The Digital Age
Leveraging Machine Learning for Proactive Threat Analysis in Cybersecurity
Leveraging Topological Data Analysis and AI for Advanced Manufacturing: Integ...
Leveraging AI and Principal Component Analysis (PCA) For In-Depth Analysis in...
The Intersection of Artificial Intelligence and Cybersecurity: Safeguarding D...
Leveraging AI and Deep Learning in Predictive Genomics for MPOX Virus Researc...
Text Mining in Digital Libraries using OKAPI BM25 Model
Green Computing, eco trends, climate change, e-waste and eco-friendly
Policies for Green Computing and E-Waste in Nigeria
Performance Evaluation of VANETs for Evaluating Node Stability in Dynamic Sce...
Optimum Location of DG Units Considering Operation Conditions
Analysis of Comparison of Fuzzy Knn, C4.5 Algorithm, and Naïve Bayes Classifi...
Web Scraping for Estimating new Record from Source Site
A Strategy for Improving the Performance of Small Files in Openstack Swift
Integrated System for Vehicle Clearance and Registration
Assessment of the Efficiency of Customer Order Management System: A Case Stu...
Energy-Aware Routing in Wireless Sensor Network Using Modified Bi-Directional A*
Security in Software Defined Networks (SDN): Challenges and Research Opportun...
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Ad

Recently uploaded (20)

PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Sustainable Sites - Green Building Construction
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Welding lecture in detail for understanding
PPTX
additive manufacturing of ss316l using mig welding
PDF
composite construction of structures.pdf
PDF
Well-logging-methods_new................
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPT
Mechanical Engineering MATERIALS Selection
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Sustainable Sites - Green Building Construction
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Foundation to blockchain - A guide to Blockchain Tech
bas. eng. economics group 4 presentation 1.pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Welding lecture in detail for understanding
additive manufacturing of ss316l using mig welding
composite construction of structures.pdf
Well-logging-methods_new................
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Lecture Notes Electrical Wiring System Components
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Automation-in-Manufacturing-Chapter-Introduction.pdf
Mechanical Engineering MATERIALS Selection
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
UNIT 4 Total Quality Management .pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx

Semantic Similarity Measures between Terms in the Biomedical Domain within frame work Unified Medical Language System (UMLS)

  • 1. International Journal of Computer Applications Technology and Research Volume 7–Issue 08, 331-340, 2018, ISSN:-2319–8656 www.ijcat.com 331 Semantic Similarity Measures between Terms in the Biomedical Domain within frame work Unified Medical Language System (UMLS) Abdelhakeem M. B. Abdelrahman Sudan University of Science and Technology Collage of Graduate Studies Khartoum, Sudan Dr. Ahmad Kayed Department of Computing and Information Technology Sohar University, Sohar, Oman Abstract The techniques and tests are tools used to define how measure the goodness of ontology or its resources. The similarity between biomedical classes/concepts is an important task for the biomedical information extraction and knowledge discovery. However, most of the semantic similarity techniques can be adopted to be used in the biomedical domain (UMLS). Many experiments have been conducted to check the applicability of these measures. In this paper, we investigate to measure semantic similarity between two terms within single ontology or multiple ontologies in ICD-10 “V1.0” as primary source, and compare my results to human experts score by correlation coefficient. Keywords: Information extraction, biomedical domain, semantic similarity techniques, Unified Medical Language System (UMLS), and Semantic Information Retrieval (SIR). 1. INTRODUCTION Ontology is test bed of semantic web, capturing knowledge about certain area via providing relevant concept and relation between them. Quality metrics are essential to evaluate the quality. Metrics are based on structure and semantic level. At the present the ontology evaluation is based only on structural metrics, which has not been very appropriate in providing desired results.
  • 2. International Journal of Computer Applications Technology and Research Volume 7–Issue 08, 331-340, 2018, ISSN:-2319–8656 www.ijcat.com 332 Semantic similarity measures are widely used in Natural Language Processing. We show how six existing domain-independent measures can be adapted to the biomedical domain. Semantic similarity techniques are becoming important components in most intelligent knowledge-based and Semantic Information Retrieval (SIR) systems [1]. Measures and tests are provided to define how we can measure the “goodness” of ontology or its resources. Many experiments have been conducted to check the applicability of these measures [4]. General English ontology based structure similarity measures can be adopted to be used into the biomedical domain within UMLS. New approach for measuring semantic similarity between biomedical concepts using multiple ontologies is proposed by Al-Mubaid and Nguyen [2, 3]. They proposed new ontology structure based technique for measuring semantic similarity between single ontology and multiple ontologies in the biomedical domain within the frame work of Unified Medical Subject Language System (UMLS). Their proposed measure based on three features [2]: first Cross modified path length between two concepts. Second, new features of common specificity of concepts in the ontology. Third Local ontology granularity of ontology cluster. 2. BIOMEDICAL DOMAIN ONTOLOGIES Most of the semantic similarity techniques work in the biomedical domain uses only ontology (e.g. MeSH, SOMED-CT) for computing the similarity between the biomedical terms[9]. However, in this work we use ICD- 10 ontology as primary source to computing the similarity between concepts in biomedical domain. International Classification of Diseases (ICD): The newest edition (ICD- 10) is divided into 22 chapters: (Infections, Neoplasm, Blood Diseases, Endocrine Diseases, etc.), and denote about 14,000 classes of diseases and related problems. The first character of the ICD code is a letter, and each letter is associated with a particular chapter, except for the letter D, which is used in both Chapter II, Neoplasm, and Chapter III, Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism, and the letter H, which is used in both Chapter VII, Diseases of the eye and adnexa and Chapter VIII, Diseases of the ear and mastoid process. Four chapters (Chapters I, II, XIX and XX) use more than one letter in the first position of their codes. Each chapter contains sufficient three-character categories to cover its content; not all available codes are used, allowing space for future revision and expansion. Chapters I–XVII relate to
  • 3. International Journal of Computer Applications Technology and Research Volume 7–Issue 08, 331-340, 2018, ISSN:-2319–8656 www.ijcat.com 333 diseases and other morbid conditions, and Chapter XIX to injuries, poisoning and certain other consequences of external causes. The remaining chapters complete the range of subject matter nowadays included in diagnostic data. Chapter XVIII covers Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified. Chapter XX, External causes of morbidity and mortality, was traditionally used to classify causes of injury and poisoning, but, since the Ninth Revision, has also provided for any recorded external cause of diseases and other morbid conditions. Finally, Chapter XXI, Factors influencing health status and contact with health services, is intended for the classification of data explaining the reason for contact with health- care services of a person not currently sick, or the circumstances in which the patient is receiving care at that particular time or otherwise having some bearing on that person’s care [8, 10]. 3. SEMANTIC SIMILARITY TECHNIQUES CHALLENGES IN THE BIOMEDICAL DOMAIN Most of existing semantic similarity techniques that used ontology structure as the primary source can’t measure the similarity between terms using single ontology or multiple ontologies in the biomedical domain within frame work Unified Medical Language System (UMLS). However, some of the semantic similarity techniques have been adopted to biomedical domain by incorporating domain information extracted from clinical data or medical ontologies. 4. RELATED WORK 4.1 Rada et al. Proposed semantic distance as a potential measure for semantic similarity between two concepts in MeSH, and implemented the shortest path length measure, called CDist, based on the shortest distance between two concept nodes in the ontology. They evaluated CDist on UMLS Metathesaurus (MeSH, SNOMED, ICD9), and then compared the CDist similarity scores to human expert scores by correlation coefficients. 4.2 Caviedes and cimino. [11] Implemented shortest path based measure, called CDist, based on the shortest distance between two concepts nodes in the ontology. They evaluated CDist on UMLS Metathesaurus (MeSH, SNOMED, ICD9), and then compared the CDist similarity scores to human expert scores by correlation coefficient. 4.3 Pedersen et al.[1] Proposed semantic similarity and relatedness in the biomedicine domain, by applied a corpus-based context vector approach to measure similarity between concepts in
  • 4. International Journal of Computer Applications Technology and Research Volume 7–Issue 08, 331-340, 2018, ISSN:-2319–8656 www.ijcat.com 334 SNOMED-CT. Their context vector approach is ontology-free but requires training text, for which, they used text data from Mayo Clinic corpus of medical notes. 4.4 Wu and Palmer Similarity Measure [11] proposed a new method which define the semantic similarity techniques between concepts C1 and C2 as N3 Sim (C1 ,C2 ) = 2 × (1) N1 +N2+2×N3 Where N1 is the length given as the number of nodes in the path from C1 to C3 which is the least common super concept of C1 and C2, and N2 is the length given in the number of nodes on a path from C2 to C3. N3 represents the global depth of the hierarchy and it serves as the scaling factor. Figure 1 fragment of Intestinal infectious diseases
  • 5. International Journal of Computer Applications Technology and Research Volume 7–Issue 08, 331-340, 2018, ISSN:-2319–8656 www.ijcat.com 335 For example from Figure 1: ( LCS (A00.1, A00.9) = A00 and LCS(A00 ,A01) = A00_A09) of two concept nodes and N1, N2 are the path lengths from each concept node to LCS, respectively. 4.5 Al- Mubaid and Nguyen Similarity technique [5, 11] proposed measure take the depth of their least common subsume (LCS) and the distance of the shortest path between them. The higher similarity arises when the two concepts are in the lower level of the hierarchy. Their similarity measure is: Sim (c1, c2) = log 2 ([L(c1, c2) -1 ] × [D- depth(L(c1, c2) ] + 2) (2) Where: L(c1, c2) is the shortest distance between c1 and c2. Depth L(c1, c2) is depth of L(c1, c2) using node counting. L(c1, c2) lowest common subsume of c1 and c2. D is the maximum depth of the taxonomy. The similarity equal 1, where two concepts nodes are in the same cluster/ontology. The maximum value of this measure occur when one of the concepts is the left most leaf node, and the other concept is the right leaf node in the tree. In the ICD-10 tree let us consider an example in ICD- 10 terminology. The category tree is “Intestinal infectious diseases” and is assigned letter A in ICD10 terminology version 2016 at the link (http://apps.who.int/classifications/icd10/browse/2016/en#/A00-A09). This tree looks as follows: Intestinal infectious diseases [A00-A09] Cholera [A00]+ Typhoid and paratyphoid fevers [A01]+ Other salmonella infections [A02]+ Shigellosis [A03]+ Viral and other specified intestinal infections [A08]+ Other gastroenteritis and colitis of infectious and unspecified origin [A09]+ The similarity between “Cholera [A00]” and “Typhoid and paratyphoid fevers [A01]” is less similarity than the similarity between “Cholera due to Vibrio cholerae 01, biovar eltor [A00.1]” and “Cholera, unspecified [A00.9]”. However, in this measure they take into account the depth
  • 6. International Journal of Computer Applications Technology and Research Volume 7–Issue 08, 331-340, 2018, ISSN:-2319–8656 www.ijcat.com 336 of the LCS of two concepts, in the path length and leacock & chodorwo produce semantic similarity for two pairs [(A00, A01) and ( A00.1, A00.9)] in sim (c1, c2) measure (Eq 2 in table 1) give high similarity in lower level in the ontology hierarchy ([ A00.1, A00.3]). Table 1: Measures Comparison Pair of Concepts P. L L. C C. K Hisham Al-Mubaid & Nyguan Measure (Eq 2) A00 – A01 0.37 2.13 0.91 3.2 A00.1 – A00.9 0.33 2.15 0.91 1.6 The higher numeric similarity result between (A00, A01) means the lower semantic similarity between them. 5. EVALUATION 5.1 Datasets: There are no standard human rating sets for semantic similarity in biomedical domain. Thus, Hisham Al-Mubaid and Nguyen [3, 11] used dataset from Pedersen et. al [1], which was annotated by 3 physician and 9 medical index experts to evaluate their proposed measure in the biomedical domain. The symbol “+” indicates that the concept can be further expanded into a sub tree (sub-concepts). For example, “Cholera” [A00] can be expanded to be as follows: Cholera [A00] Cholera due to Vibrio cholerae 01, biovar cholerae [A00.0]+ Cholera due to Vibrio cholerae 01, biovar eltor [A00.1]+ Cholera, unspecified [A00.9]+
  • 7. International Journal of Computer Applications Technology and Research Volume 7–Issue 08, 331-340, 2018, ISSN:-2319–8656 www.ijcat.com 337 Table 2 Dataset 1: 30 medical term pairs sorted in the order of the average [1]. Id Concept1 Concept2 Phys Expert Id Concept1 Concept2 Phys Expert 4 Renal failure I12.0 Kidney failure I12.0 4.0000 4.0000 27 Acne Syringe 2.0000 1.0000 5 Heart I51.5 Myocardium I51.5 3.3333 3.0000 12 Antibiotic (Z88.1) Allergy (Z88.1) 1.6667 1.2222 1 Stroke I64 Infarct I64 3.0000 2.7778 13 Cortisone Total knee replacement 1.6667 1.0000 7 Abortion O03 Miscarriage O03 3.0000 3.3333 14 Pulmonary embolus Myocardial infarction 1.6667 1.2222 9 Delusion (F06.2) Schizophrenia (F06.2) 3.0000 2.2222 16 Pulmonary Fibrosis (E84.0) Lung Cancer (C34.1) 1.6667 1.4444 11 Congestive heart failure (I50.0) Pulmonary edema (I50.1) 3.0000 1.4444 6 Cholangiocarcino ma Colonoscopy 1.3333 1.0000 8 Metastasis (C77.0) Adenocarcinoma (C08.9) 2.6667 1.7778 29 Lymphoid hyperplasia (K38.0) Laryngeal Cancer (C32.0) 1.3333 1.0000 17 Calcification (M61) Stenosis (H04.5) 2.6667 2.0000 21 Multiple Sclerosis (F06.8) Psychosis (F06.8) 1.0000 1.0000 10 Diarrhea Stomach cramps 2.3333 1.3333 22 Appendicitis (K35) Osteoporosis (M80) 1.0000 1.0000 19 Mitral stenosis (I05.0) Atrial fibrillation (I48) 2.3333 1.3333 23 Rectal polyp (K62.1) Aorta (I70.0) 1.0000 1.0000 20 Chronic obstructive pulmonary disease (J44.9) Lung infiltrates (J82) 2.0000 1.8889 24 Xerostomia (K11.7) Alcoholic cirrhosis (K70.3) 1.0000 1.0000 2 Rheumatoid arthritis (M05.3) Lupus (L93) 2.0000 1.1111 25 Peptic ulcer disease (K21.0) Myopia (H52.1) 1.0000 1.0000 3 Brain tumor (G94.8) Intracranial hemorrhage(I69.2) 2.0000 1.3333 26 Depression (F20.4) Cellulitis (H60.1) 1.0000 1.0000 15 Carpal tunnel Syndrome (G56.0) Osteoarthritis (M19.9) 2.0000 1.1111 28 Varicose vein Entire knee meniscus 1.0000 1.0000 18 Diabetes mellitus (E10-E14) Hypertension (I10- I15) 2.0000 1.0000 30 Hyperlipidemia (E78.0) Metastasis (C77.0) 1.0000 1.0000 5.2 Experiments and Results Table 2. Test set of 30 medical term pairs sorted in the order of the averaged physicians’ scores (taken from Pedersen et. al. 2005 [1]). Al-Mubaid and Nguyen [5, 11] find only 24 out of the 30 concept pairs in ICD-10 using http://apps.who.int/classifications/icd10/browse/2016/en browser version 2010. Another biomedical dataset was used containing 36 MeSH term pairs [15]. The human scores in this dataset are the average evaluated scores of reliable doctors. UMLSKS browser was used [12]
  • 8. International Journal of Computer Applications Technology and Research Volume 7–Issue 08, 331-340, 2018, ISSN:-2319–8656 www.ijcat.com 338 for SNOMED-CTterms, and MeSH Browser [13] for MeSH terms. Table 3, Table 4, Table 5, and Table 6 show Dataset2 along with human scores and scores of Path length, Wu and Palmer’s, Leacock and Chodorow’s, and Hisham Al-Mubaid & Nguyen techniques calculated using MeSH ontology. The term pairs in bold, in Table 3, Table 4, Table 5, and Table 6, are the ones that contain a term that was not found in MeSH Ontology and they were excluded from experiments. Table3. Biomedical Dataset 2 (36 pairs) with human similarity scores (Human) and Path length’s scores using MeSH ontology. Id Concept 1 Concept 2 Human Path length 1 Anemia Appendicitis 0.031 8 2 Meningitis Tricuspid Atresia 0.031 8 . . 36 Chicken Pox . . Varicella . . 0.968 . . 1 Table 4. Biomedical Dataset 2 ( 36 pairs ) with human similarity scores (Human) and Wu and Palmer’s scores using MeSH ontology. Id Concept 1 Concept 2 Human Wu &Palmer 1 Anemia Appendicitis 0.031 0.364 2 Meningitis Tricuspid Atresia 0.031 0.364 . . 36 Chicken Pox . . Varicella . . 0.968 . . 1.000 Table 5. Biomedical Dataset 2 ( 36 pairs ) with human similarity scores (Human) and Leacock and Chodorow’s scores using MeSH ontology. Id Concept 1 Concept 2 Human Leacock & Chodorow 1 Anemia Appendicitis 0.031 1.099 2 Meningitis Tricuspid Atresia 0.031 1.099 . 36 Chicken Pox . Varicella . 0.968 . 3.178
  • 9. International Journal of Computer Applications Technology and Research Volume 7–Issue 08, 331-340, 2018, ISSN:-2319–8656 www.ijcat.com 339 Table 6. Biomedical Dataset 2 (36 pairs ) with human similarity scores (Human) and Hisham Al- Mubaid & Nguyen measure (SemDist) using MeSH ontology. Id Concept 1 Concept 2 Human SemDist 1 Anemia Appendicitis 0.031 4.263 2 Meningitis Tricuspid Atresia 0.031 4.263 36 Chicken Pox Varicella 0.968 0.000 6. CONCLUSIONS AND FUTURE WORK In this paper we discussed the basics of semantic similarity techniques, the classification of single ontology similarity measures and cross ontologies similarity measures. We prepare a brief introduction of the various semantic similarity measures in biomedical domain. However, from all the above, we can used SemDist as semantic similarity measures in the biomedical domain. In future work, we intend to explore the semantic similarity techniques in the biomedical domain (ICD10, MeSH, and SNOMED-CT) within UMLS frame work. We also prepare implement a web- based user interface for all these semantic similarity techniques and to make it available freely to researchers over the Internet. That will be much helpful for interested researchers in the field of bioinformatics text mining. 7. REFERENCES [1] Ted Pedersen, et al. " Measures of semantic similarity and relatedness in the biomedical domain ", Journal of Biomedical Informatics 40 (2007) 288–299. [2] Hisham Al-Mubaid and Hoa A. Nguyen, “A Cluster-Based Approach for Semantic Similarity in the Biomedical Domain” Proceedings of the 28th IEEE, EMBS Annual International Conference New York City, USA, Aug 30-Sept 3, 2006. [3] Hisham Al-mubaid & Hoa A. Nguyen “Measuring Semantic Similarity between Biomedical concepts within multiple ontologies” IEEE Trans Syst Man Cybern Part C: Appl Rev 2009, 39. [4] Ahmad Kayed, et al. "Ontology Evaluation: Which Test to Use" 2013 5th International Conference on Computer Science and Information Technology (CSIT), IEEE, pp 45-48, 2013.
  • 10. International Journal of Computer Applications Technology and Research Volume 7–Issue 08, 331-340, 2018, ISSN:-2319–8656 www.ijcat.com 340 [5] Hisham Al-Mubaid and Hoa A. Nguyen, “New Ontology Based Semantic Similarity for the Biomedical Domain”, (2006) p 623 – 628. [6] S. Anitha Elavarasi, et. al, “A Survey on Semantic Similarity Measure” International Journal of Research in Advent Technology, Vol.2, No.3, March 2014 E-ISSN: 2321-9637. [7] Nguyen H., Al-Mubaid H. (2006) “New Semantic Similarity Techniques of Concepts applied in the biomedical domain and WordNet.” MS Thesis, University o f Houston Clear Lake, Houston, TX USA, 2006. [8] World Health Organization, “International statistical classification of diseases and related health problems”. - 10th revision, edition 2010. [9] Hisham Al-Mubaid and Hoa A. Nguyen, “Using MEDLINE as Standard Corpus for Measuring Semantic Similarity in the Biomedical Domain”, Sixth IEEE Symposium on BionInformatics and BioEngineering (BIBE'06), 2006. [10] Mirjana Ivanovic& Zoran Budimac, An overview of ontologies and data resources in medical domains, Expert Systems with Applications 41 (2014) 5158–5166. [11] Montserrat Batet Sanromà, “ontology-based semantic clustering”, PhD Thesis, 2010. [12] UMLSKS. Available: http://umlsks.nlm.nih.gov [13] MeSH Browser. Available: http://www.nlm.nih.gov/mesh/MBrowser.html