Clustering of Similar Values, in Spanish, for the Improvement of Search Systems

Clustering of Similar Values, in Spanish,
for the Improvement of Search Systems
Sergio Luján-Mora & Manuel Palomar
(sergio.lujan@ua.es / @sergiolujanmora)
Department of Software and Computing Systems
University of Alicante, Spain
Published in:
Proceedings International Joint Conference, 7th IberoAmerican Conference, 15th Brazilian Symposium on AI,
IBERAMIA-SBIA 2000, Open Discussion Track Proceedings,
p. 217-226, Atibaia - Sao Paulo (Brasil), November 19-22
2000. ISBN: 85-87837-03-6.
Download:
http://gplsi.dlsi.ua.es/almacenes/ver.php?pdf=1
1

Clustering of Similar Values,
in Spanish,
for the Improvement of
Search Systems
Sergio Luján-Mora & Manuel Palomar
Department of Software and Computing Systems
University of Alicante, Spain

Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

2

Contents
• Introduction
• Taxonomy of different values
• The solution
• The clustering algorithm
• Results
• Conclusions

3

Introduction
• Information systems  Rapid and
precise access
• Databases  Find information
• Inconsistency: a term represented by
different values


4

Introduction
• Term
– Universidad de Alicante

• Different values found in databases:
– Universidad Alicante
– Unibersidad de Alicante
– Universitat d’Alacant
– University of Alicante

5

Introduction
• The problem:
– Data redundancy  Inconsistency
– Integration of different databases into a
common repository (e.g. data
warehouses):
• different criteria  data redundancy 
Inconsistency

6

Introduction
• We use clustering within an automatic
method for reducing on inconsistency
1. Values that refer to a same term are
clustered
2. All values are replaced by the cluster
sample

7

Contents
• Introduction
• The solution
• Results
• Conclusions

8

Taxonomy of different values
• Omission or inclusion of the written
accent:
Asociación Astronómica
Asociacion Astronomica

• Lower-case / upper-case:
Departamento de Lenguajes y Sistemas
Departamento de lenguajes y sistemas

9

• Abbreviations and acronyms:
Dpto. de Derecho Civil
Departamento de Derecho Civil

• Word order:
Miguel de Cervantes Saavedra
Cervantes Saavedra, Miguel de

10

• Different denominations:
Unidad de Registro Sismológico
Unidad de Registro Sísmico

• Punctuation marks:
Laboratorio Multimedia (mmlab)
Laboratorio Multimedia - mmlab

11

• Errors (misspelling, typing or printing
errors):

Gabinete de imagen
Gavinete de imagen

• Different languages:
Universidad de Alicante
University of Alicante

12

Contents
• Introduction
• The solution
• Results
• Conclusions

13

The solution
1. Preparation
Main
step

2. Reading
3. Sorting
4. Clustering
5. Checking
6. Updating


14

Contents
• Introduction
• The solution
• Results
• Conclusions

15

The clustering algorithm
• Similarity:
– Edit distance or Levenshtein distance (LD)
– Invariant distance from word position
(IDWP)
Universidad de Alicante
Alicante, Universidad de

16

• Filtering:
– Length distance (LEND)
– Transposition-invariant distance (TID)


17

Input:
C: Sorted strings in descending order by frequency (c1…cm)
Output:
G: Set of clusters (g1…gn)
STEPS
1 Select ci, the first string in C, and insert it into the new
cluster gk
2 Remove ci from C

18

3. For each string cj in C
If LEND(ci, cj) < α LEND(ci, cj) then
If TID(ci, cj) < α TID(ci, cj) then
If LD(ci, cj) < α LD(ci, cj) then
Insert cj into cluster gk
Remove cj from C
Else If IDWP(ci, cj) < α IDWP(ci, cj) then
Insert cj into cluster gk
Remove c from C
Universidad de Alicante j (España)

19

Contents
• Introduction
• The solution
• Results
• Conclusions

20

Results

Indexes for measuring the cluster complexity

CI: Consistency Index
FCI: File Consistency Index

∑∑ LD( x , x )
n

CI =

n

i

i =1 j =1

j

n

∑x
i =1

m

i


FCI =

∑ CI
i =1

i

m

21

Results
• File A

• File B

– Without
• FCI: 0.31

– With
• FCI: 0.12


– Without
• FCI: 1.72

– With
• FCI: 1.11

22

Results
• Evaluation measures:
– ONC: optimal number of clusters
– NC: number of clusters generated
– NCC: number of completely correct
clusters
– NIC: number of incorrect clusters
– NES: number of erroneous strings

23

Results
• Precision: NCC / ONC
• Error: NIC / ONC


24

Results
• File A

• File B

– Without

– Without

• Precision: 70.7%


• Error: 7.6%

• Error: 8.7%

– With

– With



• Error: 0%

• Error: 6.5%


25

Contents
• Introduction
• The problem: causes
• The solution
• Results
• Conclusions

26

Conclusions
• Achieves good results: improves on
data quality
• Review obtained clusters
• Expansion of abbreviations
• Parameters


27

Clustering of Similar Values, in Spanish, for the Improvement of Search Systems

Más contenido relacionado

Similar a Clustering of Similar Values, in Spanish, for the Improvement of Search Systems (13)

Más de Sergio Luján Mora - Universidad de Alicante (20)

Último (20)

Clustering of Similar Values, in Spanish, for the Improvement of Search Systems

Notas del editor