SlideShare una empresa de Scribd logo
Clustering of Similar Values, in Spanish,
for the Improvement of Search Systems
Sergio Luján-Mora & Manuel Palomar
(sergio.lujan@ua.es / @sergiolujanmora)
Department of Software and Computing Systems
University of Alicante, Spain
Published in:
Proceedings International Joint Conference, 7th IberoAmerican Conference, 15th Brazilian Symposium on AI,
IBERAMIA-SBIA 2000, Open Discussion Track Proceedings,
p. 217-226, Atibaia - Sao Paulo (Brasil), November 19-22
2000. ISBN: 85-87837-03-6.
Download:
http://gplsi.dlsi.ua.es/almacenes/ver.php?pdf=1
1
Clustering of Similar Values,
in Spanish,
for the Improvement of
Search Systems
Sergio Luján-Mora & Manuel Palomar
Department of Software and Computing Systems
University of Alicante, Spain

Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

2
Contents
• Introduction
• Taxonomy of different values
• The solution
• The clustering algorithm
• Results
• Conclusions
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

3
Introduction
• Information systems  Rapid and
precise access
• Databases  Find information
• Inconsistency: a term represented by
different values

Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

4
Introduction
• Term
– Universidad de Alicante

• Different values found in databases:
– Universidad Alicante
– Unibersidad de Alicante
– Universitat d’Alacant
– University of Alicante
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

5
Introduction
• The problem:
– Data redundancy  Inconsistency
– Integration of different databases into a
common repository (e.g. data
warehouses):
• different criteria  data redundancy 
Dpto. de Lenguajes y Sistemas Informáticos
Inconsistency
Universidad de Alicante (España)

6
Introduction
• We use clustering within an automatic
method for reducing on inconsistency
1. Values that refer to a same term are
clustered
2. All values are replaced by the cluster
sample
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

7
Contents
• Introduction
• Taxonomy of different values
• The solution
• The clustering algorithm
• Results
• Conclusions
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

8
Taxonomy of different values
• Omission or inclusion of the written
accent:
Asociación Astronómica
Asociacion Astronomica

• Lower-case / upper-case:
Departamento de Lenguajes y Sistemas
Departamento de lenguajes y sistemas
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

9
Taxonomy of different values
• Abbreviations and acronyms:
Dpto. de Derecho Civil
Departamento de Derecho Civil

• Word order:
Miguel de Cervantes Saavedra
Cervantes Saavedra, Miguel de
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

10
Taxonomy of different values
• Different denominations:
Unidad de Registro Sismológico
Unidad de Registro Sísmico

• Punctuation marks:
Laboratorio Multimedia (mmlab)
Laboratorio Multimedia - mmlab
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

11
Taxonomy of different values
• Errors (misspelling, typing or printing
errors):

Gabinete de imagen
Gavinete de imagen

• Different languages:
Universidad de Alicante
University of Alicante
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

12
Contents
• Introduction
• Taxonomy of different values
• The solution
• The clustering algorithm
• Results
• Conclusions
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

13
The solution
1. Preparation
Main
step

2. Reading
3. Sorting
4. Clustering
5. Checking
6. Updating

Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

14
Contents
• Introduction
• Taxonomy of different values
• The solution
• The clustering algorithm
• Results
• Conclusions
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

15
The clustering algorithm
• Similarity:
– Edit distance or Levenshtein distance (LD)
– Invariant distance from word position
(IDWP)
Universidad de Alicante
Alicante, Universidad de
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

16
The clustering algorithm
• Filtering:
– Length distance (LEND)
– Transposition-invariant distance (TID)

Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

17
The clustering algorithm
Input:
C: Sorted strings in descending order by frequency (c1…cm)
Output:
G: Set of clusters (g1…gn)
STEPS
1 Select ci, the first string in C, and insert it into the new
cluster gk
2 Remove ci from C
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

18
The clustering algorithm
3. For each string cj in C
If LEND(ci, cj) < α LEND(ci, cj) then
If TID(ci, cj) < α TID(ci, cj) then
If LD(ci, cj) < α LD(ci, cj) then
Insert cj into cluster gk
Remove cj from C
Else If IDWP(ci, cj) < α IDWP(ci, cj) then
Insert cj into cluster gk
Dpto. de Lenguajes y Sistemas Informáticos
Remove c from C
Universidad de Alicante j (España)

19
Contents
• Introduction
• Taxonomy of different values
• The solution
• The clustering algorithm
• Results
• Conclusions
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

20
Results

Indexes for measuring the cluster complexity

CI: Consistency Index
FCI: File Consistency Index

∑∑ LD( x , x )
n

CI =

n

i

i =1 j =1

j

n

∑x
i =1

m

i

Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

FCI =

∑ CI
i =1

i

m

21
Results
• File A

• File B

– Without
• FCI: 0.31

– With
• FCI: 0.12

Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

– Without
• FCI: 1.72

– With
• FCI: 1.11

22
Results
• Evaluation measures:
– ONC: optimal number of clusters
– NC: number of clusters generated
– NCC: number of completely correct
clusters
– NIC: number of incorrect clusters
– NES: number of erroneous strings
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

23
Results
• Precision: NCC / ONC
• Error: NIC / ONC

Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

24
Results
• File A

• File B

– Without

– Without

• Precision: 70.7%

• Precision: 67.4%

• Error: 7.6%

• Error: 8.7%

– With

– With

• Precision: 84.8%

• Precision: 72.8%

• Error: 0%

• Error: 6.5%

Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

25
Contents
• Introduction
• The problem: causes
• The solution
• The clustering algorithm
• Results
• Conclusions
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

26
Conclusions
• Achieves good results: improves on
data quality
• Review obtained clusters
• Expansion of abbreviations
• Parameters

Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)

27

Más contenido relacionado

PPTX
Escuela politécnica naciona ll
PDF
Tema 2 busqueda_ordenacion_grupo_21
PDF
Porto
PPTX
Heap sort
PPT
Clasificación de textos académicos en función de su
ODP
Índices de Texto
PDF
IN01202C.pdf
PDF
Teoría de lenguajes, gramáticas y autómatas para informáticos.pdf
Escuela politécnica naciona ll
Tema 2 busqueda_ordenacion_grupo_21
Porto
Heap sort
Clasificación de textos académicos en función de su
Índices de Texto
IN01202C.pdf
Teoría de lenguajes, gramáticas y autómatas para informáticos.pdf

Similar a Clustering of Similar Values, in Spanish, for the Improvement of Search Systems (13)

PPTX
Clustering Reconocimiento De Voz
PDF
y estructura_de_datos
PDF
Algoritmos y estructura_de_datos
PDF
Algoritmos y estructura_de_datos
PDF
Algoritmos y estructura_de_datos
PDF
Estructuras de datos osvaldo cairo
DOCX
Algoritmos Ordenamiento
PDF
P3si
PDF
Teoria de automatas y lenguajes formales
PDF
Ontologia guia
PDF
Ainotes Spanish
PPT
LinearSortOrderSatatistics.ppt
PDF
Estructuras de datos - Cairo y Guardati, 3ra edicion.pdf
Clustering Reconocimiento De Voz
y estructura_de_datos
Algoritmos y estructura_de_datos
Algoritmos y estructura_de_datos
Algoritmos y estructura_de_datos
Estructuras de datos osvaldo cairo
Algoritmos Ordenamiento
P3si
Teoria de automatas y lenguajes formales
Ontologia guia
Ainotes Spanish
LinearSortOrderSatatistics.ppt
Estructuras de datos - Cairo y Guardati, 3ra edicion.pdf
Publicidad

Más de Sergio Luján Mora - Universidad de Alicante (20)

PPT
Delivering location-based services using GIS, WAP, and the Web: two applications
PPTX
PPTX
Cookies: ¿Qué son y para qué sirven?
PPTX
PPTX
Curso Introduccion accesibilidad web
PPT
¿Qué es un CAPTCHA? Origen y uso
PPT
¿Qué es un CAPTCHA? Futuro
PPTX
Errores web: Renfe y las fechas
PPTX
Errores web: Renfe y los nombres de las ciudades
PPTX
Errores web: Amadeus y su calendario
PPTX
Errores web: Rumbo y su calendario
PPT
Herramientas de trabajo colaborativo
PPT
Recursos 2.0 de la Universidad de Alicante
PPT
Delivering location-based services using GIS, WAP, and the Web: two applications
Cookies: ¿Qué son y para qué sirven?
Curso Introduccion accesibilidad web
¿Qué es un CAPTCHA? Origen y uso
¿Qué es un CAPTCHA? Futuro
Errores web: Renfe y las fechas
Errores web: Renfe y los nombres de las ciudades
Errores web: Amadeus y su calendario
Errores web: Rumbo y su calendario
Herramientas de trabajo colaborativo
Recursos 2.0 de la Universidad de Alicante
Publicidad

Último (20)

PDF
PRESENTACIÓN GENERAL MIPIG - MODELO INTEGRADO DE PLANEACIÓN
PPTX
ccna: redes de nat ipv4 stharlling cande
PDF
TRABAJO DE TECNOLOGIA.pdf...........................
PPTX
CLAASIFICACIÓN DE LOS ROBOTS POR UTILIDAD
PPTX
modulo seguimiento 1 para iniciantes del
PPTX
Diapositivas Borrador Rocha Jauregui David Paolo (3).pptx
DOCX
TRABAJO GRUPAL (5) (1).docxsjsjskskksksksks
PPT
Protocolos de seguridad y mecanismos encriptación
PPTX
Historia Inteligencia Artificial Ana Romero.pptx
PPTX
Mecanismos-de-Propagacion de ondas electromagneticas
PDF
0007_PPT_DefinicionesDeDataMining_201_v1-0.pdf
DOCX
Guía 5. Test de orientación Vocacional 2.docx
PDF
MANUAL TECNOLOGÍA SER MINISTERIO EDUCACIÓN
PDF
Taller tecnológico Michelle lobo Velasquez
DOCX
Trabajo grupal.docxjsjsjsksjsjsskksjsjsjsj
PPTX
la-historia-de-la-medicina Edna Silva.pptx
PDF
Distribucion de frecuencia exel (1).pdf
PPTX
ccna: redes de nat ipv4 stharlling cande
PDF
CONTABILIDAD Y TRIBUTACION, EJERCICIO PRACTICO
PPTX
El uso de las TIC en la vida cotidiana..
PRESENTACIÓN GENERAL MIPIG - MODELO INTEGRADO DE PLANEACIÓN
ccna: redes de nat ipv4 stharlling cande
TRABAJO DE TECNOLOGIA.pdf...........................
CLAASIFICACIÓN DE LOS ROBOTS POR UTILIDAD
modulo seguimiento 1 para iniciantes del
Diapositivas Borrador Rocha Jauregui David Paolo (3).pptx
TRABAJO GRUPAL (5) (1).docxsjsjskskksksksks
Protocolos de seguridad y mecanismos encriptación
Historia Inteligencia Artificial Ana Romero.pptx
Mecanismos-de-Propagacion de ondas electromagneticas
0007_PPT_DefinicionesDeDataMining_201_v1-0.pdf
Guía 5. Test de orientación Vocacional 2.docx
MANUAL TECNOLOGÍA SER MINISTERIO EDUCACIÓN
Taller tecnológico Michelle lobo Velasquez
Trabajo grupal.docxjsjsjsksjsjsskksjsjsjsj
la-historia-de-la-medicina Edna Silva.pptx
Distribucion de frecuencia exel (1).pdf
ccna: redes de nat ipv4 stharlling cande
CONTABILIDAD Y TRIBUTACION, EJERCICIO PRACTICO
El uso de las TIC en la vida cotidiana..

Clustering of Similar Values, in Spanish, for the Improvement of Search Systems

  • 1. Clustering of Similar Values, in Spanish, for the Improvement of Search Systems Sergio Luján-Mora & Manuel Palomar (sergio.lujan@ua.es / @sergiolujanmora) Department of Software and Computing Systems University of Alicante, Spain Published in: Proceedings International Joint Conference, 7th IberoAmerican Conference, 15th Brazilian Symposium on AI, IBERAMIA-SBIA 2000, Open Discussion Track Proceedings, p. 217-226, Atibaia - Sao Paulo (Brasil), November 19-22 2000. ISBN: 85-87837-03-6. Download: http://gplsi.dlsi.ua.es/almacenes/ver.php?pdf=1 1
  • 2. Clustering of Similar Values, in Spanish, for the Improvement of Search Systems Sergio Luján-Mora & Manuel Palomar Department of Software and Computing Systems University of Alicante, Spain Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 2
  • 3. Contents • Introduction • Taxonomy of different values • The solution • The clustering algorithm • Results • Conclusions Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 3
  • 4. Introduction • Information systems  Rapid and precise access • Databases  Find information • Inconsistency: a term represented by different values Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 4
  • 5. Introduction • Term – Universidad de Alicante • Different values found in databases: – Universidad Alicante – Unibersidad de Alicante – Universitat d’Alacant – University of Alicante Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 5
  • 6. Introduction • The problem: – Data redundancy  Inconsistency – Integration of different databases into a common repository (e.g. data warehouses): • different criteria  data redundancy  Dpto. de Lenguajes y Sistemas Informáticos Inconsistency Universidad de Alicante (España) 6
  • 7. Introduction • We use clustering within an automatic method for reducing on inconsistency 1. Values that refer to a same term are clustered 2. All values are replaced by the cluster sample Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 7
  • 8. Contents • Introduction • Taxonomy of different values • The solution • The clustering algorithm • Results • Conclusions Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 8
  • 9. Taxonomy of different values • Omission or inclusion of the written accent: Asociación Astronómica Asociacion Astronomica • Lower-case / upper-case: Departamento de Lenguajes y Sistemas Departamento de lenguajes y sistemas Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 9
  • 10. Taxonomy of different values • Abbreviations and acronyms: Dpto. de Derecho Civil Departamento de Derecho Civil • Word order: Miguel de Cervantes Saavedra Cervantes Saavedra, Miguel de Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 10
  • 11. Taxonomy of different values • Different denominations: Unidad de Registro Sismológico Unidad de Registro Sísmico • Punctuation marks: Laboratorio Multimedia (mmlab) Laboratorio Multimedia - mmlab Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 11
  • 12. Taxonomy of different values • Errors (misspelling, typing or printing errors): Gabinete de imagen Gavinete de imagen • Different languages: Universidad de Alicante University of Alicante Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 12
  • 13. Contents • Introduction • Taxonomy of different values • The solution • The clustering algorithm • Results • Conclusions Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 13
  • 14. The solution 1. Preparation Main step 2. Reading 3. Sorting 4. Clustering 5. Checking 6. Updating Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 14
  • 15. Contents • Introduction • Taxonomy of different values • The solution • The clustering algorithm • Results • Conclusions Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 15
  • 16. The clustering algorithm • Similarity: – Edit distance or Levenshtein distance (LD) – Invariant distance from word position (IDWP) Universidad de Alicante Alicante, Universidad de Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 16
  • 17. The clustering algorithm • Filtering: – Length distance (LEND) – Transposition-invariant distance (TID) Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 17
  • 18. The clustering algorithm Input: C: Sorted strings in descending order by frequency (c1…cm) Output: G: Set of clusters (g1…gn) STEPS 1 Select ci, the first string in C, and insert it into the new cluster gk 2 Remove ci from C Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 18
  • 19. The clustering algorithm 3. For each string cj in C If LEND(ci, cj) < α LEND(ci, cj) then If TID(ci, cj) < α TID(ci, cj) then If LD(ci, cj) < α LD(ci, cj) then Insert cj into cluster gk Remove cj from C Else If IDWP(ci, cj) < α IDWP(ci, cj) then Insert cj into cluster gk Dpto. de Lenguajes y Sistemas Informáticos Remove c from C Universidad de Alicante j (España) 19
  • 20. Contents • Introduction • Taxonomy of different values • The solution • The clustering algorithm • Results • Conclusions Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 20
  • 21. Results Indexes for measuring the cluster complexity CI: Consistency Index FCI: File Consistency Index ∑∑ LD( x , x ) n CI = n i i =1 j =1 j n ∑x i =1 m i Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) FCI = ∑ CI i =1 i m 21
  • 22. Results • File A • File B – Without • FCI: 0.31 – With • FCI: 0.12 Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) – Without • FCI: 1.72 – With • FCI: 1.11 22
  • 23. Results • Evaluation measures: – ONC: optimal number of clusters – NC: number of clusters generated – NCC: number of completely correct clusters – NIC: number of incorrect clusters – NES: number of erroneous strings Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 23
  • 24. Results • Precision: NCC / ONC • Error: NIC / ONC Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 24
  • 25. Results • File A • File B – Without – Without • Precision: 70.7% • Precision: 67.4% • Error: 7.6% • Error: 8.7% – With – With • Precision: 84.8% • Precision: 72.8% • Error: 0% • Error: 6.5% Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 25
  • 26. Contents • Introduction • The problem: causes • The solution • The clustering algorithm • Results • Conclusions Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 26
  • 27. Conclusions • Achieves good results: improves on data quality • Review obtained clusters • Expansion of abbreviations • Parameters Dpto. de Lenguajes y Sistemas Informáticos Universidad de Alicante (España) 27

Notas del editor

  • #2: {"27":"This method achieves good results in all experiments done, but it does not eliminate the need to review the obtained clusters.\nThe expansion of abbreviations improves on the results.\n","16":"The similarity between two strings must be evaluated. We use the EDIT DISTANCE and the INVARIANT DISTANCE FROM WORD POSITION. \nThe edit distance of two strings is a measure of similarity that is given by the minimal number of simple edit operations needed to transform an string into the other.\nThe simple editing operations considered are: the insertion of a character, the deletion of a character and the substitution of one character with another.\n","5":"For example, if we consult a database that stores information about university researches, we may easily find that there are different values for the same university:\n- Universidad Alicante (without the preposition)\n- Unibersidad de Alicante (with misspelling)\n- Universitat d’Alacant (in Catalan)\n- and even University of Alicante (in English)\nIf a database suffers inconsistency, a search using a given value will not provide all the available information about the term.\n","22":"As we can see, the clusters of file B are more complex than those of file.\nIn both files, the FCI is reduced when expanding the abbreviations.\n","11":"- Different denominations.\n- Punctuation marks: hyphens, commas, semi-colons, brackets, exclamation marks and so on.\n","17":"These distances speed up the clustering. The expensive computation of LD and IDWP can be avoided. \n","6":"The problem of the inconsistency in the values stored in databases may have two origins:\n- Different people may insert the same term with different values in a database.\n- When we try to integrate different databases, they may use different values for representing the same term.\n","23":"We have evaluated the clusters obtained by using four measures that are obtained by comparing the clusters produced with the optimal clusters (handcrafted).\n","12":"- Errors: misspelling, typing or printing errors.\n- Use of different languages.\n","1":"Perhaps we should begin.\nGood morning everyone. Thanks for coming. I am a member of the Deparment of Software and Computing Systems at the University of Alicante in Spain.\nThe work I’m going to present is Clustering of Similar Values, in Spanish, for the Improvement of Search Systems.\n","7":"We present an automatic method for reducing on the inconsistency found in existing databases, and thus, improving data quality. \nAll the values that refer to a same term are clustered by measuring their degree of similarity.\nThe clustered values can be assigned to a common value which, in principle, could be substituted for the original values.\n","24":"From the previous measures, we obtain Precision: NCC divided by ONC and Error: NIC divided by ONC.\n","13":"The method we propose can be divided into six steps.\n","2":"Perhaps we should begin.\nGood morning everyone. Thanks for coming. I am a member of the Deparment of Software and Computing Systems at the University of Alicante in Spain.\nThe work I’m going to present is Clustering of Similar Values, in Spanish, for the Improvement of Search Systems.\n","8":"After analysing several databases with information both in Spanish, we have noticed that the different values that appear for a given term are due to a combination of the following causes.\n","25":"In both files, the expansion of abbreviations produces improvements: it increases the precision and reduces the error.\nFor file A, a maximum precision of 70.7% and 84.8% is obtained without and with expansion of abbreviations.\nFor file B, a maximum precision of 67.4% and 72.8% is obtained without and with expansion of abbreviations.\n","14":"1. Preparation. It may be necessary to prepare the strings before applying the clustering algorithm.\n2. Reading. The following process is repeated for each of the strings contained in the input file: Read a string, Expand abbreviations and acronyms, Remove accents, Shift string to lower-case, Store the string.\n3. Sorting. The strings are sorted, in descending order, by frequency of appearance.\n","3":"First of all, I would like to outline the main points of my talk. \nI have structured my talk into six sections. Firstly, I will give an introduction to my research and I will make a few observations about the inconsistency problem. Then I will go on to explain the main causes of the problem. Next, I will talk about our proposed method for reducing inconsistency in databases. And then, I will present our clustering algorithm and the distance metrics it uses. Finally, I will highlight the main results of our method and the conclusions of our research.\nLet’s move on to the first part of the presentation.\n","9":"- The omission or inclusion of the written accent.\n- The use of lower-case and upper-case letters.\n","15":"The standard method of detecting exact duplicates in a table is to sort the table and then to check if neighboring tuples are identical. This approach can be extended to detect approximate duplicates.\n","4":"Existing information systems provide rapid and precise access to information stored in databases.\nOne of the main uses of databases is find information.\nIf a database has a bad design, it is very likely that it suffers inconsistency: the very same term may have different values, due to misspelling, a permuted word order, spelling variants and so on.\n","21":"We have used two files for evaluating our method. They contain data from two databases with inconsistency problems.\nWe have developed a coefficient named CONSISTENCY INDEX that permits the evaluation of the complexity of a cluster: the greater the value of the coefficient, the more different the strings that form the cluster are.\nThe FILE CONSISTENCY INDEX is defined as the average of the consistency indexes of all the existing clusters in the file.\n","10":"- The use of abbreviations and acronyms.\n- Different word order.\n"}