SlideShare a Scribd company logo
3
Most read
4
Most read
7
Most read
Document Clustering in Amharic
   for information browsing and retrieval
             Yalemisew Mintesinot Abgaz

              Yabgaz@computing.dcu.ie




                     Dec 1, 2011
Introduction
• The rate of production of information is growing exponentially
• Documents produced in Amharic language are increasing
       available in digital format
       accessible online
• Growing number of Amharic web documents than before
• Growing number of Amharic language users
• Increasing number of applications available in Amharic
Introduction
• Challenges ahead
    – Searching and accessing the information in Amharic is difficult
        • From the language perspective
        • From the knowledge perspective
        • Availability of tools
    – Identifying the relevant documents from the available ones is challenging
        • Searching and Search results
    – Browsing the documents in a concept map is not available
• The challenges call for a solution
Agenda Items
•   Introduction
•   Document clustering
•   Document clustering process
•   Experimental results
•   Conclusion
•   Future work
Document clustering
• Document clustering is a process of identifying groups or clusters of
  documents with common features.
• Groups documents based on similarities of the contents of the documents
• Used for information organization and information retrieval
• To design a retrieval mechanism for searching through the clusters
• Can be
   – Hierarchical
   – None hierarchical
• Is different from document classification
Document clustering
• Hierarchical document clustering
   – Is a widely used method
   – Generates hierarchical classes with generalization at the top and
     specialization at the bottom
• Clustering algorithms
   – Divisive
   – Agglomerative
       • Single link, complete link, group average link, ward’s method and
       • Frequent item based hierarchical clustering
Document clustering process
Document 
collection       Document                        Index words
                    text          Indexing
                 collection


                                  Stemming         Stemmed 
                   Stop                          index words
                  Word list


                                    Vector        Document 
                                                 term vectors
                  Suffix list
                                representation


                                                    Cluster 
                                  Clustering
                                                 Representation



  Query          Query          Query‐Cluster        Output 
               processing        Matching          documents
Document clustering process
1. Document collection
   -   Amharic news documents collected from Walta Information Centre
   -   Similar documents were selected by previous researchers
   -   The documents cover various domains such as
       -   Governance
       -   Market
       -   Politics
       -   Sport
       -   Education etc.
Document clustering process
2. Document pre-processing
   - Indexing the documents
      -   Word identification (Amharic word separators considered)
      -   Smoothing( characters of the same voice were mapped to a single character)
      -   ጸሃይ፣ ጸኅይ፣ጸሀይ፣ ፀሃይ፣ ፀኃይ፣ ፀሀይ…  ፀሐይ
   - Stop word removal
      -   Words like [ለ፣ ወደ]=to, [ከ]=from, [የ] are removed [non-content bearing
          words]
      -   Stop words in news domain such as [ገልጿል] disclosed, [አመልክቷል] ect.
   - Stop words are validated against their frequency in the document
     collection [a threshold of 100 is used]
Document clustering process
3. Stemming of indexed terms
    -     Amharic language is morphologically complex
    -     Nouns have inflection [prefix, and suffix]
    -     አስተማሩ
    -     አስተማረ
    -     አስተማረች                              አስተማረ
    -     አስተማርኩ
    -     Verbs have inflection[prefix, suffix and infix]
    -     ሰበረ
    -     ሰበረች
    -      ሰበርክ                                 ሰበር         ስብር
    -     ሰበርሽ
    -     አሰበረ
    -     stemming brings the word into its common form
Document clustering process
4. Representing documents using document vector
   -  Term weighting is used to weight the term frequency
   -  Weight(di,j) = Tfij* (logN- log n)+1
      • Tf ij is frequency of term j in document i
      • N is the number of document in the collection and
      • n is the number of documents containing the term.
   – Weighted term frequency for index terms
Document clustering process
5. Clustering the documents
   -   Constructing the initial clusters
       -   Following the FIHC algorithm, initial clusters are constructed by setting the
           global support between 0 and 1
       -   The initial cluster groups similar documents together and creates a new cluster
           whenever it gets a different document
       -   Used global support
Document clustering process
5. Clustering the documents
   -   Making the clusters disjoint
       -   The score function is used to measure how well a cluster fit the documents at
           hand.
   -   Hierarchical tree construction
       -   The cluster tree is built using inter cluster similarity
       -   Centroid calculation
   -   Tree pruning
Experimental result
 • Tuning the global support to get hierarchical documents
        – More than 10% global support gives flat hierarchy
        – Less than 1% global support gives a single vertical hierarchy
        – 5% global support shows a better performance
Global Support          Width        Depth                                  Remark
>=20%              < =9         0            Flat hierarchy


10%                61           2            1 level hierarchical(only for 2 classes


5%                 92           10           10 level hierarchy for two classes 5 level hierarchy for five classes

<=1%               >=120        25           25 level hierarchy[took too much time to cluster]
Experimental result
Experimental result
                         recall-precison
                       globalsupport=10%
            0.9
            0.8
            0.7
precision




            0.6
            0.5                                           10%
            0.4                                           5%
            0.3
            0.2
            0.1
              0
                  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
                                  recall
Discussion of results
• Tuning the global support threshold plays a significant role in
  creating the required clusters
• Stemming affects the clusters and creates overlapping clusters
• High precision can be achieved if frequent items(terms) are used
• High recall can be achieved when the whole index terms are used
  but it greatly affect precision
Future directions
• Developing standard corpus collection
• Using ontologies as a concept map
• Standardization for Amharic language resources such as standard
  stop word list
• Further research in stemming [cross domain research]
• Comparison with other document clustering algorithms
• Comparison with other information retrieval methods
Thank you!




             Questions?

More Related Content

PDF
10 Brazilian Expressions With Numbers You Need to Know
PDF
PPT
Gold plating
PDF
Calcining Zone Rings
PPTX
Tin and it’s alloy
PDF
Exemplary report-rotary-kiln-alignment
PDF
Red river in clinker cooler
PDF
Microsoft power point dip tube failure
10 Brazilian Expressions With Numbers You Need to Know
Gold plating
Calcining Zone Rings
Tin and it’s alloy
Exemplary report-rotary-kiln-alignment
Red river in clinker cooler
Microsoft power point dip tube failure

What's hot (10)

PDF
How to control kiln shell corrosion
PPTX
Potentiostatic polarization curve of active-passive metal (Fe) & Flade potent...
PDF
Clinker Liquid Phase
PDF
Blast furnace process-Dry and wet coke in blast furnace
PDF
Burning and cooling
PDF
An Introduction to Rotary Kilns
PPTX
Extraction of zinc
PDF
COACHING HABIT Say Less, Ask More Change the Way You Lead Forever (MICHAEL B...
PPTX
Why oil prices failing
How to control kiln shell corrosion
Potentiostatic polarization curve of active-passive metal (Fe) & Flade potent...
Clinker Liquid Phase
Blast furnace process-Dry and wet coke in blast furnace
Burning and cooling
An Introduction to Rotary Kilns
Extraction of zinc
COACHING HABIT Say Less, Ask More Change the Way You Lead Forever (MICHAEL B...
Why oil prices failing
Ad

Viewers also liked (20)

PDF
PDF
C3.1.logistic intro
PPTX
Search Engines
PDF
Atu media eval_sed2014
PDF
PDF
C4.1.1
PPTX
Scaling Document Clustering in the Cloud
PDF
C3.3.1
PDF
PDF
Human vs-Machine-Translation
PPTX
Is Google Translate Effective At Sentence Changing
ODP
Google Translate + TectoMT
PPTX
Google translate 1
ZIP
Language Use And Preservation Online
PDF
Building Translate on Glass
PPTX
Pptphrase tagset mapping for french and english treebanks and its application...
PPTX
Text clustering
ODP
8 Google Translate
PPT
Google translate (new russian)
PPTX
Document clustering and classification
C3.1.logistic intro
Search Engines
Atu media eval_sed2014
C4.1.1
Scaling Document Clustering in the Cloud
C3.3.1
Human vs-Machine-Translation
Is Google Translate Effective At Sentence Changing
Google Translate + TectoMT
Google translate 1
Language Use And Preservation Online
Building Translate on Glass
Pptphrase tagset mapping for french and english treebanks and its application...
Text clustering
8 Google Translate
Google translate (new russian)
Document clustering and classification
Ad

Similar to Amharic document clustering (20)

PPTX
Model of semantic textual document clustering
PPT
3_Indexing.ppt
PPT
Information Retrieval QueryLanguageOperation.ppt
PPTX
Semi-automated Exploration and Extraction of Data in Scientific Tables
PPTX
Intro to Vectorization Concepts - GaTech cse6242
PPT
score based ranking of documents
PPTX
NLP Introduction and basics of natural language processing
PPTX
Information Retrieval and Extraction - Module 7
PDF
Chapter 6 Query Language .pdf
PPTX
analyzing qualitative data. .pptx
PDF
Automated Abstracts and Big Data
PDF
Natural Language Processing using Java
PDF
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
PDF
AI presentation and introduction - Retrieval Augmented Generation RAG 101
PPTX
Information storage and retrieval system unit two
PPT
What might a spoken corpus tell us about language
PDF
Best Practice in Data Management and Sharing
PPTX
Techniques of information retrieval
PDF
An evaluation and overview of indices
Model of semantic textual document clustering
3_Indexing.ppt
Information Retrieval QueryLanguageOperation.ppt
Semi-automated Exploration and Extraction of Data in Scientific Tables
Intro to Vectorization Concepts - GaTech cse6242
score based ranking of documents
NLP Introduction and basics of natural language processing
Information Retrieval and Extraction - Module 7
Chapter 6 Query Language .pdf
analyzing qualitative data. .pptx
Automated Abstracts and Big Data
Natural Language Processing using Java
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
AI presentation and introduction - Retrieval Augmented Generation RAG 101
Information storage and retrieval system unit two
What might a spoken corpus tell us about language
Best Practice in Data Management and Sharing
Techniques of information retrieval
An evaluation and overview of indices

More from Guy De Pauw (20)

PDF
Technological Tools for Dictionary and Corpora Building for Minority Language...
PDF
Semi-automated extraction of morphological grammars for Nguni with special re...
PPTX
Resource-Light Bantu Part-of-Speech Tagging
PDF
Natural Language Processing for Amazigh Language
PDF
POS Annotated 50m Corpus of Tajik Language
PDF
The Tagged Icelandic Corpus (MÍM)
PDF
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
PDF
Tagging and Verifying an Amharic News Corpus
PDF
A Corpus of Santome
PDF
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
PDF
Compiling Apertium Dictionaries with HFST
PDF
The Database of Modern Icelandic Inflection
PDF
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
PPT
Issues in Designing a Corpus of Spoken Irish
PDF
How to build language technology resources for the next 100 years
PPT
Towards Standardizing Evaluation Test Sets for Compound Analysers
PPT
The PALDO Concept - New Paradigms for African Language Resource Development
PPT
A System for the Recognition of Handwritten Yorùbá Characters
PPTX
IFE-MT: An English-to-Yorùbá Machine Translation System
PDF
A Number to Yorùbá Text Transcription System
Technological Tools for Dictionary and Corpora Building for Minority Language...
Semi-automated extraction of morphological grammars for Nguni with special re...
Resource-Light Bantu Part-of-Speech Tagging
Natural Language Processing for Amazigh Language
POS Annotated 50m Corpus of Tajik Language
The Tagged Icelandic Corpus (MÍM)
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Tagging and Verifying an Amharic News Corpus
A Corpus of Santome
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Compiling Apertium Dictionaries with HFST
The Database of Modern Icelandic Inflection
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Issues in Designing a Corpus of Spoken Irish
How to build language technology resources for the next 100 years
Towards Standardizing Evaluation Test Sets for Compound Analysers
The PALDO Concept - New Paradigms for African Language Resource Development
A System for the Recognition of Handwritten Yorùbá Characters
IFE-MT: An English-to-Yorùbá Machine Translation System
A Number to Yorùbá Text Transcription System

Recently uploaded (20)

PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPT
Teaching material agriculture food technology
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
MYSQL Presentation for SQL database connectivity
MIND Revenue Release Quarter 2 2025 Press Release
Understanding_Digital_Forensics_Presentation.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Teaching material agriculture food technology
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
sap open course for s4hana steps from ECC to s4
Programs and apps: productivity, graphics, security and other tools
Reach Out and Touch Someone: Haptics and Empathic Computing
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The AUB Centre for AI in Media Proposal.docx
Building Integrated photovoltaic BIPV_UPV.pdf
Spectral efficient network and resource selection model in 5G networks
MYSQL Presentation for SQL database connectivity

Amharic document clustering

  • 1. Document Clustering in Amharic for information browsing and retrieval Yalemisew Mintesinot Abgaz Yabgaz@computing.dcu.ie Dec 1, 2011
  • 2. Introduction • The rate of production of information is growing exponentially • Documents produced in Amharic language are increasing available in digital format accessible online • Growing number of Amharic web documents than before • Growing number of Amharic language users • Increasing number of applications available in Amharic
  • 3. Introduction • Challenges ahead – Searching and accessing the information in Amharic is difficult • From the language perspective • From the knowledge perspective • Availability of tools – Identifying the relevant documents from the available ones is challenging • Searching and Search results – Browsing the documents in a concept map is not available • The challenges call for a solution
  • 4. Agenda Items • Introduction • Document clustering • Document clustering process • Experimental results • Conclusion • Future work
  • 5. Document clustering • Document clustering is a process of identifying groups or clusters of documents with common features. • Groups documents based on similarities of the contents of the documents • Used for information organization and information retrieval • To design a retrieval mechanism for searching through the clusters • Can be – Hierarchical – None hierarchical • Is different from document classification
  • 6. Document clustering • Hierarchical document clustering – Is a widely used method – Generates hierarchical classes with generalization at the top and specialization at the bottom • Clustering algorithms – Divisive – Agglomerative • Single link, complete link, group average link, ward’s method and • Frequent item based hierarchical clustering
  • 7. Document clustering process Document  collection Document   Index words text  Indexing collection Stemming Stemmed  Stop  index words Word list Vector  Document  term vectors Suffix list representation Cluster  Clustering Representation Query Query  Query‐Cluster  Output  processing Matching documents
  • 8. Document clustering process 1. Document collection - Amharic news documents collected from Walta Information Centre - Similar documents were selected by previous researchers - The documents cover various domains such as - Governance - Market - Politics - Sport - Education etc.
  • 9. Document clustering process 2. Document pre-processing - Indexing the documents - Word identification (Amharic word separators considered) - Smoothing( characters of the same voice were mapped to a single character) - ጸሃይ፣ ጸኅይ፣ጸሀይ፣ ፀሃይ፣ ፀኃይ፣ ፀሀይ…  ፀሐይ - Stop word removal - Words like [ለ፣ ወደ]=to, [ከ]=from, [የ] are removed [non-content bearing words] - Stop words in news domain such as [ገልጿል] disclosed, [አመልክቷል] ect. - Stop words are validated against their frequency in the document collection [a threshold of 100 is used]
  • 10. Document clustering process 3. Stemming of indexed terms - Amharic language is morphologically complex - Nouns have inflection [prefix, and suffix] - አስተማሩ - አስተማረ - አስተማረች አስተማረ - አስተማርኩ - Verbs have inflection[prefix, suffix and infix] - ሰበረ - ሰበረች - ሰበርክ ሰበር ስብር - ሰበርሽ - አሰበረ - stemming brings the word into its common form
  • 11. Document clustering process 4. Representing documents using document vector - Term weighting is used to weight the term frequency - Weight(di,j) = Tfij* (logN- log n)+1 • Tf ij is frequency of term j in document i • N is the number of document in the collection and • n is the number of documents containing the term. – Weighted term frequency for index terms
  • 12. Document clustering process 5. Clustering the documents - Constructing the initial clusters - Following the FIHC algorithm, initial clusters are constructed by setting the global support between 0 and 1 - The initial cluster groups similar documents together and creates a new cluster whenever it gets a different document - Used global support
  • 13. Document clustering process 5. Clustering the documents - Making the clusters disjoint - The score function is used to measure how well a cluster fit the documents at hand. - Hierarchical tree construction - The cluster tree is built using inter cluster similarity - Centroid calculation - Tree pruning
  • 14. Experimental result • Tuning the global support to get hierarchical documents – More than 10% global support gives flat hierarchy – Less than 1% global support gives a single vertical hierarchy – 5% global support shows a better performance Global Support Width Depth Remark >=20% < =9 0 Flat hierarchy 10% 61 2 1 level hierarchical(only for 2 classes 5% 92 10 10 level hierarchy for two classes 5 level hierarchy for five classes <=1% >=120 25 25 level hierarchy[took too much time to cluster]
  • 16. Experimental result recall-precison globalsupport=10% 0.9 0.8 0.7 precision 0.6 0.5 10% 0.4 5% 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 recall
  • 17. Discussion of results • Tuning the global support threshold plays a significant role in creating the required clusters • Stemming affects the clusters and creates overlapping clusters • High precision can be achieved if frequent items(terms) are used • High recall can be achieved when the whole index terms are used but it greatly affect precision
  • 18. Future directions • Developing standard corpus collection • Using ontologies as a concept map • Standardization for Amharic language resources such as standard stop word list • Further research in stemming [cross domain research] • Comparison with other document clustering algorithms • Comparison with other information retrieval methods
  • 19. Thank you! Questions?