1.pdf

Exploring the Challenges of Natural
Language Processing (NLP) in Ethiopian
Language
Submitted to: - Dr. Tibebe Beshahesta
DECEMBER 27, 2023
HiLCoE School of Computer Science
&Technology
Group Members ID Number
1. Abel Hailemariam QF7953
2. Alazar Kebede FX4743
3. Anobie Tesfaye LD1791
4. Dagim Ashenafi XI9378
5. Enat Desta UZ9773
6. Kalkidan Abebe VJ9284
Information Research Method: - Research Proposal

1
Table of Contents
Background/Overview................................................................................................................................2
Problem Statement......................................................................................................................................3
Research Questions.................................................................................................................................3
Objective of the Research...........................................................................................................................4
a) General Objective...............................................................................................................................4
b) Specific Objectives..............................................................................................................................4
Approach/Methodology..............................................................................................................................4
General Approach...................................................................................................................................4
Study Population.....................................................................................................................................4
Data Collection Methods ........................................................................................................................4
Data Analysis...........................................................................................................................................5
Design/Experiment Methods..................................................................................................................5
Procedures/Tools and Techniques.........................................................................................................5
Literature Review .......................................................................................................................................5
Scope and limitations of the research........................................................................................................7
Scope ........................................................................................................................................................7
Limitations...............................................................................................................................................8
Significance of the research........................................................................................................................8
References....................................................................................................................................................9
Annex .........................................................................................................................................................10

2
Background/Overview
The field of Natural Language Processing (NLP) is a subfield of artificial intelligence that aims
to enable computers to understand, analyze, and generate human language. It has gained
significant attention and success in several major languages and advancement in recent years,
particularly in languages with extensive research and resources available. However, non-English
languages, particularly those with a smaller digital presence and limited resources, often face
considerable challenges when it comes to NLP. Ethiopian languages, with their rich linguistic
diversity and unique characteristics, present a compelling case for investigating the challenges
and possibilities in the realm of NLP.
The general area that this research proposal focuses on is the exploration of NLP in Ethiopian
languages. These languages play a vital role in Ethiopia's culture, society, and communication,
yet their inclusion within the realm of NLP has been relatively limited. By addressing the
challenges specific to Ethiopian languages, we aim to contribute to the broader field of NLP and
foster linguistic diversity and inclusion.
Key concepts include:
❖ Morphological Complexity: Ethiopian languages are known for their intricate
morphology, involving complex word formations and extensive morphological processes.
This presents challenges in developing effective morphological analyzers, segmentary,
and stemmers, which are essential components of NLP systems.
❖ Limited Linguistic Resources: Ethiopian languages have relatively fewer linguistic
resources available compared to more widely spoken languages. These include corpora,
lexicons, annotated data, and language models. The scarcity of such resources poses
difficulties in training and evaluating NLP models, hindering the progress of language-
specific applications.
❖ Orthographic Variation: Ethiopian languages exhibit diverse orthographic conventions,
with variations in script usage, character encoding, and writing systems. These variations
impact text normalization, tokenization, and other preprocessing tasks crucial for NLP
applications, requiring robust and adaptable techniques to handle them effectively.
❖ Named Entity Recognition (NER): NER is a fundamental task in NLP, and its accurate
implementation in Ethiopian languages is an ongoing challenge. The lack of labeled
datasets, ambiguous semantics, and the absence of standardized conventions hinder the
development of robust NER models for these languages.
❖ Machine Translation and Language Generation: Enabling machine translation and
language generation capabilities in Ethiopian languages would greatly facilitate
communication, knowledge sharing, and information access. However, the scarcity of
parallel corpora, translation models, and language models poses substantial obstacles in
developing effective systems.
In conclusion, this research proposal aims to dive into the challenges faced in applying NLP
techniques to Ethiopian languages. By addressing the complexities of morphology, limited

3
linguistic resources, orthographic variations, named entity recognition, machine translation, and
language generation, we seek to provide insights and solutions that can empower the NLP
community to tackle these challenges effectively. Through this exploration, we strive to promote
the inclusion and advancement of Ethiopian languages in the broader field of NLP.
Problem Statement
The problem at hand is the limited progress and utilization of Ethiopian languages in NLP. The
field of Natural Language Processing (NLP) has made significant strides in processing and
understanding various languages. However, research and development in NLP have primarily
focused on widely spoken languages, leaving Ethiopian languages largely understudied and
neglected. This lack of attention creates a significant problem as it hampers the development of
robust NLP applications for Ethiopian languages, hindering communication, access to
information, and technological advancements within the Ethiopian context.
The problem is twofold:
Firstly, there is a scarcity of resources necessary for NLP in Ethiopian languages. This scarcity
includes linguistic corpora, lexicons, annotated data, and language models, which are crucial for
training and evaluating NLP systems. Without adequate resources, researchers and developers
face significant challenges in building effective and accurate NLP models for these languages
(Alemu et al., 2019). Additionally, the limited availability of parallel corpora and translation
models inhibits progress in machine translation and language generation tasks tailored to
Ethiopian languages (Tamirat and van Zaanen, 2017).
Secondly, Ethiopian languages exhibit complex morphological structures, orthographic
variations, and unconventional script usage, which complicate the application of existing NLP
techniques (Worku et al., 2021). For instance, the intricate morphology of Ethiopian languages
poses challenges in developing reliable morphological analyzers, segmentary, and stemmers
(Beyene et al., 2013). Furthermore, orthographic variations in script usage and character
encoding necessitate the need for adaptable preprocessing techniques to handle these
complexities effectively (Abebe et al., 2020).
Research Questions
1. What are the specific challenges of NLP in under-resourced Ethiopian languages, such as
Amharic, Afaan Oromoo and Tigrinya?
2. How do the lexical and morphological complexities of Ethiopian languages impact NLP tasks,
such as part-of-speech tagging, named entity recognition, and machine translation?
3. What are the implications of the lack of annotated data for developing robust NLP models in
Ethiopian languages?
4. How can the dialectal and regional variations of Ethiopian languages be addressed to establish
a standardized form for NLP applications?

4
5. What are the potential solutions offered by cross-lingual transfer learning techniques to
overcome the scarcity of resources in Ethiopian languages and improve NLP capabilities?
Objective of the Research
a) General Objective
To address the challenges of NLP in Ethiopian languages and contribute towards the
development of robust NLP models and resources, enabling effective communication,
information retrieval, and technological advancements within the Ethiopian context.
b) Specific Objectives
• To investigate and identify the specific challenges faced in developing NLP applications
for Ethiopian languages due to limited linguistic resources, such as corpora, lexicons,
annotated data, and language models.
• To propose and develop innovative techniques and methodologies for handling the
complex morphological structures, orthographic variations, and unconventional script
usage in Ethiopian languages, thereby enhancing the accuracy and performance of NLP
models.
• To explore and evaluate strategies for addressing the scarcity of parallel corpora and
translation models, with a focus on developing machine translation and language
generation systems tailored to Ethiopian languages.
Approach/Methodology
General Approach
In this research proposal, a qualitative approach will be employed to explore the challenges of
Natural Language Processing (NLP) in Ethiopian languages. The specific method chosen for this
study is the Case Study method.
Study Population
The study population will consist of native speakers of Ethiopian languages and experts in the
field of NLP. Native speakers will provide valuable insights into the unique linguistic
characteristics, cultural nuances, and challenges of Ethiopian languages. NLP experts will
provide technical expertise and guidance in identifying and addressing the challenges faced in
developing NLP applications for these languages.
Data Collection Methods
1. Literature Review: A comprehensive review of existing literature on NLP in Ethiopian
languages will be conducted, analyzing previous studies, research papers, and relevant resources
to gain an understanding of the current state and challenges in this field.
2. Interviews: Semi-structured interviews will be conducted with native speakers of Ethiopian
languages who possess expertise in linguistics, computational linguistics, or NLP. These
interviews will provide insights into the challenges, needs, and aspirations for NLP in Ethiopian
languages.

5
The sample size will be determined based on achieving data saturation, where new information
or perspectives no longer emerge from additional participants. This saturation point will be used
to limit the sample and ensure thorough coverage of the research topic while optimizing
available resources. Please note, however, that the specific details of the saturation point, such as
the number of participants, will be determined during the research process based on the evolving
nature of the data.
3. Multilingual NLP Systems: Existing multilingual NLP systems, such as language models or
named entity recognition tools, will be utilized to analyze the performance and limitations when
processing Ethiopian languages. The outputs of these systems will be evaluated, and any errors
or difficulties encountered will be documented.
Data Analysis
Thematic analysis will be used to analyze the qualitative data collected from interviews and
observations. The data will be transcribed, coded, and categorized into themes and patterns.
These themes will provide insights into the challenges and potential solutions for NLP in
Ethiopian languages.
Design/Experiment Methods
The research design will involve a single or multiple case studies focusing on specific Ethiopian
languages. The case studies will include the development and evaluation of prototypes and NLP
systems for targeted language(s), incorporating the identified challenges and potential solutions.
The performance metrics, such as accuracy, precision, recall, and linguistic coverage, will be
used to evaluate the effectiveness of these systems.
Procedures/Tools and Techniques
1. Purposive Sampling: Participants for interviews will be selected through purposive sampling,
ensuring a diverse range of expertise and perspectives among native speakers of Ethiopian
languages.
2. Transcription and Translation: Interviews will be audio-recorded, transcribed, and translated
from the local language to English for analysis and interpretation purposes.
3. Qualitative Data Analysis Software: Specialized software, such as Atlas.ti, will be employed to
facilitate the coding, organization, and analysis of qualitative data.
4. Ethical Considerations: All necessary ethical approvals and consent procedures will be
followed to ensure the privacy and confidentiality of the participants. Informed consent will be
obtained from all participants, and their identities will be anonymized in the research findings.
Literature Review

7
Scope and limitations of the research
Scope
• Language Focus: The research will focus on three specific Ethiopian languages, namely
Amharic, Afaan Oromoo, and Tigrinya. These languages are widely spoken in Ethiopia
and represent a diverse linguistic landscape. By focusing on multiple languages, the
research aims to capture a broader spectrum of challenges and explore language-specific
nuances in NLP.
• Multilingual NLP System: The research will employ a multilingual NLP system,
specifically BERT (Bidirectional Encoder Representations from Transformers). BERT is
a pre-trained language model that can handle multiple languages, including the ones

8
chosen for this study. By utilizing BERT, the research aims to evaluate its effectiveness
and adaptability in addressing the challenges of NLP in Ethiopian languages.
Limitations
However, it should be acknowledged that this research has certain limitations. Primarily, the
chosen languages may not fully represent the linguistic diversity of Ethiopia, as there are
numerous other languages within the country. The findings of this research may not be applicable
to all Ethiopian languages due to this limitation. Additionally, the generalizability of the findings
may be restricted to the selected languages and the specific research context. Furthermore, the
research is constrained by time, resources, and the scope of coverage, which may limit the
comprehensiveness of the study. The use of BERT as a multilingual NLP system carries its own
limitations, as its effectiveness may vary across different languages. Lastly, qualitative research
is subjective in nature and can be influenced by the researcher's interpretation and biases, despite
efforts to ensure objectivity and rigor.
Significance of the research
The significance of research on natural language processing (NLP) for Ethiopian languages lies
in its potential to address challenges related to digital inclusion and language preservation.
Ethiopia's linguistic diversity, with over 80 distinct languages, necessitates the development of
NLP technologies to meet the needs of the local population. By overcoming barriers to digital
access, these technologies can empower individuals and bridge the digital divide, benefiting
sectors like education, healthcare, governance, and commerce.
Ethiopia is a linguistically diverse country, yet the development of NLP technologies has
primarily focused on widely spoken languages, leaving speakers of Ethiopian languages with
limited access to digital resources in their mother tongues. This research aims to fill this gap by
exploring innovative approaches and techniques to develop NLP technologies specifically for
Ethiopian languages. Understanding the unique linguistic characteristics of these languages
allows for the design of algorithms, models, and tools tailored to their analysis and processing.
This, in turn, enables the creation of applications such as machine translation, sentiment analysis,
speech recognition, and information retrieval in Ethiopian languages.
The significance of this research extends to several areas. Firstly, it promotes digital inclusion by
ensuring individuals who primarily communicate in Ethiopian languages can fully participate in
the digital era. Access to technology and digital content in one's native language empowers
individuals to engage in online activities, access information, and communicate effectively,
leading to enhanced social and economic opportunities.
Secondly, the research contributes to language preservation and revitalization. By developing
NLP technologies for Ethiopian languages, it aids in the documentation and archiving of these
languages. Digital preservation of language resources safeguards Ethiopia's linguistic heritage
and provides valuable resources for future generations to study, analyze, and revive endangered
or under-represented languages.

9
References
• Demeke, G., & Getachew, M. (2006). Manual annotation of Amharic news items with
part-of-speech tags and its challenges. Ethiopian Languages Research Center Working
Papers, 2(1), 16.
• Abebe, A., van Zaanen, M., & Bosch, A. (2020). The Role of Script in Under-resourced NLP:
Empirical and Computational Approaches for Ethiopic Script. In Proceedings of the 1st
International Workshop on Solutions for Automatic Gleaning of Multilingual Endangered Texts
(SAGE) (pp. 1-10).
• Alemu, H. H., Abate, S. S., & Worku, A. G. (2019). An initiative on Ethiopian languages
localization: A case study approach. In Proceedings of the 4th ACM Workshop on African
Network Information Center (pp. 10-16).
• Beyene, A., Abebe, A., & Bosch, A. (2013). Challenges in Computational Analysis of Amharic
Language Texts: Allomorph in Amharic Verb Inflection. In Proceedings of the 2013 AFNLP
Conference (pp. 23-30).
• Diab, M., Hacioglu, K., & Jurafsky, D. (2004). Automatic tagging of Arabic text: From
raw text to base phrase chunks. In Proceedings of NAACL-HLT (pp. 149-152).
• Gambäck, B., Olsson, F., Argaw, A. A., & Asker, L. (2009). Methods for Amharic part-of-
speech tagging. In Proceedings of AfLaT.
• Gasser, M. (2009). Horn Morpho: A system for morphological processing of Amharic,
Oromo, and Tigrinya. In Proceedings of the 14th Meeting of Computational Linguistics
in Africa (AfLaT).
• Gasser, M. (2011). Computational morphology and the teaching of Semitic languages.
Proceedings of the Second Workshop on Speech and Language Processing for Assistive
Technologies (pp. 126–131).
• Getachew, M. (2001). Automatic part of speech tagging for Amharic: An experiment
using stochastic hidden Markov approach (master’s thesis).
• Habash, N. & Rambow, O. (2005). Arabic tokenization, part-of-speech tagging and
morphological disambiguation in one fell swoop. In Proceedings of ACL (pp. 573-580).
• LFormat, K. D., Wang, L., & Wale, A. (2019). BiLSTM-CRF for Amharic part-of-speech
tagging. Computing and Communications Workshop and Conference (CCWC), 2019
IEEE 9th Annual (pp. 660-663).
• Mansur, N., Abraham, B., & Yaregal, A. (2009). Amharic verb lexicon in the context of
machine translation. In Proceedings of the International Conference on Machine Learning
and Cybernetics (pp. 1664–1671).
• Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In
Proceedings of the Empirical Methods in Natural Language Processing (EMNLP).
• Tach belie, M. Y., & Menzel, W. (2009). Amharic part-of-speech tagger for factored
language model. In Proceedings of the International Conference on Machine Learning
and Cybernetics (pp. 1711–1716).
• Yimam, S. M. (2007). AMHARIC grammar. Addis Ababa, Ethiopia: Yimam Publishers.
• Yimam, S. M. (2010). Automatic processing of Amharic: Tokenization, POS tagging, IR
and MT (master’s thesis)

11
❖ To be customize for actual usage.
Interview Questions:
1. Can you please introduce yourself and your background in Amharic, Oromo, and Tigrinya
language processing or related fields?
2. In your experience, what are the key challenges faced when working with large language
models in Amharic, Afaan Oromoo, and Tigrinya language processing?
3. How do you perceive the current performance of existing large language models in processing
texts in Ethiopian language (Amharic, Afaan Oromoo, and Tigrinya)? Are there any specific
limitations or areas where they struggle?
4. What are the potential implications or consequences of these challenges in various domains,
such as natural language understanding, machine translation, or sentiment analysis in Amharic,
Afaan Oromoo, and Tigrinya?
5. In your opinion, what are the specific linguistic or cultural challenges in Ethiopian language
(Amharic, Afaan Oromoo, and Tigrinya) that make it challenging for large language models to
accurately process and understand?
6. Have you come across any notable instances where large language models have produced
incorrect or inappropriate results when processing (Amharic, Afaan Oromoo, and Tigrinya) text? If
yes, could you provide some examples?
7. Based on your expertise, what improvements or advancements do you think are necessary to
enhance the performance of large language models in (Amharic, Afaan Oromoo, and Tigrinya)
languages processing?
8. Are there any specific strategies or methodologies that you would recommend addressing the
challenges faced by current language models in (Amharic, Afaan Oromoo, and Tigrinya)
processing?

1.pdf

More Related Content

Similar to 1.pdf (20)

Recently uploaded (20)

1.pdf