SlideShare a Scribd company logo
Thamizhi-Language Processing Tools
Kengatharaiyer Sarveswaran (Sarves)
sarves@cse.mrt.ac.lk
Department of Computer Science and Engineering
University of Moratuwa, Sri Lanka.
December 12, 2020
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 1 / 10
Overview
Thamizhi-Preprocessor
ThamizhiPOSt: Tamil POS Tagger
ThamizhiMorph: Tamil Morphological Analyser/Generator
ThamizhiUDp: Tamil Universal Dependency Parser
ThamizhiLFG: Computational Grammar for Tamil using LFG
What we need
Acknowledgement
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 2 / 10
Thamizhi-Preprocessor
Validate words using Nanool grammar
Normalise Unicode points
க ,ெ ,ா, க, ் ,க ,ு -> க , ொ, க ,், க ,ு
Home page:
http://nlp-tools.uom.lk/thamizhi-preprocessor/
How to use:
-Download the script from the site:
python3 thamizhi-preprocessor.py -validate word-to-be-validated
python3 thamizhi-preprocessor.py -normalise file-to-be-normalised
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 3 / 10
ThamizhiPOSt: Tamil POS Tagger
Harmonised BIS1
- Amrita2
- UPOS3
Tagsets
Used Universal POS Tagset
Trained the POS tagger using Stanza
Trained using Amrita data (mapped to UPOS)
F1 score - 93.27 (Nov, 2020)
Trained models and POS tagged data are available for download
Home page:
http://nlp-tools.uom.lk/thamizhi-pos/
How to use:
python3 thamizhi-post.py ”input-file”
1tdil-dc.in/tdildcMain/articles/134692Draft%20POS%20Tag%20standard.pdf
2www.amrita.edu/publication/tamil-pos-tagging-using-linear-programming
3universaldependencies.org/u/pos/
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 4 / 10
ThamizhiMorph: Morphological Analyser/Generator
Rule-based (Finite-State Transducer) implementation
Implemented using foma4
Handles Verbs, Nouns, and other particles
Generates all analyses
Can be used for morph segmentation
வந்தான் வா|+verb|+fin|+sim|+strong|+past=(
ந்)த்|+3sgm=ஆன்)
All the models, data and scripts are available
Home page:
http://nlp-tools.uom.lk/thamizhi-morph/
How to use:
python3 thamizhi-morph.py ”input-file”
4fomafst.github.io/
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 5 / 10
ThamizhiUDp: Universal Dependency Parser 1/2
Hybrid approach
Multilingual Learning (with Hindi/Turkish/Telugu) for Parsing
Labelled Assigned Score - 62.39
All the data, models and scripts are available
Step Tool Dataset
Tokenisation Stanza Tamil UDT
Multi-word tokeniser Stanza Tamil UDT
Lemmatisation Stanza Tamil UDT
POS tagging ThamizhiPOSt Amrita Data
Morphological tagging ThamizhiMorph Rule-based
Dependency parsing uuparser UDT Hindi/Tamil
Home page:
http://nlp-tools.uom.lk/thamizhi-udp/
How to use:
./parse.sh ”input-file”
Note: Input file should be in CoNLL-U format.
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 6 / 10
ThamizhiUDp: Universal Dependency Parser 2/2
Tamil Modern Written Tamil Treebank:
https://github.com/UniversalDependencies/UDT amil −
MWTT/tree/master
A joint work together with Dr.K. Parameswari
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 7 / 10
ThamizhiLFG: Computational Grammar for Tamil
An initial version, covering 160 sentences (ParGram5
+ Grade-1
Tamil textbook) available
Simple intransitive, transitive, ditransitive, conjunctions are covered
Limited vocabulary, will integrate ThamizhiMorph
Hosted in the INESS site
How to use: https://clarino.uib.no/iness/xle-web
5https://pargram.w.uib.no/
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 8 / 10
What we need:
People with linguistic knowledge to review tools/annotated data
Benchmark data-sets for evaluation
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 9 / 10
Acknowledgement
Supervisors:
Prof. Gihan Dias, University of Moratuwa
Prof. Miriam Butt, University of Konstanz
Collaborators:
Dr. K. Parameswari, University of Hyderabad
Ms. S. Rajamathangi, Jawaharlal Nehru University
Scholars who have provided valuable inputs:
Prof. S.Rajendren, Prof. S.Ramesh, Colleagues at NLPC
Most of these works were supported by the Accelerating Higher
Education Expansion and Development (AHEAD) Operation of the
Ministry of Higher Education, Sri Lanka funded by the World Bank, and
by the DAAD (German Academic Exchange Office).
Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 10 / 10

More Related Content

PDF
Developing Dependency Parsers for Tamil
PDF
A Dialogue System for Telugu, a Resource-Poor Language
PDF
Ny3424442448
PPS
E-text in EFL - Four flavours
PDF
Error Analysis of Rule-based Machine Translation Outputs
PPT
An Intuitive Natural Language Understanding System
PPT
**JUNK** (no subject)
PPT
Taking into account communities of practice’s specific vocabularies in inform...
Developing Dependency Parsers for Tamil
A Dialogue System for Telugu, a Resource-Poor Language
Ny3424442448
E-text in EFL - Four flavours
Error Analysis of Rule-based Machine Translation Outputs
An Intuitive Natural Language Understanding System
**JUNK** (no subject)
Taking into account communities of practice’s specific vocabularies in inform...

What's hot (20)

PPTX
Machine translation with statistical approach
PPTX
Hindi –tamil text translation
PDF
E1 geetha2 karthikeyan
PDF
Survey on Indian CLIR and MT systems in Marathi Language
PDF
I1 geetha3 revathi
PDF
Corpus-Based Vocabulary Learning in Technical English
PDF
A performance of svm with modified lesk approach for word sense disambiguatio...
PDF
Phonetic Recognition In Words For Persian Text To Speech Systems
PDF
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
PPTX
Machine translation from English to Hindi
PDF
ANALYSIS OF MWES IN HINDI TEXT USING NLTK
PDF
Design of A Spell Corrector For Hausa Language
PDF
G1803013542
PDF
Cross language information retrieval in indian
PDF
J1803015357
PDF
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
PPTX
Introduction to development of lexical databases
PDF
Marathi Text-To-Speech Synthesis using Natural Language Processing
PDF
P1803018289
Machine translation with statistical approach
Hindi –tamil text translation
E1 geetha2 karthikeyan
Survey on Indian CLIR and MT systems in Marathi Language
I1 geetha3 revathi
Corpus-Based Vocabulary Learning in Technical English
A performance of svm with modified lesk approach for word sense disambiguatio...
Phonetic Recognition In Words For Persian Text To Speech Systems
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Machine translation from English to Hindi
ANALYSIS OF MWES IN HINDI TEXT USING NLTK
Design of A Spell Corrector For Hausa Language
G1803013542
Cross language information retrieval in indian
J1803015357
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
Introduction to development of lexical databases
Marathi Text-To-Speech Synthesis using Natural Language Processing
P1803018289
Ad

Similar to Thamizhi Language Processing Tools (7)

PDF
Tamil Language Computing: The present and the Future
PDF
D3 dhanalakshmi
PDF
poiuytrewqasdfghjkloiuytrescvbjkl,mnbvcxzsdfghjklkjhgfdcvbnmnbvcxcvbn
PPTX
தமிழ்க்கணிமை கட்டமைப்பு
PPTX
Presentation1
DOCX
Pos Tagging for Classical Tamil Texts
PPT
Tamil Morphological Analysis
Tamil Language Computing: The present and the Future
D3 dhanalakshmi
poiuytrewqasdfghjkloiuytrescvbjkl,mnbvcxzsdfghjklkjhgfdcvbnmnbvcxcvbn
தமிழ்க்கணிமை கட்டமைப்பு
Presentation1
Pos Tagging for Classical Tamil Texts
Tamil Morphological Analysis
Ad

More from Kengatharaiyer Sarveswaran (14)

PDF
Natural Language Processing for Tamil and Sinhala
PDF
Department of Education - Northern Province - Grade 5 paper
PPT
Digital transformation and the SME sector
ODP
IP and ICT - Intro
PDF
Concept paper for Educational Management Information System
PDF
Concept paper - DIY Innovation Center
PDF
Presentation - CTC
ODP
Being 21st century teacher and e-Learning
PDF
Using the Internet for Learning
PDF
21ம் நூற்றாண்டில் இணையக் கல்வியின் முக்கியத்துவம்
PDF
Teaching and Learning in Northern Province, Sri Lanka
PDF
Introduction to Electronic Learning
PDF
Joomla Manual in Tamil
PPT
Introduction to PHP
Natural Language Processing for Tamil and Sinhala
Department of Education - Northern Province - Grade 5 paper
Digital transformation and the SME sector
IP and ICT - Intro
Concept paper for Educational Management Information System
Concept paper - DIY Innovation Center
Presentation - CTC
Being 21st century teacher and e-Learning
Using the Internet for Learning
21ம் நூற்றாண்டில் இணையக் கல்வியின் முக்கியத்துவம்
Teaching and Learning in Northern Province, Sri Lanka
Introduction to Electronic Learning
Joomla Manual in Tamil
Introduction to PHP

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Encapsulation theory and applications.pdf
PPT
Teaching material agriculture food technology
PPTX
Machine Learning_overview_presentation.pptx
PPTX
Cloud computing and distributed systems.
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Spectroscopy.pptx food analysis technology
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Unlocking AI with Model Context Protocol (MCP)
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Encapsulation theory and applications.pdf
Teaching material agriculture food technology
Machine Learning_overview_presentation.pptx
Cloud computing and distributed systems.
Big Data Technologies - Introduction.pptx
Spectroscopy.pptx food analysis technology
MYSQL Presentation for SQL database connectivity
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Per capita expenditure prediction using model stacking based on satellite ima...
Reach Out and Touch Someone: Haptics and Empathic Computing
Spectral efficient network and resource selection model in 5G networks
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Encapsulation_ Review paper, used for researhc scholars
Building Integrated photovoltaic BIPV_UPV.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Programs and apps: productivity, graphics, security and other tools
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton

Thamizhi Language Processing Tools

  • 1. Thamizhi-Language Processing Tools Kengatharaiyer Sarveswaran (Sarves) sarves@cse.mrt.ac.lk Department of Computer Science and Engineering University of Moratuwa, Sri Lanka. December 12, 2020 Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 1 / 10
  • 2. Overview Thamizhi-Preprocessor ThamizhiPOSt: Tamil POS Tagger ThamizhiMorph: Tamil Morphological Analyser/Generator ThamizhiUDp: Tamil Universal Dependency Parser ThamizhiLFG: Computational Grammar for Tamil using LFG What we need Acknowledgement Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 2 / 10
  • 3. Thamizhi-Preprocessor Validate words using Nanool grammar Normalise Unicode points க ,ெ ,ா, க, ் ,க ,ு -> க , ொ, க ,், க ,ு Home page: http://nlp-tools.uom.lk/thamizhi-preprocessor/ How to use: -Download the script from the site: python3 thamizhi-preprocessor.py -validate word-to-be-validated python3 thamizhi-preprocessor.py -normalise file-to-be-normalised Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 3 / 10
  • 4. ThamizhiPOSt: Tamil POS Tagger Harmonised BIS1 - Amrita2 - UPOS3 Tagsets Used Universal POS Tagset Trained the POS tagger using Stanza Trained using Amrita data (mapped to UPOS) F1 score - 93.27 (Nov, 2020) Trained models and POS tagged data are available for download Home page: http://nlp-tools.uom.lk/thamizhi-pos/ How to use: python3 thamizhi-post.py ”input-file” 1tdil-dc.in/tdildcMain/articles/134692Draft%20POS%20Tag%20standard.pdf 2www.amrita.edu/publication/tamil-pos-tagging-using-linear-programming 3universaldependencies.org/u/pos/ Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 4 / 10
  • 5. ThamizhiMorph: Morphological Analyser/Generator Rule-based (Finite-State Transducer) implementation Implemented using foma4 Handles Verbs, Nouns, and other particles Generates all analyses Can be used for morph segmentation வந்தான் வா|+verb|+fin|+sim|+strong|+past=( ந்)த்|+3sgm=ஆன்) All the models, data and scripts are available Home page: http://nlp-tools.uom.lk/thamizhi-morph/ How to use: python3 thamizhi-morph.py ”input-file” 4fomafst.github.io/ Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 5 / 10
  • 6. ThamizhiUDp: Universal Dependency Parser 1/2 Hybrid approach Multilingual Learning (with Hindi/Turkish/Telugu) for Parsing Labelled Assigned Score - 62.39 All the data, models and scripts are available Step Tool Dataset Tokenisation Stanza Tamil UDT Multi-word tokeniser Stanza Tamil UDT Lemmatisation Stanza Tamil UDT POS tagging ThamizhiPOSt Amrita Data Morphological tagging ThamizhiMorph Rule-based Dependency parsing uuparser UDT Hindi/Tamil Home page: http://nlp-tools.uom.lk/thamizhi-udp/ How to use: ./parse.sh ”input-file” Note: Input file should be in CoNLL-U format. Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 6 / 10
  • 7. ThamizhiUDp: Universal Dependency Parser 2/2 Tamil Modern Written Tamil Treebank: https://github.com/UniversalDependencies/UDT amil − MWTT/tree/master A joint work together with Dr.K. Parameswari Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 7 / 10
  • 8. ThamizhiLFG: Computational Grammar for Tamil An initial version, covering 160 sentences (ParGram5 + Grade-1 Tamil textbook) available Simple intransitive, transitive, ditransitive, conjunctions are covered Limited vocabulary, will integrate ThamizhiMorph Hosted in the INESS site How to use: https://clarino.uib.no/iness/xle-web 5https://pargram.w.uib.no/ Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 8 / 10
  • 9. What we need: People with linguistic knowledge to review tools/annotated data Benchmark data-sets for evaluation Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 9 / 10
  • 10. Acknowledgement Supervisors: Prof. Gihan Dias, University of Moratuwa Prof. Miriam Butt, University of Konstanz Collaborators: Dr. K. Parameswari, University of Hyderabad Ms. S. Rajamathangi, Jawaharlal Nehru University Scholars who have provided valuable inputs: Prof. S.Rajendren, Prof. S.Ramesh, Colleagues at NLPC Most of these works were supported by the Accelerating Higher Education Expansion and Development (AHEAD) Operation of the Ministry of Higher Education, Sri Lanka funded by the World Bank, and by the DAAD (German Academic Exchange Office). Sarves, NLPC, University of Moratuwa Thamizhi-LPTs December 12, 2020 10 / 10