SlideShare a Scribd company logo
Implementing a Corpus for
Sinhala Language
Upeksha W. D.
Wijayarathna D. G. C. D.
Siriwardena M. P.
Lasandun K. H. L.
Dr. Chinthana Wimalasuriya
Prof. Gihan Dias
Mr. N. H. N. D. De Silva
OUTLINE
• Introduction
• Resource Gathering
• Data Storage
• User Interface Design and Implementation
• Application Programming Interface
• Limitations
INTRODUCTION
What is a Language Corpus?
▪ A collection of authentic texts of the
language
▪ Stored electronically
▪ Contains actual usage patterns of the
language
▪ Covers a wide context of the language
▪ Can be used to discover information about
language that may not have been noticed
through intuition alone
ARCHITECTURE OF THE
CORPUS
IDENTIFIED SINHALA
RESOURCES
News Academic Creative
Writing
Spoken Gazette
News Paper Text books Fiction Subtitle Gazette
News Items Religious Blogs
Wikipedia Magazine
Mahawansa
COMPOSITION OF CORPUS
SINHALA SOURCES
DATA REPRESENTATION
▪ Raw data is separately stored as XML
formatted files and converted to necessary
formats according to information need
▪ Primary Metadata
▪ Article Name
▪ Author
▪ URL
▪ Date (year, Month, Day)
▪ Genre
DATA STORAGE SYSTEM
Considerations
▪ Relational Databases
▪ Oracle DB
▪ H2 Database
▪ Indexed File Systems
▪ Apache Solr
▪ Column Store Databases
▪ Cassandra
▪ Graph Databases
▪ Neo4j
EVALUATION CRITERIA
We considered performance for inserting
data and for retrieving 12 different
information needs.
Data Insertion time comparison
Information Retrieval Performance
Comparison - Part 1
Information Retrieval Performance
Comparison - Part 2
•Cassandra performed better than
others in most of the scenarios,
and its insertion time increased
linearly.
•So we chose it for implementing
corpus.
• Cassandra Uses Query Based
Data Modeling.
• Uses a separate column family for
each information need
User Interface Design and
Implementation
● Web interface of Sinmin has been designed
for users who would prefer a visualised and
summarized view of statistical data of
Sinmin.
● Visual design of the interface has been made
in a way that any user without prior
experience of the interface is able to fulfill his
information requirements with little effort.
Searching the frequency of a word
Searching the most probable words
comes after a word
Comparing two words
APPLICATION PROGRAMING
INTERFACE (API)
•REST API to expose Corpus services
•Much complex and customizable data retrieval
and filtering
•Interface for third party applications to
consume
LIMITATIONS OF THE CORPUS
● words are not annotated with their Part of
Speech (POS) tags and lemmas.
● If a new information need occurs, new
column families may need to be created for
them and data has to be inserted again.
QUESTIONS?
THANK YOU!

More Related Content

PPTX
Sinmin Literature Review Presentation
PPTX
Sinmin final presentation
PDF
Linked Data Publication of Live Music Archives
PDF
WikiAsp: A Dataset for Multi-domain Aspect-based Summarization
PDF
Slaps - a Smalltalk LDAP server
PDF
JCDL 2016 Doctoral Consortium - Web Archive Profiling
PDF
Python Data types properties
PPTX
Conceptos básicos. Seminario web 1: Introducción a NoSQL
Sinmin Literature Review Presentation
Sinmin final presentation
Linked Data Publication of Live Music Archives
WikiAsp: A Dataset for Multi-domain Aspect-based Summarization
Slaps - a Smalltalk LDAP server
JCDL 2016 Doctoral Consortium - Web Archive Profiling
Python Data types properties
Conceptos básicos. Seminario web 1: Introducción a NoSQL

What's hot (11)

PDF
Web Archiving: A Brief Introduction
PDF
TPDL 2016 Doctoral Consortium - Web Archive Profiling
PPTX
Dynamic websites
PDF
RDF Seminar Presentation
PPTX
PPTX
Demystifying RDF
ODP
Redis - Your Magical superfast database
PDF
NoSQL
PDF
Open Location Data and Linked Open Data
PPTX
Services semantic technology_terminology
PPTX
NoSQL Roundup
Web Archiving: A Brief Introduction
TPDL 2016 Doctoral Consortium - Web Archive Profiling
Dynamic websites
RDF Seminar Presentation
Demystifying RDF
Redis - Your Magical superfast database
NoSQL
Open Location Data and Linked Open Data
Services semantic technology_terminology
NoSQL Roundup
Ad

Viewers also liked (20)

PDF
GRASSy GIS
PDF
Diesel fuel system how it works sinhala
PDF
Automatic transmission system sinhala
PDF
automobile Petrol Engine overhaul sinhala
PDF
automobile Electronic fuel injection how it works sinhala
PDF
automobile Petrol engine tune up sinhala
PPTX
Ict grade 10
DOC
Botanical name of Different Trees
PPTX
grade 10 ict New syllabus
PPT
Weeds in sri lankaa
PDF
Input and Output Devicesආදාන හා ප්‍රතිදාන උපාංග
PDF
GCE O/L ICT
PPSX
පරිගණක වර්ගීකරණය
PDF
G.C.E O/L ICT Short Notes Grade-11
PDF
G.C.E. O/L ICT Lessons Database sinhala
PDF
පරිගණකයේ විකාශය
PDF
Grade 10 ICT Short Notes in Sinhala(2015)
PDF
Pascal programming language
PPTX
landscape- types plants-tree-shrubs
GRASSy GIS
Diesel fuel system how it works sinhala
Automatic transmission system sinhala
automobile Petrol Engine overhaul sinhala
automobile Electronic fuel injection how it works sinhala
automobile Petrol engine tune up sinhala
Ict grade 10
Botanical name of Different Trees
grade 10 ict New syllabus
Weeds in sri lankaa
Input and Output Devicesආදාන හා ප්‍රතිදාන උපාංග
GCE O/L ICT
පරිගණක වර්ගීකරණය
G.C.E O/L ICT Short Notes Grade-11
G.C.E. O/L ICT Lessons Database sinhala
පරිගණකයේ විකාශය
Grade 10 ICT Short Notes in Sinhala(2015)
Pascal programming language
landscape- types plants-tree-shrubs
Ad

Similar to Implementing a Corpus for Sinhala Language (20)

PDF
Cloud-based Linked Data Management for Self-service Application Development
PPTX
Module 5 Web Programing Setting Up Postgres.pptx
PDF
Rob Hanna: Leveraging Cognitive Science to Improve Topic- Based Authoring
PPTX
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
PPTX
UNIT-2.pptx
PPTX
Database Systems Lec 1.pptx
PPTX
Web 3 final(1)
PPTX
Got documents?
PPTX
Linked Open Data for Cultural Heritage
PDF
Hansen Metadata for Institutional Repositories
PPTX
ODSC and iRODS
PPSX
An Introduction to Semantic Web Technology
PPTX
Semantic Web use cases in outcomes research
PPTX
Got documents - The Raven Bouns Edition
PDF
The web of interlinked data and knowledge stripped
PDF
The Web of Data: The W3C Semantic Web Initiative
PPTX
Got documents Code Mash Revision
PPT
Introduction to Metadata for IDAH Fellows
PPTX
Why I don't use Semantic Web technologies anymore, event if they still influe...
PDF
Resource Oriented Architectures: The Future of Data API?
Cloud-based Linked Data Management for Self-service Application Development
Module 5 Web Programing Setting Up Postgres.pptx
Rob Hanna: Leveraging Cognitive Science to Improve Topic- Based Authoring
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
UNIT-2.pptx
Database Systems Lec 1.pptx
Web 3 final(1)
Got documents?
Linked Open Data for Cultural Heritage
Hansen Metadata for Institutional Repositories
ODSC and iRODS
An Introduction to Semantic Web Technology
Semantic Web use cases in outcomes research
Got documents - The Raven Bouns Edition
The web of interlinked data and knowledge stripped
The Web of Data: The W3C Semantic Web Initiative
Got documents Code Mash Revision
Introduction to Metadata for IDAH Fellows
Why I don't use Semantic Web technologies anymore, event if they still influe...
Resource Oriented Architectures: The Future of Data API?

More from Chamila Wijayarathna (18)

PPTX
Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...
PPTX
Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...
PDF
SinMin - Sinhala Corpus Project - Thesis
PPTX
GS0C - "How to Start" Guide
PDF
Xbotix 2014 Rules undergraduate category
PDF
Kaggle KDD Cup Report
PDF
Higgs Boson Machine Learning Challenge Report
PPTX
Programs With Common Sense
PDF
Knock detecting door lock research paper
PDF
IEEE Xtreme Final results 2012
PPTX
Helen Keller, The Story of My Life
PDF
Shirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
PDF
Ieee xtreme 5.0 results
DOCX
Memory technologies
DOCX
History of Computer
DOC
Products, Process Development Firms in Sri Lanka and their focus on Sustaina...
DOCX
Path Following Robot
PPTX
Path following robot
Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...
Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...
SinMin - Sinhala Corpus Project - Thesis
GS0C - "How to Start" Guide
Xbotix 2014 Rules undergraduate category
Kaggle KDD Cup Report
Higgs Boson Machine Learning Challenge Report
Programs With Common Sense
Knock detecting door lock research paper
IEEE Xtreme Final results 2012
Helen Keller, The Story of My Life
Shirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
Ieee xtreme 5.0 results
Memory technologies
History of Computer
Products, Process Development Firms in Sri Lanka and their focus on Sustaina...
Path Following Robot
Path following robot

Recently uploaded (20)

DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
web development for engineering and engineering
PPTX
Construction Project Organization Group 2.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
PPT on Performance Review to get promotions
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
UNIT-1 - COAL BASED THERMAL POWER PLANTS
R24 SURVEYING LAB MANUAL for civil enggi
Operating System & Kernel Study Guide-1 - converted.pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
web development for engineering and engineering
Construction Project Organization Group 2.pptx
Internet of Things (IOT) - A guide to understanding
PPT on Performance Review to get promotions
Fundamentals of safety and accident prevention -final (1).pptx
OOP with Java - Java Introduction (Basics)
Model Code of Practice - Construction Work - 21102022 .pdf
Automation-in-Manufacturing-Chapter-Introduction.pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx

Implementing a Corpus for Sinhala Language