SlideShare a Scribd company logo
Segmentation in Sanskrit texts
देहिनोऽस्मिन्यथा देिे कौिारं यौवनं जरा .
तथा देिान्तरप्रास्ततर्धीरमतत्र न िुह्यतत
देहिनः अस्मिन् यथा देिे कौिारं यौवनं जरा
तथा देिान्तर प्रास्ततः र्धीरः तत्र न िुह्यतत
Segmentation in Sanskrit texts
तथा देिान्तरप्रास्ततर्धीरमतत्र न िुह्यतत
तथा देिान्तर प्रास्ततः र्धीरः तत्र न िुह्यतत
तथा देिान्तरप्रास्ततर्धीरमतत्र न िुह्यतत
A
तथा देिान्तरप्रास्ततर्धीरमतत्र न
िुह्यतत
राि
रािेभ्यः
रािमय
wi
ti
PMI Matrix of the un-segmentable token lemmas
P(w1,w2,w3,w4) = P(w1 | <s>)P(w2|w1)P(w3|w2)P(w4|w3)P(</s>|w4)
Set (Size in sentences) Micro Accuracy Macro Accuracy
Training set (1700) 87.76 % 92.56 %
Testing Set (150) 87.82 93.56 %
•
•
•
•
• Treat the problem as a query expansion problem.
• Start with unsegmented tokens
• At each step a new candidate word is selected and added to query
• The query expansion iterates till a complete sentence is output.
Chunk 1 – c1 c2 c3 c4
w1
w2 .
.
.
.
.
wk.
.
.
.
.
Wl6
S = c1 + c2 + c3 + c4
C2 = Set of wi, which are
candidates for semantically
correct segmentation.
Similarly for c2 and c3
• Treat the problem as a query expansion problem.
• Start with unsegmented tokens
• At each step a new candidate word is selected and added to query
• The query expansion iterates till a complete sentence is output.
Chunk 1 – c1 c2 c3 c4
w1
w2 .
.
.
.
.
wk.
.
.
.
.
Wl6
S = c1 + c2 + c3 + c4
C2 = Set of wi, which are
candidates for semantically
correct segmentation.
Similarly for c2 and c3
Segmentation in Sanskrit texts
Segmentation in Sanskrit texts
• From Query Nodes, reach the most promising candidate word nodes.
• Perform multiple personalised random walks.
• Edge weights – Accommodate heterogeneous information
• Learn weights for each of the random walk approach (path) by
supervised methods.
• The weighted sum of all the random walk methods, gives the most
suitable candidate
• PS- We use 4 lakh tagged sentences from Digital corpus of Sanskrit.
Language Model (LM) with word lemmas
LM with morphological types
Verb specific Expectancy
Compound word formation patterns
Language Model with words - LMw
LM with morphological types - LMt
Verb specific Expectancy – ViE
Compound word formation patterns
PCRW -
Unifying
Framework
• Handle Free Word Order
• Incorporate heterogeneous types of information
• Bonus – Form different relational paths(upto l) by combination of
individual edge weights.
• For l = 3, some sample paths that can be formed as combination.
• LMw -> LMt ->LMw
• LMt -> V1E -> LMt
• LMt -> VkE -> LMt
Segmentation in Sanskrit texts

More Related Content

PDF
Lecture: Context-Free Grammars
PDF
Regular language and Regular expression
PPT
Logic
PDF
Named Entity recognition in Sanskrit
PDF
Taddhita Generation
PDF
PDF
Ferosa - Insights
PPTX
Natural Language Processing Datascience.pptx
Lecture: Context-Free Grammars
Regular language and Regular expression
Logic
Named Entity recognition in Sanskrit
Taddhita Generation
Ferosa - Insights
Natural Language Processing Datascience.pptx

Similar to Segmentation in Sanskrit texts (8)

PPTX
NLP Concepts detail explained in details.pptx
PPT
Introduction to Natural Language Processing
PDF
"SSC" - Geometria e Semantica del Linguaggio
PDF
LSA algorithm
PPTX
Subword tokenizers
PDF
(Kpi summer school 2015) word embeddings and neural language modeling
PPTX
Module 4.1 of chennai's slides wo hanve dot do thhopps otps
PDF
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
NLP Concepts detail explained in details.pptx
Introduction to Natural Language Processing
"SSC" - Geometria e Semantica del Linguaggio
LSA algorithm
Subword tokenizers
(Kpi summer school 2015) word embeddings and neural language modeling
Module 4.1 of chennai's slides wo hanve dot do thhopps otps
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Ad

More from Amrith Krishna (12)

PDF
Unsupervised program synthesis
PDF
Analyzing Stack Overflow - Problem
PDF
Asterix and the Maagic Potion - Suffix tree problem
PDF
Roller Coaster Problem - OS
PDF
File Watcher - Lab Assignment
PDF
R - Eigen vector centrality with product reviews
PDF
Skipl List implementation - Part 2
PDF
Skipl List implementation - Part 1
PDF
Maach-Dal-Bhaat Problem
PDF
QGene Quiz 2016
PDF
Astra word Segmentation
PPT
Windows Architecture
Unsupervised program synthesis
Analyzing Stack Overflow - Problem
Asterix and the Maagic Potion - Suffix tree problem
Roller Coaster Problem - OS
File Watcher - Lab Assignment
R - Eigen vector centrality with product reviews
Skipl List implementation - Part 2
Skipl List implementation - Part 1
Maach-Dal-Bhaat Problem
QGene Quiz 2016
Astra word Segmentation
Windows Architecture
Ad

Recently uploaded (20)

PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PPT
Total quality management ppt for engineering students
PPT
A5_DistSysCh1.ppt_INTRODUCTION TO DISTRIBUTED SYSTEMS
PPTX
Current and future trends in Computer Vision.pptx
PDF
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PPTX
UNIT - 3 Total quality Management .pptx
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PPT
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PDF
Soil Improvement Techniques Note - Rabbi
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Analyzing Impact of Pakistan Economic Corridor on Import and Export in Pakist...
PPTX
Fundamentals of Mechanical Engineering.pptx
PPTX
communication and presentation skills 01
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PDF
PPT on Performance Review to get promotions
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
Total quality management ppt for engineering students
A5_DistSysCh1.ppt_INTRODUCTION TO DISTRIBUTED SYSTEMS
Current and future trends in Computer Vision.pptx
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
UNIT - 3 Total quality Management .pptx
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
Soil Improvement Techniques Note - Rabbi
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
R24 SURVEYING LAB MANUAL for civil enggi
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Analyzing Impact of Pakistan Economic Corridor on Import and Export in Pakist...
Fundamentals of Mechanical Engineering.pptx
communication and presentation skills 01
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PPT on Performance Review to get promotions

Segmentation in Sanskrit texts

  • 2. देहिनोऽस्मिन्यथा देिे कौिारं यौवनं जरा . तथा देिान्तरप्रास्ततर्धीरमतत्र न िुह्यतत देहिनः अस्मिन् यथा देिे कौिारं यौवनं जरा तथा देिान्तर प्रास्ततः र्धीरः तत्र न िुह्यतत
  • 4. तथा देिान्तरप्रास्ततर्धीरमतत्र न िुह्यतत तथा देिान्तर प्रास्ततः र्धीरः तत्र न िुह्यतत
  • 6. A
  • 8. Set (Size in sentences) Micro Accuracy Macro Accuracy Training set (1700) 87.76 % 92.56 % Testing Set (150) 87.82 93.56 % • • • •
  • 9. • Treat the problem as a query expansion problem. • Start with unsegmented tokens • At each step a new candidate word is selected and added to query • The query expansion iterates till a complete sentence is output. Chunk 1 – c1 c2 c3 c4 w1 w2 . . . . . wk. . . . . Wl6 S = c1 + c2 + c3 + c4 C2 = Set of wi, which are candidates for semantically correct segmentation. Similarly for c2 and c3
  • 10. • Treat the problem as a query expansion problem. • Start with unsegmented tokens • At each step a new candidate word is selected and added to query • The query expansion iterates till a complete sentence is output. Chunk 1 – c1 c2 c3 c4 w1 w2 . . . . . wk. . . . . Wl6 S = c1 + c2 + c3 + c4 C2 = Set of wi, which are candidates for semantically correct segmentation. Similarly for c2 and c3
  • 13. • From Query Nodes, reach the most promising candidate word nodes. • Perform multiple personalised random walks. • Edge weights – Accommodate heterogeneous information • Learn weights for each of the random walk approach (path) by supervised methods. • The weighted sum of all the random walk methods, gives the most suitable candidate • PS- We use 4 lakh tagged sentences from Digital corpus of Sanskrit. Language Model (LM) with word lemmas LM with morphological types Verb specific Expectancy Compound word formation patterns
  • 14. Language Model with words - LMw LM with morphological types - LMt Verb specific Expectancy – ViE Compound word formation patterns PCRW - Unifying Framework • Handle Free Word Order • Incorporate heterogeneous types of information • Bonus – Form different relational paths(upto l) by combination of individual edge weights. • For l = 3, some sample paths that can be formed as combination. • LMw -> LMt ->LMw • LMt -> V1E -> LMt • LMt -> VkE -> LMt