Segmentation in Sanskrit texts

देहिनोऽस्मिन्यथा देिे कौिारं यौवनं जरा .
तथा देिान्तरप्रास्ततर्धीरमतत्र न िुह्यतत
देहिनः अस्मिन् यथा देिे कौिारं यौवनं जरा
तथा देिान्तर प्रास्ततः र्धीरः तत्र न िुह्यतत

तथा देिान्तर प्रास्ततः र्धीरः तत्र न िुह्यतत

तथा देिान्तरप्रास्ततर्धीरमतत्र न
िुह्यतत
राि
रािेभ्यः
रािमय
wi
ti
PMI Matrix of the un-segmentable token lemmas
P(w1,w2,w3,w4) = P(w1 | <s>)P(w2|w1)P(w3|w2)P(w4|w3)P(</s>|w4)

Set (Size in sentences) Micro Accuracy Macro Accuracy
Training set (1700) 87.76 % 92.56 %
Testing Set (150) 87.82 93.56 %
•
•
•
•

• Treat the problem as a query expansion problem.
• Start with unsegmented tokens
• At each step a new candidate word is selected and added to query
• The query expansion iterates till a complete sentence is output.
Chunk 1 – c1 c2 c3 c4
w1
w2 .
.
.
.
.
wk.
.
.
.
.
Wl6
S = c1 + c2 + c3 + c4
C2 = Set of wi, which are
candidates for semantically
correct segmentation.
Similarly for c2 and c3

• From Query Nodes, reach the most promising candidate word nodes.
• Perform multiple personalised random walks.
• Edge weights – Accommodate heterogeneous information
• Learn weights for each of the random walk approach (path) by
supervised methods.
• The weighted sum of all the random walk methods, gives the most
suitable candidate
• PS- We use 4 lakh tagged sentences from Digital corpus of Sanskrit.
Language Model (LM) with word lemmas
LM with morphological types
Verb specific Expectancy
Compound word formation patterns

Language Model with words - LMw
LM with morphological types - LMt
Verb specific Expectancy – ViE
Compound word formation patterns
PCRW -
Unifying
Framework
• Handle Free Word Order
• Incorporate heterogeneous types of information
• Bonus – Form different relational paths(upto l) by combination of
individual edge weights.
• For l = 3, some sample paths that can be formed as combination.
• LMw -> LMt ->LMw
• LMt -> V1E -> LMt
• LMt -> VkE -> LMt

Segmentation in Sanskrit texts

More Related Content

Similar to Segmentation in Sanskrit texts (8)

More from Amrith Krishna (12)

Recently uploaded (20)

Segmentation in Sanskrit texts