Linguistic component: Lemmatizer for
the Russian language
Technical description
SemanticAnalyzer Group, 2013-08-29
www.semanticanalyzer.info
This document describes technical details of lemmatizer for the Russian language.
It is assumed, that prior to using this component an input text has been preprocessed with Tokenizer
component (see the corresponding Technical Description).
Demo package sent upon request contains the following:
 Java library of tokenizer in a form of a binary
 run_lemmatizer.sh script for swift checking the functionality of the module
 messages_to_lemmatize.txt file containing examples of generic text and tweets for tokenization
using the run_lemmatizer.sh script
Algorithm is based on combination of the following:
 dictionary search
 algorithm calculating morphological properties of unknown words
 compound word analyzer
 analyzer of numbers
 rule-based analyzer
Speed of processing
Server: Intel(R) Xeon(R) CPU X3363 @ 2.83GHz
Operating system: ubuntu 10.04, Java 1.7.0_21 64 bit server
5037 characters/ms
880 tokens/ms
Tests were conducted in a single thread.
Format of the messages_to_lemmatize.txt file
This file describes input data for the tokenizer module for demo purposes.Формат:
Format:
TexttText type
Text contains textual data in Russian for lemmatization
t – tab symbol
Text type: supported values are GENERAL_TEXT and TWITTER.
Examples of lemmatization
The run_lemmatizer.sh script will generate the following file: messages_to_lemmatize.out.
For the following input file messages_to_tokenize.txt:
Прекрасный вечер))) прогулка по Набережной - самое то;) только маккафе подпортило настроение(
TWITTER
This output gets generated:
Прекрасный, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=ый,endings=[а, ая, ее, ей, о, ого, ое, ой, ом,
ому, ою, ую, ы, ые, ым, ыми, ых],lemma=прекрасный,pos=ADJECTIVE,weight=14317,stem=прекрасн]
вечер, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=,endings=[а, ам, ами, ах, е, ов, ом,
у],lemma=вечер,pos=NOUN,weight=39101,stem=вечер]
emopostkn, type: ALPHANUM
emopostkn, type: ALPHANUM
emopostkn, type: ALPHANUM
прогулка, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=а,endings=[ам, ами, ах, е, и, ой, ою,
у],lemma=прогулка,pos=NOUN,weight=3054,stem=прогулк]
по, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=по,pos=PREPOSITION,weight=573
564,stem=по]
Набережной, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=ая,endings=[ой, ою, ую, ые, ым, ыми,
ых],lemma=набережная,pos=NOUN,weight=2908,stem=набережн]
-, type: PUNCT
MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=-
,pos=NUMERAL,weight=0,stem=-]
самое, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=ый,endings=[ая, ого, ое, ой, ом, ому, ою,
ую, ые, ым, ыми, ых],lemma=самый,pos=ADJECTIVE,weight=0,stem=сам]
то, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=то,pos=CONJUNCTION,weight=0,s
tem=то]
MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=то,pos=ADVERB,weight=0,stem=т
о]
MorphDesc[removeNum=0,lemmaEnding=от,endings=[а, е, ем, еми, ех, о, ого, ой, ом,
ому, ою, у],lemma=тот,pos=PRONOUN_ADJECTIVE,weight=1139844,stem=т]
MorphDesc[removeNum=0,lemmaEnding=о,endings=[е, ем, еми, ех, ого, ом,
ому],lemma=то,pos=NOUN,weight=0,stem=т]
emopostkn, type: ALPHANUM
только, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=только,pos=PARTICLE,weight=0,s
tem=только]
MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=только,pos=ADVERB,weight=0,st
em=только]
маккафе, type: ALPHANUM
подпортило, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=ть,endings=[в, вший, вшего, вшему, вшим,
вшем, вшая, вшей, вшую, вшею, вшее, вшие, вших, вшими, вши, л, ла, ли,
ло],lemma=подпортить,pos=VERB,weight=190,stem=подпорти]
настроение, type: ALPHANUM
MorphDesc[removeNum=0,lemmaEnding=е,endings=[ем, и, ю,
я],lemma=настроение,pos=NOUN,weight=8416,stem=настроени]
MorphDesc[removeNum=0,lemmaEnding=е,endings=[ем, и, й, ю, я, ям, ями,
ях],lemma=настроение,pos=NOUN,weight=8416,stem=настроени]
emonegtkn, type: ALPHANUM
Examples of using the library from the Java code
MorphAnalyzer morphAnalyzer = MorphAnalyzerLoader.loadDefault();
System.out.println(morphAnalyzer.analyzeBest("русского"));
output:
MorphDesc[removeNum=0,lemmaEnding=ий,endings=[ая, ие, им, ими, их, ого, ое, ой, ом, ому, ою,
ую],lemma=русский,pos=ADJECTIVE,weight=36739,stem=русск]

More Related Content

PPT
Unix Lec2
PPTX
Command-Line 101
PPTX
Terminal commands ubuntu 2
PDF
Linux Commands - 3
PPT
PPT
Emacs, a performant IDE for Perl
PDF
Sublime text-gdg-algiers-2015
ODP
Ubuntu Terminal
Unix Lec2
Command-Line 101
Terminal commands ubuntu 2
Linux Commands - 3
Emacs, a performant IDE for Perl
Sublime text-gdg-algiers-2015
Ubuntu Terminal

What's hot (11)

PPT
Batch file programming
ODP
Linux commd
ODP
Linux commd
ODP
PPTX
In just one hour i will make you a power shell ninja
PDF
Command line for the beginner - Using the command line in developing for the...
PDF
Linux introduction Class 03
PPT
PDF
Automating with ansible (Part c)
PPT
Apache
PPT
Intro_Unix_Ppt
Batch file programming
Linux commd
Linux commd
In just one hour i will make you a power shell ninja
Command line for the beginner - Using the command line in developing for the...
Linux introduction Class 03
Automating with ansible (Part c)
Apache
Intro_Unix_Ppt
Ad

Viewers also liked (19)

PDF
Linguistic component Sentiment Analyzer for the Russian language
PDF
Starget sentiment analyzer for English
PDF
Lucene revolution eu 2013 dublin writeup
PDF
Social spam detection by SemanticAnalyzer Group
PDF
Solr onfitnesse learningfromberlinbuzzwords
PDF
Introduction To Machine Translation 1
PDF
Semantic feature machine translation system
PDF
Machine translation course program (in English)
PDF
Automatic Build Of Semantic Translational Dictionary
PDF
MTEngine: Semantic-level Crowdsourced Machine Translation
PDF
Introduction To Machine Translation
PDF
NoSQL, Apache SOLR and Apache Hadoop
PDF
Rule based approach to sentiment analysis at ROMIP 2011
PDF
Poster: Method for an automatic generation of a semantic-level contextual tra...
PPTX
Rule based approach to sentiment analysis at romip’11 slides
PDF
Linguistic component Tokenizer for the Russian language
PDF
Semantic Analysis: theory, applications and use cases
PDF
IR: Open source state
PPTX
SAS University Edition - Getting Started
Linguistic component Sentiment Analyzer for the Russian language
Starget sentiment analyzer for English
Lucene revolution eu 2013 dublin writeup
Social spam detection by SemanticAnalyzer Group
Solr onfitnesse learningfromberlinbuzzwords
Introduction To Machine Translation 1
Semantic feature machine translation system
Machine translation course program (in English)
Automatic Build Of Semantic Translational Dictionary
MTEngine: Semantic-level Crowdsourced Machine Translation
Introduction To Machine Translation
NoSQL, Apache SOLR and Apache Hadoop
Rule based approach to sentiment analysis at ROMIP 2011
Poster: Method for an automatic generation of a semantic-level contextual tra...
Rule based approach to sentiment analysis at romip’11 slides
Linguistic component Tokenizer for the Russian language
Semantic Analysis: theory, applications and use cases
IR: Open source state
SAS University Edition - Getting Started
Ad

More from Dmitry Kan (6)

PDF
London IR Meetup - Players in Vector Search_ algorithms, software and use cases
PDF
Vector databases and neural search
PPTX
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
PDF
SentiScan: система автоматической разметки тональности в social media
PDF
Icsoft 2011 51_cr
PDF
Computer Semantics And Machine Translation
London IR Meetup - Players in Vector Search_ algorithms, software and use cases
Vector databases and neural search
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
SentiScan: система автоматической разметки тональности в social media
Icsoft 2011 51_cr
Computer Semantics And Machine Translation

Recently uploaded (20)

PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
2018-HIPAA-Renewal-Training for executives
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PDF
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
CloudStack 4.21: First Look Webinar slides
PDF
Architecture types and enterprise applications.pdf
PDF
STKI Israel Market Study 2025 version august
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
Microsoft Excel 365/2024 Beginner's training
PDF
Flame analysis and combustion estimation using large language and vision assi...
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
Getting started with AI Agents and Multi-Agent Systems
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
2018-HIPAA-Renewal-Training for executives
NewMind AI Weekly Chronicles – August ’25 Week III
Credit Without Borders: AI and Financial Inclusion in Bangladesh
Benefits of Physical activity for teenagers.pptx
Convolutional neural network based encoder-decoder for efficient real-time ob...
Enhancing plagiarism detection using data pre-processing and machine learning...
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
Developing a website for English-speaking practice to English as a foreign la...
A review of recent deep learning applications in wood surface defect identifi...
CloudStack 4.21: First Look Webinar slides
Architecture types and enterprise applications.pdf
STKI Israel Market Study 2025 version august
Final SEM Unit 1 for mit wpu at pune .pptx
1 - Historical Antecedents, Social Consideration.pdf
Microsoft Excel 365/2024 Beginner's training
Flame analysis and combustion estimation using large language and vision assi...
A contest of sentiment analysis: k-nearest neighbor versus neural network
Getting started with AI Agents and Multi-Agent Systems
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx

Linguistic component Lemmatizer for the Russian language

  • 1. Linguistic component: Lemmatizer for the Russian language Technical description SemanticAnalyzer Group, 2013-08-29 www.semanticanalyzer.info This document describes technical details of lemmatizer for the Russian language. It is assumed, that prior to using this component an input text has been preprocessed with Tokenizer component (see the corresponding Technical Description). Demo package sent upon request contains the following:  Java library of tokenizer in a form of a binary  run_lemmatizer.sh script for swift checking the functionality of the module  messages_to_lemmatize.txt file containing examples of generic text and tweets for tokenization using the run_lemmatizer.sh script Algorithm is based on combination of the following:  dictionary search  algorithm calculating morphological properties of unknown words  compound word analyzer  analyzer of numbers  rule-based analyzer Speed of processing Server: Intel(R) Xeon(R) CPU X3363 @ 2.83GHz Operating system: ubuntu 10.04, Java 1.7.0_21 64 bit server 5037 characters/ms 880 tokens/ms Tests were conducted in a single thread. Format of the messages_to_lemmatize.txt file This file describes input data for the tokenizer module for demo purposes.Формат: Format: TexttText type Text contains textual data in Russian for lemmatization t – tab symbol Text type: supported values are GENERAL_TEXT and TWITTER. Examples of lemmatization The run_lemmatizer.sh script will generate the following file: messages_to_lemmatize.out. For the following input file messages_to_tokenize.txt:
  • 2. Прекрасный вечер))) прогулка по Набережной - самое то;) только маккафе подпортило настроение( TWITTER This output gets generated: Прекрасный, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=ый,endings=[а, ая, ее, ей, о, ого, ое, ой, ом, ому, ою, ую, ы, ые, ым, ыми, ых],lemma=прекрасный,pos=ADJECTIVE,weight=14317,stem=прекрасн] вечер, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=,endings=[а, ам, ами, ах, е, ов, ом, у],lemma=вечер,pos=NOUN,weight=39101,stem=вечер] emopostkn, type: ALPHANUM emopostkn, type: ALPHANUM emopostkn, type: ALPHANUM прогулка, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=а,endings=[ам, ами, ах, е, и, ой, ою, у],lemma=прогулка,pos=NOUN,weight=3054,stem=прогулк] по, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=по,pos=PREPOSITION,weight=573 564,stem=по] Набережной, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=ая,endings=[ой, ою, ую, ые, ым, ыми, ых],lemma=набережная,pos=NOUN,weight=2908,stem=набережн] -, type: PUNCT MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=- ,pos=NUMERAL,weight=0,stem=-] самое, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=ый,endings=[ая, ого, ое, ой, ом, ому, ою, ую, ые, ым, ыми, ых],lemma=самый,pos=ADJECTIVE,weight=0,stem=сам] то, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=то,pos=CONJUNCTION,weight=0,s tem=то] MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=то,pos=ADVERB,weight=0,stem=т о] MorphDesc[removeNum=0,lemmaEnding=от,endings=[а, е, ем, еми, ех, о, ого, ой, ом, ому, ою, у],lemma=тот,pos=PRONOUN_ADJECTIVE,weight=1139844,stem=т] MorphDesc[removeNum=0,lemmaEnding=о,endings=[е, ем, еми, ех, ого, ом, ому],lemma=то,pos=NOUN,weight=0,stem=т] emopostkn, type: ALPHANUM только, type: ALPHANUM
  • 3. MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=только,pos=PARTICLE,weight=0,s tem=только] MorphDesc[removeNum=0,lemmaEnding=,endings=[],lemma=только,pos=ADVERB,weight=0,st em=только] маккафе, type: ALPHANUM подпортило, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=ть,endings=[в, вший, вшего, вшему, вшим, вшем, вшая, вшей, вшую, вшею, вшее, вшие, вших, вшими, вши, л, ла, ли, ло],lemma=подпортить,pos=VERB,weight=190,stem=подпорти] настроение, type: ALPHANUM MorphDesc[removeNum=0,lemmaEnding=е,endings=[ем, и, ю, я],lemma=настроение,pos=NOUN,weight=8416,stem=настроени] MorphDesc[removeNum=0,lemmaEnding=е,endings=[ем, и, й, ю, я, ям, ями, ях],lemma=настроение,pos=NOUN,weight=8416,stem=настроени] emonegtkn, type: ALPHANUM Examples of using the library from the Java code MorphAnalyzer morphAnalyzer = MorphAnalyzerLoader.loadDefault(); System.out.println(morphAnalyzer.analyzeBest("русского")); output: MorphDesc[removeNum=0,lemmaEnding=ий,endings=[ая, ие, им, ими, их, ого, ое, ой, ом, ому, ою, ую],lemma=русский,pos=ADJECTIVE,weight=36739,stem=русск]