SlideShare a Scribd company logo
Combining Knowledge
and CRF-based Approach to
Named Entity Recognition in
Russian
Mozharova V. A.,
Loukachevitch N. V.
Lomonosov Moscow State University
Named entity recognition task
A named entity is a word or a word collocation that means a
specific object or an event and distinguishes it from other similar
objects.
1. Президент [Владимир Путин] PER 17 декабря провел
традиционную пресс-конференцию перед Новым Годом.
2. Студенты и Татьяны получат эксклюзивный пропуск на
Главный каток страны.
Methods:
1. Machine learning
2. Rule-based approach
3. Combination
2
Related works
• English
• A lot of works
• Evaluations (MUC, CoNLL, ACE …)
• Russian
• Machine learning (CRF)
• (Antonova, Soloviev, 2013; Podobryaev, 2013; Gareev, 2013)
• Most works on own collections
• Rule-based
• (Trofimov, 2014)
• Open collection “Person-1000”
3
Outline
Our approach: CRF-based Named Entity recognition
Features:
• Token-based
• Lexicon-based
• Context-based
Labeling representation
• IO-scheme
• BIO-scheme
Experiments on open collections
• Persons-1000
• Persons-1111-F (Eastern names)
4
CRF-based machine learning
CRF is a tool for labeling sequential data.
• CRF++ (open source implementation)
Preprocessing
• Morphological analyzer (POS-tagging, lemmatization,
gender and grammatical case tagging)
5
Scheme of text processing
6
Text Feature
Extraction:
-token-based
-lexicon-based
-context-based
Name
Extraction
CRF
Token features
Most traditional features
1. Token initial form (lemma)
2. Number of symbols in a token
3. Letter case: BigBig, BigSmall, SmallSmall, Fence
4. Token type
• part of speech
• type of punctuation
5. The presence of a vowel (a binary feature)
6. If a token contains a known letter n-gram from a pre-
defined set:
• Кузнецов, Матвиенко, Джугашвили
• Госдепартамент, Газпром
7
Features based on lexicons
We used vocabularies that store lists of useful
expressions (words or phrase)
Sources:
• Phonebook
• Wikipedia
• Thesaurus (РуТез)
Single feature for each lexicon
Example:
«Набережные[geo2] Челны[geo2]»
8
lexicons
.
9
Vocabulary Size, objects Clarification Examples
Famous persons 31482 Famous people Владимир Путин
First names 2773 First names Василий, Анна, Том
Surnames 66108 Surnames Кузнецов, Грибоедов
Verbs of informing 1729 Verbs that usually occur with
persons
высказать,
признаться
Companies 33380 Organization names Сбербанк
Company types 6774 Organization types организация,
авиафирма
Geography 8969 Geographical objects Балтийское море
Equipment 44094 Devices, equipment, tools устройство, телефон
Context features and example
10
Token Lemma Register Token
Type
Second
Name
Geo Label
В В Small Auxiliary False False NO
России РОССИЯ BigSmall Noun False Geo1 GEOPOLIT
Алиев АЛИЕВ BigSmall Noun Sname1 False PER
третий ТРЕТИЙ Small Numeral False False NO
раз РАЗ Small Auxiliary False False NO
Expert labeling. Brat annotatiоn tool
11
Labeling representation
IO-scheme (Inside-Outside)
• I - belongs to named entity
• O - does not belong to named
entity
|C| + 1 classes
12
Token IO-Labels BIO-labels
Владимир I-PER B-PER
Путин I-PER I-PER
посетил OUTSIDE OUTSIDE
Англию I-GEOPOLIT B-GEOPOLIT
BIO-scheme (Begin-Inside-
Outside)
• B - named entity beginning
• I - named entity continuation
• O - not named entity
2*|C| + 1 classes
IO-labeling: aggregation of tokens into
named entities
13
I-PER
Person
I-PER
Person
I-PER
Person
I-PER
Петр
Person Person
I-PER
Петр
I-PER
Person
I-PER
IO-labeling: aggregation of tokens into
named entities
14
I-ORG
Organization
I-ORG
I-PER
X1 …
Person
I-PER
X1
…
OUTSIDE
X1
Person
…
Person
Target metric
intersectionCount is the number of named entities labeled by both:
the classier and the expert;
classifierCount is the number of named entities labeled by only the
classier;
expertCount is the number of named entities labeled by only the
expert.
15
Text collections
• "Persons-1000" (1000 news documents)
• Russian names: Александр Игнатенко, Алексей Волков
• " Persons-1111F" (1111 news documents)
• Eastern names: Абдалла Халаф, Иттё Ито
We additionally labeled:
• Organizations (ORG)
• Media organizations having a specific function of
information providing (MEDIA)
• Locations (LOC)
• States and capitals in the role of a state (GEOPOLIT)
16
Experiments on Collection
“Persons-1000”
NE
Type
F-score, %
IO IO +
rules
BIO
PER 94.95 95.09 96.08
ORG 80.03 80.23 83.84
LOC 92.60 92.60 94.57
Average 89.54 89.67 91.71
17
NE
Type
F-score, %
IO IO +
rules
BIO
PER 94.95 95.01 95.63
ORG 75.90 76.16 80.06
MEDIA 87.95 87.95 87.99
LOC 84.53 84.53 86.91
GEOPOLIT 94.65 94.65 94.50
Average 88.21 88.37 89.93
Cross-validation 3:1
Experiments on collection with Eastern
names (Persons-1111F)
Person name extraction
“Persons-1000”: cross-validation 3:1
“Persons-1111F” : training on “Persons-1000”
18
Collection F-score, %
Rule-based
(Trofimov, 2014)
Our system
Pesons-1000 96.62 96.08
Persons-1111F 64.43 81.68
Conclusion
• We presented the system for Russian Named Entity
Recognition task using knowledge-based approach
together with CRF classifier
• We tested our system on two open text collections
“Persons-1000” and “Persons-1111” and compare
our results with rule-based system
• We compared two labeling schemes for Russian
texts: IO-scheme and BIO-scheme
19

More Related Content

PDF
Internet of things and their requirements.
PDF
14 de Dezembro 2009
PPTX
Configuracion de IP windows XP
PPT
Social Media Strategies for Powerful Communications
PDF
和菓子復興大作戦〜萌えキャラで和菓子ブームを〜
PPT
Radioactivity (1)
PDF
NCC achieves transparency via IBX Spend Analytics enabling a complete procure...
PPT
Skolkovo
Internet of things and their requirements.
14 de Dezembro 2009
Configuracion de IP windows XP
Social Media Strategies for Powerful Communications
和菓子復興大作戦〜萌えキャラで和菓子ブームを〜
Radioactivity (1)
NCC achieves transparency via IBX Spend Analytics enabling a complete procure...
Skolkovo

Viewers also liked (6)

PPTX
Cancer in dogs
PPTX
Semantic web-and-public-data - en
PDF
God is Loving
PDF
Trabalho 1
PPTX
Are we with-it? - Lucia Schoombee
DOC
15 Things to Give Up to be Happy
Cancer in dogs
Semantic web-and-public-data - en
God is Loving
Trabalho 1
Are we with-it? - Lucia Schoombee
15 Things to Give Up to be Happy
Ad

Similar to Valeriia Mozharova and Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to Named Entity Recognition in Russian (20)

PDF
A survey of named entity recognition in assamese and other indian languages
PDF
58903230-SentiMatrix-Named-Entity-Recognition-for-Romanian-Language
PPTX
PhD Defense
PDF
D017422528
PDF
B017441015
PPTX
NAMED ENTITY RECOGNITION
PPTX
finalseminarppt-1803230802 FDF SDF30.pptx
PDF
#3 Information extraction from news to conversations
PDF
Named Entity Recognition Using Web Document Corpus
PDF
Named entity recognition using web document corpus
PDF
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...
PDF
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...
PDF
LP&IIS2013.Chinese Named Entity Recognition with Conditional Random Fields in...
PPT
sobha-ner.ppt named entity recognition model
PDF
ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts
PPTX
Reading Group 2013 (DERI NUIG)
PDF
Named Entity Recognition for Telugu Using Conditional Random Field
PDF
Domain Specific Named Entity Recognition Using Supervised Approach
PDF
Named Entity Recognition from Online News
PPTX
Information retrieval and extraction
A survey of named entity recognition in assamese and other indian languages
58903230-SentiMatrix-Named-Entity-Recognition-for-Romanian-Language
PhD Defense
D017422528
B017441015
NAMED ENTITY RECOGNITION
finalseminarppt-1803230802 FDF SDF30.pptx
#3 Information extraction from news to conversations
Named Entity Recognition Using Web Document Corpus
Named entity recognition using web document corpus
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...
LP&IIS2013.Chinese Named Entity Recognition with Conditional Random Fields in...
sobha-ner.ppt named entity recognition model
ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts
Reading Group 2013 (DERI NUIG)
Named Entity Recognition for Telugu Using Conditional Random Field
Domain Specific Named Entity Recognition Using Supervised Approach
Named Entity Recognition from Online News
Information retrieval and extraction
Ad

More from AIST (20)

PDF
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
PDF
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
PDF
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
PDF
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
PDF
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
PDF
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
PDF
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
PPTX
Иосиф Иткин, Exactpro - TBA
PPTX
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
PDF
George Moiseev - Classification of E-commerce Websites by Product Categories
PDF
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
PDF
Marina Danshina - The methodology of automated decryption of znamenny chants
PDF
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
PPTX
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
PDF
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
PDF
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
PPTX
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
PPTX
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
PDF
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
PPTX
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Иосиф Иткин, Exactpro - TBA
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
George Moiseev - Classification of E-commerce Websites by Product Categories
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Marina Danshina - The methodology of automated decryption of znamenny chants
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising

Recently uploaded (20)

PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
modul_python (1).pptx for professional and student
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
Introduction to Inferential Statistics.pptx
PDF
annual-report-2024-2025 original latest.
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Introduction to Data Science and Data Analysis
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
Leprosy and NLEP programme community medicine
PPT
Predictive modeling basics in data cleaning process
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Microsoft Core Cloud Services powerpoint
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
DOCX
Factor Analysis Word Document Presentation
PDF
Introduction to the R Programming Language
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
CYBER SECURITY the Next Warefare Tactics
IMPACT OF LANDSLIDE.....................
modul_python (1).pptx for professional and student
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Introduction to Inferential Statistics.pptx
annual-report-2024-2025 original latest.
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Data Science and Data Analysis
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Leprosy and NLEP programme community medicine
Predictive modeling basics in data cleaning process
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Microsoft Core Cloud Services powerpoint
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
retention in jsjsksksksnbsndjddjdnFPD.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Factor Analysis Word Document Presentation
Introduction to the R Programming Language
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt

Valeriia Mozharova and Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to Named Entity Recognition in Russian

  • 1. Combining Knowledge and CRF-based Approach to Named Entity Recognition in Russian Mozharova V. A., Loukachevitch N. V. Lomonosov Moscow State University
  • 2. Named entity recognition task A named entity is a word or a word collocation that means a specific object or an event and distinguishes it from other similar objects. 1. Президент [Владимир Путин] PER 17 декабря провел традиционную пресс-конференцию перед Новым Годом. 2. Студенты и Татьяны получат эксклюзивный пропуск на Главный каток страны. Methods: 1. Machine learning 2. Rule-based approach 3. Combination 2
  • 3. Related works • English • A lot of works • Evaluations (MUC, CoNLL, ACE …) • Russian • Machine learning (CRF) • (Antonova, Soloviev, 2013; Podobryaev, 2013; Gareev, 2013) • Most works on own collections • Rule-based • (Trofimov, 2014) • Open collection “Person-1000” 3
  • 4. Outline Our approach: CRF-based Named Entity recognition Features: • Token-based • Lexicon-based • Context-based Labeling representation • IO-scheme • BIO-scheme Experiments on open collections • Persons-1000 • Persons-1111-F (Eastern names) 4
  • 5. CRF-based machine learning CRF is a tool for labeling sequential data. • CRF++ (open source implementation) Preprocessing • Morphological analyzer (POS-tagging, lemmatization, gender and grammatical case tagging) 5
  • 6. Scheme of text processing 6 Text Feature Extraction: -token-based -lexicon-based -context-based Name Extraction CRF
  • 7. Token features Most traditional features 1. Token initial form (lemma) 2. Number of symbols in a token 3. Letter case: BigBig, BigSmall, SmallSmall, Fence 4. Token type • part of speech • type of punctuation 5. The presence of a vowel (a binary feature) 6. If a token contains a known letter n-gram from a pre- defined set: • Кузнецов, Матвиенко, Джугашвили • Госдепартамент, Газпром 7
  • 8. Features based on lexicons We used vocabularies that store lists of useful expressions (words or phrase) Sources: • Phonebook • Wikipedia • Thesaurus (РуТез) Single feature for each lexicon Example: «Набережные[geo2] Челны[geo2]» 8
  • 9. lexicons . 9 Vocabulary Size, objects Clarification Examples Famous persons 31482 Famous people Владимир Путин First names 2773 First names Василий, Анна, Том Surnames 66108 Surnames Кузнецов, Грибоедов Verbs of informing 1729 Verbs that usually occur with persons высказать, признаться Companies 33380 Organization names Сбербанк Company types 6774 Organization types организация, авиафирма Geography 8969 Geographical objects Балтийское море Equipment 44094 Devices, equipment, tools устройство, телефон
  • 10. Context features and example 10 Token Lemma Register Token Type Second Name Geo Label В В Small Auxiliary False False NO России РОССИЯ BigSmall Noun False Geo1 GEOPOLIT Алиев АЛИЕВ BigSmall Noun Sname1 False PER третий ТРЕТИЙ Small Numeral False False NO раз РАЗ Small Auxiliary False False NO
  • 11. Expert labeling. Brat annotatiоn tool 11
  • 12. Labeling representation IO-scheme (Inside-Outside) • I - belongs to named entity • O - does not belong to named entity |C| + 1 classes 12 Token IO-Labels BIO-labels Владимир I-PER B-PER Путин I-PER I-PER посетил OUTSIDE OUTSIDE Англию I-GEOPOLIT B-GEOPOLIT BIO-scheme (Begin-Inside- Outside) • B - named entity beginning • I - named entity continuation • O - not named entity 2*|C| + 1 classes
  • 13. IO-labeling: aggregation of tokens into named entities 13 I-PER Person I-PER Person I-PER Person I-PER Петр Person Person I-PER Петр I-PER Person I-PER
  • 14. IO-labeling: aggregation of tokens into named entities 14 I-ORG Organization I-ORG I-PER X1 … Person I-PER X1 … OUTSIDE X1 Person … Person
  • 15. Target metric intersectionCount is the number of named entities labeled by both: the classier and the expert; classifierCount is the number of named entities labeled by only the classier; expertCount is the number of named entities labeled by only the expert. 15
  • 16. Text collections • "Persons-1000" (1000 news documents) • Russian names: Александр Игнатенко, Алексей Волков • " Persons-1111F" (1111 news documents) • Eastern names: Абдалла Халаф, Иттё Ито We additionally labeled: • Organizations (ORG) • Media organizations having a specific function of information providing (MEDIA) • Locations (LOC) • States and capitals in the role of a state (GEOPOLIT) 16
  • 17. Experiments on Collection “Persons-1000” NE Type F-score, % IO IO + rules BIO PER 94.95 95.09 96.08 ORG 80.03 80.23 83.84 LOC 92.60 92.60 94.57 Average 89.54 89.67 91.71 17 NE Type F-score, % IO IO + rules BIO PER 94.95 95.01 95.63 ORG 75.90 76.16 80.06 MEDIA 87.95 87.95 87.99 LOC 84.53 84.53 86.91 GEOPOLIT 94.65 94.65 94.50 Average 88.21 88.37 89.93 Cross-validation 3:1
  • 18. Experiments on collection with Eastern names (Persons-1111F) Person name extraction “Persons-1000”: cross-validation 3:1 “Persons-1111F” : training on “Persons-1000” 18 Collection F-score, % Rule-based (Trofimov, 2014) Our system Pesons-1000 96.62 96.08 Persons-1111F 64.43 81.68
  • 19. Conclusion • We presented the system for Russian Named Entity Recognition task using knowledge-based approach together with CRF classifier • We tested our system on two open text collections “Persons-1000” and “Persons-1111” and compare our results with rule-based system • We compared two labeling schemes for Russian texts: IO-scheme and BIO-scheme 19