SlideShare a Scribd company logo
SALES RELAUNCH F&Q SESSION
Multi-lingual data processing
The CIS and Georgia
Olga Rink, director general
3
Content
Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
• Business environment
• Main stages of processing multi-lingual business data
o Naming convention
o Transliteration
o Matching
• Seeding and verifying objects in a media coverage
4
Official languages, population (mn) and Russian as a
second language (est.)
Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
5
Multi-lingual environment
Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
Country Official language (group)
Population,
mn Alphabet Second language
Russian, % of
population, est.
Russia Russian 150Cyrillic
35+* official and over 100
used  100%
Armenia
Armenian (Indo-European
language) 3Own script Russian, English 100%
Azerbaijan Azeri Turkish 9,8
Latin in Azerbaijan, Cyrillic in Russia
(Dagestan) 90%
Belarus Bielaruskaja mova, Russian 9,5Cyrillic Russian  100%
Georgia Georgian (Kartvelian language) 3,7Georgian script
Russian, English, Azeri,
Armenian 100%
Kazakhstan
Kazakh (Turkic language),
Russian 17,7
Kazakh alphabets (Cyrillic, Latin,
Perso-Arabic, Kazakh Braille)
Russian
 100%
Kyrgyzstan
Kyrgyz (Turkic language),
Russian 6Cyrillic Kyrgyz  100%
Moldova Romanian 3,6Latin Russian is widely used  90%
Tajikistan Tajik (Persian dialect) 8Cyrillic Russian 90%
Turkmenistan Turkmen (Turkic language) 5,2Cyrillic, Latin Russian is used 100%
Ukraine Ukrainian (Ukrayins'ka mova) 42,5Cyrillic
Russian is widely used along
with a number of other
languages  100%
Uzbekistan Uzbek, in fact Russian 31,6Cyrillic, Latin Russian is widely used 100%
• The Constitution of Dagestan defines "Russian and the languages of
the peoples of Dagestan" as the state languages
•  a bulk of newly-registered business is available in Cyrillic or Latin
6Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
• For Slavic languages we use ISO
9:1995 standard with one exception:
put a combination of Latin characters
instead of Latin diacritic characters.
Example: Ch (without diacritic) instead of
Ч – Č (with diacritic)
• ISO9985 is used for Armenian
• ISO 9984 – for Georgian
• ООО «Ъ» (Trade style: OOO TVERDY
ZNAK; OOO “” is a transliterated
name – no way to find by the
original name)
• Minor changes in transliteration like
3DNYUS, OOO >3DNEWS, LLC are
accepted and now filtered while
being updated
• Matching rules are defined in our
“Naming Convention”: i.e. the
transliterated «normalized» Charter
brief company name is used as
primary: an indication to a legal form
in the name (required by law) is put at
the end via comma.
• Second one is the transliterated full
legal name.
• Trade style contains official name in
English/Latin or trade marks
• We use rule-based and machine
learning approaches, including areas
of collecting data, identifying
objects, developing credit scorings,
digesting media coverage
7
Natural Language Processing and Machine Learning
The SCAN engine is leveraging vast amounts of text data to enable the next generation of Interfax data products
Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
Interfax builds a scalable machine learning infrastructure that enables data scientists and engineers to explore, train,
and deploy credit and reputation risk models with minimal effort
• Tagging documents and
• Classifying by a text type (media-release,
forecast, feature etc)
Detecting and Disambiguating Named Entities
Support Vector Machine (SVM) or Bayes are used,
depending on configuration
• SVM represents a text as a vector to compare with a pattern
(prototype); The closeness defines the type
• Bayes rule is applicable when you rely on pre-determined
assumptions (a range of known “symptoms”) while calculating
probabilities
Rule-based fact extraction and sentiment analysis
At an initial phase for seeding named persons
• Rule-based approach mostly
• Context analysis and statistics for entity disambiguation
Clarification of Named Entity Detection with learning semi-
automatically labelled corpus
• Support Vector Machine (SVM)
• A neural network on the basis of the existing rule-based
structure is considered for future
8
An intellectual WOW-effect or what can only SCAN
do – forward to “verifying” media coverage
Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
Out of 3 mn companies automatically
generated by the Scan linguistic kernel for
the recent year 22 thousand have been
verified, 0.5 mn are identified with Spark
2 mn persons were generated (seeded);
out of them 75 thousand verified
300 thousand of geographic locations: all
Russian ones identified by OKATO classifier
and many global locations got by parsing
Wikipedia
13 thousand trade marks (“Trade style”)
24 thousand sources in
Russian
ThankYou
Interfax – Dun & Bradstreet
www.dnb.ru

More Related Content

PPTX
Better Cross-Channel Experiences With Metadata - Information Architecture Sum...
PPTX
Lecture #2 xml
PPT
Web 3.0 Explained - Part II - Techniques by Freek Biljiques
PPTX
Digital Twin: jSON-LD, RDF
ODP
Web of data
PPTX
PDF
POLYGLOT-NER: Massive Multilingual Named Entity Recognition
PPT
Better Cross-Channel Experiences With Metadata - Information Architecture Sum...
Lecture #2 xml
Web 3.0 Explained - Part II - Techniques by Freek Biljiques
Digital Twin: jSON-LD, RDF
Web of data
POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Viewers also liked (20)

ODP
Presentacion Teledetección
PPTX
Web 2.0 tatys
PPTX
Answer HW Alternatives
DOCX
Innovative lesson plan
PPTX
Acreditación de programas de grado en ingeniería, arquitectura y diseño -Expe...
PPT
Gabriela mazoni e franciela gomes
PDF
COMPRA COLETIVA: ESTUDO SOBRE O IMPACTO NAS EMPRESAS DE SERVIÇOS QUE UTILIZAM...
PPTX
Imaginary Invention: Ultra perfect skin
DOCX
Top 10 tv dramas
PPTX
ผู้นำในดวงใจ , พลเอกเปรม ติณสูลานนท์
PDF
ทำ - ธรรมนูญ : ธรรมนูญประชาชนเพื่อการจัดการตนเอง
PDF
Maranhão - Império
PDF
Фабрика "Смирнов" - больше чем качество
PDF
การจัดโครงสร้างสถานศึกษา
DOCX
Innovative lesson plan
PDF
Para obtener trabajo_urgente__oracin_efectiva_a_san_judas_tadeo
PDF
Projecte Niger Francés
PDF
4รายงานนวีตกรรม
PPT
El Virus De La Gripe
PPTX
History of bastard sword
Presentacion Teledetección
Web 2.0 tatys
Answer HW Alternatives
Innovative lesson plan
Acreditación de programas de grado en ingeniería, arquitectura y diseño -Expe...
Gabriela mazoni e franciela gomes
COMPRA COLETIVA: ESTUDO SOBRE O IMPACTO NAS EMPRESAS DE SERVIÇOS QUE UTILIZAM...
Imaginary Invention: Ultra perfect skin
Top 10 tv dramas
ผู้นำในดวงใจ , พลเอกเปรม ติณสูลานนท์
ทำ - ธรรมนูญ : ธรรมนูญประชาชนเพื่อการจัดการตนเอง
Maranhão - Império
Фабрика "Смирнов" - больше чем качество
การจัดโครงสร้างสถานศึกษา
Innovative lesson plan
Para obtener trabajo_urgente__oracin_efectiva_a_san_judas_tadeo
Projecte Niger Francés
4รายงานนวีตกรรม
El Virus De La Gripe
History of bastard sword
Ad

Similar to Processing multi-lingual business data (20)

PPT
cldr_overview
PPTX
SAS Global 2021 Introduction to Natural Language Processing
PDF
Recent advances in LVCSR : A benchmark comparison of performances
PPT
Tulsa Techfest 2008 - Creating A Voice User Interface With Speech Server
PDF
Machine translation for eDiscovery involving cross-border matters
 
PDF
Methods and apparatus for automatic translation of a computer program languag...
PDF
The State of Automatic Speech Recognition 2022 (2).pdf
PPT
Information Retrieval
PDF
Content Processing Architecture and Applications - Introduction to Text Mining
PPT
Calais @ the Palo Alto Semantic Web Meetup
PDF
Tackling Hidden Risks in AML Sanctions Screening Programs
PDF
Growing Your Freelance Business (Olga Melnikova)
PDF
Essential Elements of Excellent Multilingual Search
PDF
Cross lingual information retrieval across 100 languages - Andrej Muhic
PDF
Carolina Scarton - ESR 7 - USFD
PPTX
Trends In Languages 2010
PPTX
Reconnaissance - For pentesting and user awareness
PPTX
Chosing The Right Language for your project
PPTX
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
PPT
Semantic Search Component
cldr_overview
SAS Global 2021 Introduction to Natural Language Processing
Recent advances in LVCSR : A benchmark comparison of performances
Tulsa Techfest 2008 - Creating A Voice User Interface With Speech Server
Machine translation for eDiscovery involving cross-border matters
 
Methods and apparatus for automatic translation of a computer program languag...
The State of Automatic Speech Recognition 2022 (2).pdf
Information Retrieval
Content Processing Architecture and Applications - Introduction to Text Mining
Calais @ the Palo Alto Semantic Web Meetup
Tackling Hidden Risks in AML Sanctions Screening Programs
Growing Your Freelance Business (Olga Melnikova)
Essential Elements of Excellent Multilingual Search
Cross lingual information retrieval across 100 languages - Andrej Muhic
Carolina Scarton - ESR 7 - USFD
Trends In Languages 2010
Reconnaissance - For pentesting and user awareness
Chosing The Right Language for your project
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Semantic Search Component
Ad

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
project resource management chapter-09.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Encapsulation theory and applications.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
WOOl fibre morphology and structure.pdf for textiles
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Mushroom cultivation and it's methods.pdf
Unlocking AI with Model Context Protocol (MCP)
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
project resource management chapter-09.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Encapsulation theory and applications.pdf
A Presentation on Artificial Intelligence
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Getting Started with Data Integration: FME Form 101
Assigned Numbers - 2025 - Bluetooth® Document
Zenith AI: Advanced Artificial Intelligence
Building Integrated photovoltaic BIPV_UPV.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
OMC Textile Division Presentation 2021.pptx
Tartificialntelligence_presentation.pptx
Group 1 Presentation -Planning and Decision Making .pptx
WOOl fibre morphology and structure.pdf for textiles
Digital-Transformation-Roadmap-for-Companies.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Mushroom cultivation and it's methods.pdf

Processing multi-lingual business data

  • 2. Multi-lingual data processing The CIS and Georgia Olga Rink, director general
  • 3. 3 Content Interfax - Dun & Bradstreet, Innovations in Multi-lingual context • Business environment • Main stages of processing multi-lingual business data o Naming convention o Transliteration o Matching • Seeding and verifying objects in a media coverage
  • 4. 4 Official languages, population (mn) and Russian as a second language (est.) Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
  • 5. 5 Multi-lingual environment Interfax - Dun & Bradstreet, Innovations in Multi-lingual context Country Official language (group) Population, mn Alphabet Second language Russian, % of population, est. Russia Russian 150Cyrillic 35+* official and over 100 used  100% Armenia Armenian (Indo-European language) 3Own script Russian, English 100% Azerbaijan Azeri Turkish 9,8 Latin in Azerbaijan, Cyrillic in Russia (Dagestan) 90% Belarus Bielaruskaja mova, Russian 9,5Cyrillic Russian  100% Georgia Georgian (Kartvelian language) 3,7Georgian script Russian, English, Azeri, Armenian 100% Kazakhstan Kazakh (Turkic language), Russian 17,7 Kazakh alphabets (Cyrillic, Latin, Perso-Arabic, Kazakh Braille) Russian  100% Kyrgyzstan Kyrgyz (Turkic language), Russian 6Cyrillic Kyrgyz  100% Moldova Romanian 3,6Latin Russian is widely used  90% Tajikistan Tajik (Persian dialect) 8Cyrillic Russian 90% Turkmenistan Turkmen (Turkic language) 5,2Cyrillic, Latin Russian is used 100% Ukraine Ukrainian (Ukrayins'ka mova) 42,5Cyrillic Russian is widely used along with a number of other languages  100% Uzbekistan Uzbek, in fact Russian 31,6Cyrillic, Latin Russian is widely used 100% • The Constitution of Dagestan defines "Russian and the languages of the peoples of Dagestan" as the state languages •  a bulk of newly-registered business is available in Cyrillic or Latin
  • 6. 6Interfax - Dun & Bradstreet, Innovations in Multi-lingual context • For Slavic languages we use ISO 9:1995 standard with one exception: put a combination of Latin characters instead of Latin diacritic characters. Example: Ch (without diacritic) instead of Ч – Č (with diacritic) • ISO9985 is used for Armenian • ISO 9984 – for Georgian • ООО «Ъ» (Trade style: OOO TVERDY ZNAK; OOO “” is a transliterated name – no way to find by the original name) • Minor changes in transliteration like 3DNYUS, OOO >3DNEWS, LLC are accepted and now filtered while being updated • Matching rules are defined in our “Naming Convention”: i.e. the transliterated «normalized» Charter brief company name is used as primary: an indication to a legal form in the name (required by law) is put at the end via comma. • Second one is the transliterated full legal name. • Trade style contains official name in English/Latin or trade marks • We use rule-based and machine learning approaches, including areas of collecting data, identifying objects, developing credit scorings, digesting media coverage
  • 7. 7 Natural Language Processing and Machine Learning The SCAN engine is leveraging vast amounts of text data to enable the next generation of Interfax data products Interfax - Dun & Bradstreet, Innovations in Multi-lingual context Interfax builds a scalable machine learning infrastructure that enables data scientists and engineers to explore, train, and deploy credit and reputation risk models with minimal effort • Tagging documents and • Classifying by a text type (media-release, forecast, feature etc) Detecting and Disambiguating Named Entities Support Vector Machine (SVM) or Bayes are used, depending on configuration • SVM represents a text as a vector to compare with a pattern (prototype); The closeness defines the type • Bayes rule is applicable when you rely on pre-determined assumptions (a range of known “symptoms”) while calculating probabilities Rule-based fact extraction and sentiment analysis At an initial phase for seeding named persons • Rule-based approach mostly • Context analysis and statistics for entity disambiguation Clarification of Named Entity Detection with learning semi- automatically labelled corpus • Support Vector Machine (SVM) • A neural network on the basis of the existing rule-based structure is considered for future
  • 8. 8 An intellectual WOW-effect or what can only SCAN do – forward to “verifying” media coverage Interfax - Dun & Bradstreet, Innovations in Multi-lingual context Out of 3 mn companies automatically generated by the Scan linguistic kernel for the recent year 22 thousand have been verified, 0.5 mn are identified with Spark 2 mn persons were generated (seeded); out of them 75 thousand verified 300 thousand of geographic locations: all Russian ones identified by OKATO classifier and many global locations got by parsing Wikipedia 13 thousand trade marks (“Trade style”) 24 thousand sources in Russian
  • 9. ThankYou Interfax – Dun & Bradstreet www.dnb.ru