SlideShare a Scribd company logo
Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav Nakov, University of California, Berkeley Elena Paskaleva, Bulgarian Academy of Sciences A Workshop on Acquisition and Management   of Multilingual Lexicons
Introduction Cognates and false friends Cognates  are pair of words in different languages that sound similar and are translations of each other F alse friends  are pairs of words in two languages that sound similar but differ in their meaning s The problem Design an algorithm that can distinguish between cognates and false friends
Cognates and False Friends  Examples of cognates ден   in Bulgarian  =  день   in Russian  (day ) idea  in English =  идея   in Bulgarian  (idea) Examples of false friends майка   in Bulgarian  (mother)   ≠   майка   in Russian  ( vest ) prost  in German  (cheers)   ≠  прост   in Bulgarian  (stupid) gift  in German  (poison)   ≠  gift  in English  (present)
The Paper in One Slide Measuring semantic similarity Analyze the words local contexts Use the Web as a corpus Similarities contexts    similar words Context translation    cross-lingual similarity Evaluation 200 pairs of words 100 cognates and 100 false friends 11pt average precision: 95.84%
Contextual Web Similarity What is  local context ? Few words before and after the target word The words in the local context of given word are semantically related to   it Need to exclude the  stop words : prepositions, pronouns, conjunctions, etc. Stop words appear in all contexts Need of sufficiently big corpus Same day delivery of fresh  flowers , roses, and unique gift baskets  from our online boutique .  Flower  delivery online by local florists for birthday  flowers .
Contextual Web Similarity Web as a corpus The Web can be used as a corpus to extract the local context for given word The Web is the largest possible corpus Contains big corpora in any language Searching some word in Google can return up to 1 000 excerpts of texts The target word is given along with its local context: few words before and after it Target language can be specified
Contextual Web Similarity Web as a corpus Example: Google query for " flower " Flowers, plants, roses, & gifts. Flower s  delivery with fewer ... Flowers, roses, plants and gift delivery. Order flowers from ProFlowers once, and you  will  never use flower s  delivery from florists again. Margarita   Flowers   -   Delivers in Bulgaria for you! - gifts, flowers, roses ... Wide selection of BOUQUETS,   FLORAL ARRANGEMENTS,   CHRISTMAS ECORATIONS,   PLANTS,   CAKES and GIFTS appropriate for various occasions. CREDIT cards acceptable. Flowers, Plants, Gift Baskets - 1-800-FLOWERS.COM - Your Florist ... Flowers, balloons, plants, gift baskets, gourmet food, and teddy bears presented by 1-800-FLOWERS.COM, Your Florist of Choice for over 30 years.
Contextual Web Similarity Measuring semantic similarity For given two words their local contexts are extracted from the Web A set of words and their frequencies Semantic similarity is measured as similarity between these local contexts Local contexts are represented as frequency vectors for given set of words Cosine between the frequency vectors in the Euclidean space is calculated
Contextual Web Similarity Example of context words frequencies word:  flower word:  computer 183 rose 165 delivery 124 gift 98 welcome 217 fresh 204 order 87 red ... ... count word 252 technology 185 order 174 new 159 Web 291 Internet 286 PC 146 site ... ... count word
Contextual Web Similarity Example of frequency vectors Similarity = cosine(v 1 , v 2 ) v 1 :  flower v 2 :  computer 5000 4999 ... 3 2 1 0 # 0 amateur 5 apple ... ... 3 alias 2 alligator 0 zap 6 zoo freq. word 5000 4999 ... 3 2 1 0 # 8 amateur 133 apple ... ... 7 alias 0 alligator 3 zap 0 zoo freq. word
Cross-Lingual Similarity We are given two words in different languages  L 1  and  L 2 We have a bilingual glossary G of  translation pairs  {p  ∈   L 1 , q  ∈  L 2 } Measuring cross-lingual similarity: We extract the local contexts of the target words from the Web:  C 1   ∈  L 1  and  C 2   ∈  L 2 We translate the context We measure distance between  C 1 *  and  C 2 C 1 * C 1 G
Reverse Context Lookup Local context extracted from the Web can contain arbitrary  parasite words  like " online ", " home ", " search ", " click ", etc. Internet terms appear in any Web page Such words are not likely to be associated with the target word Example (for the word  flowers ) " send flowers online ",  " flowers here " ,  " order flowers here " Will the word " flowers " appear in the local context of " send ", " online " and " here " ?
Reverse Context Lookup If two words are semantically related both should appear in the local contexts of each other Let  #{x,y}   =  number of occurrences of  x  in the local context of  y For any word  w  and a word from its local context  w c , we define their  strength of semantic association   p(w,w c )  as follows: p(w,  w c ) = min{ #(w,  w c ), #( w c ,w) } We use  p(w,wc)  as vector coordinates when measuring semantic similarity
Web Similarity Using Seed Words Adaptation of the Fung&Yee'98 algorithm* We have a bilingual glossary  G: L 1     L 2  of translation pairs and target words w 1 , w 2 We search in Google the co-occurrences of the target words with the glossary entries Compare the co-occurrence vectors for each  {p,q}   ∈   G  compare max (google#("w 1  p") and google#("p w 1 ")) with max (google#"w 2  q") and google#("q w 2 ")) *  P. Fung and L. Y. Yee. An IR approach for translating from   nonparallel, comparable texts. In Proceedings of ACL, volume   1, pages 414–420, 1998
Evaluation Data Set We use 200 Bulgarian/Russian pairs of words: 100 cognates and 100 false friends Manually assembled by a linguist Manually checked in several large monolingual and bilingual dictionaries Limited to nouns only
Experiments We tested few modifications of our contextual Web similarity algorithm Use of TF.IDF weighting Preserve the stop words Use of lemmatization of the context words Use different context size (2, 3, 4 and 5) Use small and large bilingual glossary Compared it with the seed words algorithm Compared with traditional orthographic similarity measures: LCSR and MEDR
Experiments BASELINE : random MEDR : minimum edit distance ratio LCSR :  longest common subsequence ration SEED: the "seed words" algorithm WEB3: the Web-based similarity algorithm with the default parameters: context   size = 3, small glossary, stop words filtering, no lemmatization, no reverse context lookup, no TF.IDF-weighting NO-STOP: WEB3 without stop words removal WEB1, WEB2, WEB4 and WEB5: WEB3 with context size of 1, 2, 4 and 5 LEMMA: WEB3 with lemmatization HUGEDICT: WEB3 with the huge glossary REVERSE: the "reverse context lookup" algorithm COMBINED: WEB3 + lemmatization + huge glossary + reverse context lookup
Resources We used the following resources: Bilingual Bulgarian / Russian glossary: 3 794 pairs of translation words Huge bilingual glossary: 59 583 word pairs A list of 599 Bulgarian stop words A list of 508 Russian stop words Bulgarian lemma dictionary:  1 000 000  wordforms and 70 000 lemmata Russian lemma dictionary: 1 500 000 wordforms and 100 000 lemmata
Evaluation We order the pairs of words from the testing dataset by the calculated similarity False friends are expected to appear on the top and the cognates on the bottom We evaluate the  11pt average precision  of the obtained ordering
Results (11pt Average Precision) Comparing BASELINE, LCSR, MEDR, SEED and WEB3 algorithms
Results (11pt Average Precision) Comparing different context sizes; keeping the stop words
Results (11pt Average Precision) Comparing different improvements of the WEB3 algorithm
Results (Precision-Recall Graph) Comparing the recall-precision graphs of evaluated algorithms
Results: The Ordering for WEB3 100.00% 50.00% yes 0,9684 beauty beauty красота 200 100.00% 50.25% yes 0,9171 flora flora флора 199 100.00% 50.51% yes 0,9028 science science наука 198 100.00% 50.76% yes 0,8916 silver silver сребро / серебро 197 100.00% 51.28% yes 0,8017 finance finance финанси / финансы 19 6 … … … … … … … … 83.00% 82.18% no 0,2130 rubble leg бут 101 82.00% 82.00% no 0,2101 time year година 100 81.00% 81.82% yes 0,2099 volcano volcano вулкан 99 … … … … … … … … 5.00% 100.00% no 0,0182 whip hedge плет / плеть 5 4.00% 100.00% no 0,0175 crud chill мраз / мразь 4 3.00% 100.00% no 0,0143 income livestock добитък / добыток 3 2.00% 100.00% no 0,0130 gaff mottle багрене / багренье 2 1.00% 100.00% no 0,0085 muff gratis муфта 1 R@ r [email_address] Cogn.? Sim. RU  Sense BG Sense Candidate r
Discussion Our approach is original because: Introduces semantic similarity measure Not orthographic or phonetic Uses the Web as a corpus Does not rely on any preexisting corpora Uses reverse-context lookup Significant improvement in quality Is applied to original problem Classification of almost identically spelled true/false friends
Discussion Very good accuracy: over 95% It is not 100% accurate Typical mistakes are synonyms, hyponyms, words influenced by cultural, historical and geographical differences The Web as a corpus introduces noise Google returns the first 1 000 results only Google ranks higher news portals, travel agencies and retail sites than books, articles and forums posts Local context could contains noise
Conclusion and Future Work Conclusion Algorithm that can distinguish between cognates and false friends Analyzes words local contexts, using the Web as a corpus Future Work Better glossaries Automatic augmenting the glossary Different language pairs
Questions ? Cognate or False Friend? Ask the Web!

More Related Content

PPTX
How to google
PPT
How to be a better Google-r
PDF
Conversational Semantics for the Web [CascadiaJS 2018]
PPT
Effective Search via Google.
PDF
Designing the Conversation [Concatenate 2018]
PDF
SQL: The one language to rule all your data
PPTX
Lesson 3
PPTX
False cognates
How to google
How to be a better Google-r
Conversational Semantics for the Web [CascadiaJS 2018]
Effective Search via Google.
Designing the Conversation [Concatenate 2018]
SQL: The one language to rule all your data
Lesson 3
False cognates

Similar to Svetlin Nakov - Cognate or False Friend? Ask the Web! (20)

PPT
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
PDF
Cognate or False Friend? Ask the Web! (Distinguish between Cognates and False...
PDF
Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web...
PDF
Nakov S., Nakov P., Paskaleva E., Cognate or False Friend? Ask the Web!
PDF
Nakov S., Nakov P., Paskaleva E., Improved Word Alignments Using the Web as a...
PDF
Improved Word Alignments Using the Web as a Corpus
PPTX
An Improved Approach to Word Sense Disambiguation
PDF
Volume 2-issue-6-2016-2020
PDF
Volume 2-issue-6-2016-2020
PDF
Automatic Identification of False Friends in Parallel Corpora: Statistical an...
PPT
Measuring Similarity Between Contexts and Concepts
PPT
PDF
Conceptual similarity: why, where and how
PPTX
Chat bot using text similarity approach
PDF
Disambiguating Polysemous Queries For Document Retrieval
PDF
P13 corley
PPTX
Using topic modelling frameworks for NLP and semantic search
PPT
PDF
2015-SemEval2015_poster
PPTX
Tg noh jeju_workshop
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Cognate or False Friend? Ask the Web! (Distinguish between Cognates and False...
Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web...
Nakov S., Nakov P., Paskaleva E., Cognate or False Friend? Ask the Web!
Nakov S., Nakov P., Paskaleva E., Improved Word Alignments Using the Web as a...
Improved Word Alignments Using the Web as a Corpus
An Improved Approach to Word Sense Disambiguation
Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020
Automatic Identification of False Friends in Parallel Corpora: Statistical an...
Measuring Similarity Between Contexts and Concepts
Conceptual similarity: why, where and how
Chat bot using text similarity approach
Disambiguating Polysemous Queries For Document Retrieval
P13 corley
Using topic modelling frameworks for NLP and semantic search
2015-SemEval2015_poster
Tg noh jeju_workshop
Ad

More from Svetlin Nakov (20)

PPTX
AI and the Future of Devs: Nakov @ Techniverse (Nov 2024)
PPTX
AI за ежедневието - Наков @ Techniverse (Nov 2024)
PPTX
AI инструменти за бизнеса - Наков - Nov 2024
PPTX
AI Adoption in Business - Nakov at Forbes HR Forum - Sept 2024
PPTX
Software Engineers in the AI Era - Sept 2024
PPTX
Най-търсените направления в ИТ сферата за 2024
PPTX
BG-IT-Edu: отворено учебно съдържание за ИТ учители
PPTX
Programming World in 2024
PDF
AI Tools for Business and Startups
PPTX
AI Tools for Scientists - Nakov (Oct 2023)
PPTX
AI Tools for Entrepreneurs
PPTX
Bulgarian Tech Industry - Nakov at Dev.BG All in One Conference 2023
PPTX
AI Tools for Business and Personal Life
PDF
Дипломна работа: учебно съдържание по ООП - Светлин Наков
PPTX
Дипломна работа: учебно съдържание по ООП
PPTX
Свободно ИТ учебно съдържание за учители по програмиране и ИТ
PPTX
AI and the Professions of the Future
PPTX
Programming Languages Trends for 2023
PPTX
IT Professions and How to Become a Developer
PPTX
GitHub Actions (Nakov at RuseConf, Sept 2022)
AI and the Future of Devs: Nakov @ Techniverse (Nov 2024)
AI за ежедневието - Наков @ Techniverse (Nov 2024)
AI инструменти за бизнеса - Наков - Nov 2024
AI Adoption in Business - Nakov at Forbes HR Forum - Sept 2024
Software Engineers in the AI Era - Sept 2024
Най-търсените направления в ИТ сферата за 2024
BG-IT-Edu: отворено учебно съдържание за ИТ учители
Programming World in 2024
AI Tools for Business and Startups
AI Tools for Scientists - Nakov (Oct 2023)
AI Tools for Entrepreneurs
Bulgarian Tech Industry - Nakov at Dev.BG All in One Conference 2023
AI Tools for Business and Personal Life
Дипломна работа: учебно съдържание по ООП - Светлин Наков
Дипломна работа: учебно съдържание по ООП
Свободно ИТ учебно съдържание за учители по програмиране и ИТ
AI and the Professions of the Future
Programming Languages Trends for 2023
IT Professions and How to Become a Developer
GitHub Actions (Nakov at RuseConf, Sept 2022)
Ad

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Approach and Philosophy of On baking technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Big Data Technologies - Introduction.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
Spectral efficient network and resource selection model in 5G networks
20250228 LYD VKU AI Blended-Learning.pptx
Approach and Philosophy of On baking technology
Digital-Transformation-Roadmap-for-Companies.pptx
Machine learning based COVID-19 study performance prediction
MYSQL Presentation for SQL database connectivity
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Per capita expenditure prediction using model stacking based on satellite ima...
Network Security Unit 5.pdf for BCA BBA.
Diabetes mellitus diagnosis method based random forest with bat algorithm
Big Data Technologies - Introduction.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Empathic Computing: Creating Shared Understanding
Mobile App Security Testing_ A Comprehensive Guide.pdf
Electronic commerce courselecture one. Pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Unlocking AI with Model Context Protocol (MCP)
NewMind AI Weekly Chronicles - August'25 Week I
Dropbox Q2 2025 Financial Results & Investor Presentation

Svetlin Nakov - Cognate or False Friend? Ask the Web!

  • 1. Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav Nakov, University of California, Berkeley Elena Paskaleva, Bulgarian Academy of Sciences A Workshop on Acquisition and Management of Multilingual Lexicons
  • 2. Introduction Cognates and false friends Cognates are pair of words in different languages that sound similar and are translations of each other F alse friends are pairs of words in two languages that sound similar but differ in their meaning s The problem Design an algorithm that can distinguish between cognates and false friends
  • 3. Cognates and False Friends Examples of cognates ден in Bulgarian = день in Russian (day ) idea in English = идея in Bulgarian (idea) Examples of false friends майка in Bulgarian (mother) ≠ майка in Russian ( vest ) prost in German (cheers) ≠ прост in Bulgarian (stupid) gift in German (poison) ≠ gift in English (present)
  • 4. The Paper in One Slide Measuring semantic similarity Analyze the words local contexts Use the Web as a corpus Similarities contexts  similar words Context translation  cross-lingual similarity Evaluation 200 pairs of words 100 cognates and 100 false friends 11pt average precision: 95.84%
  • 5. Contextual Web Similarity What is local context ? Few words before and after the target word The words in the local context of given word are semantically related to it Need to exclude the stop words : prepositions, pronouns, conjunctions, etc. Stop words appear in all contexts Need of sufficiently big corpus Same day delivery of fresh flowers , roses, and unique gift baskets from our online boutique . Flower delivery online by local florists for birthday flowers .
  • 6. Contextual Web Similarity Web as a corpus The Web can be used as a corpus to extract the local context for given word The Web is the largest possible corpus Contains big corpora in any language Searching some word in Google can return up to 1 000 excerpts of texts The target word is given along with its local context: few words before and after it Target language can be specified
  • 7. Contextual Web Similarity Web as a corpus Example: Google query for " flower " Flowers, plants, roses, & gifts. Flower s delivery with fewer ... Flowers, roses, plants and gift delivery. Order flowers from ProFlowers once, and you will never use flower s delivery from florists again. Margarita Flowers - Delivers in Bulgaria for you! - gifts, flowers, roses ... Wide selection of BOUQUETS, FLORAL ARRANGEMENTS, CHRISTMAS ECORATIONS, PLANTS, CAKES and GIFTS appropriate for various occasions. CREDIT cards acceptable. Flowers, Plants, Gift Baskets - 1-800-FLOWERS.COM - Your Florist ... Flowers, balloons, plants, gift baskets, gourmet food, and teddy bears presented by 1-800-FLOWERS.COM, Your Florist of Choice for over 30 years.
  • 8. Contextual Web Similarity Measuring semantic similarity For given two words their local contexts are extracted from the Web A set of words and their frequencies Semantic similarity is measured as similarity between these local contexts Local contexts are represented as frequency vectors for given set of words Cosine between the frequency vectors in the Euclidean space is calculated
  • 9. Contextual Web Similarity Example of context words frequencies word: flower word: computer 183 rose 165 delivery 124 gift 98 welcome 217 fresh 204 order 87 red ... ... count word 252 technology 185 order 174 new 159 Web 291 Internet 286 PC 146 site ... ... count word
  • 10. Contextual Web Similarity Example of frequency vectors Similarity = cosine(v 1 , v 2 ) v 1 : flower v 2 : computer 5000 4999 ... 3 2 1 0 # 0 amateur 5 apple ... ... 3 alias 2 alligator 0 zap 6 zoo freq. word 5000 4999 ... 3 2 1 0 # 8 amateur 133 apple ... ... 7 alias 0 alligator 3 zap 0 zoo freq. word
  • 11. Cross-Lingual Similarity We are given two words in different languages L 1 and L 2 We have a bilingual glossary G of translation pairs {p ∈ L 1 , q ∈ L 2 } Measuring cross-lingual similarity: We extract the local contexts of the target words from the Web: C 1 ∈ L 1 and C 2 ∈ L 2 We translate the context We measure distance between C 1 * and C 2 C 1 * C 1 G
  • 12. Reverse Context Lookup Local context extracted from the Web can contain arbitrary parasite words like " online ", " home ", " search ", " click ", etc. Internet terms appear in any Web page Such words are not likely to be associated with the target word Example (for the word flowers ) " send flowers online ", " flowers here " , " order flowers here " Will the word " flowers " appear in the local context of " send ", " online " and " here " ?
  • 13. Reverse Context Lookup If two words are semantically related both should appear in the local contexts of each other Let #{x,y} = number of occurrences of x in the local context of y For any word w and a word from its local context w c , we define their strength of semantic association p(w,w c ) as follows: p(w, w c ) = min{ #(w, w c ), #( w c ,w) } We use p(w,wc) as vector coordinates when measuring semantic similarity
  • 14. Web Similarity Using Seed Words Adaptation of the Fung&Yee'98 algorithm* We have a bilingual glossary G: L 1  L 2 of translation pairs and target words w 1 , w 2 We search in Google the co-occurrences of the target words with the glossary entries Compare the co-occurrence vectors for each {p,q} ∈ G compare max (google#("w 1 p") and google#("p w 1 ")) with max (google#"w 2 q") and google#("q w 2 ")) * P. Fung and L. Y. Yee. An IR approach for translating from nonparallel, comparable texts. In Proceedings of ACL, volume 1, pages 414–420, 1998
  • 15. Evaluation Data Set We use 200 Bulgarian/Russian pairs of words: 100 cognates and 100 false friends Manually assembled by a linguist Manually checked in several large monolingual and bilingual dictionaries Limited to nouns only
  • 16. Experiments We tested few modifications of our contextual Web similarity algorithm Use of TF.IDF weighting Preserve the stop words Use of lemmatization of the context words Use different context size (2, 3, 4 and 5) Use small and large bilingual glossary Compared it with the seed words algorithm Compared with traditional orthographic similarity measures: LCSR and MEDR
  • 17. Experiments BASELINE : random MEDR : minimum edit distance ratio LCSR : longest common subsequence ration SEED: the "seed words" algorithm WEB3: the Web-based similarity algorithm with the default parameters: context size = 3, small glossary, stop words filtering, no lemmatization, no reverse context lookup, no TF.IDF-weighting NO-STOP: WEB3 without stop words removal WEB1, WEB2, WEB4 and WEB5: WEB3 with context size of 1, 2, 4 and 5 LEMMA: WEB3 with lemmatization HUGEDICT: WEB3 with the huge glossary REVERSE: the "reverse context lookup" algorithm COMBINED: WEB3 + lemmatization + huge glossary + reverse context lookup
  • 18. Resources We used the following resources: Bilingual Bulgarian / Russian glossary: 3 794 pairs of translation words Huge bilingual glossary: 59 583 word pairs A list of 599 Bulgarian stop words A list of 508 Russian stop words Bulgarian lemma dictionary: 1 000 000 wordforms and 70 000 lemmata Russian lemma dictionary: 1 500 000 wordforms and 100 000 lemmata
  • 19. Evaluation We order the pairs of words from the testing dataset by the calculated similarity False friends are expected to appear on the top and the cognates on the bottom We evaluate the 11pt average precision of the obtained ordering
  • 20. Results (11pt Average Precision) Comparing BASELINE, LCSR, MEDR, SEED and WEB3 algorithms
  • 21. Results (11pt Average Precision) Comparing different context sizes; keeping the stop words
  • 22. Results (11pt Average Precision) Comparing different improvements of the WEB3 algorithm
  • 23. Results (Precision-Recall Graph) Comparing the recall-precision graphs of evaluated algorithms
  • 24. Results: The Ordering for WEB3 100.00% 50.00% yes 0,9684 beauty beauty красота 200 100.00% 50.25% yes 0,9171 flora flora флора 199 100.00% 50.51% yes 0,9028 science science наука 198 100.00% 50.76% yes 0,8916 silver silver сребро / серебро 197 100.00% 51.28% yes 0,8017 finance finance финанси / финансы 19 6 … … … … … … … … 83.00% 82.18% no 0,2130 rubble leg бут 101 82.00% 82.00% no 0,2101 time year година 100 81.00% 81.82% yes 0,2099 volcano volcano вулкан 99 … … … … … … … … 5.00% 100.00% no 0,0182 whip hedge плет / плеть 5 4.00% 100.00% no 0,0175 crud chill мраз / мразь 4 3.00% 100.00% no 0,0143 income livestock добитък / добыток 3 2.00% 100.00% no 0,0130 gaff mottle багрене / багренье 2 1.00% 100.00% no 0,0085 muff gratis муфта 1 R@ r [email_address] Cogn.? Sim. RU Sense BG Sense Candidate r
  • 25. Discussion Our approach is original because: Introduces semantic similarity measure Not orthographic or phonetic Uses the Web as a corpus Does not rely on any preexisting corpora Uses reverse-context lookup Significant improvement in quality Is applied to original problem Classification of almost identically spelled true/false friends
  • 26. Discussion Very good accuracy: over 95% It is not 100% accurate Typical mistakes are synonyms, hyponyms, words influenced by cultural, historical and geographical differences The Web as a corpus introduces noise Google returns the first 1 000 results only Google ranks higher news portals, travel agencies and retail sites than books, articles and forums posts Local context could contains noise
  • 27. Conclusion and Future Work Conclusion Algorithm that can distinguish between cognates and false friends Analyzes words local contexts, using the Web as a corpus Future Work Better glossaries Automatic augmenting the glossary Different language pairs
  • 28. Questions ? Cognate or False Friend? Ask the Web!