SlideShare a Scribd company logo
Bilingual Term Extraction
with Big Data.
Research project with the Zurich University of Applied Sciences (Switzerland).
Conducted by Mark Cieliebak.
Rémy Blättler
Chief of the System
Developer.
Co-Founder.
Chief of the System.
In Zurich, Berlin and Los Angeles.
In English, Spanish, Portuguese, German,
Chinese, Japanese and 80 other languages.
• Simple word frequencies
• Using dictionaries to find “phrases”
Naïve approaches
• Shallow, two-layer neural networks trained to
reconstruct linguistic contexts of words
• Free & open-source
GENSIM word2vec
breakfast cereal dinner lunch => Cereal
GENSIM word2vec
man => boy woman => girl
Sweden
 Norway 0.76
 Finland 0.71
 Estonia 0.54
New York City
Terms and Conditions
GENSIM Phrases
Never Follow (Audi)
Just do it (Nike)
1. Average occurrence of a term over all corpora
2. Average occurrence of a term for one client
3. Same for the other language
Term extraction 1
Detect the specific phrase in the source & target:
Geänderte Segmentberichterstattung erhöht
Aussagekraft.
Improvements gained from changes
in segment reporting.
Term extraction 2
Plugin additional algorithms
Works well for small fixes too
GENSIM
Upper-
case
Weighted output
Entity
extraction
Data corpus
Remove
cities
German English
myCloud myCloud
Wie kann ich How can I
Swisscom Broadcast Swisscom Broadcast
HomepageTool HomepageTool
Abo subscription
Swisscom (Schweiz) AG Swisscom (Switzerland) Ltd
Swisscom Sharespace Swisscom Sharespace
App app
Healthi Healthi
KMU SME
SharePoint SharePoint
Endresultat Swisscom.”
Ihre Webseite your website
KMU SMEs
Dateien Files
Swisscom myCloud Swisscom myCloud
generelles Rauchverbot cablex."
Dateien Photos
inOne mobile inOne mobile
Business Apps Business Apps
The big picture: fast & easy client onboarding
Webspider
(with PhantomJS)
(N)MT-based
alignment
Term extraction
Structured text in
multiple languages
Translation
memory
Terminology
database
Bilingual term extraction with Big Data (and Gensim)
• Speed (tests take multiple hours)
• Insufficient data (>50k TM units helps)
• Bad source data (HTML, Javascript, etc.)
Problems
$1,000 KTI Research Project
$20,000 SpinningBytes & University Cooperation
=> $20,000 more until online & 80h+ internal
Budget
Questions?
@remyblaettler (Twitter)
remy@supertext.ch
Bilingual term extraction with Big Data (and Gensim)
Bilingual term extraction with Big Data (and Gensim)

More Related Content

PPTX
Artificial Intelligence TM and Terminology Onboarding
PDF
How can text-mining leverage developments in Deep Learning? Presentation at ...
PPTX
NLP in Practice - Part I
PPTX
A Fuzzy Approach For Multi-Domain Sentiment Analysis
PDF
Agile and the evolution
PDF
All Things Open 2022 - State of OSS Security & Support
PPTX
Tutorial of Sentiment Analysis
PDF
Empirical Methods in Software Engineering - an Overview
Artificial Intelligence TM and Terminology Onboarding
How can text-mining leverage developments in Deep Learning? Presentation at ...
NLP in Practice - Part I
A Fuzzy Approach For Multi-Domain Sentiment Analysis
Agile and the evolution
All Things Open 2022 - State of OSS Security & Support
Tutorial of Sentiment Analysis
Empirical Methods in Software Engineering - an Overview

Similar to Bilingual term extraction with Big Data (and Gensim) (20)

PPTX
Using Bioinformatics Data to inform Therapeutics discovery and development
PPTX
NLP Community Conference - Dr. Catherine Havasi (ConceptNet/MIT Media Lab/Lum...
PPT
Six Easy Pieces of Quantitatively Analyzing Open Source
PDF
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
PDF
Knowledge-Driven NGS Solutions
PPTX
AI-Powered Linguistics and Search with Fusion and Rosette
PPTX
Agile Development Practices - Productivity
PPTX
Shift Left, Shift Right and improve the centre
PPTX
REVIEW PPT.pptx
PDF
building intelligent systems with large scale deep learning
PPTX
VOC real world enterprise needs
PDF
Gen AI Applications in Different Industries.pdf
PDF
Natural Language Processing, Techniques, Current Trends and Applications in I...
PPTX
Making Sense of Selenium
PPTX
Fake news detection
PPTX
Learn Real World Machine Learning By Building Projects
PPT
PDF
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
DOCX
Whole test suite generation
PDF
Open vocabulary problem
Using Bioinformatics Data to inform Therapeutics discovery and development
NLP Community Conference - Dr. Catherine Havasi (ConceptNet/MIT Media Lab/Lum...
Six Easy Pieces of Quantitatively Analyzing Open Source
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
Knowledge-Driven NGS Solutions
AI-Powered Linguistics and Search with Fusion and Rosette
Agile Development Practices - Productivity
Shift Left, Shift Right and improve the centre
REVIEW PPT.pptx
building intelligent systems with large scale deep learning
VOC real world enterprise needs
Gen AI Applications in Different Industries.pdf
Natural Language Processing, Techniques, Current Trends and Applications in I...
Making Sense of Selenium
Fake news detection
Learn Real World Machine Learning By Building Projects
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
Whole test suite generation
Open vocabulary problem
Ad

More from Remy Blaettler (7)

PPTX
The Lean Startup (Theory & Motivation - Examples & Learnings)
PPTX
Emotional Webdesign am Internet Briefing 2013 in Bern
PPTX
Multilingual Websites - UX Camp Europe 2013
PPTX
Multilingual websites - UX Camp Berlin 2012
PPTX
Online CAT and project management tools for translators
PPTX
Emotional Webdesign - How to make the user smile - Frontend Conf Zürich 2011
PPTX
Make the user smile - Emotional Webdesign - UX Camp Berlin 2010
The Lean Startup (Theory & Motivation - Examples & Learnings)
Emotional Webdesign am Internet Briefing 2013 in Bern
Multilingual Websites - UX Camp Europe 2013
Multilingual websites - UX Camp Berlin 2012
Online CAT and project management tools for translators
Emotional Webdesign - How to make the user smile - Frontend Conf Zürich 2011
Make the user smile - Emotional Webdesign - UX Camp Berlin 2010
Ad

Recently uploaded (20)

PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Introduction to Inferential Statistics.pptx
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Microsoft Core Cloud Services powerpoint
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
New ISO 27001_2022 standard and the changes
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Introduction to Data Science and Data Analysis
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
[EN] Industrial Machine Downtime Prediction
Optimise Shopper Experiences with a Strong Data Estate.pdf
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
IBA_Chapter_11_Slides_Final_Accessible.pptx
A Complete Guide to Streamlining Business Processes
retention in jsjsksksksnbsndjddjdnFPD.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction to Inferential Statistics.pptx
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Microsoft Core Cloud Services powerpoint
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
STERILIZATION AND DISINFECTION-1.ppthhhbx
New ISO 27001_2022 standard and the changes
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
SAP 2 completion done . PRESENTATION.pptx
Introduction to Data Science and Data Analysis

Bilingual term extraction with Big Data (and Gensim)

  • 1. Bilingual Term Extraction with Big Data. Research project with the Zurich University of Applied Sciences (Switzerland). Conducted by Mark Cieliebak.
  • 2. Rémy Blättler Chief of the System Developer. Co-Founder. Chief of the System.
  • 3. In Zurich, Berlin and Los Angeles. In English, Spanish, Portuguese, German, Chinese, Japanese and 80 other languages.
  • 4. • Simple word frequencies • Using dictionaries to find “phrases” Naïve approaches
  • 5. • Shallow, two-layer neural networks trained to reconstruct linguistic contexts of words • Free & open-source GENSIM word2vec
  • 6. breakfast cereal dinner lunch => Cereal GENSIM word2vec man => boy woman => girl Sweden  Norway 0.76  Finland 0.71  Estonia 0.54
  • 7. New York City Terms and Conditions GENSIM Phrases Never Follow (Audi) Just do it (Nike)
  • 8. 1. Average occurrence of a term over all corpora 2. Average occurrence of a term for one client 3. Same for the other language Term extraction 1
  • 9. Detect the specific phrase in the source & target: Geänderte Segmentberichterstattung erhöht Aussagekraft. Improvements gained from changes in segment reporting. Term extraction 2
  • 10. Plugin additional algorithms Works well for small fixes too GENSIM Upper- case Weighted output Entity extraction Data corpus Remove cities
  • 11. German English myCloud myCloud Wie kann ich How can I Swisscom Broadcast Swisscom Broadcast HomepageTool HomepageTool Abo subscription Swisscom (Schweiz) AG Swisscom (Switzerland) Ltd Swisscom Sharespace Swisscom Sharespace App app Healthi Healthi KMU SME SharePoint SharePoint Endresultat Swisscom.” Ihre Webseite your website KMU SMEs Dateien Files Swisscom myCloud Swisscom myCloud generelles Rauchverbot cablex." Dateien Photos inOne mobile inOne mobile Business Apps Business Apps
  • 12. The big picture: fast & easy client onboarding Webspider (with PhantomJS) (N)MT-based alignment Term extraction Structured text in multiple languages Translation memory Terminology database
  • 14. • Speed (tests take multiple hours) • Insufficient data (>50k TM units helps) • Bad source data (HTML, Javascript, etc.) Problems
  • 15. $1,000 KTI Research Project $20,000 SpinningBytes & University Cooperation => $20,000 more until online & 80h+ internal Budget

Editor's Notes

  • #7: Cluster documents and classify them by topic Sentiment Analysis Recommendations, e.g. CRM
  • #10: It gets complicated if a segment has multiple extracted phrases. => Find other segments with the same phrases
  • #11: Next step: Replace Weighted output with Neural Network. Whats stopping us? Not enough Terminology data to train with. Maybe a public system would help.
  • #12: Examples DON’T PROOFREAD