SlideShare a Scribd company logo
NLP Project Full Cycle
Vsevolod Dyomkin
10/2016
A Bit about Me
* Lisp programmer
* 5+ years of NLP work at Grammarly
* Occasional lecturer
https://vseloved.github.io
Plan
* Overview of NLP
* NLP Data
* Common NLP problems
and approaches
* Example NLP application:
text language identification
What Is NLP?
Transforming free-form text
into structured data and back
What Is NLP?
Transforming free-form text
into structured data and back
Intersection of:
* Computational Linguistics
* CompSci & AI
* ML, Stats, Information Theory
Natural Language
* ambiguous
* noisy
* evolving
Roles
linguist [noun]
1. A specialist in linguistics
linguist [noun]
1. A specialist in linguistics
linguistics [noun]
1. The scientific study of
language.
NLP Project Full Cycle
NLP Project Full Cycle
NLP Data
Types of text data:
* structured
* semi-structured
* unstructured
“Data is ten times more
powerful than algorithms.”
-- Peter Norvig
The Unreasonable Effectiveness of Data.
http://youtu.be/yvDCzhbjYWs
Kinds of Data
* Dictionaries
* Databases/Ontologies
* Corpora
* Internet/user Data
Where to Get Data?
* Linguistic Data Consortium
http://www.ldc.upenn.edu/
* Common Crawl
* Wikimedia
* Wordnet
* APIs: Twitter, Wordnik, ...
* University sites &
the academic community:
Stanford, Oxford, CMU, ...
Create Your Own!
* Linguists
* Crowdsourcing
* By-product
-- Johnatahn Zittrain
http://goo.gl/hs4qB
Classic NLP Problems
* Linguistically-motivated:
segmentation, tagging, parsing
* Analytical:
classification, sentiment analysis
* Transformation:
translation, correction, generation
* Conversation:
question answering, dialog
engineer [noun]
5. A person skilled in the
design and programming of
computer systems
Tokenization
Example:
This is a test that isn't so simple: 1.23.
"This" "is" "a" "test" "that" "is" "n't"
"so" "simple" ":" "1.23" "."
Issues:
* Finland’s capital -
Finland Finlands Finland’s
* what’re, I’m, isn’t -
what ’re, I ’m, is n’t
* Hewlett-Packard or Hewlett Packard
* San Francisco - one token or two?
* m.p.h., PhD.
Regular Expressions
Simplest regex: [^s]+
More advanced regex:
w+|[!"#$%&'*+,./:;<=>?@^`~…() {}[|]⟨⟩ ‒–—
«»“”‘’-]―
Even more advanced regex:
[+-]?[0-9](?:[0-9,.]*[0-9])?
|[w@](?:[w'’`@-][w']|[w'][w@'’`-])*[w']?
|["#$%&*+,/:;<=>@^`~…() {}[|] «»“”‘’']⟨⟩ ‒–—―
|[.!?]+
|-+
In fact, it works:
https://github.com/lang-uk/ner-uk/blob/master/doc
/tokenization.md
Rule-based Approach
* easy to understand and
reason about
* can be arbitrarily precise
* iterative, can be used to
gather more data
Limitations:
* recall problems
* poor adaptability
Rule-based NLP tools
* SpamAssasin
* LanguageTool
* ELIZA
* GATE
NLP Project Full Cycle
researcher [noun]
1. One who researches
researcher [noun]
1. One who researches
research [noun]
1. Diligent inquiry or
examination to seek or revise
facts, principles, theories,
applications, etc.; laborious
or continued search after
truth
Models
Statistical Approach
“Probability theory
is nothing but
common sense
reduced to calculation.”
-- Pierre-Simon Laplace
Language Models
Question: what is the probability of a
sequence of words/sentence?
Language Models
Question: what is the probability of a
sequence of words/sentence?
Answer: Apply the chain rule
P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1)
* P(w3|w0 w1 w2) * …
where S = w0 w1 w2 …
Ngrams
Apply Markov assumption: each word depends
only on N previous words (in practice
N=1..4 which results in bigrams-fivegrams,
because we include the current word also).
If n=2:
P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1)
* P(w3|w1 w2) * …
According to the chain rule:
P(w2|w0 w1) = P(w0 w1 w2) / P(w0 w1)
Spam Filtering
A 2-class classification problem with a
bias towards minimizing FPs.
Default approach: rule-based (SpamAssassin)
Problems:
* scales poorly
* hard to reach arbitrary precision
* hard to rank the importance of
complex features?
Bag-of-words Model
* each word is a feature
* each word is independent of others
* position of the word in a sentence is irrelevant
Pros:
* simple
* fast
* scalable
Limitations:
* independence assumption doesn't hold
Bag-of-words Model
* each word is a feature
* each word is independent of others
* position of the word in a sentence is irrelevant
Pros:
* simple
* fast
* scalable
Limitations:
* independence assumption doesn't hold
http://www.paulgraham.com/spam.html - A Plan for Spam
Initial results: recall: 92%, precision: 98.84%
Improved results: recall: 99.5%, precision: 99.97%
Naive Bayes
Classifier
P(Y|X) = P(Y) * P(X|Y) / P(X)
select Y = argmax P(Y|x)
Naive step:
P(Y|x) = P(Y) * prod(P(x|Y))
for all x in X
(P(x) is marginalized out because it's the
same for all Y)
Machine Learning
Approach
Dependency Parsing
nsubj(ate-2, They-1)
root(ROOT-0, ate-2)
det(pizza-4, the-3)
dobj(ate-2, pizza-4)
prep(ate-2, with-5)
pobj(with-5, anchovies-6)
https://honnibal.wordpress.com/2013/12/18/a-simple-fas
t-algorithm-for-natural-language-dependency-parsing/
Shift-reduce Parsing
Shift-reduce Parsing
Averaged Perceptron
def train(model, number_iter, examples):
for i in range(number_iter):
for features, true_tag in examples:
guess = model.predict(features)
if guess != true_tag:
for f in features:
model.weights[f][true_tag] += 1
model.weights[f][guess] -= 1
random.shuffle(examples)
ML-based Parsing
The parser starts with an empty stack, and a buffer index at 0, with no
dependencies recorded. It chooses one of the valid actions, and applies it to
the state. It continues choosing actions and applying them until the stack is
empty and the buffer index is at the end of the input.
SHIFT = 0; RIGHT = 1; LEFT = 2
MOVES = [SHIFT, RIGHT, LEFT]
def parse(words, tags):
n = len(words)
deps = init_deps(n)
idx = 1
stack = [0]
while stack or idx < n:
features = extract_features(words, tags, idx, n, stack, deps)
scores = score(features)
valid_moves = get_valid_moves(i, n, len(stack))
next_move = max(valid_moves, key=lambda move: scores[move])
idx = transition(next_move, idx, stack, parse)
return tags, parse
The Hierarchy of
ML Models
Linear:
* (Averaged) Perceptron
* Maximum Entropy / LogLinear / Logistic
Regression; Conditional Random Field
* SVM
Non-linear:
* Decision Trees, Random Forests, Boosted
Trees
* Artificial Neural networks
Semantics
Question: how to model relationships
between words?
Semantics
Question: how to model relationships
between words?
Answer: build a graph
Wordnet
Freebase
DBPedia
Word Similarity
Next question: now, how do we measure those
relations?
Word Similarity
Next question: now, how do we measure those
relations?
* different Wordnet similarity measures
Word Similarity
Next question: now, how do we measure those
relations?
* different Wordnet similarity measures
* PMI(x,y) = log(p(x,y) / p(x) * p(y))
Distributional
Semantics
Distributional hypothesis:
"You shall know a word by
the company it keeps"
--John Rupert Firth
Word representations:
* Explicit representation
Number of nonzero dimensions:
max:474234, min:3, mean:1595, median:415
* Dense representation (word2vec, GloVe, …)
* Hierarchical repr (Brown clusters)
Steps to Develop
an NLP System
* Translate real-world requirements
into a measurable goal
* Find a suitable level and
representation
* Find initial data for experiments
* Find and utilize existing tools and
frameworks where possible
* Setup and perform a proper
experiment (series of experiments)
* Optimize the system for production
Going into Prod
* NLP tasks are usually CPU-intensive
but stateless
* General-purpose NLP frameworks are
(mostly) not production-ready
* Don't trust research results
* Value pre- and post- processing
* Gather user feedback
Text Language
Identification
Not an unsolved problem:
* https://github.com/CLD2Owners/cld2 - C++
* https://github.com/saffsd/langid.py - Python
* https://github.com/shuyo/language-detection/ - Java
To read:
https://blog.twitter.com/2015/evaluating-language-identifi
cation-performance
http://blog.mikemccandless.com/2011/10/accuracy-and-perfor
mance-of-googles.html
http://lab.hypotheses.org/1083
http://labs.translated.net/language-identifier/
WILD Challenges
NLP Project Full Cycle
YALI WILD
* All of them use weak models
* Wanted to use Wiktionary —
150+ languages,
always evolving
* Wanted to do in Lisp
WILD Linguistics
* Scripts vs languages
http://www.omniglot.com/writing/langalph.htm
* Languages distribution
https://en.wikipedia.org/wiki/Languages_used_o
n_the_Internet#Content_languages_for_websites
* Frequency word lists
https://invokeit.wordpress.com/frequency-word-
lists/
* Word segmentation?
WILD Data
Wiktionary Wikipedia data:
used abstracts, ~175 languages
- download & store
- process (SAX parsing)
- setup learning & test data sets
10,778,404 unique words
481,581 unique character trigrams
WILD Engineering
* Initial model size ~1G -
script hacks & Huffman coding
to the rescue
* Model pruning
* Proper probability calculations
* Efficient testing
* Properly saving the model
* Library & public API

More Related Content

PDF
Natural Language Processing
PDF
Introduction to natural language processing
PPTX
Deep Learning Models for Question Answering
PDF
Natural language processing (NLP) introduction
PDF
Introduction to Natural Language Processing (NLP)
PPT
Natural Language Processing
PPTX
Nltk
PDF
Natural Language Processing seminar review
Natural Language Processing
Introduction to natural language processing
Deep Learning Models for Question Answering
Natural language processing (NLP) introduction
Introduction to Natural Language Processing (NLP)
Natural Language Processing
Nltk
Natural Language Processing seminar review

What's hot (20)

PDF
An introduction to the Transformers architecture and BERT
PPTX
Natural Language Processing
PPTX
Introduction to natural language processing, history and origin
PPTX
Natural Language Processing
PDF
Seq2Seq (encoder decoder) model
PDF
Natural Language Processing
PPTX
Natural Language Processing
PPTX
Natural language processing
PDF
Natural language processing
PDF
Introduction to NLTK
PPT
Cognitive models unit 3
PPTX
NLP Project Presentation
PPTX
Natural Language Processing
PPT
PPT
JavaScript Object Notation (JSON)
PPTX
XLnet RoBERTa Reformer
An introduction to the Transformers architecture and BERT
Natural Language Processing
Introduction to natural language processing, history and origin
Natural Language Processing
Seq2Seq (encoder decoder) model
Natural Language Processing
Natural Language Processing
Natural language processing
Natural language processing
Introduction to NLTK
Cognitive models unit 3
NLP Project Presentation
Natural Language Processing
JavaScript Object Notation (JSON)
XLnet RoBERTa Reformer
Ad

Similar to NLP Project Full Cycle (20)

PDF
Crash Course in Natural Language Processing (2016)
PDF
Crash-course in Natural Language Processing
PDF
Practical NLP with Lisp
PDF
Aspects of NLP Practice
PDF
Machine learning with in the python lecture for computer science
PDF
The State of #NLProc
PPT
Deep learning is a subset of machine learning and AI
PPT
Overview of Deep Learning and its advantage
PPT
Introduction to Deep Learning presentation
PPT
deepnet-lourentzou.ppt
PPT
ppt
PDF
Exposé Ontology
PPTX
Sentiment analysis using naive bayes classifier
PPT
Machine learning
PPT
Language Modeling Putting a curve to the bag of words
PPT
Moore_slides.ppt
PDF
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
PDF
Artificial intelligence for Social Good
PDF
Week 2 Sentiment Analysis Using Machine Learning
PPTX
Machine Learning with Spark
Crash Course in Natural Language Processing (2016)
Crash-course in Natural Language Processing
Practical NLP with Lisp
Aspects of NLP Practice
Machine learning with in the python lecture for computer science
The State of #NLProc
Deep learning is a subset of machine learning and AI
Overview of Deep Learning and its advantage
Introduction to Deep Learning presentation
deepnet-lourentzou.ppt
ppt
Exposé Ontology
Sentiment analysis using naive bayes classifier
Machine learning
Language Modeling Putting a curve to the bag of words
Moore_slides.ppt
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Artificial intelligence for Social Good
Week 2 Sentiment Analysis Using Machine Learning
Machine Learning with Spark
Ad

More from Vsevolod Dyomkin (15)

PDF
NLP Project Full Circle
PDF
Lisp in a Startup: the Good, the Bad, and the Ugly
PDF
Loading Multiple Versions of an ASDF System in the Same Lisp Image
PDF
NLP in the WILD or Building a System for Text Language Identification
PDF
Sugaring Lisp for the 21st Century
PDF
Can functional programming be liberated from static typing?
PDF
Lisp Machine Prunciples
PDF
Natural Language Processing in Practice
PDF
PDF
Lisp как универсальная обертка
PDF
Lisp for Python Programmers
ODP
Tedxkyiv communication guidelines
ODP
Новые нереляционные системы хранения данных
ODP
Чему мы можем научиться у Lisp'а?
PPT
Экосистема Common Lisp
NLP Project Full Circle
Lisp in a Startup: the Good, the Bad, and the Ugly
Loading Multiple Versions of an ASDF System in the Same Lisp Image
NLP in the WILD or Building a System for Text Language Identification
Sugaring Lisp for the 21st Century
Can functional programming be liberated from static typing?
Lisp Machine Prunciples
Natural Language Processing in Practice
Lisp как универсальная обертка
Lisp for Python Programmers
Tedxkyiv communication guidelines
Новые нереляционные системы хранения данных
Чему мы можем научиться у Lisp'а?
Экосистема Common Lisp

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Approach and Philosophy of On baking technology
PDF
Modernizing your data center with Dell and AMD
PPT
Teaching material agriculture food technology
PPTX
A Presentation on Artificial Intelligence
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
NewMind AI Weekly Chronicles - August'25 Week I
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Diabetes mellitus diagnosis method based random forest with bat algorithm
The Rise and Fall of 3GPP – Time for a Sabbatical?
Chapter 3 Spatial Domain Image Processing.pdf
Encapsulation_ Review paper, used for researhc scholars
Approach and Philosophy of On baking technology
Modernizing your data center with Dell and AMD
Teaching material agriculture food technology
A Presentation on Artificial Intelligence
The AUB Centre for AI in Media Proposal.docx
MYSQL Presentation for SQL database connectivity
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Empathic Computing: Creating Shared Understanding
Review of recent advances in non-invasive hemoglobin estimation
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Spectral efficient network and resource selection model in 5G networks
NewMind AI Monthly Chronicles - July 2025
NewMind AI Weekly Chronicles - August'25 Week I

NLP Project Full Cycle

  • 1. NLP Project Full Cycle Vsevolod Dyomkin 10/2016
  • 2. A Bit about Me * Lisp programmer * 5+ years of NLP work at Grammarly * Occasional lecturer https://vseloved.github.io
  • 3. Plan * Overview of NLP * NLP Data * Common NLP problems and approaches * Example NLP application: text language identification
  • 4. What Is NLP? Transforming free-form text into structured data and back
  • 5. What Is NLP? Transforming free-form text into structured data and back Intersection of: * Computational Linguistics * CompSci & AI * ML, Stats, Information Theory
  • 8. linguist [noun] 1. A specialist in linguistics
  • 9. linguist [noun] 1. A specialist in linguistics linguistics [noun] 1. The scientific study of language.
  • 12. NLP Data Types of text data: * structured * semi-structured * unstructured “Data is ten times more powerful than algorithms.” -- Peter Norvig The Unreasonable Effectiveness of Data. http://youtu.be/yvDCzhbjYWs
  • 13. Kinds of Data * Dictionaries * Databases/Ontologies * Corpora * Internet/user Data
  • 14. Where to Get Data? * Linguistic Data Consortium http://www.ldc.upenn.edu/ * Common Crawl * Wikimedia * Wordnet * APIs: Twitter, Wordnik, ... * University sites & the academic community: Stanford, Oxford, CMU, ...
  • 15. Create Your Own! * Linguists * Crowdsourcing * By-product -- Johnatahn Zittrain http://goo.gl/hs4qB
  • 16. Classic NLP Problems * Linguistically-motivated: segmentation, tagging, parsing * Analytical: classification, sentiment analysis * Transformation: translation, correction, generation * Conversation: question answering, dialog
  • 17. engineer [noun] 5. A person skilled in the design and programming of computer systems
  • 18. Tokenization Example: This is a test that isn't so simple: 1.23. "This" "is" "a" "test" "that" "is" "n't" "so" "simple" ":" "1.23" "." Issues: * Finland’s capital - Finland Finlands Finland’s * what’re, I’m, isn’t - what ’re, I ’m, is n’t * Hewlett-Packard or Hewlett Packard * San Francisco - one token or two? * m.p.h., PhD.
  • 19. Regular Expressions Simplest regex: [^s]+ More advanced regex: w+|[!"#$%&'*+,./:;<=>?@^`~…() {}[|]⟨⟩ ‒–— «»“”‘’-]― Even more advanced regex: [+-]?[0-9](?:[0-9,.]*[0-9])? |[w@](?:[w'’`@-][w']|[w'][w@'’`-])*[w']? |["#$%&*+,/:;<=>@^`~…() {}[|] «»“”‘’']⟨⟩ ‒–—― |[.!?]+ |-+ In fact, it works: https://github.com/lang-uk/ner-uk/blob/master/doc /tokenization.md
  • 20. Rule-based Approach * easy to understand and reason about * can be arbitrarily precise * iterative, can be used to gather more data Limitations: * recall problems * poor adaptability
  • 21. Rule-based NLP tools * SpamAssasin * LanguageTool * ELIZA * GATE
  • 23. researcher [noun] 1. One who researches
  • 24. researcher [noun] 1. One who researches research [noun] 1. Diligent inquiry or examination to seek or revise facts, principles, theories, applications, etc.; laborious or continued search after truth
  • 26. Statistical Approach “Probability theory is nothing but common sense reduced to calculation.” -- Pierre-Simon Laplace
  • 27. Language Models Question: what is the probability of a sequence of words/sentence?
  • 28. Language Models Question: what is the probability of a sequence of words/sentence? Answer: Apply the chain rule P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1) * P(w3|w0 w1 w2) * … where S = w0 w1 w2 …
  • 29. Ngrams Apply Markov assumption: each word depends only on N previous words (in practice N=1..4 which results in bigrams-fivegrams, because we include the current word also). If n=2: P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1) * P(w3|w1 w2) * … According to the chain rule: P(w2|w0 w1) = P(w0 w1 w2) / P(w0 w1)
  • 30. Spam Filtering A 2-class classification problem with a bias towards minimizing FPs. Default approach: rule-based (SpamAssassin) Problems: * scales poorly * hard to reach arbitrary precision * hard to rank the importance of complex features?
  • 31. Bag-of-words Model * each word is a feature * each word is independent of others * position of the word in a sentence is irrelevant Pros: * simple * fast * scalable Limitations: * independence assumption doesn't hold
  • 32. Bag-of-words Model * each word is a feature * each word is independent of others * position of the word in a sentence is irrelevant Pros: * simple * fast * scalable Limitations: * independence assumption doesn't hold http://www.paulgraham.com/spam.html - A Plan for Spam Initial results: recall: 92%, precision: 98.84% Improved results: recall: 99.5%, precision: 99.97%
  • 33. Naive Bayes Classifier P(Y|X) = P(Y) * P(X|Y) / P(X) select Y = argmax P(Y|x) Naive step: P(Y|x) = P(Y) * prod(P(x|Y)) for all x in X (P(x) is marginalized out because it's the same for all Y)
  • 35. Dependency Parsing nsubj(ate-2, They-1) root(ROOT-0, ate-2) det(pizza-4, the-3) dobj(ate-2, pizza-4) prep(ate-2, with-5) pobj(with-5, anchovies-6) https://honnibal.wordpress.com/2013/12/18/a-simple-fas t-algorithm-for-natural-language-dependency-parsing/
  • 38. Averaged Perceptron def train(model, number_iter, examples): for i in range(number_iter): for features, true_tag in examples: guess = model.predict(features) if guess != true_tag: for f in features: model.weights[f][true_tag] += 1 model.weights[f][guess] -= 1 random.shuffle(examples)
  • 39. ML-based Parsing The parser starts with an empty stack, and a buffer index at 0, with no dependencies recorded. It chooses one of the valid actions, and applies it to the state. It continues choosing actions and applying them until the stack is empty and the buffer index is at the end of the input. SHIFT = 0; RIGHT = 1; LEFT = 2 MOVES = [SHIFT, RIGHT, LEFT] def parse(words, tags): n = len(words) deps = init_deps(n) idx = 1 stack = [0] while stack or idx < n: features = extract_features(words, tags, idx, n, stack, deps) scores = score(features) valid_moves = get_valid_moves(i, n, len(stack)) next_move = max(valid_moves, key=lambda move: scores[move]) idx = transition(next_move, idx, stack, parse) return tags, parse
  • 40. The Hierarchy of ML Models Linear: * (Averaged) Perceptron * Maximum Entropy / LogLinear / Logistic Regression; Conditional Random Field * SVM Non-linear: * Decision Trees, Random Forests, Boosted Trees * Artificial Neural networks
  • 41. Semantics Question: how to model relationships between words?
  • 42. Semantics Question: how to model relationships between words? Answer: build a graph Wordnet Freebase DBPedia
  • 43. Word Similarity Next question: now, how do we measure those relations?
  • 44. Word Similarity Next question: now, how do we measure those relations? * different Wordnet similarity measures
  • 45. Word Similarity Next question: now, how do we measure those relations? * different Wordnet similarity measures * PMI(x,y) = log(p(x,y) / p(x) * p(y))
  • 46. Distributional Semantics Distributional hypothesis: "You shall know a word by the company it keeps" --John Rupert Firth Word representations: * Explicit representation Number of nonzero dimensions: max:474234, min:3, mean:1595, median:415 * Dense representation (word2vec, GloVe, …) * Hierarchical repr (Brown clusters)
  • 47. Steps to Develop an NLP System * Translate real-world requirements into a measurable goal * Find a suitable level and representation * Find initial data for experiments * Find and utilize existing tools and frameworks where possible * Setup and perform a proper experiment (series of experiments) * Optimize the system for production
  • 48. Going into Prod * NLP tasks are usually CPU-intensive but stateless * General-purpose NLP frameworks are (mostly) not production-ready * Don't trust research results * Value pre- and post- processing * Gather user feedback
  • 49. Text Language Identification Not an unsolved problem: * https://github.com/CLD2Owners/cld2 - C++ * https://github.com/saffsd/langid.py - Python * https://github.com/shuyo/language-detection/ - Java To read: https://blog.twitter.com/2015/evaluating-language-identifi cation-performance http://blog.mikemccandless.com/2011/10/accuracy-and-perfor mance-of-googles.html http://lab.hypotheses.org/1083 http://labs.translated.net/language-identifier/
  • 52. YALI WILD * All of them use weak models * Wanted to use Wiktionary — 150+ languages, always evolving * Wanted to do in Lisp
  • 53. WILD Linguistics * Scripts vs languages http://www.omniglot.com/writing/langalph.htm * Languages distribution https://en.wikipedia.org/wiki/Languages_used_o n_the_Internet#Content_languages_for_websites * Frequency word lists https://invokeit.wordpress.com/frequency-word- lists/ * Word segmentation?
  • 54. WILD Data Wiktionary Wikipedia data: used abstracts, ~175 languages - download & store - process (SAX parsing) - setup learning & test data sets 10,778,404 unique words 481,581 unique character trigrams
  • 55. WILD Engineering * Initial model size ~1G - script hacks & Huffman coding to the rescue * Model pruning * Proper probability calculations * Efficient testing * Properly saving the model * Library & public API