SlideShare a Scribd company logo
Text mining and natural
language processing
Florian Leitner

Technical University of Madrid (UPM), Spain

!
Tyba

Madrid, ES, 12th of June, 2015
License:
Florian Leitner
Is language understanding & generation

key to artificial intelligence?
• ā€œHerā€ (Samantha) Movie, 2013

• ā€œThe Singularity: ~2030ā€ā€Ø
Ray Kurzweil, Google’s director of engineering

• ā€œWatsonā€ & ā€œCRUSHā€ā€Ø
IBM’s bet on the future: Datastreams, Mainframes & AI
2
ā€œpredict crimes before they happenā€
Criminal Reduction
Utilizing Statistical History
(IBM, reality)
!
Precogs
(Minority Report, movie)
if? when?
cognitive computing:
ā€œprocessing information more like a
human than a machineā€
GoogleGoogle
Florian Leitner
Examples of text mining and

natural language processing applications.
• Spam filtering

• Document classification

• Social media/brand monitoring

• Opinion mining (& text classification)

• Search engines

• Information retrieval

• Plagiarism detection

• Content-based recommendation systems

• Watson (Jeopardy!, IBM)

• Question answering

• Spelling correction

• Language modeling

• Website translation (Google)

• Machine translation

• Digital assistants (MS’ Clippy)

• Dialog systems (ā€œTuring testā€)

• Siri (Apple) and Google Now

• Speech recognit. & language understand.

• Event detection (in e-mails)

• Information extraction
3
TextMining
LanguageProcessing
Relevant FOSS (only!) libraries will be down here… (MIT, ALv2, GPL, BSD, …)
Concepts & Terminology
Florian Leitner
Document and text

classification/clustering
5
1st Principal Component
2ndPrincipalComponent
document
distance
1st
Principal Component
2nd
PrincipalComponent
Centroid
Cluster
Supervised (ā€œLearning to classify from examplesā€, e.g., spam filtering)

vs.

Unsupervised (ā€œExploratory groupingā€, e.g., topic modeling)
LIBSVM
Florian Leitner
Words, Tokens,
and N-Grams/Shingles
6
This is a sentence .
This is is a a sentence sentence .
This is a is a sentence a sentence .
This is a sentence.
{
{
{
{
{
{
{
NB:
ā€œtokenizationā€
Splitting:
Character-based,
Regular Expressions,
Probabilistic, …
Token or Shingle
Florian Leitner
Words, Tokens,
and N-Grams/Shingles
6
This is a sentence .
This is is a a sentence sentence .
This is a is a sentence a sentence .
This is a sentence.
{
{
{
{
{
{
{
NB:
ā€œtokenizationā€
Splitting:
Character-based,
Regular Expressions,
Probabilistic, …
Snag: the terms ā€œshingleā€, ā€œtokenā€ and ā€œn-gramā€ are not used consistently…
but ā€œn-gramā€ and ā€œtokenā€ are far more common!
shingles
(unigrams)
2-shingles
(bigrams)
3-shingles
(trigrams)
ā€œk-shinglingā€
e.g. all trigrams of the word ā€œsentenceā€:

[sen, ent, nte, ten, enc, nce]
Token N-Grams
Character N-Grams
Token or Shingle
Florian Leitner
Lemmatization, Part-of-Speech (PoS) tagging, and
Named Entity Recognition (NER)
7
Token Lemma PoS NER
Constitutive constitutive JJ O
binding binding NN O
to to TO O
the the DT O
peri-! peri-kappa NN B-DNA
B B NN I-DNA
site site NN I-DNA
is be VBZ O
seen see VBN O
in in IN O
monocytes monocyte NNS B-cell
. . . O
de facto standard

PoS tagset

{NN, JJ, DT, VBZ, …}

Penn Treebank
B-I-O
chunk encoding
common

alternatives:

I-O

I-E-O

B-I-E-W-O
End token
(unigram) Word
Stanford CoreNLP FACTORIE and many more…
FreeLing
Linguistic annotations of tokens (used to train automated classifiers).
Begin-Inside-Outside
(relevant) token
}
chunk
Florian Leitner
Word vectors and inverted indices
8
0 1 2 3 4 5 6 7 8 9 10
10
0
1
2
3
4
5
6
7
8
9
count(Word1)
count(Word2)
Text1
Text2
α
γ
β
Similarity(T1
, T2
) := cos(T1
, T2
)
count(Word3
)
Comparing text vectors:

E.g., cosine similarity
Text vectorization:

Inverted index
Text 1: He that not wills to the end neither

wills to the means.

Text 2: If the mountain will not go to Moses,

then Moses must go to the mountain.
tokens Text 1 Text 2
end 1 0
go 0 2
he 1 0
if 0 1
means 1 0
Moses 0 2
mountain 0 2
must 0 1
not 1 1
that 1 0
the 2 2
then 0 1
to 2 2
will 2 1 INDRI
ā€œSearch engine basicsā€
eachtoken/wordisadimension!
Florian Leitner
Inverted indices and

the central dogma of machine learning
9
Ɨ=
y = hāœ“(X)
XTy Īø
Rank,
Class,
Expectation,
Probability,
Descriptor*,
…
Inverted index
(transposed)
Parameters

(Īø)
ā€œtextsā€(n)
n-grams (p)
instances,
observations
variables,
features
(Hyperparameters are settings that control the learning algorithm.)
per feature
Florian Leitner
Inverted indices and

the central dogma of machine learning
9
Ɨ=
y = hāœ“(X)
XTy Īø
Rank,
Class,
Expectation,
Probability,
Descriptor*,
…
Inverted index
(transposed)
Parameters

(Īø)
ā€œtextsā€(n)
n-grams (p)
instances,
observations
variables,
features
(Hyperparameters are settings that control the learning algorithm.)
per feature
ā€œNonparametricā€
per instance
Florian Leitner
The curse of dimensionality

(R.E. Bellman, 1961) [inventor of dynamic programming]
• p ≫ n (far more tokens/features than texts/instances)

• Inverted indices (X) are (discrete) sparse matrices.

• Even with millions of training examples, unseen tokens will keep
popping up in during evaluation or in production.

‣ In such a high-dimensional hypercube, most instances are closer to
the face of the cube (ā€œnothingā€, outside) than other instances.

āœ“ Remedy: (feature) dimensionality reduction

The ā€œblessing of non-uniformity.ā€

• feature extraction (compression): PCA/LSA (projection), factor analysis (regression),
compression, auto-encoders & deep learning (compression & embedding), …

• feature selection (elimination): LASSO (regularization), SVM (support vectors),
Bayesian nets (structure learning), locality sensitivity hashing, random projections, …
10
Applications
Florian Leitner
Google’s review summaries:

Opinion mining (ā€œsentimentā€ analysis).
12
Don’t do it, please… ;-) (If you must: see document and text classification software.)
Florian Leitner
Polarity of sentiment keywords in IMDB.
• Ć„
13
Cristopher Potts. On the negativity of negation. 2011
ā€œnot goodā€
Florian Leitner
Language understanding:
Parsing and semantic analysis.
14
disambiguation!
Coreference
(Anaphora)
Resolution
Named Entity
Recognition
Apple Siri
Stanford BLLIP (C-J) Malt LinkGrammar and many more…RedShift
Entity
Grounding
disambiguation!
disambiguation!
L. TesniĆØreN. Chomsky
Florian Leitner
Automatic text summarization:
Automatic text summarization:
• Variance/human agreement: When is a
summary ā€œcorrectā€?

• Coherence: providing discourse
structure (text flow) to the summary.

• Paraphrasing: important sentences are
repeated, but with different wordings.

• Implied messages: (the Dow Jones
index rose 10 points → the economy is
thriving)

• Anaphora (coreference) resolution:
very hard, but crucial.
15
…is very difficult because…
Image Source: www.lexalytics.com
Lex[Page]Rank (JUNG) sumy TextTeaser
the author got hired by Google…
Florian Leitner
Machine translation:
Deep learning with auto-encoders.
16
‣have only one gender (en) or use opposing genders

(es vs. de: el/die !; la/der "; …/das #)
‣have different verb placements (es⬌de).
‣have a different concepts of verbs (latin, arab, cjk).
‣use different tenses (en⬌de).
‣have different word orders (latin, arab, cjk).
Different languages…
DL4J
Florian Leitner
Question answering:
The champions league of TM & NLP.
17
Biggest issue: statistical inference
IBM Watson WolframAlpha
Category: Oscar Winning Movies
Hint: Its final scene includes the line ā€œI
do wish we could chat longer, but I’m
having an old friend for dinnerā€
!
!
!
!
Answer: Silence of the Lamb
All men are mortal.

Socrates probably is a man…
…Therefore, Socrates

might be mortal.
(cognitive computing)
Florian Leitner
Information extraction:
Knowledge mining for molecular biology.
18
Biological
Repositories
Binary
Interactions
Named Entity
Recognition
Entity Associations
Entity Mapping
(Grounding)
Relationship
Extraction
Relationship
Annotations
Cdk5 Rat
TaxID
10116
UniProt
Q03114
Experimental
Methods
Article
Classification
Biological Model
Articles
Short Factoid
Question Answering
Ontologies & Thesauri
WWW
MITIE OpenDMAP ClearTK
Florian Leitner
Text mining and language processing
is all about resolving ambiguities.
19
Anaphora resolution
Carl and Bob were fighting:
ā€œYou should shut up,ā€
Carl told him.
Part-of-Speech tagging
The robot wheels out the iron.
Paraphrasing
Unemployment is on the rise.
vs
The economy is slumping.
Entity recognition & grounding
Is Princeton really good for you?
Florian Leitner
Text mining and language processing
is all about resolving ambiguities.
20
Anaphora resolution
Carl and Bob were fighting:
ā€œYou should shut up,ā€
Carl told him.
Part-of-Speech tagging
The robot wheels out the iron.
Paraphrasing
Unemployment is on the rise.
vs
The economy is slumping.
Entity recognition & grounding
Is Princeton really good for you?

More Related Content

PDF
OUTDATED Text Mining 1/5: Introduction
PDF
OUTDATED Text Mining 2/5: Language Modeling
PDF
OUTDATED Text Mining 3/5: String Processing
PDF
OUTDATED Text Mining 5/5: Information Extraction
PPTX
Ngrams smoothing
PDF
Crash Course in Natural Language Processing (2016)
PDF
Semantics and Computational Semantics
PDF
Natural Language Processing in Practice
OUTDATED Text Mining 1/5: Introduction
OUTDATED Text Mining 2/5: Language Modeling
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 5/5: Information Extraction
Ngrams smoothing
Crash Course in Natural Language Processing (2016)
Semantics and Computational Semantics
Natural Language Processing in Practice

What's hot (20)

PDF
Crash-course in Natural Language Processing
PDF
Can functional programming be liberated from static typing?
PDF
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
PDF
Aspects of NLP Practice
PDF
Semantic Role Labeling
PDF
AINL 2016: Maraev
PDF
AINL 2016: Kravchenko
PDF
AINL 2016: Galinsky, Alekseev, Nikolenko
PPTX
A statistical approach to machine translation
PDF
Word2vec: From intuition to practice using gensim
PDF
Practical NLP with Lisp
PDF
The State of #NLProc
PPTX
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
PDF
AINL 2016: Alekseev, Nikolenko
PDF
Logic programming (1)
PPTX
ورؓة ŲŖŲ¶Ł…ŁŠŁ† Ų§Ł„ŁƒŁ„Ł…Ų§ŲŖ في التعلم Ų§Ł„Ų¹Ł…ŁŠŁ‚ Word embeddings workshop
PDF
A Low Dimensionality Representation for Language Variety Identification (CICL...
PDF
Codeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
PDF
Text classification presentation
Crash-course in Natural Language Processing
Can functional programming be liberated from static typing?
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
Aspects of NLP Practice
Semantic Role Labeling
AINL 2016: Maraev
AINL 2016: Kravchenko
AINL 2016: Galinsky, Alekseev, Nikolenko
A statistical approach to machine translation
Word2vec: From intuition to practice using gensim
Practical NLP with Lisp
The State of #NLProc
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
AINL 2016: Alekseev, Nikolenko
Logic programming (1)
ورؓة ŲŖŲ¶Ł…ŁŠŁ† Ų§Ł„ŁƒŁ„Ł…Ų§ŲŖ في التعلم Ų§Ł„Ų¹Ł…ŁŠŁ‚ Word embeddings workshop
A Low Dimensionality Representation for Language Variety Identification (CICL...
Codeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
Text classification presentation
Ad

Viewers also liked (20)

PDF
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
PDF
Aplicaciones de PLN en empresas - Fab Lab ESAN
PDF
Ī£Ī„ĪĪŸĪ Ī¤Ī™ĪšĪ— Ī Ī‘Ī”ĪŸĪ„Ī£Ī™Ī‘Ī£Ī— Ī¤Ī©Ī Ī£Ī¤Ī‘Ī˜ĪœĪ©Ī ΤΟ΄ Ī Ī™Ī›ĪŸĪ¤Ī™ĪšĪŸĪ„ Ī•Ī”Ī“ĪŸĪ„ ΤΗΣ Ī”Ī”Ī‘ĪœĪ‘Ī£
PDF
Python + NoSQL in Animations
PPTX
Textmining Information Extraction
PPT
Yahoo answers
PDF
Text mining - from Bayes rule to dependency parsing
PPTX
Web Mining & Text Mining
PDF
Best Practices for Large Scale Text Mining Processing
PPT
Data Mining Overview
PDF
Basic NLP with Python and NLTK
PPTX
Text data mining1
PPTX
Text mining
PPTX
Aspect extraction using conditional random fields [SentiRuEval]
PPTX
Introduction to Text Mining
PPT
Big Data & Text Mining
PPT
Textmining Introduction
PPTX
NLTK - Natural Language Processing in Python
PDF
Practical Natural Language Processing
PPTX
Natural language processing
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Aplicaciones de PLN en empresas - Fab Lab ESAN
Ī£Ī„ĪĪŸĪ Ī¤Ī™ĪšĪ— Ī Ī‘Ī”ĪŸĪ„Ī£Ī™Ī‘Ī£Ī— Ī¤Ī©Ī Ī£Ī¤Ī‘Ī˜ĪœĪ©Ī ΤΟ΄ Ī Ī™Ī›ĪŸĪ¤Ī™ĪšĪŸĪ„ Ī•Ī”Ī“ĪŸĪ„ ΤΗΣ Ī”Ī”Ī‘ĪœĪ‘Ī£
Python + NoSQL in Animations
Textmining Information Extraction
Yahoo answers
Text mining - from Bayes rule to dependency parsing
Web Mining & Text Mining
Best Practices for Large Scale Text Mining Processing
Data Mining Overview
Basic NLP with Python and NLTK
Text data mining1
Text mining
Aspect extraction using conditional random fields [SentiRuEval]
Introduction to Text Mining
Big Data & Text Mining
Textmining Introduction
NLTK - Natural Language Processing in Python
Practical Natural Language Processing
Natural language processing
Ad

Similar to Overview of text mining and NLP (+software) (20)

PPTX
Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
PDF
ODSC London 2018
PDF
Weakly supervised learning
PPTX
KiwiPyCon 2014 talk - Understanding human language with Python
PPT
PPT slides
Ā 
PDF
Smart Data Webinar: Advances in Natural Language Processing
PDF
Functional Programming with Immutable Data Structures
PPT
F# Eye for the C# Guy
PDF
Machine reading for the Semantic Web
Ā 
PPTX
Nltk
PPT
Machine Learning ICS 273A
Ā 
PPT
Machine Learning ICS 273A
Ā 
PDF
Babak Rasolzadeh: The importance of entities
DOCX
Data Type is a basic classification which identifies.docx
PPTX
KiwiPyCon 2014 - NLP with Python tutorial
PPT
NLP Introduction.ppt machine learning presentation
PPT
Introduction to Natural Language Processing
PDF
MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...
PDF
NLP using JavaScript Natural Library
PDF
Lean Logic for Lean Times: Varieties of Natural Logic
Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
ODSC London 2018
Weakly supervised learning
KiwiPyCon 2014 talk - Understanding human language with Python
PPT slides
Ā 
Smart Data Webinar: Advances in Natural Language Processing
Functional Programming with Immutable Data Structures
F# Eye for the C# Guy
Machine reading for the Semantic Web
Ā 
Nltk
Machine Learning ICS 273A
Ā 
Machine Learning ICS 273A
Ā 
Babak Rasolzadeh: The importance of entities
Data Type is a basic classification which identifies.docx
KiwiPyCon 2014 - NLP with Python tutorial
NLP Introduction.ppt machine learning presentation
Introduction to Natural Language Processing
MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...
NLP using JavaScript Natural Library
Lean Logic for Lean Times: Varieties of Natural Logic

Recently uploaded (20)

PDF
.pdf is not working space design for the following data for the following dat...
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Database Infoormation System (DBIS).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction to machine learning and Linear Models
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Mega Projects Data Mega Projects Data
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Computer network topology notes for revision
PDF
Foundation of Data Science unit number two notes
PDF
Lecture1 pattern recognition............
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
.pdf is not working space design for the following data for the following dat...
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Database Infoormation System (DBIS).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Data_Analytics_and_PowerBI_Presentation.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to machine learning and Linear Models
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Mega Projects Data Mega Projects Data
ISS -ESG Data flows What is ESG and HowHow
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Supervised vs unsupervised machine learning algorithms
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Computer network topology notes for revision
Foundation of Data Science unit number two notes
Lecture1 pattern recognition............
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj

Overview of text mining and NLP (+software)

  • 1. Text mining and natural language processing Florian Leitner Technical University of Madrid (UPM), Spain ! Tyba Madrid, ES, 12th of June, 2015 License:
  • 2. Florian Leitner Is language understanding & generation
 key to artificial intelligence? • ā€œHerā€ (Samantha) Movie, 2013 • ā€œThe Singularity: ~2030ā€ā€Ø Ray Kurzweil, Google’s director of engineering • ā€œWatsonā€ & ā€œCRUSHā€ā€Ø IBM’s bet on the future: Datastreams, Mainframes & AI 2 ā€œpredict crimes before they happenā€ Criminal Reduction Utilizing Statistical History (IBM, reality) ! Precogs (Minority Report, movie) if? when? cognitive computing: ā€œprocessing information more like a human than a machineā€ GoogleGoogle
  • 3. Florian Leitner Examples of text mining and
 natural language processing applications. • Spam filtering • Document classification • Social media/brand monitoring • Opinion mining (& text classification) • Search engines • Information retrieval • Plagiarism detection • Content-based recommendation systems • Watson (Jeopardy!, IBM) • Question answering • Spelling correction • Language modeling • Website translation (Google) • Machine translation • Digital assistants (MS’ Clippy) • Dialog systems (ā€œTuring testā€) • Siri (Apple) and Google Now • Speech recognit. & language understand. • Event detection (in e-mails) • Information extraction 3 TextMining LanguageProcessing Relevant FOSS (only!) libraries will be down here… (MIT, ALv2, GPL, BSD, …)
  • 5. Florian Leitner Document and text
 classification/clustering 5 1st Principal Component 2ndPrincipalComponent document distance 1st Principal Component 2nd PrincipalComponent Centroid Cluster Supervised (ā€œLearning to classify from examplesā€, e.g., spam filtering) vs. Unsupervised (ā€œExploratory groupingā€, e.g., topic modeling) LIBSVM
  • 6. Florian Leitner Words, Tokens, and N-Grams/Shingles 6 This is a sentence . This is is a a sentence sentence . This is a is a sentence a sentence . This is a sentence. { { { { { { { NB: ā€œtokenizationā€ Splitting: Character-based, Regular Expressions, Probabilistic, … Token or Shingle
  • 7. Florian Leitner Words, Tokens, and N-Grams/Shingles 6 This is a sentence . This is is a a sentence sentence . This is a is a sentence a sentence . This is a sentence. { { { { { { { NB: ā€œtokenizationā€ Splitting: Character-based, Regular Expressions, Probabilistic, … Snag: the terms ā€œshingleā€, ā€œtokenā€ and ā€œn-gramā€ are not used consistently… but ā€œn-gramā€ and ā€œtokenā€ are far more common! shingles (unigrams) 2-shingles (bigrams) 3-shingles (trigrams) ā€œk-shinglingā€ e.g. all trigrams of the word ā€œsentenceā€:
 [sen, ent, nte, ten, enc, nce] Token N-Grams Character N-Grams Token or Shingle
  • 8. Florian Leitner Lemmatization, Part-of-Speech (PoS) tagging, and Named Entity Recognition (NER) 7 Token Lemma PoS NER Constitutive constitutive JJ O binding binding NN O to to TO O the the DT O peri-! peri-kappa NN B-DNA B B NN I-DNA site site NN I-DNA is be VBZ O seen see VBN O in in IN O monocytes monocyte NNS B-cell . . . O de facto standard
 PoS tagset {NN, JJ, DT, VBZ, …} Penn Treebank B-I-O chunk encoding common alternatives: I-O I-E-O B-I-E-W-O End token (unigram) Word Stanford CoreNLP FACTORIE and many more… FreeLing Linguistic annotations of tokens (used to train automated classifiers). Begin-Inside-Outside (relevant) token } chunk
  • 9. Florian Leitner Word vectors and inverted indices 8 0 1 2 3 4 5 6 7 8 9 10 10 0 1 2 3 4 5 6 7 8 9 count(Word1) count(Word2) Text1 Text2 α γ β Similarity(T1 , T2 ) := cos(T1 , T2 ) count(Word3 ) Comparing text vectors: E.g., cosine similarity Text vectorization: Inverted index Text 1: He that not wills to the end neither wills to the means. Text 2: If the mountain will not go to Moses, then Moses must go to the mountain. tokens Text 1 Text 2 end 1 0 go 0 2 he 1 0 if 0 1 means 1 0 Moses 0 2 mountain 0 2 must 0 1 not 1 1 that 1 0 the 2 2 then 0 1 to 2 2 will 2 1 INDRI ā€œSearch engine basicsā€ eachtoken/wordisadimension!
  • 10. Florian Leitner Inverted indices and
 the central dogma of machine learning 9 Ɨ= y = hāœ“(X) XTy Īø Rank, Class, Expectation, Probability, Descriptor*, … Inverted index (transposed) Parameters
 (Īø) ā€œtextsā€(n) n-grams (p) instances, observations variables, features (Hyperparameters are settings that control the learning algorithm.) per feature
  • 11. Florian Leitner Inverted indices and
 the central dogma of machine learning 9 Ɨ= y = hāœ“(X) XTy Īø Rank, Class, Expectation, Probability, Descriptor*, … Inverted index (transposed) Parameters
 (Īø) ā€œtextsā€(n) n-grams (p) instances, observations variables, features (Hyperparameters are settings that control the learning algorithm.) per feature ā€œNonparametricā€ per instance
  • 12. Florian Leitner The curse of dimensionality
 (R.E. Bellman, 1961) [inventor of dynamic programming] • p ≫ n (far more tokens/features than texts/instances) • Inverted indices (X) are (discrete) sparse matrices. • Even with millions of training examples, unseen tokens will keep popping up in during evaluation or in production. ‣ In such a high-dimensional hypercube, most instances are closer to the face of the cube (ā€œnothingā€, outside) than other instances. āœ“ Remedy: (feature) dimensionality reduction
 The ā€œblessing of non-uniformity.ā€ • feature extraction (compression): PCA/LSA (projection), factor analysis (regression), compression, auto-encoders & deep learning (compression & embedding), … • feature selection (elimination): LASSO (regularization), SVM (support vectors), Bayesian nets (structure learning), locality sensitivity hashing, random projections, … 10
  • 14. Florian Leitner Google’s review summaries:
 Opinion mining (ā€œsentimentā€ analysis). 12 Don’t do it, please… ;-) (If you must: see document and text classification software.)
  • 15. Florian Leitner Polarity of sentiment keywords in IMDB. • Ć„ 13 Cristopher Potts. On the negativity of negation. 2011 ā€œnot goodā€
  • 16. Florian Leitner Language understanding: Parsing and semantic analysis. 14 disambiguation! Coreference (Anaphora) Resolution Named Entity Recognition Apple Siri Stanford BLLIP (C-J) Malt LinkGrammar and many more…RedShift Entity Grounding disambiguation! disambiguation! L. TesniĆØreN. Chomsky
  • 17. Florian Leitner Automatic text summarization: Automatic text summarization: • Variance/human agreement: When is a summary ā€œcorrectā€? • Coherence: providing discourse structure (text flow) to the summary. • Paraphrasing: important sentences are repeated, but with different wordings. • Implied messages: (the Dow Jones index rose 10 points → the economy is thriving) • Anaphora (coreference) resolution: very hard, but crucial. 15 …is very difficult because… Image Source: www.lexalytics.com Lex[Page]Rank (JUNG) sumy TextTeaser the author got hired by Google…
  • 18. Florian Leitner Machine translation: Deep learning with auto-encoders. 16 ‣have only one gender (en) or use opposing genders
 (es vs. de: el/die !; la/der "; …/das #) ‣have different verb placements (es⬌de). ‣have a different concepts of verbs (latin, arab, cjk). ‣use different tenses (en⬌de). ‣have different word orders (latin, arab, cjk). Different languages… DL4J
  • 19. Florian Leitner Question answering: The champions league of TM & NLP. 17 Biggest issue: statistical inference IBM Watson WolframAlpha Category: Oscar Winning Movies Hint: Its final scene includes the line ā€œI do wish we could chat longer, but I’m having an old friend for dinnerā€ ! ! ! ! Answer: Silence of the Lamb All men are mortal. Socrates probably is a man… …Therefore, Socrates might be mortal. (cognitive computing)
  • 20. Florian Leitner Information extraction: Knowledge mining for molecular biology. 18 Biological Repositories Binary Interactions Named Entity Recognition Entity Associations Entity Mapping (Grounding) Relationship Extraction Relationship Annotations Cdk5 Rat TaxID 10116 UniProt Q03114 Experimental Methods Article Classification Biological Model Articles Short Factoid Question Answering Ontologies & Thesauri WWW MITIE OpenDMAP ClearTK
  • 21. Florian Leitner Text mining and language processing is all about resolving ambiguities. 19 Anaphora resolution Carl and Bob were fighting: ā€œYou should shut up,ā€ Carl told him. Part-of-Speech tagging The robot wheels out the iron. Paraphrasing Unemployment is on the rise. vs The economy is slumping. Entity recognition & grounding Is Princeton really good for you?
  • 22. Florian Leitner Text mining and language processing is all about resolving ambiguities. 20 Anaphora resolution Carl and Bob were fighting: ā€œYou should shut up,ā€ Carl told him. Part-of-Speech tagging The robot wheels out the iron. Paraphrasing Unemployment is on the rise. vs The economy is slumping. Entity recognition & grounding Is Princeton really good for you?