SlideShare a Scribd company logo
Natural Language Processing
+ Python
by Ann C. Tan-Pohlmann

February 22, 2014
Outline
• NLP Basics
• NLTK
– Text Processing

• Gensim (really, really short )
– Text Classification

2
Natural Language Processing
• computer science, artificial intelligence, and
linguistics
• human–computer interaction
• natural language understanding
• natural language generation
- Wikipedia

3
Star Trek's Universal Translator

http://www.youtube.com/watch?v=EaeSKU
V2zp0
Spoken Dialog Systems

5
NLP Basics
• Morphology
– study of word formation
– how word forms vary in a sentence

• Syntax
– branch of grammar
– how words are arranged in a sentence to show
connections of meaning

• Semantics
– study of meaning of words, phrases and sentences
6
NLTK: Getting Started
• Natural Language Took Kit
– for symbolic and statistical NLP
– teaching tool, study tool and as a platform for prototyping

• Python 2.7 is a prerequisite
>>> import nltk
>>> nltk.download()

7
Some NLTK methods
•
•
•
•
•

Frequency Distribution

text.similar(str)
concordance(str)
len(text)
len(set(text))
lexical_diversity

•
•
•
•
•

– len(text)/
len(set(text))

fd = FreqDist(text)
fd.inc(str)
fd[str]
fd.N()
fd.max()

• text.collocations()
- sequence of words that occur
together often

MORPHOLOGY > Syntax > Semantics

8
Frequency Distribution
•
•
•
•
•

fd = FreqDist(text)
fd.inc(str) – increment count
fd[str] – returns the number of occurrence for sample str
fd.N() – total number of samples
fd.max() – sample with the greatest count

9
Corpus
• large collection of raw or categorized text on
one or more domain
• Examples: Gutenberg, Brown, Reuters, Web &
Chat Txt
>>> from nltk.corpus import brown
>>> brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', '
humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance',
'science_fiction']
>>> adventure_text = brown.words(categories='adventure')

10
Corpora in Other Languages
>>> from nltk.corpus import udhr
>>> languages = nltk.corpus.udhr.fileids()
>>> languages.index('Filipino_Tagalog-Latin1')
>>> tagalog = nltk.corpus.udhr.raw('Filipino_Tagalog-Latin1')
>>> tagalog_words = nltk.corpus.udhr.words('Filipino_Tagalog-Latin1')
>>> tagalog_tokens = nltk.word_tokenize(tagalog)
>>> tagalog_text = nltk.Text(tagalog_tokens)
>>> fd = FreqDist(tagalog_text)
>>> for sample in fd:
... print sample

11
Using Corpus from Palito
Corpus
– large collection of raw or categorized text
>>> import nltk
>>> from nltk.corpus import PlaintextCorpusReader
>>> corpus_dir = '/Users/ann/Downloads'
>>> tagalog = PlaintextCorpusReader(corpus_dir,
'Tagalog_Literary_Text.txt')
>>> raw = tagalog.raw()
>>> sentences = tagalog.sents()
>>> words = tagalog.words()
>>> tokens = nltk.word_tokenize(raw)
>>> tagalog_text = nltk.Text(tokens)
12
Spoken Dialog Systems

MORPHOLOGY > Syntax > Semantics

13
Tokenization
Tokenization
– breaking up of string into words and punctuations

>>> tokens = nltk.word_tokenize(raw)
>>> tagalog_tokens = nltk.Text(tokens)
>>> tagalog_tokens = set(sample.lower() for sample in tagalog_tokens)

MORPHOLOGY > Syntax > Semantics

14
Stemming
Stemming
– normalize words into its base form, result may not be the 'root' word
>>> def stem(word):
... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
...
if word.endswith(suffix):
...
return word[:-len(suffix)]
... return word
...
>>> stem('reading')
'read'
>>> stem('moment')
'mo'

MORPHOLOGY > Syntax > Semantics

15
Lemmatization
Lemmatization
– uses vocabulary list and morphological analysis (uses POS of a word)
>>> def stem(word):
... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
...
if word.endswith(suffix) and word[:-len(suffix)] in brown.words():
...
return word[:-len(suffix)]
... return word
...
>>> stem('reading')
'read'
>>> stem('moment')
'moment'

MORPHOLOGY > Syntax > Semantics

16
NLTK Stemmers & Lemmatizer
• Porter Stemmer and Lancaster Stemmer
>>> porter = nltk.PorterStemmer()
>>> lancaster = nltk.LancasterStemmer()
>>> [porter.stem(w) for w in brown.words()[:100]]

• Word Net Lemmatizer
>>> wnl = nltk.WordNetLemmatizer()
>>> [wnl.lemmatize(w) for w in brown.words()[:100]]

• Comparison
>>> [wnl.lemmatize(w) for w in ['investigation', 'women']]
>>> [porter.stem(w) for w in ['investigation', 'women']]
>>> [lancaster.stem(w) for w in ['investigation', 'women']]

MORPHOLOGY > Syntax > Semantics

17
Using Regular Expression
Operator
.
^abc
abc$
[abc]
[A-Z0-9]
ed|ing|s
*
+
?
{n}
{n,}
{,n}
{m,n}
a(b|c)+

Behavior
Wildcard, matches any character
Matches some pattern abc at the start of a string
Matches some pattern abc at the end of a string
Matches one of a set of characters
Matches one of a range of characters
Matches one of the specified strings (disjunction)
Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure)
One or more of previous item, e.g. a+, [a-z]+
Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]?
Exactly n repeats where n is a non-negative integer
At least n repeats
No more than n repeats
At least m and no more than n repeats
Parentheses that indicate the scope of the operators

MORPHOLOGY > Syntax > Semantics

18
Using Regular Expression
>>> import re
>>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'reading')
[('read', 'ing')]
>>> def stem(word):
... regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
... stem, suffix = re.findall(regexp, word)[0]
... return stem
...
>>> stem('reading')
'read'
>>> stem('moment')
'moment'

MORPHOLOGY > Syntax > Semantics

19
Spoken Dialog Systems

Morphology > SYNTAX > Semantics

20
Lexical Resources
• collection of words with association information (annotation)
• Ex: stopwords – high-frequency words with little lexical
content
>>> from nltk.corpus import stopwords
>>> stopwords.words('english')
>>> stopwords.words('german')

MORPHOLOGY > Syntax > Semantics

21
Part-of-Speech (POS) Tagging
• the process of labeling and classifying words
to a particular part of speech based on its
definition and context

Morphology > SYNTAX > Semantics

22
NLTKs POS Tag Sets* – 1/2
Tag
ADJ
ADV
CNJ
DET
EX
FW
MOD
N
NP

Meaning
adjective
adverb
conjunction
determiner
existential
foreign word
modal verb
noun
proper noun

Examples
new, good, high, special, big, local
really, already, still, early, now
and, or, but, if, while, although
the, a, some, most, every, no
there, there's
dolce, ersatz, esprit, quo, maitre
will, can, would, may, must, should
year, home, costs, time, education
Alison, Africa, April, Washington

*simplified
Morphology > SYNTAX > Semantics

23
NLTKs POS Tag Sets* – 2/2
Tag
NUM
PRO
P
TO
UH
V
VD
VG
VN
WH

Meaning
number
pronoun
preposition
the word to
interjection
verb
past tense
present participle
past participle
wh determiner

Examples
twenty-four, fourth, 1991, 14:24
he, their, her, its, my, I, us
on, of, at, with, by, into, under
to
ah, bang, ha, whee, hmpf, oops
is, has, get, do, make, see, run
said, took, told, made, asked
making, going, playing, working
given, taken, begun, sung
who, which, when, what, where, how

*simplified
Morphology > SYNTAX > Semantics

24
NLTK POS Tagger (Brown)
>>> nltk.pos_tag(brown.words()[:30])
[('The', 'DT'), ('Fulton', 'NNP'), ('County', 'NNP'), ('Grand', 'NNP'),
('Jury', 'NNP'), ('said', 'VBD'), ('Friday', 'NNP'), ('an', 'DT'),
('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'JJ'), ('recent', 'JJ'),
('primary', 'JJ'), ('election', 'NN'), ('produced', 'VBN'), ('``', '``'), ('no',
'DT'), ('evidence', 'NN'), ("''", "''"), ('that', 'WDT'), ('any', 'DT'),
('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.'), ('The',
'DT'), ('jury', 'NN'), ('further', 'RB'), ('said', 'VBD'), ('in', 'IN')]
>>> brown.tagged_words(simplify_tags=True)
[('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...]

Morphology > SYNTAX > Semantics

25
NLTK POS Tagger (German)
>>> german = nltk.corpus.europarl_raw.german
>>> nltk.pos_tag(german.words()[:30])
[(u'Wiederaufnahme', 'NNP'), (u'der', 'NN'), (u'Sitzungsperiode', 'NNP'),
(u'Ich', 'NNP'), (u'erklxe4re', 'NNP'), (u'die', 'VB'), (u'am', 'NN'), (u'Freita
g', 'NNP'), (u',', ','), (u'dem', 'NN'), (u'17.', 'CD'), (u'Dezember', 'NNP'), (u'
unterbrochene', 'NN'), (u'Sitzungsperiode', 'NNP'), (u'des', 'VBZ'), (u'Eur
opxe4ischen', 'JJ'), (u'Parlaments', 'NNS'), (u'fxfcr', 'JJ'), (u'wiederaufg
enommen', 'NNS'), (u',', ','), (u'wxfcnsche', 'NNP'), (u'Ihnen', 'NNP'), (u'
nochmals', 'NNS'), (u'alles', 'VBZ'), (u'Gute', 'NNP'), (u'zum', 'NN'), (u'Ja
hreswechsel', 'NNP'), (u'und', 'NN'), (u'hoffe', 'NN'), (u',', ',')]

xe4 = ä xfc = ü
!!! DOES NOT WORK FOR GERMAN

Morphology > SYNTAX > Semantics

26
NLTK POS Dictionary
>>> pos = nltk.defaultdict(lambda:'N')
>>> pos['eat']
'N'
>>> pos.items()
[('eat', 'N')]
>>> for (word, tag) in brown.tagged_words(simplify_tags=True):
... if word in pos:
...
if isinstance(pos[word], str):
...
new_list = [pos[word]]
...
pos[word] = new_list
...
if tag not in pos[word]:
...
pos[word].append(tag)
... else:
...
pos[word] = [tag]
...
>>> pos['eat']
['N', 'V']
Morphology > SYNTAX > Semantics

27
What else can you do with NLTK?
• Other Taggers
– Unigram Tagging
• nltk.UnigramTagger()
• train tagger using tagged sentence data

– N-gram Tagging

• Text classification using machine learning
techniques
– decision trees
– naïve Bayes classification (supervised)
– Markov Models
Morphology > SYNTAX > SEMANTICS

28
Gensim
• Tool that extracts semantic structure of
documents, by examining word statistical cooccurrence patterns within a corpus of
training documents.
• Algorithms:
1. Latent Semantic Analysis (LSA)
2. Latent Dirichlet Allocation (LDA) or Random
Projections
Morphology > Syntax > SEMANTICS

29
Gensim
• Features
– memory independent
– wrappers/converters for several data formats

• Vector
– representation of the document as an array of features or
question-answer pair
1.
2.
3.

(word occurrence, count)
(paragraph, count)
(font, count)

• Model
– transformation from one vector to another
– learned from a training corpus without supervision
Morphology > Syntax > SEMANTICS

30
Wiki document classification

http://radimrehurek.com/gensim/wiki.html

31
Other NLP tools for Python
• TextBlob
– part-of-speech tagging, noun phrase extraction,
sentiment analysis, classification, translation
– https://pypi.python.org/pypi/textblob

• Pattern
– part-of-speech taggers, n-gram search, sentiment
analysis, WordNet, machine learning
– http://www.clips.ua.ac.be/pattern
32
Star Trek technology that became a reality

http://www.youtube.com/watch?v=sRZxwR
IH9RI
Installation Guides
• NLTK
– http://www.nltk.org/install.html
– http://www.nltk.org/data.html

• Gensim
– http://radimrehurek.com/gensim/install.html

• Palito
– http://ccs.dlsu.edu.ph:8086/Palito/find_project.js
p
34
Using iPython
• http://ipython.org/install.html
>>> documents = ["Human machine interface for lab abc computer applications",
>>>
"A survey of user opinion of computer system response time",
>>>
"The EPS user interface management system",
>>>
"System and human system engineering testing of EPS",
>>>
"Relation of user perceived response time to error measurement",
>>>
"The generation of random binary unordered trees",
>>>
"The intersection graph of paths in trees",
>>>
"Graph minors IV Widths of trees and well quasi ordering",
>>>
"Graph minors A survey"]

35
References
• Natural Language Processing with Python By
Steven Bird, Ewan Klein, Edward Loper
• http://www.nltk.org/book/
• http://radimrehurek.com/gensim/tutorial.htm
l

36
Thank You!
• For questions and comments:
- ann at auberonsolutions dot com

37

More Related Content

PPTX
Natural language processing
PDF
A Review of Deep Contextualized Word Representations (Peters+, 2018)
PPTX
NLP_KASHK:POS Tagging
PPTX
Natural Language Processing in AI
PDF
Semantic analysis in Compiler Construction
PDF
Natural language processing (Python)
PPTX
TEXT SUMMARIZATION
PPTX
Language Model (N-Gram).pptx
Natural language processing
A Review of Deep Contextualized Word Representations (Peters+, 2018)
NLP_KASHK:POS Tagging
Natural Language Processing in AI
Semantic analysis in Compiler Construction
Natural language processing (Python)
TEXT SUMMARIZATION
Language Model (N-Gram).pptx

What's hot (20)

PDF
Natural Language Processing seminar review
PPT
Natural language processing
PPTX
Language models
PPTX
Natural Language Processing (NLP) - Introduction
PPTX
Natural language processing: feature extraction
PDF
Challenges in nlp
PPT
Introduction to Natural Language Processing
PPTX
NLP_KASHK:Text Normalization
PPTX
Introduction For seq2seq(sequence to sequence) and RNN
PPT
Natural Language Processing
PDF
Word2Vec
PPTX
Natural language processing
PDF
Lecture 6
PDF
bag-of-words models
PPTX
Subword tokenizers
PPT
Classification using back propagation algorithm
PPTX
Natural Language Processing
PPTX
Natural Language Processing
Natural Language Processing seminar review
Natural language processing
Language models
Natural Language Processing (NLP) - Introduction
Natural language processing: feature extraction
Challenges in nlp
Introduction to Natural Language Processing
NLP_KASHK:Text Normalization
Introduction For seq2seq(sequence to sequence) and RNN
Natural Language Processing
Word2Vec
Natural language processing
Lecture 6
bag-of-words models
Subword tokenizers
Classification using back propagation algorithm
Natural Language Processing
Natural Language Processing
Ad

Similar to Natural Language Processing and Python (20)

PDF
한국어와 NLTK, Gensim의 만남
DOCX
JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
PDF
Term Rewriting
PDF
Ejercicios de estilo en la programación
PDF
Declare Your Language: Type Checking
PPT
Language Technology Enhanced Learning
PDF
CS4200 2019 | Lecture 3 | Parsing
PDF
Music as data
PDF
Perl 6 in Context
PDF
Separation of Concerns in Language Definition
PDF
Compiler Construction | Lecture 5 | Transformation by Term Rewriting
PPT
Profiling and optimization
PDF
Analyzing Program Similarities and Differences Across Multiple Languages
PPTX
codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)
PPTX
Open course(programming languages) 20150225
PPTX
Introduction ,numeric Data types,python Data types.pptx
PDF
Declare Your Language (at DLS)
PPTX
Basics of Python programming (part 2)
PPT
Compiler design lessons notes from Semester
한국어와 NLTK, Gensim의 만남
JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
Term Rewriting
Ejercicios de estilo en la programación
Declare Your Language: Type Checking
Language Technology Enhanced Learning
CS4200 2019 | Lecture 3 | Parsing
Music as data
Perl 6 in Context
Separation of Concerns in Language Definition
Compiler Construction | Lecture 5 | Transformation by Term Rewriting
Profiling and optimization
Analyzing Program Similarities and Differences Across Multiple Languages
codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)
Open course(programming languages) 20150225
Introduction ,numeric Data types,python Data types.pptx
Declare Your Language (at DLS)
Basics of Python programming (part 2)
Compiler design lessons notes from Semester
Ad

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
cuic standard and advanced reporting.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Cloud computing and distributed systems.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
Mobile App Security Testing_ A Comprehensive Guide.pdf
Empathic Computing: Creating Shared Understanding
cuic standard and advanced reporting.pdf
Machine learning based COVID-19 study performance prediction
Cloud computing and distributed systems.
Digital-Transformation-Roadmap-for-Companies.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Modernizing your data center with Dell and AMD
Diabetes mellitus diagnosis method based random forest with bat algorithm
MYSQL Presentation for SQL database connectivity
Per capita expenditure prediction using model stacking based on satellite ima...
Unlocking AI with Model Context Protocol (MCP)
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
20250228 LYD VKU AI Blended-Learning.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
The Rise and Fall of 3GPP – Time for a Sabbatical?
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectral efficient network and resource selection model in 5G networks

Natural Language Processing and Python

  • 1. Natural Language Processing + Python by Ann C. Tan-Pohlmann February 22, 2014
  • 2. Outline • NLP Basics • NLTK – Text Processing • Gensim (really, really short ) – Text Classification 2
  • 3. Natural Language Processing • computer science, artificial intelligence, and linguistics • human–computer interaction • natural language understanding • natural language generation - Wikipedia 3
  • 4. Star Trek's Universal Translator http://www.youtube.com/watch?v=EaeSKU V2zp0
  • 6. NLP Basics • Morphology – study of word formation – how word forms vary in a sentence • Syntax – branch of grammar – how words are arranged in a sentence to show connections of meaning • Semantics – study of meaning of words, phrases and sentences 6
  • 7. NLTK: Getting Started • Natural Language Took Kit – for symbolic and statistical NLP – teaching tool, study tool and as a platform for prototyping • Python 2.7 is a prerequisite >>> import nltk >>> nltk.download() 7
  • 8. Some NLTK methods • • • • • Frequency Distribution text.similar(str) concordance(str) len(text) len(set(text)) lexical_diversity • • • • • – len(text)/ len(set(text)) fd = FreqDist(text) fd.inc(str) fd[str] fd.N() fd.max() • text.collocations() - sequence of words that occur together often MORPHOLOGY > Syntax > Semantics 8
  • 9. Frequency Distribution • • • • • fd = FreqDist(text) fd.inc(str) – increment count fd[str] – returns the number of occurrence for sample str fd.N() – total number of samples fd.max() – sample with the greatest count 9
  • 10. Corpus • large collection of raw or categorized text on one or more domain • Examples: Gutenberg, Brown, Reuters, Web & Chat Txt >>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', ' humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> adventure_text = brown.words(categories='adventure') 10
  • 11. Corpora in Other Languages >>> from nltk.corpus import udhr >>> languages = nltk.corpus.udhr.fileids() >>> languages.index('Filipino_Tagalog-Latin1') >>> tagalog = nltk.corpus.udhr.raw('Filipino_Tagalog-Latin1') >>> tagalog_words = nltk.corpus.udhr.words('Filipino_Tagalog-Latin1') >>> tagalog_tokens = nltk.word_tokenize(tagalog) >>> tagalog_text = nltk.Text(tagalog_tokens) >>> fd = FreqDist(tagalog_text) >>> for sample in fd: ... print sample 11
  • 12. Using Corpus from Palito Corpus – large collection of raw or categorized text >>> import nltk >>> from nltk.corpus import PlaintextCorpusReader >>> corpus_dir = '/Users/ann/Downloads' >>> tagalog = PlaintextCorpusReader(corpus_dir, 'Tagalog_Literary_Text.txt') >>> raw = tagalog.raw() >>> sentences = tagalog.sents() >>> words = tagalog.words() >>> tokens = nltk.word_tokenize(raw) >>> tagalog_text = nltk.Text(tokens) 12
  • 13. Spoken Dialog Systems MORPHOLOGY > Syntax > Semantics 13
  • 14. Tokenization Tokenization – breaking up of string into words and punctuations >>> tokens = nltk.word_tokenize(raw) >>> tagalog_tokens = nltk.Text(tokens) >>> tagalog_tokens = set(sample.lower() for sample in tagalog_tokens) MORPHOLOGY > Syntax > Semantics 14
  • 15. Stemming Stemming – normalize words into its base form, result may not be the 'root' word >>> def stem(word): ... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']: ... if word.endswith(suffix): ... return word[:-len(suffix)] ... return word ... >>> stem('reading') 'read' >>> stem('moment') 'mo' MORPHOLOGY > Syntax > Semantics 15
  • 16. Lemmatization Lemmatization – uses vocabulary list and morphological analysis (uses POS of a word) >>> def stem(word): ... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']: ... if word.endswith(suffix) and word[:-len(suffix)] in brown.words(): ... return word[:-len(suffix)] ... return word ... >>> stem('reading') 'read' >>> stem('moment') 'moment' MORPHOLOGY > Syntax > Semantics 16
  • 17. NLTK Stemmers & Lemmatizer • Porter Stemmer and Lancaster Stemmer >>> porter = nltk.PorterStemmer() >>> lancaster = nltk.LancasterStemmer() >>> [porter.stem(w) for w in brown.words()[:100]] • Word Net Lemmatizer >>> wnl = nltk.WordNetLemmatizer() >>> [wnl.lemmatize(w) for w in brown.words()[:100]] • Comparison >>> [wnl.lemmatize(w) for w in ['investigation', 'women']] >>> [porter.stem(w) for w in ['investigation', 'women']] >>> [lancaster.stem(w) for w in ['investigation', 'women']] MORPHOLOGY > Syntax > Semantics 17
  • 18. Using Regular Expression Operator . ^abc abc$ [abc] [A-Z0-9] ed|ing|s * + ? {n} {n,} {,n} {m,n} a(b|c)+ Behavior Wildcard, matches any character Matches some pattern abc at the start of a string Matches some pattern abc at the end of a string Matches one of a set of characters Matches one of a range of characters Matches one of the specified strings (disjunction) Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure) One or more of previous item, e.g. a+, [a-z]+ Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]? Exactly n repeats where n is a non-negative integer At least n repeats No more than n repeats At least m and no more than n repeats Parentheses that indicate the scope of the operators MORPHOLOGY > Syntax > Semantics 18
  • 19. Using Regular Expression >>> import re >>> re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'reading') [('read', 'ing')] >>> def stem(word): ... regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$' ... stem, suffix = re.findall(regexp, word)[0] ... return stem ... >>> stem('reading') 'read' >>> stem('moment') 'moment' MORPHOLOGY > Syntax > Semantics 19
  • 20. Spoken Dialog Systems Morphology > SYNTAX > Semantics 20
  • 21. Lexical Resources • collection of words with association information (annotation) • Ex: stopwords – high-frequency words with little lexical content >>> from nltk.corpus import stopwords >>> stopwords.words('english') >>> stopwords.words('german') MORPHOLOGY > Syntax > Semantics 21
  • 22. Part-of-Speech (POS) Tagging • the process of labeling and classifying words to a particular part of speech based on its definition and context Morphology > SYNTAX > Semantics 22
  • 23. NLTKs POS Tag Sets* – 1/2 Tag ADJ ADV CNJ DET EX FW MOD N NP Meaning adjective adverb conjunction determiner existential foreign word modal verb noun proper noun Examples new, good, high, special, big, local really, already, still, early, now and, or, but, if, while, although the, a, some, most, every, no there, there's dolce, ersatz, esprit, quo, maitre will, can, would, may, must, should year, home, costs, time, education Alison, Africa, April, Washington *simplified Morphology > SYNTAX > Semantics 23
  • 24. NLTKs POS Tag Sets* – 2/2 Tag NUM PRO P TO UH V VD VG VN WH Meaning number pronoun preposition the word to interjection verb past tense present participle past participle wh determiner Examples twenty-four, fourth, 1991, 14:24 he, their, her, its, my, I, us on, of, at, with, by, into, under to ah, bang, ha, whee, hmpf, oops is, has, get, do, make, see, run said, took, told, made, asked making, going, playing, working given, taken, begun, sung who, which, when, what, where, how *simplified Morphology > SYNTAX > Semantics 24
  • 25. NLTK POS Tagger (Brown) >>> nltk.pos_tag(brown.words()[:30]) [('The', 'DT'), ('Fulton', 'NNP'), ('County', 'NNP'), ('Grand', 'NNP'), ('Jury', 'NNP'), ('said', 'VBD'), ('Friday', 'NNP'), ('an', 'DT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'JJ'), ('recent', 'JJ'), ('primary', 'JJ'), ('election', 'NN'), ('produced', 'VBN'), ('``', '``'), ('no', 'DT'), ('evidence', 'NN'), ("''", "''"), ('that', 'WDT'), ('any', 'DT'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.'), ('The', 'DT'), ('jury', 'NN'), ('further', 'RB'), ('said', 'VBD'), ('in', 'IN')] >>> brown.tagged_words(simplify_tags=True) [('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...] Morphology > SYNTAX > Semantics 25
  • 26. NLTK POS Tagger (German) >>> german = nltk.corpus.europarl_raw.german >>> nltk.pos_tag(german.words()[:30]) [(u'Wiederaufnahme', 'NNP'), (u'der', 'NN'), (u'Sitzungsperiode', 'NNP'), (u'Ich', 'NNP'), (u'erklxe4re', 'NNP'), (u'die', 'VB'), (u'am', 'NN'), (u'Freita g', 'NNP'), (u',', ','), (u'dem', 'NN'), (u'17.', 'CD'), (u'Dezember', 'NNP'), (u' unterbrochene', 'NN'), (u'Sitzungsperiode', 'NNP'), (u'des', 'VBZ'), (u'Eur opxe4ischen', 'JJ'), (u'Parlaments', 'NNS'), (u'fxfcr', 'JJ'), (u'wiederaufg enommen', 'NNS'), (u',', ','), (u'wxfcnsche', 'NNP'), (u'Ihnen', 'NNP'), (u' nochmals', 'NNS'), (u'alles', 'VBZ'), (u'Gute', 'NNP'), (u'zum', 'NN'), (u'Ja hreswechsel', 'NNP'), (u'und', 'NN'), (u'hoffe', 'NN'), (u',', ',')] xe4 = ä xfc = ü !!! DOES NOT WORK FOR GERMAN Morphology > SYNTAX > Semantics 26
  • 27. NLTK POS Dictionary >>> pos = nltk.defaultdict(lambda:'N') >>> pos['eat'] 'N' >>> pos.items() [('eat', 'N')] >>> for (word, tag) in brown.tagged_words(simplify_tags=True): ... if word in pos: ... if isinstance(pos[word], str): ... new_list = [pos[word]] ... pos[word] = new_list ... if tag not in pos[word]: ... pos[word].append(tag) ... else: ... pos[word] = [tag] ... >>> pos['eat'] ['N', 'V'] Morphology > SYNTAX > Semantics 27
  • 28. What else can you do with NLTK? • Other Taggers – Unigram Tagging • nltk.UnigramTagger() • train tagger using tagged sentence data – N-gram Tagging • Text classification using machine learning techniques – decision trees – naïve Bayes classification (supervised) – Markov Models Morphology > SYNTAX > SEMANTICS 28
  • 29. Gensim • Tool that extracts semantic structure of documents, by examining word statistical cooccurrence patterns within a corpus of training documents. • Algorithms: 1. Latent Semantic Analysis (LSA) 2. Latent Dirichlet Allocation (LDA) or Random Projections Morphology > Syntax > SEMANTICS 29
  • 30. Gensim • Features – memory independent – wrappers/converters for several data formats • Vector – representation of the document as an array of features or question-answer pair 1. 2. 3. (word occurrence, count) (paragraph, count) (font, count) • Model – transformation from one vector to another – learned from a training corpus without supervision Morphology > Syntax > SEMANTICS 30
  • 32. Other NLP tools for Python • TextBlob – part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation – https://pypi.python.org/pypi/textblob • Pattern – part-of-speech taggers, n-gram search, sentiment analysis, WordNet, machine learning – http://www.clips.ua.ac.be/pattern 32
  • 33. Star Trek technology that became a reality http://www.youtube.com/watch?v=sRZxwR IH9RI
  • 34. Installation Guides • NLTK – http://www.nltk.org/install.html – http://www.nltk.org/data.html • Gensim – http://radimrehurek.com/gensim/install.html • Palito – http://ccs.dlsu.edu.ph:8086/Palito/find_project.js p 34
  • 35. Using iPython • http://ipython.org/install.html >>> documents = ["Human machine interface for lab abc computer applications", >>> "A survey of user opinion of computer system response time", >>> "The EPS user interface management system", >>> "System and human system engineering testing of EPS", >>> "Relation of user perceived response time to error measurement", >>> "The generation of random binary unordered trees", >>> "The intersection graph of paths in trees", >>> "Graph minors IV Widths of trees and well quasi ordering", >>> "Graph minors A survey"] 35
  • 36. References • Natural Language Processing with Python By Steven Bird, Ewan Klein, Edward Loper • http://www.nltk.org/book/ • http://radimrehurek.com/gensim/tutorial.htm l 36
  • 37. Thank You! • For questions and comments: - ann at auberonsolutions dot com 37