SlideShare a Scribd company logo
Week 8
The Natural Language Toolkit
(NLTK)
Except where otherwise noted, this work is licensed under:
http://creativecommons.org/licenses/by-nc-sa/3.0
2
List methods
• Getting information about a list
– list.index(item)
– list.count(item)
• These modify the list in-place, unlike str operations
– list.append(item)
– list.insert(index, item)
– list.remove(item)
– list.extend(list2)
• same as list += list2
– list.sort()
– list.reverse()
3
List exercise
• Write a script to print the most frequent token in a text file.
4
And now for something completely different
5
• So far, we've studied programming syntax and techniques
• What about tasks for programming?
– Homework
– Mathematics, statistics
– Biology
– Animation
– Website development
– Game development
– Natural language processing
Programming tasks?
(Sage)
(Biopython)
(Blender)
(Django)
(PyGame)
(NLTK)
6
Natural Language Processing (NLP)
• How can we make a computer understand language?
– Can a human write/talk to the computer?
• Or can the computer guess/predict the input?
– Can the computer talk back?
– Based on language rules, patterns, or statistics
• For now, statistics are more accurate and popular
7
Some areas of NLP
• shallow processing – the surface level
– tokenization
– part-of-speech tagging
– forms of words
• deep processing – the underlying structures of language
– word order (syntax)
– meaning
– translation
• natural language generation
8
The NLTK
• A collection of:
– Python functions and objects for accomplishing NLP tasks
– sample texts (corpora)
• Available at: http://nltk.sourceforge.net
– Requires Python 2.4 or higher
– Click 'Download' and follow instructions for your OS
9
Tokenization
• Say we want to know the words in Marty's vocabulary
– "You know what I hate? Anybody who drives an S.U.V. I'd really
like to find Mr. It-Costs-Me-100-Dollars-To-Gas-Up and kick him
square in the teeth. Booyah. Be like, I'm Marty Stepp, the best
ever. Booyah!"
• How do we split his speech into tokens?
10
Tokenization (cont.)
• How do we split his speech into tokens?
>>> martysSpeech.split()
['You', 'know', 'what', 'I', 'hate?', 'Anybody',
'who', 'drives', 'an', 'S.U.V.', "I'd", 'really',
'like', 'to', 'find', 'Mr.', 'It-Costs-Me-100-
Dollars-To-Gas-Up', 'and', 'kick', 'him',
'square', 'in', 'the', 'teeth.', 'Booyah.', 'Be',
'like,', "I'm", 'Marty', 'Stepp,', 'the', 'best',
'ever.', 'Booyah!']
• Now, how often does he use the word "booyah"?
>>> martysSpeech.split().count("booyah")
0
>>> # What the!
11
Tokenization (cont.)
• We could lowercase the speech
• We could write our own method to split on "." split on ",",
split on "-", etc.
• The NLTK already has several tokenizer options
• Try:
• nltk.tokenize.WordPunctTokenizer
– tokenizes on all punctuation
• nltk.tokenize.PunktWordTokenizer
– trained algorithm to statistically split on words
12
Part-of-speech (POS) tagging
• If you know a token's POS you know:
– is it the subject?
– is it the verb?
– is it introducing a grammatical structure?
– is it a proper name?
13
Part-of-speech (POS) tagging
• Exercise: most frequent proper noun in the Penn Treebank?
– Try:
• nltk.corpus.treebank
• Python's dir() to list attributes of an object
– Example:
>>> dir("hello world!")
[..., 'capitalize', 'center', 'count',
'decode', 'encode', 'endswith', 'expandtabs',
'find', 'index', 'isalnum', 'isalpha',
'isdigit', 'islower', 'isspace', 'istitle',
'isupper', 'join', 'ljust', 'lower', ...]
14
Tuples
• tagged_words() gives us a list of tuples
– tuple: the same thing as a list, but you can't change it
– in this case, the tuples are a (word, tag) pairs
>>> # Get the (word, tag) pair at list index 0
...
>>> pair = nltk.corpus.treebank.tagged_words()[0]
>>> pair
('Pierre', 'NNP')
>>> word = pair[0]
>>> tag = pair[1]
>>> print word, tag
Pierre NNP
>>> word, tag = pair # or unpack in 1 line!
>>> print word, tag
Pierre NNP
15
POS tagging (cont.)
• How do we tag plain sentences?
– A NLTK tagger needs a list of tagged sentences to train on
• We'll use nltk.corpus.treebank.tagged_sents()
– Then it is ready to tag any input! (but how well?)
– Try these tagger objects:
• nltk.UnigramTagger(tagged_sentences)
• nltk.TrigramTagger(tagged_sentences)
– Call the tagger's tag(tokens) method
>>> tagger = nltk.UnigramTagger(tagged_sentences)
>>> result = tagger.tag(tokens)
>>> result
[('You', 'PRP'), ('know', 'VB'), ('what', 'WP'),
('I', 'PRP'), ('hate', None), ('?', '.'), ...]
16
POS tagging (cont.)
• Exercise: Mad Libs
– I have a passage I want filled with the right parts of speech
– Let's use random picks from our own data!
– This code will print it out:
print properNoun1, "has always been a", adjective1, 
singularNoun, "unlike the", adjective2, 
properNoun2, "who I", pastVerb, "as he was", 
ingVerb, "yesterday."
17
Eliza (NLG)
• Eliza simulates a Rogerian psychotherapist
• With while loops and tokenization, you can make a chat bot!
– Try:
• nltk.chat.eliza.eliza_chat()
18
Parsing
• Syntax is as important for a compiler as it is for natural
language
• Realizing the hidden structure of a sentence is useful for:
– translation
– meaning analysis
– relationship analysis
– a cool demo!
• Try:
– nltk.draw.rdparser.demo()
19
Conclusion
• NLTK: NLP made easy with Python
– Functions and objects for:
• tokenization, tagging, generation, parsing, ...
• and much more!
– Even armed with these tools, NLP has a lot of difficult problems!
• Also saw:
– List methods
– dir()
– Tuples

More Related Content

PPTX
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
PPTX
Nltk
PDF
Nltk:a tool for_nlp - py_con-dhaka-2014
PPTX
PDF
NLP for Everyday People
PPTX
PPTX
Text Analysis Operations using NLTK.pptx
PPTX
Natural Language processing using nltk.pptx
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
Nltk
Nltk:a tool for_nlp - py_con-dhaka-2014
NLP for Everyday People
Text Analysis Operations using NLTK.pptx
Natural Language processing using nltk.pptx

Similar to NLTK Python Basic Natural Language Processing.ppt (20)

PDF
HackYale - Natural Language Processing (All Slides)
PDF
AM4TM_WS22_Practice_01_NLP_Basics.pdf
PPTX
Natural Language Processing_in semantic web.pptx
PPTX
Natural Language Processing: Comparing NLTK and OpenNLP
PPTX
Natural Language Processing (NLP).pptx
PDF
overview of natural language processing concepts
PPTX
Natural Language Processing and Python
PDF
Introduction to natural language processing
PDF
Natural Language Processing with Python
PDF
Introduction to Natural Language Processing (NLP)
PPTX
Python computer science technology .pptx
PPT
NLTK: Natural Language Processing made easy
PPTX
NLTK - Natural Language Processing in Python
PPTX
NLP.pptx
PPTX
Artificial Intelligence Notes Unit 4
PDF
Pycon India 2018 Natural Language Processing Workshop
PDF
Categorizing and pos tagging with nltk python
PPTX
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
PPTX
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
PPT
Natural Language Processing made easy
HackYale - Natural Language Processing (All Slides)
AM4TM_WS22_Practice_01_NLP_Basics.pdf
Natural Language Processing_in semantic web.pptx
Natural Language Processing: Comparing NLTK and OpenNLP
Natural Language Processing (NLP).pptx
overview of natural language processing concepts
Natural Language Processing and Python
Introduction to natural language processing
Natural Language Processing with Python
Introduction to Natural Language Processing (NLP)
Python computer science technology .pptx
NLTK: Natural Language Processing made easy
NLTK - Natural Language Processing in Python
NLP.pptx
Artificial Intelligence Notes Unit 4
Pycon India 2018 Natural Language Processing Workshop
Categorizing and pos tagging with nltk python
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
Natural Language Processing made easy
Ad

Recently uploaded (20)

PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Lecture1 pattern recognition............
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Mega Projects Data Mega Projects Data
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Supervised vs unsupervised machine learning algorithms
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
1_Introduction to advance data techniques.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Computer network topology notes for revision
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Lecture1 pattern recognition............
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Galatica Smart Energy Infrastructure Startup Pitch Deck
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to Knowledge Engineering Part 1
Mega Projects Data Mega Projects Data
Moving the Public Sector (Government) to a Digital Adoption
IBA_Chapter_11_Slides_Final_Accessible.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Supervised vs unsupervised machine learning algorithms
Miokarditis (Inflamasi pada Otot Jantung)
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
1_Introduction to advance data techniques.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Computer network topology notes for revision
Business Acumen Training GuidePresentation.pptx
Launch Your Data Science Career in Kochi – 2025
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Ad

NLTK Python Basic Natural Language Processing.ppt

  • 1. Week 8 The Natural Language Toolkit (NLTK) Except where otherwise noted, this work is licensed under: http://creativecommons.org/licenses/by-nc-sa/3.0
  • 2. 2 List methods • Getting information about a list – list.index(item) – list.count(item) • These modify the list in-place, unlike str operations – list.append(item) – list.insert(index, item) – list.remove(item) – list.extend(list2) • same as list += list2 – list.sort() – list.reverse()
  • 3. 3 List exercise • Write a script to print the most frequent token in a text file.
  • 4. 4 And now for something completely different
  • 5. 5 • So far, we've studied programming syntax and techniques • What about tasks for programming? – Homework – Mathematics, statistics – Biology – Animation – Website development – Game development – Natural language processing Programming tasks? (Sage) (Biopython) (Blender) (Django) (PyGame) (NLTK)
  • 6. 6 Natural Language Processing (NLP) • How can we make a computer understand language? – Can a human write/talk to the computer? • Or can the computer guess/predict the input? – Can the computer talk back? – Based on language rules, patterns, or statistics • For now, statistics are more accurate and popular
  • 7. 7 Some areas of NLP • shallow processing – the surface level – tokenization – part-of-speech tagging – forms of words • deep processing – the underlying structures of language – word order (syntax) – meaning – translation • natural language generation
  • 8. 8 The NLTK • A collection of: – Python functions and objects for accomplishing NLP tasks – sample texts (corpora) • Available at: http://nltk.sourceforge.net – Requires Python 2.4 or higher – Click 'Download' and follow instructions for your OS
  • 9. 9 Tokenization • Say we want to know the words in Marty's vocabulary – "You know what I hate? Anybody who drives an S.U.V. I'd really like to find Mr. It-Costs-Me-100-Dollars-To-Gas-Up and kick him square in the teeth. Booyah. Be like, I'm Marty Stepp, the best ever. Booyah!" • How do we split his speech into tokens?
  • 10. 10 Tokenization (cont.) • How do we split his speech into tokens? >>> martysSpeech.split() ['You', 'know', 'what', 'I', 'hate?', 'Anybody', 'who', 'drives', 'an', 'S.U.V.', "I'd", 'really', 'like', 'to', 'find', 'Mr.', 'It-Costs-Me-100- Dollars-To-Gas-Up', 'and', 'kick', 'him', 'square', 'in', 'the', 'teeth.', 'Booyah.', 'Be', 'like,', "I'm", 'Marty', 'Stepp,', 'the', 'best', 'ever.', 'Booyah!'] • Now, how often does he use the word "booyah"? >>> martysSpeech.split().count("booyah") 0 >>> # What the!
  • 11. 11 Tokenization (cont.) • We could lowercase the speech • We could write our own method to split on "." split on ",", split on "-", etc. • The NLTK already has several tokenizer options • Try: • nltk.tokenize.WordPunctTokenizer – tokenizes on all punctuation • nltk.tokenize.PunktWordTokenizer – trained algorithm to statistically split on words
  • 12. 12 Part-of-speech (POS) tagging • If you know a token's POS you know: – is it the subject? – is it the verb? – is it introducing a grammatical structure? – is it a proper name?
  • 13. 13 Part-of-speech (POS) tagging • Exercise: most frequent proper noun in the Penn Treebank? – Try: • nltk.corpus.treebank • Python's dir() to list attributes of an object – Example: >>> dir("hello world!") [..., 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', ...]
  • 14. 14 Tuples • tagged_words() gives us a list of tuples – tuple: the same thing as a list, but you can't change it – in this case, the tuples are a (word, tag) pairs >>> # Get the (word, tag) pair at list index 0 ... >>> pair = nltk.corpus.treebank.tagged_words()[0] >>> pair ('Pierre', 'NNP') >>> word = pair[0] >>> tag = pair[1] >>> print word, tag Pierre NNP >>> word, tag = pair # or unpack in 1 line! >>> print word, tag Pierre NNP
  • 15. 15 POS tagging (cont.) • How do we tag plain sentences? – A NLTK tagger needs a list of tagged sentences to train on • We'll use nltk.corpus.treebank.tagged_sents() – Then it is ready to tag any input! (but how well?) – Try these tagger objects: • nltk.UnigramTagger(tagged_sentences) • nltk.TrigramTagger(tagged_sentences) – Call the tagger's tag(tokens) method >>> tagger = nltk.UnigramTagger(tagged_sentences) >>> result = tagger.tag(tokens) >>> result [('You', 'PRP'), ('know', 'VB'), ('what', 'WP'), ('I', 'PRP'), ('hate', None), ('?', '.'), ...]
  • 16. 16 POS tagging (cont.) • Exercise: Mad Libs – I have a passage I want filled with the right parts of speech – Let's use random picks from our own data! – This code will print it out: print properNoun1, "has always been a", adjective1, singularNoun, "unlike the", adjective2, properNoun2, "who I", pastVerb, "as he was", ingVerb, "yesterday."
  • 17. 17 Eliza (NLG) • Eliza simulates a Rogerian psychotherapist • With while loops and tokenization, you can make a chat bot! – Try: • nltk.chat.eliza.eliza_chat()
  • 18. 18 Parsing • Syntax is as important for a compiler as it is for natural language • Realizing the hidden structure of a sentence is useful for: – translation – meaning analysis – relationship analysis – a cool demo! • Try: – nltk.draw.rdparser.demo()
  • 19. 19 Conclusion • NLTK: NLP made easy with Python – Functions and objects for: • tokenization, tagging, generation, parsing, ... • and much more! – Even armed with these tools, NLP has a lot of difficult problems! • Also saw: – List methods – dir() – Tuples