SlideShare a Scribd company logo
Python for NLP and the Natural
Language Toolkit
Introductions
 Anirudh K Menon
Software Engineer at IGate, working as part of the Big Data & Analytics Team there.
I am a computer Science Engineer having experience in web development and big data
space.
My e-mail : animenon@mail.com
 You?
Natural Language Processing
 Fundamental goal: deep understanding of broad language
 Not just string processing or keyword matching!
 End systems that we want to build:
 Ambitious: speech recognition, machine translation, question answering…
 Modest: spelling correction, text categorization…
Example: Machine Translation
NLP applications
 Text Categorization
 Spelling & Grammar Corrections
 Information Extraction
 Speech Recognition
 Information Retrieval
 Synonym Generation
 Summarization
 Machine Translation
 Question Answering
 Dialog Systems
 Language generation
Why NLP is difficult
 A NLP system needs to answer the question “who did what to whom”
 Language is ambiguous
 At all levels: lexical, phrase, semantic
 Iraqi Head Seeks Arms
 Word sense is ambiguous (head, arms)
 Stolen Painting Found by Tree
 Thematic role is ambiguous: tree is agent or location?
 Ban on Nude Dancing on Governor’s Desk
 Syntactic structure (attachment) is ambiguous: is the ban or the dancing on the desk?
 Hospitals Are Sued by 7 Foot Doctors
 Semantics is ambiguous : what is 7 foot?
Why NLP is difficult
 Language is flexible
 New words, new meanings
 Different meanings in different contexts
 Language is subtle
 He arrived at the lecture
 He chuckled at the lecture
 He chuckled his way through the lecture
 **He arrived his way through the lecture
 Language is complex!
Corpus-based statistical approaches to
tackle NLP problem
 How can a can a machine understand these differences?
 Decorate the cake with the frosting
 Decorate the cake with the kids
 Rules based approaches, i.e. hand coded syntactic constraints and preference rules:
 The verb decorate require an animate being as agent
 The object cake is formed by any of the following, inanimate entities (cream, dough,
frosting…..)
 Such approaches have been showed to be time consuming to build, do not scale up
well and are very brittle to new, unusual, metaphorical use of language
 To swallow requires an animate being as agent/subject and a physical object as object
 I swallowed his story or the actor swallowed his lines.
 The supernova swallowed the planet
Corpus-based statistical approaches
to tackle NLP problem
 Feature extractions (usually linguistics motivated)
 Statistical models
 Data (corpora, labels, linguistic resources)
Intro to NLTK
 The NLTK is a set of Python modules to carry out many common natural
language tasks.
 NLTK defines an infrastructure that can be used to build NLP programs in
Python.
 It provides basic classes for representing data relevant to natural language
processing.
 There are versions for Windows, OS X, Unix, Linux. Detailed instructions
on Installation tab
 Windows :
>>> import nltk
>>> nltk.download('all')
 Linux :
$ pip install --upgrade nltk
NLTK: Top-Level Organization
 NLTK is organized as a flat hierarchy of packages
and modules.
 Each module provides the tools necessary to
address a specific task
 Modules contain two types of classes:
 Data-oriented classes are used to represent information
relevant to natural language processing.
 Task-oriented classes encapsulate the resources and
methods needed to perform a specific task.
Modules
 The NLTK modules include:
 token: classes for representing and processing individual elements of
text, such as words and sentences
 probability: classes for representing and processing probabilistic
information.
 tree: classes for representing and processing hierarchical information
over text.
 cfg: classes for representing and processing context free grammars.
 tagger: tagging each word with a part-of-speech, a sense, etc
 parser: building trees over text (includes chart, chunk and probabilistic
parsers)
 classifier: classify text into categories (includes feature,
featureSelection, maxent, naivebayes)
 draw: visualize NLP structures and processes
 corpus: access (tagged) corpus data
 We will cover some of these explicitly as we reach topics.
 Standard interfaces for performing tasks such as part-of-speech tagging,
syntactic parsing, and text classification.
 Standard implementations for each task can be combined to solve
complex problems.
Example
 The most basic natural language processing technique is tokenization.
 Tokenization means splitting the input into tokens.
Eg: Word Tokenization –
Input : “Hey there, How are you all?”
Output : “Hey”, “there,”, “How”, “are”, “you”, “all?”
The task of converting a text from a single string to a list of tokens is known as
tokenization.
Tokens and Types
 The term word can be used in two different ways:
1. To refer to an individual occurrence of a word
2. To refer to an abstract vocabulary item
 For example, the sentence “my dog likes his dog”
contains five occurrences of words, but four vocabulary
items.
 To avoid confusion use more precise terminology:
1. Word token: an occurrence of a word
2. Word Type: a vocabulary item
Examples on python shell
 Tokenization
 Sentence Detection
 Common Usages, etc.
References
1. CS1573: AI Application Development, Spring 2003
(modified from Edward Loper’s notes)
2. nltk.sourceforge.net/tutorial/introduction/index.html
3. Applied Natural Language Processing, Fall 2009, by Barbara Rosario
Thank You for your patient listening!
Contact : animenon@mail.com

More Related Content

PDF
Natural language processing (NLP) introduction
PPTX
Natural language processing
PPTX
Parts of Speect Tagging
PPTX
Natural Language Processing
PPT
Introduction to Natural Language Processing
PDF
Introduction to Natural Language Processing (NLP)
PPTX
natural language processing help at myassignmenthelp.net
Natural language processing (NLP) introduction
Natural language processing
Parts of Speect Tagging
Natural Language Processing
Introduction to Natural Language Processing
Introduction to Natural Language Processing (NLP)
natural language processing help at myassignmenthelp.net

What's hot (20)

PPTX
Natural Language Processing (NLP).pptx
PPTX
Natural Language Processing
PPTX
Natural Language Processing
PPTX
Natural Language Processing in AI
PPT
Natural language processing
PPTX
Language models
PDF
Natural Language Processing seminar review
PPTX
Natural lanaguage processing
PDF
Natural Language Processing (NLP)
PPTX
Natural language processing
PDF
Seq2Seq (encoder decoder) model
PDF
Natural language processing
PPTX
Natural Language Processing
PDF
Natural language processing (Python)
KEY
NLTK in 20 minutes
PPTX
Natural language processing
PPTX
PPT
Natural Language Processing
PPT
Introduction to Natural Language Processing
PDF
Natural Language Processing
Natural Language Processing (NLP).pptx
Natural Language Processing
Natural Language Processing
Natural Language Processing in AI
Natural language processing
Language models
Natural Language Processing seminar review
Natural lanaguage processing
Natural Language Processing (NLP)
Natural language processing
Seq2Seq (encoder decoder) model
Natural language processing
Natural Language Processing
Natural language processing (Python)
NLTK in 20 minutes
Natural language processing
Natural Language Processing
Introduction to Natural Language Processing
Natural Language Processing
Ad

Viewers also liked (17)

PPTX
Python NLTK
PPTX
DOCX
Natural Language Processing
PDF
Machine Learning in NLP
PDF
Statistical Learning and Text Classification with NLTK and scikit-learn
PDF
Natural Language Toolkit (NLTK), Basics
PDF
GATE : General Architecture for Text Engineering
PPT
OpenNLP demo
PDF
Text analysis and Semantic Search with GATE
PDF
Text classification in scikit-learn
PDF
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
PDF
Sentiment analysis-by-nltk
PDF
Spam Filtering
PPT
How Sentiment Analysis works
PPT
Neuro Linguistic Programming
PDF
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
PPTX
Introduction to Machine Learning
Python NLTK
Natural Language Processing
Machine Learning in NLP
Statistical Learning and Text Classification with NLTK and scikit-learn
Natural Language Toolkit (NLTK), Basics
GATE : General Architecture for Text Engineering
OpenNLP demo
Text analysis and Semantic Search with GATE
Text classification in scikit-learn
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Sentiment analysis-by-nltk
Spam Filtering
How Sentiment Analysis works
Neuro Linguistic Programming
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Introduction to Machine Learning
Ad

Similar to Nltk (20)

PPTX
Natural Language processing using nltk.pptx
PPTX
Natural language processing: feature extraction
PPTX
Natural Language Processing_in semantic web.pptx
PPTX
PDF
MACHINE-DRIVEN TEXT ANALYSIS
PDF
NLP Deep Learning with Tensorflow
DOC
PDF
Natural Language Processing
PDF
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
PPTX
Prolog (present)
PPTX
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
PPT
ppt
PPT
ppt
PPTX
Open nlp presentationss
PPTX
Artificial inteIegence & Machine learning - Key Concepts
PPTX
Frame-Script and Predicate logic.pptx
PDF
Introduction to Natural Language Processing
PDF
Module 8: Natural language processing Pt 1
PPTX
Natural language processing using python
PPTX
Text Mining_big_data_machine_learning.pptx
Natural Language processing using nltk.pptx
Natural language processing: feature extraction
Natural Language Processing_in semantic web.pptx
MACHINE-DRIVEN TEXT ANALYSIS
NLP Deep Learning with Tensorflow
Natural Language Processing
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
Prolog (present)
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
ppt
ppt
Open nlp presentationss
Artificial inteIegence & Machine learning - Key Concepts
Frame-Script and Predicate logic.pptx
Introduction to Natural Language Processing
Module 8: Natural language processing Pt 1
Natural language processing using python
Text Mining_big_data_machine_learning.pptx

Recently uploaded (20)

PPTX
1_Introduction to advance data techniques.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Lecture1 pattern recognition............
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Global journeys: estimating international migration
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Foundation of Data Science unit number two notes
PPTX
Moving the Public Sector (Government) to a Digital Adoption
1_Introduction to advance data techniques.pptx
.pdf is not working space design for the following data for the following dat...
Introduction-to-Cloud-ComputingFinal.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Lecture1 pattern recognition............
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
IB Computer Science - Internal Assessment.pptx
climate analysis of Dhaka ,Banglades.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to Knowledge Engineering Part 1
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Business Acumen Training GuidePresentation.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Global journeys: estimating international migration
Miokarditis (Inflamasi pada Otot Jantung)
Foundation of Data Science unit number two notes
Moving the Public Sector (Government) to a Digital Adoption

Nltk

  • 1. Python for NLP and the Natural Language Toolkit
  • 2. Introductions  Anirudh K Menon Software Engineer at IGate, working as part of the Big Data & Analytics Team there. I am a computer Science Engineer having experience in web development and big data space. My e-mail : animenon@mail.com  You?
  • 3. Natural Language Processing  Fundamental goal: deep understanding of broad language  Not just string processing or keyword matching!  End systems that we want to build:  Ambitious: speech recognition, machine translation, question answering…  Modest: spelling correction, text categorization…
  • 5. NLP applications  Text Categorization  Spelling & Grammar Corrections  Information Extraction  Speech Recognition  Information Retrieval  Synonym Generation  Summarization  Machine Translation  Question Answering  Dialog Systems  Language generation
  • 6. Why NLP is difficult  A NLP system needs to answer the question “who did what to whom”  Language is ambiguous  At all levels: lexical, phrase, semantic  Iraqi Head Seeks Arms  Word sense is ambiguous (head, arms)  Stolen Painting Found by Tree  Thematic role is ambiguous: tree is agent or location?  Ban on Nude Dancing on Governor’s Desk  Syntactic structure (attachment) is ambiguous: is the ban or the dancing on the desk?  Hospitals Are Sued by 7 Foot Doctors  Semantics is ambiguous : what is 7 foot?
  • 7. Why NLP is difficult  Language is flexible  New words, new meanings  Different meanings in different contexts  Language is subtle  He arrived at the lecture  He chuckled at the lecture  He chuckled his way through the lecture  **He arrived his way through the lecture  Language is complex!
  • 8. Corpus-based statistical approaches to tackle NLP problem  How can a can a machine understand these differences?  Decorate the cake with the frosting  Decorate the cake with the kids  Rules based approaches, i.e. hand coded syntactic constraints and preference rules:  The verb decorate require an animate being as agent  The object cake is formed by any of the following, inanimate entities (cream, dough, frosting…..)  Such approaches have been showed to be time consuming to build, do not scale up well and are very brittle to new, unusual, metaphorical use of language  To swallow requires an animate being as agent/subject and a physical object as object  I swallowed his story or the actor swallowed his lines.  The supernova swallowed the planet
  • 9. Corpus-based statistical approaches to tackle NLP problem  Feature extractions (usually linguistics motivated)  Statistical models  Data (corpora, labels, linguistic resources)
  • 10. Intro to NLTK  The NLTK is a set of Python modules to carry out many common natural language tasks.  NLTK defines an infrastructure that can be used to build NLP programs in Python.  It provides basic classes for representing data relevant to natural language processing.  There are versions for Windows, OS X, Unix, Linux. Detailed instructions on Installation tab  Windows : >>> import nltk >>> nltk.download('all')  Linux : $ pip install --upgrade nltk
  • 11. NLTK: Top-Level Organization  NLTK is organized as a flat hierarchy of packages and modules.  Each module provides the tools necessary to address a specific task  Modules contain two types of classes:  Data-oriented classes are used to represent information relevant to natural language processing.  Task-oriented classes encapsulate the resources and methods needed to perform a specific task.
  • 12. Modules  The NLTK modules include:  token: classes for representing and processing individual elements of text, such as words and sentences  probability: classes for representing and processing probabilistic information.  tree: classes for representing and processing hierarchical information over text.  cfg: classes for representing and processing context free grammars.  tagger: tagging each word with a part-of-speech, a sense, etc  parser: building trees over text (includes chart, chunk and probabilistic parsers)  classifier: classify text into categories (includes feature, featureSelection, maxent, naivebayes)  draw: visualize NLP structures and processes  corpus: access (tagged) corpus data  We will cover some of these explicitly as we reach topics.
  • 13.  Standard interfaces for performing tasks such as part-of-speech tagging, syntactic parsing, and text classification.  Standard implementations for each task can be combined to solve complex problems.
  • 14. Example  The most basic natural language processing technique is tokenization.  Tokenization means splitting the input into tokens. Eg: Word Tokenization – Input : “Hey there, How are you all?” Output : “Hey”, “there,”, “How”, “are”, “you”, “all?” The task of converting a text from a single string to a list of tokens is known as tokenization.
  • 15. Tokens and Types  The term word can be used in two different ways: 1. To refer to an individual occurrence of a word 2. To refer to an abstract vocabulary item  For example, the sentence “my dog likes his dog” contains five occurrences of words, but four vocabulary items.  To avoid confusion use more precise terminology: 1. Word token: an occurrence of a word 2. Word Type: a vocabulary item
  • 16. Examples on python shell  Tokenization  Sentence Detection  Common Usages, etc.
  • 17. References 1. CS1573: AI Application Development, Spring 2003 (modified from Edward Loper’s notes) 2. nltk.sourceforge.net/tutorial/introduction/index.html 3. Applied Natural Language Processing, Fall 2009, by Barbara Rosario
  • 18. Thank You for your patient listening! Contact : animenon@mail.com