SlideShare a Scribd company logo
Named Entity Recognition (NER) with NLTK
2
Copyright @ 2019 Learntek. All Rights Reserved. 3
Named Entity Recognition with NLTK :
Natural language processing is a sub-area of computer science, information
engineering, and artificial intelligence concerned with the interactions between
computers and human (native) languages. This is nothing but how to program
computers to process and analyse large amounts of natural language data.
NLP = Computer Science + AI + Computational Linguistics
n another way, Natural language processing is the capability of computer software
to understand human language as it is spoken. NLP is one of the component of
artificial intelligence (AI).
Copyright @ 2019 Learntek. All Rights Reserved. 4
About NLTK
•The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and
programs for symbolic and statistical natural language processing (NLP) for English
written in the Python programming language.
•It was developed by Steven Bird and Edward Loper in the Department of Computer
and Information Science at the University of Pennsylvania.
•A software package for manipulating linguistic data and performing NLP tasks.
Copyright @ 2019 Learntek. All Rights Reserved. 5
Named Entity Recognition (NER)
Named Entity Recognition is used in many fields in Natural Language Processing
(NLP), and it can help answering many real-world questions.
Named entity recognition(NER) is probably the first step towards information
extraction that seeks to locate and classify named entities in text into pre-defined
categories such as the names of persons, organizations, locations, expressions of
times, quantities, monetary values, percentages, etc.
Information comes in many shapes and sizes.
One important form is structured data, where there is a regular and predictable
organization of entities and relationships.
Copyright @ 2019 Learntek. All Rights Reserved.
6
For example, we might be interested in the relation between companies and
locations.
Given a company, we would like to be able to identify the locations where it does
business; conversely, given a location, we would like to discover which companies
do business in that location. Our data is in tabular form, then answering these
queries is straightforward.
Org Name Location Name
TCS PUNE
INFOCEPT PUNE
WIPRO PUNE
AMAZON HYDERABAD
INTEL HYDERABAD
Copyright @ 2019 Learntek. All Rights Reserved. 7
If this location data was stored in Python as a list of tuples (entity, relation, entity),
then the question “Which organizations operate in HYDERABAD?” could be given as
follows:
>>> import nltk
>>> loc=[('TCS', 'IN', 'PUNE’),
... ('INFOCEPT', 'IN', 'PUNE’),
... ('WIPRO', 'IN', 'PUNE’),
... ('AMAZON', 'IN', 'HYDERABAD’) ,
... ('INTEL', 'IN', 'HYDERABAD’),
... ]
Copyright @ 2019 Learntek. All Rights Reserved. 8
>>> query = [e1 for (e1, rel, e2) in loc if e2=='HYDERABAD’]
>>> print(query)
['AMAZON', 'INTEL’]
>>> query = [e1 for (e1, rel, e2) in loc if e2=='PUNE’]
>>> print(query)
['TCS', 'INFOCEPT', 'WIPRO']
Copyright @ 2019 Learntek. All Rights Reserved. 9
Copyright @ 2019 Learntek. All Rights Reserved. 10
Information Extraction has many applications, including business intelligence,
resume harvesting, media analysis, sentiment detecti on, patent search, and email
scanning. A particularly important area of current research involves the attempt to
extract structured data out of electronically-available scientific literature, especially
in the domain of biology and medicine.
Information Extraction Architecture
Following figure shows the architecture for Information extraction system.
Copyright @ 2019 Learntek. All Rights Reserved. 11
Copyright @ 2019 Learntek. All Rights Reserved. 12
The above system takes the raw text of a document as an input, and produces a list
of (entity, relation, entity) tuples as its output. For example, given a document that
indicates that the company INTEL is in HYDERABAD it might generate the tuple
([ORG: ‘INTEL’] ‘in’ [LOC: ‘ HYDERABAD’]). The steps in the information extraction
system is as follows.
STEP 1: The raw text of the document is split into sentences using a sentence
segmentation.
STEP 2: Each sentence is further subdivided into words using a tokenization.
STEP 3: Each sentence is tagged with part-of-speech tags, which will prove very
helpful in the next step, named entity detection.
Copyright @ 2019 Learntek. All Rights Reserved. 13
STEP 4: In this step, we search for mentions of potentially interesting entities in
each sentence.
STEP 5: we use relation detection to search for likely relations between different
entities in the text.
Chunking
The basic technique that we use for entity detection is chunking which segments
and labels multi-token sequences.
Copyright @ 2019 Learntek. All Rights Reserved. 14
In the following figure shows the Segmentation and Labelling at both the Token
and Chunk Levels, the smaller boxes in it show the word-level tokenization and
part-of-speech tagging, while the large boxes show higher-level chunking. Each of
these larger boxes is called a chunk. Like tokenization, which omits whitespace,
chunking usually selects a subset of the tokens. Also, like tokenization, the pieces
produced by a chunker do not overlap in the source text.
Copyright @ 2019 Learntek. All Rights Reserved. 15
Noun Phrase Chunking
In the noun phrase chunking, or NP-chunking, we will search for chunks
corresponding to individual noun phrases. For example, here is some Wall Street
Journal text with NP-chunks marked using brackets:
[ The/DT market/NN ] for/IN [ system-management/NN software/NN ] for/IN [
Digital/NNP ] [ 's/POS hardware/NN ] is/VBZ fragmented/JJ enough/RB that/IN [ a/DT
giant/NN ] such/JJ as/IN [ Computer/NNP Associates/NNPS ] should/MD do/VB well/RB
there/RB ./.
Copyright @ 2019 Learntek. All Rights Reserved.
16
NP-chunks are often smaller pieces than complete noun phrases.
One of the most useful sources of information for NP-chunking is part-of-speech
tags.
This is one of the inspirations for performing part-of-speech tagging in our
information extraction system. We determine this approach using an example
sentence. In order to create an NP-chunker, we will first define a chunk grammar,
consisting of rules that indicate how sentences should be chunked. In this case, we
will define a simple grammar with a single regular-expression rule. This rule says
that an NP chunk should be formed whenever the chunker finds an optional
determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN).
Using this grammar, we create a chunk parser , and test it on our example sentence.
The result is a tree, which we can either print, or display graphically.
Copyright @ 2019 Learntek. All Rights Reserved. 17
>> sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
... ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
>>> grammar = "NP: {<DT>?<JJ>*<NN>}“
>>> cp = nltk.RegexpParser(grammar)
>>> result = cp.parse(sentence)
>>> print(result)
(S
(NP the/DT little/JJ yellow/JJ dog/NN)
barked/VBD
at/IN
(NP the/DT cat/NN))
>>> result.draw()
Copyright @ 2019 Learntek. All Rights Reserved. 18
Copyright @ 2019 Learntek. All Rights Reserved. 19
Chunking with Regular Expressions
To find the chunk structure for a given sentence, the Regexp Parser chunker starts
with a flat structure in which no tokens are chunked. The chunking rules applied in
turn, successively updating the chunk structure. Once all the rules have been
invoked, the resulting chunk structure is returned. Following simple chunk grammar
consisting of two rules. The first rule matches an optional determiner or possessive
pronoun, zero or more adjectives, then a noun. The second rule matches one or
more proper nouns. We also define an example sentence to be chunked and run the
chunker on this input.
Copyright @ 2019 Learntek. All Rights Reserved. 20
>>> import nltk
>>> grammar = r""" NP: {<DT|PP$>?<JJ>*<NN>}
... {<NNP>+}
... """
>>> cp = nltk.RegexpParser(grammar)
>>> sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"),
... ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
>>> print(cp.parse(sentence))
Copyright @ 2019 Learntek. All Rights Reserved. 21
OUTPUT:
(S
(NP Rapunzel/NNP)
let/VBD
down/RP
(NP her/PP$ long/JJ golden/JJ hair/NN))
Copyright @ 2019 Learntek. All Rights Reserved. 22
Copyright @ 2019 Learntek. All Rights Reserved. 23
chunk.conllstr2tree() Function:
A conversion function chunk.conllstr2tree() is used to builds a tree representation
from one of these multi-line strings. Moreover, it permits us to choose any subset of
the three chunk types to use, here just for NP chunks:
>>> text = ''' ...
he PRP B-NP
... accepted VBD B-VP
... the DT B-NP
... position NN I-NP
... of IN B-PP
... vice NN B-NP
... chairman NN I-NP
Copyright @ 2019 Learntek. All Rights Reserved. 24
... of IN B-PP
... Carlyle NNP B-NP
... Group NNP I-NP
... , , O
... a DT B-NP
... merchant NN I-NP
... banking NN I-NP
... concern NN I-NP
.. . . O ...
''' >>> nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw()
Copyright @ 2019 Learntek. All Rights Reserved. 25
Copyright @ 2019 Learntek. All Rights Reserved. 26
Copyright @ 2019 Learntek. All Rights Reserved. 27
For more Training Information , Contact Us
Email : info@learntek.org
USA : +1734 418 2465
INDIA : +40 4018 1306
+7799713624

More Related Content

PDF
IE: Named Entity Recognition (NER)
PDF
Lecture: Summarization
PDF
Relation Extraction
PDF
Information Extraction
PDF
Lecture: Word Senses
PDF
Lecture: Question Answering
PDF
Text Mining Analytics 101
PDF
OUTDATED Text Mining 5/5: Information Extraction
IE: Named Entity Recognition (NER)
Lecture: Summarization
Relation Extraction
Information Extraction
Lecture: Word Senses
Lecture: Question Answering
Text Mining Analytics 101
OUTDATED Text Mining 5/5: Information Extraction

What's hot (20)

PDF
OUTDATED Text Mining 4/5: Text Classification
PPT
Information extraction for Free Text
PDF
OUTDATED Text Mining 3/5: String Processing
PDF
Crash-course in Natural Language Processing
PDF
[系列活動] 文字探勘者的入門心法
PPTX
Building yourself with Python - Learn the Basics!!
PPTX
From NLP to text mining
PDF
Crash Course in Natural Language Processing (2016)
PPTX
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
PDF
Lecture: Semantic Word Clouds
PDF
Question Answering with Lydia
PDF
Can functional programming be liberated from static typing?
PDF
Lecture 2: Computational Semantics
PDF
OUTDATED Text Mining 2/5: Language Modeling
PDF
Lecture20 xing
PPT
Natural Language Processing
PPTX
Prolog (present)
PDF
HackYale - Natural Language Processing (Week 0)
PPTX
2015 07-tuto2-clus type
PDF
Introduction of tango! (en)
OUTDATED Text Mining 4/5: Text Classification
Information extraction for Free Text
OUTDATED Text Mining 3/5: String Processing
Crash-course in Natural Language Processing
[系列活動] 文字探勘者的入門心法
Building yourself with Python - Learn the Basics!!
From NLP to text mining
Crash Course in Natural Language Processing (2016)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture: Semantic Word Clouds
Question Answering with Lydia
Can functional programming be liberated from static typing?
Lecture 2: Computational Semantics
OUTDATED Text Mining 2/5: Language Modeling
Lecture20 xing
Natural Language Processing
Prolog (present)
HackYale - Natural Language Processing (Week 0)
2015 07-tuto2-clus type
Introduction of tango! (en)
Ad

Similar to Named entity recognition (ner) with nltk (20)

PDF
Categorizing and pos tagging with nltk python
PPTX
Categorizing and pos tagging with nltk python
PDF
Data Science - Part XI - Text Analytics
PDF
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
PDF
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
PPTX
Named Entity Recognition For Hindi-English code-mixed Twitter Text
PPTX
Frame-Script and Predicate logic.pptx
PDF
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
PPT
Introduction to Natural Language Processing
PPTX
Chatbot_Presentation
PPTX
Tata Motors GDC .LTD Internship
PPTX
Text Mining_big_data_machine_learning.pptx
PPTX
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
PPSX
Nltk - Boston Text Analytics
DOCX
employee turnover prediction document.docx
PDF
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
PPTX
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...
PDF
FinalReport
PPTX
Coreference Extraction from Identric’s Documents - Solution of Datathon 2018
PPTX
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
Data Science - Part XI - Text Analytics
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
Named Entity Recognition For Hindi-English code-mixed Twitter Text
Frame-Script and Predicate logic.pptx
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
Introduction to Natural Language Processing
Chatbot_Presentation
Tata Motors GDC .LTD Internship
Text Mining_big_data_machine_learning.pptx
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
Nltk - Boston Text Analytics
employee turnover prediction document.docx
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...
FinalReport
Coreference Extraction from Identric’s Documents - Solution of Datathon 2018
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
Ad

More from Janu Jahnavi (20)

PDF
Analytics using r programming
PDF
Software testing
PPTX
Software testing
PPTX
Spring
PDF
Stack skills
PPTX
Ui devopler
PPTX
Apache flink
PDF
Apache flink
PDF
Angular js
PDF
Mysql python
PPTX
Mysql python
PDF
Ruby with cucmber
PPTX
Apache kafka
PDF
Apache kafka
PPTX
Google cloud platform
PPTX
Google cloud Platform
PDF
Apache spark with java 8
PPTX
Apache spark with java 8
PDF
Python multithreading
PPTX
Python multithreading
Analytics using r programming
Software testing
Software testing
Spring
Stack skills
Ui devopler
Apache flink
Apache flink
Angular js
Mysql python
Mysql python
Ruby with cucmber
Apache kafka
Apache kafka
Google cloud platform
Google cloud Platform
Apache spark with java 8
Apache spark with java 8
Python multithreading
Python multithreading

Recently uploaded (20)

PDF
Classroom Observation Tools for Teachers
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Pre independence Education in Inndia.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
Insiders guide to clinical Medicine.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Sports Quiz easy sports quiz sports quiz
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Complications of Minimal Access Surgery at WLH
Classroom Observation Tools for Teachers
human mycosis Human fungal infections are called human mycosis..pptx
O5-L3 Freight Transport Ops (International) V1.pdf
Pre independence Education in Inndia.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
O7-L3 Supply Chain Operations - ICLT Program
Anesthesia in Laparoscopic Surgery in India
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
2.FourierTransform-ShortQuestionswithAnswers.pdf
RMMM.pdf make it easy to upload and study
Insiders guide to clinical Medicine.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPH.pptx obstetrics and gynecology in nursing
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Module 4: Burden of Disease Tutorial Slides S2 2025
Microbial disease of the cardiovascular and lymphatic systems
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Sports Quiz easy sports quiz sports quiz
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Complications of Minimal Access Surgery at WLH

Named entity recognition (ner) with nltk

  • 1. Named Entity Recognition (NER) with NLTK
  • 2. 2
  • 3. Copyright @ 2019 Learntek. All Rights Reserved. 3 Named Entity Recognition with NLTK : Natural language processing is a sub-area of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (native) languages. This is nothing but how to program computers to process and analyse large amounts of natural language data. NLP = Computer Science + AI + Computational Linguistics n another way, Natural language processing is the capability of computer software to understand human language as it is spoken. NLP is one of the component of artificial intelligence (AI).
  • 4. Copyright @ 2019 Learntek. All Rights Reserved. 4 About NLTK •The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. •It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania. •A software package for manipulating linguistic data and performing NLP tasks.
  • 5. Copyright @ 2019 Learntek. All Rights Reserved. 5 Named Entity Recognition (NER) Named Entity Recognition is used in many fields in Natural Language Processing (NLP), and it can help answering many real-world questions. Named entity recognition(NER) is probably the first step towards information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. Information comes in many shapes and sizes. One important form is structured data, where there is a regular and predictable organization of entities and relationships.
  • 6. Copyright @ 2019 Learntek. All Rights Reserved. 6 For example, we might be interested in the relation between companies and locations. Given a company, we would like to be able to identify the locations where it does business; conversely, given a location, we would like to discover which companies do business in that location. Our data is in tabular form, then answering these queries is straightforward. Org Name Location Name TCS PUNE INFOCEPT PUNE WIPRO PUNE AMAZON HYDERABAD INTEL HYDERABAD
  • 7. Copyright @ 2019 Learntek. All Rights Reserved. 7 If this location data was stored in Python as a list of tuples (entity, relation, entity), then the question “Which organizations operate in HYDERABAD?” could be given as follows: >>> import nltk >>> loc=[('TCS', 'IN', 'PUNE’), ... ('INFOCEPT', 'IN', 'PUNE’), ... ('WIPRO', 'IN', 'PUNE’), ... ('AMAZON', 'IN', 'HYDERABAD’) , ... ('INTEL', 'IN', 'HYDERABAD’), ... ]
  • 8. Copyright @ 2019 Learntek. All Rights Reserved. 8 >>> query = [e1 for (e1, rel, e2) in loc if e2=='HYDERABAD’] >>> print(query) ['AMAZON', 'INTEL’] >>> query = [e1 for (e1, rel, e2) in loc if e2=='PUNE’] >>> print(query) ['TCS', 'INFOCEPT', 'WIPRO']
  • 9. Copyright @ 2019 Learntek. All Rights Reserved. 9
  • 10. Copyright @ 2019 Learntek. All Rights Reserved. 10 Information Extraction has many applications, including business intelligence, resume harvesting, media analysis, sentiment detecti on, patent search, and email scanning. A particularly important area of current research involves the attempt to extract structured data out of electronically-available scientific literature, especially in the domain of biology and medicine. Information Extraction Architecture Following figure shows the architecture for Information extraction system.
  • 11. Copyright @ 2019 Learntek. All Rights Reserved. 11
  • 12. Copyright @ 2019 Learntek. All Rights Reserved. 12 The above system takes the raw text of a document as an input, and produces a list of (entity, relation, entity) tuples as its output. For example, given a document that indicates that the company INTEL is in HYDERABAD it might generate the tuple ([ORG: ‘INTEL’] ‘in’ [LOC: ‘ HYDERABAD’]). The steps in the information extraction system is as follows. STEP 1: The raw text of the document is split into sentences using a sentence segmentation. STEP 2: Each sentence is further subdivided into words using a tokenization. STEP 3: Each sentence is tagged with part-of-speech tags, which will prove very helpful in the next step, named entity detection.
  • 13. Copyright @ 2019 Learntek. All Rights Reserved. 13 STEP 4: In this step, we search for mentions of potentially interesting entities in each sentence. STEP 5: we use relation detection to search for likely relations between different entities in the text. Chunking The basic technique that we use for entity detection is chunking which segments and labels multi-token sequences.
  • 14. Copyright @ 2019 Learntek. All Rights Reserved. 14 In the following figure shows the Segmentation and Labelling at both the Token and Chunk Levels, the smaller boxes in it show the word-level tokenization and part-of-speech tagging, while the large boxes show higher-level chunking. Each of these larger boxes is called a chunk. Like tokenization, which omits whitespace, chunking usually selects a subset of the tokens. Also, like tokenization, the pieces produced by a chunker do not overlap in the source text.
  • 15. Copyright @ 2019 Learntek. All Rights Reserved. 15 Noun Phrase Chunking In the noun phrase chunking, or NP-chunking, we will search for chunks corresponding to individual noun phrases. For example, here is some Wall Street Journal text with NP-chunks marked using brackets: [ The/DT market/NN ] for/IN [ system-management/NN software/NN ] for/IN [ Digital/NNP ] [ 's/POS hardware/NN ] is/VBZ fragmented/JJ enough/RB that/IN [ a/DT giant/NN ] such/JJ as/IN [ Computer/NNP Associates/NNPS ] should/MD do/VB well/RB there/RB ./.
  • 16. Copyright @ 2019 Learntek. All Rights Reserved. 16 NP-chunks are often smaller pieces than complete noun phrases. One of the most useful sources of information for NP-chunking is part-of-speech tags. This is one of the inspirations for performing part-of-speech tagging in our information extraction system. We determine this approach using an example sentence. In order to create an NP-chunker, we will first define a chunk grammar, consisting of rules that indicate how sentences should be chunked. In this case, we will define a simple grammar with a single regular-expression rule. This rule says that an NP chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN). Using this grammar, we create a chunk parser , and test it on our example sentence. The result is a tree, which we can either print, or display graphically.
  • 17. Copyright @ 2019 Learntek. All Rights Reserved. 17 >> sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ... ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")] >>> grammar = "NP: {<DT>?<JJ>*<NN>}“ >>> cp = nltk.RegexpParser(grammar) >>> result = cp.parse(sentence) >>> print(result) (S (NP the/DT little/JJ yellow/JJ dog/NN) barked/VBD at/IN (NP the/DT cat/NN)) >>> result.draw()
  • 18. Copyright @ 2019 Learntek. All Rights Reserved. 18
  • 19. Copyright @ 2019 Learntek. All Rights Reserved. 19 Chunking with Regular Expressions To find the chunk structure for a given sentence, the Regexp Parser chunker starts with a flat structure in which no tokens are chunked. The chunking rules applied in turn, successively updating the chunk structure. Once all the rules have been invoked, the resulting chunk structure is returned. Following simple chunk grammar consisting of two rules. The first rule matches an optional determiner or possessive pronoun, zero or more adjectives, then a noun. The second rule matches one or more proper nouns. We also define an example sentence to be chunked and run the chunker on this input.
  • 20. Copyright @ 2019 Learntek. All Rights Reserved. 20 >>> import nltk >>> grammar = r""" NP: {<DT|PP$>?<JJ>*<NN>} ... {<NNP>+} ... """ >>> cp = nltk.RegexpParser(grammar) >>> sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), ... ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")] >>> print(cp.parse(sentence))
  • 21. Copyright @ 2019 Learntek. All Rights Reserved. 21 OUTPUT: (S (NP Rapunzel/NNP) let/VBD down/RP (NP her/PP$ long/JJ golden/JJ hair/NN))
  • 22. Copyright @ 2019 Learntek. All Rights Reserved. 22
  • 23. Copyright @ 2019 Learntek. All Rights Reserved. 23 chunk.conllstr2tree() Function: A conversion function chunk.conllstr2tree() is used to builds a tree representation from one of these multi-line strings. Moreover, it permits us to choose any subset of the three chunk types to use, here just for NP chunks: >>> text = ''' ... he PRP B-NP ... accepted VBD B-VP ... the DT B-NP ... position NN I-NP ... of IN B-PP ... vice NN B-NP ... chairman NN I-NP
  • 24. Copyright @ 2019 Learntek. All Rights Reserved. 24 ... of IN B-PP ... Carlyle NNP B-NP ... Group NNP I-NP ... , , O ... a DT B-NP ... merchant NN I-NP ... banking NN I-NP ... concern NN I-NP .. . . O ... ''' >>> nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw()
  • 25. Copyright @ 2019 Learntek. All Rights Reserved. 25
  • 26. Copyright @ 2019 Learntek. All Rights Reserved. 26
  • 27. Copyright @ 2019 Learntek. All Rights Reserved. 27 For more Training Information , Contact Us Email : info@learntek.org USA : +1734 418 2465 INDIA : +40 4018 1306 +7799713624