SlideShare a Scribd company logo
Inteligência Artificial
Aula 12
Python para processamento de Texto
Inteligencia artificial 12
text wrangling?
>>>inputstring = ' This is an example sent. The sentence
splitter will split on sent markers. Ohh really !!'
>>>from nltk.tokenize import sent_tokenize
>>>all_sent = sent_tokenize(inputstring)
>>>print all_sent
[' This is an example sent', 'The sentence splitter will split
on markers.','Ohh really !!']
>>>import nltk.tokenize.punkt
>>>tokenizer =
nltk.tokenize.punkt.PunktSentenceTokenizer()
Inteligencia artificial 12
Inteligencia artificial 12
Stemming
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Snowball stemmers that can be used for Dutch, English,
French, German, Italian, Portuguese, Romanian, Russian,
and so on
For Snowball Stemmer, which is based on Snowball
Stemming Algorithm, can be used in NLTK like this:
>>> from nltk.stem import SnowballStemmer
>>> snowball_stemmer = SnowballStemmer(“english”)
>>> snowball_stemmer.stem(‘maximum’)
u’maximum’
>>> snowball_stemmer.stem(‘presumably’)
u’presum’
>>> snowball_stemmer.stem(‘multiply’)
u’multipli’
>>> snowball_stemmer.stem(‘provision’)
u’provis’
>>> snowball_stemmer.stem(‘owed’)
u’owe’
>>> snowball_stemmer.stem(‘ear’)
u’ear’
Lemmatization
Inteligencia artificial 12
Inteligencia artificial 12
>>>from nltk.corpus import stopwords
>>>stoplist = stopwords.words('english') # config the
language name
# NLTK supports 22 languages for removing the stop
words
>>>text = "This is just a test"
>>>cleanwordlist = [word for word in text.split() if word not
in stoplist]
# apart from just and test others are stopwords
['test']
StopWords
>>># tokens is a list of all tokens in corpus
>>>freq_dist = nltk.FreqDist(token)
>>>rarewords = freq_dist.keys()[-50:]
>>>after_rare_words = [ word for word in token not in
rarewords]
Rarewords
Inteligencia artificial 12
>>>import nltk
>>>from nltk import word_tokenize
>>>s = "I was watching TV"
>>>print nltk.pos_tag(word_tokenize(s))
[('I', 'PRP'), ('was', 'VBD'), ('watching', 'VBG'), ('TV', 'NN')]
What is Part of speech tagging
Inteligencia artificial 12
>>>from nltk.corpus import brown
>>>import nltk
>>>tags = [tag for (word, tag) in
brown.tagged_words(categories='news')]
>>>print nltk.FreqDist(tags)
<FreqDist: 'NN': 13162, 'IN': 10616, 'AT': 8893, 'NP': 6866, ',':
5133, 'NNS': 5066, '.': 4452, 'JJ': 4392 >
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
clusters <- hclust(dist(iris[, 3:4]))
plot(clusters)
clusterCut <- cutree(clusters, 3)
clusters <- hclust(dist(iris[, 3:4]),
method = 'average')
plot(clusters)
K-Means
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
Inteligencia artificial 12
ggplot(iris, aes(Petal.Length, Petal.Width, color = iris
$Species)) +
geom_point(alpha = 0.4, size = 3.5) + geom_point(col =
clusterCut) +
scale_color_manual(values = c('black', 'red', 'green'))

More Related Content

PPT
Buffer OverFlow
PPT
C Sharp Jn (3)
PDF
C++ Course - Lesson 1
PPTX
Oop object oriented programing topics
PPT
C++ control loops
PPT
Loops
PPT
C++ programming
PDF
Nesting of for loops using C++
Buffer OverFlow
C Sharp Jn (3)
C++ Course - Lesson 1
Oop object oriented programing topics
C++ control loops
Loops
C++ programming
Nesting of for loops using C++

What's hot (6)

PDF
Zeppelin Helium: Spell
PDF
Reanalyzing the Notepad++ project
PPT
My programming final proj. (1)
PDF
Demystifying the Go Scheduler
TXT
Sine Wave Generator with controllable frequency displayed on a seven segment ...
PPTX
Exploit Research and Development Megaprimer: Unicode Based Exploit Development
Zeppelin Helium: Spell
Reanalyzing the Notepad++ project
My programming final proj. (1)
Demystifying the Go Scheduler
Sine Wave Generator with controllable frequency displayed on a seven segment ...
Exploit Research and Development Megaprimer: Unicode Based Exploit Development
Ad

Similar to Inteligencia artificial 12 (6)

PPTX
Text Analysis Operations using NLTK.pptx
PPTX
Python computer science technology .pptx
PDF
AM4TM_WS22_Practice_01_NLP_Basics.pdf
PDF
Nltk:a tool for_nlp - py_con-dhaka-2014
PPTX
Natural Language Processing and Python
PPTX
Natural Language processing using nltk.pptx
Text Analysis Operations using NLTK.pptx
Python computer science technology .pptx
AM4TM_WS22_Practice_01_NLP_Basics.pdf
Nltk:a tool for_nlp - py_con-dhaka-2014
Natural Language Processing and Python
Natural Language processing using nltk.pptx
Ad

More from Nauber Gois (20)

PDF
Ai health
PDF
Inteligencia artificial 13
PDF
Sistemas operacionais 14
PDF
Sistemas operacionais 13
PDF
Sistemas operacionais 12
PDF
Sistemas operacionais 11
PDF
Sistemas operacionais 10
PDF
Inteligencia artificial 11
PDF
Sistemas operacional 9
PDF
Inteligencia artificial 10
PDF
Sistemas operacionais 8
PDF
Inteligencia artificial 9
PDF
Inteligencia artificial 8
PDF
Sist infgerenciais 8
PDF
Sist operacionais 7
PDF
Inteligencia artifical 7
PDF
Beefataque
PDF
Ssit informacoesgerenciais 5
PDF
Sistemas operacionais 6
PDF
Inteligencia artifical 6
Ai health
Inteligencia artificial 13
Sistemas operacionais 14
Sistemas operacionais 13
Sistemas operacionais 12
Sistemas operacionais 11
Sistemas operacionais 10
Inteligencia artificial 11
Sistemas operacional 9
Inteligencia artificial 10
Sistemas operacionais 8
Inteligencia artificial 9
Inteligencia artificial 8
Sist infgerenciais 8
Sist operacionais 7
Inteligencia artifical 7
Beefataque
Ssit informacoesgerenciais 5
Sistemas operacionais 6
Inteligencia artifical 6

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Encapsulation_ Review paper, used for researhc scholars
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Cloud computing and distributed systems.
PDF
Encapsulation theory and applications.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Digital-Transformation-Roadmap-for-Companies.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Spectroscopy.pptx food analysis technology
Encapsulation_ Review paper, used for researhc scholars
The AUB Centre for AI in Media Proposal.docx
Chapter 3 Spatial Domain Image Processing.pdf
Empathic Computing: Creating Shared Understanding
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
“AI and Expert System Decision Support & Business Intelligence Systems”
MIND Revenue Release Quarter 2 2025 Press Release
Building Integrated photovoltaic BIPV_UPV.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Network Security Unit 5.pdf for BCA BBA.
Programs and apps: productivity, graphics, security and other tools
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Cloud computing and distributed systems.
Encapsulation theory and applications.pdf
Big Data Technologies - Introduction.pptx
Advanced methodologies resolving dimensionality complications for autism neur...

Inteligencia artificial 12

  • 5. >>>inputstring = ' This is an example sent. The sentence splitter will split on sent markers. Ohh really !!' >>>from nltk.tokenize import sent_tokenize >>>all_sent = sent_tokenize(inputstring) >>>print all_sent [' This is an example sent', 'The sentence splitter will split on markers.','Ohh really !!']
  • 13. Snowball stemmers that can be used for Dutch, English, French, German, Italian, Portuguese, Romanian, Russian, and so on
  • 14. For Snowball Stemmer, which is based on Snowball Stemming Algorithm, can be used in NLTK like this: >>> from nltk.stem import SnowballStemmer >>> snowball_stemmer = SnowballStemmer(“english”) >>> snowball_stemmer.stem(‘maximum’) u’maximum’ >>> snowball_stemmer.stem(‘presumably’) u’presum’ >>> snowball_stemmer.stem(‘multiply’) u’multipli’ >>> snowball_stemmer.stem(‘provision’) u’provis’ >>> snowball_stemmer.stem(‘owed’) u’owe’ >>> snowball_stemmer.stem(‘ear’) u’ear’
  • 18. >>>from nltk.corpus import stopwords >>>stoplist = stopwords.words('english') # config the language name # NLTK supports 22 languages for removing the stop words >>>text = "This is just a test" >>>cleanwordlist = [word for word in text.split() if word not in stoplist] # apart from just and test others are stopwords ['test'] StopWords
  • 19. >>># tokens is a list of all tokens in corpus >>>freq_dist = nltk.FreqDist(token) >>>rarewords = freq_dist.keys()[-50:] >>>after_rare_words = [ word for word in token not in rarewords] Rarewords
  • 21. >>>import nltk >>>from nltk import word_tokenize >>>s = "I was watching TV" >>>print nltk.pos_tag(word_tokenize(s)) [('I', 'PRP'), ('was', 'VBD'), ('watching', 'VBG'), ('TV', 'NN')] What is Part of speech tagging
  • 23. >>>from nltk.corpus import brown >>>import nltk >>>tags = [tag for (word, tag) in brown.tagged_words(categories='news')] >>>print nltk.FreqDist(tags) <FreqDist: 'NN': 13162, 'IN': 10616, 'AT': 8893, 'NP': 6866, ',': 5133, 'NNS': 5066, '.': 4452, 'JJ': 4392 >
  • 50. clusters <- hclust(dist(iris[, 3:4])) plot(clusters)
  • 52. clusters <- hclust(dist(iris[, 3:4]), method = 'average') plot(clusters)
  • 58. ggplot(iris, aes(Petal.Length, Petal.Width, color = iris $Species)) + geom_point(alpha = 0.4, size = 3.5) + geom_point(col = clusterCut) + scale_color_manual(values = c('black', 'red', 'green'))