SlideShare a Scribd company logo
Text Analytics
with Python
TD Workshop 2
Nhi Nguyen & Michelle Purnama
Pre-Workshop Checklist
⬡ 1. This is pretty obvious…. but do you have your laptop with
you? If you don’t…. Perhaps go grab it?
⬡ 2. Did you download Anaconda?
⬡ 3. Did you have access to TD WS 2 Shared Folder?
⬡ 4. If you say “no” to questions 2 and 3 → go to AIS website for
the instruction!
AIS Upcoming Events
⬡ No Speaker Series, Next Monday, 10/21
⬡ EY Office VIsit - Next Thursday, 10/24, 9:00AM - 12:00PM
∙ Find the signup in the newsletter
⬡ PD Meeting: Friday, 10/25, 12:00 – 12:50
∙ Talking Tech with Ilya Rogov
Hello!
I am Michelle Purnama
I hope you’re all excited to learn
Python with us! Don’t be scared -
this Python won’t bite :)
4
1.
What is Python?
Python 101 starts now!
Python
⬡ Python is an interpreted, high-
level, general-purpose
programming language
⬡ It supports the use of modules
and packages
⬡ Code can be reused in a variety
of projects by importing and
exporting these modules
6
This Python ?
Or this
Python ?
Python Packages
7
2. Anaconda &
Jupyter Notebook
What are they again?
Anaconda
⬡ Free and open-source
distribution of Python and R
programming languages that
aims to simplify package
management & deployment
⬡ In this workshop, we are using
Anaconda to install Python and
Jupyter Notebook
9
Jupyter Notebook
⬡ Open-source web application that
allows you to create & share
documents that contain live code,
equations, visualization and narrative
text
⬡ Powerful way to iterate our Python
code and writing lines of code and
running them one at a time
10
Text Analytics -
Main Phases
Let’s start coding!
Text Analytics & NLP
⬡ Day-to-day texts generated are unstructured
⬡ NLP - Natural Language Processing
⬡ NLP enables computer to interact with humans in a
natural manner
⬡ Example: analyzing movie review
12
Text Analytics Operations using NLTK
⬡ NLTK - Natural Language Toolkit
⬡ Python package that provides a set of diverse
natural languages algorithms
⬡ Free, open source, easy to use, well documented
⬡ Helps computer analyze, preprocess, and understand
written text
13
14
Tokenization
Stop
words
Removal
Lexicon
Normalization
Sentiment
Analysis
Understand
POS Tag
Phase 1
Phase 2
Phase 3
Phase 4
Phase 5
15
Phase 1: Tokenization
Tokenization
⬡ First step in text analytics
⬡ The process of breaking down a text
paragraph into smaller chunks such as
words or sentences
⬡ Token - a single entity that is building
blocks for sentence or paragraph
⬡ nltk.tokenize - a module inside NLTK
package
16
Sentence Tokenization
⬡ Breaks text paragraph
into sentences
⬡ Import sent_tokenize
Sentence & Word Tokenization
17
Word Tokenization
⬡ Breaks text paragraph
into words
⬡ Import word_tokenize
Frequency Distribution
⬡ Frequency of occurrence
of each word in a text
⬡ Import FreqDist from
nltk.probability module
⬡ Import matplotlib
package to plot the
chart
18
Do It
Yourself!
Choose any story from the
Funny Halloween Stories link
and plot a frequency
distribution using Python! Boo!
19
20
Phase 2: Stop
words Removal
Stopwords
⬡ Noise in the text
⬡ Examples: is, am, are, this, a, an, the
⬡ We need to create a list of stopwords and filter out our
list of tokens from these words
21
Wow,
that’s a
mouthful
Do It
Yourself!
Use the same story you picked
in Phase 1 and remove the
stopwords from that text. Let’s
do it!
22
23
Phase 3: Lexicon
Normalization
Lexicon Normalization
⬡ Reduces derivationally related forms of a word to a
common root word
⬡ For example, connection, connected, connecting
word reduce to a common word “connect”
24
Stemming
⬡ Reduces word to their
root word / chops off the
derivational affixes
⬡ Does not recognize the
knowledge of the word in
context
Stemming & Lemmatization
25
Lemmatization
⬡ More sophisticated
⬡ Reduces words to their
base word - linguistically
correct lemmas
⬡ Considers context of the
word
26
Phase 4: POS Tag
POS Tagging
⬡ Part-of-Speech (POS) tagging looks to identify the
grammatical group of a given word based on the
context
⬡ For example, noun, pronoun, adjective, verb, adverbs,
etc
27
Do It
Yourself!
Choose a sentence from the
Halloween Story and apply
POS tags to the tokenized
sentence!
28
29
Phase 5: Sentiment
Analysis
Text Classification
⬡ Important task in text mining
⬡ Identifying category/class of given text such as blog,
book, web page, tweets
⬡ Various application in spam detection, classifying
website content for a search engine, sentiments of
customer feedback, etc
30
Text Classification
31
Sentiment Analysis
⬡ Quantifying user content, idea, belief, opinion
⬡ Combination of words, tone, and writing
style
⬡ Analyzes user messages and classifies
underlying sentiment as positive, negative,
or neutral
⬡ Two approaches:
∙ Lexicon-based
∙ Machine learning-based approach
32
Dataset - sentimentanalysis.tsv
33
What We’ve Learned Today..
34
⬡ Break down paragraphs into smaller chunks
⬡ Remove punctuation and stopwords to eliminate noise
⬡ Use Stemming & Lemmatization to reduce words to their
base words
⬡ Understand Part-of-Speech tagging
⬡ Create simple graphs in Python
⬡ Scratch a bit of the surface of Sentiment Text Analysis!
35
Tokenization
Stop
words
Removal
Lexicon
Normalization
Sentiment
Analysis
Understand
POS Tag
Phase 1
Phase 2
Phase 3
Phase 4
Phase 5
5.
Extra Resources
More Python?
Additional Learning Resources
⬡ To read more about Text Analysis
∙ https://monkeylearn.com/text-analysis/
⬡ More advanced Text Analysis tutorial
∙ https://www.dataquest.io/blog/tutorial-text-analysis-python-test-
hypothesis
⬡ Bootcamp course on Python
∙ https://www.udemy.com/course/complete-python-bootcamp/
37
38
Thanks for coming!
http://bit.ly/TD-SAT2
Suitable Code Exit Code: Nychella

More Related Content

PDF
AIS Technical Development Workshop 2: Text Analytics with Python
PPTX
Presentation1
DOCX
Langauage model
PPT
Natural Language Processing for Games Research
PDF
Introduction to natural language processing
PPTX
Introduction to Natural Language Processing
PDF
Natural language processing
AIS Technical Development Workshop 2: Text Analytics with Python
Presentation1
Langauage model
Natural Language Processing for Games Research
Introduction to natural language processing
Introduction to Natural Language Processing
Natural language processing

What's hot (20)

PDF
Natural Language Processing (NLP)
PDF
Networks and Natural Language Processing
PPTX
Language Modeling and English Speech Prediction System to aid People with Stu...
PPTX
Natural language processing PPT presentation
PPTX
Natural language processing
DOCX
Natural language processing
PPT
Natural language processing
PPTX
Natural Language Processing in Alternative and Augmentative Communication
PPTX
Natural Language Processing
PPT
Natural language processing
PDF
Natural language processing
PPTX
Artificial Intelligence Notes Unit 4
PPTX
Natural language processing
PPTX
Natural language processing
DOCX
Natural Language Processing
PPTX
Deep Learning for Natural Language Processing
PDF
Natural Language Processing seminar review
PPTX
Natural Language Processing
PPTX
Lecture 1: Semantic Analysis in Language Technology
Natural Language Processing (NLP)
Networks and Natural Language Processing
Language Modeling and English Speech Prediction System to aid People with Stu...
Natural language processing PPT presentation
Natural language processing
Natural language processing
Natural language processing
Natural Language Processing in Alternative and Augmentative Communication
Natural Language Processing
Natural language processing
Natural language processing
Artificial Intelligence Notes Unit 4
Natural language processing
Natural language processing
Natural Language Processing
Deep Learning for Natural Language Processing
Natural Language Processing seminar review
Natural Language Processing
Lecture 1: Semantic Analysis in Language Technology
Ad

Similar to Technical Development Workshop - Text Analytics with Python (20)

PPTX
Data Science & Analytics , Computer Science
PPTX
Data Science & Analytics , Computer Science
PPTX
Data Science & Analytics , Computer Science
PPTX
Text Mining_big_data_machine_learning.pptx
PPTX
Natural Language processing using nltk.pptx
PPTX
UNIT-1 and 2 Text and image classification .pptx
PPTX
MODULE 4-Text Analytics.pptx
PDF
Analysing Demonetisation through Text Mining using Live Twitter Data!
PDF
AM4TM_WS22_Practice_01_NLP_Basics.pdf
PPTX
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
PPTX
Session 07 text data.pptx
PPTX
Session 07 text data.pptx
PPTX
Session 07 text data.pptx
PPTX
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
PPT
Text Analytics: Yesterday, Today and Tomorrow
PPTX
Weekairtificial intelligence 8-Module 7 NLP.pptx
PDF
Text Pre-Processing Techniques in Natural Language Processing: A Review
PPTX
Fast and accurate sentiment classification us and naive bayes model b516001
PPTX
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
PPTX
NLP Introduction - Natural Language Processing and Artificial Intelligence Ov...
Data Science & Analytics , Computer Science
Data Science & Analytics , Computer Science
Data Science & Analytics , Computer Science
Text Mining_big_data_machine_learning.pptx
Natural Language processing using nltk.pptx
UNIT-1 and 2 Text and image classification .pptx
MODULE 4-Text Analytics.pptx
Analysing Demonetisation through Text Mining using Live Twitter Data!
AM4TM_WS22_Practice_01_NLP_Basics.pdf
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
Session 07 text data.pptx
Session 07 text data.pptx
Session 07 text data.pptx
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Text Analytics: Yesterday, Today and Tomorrow
Weekairtificial intelligence 8-Module 7 NLP.pptx
Text Pre-Processing Techniques in Natural Language Processing: A Review
Fast and accurate sentiment classification us and naive bayes model b516001
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
NLP Introduction - Natural Language Processing and Artificial Intelligence Ov...
Ad

Recently uploaded (20)

PDF
Insiders guide to clinical Medicine.pdf
PPTX
Pharma ospi slides which help in ospi learning
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Computing-Curriculum for Schools in Ghana
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
Basic Mud Logging Guide for educational purpose
PPTX
Cell Structure & Organelles in detailed.
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Complications of Minimal Access Surgery at WLH
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
master seminar digital applications in india
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
TR - Agricultural Crops Production NC III.pdf
Insiders guide to clinical Medicine.pdf
Pharma ospi slides which help in ospi learning
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Computing-Curriculum for Schools in Ghana
Microbial disease of the cardiovascular and lymphatic systems
GDM (1) (1).pptx small presentation for students
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Basic Mud Logging Guide for educational purpose
Cell Structure & Organelles in detailed.
Module 4: Burden of Disease Tutorial Slides S2 2025
2.FourierTransform-ShortQuestionswithAnswers.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Complications of Minimal Access Surgery at WLH
PPH.pptx obstetrics and gynecology in nursing
VCE English Exam - Section C Student Revision Booklet
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
master seminar digital applications in india
FourierSeries-QuestionsWithAnswers(Part-A).pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
TR - Agricultural Crops Production NC III.pdf

Technical Development Workshop - Text Analytics with Python

Editor's Notes

  • #2: Nhi
  • #5: Michelle
  • #6: M
  • #7: Python is an interpreted, high-level, general-purpose programming language. Python supports the use of modules and packages, which means that programs can be designed in a modular style and code can be reused across a variety of projects. Once you've developed a module or package you need, it can be scaled for use in other projects, and it's easy to import or export these modules.
  • #10: What is Anaconda? Anaconda is a free and open-source distribution of the Python and R programming languages for scientific computing, that aims to simplify package management and deployment. In this workshop, we will use Anaconda to Install Python and Jupyter Notebook as Anaconda also includes other commonly used packages for scientific computing and data science (and in this case, for text analytics!)
  • #11: M What is Jupyter Notebook? The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Jupyter Notebooks are a powerful way to write and iterate on your Python code for data analysis. Rather than writing and re-writing an entire program, you can write lines of code and run them one at a time.
  • #12: N
  • #13: NLP enables the computer to interact with humans in a natural manner. It helps the computer to understand the human language and derive meaning from it. Analyzing movie review is one of the classic examples to demonstrate a simple NLP Bag-of-words model, on movie reviews.
  • #14: N NLTK is a powerful Python package that provides a set of diverse natural languages algorithms. It is free, opensource, easy to use, large community, and well documented. NLTK helps the computer to analysis, preprocess, and understand the written text. Going back to the phase slide, NLTK consists of the most common algorithms such as tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition.
  • #16: M
  • #17: Talk about package —> module —> class (draw venn diagram on white board maybe?)
  • #19: # Frequency Distribution Plot import matplotlib.pyplot as plt fdist.plot(30,cumulative=False) plt.show() https://matplotlib.org/tutorials/introductory/pyplot.html#pyplot-tutorial
  • #22: M
  • #24: N
  • #25: Lexicon normalization considers another type of noise in the text. For example, connection, connected, connecting word reduce to a common word "connect". It reduces derivationally related forms of a word to a common root word.
  • #26: Stemming process of linguistic normalization reduces words to their word root word or chops off the derivational affixes. Lemmatization more sophisticated than stemming. reduces words to their base word, which is linguistically correct lemmas. A lemma is a word that stands at the head of a definition in a dictionary. All the head words in a dictionary are lemmas. Technically, it is "a base word and its inflections" Stemmer works on an individual word without knowledge of the context. For example, The word "better" has "good" as its lemma. This thing will miss by stemming because it requires a dictionary look-up.
  • #27: N
  • #28: The primary target of Part-of-Speech(POS) tagging is to identify the grammatical group of a given word. Whether it is a NOUN, PRONOUN, ADJECTIVE, VERB, ADVERBS, etc. based on the context. POS Tagging looks for relationships within the sentence and assigns a corresponding tag to the word. List of POS tags: https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b
  • #31: Text classification is one of the important tasks of text mining. Identifying category or class of given text such as a blog, book, web page, news articles, and tweets. It has various application in today's computer world such as spam detection, task categorization in CRM services, categorizing products on E-retailer websites, classifying the content of websites for a search engine, sentiments of customer feedback, etc.
  • #33: What users and the general public think about the latest feature? You can quantify such information with reasonable accuracy using sentiment analysis. Quantifying users content, idea, belief, and opinion is known as sentiment analysis. Human communication is just not limited to words, it is more than words. Sentiments are combination words, tone, and writing style. Two approaches Lexicon-based: Count a number of positive and negative words in given text and the larger count will be the sentiment of text. Machine learning based approach: Develop a classification model, which is trained using the pre-labeled dataset of positive, negative, and neutral. In this Tutorial, you will use the second approach(Machine learning based approach). This is how you learn sentiment and text classification with a single example.
  • #35: Break down paragraphs into smaller chunks like sentences or words. Remove punctuation and stopwords to increase the accuracy of our analysis. Use Stemming or Lemmatization to reduce words to their base words. Understand Part-of-Speech tagging. Create simple graphs in Python. Scratch a bit of the surface of Sentiment Text Analysis!