Technical Development Workshop - Text Analytics with Python

Text Analytics
with Python
TD Workshop 2
Nhi Nguyen & Michelle Purnama

Pre-Workshop Checklist
⬡ 1. This is pretty obvious…. but do you have your laptop with
you? If you don’t…. Perhaps go grab it?
⬡ 2. Did you download Anaconda?
⬡ 3. Did you have access to TD WS 2 Shared Folder?
⬡ 4. If you say “no” to questions 2 and 3 → go to AIS website for
the instruction!

AIS Upcoming Events
⬡ No Speaker Series, Next Monday, 10/21
⬡ EY Office VIsit - Next Thursday, 10/24, 9:00AM - 12:00PM
∙ Find the signup in the newsletter
⬡ PD Meeting: Friday, 10/25, 12:00 – 12:50
∙ Talking Tech with Ilya Rogov

Hello!
I am Michelle Purnama
I hope you’re all excited to learn
Python with us! Don’t be scared -
this Python won’t bite :)
4

1.
What is Python?
Python 101 starts now!

Python
⬡ Python is an interpreted, high-
level, general-purpose
programming language
⬡ It supports the use of modules
and packages
⬡ Code can be reused in a variety
of projects by importing and
exporting these modules
6
This Python ?
Or this
Python ?

2. Anaconda &
Jupyter Notebook
What are they again?

Anaconda
⬡ Free and open-source
distribution of Python and R
programming languages that
aims to simplify package
management & deployment
⬡ In this workshop, we are using
Anaconda to install Python and
Jupyter Notebook
9

Jupyter Notebook
⬡ Open-source web application that
allows you to create & share
documents that contain live code,
equations, visualization and narrative
text
⬡ Powerful way to iterate our Python
code and writing lines of code and
running them one at a time
10

Text Analytics -
Main Phases
Let’s start coding!

Text Analytics & NLP
⬡ Day-to-day texts generated are unstructured
⬡ NLP - Natural Language Processing
⬡ NLP enables computer to interact with humans in a
natural manner
⬡ Example: analyzing movie review
12

Text Analytics Operations using NLTK
⬡ NLTK - Natural Language Toolkit
⬡ Python package that provides a set of diverse
natural languages algorithms
⬡ Free, open source, easy to use, well documented
⬡ Helps computer analyze, preprocess, and understand
written text
13

14
Tokenization
Stop
words
Removal
Lexicon
Normalization
Sentiment
Analysis
Understand
POS Tag
Phase 1
Phase 2
Phase 3
Phase 4
Phase 5

Tokenization
⬡ First step in text analytics
⬡ The process of breaking down a text
paragraph into smaller chunks such as
words or sentences
⬡ Token - a single entity that is building
blocks for sentence or paragraph
⬡ nltk.tokenize - a module inside NLTK
package
16

Sentence Tokenization
⬡ Breaks text paragraph
into sentences
⬡ Import sent_tokenize
Sentence & Word Tokenization
17
Word Tokenization
⬡ Breaks text paragraph
into words
⬡ Import word_tokenize

Frequency Distribution
⬡ Frequency of occurrence
of each word in a text
⬡ Import FreqDist from
nltk.probability module
⬡ Import matplotlib
package to plot the
chart
18

Do It
Yourself!
Choose any story from the
Funny Halloween Stories link
and plot a frequency
distribution using Python! Boo!
19

20
Phase 2: Stop
words Removal

Stopwords
⬡ Noise in the text
⬡ Examples: is, am, are, this, a, an, the
⬡ We need to create a list of stopwords and filter out our
list of tokens from these words
21
Wow,
that’s a
mouthful

Do It
Yourself!
Use the same story you picked
in Phase 1 and remove the
stopwords from that text. Let’s
do it!
22

23
Phase 3: Lexicon
Normalization

Lexicon Normalization
⬡ Reduces derivationally related forms of a word to a
common root word
⬡ For example, connection, connected, connecting
word reduce to a common word “connect”
24

Stemming
⬡ Reduces word to their
root word / chops off the
derivational affixes
⬡ Does not recognize the
knowledge of the word in
context
Stemming & Lemmatization
25
Lemmatization
⬡ More sophisticated
⬡ Reduces words to their
base word - linguistically
correct lemmas
⬡ Considers context of the
word

POS Tagging
⬡ Part-of-Speech (POS) tagging looks to identify the
grammatical group of a given word based on the
context
⬡ For example, noun, pronoun, adjective, verb, adverbs,
etc
27

Do It
Yourself!
Choose a sentence from the
Halloween Story and apply
POS tags to the tokenized
sentence!
28

29
Phase 5: Sentiment
Analysis

Text Classification
⬡ Important task in text mining
⬡ Identifying category/class of given text such as blog,
book, web page, tweets
⬡ Various application in spam detection, classifying
website content for a search engine, sentiments of
customer feedback, etc
30

Sentiment Analysis
⬡ Quantifying user content, idea, belief, opinion
⬡ Combination of words, tone, and writing
style
⬡ Analyzes user messages and classifies
underlying sentiment as positive, negative,
or neutral
⬡ Two approaches:
∙ Lexicon-based
∙ Machine learning-based approach
32

Dataset - sentimentanalysis.tsv
33

What We’ve Learned Today..
34
⬡ Break down paragraphs into smaller chunks
⬡ Remove punctuation and stopwords to eliminate noise
⬡ Use Stemming & Lemmatization to reduce words to their
base words
⬡ Understand Part-of-Speech tagging
⬡ Create simple graphs in Python
⬡ Scratch a bit of the surface of Sentiment Text Analysis!

35
Tokenization
Stop
words
Removal
Lexicon
Normalization
Sentiment
Analysis
Understand
POS Tag
Phase 1
Phase 2
Phase 3
Phase 4
Phase 5

5.
Extra Resources
More Python?

Additional Learning Resources
⬡ To read more about Text Analysis
∙ https://monkeylearn.com/text-analysis/
⬡ More advanced Text Analysis tutorial
∙ https://www.dataquest.io/blog/tutorial-text-analysis-python-test-
hypothesis
⬡ Bootcamp course on Python
∙ https://www.udemy.com/course/complete-python-bootcamp/
37

38
Thanks for coming!
http://bit.ly/TD-SAT2
Suitable Code Exit Code: Nychella

Technical Development Workshop - Text Analytics with Python

More Related Content

What's hot (20)

Similar to Technical Development Workshop - Text Analytics with Python (20)

Recently uploaded (20)

Technical Development Workshop - Text Analytics with Python

Editor's Notes