Text Mining_big_data_machine_learning.pptx

HOW THIS COURSE IS
DELIVERED
Text mining is basically data mining on unstructured text. It’s an
implementation of the many techniques explained in previous sessions.
Combining the basic theory by focusing on implementation using Python.
The basic explanation of the theory and algorithm is not explained using in
depth mathematical and statistical approach, to make it easier for
participants who do not have statistics or mathematics background

AGENDA
Definition of text mining and common application
Text mining and NLP preprocessing technique
How to extract important information from text
Understand how text is handled in Python

CHAPTER 1
INTRODUCTION
BIG DATA ANALYTICS CE

TEXT MINING
DEFINITION
Broad umbrella terms describing a range of technologies for analyzing and processing
semi structured and unstructured text data.
The unifying theme behind each of these technologies is the need to “turn unstructured text
into structured data” so powerful algorithms can be applied to large document
databases.
Converting text into a structured, numerical format and applying analytical algorithms
require knowing how to both use and combine techniques for handling text, ranging from
individual words to documents to entire document databases
In building a statistical language system, it is best to devise a model that can make good
use of available data, even if the model seems overly simplistic.

TEXT MINING
PURPOSE
Turn text data into high-quality information and/or actionable knowledge
 Minimizes human effort on consuming text data
 Supplies knowledge for optimal decision making actionable knowledge
→

TEXT MINING
APPLICATIONS
Summary: search for the most important information from a text.
Chat Bot: an automatic question and answer application between machines and humans.
Text categorization: determine the topic of a particular document
Keyword Tag: keyword tags are selected keywords that represent text.
Sentiment Analysis: determine the sentiment or value of opinion in a text, these sentiments can be
negative, neutral, or positive sentiments.
Speech-to-text and text-to-speech conversions: Turns sound into text and vice versa
Translator Machine: Translation of text from one language to another
Spelling Checker

HUMANS AS SUBJECTIVE
SENSORS
Real World
Weather
Locations
Network
Sensor
Thermometer
Geo Sensor
Network Sensor
3o
C, 15o
F, ...
41o
N 120o
W ...
10110101010101
Sense Report
Real World Human Sensor
Perceive Express
Data
Text Data

TEXT MINING
LANDSCAPE
Real World
Perceive Express
2. Mining knowledge
about language : word
mining and association
3. Mining content of
text data : topic mining
and analysis
4. Mining knowledge about
the observer : opinion mining
& sentiment analysis
5. Infer other real world
variables : predictive
analysis
1. NLP and text
representation
Text Data

NLP
DEFINITION
NLP is a research area of computer science, artificial intelligence, and computational
linguistics, concerned with the interactions between computers and human natural
languages.
Helps computers understand, interpret, and manipulate human language. Not only
understand the word, but also how these words are interconnected into a meaningful
information

NLP
BASIC CONCEPTS
a dog is chasing a boy on the playground
det noun aux verb det noun prep det noun
Noun Phrase
Complex Verb Noun Phrase Noun Phrase
Verb Phrase Prep Phrase
Verb Phrase
Sentence
Lexical
Analysis (POS
Tagging)
Syntactic
Analysis
(Parsing)
A person saying this may
be reminding another
person to get the dog back.
Pragmatic Analysis
(speech act)
Semantic Analysis :
Dog(d1).
Boy(b1).
Playground(p1).
Chasing(d1,b1,p1)

NLP
CHALLENGES
Language is ambiguous need context to explain
→
 The same word can mean something else (homograph)
 Bank - Sloping land (especially the slope beside a body of water)
 Bank - A financial institution that accepts deposits and channels the money into lending activities
 Different words means the same (synonyms)
Human errors - misspellings, typos, abbreviations, social languages, etc.
Each language is different in terms of structure, vocabulary, etc
Special requirements: regulation and privacy related to legal, diplomatic,
medical

NLP
IMPLEMENTATION PACKAGES
Apache OpenNLP: Machine learning toolkit that provides tokenization, sentence
segmentation, part of speech tagging, named entity extraction, chunking, parsing,
coreference resolution, and so on.
Natural Language Toolkit (NLTK): the most popular python library for NLP, consisting of:
classification, tokenization, stemming, tagging, parsing, and others.
Stanford NLP: used for part of speech tagging, named entity recognizer, coreference
resolution, sentiment analysis, etc.
MALLET: is a JAVA package consists of Latent Dirichlet Allocation, document classification,
clustering, topic modeling, information extraction, and others.
etc.

LAB PREPARATION
PYTON LIBRARY REQUIRED
In this training we will use below python library
nltk: the most popular NLP library in the python ecosystem
beautifulsoup4: library for extracting data from HTML and XML documents
pandas: library for data manipulation and analysis
scikit-learn: python machine learning library
matplotlib: library for 2-dimensional plotting
sastrawi: stemmer for Indonesian Language, ported from PHP
gensim: python library focused on analyzing plain-text documents for semantic structure

LAB 01
INSTALLATION AND CONFIG

LAB DESCRIPTION
What you will learn:
Anaconda installation
Create and manage anaconda
environment
Install python and the required
packages
Run Jupyter Notebook
Requirement :
Anaconda
Jupyter Notebook
Python 3 library
 numpy
 pandas
 scikit-learn
 nltk
 matplotlib
 beautifulsoup4
 sastrawi

STEP 01
INSTALL ANACONDA
Install Anaconda according to your OS.
Installer can be downloaded at
www.anaconda.com/download

STEP 02
CREATE ENVIRONMENT
Conda environment is basically a certain directory that contains all the packages we
install. We can have several conda environments with different package versions. For
example we need to run python 2 in one environment, and python 3 in another.
By default, Anaconda creates one main environment called base (root).
There are two ways to create and manage the environment: by using Anaconda Navigator
(Graphical User Interface) or Anaconda prompt (Command Line Interface)
To create an environment through the GUI, run Anaconda Navigator

STEP 02
CREATE ENVIRONMENT
1. Choose Environment, and click Create
2. Name the environment, for example training1
3. Choose python version

STEP 03
INSTALL PACKAGE
1. In the drop down list, select Not
Installed, then check the packages that
we will install
2. Click Apply button

STEP 03
INSTALL PACKAGE
Sastrawi package installed through CLI, by using pip
Follow this steps :
 Open Anaconda Prompt
 Activate environment with command activate <environment name>
 Install Sastrawi by using command pip install Sastrawi

STEP 04
RUN JUPYTER NOTEBOOK
Install Jupyter Notebook through
Anaconda Navigator in Home
menu, and click Launch

STEP 04
RUN JUPYTER NOTEBOOK
Jupyter notebook will be opened as a tab in
your browser
You can create new folder or notebook by
clicking New

STEP 05
CHECK PACKAGE VERSION
We can check the version of all package
installed in our current environment, with
the following code :
import pkg_resources
dists = [d for d in pkg_resources.working_set] for i in
dists:
print(i)
Click Run to execute

CHAPTER 02
TEXT MINING WORKFLOW

COMMON
TEXT MINING WORKFLOW
UNSTRUCTURED DATA: TEXT MINING
Feature
Extraction
Feature extraction
and selection. In
the process might
include
exploration and
visualization to
find the suitable
features.
Collection
Obtaining or
preparing corpus.
Corpus can be
emails, web
pages, social
media contents,
wikipedia, journal
collection,
documents, etc.
Preprocessing
Common tasks in
data preprocessing
are tokenization and
segmentation,
normalization and
noise removal.
Model
Building
Build, train and
test the model.
Model
Evaluation
Evaluate model
performance. The
metrics can vary
depends on the
type of model
used and the NLP
task performed.

CHAPTER 03
TEXT PREPROCESSING

TEXT PREPROCESSING
COMMON TASKS
Some of the most common tasks in the preprocessing stage are:
Tokenization
Noise Removal
 Stop Words Removal
Normalization
 Stemming & Lemmatization
 Object Standardization

TOKENIZATION
The process of cutting text into smaller units, called tokens.
Tokens can be words, keywords, phrases, symbols, or even sentences.
Challenges in tokenization depends on the type of language.
Languages such as English is referred to as space-delimited as most of the words are
separated from each other by white spaces.
Languages such as Chinese are referred to as unsegmented as words do not have clear
boundaries. Tokenizing unsegmented language sentences requires additional lexical and
morphological information.

NOISE REMOVAL
Every piece of text that is not relevant to the context and final output can be considered as noise. For example:
Stopword: commonly used words from a language - for example: are, me, from, in, etc.
URL, link, tag
social media entities (mention, hashtag)
punctuation
etc.
Approach:
Preparing a noise dictionary, and iterate each token (word), eliminating tokens that appear in the dictionary.
For example: stop word list.
Using regular expression (regex)
Combination of both

LAB 02
TOKENIZATION &
CLEANSING

LAB DESCRIPTION
Tokenization and noise removal using nltk.
In this lab, elements considered as noise are non alphabetical token, such as numbers and
punctuations.

STEP 01
TOKENIZATION AND CLEANSING
Type the following code into your
notebook and click Run
import nltk
import re
def
tokenize_clean(te
xt):
#tokenizatio
n and change
to lowercase
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for
word in nltk.word_tokenize(sent)]
#clean token from number and non alphabetical character such as
punctuation, etc.
filtered_tokens = []
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
return filtered_tokens
word_1 = tokenize_clean("Budi dan Badu
bermain bola di sekolah")
word_2 = tokenize_clean("Apakah Romi dan Julia saling mencintai saat mereka
berjumpa di persimpangan jalan?")

STOP WORDS REMOVAL
Stop words are words in a sentence that are considered unimportant, which if omitted will
not change the meaning or value of the sentence.
In most cases, stop word needs to be removed / cleaned so that the results of the analysis
are more accurate.
Stop words are usually words that don’t have meaning on its own, for example conjunctions
such as 'and', 'then', 'or', etc.
Stop words depend on the language and domain of the problem to be resolved. There is
no universal criteria in determining stop words.

LAB 03
STOP WORDS REMOVAL

LAB DESCRIPTION
Show stop words list in nltk
Remove stop words from text

STEP 01
SHOW STOP WORDS LIST
Type the following code and click Run
Import nltk
stopwords =
nltk.corpus.stopwords.words('indonesian')
stopwords

STEP 02
STOP WORDS REMOVAL
Type the following code into
your notebook and click Run
import nltk import
re
def
tokenize_clean(text)
:
[..script seperti pada labs 01]
#clean stop words
stopwords = nltk.corpus.stopwords.words('indonesian')
cleaned_token = []
for token in filtered_tokens:
if token not in stopwords:
cleaned_token.append(token)
return cleaned_token
word_1 = tokenize_clean("Budi dan Badu bermain
bola di sekolah")
word_2 = tokenize_clean("Apakah Romi dan Julia saling mencintai saat mereka berjumpa
di persimpangan jalan?")
print(word_1)
print(word_2)

NORMALIZATION
A series of tasks to process text into a form with certain standards
The aim is to improve the quality of the text so that the next process can perform better
For example: changing all letters to lowercase, changing numbers into letters, stemming,
changing abbreviations into their original words, etc.
Normalization tries to make the same token/word represented in the same form, so that
the next process can run better
Important processes in text normalization are stemming and lemmatization

STEMMING &
LEMMATIZATION
Stemming and Lemmatization is the process of changing the word into a common base
form (stem).
Stemming is done by cutting off the end or the beginning of the word, taking into account
a list of common prefixes and suffixes in an inflected word. For example:
 studying study
→
 studies studi
→

STEMMING &
LEMMATIZATION
Lemmatization takes into consideration the morphological analysis of the words. To do so, it
is necessary to have detailed dictionaries which the algorithm can look through to link the
form back to its lemma. For example:
 studying study
→
 studies study
→
Lemmatization is usually done for languages that have a change of word form, for
example in English: go-went-gone, etc. In bahasa Indonesia, stemming and lemmatization
are usually considered the same process.

LAB 04
STEMMING &
LEMMATIZATION
BIG DATA ANALYTICS CERTIFICATION

LAB DESCRIPTION
Using Sastrawi to do stemming and lemmatization

STEP 01
INSERT CELL
Insert new cell by choosing Insert Insert Cell
→

STEP 02
STEMMING CODE
import nltk
import re
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
def tokenize_and_stem(text): [..script
seperti pada labs 01] [..script
seperti pada labs 02]
#stem using Sastrawi StemmerFactory factory
= StemmerFactory()
stemmer = factory.create_stemmer()
stems = []
for token in cleaned_token:
stems.append(stemmer.stem(token))
return stems
word_1 = tokenize_and_stem("Budi dan Badu bermain bola di sekolah")
word_2 = tokenize_and_stem("Apakah Romi dan Julia saling mencintai saat mereka berjumpa
di persimpangan jalan?")
print(word_1)
print(word_2)

CHAPTER 04
FEATURE EXTRACTION

FEATURE EXTRACTION
The process of converting text into a set of features prior to analysis
The type of features depend on the model and what method will be used in the
mining/machine learning process
Some feature engineering techniques in NLP:
 Syntactic parsing
 Entity parsing
 Vectorization

FEATURE EXTRA
SYNTACTIC PARSING
Syntactic parsing is the process of determining sentence structure based on a certain
grammar and lexicon.
The structure of the sentence includes word level, word class level, phrase level, element
level, and clausal level.
Some important attributes of text syntax are :
 Dependency Grammar and
 Part of Speech Tags.

SYNTACTIC PARSING
PART OF SPEECH TAGGING
POS (Part-of-Speech) Tags are a way of categorizing word classes, such as nouns, verbs,
adjectives, etc.
POS Tagger is an application that is capable of automatically performing POS tag
annotations for each word in a document.
POS tagging produces a list of tuples, where each tuple is in the form of (words, tags)
pairs.
Tags are labels that indicate whether a word is a noun, adjective, verb, and so on.
There are several POS tagset or POS tag naming methods, the most popular tagset is Penn
TreeBank tagset.

PART OF SPEECH
TAGGING SIMPLE EXAMPLE
Paul Pogba scored a late penalty
Paul Pogba
scored a late penalty
sentence
noun phrase
verb phrase
scored
verb
a
article
late
adjective
penalty
noun
Paul Pogba
named entity

PART OF SPEECH TAGGING
USAGE
POS tagging is usually done before the chunking process, or phrases extraction from a
sentence.
POS Tagging is also used for sentence structure analysis and word sense disambiguation.
For example:
 can - We can help you
 can - It kept in a can
By knowing the word class in a sentence, it’s easier to determine its meaning.

ENTITY PARSING
NAMED ENTITY RECOGNITION
The task of identifying the names of all the people, organizations and geographic
locations in a text, as well as time, currency and percentage expressions
Build knowledge from text, by extracting information such as
 Names (people, organizations, locations, objects, etc.)
 Temporal expression (calendar dates, times of day, durations, etc.)
 Numerical expressions (money, percentage, etc.)
Knowledge base built with NER is widely used in technologies such as smart assistants,
machine translation, indexing in information retrieval, classification, automatic
summarization, etc.

NAMED ENTITY RECOGNITION
METHODS
Rule Based
 Using a data dictionary consisting of the name of the country, city, company, etc
 Uses predefined language dependent rules based on linguistics which helps in the identification of named
entities in a document.
 Constraints: requires the ability to define the rules that are usually carried out by linguists and have a large
dependence on the language used.
Machine Learning
 Using statistical classification models and machine learning algorithms
 Constraint : require annotated corpora for the domain of interest. The construction of the annotated corpora
for a new domain is a time-consuming task and requires effort by the human experts to produce it.
Hybrid
 Combine both methods by taking advantage of each method used.

VECTORIZATION
Transforming text into representations that are 'understood' by machines, which is numeric
vector (or array), so that they can be used by various analytics and machine learning
algorithms
There are 2 types of text vectorization, which are:
 Bag of words (or bag of n-grams), represents words as a discrete element of a vector (or
array) element of a bag
→
 Word embeddings : represent (or embed) words in a continuous vector space in which words
with similar meanings are mapped closer to each other. New words in application texts that
were missing in training texts can still be classified through similar words.

VECTORIZATION
METHODS
Type Vectorization Method Function Considerations
Bag of Words
Frequency Counts term
frequencies
Most frequent words not
always most informative
One-Hot Encoding Binarizes term
occurrence (0, 1)
All words equidistant, so
normalization extra
important
TF–IDF Normalizes term
frequencies across
documents
Moderately frequent terms
may not be representative of
document topics
Word
Embeddings
Distributed
Representations
Context-based,
continuous term
similarity encoding
Performance intensive;
difficult to scale without
additional tools (e.g.,
Tensorflow)

BAG OF WORDS
DEFINITION
Text representation that indicate the appearance of a token / word in a document.
Called a bag because it does not care about the structure or sequence in the text. *
The main components of BoW are:
 Vocabulary or a collection of known words based on text input
 A measure of the presence of the known words
The complexity of BoW techniques depends on how to build the vocabulary and the
scoring method

BAG OF WORDS
SCORING METHOD
Binary: one-hot encoding vector
Counts: count the number of times a word appears in each document
Frequency: number of occurrences of words in a document to the total number of words in
the document (count / total words)
TF-IDF: frequency and relevance of words in a corpus

BAG OF WORDS
EXAMPLE
It was the best of times,
it was the worst of times,
it was the age of wisdom
The vocabulary:
{“it”, “was”, “the”, “best”, “of”, “times”, “worst”, “age”, “wisdom”}
If we treat each sentence as separate document, the BoW vectors are:
“it was the best of times” [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
"it was the worst of times" [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
"it was the age of wisdom" [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]

ONE-HOT ENCODING
DEFINITION
A representation of categorical variables as binary vectors.
Each integer value is represented as a binary vector that is all zero values except the
index of the integer, which is marked with a 1.
For example, if we have words {boy, chase, dog, playground} :
boy {1, 0, 0, 0}
chase {0, 1, 0, 0}
dog {0, 0, 1, 0}
playground {0, 0, 0, 1}

TF-IDF
TF-IDF : Term Frequency - Inverse Document Frequency
A way to determine the topic of a document based on the words or terms in the document
TF-IDF calculates relevance, not just frequency
The weight calculation in TF-IDF uses a statistical method, which evaluates how important a
term is to a document
The greater the TF-IDF value of a word or term, the rarer the word, the more relevant a
word to a document

TF-IDF
USAGE
What is the use of TF-IDF?
 Categorize text, automatically create tags or keywords for a document.
 Determine the order of documents in search results (document relevance to a term)
 Fix / add stop-word lists
Difference between TF-IDF and sentiment analysis?
 Sentiment analysis classifies text based on 'positive', 'negative' or 'neutral' opinion values.
 TF-IDF classifies text based on its contents.

TF-IDF
TERM FREQUENCY
Term Frequency (TF) calculates the frequency of occurrence of a word or term (T) in a document
(D).
There’s a possibility that a word has a greater occurrence value on a different document, because
the length of the document is different
TF calculation formula:
TF (t) = (Number of occurrences of the word t) / (Total number of words)

TF-IDF
INVERSE DOCUMENT FREQUENCY
IDF calculates how important a word or term and document is
It is known that certain terms, such as "are", "from", and "that", may appear many times but does not have a
large influence. Therefore it is necessary to reduce the weight for those words and increase the weight for the
rare ones, with the following calculations:
IDF (t) = log_e (Number of documents / Number of documents containing the term t)

TF-IDF
EXAMPLE
Suppose a document has 100 words with the appearance of the word cat 3 times
TF for cats is:
3/100 = 0.03
If there are 10 million documents and the word cat appears in 1000 documents, then the IDF:
log (10,000,000 / 1,000) = 4
The TF-IDF weight for the word cat is
0.03 x 4 = 0.12

TF-IDF
UPDATE DAN MAINTENANCE
In most cases, the processed document grow continuously, so the value of tf-idf needs to be
updated to include the new documents.
However, TF-IDF is calculated against a certain corpus, so the tf-idf matrix cannot be
updated incrementally.
Several approaches can be taken to overcome this, including:
 Perform tf-idf calculations when needed. If there is a new document, the terms in the documents
are calculated for the tf-idf value
 Update regularly, when new documents reach a certain amount / time, the drawback is that
there may be terms that will be ignored because they are not yet included in the vocabulary

ISSUES IN BOW
Vocabulary: requires good design, especially for managing size because it will affect the
sparsity of document representation
Sparsity: The vector formed is a sparse vector, which is a vector with majority elements null
or 0. This sparse representation is more difficult and less efficient to model, both in terms
of computational (storage complexity and computation time) as well as information (model
a little information in a very large space)
Meaning: eliminating the word order results in the loss of context and meaning of words in
the text (semantics). Context and meaning are very useful in modeling, for instance to
distinguish different meaning of words due to different arrangement, to determine
synonyms, and so on

LAB 05
TF-IDF

LAB DESCRIPTION
Run TF-IDF function
Requirement :
tokenize_and_stem function from previous labs

STEP 01
DATA INPUT
Create dataset
from sklearn.feature_extraction.text import TfidfVectorizer
#we will use dummy document for input, with 1 sentence per document
files = []
files.append("Sekelompok ibu dan kaum perempuan duduk beralaskan rumput lapangan sambil fokus menganyam
bambu yang ia genggam di tangan.")
files.append("Sebagian besar masyarakat rupanya tak mau melewatkan waktu begitu saja untuk meratapi
erupsi.")
files.append("Lombok memang memiliki sejuta pesona yang mampu menyedot perhatian orang untuk datang
berwisata.")
files.append("Perempuan yang bergelut di dunia kerelawanan akan belajar caranya bertanggung jawab bagi
sendiri dan orang lain.")
files.append("Kami berkoordinasi dan melapor pada posko relawan, kami berkomitmen siap membantu
dengan siaga 24 jam")

STEP 02
CORPUS PREPARATION
#prepare corpus, load it into dictionary
token_dict = {}
i = 0
for t in files:
filename = "file" + str(i)
token_dict[filename] = t
i = i + 1
#use stop words bahasa indonesia from nltk corpus
#perform tf-idf vectorization, use tokenize_and_stem we create in previous lab
tfidf = TfidfVectorizer(tokenizer=tokenize_and_stem, stop_words=stopwords)
tfs = tfidf.fit_transform(token_dict.values())

STEP 03
TF-IDF TRANSFORMATION
We test by using a new sentence, what are the tokens produced and what is the tf-idf
value
Show how many token produced dan tf-idf value
str1 = 'Di kejauhan tampak seorang relawan pria dari Lombok sedang berjalan.'
response = tfidf.transform([str1])
#show result
feature_names = tfidf.get_feature_names()
for col in response.nonzero()[1]:
print (feature_names[col], ' - ', response[0, col])

WORD EMBEDDINGS
DEFINITION
A distributed word representation, which is dense, low-dimensional, and real-valued
representation of word. A word representation is a mathematical object associated with
each word, often a vector.

WORD EMBEDDINGS
WHY
Word embeddings overcomes BoW problems
 Represents words in dense real number vectors
 Includes sentence context, determines the meaning of words by looking at the context
 Includes information on the similarity of words in their representation: similar words are
represented by similar vectors
Vector values are learned using neural networks, so this word embedding method is often
associated with deep learning
Popular word embedding algorithms: Word2Vec, GloVe

WORD EMBEDDINGS
WORD2VEC
Word2Vec is a neural network with 2 layers, text as input and vectors as output
Developed by Mikolov et. al. at Google in 2013
Determine the meaning of a word by using other words around it (its context)
When a word w appears in a text, the context of w is the words before and after w (usually in a
specified window size)
For example:
This context is called local context
…Menlu yang menghadiri dan membuka konferensi Afro-Asia mengharapkan kerjasama yang baik...
...bahwa tema yang diusung dalam konferensi tahun ini adalah penguatan Ekonomi...
...Wagub membuka Seminar Nasional dan Konferensi Daerah Ikatan Apoteker Indonesia…

WOR2VEC
WORD SIMILARITY

WORD2VEC
ADVANTAGE
The key benefit of the approach is that high-quality word embeddings can be learned
efficiently (low space and time complexity), allowing larger embeddings to be learned
(more dimensions) from much larger corpora of text (billions of words).

GOOGLE WORD2VEC
Pre-trained model
300 dimensional vector
3 million words and phrases
Dataset : Google News (300 billion words)
Further info:
https://code.google.com/archive/p/word2vec/

OTHER WORD
EMBEDDING
GloVe
FastText
LDA2Vec
StarSpace
Poincare embeddings

CHAPTER 05
OPINION MINING &
SENTIMENT ANALYSIS

DEFINITION
Information types in text : facts and opinions.
 Facts : objective expressions about something.
 Opinions : subjective expressions that describe people’s sentiments, appraisals, and feelings
toward a subject or topic.
Sentiment analysis : analysis process to obtain subjective information of a topic.

SENTIMENT ANALYSIS
USE CASE EXAMPLES
Opinions in the social and geopolitical context
Business and e-commerce applications, such as product reviews and movie ratings
Predicting stock prices based on people opinion about the companies and resources
Determine areas of product that need to be improved by summarizing product reviews
Customer preference

OPINION
REPRESENTATION
Opinion holder: Whose opinion is this?
Opinion target: What is this opinion about? e.g., a product, a service, an individual, an
organization, an event, or a topic also called entity. An entity can have many feature
→
(aspect).
Opinion context: Under what situation (e.g., time, location) was the opinion expressed?
Opinion sentiment: What does the opinion tell us about the opinion holder’s feeling ?
Positive, negative and neutral are called opinion orientation (also called sentiment
orientation or polarity)
Liu (2012) formulated the formal definition : An opinion is a quadruple, ( g, s, h, t), where
g is the opinion (or sentiment) target, s is the sentiment about the target, h is the opinion
holder, and t is the time when the opinion was expressed.

SENTIMENT ANALYSIS
LEVEL
Document-level Sentiment Analysis : determine whether a whole document, message, etc, is
overall positive or negative
Sentence-level Sentiment Analysis : determine the sentiment of each sentence within the
document
Aspect or Topic based Sentiment Analysis : identify not only positive or negative sentence,
but also the specific topic/feature that is being referred as positive or negative. There
may be more than 1 aspects in a sentence :
 e.g : I love the display of the new phone but the battery life is terrible.

SENTIMENT ANALYSIS
PROCESS
Opinion Mining
 Entity extraction and categorization
 Aspect extraction and categorization
 Opinion holder extraction and categorization
 Time extraction and standardization
 Sentiment classification
 Opinion quadruple generation: Produce all opinion (g, s, h, t) expressed in
document d based on the results of the above tasks.
Opinion Summarization
 Opinions are subjective. An opinion from a single person (unless a VIP) is often not
sufficient for action. We need opinions from many people, and thus the need for
opinion summarization.

DOCUMENT SENTIMENT
CLASSIFICATION
TECHNIQUES
Supervised learning : any existing supervised learning methods can be applied; e.g.
Bayesian classifications, Support Vector Machine, etc.
Unsupervised learning : using opinion words and phrases. Liu (1992) explain the algorithm
which contains 3 steps:
 Extract phrase containing adjective or adverbs
 Estimate the semantic orientation/polarity
 Given a review, the algorithm computes the average opinion orientation of all phrases in the
review, and classies the review as recommended if the average is positive, not recommended
otherwise.

SENTIMENT ANALYSIS
IMPORTANT FEATURES
Terms and their frequency : Individual words or n-grams and their frequency counts. Word
positions may also be considered. The TF-IDF weighting scheme may be applied too. These
features have been shown quite effective in sentiment classification
Part of speech : adjectives may be treated as special features
Opinion words and phrases : words that are commonly used to express positive or
negative sentiments. For example, beautiful, wonderful, good are positive opinion words,
and bad, poor, and terrible are negative opinion words.
Negations : important because their appearances often change the opinion orientation
Syntactic dependency : word dependency based features generated from parsing or
dependency trees

SENTIMENT ANALYSIS -
CHALLENGES
A positive or negative sentiment word may have opposite orientations in different
application domains.
A sentence containing sentiment words may not express any sentiment. Question sentences
and conditional sentences are two important types, e.g., “Can you tell me which camera is
good?” and “If I can find a good camera in the shop, I will buy it.”
Not all conditional or interrogative sentences express no sentiments, e.g., “Does anyone
know how to repair this terrible printer” and “If you are looking for a good car, get
Toyota.”
Sarcastic sentences with or without sentiment words are hard to deal with, e.g., “What a
great car! It stopped working in two days.”
Many sentences without sentiment words can also imply opinions. e.g. “This washer uses a
lot of water” implies a negative sentiment.

CHAPTER 06
TOPIC MODELLING
BIG DATA ANALYTICS CERTIFICATION

TOPIC MODELLING
Topic modeling is an unsupervised machine learning way to organize text information such
that related pieces of text can be identified.
Topic Modelling is basically a document clustering where documents and words are
clustered simultaneously
Topic modelling problem :
 Known : Text/document collections (corpus) and The number of topics
 Unknown : The actual topics and topic distribution in each document
Topic modelling used in:
 Discovering hidden topical patterns that are present across the collection
 Annotating documents according to these topics
 Using these annotations to organize, search and summarize texts

TOPIC MODELLING
Basic assumptions:
 A document consists of a mixture of topics
 A topic is a collection of words
Topic = latent semantic concepts
Different Approaches
 Latent Semantic Analysis/Indexing (LSA/LSI) linear algebra
→
 Probabilistic Latent Semantic Analysis (PLSA) probabilistics
→
 Latent Dirichlet Allocation (LDA) probabilistics
→

LATENT SEMANTIC
ANALYSIS
Decomposing documents-words matrix into documents-topics and topics-words by using
Singular Value Decomposition (SVD)
Given m documents and n words in our vocabulary, we can construct an m-by-n matrix A
sparse word-document co-occurrence matrix
→
 Simplest form of LSA uses raw count, where ai-j is the number of times the j-th word appeared
in the i-th document
 More advanced LSA often uses TF-IDF to for ai-j value
SVD decompose matrix A into 3 matrices where:
 A is an m × n matrix
 U is an m × n orthogonal matrix
 S is an n × n diagonal matrix
 V is an n × n orthogonal matrix

LATENT SEMANTIC
ANALYSIS
Since A most likely sparse, we need to perform dimensionality reduction using truncated
SVD
This will keep the t most significant dimensions in the transformed space.
LSA is quick and efficient, but has some shortcomings:
 Lack of interpretable embeddings
 Need for really large set of documents and vocabulary to get accurate results
 Less efficient representation

PROBABILISTIC LATENT
SEMANTIC ANALYSIS
PLSA uses probabilistic method instead of SVD
The basic idea : find probabilistic model P(D,W) such that for any document d and word w, P(d,w)
corresponds to that entry in the document-term matrix.
PLSA assumptions:
 given a document d, topic z is present in that document with probability P(z|d)
 given a topic z, word w is drawn from z with probability P(w|z)
As its name implies, PLSA just adds a probabilistic treatment of topics and words on top of LSA.

PLSA
LIMITATIONS
PLSA is more flexible than LSA, but still has some limitations :
 The number of parameters grows linearly with the size of training documents The model is
→
prone to overfitting
 Not a well-defined generative model - no way of generalizing to new, unseen documents

LATENT DIRICHLET
ALLOCATION
LDA is a Bayesian version of pLSA. It uses dirichlet priors for the document-topic and
word-topic distributions, leading to better generalization.
Dirichlet : a probability distribution but it is not sampling from the space of real numbers.
Instead it is sampling over a probability simplex.
Probability simplex : a group of numbers that add up to 1. For example:
 (0.6, 0.4)
 (0.1, 0.1, 0.8)
 (0.05, 0.2, 0.15, 0.1, 0.3, 0.2)
The numbers represent probabilities over K distinct categories. In the above examples, K is
2, 3, and 6 respectively.

LATENT DIRICHLET
ALLOCATION MODEL
From a dirichlet distribution Dir( ), draw a random sample representing the topic
α
distribution of a particular document.
θ
From , we select a particular topic Z based on the distribution.
θ
From another dirichlet distribution Dir( ), select a random sample representing the word
𝛽
distribution of the topic Z. From , we choose the word w.
φ φ
LDA typically works better than pLSA because it can generalize to new documents easily.
Some limitations:
Needs relatively large memory and processing time.
The model is difficult to explain

CHAPTER 07
WRAPPING IT ALL TOGETHER

TEXT CLUSTERING
PROCESS FLOW
We will demonstrate the end-to-end process by performing document clustering.
The process flow that will be used are as follow:
 Text preprocessing, including text cleanup and text normalization
 Vector Representation / Feature Extraction : using TF-IDF
 Building model : using K-Means
 Visualization
 Model evaluation

LAB 08
TEXT CLUSTERING

LAB DESCRIPTION
How to create a document clustering program by using real dataset
Implement tokenization, stemming and cleansing
K-Means implementation
Visualization by using matplotlib

STEP 01
LIBRARY
Import all required library and click Run
import nltk
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
import re
import pandas as pd
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from sklearn.externals import joblib
from sklearn.manifold import MDS

STEP 02
DATA INPUT
Type the following code and click Run
Show sample data
#load titles
titles = open('Judul Berita.txt').read().split('n')
#load articles
article = open('Berita.txt', encoding="utf8").read().split('BERHENTI
DISINI')
len(titles) titles[:5]
article[:5] len(article)

STEP 03
PARSING ARTICLES
Parsing articles from html format using beautifulsoup package
article_clean = [] for text in article:
text = BeautifulSoup(text, 'html.parser').getText()
article_clean.append(text)
article = article_clean
print(article)

STEP 04
TOKENIZATION DAN STEMMING
Do tokenization, stemming and cleansing, like in the Lab 03
def tokenize_and_stem(text):
#tokenization and change to lowercase
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word
in nltk.word_tokenize(sent)]
#clean token from number and non alphabetical character such as
punctuation, etc.
filtered_tokens = []
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
#clean stop words
cleaned_token = []
for token in filtered_tokens:
if token not in stopwords:
cleaned_token.append(token)
...

STEP 05
TOKENIZATION DAN STEMMING (CONT.)
Do tokenization, stemming and cleansing, like in the Lab 03 (cont.)
Show sample data
...
#stem using Sastrawi StemmerFactory
factory = StemmerFactory()
stemmer = factory.create_stemmer()
stems = [stemmer.stem(t) for t in cleaned_token]
return stems
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index =
totalvocab_stemmed)
print('ada ' + str(vocab_frame.shape[0]) + ' kata di vocab_frame')
print(vocab_frame.head())

STEP 06
TF-IDF
Calculate TF-IDF matrix
Show matrix
tfidf_vectorizer = TfidfVectorizer(max_df=0.8,
max_features=200000,min_df=0.2,
use_idf=True, tokenizer=tokenize_and_stem,
ngram_range=(1,3))
#fit the vectorizer to article
tfidf_matrix = tfidf_vectorizer.fit_transform(article)
print(tfidf_matrix.shap)
print(tfidf_matrix)

STEP 07
K-MEANS MODELLING
Do K-Means Modeling, in this case we use the number of clusters = 3
Create DataFrame with the format: sequence - title - cluster
num_clusters = 3
km = KMeans(n_clusters=num_clusters, random_state=1000)
km.fit(tfidf_matrix)
#urutan
ranks = [i for i in range(1, len(titles)+1)]
#cluster with k-means
clusters = km.labels_.tolist()
news = { 'title': titles, 'rank': ranks, 'article': article, 'cluster': clusters }
frame = pd.DataFrame(news, index = [clusters] , columns = ['rank', 'title', 'cluster'])
#show dataframe
print(frame)
frame['cluster'].value_counts()

STEP 08
DATA EXPLORATION
Displays the results of clustering and top term per cluster to determine the label
print("Top terms per cluster:")
#sort cluster centers based on its proximity to its centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
for i in range(num_clusters):
print("Cluster %d words:" % i, end='')
for ind in order_centroids[i, :6]: #replace 6 with n words per cluster
print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0]
[0].encode('utf-8', 'ignore'), end=',')
print() #add whitespace
print("Cluster %d titles:" % i, end='')
for title in frame.ix[i]['title'].values.tolist():
print(' %s,' % title, end='')
print()

STEP 09
VISUALIZATION
similarity_distance = 1 - cosine_similarity(tfidf_matrix)
mds = MDS(n_components=2, dissimilarity="precomputed",
random_state=1) mds.fit_transform(similarity_distance)
# shape (n_components, n_samples)
xs, ys = pos[:, 0], pos[:, 1]
#set color with dictionary
cluster_colors = {0: '#1b9e77', 1: '#d95f02', 2: '#7570b3'}
#dictionary for cluster name (chart legend)
cluster_names = {0: 'Olahraga', 1: 'Ekonomi', 2: 'Kriminal'}
Visualize of the results of clustering with MDS
From step 06 it can be seen that the 3 clusters formed are: economy, sports and crime.
Color set and label cluster

STEP 09
VISUALIZATION
Set matplotlib to display charts inline
Type the following code
matplotlib inline
#create data frame that has the result of the MDS plus the cluster
numbers and titles
df = pd.DataFrame(dict(x=xs, y=ys, label=clusters, title=titles))
groups = df.groupby('label')
# set up plot
fig, ax = plt.subplots(figsize=(17, 9)) # set size
# ax.margins(0.05) # Optional, just adds 5% padding to the
autoscaling

Text Mining_big_data_machine_learning.pptx

More Related Content

Similar to Text Mining_big_data_machine_learning.pptx (20)

Recently uploaded (20)

Text Mining_big_data_machine_learning.pptx

Editor's Notes