2. HOW THIS COURSE IS
DELIVERED
Text mining is basically data mining on unstructured text. It’s an
implementation of the many techniques explained in previous sessions.
Combining the basic theory by focusing on implementation using Python.
The basic explanation of the theory and algorithm is not explained using in
depth mathematical and statistical approach, to make it easier for
participants who do not have statistics or mathematics background
3. AGENDA
Definition of text mining and common application
Text mining and NLP preprocessing technique
How to extract important information from text
Understand how text is handled in Python
5. TEXT MINING
DEFINITION
Broad umbrella terms describing a range of technologies for analyzing and processing
semi structured and unstructured text data.
The unifying theme behind each of these technologies is the need to “turn unstructured text
into structured data” so powerful algorithms can be applied to large document
databases.
Converting text into a structured, numerical format and applying analytical algorithms
require knowing how to both use and combine techniques for handling text, ranging from
individual words to documents to entire document databases
In building a statistical language system, it is best to devise a model that can make good
use of available data, even if the model seems overly simplistic.
6. TEXT MINING
PURPOSE
Turn text data into high-quality information and/or actionable knowledge
Minimizes human effort on consuming text data
Supplies knowledge for optimal decision making actionable knowledge
→
7. TEXT MINING
APPLICATIONS
Summary: search for the most important information from a text.
Chat Bot: an automatic question and answer application between machines and humans.
Text categorization: determine the topic of a particular document
Keyword Tag: keyword tags are selected keywords that represent text.
Sentiment Analysis: determine the sentiment or value of opinion in a text, these sentiments can be
negative, neutral, or positive sentiments.
Speech-to-text and text-to-speech conversions: Turns sound into text and vice versa
Translator Machine: Translation of text from one language to another
Spelling Checker
8. HUMANS AS SUBJECTIVE
SENSORS
Real World
Weather
Locations
Network
Sensor
Thermometer
Geo Sensor
Network Sensor
3o
C, 15o
F, ...
41o
N 120o
W ...
10110101010101
Sense Report
Real World Human Sensor
Perceive Express
Data
Text Data
9. TEXT MINING
LANDSCAPE
Real World
Perceive Express
2. Mining knowledge
about language : word
mining and association
3. Mining content of
text data : topic mining
and analysis
4. Mining knowledge about
the observer : opinion mining
& sentiment analysis
5. Infer other real world
variables : predictive
analysis
1. NLP and text
representation
Text Data
10. NLP
DEFINITION
NLP is a research area of computer science, artificial intelligence, and computational
linguistics, concerned with the interactions between computers and human natural
languages.
Helps computers understand, interpret, and manipulate human language. Not only
understand the word, but also how these words are interconnected into a meaningful
information
11. NLP
BASIC CONCEPTS
a dog is chasing a boy on the playground
det noun aux verb det noun prep det noun
Noun Phrase
Complex Verb Noun Phrase Noun Phrase
Verb Phrase Prep Phrase
Verb Phrase
Sentence
Lexical
Analysis (POS
Tagging)
Syntactic
Analysis
(Parsing)
A person saying this may
be reminding another
person to get the dog back.
Pragmatic Analysis
(speech act)
Semantic Analysis :
Dog(d1).
Boy(b1).
Playground(p1).
Chasing(d1,b1,p1)
12. NLP
CHALLENGES
Language is ambiguous need context to explain
→
The same word can mean something else (homograph)
Bank - Sloping land (especially the slope beside a body of water)
Bank - A financial institution that accepts deposits and channels the money into lending activities
Different words means the same (synonyms)
Human errors - misspellings, typos, abbreviations, social languages, etc.
Each language is different in terms of structure, vocabulary, etc
Special requirements: regulation and privacy related to legal, diplomatic,
medical
13. NLP
IMPLEMENTATION PACKAGES
Apache OpenNLP: Machine learning toolkit that provides tokenization, sentence
segmentation, part of speech tagging, named entity extraction, chunking, parsing,
coreference resolution, and so on.
Natural Language Toolkit (NLTK): the most popular python library for NLP, consisting of:
classification, tokenization, stemming, tagging, parsing, and others.
Stanford NLP: used for part of speech tagging, named entity recognizer, coreference
resolution, sentiment analysis, etc.
MALLET: is a JAVA package consists of Latent Dirichlet Allocation, document classification,
clustering, topic modeling, information extraction, and others.
etc.
14. LAB PREPARATION
PYTON LIBRARY REQUIRED
In this training we will use below python library
nltk: the most popular NLP library in the python ecosystem
beautifulsoup4: library for extracting data from HTML and XML documents
pandas: library for data manipulation and analysis
scikit-learn: python machine learning library
matplotlib: library for 2-dimensional plotting
sastrawi: stemmer for Indonesian Language, ported from PHP
gensim: python library focused on analyzing plain-text documents for semantic structure
18. STEP 02
CREATE ENVIRONMENT
Conda environment is basically a certain directory that contains all the packages we
install. We can have several conda environments with different package versions. For
example we need to run python 2 in one environment, and python 3 in another.
By default, Anaconda creates one main environment called base (root).
There are two ways to create and manage the environment: by using Anaconda Navigator
(Graphical User Interface) or Anaconda prompt (Command Line Interface)
To create an environment through the GUI, run Anaconda Navigator
19. STEP 02
CREATE ENVIRONMENT
1. Choose Environment, and click Create
2. Name the environment, for example training1
3. Choose python version
20. STEP 03
INSTALL PACKAGE
1. In the drop down list, select Not
Installed, then check the packages that
we will install
2. Click Apply button
21. STEP 03
INSTALL PACKAGE
Sastrawi package installed through CLI, by using pip
Follow this steps :
Open Anaconda Prompt
Activate environment with command activate <environment name>
Install Sastrawi by using command pip install Sastrawi
22. STEP 04
RUN JUPYTER NOTEBOOK
Install Jupyter Notebook through
Anaconda Navigator in Home
menu, and click Launch
23. STEP 04
RUN JUPYTER NOTEBOOK
Jupyter notebook will be opened as a tab in
your browser
You can create new folder or notebook by
clicking New
24. STEP 05
CHECK PACKAGE VERSION
We can check the version of all package
installed in our current environment, with
the following code :
import pkg_resources
dists = [d for d in pkg_resources.working_set] for i in
dists:
print(i)
Click Run to execute
26. COMMON
TEXT MINING WORKFLOW
UNSTRUCTURED DATA: TEXT MINING
Feature
Extraction
Feature extraction
and selection. In
the process might
include
exploration and
visualization to
find the suitable
features.
Collection
Obtaining or
preparing corpus.
Corpus can be
emails, web
pages, social
media contents,
wikipedia, journal
collection,
documents, etc.
Preprocessing
Common tasks in
data preprocessing
are tokenization and
segmentation,
normalization and
noise removal.
Model
Building
Build, train and
test the model.
Model
Evaluation
Evaluate model
performance. The
metrics can vary
depends on the
type of model
used and the NLP
task performed.
28. TEXT PREPROCESSING
COMMON TASKS
Some of the most common tasks in the preprocessing stage are:
Tokenization
Noise Removal
Stop Words Removal
Normalization
Stemming & Lemmatization
Object Standardization
29. TOKENIZATION
The process of cutting text into smaller units, called tokens.
Tokens can be words, keywords, phrases, symbols, or even sentences.
Challenges in tokenization depends on the type of language.
Languages such as English is referred to as space-delimited as most of the words are
separated from each other by white spaces.
Languages such as Chinese are referred to as unsegmented as words do not have clear
boundaries. Tokenizing unsegmented language sentences requires additional lexical and
morphological information.
30. NOISE REMOVAL
Every piece of text that is not relevant to the context and final output can be considered as noise. For example:
Stopword: commonly used words from a language - for example: are, me, from, in, etc.
URL, link, tag
social media entities (mention, hashtag)
punctuation
etc.
Approach:
Preparing a noise dictionary, and iterate each token (word), eliminating tokens that appear in the dictionary.
For example: stop word list.
Using regular expression (regex)
Combination of both
32. LAB DESCRIPTION
What you will learn:
Tokenization and noise removal using nltk.
In this lab, elements considered as noise are non alphabetical token, such as numbers and
punctuations.
UNSTRUCTURED DATA: TEXT MINING
33. STEP 01
TOKENIZATION AND CLEANSING
Type the following code into your
notebook and click Run
UNSTRUCTURED DATA: TEXT MINING
import nltk
import re
def
tokenize_clean(te
xt):
#tokenizatio
n and change
to lowercase
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for
word in nltk.word_tokenize(sent)]
#clean token from number and non alphabetical character such as
punctuation, etc.
filtered_tokens = []
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
return filtered_tokens
word_1 = tokenize_clean("Budi dan Badu
bermain bola di sekolah")
word_2 = tokenize_clean("Apakah Romi dan Julia saling mencintai saat mereka
berjumpa di persimpangan jalan?")
34. STOP WORDS REMOVAL
Stop words are words in a sentence that are considered unimportant, which if omitted will
not change the meaning or value of the sentence.
In most cases, stop word needs to be removed / cleaned so that the results of the analysis
are more accurate.
Stop words are usually words that don’t have meaning on its own, for example conjunctions
such as 'and', 'then', 'or', etc.
Stop words depend on the language and domain of the problem to be resolved. There is
no universal criteria in determining stop words.
UNSTRUCTURED DATA: TEXT MINING
36. LAB DESCRIPTION
What you will learn:
Show stop words list in nltk
Remove stop words from text
UNSTRUCTURED DATA: TEXT MINING
37. STEP 01
SHOW STOP WORDS LIST
Type the following code and click Run
UNSTRUCTURED DATA: TEXT MINING
Import nltk
stopwords =
nltk.corpus.stopwords.words('indonesian')
stopwords
38. STEP 02
STOP WORDS REMOVAL
Type the following code into
your notebook and click Run
UNSTRUCTURED DATA: TEXT MINING
import nltk import
re
def
tokenize_clean(text)
:
[..script seperti pada labs 01]
#clean stop words
stopwords = nltk.corpus.stopwords.words('indonesian')
cleaned_token = []
for token in filtered_tokens:
if token not in stopwords:
cleaned_token.append(token)
return cleaned_token
word_1 = tokenize_clean("Budi dan Badu bermain
bola di sekolah")
word_2 = tokenize_clean("Apakah Romi dan Julia saling mencintai saat mereka berjumpa
di persimpangan jalan?")
print(word_1)
print(word_2)
39. NORMALIZATION
A series of tasks to process text into a form with certain standards
The aim is to improve the quality of the text so that the next process can perform better
For example: changing all letters to lowercase, changing numbers into letters, stemming,
changing abbreviations into their original words, etc.
Normalization tries to make the same token/word represented in the same form, so that
the next process can run better
Important processes in text normalization are stemming and lemmatization
UNSTRUCTURED DATA: TEXT MINING
40. STEMMING &
LEMMATIZATION
Stemming and Lemmatization is the process of changing the word into a common base
form (stem).
Stemming is done by cutting off the end or the beginning of the word, taking into account
a list of common prefixes and suffixes in an inflected word. For example:
studying study
→
studies studi
→
UNSTRUCTURED DATA: TEXT MINING
41. STEMMING &
LEMMATIZATION
Lemmatization takes into consideration the morphological analysis of the words. To do so, it
is necessary to have detailed dictionaries which the algorithm can look through to link the
form back to its lemma. For example:
studying study
→
studies study
→
Lemmatization is usually done for languages that have a change of word form, for
example in English: go-went-gone, etc. In bahasa Indonesia, stemming and lemmatization
are usually considered the same process.
UNSTRUCTURED DATA: TEXT MINING
45. STEP 02
STEMMING CODE
UNSTRUCTURED DATA: TEXT MINING
import nltk
import re
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
def tokenize_and_stem(text): [..script
seperti pada labs 01] [..script
seperti pada labs 02]
#stem using Sastrawi StemmerFactory factory
= StemmerFactory()
stemmer = factory.create_stemmer()
stems = []
for token in cleaned_token:
stems.append(stemmer.stem(token))
return stems
word_1 = tokenize_and_stem("Budi dan Badu bermain bola di sekolah")
word_2 = tokenize_and_stem("Apakah Romi dan Julia saling mencintai saat mereka berjumpa
di persimpangan jalan?")
print(word_1)
print(word_2)
47. FEATURE EXTRACTION
The process of converting text into a set of features prior to analysis
The type of features depend on the model and what method will be used in the
mining/machine learning process
Some feature engineering techniques in NLP:
Syntactic parsing
Entity parsing
Vectorization
UNSTRUCTURED DATA: TEXT MINING
48. FEATURE EXTRA
SYNTACTIC PARSING
Syntactic parsing is the process of determining sentence structure based on a certain
grammar and lexicon.
The structure of the sentence includes word level, word class level, phrase level, element
level, and clausal level.
Some important attributes of text syntax are :
Dependency Grammar and
Part of Speech Tags.
UNSTRUCTURED DATA: TEXT MINING
49. SYNTACTIC PARSING
PART OF SPEECH TAGGING
POS (Part-of-Speech) Tags are a way of categorizing word classes, such as nouns, verbs,
adjectives, etc.
POS Tagger is an application that is capable of automatically performing POS tag
annotations for each word in a document.
POS tagging produces a list of tuples, where each tuple is in the form of (words, tags)
pairs.
Tags are labels that indicate whether a word is a noun, adjective, verb, and so on.
There are several POS tagset or POS tag naming methods, the most popular tagset is Penn
TreeBank tagset.
UNSTRUCTURED DATA: TEXT MINING
50. PART OF SPEECH
TAGGING SIMPLE EXAMPLE
UNSTRUCTURED DATA: TEXT MINING
Paul Pogba scored a late penalty
Paul Pogba
scored a late penalty
sentence
noun phrase
verb phrase
scored
verb
a
article
late
adjective
penalty
noun
Paul Pogba
named entity
51. PART OF SPEECH TAGGING
USAGE
POS tagging is usually done before the chunking process, or phrases extraction from a
sentence.
POS Tagging is also used for sentence structure analysis and word sense disambiguation.
For example:
can - We can help you
can - It kept in a can
By knowing the word class in a sentence, it’s easier to determine its meaning.
UNSTRUCTURED DATA: TEXT MINING
52. ENTITY PARSING
NAMED ENTITY RECOGNITION
The task of identifying the names of all the people, organizations and geographic
locations in a text, as well as time, currency and percentage expressions
Build knowledge from text, by extracting information such as
Names (people, organizations, locations, objects, etc.)
Temporal expression (calendar dates, times of day, durations, etc.)
Numerical expressions (money, percentage, etc.)
Knowledge base built with NER is widely used in technologies such as smart assistants,
machine translation, indexing in information retrieval, classification, automatic
summarization, etc.
UNSTRUCTURED DATA: TEXT MINING
53. NAMED ENTITY RECOGNITION
METHODS
Rule Based
Using a data dictionary consisting of the name of the country, city, company, etc
Uses predefined language dependent rules based on linguistics which helps in the identification of named
entities in a document.
Constraints: requires the ability to define the rules that are usually carried out by linguists and have a large
dependence on the language used.
Machine Learning
Using statistical classification models and machine learning algorithms
Constraint : require annotated corpora for the domain of interest. The construction of the annotated corpora
for a new domain is a time-consuming task and requires effort by the human experts to produce it.
Hybrid
Combine both methods by taking advantage of each method used.
UNSTRUCTURED DATA: TEXT MINING
54. VECTORIZATION
Transforming text into representations that are 'understood' by machines, which is numeric
vector (or array), so that they can be used by various analytics and machine learning
algorithms
There are 2 types of text vectorization, which are:
Bag of words (or bag of n-grams), represents words as a discrete element of a vector (or
array) element of a bag
→
Word embeddings : represent (or embed) words in a continuous vector space in which words
with similar meanings are mapped closer to each other. New words in application texts that
were missing in training texts can still be classified through similar words.
UNSTRUCTURED DATA: TEXT MINING
55. VECTORIZATION
METHODS
UNSTRUCTURED DATA: TEXT MINING
Type Vectorization Method Function Considerations
Bag of Words
Frequency Counts term
frequencies
Most frequent words not
always most informative
One-Hot Encoding Binarizes term
occurrence (0, 1)
All words equidistant, so
normalization extra
important
TF–IDF Normalizes term
frequencies across
documents
Moderately frequent terms
may not be representative of
document topics
Word
Embeddings
Distributed
Representations
Context-based,
continuous term
similarity encoding
Performance intensive;
difficult to scale without
additional tools (e.g.,
Tensorflow)
56. BAG OF WORDS
DEFINITION
Text representation that indicate the appearance of a token / word in a document.
Called a bag because it does not care about the structure or sequence in the text. *
The main components of BoW are:
Vocabulary or a collection of known words based on text input
A measure of the presence of the known words
The complexity of BoW techniques depends on how to build the vocabulary and the
scoring method
UNSTRUCTURED DATA: TEXT MINING
57. BAG OF WORDS
SCORING METHOD
Binary: one-hot encoding vector
Counts: count the number of times a word appears in each document
Frequency: number of occurrences of words in a document to the total number of words in
the document (count / total words)
TF-IDF: frequency and relevance of words in a corpus
UNSTRUCTURED DATA: TEXT MINING
58. BAG OF WORDS
EXAMPLE
It was the best of times,
it was the worst of times,
it was the age of wisdom
The vocabulary:
{“it”, “was”, “the”, “best”, “of”, “times”, “worst”, “age”, “wisdom”}
If we treat each sentence as separate document, the BoW vectors are:
UNSTRUCTURED DATA: TEXT MINING
“it was the best of times” [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
"it was the worst of times" [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
"it was the age of wisdom" [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
59. ONE-HOT ENCODING
DEFINITION
A representation of categorical variables as binary vectors.
Each integer value is represented as a binary vector that is all zero values except the
index of the integer, which is marked with a 1.
For example, if we have words {boy, chase, dog, playground} :
UNSTRUCTURED DATA: TEXT MINING
boy {1, 0, 0, 0}
chase {0, 1, 0, 0}
dog {0, 0, 1, 0}
playground {0, 0, 0, 1}
60. TF-IDF
TF-IDF : Term Frequency - Inverse Document Frequency
A way to determine the topic of a document based on the words or terms in the document
TF-IDF calculates relevance, not just frequency
The weight calculation in TF-IDF uses a statistical method, which evaluates how important a
term is to a document
The greater the TF-IDF value of a word or term, the rarer the word, the more relevant a
word to a document
UNSTRUCTURED DATA: TEXT MINING
61. TF-IDF
USAGE
What is the use of TF-IDF?
Categorize text, automatically create tags or keywords for a document.
Determine the order of documents in search results (document relevance to a term)
Fix / add stop-word lists
Difference between TF-IDF and sentiment analysis?
Sentiment analysis classifies text based on 'positive', 'negative' or 'neutral' opinion values.
TF-IDF classifies text based on its contents.
UNSTRUCTURED DATA: TEXT MINING
62. TF-IDF
TERM FREQUENCY
Term Frequency (TF) calculates the frequency of occurrence of a word or term (T) in a document
(D).
There’s a possibility that a word has a greater occurrence value on a different document, because
the length of the document is different
TF calculation formula:
TF (t) = (Number of occurrences of the word t) / (Total number of words)
UNSTRUCTURED DATA: TEXT MINING
63. TF-IDF
INVERSE DOCUMENT FREQUENCY
IDF calculates how important a word or term and document is
It is known that certain terms, such as "are", "from", and "that", may appear many times but does not have a
large influence. Therefore it is necessary to reduce the weight for those words and increase the weight for the
rare ones, with the following calculations:
IDF (t) = log_e (Number of documents / Number of documents containing the term t)
UNSTRUCTURED DATA: TEXT MINING
64. TF-IDF
EXAMPLE
Suppose a document has 100 words with the appearance of the word cat 3 times
TF for cats is:
3/100 = 0.03
If there are 10 million documents and the word cat appears in 1000 documents, then the IDF:
log (10,000,000 / 1,000) = 4
The TF-IDF weight for the word cat is
0.03 x 4 = 0.12
UNSTRUCTURED DATA: TEXT MINING
65. TF-IDF
UPDATE DAN MAINTENANCE
In most cases, the processed document grow continuously, so the value of tf-idf needs to be
updated to include the new documents.
However, TF-IDF is calculated against a certain corpus, so the tf-idf matrix cannot be
updated incrementally.
Several approaches can be taken to overcome this, including:
Perform tf-idf calculations when needed. If there is a new document, the terms in the documents
are calculated for the tf-idf value
Update regularly, when new documents reach a certain amount / time, the drawback is that
there may be terms that will be ignored because they are not yet included in the vocabulary
UNSTRUCTURED DATA: TEXT MINING
66. ISSUES IN BOW
Vocabulary: requires good design, especially for managing size because it will affect the
sparsity of document representation
Sparsity: The vector formed is a sparse vector, which is a vector with majority elements null
or 0. This sparse representation is more difficult and less efficient to model, both in terms
of computational (storage complexity and computation time) as well as information (model
a little information in a very large space)
Meaning: eliminating the word order results in the loss of context and meaning of words in
the text (semantics). Context and meaning are very useful in modeling, for instance to
distinguish different meaning of words due to different arrangement, to determine
synonyms, and so on
UNSTRUCTURED DATA: TEXT MINING
68. LAB DESCRIPTION
What you will learn:
Run TF-IDF function
Requirement :
tokenize_and_stem function from previous labs
UNSTRUCTURED DATA: TEXT MINING
69. STEP 01
DATA INPUT
Create dataset
UNSTRUCTURED DATA: TEXT MINING
from sklearn.feature_extraction.text import TfidfVectorizer
#we will use dummy document for input, with 1 sentence per document
files = []
files.append("Sekelompok ibu dan kaum perempuan duduk beralaskan rumput lapangan sambil fokus menganyam
bambu yang ia genggam di tangan.")
files.append("Sebagian besar masyarakat rupanya tak mau melewatkan waktu begitu saja untuk meratapi
erupsi.")
files.append("Lombok memang memiliki sejuta pesona yang mampu menyedot perhatian orang untuk datang
berwisata.")
files.append("Perempuan yang bergelut di dunia kerelawanan akan belajar caranya bertanggung jawab bagi
sendiri dan orang lain.")
files.append("Kami berkoordinasi dan melapor pada posko relawan, kami berkomitmen siap membantu
dengan siaga 24 jam")
70. STEP 02
CORPUS PREPARATION
UNSTRUCTURED DATA: TEXT MINING
#prepare corpus, load it into dictionary
token_dict = {}
i = 0
for t in files:
filename = "file" + str(i)
token_dict[filename] = t
i = i + 1
#use stop words bahasa indonesia from nltk corpus
stopwords = nltk.corpus.stopwords.words('indonesian')
#perform tf-idf vectorization, use tokenize_and_stem we create in previous lab
tfidf = TfidfVectorizer(tokenizer=tokenize_and_stem, stop_words=stopwords)
tfs = tfidf.fit_transform(token_dict.values())
71. STEP 03
TF-IDF TRANSFORMATION
We test by using a new sentence, what are the tokens produced and what is the tf-idf
value
Show how many token produced dan tf-idf value
UNSTRUCTURED DATA: TEXT MINING
str1 = 'Di kejauhan tampak seorang relawan pria dari Lombok sedang berjalan.'
response = tfidf.transform([str1])
#show result
feature_names = tfidf.get_feature_names()
for col in response.nonzero()[1]:
print (feature_names[col], ' - ', response[0, col])
72. WORD EMBEDDINGS
DEFINITION
A distributed word representation, which is dense, low-dimensional, and real-valued
representation of word. A word representation is a mathematical object associated with
each word, often a vector.
UNSTRUCTURED DATA: TEXT MINING
73. WORD EMBEDDINGS
WHY
Word embeddings overcomes BoW problems
Represents words in dense real number vectors
Includes sentence context, determines the meaning of words by looking at the context
Includes information on the similarity of words in their representation: similar words are
represented by similar vectors
Vector values are learned using neural networks, so this word embedding method is often
associated with deep learning
Popular word embedding algorithms: Word2Vec, GloVe
UNSTRUCTURED DATA: TEXT MINING
74. WORD EMBEDDINGS
WORD2VEC
Word2Vec is a neural network with 2 layers, text as input and vectors as output
Developed by Mikolov et. al. at Google in 2013
Determine the meaning of a word by using other words around it (its context)
When a word w appears in a text, the context of w is the words before and after w (usually in a
specified window size)
For example:
This context is called local context
UNSTRUCTURED DATA: TEXT MINING
…Menlu yang menghadiri dan membuka konferensi Afro-Asia mengharapkan kerjasama yang baik...
...bahwa tema yang diusung dalam konferensi tahun ini adalah penguatan Ekonomi...
...Wagub membuka Seminar Nasional dan Konferensi Daerah Ikatan Apoteker Indonesia…
76. WORD2VEC
ADVANTAGE
The key benefit of the approach is that high-quality word embeddings can be learned
efficiently (low space and time complexity), allowing larger embeddings to be learned
(more dimensions) from much larger corpora of text (billions of words).
UNSTRUCTURED DATA: TEXT MINING
77. GOOGLE WORD2VEC
Pre-trained model
300 dimensional vector
3 million words and phrases
Dataset : Google News (300 billion words)
Further info:
https://code.google.com/archive/p/word2vec/
UNSTRUCTURED DATA: TEXT MINING
80. DEFINITION
Information types in text : facts and opinions.
Facts : objective expressions about something.
Opinions : subjective expressions that describe people’s sentiments, appraisals, and feelings
toward a subject or topic.
Sentiment analysis : analysis process to obtain subjective information of a topic.
UNSTRUCTURED DATA: TEXT MINING
81. SENTIMENT ANALYSIS
USE CASE EXAMPLES
Opinions in the social and geopolitical context
Business and e-commerce applications, such as product reviews and movie ratings
Predicting stock prices based on people opinion about the companies and resources
Determine areas of product that need to be improved by summarizing product reviews
Customer preference
UNSTRUCTURED DATA: TEXT MINING
82. OPINION
REPRESENTATION
Opinion holder: Whose opinion is this?
Opinion target: What is this opinion about? e.g., a product, a service, an individual, an
organization, an event, or a topic also called entity. An entity can have many feature
→
(aspect).
Opinion context: Under what situation (e.g., time, location) was the opinion expressed?
Opinion sentiment: What does the opinion tell us about the opinion holder’s feeling ?
Positive, negative and neutral are called opinion orientation (also called sentiment
orientation or polarity)
Liu (2012) formulated the formal definition : An opinion is a quadruple, ( g, s, h, t), where
g is the opinion (or sentiment) target, s is the sentiment about the target, h is the opinion
holder, and t is the time when the opinion was expressed.
UNSTRUCTURED DATA: TEXT MINING
83. SENTIMENT ANALYSIS
LEVEL
Document-level Sentiment Analysis : determine whether a whole document, message, etc, is
overall positive or negative
Sentence-level Sentiment Analysis : determine the sentiment of each sentence within the
document
Aspect or Topic based Sentiment Analysis : identify not only positive or negative sentence,
but also the specific topic/feature that is being referred as positive or negative. There
may be more than 1 aspects in a sentence :
e.g : I love the display of the new phone but the battery life is terrible.
UNSTRUCTURED DATA: TEXT MINING
84. SENTIMENT ANALYSIS
PROCESS
Opinion Mining
Entity extraction and categorization
Aspect extraction and categorization
Opinion holder extraction and categorization
Time extraction and standardization
Sentiment classification
Opinion quadruple generation: Produce all opinion (g, s, h, t) expressed in
document d based on the results of the above tasks.
Opinion Summarization
Opinions are subjective. An opinion from a single person (unless a VIP) is often not
sufficient for action. We need opinions from many people, and thus the need for
opinion summarization.
UNSTRUCTURED DATA: TEXT MINING
85. DOCUMENT SENTIMENT
CLASSIFICATION
TECHNIQUES
Supervised learning : any existing supervised learning methods can be applied; e.g.
Bayesian classifications, Support Vector Machine, etc.
Unsupervised learning : using opinion words and phrases. Liu (1992) explain the algorithm
which contains 3 steps:
Extract phrase containing adjective or adverbs
Estimate the semantic orientation/polarity
Given a review, the algorithm computes the average opinion orientation of all phrases in the
review, and classies the review as recommended if the average is positive, not recommended
otherwise.
UNSTRUCTURED DATA: TEXT MINING
86. SENTIMENT ANALYSIS
IMPORTANT FEATURES
Terms and their frequency : Individual words or n-grams and their frequency counts. Word
positions may also be considered. The TF-IDF weighting scheme may be applied too. These
features have been shown quite effective in sentiment classification
Part of speech : adjectives may be treated as special features
Opinion words and phrases : words that are commonly used to express positive or
negative sentiments. For example, beautiful, wonderful, good are positive opinion words,
and bad, poor, and terrible are negative opinion words.
Negations : important because their appearances often change the opinion orientation
Syntactic dependency : word dependency based features generated from parsing or
dependency trees
UNSTRUCTURED DATA: TEXT MINING
87. SENTIMENT ANALYSIS -
CHALLENGES
A positive or negative sentiment word may have opposite orientations in different
application domains.
A sentence containing sentiment words may not express any sentiment. Question sentences
and conditional sentences are two important types, e.g., “Can you tell me which camera is
good?” and “If I can find a good camera in the shop, I will buy it.”
Not all conditional or interrogative sentences express no sentiments, e.g., “Does anyone
know how to repair this terrible printer” and “If you are looking for a good car, get
Toyota.”
Sarcastic sentences with or without sentiment words are hard to deal with, e.g., “What a
great car! It stopped working in two days.”
Many sentences without sentiment words can also imply opinions. e.g. “This washer uses a
lot of water” implies a negative sentiment.
UNSTRUCTURED DATA: TEXT MINING
89. TOPIC MODELLING
Topic modeling is an unsupervised machine learning way to organize text information such
that related pieces of text can be identified.
Topic Modelling is basically a document clustering where documents and words are
clustered simultaneously
Topic modelling problem :
Known : Text/document collections (corpus) and The number of topics
Unknown : The actual topics and topic distribution in each document
Topic modelling used in:
Discovering hidden topical patterns that are present across the collection
Annotating documents according to these topics
Using these annotations to organize, search and summarize texts
UNSTRUCTURED DATA: TEXT MINING
90. TOPIC MODELLING
Basic assumptions:
A document consists of a mixture of topics
A topic is a collection of words
Topic = latent semantic concepts
Different Approaches
Latent Semantic Analysis/Indexing (LSA/LSI) linear algebra
→
Probabilistic Latent Semantic Analysis (PLSA) probabilistics
→
Latent Dirichlet Allocation (LDA) probabilistics
→
UNSTRUCTURED DATA: TEXT MINING
91. LATENT SEMANTIC
ANALYSIS
Decomposing documents-words matrix into documents-topics and topics-words by using
Singular Value Decomposition (SVD)
Given m documents and n words in our vocabulary, we can construct an m-by-n matrix A
sparse word-document co-occurrence matrix
→
Simplest form of LSA uses raw count, where ai-j is the number of times the j-th word appeared
in the i-th document
More advanced LSA often uses TF-IDF to for ai-j value
SVD decompose matrix A into 3 matrices where:
A is an m × n matrix
U is an m × n orthogonal matrix
S is an n × n diagonal matrix
V is an n × n orthogonal matrix
UNSTRUCTURED DATA: TEXT MINING
92. LATENT SEMANTIC
ANALYSIS
Since A most likely sparse, we need to perform dimensionality reduction using truncated
SVD
This will keep the t most significant dimensions in the transformed space.
LSA is quick and efficient, but has some shortcomings:
Lack of interpretable embeddings
Need for really large set of documents and vocabulary to get accurate results
Less efficient representation
UNSTRUCTURED DATA: TEXT MINING
93. PROBABILISTIC LATENT
SEMANTIC ANALYSIS
PLSA uses probabilistic method instead of SVD
The basic idea : find probabilistic model P(D,W) such that for any document d and word w, P(d,w)
corresponds to that entry in the document-term matrix.
PLSA assumptions:
given a document d, topic z is present in that document with probability P(z|d)
given a topic z, word w is drawn from z with probability P(w|z)
As its name implies, PLSA just adds a probabilistic treatment of topics and words on top of LSA.
UNSTRUCTURED DATA: TEXT MINING
94. PLSA
LIMITATIONS
PLSA is more flexible than LSA, but still has some limitations :
The number of parameters grows linearly with the size of training documents The model is
→
prone to overfitting
Not a well-defined generative model - no way of generalizing to new, unseen documents
UNSTRUCTURED DATA: TEXT MINING
95. LATENT DIRICHLET
ALLOCATION
LDA is a Bayesian version of pLSA. It uses dirichlet priors for the document-topic and
word-topic distributions, leading to better generalization.
Dirichlet : a probability distribution but it is not sampling from the space of real numbers.
Instead it is sampling over a probability simplex.
Probability simplex : a group of numbers that add up to 1. For example:
(0.6, 0.4)
(0.1, 0.1, 0.8)
(0.05, 0.2, 0.15, 0.1, 0.3, 0.2)
The numbers represent probabilities over K distinct categories. In the above examples, K is
2, 3, and 6 respectively.
UNSTRUCTURED DATA: TEXT MINING
96. LATENT DIRICHLET
ALLOCATION MODEL
From a dirichlet distribution Dir( ), draw a random sample representing the topic
α
distribution of a particular document.
θ
From , we select a particular topic Z based on the distribution.
θ
From another dirichlet distribution Dir( ), select a random sample representing the word
𝛽
distribution of the topic Z. From , we choose the word w.
φ φ
LDA typically works better than pLSA because it can generalize to new documents easily.
Some limitations:
Needs relatively large memory and processing time.
The model is difficult to explain
UNSTRUCTURED DATA: TEXT MINING
98. TEXT CLUSTERING
PROCESS FLOW
We will demonstrate the end-to-end process by performing document clustering.
The process flow that will be used are as follow:
Text preprocessing, including text cleanup and text normalization
Vector Representation / Feature Extraction : using TF-IDF
Building model : using K-Means
Visualization
Model evaluation
UNSTRUCTURED DATA: TEXT MINING
100. LAB DESCRIPTION
What you will learn:
How to create a document clustering program by using real dataset
Implement tokenization, stemming and cleansing
K-Means implementation
Visualization by using matplotlib
UNSTRUCTURED DATA: TEXT MINING
101. STEP 01
LIBRARY
Import all required library and click Run
UNSTRUCTURED DATA: TEXT MINING
import nltk
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
import re
import pandas as pd
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from sklearn.externals import joblib
from sklearn.manifold import MDS
102. STEP 02
DATA INPUT
Type the following code and click Run
Show sample data
UNSTRUCTURED DATA: TEXT MINING
#load titles
titles = open('Judul Berita.txt').read().split('n')
#load articles
article = open('Berita.txt', encoding="utf8").read().split('BERHENTI
DISINI')
len(titles) titles[:5]
article[:5] len(article)
103. STEP 03
PARSING ARTICLES
Parsing articles from html format using beautifulsoup package
UNSTRUCTURED DATA: TEXT MINING
article_clean = [] for text in article:
text = BeautifulSoup(text, 'html.parser').getText()
article_clean.append(text)
article = article_clean
print(article)
104. STEP 04
TOKENIZATION DAN STEMMING
Do tokenization, stemming and cleansing, like in the Lab 03
UNSTRUCTURED DATA: TEXT MINING
def tokenize_and_stem(text):
#tokenization and change to lowercase
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word
in nltk.word_tokenize(sent)]
#clean token from number and non alphabetical character such as
punctuation, etc.
filtered_tokens = []
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
#clean stop words
stopwords = nltk.corpus.stopwords.words('indonesian')
cleaned_token = []
for token in filtered_tokens:
if token not in stopwords:
cleaned_token.append(token)
...
105. STEP 05
TOKENIZATION DAN STEMMING (CONT.)
Do tokenization, stemming and cleansing, like in the Lab 03 (cont.)
Show sample data
UNSTRUCTURED DATA: TEXT MINING
...
#stem using Sastrawi StemmerFactory
factory = StemmerFactory()
stemmer = factory.create_stemmer()
stems = [stemmer.stem(t) for t in cleaned_token]
return stems
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index =
totalvocab_stemmed)
print('ada ' + str(vocab_frame.shape[0]) + ' kata di vocab_frame')
print(vocab_frame.head())
106. STEP 06
TF-IDF
Calculate TF-IDF matrix
Show matrix
UNSTRUCTURED DATA: TEXT MINING
tfidf_vectorizer = TfidfVectorizer(max_df=0.8,
max_features=200000,min_df=0.2,
use_idf=True, tokenizer=tokenize_and_stem,
ngram_range=(1,3))
#fit the vectorizer to article
tfidf_matrix = tfidf_vectorizer.fit_transform(article)
print(tfidf_matrix.shap)
print(tfidf_matrix)
107. STEP 07
K-MEANS MODELLING
Do K-Means Modeling, in this case we use the number of clusters = 3
Create DataFrame with the format: sequence - title - cluster
UNSTRUCTURED DATA: TEXT MINING
num_clusters = 3
km = KMeans(n_clusters=num_clusters, random_state=1000)
km.fit(tfidf_matrix)
#urutan
ranks = [i for i in range(1, len(titles)+1)]
#cluster with k-means
clusters = km.labels_.tolist()
news = { 'title': titles, 'rank': ranks, 'article': article, 'cluster': clusters }
frame = pd.DataFrame(news, index = [clusters] , columns = ['rank', 'title', 'cluster'])
#show dataframe
print(frame)
frame['cluster'].value_counts()
108. STEP 08
DATA EXPLORATION
Displays the results of clustering and top term per cluster to determine the label
UNSTRUCTURED DATA: TEXT MINING
print("Top terms per cluster:")
#sort cluster centers based on its proximity to its centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
for i in range(num_clusters):
print("Cluster %d words:" % i, end='')
for ind in order_centroids[i, :6]: #replace 6 with n words per cluster
print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0]
[0].encode('utf-8', 'ignore'), end=',')
print() #add whitespace
print() #add whitespace
print("Cluster %d titles:" % i, end='')
for title in frame.ix[i]['title'].values.tolist():
print(' %s,' % title, end='')
print() #add whitespace
print()
109. STEP 09
VISUALIZATION
UNSTRUCTURED DATA: TEXT MINING
similarity_distance = 1 - cosine_similarity(tfidf_matrix)
mds = MDS(n_components=2, dissimilarity="precomputed",
random_state=1) mds.fit_transform(similarity_distance)
# shape (n_components, n_samples)
xs, ys = pos[:, 0], pos[:, 1]
#set color with dictionary
cluster_colors = {0: '#1b9e77', 1: '#d95f02', 2: '#7570b3'}
#dictionary for cluster name (chart legend)
cluster_names = {0: 'Olahraga', 1: 'Ekonomi', 2: 'Kriminal'}
Visualize of the results of clustering with MDS
From step 06 it can be seen that the 3 clusters formed are: economy, sports and crime.
Color set and label cluster
110. STEP 09
VISUALIZATION
Set matplotlib to display charts inline
Type the following code
UNSTRUCTURED DATA: TEXT MINING
matplotlib inline
#create data frame that has the result of the MDS plus the cluster
numbers and titles
df = pd.DataFrame(dict(x=xs, y=ys, label=clusters, title=titles))
groups = df.groupby('label')
# set up plot
fig, ax = plt.subplots(figsize=(17, 9)) # set size
# ax.margins(0.05) # Optional, just adds 5% padding to the
autoscaling
Editor's Notes
#8:Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.