UNSTRUCTURED DATA
TEXT MINING
HOW THIS COURSE IS
DELIVERED
Text mining is basically data mining on unstructured text. It’s an
implementation of the many techniques explained in previous sessions.
Combining the basic theory by focusing on implementation using Python.
The basic explanation of the theory and algorithm is not explained using in
depth mathematical and statistical approach, to make it easier for
participants who do not have statistics or mathematics background
AGENDA
Definition of text mining and common application
Text mining and NLP preprocessing technique
How to extract important information from text
Understand how text is handled in Python
CHAPTER 1
INTRODUCTION
BIG DATA ANALYTICS CE
TEXT MINING
DEFINITION
Broad umbrella terms describing a range of technologies for analyzing and processing
semi structured and unstructured text data.
The unifying theme behind each of these technologies is the need to “turn unstructured text
into structured data” so powerful algorithms can be applied to large document
databases.
Converting text into a structured, numerical format and applying analytical algorithms
require knowing how to both use and combine techniques for handling text, ranging from
individual words to documents to entire document databases
In building a statistical language system, it is best to devise a model that can make good
use of available data, even if the model seems overly simplistic.
TEXT MINING
PURPOSE
Turn text data into high-quality information and/or actionable knowledge
 Minimizes human effort on consuming text data
 Supplies knowledge for optimal decision making actionable knowledge
→
TEXT MINING
APPLICATIONS
Summary: search for the most important information from a text.
Chat Bot: an automatic question and answer application between machines and humans.
Text categorization: determine the topic of a particular document
Keyword Tag: keyword tags are selected keywords that represent text.
Sentiment Analysis: determine the sentiment or value of opinion in a text, these sentiments can be
negative, neutral, or positive sentiments.
Speech-to-text and text-to-speech conversions: Turns sound into text and vice versa
Translator Machine: Translation of text from one language to another
Spelling Checker
HUMANS AS SUBJECTIVE
SENSORS
Real World
Weather
Locations
Network
Sensor
Thermometer
Geo Sensor
Network Sensor
3o
C, 15o
F, ...
41o
N 120o
W ...
10110101010101
Sense Report
Real World Human Sensor
Perceive Express
Data
Text Data
TEXT MINING
LANDSCAPE
Real World
Perceive Express
2. Mining knowledge
about language : word
mining and association
3. Mining content of
text data : topic mining
and analysis
4. Mining knowledge about
the observer : opinion mining
& sentiment analysis
5. Infer other real world
variables : predictive
analysis
1. NLP and text
representation
Text Data
NLP
DEFINITION
NLP is a research area of computer science, artificial intelligence, and computational
linguistics, concerned with the interactions between computers and human natural
languages.
Helps computers understand, interpret, and manipulate human language. Not only
understand the word, but also how these words are interconnected into a meaningful
information
NLP
BASIC CONCEPTS
a dog is chasing a boy on the playground
det noun aux verb det noun prep det noun
Noun Phrase
Complex Verb Noun Phrase Noun Phrase
Verb Phrase Prep Phrase
Verb Phrase
Sentence
Lexical
Analysis (POS
Tagging)
Syntactic
Analysis
(Parsing)
A person saying this may
be reminding another
person to get the dog back.
Pragmatic Analysis
(speech act)
Semantic Analysis :
Dog(d1).
Boy(b1).
Playground(p1).
Chasing(d1,b1,p1)
NLP
CHALLENGES
Language is ambiguous need context to explain
→
 The same word can mean something else (homograph)
 Bank - Sloping land (especially the slope beside a body of water)
 Bank - A financial institution that accepts deposits and channels the money into lending activities
 Different words means the same (synonyms)
Human errors - misspellings, typos, abbreviations, social languages, etc.
Each language is different in terms of structure, vocabulary, etc
Special requirements: regulation and privacy related to legal, diplomatic,
medical
NLP
IMPLEMENTATION PACKAGES
Apache OpenNLP: Machine learning toolkit that provides tokenization, sentence
segmentation, part of speech tagging, named entity extraction, chunking, parsing,
coreference resolution, and so on.
Natural Language Toolkit (NLTK): the most popular python library for NLP, consisting of:
classification, tokenization, stemming, tagging, parsing, and others.
Stanford NLP: used for part of speech tagging, named entity recognizer, coreference
resolution, sentiment analysis, etc.
MALLET: is a JAVA package consists of Latent Dirichlet Allocation, document classification,
clustering, topic modeling, information extraction, and others.
etc.
LAB PREPARATION
PYTON LIBRARY REQUIRED
In this training we will use below python library
nltk: the most popular NLP library in the python ecosystem
beautifulsoup4: library for extracting data from HTML and XML documents
pandas: library for data manipulation and analysis
scikit-learn: python machine learning library
matplotlib: library for 2-dimensional plotting
sastrawi: stemmer for Indonesian Language, ported from PHP
gensim: python library focused on analyzing plain-text documents for semantic structure
LAB 01
INSTALLATION AND CONFIG
BIG DATA ANALYTICS CE
LAB DESCRIPTION
What you will learn:
Anaconda installation
Create and manage anaconda
environment
Install python and the required
packages
Run Jupyter Notebook
Requirement :
Anaconda
Jupyter Notebook
Python 3 library
 numpy
 pandas
 scikit-learn
 nltk
 matplotlib
 beautifulsoup4
 sastrawi
STEP 01
INSTALL ANACONDA
Install Anaconda according to your OS.
Installer can be downloaded at
www.anaconda.com/download
STEP 02
CREATE ENVIRONMENT
Conda environment is basically a certain directory that contains all the packages we
install. We can have several conda environments with different package versions. For
example we need to run python 2 in one environment, and python 3 in another.
By default, Anaconda creates one main environment called base (root).
There are two ways to create and manage the environment: by using Anaconda Navigator
(Graphical User Interface) or Anaconda prompt (Command Line Interface)
To create an environment through the GUI, run Anaconda Navigator
STEP 02
CREATE ENVIRONMENT
1. Choose Environment, and click Create
2. Name the environment, for example training1
3. Choose python version
STEP 03
INSTALL PACKAGE
1. In the drop down list, select Not
Installed, then check the packages that
we will install
2. Click Apply button
STEP 03
INSTALL PACKAGE
Sastrawi package installed through CLI, by using pip
Follow this steps :
 Open Anaconda Prompt
 Activate environment with command activate <environment name>
 Install Sastrawi by using command pip install Sastrawi
STEP 04
RUN JUPYTER NOTEBOOK
Install Jupyter Notebook through
Anaconda Navigator in Home
menu, and click Launch
STEP 04
RUN JUPYTER NOTEBOOK
Jupyter notebook will be opened as a tab in
your browser
You can create new folder or notebook by
clicking New
STEP 05
CHECK PACKAGE VERSION
We can check the version of all package
installed in our current environment, with
the following code :
import pkg_resources
dists = [d for d in pkg_resources.working_set] for i in
dists:
print(i)
Click Run to execute
CHAPTER 02
TEXT MINING WORKFLOW
BIG DATA ANALYTICS CE
COMMON
TEXT MINING WORKFLOW
UNSTRUCTURED DATA: TEXT MINING
Feature
Extraction
Feature extraction
and selection. In
the process might
include
exploration and
visualization to
find the suitable
features.
Collection
Obtaining or
preparing corpus.
Corpus can be
emails, web
pages, social
media contents,
wikipedia, journal
collection,
documents, etc.
Preprocessing
Common tasks in
data preprocessing
are tokenization and
segmentation,
normalization and
noise removal.
Model
Building
Build, train and
test the model.
Model
Evaluation
Evaluate model
performance. The
metrics can vary
depends on the
type of model
used and the NLP
task performed.
CHAPTER 03
TEXT PREPROCESSING
BIG DATA ANALYTICS CE
TEXT PREPROCESSING
COMMON TASKS
Some of the most common tasks in the preprocessing stage are:
Tokenization
Noise Removal
 Stop Words Removal
Normalization
 Stemming & Lemmatization
 Object Standardization
TOKENIZATION
The process of cutting text into smaller units, called tokens.
Tokens can be words, keywords, phrases, symbols, or even sentences.
Challenges in tokenization depends on the type of language.
Languages such as English is referred to as space-delimited as most of the words are
separated from each other by white spaces.
Languages such as Chinese are referred to as unsegmented as words do not have clear
boundaries. Tokenizing unsegmented language sentences requires additional lexical and
morphological information.
NOISE REMOVAL
Every piece of text that is not relevant to the context and final output can be considered as noise. For example:
Stopword: commonly used words from a language - for example: are, me, from, in, etc.
URL, link, tag
social media entities (mention, hashtag)
punctuation
etc.
Approach:
Preparing a noise dictionary, and iterate each token (word), eliminating tokens that appear in the dictionary.
For example: stop word list.
Using regular expression (regex)
Combination of both
LAB 02
TOKENIZATION &
CLEANSING
BIG DATA ANALYTICS CE
LAB DESCRIPTION
What you will learn:
Tokenization and noise removal using nltk.
In this lab, elements considered as noise are non alphabetical token, such as numbers and
punctuations.
UNSTRUCTURED DATA: TEXT MINING
STEP 01
TOKENIZATION AND CLEANSING
Type the following code into your
notebook and click Run
UNSTRUCTURED DATA: TEXT MINING
import nltk
import re
def
tokenize_clean(te
xt):
#tokenizatio
n and change
to lowercase
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for
word in nltk.word_tokenize(sent)]
#clean token from number and non alphabetical character such as
punctuation, etc.
filtered_tokens = []
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
return filtered_tokens
word_1 = tokenize_clean("Budi dan Badu
bermain bola di sekolah")
word_2 = tokenize_clean("Apakah Romi dan Julia saling mencintai saat mereka
berjumpa di persimpangan jalan?")
STOP WORDS REMOVAL
Stop words are words in a sentence that are considered unimportant, which if omitted will
not change the meaning or value of the sentence.
In most cases, stop word needs to be removed / cleaned so that the results of the analysis
are more accurate.
Stop words are usually words that don’t have meaning on its own, for example conjunctions
such as 'and', 'then', 'or', etc.
Stop words depend on the language and domain of the problem to be resolved. There is
no universal criteria in determining stop words.
UNSTRUCTURED DATA: TEXT MINING
LAB 03
STOP WORDS REMOVAL
BIG DATA ANALYTICS CE
LAB DESCRIPTION
What you will learn:
Show stop words list in nltk
Remove stop words from text
UNSTRUCTURED DATA: TEXT MINING
STEP 01
SHOW STOP WORDS LIST
Type the following code and click Run
UNSTRUCTURED DATA: TEXT MINING
Import nltk
stopwords =
nltk.corpus.stopwords.words('indonesian')
stopwords
STEP 02
STOP WORDS REMOVAL
Type the following code into
your notebook and click Run
UNSTRUCTURED DATA: TEXT MINING
import nltk import
re
def
tokenize_clean(text)
:
[..script seperti pada labs 01]
#clean stop words
stopwords = nltk.corpus.stopwords.words('indonesian')
cleaned_token = []
for token in filtered_tokens:
if token not in stopwords:
cleaned_token.append(token)
return cleaned_token
word_1 = tokenize_clean("Budi dan Badu bermain
bola di sekolah")
word_2 = tokenize_clean("Apakah Romi dan Julia saling mencintai saat mereka berjumpa
di persimpangan jalan?")
print(word_1)
print(word_2)
NORMALIZATION
A series of tasks to process text into a form with certain standards
The aim is to improve the quality of the text so that the next process can perform better
For example: changing all letters to lowercase, changing numbers into letters, stemming,
changing abbreviations into their original words, etc.
Normalization tries to make the same token/word represented in the same form, so that
the next process can run better
Important processes in text normalization are stemming and lemmatization
UNSTRUCTURED DATA: TEXT MINING
STEMMING &
LEMMATIZATION
Stemming and Lemmatization is the process of changing the word into a common base
form (stem).
Stemming is done by cutting off the end or the beginning of the word, taking into account
a list of common prefixes and suffixes in an inflected word. For example:
 studying study
→
 studies studi
→
UNSTRUCTURED DATA: TEXT MINING
STEMMING &
LEMMATIZATION
Lemmatization takes into consideration the morphological analysis of the words. To do so, it
is necessary to have detailed dictionaries which the algorithm can look through to link the
form back to its lemma. For example:
 studying study
→
 studies study
→
Lemmatization is usually done for languages that have a change of word form, for
example in English: go-went-gone, etc. In bahasa Indonesia, stemming and lemmatization
are usually considered the same process.
UNSTRUCTURED DATA: TEXT MINING
LAB 04
STEMMING &
LEMMATIZATION
BIG DATA ANALYTICS CERTIFICATION
LAB DESCRIPTION
What you will learn:
Using Sastrawi to do stemming and lemmatization
UNSTRUCTURED DATA: TEXT MINING
STEP 01
INSERT CELL
Insert new cell by choosing Insert Insert Cell
→
UNSTRUCTURED DATA: TEXT MINING
STEP 02
STEMMING CODE
UNSTRUCTURED DATA: TEXT MINING
import nltk
import re
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
def tokenize_and_stem(text): [..script
seperti pada labs 01] [..script
seperti pada labs 02]
#stem using Sastrawi StemmerFactory factory
= StemmerFactory()
stemmer = factory.create_stemmer()
stems = []
for token in cleaned_token:
stems.append(stemmer.stem(token))
return stems
word_1 = tokenize_and_stem("Budi dan Badu bermain bola di sekolah")
word_2 = tokenize_and_stem("Apakah Romi dan Julia saling mencintai saat mereka berjumpa
di persimpangan jalan?")
print(word_1)
print(word_2)
CHAPTER 04
FEATURE EXTRACTION
BIG DATA ANALYTICS CE
FEATURE EXTRACTION
The process of converting text into a set of features prior to analysis
The type of features depend on the model and what method will be used in the
mining/machine learning process
Some feature engineering techniques in NLP:
 Syntactic parsing
 Entity parsing
 Vectorization
UNSTRUCTURED DATA: TEXT MINING
FEATURE EXTRA
SYNTACTIC PARSING
Syntactic parsing is the process of determining sentence structure based on a certain
grammar and lexicon.
The structure of the sentence includes word level, word class level, phrase level, element
level, and clausal level.
Some important attributes of text syntax are :
 Dependency Grammar and
 Part of Speech Tags.
UNSTRUCTURED DATA: TEXT MINING
SYNTACTIC PARSING
PART OF SPEECH TAGGING
POS (Part-of-Speech) Tags are a way of categorizing word classes, such as nouns, verbs,
adjectives, etc.
POS Tagger is an application that is capable of automatically performing POS tag
annotations for each word in a document.
POS tagging produces a list of tuples, where each tuple is in the form of (words, tags)
pairs.
Tags are labels that indicate whether a word is a noun, adjective, verb, and so on.
There are several POS tagset or POS tag naming methods, the most popular tagset is Penn
TreeBank tagset.
UNSTRUCTURED DATA: TEXT MINING
PART OF SPEECH
TAGGING SIMPLE EXAMPLE
UNSTRUCTURED DATA: TEXT MINING
Paul Pogba scored a late penalty
Paul Pogba
scored a late penalty
sentence
noun phrase
verb phrase
scored
verb
a
article
late
adjective
penalty
noun
Paul Pogba
named entity
PART OF SPEECH TAGGING
USAGE
POS tagging is usually done before the chunking process, or phrases extraction from a
sentence.
POS Tagging is also used for sentence structure analysis and word sense disambiguation.
For example:
 can - We can help you
 can - It kept in a can
By knowing the word class in a sentence, it’s easier to determine its meaning.
UNSTRUCTURED DATA: TEXT MINING
ENTITY PARSING
NAMED ENTITY RECOGNITION
The task of identifying the names of all the people, organizations and geographic
locations in a text, as well as time, currency and percentage expressions
Build knowledge from text, by extracting information such as
 Names (people, organizations, locations, objects, etc.)
 Temporal expression (calendar dates, times of day, durations, etc.)
 Numerical expressions (money, percentage, etc.)
Knowledge base built with NER is widely used in technologies such as smart assistants,
machine translation, indexing in information retrieval, classification, automatic
summarization, etc.
UNSTRUCTURED DATA: TEXT MINING
NAMED ENTITY RECOGNITION
METHODS
Rule Based
 Using a data dictionary consisting of the name of the country, city, company, etc
 Uses predefined language dependent rules based on linguistics which helps in the identification of named
entities in a document.
 Constraints: requires the ability to define the rules that are usually carried out by linguists and have a large
dependence on the language used.
Machine Learning
 Using statistical classification models and machine learning algorithms
 Constraint : require annotated corpora for the domain of interest. The construction of the annotated corpora
for a new domain is a time-consuming task and requires effort by the human experts to produce it.
Hybrid
 Combine both methods by taking advantage of each method used.
UNSTRUCTURED DATA: TEXT MINING
VECTORIZATION
Transforming text into representations that are 'understood' by machines, which is numeric
vector (or array), so that they can be used by various analytics and machine learning
algorithms
There are 2 types of text vectorization, which are:
 Bag of words (or bag of n-grams), represents words as a discrete element of a vector (or
array) element of a bag
→
 Word embeddings : represent (or embed) words in a continuous vector space in which words
with similar meanings are mapped closer to each other. New words in application texts that
were missing in training texts can still be classified through similar words.
UNSTRUCTURED DATA: TEXT MINING
VECTORIZATION
METHODS
UNSTRUCTURED DATA: TEXT MINING
Type Vectorization Method Function Considerations
Bag of Words
Frequency Counts term
frequencies
Most frequent words not
always most informative
One-Hot Encoding Binarizes term
occurrence (0, 1)
All words equidistant, so
normalization extra
important
TF–IDF Normalizes term
frequencies across
documents
Moderately frequent terms
may not be representative of
document topics
Word
Embeddings
Distributed
Representations
Context-based,
continuous term
similarity encoding
Performance intensive;
difficult to scale without
additional tools (e.g.,
Tensorflow)
BAG OF WORDS
DEFINITION
Text representation that indicate the appearance of a token / word in a document.
Called a bag because it does not care about the structure or sequence in the text. *
The main components of BoW are:
 Vocabulary or a collection of known words based on text input
 A measure of the presence of the known words
The complexity of BoW techniques depends on how to build the vocabulary and the
scoring method
UNSTRUCTURED DATA: TEXT MINING
BAG OF WORDS
SCORING METHOD
Binary: one-hot encoding vector
Counts: count the number of times a word appears in each document
Frequency: number of occurrences of words in a document to the total number of words in
the document (count / total words)
TF-IDF: frequency and relevance of words in a corpus
UNSTRUCTURED DATA: TEXT MINING
BAG OF WORDS
EXAMPLE
It was the best of times,
it was the worst of times,
it was the age of wisdom
The vocabulary:
{“it”, “was”, “the”, “best”, “of”, “times”, “worst”, “age”, “wisdom”}
If we treat each sentence as separate document, the BoW vectors are:
UNSTRUCTURED DATA: TEXT MINING
“it was the best of times” [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
"it was the worst of times" [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
"it was the age of wisdom" [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
ONE-HOT ENCODING
DEFINITION
A representation of categorical variables as binary vectors.
Each integer value is represented as a binary vector that is all zero values except the
index of the integer, which is marked with a 1.
For example, if we have words {boy, chase, dog, playground} :
UNSTRUCTURED DATA: TEXT MINING
boy {1, 0, 0, 0}
chase {0, 1, 0, 0}
dog {0, 0, 1, 0}
playground {0, 0, 0, 1}
TF-IDF
TF-IDF : Term Frequency - Inverse Document Frequency
A way to determine the topic of a document based on the words or terms in the document
TF-IDF calculates relevance, not just frequency
The weight calculation in TF-IDF uses a statistical method, which evaluates how important a
term is to a document
The greater the TF-IDF value of a word or term, the rarer the word, the more relevant a
word to a document
UNSTRUCTURED DATA: TEXT MINING
TF-IDF
USAGE
What is the use of TF-IDF?
 Categorize text, automatically create tags or keywords for a document.
 Determine the order of documents in search results (document relevance to a term)
 Fix / add stop-word lists
Difference between TF-IDF and sentiment analysis?
 Sentiment analysis classifies text based on 'positive', 'negative' or 'neutral' opinion values.
 TF-IDF classifies text based on its contents.
UNSTRUCTURED DATA: TEXT MINING
TF-IDF
TERM FREQUENCY
Term Frequency (TF) calculates the frequency of occurrence of a word or term (T) in a document
(D).
There’s a possibility that a word has a greater occurrence value on a different document, because
the length of the document is different
TF calculation formula:
TF (t) = (Number of occurrences of the word t) / (Total number of words)
UNSTRUCTURED DATA: TEXT MINING
TF-IDF
INVERSE DOCUMENT FREQUENCY
IDF calculates how important a word or term and document is
It is known that certain terms, such as "are", "from", and "that", may appear many times but does not have a
large influence. Therefore it is necessary to reduce the weight for those words and increase the weight for the
rare ones, with the following calculations:
IDF (t) = log_e (Number of documents / Number of documents containing the term t)
UNSTRUCTURED DATA: TEXT MINING
TF-IDF
EXAMPLE
Suppose a document has 100 words with the appearance of the word cat 3 times
TF for cats is:
3/100 = 0.03
If there are 10 million documents and the word cat appears in 1000 documents, then the IDF:
log (10,000,000 / 1,000) = 4
The TF-IDF weight for the word cat is
0.03 x 4 = 0.12
UNSTRUCTURED DATA: TEXT MINING
TF-IDF
UPDATE DAN MAINTENANCE
In most cases, the processed document grow continuously, so the value of tf-idf needs to be
updated to include the new documents.
However, TF-IDF is calculated against a certain corpus, so the tf-idf matrix cannot be
updated incrementally.
Several approaches can be taken to overcome this, including:
 Perform tf-idf calculations when needed. If there is a new document, the terms in the documents
are calculated for the tf-idf value
 Update regularly, when new documents reach a certain amount / time, the drawback is that
there may be terms that will be ignored because they are not yet included in the vocabulary
UNSTRUCTURED DATA: TEXT MINING
ISSUES IN BOW
Vocabulary: requires good design, especially for managing size because it will affect the
sparsity of document representation
Sparsity: The vector formed is a sparse vector, which is a vector with majority elements null
or 0. This sparse representation is more difficult and less efficient to model, both in terms
of computational (storage complexity and computation time) as well as information (model
a little information in a very large space)
Meaning: eliminating the word order results in the loss of context and meaning of words in
the text (semantics). Context and meaning are very useful in modeling, for instance to
distinguish different meaning of words due to different arrangement, to determine
synonyms, and so on
UNSTRUCTURED DATA: TEXT MINING
LAB 05
TF-IDF
BIG DATA ANALYTICS CE
LAB DESCRIPTION
What you will learn:
Run TF-IDF function
Requirement :
tokenize_and_stem function from previous labs
UNSTRUCTURED DATA: TEXT MINING
STEP 01
DATA INPUT
Create dataset
UNSTRUCTURED DATA: TEXT MINING
from sklearn.feature_extraction.text import TfidfVectorizer
#we will use dummy document for input, with 1 sentence per document
files = []
files.append("Sekelompok ibu dan kaum perempuan duduk beralaskan rumput lapangan sambil fokus menganyam
bambu yang ia genggam di tangan.")
files.append("Sebagian besar masyarakat rupanya tak mau melewatkan waktu begitu saja untuk meratapi
erupsi.")
files.append("Lombok memang memiliki sejuta pesona yang mampu menyedot perhatian orang untuk datang
berwisata.")
files.append("Perempuan yang bergelut di dunia kerelawanan akan belajar caranya bertanggung jawab bagi
sendiri dan orang lain.")
files.append("Kami berkoordinasi dan melapor pada posko relawan, kami berkomitmen siap membantu
dengan siaga 24 jam")
STEP 02
CORPUS PREPARATION
UNSTRUCTURED DATA: TEXT MINING
#prepare corpus, load it into dictionary
token_dict = {}
i = 0
for t in files:
filename = "file" + str(i)
token_dict[filename] = t
i = i + 1
#use stop words bahasa indonesia from nltk corpus
stopwords = nltk.corpus.stopwords.words('indonesian')
#perform tf-idf vectorization, use tokenize_and_stem we create in previous lab
tfidf = TfidfVectorizer(tokenizer=tokenize_and_stem, stop_words=stopwords)
tfs = tfidf.fit_transform(token_dict.values())
STEP 03
TF-IDF TRANSFORMATION
We test by using a new sentence, what are the tokens produced and what is the tf-idf
value
Show how many token produced dan tf-idf value
UNSTRUCTURED DATA: TEXT MINING
str1 = 'Di kejauhan tampak seorang relawan pria dari Lombok sedang berjalan.'
response = tfidf.transform([str1])
#show result
feature_names = tfidf.get_feature_names()
for col in response.nonzero()[1]:
print (feature_names[col], ' - ', response[0, col])
WORD EMBEDDINGS
DEFINITION
A distributed word representation, which is dense, low-dimensional, and real-valued
representation of word. A word representation is a mathematical object associated with
each word, often a vector.
UNSTRUCTURED DATA: TEXT MINING
WORD EMBEDDINGS
WHY
Word embeddings overcomes BoW problems
 Represents words in dense real number vectors
 Includes sentence context, determines the meaning of words by looking at the context
 Includes information on the similarity of words in their representation: similar words are
represented by similar vectors
Vector values are learned using neural networks, so this word embedding method is often
associated with deep learning
Popular word embedding algorithms: Word2Vec, GloVe
UNSTRUCTURED DATA: TEXT MINING
WORD EMBEDDINGS
WORD2VEC
Word2Vec is a neural network with 2 layers, text as input and vectors as output
Developed by Mikolov et. al. at Google in 2013
Determine the meaning of a word by using other words around it (its context)
When a word w appears in a text, the context of w is the words before and after w (usually in a
specified window size)
For example:
This context is called local context
UNSTRUCTURED DATA: TEXT MINING
…Menlu yang menghadiri dan membuka konferensi Afro-Asia mengharapkan kerjasama yang baik...
...bahwa tema yang diusung dalam konferensi tahun ini adalah penguatan Ekonomi...
...Wagub membuka Seminar Nasional dan Konferensi Daerah Ikatan Apoteker Indonesia…
WOR2VEC
WORD SIMILARITY
UNSTRUCTURED DATA: TEXT MINING
WORD2VEC
ADVANTAGE
The key benefit of the approach is that high-quality word embeddings can be learned
efficiently (low space and time complexity), allowing larger embeddings to be learned
(more dimensions) from much larger corpora of text (billions of words).
UNSTRUCTURED DATA: TEXT MINING
GOOGLE WORD2VEC
Pre-trained model
300 dimensional vector
3 million words and phrases
Dataset : Google News (300 billion words)
Further info:
https://code.google.com/archive/p/word2vec/
UNSTRUCTURED DATA: TEXT MINING
OTHER WORD
EMBEDDING
GloVe
FastText
LDA2Vec
StarSpace
Poincare embeddings
UNSTRUCTURED DATA: TEXT MINING
CHAPTER 05
OPINION MINING &
SENTIMENT ANALYSIS
BIG DATA ANALYTICS CE
DEFINITION
Information types in text : facts and opinions.
 Facts : objective expressions about something.
 Opinions : subjective expressions that describe people’s sentiments, appraisals, and feelings
toward a subject or topic.
Sentiment analysis : analysis process to obtain subjective information of a topic.
UNSTRUCTURED DATA: TEXT MINING
SENTIMENT ANALYSIS
USE CASE EXAMPLES
Opinions in the social and geopolitical context
Business and e-commerce applications, such as product reviews and movie ratings
Predicting stock prices based on people opinion about the companies and resources
Determine areas of product that need to be improved by summarizing product reviews
Customer preference
UNSTRUCTURED DATA: TEXT MINING
OPINION
REPRESENTATION
Opinion holder: Whose opinion is this?
Opinion target: What is this opinion about? e.g., a product, a service, an individual, an
organization, an event, or a topic also called entity. An entity can have many feature
→
(aspect).
Opinion context: Under what situation (e.g., time, location) was the opinion expressed?
Opinion sentiment: What does the opinion tell us about the opinion holder’s feeling ?
Positive, negative and neutral are called opinion orientation (also called sentiment
orientation or polarity)
Liu (2012) formulated the formal definition : An opinion is a quadruple, ( g, s, h, t), where
g is the opinion (or sentiment) target, s is the sentiment about the target, h is the opinion
holder, and t is the time when the opinion was expressed.
UNSTRUCTURED DATA: TEXT MINING
SENTIMENT ANALYSIS
LEVEL
Document-level Sentiment Analysis : determine whether a whole document, message, etc, is
overall positive or negative
Sentence-level Sentiment Analysis : determine the sentiment of each sentence within the
document
Aspect or Topic based Sentiment Analysis : identify not only positive or negative sentence,
but also the specific topic/feature that is being referred as positive or negative. There
may be more than 1 aspects in a sentence :
 e.g : I love the display of the new phone but the battery life is terrible.
UNSTRUCTURED DATA: TEXT MINING
SENTIMENT ANALYSIS
PROCESS
Opinion Mining
 Entity extraction and categorization
 Aspect extraction and categorization
 Opinion holder extraction and categorization
 Time extraction and standardization
 Sentiment classification
 Opinion quadruple generation: Produce all opinion (g, s, h, t) expressed in
document d based on the results of the above tasks.
Opinion Summarization
 Opinions are subjective. An opinion from a single person (unless a VIP) is often not
sufficient for action. We need opinions from many people, and thus the need for
opinion summarization.
UNSTRUCTURED DATA: TEXT MINING
DOCUMENT SENTIMENT
CLASSIFICATION
TECHNIQUES
Supervised learning : any existing supervised learning methods can be applied; e.g.
Bayesian classifications, Support Vector Machine, etc.
Unsupervised learning : using opinion words and phrases. Liu (1992) explain the algorithm
which contains 3 steps:
 Extract phrase containing adjective or adverbs
 Estimate the semantic orientation/polarity
 Given a review, the algorithm computes the average opinion orientation of all phrases in the
review, and classies the review as recommended if the average is positive, not recommended
otherwise.
UNSTRUCTURED DATA: TEXT MINING
SENTIMENT ANALYSIS
IMPORTANT FEATURES
Terms and their frequency : Individual words or n-grams and their frequency counts. Word
positions may also be considered. The TF-IDF weighting scheme may be applied too. These
features have been shown quite effective in sentiment classification
Part of speech : adjectives may be treated as special features
Opinion words and phrases : words that are commonly used to express positive or
negative sentiments. For example, beautiful, wonderful, good are positive opinion words,
and bad, poor, and terrible are negative opinion words.
Negations : important because their appearances often change the opinion orientation
Syntactic dependency : word dependency based features generated from parsing or
dependency trees
UNSTRUCTURED DATA: TEXT MINING
SENTIMENT ANALYSIS -
CHALLENGES
A positive or negative sentiment word may have opposite orientations in different
application domains.
A sentence containing sentiment words may not express any sentiment. Question sentences
and conditional sentences are two important types, e.g., “Can you tell me which camera is
good?” and “If I can find a good camera in the shop, I will buy it.”
Not all conditional or interrogative sentences express no sentiments, e.g., “Does anyone
know how to repair this terrible printer” and “If you are looking for a good car, get
Toyota.”
Sarcastic sentences with or without sentiment words are hard to deal with, e.g., “What a
great car! It stopped working in two days.”
Many sentences without sentiment words can also imply opinions. e.g. “This washer uses a
lot of water” implies a negative sentiment.
UNSTRUCTURED DATA: TEXT MINING
CHAPTER 06
TOPIC MODELLING
BIG DATA ANALYTICS CERTIFICATION
TOPIC MODELLING
Topic modeling is an unsupervised machine learning way to organize text information such
that related pieces of text can be identified.
Topic Modelling is basically a document clustering where documents and words are
clustered simultaneously
Topic modelling problem :
 Known : Text/document collections (corpus) and The number of topics
 Unknown : The actual topics and topic distribution in each document
Topic modelling used in:
 Discovering hidden topical patterns that are present across the collection
 Annotating documents according to these topics
 Using these annotations to organize, search and summarize texts
UNSTRUCTURED DATA: TEXT MINING
TOPIC MODELLING
Basic assumptions:
 A document consists of a mixture of topics
 A topic is a collection of words
Topic = latent semantic concepts
Different Approaches
 Latent Semantic Analysis/Indexing (LSA/LSI) linear algebra
→
 Probabilistic Latent Semantic Analysis (PLSA) probabilistics
→
 Latent Dirichlet Allocation (LDA) probabilistics
→
UNSTRUCTURED DATA: TEXT MINING
LATENT SEMANTIC
ANALYSIS
Decomposing documents-words matrix into documents-topics and topics-words by using
Singular Value Decomposition (SVD)
Given m documents and n words in our vocabulary, we can construct an m-by-n matrix A
sparse word-document co-occurrence matrix
→
 Simplest form of LSA uses raw count, where ai-j is the number of times the j-th word appeared
in the i-th document
 More advanced LSA often uses TF-IDF to for ai-j value
SVD decompose matrix A into 3 matrices where:
 A is an m × n matrix
 U is an m × n orthogonal matrix
 S is an n × n diagonal matrix
 V is an n × n orthogonal matrix
UNSTRUCTURED DATA: TEXT MINING
LATENT SEMANTIC
ANALYSIS
Since A most likely sparse, we need to perform dimensionality reduction using truncated
SVD
This will keep the t most significant dimensions in the transformed space.
LSA is quick and efficient, but has some shortcomings:
 Lack of interpretable embeddings
 Need for really large set of documents and vocabulary to get accurate results
 Less efficient representation
UNSTRUCTURED DATA: TEXT MINING
PROBABILISTIC LATENT
SEMANTIC ANALYSIS
PLSA uses probabilistic method instead of SVD
The basic idea : find probabilistic model P(D,W) such that for any document d and word w, P(d,w)
corresponds to that entry in the document-term matrix.
PLSA assumptions:
 given a document d, topic z is present in that document with probability P(z|d)
 given a topic z, word w is drawn from z with probability P(w|z)
As its name implies, PLSA just adds a probabilistic treatment of topics and words on top of LSA.
UNSTRUCTURED DATA: TEXT MINING
PLSA
LIMITATIONS
PLSA is more flexible than LSA, but still has some limitations :
 The number of parameters grows linearly with the size of training documents The model is
→
prone to overfitting
 Not a well-defined generative model - no way of generalizing to new, unseen documents
UNSTRUCTURED DATA: TEXT MINING
LATENT DIRICHLET
ALLOCATION
LDA is a Bayesian version of pLSA. It uses dirichlet priors for the document-topic and
word-topic distributions, leading to better generalization.
Dirichlet : a probability distribution but it is not sampling from the space of real numbers.
Instead it is sampling over a probability simplex.
Probability simplex : a group of numbers that add up to 1. For example:
 (0.6, 0.4)
 (0.1, 0.1, 0.8)
 (0.05, 0.2, 0.15, 0.1, 0.3, 0.2)
The numbers represent probabilities over K distinct categories. In the above examples, K is
2, 3, and 6 respectively.
UNSTRUCTURED DATA: TEXT MINING
LATENT DIRICHLET
ALLOCATION MODEL
From a dirichlet distribution Dir( ), draw a random sample representing the topic
α
distribution of a particular document.
θ
From , we select a particular topic Z based on the distribution.
θ
From another dirichlet distribution Dir( ), select a random sample representing the word
𝛽
distribution of the topic Z. From , we choose the word w.
φ φ
LDA typically works better than pLSA because it can generalize to new documents easily.
Some limitations:
Needs relatively large memory and processing time.
The model is difficult to explain
UNSTRUCTURED DATA: TEXT MINING
CHAPTER 07
WRAPPING IT ALL TOGETHER
BIG DATA ANALYTICS CE
TEXT CLUSTERING
PROCESS FLOW
We will demonstrate the end-to-end process by performing document clustering.
The process flow that will be used are as follow:
 Text preprocessing, including text cleanup and text normalization
 Vector Representation / Feature Extraction : using TF-IDF
 Building model : using K-Means
 Visualization
 Model evaluation
UNSTRUCTURED DATA: TEXT MINING
LAB 08
TEXT CLUSTERING
BIG DATA ANALYTICS CE
LAB DESCRIPTION
What you will learn:
How to create a document clustering program by using real dataset
Implement tokenization, stemming and cleansing
K-Means implementation
Visualization by using matplotlib
UNSTRUCTURED DATA: TEXT MINING
STEP 01
LIBRARY
Import all required library and click Run
UNSTRUCTURED DATA: TEXT MINING
import nltk
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
import re
import pandas as pd
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from sklearn.externals import joblib
from sklearn.manifold import MDS
STEP 02
DATA INPUT
Type the following code and click Run
Show sample data
UNSTRUCTURED DATA: TEXT MINING
#load titles
titles = open('Judul Berita.txt').read().split('n')
#load articles
article = open('Berita.txt', encoding="utf8").read().split('BERHENTI
DISINI')
len(titles) titles[:5]
article[:5] len(article)
STEP 03
PARSING ARTICLES
Parsing articles from html format using beautifulsoup package
UNSTRUCTURED DATA: TEXT MINING
article_clean = [] for text in article:
text = BeautifulSoup(text, 'html.parser').getText()
article_clean.append(text)
article = article_clean
print(article)
STEP 04
TOKENIZATION DAN STEMMING
Do tokenization, stemming and cleansing, like in the Lab 03
UNSTRUCTURED DATA: TEXT MINING
def tokenize_and_stem(text):
#tokenization and change to lowercase
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word
in nltk.word_tokenize(sent)]
#clean token from number and non alphabetical character such as
punctuation, etc.
filtered_tokens = []
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
#clean stop words
stopwords = nltk.corpus.stopwords.words('indonesian')
cleaned_token = []
for token in filtered_tokens:
if token not in stopwords:
cleaned_token.append(token)
...
STEP 05
TOKENIZATION DAN STEMMING (CONT.)
Do tokenization, stemming and cleansing, like in the Lab 03 (cont.)
Show sample data
UNSTRUCTURED DATA: TEXT MINING
...
#stem using Sastrawi StemmerFactory
factory = StemmerFactory()
stemmer = factory.create_stemmer()
stems = [stemmer.stem(t) for t in cleaned_token]
return stems
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index =
totalvocab_stemmed)
print('ada ' + str(vocab_frame.shape[0]) + ' kata di vocab_frame')
print(vocab_frame.head())
STEP 06
TF-IDF
Calculate TF-IDF matrix
Show matrix
UNSTRUCTURED DATA: TEXT MINING
tfidf_vectorizer = TfidfVectorizer(max_df=0.8,
max_features=200000,min_df=0.2,
use_idf=True, tokenizer=tokenize_and_stem,
ngram_range=(1,3))
#fit the vectorizer to article
tfidf_matrix = tfidf_vectorizer.fit_transform(article)
print(tfidf_matrix.shap)
print(tfidf_matrix)
STEP 07
K-MEANS MODELLING
Do K-Means Modeling, in this case we use the number of clusters = 3
Create DataFrame with the format: sequence - title - cluster
UNSTRUCTURED DATA: TEXT MINING
num_clusters = 3
km = KMeans(n_clusters=num_clusters, random_state=1000)
km.fit(tfidf_matrix)
#urutan
ranks = [i for i in range(1, len(titles)+1)]
#cluster with k-means
clusters = km.labels_.tolist()
news = { 'title': titles, 'rank': ranks, 'article': article, 'cluster': clusters }
frame = pd.DataFrame(news, index = [clusters] , columns = ['rank', 'title', 'cluster'])
#show dataframe
print(frame)
frame['cluster'].value_counts()
STEP 08
DATA EXPLORATION
Displays the results of clustering and top term per cluster to determine the label
UNSTRUCTURED DATA: TEXT MINING
print("Top terms per cluster:")
#sort cluster centers based on its proximity to its centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
for i in range(num_clusters):
print("Cluster %d words:" % i, end='')
for ind in order_centroids[i, :6]: #replace 6 with n words per cluster
print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0]
[0].encode('utf-8', 'ignore'), end=',')
print() #add whitespace
print() #add whitespace
print("Cluster %d titles:" % i, end='')
for title in frame.ix[i]['title'].values.tolist():
print(' %s,' % title, end='')
print() #add whitespace
print()
STEP 09
VISUALIZATION
UNSTRUCTURED DATA: TEXT MINING
similarity_distance = 1 - cosine_similarity(tfidf_matrix)
mds = MDS(n_components=2, dissimilarity="precomputed",
random_state=1) mds.fit_transform(similarity_distance)
# shape (n_components, n_samples)
xs, ys = pos[:, 0], pos[:, 1]
#set color with dictionary
cluster_colors = {0: '#1b9e77', 1: '#d95f02', 2: '#7570b3'}
#dictionary for cluster name (chart legend)
cluster_names = {0: 'Olahraga', 1: 'Ekonomi', 2: 'Kriminal'}
Visualize of the results of clustering with MDS
From step 06 it can be seen that the 3 clusters formed are: economy, sports and crime.
Color set and label cluster
STEP 09
VISUALIZATION
Set matplotlib to display charts inline
Type the following code
UNSTRUCTURED DATA: TEXT MINING
matplotlib inline
#create data frame that has the result of the MDS plus the cluster
numbers and titles
df = pd.DataFrame(dict(x=xs, y=ys, label=clusters, title=titles))
groups = df.groupby('label')
# set up plot
fig, ax = plt.subplots(figsize=(17, 9)) # set size
# ax.margins(0.05) # Optional, just adds 5% padding to the
autoscaling

More Related Content

PPTX
NLP Introduction - Natural Language Processing and Artificial Intelligence Ov...
PPTX
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
PPTX
Natural Language Processing ktu syllabus module 1
PDF
INTRODUCTION TO Natural language processing
PPTX
Technical Development Workshop - Text Analytics with Python
PPTX
AcademyNaturalLanguageProcessing-EN-ILT-M02-Introduction.pptx
PDF
AIS Technical Development Workshop 2: Text Analytics with Python
PPTX
Natural Language processing using nltk.pptx
NLP Introduction - Natural Language Processing and Artificial Intelligence Ov...
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Natural Language Processing ktu syllabus module 1
INTRODUCTION TO Natural language processing
Technical Development Workshop - Text Analytics with Python
AcademyNaturalLanguageProcessing-EN-ILT-M02-Introduction.pptx
AIS Technical Development Workshop 2: Text Analytics with Python
Natural Language processing using nltk.pptx

Similar to Text Mining_big_data_machine_learning.pptx (20)

PDF
AM4TM_WS22_Practice_01_NLP_Basics.pdf
PDF
Natural Language Processing with Python
PPTX
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
PPTX
Natural Language Processing (NLP).pptx
PDF
Natural language processing (NLP) introduction
PPTX
Fake news detection
PPT
NLP Introduction.ppt machine learning presentation
PPTX
Introduction to Text Mining
PPTX
UNIT-1 and 2 Text and image classification .pptx
PDF
Capitalizing on Machine Reading to Engage Bigger Data
PDF
Machine Learning for Natural Language Processing| ashokveda . pdf
PPT
Lecture1 Natural Language Processing for
PPTX
NLP edmund retrievel system presentation.pptx
PDF
NLP for Everyday People
PDF
Natural Language Processing (NLP)
PPTX
Natural Language Processing Advancements By Deep Learning - A Survey
PDF
Analysing Demonetisation through Text Mining using Live Twitter Data!
PDF
Module 8: Natural language processing Pt 1
PDF
Natural Language Processing, Techniques, Current Trends and Applications in I...
PPTX
Weekairtificial intelligence 8-Module 7 NLP.pptx
AM4TM_WS22_Practice_01_NLP_Basics.pdf
Natural Language Processing with Python
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
Natural Language Processing (NLP).pptx
Natural language processing (NLP) introduction
Fake news detection
NLP Introduction.ppt machine learning presentation
Introduction to Text Mining
UNIT-1 and 2 Text and image classification .pptx
Capitalizing on Machine Reading to Engage Bigger Data
Machine Learning for Natural Language Processing| ashokveda . pdf
Lecture1 Natural Language Processing for
NLP edmund retrievel system presentation.pptx
NLP for Everyday People
Natural Language Processing (NLP)
Natural Language Processing Advancements By Deep Learning - A Survey
Analysing Demonetisation through Text Mining using Live Twitter Data!
Module 8: Natural language processing Pt 1
Natural Language Processing, Techniques, Current Trends and Applications in I...
Weekairtificial intelligence 8-Module 7 NLP.pptx
Ad

Recently uploaded (20)

PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPT
Module 1.ppt Iot fundamentals and Architecture
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PPTX
Build Your First AI Agent with UiPath.pptx
PPTX
Modernising the Digital Integration Hub
PPTX
The various Industrial Revolutions .pptx
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
PPTX
TEXTILE technology diploma scope and career opportunities
PPTX
Microsoft Excel 365/2024 Beginner's training
PPTX
2018-HIPAA-Renewal-Training for executives
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PPTX
Benefits of Physical activity for teenagers.pptx
1 - Historical Antecedents, Social Consideration.pdf
sbt 2.0: go big (Scala Days 2025 edition)
Credit Without Borders: AI and Financial Inclusion in Bangladesh
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Module 1.ppt Iot fundamentals and Architecture
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
A contest of sentiment analysis: k-nearest neighbor versus neural network
Build Your First AI Agent with UiPath.pptx
Modernising the Digital Integration Hub
The various Industrial Revolutions .pptx
Getting started with AI Agents and Multi-Agent Systems
NewMind AI Weekly Chronicles – August ’25 Week III
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
TEXTILE technology diploma scope and career opportunities
Microsoft Excel 365/2024 Beginner's training
2018-HIPAA-Renewal-Training for executives
Zenith AI: Advanced Artificial Intelligence
OpenACC and Open Hackathons Monthly Highlights July 2025
Enhancing plagiarism detection using data pre-processing and machine learning...
Benefits of Physical activity for teenagers.pptx
Ad

Text Mining_big_data_machine_learning.pptx

  • 2. HOW THIS COURSE IS DELIVERED Text mining is basically data mining on unstructured text. It’s an implementation of the many techniques explained in previous sessions. Combining the basic theory by focusing on implementation using Python. The basic explanation of the theory and algorithm is not explained using in depth mathematical and statistical approach, to make it easier for participants who do not have statistics or mathematics background
  • 3. AGENDA Definition of text mining and common application Text mining and NLP preprocessing technique How to extract important information from text Understand how text is handled in Python
  • 5. TEXT MINING DEFINITION Broad umbrella terms describing a range of technologies for analyzing and processing semi structured and unstructured text data. The unifying theme behind each of these technologies is the need to “turn unstructured text into structured data” so powerful algorithms can be applied to large document databases. Converting text into a structured, numerical format and applying analytical algorithms require knowing how to both use and combine techniques for handling text, ranging from individual words to documents to entire document databases In building a statistical language system, it is best to devise a model that can make good use of available data, even if the model seems overly simplistic.
  • 6. TEXT MINING PURPOSE Turn text data into high-quality information and/or actionable knowledge  Minimizes human effort on consuming text data  Supplies knowledge for optimal decision making actionable knowledge →
  • 7. TEXT MINING APPLICATIONS Summary: search for the most important information from a text. Chat Bot: an automatic question and answer application between machines and humans. Text categorization: determine the topic of a particular document Keyword Tag: keyword tags are selected keywords that represent text. Sentiment Analysis: determine the sentiment or value of opinion in a text, these sentiments can be negative, neutral, or positive sentiments. Speech-to-text and text-to-speech conversions: Turns sound into text and vice versa Translator Machine: Translation of text from one language to another Spelling Checker
  • 8. HUMANS AS SUBJECTIVE SENSORS Real World Weather Locations Network Sensor Thermometer Geo Sensor Network Sensor 3o C, 15o F, ... 41o N 120o W ... 10110101010101 Sense Report Real World Human Sensor Perceive Express Data Text Data
  • 9. TEXT MINING LANDSCAPE Real World Perceive Express 2. Mining knowledge about language : word mining and association 3. Mining content of text data : topic mining and analysis 4. Mining knowledge about the observer : opinion mining & sentiment analysis 5. Infer other real world variables : predictive analysis 1. NLP and text representation Text Data
  • 10. NLP DEFINITION NLP is a research area of computer science, artificial intelligence, and computational linguistics, concerned with the interactions between computers and human natural languages. Helps computers understand, interpret, and manipulate human language. Not only understand the word, but also how these words are interconnected into a meaningful information
  • 11. NLP BASIC CONCEPTS a dog is chasing a boy on the playground det noun aux verb det noun prep det noun Noun Phrase Complex Verb Noun Phrase Noun Phrase Verb Phrase Prep Phrase Verb Phrase Sentence Lexical Analysis (POS Tagging) Syntactic Analysis (Parsing) A person saying this may be reminding another person to get the dog back. Pragmatic Analysis (speech act) Semantic Analysis : Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1)
  • 12. NLP CHALLENGES Language is ambiguous need context to explain →  The same word can mean something else (homograph)  Bank - Sloping land (especially the slope beside a body of water)  Bank - A financial institution that accepts deposits and channels the money into lending activities  Different words means the same (synonyms) Human errors - misspellings, typos, abbreviations, social languages, etc. Each language is different in terms of structure, vocabulary, etc Special requirements: regulation and privacy related to legal, diplomatic, medical
  • 13. NLP IMPLEMENTATION PACKAGES Apache OpenNLP: Machine learning toolkit that provides tokenization, sentence segmentation, part of speech tagging, named entity extraction, chunking, parsing, coreference resolution, and so on. Natural Language Toolkit (NLTK): the most popular python library for NLP, consisting of: classification, tokenization, stemming, tagging, parsing, and others. Stanford NLP: used for part of speech tagging, named entity recognizer, coreference resolution, sentiment analysis, etc. MALLET: is a JAVA package consists of Latent Dirichlet Allocation, document classification, clustering, topic modeling, information extraction, and others. etc.
  • 14. LAB PREPARATION PYTON LIBRARY REQUIRED In this training we will use below python library nltk: the most popular NLP library in the python ecosystem beautifulsoup4: library for extracting data from HTML and XML documents pandas: library for data manipulation and analysis scikit-learn: python machine learning library matplotlib: library for 2-dimensional plotting sastrawi: stemmer for Indonesian Language, ported from PHP gensim: python library focused on analyzing plain-text documents for semantic structure
  • 15. LAB 01 INSTALLATION AND CONFIG BIG DATA ANALYTICS CE
  • 16. LAB DESCRIPTION What you will learn: Anaconda installation Create and manage anaconda environment Install python and the required packages Run Jupyter Notebook Requirement : Anaconda Jupyter Notebook Python 3 library  numpy  pandas  scikit-learn  nltk  matplotlib  beautifulsoup4  sastrawi
  • 17. STEP 01 INSTALL ANACONDA Install Anaconda according to your OS. Installer can be downloaded at www.anaconda.com/download
  • 18. STEP 02 CREATE ENVIRONMENT Conda environment is basically a certain directory that contains all the packages we install. We can have several conda environments with different package versions. For example we need to run python 2 in one environment, and python 3 in another. By default, Anaconda creates one main environment called base (root). There are two ways to create and manage the environment: by using Anaconda Navigator (Graphical User Interface) or Anaconda prompt (Command Line Interface) To create an environment through the GUI, run Anaconda Navigator
  • 19. STEP 02 CREATE ENVIRONMENT 1. Choose Environment, and click Create 2. Name the environment, for example training1 3. Choose python version
  • 20. STEP 03 INSTALL PACKAGE 1. In the drop down list, select Not Installed, then check the packages that we will install 2. Click Apply button
  • 21. STEP 03 INSTALL PACKAGE Sastrawi package installed through CLI, by using pip Follow this steps :  Open Anaconda Prompt  Activate environment with command activate <environment name>  Install Sastrawi by using command pip install Sastrawi
  • 22. STEP 04 RUN JUPYTER NOTEBOOK Install Jupyter Notebook through Anaconda Navigator in Home menu, and click Launch
  • 23. STEP 04 RUN JUPYTER NOTEBOOK Jupyter notebook will be opened as a tab in your browser You can create new folder or notebook by clicking New
  • 24. STEP 05 CHECK PACKAGE VERSION We can check the version of all package installed in our current environment, with the following code : import pkg_resources dists = [d for d in pkg_resources.working_set] for i in dists: print(i) Click Run to execute
  • 25. CHAPTER 02 TEXT MINING WORKFLOW BIG DATA ANALYTICS CE
  • 26. COMMON TEXT MINING WORKFLOW UNSTRUCTURED DATA: TEXT MINING Feature Extraction Feature extraction and selection. In the process might include exploration and visualization to find the suitable features. Collection Obtaining or preparing corpus. Corpus can be emails, web pages, social media contents, wikipedia, journal collection, documents, etc. Preprocessing Common tasks in data preprocessing are tokenization and segmentation, normalization and noise removal. Model Building Build, train and test the model. Model Evaluation Evaluate model performance. The metrics can vary depends on the type of model used and the NLP task performed.
  • 28. TEXT PREPROCESSING COMMON TASKS Some of the most common tasks in the preprocessing stage are: Tokenization Noise Removal  Stop Words Removal Normalization  Stemming & Lemmatization  Object Standardization
  • 29. TOKENIZATION The process of cutting text into smaller units, called tokens. Tokens can be words, keywords, phrases, symbols, or even sentences. Challenges in tokenization depends on the type of language. Languages such as English is referred to as space-delimited as most of the words are separated from each other by white spaces. Languages such as Chinese are referred to as unsegmented as words do not have clear boundaries. Tokenizing unsegmented language sentences requires additional lexical and morphological information.
  • 30. NOISE REMOVAL Every piece of text that is not relevant to the context and final output can be considered as noise. For example: Stopword: commonly used words from a language - for example: are, me, from, in, etc. URL, link, tag social media entities (mention, hashtag) punctuation etc. Approach: Preparing a noise dictionary, and iterate each token (word), eliminating tokens that appear in the dictionary. For example: stop word list. Using regular expression (regex) Combination of both
  • 32. LAB DESCRIPTION What you will learn: Tokenization and noise removal using nltk. In this lab, elements considered as noise are non alphabetical token, such as numbers and punctuations. UNSTRUCTURED DATA: TEXT MINING
  • 33. STEP 01 TOKENIZATION AND CLEANSING Type the following code into your notebook and click Run UNSTRUCTURED DATA: TEXT MINING import nltk import re def tokenize_clean(te xt): #tokenizatio n and change to lowercase tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)] #clean token from number and non alphabetical character such as punctuation, etc. filtered_tokens = [] for token in tokens: if re.search('[a-zA-Z]', token): filtered_tokens.append(token) return filtered_tokens word_1 = tokenize_clean("Budi dan Badu bermain bola di sekolah") word_2 = tokenize_clean("Apakah Romi dan Julia saling mencintai saat mereka berjumpa di persimpangan jalan?")
  • 34. STOP WORDS REMOVAL Stop words are words in a sentence that are considered unimportant, which if omitted will not change the meaning or value of the sentence. In most cases, stop word needs to be removed / cleaned so that the results of the analysis are more accurate. Stop words are usually words that don’t have meaning on its own, for example conjunctions such as 'and', 'then', 'or', etc. Stop words depend on the language and domain of the problem to be resolved. There is no universal criteria in determining stop words. UNSTRUCTURED DATA: TEXT MINING
  • 35. LAB 03 STOP WORDS REMOVAL BIG DATA ANALYTICS CE
  • 36. LAB DESCRIPTION What you will learn: Show stop words list in nltk Remove stop words from text UNSTRUCTURED DATA: TEXT MINING
  • 37. STEP 01 SHOW STOP WORDS LIST Type the following code and click Run UNSTRUCTURED DATA: TEXT MINING Import nltk stopwords = nltk.corpus.stopwords.words('indonesian') stopwords
  • 38. STEP 02 STOP WORDS REMOVAL Type the following code into your notebook and click Run UNSTRUCTURED DATA: TEXT MINING import nltk import re def tokenize_clean(text) : [..script seperti pada labs 01] #clean stop words stopwords = nltk.corpus.stopwords.words('indonesian') cleaned_token = [] for token in filtered_tokens: if token not in stopwords: cleaned_token.append(token) return cleaned_token word_1 = tokenize_clean("Budi dan Badu bermain bola di sekolah") word_2 = tokenize_clean("Apakah Romi dan Julia saling mencintai saat mereka berjumpa di persimpangan jalan?") print(word_1) print(word_2)
  • 39. NORMALIZATION A series of tasks to process text into a form with certain standards The aim is to improve the quality of the text so that the next process can perform better For example: changing all letters to lowercase, changing numbers into letters, stemming, changing abbreviations into their original words, etc. Normalization tries to make the same token/word represented in the same form, so that the next process can run better Important processes in text normalization are stemming and lemmatization UNSTRUCTURED DATA: TEXT MINING
  • 40. STEMMING & LEMMATIZATION Stemming and Lemmatization is the process of changing the word into a common base form (stem). Stemming is done by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes in an inflected word. For example:  studying study →  studies studi → UNSTRUCTURED DATA: TEXT MINING
  • 41. STEMMING & LEMMATIZATION Lemmatization takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma. For example:  studying study →  studies study → Lemmatization is usually done for languages that have a change of word form, for example in English: go-went-gone, etc. In bahasa Indonesia, stemming and lemmatization are usually considered the same process. UNSTRUCTURED DATA: TEXT MINING
  • 42. LAB 04 STEMMING & LEMMATIZATION BIG DATA ANALYTICS CERTIFICATION
  • 43. LAB DESCRIPTION What you will learn: Using Sastrawi to do stemming and lemmatization UNSTRUCTURED DATA: TEXT MINING
  • 44. STEP 01 INSERT CELL Insert new cell by choosing Insert Insert Cell → UNSTRUCTURED DATA: TEXT MINING
  • 45. STEP 02 STEMMING CODE UNSTRUCTURED DATA: TEXT MINING import nltk import re from Sastrawi.Stemmer.StemmerFactory import StemmerFactory def tokenize_and_stem(text): [..script seperti pada labs 01] [..script seperti pada labs 02] #stem using Sastrawi StemmerFactory factory = StemmerFactory() stemmer = factory.create_stemmer() stems = [] for token in cleaned_token: stems.append(stemmer.stem(token)) return stems word_1 = tokenize_and_stem("Budi dan Badu bermain bola di sekolah") word_2 = tokenize_and_stem("Apakah Romi dan Julia saling mencintai saat mereka berjumpa di persimpangan jalan?") print(word_1) print(word_2)
  • 47. FEATURE EXTRACTION The process of converting text into a set of features prior to analysis The type of features depend on the model and what method will be used in the mining/machine learning process Some feature engineering techniques in NLP:  Syntactic parsing  Entity parsing  Vectorization UNSTRUCTURED DATA: TEXT MINING
  • 48. FEATURE EXTRA SYNTACTIC PARSING Syntactic parsing is the process of determining sentence structure based on a certain grammar and lexicon. The structure of the sentence includes word level, word class level, phrase level, element level, and clausal level. Some important attributes of text syntax are :  Dependency Grammar and  Part of Speech Tags. UNSTRUCTURED DATA: TEXT MINING
  • 49. SYNTACTIC PARSING PART OF SPEECH TAGGING POS (Part-of-Speech) Tags are a way of categorizing word classes, such as nouns, verbs, adjectives, etc. POS Tagger is an application that is capable of automatically performing POS tag annotations for each word in a document. POS tagging produces a list of tuples, where each tuple is in the form of (words, tags) pairs. Tags are labels that indicate whether a word is a noun, adjective, verb, and so on. There are several POS tagset or POS tag naming methods, the most popular tagset is Penn TreeBank tagset. UNSTRUCTURED DATA: TEXT MINING
  • 50. PART OF SPEECH TAGGING SIMPLE EXAMPLE UNSTRUCTURED DATA: TEXT MINING Paul Pogba scored a late penalty Paul Pogba scored a late penalty sentence noun phrase verb phrase scored verb a article late adjective penalty noun Paul Pogba named entity
  • 51. PART OF SPEECH TAGGING USAGE POS tagging is usually done before the chunking process, or phrases extraction from a sentence. POS Tagging is also used for sentence structure analysis and word sense disambiguation. For example:  can - We can help you  can - It kept in a can By knowing the word class in a sentence, it’s easier to determine its meaning. UNSTRUCTURED DATA: TEXT MINING
  • 52. ENTITY PARSING NAMED ENTITY RECOGNITION The task of identifying the names of all the people, organizations and geographic locations in a text, as well as time, currency and percentage expressions Build knowledge from text, by extracting information such as  Names (people, organizations, locations, objects, etc.)  Temporal expression (calendar dates, times of day, durations, etc.)  Numerical expressions (money, percentage, etc.) Knowledge base built with NER is widely used in technologies such as smart assistants, machine translation, indexing in information retrieval, classification, automatic summarization, etc. UNSTRUCTURED DATA: TEXT MINING
  • 53. NAMED ENTITY RECOGNITION METHODS Rule Based  Using a data dictionary consisting of the name of the country, city, company, etc  Uses predefined language dependent rules based on linguistics which helps in the identification of named entities in a document.  Constraints: requires the ability to define the rules that are usually carried out by linguists and have a large dependence on the language used. Machine Learning  Using statistical classification models and machine learning algorithms  Constraint : require annotated corpora for the domain of interest. The construction of the annotated corpora for a new domain is a time-consuming task and requires effort by the human experts to produce it. Hybrid  Combine both methods by taking advantage of each method used. UNSTRUCTURED DATA: TEXT MINING
  • 54. VECTORIZATION Transforming text into representations that are 'understood' by machines, which is numeric vector (or array), so that they can be used by various analytics and machine learning algorithms There are 2 types of text vectorization, which are:  Bag of words (or bag of n-grams), represents words as a discrete element of a vector (or array) element of a bag →  Word embeddings : represent (or embed) words in a continuous vector space in which words with similar meanings are mapped closer to each other. New words in application texts that were missing in training texts can still be classified through similar words. UNSTRUCTURED DATA: TEXT MINING
  • 55. VECTORIZATION METHODS UNSTRUCTURED DATA: TEXT MINING Type Vectorization Method Function Considerations Bag of Words Frequency Counts term frequencies Most frequent words not always most informative One-Hot Encoding Binarizes term occurrence (0, 1) All words equidistant, so normalization extra important TF–IDF Normalizes term frequencies across documents Moderately frequent terms may not be representative of document topics Word Embeddings Distributed Representations Context-based, continuous term similarity encoding Performance intensive; difficult to scale without additional tools (e.g., Tensorflow)
  • 56. BAG OF WORDS DEFINITION Text representation that indicate the appearance of a token / word in a document. Called a bag because it does not care about the structure or sequence in the text. * The main components of BoW are:  Vocabulary or a collection of known words based on text input  A measure of the presence of the known words The complexity of BoW techniques depends on how to build the vocabulary and the scoring method UNSTRUCTURED DATA: TEXT MINING
  • 57. BAG OF WORDS SCORING METHOD Binary: one-hot encoding vector Counts: count the number of times a word appears in each document Frequency: number of occurrences of words in a document to the total number of words in the document (count / total words) TF-IDF: frequency and relevance of words in a corpus UNSTRUCTURED DATA: TEXT MINING
  • 58. BAG OF WORDS EXAMPLE It was the best of times, it was the worst of times, it was the age of wisdom The vocabulary: {“it”, “was”, “the”, “best”, “of”, “times”, “worst”, “age”, “wisdom”} If we treat each sentence as separate document, the BoW vectors are: UNSTRUCTURED DATA: TEXT MINING “it was the best of times” [1, 1, 1, 1, 1, 1, 0, 0, 0, 0] "it was the worst of times" [1, 1, 1, 0, 1, 1, 1, 0, 0, 0] "it was the age of wisdom" [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
  • 59. ONE-HOT ENCODING DEFINITION A representation of categorical variables as binary vectors. Each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1. For example, if we have words {boy, chase, dog, playground} : UNSTRUCTURED DATA: TEXT MINING boy {1, 0, 0, 0} chase {0, 1, 0, 0} dog {0, 0, 1, 0} playground {0, 0, 0, 1}
  • 60. TF-IDF TF-IDF : Term Frequency - Inverse Document Frequency A way to determine the topic of a document based on the words or terms in the document TF-IDF calculates relevance, not just frequency The weight calculation in TF-IDF uses a statistical method, which evaluates how important a term is to a document The greater the TF-IDF value of a word or term, the rarer the word, the more relevant a word to a document UNSTRUCTURED DATA: TEXT MINING
  • 61. TF-IDF USAGE What is the use of TF-IDF?  Categorize text, automatically create tags or keywords for a document.  Determine the order of documents in search results (document relevance to a term)  Fix / add stop-word lists Difference between TF-IDF and sentiment analysis?  Sentiment analysis classifies text based on 'positive', 'negative' or 'neutral' opinion values.  TF-IDF classifies text based on its contents. UNSTRUCTURED DATA: TEXT MINING
  • 62. TF-IDF TERM FREQUENCY Term Frequency (TF) calculates the frequency of occurrence of a word or term (T) in a document (D). There’s a possibility that a word has a greater occurrence value on a different document, because the length of the document is different TF calculation formula: TF (t) = (Number of occurrences of the word t) / (Total number of words) UNSTRUCTURED DATA: TEXT MINING
  • 63. TF-IDF INVERSE DOCUMENT FREQUENCY IDF calculates how important a word or term and document is It is known that certain terms, such as "are", "from", and "that", may appear many times but does not have a large influence. Therefore it is necessary to reduce the weight for those words and increase the weight for the rare ones, with the following calculations: IDF (t) = log_e (Number of documents / Number of documents containing the term t) UNSTRUCTURED DATA: TEXT MINING
  • 64. TF-IDF EXAMPLE Suppose a document has 100 words with the appearance of the word cat 3 times TF for cats is: 3/100 = 0.03 If there are 10 million documents and the word cat appears in 1000 documents, then the IDF: log (10,000,000 / 1,000) = 4 The TF-IDF weight for the word cat is 0.03 x 4 = 0.12 UNSTRUCTURED DATA: TEXT MINING
  • 65. TF-IDF UPDATE DAN MAINTENANCE In most cases, the processed document grow continuously, so the value of tf-idf needs to be updated to include the new documents. However, TF-IDF is calculated against a certain corpus, so the tf-idf matrix cannot be updated incrementally. Several approaches can be taken to overcome this, including:  Perform tf-idf calculations when needed. If there is a new document, the terms in the documents are calculated for the tf-idf value  Update regularly, when new documents reach a certain amount / time, the drawback is that there may be terms that will be ignored because they are not yet included in the vocabulary UNSTRUCTURED DATA: TEXT MINING
  • 66. ISSUES IN BOW Vocabulary: requires good design, especially for managing size because it will affect the sparsity of document representation Sparsity: The vector formed is a sparse vector, which is a vector with majority elements null or 0. This sparse representation is more difficult and less efficient to model, both in terms of computational (storage complexity and computation time) as well as information (model a little information in a very large space) Meaning: eliminating the word order results in the loss of context and meaning of words in the text (semantics). Context and meaning are very useful in modeling, for instance to distinguish different meaning of words due to different arrangement, to determine synonyms, and so on UNSTRUCTURED DATA: TEXT MINING
  • 67. LAB 05 TF-IDF BIG DATA ANALYTICS CE
  • 68. LAB DESCRIPTION What you will learn: Run TF-IDF function Requirement : tokenize_and_stem function from previous labs UNSTRUCTURED DATA: TEXT MINING
  • 69. STEP 01 DATA INPUT Create dataset UNSTRUCTURED DATA: TEXT MINING from sklearn.feature_extraction.text import TfidfVectorizer #we will use dummy document for input, with 1 sentence per document files = [] files.append("Sekelompok ibu dan kaum perempuan duduk beralaskan rumput lapangan sambil fokus menganyam bambu yang ia genggam di tangan.") files.append("Sebagian besar masyarakat rupanya tak mau melewatkan waktu begitu saja untuk meratapi erupsi.") files.append("Lombok memang memiliki sejuta pesona yang mampu menyedot perhatian orang untuk datang berwisata.") files.append("Perempuan yang bergelut di dunia kerelawanan akan belajar caranya bertanggung jawab bagi sendiri dan orang lain.") files.append("Kami berkoordinasi dan melapor pada posko relawan, kami berkomitmen siap membantu dengan siaga 24 jam")
  • 70. STEP 02 CORPUS PREPARATION UNSTRUCTURED DATA: TEXT MINING #prepare corpus, load it into dictionary token_dict = {} i = 0 for t in files: filename = "file" + str(i) token_dict[filename] = t i = i + 1 #use stop words bahasa indonesia from nltk corpus stopwords = nltk.corpus.stopwords.words('indonesian') #perform tf-idf vectorization, use tokenize_and_stem we create in previous lab tfidf = TfidfVectorizer(tokenizer=tokenize_and_stem, stop_words=stopwords) tfs = tfidf.fit_transform(token_dict.values())
  • 71. STEP 03 TF-IDF TRANSFORMATION We test by using a new sentence, what are the tokens produced and what is the tf-idf value Show how many token produced dan tf-idf value UNSTRUCTURED DATA: TEXT MINING str1 = 'Di kejauhan tampak seorang relawan pria dari Lombok sedang berjalan.' response = tfidf.transform([str1]) #show result feature_names = tfidf.get_feature_names() for col in response.nonzero()[1]: print (feature_names[col], ' - ', response[0, col])
  • 72. WORD EMBEDDINGS DEFINITION A distributed word representation, which is dense, low-dimensional, and real-valued representation of word. A word representation is a mathematical object associated with each word, often a vector. UNSTRUCTURED DATA: TEXT MINING
  • 73. WORD EMBEDDINGS WHY Word embeddings overcomes BoW problems  Represents words in dense real number vectors  Includes sentence context, determines the meaning of words by looking at the context  Includes information on the similarity of words in their representation: similar words are represented by similar vectors Vector values are learned using neural networks, so this word embedding method is often associated with deep learning Popular word embedding algorithms: Word2Vec, GloVe UNSTRUCTURED DATA: TEXT MINING
  • 74. WORD EMBEDDINGS WORD2VEC Word2Vec is a neural network with 2 layers, text as input and vectors as output Developed by Mikolov et. al. at Google in 2013 Determine the meaning of a word by using other words around it (its context) When a word w appears in a text, the context of w is the words before and after w (usually in a specified window size) For example: This context is called local context UNSTRUCTURED DATA: TEXT MINING …Menlu yang menghadiri dan membuka konferensi Afro-Asia mengharapkan kerjasama yang baik... ...bahwa tema yang diusung dalam konferensi tahun ini adalah penguatan Ekonomi... ...Wagub membuka Seminar Nasional dan Konferensi Daerah Ikatan Apoteker Indonesia…
  • 76. WORD2VEC ADVANTAGE The key benefit of the approach is that high-quality word embeddings can be learned efficiently (low space and time complexity), allowing larger embeddings to be learned (more dimensions) from much larger corpora of text (billions of words). UNSTRUCTURED DATA: TEXT MINING
  • 77. GOOGLE WORD2VEC Pre-trained model 300 dimensional vector 3 million words and phrases Dataset : Google News (300 billion words) Further info: https://code.google.com/archive/p/word2vec/ UNSTRUCTURED DATA: TEXT MINING
  • 79. CHAPTER 05 OPINION MINING & SENTIMENT ANALYSIS BIG DATA ANALYTICS CE
  • 80. DEFINITION Information types in text : facts and opinions.  Facts : objective expressions about something.  Opinions : subjective expressions that describe people’s sentiments, appraisals, and feelings toward a subject or topic. Sentiment analysis : analysis process to obtain subjective information of a topic. UNSTRUCTURED DATA: TEXT MINING
  • 81. SENTIMENT ANALYSIS USE CASE EXAMPLES Opinions in the social and geopolitical context Business and e-commerce applications, such as product reviews and movie ratings Predicting stock prices based on people opinion about the companies and resources Determine areas of product that need to be improved by summarizing product reviews Customer preference UNSTRUCTURED DATA: TEXT MINING
  • 82. OPINION REPRESENTATION Opinion holder: Whose opinion is this? Opinion target: What is this opinion about? e.g., a product, a service, an individual, an organization, an event, or a topic also called entity. An entity can have many feature → (aspect). Opinion context: Under what situation (e.g., time, location) was the opinion expressed? Opinion sentiment: What does the opinion tell us about the opinion holder’s feeling ? Positive, negative and neutral are called opinion orientation (also called sentiment orientation or polarity) Liu (2012) formulated the formal definition : An opinion is a quadruple, ( g, s, h, t), where g is the opinion (or sentiment) target, s is the sentiment about the target, h is the opinion holder, and t is the time when the opinion was expressed. UNSTRUCTURED DATA: TEXT MINING
  • 83. SENTIMENT ANALYSIS LEVEL Document-level Sentiment Analysis : determine whether a whole document, message, etc, is overall positive or negative Sentence-level Sentiment Analysis : determine the sentiment of each sentence within the document Aspect or Topic based Sentiment Analysis : identify not only positive or negative sentence, but also the specific topic/feature that is being referred as positive or negative. There may be more than 1 aspects in a sentence :  e.g : I love the display of the new phone but the battery life is terrible. UNSTRUCTURED DATA: TEXT MINING
  • 84. SENTIMENT ANALYSIS PROCESS Opinion Mining  Entity extraction and categorization  Aspect extraction and categorization  Opinion holder extraction and categorization  Time extraction and standardization  Sentiment classification  Opinion quadruple generation: Produce all opinion (g, s, h, t) expressed in document d based on the results of the above tasks. Opinion Summarization  Opinions are subjective. An opinion from a single person (unless a VIP) is often not sufficient for action. We need opinions from many people, and thus the need for opinion summarization. UNSTRUCTURED DATA: TEXT MINING
  • 85. DOCUMENT SENTIMENT CLASSIFICATION TECHNIQUES Supervised learning : any existing supervised learning methods can be applied; e.g. Bayesian classifications, Support Vector Machine, etc. Unsupervised learning : using opinion words and phrases. Liu (1992) explain the algorithm which contains 3 steps:  Extract phrase containing adjective or adverbs  Estimate the semantic orientation/polarity  Given a review, the algorithm computes the average opinion orientation of all phrases in the review, and classies the review as recommended if the average is positive, not recommended otherwise. UNSTRUCTURED DATA: TEXT MINING
  • 86. SENTIMENT ANALYSIS IMPORTANT FEATURES Terms and their frequency : Individual words or n-grams and their frequency counts. Word positions may also be considered. The TF-IDF weighting scheme may be applied too. These features have been shown quite effective in sentiment classification Part of speech : adjectives may be treated as special features Opinion words and phrases : words that are commonly used to express positive or negative sentiments. For example, beautiful, wonderful, good are positive opinion words, and bad, poor, and terrible are negative opinion words. Negations : important because their appearances often change the opinion orientation Syntactic dependency : word dependency based features generated from parsing or dependency trees UNSTRUCTURED DATA: TEXT MINING
  • 87. SENTIMENT ANALYSIS - CHALLENGES A positive or negative sentiment word may have opposite orientations in different application domains. A sentence containing sentiment words may not express any sentiment. Question sentences and conditional sentences are two important types, e.g., “Can you tell me which camera is good?” and “If I can find a good camera in the shop, I will buy it.” Not all conditional or interrogative sentences express no sentiments, e.g., “Does anyone know how to repair this terrible printer” and “If you are looking for a good car, get Toyota.” Sarcastic sentences with or without sentiment words are hard to deal with, e.g., “What a great car! It stopped working in two days.” Many sentences without sentiment words can also imply opinions. e.g. “This washer uses a lot of water” implies a negative sentiment. UNSTRUCTURED DATA: TEXT MINING
  • 88. CHAPTER 06 TOPIC MODELLING BIG DATA ANALYTICS CERTIFICATION
  • 89. TOPIC MODELLING Topic modeling is an unsupervised machine learning way to organize text information such that related pieces of text can be identified. Topic Modelling is basically a document clustering where documents and words are clustered simultaneously Topic modelling problem :  Known : Text/document collections (corpus) and The number of topics  Unknown : The actual topics and topic distribution in each document Topic modelling used in:  Discovering hidden topical patterns that are present across the collection  Annotating documents according to these topics  Using these annotations to organize, search and summarize texts UNSTRUCTURED DATA: TEXT MINING
  • 90. TOPIC MODELLING Basic assumptions:  A document consists of a mixture of topics  A topic is a collection of words Topic = latent semantic concepts Different Approaches  Latent Semantic Analysis/Indexing (LSA/LSI) linear algebra →  Probabilistic Latent Semantic Analysis (PLSA) probabilistics →  Latent Dirichlet Allocation (LDA) probabilistics → UNSTRUCTURED DATA: TEXT MINING
  • 91. LATENT SEMANTIC ANALYSIS Decomposing documents-words matrix into documents-topics and topics-words by using Singular Value Decomposition (SVD) Given m documents and n words in our vocabulary, we can construct an m-by-n matrix A sparse word-document co-occurrence matrix →  Simplest form of LSA uses raw count, where ai-j is the number of times the j-th word appeared in the i-th document  More advanced LSA often uses TF-IDF to for ai-j value SVD decompose matrix A into 3 matrices where:  A is an m × n matrix  U is an m × n orthogonal matrix  S is an n × n diagonal matrix  V is an n × n orthogonal matrix UNSTRUCTURED DATA: TEXT MINING
  • 92. LATENT SEMANTIC ANALYSIS Since A most likely sparse, we need to perform dimensionality reduction using truncated SVD This will keep the t most significant dimensions in the transformed space. LSA is quick and efficient, but has some shortcomings:  Lack of interpretable embeddings  Need for really large set of documents and vocabulary to get accurate results  Less efficient representation UNSTRUCTURED DATA: TEXT MINING
  • 93. PROBABILISTIC LATENT SEMANTIC ANALYSIS PLSA uses probabilistic method instead of SVD The basic idea : find probabilistic model P(D,W) such that for any document d and word w, P(d,w) corresponds to that entry in the document-term matrix. PLSA assumptions:  given a document d, topic z is present in that document with probability P(z|d)  given a topic z, word w is drawn from z with probability P(w|z) As its name implies, PLSA just adds a probabilistic treatment of topics and words on top of LSA. UNSTRUCTURED DATA: TEXT MINING
  • 94. PLSA LIMITATIONS PLSA is more flexible than LSA, but still has some limitations :  The number of parameters grows linearly with the size of training documents The model is → prone to overfitting  Not a well-defined generative model - no way of generalizing to new, unseen documents UNSTRUCTURED DATA: TEXT MINING
  • 95. LATENT DIRICHLET ALLOCATION LDA is a Bayesian version of pLSA. It uses dirichlet priors for the document-topic and word-topic distributions, leading to better generalization. Dirichlet : a probability distribution but it is not sampling from the space of real numbers. Instead it is sampling over a probability simplex. Probability simplex : a group of numbers that add up to 1. For example:  (0.6, 0.4)  (0.1, 0.1, 0.8)  (0.05, 0.2, 0.15, 0.1, 0.3, 0.2) The numbers represent probabilities over K distinct categories. In the above examples, K is 2, 3, and 6 respectively. UNSTRUCTURED DATA: TEXT MINING
  • 96. LATENT DIRICHLET ALLOCATION MODEL From a dirichlet distribution Dir( ), draw a random sample representing the topic α distribution of a particular document. θ From , we select a particular topic Z based on the distribution. θ From another dirichlet distribution Dir( ), select a random sample representing the word 𝛽 distribution of the topic Z. From , we choose the word w. φ φ LDA typically works better than pLSA because it can generalize to new documents easily. Some limitations: Needs relatively large memory and processing time. The model is difficult to explain UNSTRUCTURED DATA: TEXT MINING
  • 97. CHAPTER 07 WRAPPING IT ALL TOGETHER BIG DATA ANALYTICS CE
  • 98. TEXT CLUSTERING PROCESS FLOW We will demonstrate the end-to-end process by performing document clustering. The process flow that will be used are as follow:  Text preprocessing, including text cleanup and text normalization  Vector Representation / Feature Extraction : using TF-IDF  Building model : using K-Means  Visualization  Model evaluation UNSTRUCTURED DATA: TEXT MINING
  • 99. LAB 08 TEXT CLUSTERING BIG DATA ANALYTICS CE
  • 100. LAB DESCRIPTION What you will learn: How to create a document clustering program by using real dataset Implement tokenization, stemming and cleansing K-Means implementation Visualization by using matplotlib UNSTRUCTURED DATA: TEXT MINING
  • 101. STEP 01 LIBRARY Import all required library and click Run UNSTRUCTURED DATA: TEXT MINING import nltk from Sastrawi.Stemmer.StemmerFactory import StemmerFactory import re import pandas as pd from bs4 import BeautifulSoup import matplotlib.pyplot as plt from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity from sklearn.cluster import KMeans from sklearn.externals import joblib from sklearn.manifold import MDS
  • 102. STEP 02 DATA INPUT Type the following code and click Run Show sample data UNSTRUCTURED DATA: TEXT MINING #load titles titles = open('Judul Berita.txt').read().split('n') #load articles article = open('Berita.txt', encoding="utf8").read().split('BERHENTI DISINI') len(titles) titles[:5] article[:5] len(article)
  • 103. STEP 03 PARSING ARTICLES Parsing articles from html format using beautifulsoup package UNSTRUCTURED DATA: TEXT MINING article_clean = [] for text in article: text = BeautifulSoup(text, 'html.parser').getText() article_clean.append(text) article = article_clean print(article)
  • 104. STEP 04 TOKENIZATION DAN STEMMING Do tokenization, stemming and cleansing, like in the Lab 03 UNSTRUCTURED DATA: TEXT MINING def tokenize_and_stem(text): #tokenization and change to lowercase tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)] #clean token from number and non alphabetical character such as punctuation, etc. filtered_tokens = [] for token in tokens: if re.search('[a-zA-Z]', token): filtered_tokens.append(token) #clean stop words stopwords = nltk.corpus.stopwords.words('indonesian') cleaned_token = [] for token in filtered_tokens: if token not in stopwords: cleaned_token.append(token) ...
  • 105. STEP 05 TOKENIZATION DAN STEMMING (CONT.) Do tokenization, stemming and cleansing, like in the Lab 03 (cont.) Show sample data UNSTRUCTURED DATA: TEXT MINING ... #stem using Sastrawi StemmerFactory factory = StemmerFactory() stemmer = factory.create_stemmer() stems = [stemmer.stem(t) for t in cleaned_token] return stems vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed) print('ada ' + str(vocab_frame.shape[0]) + ' kata di vocab_frame') print(vocab_frame.head())
  • 106. STEP 06 TF-IDF Calculate TF-IDF matrix Show matrix UNSTRUCTURED DATA: TEXT MINING tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,min_df=0.2, use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3)) #fit the vectorizer to article tfidf_matrix = tfidf_vectorizer.fit_transform(article) print(tfidf_matrix.shap) print(tfidf_matrix)
  • 107. STEP 07 K-MEANS MODELLING Do K-Means Modeling, in this case we use the number of clusters = 3 Create DataFrame with the format: sequence - title - cluster UNSTRUCTURED DATA: TEXT MINING num_clusters = 3 km = KMeans(n_clusters=num_clusters, random_state=1000) km.fit(tfidf_matrix) #urutan ranks = [i for i in range(1, len(titles)+1)] #cluster with k-means clusters = km.labels_.tolist() news = { 'title': titles, 'rank': ranks, 'article': article, 'cluster': clusters } frame = pd.DataFrame(news, index = [clusters] , columns = ['rank', 'title', 'cluster']) #show dataframe print(frame) frame['cluster'].value_counts()
  • 108. STEP 08 DATA EXPLORATION Displays the results of clustering and top term per cluster to determine the label UNSTRUCTURED DATA: TEXT MINING print("Top terms per cluster:") #sort cluster centers based on its proximity to its centroid order_centroids = km.cluster_centers_.argsort()[:, ::-1] for i in range(num_clusters): print("Cluster %d words:" % i, end='') for ind in order_centroids[i, :6]: #replace 6 with n words per cluster print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0] [0].encode('utf-8', 'ignore'), end=',') print() #add whitespace print() #add whitespace print("Cluster %d titles:" % i, end='') for title in frame.ix[i]['title'].values.tolist(): print(' %s,' % title, end='') print() #add whitespace print()
  • 109. STEP 09 VISUALIZATION UNSTRUCTURED DATA: TEXT MINING similarity_distance = 1 - cosine_similarity(tfidf_matrix) mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1) mds.fit_transform(similarity_distance) # shape (n_components, n_samples) xs, ys = pos[:, 0], pos[:, 1] #set color with dictionary cluster_colors = {0: '#1b9e77', 1: '#d95f02', 2: '#7570b3'} #dictionary for cluster name (chart legend) cluster_names = {0: 'Olahraga', 1: 'Ekonomi', 2: 'Kriminal'} Visualize of the results of clustering with MDS From step 06 it can be seen that the 3 clusters formed are: economy, sports and crime. Color set and label cluster
  • 110. STEP 09 VISUALIZATION Set matplotlib to display charts inline Type the following code UNSTRUCTURED DATA: TEXT MINING matplotlib inline #create data frame that has the result of the MDS plus the cluster numbers and titles df = pd.DataFrame(dict(x=xs, y=ys, label=clusters, title=titles)) groups = df.groupby('label') # set up plot fig, ax = plt.subplots(figsize=(17, 9)) # set size # ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling

Editor's Notes

  • #8: Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.