Data mining techniques

Data Ming TechniquesData Ming Techniques
MUMTAZ KHAN
MS (SEMANTIC WEB)

TF-IDFTF-IDF
TF-IDF stands for Term Frequency & Inverse Document
Frequency .
 Important data for search-figures out what terms are most
relevant for document .
Term frequency: It measures how often a word
occurs in a document.
◦ A word that occurs frequently is probably important to that
document’s meaning.
 Mathematical Form:
TF= (Number of occurrences of the keyword in that particular
document) / (Total number of keywords in the document)

TF-IDF (Continue---- )TF-IDF (Continue---- )
IDF : Inverse Document Frequency measure the
rarity of a term in the whole corpus/documents .
◦ Let N denoting the total number of Documents then the inverse
document frequency of term T is defined as
IDF= Log (N/df) or
= 1+loge (Total Number of Documents/Number of Documents with
that term in it), So
TF-IDF= TF*IDF
Practical Example of TF-IDF
◦ Let we have three documents d1,d2 and d3
◦ d1= The game of life is a game of everlasting learning
◦ d2= The unexamined life is not worth living
◦ d3= Never stop learning

Steps for TF-IDF
TF for d1:
TF for d2:
TF for d3:
Normalized TF for d1,d2 and d3 are as :
d1 :
Term/Word the game of life is a everlasting learning
Frequency 1 2 2 1 1 1 1 1
Term/Word the unexamined life is not worth living
Frequency 1 1 1 1 1 1
Term/Word Never Stop learning
Frequency 1 1 1
Term/Word the game of life is a everlasting learning
Normalized TF 1/10=.1 2/10=.2 2/10=.2 1/10=.1 1/10=.1 1/10=.1 1/10=.1 1/10=.1

d2:
d3:
◦ Note: d1 contains 10 terms/words, d2 contains 7
terms/words and d3 contains 3 terms/words.
Calculation of IDF for each term/word as:
IDF=1+loge (Total Number of Documents/Number
of Documents with that term in it)
Term/Word the unexamined life is not worth living
Normalized TF 1/7=.1428 1/7=.1428 1/7=.1428 1/7=.1428 1/7=.1428 1/7=.1428 1/7=.1428
Term/Word Never Stop learning
Normalized TF 1/3=.3333 1/3=.3333 1/3=.3333

 Let us Compute the IDF for the
Term Unexamined
IDF=1+loge (Total Number of Documents/Number of Documents with term Unexamined in it).
◦ There are 3 documents in all=d1,d2,d3 but the term unexamined appears in
document d2.
IDF(unexamined)=1+loge (3/1)=2.098726209 , and similarly so on.
Terms IDF
The 1.405507135
Game 2.098726209
Of 2.098726209
Life 1.405507135
Is 1.405507135
A 2.098726209
Everlasting 2.098726209
Terms IDF
Learning 1.405507135
Unexamined 2.098726209
Not 2.098726209
Worth 2.098726209
Living 2.098726209
Never 2.098726209
Stop 2.098726209

 Let us calculate the TF-IDF and the
relevant documents for the query:
life learning
 Note: For each term the TF-IDF as
calculated multiply its normalized term
frequency with its IDF, e.g.
TF-IDF=TF*IDF
Terms D1 D2 D3
life 0.140550715 0.200786736 0
learning 0.140550715 0 0.468502384

 Vector Space Model-Cosine Similarity
◦ For each document we derive a vector
◦ Set of documents in a collection is viewed as a set of
vectors in a vector space.
◦ Each term will have its own axis
 Formula: To find similarity b/w any two documents
Cosine Similarity (d1, d2) = Dot product(d1, d2) / ||d1|| * ||d2||
Dot product (d1,d2) = d1[0] * d2[0] + d1[1] * d2[1] * … * d1[n] * d2[n]
||d1|| = square root(d1[0]2
+ d1[1]2
+ ... + d1[n]2
)
||d2|| = square root(d2[0]2
+ d2[1]2
+ ... + d2[n]2
)

 Vector Space Model-Cosine Similarity

 The TF-IDF for the query : life learning
 Let us calculate the cosine similarity between query and document(d1)
Cosine Similarity(Query,Document1) = Dot product(Query, Document1) / ||Query|| * ||
Document1||
Dot product(Query, Document1)
= ((0.702753576) * (0.140550715) + (0.702753576)*(0.140550715))
= 0.197545035151
||Query|| = sqrt((0.702753576) + (0.702753576) ) = 0.993843638185
||Document1|| = sqrt((0.140550715) + (0.140550715) ) = 0.198768727354
Cosine Similarity(Query, Document) = 0.197545035151 / (0.993843638185) *
(0.198768727354)
= 0.197545035151 / 0.197545035151
= 1
life .5 1.405507153 0.702753576
learning .5 1.405507153 0.702753576

 The Similarity score for d1,d2, d3 and query are as:
 Some Demerits of TF-IDF
◦ It is based on bag-of-words model
 therefore it does not capture position in text, semantics, co-occurrences in
different documents, etc.
 For this reason, TF-IDF is only useful as a lexical level feature
◦ Cannot capture semantics (e.g. as compared to topic models,
word embedding's)
Cosine
Similarity
d1 d2 d3
1 0.707106781 0.707106781

ReferencesReferences
1. Van Rijsbergen, Cornelis J., Stephen Edward Robertson, and Martin F. Porter. New
models in probabilistic information retrieval. London: British Library Research and
Development Department, 1980
2. http://ethen8181.github.io/machine-learning/clustering_old/tf_idf/tf_idf.html
3. http://www.tfidf.com/
4. Wang, Wei, and Yongxin Tang. "Improvement and Application of TF-IDF Algorithm in
Text Orientation Analysis." (2016).
5. Wu, Ho Chung, et al. "Interpreting tf-idf term weights as making relevance
decisions." ACM Transactions on Information Systems (TOIS) 26.3 (2008): 13.
6. Sparck Jones, Karen. "A statistical interpretation of term specificity and its application in
retrieval." Journal of documentation 28.1 (1972): 11-21.
7. Salton, Gerard, and Christopher Buckley. "Term-weighting approaches in automatic text
retrieval." Information processing & management 24.5 (1988): 513-523.

LDA (Latent Dirichlet Allocation)LDA (Latent Dirichlet Allocation)
 Outline of LDA
◦ Introduction
◦ Model Definition
◦ Variational Inferences
◦ Example output and Simulation
◦ References

LDA (Continue---- )LDA (Continue---- )
 Introduction
◦ When more information becomes available, it becomes more
difficult to find and discover what we need.
◦ We need tools to help us organize, search and understand these
vast amount of information.
◦ Topic modeling provides methods for automatically organizing,
understanding, searching, and summarizing large electronic
archives:

 Introduction (Goal of Topic Model)
◦ Document has several topics
◦ Topics are associated with words
◦ Words are expressed through the topics into documents
Documents WordsTopics
Observed Latent Observed

 LDA (Continue….)
◦ LDA is a generative probabilistic model of a corpus
◦ LDA basically to make pLSA a generative model by imposing a Dirichlet
Prior on the Model parameters
◦ LDA is just a Bayesian Version of pLSA, and the parameters are now
much regularized
◦ LDA breaks down the collection of documents into topics
◦ Discover the hidden themes in the collection
◦ Representing the document as a mixture of topics with their probability
distribution.
◦ Topics are represented as a mixture of words with probability
representing the importance of the for each topic.
◦ Discover the hidden themes in the collection

 Twitter Using LDA
◦ Fetch tweets data using “twitteR” package
◦ Load the data into the R environment
◦ Clean the data to remove : re-tweet information, links special characters,
emoticons, frequent words like is, as, this etc.
◦ Create a Term Document Matrix (TMD) using “tm” package.
◦ Calculate TF-IDF i.e. Term Frequency-Inverse Document Frequency for
all the words in TDM
◦ Exclude all the words with TF-IDF <=0.1, to remove all the words which
all less frequent
◦ Calculate the optimal Number of Topics (k) in the Corpus using
log-likelihood function for the TDM calculated
◦ Apply LDA method using “topicmodels” package to discover topics
◦ Evaluate the model

 LDA – generative process
◦ Setting up a generative model
 We have D documents using a vocabulary of V word types
 Each documents contains (up to) N word tokens.
 We assume K topics.
 Each document has a K-dimensional multinomial Փd over topics
with a
 Common Dirichlet prior Dir( )α
 Each topic has a V-dimensional multinomial "Βk over words
with a
 Common symmetric Dirichlet prior D( ).ɳ

 The Generative process
◦ For topic topic k=1…..K do
 -Draw a word-distribution(multinomial) Βk Dir( )∼ ɳ
◦ For each document d=1……D do
 - Draw a topic-distribution (multinomial)θd Dir( )∼ α
 - For each word Wd,n:
◦ - Draw a topic Zd,n Mult(∼ Փd) with Zd,n€ [1…K]
◦ - Draw a word Wd,n Mult( Z∼ β d,n )

 Graphical Model of LDA

 LDA Joint Distribution

 LDA Joint Distribution defines a posterior p(θ,z,β|w)
◦ From a collection of document we have to infer:
 Per-word topic assignment zd,n
 Per-document topic proportions θd
 Per-corpus topic distribution βk
Note:

 Why depends on and ?

 LDA Graphical Model with working Procedure

 LDA inputs.
◦ Set of words per document for each document in Corpus
 LDA inputs.
◦ Corpus-wide topic vocabulary distributions
◦ Topic assignments per word
◦ Topic proportion per document

STATISTICAL INFERENCEPROBABILISTIC GENERATIVE PROCESS
3 Latent Variables
Word distribution per topic
(Word-topic-matrix)
Topic distribution per doc
(topic-doc-matrix)
Topic word assignment
TOPIC MODELS

 Dirichlet distribution is Conjugate prior of multinomial
distribution
 The parameter α control the mean shape and sparsity of θ
◦ high α= uniform θ , small α= sparse θ
 In LDA the topics are a V-dimensional Dirichlet and the
topic proportion are a K-dimensional Dirichlet

 The Geometric intuition (Simplex)

 The Dirichlet is a “dice factory”
◦ Multivariate equivalent of the Beta distribution (“coin factory”)
◦ Parameters α determine the form of the prior
 The Dirichlet is defined over the (K-1) simplex
◦ The K non-negative arguments which sum to one

 To which topics does a given document belong to? Thus want
to compute the posterior distribution of the hidden variables
given a document.

◦ Variational Inference

 LDA Summary
◦ LDA Can:
 Visualize the hidden thematic structure in large corpora
 Generalize new data to fit into that structure
 Used for Feature reduction, bioinformatics
 Used for Sentiment analysis, object localization, automatic harmonic analysis for
music
◦ Note: LDA Main Goal
 In each document, allocate its words to few topics
 In each topic, assign high probability to few terms
◦ This from the joint
 Sparse proportions come from the 1st
term
 Sparse topics come from the 2nd
term
◦ Limitations:
 Must know the number of topics k in advance
 Dirichlet topic distribution cannot capture correlations among topics

1. Jelodar, Hamed, et al. "Latent Dirichlet Allocation (LDA) and Topic modeling: models,
applications, a survey." arXiv preprint arXiv:1711.04305 (2017).
2. http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
3. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." Journal of
machine Learning research3.Jan (2003): 993-1022.
4. Video Lectures of David Blei on videolectures.net: http://videolectures.net/mlss09uk_blei_tm/
5. Campr, Michal, and Karel Ježek. "Comparing semantic models for evaluating automatic
document summarization." International Conference on Text, Speech, and Dialogue. Springer
International Publishing, 2015.
6. Hu, Diane J. "Latent dirichlet allocation for text, images, and music." University of California, San
Diego. Retrieved April 26 (2009): 2013.
7. Jayapal, Arun, and Martin Emms. "Topic Models-Latent Dirichlet Allocation." (2014).
8. Wang, Y. Distributed gibbs sampling of latent topic models: The gritty details. Technical report,
2008.
9. https://cs.stanford.edu/~ppasupat/a9online/1140.html

Latent Semantic Indexing (LSI)Latent Semantic Indexing (LSI)
Problems in Lexical matching
Motivation
Introduction
 How LSI Work?
LSI Procedure
SVD
Example
Application
Demerits

LSI(Continue …)LSI(Continue …)
Problems in Lexical Matching
◦ Synonymy
- widespread synonym occurances
-decrease recall.
◦ Polysemy
- retrieval of irrelevant documents
- poor precision
◦ Noise
- Boolean search on specific words
- Retrieval o contently unrelated documents

Motivation for LSI
◦ To find and fit a useful model of the relationships
between terms and documents.
◦ To find out what terms "really" are implied by a query .
◦ LSI allow the user to search for concepts rather than
specific words.
◦ Stores them in the concept space
◦ LSI can retrieve documents related to a user's query
even when the query and the documents do not share
any common terms
◦ Mathematical model
 Relates documents and the concepts
◦ LSI tries to overcome the problems of lexical matching

Introduction
◦ LSI is a technique that projects queries and documents
into a space with “latent” semantic dimensions
◦ It uses multidimensional vector space to place all
documents and terms
◦ Each dimension in that space corresponds to a concept
existing in the collection.
◦ Common related terms in a document and query will pull
document and query vector close to each other.

Concepts in Documents

How LSI Work?
• A set of documents
 how to determine the similiar ones?
 examine the documents
 try to find concepts in common
 classify the documents
• This is how LSI also works.
• LSI represents terms and documents in a high-dimensional space
allowing relationships between terms and documents to be exploited
during searching.
• Convert high-dimensional space to lower-dimensional space, throw out
noise, keep the good stuff

LSI Procedure
◦ Obtain term-document matrix.
◦ Compute the SVD.
◦ Truncate-SVD into reduced-k LSI space.
-k-dimensional semantic structure
-similarity on reduced-space:
-term-term
-term-document
-document-document
 Query Procedure
◦ Map the query to reduced k-space
q’=qTUkS-1k,
◦ Retrieve documents or terms within a proximity.
-cosine
-best m

Singular Value Decomposition (SVD)
◦ LSI use SVD, a linear analysis method:
◦ SVD decomposes the original matrix into three matrixes
 Document eigenvector matrix
 Eigenvalue matrix
 Term eigenvector matrix
◦ SVD of a rectangular matrix A is given by:
 A=U VΣ T

Singular Value Decomposition (SVD)
◦ For an m n matrix A of rank r there exists aˣ factorization
◦ Singular Vale Decomposition =SVD) as follows
 A=U VΣ T
◦ The columns of U are orthogonal eigenvectors of AAT
◦ The columns of V are orthogonal eigenvectors of AT
A
◦ Eigenvectors λ1 … λr of AAT
are the eigenvectors of AT
A
AAT
=
AT
A=

Example
◦ Let we have three documents
 d1: Shipment of gold damaged in a fire
 d2: Delivery of silver arrived in a silver truck.
 d3: Shipment of gold arrived in a truck.
◦ Problem: Use Latent Semantic Indexing (LSI) to rank these documents
for the query gold silver truck.
Step 1: Set term weights and construct the term-document matrix A and query matrix:

Step 2: Decompose matrix A matrix and find the U, S and V matrices, where

 Step 3: Implement a Rank 2 Approximation by keeping the first two columns of U and V
and the first two columns and rows of S.

 Step 4: Find the new document vector coordinates in this reduced 2-dimensional space.
 Rows of V holds eigenvector values. These are the coordinates of individual document
vectors, hence
 d1(-0.4945, 0.6492)
 d2(-0.6458, -0.7194)
 d3(-0.5817, 0.2469)
 Step 5: Find the new query vector coordinates in the reduced 2-dimensional space.
 Note: These are the new coordinate of the query vector in two dimensions. Note how
this matrix is now different from the original query matrix q given in Step 1.
so

 Step 6: Rank documents in decreasing order of query-document cosine similarities.

 Graphical Representation
 We can see that document d2 scores higher than d3 and d1. Its vector is
closer to the query vector than the other vectors. Also note that Term
Vector Theory is still used at the beginning and at the end of LSI

LSI (Continue…)LSI (Continue…)
Applications of LSI
◦ Information Retrieval
◦ Information Filtering
◦ Relevance Feedback
◦ Improving performance of Search Engines
 in ranking pages
◦ Cross-language retrieval
◦ Automated essay grading
◦ Optimizing link profile of your web page
◦ Modelling of human cognitive function
◦ Dynamic advertisements put on pages, Google’s
AdSense

LSI (Continue…)LSI (Continue…)
Demerits of LSI
◦ Storage
◦ The complexity of the LSI model obtained from
truncated SVD is costly
◦ Its execution efficiency lag far behind the execution
efficiency of the simpler, Boolean models, especially on
large data sets.
◦ The latent topic dimension can not be chosen to arbitrary
numbers. It depends on the rank of the matrix, so can't go
beyond that
◦ Bad for millions of words or documents
◦ Hard to incorporate new words or documents

◦ http://www.bluebit.gr/matrix-calculator/
◦ Rosario, Barbara. "Latent semantic indexing: An overview." Techn. rep. INFOSYS 240 (2000): 1-16.
◦ Ding, Chris HQ. "A probabilistic model for latent semantic indexing." Journal of the Association for
Information Science and Technology 56.6 (2005): 597-608.
◦ Dumais, Susan T. "Latent semantic indexing (LSI) and TREC-2." Nist Special Publication Sp (1994):
105-105.
◦ Alter, Orly, Patrick O. Brown, and David Botstein. "Singular value decomposition for genome-
wide expression data processing and modeling." Proceedings of the National Academy of
Sciences 97.18 (2000): 10101-10106.
◦ Golub, Gene H., and Charles F. Van Loan. Matrix computations. Vol. 3. JHU Press, 2012.
◦ http://www-db.deis.unibo.it/courses/SI-M/
◦ http://web.eecs.utk.edu/research/lsi/
◦ http://lsi.research.telcordia.com/

Word2VectWord2Vect
◦ It is used to generate representation vectors out of
words
◦ Maps words to continuous vector presentations
 i.e. points in an N-dimensional space
◦ Learns vectors from training data (generalizations)
◦ It is a numeric representation for each word
 That enable to capture relationship between words like
synonyms, analogies

Word2VectWord2Vect
Continuous Bag of Words (CBOW)
◦ It predicted the missing word window of context words
 Suppose we given the words Latent Dirichlet, then CBOW
model predict the missing word Allocation, so Latent Dirichlet
Allocation
◦ it useful to identify the missing word in the sentence
◦ It identify the effective sentiment orientations
◦ Randomly initialize input/output weight matrices of sizes
VxN and NxV where V: vocab size, N: vector size
(parameter
◦ Update weight matrices using SGD, backprop. and cross
entropy over corpus
◦ Hidden layer size corresponds to word vector
dimensional.

ConDoc2VectConDoc2Vect
Skip Gram
◦ Method very similar, except now we predict window of
words given single word vector
◦ It predicted the context words given the word
 Suppose we given a word Dirichlet, then Skip-Gram model
predict the context words, Latent Dirichlet Allocation.
◦ Boils down to maximizing dot-product similarity of
context words and target word
◦ Skip-gram typically outperforms CBOW on semantic
and syntactic accuracy (Mikolov et al.)

Word2Vec (Continue…)Word2Vec (Continue…)
Demerits
• Quality depends on input data, number of samples, and
size of vectors (possibly long computation time!)
 But Google released 3 million word vectors trained on 100 billion words!
• Averaging vec’s does not work well (in my experience) on
large text (> tweet level)
• W2V cannot provide fixed-length feature vectors for
variable-length text (pretty much everything!)

Doc2VecDoc2Vec
◦ It generalize Word2Vec to whole documents (phrases,
sentences, etc)
◦ Provide fixed-length vector
◦ Learns Distributed Memory (DM) and Distributed Bag of
Words (DBOW)

Doc2Vec (Continue… )Doc2Vec (Continue… )
Distributed Memory (DM)
◦ Assign and randomly initialize paragraph vector for each
document
◦ Predict next work using context words + paragraph
vector
◦ Slide context window across doc but keep paragraph
vector fixed (hence distributed memory)
◦ Learns Updating done via SGD and backpropagation.

Doc2Vect (Continue… )Doc2Vect (Continue… )
Distributed Bag of Words (DBOW)
◦ Only use paragraph vector (no word vector!)
◦ Take window of words in paragraph and randomly
sample which one to predict using paragraph vector
(ignores word ordering )
◦ Simpler, more memory efficient
◦ DM typically outperforms DBOW
 but DM+DBOW is even better!

Doc2Vec Example)Doc2Vec Example)
Doc2Vec on Wikipedia1
1
Document Embedding with Paragraph Vectors, A. Dai et al., 2014

Doc2Vec Example)Doc2Vec Example)
Doc2Vec on Wikipedia1
◦ LDA vs. Doc2Vec for nearest neighbors to “Machine learning “(bold=unrelated to Machine learning)
1
Document Embedding with Paragraph Vectors, A. Dai et al., 2014

Word2Vec (Continue…)Word2Vec (Continue…)
Application
◦ Information Retrieval
◦ Documents classification
◦ Recommendation algorithms
◦ etc
Thank you !

Doc2Vec (Continue…)Doc2Vec (Continue…)
Conclusion
◦ Doc2Vec is more efficient, robust than others
methods such as LIS, LDA, TF-IDF
Thank you !

◦ http://www.bluebit.gr/matrix-calculator/

Data mining techniques

More Related Content

What's hot (20)

Similar to Data mining techniques (20)

More from Higher Education Department KPK, Pakistan (6)

Recently uploaded (20)

Data mining techniques