Text mining

TEXT MINING
B Y
R E VAT H Y S
K O S H Y G
1

INTRODUCTION
• Text Mining is a Discovery
• Also referred as Text Data
Mining (TDM) and Knowledge
Discovery in Textual Database
(KDT).
• To extract relevant information
or knowledge or pattern from
different sources that are in
unstructured or semi-structured
form.
2

INPUT OUTPUT MODEL FOR TEXT MINING
4

STEPS FOR TEXT MINING
• Pre processing the text
• Applying text mining techniques
 Summarization
 Classification
 Clustering
 Visualization
 Information extraction
• Analyzing the text
5

TEXT DATABASES & INFORMATION
RETRIEVAL
• Text databases ( document databases)
 Large collections of documents from various sources: news articles,
research papers, books, digital libraries, e-mail messages, and Web
pages, library database, etc.
 Data stored is usually semi-structured
• Information retrieval
 A field developed in parallel with database systems
 Information is organized into (a large number of) documents
 Information retrieval problem: locating relevant documents based on
user input, such as keywords or example documents
6

TYPICAL INFORMATION RETRIEVAL
PROBLEM
• To locate relevant documents in a document collection based on a
user’s query
• Some keywords describing an information need,
• For ad hoc (i.e., short-term) information need user takes the initiative to
“pull” the relevant information out from the collection;
• For long-term information need, a retrieval system may also take the
initiative to “push” relevant to the user’s need.
• Such an information access process is called information filtering,
• Corresponding systems are called filtering systems or recommender
systems. 7

INFORMATION RETRIEVAL
• Typical IR systems
– Online library catalogs
– Online document management systems
• Information retrieval vs. database systems
– Some DB problems are not present in IR, e.g., update, transaction
management, complex objects
– Some IR problems are not addressed well in DBMS, e.g.,
unstructured documents, approximate search using keywords and
relevance
8

BASIC MEASURES FOR TEXT RETRIEVAL
• Suppose that a text retrieval system has just retrieved a number of documents based
on query
• Let the set of documents relevant to a query be denoted as {Relevant}
• The set of documents retrieved be denoted as {Retrieved}.
• The set of documents that are both relevant and retrieved is denoted as {Relevant} n
{Retrieved}
9

• Precision: Percentage of retrieved documents that are in fact relevant
to the query (i.e., “correct” responses).
• Recall: Percentage of documents that are relevant to the query and
were, in fact, retrieved.
• An information retrieval system often needs to trade off recall for
precision or vice versa.
• F-score, is harmonic mean of recall and precision
10

INFORMATION RETRIEVAL CONCEPTS
• Basic Concepts
–A document can be described by a set of representative
keywords called index terms.
–Different index terms have varying relevance when used to
describe document contents.
–This effect is captured through the assignment of numerical
weights to each index term of a document. (e.g.: frequency, tf-
idf)
• DBMS Analogy
–Index Terms  Attributes
–Weights  Attribute Values
11

TEXT RETRIEVAL METHODS
• Document selection methods
– Knowledge based Retrieval
– The query is specifying constraints for selecting relevant documents.
– Boolean retrieval model- a document is represented by a set of
keywords
– User provides a Boolean expression of keywords, such as “car and
repair shops,” “tea or coffee,” or“database systems but not Oracle.”
– Return documents that satisfy the Boolean expression.
– Difficulty in prescribing a user’s information need exactly with a
Boolean query,
– Used when the user knows a lot about the document collection and
can formulate a good query
12

• Document ranking methods
– Similarity based retrieval
– Use the query to rank all documents in the order of relevance.
– Present a ranked list of documents in response to a user’s keyword
query.
– Ranking methods based on mathematical foundations, including
algebra, logic, probability, and statistics.
– Match the keywords in a query with those in the documents and score
each document based on how well it matches the query.
– degree of relevance :- score computed based on information such as
the frequency of words in the document.
13

SIMILARITY BASED RETRIEVAL
• Finds similar documents based on a set of common keywords
• Answer should be based on the degree of relevance based on the
nearness of the keywords, relative frequency of the keywords, etc.
• Basic techniques
• Stop list
• Set of words that are deemed “irrelevant”, even though they may
appear frequently
• E.g., a, the, of, for, to, with, etc.
• Stop lists may vary when document set varies
14

SIMILARITY BASED RETRIEVAL
 Word stem
• Several words are small syntactic variants of each other since
they share a common word stem
• E.g., drug, drugs, drugged
 A term frequency table
• Each entry frequent_table(i, j) = no of occurrences of the word ti
in document di
• Usually, the ratio instead of the absolute number of occurrences
is used
 Similarity metrics: measure the closeness of a document to a query
(a set of keywords)
• Relative term occurrences
• Cosine distance: 15

INFORMATION RETRIEVAL MODELS
• Information Retrieval Models:
 Boolean Model
 Vector Model
 Probabilistic Model
16

BOOLEAN MODEL
• Consider that index terms are either present or absent in a document
• As a result, the index term weights are assumed to be all binaries
• A query is composed of index terms linked by three connectives: not,
and, and or
– e.g.: car and repair, plane or airplane
• The Boolean model predicts that each document is either relevant or
non-relevant based on the match of a document to the query
17

THE VECTOR SPACE MODEL
• Represent a document and a query both as vectors in a high-
dimensional space corresponding to all the keywords
• Use an appropriate similarity measure to compute the similarity between
the query vector and the document vector.
• The similarity values can then be used for ranking documents.
18

MODEL A DOCUMENT
• Starting with a set of d documents and a set of t terms, model each
document as a vector v in the t dimensional space Rt , ie. vector-space
model.
• The term frequency be the number of occurrences of term t in the
document d, that is, freq(d; t).
• The (weighted) term-frequency matrix TF(d; t) measures the association
of a term t with respect to the given document d:
• it is defined as 0 if the document does not contain the term, and nonzero
otherwise.
• TF(d; t) = 1 if the term t occurs in the document d 19

TEXT INDEXING TECHNIQUES
• Inverted index
 Maintains two hash- or B+ tree indexed tables:
• document_table: a set of document records <doc_id, postings_list>
• term_table: a set of term records, <term, postings_list>
 Answer query: Find all docs associated with one or a set of terms
• easy to implement
• do not handle well synonymy and polysemy, and posting lists could be too long
(storage could be very large)
• Signature file
 Associate a signature with each document
 A signature is a representation of an ordered list of terms that describe the
document
 Order is obtained by frequency analysis, stemming and stop lists
20

• An inverted index is created for a document collection,
• a retrieval system can answer a keyword query quickly by looking up
which documents contain the query keywords.
• maintain a score accumulator for each document and update these
accumulators as we go through each query term.
• For each query term, fetch all of the documents that match the term and
increase their scores.
QUERY PROCESSING TECHNIQUES
21

• relevance feedback
– examples of relevant documents are available,
– the system can learn from such examples to improve retrieval
performance.
– Effective in improving retrieval performance.
• pseudo-feedback or blind feedback
– do not have such relevant examples,
– a system can assume the top few retrieved documents in some initial
retrieval results to be relevant and extract more related keywords to
expand a query.
– a process of mining useful keywords from the top retrieved documents.
– leads to improved retrieval performance.
22

TEXT MINING APPROACHES
• Based on the kinds of data they take as input:
• The Keyword-based Approach,
– the input is a set of keywords or terms in the documents,
– may only discover relationships at a relatively shallow level
– Rediscovery of compound nouns (e.g., “database” and “systems”) or
co-occurring patterns with less significance (e.g., “terrorist” and
“explosion”).
– may not bring much deep understanding to the text.
23

• The tagging approach
– The input is a set of tags
– Rely on tags obtained by manual tagging or by some automated
categorization algorithm
• The Information-extraction Approach
– more advanced, challenging knowledge discovery task
– inputs semantic information, such as events, facts, or entities
uncovered by information extraction.
– Lead to the discovery of some deep knowledge,
– Requires semantic analysis of text by natural language
understanding and machine learning methods.
24

• Various text mining tasks can be performed on the extracted
keywords, tags, or semantic information
• These include document clustering, classification, information
extraction, association analysis, and trend analysis.
25

TYPES OF TEXT DATA MINING
• Keyword-based association analysis
• Automatic document classification
• Similarity detection
– Cluster documents by a common author
– Cluster documents containing information from a common source
• Link analysis: unusual correlation between entities
• Sequence analysis: predicting a recurring event
• Anomaly detection: find information that violates usual patterns
• Hypertext analysis
– Patterns in anchors/links
• Anchor text correlations with linked objects
26

KEYWORD BASED ASSOCIATION
ANALYSIS
• Motivation
– Collect sets of keywords or terms that occur frequently together and
then find the association or correlation relationships among them
• Association Analysis Process
– Preprocess the text data by parsing, stemming, removing stop words,
etc.
– Evoke association mining algorithms
• Consider each document as a transaction
• View a set of keywords in the document as a set of items in the
transaction
– Term level association mining 27

KEYWORD BASED ASSOCIATION
ANALYSIS
• Collects sets of keywords or terms that occur frequently together
• Finds the association or correlation relationships among them.
• Preprocesses the text data by parsing, stemming, removing stop words,
and so on, and then evokes association mining algorithms.
• Document database,
– Each document can be viewed as a transaction,
– A set of keywords in the document as a set of items in the transaction.
{document_id, a_set_of_keywords}
• Problem of keyword association mining is mapped to item association
mining in transaction databases, 28

TEXT CLASSIFICATION
• Motivation
– Automatic classification for the large number of on-line text documents
(Web pages, e-mails, corporate intranets, etc.)
• Classification Process
– Data preprocessing
– Definition of training set and test sets
– Creation of the classification model using the selected classification
algorithm
– Classification model validation
– Classification of new/unknown text documents
• Text document classification differs from the classification of relational
data
– Document databases are not structured according to attribute-value pairs
29

TEXT CLASSIFICATION
• Classification Algorithms:
– Support Vector Machines
– K-Nearest Neighbors
– Naïve Bayes
– Neural Networks
– Decision Trees
– Association rule-based
– Boosting
30

DOCUMENT CLUSTERING
• Motivation
– Automatically group related documents based on their contents
– No predetermined training sets or taxonomies
– Generate a taxonomy at runtime
• Clustering Process
– Data preprocessing: remove stop words, stem, feature extraction,
lexical analysis, etc.
– Hierarchical clustering: compute similarities applying clustering
algorithms.
– Model-Based clustering (Neural Network Approach): clusters are
represented by “exemplars”. (e.g.: SOM)
31

APPLICATIONS OF TEXT MINING
• Digital libraries
• Academic and research field
• Life science
• Social media
• Business intelligence
32

Text mining

More Related Content

What's hot (20)

Similar to Text mining (20)

More from Koshy Geoji (9)

Recently uploaded (20)

Text mining

Editor's Notes