SlideShare a Scribd company logo
TEXT MINING
B Y
R E VAT H Y S
K O S H Y G
1
INTRODUCTION
• Text Mining is a Discovery
• Also referred as Text Data
Mining (TDM) and Knowledge
Discovery in Textual Database
(KDT).
• To extract relevant information
or knowledge or pattern from
different sources that are in
unstructured or semi-structured
form.
2
DATA MINING VS. TEXT MINING
3
INPUT OUTPUT MODEL FOR TEXT MINING
4
STEPS FOR TEXT MINING
• Pre processing the text
• Applying text mining techniques
 Summarization
 Classification
 Clustering
 Visualization
 Information extraction
• Analyzing the text
5
TEXT DATABASES & INFORMATION
RETRIEVAL
• Text databases ( document databases)
 Large collections of documents from various sources: news articles,
research papers, books, digital libraries, e-mail messages, and Web
pages, library database, etc.
 Data stored is usually semi-structured
• Information retrieval
 A field developed in parallel with database systems
 Information is organized into (a large number of) documents
 Information retrieval problem: locating relevant documents based on
user input, such as keywords or example documents
6
TYPICAL INFORMATION RETRIEVAL
PROBLEM
• To locate relevant documents in a document collection based on a
user’s query
• Some keywords describing an information need,
• For ad hoc (i.e., short-term) information need user takes the initiative to
“pull” the relevant information out from the collection;
• For long-term information need, a retrieval system may also take the
initiative to “push” relevant to the user’s need.
• Such an information access process is called information filtering,
• Corresponding systems are called filtering systems or recommender
systems. 7
INFORMATION RETRIEVAL
• Typical IR systems
– Online library catalogs
– Online document management systems
• Information retrieval vs. database systems
– Some DB problems are not present in IR, e.g., update, transaction
management, complex objects
– Some IR problems are not addressed well in DBMS, e.g.,
unstructured documents, approximate search using keywords and
relevance
8
BASIC MEASURES FOR TEXT RETRIEVAL
• Suppose that a text retrieval system has just retrieved a number of documents based
on query
• Let the set of documents relevant to a query be denoted as {Relevant}
• The set of documents retrieved be denoted as {Retrieved}.
• The set of documents that are both relevant and retrieved is denoted as {Relevant} n
{Retrieved}
9
• Precision: Percentage of retrieved documents that are in fact relevant
to the query (i.e., “correct” responses).
• Recall: Percentage of documents that are relevant to the query and
were, in fact, retrieved.
• An information retrieval system often needs to trade off recall for
precision or vice versa.
• F-score, is harmonic mean of recall and precision
10
INFORMATION RETRIEVAL CONCEPTS
• Basic Concepts
–A document can be described by a set of representative
keywords called index terms.
–Different index terms have varying relevance when used to
describe document contents.
–This effect is captured through the assignment of numerical
weights to each index term of a document. (e.g.: frequency, tf-
idf)
• DBMS Analogy
–Index Terms  Attributes
–Weights  Attribute Values
11
TEXT RETRIEVAL METHODS
• Document selection methods
– Knowledge based Retrieval
– The query is specifying constraints for selecting relevant documents.
– Boolean retrieval model- a document is represented by a set of
keywords
– User provides a Boolean expression of keywords, such as “car and
repair shops,” “tea or coffee,” or“database systems but not Oracle.”
– Return documents that satisfy the Boolean expression.
– Difficulty in prescribing a user’s information need exactly with a
Boolean query,
– Used when the user knows a lot about the document collection and
can formulate a good query
12
• Document ranking methods
– Similarity based retrieval
– Use the query to rank all documents in the order of relevance.
– Present a ranked list of documents in response to a user’s keyword
query.
– Ranking methods based on mathematical foundations, including
algebra, logic, probability, and statistics.
– Match the keywords in a query with those in the documents and score
each document based on how well it matches the query.
– degree of relevance :- score computed based on information such as
the frequency of words in the document.
13
SIMILARITY BASED RETRIEVAL
• Finds similar documents based on a set of common keywords
• Answer should be based on the degree of relevance based on the
nearness of the keywords, relative frequency of the keywords, etc.
• Basic techniques
• Stop list
• Set of words that are deemed “irrelevant”, even though they may
appear frequently
• E.g., a, the, of, for, to, with, etc.
• Stop lists may vary when document set varies
14
SIMILARITY BASED RETRIEVAL
 Word stem
• Several words are small syntactic variants of each other since
they share a common word stem
• E.g., drug, drugs, drugged
 A term frequency table
• Each entry frequent_table(i, j) = no of occurrences of the word ti
in document di
• Usually, the ratio instead of the absolute number of occurrences
is used
 Similarity metrics: measure the closeness of a document to a query
(a set of keywords)
• Relative term occurrences
• Cosine distance: 15
INFORMATION RETRIEVAL MODELS
• Information Retrieval Models:
 Boolean Model
 Vector Model
 Probabilistic Model
16
BOOLEAN MODEL
• Consider that index terms are either present or absent in a document
• As a result, the index term weights are assumed to be all binaries
• A query is composed of index terms linked by three connectives: not,
and, and or
– e.g.: car and repair, plane or airplane
• The Boolean model predicts that each document is either relevant or
non-relevant based on the match of a document to the query
17
THE VECTOR SPACE MODEL
• Represent a document and a query both as vectors in a high-
dimensional space corresponding to all the keywords
• Use an appropriate similarity measure to compute the similarity between
the query vector and the document vector.
• The similarity values can then be used for ranking documents.
18
MODEL A DOCUMENT
• Starting with a set of d documents and a set of t terms, model each
document as a vector v in the t dimensional space Rt , ie. vector-space
model.
• The term frequency be the number of occurrences of term t in the
document d, that is, freq(d; t).
• The (weighted) term-frequency matrix TF(d; t) measures the association
of a term t with respect to the given document d:
• it is defined as 0 if the document does not contain the term, and nonzero
otherwise.
• TF(d; t) = 1 if the term t occurs in the document d 19
TEXT INDEXING TECHNIQUES
• Inverted index
 Maintains two hash- or B+ tree indexed tables:
• document_table: a set of document records <doc_id, postings_list>
• term_table: a set of term records, <term, postings_list>
 Answer query: Find all docs associated with one or a set of terms
• easy to implement
• do not handle well synonymy and polysemy, and posting lists could be too long
(storage could be very large)
• Signature file
 Associate a signature with each document
 A signature is a representation of an ordered list of terms that describe the
document
 Order is obtained by frequency analysis, stemming and stop lists
20
• An inverted index is created for a document collection,
• a retrieval system can answer a keyword query quickly by looking up
which documents contain the query keywords.
• maintain a score accumulator for each document and update these
accumulators as we go through each query term.
• For each query term, fetch all of the documents that match the term and
increase their scores.
QUERY PROCESSING TECHNIQUES
21
• relevance feedback
– examples of relevant documents are available,
– the system can learn from such examples to improve retrieval
performance.
– Effective in improving retrieval performance.
• pseudo-feedback or blind feedback
– do not have such relevant examples,
– a system can assume the top few retrieved documents in some initial
retrieval results to be relevant and extract more related keywords to
expand a query.
– a process of mining useful keywords from the top retrieved documents.
– leads to improved retrieval performance.
22
TEXT MINING APPROACHES
• Based on the kinds of data they take as input:
• The Keyword-based Approach,
– the input is a set of keywords or terms in the documents,
– may only discover relationships at a relatively shallow level
– Rediscovery of compound nouns (e.g., “database” and “systems”) or
co-occurring patterns with less significance (e.g., “terrorist” and
“explosion”).
– may not bring much deep understanding to the text.
23
• The tagging approach
– The input is a set of tags
– Rely on tags obtained by manual tagging or by some automated
categorization algorithm
• The Information-extraction Approach
– more advanced, challenging knowledge discovery task
– inputs semantic information, such as events, facts, or entities
uncovered by information extraction.
– Lead to the discovery of some deep knowledge,
– Requires semantic analysis of text by natural language
understanding and machine learning methods.
24
• Various text mining tasks can be performed on the extracted
keywords, tags, or semantic information
• These include document clustering, classification, information
extraction, association analysis, and trend analysis.
25
TYPES OF TEXT DATA MINING
• Keyword-based association analysis
• Automatic document classification
• Similarity detection
– Cluster documents by a common author
– Cluster documents containing information from a common source
• Link analysis: unusual correlation between entities
• Sequence analysis: predicting a recurring event
• Anomaly detection: find information that violates usual patterns
• Hypertext analysis
– Patterns in anchors/links
• Anchor text correlations with linked objects
26
KEYWORD BASED ASSOCIATION
ANALYSIS
• Motivation
– Collect sets of keywords or terms that occur frequently together and
then find the association or correlation relationships among them
• Association Analysis Process
– Preprocess the text data by parsing, stemming, removing stop words,
etc.
– Evoke association mining algorithms
• Consider each document as a transaction
• View a set of keywords in the document as a set of items in the
transaction
– Term level association mining 27
KEYWORD BASED ASSOCIATION
ANALYSIS
• Collects sets of keywords or terms that occur frequently together
• Finds the association or correlation relationships among them.
• Preprocesses the text data by parsing, stemming, removing stop words,
and so on, and then evokes association mining algorithms.
• Document database,
– Each document can be viewed as a transaction,
– A set of keywords in the document as a set of items in the transaction.
{document_id, a_set_of_keywords}
• Problem of keyword association mining is mapped to item association
mining in transaction databases, 28
TEXT CLASSIFICATION
• Motivation
– Automatic classification for the large number of on-line text documents
(Web pages, e-mails, corporate intranets, etc.)
• Classification Process
– Data preprocessing
– Definition of training set and test sets
– Creation of the classification model using the selected classification
algorithm
– Classification model validation
– Classification of new/unknown text documents
• Text document classification differs from the classification of relational
data
– Document databases are not structured according to attribute-value pairs
29
TEXT CLASSIFICATION
• Classification Algorithms:
– Support Vector Machines
– K-Nearest Neighbors
– Naïve Bayes
– Neural Networks
– Decision Trees
– Association rule-based
– Boosting
30
DOCUMENT CLUSTERING
• Motivation
– Automatically group related documents based on their contents
– No predetermined training sets or taxonomies
– Generate a taxonomy at runtime
• Clustering Process
– Data preprocessing: remove stop words, stem, feature extraction,
lexical analysis, etc.
– Hierarchical clustering: compute similarities applying clustering
algorithms.
– Model-Based clustering (Neural Network Approach): clusters are
represented by “exemplars”. (e.g.: SOM)
31
APPLICATIONS OF TEXT MINING
• Digital libraries
• Academic and research field
• Life science
• Social media
• Business intelligence
32
THANKS
33

More Related Content

PPTX
Text MIning
PPTX
Text mining
PPTX
Text mining
PPTX
Text data mining1
PPTX
PPTX
Text clustering
PPTX
Text categorization
PPTX
Data Mining
Text MIning
Text mining
Text mining
Text data mining1
Text clustering
Text categorization
Data Mining

What's hot (20)

PPTX
Web mining
PPT
Multimedia Mining
PPTX
OLAP & DATA WAREHOUSE
PPT
Textmining Introduction
PDF
Data science presentation
PPTX
Data mining concepts and work
PPT
Data Mining Concepts
PPTX
Data mining presentation.ppt
PDF
Data preprocessing using Machine Learning
PPTX
Kdd process
PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
PPT
Data mining techniques unit 1
PPTX
Data clustring
PPTX
Data warehousing
PPTX
PPTX
Exploratory data analysis with Python
PPTX
Information retrieval introduction
PPT
lecture12-clustering.ppt
PPT
data mining
Web mining
Multimedia Mining
OLAP & DATA WAREHOUSE
Textmining Introduction
Data science presentation
Data mining concepts and work
Data Mining Concepts
Data mining presentation.ppt
Data preprocessing using Machine Learning
Kdd process
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
Data mining techniques unit 1
Data clustring
Data warehousing
Exploratory data analysis with Python
Information retrieval introduction
lecture12-clustering.ppt
data mining
Ad

Similar to Text mining (20)

PPTX
Text Mining.pptx
PPTX
Text Mining
PPTX
Week14-Multimedia Information Retrieval.pptx
PPTX
Tdm information retrieval
PPT
Cs583 info-retrieval
PDF
Information_Retrieval_Models_Nfaoui_El_Habib
PPT
Web search engines
PDF
Information Retrieval and Map-Reduce Implementations
PPTX
Information retrival system and PageRank algorithm
PDF
Information retrieval concept, practice and challenge
PPTX
Introduction to Information Retrieval (concepts and principles)
PPT
3392413.ppt information retreival systems
PPT
Information Retrieval Models
PDF
Data Science - Part XI - Text Analytics
PPT
Information Retrieval
PPT
Text Mining
PPT
Information Retrieval QueryLanguageOperation.ppt
PPT
search engine
PDF
Information retrieval systems irt ppt do
Text Mining.pptx
Text Mining
Week14-Multimedia Information Retrieval.pptx
Tdm information retrieval
Cs583 info-retrieval
Information_Retrieval_Models_Nfaoui_El_Habib
Web search engines
Information Retrieval and Map-Reduce Implementations
Information retrival system and PageRank algorithm
Information retrieval concept, practice and challenge
Introduction to Information Retrieval (concepts and principles)
3392413.ppt information retreival systems
Information Retrieval Models
Data Science - Part XI - Text Analytics
Information Retrieval
Text Mining
Information Retrieval QueryLanguageOperation.ppt
search engine
Information retrieval systems irt ppt do
Ad

More from Koshy Geoji (9)

PDF
Computer Graphics Report
PDF
C programs Set 4
PDF
C programs Set 3
PDF
C programs Set 2
PDF
C programs
PPTX
Vehicle detection in Aerial Images
PPTX
Hypothesis test based approach for change detection
DOCX
Seminar report
DOCX
73347633 milma-os
Computer Graphics Report
C programs Set 4
C programs Set 3
C programs Set 2
C programs
Vehicle detection in Aerial Images
Hypothesis test based approach for change detection
Seminar report
73347633 milma-os

Recently uploaded (20)

PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PPTX
Institutional Correction lecture only . . .
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Pre independence Education in Inndia.pdf
PDF
Business Ethics Teaching Materials for college
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PDF
Classroom Observation Tools for Teachers
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
O7-L3 Supply Chain Operations - ICLT Program
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Institutional Correction lecture only . . .
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Supply Chain Operations Speaking Notes -ICLT Program
Pre independence Education in Inndia.pdf
Business Ethics Teaching Materials for college
VCE English Exam - Section C Student Revision Booklet
FourierSeries-QuestionsWithAnswers(Part-A).pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Week 4 Term 3 Study Techniques revisited.pptx
Anesthesia in Laparoscopic Surgery in India
O5-L3 Freight Transport Ops (International) V1.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
Classroom Observation Tools for Teachers
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
O7-L3 Supply Chain Operations - ICLT Program

Text mining

  • 1. TEXT MINING B Y R E VAT H Y S K O S H Y G 1
  • 2. INTRODUCTION • Text Mining is a Discovery • Also referred as Text Data Mining (TDM) and Knowledge Discovery in Textual Database (KDT). • To extract relevant information or knowledge or pattern from different sources that are in unstructured or semi-structured form. 2
  • 3. DATA MINING VS. TEXT MINING 3
  • 4. INPUT OUTPUT MODEL FOR TEXT MINING 4
  • 5. STEPS FOR TEXT MINING • Pre processing the text • Applying text mining techniques  Summarization  Classification  Clustering  Visualization  Information extraction • Analyzing the text 5
  • 6. TEXT DATABASES & INFORMATION RETRIEVAL • Text databases ( document databases)  Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and Web pages, library database, etc.  Data stored is usually semi-structured • Information retrieval  A field developed in parallel with database systems  Information is organized into (a large number of) documents  Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents 6
  • 7. TYPICAL INFORMATION RETRIEVAL PROBLEM • To locate relevant documents in a document collection based on a user’s query • Some keywords describing an information need, • For ad hoc (i.e., short-term) information need user takes the initiative to “pull” the relevant information out from the collection; • For long-term information need, a retrieval system may also take the initiative to “push” relevant to the user’s need. • Such an information access process is called information filtering, • Corresponding systems are called filtering systems or recommender systems. 7
  • 8. INFORMATION RETRIEVAL • Typical IR systems – Online library catalogs – Online document management systems • Information retrieval vs. database systems – Some DB problems are not present in IR, e.g., update, transaction management, complex objects – Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance 8
  • 9. BASIC MEASURES FOR TEXT RETRIEVAL • Suppose that a text retrieval system has just retrieved a number of documents based on query • Let the set of documents relevant to a query be denoted as {Relevant} • The set of documents retrieved be denoted as {Retrieved}. • The set of documents that are both relevant and retrieved is denoted as {Relevant} n {Retrieved} 9
  • 10. • Precision: Percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses). • Recall: Percentage of documents that are relevant to the query and were, in fact, retrieved. • An information retrieval system often needs to trade off recall for precision or vice versa. • F-score, is harmonic mean of recall and precision 10
  • 11. INFORMATION RETRIEVAL CONCEPTS • Basic Concepts –A document can be described by a set of representative keywords called index terms. –Different index terms have varying relevance when used to describe document contents. –This effect is captured through the assignment of numerical weights to each index term of a document. (e.g.: frequency, tf- idf) • DBMS Analogy –Index Terms  Attributes –Weights  Attribute Values 11
  • 12. TEXT RETRIEVAL METHODS • Document selection methods – Knowledge based Retrieval – The query is specifying constraints for selecting relevant documents. – Boolean retrieval model- a document is represented by a set of keywords – User provides a Boolean expression of keywords, such as “car and repair shops,” “tea or coffee,” or“database systems but not Oracle.” – Return documents that satisfy the Boolean expression. – Difficulty in prescribing a user’s information need exactly with a Boolean query, – Used when the user knows a lot about the document collection and can formulate a good query 12
  • 13. • Document ranking methods – Similarity based retrieval – Use the query to rank all documents in the order of relevance. – Present a ranked list of documents in response to a user’s keyword query. – Ranking methods based on mathematical foundations, including algebra, logic, probability, and statistics. – Match the keywords in a query with those in the documents and score each document based on how well it matches the query. – degree of relevance :- score computed based on information such as the frequency of words in the document. 13
  • 14. SIMILARITY BASED RETRIEVAL • Finds similar documents based on a set of common keywords • Answer should be based on the degree of relevance based on the nearness of the keywords, relative frequency of the keywords, etc. • Basic techniques • Stop list • Set of words that are deemed “irrelevant”, even though they may appear frequently • E.g., a, the, of, for, to, with, etc. • Stop lists may vary when document set varies 14
  • 15. SIMILARITY BASED RETRIEVAL  Word stem • Several words are small syntactic variants of each other since they share a common word stem • E.g., drug, drugs, drugged  A term frequency table • Each entry frequent_table(i, j) = no of occurrences of the word ti in document di • Usually, the ratio instead of the absolute number of occurrences is used  Similarity metrics: measure the closeness of a document to a query (a set of keywords) • Relative term occurrences • Cosine distance: 15
  • 16. INFORMATION RETRIEVAL MODELS • Information Retrieval Models:  Boolean Model  Vector Model  Probabilistic Model 16
  • 17. BOOLEAN MODEL • Consider that index terms are either present or absent in a document • As a result, the index term weights are assumed to be all binaries • A query is composed of index terms linked by three connectives: not, and, and or – e.g.: car and repair, plane or airplane • The Boolean model predicts that each document is either relevant or non-relevant based on the match of a document to the query 17
  • 18. THE VECTOR SPACE MODEL • Represent a document and a query both as vectors in a high- dimensional space corresponding to all the keywords • Use an appropriate similarity measure to compute the similarity between the query vector and the document vector. • The similarity values can then be used for ranking documents. 18
  • 19. MODEL A DOCUMENT • Starting with a set of d documents and a set of t terms, model each document as a vector v in the t dimensional space Rt , ie. vector-space model. • The term frequency be the number of occurrences of term t in the document d, that is, freq(d; t). • The (weighted) term-frequency matrix TF(d; t) measures the association of a term t with respect to the given document d: • it is defined as 0 if the document does not contain the term, and nonzero otherwise. • TF(d; t) = 1 if the term t occurs in the document d 19
  • 20. TEXT INDEXING TECHNIQUES • Inverted index  Maintains two hash- or B+ tree indexed tables: • document_table: a set of document records <doc_id, postings_list> • term_table: a set of term records, <term, postings_list>  Answer query: Find all docs associated with one or a set of terms • easy to implement • do not handle well synonymy and polysemy, and posting lists could be too long (storage could be very large) • Signature file  Associate a signature with each document  A signature is a representation of an ordered list of terms that describe the document  Order is obtained by frequency analysis, stemming and stop lists 20
  • 21. • An inverted index is created for a document collection, • a retrieval system can answer a keyword query quickly by looking up which documents contain the query keywords. • maintain a score accumulator for each document and update these accumulators as we go through each query term. • For each query term, fetch all of the documents that match the term and increase their scores. QUERY PROCESSING TECHNIQUES 21
  • 22. • relevance feedback – examples of relevant documents are available, – the system can learn from such examples to improve retrieval performance. – Effective in improving retrieval performance. • pseudo-feedback or blind feedback – do not have such relevant examples, – a system can assume the top few retrieved documents in some initial retrieval results to be relevant and extract more related keywords to expand a query. – a process of mining useful keywords from the top retrieved documents. – leads to improved retrieval performance. 22
  • 23. TEXT MINING APPROACHES • Based on the kinds of data they take as input: • The Keyword-based Approach, – the input is a set of keywords or terms in the documents, – may only discover relationships at a relatively shallow level – Rediscovery of compound nouns (e.g., “database” and “systems”) or co-occurring patterns with less significance (e.g., “terrorist” and “explosion”). – may not bring much deep understanding to the text. 23
  • 24. • The tagging approach – The input is a set of tags – Rely on tags obtained by manual tagging or by some automated categorization algorithm • The Information-extraction Approach – more advanced, challenging knowledge discovery task – inputs semantic information, such as events, facts, or entities uncovered by information extraction. – Lead to the discovery of some deep knowledge, – Requires semantic analysis of text by natural language understanding and machine learning methods. 24
  • 25. • Various text mining tasks can be performed on the extracted keywords, tags, or semantic information • These include document clustering, classification, information extraction, association analysis, and trend analysis. 25
  • 26. TYPES OF TEXT DATA MINING • Keyword-based association analysis • Automatic document classification • Similarity detection – Cluster documents by a common author – Cluster documents containing information from a common source • Link analysis: unusual correlation between entities • Sequence analysis: predicting a recurring event • Anomaly detection: find information that violates usual patterns • Hypertext analysis – Patterns in anchors/links • Anchor text correlations with linked objects 26
  • 27. KEYWORD BASED ASSOCIATION ANALYSIS • Motivation – Collect sets of keywords or terms that occur frequently together and then find the association or correlation relationships among them • Association Analysis Process – Preprocess the text data by parsing, stemming, removing stop words, etc. – Evoke association mining algorithms • Consider each document as a transaction • View a set of keywords in the document as a set of items in the transaction – Term level association mining 27
  • 28. KEYWORD BASED ASSOCIATION ANALYSIS • Collects sets of keywords or terms that occur frequently together • Finds the association or correlation relationships among them. • Preprocesses the text data by parsing, stemming, removing stop words, and so on, and then evokes association mining algorithms. • Document database, – Each document can be viewed as a transaction, – A set of keywords in the document as a set of items in the transaction. {document_id, a_set_of_keywords} • Problem of keyword association mining is mapped to item association mining in transaction databases, 28
  • 29. TEXT CLASSIFICATION • Motivation – Automatic classification for the large number of on-line text documents (Web pages, e-mails, corporate intranets, etc.) • Classification Process – Data preprocessing – Definition of training set and test sets – Creation of the classification model using the selected classification algorithm – Classification model validation – Classification of new/unknown text documents • Text document classification differs from the classification of relational data – Document databases are not structured according to attribute-value pairs 29
  • 30. TEXT CLASSIFICATION • Classification Algorithms: – Support Vector Machines – K-Nearest Neighbors – Naïve Bayes – Neural Networks – Decision Trees – Association rule-based – Boosting 30
  • 31. DOCUMENT CLUSTERING • Motivation – Automatically group related documents based on their contents – No predetermined training sets or taxonomies – Generate a taxonomy at runtime • Clustering Process – Data preprocessing: remove stop words, stem, feature extraction, lexical analysis, etc. – Hierarchical clustering: compute similarities applying clustering algorithms. – Model-Based clustering (Neural Network Approach): clusters are represented by “exemplars”. (e.g.: SOM) 31
  • 32. APPLICATIONS OF TEXT MINING • Digital libraries • Academic and research field • Life science • Social media • Business intelligence 32

Editor's Notes

  • #5: Most of the information are stored electronically in the form of text databases Data stored in most text databases are semistructured data A document may contain a few structured fields, such as title, authors, publication date, category, etc Largely unstructured text components, such as abstract and contents.
  • #7: Database systems, focused on query and transaction processing of structured data Information retrieval is concerned with the organization and retrieval of information from a large number of text-based documents. on-line library catalog systems, on-line document management systems, and Web search engines.
  • #21: Inverted indices and signature files. An inverted index is an index structure that maintains two hash indexed or B+-tree indexed tables: document table and term table, document table consists of a set of document records, each containing two fields: doc id and posting list, posting list is a list of terms (or pointers to terms) that occur in the document, sorted according to some relevance measure. term table consists of a set of term records, each containing two fields: term id and posting list, So it is easy to answer queries like “Find all of the documents associated with a given set of terms,” or “Find all of the terms associated with a given set of documents.” For example, to find all of the documents associated with a set of terms, we can first find a list of document identifiers in term table for each term, and then intersect them to obtain the set of relevant documents. A signature file is a file that stores a signature record for each document in the database. Each signature has a fixed size of b bits representing terms Each bit of a document signature is initialized to 0. A bit is set to 1 if the term it represents appears in the document. A signature S1 matches another signature S2 if each bit that is set in signature S2 is also set in S1.