SlideShare a Scribd company logo
2
Most read
5
Most read
6
Most read
TEXT MINING
By
Yashvi Babariya
INTRODUCTION
 Text mining is a Discovery
 Also referred as Text Data
Mining (TDM) and
Knowledge Discovery in
Textual Database (KDT)
 To extract relevant
information or knowledge or
pattern from different
sources that are in
unstructured or semi-
structured form.
DATA MINING VS. TEXT MINING
Data Mining Text Mining
Process directly Linguistic processing or natural
language processing (NLP)
Identify causal relationship Discover heretofore unknown
information
Structured Data Semi-structured & unstructured
Data (Text)
Structured numeric transection
data residing in rational data
warehouse
Applications deal with much more
diverse and eclectic collections of
systems & formats
INPUT OUTPUT MODEL FOR TEXT MINING
STEPS FOR TEXT MINING
 Pre processing the text
 Applying text mining techniques
 Summarization
 Classification
 Clustering
 Visualization
 Information extraction
o Analyzing The Text
TEXT DATABASES & INFORMATION
RETIEVAL
 Text databases ( document databases)
 Large collections of documents from various sources
: news articles, research papers, books, digital libraries, e-mail
messages and web pages, library database etc.
 Data stored is usually semi-structured
 Information retrieval
 A field developed in parallel with database systems
 Information is organized into a large number of documents
 Information retrieval problem: locating relevant documents
based on user input, such as keywords or example
documents
TYPICAL INFPRMATION RETRIEVAL
PROBLEM
 To locate relevant documents in a document collection
based on a user’s query
 Some keywords describing an information need
 For ad hoc information need user takes the initiative to
“pull” the relevant information out from the collection
 For long-term information need, a retrieval system may
also take the initiative to push relevant to user’s need
 Such an information access process is called
information filtering
 Corresponding systems are called filtering systems or
recommender systems
INFORMATION RETRIEVAL
 Typical IR systems
 Online library catalogs
 Online document management systems
 Information retrieval vs. database
systems
 some DB problems are not present in IR, e.g.,
update, transections management, complex objects
 Some IR problems are not addressed well in
DBMS, e.g., unstructured documents, approximate
search using keywords and relevance
BASIC MEASURE FOR TEXT RETRIEVAL
 Suppose that a text retrieval system has just retrieved a
number of documents based on query
 Let the set of documents relevant to a query be denoted as
{relevant}
 The set of documents retrieved be donated as {retrieved}
 The set of documents that are both relevant and retrieved is
denoted as {relevant} n {retrieved}
 Precision: percentage of retrieved documents that are
in fact relevant to the query(i.e. “correct” response)
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
| 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 ∩ 𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 |
|{𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
 Recall: percentage of documents that are relevant to the
query and were in fact, retrieved.
𝑟𝑒𝑐𝑎𝑙𝑙 =
| 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 ∩ 𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 |
| 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 |
 An information retrieval system often needs to trade off
recall for precision or vice versa.
 F-score, is harmonic mean of recall and precision
𝐹_𝑠𝑐𝑜𝑟𝑒 =
𝑟𝑒𝑐𝑎𝑙𝑙×𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
(𝑟𝑒𝑐𝑎𝑙𝑙+𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛)/2
INFORMATION RETRIEVAL CONCEPT
 Basic concept
 A document can be described by a set of representative
keywords called index terms
 Different index terms varying relevance when used to
describe document content
 This effect is captured through the assignment of numerical
weights to each index term of a document
 DBMS Analogy
 Index terms → Attributes
 Weights → Attributes value
TEXT RETRIEVAL METHODS
 Document selection method
• Knowledge base retrieval
• The query is specifying constraints for selecting relevant
documents
• Boolean retrieval model- a document is represented by a set
of keywords
• User provides a Boolean expression of keywords such as “car
and repair shops”, “tea or coffee”
• Return documents that satisfy the boolean expression
• Difficulty in presenting a user’s information need exctly with a
boolean query
• Used when the user knows a lot about the document collection
and can formulate a good query

More Related Content

PPTX
Text mining
PPTX
Text mining
PDF
Chapter 1: Introduction to Information Storage and Retrieval
PPTX
Tdm information retrieval
PPT
Info systems databases
PPT
intro.ppt
PPT
20IT501_DWDM_PPT_Unit_II.ppt
PPT
20IT501_DWDM_PPT_Unit_II.ppt
Text mining
Text mining
Chapter 1: Introduction to Information Storage and Retrieval
Tdm information retrieval
Info systems databases
intro.ppt
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt

Similar to Text Mining.pptx (20)

PPTX
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
PPTX
Citation Database
PPT
Datamining
PPTX
Text data mining1
PPTX
Lec20.pptx introduction to data bases and information systems
PPTX
PPTX
DBM to the following in details of 1.pptx
PPTX
DATA RESOURCE MANAGEMENT
PDF
Database system Handbook 4th muhammad sharif.pdf
PDF
Database system Handbook 4th muhammad sharif.pdf
PPT
Database
PPT
Database
DOC
Database Management System
PDF
Ch-1-Introduction-to-Database.pdf
PPTX
Information retrieval introduction
PDF
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
PDF
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
PDF
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
PDF
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
PDF
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
Citation Database
Datamining
Text data mining1
Lec20.pptx introduction to data bases and information systems
DBM to the following in details of 1.pptx
DATA RESOURCE MANAGEMENT
Database system Handbook 4th muhammad sharif.pdf
Database system Handbook 4th muhammad sharif.pdf
Database
Database
Database Management System
Ch-1-Introduction-to-Database.pdf
Information retrieval introduction
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
Ad

Recently uploaded (20)

PPTX
Introduction to Knowledge Engineering Part 1
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Mega Projects Data Mega Projects Data
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Lecture1 pattern recognition............
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
Quality review (1)_presentation of this 21
PDF
Introduction to the R Programming Language
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Introduction to Knowledge Engineering Part 1
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Mega Projects Data Mega Projects Data
Introduction-to-Cloud-ComputingFinal.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
.pdf is not working space design for the following data for the following dat...
Clinical guidelines as a resource for EBP(1).pdf
ISS -ESG Data flows What is ESG and HowHow
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
annual-report-2024-2025 original latest.
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Lecture1 pattern recognition............
STUDY DESIGN details- Lt Col Maksud (21).pptx
Quality review (1)_presentation of this 21
Introduction to the R Programming Language
Qualitative Qantitative and Mixed Methods.pptx
IB Computer Science - Internal Assessment.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Ad

Text Mining.pptx

  • 2. INTRODUCTION  Text mining is a Discovery  Also referred as Text Data Mining (TDM) and Knowledge Discovery in Textual Database (KDT)  To extract relevant information or knowledge or pattern from different sources that are in unstructured or semi- structured form.
  • 3. DATA MINING VS. TEXT MINING Data Mining Text Mining Process directly Linguistic processing or natural language processing (NLP) Identify causal relationship Discover heretofore unknown information Structured Data Semi-structured & unstructured Data (Text) Structured numeric transection data residing in rational data warehouse Applications deal with much more diverse and eclectic collections of systems & formats
  • 4. INPUT OUTPUT MODEL FOR TEXT MINING
  • 5. STEPS FOR TEXT MINING  Pre processing the text  Applying text mining techniques  Summarization  Classification  Clustering  Visualization  Information extraction o Analyzing The Text
  • 6. TEXT DATABASES & INFORMATION RETIEVAL  Text databases ( document databases)  Large collections of documents from various sources : news articles, research papers, books, digital libraries, e-mail messages and web pages, library database etc.  Data stored is usually semi-structured  Information retrieval  A field developed in parallel with database systems  Information is organized into a large number of documents  Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents
  • 7. TYPICAL INFPRMATION RETRIEVAL PROBLEM  To locate relevant documents in a document collection based on a user’s query  Some keywords describing an information need  For ad hoc information need user takes the initiative to “pull” the relevant information out from the collection  For long-term information need, a retrieval system may also take the initiative to push relevant to user’s need  Such an information access process is called information filtering  Corresponding systems are called filtering systems or recommender systems
  • 8. INFORMATION RETRIEVAL  Typical IR systems  Online library catalogs  Online document management systems  Information retrieval vs. database systems  some DB problems are not present in IR, e.g., update, transections management, complex objects  Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance
  • 9. BASIC MEASURE FOR TEXT RETRIEVAL  Suppose that a text retrieval system has just retrieved a number of documents based on query  Let the set of documents relevant to a query be denoted as {relevant}  The set of documents retrieved be donated as {retrieved}  The set of documents that are both relevant and retrieved is denoted as {relevant} n {retrieved}
  • 10.  Precision: percentage of retrieved documents that are in fact relevant to the query(i.e. “correct” response) 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = | 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 ∩ 𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 | |{𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑  Recall: percentage of documents that are relevant to the query and were in fact, retrieved. 𝑟𝑒𝑐𝑎𝑙𝑙 = | 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 ∩ 𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 | | 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 |  An information retrieval system often needs to trade off recall for precision or vice versa.  F-score, is harmonic mean of recall and precision 𝐹_𝑠𝑐𝑜𝑟𝑒 = 𝑟𝑒𝑐𝑎𝑙𝑙×𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 (𝑟𝑒𝑐𝑎𝑙𝑙+𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛)/2
  • 11. INFORMATION RETRIEVAL CONCEPT  Basic concept  A document can be described by a set of representative keywords called index terms  Different index terms varying relevance when used to describe document content  This effect is captured through the assignment of numerical weights to each index term of a document  DBMS Analogy  Index terms → Attributes  Weights → Attributes value
  • 12. TEXT RETRIEVAL METHODS  Document selection method • Knowledge base retrieval • The query is specifying constraints for selecting relevant documents • Boolean retrieval model- a document is represented by a set of keywords • User provides a Boolean expression of keywords such as “car and repair shops”, “tea or coffee” • Return documents that satisfy the boolean expression • Difficulty in presenting a user’s information need exctly with a boolean query • Used when the user knows a lot about the document collection and can formulate a good query