SlideShare a Scribd company logo
Text mining: Predictive models for text
Document Classification Document classification/categorization is a problem in text mining. The task is to assign an document to one or more categories, based on its contents
Classification Techniques nearest-neighbor Decision rules Probabilistic models Linear models
K-nearest neighbor algorithm Given a distance metric Assign class to be the same as its nearest neighbor All training data is used during operation Multi-class decision framework
K-nearest neighbor algorithm Simple algorithm is slow For each  x i if (dist( x ,  x i ) < min) nearest =  x i min = dist( x ,  x i ) Use data structures to speed up search
Advantages it is well suited for multi-modal classes as its classification decision is based on a small neighborhood of similar objects (i.e., the major class).  So, even if the target class is multi-modal (i.e., consists of objects whose independent variables have different characteristics for different subsets), it can still lead to good accuracy.
DRAWBACKS A major drawback of the similarity measure used in KNN is that it uses all features equally in computing similarities. This can lead to poor similarity measures and classification errors, when only a small subset of the features is useful for classification
Decision trees Decision trees are popular for pattern recognition because the models they produce are easier to understand. Root node A A B B B B Nodes of the tree Leaves (terminal nodes) of the tree Branches (decision point) of the tree C
Binary decision trees Since each inequality that is used to split the input space is only based on one input variable. Each node draws a boundary that can be geometrically interpreted as a hyper plane perpendicular to the axis
Linear decision trees Linear decision trees are similar to binary decision trees, except that the inequality computed at each node takes on an arbitrary linear from that may depend on multiple variables.
Probabilistic  model-Naive Bayes classifier a naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature.  Example: a fruit may be considered to be an apple if it is red, round, and about 4&quot; in diameter. Even though these features depend on the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.
Parameter estimation All model parameters  can be approximated with relative frequencies from the training set.  If a given class and feature value never occur together in the training set then the frequency-based probability estimate will be zero. This is problematic since it will wipe out all information in the other probabilities when they are multiplied. It is therefore often desirable to incorporate a small-sample correction in all probability estimates such that no probability is ever set to be exactly zero.
Constructing a classifier from the probability model The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probable.  This is known as the  maximum a posteriori  or  MAP  decision rule. The corresponding classifier is the function classify defined as follows: Classify(f1,f2…fn)=argmax p(C=c)∏ p(Fi=fi)/(C=c)
Performance Evaluation Given  n  test documents and  m  classes in consideration, a classifier makes  n      m  binary decisions.  A two-by-two contingency table can be computed for each class. Truly yes  truly no System yes  a  b System No  c  d
Performance Evaluation Recall = a/(a+c) where a + c > 0 (o.w. undefined).  Did we find all of those that belonged in the class? Precision = a/(a+b) where a+b>0 (o.w. undefined). Of the times we predicted it was “in class”, how often are we correct?
Performance Evaluation Accuracy = (a + d) / n When one classes is overwhelmingly in the majority, this may not paint an accurate picture. Others: miss, false alarm (fallout), error, F-measure, break-even point, ...
Applications Web pages organized into category hierarchies  Journal articles indexed by subject categories (e.g., the Library of Congress, MEDLINE, etc.)  Responses to Census Bureau occupations  Patents archived using  International Patent Classification   Patient records coded using international insurance categories  E-mail message filtering  News events tracked and filtered by topics
conclusion In this presentation we learned about Document classification Classification techniques Performance evaluation and applications
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

More Related Content

PPTX
Decision tree
PPTX
Random forest
PPTX
Data reduction
PPTX
Chapter 4 Classification
PPTX
04 Classification in Data Mining
PPTX
Classification
PDF
Accelerating the Random Forest algorithm for commodity parallel- Mark Seligman
PPTX
AI to advance science research
Decision tree
Random forest
Data reduction
Chapter 4 Classification
04 Classification in Data Mining
Classification
Accelerating the Random Forest algorithm for commodity parallel- Mark Seligman
AI to advance science research

What's hot (19)

PPTX
CART – Classification & Regression Trees
PPTX
Cluster analysis
PPT
Data.Mining.C.6(II).classification and prediction
PDF
Introduction to Machine Learning Classifiers
PPT
Supervised and unsupervised learning
PPT
Accuracy assessment of Remote Sensing Data
PDF
Classification Based Machine Learning Algorithms
PPTX
WEKA: The Explorer
PDF
Classification and regression trees (cart)
PPTX
Presentation on supervised learning
PPT
2.1 Data Mining-classification Basic concepts
PPTX
Machine Learning
PPTX
How use weka tool
PPTX
Mscs discussion
PDF
CART: Not only Classification and Regression Trees
PDF
Research scholars evaluation based on guides view using id3
PPTX
Data mining: Classification and prediction
PPTX
Classification techniques in data mining
PPTX
Supervised learning and Unsupervised learning
CART – Classification & Regression Trees
Cluster analysis
Data.Mining.C.6(II).classification and prediction
Introduction to Machine Learning Classifiers
Supervised and unsupervised learning
Accuracy assessment of Remote Sensing Data
Classification Based Machine Learning Algorithms
WEKA: The Explorer
Classification and regression trees (cart)
Presentation on supervised learning
2.1 Data Mining-classification Basic concepts
Machine Learning
How use weka tool
Mscs discussion
CART: Not only Classification and Regression Trees
Research scholars evaluation based on guides view using id3
Data mining: Classification and prediction
Classification techniques in data mining
Supervised learning and Unsupervised learning
Ad

Viewers also liked (20)

PPTX
Introduction to Text Mining
PPTX
Text analytics opportunities in the Insurance domain
PPT
Textmining
PPTX
Oracle: DML
PPTX
XL-MINER:Prediction
PPTX
WEKA: Credibility Evaluating Whats Been Learned
PPTX
RapidMiner: Nested Subprocesses
PPT
PPT
Traffic Skills, Parent & Kids Intro
PPTX
MED dra Coding -MSSO
PPTX
WEKA: Output Knowledge Representation
PPTX
Data Mining The Sky
PPTX
LISP:Object System Lisp
PPTX
LISP: Macros in lisp
DOC
建築師法修正草案總說明
PPT
Wisconsin Fertility Institute: Injection Class 2011
PDF
Vision To Profit V2 Sept 2009
PPT
Épica Latina Latín II
Introduction to Text Mining
Text analytics opportunities in the Insurance domain
Textmining
Oracle: DML
XL-MINER:Prediction
WEKA: Credibility Evaluating Whats Been Learned
RapidMiner: Nested Subprocesses
Traffic Skills, Parent & Kids Intro
MED dra Coding -MSSO
WEKA: Output Knowledge Representation
Data Mining The Sky
LISP:Object System Lisp
LISP: Macros in lisp
建築師法修正草案總說明
Wisconsin Fertility Institute: Injection Class 2011
Vision To Profit V2 Sept 2009
Épica Latina Latín II
Ad

Similar to Textmining Predictive Models (20)

PPT
[ppt]
PPT
[ppt]
PPTX
UNIT 3: Data Warehousing and Data Mining
PDF
IRJET- Performance Evaluation of Various Classification Algorithms
PDF
IRJET- Performance Evaluation of Various Classification Algorithms
PPTX
dataminingclassificationprediction123 .pptx
PPT
Introduction to Machine Learning Aristotelis Tsirigos
PPT
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
PPTX
Unit 4 Classification of data and more info on it
PPT
3 DM Classification HFCS kilometres .ppt
PPT
Business Analytics using R.ppt
PPTX
Think-Aloud Protocols
PPTX
Deep learning from mashine learning AI..
PPTX
BAS 250 Lecture 8
PPT
20070702 Text Categorization
PDF
Machine Learning Algorithms Introduction.pdf
PDF
Hypothesis on Different Data Mining Algorithms
PPT
Computational Biology, Part 4 Protein Coding Regions
PDF
Introduction to conventional machine learning techniques
PDF
Classifiers
[ppt]
[ppt]
UNIT 3: Data Warehousing and Data Mining
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
dataminingclassificationprediction123 .pptx
Introduction to Machine Learning Aristotelis Tsirigos
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
Unit 4 Classification of data and more info on it
3 DM Classification HFCS kilometres .ppt
Business Analytics using R.ppt
Think-Aloud Protocols
Deep learning from mashine learning AI..
BAS 250 Lecture 8
20070702 Text Categorization
Machine Learning Algorithms Introduction.pdf
Hypothesis on Different Data Mining Algorithms
Computational Biology, Part 4 Protein Coding Regions
Introduction to conventional machine learning techniques
Classifiers

More from DataminingTools Inc (20)

PPTX
Terminology Machine Learning
PPTX
Techniques Machine Learning
PPTX
Machine learning Introduction
PPTX
Areas of machine leanring
PPTX
AI: Planning and AI
PPTX
AI: Logic in AI 2
PPTX
AI: Logic in AI
PPTX
AI: Learning in AI 2
PPTX
AI: Learning in AI
PPTX
AI: Introduction to artificial intelligence
PPTX
AI: Belief Networks
PPTX
AI: AI & Searching
PPTX
AI: AI & Problem Solving
PPTX
Data Mining: Text and web mining
PPTX
Data Mining: Outlier analysis
PPTX
Data Mining: Mining stream time series and sequence data
PPTX
Data Mining: Mining ,associations, and correlations
PPTX
Data Mining: Graph mining and social network analysis
PPTX
Data warehouse and olap technology
PPTX
Data Mining: Data processing
Terminology Machine Learning
Techniques Machine Learning
Machine learning Introduction
Areas of machine leanring
AI: Planning and AI
AI: Logic in AI 2
AI: Logic in AI
AI: Learning in AI 2
AI: Learning in AI
AI: Introduction to artificial intelligence
AI: Belief Networks
AI: AI & Searching
AI: AI & Problem Solving
Data Mining: Text and web mining
Data Mining: Outlier analysis
Data Mining: Mining stream time series and sequence data
Data Mining: Mining ,associations, and correlations
Data Mining: Graph mining and social network analysis
Data warehouse and olap technology
Data Mining: Data processing

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
A Presentation on Artificial Intelligence
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Approach and Philosophy of On baking technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
KodekX | Application Modernization Development
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
Understanding_Digital_Forensics_Presentation.pptx
NewMind AI Weekly Chronicles - August'25 Week I
A Presentation on Artificial Intelligence
Advanced methodologies resolving dimensionality complications for autism neur...
Machine learning based COVID-19 study performance prediction
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Electronic commerce courselecture one. Pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Approach and Philosophy of On baking technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Network Security Unit 5.pdf for BCA BBA.
KodekX | Application Modernization Development
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
NewMind AI Monthly Chronicles - July 2025
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
MYSQL Presentation for SQL database connectivity
Encapsulation_ Review paper, used for researhc scholars
Mobile App Security Testing_ A Comprehensive Guide.pdf

Textmining Predictive Models

  • 1. Text mining: Predictive models for text
  • 2. Document Classification Document classification/categorization is a problem in text mining. The task is to assign an document to one or more categories, based on its contents
  • 3. Classification Techniques nearest-neighbor Decision rules Probabilistic models Linear models
  • 4. K-nearest neighbor algorithm Given a distance metric Assign class to be the same as its nearest neighbor All training data is used during operation Multi-class decision framework
  • 5. K-nearest neighbor algorithm Simple algorithm is slow For each x i if (dist( x , x i ) < min) nearest = x i min = dist( x , x i ) Use data structures to speed up search
  • 6. Advantages it is well suited for multi-modal classes as its classification decision is based on a small neighborhood of similar objects (i.e., the major class). So, even if the target class is multi-modal (i.e., consists of objects whose independent variables have different characteristics for different subsets), it can still lead to good accuracy.
  • 7. DRAWBACKS A major drawback of the similarity measure used in KNN is that it uses all features equally in computing similarities. This can lead to poor similarity measures and classification errors, when only a small subset of the features is useful for classification
  • 8. Decision trees Decision trees are popular for pattern recognition because the models they produce are easier to understand. Root node A A B B B B Nodes of the tree Leaves (terminal nodes) of the tree Branches (decision point) of the tree C
  • 9. Binary decision trees Since each inequality that is used to split the input space is only based on one input variable. Each node draws a boundary that can be geometrically interpreted as a hyper plane perpendicular to the axis
  • 10. Linear decision trees Linear decision trees are similar to binary decision trees, except that the inequality computed at each node takes on an arbitrary linear from that may depend on multiple variables.
  • 11. Probabilistic model-Naive Bayes classifier a naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. Example: a fruit may be considered to be an apple if it is red, round, and about 4&quot; in diameter. Even though these features depend on the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.
  • 12. Parameter estimation All model parameters can be approximated with relative frequencies from the training set. If a given class and feature value never occur together in the training set then the frequency-based probability estimate will be zero. This is problematic since it will wipe out all information in the other probabilities when they are multiplied. It is therefore often desirable to incorporate a small-sample correction in all probability estimates such that no probability is ever set to be exactly zero.
  • 13. Constructing a classifier from the probability model The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probable. This is known as the maximum a posteriori or MAP decision rule. The corresponding classifier is the function classify defined as follows: Classify(f1,f2…fn)=argmax p(C=c)∏ p(Fi=fi)/(C=c)
  • 14. Performance Evaluation Given n test documents and m classes in consideration, a classifier makes n  m binary decisions. A two-by-two contingency table can be computed for each class. Truly yes truly no System yes a b System No c d
  • 15. Performance Evaluation Recall = a/(a+c) where a + c > 0 (o.w. undefined). Did we find all of those that belonged in the class? Precision = a/(a+b) where a+b>0 (o.w. undefined). Of the times we predicted it was “in class”, how often are we correct?
  • 16. Performance Evaluation Accuracy = (a + d) / n When one classes is overwhelmingly in the majority, this may not paint an accurate picture. Others: miss, false alarm (fallout), error, F-measure, break-even point, ...
  • 17. Applications Web pages organized into category hierarchies Journal articles indexed by subject categories (e.g., the Library of Congress, MEDLINE, etc.) Responses to Census Bureau occupations Patents archived using International Patent Classification Patient records coded using international insurance categories E-mail message filtering News events tracked and filtered by topics
  • 18. conclusion In this presentation we learned about Document classification Classification techniques Performance evaluation and applications
  • 19. Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net