SlideShare a Scribd company logo
Text mining: Introduction and data preparation
Overview of Text mining What is Text Mining? Text Mining, "also known as intelligent text analysis, text data mining or knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text."
Need for Text mining: We can better understand the need for Text mining using a practical example. Ex: The Bio Tech Industry. -80% of  biological knowledge is only in research paper (unstructured data). - If  a scientist  manually  read 50 research paper/week and only 10% of those data are useful   then   he/she manages only 5 research paper/week.
Need for Text mining But online databases like Medline adds more than 10,000 abstracts per month using  Text mining   Thus the performance of gathering relevant data is increased dramatically when we use text mining .It shows the need for Text mining.
Challenges in Text Mining Information is in unstructured textual form Large textual data base almost all publications are also in electronic form Very high number of possible “dimensions” (but sparse): all possible word and phrase types in the language!! Complex and subtle relationships between concepts in text
Challenges in Text Mining “ AOL merges with Time-Warner” “Time-Warner is bought by AOL” Word ambiguity and context sensitivity automobile = car = vehicle = Toyota Apple (the company) or apple (the fruit) Noisy data Example: Spelling mistakes
Text Mining Process Text preprocessing Syntactic/Semantic text analysis Features Generation Bag of words Features Selection Simple counting Statistics
Text Mining Process Text/Data Mining Classification Clustering Associations Analyzing results
Applications The potential applications are countless. Customer profile analysis Trend analysis Information filtering and routing Event tracks news stories classification Web search etc etc.
Tokenization Convert a sentence into a sequence of  tokens i.e  words. Why do we tokenize? Because we do not want to treat a sentence as a sequence of  characters Tokenizing general English sentences is relatively straightforward. Use spaces as the boundaries Use some heuristics to handle exceptions
Tokenisation issues  separate possessive endings or abbreviated forms from preceding words:  Mary’s    Mary ‘s Mary’s    Mary is Mary’s    Mary has separate punctuation marks and quotes from words  : Mary.    Mary  . “ new”    “  new  “
  Dictionary creation Dictionary is used to locate  occurrence of a particular term in the documents. It will reduce the retrivel time of an algorithm. They are stored as linked list
Example Brutus −-> 1 2 4 11 31 45 173 174 Caesar −-> 1 2 4 5 6 16 57 132 . . . Calpurnia −-> 2 31 54 101 In the above example the occurence of the terns brutus caesar and calpurnia in the documents are given.
Feature generation and selection Importance of feature selection Machine Learning It improve the efficiency in many machine learning. Over fitting problem Over fitting is the problem of training the machine so much that when the actual data is place it behave well to an extent and start to fail. Improve Efficiency of training
Feature selection methods for classification Filter Method pre-process computation of score for each feature and then select feature according to the score Wrapper Method The wrapper utilize learning as a black box to score subset features Embedded Method Feature selection is perform within the process of training the algorithm
Parsing tasks Separate words from spaces and punctuation Clean up Remove redundant words Remove words with no content Cleaned up list of Words referred to as tokens
Simple Algorithm for parsing # Initialize, description-the entire text charcount<-nchar(Description) # number of records of text Line count<-length(Description) Num<-Line count*6 # Array to hold location of spaces Position<-rep(0,Num) dim(Position)<-c(Linecount,6)
Simple Algorithm for parsing # Array for Terms Terms<-rep(“”,Num) dim(Terms)<-c(Linecount,6) wordcount<-rep(0,Linecount)
Search for Spaces for (i in 1:Linecount) { n<-charcount[i] k<-1 for (j in 1:n) { Char<-substring(Description[i],j,j) if (is.all.white(Char)) {Position[i,k]<-j; k<-k+1} wordcount[i]<-k }}
Get Words # parse out terms for (i in 1:Linecount) { # first word if (Position[i,1]==0) Terms[i,1]<-Description[i] else if (Position[i,1]>0) Terms[i,1]<-substring(Description[i],1,Position[i,j]-1)
Get Words for (j in 1:wordcount) { if (Position[i,j]>0) { Terms[i,j]<-substring(Description[i],Position[i,j-1]+1,Position[i,j]-1) } } }
Conclusion In this presentation Overview of text mining Tokenization Dictionary creation Feature selection Parsing are studied in detail.
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

More Related Content

PPTX
Text mining
PPTX
Text mining
PPTX
Text MIning
PPT
Data mining
PPTX
Text mining
PPTX
PPTX
Data Mining: Classification and analysis
PPTX
Text Mining
Text mining
Text mining
Text MIning
Data mining
Text mining
Data Mining: Classification and analysis
Text Mining

What's hot (20)

PPTX
Data mining: Classification and prediction
PPT
4.4 text mining
PPTX
Text categorization
PPTX
Model of information retrieval (3)
PPT
Tesxt mining
PPT
1.2 steps and functionalities
PPTX
Introduction to Data mining
PPTX
Suffix Tree and Suffix Array
PPTX
Vector space model of information retrieval
PDF
Introduction to Information Retrieval & Models
PDF
Data preprocessing using Machine Learning
PPTX
Text data mining1
PPTX
Data Mining: Text and web mining
PPTX
Probabilistic information retrieval models & systems
PPTX
Data mining presentation.ppt
PPTX
Automatic indexing
PPT
Data Mining Concepts
PPTX
Kdd process
PPTX
Introduction to Data Mining
PPTX
Data mining , Knowledge Discovery Process, Classification
Data mining: Classification and prediction
4.4 text mining
Text categorization
Model of information retrieval (3)
Tesxt mining
1.2 steps and functionalities
Introduction to Data mining
Suffix Tree and Suffix Array
Vector space model of information retrieval
Introduction to Information Retrieval & Models
Data preprocessing using Machine Learning
Text data mining1
Data Mining: Text and web mining
Probabilistic information retrieval models & systems
Data mining presentation.ppt
Automatic indexing
Data Mining Concepts
Kdd process
Introduction to Data Mining
Data mining , Knowledge Discovery Process, Classification
Ad

Viewers also liked (9)

PPTX
Introduction to Text Mining
PPT
Big Data & Text Mining
KEY
Text-mining as a Research Tool in the Humanities and Social Sciences
PDF
Web mining slides
PPT
Introduction to text mining
PPT
Text analysis presentation ppt
PPTX
Mining ppt 2014
PPTX
Data mining
PPTX
Quick Tour of Text Mining
Introduction to Text Mining
Big Data & Text Mining
Text-mining as a Research Tool in the Humanities and Social Sciences
Web mining slides
Introduction to text mining
Text analysis presentation ppt
Mining ppt 2014
Data mining
Quick Tour of Text Mining
Ad

Similar to Textmining Introduction (20)

PPTX
Text mining and analytics v6 - p1
PDF
Text Mining Analytics 101
PDF
Using data mining methods knowledge discovery for text mining
PDF
Arules_TM_Rpart_Markdown
PPT
Lecture 2
PPTX
Data Mining Email SPam Detection PPT WITH Algorithms
PDF
Experimental Result Analysis of Text Categorization using Clustering and Clas...
PDF
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
PDF
Using data mining methods knowledge discovery for text mining
DOCX
Comparative analysis of algorithms classification and methods the presentatio...
DOCX
Comparative analysis of algorithms_MADI
PDF
Review of HR Recruitment Shortlisting
PPT
Machine Learning Applications in NLP.ppt
DOC
Presentation on Machine Learning and Data Mining
PDF
Machine Learning: Learning with data
PDF
One talk Machine Learning
PPTX
Lecture 10
PPTX
05 -- Feature Engineering (Text).pptxiuy
PDF
Text Classification/Categorization
PPTX
3. introduction to text mining
Text mining and analytics v6 - p1
Text Mining Analytics 101
Using data mining methods knowledge discovery for text mining
Arules_TM_Rpart_Markdown
Lecture 2
Data Mining Email SPam Detection PPT WITH Algorithms
Experimental Result Analysis of Text Categorization using Clustering and Clas...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Using data mining methods knowledge discovery for text mining
Comparative analysis of algorithms classification and methods the presentatio...
Comparative analysis of algorithms_MADI
Review of HR Recruitment Shortlisting
Machine Learning Applications in NLP.ppt
Presentation on Machine Learning and Data Mining
Machine Learning: Learning with data
One talk Machine Learning
Lecture 10
05 -- Feature Engineering (Text).pptxiuy
Text Classification/Categorization
3. introduction to text mining

More from Datamining Tools (20)

PPTX
Data Mining: Text and web mining
PPTX
Data Mining: Outlier analysis
PPTX
Data Mining: Mining stream time series and sequence data
PPTX
Data Mining: Mining ,associations, and correlations
PPTX
Data Mining: Graph mining and social network analysis
PPTX
Data Mining: Data warehouse and olap technology
PPTX
Data MIning: Data processing
PPTX
Data Mining: clustering and analysis
PPTX
Data mining: Classification and Prediction
PPTX
Data Mining: Data mining classification and analysis
PPTX
Data Mining: Data mining and key definitions
PPTX
Data Mining: Data cube computation and data generalization
PPTX
Data Mining: Applying data mining
PPTX
Data Mining: Application and trends in data mining
PPTX
AI: Planning and AI
PPTX
AI: Logic in AI 2
PPTX
AI: Logic in AI
PPTX
AI: Learning in AI 2
PPTX
AI: Learning in AI
PPTX
AI: Introduction to artificial intelligence
Data Mining: Text and web mining
Data Mining: Outlier analysis
Data Mining: Mining stream time series and sequence data
Data Mining: Mining ,associations, and correlations
Data Mining: Graph mining and social network analysis
Data Mining: Data warehouse and olap technology
Data MIning: Data processing
Data Mining: clustering and analysis
Data mining: Classification and Prediction
Data Mining: Data mining classification and analysis
Data Mining: Data mining and key definitions
Data Mining: Data cube computation and data generalization
Data Mining: Applying data mining
Data Mining: Application and trends in data mining
AI: Planning and AI
AI: Logic in AI 2
AI: Logic in AI
AI: Learning in AI 2
AI: Learning in AI
AI: Introduction to artificial intelligence

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
A Presentation on Artificial Intelligence
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Cloud computing and distributed systems.
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Encapsulation theory and applications.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation_ Review paper, used for researhc scholars
Advanced methodologies resolving dimensionality complications for autism neur...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
“AI and Expert System Decision Support & Business Intelligence Systems”
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Digital-Transformation-Roadmap-for-Companies.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Modernizing your data center with Dell and AMD
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
A Presentation on Artificial Intelligence
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Cloud computing and distributed systems.
MYSQL Presentation for SQL database connectivity
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Diabetes mellitus diagnosis method based random forest with bat algorithm
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication

Textmining Introduction

  • 1. Text mining: Introduction and data preparation
  • 2. Overview of Text mining What is Text Mining? Text Mining, &quot;also known as intelligent text analysis, text data mining or knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text.&quot;
  • 3. Need for Text mining: We can better understand the need for Text mining using a practical example. Ex: The Bio Tech Industry. -80% of biological knowledge is only in research paper (unstructured data). - If a scientist manually read 50 research paper/week and only 10% of those data are useful then he/she manages only 5 research paper/week.
  • 4. Need for Text mining But online databases like Medline adds more than 10,000 abstracts per month using Text mining   Thus the performance of gathering relevant data is increased dramatically when we use text mining .It shows the need for Text mining.
  • 5. Challenges in Text Mining Information is in unstructured textual form Large textual data base almost all publications are also in electronic form Very high number of possible “dimensions” (but sparse): all possible word and phrase types in the language!! Complex and subtle relationships between concepts in text
  • 6. Challenges in Text Mining “ AOL merges with Time-Warner” “Time-Warner is bought by AOL” Word ambiguity and context sensitivity automobile = car = vehicle = Toyota Apple (the company) or apple (the fruit) Noisy data Example: Spelling mistakes
  • 7. Text Mining Process Text preprocessing Syntactic/Semantic text analysis Features Generation Bag of words Features Selection Simple counting Statistics
  • 8. Text Mining Process Text/Data Mining Classification Clustering Associations Analyzing results
  • 9. Applications The potential applications are countless. Customer profile analysis Trend analysis Information filtering and routing Event tracks news stories classification Web search etc etc.
  • 10. Tokenization Convert a sentence into a sequence of tokens i.e words. Why do we tokenize? Because we do not want to treat a sentence as a sequence of characters Tokenizing general English sentences is relatively straightforward. Use spaces as the boundaries Use some heuristics to handle exceptions
  • 11. Tokenisation issues separate possessive endings or abbreviated forms from preceding words: Mary’s  Mary ‘s Mary’s  Mary is Mary’s  Mary has separate punctuation marks and quotes from words : Mary.  Mary . “ new”  “ new “
  • 12. Dictionary creation Dictionary is used to locate occurrence of a particular term in the documents. It will reduce the retrivel time of an algorithm. They are stored as linked list
  • 13. Example Brutus −-> 1 2 4 11 31 45 173 174 Caesar −-> 1 2 4 5 6 16 57 132 . . . Calpurnia −-> 2 31 54 101 In the above example the occurence of the terns brutus caesar and calpurnia in the documents are given.
  • 14. Feature generation and selection Importance of feature selection Machine Learning It improve the efficiency in many machine learning. Over fitting problem Over fitting is the problem of training the machine so much that when the actual data is place it behave well to an extent and start to fail. Improve Efficiency of training
  • 15. Feature selection methods for classification Filter Method pre-process computation of score for each feature and then select feature according to the score Wrapper Method The wrapper utilize learning as a black box to score subset features Embedded Method Feature selection is perform within the process of training the algorithm
  • 16. Parsing tasks Separate words from spaces and punctuation Clean up Remove redundant words Remove words with no content Cleaned up list of Words referred to as tokens
  • 17. Simple Algorithm for parsing # Initialize, description-the entire text charcount<-nchar(Description) # number of records of text Line count<-length(Description) Num<-Line count*6 # Array to hold location of spaces Position<-rep(0,Num) dim(Position)<-c(Linecount,6)
  • 18. Simple Algorithm for parsing # Array for Terms Terms<-rep(“”,Num) dim(Terms)<-c(Linecount,6) wordcount<-rep(0,Linecount)
  • 19. Search for Spaces for (i in 1:Linecount) { n<-charcount[i] k<-1 for (j in 1:n) { Char<-substring(Description[i],j,j) if (is.all.white(Char)) {Position[i,k]<-j; k<-k+1} wordcount[i]<-k }}
  • 20. Get Words # parse out terms for (i in 1:Linecount) { # first word if (Position[i,1]==0) Terms[i,1]<-Description[i] else if (Position[i,1]>0) Terms[i,1]<-substring(Description[i],1,Position[i,j]-1)
  • 21. Get Words for (j in 1:wordcount) { if (Position[i,j]>0) { Terms[i,j]<-substring(Description[i],Position[i,j-1]+1,Position[i,j]-1) } } }
  • 22. Conclusion In this presentation Overview of text mining Tokenization Dictionary creation Feature selection Parsing are studied in detail.
  • 23. Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net