Textmining Introduction

Text mining: Introduction and data preparation

Overview of Text mining What is Text Mining? Text Mining, "also known as intelligent text analysis, text data mining or knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text."

Need for Text mining: We can better understand the need for Text mining using a practical example. Ex: The Bio Tech Industry. -80% of biological knowledge is only in research paper (unstructured data). - If a scientist manually read 50 research paper/week and only 10% of those data are useful then he/she manages only 5 research paper/week.

Need for Text mining But online databases like Medline adds more than 10,000 abstracts per month using Text mining Thus the performance of gathering relevant data is increased dramatically when we use text mining .It shows the need for Text mining.

Challenges in Text Mining Information is in unstructured textual form Large textual data base almost all publications are also in electronic form Very high number of possible “dimensions” (but sparse): all possible word and phrase types in the language!! Complex and subtle relationships between concepts in text

Challenges in Text Mining “ AOL merges with Time-Warner” “Time-Warner is bought by AOL” Word ambiguity and context sensitivity automobile = car = vehicle = Toyota Apple (the company) or apple (the fruit) Noisy data Example: Spelling mistakes

Text Mining Process Text preprocessing Syntactic/Semantic text analysis Features Generation Bag of words Features Selection Simple counting Statistics

Text Mining Process Text/Data Mining Classification Clustering Associations Analyzing results

Applications The potential applications are countless. Customer profile analysis Trend analysis Information filtering and routing Event tracks news stories classification Web search etc etc.

Tokenization Convert a sentence into a sequence of tokens i.e words. Why do we tokenize? Because we do not want to treat a sentence as a sequence of characters Tokenizing general English sentences is relatively straightforward. Use spaces as the boundaries Use some heuristics to handle exceptions

Tokenisation issues separate possessive endings or abbreviated forms from preceding words: Mary’s  Mary ‘s Mary’s  Mary is Mary’s  Mary has separate punctuation marks and quotes from words : Mary.  Mary . “ new”  “ new “

Dictionary creation Dictionary is used to locate occurrence of a particular term in the documents. It will reduce the retrivel time of an algorithm. They are stored as linked list

Example Brutus −-> 1 2 4 11 31 45 173 174 Caesar −-> 1 2 4 5 6 16 57 132 . . . Calpurnia −-> 2 31 54 101 In the above example the occurence of the terns brutus caesar and calpurnia in the documents are given.

Feature generation and selection Importance of feature selection Machine Learning It improve the efficiency in many machine learning. Over fitting problem Over fitting is the problem of training the machine so much that when the actual data is place it behave well to an extent and start to fail. Improve Efficiency of training

Feature selection methods for classification Filter Method pre-process computation of score for each feature and then select feature according to the score Wrapper Method The wrapper utilize learning as a black box to score subset features Embedded Method Feature selection is perform within the process of training the algorithm

Parsing tasks Separate words from spaces and punctuation Clean up Remove redundant words Remove words with no content Cleaned up list of Words referred to as tokens

Simple Algorithm for parsing # Initialize, description-the entire text charcount<-nchar(Description) # number of records of text Line count<-length(Description) Num<-Line count*6 # Array to hold location of spaces Position<-rep(0,Num) dim(Position)<-c(Linecount,6)

Simple Algorithm for parsing # Array for Terms Terms<-rep(“”,Num) dim(Terms)<-c(Linecount,6) wordcount<-rep(0,Linecount)

Search for Spaces for (i in 1:Linecount) { n<-charcount[i] k<-1 for (j in 1:n) { Char<-substring(Description[i],j,j) if (is.all.white(Char)) {Position[i,k]<-j; k<-k+1} wordcount[i]<-k }}

Get Words # parse out terms for (i in 1:Linecount) { # first word if (Position[i,1]==0) Terms[i,1]<-Description[i] else if (Position[i,1]>0) Terms[i,1]<-substring(Description[i],1,Position[i,j]-1)

Get Words for (j in 1:wordcount) { if (Position[i,j]>0) { Terms[i,j]<-substring(Description[i],Position[i,j-1]+1,Position[i,j]-1) } } }

Conclusion In this presentation Overview of text mining Tokenization Dictionary creation Feature selection Parsing are studied in detail.

Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

Textmining Introduction

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Textmining Introduction (20)

More from Datamining Tools (20)

Recently uploaded (20)

Textmining Introduction