SlideShare a Scribd company logo
WACHEMO UNIVERSITY
COLLAGE OF ENGINEERING AND TECHNOLOGY
SCHOOL OF COMPUTING AND INFORMATICS
PROJECT TITLE: NETWORK DESIGN FOR VISION ACADEMY SCHOOL
BY
Name ID
1. FraholFeyera…………………………………..........R/ET-5526/07
2. FujetaHusen………………………………………. R/ET-5528/07
3. TerufatHezkel………………………………………. R/ET-5526/07
4. AdugnaAdmasu…………………………….………. R/ET-5420/07
5. ShagituSeid………………………………………….R/ET-3299/07
6. MihiretMerkabu……………………………………. R/ET-5597/07
ii | P a g e
Submitted to: Mr.Girma
Wednesday, January 3, 2018
I | P a g e
Contents
1. INTRODUCTION .......................................................................................................................................................................................................................1
1.1 Textclassification process...........................................................................................................................................................................................................2
1.2 Documents Collection...........................................................................................................................................................................................................3
1.3 Pre-Processing.....................................................................................................................................................................................................................3
1.4 Feature Selection..................................................................................................................................................................................................................5
1.5 Automatic Classification .............................................................................................................................................................................................................5
1.6 Performance Evaluations.............................................................................................................................................................................................................6
2. Architecture of Text Classification with Machine Learning ...............................................................................................................................................................7
2.1 Supervised Learning................................................................................................................................................................................................................8
2.2 Starting to sketch the Architecture.............................................................................................................................................................................................8
3.0 Document classification approach:..........................................................................................................................................................................................9
3.1 manual document classifications:..............................................................................................................................................................................................9
3.2 Automatic document classification:.........................................................................................................................................................................................10
5 Conclusion.................................................................................................................................................................................................................................11
1 | P a g e
1. INTRODUCTION
The text mining studies are gaining more importance recently because of the availability of the increasing number of the electronic documents from a
variety of sources. Which include unstructured and semi structured information. The main goal of text mining is to enable users to extract information
from textual resources and deals with the operations like, retrieval, classification (supervised, unsupervised and semi supervised) and summarization
Natural Language Processing (NLP), Data Mining, and Machine Learning techniques work together to automatically classify and discover patterns from
the different types of the documents [1].
Text classification (TC) is an important part of text mining, looked to be that of manually building automatic TC systems by means of knowledge-
engineering techniques, i.e. manually defining a set of logical rules that convert expert knowledge on how to classify documents under the given
set of categories. For example would be to automatically label each incoming news story with a topic like “sports”, “politics”, or “art”. a data mining
classification task starts with a
training set D = (d1….. dn) of documents that are already labelled with a class C1,C2 (e.g. sport, politics). The task is then to determine a classification
model which is able to assign the correct class to a new document d of the domain Text classification has two flavors as single label and multi-label
.single label document is belongs to only one class and multi label document may be belong to more than one classes In this paper we are consider only
single label document classification.
Text Classification is the task: to classify documents into predefined classes
2 | P a g e
Text Classification is also called
Text Categorization
Document Classification
Document Categorization
• Two approaches:-
 manual classification and
 automatic classification
1.1 Text classification process
The stages of TC are discussing as following points.
Performance
measure
Documents
3 | P a g e
Fig. 1 Document Classification Process
1.2 Documents Collection
This is first step of classification process in which we are collecting the different types (format) of document like html, .pdf, .doc, web content etc.
1.3 Pre-Processing
The first step of pre-processing which is used to presents the text documents into clear word format. The documents prepared for next step in text
classification are represented by a great amount of features. Commonly the steps taken are:
Tokenization: A document is treated as a string, and then partitioned into a list of tokens.
Removing stop words: Stop words such as “the”, “a”, “and”, etc are frequently occurring, so the insignificant words need to be removed.
Stemming word: Applying the stemming algorithm that converts different word form into similar canonical form. This step is the process of conflating
tokens to their root form, e.g. connection to connect, computing to compute
Preprocessing Indexing
Classification
Algorithms
Feature
Selection
4 | P a g e
Indexing
The documents representation is one of the pre-processing technique that is used to reduce the complexity of the documents and make them easier to
handle, the document have to be transformed from the full text version to a document vector The Perhaps most commonly used document representation
is called vector space model (SMART) [55] vector space model, documents are represented by vectors of words. Usually, one has a collection of
documents which is represented by word by word document Matrix. BoW/VSM representation scheme has its own limitations. Some of them are: high
dimensionality of the representation, loss of correlation with adjacent words and loss of semantic relationship that exist among the terms in a document. to
overcome these problems, term weighting methods are used to assign appropriate weights to the term as shown in following matrix
T1 T2 …. Tat Ci
D1 w11 w21 … wt1 c1 D2 w12 w22 … wt2 c2
: : : :
Dn w1n w2n … wtn Cn
Where each entry represents the occurrence of the word in the document, where wtn is the weight of word i in the document n .since every word does not
normally appear in each document, .There are several way of determining the weight w11. Like Boolean weighting, word frequency weighting, tf-idf,
entropy etc. But the major drawback of this model is that it results in a huge sparse matrix, which raises a problem of high dimensionality. Other various
5 | P a g e
methods are presented in [56] as 1) an ontology representation for a document to keep the semantic relationship between the terms in a document.2) a
sequence of symbols (byte, a character or a word) called N-Grams, that are extracted from a long string in a document., it is very difficult to decide the
number of grams to be considered for effective document representation.3) multiword terms as vector components .But this method requires a
sophisticated automatic term extraction algorithms to extract the terms automatically from a document 4) Latent Semantic Indexing (LSI) which preserves
the representative features for a document, The LSI preserves the most representative features rather than discriminating features. Thus to overcome this
problem 5) Locality Preserving Indexing (LPI), discovers the local semantic structure of a document. But is not efficient in time and memory 6) a new
representation to model the web documents is proposed. HTML tags are used to build the web document representation.
1.4 Feature Selection
After pre-processing and indexing the important step of text classification, is feature selection [2] to construct vector space, which improves the
scalability, efficiency and accuracy of a text classifier. The main idea of Feature Selection (FS) is to select subset of features from the original documents.
FS is performed by keeping the words with highest score according to predetermined measure of the importance of the word. Because of for text
classification a major problem is the high dimensionality of the feature space. Many feature evaluation metrics have been notable among which are
information gain (IG), term frequency, Chi-square, expected cross entropy, Odds Ratio, the weight of evidence of text, mutual information, Gini index.
.Other various methods are presented like [58] sampling method which is randomly samples roughly features and then make matrix for classification. By
considering problem of high dimensional problem [59] is presented new FS witch use the genetic algorithm (GA) optimization.
1.5 Automatic Classification
The automatic classification of documents into predefined categories has observed as an active attention, the documents can be classified by three ways,
unsupervised, supervised and semi supervised methods. From last few years, the task of automatic text classification have been extensively studied and
rapid progress seems in this area, including the machine learning approaches such as Bayesian classifier, Decision Tree ETC.
6 | P a g e
1.6 Performance Evaluations
This is Last stage of Text classification, in which the evaluations of text classifiers is typically conducted experimentally, rather than analytically. The
experimental evaluation of classifiers, rather than concentrating on issues of Efficiency, usually tries to evaluate the effectiveness of a classifier, i.e. its
capability of taking the right categorization decisions. An important issue of Text categorization is how to measures the performance of the classifiers.
Many measures have been used, like Precision and recall [54] ; fallout, error, accuracy etc. are given below
Precision wrt ci (Pri) is defined as the as the probability that if a random document dx is classified under ci, this decision is correct. Analogously, Recall
wrt ci (Rei) is defined as the conditional that, if a random document dx ought to be classified under ci, this decision is taken Tips–The number of document
correctly assigned to this category.
FN - The number of document incorrectly assigned to this category
FPi - The number of document incorrectly rejected assigned to this category
TNi - The number of document correctly rejected assigned to this category
Fallout = FNi / FNi + TNi
Error =FNi +FPi / TPi + FNi +FPi +TNi
Accuracy = TPi + TNi
Relevant technologies
Text Clustering.
7 | P a g e
– Create clusters of documents without any external information.
Information Retrieval (IR) – Retrieve a set of documents relevant to a query.
Information Filtering – Filter out irrelevant documents through interactions.
Information Extraction (IE).
– Extract fragments of information, e.g., person names, dates, and places, in documents
2. Architecture of Text Classification with Machine Learning
One of the most commons tasks in Machine Learning is text classification, which is simply teaching your machine how to read and interpret a text and
predict what kind of text it is.
The purpose of this essay is to talk about a simple and generic enough Architecture to a supervised learning text classification. The interesting point of this
Architecture is that you can use it as a basic/initial model for many classifications tasks.
8 | P a g e
2.1 Supervised Learning
Supervised Learning is when you have to first train your model with already existing labeled dataset, just like teaching a kid how to differentiate between
a car and a motorcycle, you have to expose its differences, similarities and such. Whereas unsupervised learning is about learning and predicting without a
pre-labeled dataset.
2.2 Starting to sketch the Architecture
With the dataset in hands, we start to think about how is going to be our architecture to achieve the given goal; we can resume the steps in:
1. Cleaning the dataset
2. Partitioning the dataset
3. Feature Engineering
4. Choosing the right Algorithms, Mathematical Models and Methods
5. Wrapping everything up
1. Cleaning the dataset
Cleaning the dataset is a crucial initial step in Machine Learning, many Toy Datasets don’t need to be cleaned, because it’s already clean, peer-reviewed
and published in a way you can use it exactly to work on the learning algorithms.
The problem is the real world is full of painful and noisy datasets.
If there’s one thing I learned while working with Machine Learning is, there’s no such thing as shiny and perfect dataset in the real world, so we have to
deal with this beforehand. Situations where there are many empty fields, wrong and non-homogeneous formats, broken characters, is very common. I
won’t talk about such techniques now, but I will write something about it in another post.
9 | P a g e
2. Partitioning the Dataset
We always need to partition the dataset in, at least, 2 partitions: the training dataset and the test/validation dataset. Why?
Suppose we fed the learning algorithm with a training data X and it already known the output Y (because it’s a training data pair (X,Y)), which is, for
given text X, Y is its classification, the algorithm will learn it.
3.0Document classification approach:
3.1 manual document classifications: users interpret the meaning of text, identify the relationships between concepts and categorize
documents.Or (rule-based approach)
▪ write a set of rules that classify documents – machine learning-based approach
▪ using a set of sample documents that are classified into the classes (training data), automatically create classifiers based on the training
data.
 Comparison of two Approaches (1)
-Rule-based Classification
Pros:
– very accurate when rules are written by experts
– Classification criteria can be easily controlled when the number of rules is small.
Cons:
10 | P a g e
 Sometimes, rules conflicts each other.
 Maintenance of rules becomes more difficult as the number of rules increases.
 The rules have to be reconstructed when a target domain changes.
 low coverage because of a wide variety of expressions Comparison of Two Approaches (2)
3.2 Automatic document classification: applies machine learning or other technologies to automatically classify
documents. this results in faster, scalable and more objective classification.
Or ( Machine Learning-based approach)
Pros:
 domain independent
 high predictive performance Cons:
 not accountable for classification results
 training data required
 In genral there are three approch.such as:
1 Supervised method: The classifier is trained on a manually tagged set of documents. The classifier can predict new categories and can also provide a
confidence indicator.
2 Unsupervised method: Documents are mathematically organized based on similar words and phrases.
3 Rules-based method: This method consists of leveraging the natural language understanding capability of a system and writing linguistic rules that
would instruct the system to act like a person in classifying a document. This means using the semantically relevant elements of a text to drive the
automatic categorization.
11 | P a g e
5 Conclusion
In this paper we have presented a thorough evaluation of different approaches for document Classification. We have confirmed recent results about the
superiority of Support Vector Machines on a new test set. Furthermore, we have shown that feature subset selection or dimensionality reduction is
essential for document Classification not only in order to make learning and Classification tractable, but also in order to avoid over fitting. It is important
to see that this also holds for Support Vector Machines. Last but not least, we have shown that linguistic preprocessing not necessarily improves
Classification quality. We are convinced that in the long run linguistic preprocessing like a morphological analysis pays off for document Classification
as well as for document retrieval. However, this linguistic preprocessing probably has to be more sophisticated than our simple morphological analysis. A
big advantage of linguistic preprocessing compared to n-gram features is that integration of thesauri, concept nets, and domain mode is becoming
possible. Besides linguistic sophistication, statistics can also help to produce good features. For the future we plan to evaluate different methods for
finding topic-relevant collocations and multi-word phrases.
12 | P a g e

More Related Content

PDF
Paper id 25201435
PDF
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
PDF
DOMAIN BASED CHUNKING
PDF
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
PDF
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
PDF
Feature selection, optimization and clustering strategies of text documents
PDF
An in-depth exploration of Bangla blog post classification
PDF
An exhaustive font and size invariant classification scheme for ocr of devana...
Paper id 25201435
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
DOMAIN BASED CHUNKING
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Feature selection, optimization and clustering strategies of text documents
An in-depth exploration of Bangla blog post classification
An exhaustive font and size invariant classification scheme for ocr of devana...

What's hot (15)

DOCX
Mi0034 –database management systems
DOC
Lecture Notes in Computer Science:
DOC
Statistical Named Entity Recognition for Hungarian – analysis ...
PDF
A systematic study of text mining techniques
PDF
A Dialogue System for Telugu, a Resource-Poor Language
PDF
Multi label classification of
PDF
Legal Document
PDF
Suitability of naïve bayesian methods for paragraph level text classification...
PDF
Summarization using ntc approach based on keyword extraction for discussion f...
PDF
AUTOMATED SHORT ANSWER GRADER USING FRIENDSHIP GRAPHS
DOC
4th sem
PDF
Profile Analysis of Users in Data Analytics Domain
PDF
A NOVEL FEATURE SET FOR RECOGNITION OF SIMILAR SHAPED HANDWRITTEN HINDI CHARA...
PDF
03 fauzi indonesian 9456 11nov17 edit septian
PDF
Parameters Optimization for Improving ASR Performance in Adverse Real World N...
Mi0034 –database management systems
Lecture Notes in Computer Science:
Statistical Named Entity Recognition for Hungarian – analysis ...
A systematic study of text mining techniques
A Dialogue System for Telugu, a Resource-Poor Language
Multi label classification of
Legal Document
Suitability of naïve bayesian methods for paragraph level text classification...
Summarization using ntc approach based on keyword extraction for discussion f...
AUTOMATED SHORT ANSWER GRADER USING FRIENDSHIP GRAPHS
4th sem
Profile Analysis of Users in Data Analytics Domain
A NOVEL FEATURE SET FOR RECOGNITION OF SIMILAR SHAPED HANDWRITTEN HINDI CHARA...
03 fauzi indonesian 9456 11nov17 edit septian
Parameters Optimization for Improving ASR Performance in Adverse Real World N...
Ad

Similar to Group4 doc (20)

DOCX
Text Categorizationof Multi-Label Documents For Text Mining
PDF
IRJET- Automated Document Summarization and Classification using Deep Lear...
PDF
Text Document categorization using support vector machine
PDF
Survey on Text Classification
DOC
Team G
PPTX
Seminar dm
PDF
Text Document Classification System
PDF
Mapping Subsets of Scholarly Information
PDF
IRE Semantic Annotation of Documents
PDF
Context Driven Technique for Document Classification
PDF
Text Classification, Sentiment Analysis, and Opinion Mining
PDF
Machine learning in automated text categorization
PDF
Automatic Text Classification Of News Blog using Machine Learning
PDF
Text classification supervised algorithms with term frequency inverse documen...
PDF
Text categorization with Lucene and Solr
DOC
INTRODUCTION
DOC
INTRODUCTION
PPTX
Text categorization
DOC
Indian Language Text Representation and Categorization Using Supervised Learn...
PDF
AN ELABORATION OF TEXT CATEGORIZATION AND AUTOMATIC TEXT CLASSIFICATION THROU...
Text Categorizationof Multi-Label Documents For Text Mining
IRJET- Automated Document Summarization and Classification using Deep Lear...
Text Document categorization using support vector machine
Survey on Text Classification
Team G
Seminar dm
Text Document Classification System
Mapping Subsets of Scholarly Information
IRE Semantic Annotation of Documents
Context Driven Technique for Document Classification
Text Classification, Sentiment Analysis, and Opinion Mining
Machine learning in automated text categorization
Automatic Text Classification Of News Blog using Machine Learning
Text classification supervised algorithms with term frequency inverse documen...
Text categorization with Lucene and Solr
INTRODUCTION
INTRODUCTION
Text categorization
Indian Language Text Representation and Categorization Using Supervised Learn...
AN ELABORATION OF TEXT CATEGORIZATION AND AUTOMATIC TEXT CLASSIFICATION THROU...
Ad

Recently uploaded (20)

PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Sports Quiz easy sports quiz sports quiz
PDF
Computing-Curriculum for Schools in Ghana
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
01-Introduction-to-Information-Management.pdf
PDF
Basic Mud Logging Guide for educational purpose
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
Pharma ospi slides which help in ospi learning
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Classroom Observation Tools for Teachers
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
Lesson notes of climatology university.
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
GDM (1) (1).pptx small presentation for students
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Microbial disease of the cardiovascular and lymphatic systems
Supply Chain Operations Speaking Notes -ICLT Program
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Sports Quiz easy sports quiz sports quiz
Computing-Curriculum for Schools in Ghana
VCE English Exam - Section C Student Revision Booklet
01-Introduction-to-Information-Management.pdf
Basic Mud Logging Guide for educational purpose
O7-L3 Supply Chain Operations - ICLT Program
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Pharma ospi slides which help in ospi learning
Abdominal Access Techniques with Prof. Dr. R K Mishra
Classroom Observation Tools for Teachers
102 student loan defaulters named and shamed – Is someone you know on the list?
Lesson notes of climatology university.
TR - Agricultural Crops Production NC III.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
GDM (1) (1).pptx small presentation for students

Group4 doc

  • 1. WACHEMO UNIVERSITY COLLAGE OF ENGINEERING AND TECHNOLOGY SCHOOL OF COMPUTING AND INFORMATICS PROJECT TITLE: NETWORK DESIGN FOR VISION ACADEMY SCHOOL BY Name ID 1. FraholFeyera…………………………………..........R/ET-5526/07 2. FujetaHusen………………………………………. R/ET-5528/07 3. TerufatHezkel………………………………………. R/ET-5526/07 4. AdugnaAdmasu…………………………….………. R/ET-5420/07 5. ShagituSeid………………………………………….R/ET-3299/07 6. MihiretMerkabu……………………………………. R/ET-5597/07
  • 2. ii | P a g e Submitted to: Mr.Girma Wednesday, January 3, 2018
  • 3. I | P a g e Contents 1. INTRODUCTION .......................................................................................................................................................................................................................1 1.1 Textclassification process...........................................................................................................................................................................................................2 1.2 Documents Collection...........................................................................................................................................................................................................3 1.3 Pre-Processing.....................................................................................................................................................................................................................3 1.4 Feature Selection..................................................................................................................................................................................................................5 1.5 Automatic Classification .............................................................................................................................................................................................................5 1.6 Performance Evaluations.............................................................................................................................................................................................................6 2. Architecture of Text Classification with Machine Learning ...............................................................................................................................................................7 2.1 Supervised Learning................................................................................................................................................................................................................8 2.2 Starting to sketch the Architecture.............................................................................................................................................................................................8 3.0 Document classification approach:..........................................................................................................................................................................................9 3.1 manual document classifications:..............................................................................................................................................................................................9 3.2 Automatic document classification:.........................................................................................................................................................................................10 5 Conclusion.................................................................................................................................................................................................................................11
  • 4. 1 | P a g e 1. INTRODUCTION The text mining studies are gaining more importance recently because of the availability of the increasing number of the electronic documents from a variety of sources. Which include unstructured and semi structured information. The main goal of text mining is to enable users to extract information from textual resources and deals with the operations like, retrieval, classification (supervised, unsupervised and semi supervised) and summarization Natural Language Processing (NLP), Data Mining, and Machine Learning techniques work together to automatically classify and discover patterns from the different types of the documents [1]. Text classification (TC) is an important part of text mining, looked to be that of manually building automatic TC systems by means of knowledge- engineering techniques, i.e. manually defining a set of logical rules that convert expert knowledge on how to classify documents under the given set of categories. For example would be to automatically label each incoming news story with a topic like “sports”, “politics”, or “art”. a data mining classification task starts with a training set D = (d1….. dn) of documents that are already labelled with a class C1,C2 (e.g. sport, politics). The task is then to determine a classification model which is able to assign the correct class to a new document d of the domain Text classification has two flavors as single label and multi-label .single label document is belongs to only one class and multi label document may be belong to more than one classes In this paper we are consider only single label document classification. Text Classification is the task: to classify documents into predefined classes
  • 5. 2 | P a g e Text Classification is also called Text Categorization Document Classification Document Categorization • Two approaches:-  manual classification and  automatic classification 1.1 Text classification process The stages of TC are discussing as following points. Performance measure Documents
  • 6. 3 | P a g e Fig. 1 Document Classification Process 1.2 Documents Collection This is first step of classification process in which we are collecting the different types (format) of document like html, .pdf, .doc, web content etc. 1.3 Pre-Processing The first step of pre-processing which is used to presents the text documents into clear word format. The documents prepared for next step in text classification are represented by a great amount of features. Commonly the steps taken are: Tokenization: A document is treated as a string, and then partitioned into a list of tokens. Removing stop words: Stop words such as “the”, “a”, “and”, etc are frequently occurring, so the insignificant words need to be removed. Stemming word: Applying the stemming algorithm that converts different word form into similar canonical form. This step is the process of conflating tokens to their root form, e.g. connection to connect, computing to compute Preprocessing Indexing Classification Algorithms Feature Selection
  • 7. 4 | P a g e Indexing The documents representation is one of the pre-processing technique that is used to reduce the complexity of the documents and make them easier to handle, the document have to be transformed from the full text version to a document vector The Perhaps most commonly used document representation is called vector space model (SMART) [55] vector space model, documents are represented by vectors of words. Usually, one has a collection of documents which is represented by word by word document Matrix. BoW/VSM representation scheme has its own limitations. Some of them are: high dimensionality of the representation, loss of correlation with adjacent words and loss of semantic relationship that exist among the terms in a document. to overcome these problems, term weighting methods are used to assign appropriate weights to the term as shown in following matrix T1 T2 …. Tat Ci D1 w11 w21 … wt1 c1 D2 w12 w22 … wt2 c2 : : : : Dn w1n w2n … wtn Cn Where each entry represents the occurrence of the word in the document, where wtn is the weight of word i in the document n .since every word does not normally appear in each document, .There are several way of determining the weight w11. Like Boolean weighting, word frequency weighting, tf-idf, entropy etc. But the major drawback of this model is that it results in a huge sparse matrix, which raises a problem of high dimensionality. Other various
  • 8. 5 | P a g e methods are presented in [56] as 1) an ontology representation for a document to keep the semantic relationship between the terms in a document.2) a sequence of symbols (byte, a character or a word) called N-Grams, that are extracted from a long string in a document., it is very difficult to decide the number of grams to be considered for effective document representation.3) multiword terms as vector components .But this method requires a sophisticated automatic term extraction algorithms to extract the terms automatically from a document 4) Latent Semantic Indexing (LSI) which preserves the representative features for a document, The LSI preserves the most representative features rather than discriminating features. Thus to overcome this problem 5) Locality Preserving Indexing (LPI), discovers the local semantic structure of a document. But is not efficient in time and memory 6) a new representation to model the web documents is proposed. HTML tags are used to build the web document representation. 1.4 Feature Selection After pre-processing and indexing the important step of text classification, is feature selection [2] to construct vector space, which improves the scalability, efficiency and accuracy of a text classifier. The main idea of Feature Selection (FS) is to select subset of features from the original documents. FS is performed by keeping the words with highest score according to predetermined measure of the importance of the word. Because of for text classification a major problem is the high dimensionality of the feature space. Many feature evaluation metrics have been notable among which are information gain (IG), term frequency, Chi-square, expected cross entropy, Odds Ratio, the weight of evidence of text, mutual information, Gini index. .Other various methods are presented like [58] sampling method which is randomly samples roughly features and then make matrix for classification. By considering problem of high dimensional problem [59] is presented new FS witch use the genetic algorithm (GA) optimization. 1.5 Automatic Classification The automatic classification of documents into predefined categories has observed as an active attention, the documents can be classified by three ways, unsupervised, supervised and semi supervised methods. From last few years, the task of automatic text classification have been extensively studied and rapid progress seems in this area, including the machine learning approaches such as Bayesian classifier, Decision Tree ETC.
  • 9. 6 | P a g e 1.6 Performance Evaluations This is Last stage of Text classification, in which the evaluations of text classifiers is typically conducted experimentally, rather than analytically. The experimental evaluation of classifiers, rather than concentrating on issues of Efficiency, usually tries to evaluate the effectiveness of a classifier, i.e. its capability of taking the right categorization decisions. An important issue of Text categorization is how to measures the performance of the classifiers. Many measures have been used, like Precision and recall [54] ; fallout, error, accuracy etc. are given below Precision wrt ci (Pri) is defined as the as the probability that if a random document dx is classified under ci, this decision is correct. Analogously, Recall wrt ci (Rei) is defined as the conditional that, if a random document dx ought to be classified under ci, this decision is taken Tips–The number of document correctly assigned to this category. FN - The number of document incorrectly assigned to this category FPi - The number of document incorrectly rejected assigned to this category TNi - The number of document correctly rejected assigned to this category Fallout = FNi / FNi + TNi Error =FNi +FPi / TPi + FNi +FPi +TNi Accuracy = TPi + TNi Relevant technologies Text Clustering.
  • 10. 7 | P a g e – Create clusters of documents without any external information. Information Retrieval (IR) – Retrieve a set of documents relevant to a query. Information Filtering – Filter out irrelevant documents through interactions. Information Extraction (IE). – Extract fragments of information, e.g., person names, dates, and places, in documents 2. Architecture of Text Classification with Machine Learning One of the most commons tasks in Machine Learning is text classification, which is simply teaching your machine how to read and interpret a text and predict what kind of text it is. The purpose of this essay is to talk about a simple and generic enough Architecture to a supervised learning text classification. The interesting point of this Architecture is that you can use it as a basic/initial model for many classifications tasks.
  • 11. 8 | P a g e 2.1 Supervised Learning Supervised Learning is when you have to first train your model with already existing labeled dataset, just like teaching a kid how to differentiate between a car and a motorcycle, you have to expose its differences, similarities and such. Whereas unsupervised learning is about learning and predicting without a pre-labeled dataset. 2.2 Starting to sketch the Architecture With the dataset in hands, we start to think about how is going to be our architecture to achieve the given goal; we can resume the steps in: 1. Cleaning the dataset 2. Partitioning the dataset 3. Feature Engineering 4. Choosing the right Algorithms, Mathematical Models and Methods 5. Wrapping everything up 1. Cleaning the dataset Cleaning the dataset is a crucial initial step in Machine Learning, many Toy Datasets don’t need to be cleaned, because it’s already clean, peer-reviewed and published in a way you can use it exactly to work on the learning algorithms. The problem is the real world is full of painful and noisy datasets. If there’s one thing I learned while working with Machine Learning is, there’s no such thing as shiny and perfect dataset in the real world, so we have to deal with this beforehand. Situations where there are many empty fields, wrong and non-homogeneous formats, broken characters, is very common. I won’t talk about such techniques now, but I will write something about it in another post.
  • 12. 9 | P a g e 2. Partitioning the Dataset We always need to partition the dataset in, at least, 2 partitions: the training dataset and the test/validation dataset. Why? Suppose we fed the learning algorithm with a training data X and it already known the output Y (because it’s a training data pair (X,Y)), which is, for given text X, Y is its classification, the algorithm will learn it. 3.0Document classification approach: 3.1 manual document classifications: users interpret the meaning of text, identify the relationships between concepts and categorize documents.Or (rule-based approach) ▪ write a set of rules that classify documents – machine learning-based approach ▪ using a set of sample documents that are classified into the classes (training data), automatically create classifiers based on the training data.  Comparison of two Approaches (1) -Rule-based Classification Pros: – very accurate when rules are written by experts – Classification criteria can be easily controlled when the number of rules is small. Cons:
  • 13. 10 | P a g e  Sometimes, rules conflicts each other.  Maintenance of rules becomes more difficult as the number of rules increases.  The rules have to be reconstructed when a target domain changes.  low coverage because of a wide variety of expressions Comparison of Two Approaches (2) 3.2 Automatic document classification: applies machine learning or other technologies to automatically classify documents. this results in faster, scalable and more objective classification. Or ( Machine Learning-based approach) Pros:  domain independent  high predictive performance Cons:  not accountable for classification results  training data required  In genral there are three approch.such as: 1 Supervised method: The classifier is trained on a manually tagged set of documents. The classifier can predict new categories and can also provide a confidence indicator. 2 Unsupervised method: Documents are mathematically organized based on similar words and phrases. 3 Rules-based method: This method consists of leveraging the natural language understanding capability of a system and writing linguistic rules that would instruct the system to act like a person in classifying a document. This means using the semantically relevant elements of a text to drive the automatic categorization.
  • 14. 11 | P a g e 5 Conclusion In this paper we have presented a thorough evaluation of different approaches for document Classification. We have confirmed recent results about the superiority of Support Vector Machines on a new test set. Furthermore, we have shown that feature subset selection or dimensionality reduction is essential for document Classification not only in order to make learning and Classification tractable, but also in order to avoid over fitting. It is important to see that this also holds for Support Vector Machines. Last but not least, we have shown that linguistic preprocessing not necessarily improves Classification quality. We are convinced that in the long run linguistic preprocessing like a morphological analysis pays off for document Classification as well as for document retrieval. However, this linguistic preprocessing probably has to be more sophisticated than our simple morphological analysis. A big advantage of linguistic preprocessing compared to n-gram features is that integration of thesauri, concept nets, and domain mode is becoming possible. Besides linguistic sophistication, statistics can also help to produce good features. For the future we plan to evaluate different methods for finding topic-relevant collocations and multi-word phrases.
  • 15. 12 | P a g e