Group4 doc

WACHEMO UNIVERSITY
COLLAGE OF ENGINEERING AND TECHNOLOGY
SCHOOL OF COMPUTING AND INFORMATICS
PROJECT TITLE: NETWORK DESIGN FOR VISION ACADEMY SCHOOL
BY
Name ID
1. FraholFeyera…………………………………..........R/ET-5526/07
2. FujetaHusen………………………………………. R/ET-5528/07
3. TerufatHezkel………………………………………. R/ET-5526/07
4. AdugnaAdmasu…………………………….………. R/ET-5420/07
5. ShagituSeid………………………………………….R/ET-3299/07
6. MihiretMerkabu……………………………………. R/ET-5597/07

ii | P a g e
Submitted to: Mr.Girma
Wednesday, January 3, 2018

I | P a g e
Contents
1. INTRODUCTION .......................................................................................................................................................................................................................1
1.1 Textclassification process...........................................................................................................................................................................................................2
1.2 Documents Collection...........................................................................................................................................................................................................3
1.3 Pre-Processing.....................................................................................................................................................................................................................3
1.4 Feature Selection..................................................................................................................................................................................................................5
1.5 Automatic Classification .............................................................................................................................................................................................................5
1.6 Performance Evaluations.............................................................................................................................................................................................................6
2. Architecture of Text Classification with Machine Learning ...............................................................................................................................................................7
2.1 Supervised Learning................................................................................................................................................................................................................8
2.2 Starting to sketch the Architecture.............................................................................................................................................................................................8
3.0 Document classification approach:..........................................................................................................................................................................................9
3.1 manual document classifications:..............................................................................................................................................................................................9
3.2 Automatic document classification:.........................................................................................................................................................................................10
5 Conclusion.................................................................................................................................................................................................................................11

1 | P a g e
1. INTRODUCTION
The text mining studies are gaining more importance recently because of the availability of the increasing number of the electronic documents from a
variety of sources. Which include unstructured and semi structured information. The main goal of text mining is to enable users to extract information
from textual resources and deals with the operations like, retrieval, classification (supervised, unsupervised and semi supervised) and summarization
Natural Language Processing (NLP), Data Mining, and Machine Learning techniques work together to automatically classify and discover patterns from
the different types of the documents [1].
Text classification (TC) is an important part of text mining, looked to be that of manually building automatic TC systems by means of knowledge-
engineering techniques, i.e. manually defining a set of logical rules that convert expert knowledge on how to classify documents under the given
set of categories. For example would be to automatically label each incoming news story with a topic like “sports”, “politics”, or “art”. a data mining
classification task starts with a
training set D = (d1….. dn) of documents that are already labelled with a class C1,C2 (e.g. sport, politics). The task is then to determine a classification
model which is able to assign the correct class to a new document d of the domain Text classification has two flavors as single label and multi-label
.single label document is belongs to only one class and multi label document may be belong to more than one classes In this paper we are consider only
single label document classification.
Text Classification is the task: to classify documents into predefined classes

2 | P a g e
Text Classification is also called
Text Categorization
Document Classification
Document Categorization
• Two approaches:-
 manual classification and
 automatic classification
1.1 Text classification process
The stages of TC are discussing as following points.
Performance
measure
Documents

3 | P a g e
Fig. 1 Document Classification Process
1.2 Documents Collection
This is first step of classification process in which we are collecting the different types (format) of document like html, .pdf, .doc, web content etc.
1.3 Pre-Processing
The first step of pre-processing which is used to presents the text documents into clear word format. The documents prepared for next step in text
classification are represented by a great amount of features. Commonly the steps taken are:
Tokenization: A document is treated as a string, and then partitioned into a list of tokens.
Removing stop words: Stop words such as “the”, “a”, “and”, etc are frequently occurring, so the insignificant words need to be removed.
Stemming word: Applying the stemming algorithm that converts different word form into similar canonical form. This step is the process of conflating
tokens to their root form, e.g. connection to connect, computing to compute
Preprocessing Indexing
Classification
Algorithms
Feature
Selection

4 | P a g e
Indexing
The documents representation is one of the pre-processing technique that is used to reduce the complexity of the documents and make them easier to
handle, the document have to be transformed from the full text version to a document vector The Perhaps most commonly used document representation
is called vector space model (SMART) [55] vector space model, documents are represented by vectors of words. Usually, one has a collection of
documents which is represented by word by word document Matrix. BoW/VSM representation scheme has its own limitations. Some of them are: high
dimensionality of the representation, loss of correlation with adjacent words and loss of semantic relationship that exist among the terms in a document. to
overcome these problems, term weighting methods are used to assign appropriate weights to the term as shown in following matrix
T1 T2 …. Tat Ci
D1 w11 w21 … wt1 c1 D2 w12 w22 … wt2 c2
: : : :
Dn w1n w2n … wtn Cn
Where each entry represents the occurrence of the word in the document, where wtn is the weight of word i in the document n .since every word does not
normally appear in each document, .There are several way of determining the weight w11. Like Boolean weighting, word frequency weighting, tf-idf,
entropy etc. But the major drawback of this model is that it results in a huge sparse matrix, which raises a problem of high dimensionality. Other various

5 | P a g e
methods are presented in [56] as 1) an ontology representation for a document to keep the semantic relationship between the terms in a document.2) a
sequence of symbols (byte, a character or a word) called N-Grams, that are extracted from a long string in a document., it is very difficult to decide the
number of grams to be considered for effective document representation.3) multiword terms as vector components .But this method requires a
sophisticated automatic term extraction algorithms to extract the terms automatically from a document 4) Latent Semantic Indexing (LSI) which preserves
the representative features for a document, The LSI preserves the most representative features rather than discriminating features. Thus to overcome this
problem 5) Locality Preserving Indexing (LPI), discovers the local semantic structure of a document. But is not efficient in time and memory 6) a new
representation to model the web documents is proposed. HTML tags are used to build the web document representation.
1.4 Feature Selection
After pre-processing and indexing the important step of text classification, is feature selection [2] to construct vector space, which improves the
scalability, efficiency and accuracy of a text classifier. The main idea of Feature Selection (FS) is to select subset of features from the original documents.
FS is performed by keeping the words with highest score according to predetermined measure of the importance of the word. Because of for text
classification a major problem is the high dimensionality of the feature space. Many feature evaluation metrics have been notable among which are
information gain (IG), term frequency, Chi-square, expected cross entropy, Odds Ratio, the weight of evidence of text, mutual information, Gini index.
.Other various methods are presented like [58] sampling method which is randomly samples roughly features and then make matrix for classification. By
considering problem of high dimensional problem [59] is presented new FS witch use the genetic algorithm (GA) optimization.
1.5 Automatic Classification
The automatic classification of documents into predefined categories has observed as an active attention, the documents can be classified by three ways,
unsupervised, supervised and semi supervised methods. From last few years, the task of automatic text classification have been extensively studied and
rapid progress seems in this area, including the machine learning approaches such as Bayesian classifier, Decision Tree ETC.

6 | P a g e
1.6 Performance Evaluations
This is Last stage of Text classification, in which the evaluations of text classifiers is typically conducted experimentally, rather than analytically. The
experimental evaluation of classifiers, rather than concentrating on issues of Efficiency, usually tries to evaluate the effectiveness of a classifier, i.e. its
capability of taking the right categorization decisions. An important issue of Text categorization is how to measures the performance of the classifiers.
Many measures have been used, like Precision and recall [54] ; fallout, error, accuracy etc. are given below
Precision wrt ci (Pri) is defined as the as the probability that if a random document dx is classified under ci, this decision is correct. Analogously, Recall
wrt ci (Rei) is defined as the conditional that, if a random document dx ought to be classified under ci, this decision is taken Tips–The number of document
correctly assigned to this category.
FN - The number of document incorrectly assigned to this category
FPi - The number of document incorrectly rejected assigned to this category
TNi - The number of document correctly rejected assigned to this category
Fallout = FNi / FNi + TNi
Error =FNi +FPi / TPi + FNi +FPi +TNi
Accuracy = TPi + TNi
Relevant technologies
Text Clustering.

7 | P a g e
– Create clusters of documents without any external information.
Information Retrieval (IR) – Retrieve a set of documents relevant to a query.
Information Filtering – Filter out irrelevant documents through interactions.
Information Extraction (IE).
– Extract fragments of information, e.g., person names, dates, and places, in documents
2. Architecture of Text Classification with Machine Learning
One of the most commons tasks in Machine Learning is text classification, which is simply teaching your machine how to read and interpret a text and
predict what kind of text it is.
The purpose of this essay is to talk about a simple and generic enough Architecture to a supervised learning text classification. The interesting point of this
Architecture is that you can use it as a basic/initial model for many classifications tasks.

8 | P a g e
2.1 Supervised Learning
Supervised Learning is when you have to first train your model with already existing labeled dataset, just like teaching a kid how to differentiate between
a car and a motorcycle, you have to expose its differences, similarities and such. Whereas unsupervised learning is about learning and predicting without a
pre-labeled dataset.
2.2 Starting to sketch the Architecture
With the dataset in hands, we start to think about how is going to be our architecture to achieve the given goal; we can resume the steps in:
1. Cleaning the dataset
2. Partitioning the dataset
3. Feature Engineering
4. Choosing the right Algorithms, Mathematical Models and Methods
5. Wrapping everything up
1. Cleaning the dataset
Cleaning the dataset is a crucial initial step in Machine Learning, many Toy Datasets don’t need to be cleaned, because it’s already clean, peer-reviewed
and published in a way you can use it exactly to work on the learning algorithms.
The problem is the real world is full of painful and noisy datasets.
If there’s one thing I learned while working with Machine Learning is, there’s no such thing as shiny and perfect dataset in the real world, so we have to
deal with this beforehand. Situations where there are many empty fields, wrong and non-homogeneous formats, broken characters, is very common. I
won’t talk about such techniques now, but I will write something about it in another post.

9 | P a g e
2. Partitioning the Dataset
We always need to partition the dataset in, at least, 2 partitions: the training dataset and the test/validation dataset. Why?
Suppose we fed the learning algorithm with a training data X and it already known the output Y (because it’s a training data pair (X,Y)), which is, for
given text X, Y is its classification, the algorithm will learn it.
3.0Document classification approach:
3.1 manual document classifications: users interpret the meaning of text, identify the relationships between concepts and categorize
documents.Or (rule-based approach)
▪ write a set of rules that classify documents – machine learning-based approach
▪ using a set of sample documents that are classified into the classes (training data), automatically create classifiers based on the training
data.
 Comparison of two Approaches (1)
-Rule-based Classification
Pros:
– very accurate when rules are written by experts
– Classification criteria can be easily controlled when the number of rules is small.
Cons:

10 | P a g e
 Sometimes, rules conflicts each other.
 Maintenance of rules becomes more difficult as the number of rules increases.
 The rules have to be reconstructed when a target domain changes.
 low coverage because of a wide variety of expressions Comparison of Two Approaches (2)
3.2 Automatic document classification: applies machine learning or other technologies to automatically classify
documents. this results in faster, scalable and more objective classification.
Or ( Machine Learning-based approach)
Pros:
 domain independent
 high predictive performance Cons:
 not accountable for classification results
 training data required
 In genral there are three approch.such as:
1 Supervised method: The classifier is trained on a manually tagged set of documents. The classifier can predict new categories and can also provide a
confidence indicator.
2 Unsupervised method: Documents are mathematically organized based on similar words and phrases.
3 Rules-based method: This method consists of leveraging the natural language understanding capability of a system and writing linguistic rules that
would instruct the system to act like a person in classifying a document. This means using the semantically relevant elements of a text to drive the
automatic categorization.

11 | P a g e
5 Conclusion
In this paper we have presented a thorough evaluation of different approaches for document Classification. We have confirmed recent results about the
superiority of Support Vector Machines on a new test set. Furthermore, we have shown that feature subset selection or dimensionality reduction is
essential for document Classification not only in order to make learning and Classification tractable, but also in order to avoid over fitting. It is important
to see that this also holds for Support Vector Machines. Last but not least, we have shown that linguistic preprocessing not necessarily improves
Classification quality. We are convinced that in the long run linguistic preprocessing like a morphological analysis pays off for document Classification
as well as for document retrieval. However, this linguistic preprocessing probably has to be more sophisticated than our simple morphological analysis. A
big advantage of linguistic preprocessing compared to n-gram features is that integration of thesauri, concept nets, and domain mode is becoming
possible. Besides linguistic sophistication, statistics can also help to produce good features. For the future we plan to evaluate different methods for
finding topic-relevant collocations and multi-word phrases.

Group4 doc

More Related Content

What's hot (15)

Similar to Group4 doc (20)

Recently uploaded (20)

Group4 doc