SlideShare a Scribd company logo
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 02 Issue: 12 | Dec-2013, Available @ http://www.ijret.org 574
SURVEY ON SEMI SUPERVISED CLASSIFICATION METHODS AND
FEATURE SELECTION
Neethu Innocent 1
, Mathew Kurian 2
1
PG student, Department of Computer Science and Technology, Karunya University, Tamil Nadu, India,
neethucii@gmail.com
2
Assistant Professor, computer science and technology ,Karunya University, Tamil Nadu, India,
mathewk80@karunya.edu
Abstract
Data mining also called knowledge discovery is a process of analyzing data from several perspective and summarize it into useful
information. It has tremendous application in the area of classification like pattern recognition, discovering several disease
type, analysis of medical image, recognizing speech, for identifying biometric, drug discovery etc. This is a survey based on
several semisupervised classification method used by classifiers , in this both labeled and unlabeled data can be used for
classification purpose.It is less expensive than other classification methods . Different techniques surveyed in this paper are low
density separation approach, transductive SVM, semi-supervised based logistic discriminate procedure, self training nearest
neighbour rule using cut edges, self training nearest neighbour rule using cut edges. Along with classification methods a review
about various feature selection methods is also mentioned in this paper. Feature selection is performed to reduce the dimension of
large dataset. After reducing attribute the data is given for classification hence the accuracy and performance of classification
system can be improved .Several feature selection method include consistency based feature selection, fuzzy entropy measure
feature selection with similarity classifier, Signal to noise ratio,Positive approximation. So each method has several benefits.
Index Terms: Semisupervised classification, Transductive support vector machine, Feature selection, unlabeled samples
--------------------------------------------------------------------***----------------------------------------------------------------------
1.INTRODUCTION
Data mining is a process of extracting information from
large datasets and converts it into understandable format. It
has major role in the area classification .Various data
mining technique can be used for classification of several
diseases It will helps for better diagnosis and treatment.
There are several classification methods to used by
classifier. By using supervised classification method, only
labeled data can be used but it is very expensive and
difficult to obtain. This paper mainly concentrated on semi
supervised classification method which use both labeled and
unlabeled data[8]. Several semi supervised classification
methods are transductive support vector machine,
recursively partition model, cut edge and nearest neighbour
rule, low density separation approach.
One major problem faced in the classification is due to
large number of features of dataset. if there is thousands of
features then it will affect the performance of classifier. This
is one of the challenge in machine learning technique and is
called feature selection[1],[7].By using this method only
selected features will taken for classification so dimension
of data will reduced, computational cost can be reduced,
performance of classification can be increased . Several
feature selection methods included in this paper are
consistency based feature selection method, consistency
based feature selection, signal to noise ratio , positive
approximation, Fuzzy entropy measure feature selection
with similarity classifier, positive approximation based on
rough set theory.
2.SEMISUPERVISED CLASSIFICATION
In semisupervised classification method both labeled and
unlabeled samples is used . Unlabeled samples will obtained
very easily but labeled samples are very expensive.
2.1.Transductive SVM
TSVM it is an iterative algorithm which include unlabeled
samples in training phase. Ujwal malik et.al proposed a
transductive procedure in which transductive sample is
selected through a filtering process of unlabeled data and an
algorithm is proposed[1].
This algorithm use input as both labeled and unlabeled
samples. Algorithm starts with training the SVM classifier
which is having a working set T(0).Working set will be
equal to labeled set . The unlabeled data which fall into the
margin will have more information, in which some data fall
into negative side and is called negative transductive sample
and unlabeled data fall into positive side is called positive
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 02 Issue: 12 | Dec-2013, Available @ http://www.ijret.org 575
transductive samples. Samples with accurate labeling
,informative samples and samples which are near to margin
will be selected and some will be residing in upper side and
in lower side will be assigned as +1and -1 respectively.
Selected transductive sample is added to the training set. So
by using this method the accuracy of classification can be
increased and cost also reduced.
2.2.Semi-supervised based logistic discriminant
procedure
Shuichi kawano et.al proposed a non linear semi-supervised
logistic discriminant procedure which is based on Gaussian
basis expansions with regularization based on graph [9]
.Graph laplacian is used in regularization term is one of the
major technique in graph based regularization method .It is
based on degree matrix and weighted adjacency matrix .
Weighted matrix M will be a n x n matrix. To select values
of several tuning parameters they derive a model for the
selection criteria from Bayesian and information theoretic
approach . Laplacian graph are applied easily to analyze
high dimensional or complex dataset in both labeled and
unlabeled data set. This method also reduce error rate of
prediction.
2.3.Self training nearest neighbour rule using cut
edges
One of the common technique used in semisupervised
method is self training In this case first classifier will train
with labeled samples and then it will include unlabeled
samples and added with training set. Classifier teaches s
itself by using its prediction. But this method can cause
several problems like misclassification, much noise in
labeled data and also cause error reinforcement. Y.Wang
et.al proposed a method to solve this problem and is called
self training nearest neighbour rule using cut edges[2] .This
method is to pool both testing samples and training sample
in an iterative way. There are two aspects .
• Firstly maximum number of testing samples must
classify to positive or negative. The extreme output
must be of lower risk. If maximum number of class
is obtained then construct a label modification
mechanism which utilizes cut edges in the relative
neighborhood graph.
• Secondly to employ cut edge weight for
semisupervised classification technique. So by
using cut edge it reduces classification error, error
reinforcement and also improves the performance.
2.4.Low density separation approach
The algorithm of low density separation use cluster
assumption which use both labeled and unlabeled data is
used. Based on cluster assumption the decision boundary
must lie in low density region and should not overlap high
density region[1][4]. There are two procedure to keep the
decision boundary in the low density regions between
clusters. First, it obtain the graph based distance that give
importance to low density region. Secondly it avoid high
density region and obtain decision boundary by optimizing
transductive SVM objective function. Hence these two
procedures will combine. So LDS can achieve more
accuracy compared than traditional semisupervised method
and SVM.
Table 1: Comparison table for semisupervised
classifications
3.FEATURE SELECTION
Most of the dataset will be of huge size. So feature selection
is used to reduce the dimensionality of large data[10]. After
reducing features the sample data is given to classifier . For
example Gene expression data set is a large data set with
thousands of gene data. So feature selection method is used
to reduce the size of data and it can increase the
performance of classifier. Several feature selection method
are explained in this section.
3.1.Consistency based feature selection
In this method only relevant features are selected and also
inconsistency measure is calculated [1],[5]. For example a
pattern is having more than one matched instances but they
are residing in different classes considered as it is
inconsistent .So inconsistent features are removed. The
forward greedy algorithm is used in this method the step of
this algorithm is shown in the fig 1 [6]. Several steps are
• Features of next candidate subset is generated
based on generation procedure
• Candidate function is evaluated based on
evaluation function
• Decide a stopping criteria when to stop
Transductive svm Accuracy of classification
increased
semi-supervised based
logistic discriminant
procedure
Reduce error rate
Self trainng nearest
neighbour rule using cut
edges
Improve performance
Low density separation
approach
Accuracy of classifier
increased
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 02 Issue: 12 | Dec-2013, Available @ http://www.ijret.org 576
Fig 1.Forward Greedy Algorithm for attribute reduction
3.2.Fuzzy entropy measure feature selection with
similarity classifier
In this method features is selected by fuzzy entropy
measure and also with the help of similarity classifier[11].
The main principle behind similarity classifier is to create an
ideal vector Vj=(-Vj(S1),……,Vj(Fn)) which represent the
class j. This ideal vector is calculated from a the sample Yj
of vector Y=(Y(S1),….Y(Sn)) which belongs to class Cj.
After the calculation of ideal vector the similarity of vector
V and sample Y is calculated, it is represented as S(Y,V).
Value of similarity will be 1 in the ideal case if sample
belong to class j. From this point view the entropy value is
calculated to select features. If similarity value is high then
entropy value will be very less and vice versa. Fuzzy
entropy value of several features is calculated using by using
sample vector and ideal vector which is used for
classification. Features with highest entropy value are
removed and lowest entropy value is selected. Classification
accuracy is increased and computational time is reduced.
3.3.Signal to noise ratio
Signal to noise ratio is calculated by using the equation[1] ,
[3]
SNR=
( )
( )
(1)
m1and m2 are mean and σ1 and σ 2 are standard
deviations. For example take the gene expression data then
calculate the SNR value of based on gene expression level.
Then arrange in descending order and select top ten features.
This method will enhance the accuracy of classification.
3.4.Positive approximation
Existing heuristic attribute reduction has several
limitations . Yuhua Qian proposed a method called positive
approximation based on rough set theory[6]. The main
objective of this method is to select some property of
original data without any redundancy. There will be more
than one reduct. But only one reduced attribute is needed so
a heuristic algorithm is proposed based on significance
measure of the attribute. This method is somewhat similar to
greedy search algorithm. But some modification is
proposed here significance measure is calculated . Until the
reduct set is obtained the attribute with maximum
significance value is added at each stage. The result of
positive approximation of attribute reduction shows that it
is an effective accelerator. There are three speed up factor
in positive approximation based feature selection :
• One attribute can select more than one ineach
loop. So this will helps to provide a restriction in
the result of the reduction algorithm.
• Reduced computational time due to attribute
significance measure.
• Another important factor in this algorithm is size
of the data is reduced and time taken for the
computation of stopping criteria is also reduced
to minimum.
4.CONCLUSION
This paper based on review of various semisupervised
classifications. Each classification method has its own
advantages and disadvantages Low density separation
approach used for classification can overcome the problems
of traditional SVM. The other method transductive SVM
can increase accuracy of classification. Logistic discriminant
procedure and self training nearest neighbour rule using cut
edges can reduce the error rate and misclassification. Survey
on several feature selection methods is analyzed in this
paper. Consistency based feature selection method reduce
the inconsistent feature and increase the the performance.
The another method fuzzy entropy measure feature
selection with similarity classifier can increase the
accuracy and reduce computational time. In signal noise
ration feature is selected based on SNR values of each
attribute and arranged in descending order then top ten
features are selected . By using positive approximation
reduction method the selected attribute will not have
redundant values.
REFERENCES
[1]Ujjwal MauliK, Anirban Mukhopadhyay and Debasis
Chakraborty “gene-expression-based cancer subtypes
prediction through feature selection and transductive SVM”
IEEE transactions on biomedical engineering, vol. 60, no. 4,
april 2013.
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 02 Issue: 12 | Dec-2013, Available @ http://www.ijret.org 577
[2]Yu Wang , Xiaoyan Xu, Haifeng Zhao , Zhongsheng Hua
“semi-supervised learning based on nearest neighbor rule
and cut edges” Knowledge-Based Systems 23 (2010) 547–
554.
[3]Debahuti Mishra, Barnali Sahu “Feature selection for
cancer classification: a signal-to-noise ratio approach”
International Journal of Scientific & Engineering Research,
Volume 2, Issue 4, April-2011 D. C. Koestler, C. J. Marsit,
B. C. Christensen, M. R. Karagas, R. Bueno.
[4]O. Chapelle and A. Zien, “Semi-supervised classification
by low-density separation,” in Proc. 10th Int. Works. Artif.
Intell. Stat., 2005, pp. 57–64
[5]M. Dash andH. Liu, “Consistency based search in feature
selection,” Artif.Intell., vol. 151, pp. 155–176, 2003.
[6]Y. Qian, J. Liang, W. Pedrycz, and C. Dang, “Positive
approximation:An accelerator for attribute reduction in
rough set theory,” Artif. Intell.,vol. 174, pp. 597–618, 2010.
[7]S. Bandyopadhyay, A. Mukhopadhyay, and U. Maulik,
“An improved algorithm for clustering gene expression
data,” oinformatics, vol. 23, no. 21, pp. 2859–2865, 2007.
[8] O. Chapelle, V. Sindhwani, and S. S. Keerthi,
“Optimization techniques for semi-supervised support
vectors,” J.Mach. Learn. Res., vol. 9, pp. 203–233, 2008.
[9] Shuichi Kawano · Toshihiro Misumi · Sadanori Konishi
“Semi-Supervised Logistic Discrimination Via Graph-Based
Regularization” Neural Process Lett (2012) 36:203–216
[10] A. Blum and P. Langley, “Selection of relevant features
and examples in machine learing,” Artif. Intell., vol. 97, no.
1/2, pp. 245–271, 1997
[11] Pasi Luukka “Feature selection using fuzzy entropy
measures with similarity classifier” Expert Systems with
Applications 38 (2011) 4600–4607
BIOGRAPHIES
Neethu Innocent pursuing her
M.Tech in Software Engineering
from Karunya University,
Tamilnadu, India. She received her
Bachelor’s degree from MG
university in Computer Science and
Engineering from Kerala.
Mathew Kurian,has finished his
M.E in computer science and
engineering from Jadavpur
University,Kolkatta and currently
he is working as Assistant Professor
in Department of Computer
Science and Engineering in
Karunya University.Previously,he
worked as Software Engineer with
Aricent Technologies.He is
currently doing his PhD Degree in
Data Mining.

More Related Content

PDF
Survey on semi supervised classification methods and feature selection
PDF
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
PDF
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
PDF
G046024851
PDF
The International Journal of Engineering and Science (The IJES)
PDF
M43016571
PDF
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
PDF
An efficient feature selection in
Survey on semi supervised classification methods and feature selection
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
G046024851
The International Journal of Engineering and Science (The IJES)
M43016571
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
An efficient feature selection in

What's hot (18)

PDF
[IJET-V1I3P11] Authors : Hemangi Bhalekar, Swati Kumbhar, Hiral Mewada, Prati...
PDF
A Survey on Machine Learning Algorithms
PDF
An unsupervised feature selection algorithm with feature ranking for maximizi...
PDF
A Threshold fuzzy entropy based feature selection method applied in various b...
PDF
IRJET- Detection of Plant Leaf Diseases using Image Processing and Soft-C...
PDF
IRJET- A Detailed Study on Classification Techniques for Data Mining
PDF
IRJET - A Survey on Machine Learning Intelligence Techniques for Medical ...
PDF
Feature selection for multiple water quality status: integrated bootstrapping...
PDF
Unsupervised Feature Selection Based on the Distribution of Features Attribut...
PDF
International Journal of Engineering Research and Development (IJERD)
PDF
A02610104
PDF
A novel medical image segmentation and classification using combined feature ...
PPS
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
PDF
IRJET- Supervised Learning Approach for Flower Images using Color, Shape and ...
PDF
Paper id 42201618
PDF
IRJET - Survey on Clustering based Categorical Data Protection
PDF
IDENTIFICATION AND CLASSIFICATION OF POWDER MICROSCOPIC IMAGES OF INDIAN HERB...
PDF
Decision Tree Classifiers to determine the patient’s Post-operative Recovery ...
[IJET-V1I3P11] Authors : Hemangi Bhalekar, Swati Kumbhar, Hiral Mewada, Prati...
A Survey on Machine Learning Algorithms
An unsupervised feature selection algorithm with feature ranking for maximizi...
A Threshold fuzzy entropy based feature selection method applied in various b...
IRJET- Detection of Plant Leaf Diseases using Image Processing and Soft-C...
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET - A Survey on Machine Learning Intelligence Techniques for Medical ...
Feature selection for multiple water quality status: integrated bootstrapping...
Unsupervised Feature Selection Based on the Distribution of Features Attribut...
International Journal of Engineering Research and Development (IJERD)
A02610104
A novel medical image segmentation and classification using combined feature ...
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
IRJET- Supervised Learning Approach for Flower Images using Color, Shape and ...
Paper id 42201618
IRJET - Survey on Clustering based Categorical Data Protection
IDENTIFICATION AND CLASSIFICATION OF POWDER MICROSCOPIC IMAGES OF INDIAN HERB...
Decision Tree Classifiers to determine the patient’s Post-operative Recovery ...
Ad

Viewers also liked (20)

PDF
Data mining techniques a survey paper
PDF
Research issues and priorities in the field of
PDF
Enhancement of power quality by unified power quality conditioner with fuzzy ...
PDF
Experimental behavior of circular hsscfrc filled steel
PDF
Devlopement of the dynamic resistance measurement (drm) method for condition ...
PDF
Design of a modified leaf spring with an integrated damping system for added ...
PDF
Experimenal investigation of performance and
PDF
Pounding problems in urban areas
PDF
Preliminary study of multi view imaging for accurate
PDF
Computerized spoiled tomato detection
PDF
Performance analysis of new proposed window for
PDF
A novel tool for stereo matching of images
PDF
Effect of fiber distance on various sac ocdma detection techniques
PDF
Behavior of r.c.c. beam with rectangular opening
PDF
An analytical study on test standards for assessment
PDF
Image segmentation based on color
PDF
Design and development of mechanical power amplifier
PDF
Migration of application schema to windows azure
PDF
A novel graphical password approach for accessing cloud & data verification
PDF
Matlab simulink based digital protection of
Data mining techniques a survey paper
Research issues and priorities in the field of
Enhancement of power quality by unified power quality conditioner with fuzzy ...
Experimental behavior of circular hsscfrc filled steel
Devlopement of the dynamic resistance measurement (drm) method for condition ...
Design of a modified leaf spring with an integrated damping system for added ...
Experimenal investigation of performance and
Pounding problems in urban areas
Preliminary study of multi view imaging for accurate
Computerized spoiled tomato detection
Performance analysis of new proposed window for
A novel tool for stereo matching of images
Effect of fiber distance on various sac ocdma detection techniques
Behavior of r.c.c. beam with rectangular opening
An analytical study on test standards for assessment
Image segmentation based on color
Design and development of mechanical power amplifier
Migration of application schema to windows azure
A novel graphical password approach for accessing cloud & data verification
Matlab simulink based digital protection of
Ad

Similar to Survey on semi supervised classification methods and (20)

PDF
IRJET-Improvement and Enhancement in Emergency Medical Services using IOT
PPTX
Semi-supervised Learning Survey - 20 years of evaluation
PDF
Supervised Machine Learning: A Review of Classification ...
PPTX
UNIT 3: Data Warehousing and Data Mining
PDF
A survey of modified support vector machine using particle of swarm optimizat...
PDF
Leave one out cross validated Hybrid Model of Genetic Algorithm and Naïve Bay...
PPTX
Data mining
PDF
Single to multiple kernel learning with four popular svm kernels (survey)
PDF
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
PDF
Study on Relavance Feature Selection Methods
PPT
Text categorization
PPT
Introduction to Machine Learning Aristotelis Tsirigos
PDF
IJCSI-10-6-1-288-292
PDF
IRJET- Supervised Learning Classification Algorithms Comparison
PDF
IRJET- Supervised Learning Classification Algorithms Comparison
PDF
International Journal of Engineering Research and Development (IJERD)
PDF
Classification Techniques: A Review
PDF
[IJET-V2I3P22] Authors: Harsha Pakhale,Deepak Kumar Xaxa
PPT
[ppt]
PPT
[ppt]
IRJET-Improvement and Enhancement in Emergency Medical Services using IOT
Semi-supervised Learning Survey - 20 years of evaluation
Supervised Machine Learning: A Review of Classification ...
UNIT 3: Data Warehousing and Data Mining
A survey of modified support vector machine using particle of swarm optimizat...
Leave one out cross validated Hybrid Model of Genetic Algorithm and Naïve Bay...
Data mining
Single to multiple kernel learning with four popular svm kernels (survey)
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
Study on Relavance Feature Selection Methods
Text categorization
Introduction to Machine Learning Aristotelis Tsirigos
IJCSI-10-6-1-288-292
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms Comparison
International Journal of Engineering Research and Development (IJERD)
Classification Techniques: A Review
[IJET-V2I3P22] Authors: Harsha Pakhale,Deepak Kumar Xaxa
[ppt]
[ppt]

More from eSAT Publishing House (20)

PDF
Likely impacts of hudhud on the environment of visakhapatnam
PDF
Impact of flood disaster in a drought prone area – case study of alampur vill...
PDF
Hudhud cyclone – a severe disaster in visakhapatnam
PDF
Groundwater investigation using geophysical methods a case study of pydibhim...
PDF
Flood related disasters concerned to urban flooding in bangalore, india
PDF
Enhancing post disaster recovery by optimal infrastructure capacity building
PDF
Effect of lintel and lintel band on the global performance of reinforced conc...
PDF
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...
PDF
Wind damage to buildings, infrastrucuture and landscape elements along the be...
PDF
Shear strength of rc deep beam panels – a review
PDF
Role of voluntary teams of professional engineers in dissater management – ex...
PDF
Risk analysis and environmental hazard management
PDF
Review study on performance of seismically tested repaired shear walls
PDF
Monitoring and assessment of air quality with reference to dust particles (pm...
PDF
Low cost wireless sensor networks and smartphone applications for disaster ma...
PDF
Coastal zones – seismic vulnerability an analysis from east coast of india
PDF
Can fracture mechanics predict damage due disaster of structures
PDF
Assessment of seismic susceptibility of rc buildings
PDF
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...
PDF
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...
Likely impacts of hudhud on the environment of visakhapatnam
Impact of flood disaster in a drought prone area – case study of alampur vill...
Hudhud cyclone – a severe disaster in visakhapatnam
Groundwater investigation using geophysical methods a case study of pydibhim...
Flood related disasters concerned to urban flooding in bangalore, india
Enhancing post disaster recovery by optimal infrastructure capacity building
Effect of lintel and lintel band on the global performance of reinforced conc...
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...
Wind damage to buildings, infrastrucuture and landscape elements along the be...
Shear strength of rc deep beam panels – a review
Role of voluntary teams of professional engineers in dissater management – ex...
Risk analysis and environmental hazard management
Review study on performance of seismically tested repaired shear walls
Monitoring and assessment of air quality with reference to dust particles (pm...
Low cost wireless sensor networks and smartphone applications for disaster ma...
Coastal zones – seismic vulnerability an analysis from east coast of india
Can fracture mechanics predict damage due disaster of structures
Assessment of seismic susceptibility of rc buildings
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...

Recently uploaded (20)

PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
DOCX
573137875-Attendance-Management-System-original
PPTX
Geodesy 1.pptx...............................................
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
additive manufacturing of ss316l using mig welding
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
Lecture Notes Electrical Wiring System Components
PDF
PPT on Performance Review to get promotions
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
Well-logging-methods_new................
PPTX
web development for engineering and engineering
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Sustainable Sites - Green Building Construction
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
573137875-Attendance-Management-System-original
Geodesy 1.pptx...............................................
Operating System & Kernel Study Guide-1 - converted.pdf
Internet of Things (IOT) - A guide to understanding
additive manufacturing of ss316l using mig welding
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Embodied AI: Ushering in the Next Era of Intelligent Systems
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Arduino robotics embedded978-1-4302-3184-4.pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Lecture Notes Electrical Wiring System Components
PPT on Performance Review to get promotions
bas. eng. economics group 4 presentation 1.pptx
Well-logging-methods_new................
web development for engineering and engineering
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Sustainable Sites - Green Building Construction

Survey on semi supervised classification methods and

  • 1. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 02 Issue: 12 | Dec-2013, Available @ http://www.ijret.org 574 SURVEY ON SEMI SUPERVISED CLASSIFICATION METHODS AND FEATURE SELECTION Neethu Innocent 1 , Mathew Kurian 2 1 PG student, Department of Computer Science and Technology, Karunya University, Tamil Nadu, India, neethucii@gmail.com 2 Assistant Professor, computer science and technology ,Karunya University, Tamil Nadu, India, mathewk80@karunya.edu Abstract Data mining also called knowledge discovery is a process of analyzing data from several perspective and summarize it into useful information. It has tremendous application in the area of classification like pattern recognition, discovering several disease type, analysis of medical image, recognizing speech, for identifying biometric, drug discovery etc. This is a survey based on several semisupervised classification method used by classifiers , in this both labeled and unlabeled data can be used for classification purpose.It is less expensive than other classification methods . Different techniques surveyed in this paper are low density separation approach, transductive SVM, semi-supervised based logistic discriminate procedure, self training nearest neighbour rule using cut edges, self training nearest neighbour rule using cut edges. Along with classification methods a review about various feature selection methods is also mentioned in this paper. Feature selection is performed to reduce the dimension of large dataset. After reducing attribute the data is given for classification hence the accuracy and performance of classification system can be improved .Several feature selection method include consistency based feature selection, fuzzy entropy measure feature selection with similarity classifier, Signal to noise ratio,Positive approximation. So each method has several benefits. Index Terms: Semisupervised classification, Transductive support vector machine, Feature selection, unlabeled samples --------------------------------------------------------------------***---------------------------------------------------------------------- 1.INTRODUCTION Data mining is a process of extracting information from large datasets and converts it into understandable format. It has major role in the area classification .Various data mining technique can be used for classification of several diseases It will helps for better diagnosis and treatment. There are several classification methods to used by classifier. By using supervised classification method, only labeled data can be used but it is very expensive and difficult to obtain. This paper mainly concentrated on semi supervised classification method which use both labeled and unlabeled data[8]. Several semi supervised classification methods are transductive support vector machine, recursively partition model, cut edge and nearest neighbour rule, low density separation approach. One major problem faced in the classification is due to large number of features of dataset. if there is thousands of features then it will affect the performance of classifier. This is one of the challenge in machine learning technique and is called feature selection[1],[7].By using this method only selected features will taken for classification so dimension of data will reduced, computational cost can be reduced, performance of classification can be increased . Several feature selection methods included in this paper are consistency based feature selection method, consistency based feature selection, signal to noise ratio , positive approximation, Fuzzy entropy measure feature selection with similarity classifier, positive approximation based on rough set theory. 2.SEMISUPERVISED CLASSIFICATION In semisupervised classification method both labeled and unlabeled samples is used . Unlabeled samples will obtained very easily but labeled samples are very expensive. 2.1.Transductive SVM TSVM it is an iterative algorithm which include unlabeled samples in training phase. Ujwal malik et.al proposed a transductive procedure in which transductive sample is selected through a filtering process of unlabeled data and an algorithm is proposed[1]. This algorithm use input as both labeled and unlabeled samples. Algorithm starts with training the SVM classifier which is having a working set T(0).Working set will be equal to labeled set . The unlabeled data which fall into the margin will have more information, in which some data fall into negative side and is called negative transductive sample and unlabeled data fall into positive side is called positive
  • 2. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 02 Issue: 12 | Dec-2013, Available @ http://www.ijret.org 575 transductive samples. Samples with accurate labeling ,informative samples and samples which are near to margin will be selected and some will be residing in upper side and in lower side will be assigned as +1and -1 respectively. Selected transductive sample is added to the training set. So by using this method the accuracy of classification can be increased and cost also reduced. 2.2.Semi-supervised based logistic discriminant procedure Shuichi kawano et.al proposed a non linear semi-supervised logistic discriminant procedure which is based on Gaussian basis expansions with regularization based on graph [9] .Graph laplacian is used in regularization term is one of the major technique in graph based regularization method .It is based on degree matrix and weighted adjacency matrix . Weighted matrix M will be a n x n matrix. To select values of several tuning parameters they derive a model for the selection criteria from Bayesian and information theoretic approach . Laplacian graph are applied easily to analyze high dimensional or complex dataset in both labeled and unlabeled data set. This method also reduce error rate of prediction. 2.3.Self training nearest neighbour rule using cut edges One of the common technique used in semisupervised method is self training In this case first classifier will train with labeled samples and then it will include unlabeled samples and added with training set. Classifier teaches s itself by using its prediction. But this method can cause several problems like misclassification, much noise in labeled data and also cause error reinforcement. Y.Wang et.al proposed a method to solve this problem and is called self training nearest neighbour rule using cut edges[2] .This method is to pool both testing samples and training sample in an iterative way. There are two aspects . • Firstly maximum number of testing samples must classify to positive or negative. The extreme output must be of lower risk. If maximum number of class is obtained then construct a label modification mechanism which utilizes cut edges in the relative neighborhood graph. • Secondly to employ cut edge weight for semisupervised classification technique. So by using cut edge it reduces classification error, error reinforcement and also improves the performance. 2.4.Low density separation approach The algorithm of low density separation use cluster assumption which use both labeled and unlabeled data is used. Based on cluster assumption the decision boundary must lie in low density region and should not overlap high density region[1][4]. There are two procedure to keep the decision boundary in the low density regions between clusters. First, it obtain the graph based distance that give importance to low density region. Secondly it avoid high density region and obtain decision boundary by optimizing transductive SVM objective function. Hence these two procedures will combine. So LDS can achieve more accuracy compared than traditional semisupervised method and SVM. Table 1: Comparison table for semisupervised classifications 3.FEATURE SELECTION Most of the dataset will be of huge size. So feature selection is used to reduce the dimensionality of large data[10]. After reducing features the sample data is given to classifier . For example Gene expression data set is a large data set with thousands of gene data. So feature selection method is used to reduce the size of data and it can increase the performance of classifier. Several feature selection method are explained in this section. 3.1.Consistency based feature selection In this method only relevant features are selected and also inconsistency measure is calculated [1],[5]. For example a pattern is having more than one matched instances but they are residing in different classes considered as it is inconsistent .So inconsistent features are removed. The forward greedy algorithm is used in this method the step of this algorithm is shown in the fig 1 [6]. Several steps are • Features of next candidate subset is generated based on generation procedure • Candidate function is evaluated based on evaluation function • Decide a stopping criteria when to stop Transductive svm Accuracy of classification increased semi-supervised based logistic discriminant procedure Reduce error rate Self trainng nearest neighbour rule using cut edges Improve performance Low density separation approach Accuracy of classifier increased
  • 3. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 02 Issue: 12 | Dec-2013, Available @ http://www.ijret.org 576 Fig 1.Forward Greedy Algorithm for attribute reduction 3.2.Fuzzy entropy measure feature selection with similarity classifier In this method features is selected by fuzzy entropy measure and also with the help of similarity classifier[11]. The main principle behind similarity classifier is to create an ideal vector Vj=(-Vj(S1),……,Vj(Fn)) which represent the class j. This ideal vector is calculated from a the sample Yj of vector Y=(Y(S1),….Y(Sn)) which belongs to class Cj. After the calculation of ideal vector the similarity of vector V and sample Y is calculated, it is represented as S(Y,V). Value of similarity will be 1 in the ideal case if sample belong to class j. From this point view the entropy value is calculated to select features. If similarity value is high then entropy value will be very less and vice versa. Fuzzy entropy value of several features is calculated using by using sample vector and ideal vector which is used for classification. Features with highest entropy value are removed and lowest entropy value is selected. Classification accuracy is increased and computational time is reduced. 3.3.Signal to noise ratio Signal to noise ratio is calculated by using the equation[1] , [3] SNR= ( ) ( ) (1) m1and m2 are mean and σ1 and σ 2 are standard deviations. For example take the gene expression data then calculate the SNR value of based on gene expression level. Then arrange in descending order and select top ten features. This method will enhance the accuracy of classification. 3.4.Positive approximation Existing heuristic attribute reduction has several limitations . Yuhua Qian proposed a method called positive approximation based on rough set theory[6]. The main objective of this method is to select some property of original data without any redundancy. There will be more than one reduct. But only one reduced attribute is needed so a heuristic algorithm is proposed based on significance measure of the attribute. This method is somewhat similar to greedy search algorithm. But some modification is proposed here significance measure is calculated . Until the reduct set is obtained the attribute with maximum significance value is added at each stage. The result of positive approximation of attribute reduction shows that it is an effective accelerator. There are three speed up factor in positive approximation based feature selection : • One attribute can select more than one ineach loop. So this will helps to provide a restriction in the result of the reduction algorithm. • Reduced computational time due to attribute significance measure. • Another important factor in this algorithm is size of the data is reduced and time taken for the computation of stopping criteria is also reduced to minimum. 4.CONCLUSION This paper based on review of various semisupervised classifications. Each classification method has its own advantages and disadvantages Low density separation approach used for classification can overcome the problems of traditional SVM. The other method transductive SVM can increase accuracy of classification. Logistic discriminant procedure and self training nearest neighbour rule using cut edges can reduce the error rate and misclassification. Survey on several feature selection methods is analyzed in this paper. Consistency based feature selection method reduce the inconsistent feature and increase the the performance. The another method fuzzy entropy measure feature selection with similarity classifier can increase the accuracy and reduce computational time. In signal noise ration feature is selected based on SNR values of each attribute and arranged in descending order then top ten features are selected . By using positive approximation reduction method the selected attribute will not have redundant values. REFERENCES [1]Ujjwal MauliK, Anirban Mukhopadhyay and Debasis Chakraborty “gene-expression-based cancer subtypes prediction through feature selection and transductive SVM” IEEE transactions on biomedical engineering, vol. 60, no. 4, april 2013.
  • 4. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 02 Issue: 12 | Dec-2013, Available @ http://www.ijret.org 577 [2]Yu Wang , Xiaoyan Xu, Haifeng Zhao , Zhongsheng Hua “semi-supervised learning based on nearest neighbor rule and cut edges” Knowledge-Based Systems 23 (2010) 547– 554. [3]Debahuti Mishra, Barnali Sahu “Feature selection for cancer classification: a signal-to-noise ratio approach” International Journal of Scientific & Engineering Research, Volume 2, Issue 4, April-2011 D. C. Koestler, C. J. Marsit, B. C. Christensen, M. R. Karagas, R. Bueno. [4]O. Chapelle and A. Zien, “Semi-supervised classification by low-density separation,” in Proc. 10th Int. Works. Artif. Intell. Stat., 2005, pp. 57–64 [5]M. Dash andH. Liu, “Consistency based search in feature selection,” Artif.Intell., vol. 151, pp. 155–176, 2003. [6]Y. Qian, J. Liang, W. Pedrycz, and C. Dang, “Positive approximation:An accelerator for attribute reduction in rough set theory,” Artif. Intell.,vol. 174, pp. 597–618, 2010. [7]S. Bandyopadhyay, A. Mukhopadhyay, and U. Maulik, “An improved algorithm for clustering gene expression data,” oinformatics, vol. 23, no. 21, pp. 2859–2865, 2007. [8] O. Chapelle, V. Sindhwani, and S. S. Keerthi, “Optimization techniques for semi-supervised support vectors,” J.Mach. Learn. Res., vol. 9, pp. 203–233, 2008. [9] Shuichi Kawano · Toshihiro Misumi · Sadanori Konishi “Semi-Supervised Logistic Discrimination Via Graph-Based Regularization” Neural Process Lett (2012) 36:203–216 [10] A. Blum and P. Langley, “Selection of relevant features and examples in machine learing,” Artif. Intell., vol. 97, no. 1/2, pp. 245–271, 1997 [11] Pasi Luukka “Feature selection using fuzzy entropy measures with similarity classifier” Expert Systems with Applications 38 (2011) 4600–4607 BIOGRAPHIES Neethu Innocent pursuing her M.Tech in Software Engineering from Karunya University, Tamilnadu, India. She received her Bachelor’s degree from MG university in Computer Science and Engineering from Kerala. Mathew Kurian,has finished his M.E in computer science and engineering from Jadavpur University,Kolkatta and currently he is working as Assistant Professor in Department of Computer Science and Engineering in Karunya University.Previously,he worked as Software Engineer with Aricent Technologies.He is currently doing his PhD Degree in Data Mining.