SlideShare a Scribd company logo
Poonam Sharma Int. Journal of Engineering Research and Applications www.ijera.com
ISSN: 2248-9622, Vol. 5, Issue 9, (Part - 1) September 2015, pp.50-55
www.ijera.com 50 | P a g e
A Combined Approach for Feature Subset Selection and Size
Reduction for High Dimensional Data
Anurag Dwivedi, Poonam Sharma
Assistant Professor Dept. of Computer Science and Engineering SRGI, Jhansi
M. Tech Scholar Dept. of Information Technology SATI, Vidisha (M.P)
Abstract: selection of relevant feature from a given set of feature is one of the important issues in the field of
data mining as well as classification. In general the dataset may contain a number of features however it is not
necessary that the whole set features are important for particular analysis of decision making because the
features may share the common information‟s and can also be completely irrelevant to the undergoing
processing. This generally happen because of improper selection of features during the dataset formation or
because of improper information availability about the observed system. However in both cases the data will
contain the features that will just increase the processing burden which may ultimately cause the improper
outcome when used for analysis. Because of these reasons some kind of methods are required to detect and
remove these features hence in this paper we are presenting an efficient approach for not just removing the
unimportant features but also the size of complete dataset size. The proposed algorithm utilizes the information
theory to detect the information gain from each feature and minimum span tree to group the similar features
with that the fuzzy c-means clustering is used to remove the similar entries from the dataset. Finally the
algorithm is tested with SVM classifier using 35 publicly available real-world high-dimensional dataset and the
results shows that the presented algorithm not only reduces the feature set and data lengths but also improves the
performances of the classifier.
Keywords: feature selection, data reduction, clustering, fuzzy clustering, and minimum span tree.
I. Introduction
Feature selection involves identifying a subset of
very useful features from the large data set that
produces compatible results as the original whole set
of features [18]. Feature selection is a critical subject
in data mining, particularly in high dimensional
applications. The selection of relevant feature is a
complex problem, and finding the ideal subset of
variables is viewed as NP-hard [3].Feature selection
may be extremely useful approach for reducing
dimensionality, removing irrelevant data and
improving learning accuracy. Extensive high-
dimensional data are generally sparse and contain
numerous classes/groups. For instance, vast content
data in the vector space show frequently contains
numerous classes of documents contains a huge
number of features, this property has turned into a
principle instead of the special case that mostly the
clustering of high dimensional data happen in
subspaces of data, so subspace grouping techniques
are needed in high-dimensional data clustering.
Numerous subspace grouping techniques have been
proposed to handle high dimensional data, used for
finding clusters from subspaces of data, rather than
the whole data space. These techniques can be
broadly categorized two groups one is called the hard
subspace clustering that searches the exact sub set of
features while other is called the soft subspace
clustering which assigns the weights to features.
Numerous high-dimensional data sets are the mixture
of extracted data from different prospective which
causes unwanted features insertion for any specific
analysis. In this paper, we propose a new data
dimension and length reduction method for high-
dimensional data set. The proposed algorithm utilizes
the entropy and joint entropy estimation to detect the
information gain from each feature and minimum
span tree to group the similar features with that the
fuzzy c-means clustering is used to remove the
similar entries from the dataset.
II. Literature Review
This section presents the brief review of the
related literatures available on same topic. R. Ruiz et
al [6] proposed hybrid approaches to provide the
possibility of efficiently applying any subset
evaluator, with a wrapper model for feature subset
selection problem for classification tasks.Alexandros
Kalousis et al [10] presented the stability of feature
selection algorithms based on the stability of the
feature preferences that they express in the form of
weights-scores, ranks, or a selected feature subset.
Finally the examination is performed by a number of
measures to quantify the stability of feature
preferences and propose an empirical way to estimate
them. Guangtao Wang et al [2] proposed a
propositional FOIL rule based algorithm FRFS,
which not only contain relevant features and excludes
RESEARCH ARTICLE OPEN ACCESS
Poonam Sharma Int. Journal of Engineering Research and Applications www.ijera.com
ISSN: 2248-9622, Vol. 5, Issue 9, (Part - 1) September 2015, pp.50-55
www.ijera.com 51 | P a g e
irrelevant and redundant ones but also considers
feature interaction, is proposed for selecting feature
subset for high dimensional data. FRFS first combine
the features appeared in the antecedents of all FOIL
rules, achieving a candidate feature subset which
excludes redundant features and reserves interactive
ones. Then, identifies and removes irrelevant features
by evaluating features in the candidate feature subset
with a new metric Cover Ratio, and obtains the final
feature subset. The supervised wrapper-based feature
subset selection in datasets with a very large number
of attributes. Pablo Bermejo et al [1]. A Fast
Correlation-Based Filter Solution is presented Lei Yu
[13] introduce a novel theory predominant
correlation, and propose a fast filter technique which
can identify relevant features as well as redundancy
among relevant features without pairwise correlation
analysis. The efficiency and effectiveness of their
method is demonstrated through extensive
comparisons with other methods using real-world
data of high dimensionality. The author also
proposed Relevance and Redundancy based
technique in [17] which show that feature relevance
alone is insufficient for efficient feature selection of
high-dimensional data and define feature redundancy
and propose to perform explicit redundancy analysis
in feature selection.
III. Terminology Explanations
This section explains the terms and operation
used in the proposed algorithm.
A. Symmetric Uncertainty:
It is derived from the mutual information by
normalizing it to the entropies of variables, and can
be used as the measure of correlation between either
two features or a feature and the target classes.
Mathematically it is defined as,
𝑆𝑈 𝑋, 𝑌 =
2 × 𝐺𝑎𝑖𝑛 𝑋 𝑌
𝐻 𝑋 + 𝐻 𝑌
, … … … … … … … 3.1
Where 𝐻(𝑋) is the entropy of the variable 𝑋, and is
calculated as follows:
𝐻 𝑋 = − 𝑝 𝑥 log2 𝑝 𝑥
𝑥∈𝑋
, … … … … … . . 3.2
Here 𝑝(𝑥) is the probability of the occurrence of
value 𝑥 of a feature 𝑓 with the domain 𝑋, and can be
calculated as
𝑝 𝑥 =
𝑜𝑐𝑐𝑢𝑟𝑎𝑛𝑐𝑒 𝑜𝑓 𝑥
𝑠𝑖𝑧𝑒 𝑜𝑓 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
, … … … … … … … . . 3.3
B. Information Gain
It represents the mutual information which can
be gained by one variable by observing the other
variable. Practically it is measured as the reduction in
entropy of a certain variable by the knowledge of
other, for example let we have to calculate the
information gain about the variable 𝑌 by some other
variable 𝑋 then it will be represented as 𝐺𝑎𝑖𝑛(𝑋|𝑌)
and can be calculated as
Gain Y X = H X − H X Y
= H Y − H Y X , … … . . (3.5)
Where 𝐻(𝑌|𝑋) is the conditional entropy and
interpreted as the remaining entropy of variable 𝑌 if
the value of another variable 𝑋 is known, on the basis
of probability it can be given as
𝐻 𝑌 𝑋
= − 𝑝 𝑥
𝑥∈𝑋
𝑝 𝑦 𝑥 log2 𝑝 𝑦 𝑥
𝑦∈𝑌
, … … … … . 3.6
Where 𝑝(𝑦|𝑥) is conditional probability of
occurrence of value 𝑦 of the feature 𝑓𝑖 with domain 𝑌
together with occurrence of value 𝑥 of the feature 𝑓𝑗
with domain 𝑋. As the equation (3.6) shows
information gain is a symmetrical measure
hence 𝐺 𝑌 𝑋 = 𝐺 𝑋 𝑌 .Since the information gain
is a symmetrical measure hence according to
equation (3.1) the symmetrical uncertainty must also
be symmetrical. The value of symmetrical
uncertainty varies in the interval of [0,1], the „1‟
represents the complete relativeness between two
variables while „0‟ shows the complete irrelevance.
C. Fuzzy C-Means Clustering
The clustering is defined as the process of
grouping the depending upon the specific measure. It
is generally defined as hard clustering or soft
clustering. The fuzzy clustering fall in the second
category where each data point belongs to more than
one cluster an attachments with the clusters is given
by membership value. The Fuzzy C-Means (FCM)
algorithm is one of the most widely used fuzzy
clustering algorithms. The FCM algorithm attempts
to partition a finite collection of elements 𝑋 =
{𝑥1, 𝑥2, 𝑥3, … … 𝑥 𝑛 } into a collection of 𝐶 fuzzy
clusters with respect to some given criterion. Given a
finite set of data, the algorithm returns a list of
𝐶 cluster centers 𝑉, such that
𝑉 = 𝑣𝑖 , 𝑖 = 1,2,3, … … 𝐶
and a partition matrix 𝑈 such that
𝑈 = 𝑢𝑖𝑗 , 𝑖 = 1,2,3, … . . , 𝐶, 𝑗 = 1,2,3, … , 𝑛
Where 𝑢𝑖𝑗 is a numerical value in [0, 1] which tells
the degree to which the element 𝑥𝑗 belongs to the
𝑖 𝑡ℎ
cluster.The following is a linguistic description of
the FCM algorithm, which is implemented Fuzzy
Logic.
Poonam Sharma Int. Journal of Engineering Research and Applications www.ijera.com
ISSN: 2248-9622, Vol. 5, Issue 9, (Part - 1) September 2015, pp.50-55
www.ijera.com 52 | P a g e
Step 1: Select the number of clusters 𝐶(2 ≤ 𝑐 ≤ 𝑛),
exponential weight𝜇(1 < 𝜇 < ∞), initial partition
matrix𝑈0
, and the termination criterion𝜖. Also, set
the iteration index 𝑙to 0.
Step 2: Calculate the fuzzy cluster centers {𝑣𝑖
𝑙
|𝑖 =
1,2,3 … 𝐶} by using𝑈𝑙
.
Step 3: Calculate the new partition matrix𝑈𝑙+1
by
using {𝑣𝑖
𝑙
, 𝑖 = 1,2,3 … 𝐶}.
Step 4: Calculate the new partition matrix ∆=
𝑈𝑙+1
− 𝑈𝑙
= max𝑖𝑗 𝑢𝑖𝑗
𝑙+1
− 𝑢𝑖𝑗
𝑙
. If∆ > 𝜖, then set
𝑙 = 𝑙 + 1 and go to step 2.If∆ ≤ 𝜖, then stop.
IV. Proposed algorithm
The proposed algorithm can be explained as
following , Let the 𝐷 be the high dimensional dataset
and can be expressed as
𝐷
=
𝑑11, 𝑑12, 𝑑13, … … … … 𝑑1𝑛
𝑑21, 𝑑22, 𝑑23, … … … … 𝑑2𝑛
.
.
.
𝑑 𝑚1, 𝑑 𝑚2, 𝑑 𝑚3, … … … … 𝑑 𝑚𝑛
, 𝑇
𝑡 𝑐
𝑡 𝑐
.
.
.
𝑡 𝑐
… . 4.1
Hence the dataset has 𝑛 dimensions and 𝑚
entries which targets to class𝑡 𝑐, 𝑡 𝑐 ∈ 𝑇, 𝑇 =
{𝑡1, 𝑡2, … 𝑡 𝐶} where 𝐶 is the total number of target
classes. The objective of the problem is to find a
subset 𝐷′
of data 𝐷 such that 𝐷′
have the 𝑚′
× 𝑛′
dimensions and 𝑚′
< 𝑚, 𝑛′
< 𝑛.
At the first step of the algorithm the relation
between the each feature and the target class is
estimated by calculating the symmetric uncertainty.
Let the symmetric uncertainty between 𝑖 𝑡ℎ
feature and
target classes is given by 𝑆𝑈 𝐹𝑖, 𝐶 . Since the
𝑆𝑈(𝐹𝑖, 𝐶) shows the predictability of target classes by
the feature 𝐹𝑖 this can be used as first measure to
remove unwanted features by defining that the
feature 𝐹𝑖 is important if and only if satisfies
𝑆𝑈 𝐹𝑖, 𝐶 > 𝜃 … … … … …. (4.2)
Where 𝜃 is the user defined constant can be seen as
the minimum required relation between feature and
target class.
After performing the above discussed operation the
dimension of feature will change let it be 𝑛1 then
𝐷1 =
𝑑11, 𝑑12, 𝑑13, … … … … 𝑑1𝑛1
𝑑21, 𝑑22, 𝑑23, … … … … 𝑑2𝑛1
.
.
.
𝑑 𝑚1, 𝑑 𝑚2, 𝑑 𝑚3, … … … … 𝑑 𝑚 𝑛1
, 𝑇
𝑡 𝑐
𝑡 𝑐
.
.
.
𝑡 𝑐
, 𝑛1
< 𝑛 … … … … … … … … … . (4.3)
In the second step of algorithm the features
which shares the same information are detected by
calculating the symmetric uncertainty amongst each
other‟s, and defined as 𝑆𝑈 𝐹𝑖, 𝐹𝑗 , 𝑖 ≠ 𝑗. The concept
of using the symmetric uncertainty is similar to first
step, hence the features with higher values of
𝑆𝑈 𝐹𝑖, 𝐹𝑗 may considered as identical features now a
𝑆𝑈𝐹𝐹 matrix is calculated as
~, 𝑆𝑈 𝐹1, 𝐹2 , … … … . … … 𝑆𝑈 𝐹1, 𝐹 𝑛′
𝑆𝑈 𝐹2, 𝐹1 , ~, … … … … … . 𝑆𝑈 𝐹2, 𝐹 𝑛′
.
.
.
𝑆𝑈 𝐹𝑚 , 𝐹1 , 𝑆𝑈 𝐹𝑚 , 𝐹2 , … … … … . … , ~
… . (4.4)
The 𝑆𝑈𝐹𝐹 matrix is used to calculate the
Minimum Span Tree (MST) such that the value of
every elements of the matrix 𝑆𝑈𝐹𝐹 is taken as the
bonding between the corresponding features.
However the MST can group every feature which
have relation strength (𝑆𝑈𝐹𝐶) greater than zero. To
eliminate the loosely connected features another
condition is applied in which if the feature has greater
relation with target classes 𝑆𝑈𝐹𝐶 than any other
feature (𝑆𝑈𝐹𝐹)then the link between these feature is
removed by modifying the 𝑆𝑈𝐹𝐹 as below
𝑆𝑈 𝐹𝑖, 𝐹𝑗 = 0, 𝑖𝑓 𝑆𝑈 𝐹𝑖, 𝐶 > 𝑆𝑈 𝐹𝑖, 𝐹𝑗 … (4.5)
After modifying the 𝑆𝑈𝐹𝐹 according to equation
(4.5) the MST is reconstructed and the features still
found connected are considered as similar features
and replaced by single representative feature
depending upon their target class relation or 𝑆𝑈𝐹𝐶
value as follows
Let the 𝐹𝑡 = {𝐹𝑎, 𝐹𝑏 , 𝐹𝑐} be the connected feature in
the MST then the representative feature 𝐹𝑟 will be
selected as
𝐹𝑟 = 𝐹𝑡 𝑖 , argmax
𝑖
{𝑆𝑈(𝐹𝑎, 𝐶), 𝑆𝑈(𝐹𝑏 , 𝐶), 𝑆𝑈(𝐹𝑐, 𝐶)}
(4.6)
Now we have the most useful features set and the
new dataset can be presented as𝐷 𝑚×𝑛′′, here the
𝑛′′
< 𝑛 hence the dimension of feature is reduced
although the entries in the dataset is still same which
needed to be reduced. In the proposed system the
third step is used for this purpose which groups the
similar data points using the fuzzy c-means clustering
and then the points having large membership value
with any one cluster is replaced by cluster centroid.
Let the fuzzy C-Means clusters the given data
into 𝑘 groups then after clustering the data can be
presented as
Poonam Sharma Int. Journal of Engineering Research and Applications www.ijera.com
ISSN: 2248-9622, Vol. 5, Issue 9, (Part - 1) September 2015, pp.50-55
www.ijera.com 53 | P a g e
𝑀𝑒𝑚𝑏𝑒𝑟𝑠ℎ𝑖𝑝 𝑀𝑎𝑡𝑟𝑖𝑥
=
𝑀11, 𝑀12, 𝑀13, … … … … 𝑀1𝑚
𝑀21, 𝑀22, 𝑀23, … … … … 𝑀2𝑚
.
.
.
𝑀𝑘1, 𝑀𝑘2, 𝑀 𝑚3, … … … … 𝑀𝑘𝑚
. . 4.7
Where 𝑀𝑖𝑗 shows the membership of 𝑗 𝑡ℎ
entry (data
point) to 𝑖 𝑡ℎ
cluster.
𝐷1 𝑖 = 𝐶𝑖, argmax
𝑖
{𝑀𝑖𝑗 , 1 ≤ 𝑗 ≤ 𝑘} … … … . (4.8)
How many points will be substituted is depends
upon the user defined minimum merging similarity.
The complete algorithm can be described in
following steps
Step 1: calculate 𝑆𝑈𝐹𝐹 matrix and 𝑆𝑈𝐹𝐶 matrix for
given dataset 𝐷 having target classes 𝑇.
Step 2: on the basis of 𝑆𝑈𝐹𝐶 values reject the
features having 𝑆𝑈𝐹𝐶 value lesser than threshold 𝜃.
Step 3: recalculate the 𝑆𝑈𝐹𝐹 matrix for reduced
features 𝑆𝑈𝐹𝐹′ .
Step 4: construct the minimum span tree (MST).
Step 5: remove the branches of MST
having 𝑆𝑈𝐹𝐶 𝑖, 𝐶 > 𝑆𝑈𝐹𝐹(𝑖, 𝐽).
Step 6: selected the isolated features from MST and
generate representative for non-isolated features.
Step 7: form new dataset with only features selected
in step 6.
Step 8: performed fuzzy c-means clustering and take
the cluster centroid as a representative for data points
having greater membership with it.
Step 9: Train the classifier from the dataset obtained
and test for accuracy.
V. Simulation Results
In this section, we present the experimental
results in terms of the proportion of selected features,
the time to obtain the feature subset, the classification
accuracy. The dataset used has 36 features all with
domain size of 2 while the target class also has a
domain size of 2. The total number of entries in the
dataset is 3196. For the testing of the selected
features quality the classification test is performed
using the probabilistic neural network
Table 1: Results for Dataset Size vs. processing time.
Dataset Size (%)
Processing Time
(Seconds)
40 3.6514
60 4.6715
80 5.5949
100 6.5218
Figure 1: Plot of the table 1 data (impact of data size
on processing time.
Table 2: Results for Dataset Size vs. Reduced Data
Size of Previous Method and Proposed Method.
Dataset Size
(%)
Data Size
Previous
Method
Proposed
Method
40 1278 1151
60 1918 1726
80 2557 2301
100 3196 2876
Figure 2: Plot of the table 2 comparison for data size
reduction of Previous and Proposed Method.
Table 3: Results for Dataset Size vs. Selected
Features of Proposed Method
Dataset Size (%) Number of Features
40 14
60 17
80 15
100 13
Figure 3: Plot of the table 3 data size vs. number of
selected features.
Poonam Sharma Int. Journal of Engineering Research and Applications www.ijera.com
ISSN: 2248-9622, Vol. 5, Issue 9, (Part - 1) September 2015, pp.50-55
www.ijera.com 54 | P a g e
Table 4: Results for Dataset Size vs. Classification
Accuracy of Previous and Proposed Method.
Dataset Size
(%)
Accuracy
Previous
Method
Proposed
Method
40 74.5 77.8
60 80.2 82.4
80 77.6 83.1
100 81.9 83.3
VI. Conclusion
In this paper, we present a novel information
theory and fuzzy clustering based feature subset
selection algorithm in combination with data size
reduction, which is very applicable, especially to
high-dimensional data. This algorithm is developed
for not only identifying and removing irrelevant and
redundant features, but also dealing with interactive
features. We first defined relevant, redundant and
interactive features based on symmetric uncertainty
then based on these definitions, we presented the
feature selection algorithm, which involves four steps
(1) redundant feature exclusion and interactive
feature reservation and (2) the irrelevant feature
identification (3) minimum span tree formation for
similar features grouping (4) fuzzy C-means
clustering for data size reduction. We also explained
the concept behind the redundant as well as irrelevant
features and reserve interactive features with
appropriate expression formations. Finally the test
with real world data sets show that our proposed
algorithm has moderate reduction capability.
Meanwhile, it also reduces the data size and obtains
the best average accuracies for all the neural network
based classification algorithms.
Figure 4: Plot of the table 4
Figure 5: Final minimum span tree Graph generated for 100% of data samples
References
[1] Pablo Bermejo, Luis de la Ossa, Jos´e A.
G´amez, Jos´e M. Puerta “Fast wrapper
feature subset selection in high-dimensional
datasets by means of filter re-
ranking”,Knowledge-Based Systems
Volume 25, Issue 1, February 2012, Pages
35–44Special Issue on New Trends in Data
Mining.
[2] Guangtao Wang, Qinbao Song, Baowen Xu,
Yuming Zhou “Selecting feature subset for
high dimensional data via the propositional
FOIL rules” Pattern Recognition 46 (2013)
199–214.
[3] Sebastián Maldonado, Richard Weber, Fazel
Famili “Feature selection for high-
dimensional class-imbalanced datasets using
Support Vector Machines”, Information
Sciences 286 (2014) 228–246.
Poonam Sharma Int. Journal of Engineering Research and Applications www.ijera.com
ISSN: 2248-9622, Vol. 5, Issue 9, (Part - 1) September 2015, pp.50-55
www.ijera.com 55 | P a g e
[4] Athanasios Tsanas, Max A. Little, Patrick E.
McSharry, Senior Member, IEEE, Jennifer
Spielman, Lorraine O. Ramig “Novel speech
signal processing algorithms for high-
accuracy classification of Parkinson‟s
disease”, IEEE Trans Biomed Eng. 2012
May;59(5):1264-71.
[5] Alok Sharma, Seiya Imoto, and Satoru
Miyano “A top-r Feature Selection
Algorithm for Microarray Gene Expression
Data”,Computational Biology and
Bioinformatics, IEEE/ACM Transactions on
(Volume:9 , Issue: 3 ), 22 March 2012
[6] R. Ruiz, J.C. Riquelme, J.S. Aguilar-Ruiz,
M. García-Torres “Fast feature selection
aimed at high-dimensional data viahybrid-
sequential-ranked searches”, Expert Systems
with Applications 39 (2012) 11094–11102.
[7] Xiaojun Chen, Yunming Ye, Xiaofei Xu,
Joshua ZhexueHuang “A feature group
weighting method for subspace clusteringof
high-dimensional data”, Pattern Recognition
45 (2012) 434–446.
[8] Yuzong Liu, Kai Wei, Katrin Kirchhoff,
Yisong Song, Jeff Bilmes “Sub modular
Feature Selection for High-Dimensional
Acoustic Score Spaces”, Acoustics, Speech
and Signal Processing (ICASSP), 2013
IEEE International Conference on26-31
May 2013
[9] Mohak Shah, Mario Marchand, and Jacques
Corbeil “Feature Selection with
Conjunctions of Decision Stumps and
Learning from Microarray Data”, Pattern
Analysis and Machine Intelligence, IEEE
Transactions on (Volume: 34, Issue: 1), 17
November 2011
[10] Alexandros Kalousis, Julien Prados, Melanie
Hilario “Stability of Feature Selection
Algorithms:a study on high dimensional
spaces”, Knowledge and Information
Systems table of contents archive Volume
12 Issue 1, May 2007.
[11] Qiang Cheng*, Hongbo Zhou, and Jie
Cheng “The Fisher-Markov Selector: Fast
Selecting Maximally Separable Feature
Subset for Multi-Class Classification with
Applications to High-Dimensional
Data”Pattern Analysis and Machine
Intelligence, IEEE Transactions on
(Volume:33 , Issue: 6 )19 April 2011.
[12] Yongjun Piao, Minghao Piao, Kiejung
Parkand Keun Ho Ryu “An ensemble
correlation-based gene selection algorithm
for cancer classification with gene
expression data”, Vol. 28 no. 24 2012, pages
3306–3315.
[13] Lei Yu, Huan Liu “Feature Selection for
High-Dimensional Data:A Fast Correlation-
Based Filter Solution”, Proceedings of the
Twentieth International Conference on
Machine Learning (ICML-2003),
Washington DC, 2003.
[14] Lance Parsons, Ehtesham Haque, Huan Liu
“Subspace Clustering for High Dimensional
Data: A Review”, ACM SIGKDD
Explorations Newsletter - Special issue on
learning from imbalanced datasets
Homepage table of contents archive Volume
6 Issue 1, June 2004.
[15] Daphne Koller Mehran Sahami “Toward
Optimal Feature Selection”, Technical
Report. Stanford Info Lab.
[16] Isabelle Guyon, Andr´e Elisseeff “An
Introduction to Variable and Feature
Selection”, Journal of Machine Learning
Research 3 (2003) 1157-1182.
[17] Lei Yu, Huan Liu “Efficient Feature
Selection via Analysis of Relevance and
Redundancy”,Journal of Machine Learning
Research 5 (2004) 1205–1224.
[18] Qinbao Song, Jingjie Ni, and Guangtao
Wang “A Fast Clustering-Based Feature
Subset Selection Algorithm for High-
Dimensional Data”, IEEE Transactions on
Knowledge and Data Engineering, Vol. 25,
No. 1, January 2013.

More Related Content

PDF
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
PDF
A Novel Approach to Mathematical Concepts in Data Mining
PPTX
Recommendation system
PDF
Extended pso algorithm for improvement problems k means clustering algorithm
PDF
An Adaptive Masker for the Differential Evolution Algorithm
PDF
Particle Swarm Optimization based K-Prototype Clustering Algorithm
PDF
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
PDF
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Novel Approach to Mathematical Concepts in Data Mining
Recommendation system
Extended pso algorithm for improvement problems k means clustering algorithm
An Adaptive Masker for the Differential Evolution Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...

What's hot (17)

PDF
Multi fractal analysis of human brain mr image
PDF
Multi fractal analysis of human brain mr image
PDF
PDF
An Automatic Medical Image Segmentation using Teaching Learning Based Optimiz...
PPTX
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
PDF
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
PDF
Ijartes v1-i2-006
PDF
A046010107
PDF
Master's Thesis Presentation
PDF
Medical diagnosis classification
PPTX
Text clustering
PPTX
Rohit 10103543
PDF
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
PDF
International Journal of Engineering Research and Development
PDF
A Novel Clustering Method for Similarity Measuring in Text Documents
PDF
K044055762
PDF
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...
Multi fractal analysis of human brain mr image
Multi fractal analysis of human brain mr image
An Automatic Medical Image Segmentation using Teaching Learning Based Optimiz...
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
Ijartes v1-i2-006
A046010107
Master's Thesis Presentation
Medical diagnosis classification
Text clustering
Rohit 10103543
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
International Journal of Engineering Research and Development
A Novel Clustering Method for Similarity Measuring in Text Documents
K044055762
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...
Ad

Viewers also liked (20)

PDF
To mitigate Black-hole attack with CBDS in MANET
PDF
Computational Aerodynamic Prediction for Integration of an Advanced Reconnais...
PDF
Variability, heritability and genetic advance analysis for grain yield in rice
PDF
Continuous rigid PC frame box girder cantilever pouring construction reasonab...
PDF
Stress-Strain of Hotmix Cold Laid Containing Buton Granular Asphat (BGA) with...
PDF
Chemical Analysis of Emu Feather Fiber Reinforced Epoxy Composites
PDF
Coastal Resource Management In Kanniyakuamari Coast, Tamil Nadu, India. Using...
PDF
Cloud Trust Management Framework Based On Cloud Market spaces
PDF
Cyclic Heating Effect on Hardness of Aluminum
PDF
Study of The Technological Profile of The Red Ceramic Industry of Alagoas
PDF
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSD
PDF
Spoiler Analysis and Wind Tunnel Experiment
PDF
Effect of Harvest of Air Relative Humidity on Water and Heat Transfer in Soil...
PDF
Review on Design & Implementation of Road Side Symbol Detection In VANET
PDF
Performance Analysis Of PV Interfaced Neural Network Based Hybrid Active Powe...
PDF
Speckle, SAR, denoising, ridgelet, radon, PSNR, MSE
PDF
Hardware Implementation of Two’s Compliment Multiplier with Partial Product b...
PDF
Design and Development of ARM9 Based Embedded Web Server
PDF
Experimental Study of an Atmospheric Pressure Dielectric Barrier Discharge an...
PDF
Performance and Emission Characteristics on Glow Plug Hot Surface Ignition C....
To mitigate Black-hole attack with CBDS in MANET
Computational Aerodynamic Prediction for Integration of an Advanced Reconnais...
Variability, heritability and genetic advance analysis for grain yield in rice
Continuous rigid PC frame box girder cantilever pouring construction reasonab...
Stress-Strain of Hotmix Cold Laid Containing Buton Granular Asphat (BGA) with...
Chemical Analysis of Emu Feather Fiber Reinforced Epoxy Composites
Coastal Resource Management In Kanniyakuamari Coast, Tamil Nadu, India. Using...
Cloud Trust Management Framework Based On Cloud Market spaces
Cyclic Heating Effect on Hardness of Aluminum
Study of The Technological Profile of The Red Ceramic Industry of Alagoas
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSD
Spoiler Analysis and Wind Tunnel Experiment
Effect of Harvest of Air Relative Humidity on Water and Heat Transfer in Soil...
Review on Design & Implementation of Road Side Symbol Detection In VANET
Performance Analysis Of PV Interfaced Neural Network Based Hybrid Active Powe...
Speckle, SAR, denoising, ridgelet, radon, PSNR, MSE
Hardware Implementation of Two’s Compliment Multiplier with Partial Product b...
Design and Development of ARM9 Based Embedded Web Server
Experimental Study of an Atmospheric Pressure Dielectric Barrier Discharge an...
Performance and Emission Characteristics on Glow Plug Hot Surface Ignition C....
Ad

Similar to A Combined Approach for Feature Subset Selection and Size Reduction for High Dimensional Data (20)

PDF
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
PDF
On Feature Selection Algorithms and Feature Selection Stability Measures : A...
PDF
The International Journal of Engineering and Science (The IJES)
PDF
Automatic Feature Subset Selection using Genetic Algorithm for Clustering
PDF
Iaetsd an efficient and large data base using subset selection algorithm
PDF
A Modified KS-test for Feature Selection
PDF
A Threshold fuzzy entropy based feature selection method applied in various b...
PDF
An integrated mechanism for feature selection
PDF
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATA
PDF
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
PDF
Optimal feature selection from v mware esxi 5.1 feature set
PDF
Q UANTUM C LUSTERING -B ASED F EATURE SUBSET S ELECTION FOR MAMMOGRAPHIC I...
DOCX
A fast clustering based feature subset selection algorithm for high-dimension...
DOCX
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
PDF
M43016571
PDF
Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...
PDF
A Novel Algorithm for Design Tree Classification with PCA
PDF
1376846406 14447221
DOCX
A fast clustering based feature subset selection algorithm for high-dimension...
PDF
A study on rough set theory based
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
On Feature Selection Algorithms and Feature Selection Stability Measures : A...
The International Journal of Engineering and Science (The IJES)
Automatic Feature Subset Selection using Genetic Algorithm for Clustering
Iaetsd an efficient and large data base using subset selection algorithm
A Modified KS-test for Feature Selection
A Threshold fuzzy entropy based feature selection method applied in various b...
An integrated mechanism for feature selection
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATA
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal feature selection from v mware esxi 5.1 feature set
Q UANTUM C LUSTERING -B ASED F EATURE SUBSET S ELECTION FOR MAMMOGRAPHIC I...
A fast clustering based feature subset selection algorithm for high-dimension...
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
M43016571
Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...
A Novel Algorithm for Design Tree Classification with PCA
1376846406 14447221
A fast clustering based feature subset selection algorithm for high-dimension...
A study on rough set theory based

Recently uploaded (20)

PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
additive manufacturing of ss316l using mig welding
PPTX
web development for engineering and engineering
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Digital Logic Computer Design lecture notes
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
DOCX
573137875-Attendance-Management-System-original
PPT
Mechanical Engineering MATERIALS Selection
PDF
PPT on Performance Review to get promotions
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PPT
Project quality management in manufacturing
PPTX
Construction Project Organization Group 2.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Foundation to blockchain - A guide to Blockchain Tech
Model Code of Practice - Construction Work - 21102022 .pdf
additive manufacturing of ss316l using mig welding
web development for engineering and engineering
OOP with Java - Java Introduction (Basics)
CYBER-CRIMES AND SECURITY A guide to understanding
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Digital Logic Computer Design lecture notes
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
573137875-Attendance-Management-System-original
Mechanical Engineering MATERIALS Selection
PPT on Performance Review to get promotions
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Lecture Notes Electrical Wiring System Components
Project quality management in manufacturing
Construction Project Organization Group 2.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS

A Combined Approach for Feature Subset Selection and Size Reduction for High Dimensional Data

  • 1. Poonam Sharma Int. Journal of Engineering Research and Applications www.ijera.com ISSN: 2248-9622, Vol. 5, Issue 9, (Part - 1) September 2015, pp.50-55 www.ijera.com 50 | P a g e A Combined Approach for Feature Subset Selection and Size Reduction for High Dimensional Data Anurag Dwivedi, Poonam Sharma Assistant Professor Dept. of Computer Science and Engineering SRGI, Jhansi M. Tech Scholar Dept. of Information Technology SATI, Vidisha (M.P) Abstract: selection of relevant feature from a given set of feature is one of the important issues in the field of data mining as well as classification. In general the dataset may contain a number of features however it is not necessary that the whole set features are important for particular analysis of decision making because the features may share the common information‟s and can also be completely irrelevant to the undergoing processing. This generally happen because of improper selection of features during the dataset formation or because of improper information availability about the observed system. However in both cases the data will contain the features that will just increase the processing burden which may ultimately cause the improper outcome when used for analysis. Because of these reasons some kind of methods are required to detect and remove these features hence in this paper we are presenting an efficient approach for not just removing the unimportant features but also the size of complete dataset size. The proposed algorithm utilizes the information theory to detect the information gain from each feature and minimum span tree to group the similar features with that the fuzzy c-means clustering is used to remove the similar entries from the dataset. Finally the algorithm is tested with SVM classifier using 35 publicly available real-world high-dimensional dataset and the results shows that the presented algorithm not only reduces the feature set and data lengths but also improves the performances of the classifier. Keywords: feature selection, data reduction, clustering, fuzzy clustering, and minimum span tree. I. Introduction Feature selection involves identifying a subset of very useful features from the large data set that produces compatible results as the original whole set of features [18]. Feature selection is a critical subject in data mining, particularly in high dimensional applications. The selection of relevant feature is a complex problem, and finding the ideal subset of variables is viewed as NP-hard [3].Feature selection may be extremely useful approach for reducing dimensionality, removing irrelevant data and improving learning accuracy. Extensive high- dimensional data are generally sparse and contain numerous classes/groups. For instance, vast content data in the vector space show frequently contains numerous classes of documents contains a huge number of features, this property has turned into a principle instead of the special case that mostly the clustering of high dimensional data happen in subspaces of data, so subspace grouping techniques are needed in high-dimensional data clustering. Numerous subspace grouping techniques have been proposed to handle high dimensional data, used for finding clusters from subspaces of data, rather than the whole data space. These techniques can be broadly categorized two groups one is called the hard subspace clustering that searches the exact sub set of features while other is called the soft subspace clustering which assigns the weights to features. Numerous high-dimensional data sets are the mixture of extracted data from different prospective which causes unwanted features insertion for any specific analysis. In this paper, we propose a new data dimension and length reduction method for high- dimensional data set. The proposed algorithm utilizes the entropy and joint entropy estimation to detect the information gain from each feature and minimum span tree to group the similar features with that the fuzzy c-means clustering is used to remove the similar entries from the dataset. II. Literature Review This section presents the brief review of the related literatures available on same topic. R. Ruiz et al [6] proposed hybrid approaches to provide the possibility of efficiently applying any subset evaluator, with a wrapper model for feature subset selection problem for classification tasks.Alexandros Kalousis et al [10] presented the stability of feature selection algorithms based on the stability of the feature preferences that they express in the form of weights-scores, ranks, or a selected feature subset. Finally the examination is performed by a number of measures to quantify the stability of feature preferences and propose an empirical way to estimate them. Guangtao Wang et al [2] proposed a propositional FOIL rule based algorithm FRFS, which not only contain relevant features and excludes RESEARCH ARTICLE OPEN ACCESS
  • 2. Poonam Sharma Int. Journal of Engineering Research and Applications www.ijera.com ISSN: 2248-9622, Vol. 5, Issue 9, (Part - 1) September 2015, pp.50-55 www.ijera.com 51 | P a g e irrelevant and redundant ones but also considers feature interaction, is proposed for selecting feature subset for high dimensional data. FRFS first combine the features appeared in the antecedents of all FOIL rules, achieving a candidate feature subset which excludes redundant features and reserves interactive ones. Then, identifies and removes irrelevant features by evaluating features in the candidate feature subset with a new metric Cover Ratio, and obtains the final feature subset. The supervised wrapper-based feature subset selection in datasets with a very large number of attributes. Pablo Bermejo et al [1]. A Fast Correlation-Based Filter Solution is presented Lei Yu [13] introduce a novel theory predominant correlation, and propose a fast filter technique which can identify relevant features as well as redundancy among relevant features without pairwise correlation analysis. The efficiency and effectiveness of their method is demonstrated through extensive comparisons with other methods using real-world data of high dimensionality. The author also proposed Relevance and Redundancy based technique in [17] which show that feature relevance alone is insufficient for efficient feature selection of high-dimensional data and define feature redundancy and propose to perform explicit redundancy analysis in feature selection. III. Terminology Explanations This section explains the terms and operation used in the proposed algorithm. A. Symmetric Uncertainty: It is derived from the mutual information by normalizing it to the entropies of variables, and can be used as the measure of correlation between either two features or a feature and the target classes. Mathematically it is defined as, 𝑆𝑈 𝑋, 𝑌 = 2 × 𝐺𝑎𝑖𝑛 𝑋 𝑌 𝐻 𝑋 + 𝐻 𝑌 , … … … … … … … 3.1 Where 𝐻(𝑋) is the entropy of the variable 𝑋, and is calculated as follows: 𝐻 𝑋 = − 𝑝 𝑥 log2 𝑝 𝑥 𝑥∈𝑋 , … … … … … . . 3.2 Here 𝑝(𝑥) is the probability of the occurrence of value 𝑥 of a feature 𝑓 with the domain 𝑋, and can be calculated as 𝑝 𝑥 = 𝑜𝑐𝑐𝑢𝑟𝑎𝑛𝑐𝑒 𝑜𝑓 𝑥 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 , … … … … … … … . . 3.3 B. Information Gain It represents the mutual information which can be gained by one variable by observing the other variable. Practically it is measured as the reduction in entropy of a certain variable by the knowledge of other, for example let we have to calculate the information gain about the variable 𝑌 by some other variable 𝑋 then it will be represented as 𝐺𝑎𝑖𝑛(𝑋|𝑌) and can be calculated as Gain Y X = H X − H X Y = H Y − H Y X , … … . . (3.5) Where 𝐻(𝑌|𝑋) is the conditional entropy and interpreted as the remaining entropy of variable 𝑌 if the value of another variable 𝑋 is known, on the basis of probability it can be given as 𝐻 𝑌 𝑋 = − 𝑝 𝑥 𝑥∈𝑋 𝑝 𝑦 𝑥 log2 𝑝 𝑦 𝑥 𝑦∈𝑌 , … … … … . 3.6 Where 𝑝(𝑦|𝑥) is conditional probability of occurrence of value 𝑦 of the feature 𝑓𝑖 with domain 𝑌 together with occurrence of value 𝑥 of the feature 𝑓𝑗 with domain 𝑋. As the equation (3.6) shows information gain is a symmetrical measure hence 𝐺 𝑌 𝑋 = 𝐺 𝑋 𝑌 .Since the information gain is a symmetrical measure hence according to equation (3.1) the symmetrical uncertainty must also be symmetrical. The value of symmetrical uncertainty varies in the interval of [0,1], the „1‟ represents the complete relativeness between two variables while „0‟ shows the complete irrelevance. C. Fuzzy C-Means Clustering The clustering is defined as the process of grouping the depending upon the specific measure. It is generally defined as hard clustering or soft clustering. The fuzzy clustering fall in the second category where each data point belongs to more than one cluster an attachments with the clusters is given by membership value. The Fuzzy C-Means (FCM) algorithm is one of the most widely used fuzzy clustering algorithms. The FCM algorithm attempts to partition a finite collection of elements 𝑋 = {𝑥1, 𝑥2, 𝑥3, … … 𝑥 𝑛 } into a collection of 𝐶 fuzzy clusters with respect to some given criterion. Given a finite set of data, the algorithm returns a list of 𝐶 cluster centers 𝑉, such that 𝑉 = 𝑣𝑖 , 𝑖 = 1,2,3, … … 𝐶 and a partition matrix 𝑈 such that 𝑈 = 𝑢𝑖𝑗 , 𝑖 = 1,2,3, … . . , 𝐶, 𝑗 = 1,2,3, … , 𝑛 Where 𝑢𝑖𝑗 is a numerical value in [0, 1] which tells the degree to which the element 𝑥𝑗 belongs to the 𝑖 𝑡ℎ cluster.The following is a linguistic description of the FCM algorithm, which is implemented Fuzzy Logic.
  • 3. Poonam Sharma Int. Journal of Engineering Research and Applications www.ijera.com ISSN: 2248-9622, Vol. 5, Issue 9, (Part - 1) September 2015, pp.50-55 www.ijera.com 52 | P a g e Step 1: Select the number of clusters 𝐶(2 ≤ 𝑐 ≤ 𝑛), exponential weight𝜇(1 < 𝜇 < ∞), initial partition matrix𝑈0 , and the termination criterion𝜖. Also, set the iteration index 𝑙to 0. Step 2: Calculate the fuzzy cluster centers {𝑣𝑖 𝑙 |𝑖 = 1,2,3 … 𝐶} by using𝑈𝑙 . Step 3: Calculate the new partition matrix𝑈𝑙+1 by using {𝑣𝑖 𝑙 , 𝑖 = 1,2,3 … 𝐶}. Step 4: Calculate the new partition matrix ∆= 𝑈𝑙+1 − 𝑈𝑙 = max𝑖𝑗 𝑢𝑖𝑗 𝑙+1 − 𝑢𝑖𝑗 𝑙 . If∆ > 𝜖, then set 𝑙 = 𝑙 + 1 and go to step 2.If∆ ≤ 𝜖, then stop. IV. Proposed algorithm The proposed algorithm can be explained as following , Let the 𝐷 be the high dimensional dataset and can be expressed as 𝐷 = 𝑑11, 𝑑12, 𝑑13, … … … … 𝑑1𝑛 𝑑21, 𝑑22, 𝑑23, … … … … 𝑑2𝑛 . . . 𝑑 𝑚1, 𝑑 𝑚2, 𝑑 𝑚3, … … … … 𝑑 𝑚𝑛 , 𝑇 𝑡 𝑐 𝑡 𝑐 . . . 𝑡 𝑐 … . 4.1 Hence the dataset has 𝑛 dimensions and 𝑚 entries which targets to class𝑡 𝑐, 𝑡 𝑐 ∈ 𝑇, 𝑇 = {𝑡1, 𝑡2, … 𝑡 𝐶} where 𝐶 is the total number of target classes. The objective of the problem is to find a subset 𝐷′ of data 𝐷 such that 𝐷′ have the 𝑚′ × 𝑛′ dimensions and 𝑚′ < 𝑚, 𝑛′ < 𝑛. At the first step of the algorithm the relation between the each feature and the target class is estimated by calculating the symmetric uncertainty. Let the symmetric uncertainty between 𝑖 𝑡ℎ feature and target classes is given by 𝑆𝑈 𝐹𝑖, 𝐶 . Since the 𝑆𝑈(𝐹𝑖, 𝐶) shows the predictability of target classes by the feature 𝐹𝑖 this can be used as first measure to remove unwanted features by defining that the feature 𝐹𝑖 is important if and only if satisfies 𝑆𝑈 𝐹𝑖, 𝐶 > 𝜃 … … … … …. (4.2) Where 𝜃 is the user defined constant can be seen as the minimum required relation between feature and target class. After performing the above discussed operation the dimension of feature will change let it be 𝑛1 then 𝐷1 = 𝑑11, 𝑑12, 𝑑13, … … … … 𝑑1𝑛1 𝑑21, 𝑑22, 𝑑23, … … … … 𝑑2𝑛1 . . . 𝑑 𝑚1, 𝑑 𝑚2, 𝑑 𝑚3, … … … … 𝑑 𝑚 𝑛1 , 𝑇 𝑡 𝑐 𝑡 𝑐 . . . 𝑡 𝑐 , 𝑛1 < 𝑛 … … … … … … … … … . (4.3) In the second step of algorithm the features which shares the same information are detected by calculating the symmetric uncertainty amongst each other‟s, and defined as 𝑆𝑈 𝐹𝑖, 𝐹𝑗 , 𝑖 ≠ 𝑗. The concept of using the symmetric uncertainty is similar to first step, hence the features with higher values of 𝑆𝑈 𝐹𝑖, 𝐹𝑗 may considered as identical features now a 𝑆𝑈𝐹𝐹 matrix is calculated as ~, 𝑆𝑈 𝐹1, 𝐹2 , … … … . … … 𝑆𝑈 𝐹1, 𝐹 𝑛′ 𝑆𝑈 𝐹2, 𝐹1 , ~, … … … … … . 𝑆𝑈 𝐹2, 𝐹 𝑛′ . . . 𝑆𝑈 𝐹𝑚 , 𝐹1 , 𝑆𝑈 𝐹𝑚 , 𝐹2 , … … … … . … , ~ … . (4.4) The 𝑆𝑈𝐹𝐹 matrix is used to calculate the Minimum Span Tree (MST) such that the value of every elements of the matrix 𝑆𝑈𝐹𝐹 is taken as the bonding between the corresponding features. However the MST can group every feature which have relation strength (𝑆𝑈𝐹𝐶) greater than zero. To eliminate the loosely connected features another condition is applied in which if the feature has greater relation with target classes 𝑆𝑈𝐹𝐶 than any other feature (𝑆𝑈𝐹𝐹)then the link between these feature is removed by modifying the 𝑆𝑈𝐹𝐹 as below 𝑆𝑈 𝐹𝑖, 𝐹𝑗 = 0, 𝑖𝑓 𝑆𝑈 𝐹𝑖, 𝐶 > 𝑆𝑈 𝐹𝑖, 𝐹𝑗 … (4.5) After modifying the 𝑆𝑈𝐹𝐹 according to equation (4.5) the MST is reconstructed and the features still found connected are considered as similar features and replaced by single representative feature depending upon their target class relation or 𝑆𝑈𝐹𝐶 value as follows Let the 𝐹𝑡 = {𝐹𝑎, 𝐹𝑏 , 𝐹𝑐} be the connected feature in the MST then the representative feature 𝐹𝑟 will be selected as 𝐹𝑟 = 𝐹𝑡 𝑖 , argmax 𝑖 {𝑆𝑈(𝐹𝑎, 𝐶), 𝑆𝑈(𝐹𝑏 , 𝐶), 𝑆𝑈(𝐹𝑐, 𝐶)} (4.6) Now we have the most useful features set and the new dataset can be presented as𝐷 𝑚×𝑛′′, here the 𝑛′′ < 𝑛 hence the dimension of feature is reduced although the entries in the dataset is still same which needed to be reduced. In the proposed system the third step is used for this purpose which groups the similar data points using the fuzzy c-means clustering and then the points having large membership value with any one cluster is replaced by cluster centroid. Let the fuzzy C-Means clusters the given data into 𝑘 groups then after clustering the data can be presented as
  • 4. Poonam Sharma Int. Journal of Engineering Research and Applications www.ijera.com ISSN: 2248-9622, Vol. 5, Issue 9, (Part - 1) September 2015, pp.50-55 www.ijera.com 53 | P a g e 𝑀𝑒𝑚𝑏𝑒𝑟𝑠ℎ𝑖𝑝 𝑀𝑎𝑡𝑟𝑖𝑥 = 𝑀11, 𝑀12, 𝑀13, … … … … 𝑀1𝑚 𝑀21, 𝑀22, 𝑀23, … … … … 𝑀2𝑚 . . . 𝑀𝑘1, 𝑀𝑘2, 𝑀 𝑚3, … … … … 𝑀𝑘𝑚 . . 4.7 Where 𝑀𝑖𝑗 shows the membership of 𝑗 𝑡ℎ entry (data point) to 𝑖 𝑡ℎ cluster. 𝐷1 𝑖 = 𝐶𝑖, argmax 𝑖 {𝑀𝑖𝑗 , 1 ≤ 𝑗 ≤ 𝑘} … … … . (4.8) How many points will be substituted is depends upon the user defined minimum merging similarity. The complete algorithm can be described in following steps Step 1: calculate 𝑆𝑈𝐹𝐹 matrix and 𝑆𝑈𝐹𝐶 matrix for given dataset 𝐷 having target classes 𝑇. Step 2: on the basis of 𝑆𝑈𝐹𝐶 values reject the features having 𝑆𝑈𝐹𝐶 value lesser than threshold 𝜃. Step 3: recalculate the 𝑆𝑈𝐹𝐹 matrix for reduced features 𝑆𝑈𝐹𝐹′ . Step 4: construct the minimum span tree (MST). Step 5: remove the branches of MST having 𝑆𝑈𝐹𝐶 𝑖, 𝐶 > 𝑆𝑈𝐹𝐹(𝑖, 𝐽). Step 6: selected the isolated features from MST and generate representative for non-isolated features. Step 7: form new dataset with only features selected in step 6. Step 8: performed fuzzy c-means clustering and take the cluster centroid as a representative for data points having greater membership with it. Step 9: Train the classifier from the dataset obtained and test for accuracy. V. Simulation Results In this section, we present the experimental results in terms of the proportion of selected features, the time to obtain the feature subset, the classification accuracy. The dataset used has 36 features all with domain size of 2 while the target class also has a domain size of 2. The total number of entries in the dataset is 3196. For the testing of the selected features quality the classification test is performed using the probabilistic neural network Table 1: Results for Dataset Size vs. processing time. Dataset Size (%) Processing Time (Seconds) 40 3.6514 60 4.6715 80 5.5949 100 6.5218 Figure 1: Plot of the table 1 data (impact of data size on processing time. Table 2: Results for Dataset Size vs. Reduced Data Size of Previous Method and Proposed Method. Dataset Size (%) Data Size Previous Method Proposed Method 40 1278 1151 60 1918 1726 80 2557 2301 100 3196 2876 Figure 2: Plot of the table 2 comparison for data size reduction of Previous and Proposed Method. Table 3: Results for Dataset Size vs. Selected Features of Proposed Method Dataset Size (%) Number of Features 40 14 60 17 80 15 100 13 Figure 3: Plot of the table 3 data size vs. number of selected features.
  • 5. Poonam Sharma Int. Journal of Engineering Research and Applications www.ijera.com ISSN: 2248-9622, Vol. 5, Issue 9, (Part - 1) September 2015, pp.50-55 www.ijera.com 54 | P a g e Table 4: Results for Dataset Size vs. Classification Accuracy of Previous and Proposed Method. Dataset Size (%) Accuracy Previous Method Proposed Method 40 74.5 77.8 60 80.2 82.4 80 77.6 83.1 100 81.9 83.3 VI. Conclusion In this paper, we present a novel information theory and fuzzy clustering based feature subset selection algorithm in combination with data size reduction, which is very applicable, especially to high-dimensional data. This algorithm is developed for not only identifying and removing irrelevant and redundant features, but also dealing with interactive features. We first defined relevant, redundant and interactive features based on symmetric uncertainty then based on these definitions, we presented the feature selection algorithm, which involves four steps (1) redundant feature exclusion and interactive feature reservation and (2) the irrelevant feature identification (3) minimum span tree formation for similar features grouping (4) fuzzy C-means clustering for data size reduction. We also explained the concept behind the redundant as well as irrelevant features and reserve interactive features with appropriate expression formations. Finally the test with real world data sets show that our proposed algorithm has moderate reduction capability. Meanwhile, it also reduces the data size and obtains the best average accuracies for all the neural network based classification algorithms. Figure 4: Plot of the table 4 Figure 5: Final minimum span tree Graph generated for 100% of data samples References [1] Pablo Bermejo, Luis de la Ossa, Jos´e A. G´amez, Jos´e M. Puerta “Fast wrapper feature subset selection in high-dimensional datasets by means of filter re- ranking”,Knowledge-Based Systems Volume 25, Issue 1, February 2012, Pages 35–44Special Issue on New Trends in Data Mining. [2] Guangtao Wang, Qinbao Song, Baowen Xu, Yuming Zhou “Selecting feature subset for high dimensional data via the propositional FOIL rules” Pattern Recognition 46 (2013) 199–214. [3] Sebastián Maldonado, Richard Weber, Fazel Famili “Feature selection for high- dimensional class-imbalanced datasets using Support Vector Machines”, Information Sciences 286 (2014) 228–246.
  • 6. Poonam Sharma Int. Journal of Engineering Research and Applications www.ijera.com ISSN: 2248-9622, Vol. 5, Issue 9, (Part - 1) September 2015, pp.50-55 www.ijera.com 55 | P a g e [4] Athanasios Tsanas, Max A. Little, Patrick E. McSharry, Senior Member, IEEE, Jennifer Spielman, Lorraine O. Ramig “Novel speech signal processing algorithms for high- accuracy classification of Parkinson‟s disease”, IEEE Trans Biomed Eng. 2012 May;59(5):1264-71. [5] Alok Sharma, Seiya Imoto, and Satoru Miyano “A top-r Feature Selection Algorithm for Microarray Gene Expression Data”,Computational Biology and Bioinformatics, IEEE/ACM Transactions on (Volume:9 , Issue: 3 ), 22 March 2012 [6] R. Ruiz, J.C. Riquelme, J.S. Aguilar-Ruiz, M. García-Torres “Fast feature selection aimed at high-dimensional data viahybrid- sequential-ranked searches”, Expert Systems with Applications 39 (2012) 11094–11102. [7] Xiaojun Chen, Yunming Ye, Xiaofei Xu, Joshua ZhexueHuang “A feature group weighting method for subspace clusteringof high-dimensional data”, Pattern Recognition 45 (2012) 434–446. [8] Yuzong Liu, Kai Wei, Katrin Kirchhoff, Yisong Song, Jeff Bilmes “Sub modular Feature Selection for High-Dimensional Acoustic Score Spaces”, Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on26-31 May 2013 [9] Mohak Shah, Mario Marchand, and Jacques Corbeil “Feature Selection with Conjunctions of Decision Stumps and Learning from Microarray Data”, Pattern Analysis and Machine Intelligence, IEEE Transactions on (Volume: 34, Issue: 1), 17 November 2011 [10] Alexandros Kalousis, Julien Prados, Melanie Hilario “Stability of Feature Selection Algorithms:a study on high dimensional spaces”, Knowledge and Information Systems table of contents archive Volume 12 Issue 1, May 2007. [11] Qiang Cheng*, Hongbo Zhou, and Jie Cheng “The Fisher-Markov Selector: Fast Selecting Maximally Separable Feature Subset for Multi-Class Classification with Applications to High-Dimensional Data”Pattern Analysis and Machine Intelligence, IEEE Transactions on (Volume:33 , Issue: 6 )19 April 2011. [12] Yongjun Piao, Minghao Piao, Kiejung Parkand Keun Ho Ryu “An ensemble correlation-based gene selection algorithm for cancer classification with gene expression data”, Vol. 28 no. 24 2012, pages 3306–3315. [13] Lei Yu, Huan Liu “Feature Selection for High-Dimensional Data:A Fast Correlation- Based Filter Solution”, Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003. [14] Lance Parsons, Ehtesham Haque, Huan Liu “Subspace Clustering for High Dimensional Data: A Review”, ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets Homepage table of contents archive Volume 6 Issue 1, June 2004. [15] Daphne Koller Mehran Sahami “Toward Optimal Feature Selection”, Technical Report. Stanford Info Lab. [16] Isabelle Guyon, Andr´e Elisseeff “An Introduction to Variable and Feature Selection”, Journal of Machine Learning Research 3 (2003) 1157-1182. [17] Lei Yu, Huan Liu “Efficient Feature Selection via Analysis of Relevance and Redundancy”,Journal of Machine Learning Research 5 (2004) 1205–1224. [18] Qinbao Song, Jingjie Ni, and Guangtao Wang “A Fast Clustering-Based Feature Subset Selection Algorithm for High- Dimensional Data”, IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 1, January 2013.