A Combined Approach for Feature Subset Selection and Size Reduction for High Dimensional Data

Poonam Sharma Int. Journal of Engineering Research and Applications www.ijera.com
ISSN: 2248-9622, Vol. 5, Issue 9, (Part - 1) September 2015, pp.50-55
www.ijera.com 50 | P a g e
A Combined Approach for Feature Subset Selection and Size
Reduction for High Dimensional Data
Anurag Dwivedi, Poonam Sharma
Assistant Professor Dept. of Computer Science and Engineering SRGI, Jhansi
M. Tech Scholar Dept. of Information Technology SATI, Vidisha (M.P)
Abstract: selection of relevant feature from a given set of feature is one of the important issues in the field of
data mining as well as classification. In general the dataset may contain a number of features however it is not
necessary that the whole set features are important for particular analysis of decision making because the
features may share the common information‟s and can also be completely irrelevant to the undergoing
processing. This generally happen because of improper selection of features during the dataset formation or
because of improper information availability about the observed system. However in both cases the data will
contain the features that will just increase the processing burden which may ultimately cause the improper
outcome when used for analysis. Because of these reasons some kind of methods are required to detect and
remove these features hence in this paper we are presenting an efficient approach for not just removing the
unimportant features but also the size of complete dataset size. The proposed algorithm utilizes the information
theory to detect the information gain from each feature and minimum span tree to group the similar features
with that the fuzzy c-means clustering is used to remove the similar entries from the dataset. Finally the
algorithm is tested with SVM classifier using 35 publicly available real-world high-dimensional dataset and the
results shows that the presented algorithm not only reduces the feature set and data lengths but also improves the
performances of the classifier.
Keywords: feature selection, data reduction, clustering, fuzzy clustering, and minimum span tree.
I. Introduction
Feature selection involves identifying a subset of
very useful features from the large data set that
produces compatible results as the original whole set
of features [18]. Feature selection is a critical subject
in data mining, particularly in high dimensional
applications. The selection of relevant feature is a
complex problem, and finding the ideal subset of
variables is viewed as NP-hard [3].Feature selection
may be extremely useful approach for reducing
dimensionality, removing irrelevant data and
improving learning accuracy. Extensive high-
dimensional data are generally sparse and contain
numerous classes/groups. For instance, vast content
data in the vector space show frequently contains
numerous classes of documents contains a huge
number of features, this property has turned into a
principle instead of the special case that mostly the
clustering of high dimensional data happen in
subspaces of data, so subspace grouping techniques
are needed in high-dimensional data clustering.
Numerous subspace grouping techniques have been
proposed to handle high dimensional data, used for
finding clusters from subspaces of data, rather than
the whole data space. These techniques can be
broadly categorized two groups one is called the hard
subspace clustering that searches the exact sub set of
features while other is called the soft subspace
clustering which assigns the weights to features.
Numerous high-dimensional data sets are the mixture
of extracted data from different prospective which
causes unwanted features insertion for any specific
analysis. In this paper, we propose a new data
dimension and length reduction method for high-
dimensional data set. The proposed algorithm utilizes
the entropy and joint entropy estimation to detect the
information gain from each feature and minimum
span tree to group the similar features with that the
fuzzy c-means clustering is used to remove the
similar entries from the dataset.
II. Literature Review
This section presents the brief review of the
related literatures available on same topic. R. Ruiz et
al [6] proposed hybrid approaches to provide the
possibility of efficiently applying any subset
evaluator, with a wrapper model for feature subset
selection problem for classification tasks.Alexandros
Kalousis et al [10] presented the stability of feature
selection algorithms based on the stability of the
feature preferences that they express in the form of
weights-scores, ranks, or a selected feature subset.
Finally the examination is performed by a number of
measures to quantify the stability of feature
preferences and propose an empirical way to estimate
them. Guangtao Wang et al [2] proposed a
propositional FOIL rule based algorithm FRFS,
which not only contain relevant features and excludes
RESEARCH ARTICLE OPEN ACCESS

irrelevant and redundant ones but also considers
feature interaction, is proposed for selecting feature
subset for high dimensional data. FRFS first combine
the features appeared in the antecedents of all FOIL
rules, achieving a candidate feature subset which
excludes redundant features and reserves interactive
ones. Then, identifies and removes irrelevant features
by evaluating features in the candidate feature subset
with a new metric Cover Ratio, and obtains the final
feature subset. The supervised wrapper-based feature
subset selection in datasets with a very large number
of attributes. Pablo Bermejo et al [1]. A Fast
Correlation-Based Filter Solution is presented Lei Yu
[13] introduce a novel theory predominant
correlation, and propose a fast filter technique which
can identify relevant features as well as redundancy
among relevant features without pairwise correlation
analysis. The efficiency and effectiveness of their
method is demonstrated through extensive
comparisons with other methods using real-world
data of high dimensionality. The author also
proposed Relevance and Redundancy based
technique in [17] which show that feature relevance
alone is insufficient for efficient feature selection of
high-dimensional data and define feature redundancy
and propose to perform explicit redundancy analysis
in feature selection.
III. Terminology Explanations
This section explains the terms and operation
used in the proposed algorithm.
A. Symmetric Uncertainty:
It is derived from the mutual information by
normalizing it to the entropies of variables, and can
be used as the measure of correlation between either
two features or a feature and the target classes.
Mathematically it is defined as,
𝑆𝑈 𝑋, 𝑌 =
2 × 𝐺𝑎𝑖𝑛 𝑋 𝑌
𝐻 𝑋 + 𝐻 𝑌
, … … … … … … … 3.1
Where 𝐻(𝑋) is the entropy of the variable 𝑋, and is
calculated as follows:
𝐻 𝑋 = − 𝑝 𝑥 log2 𝑝 𝑥
𝑥∈𝑋
, … … … … … . . 3.2
Here 𝑝(𝑥) is the probability of the occurrence of
value 𝑥 of a feature 𝑓 with the domain 𝑋, and can be
calculated as
𝑝 𝑥 =
𝑜𝑐𝑐𝑢𝑟𝑎𝑛𝑐𝑒 𝑜𝑓 𝑥
𝑠𝑖𝑧𝑒 𝑜𝑓 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
, … … … … … … … . . 3.3
B. Information Gain
It represents the mutual information which can
be gained by one variable by observing the other
variable. Practically it is measured as the reduction in
entropy of a certain variable by the knowledge of
other, for example let we have to calculate the
information gain about the variable 𝑌 by some other
variable 𝑋 then it will be represented as 𝐺𝑎𝑖𝑛(𝑋|𝑌)
and can be calculated as
Gain Y X = H X − H X Y
= H Y − H Y X , … … . . (3.5)
Where 𝐻(𝑌|𝑋) is the conditional entropy and
interpreted as the remaining entropy of variable 𝑌 if
the value of another variable 𝑋 is known, on the basis
of probability it can be given as
𝐻 𝑌 𝑋
= − 𝑝 𝑥
𝑥∈𝑋
𝑝 𝑦 𝑥 log2 𝑝 𝑦 𝑥
𝑦∈𝑌
, … … … … . 3.6
Where 𝑝(𝑦|𝑥) is conditional probability of
occurrence of value 𝑦 of the feature 𝑓𝑖 with domain 𝑌
together with occurrence of value 𝑥 of the feature 𝑓𝑗
with domain 𝑋. As the equation (3.6) shows
information gain is a symmetrical measure
hence 𝐺 𝑌 𝑋 = 𝐺 𝑋 𝑌 .Since the information gain
is a symmetrical measure hence according to
equation (3.1) the symmetrical uncertainty must also
be symmetrical. The value of symmetrical
uncertainty varies in the interval of [0,1], the „1‟
represents the complete relativeness between two
variables while „0‟ shows the complete irrelevance.
C. Fuzzy C-Means Clustering
The clustering is defined as the process of
grouping the depending upon the specific measure. It
is generally defined as hard clustering or soft
clustering. The fuzzy clustering fall in the second
category where each data point belongs to more than
one cluster an attachments with the clusters is given
by membership value. The Fuzzy C-Means (FCM)
algorithm is one of the most widely used fuzzy
clustering algorithms. The FCM algorithm attempts
to partition a finite collection of elements 𝑋 =
{𝑥1, 𝑥2, 𝑥3, … … 𝑥 𝑛 } into a collection of 𝐶 fuzzy
clusters with respect to some given criterion. Given a
finite set of data, the algorithm returns a list of
𝐶 cluster centers 𝑉, such that
𝑉 = 𝑣𝑖 , 𝑖 = 1,2,3, … … 𝐶
and a partition matrix 𝑈 such that
𝑈 = 𝑢𝑖𝑗 , 𝑖 = 1,2,3, … . . , 𝐶, 𝑗 = 1,2,3, … , 𝑛
Where 𝑢𝑖𝑗 is a numerical value in [0, 1] which tells
the degree to which the element 𝑥𝑗 belongs to the
𝑖 𝑡ℎ
cluster.The following is a linguistic description of
the FCM algorithm, which is implemented Fuzzy
Logic.

Step 1: Select the number of clusters 𝐶(2 ≤ 𝑐 ≤ 𝑛),
exponential weight𝜇(1 < 𝜇 < ∞), initial partition
matrix𝑈0
, and the termination criterion𝜖. Also, set
the iteration index 𝑙to 0.
Step 2: Calculate the fuzzy cluster centers {𝑣𝑖
𝑙
|𝑖 =
1,2,3 … 𝐶} by using𝑈𝑙
.
Step 3: Calculate the new partition matrix𝑈𝑙+1
by
using {𝑣𝑖
𝑙
, 𝑖 = 1,2,3 … 𝐶}.
Step 4: Calculate the new partition matrix ∆=
𝑈𝑙+1
− 𝑈𝑙
= max𝑖𝑗 𝑢𝑖𝑗
𝑙+1
− 𝑢𝑖𝑗
𝑙
. If∆ > 𝜖, then set
𝑙 = 𝑙 + 1 and go to step 2.If∆ ≤ 𝜖, then stop.
IV. Proposed algorithm
The proposed algorithm can be explained as
following , Let the 𝐷 be the high dimensional dataset
and can be expressed as
𝐷
=
𝑑11, 𝑑12, 𝑑13, … … … … 𝑑1𝑛
𝑑21, 𝑑22, 𝑑23, … … … … 𝑑2𝑛
.
.
.
𝑑 𝑚1, 𝑑 𝑚2, 𝑑 𝑚3, … … … … 𝑑 𝑚𝑛
, 𝑇
𝑡 𝑐
𝑡 𝑐
.
.
.
𝑡 𝑐
… . 4.1
Hence the dataset has 𝑛 dimensions and 𝑚
entries which targets to class𝑡 𝑐, 𝑡 𝑐 ∈ 𝑇, 𝑇 =
{𝑡1, 𝑡2, … 𝑡 𝐶} where 𝐶 is the total number of target
classes. The objective of the problem is to find a
subset 𝐷′
of data 𝐷 such that 𝐷′
have the 𝑚′
× 𝑛′
dimensions and 𝑚′
< 𝑚, 𝑛′
< 𝑛.
At the first step of the algorithm the relation
between the each feature and the target class is
estimated by calculating the symmetric uncertainty.
Let the symmetric uncertainty between 𝑖 𝑡ℎ
feature and
target classes is given by 𝑆𝑈 𝐹𝑖, 𝐶 . Since the
𝑆𝑈(𝐹𝑖, 𝐶) shows the predictability of target classes by
the feature 𝐹𝑖 this can be used as first measure to
remove unwanted features by defining that the
feature 𝐹𝑖 is important if and only if satisfies
𝑆𝑈 𝐹𝑖, 𝐶 > 𝜃 … … … … …. (4.2)
Where 𝜃 is the user defined constant can be seen as
the minimum required relation between feature and
target class.
After performing the above discussed operation the
dimension of feature will change let it be 𝑛1 then
𝐷1 =
𝑑11, 𝑑12, 𝑑13, … … … … 𝑑1𝑛1
𝑑21, 𝑑22, 𝑑23, … … … … 𝑑2𝑛1
.
.
.
𝑑 𝑚1, 𝑑 𝑚2, 𝑑 𝑚3, … … … … 𝑑 𝑚 𝑛1
, 𝑇
𝑡 𝑐
𝑡 𝑐
.
.
.
𝑡 𝑐
, 𝑛1
< 𝑛 … … … … … … … … … . (4.3)
In the second step of algorithm the features
which shares the same information are detected by
calculating the symmetric uncertainty amongst each
other‟s, and defined as 𝑆𝑈 𝐹𝑖, 𝐹𝑗 , 𝑖 ≠ 𝑗. The concept
of using the symmetric uncertainty is similar to first
step, hence the features with higher values of
𝑆𝑈 𝐹𝑖, 𝐹𝑗 may considered as identical features now a
𝑆𝑈𝐹𝐹 matrix is calculated as
~, 𝑆𝑈 𝐹1, 𝐹2 , … … … . … … 𝑆𝑈 𝐹1, 𝐹 𝑛′
𝑆𝑈 𝐹2, 𝐹1 , ~, … … … … … . 𝑆𝑈 𝐹2, 𝐹 𝑛′
.
.
.
𝑆𝑈 𝐹𝑚 , 𝐹1 , 𝑆𝑈 𝐹𝑚 , 𝐹2 , … … … … . … , ~
… . (4.4)
The 𝑆𝑈𝐹𝐹 matrix is used to calculate the
Minimum Span Tree (MST) such that the value of
every elements of the matrix 𝑆𝑈𝐹𝐹 is taken as the
bonding between the corresponding features.
However the MST can group every feature which
have relation strength (𝑆𝑈𝐹𝐶) greater than zero. To
eliminate the loosely connected features another
condition is applied in which if the feature has greater
relation with target classes 𝑆𝑈𝐹𝐶 than any other
feature (𝑆𝑈𝐹𝐹)then the link between these feature is
removed by modifying the 𝑆𝑈𝐹𝐹 as below
𝑆𝑈 𝐹𝑖, 𝐹𝑗 = 0, 𝑖𝑓 𝑆𝑈 𝐹𝑖, 𝐶 > 𝑆𝑈 𝐹𝑖, 𝐹𝑗 … (4.5)
After modifying the 𝑆𝑈𝐹𝐹 according to equation
(4.5) the MST is reconstructed and the features still
found connected are considered as similar features
and replaced by single representative feature
depending upon their target class relation or 𝑆𝑈𝐹𝐶
value as follows
Let the 𝐹𝑡 = {𝐹𝑎, 𝐹𝑏 , 𝐹𝑐} be the connected feature in
the MST then the representative feature 𝐹𝑟 will be
selected as
𝐹𝑟 = 𝐹𝑡 𝑖 , argmax
𝑖
{𝑆𝑈(𝐹𝑎, 𝐶), 𝑆𝑈(𝐹𝑏 , 𝐶), 𝑆𝑈(𝐹𝑐, 𝐶)}
(4.6)
Now we have the most useful features set and the
new dataset can be presented as𝐷 𝑚×𝑛′′, here the
𝑛′′
< 𝑛 hence the dimension of feature is reduced
although the entries in the dataset is still same which
needed to be reduced. In the proposed system the
third step is used for this purpose which groups the
similar data points using the fuzzy c-means clustering
and then the points having large membership value
with any one cluster is replaced by cluster centroid.
Let the fuzzy C-Means clusters the given data
into 𝑘 groups then after clustering the data can be
presented as

𝑀𝑒𝑚𝑏𝑒𝑟𝑠ℎ𝑖𝑝 𝑀𝑎𝑡𝑟𝑖𝑥
=
𝑀11, 𝑀12, 𝑀13, … … … … 𝑀1𝑚
𝑀21, 𝑀22, 𝑀23, … … … … 𝑀2𝑚
.
.
.
𝑀𝑘1, 𝑀𝑘2, 𝑀 𝑚3, … … … … 𝑀𝑘𝑚
. . 4.7
Where 𝑀𝑖𝑗 shows the membership of 𝑗 𝑡ℎ
entry (data
point) to 𝑖 𝑡ℎ
cluster.
𝐷1 𝑖 = 𝐶𝑖, argmax
𝑖
{𝑀𝑖𝑗 , 1 ≤ 𝑗 ≤ 𝑘} … … … . (4.8)
How many points will be substituted is depends
upon the user defined minimum merging similarity.
The complete algorithm can be described in
following steps
Step 1: calculate 𝑆𝑈𝐹𝐹 matrix and 𝑆𝑈𝐹𝐶 matrix for
given dataset 𝐷 having target classes 𝑇.
Step 2: on the basis of 𝑆𝑈𝐹𝐶 values reject the
features having 𝑆𝑈𝐹𝐶 value lesser than threshold 𝜃.
Step 3: recalculate the 𝑆𝑈𝐹𝐹 matrix for reduced
features 𝑆𝑈𝐹𝐹′ .
Step 4: construct the minimum span tree (MST).
Step 5: remove the branches of MST
having 𝑆𝑈𝐹𝐶 𝑖, 𝐶 > 𝑆𝑈𝐹𝐹(𝑖, 𝐽).
Step 6: selected the isolated features from MST and
generate representative for non-isolated features.
Step 7: form new dataset with only features selected
in step 6.
Step 8: performed fuzzy c-means clustering and take
the cluster centroid as a representative for data points
having greater membership with it.
Step 9: Train the classifier from the dataset obtained
and test for accuracy.
V. Simulation Results
In this section, we present the experimental
results in terms of the proportion of selected features,
the time to obtain the feature subset, the classification
accuracy. The dataset used has 36 features all with
domain size of 2 while the target class also has a
domain size of 2. The total number of entries in the
dataset is 3196. For the testing of the selected
features quality the classification test is performed
using the probabilistic neural network
Table 1: Results for Dataset Size vs. processing time.
Dataset Size (%)
Processing Time
(Seconds)
40 3.6514
60 4.6715
80 5.5949
100 6.5218
Figure 1: Plot of the table 1 data (impact of data size
on processing time.
Table 2: Results for Dataset Size vs. Reduced Data
Size of Previous Method and Proposed Method.
Dataset Size
(%)
Data Size
Previous
Method
Proposed
Method
40 1278 1151
60 1918 1726
80 2557 2301
100 3196 2876
Figure 2: Plot of the table 2 comparison for data size
reduction of Previous and Proposed Method.
Table 3: Results for Dataset Size vs. Selected
Features of Proposed Method
Dataset Size (%) Number of Features
40 14
60 17
80 15
100 13
Figure 3: Plot of the table 3 data size vs. number of
selected features.

Table 4: Results for Dataset Size vs. Classification
Accuracy of Previous and Proposed Method.
Dataset Size
(%)
Accuracy
Previous
Method
Proposed
Method
40 74.5 77.8
60 80.2 82.4
80 77.6 83.1
100 81.9 83.3
VI. Conclusion
In this paper, we present a novel information
theory and fuzzy clustering based feature subset
selection algorithm in combination with data size
reduction, which is very applicable, especially to
high-dimensional data. This algorithm is developed
for not only identifying and removing irrelevant and
redundant features, but also dealing with interactive
features. We first defined relevant, redundant and
interactive features based on symmetric uncertainty
then based on these definitions, we presented the
feature selection algorithm, which involves four steps
(1) redundant feature exclusion and interactive
feature reservation and (2) the irrelevant feature
identification (3) minimum span tree formation for
similar features grouping (4) fuzzy C-means
clustering for data size reduction. We also explained
the concept behind the redundant as well as irrelevant
features and reserve interactive features with
appropriate expression formations. Finally the test
with real world data sets show that our proposed
algorithm has moderate reduction capability.
Meanwhile, it also reduces the data size and obtains
the best average accuracies for all the neural network
based classification algorithms.
Figure 4: Plot of the table 4
Figure 5: Final minimum span tree Graph generated for 100% of data samples
References
[1] Pablo Bermejo, Luis de la Ossa, José A.
Gámez, José M. Puerta “Fast wrapper
feature subset selection in high-dimensional
datasets by means of filter re-
ranking”,Knowledge-Based Systems
Volume 25, Issue 1, February 2012, Pages
35–44Special Issue on New Trends in Data
Mining.
[2] Guangtao Wang, Qinbao Song, Baowen Xu,
Yuming Zhou “Selecting feature subset for
high dimensional data via the propositional
FOIL rules” Pattern Recognition 46 (2013)
199–214.
[3] Sebastián Maldonado, Richard Weber, Fazel
Famili “Feature selection for high-
dimensional class-imbalanced datasets using
Support Vector Machines”, Information
Sciences 286 (2014) 228–246.

[4] Athanasios Tsanas, Max A. Little, Patrick E.
McSharry, Senior Member, IEEE, Jennifer
Spielman, Lorraine O. Ramig “Novel speech
signal processing algorithms for high-
accuracy classification of Parkinson‟s
disease”, IEEE Trans Biomed Eng. 2012
May;59(5):1264-71.
[5] Alok Sharma, Seiya Imoto, and Satoru
Miyano “A top-r Feature Selection
Algorithm for Microarray Gene Expression
Data”,Computational Biology and
Bioinformatics, IEEE/ACM Transactions on
(Volume:9 , Issue: 3 ), 22 March 2012
[6] R. Ruiz, J.C. Riquelme, J.S. Aguilar-Ruiz,
M. García-Torres “Fast feature selection
aimed at high-dimensional data viahybrid-
sequential-ranked searches”, Expert Systems
with Applications 39 (2012) 11094–11102.
[7] Xiaojun Chen, Yunming Ye, Xiaofei Xu,
Joshua ZhexueHuang “A feature group
weighting method for subspace clusteringof
high-dimensional data”, Pattern Recognition
45 (2012) 434–446.
[8] Yuzong Liu, Kai Wei, Katrin Kirchhoff,
Yisong Song, Jeff Bilmes “Sub modular
Feature Selection for High-Dimensional
Acoustic Score Spaces”, Acoustics, Speech
and Signal Processing (ICASSP), 2013
IEEE International Conference on26-31
May 2013
[9] Mohak Shah, Mario Marchand, and Jacques
Corbeil “Feature Selection with
Conjunctions of Decision Stumps and
Learning from Microarray Data”, Pattern
Analysis and Machine Intelligence, IEEE
Transactions on (Volume: 34, Issue: 1), 17
November 2011
[10] Alexandros Kalousis, Julien Prados, Melanie
Hilario “Stability of Feature Selection
Algorithms:a study on high dimensional
spaces”, Knowledge and Information
Systems table of contents archive Volume
12 Issue 1, May 2007.
[11] Qiang Cheng*, Hongbo Zhou, and Jie
Cheng “The Fisher-Markov Selector: Fast
Selecting Maximally Separable Feature
Subset for Multi-Class Classification with
Applications to High-Dimensional
Data”Pattern Analysis and Machine
Intelligence, IEEE Transactions on
(Volume:33 , Issue: 6 )19 April 2011.
[12] Yongjun Piao, Minghao Piao, Kiejung
Parkand Keun Ho Ryu “An ensemble
correlation-based gene selection algorithm
for cancer classification with gene
expression data”, Vol. 28 no. 24 2012, pages
3306–3315.
[13] Lei Yu, Huan Liu “Feature Selection for
High-Dimensional Data:A Fast Correlation-
Based Filter Solution”, Proceedings of the
Twentieth International Conference on
Machine Learning (ICML-2003),
Washington DC, 2003.
[14] Lance Parsons, Ehtesham Haque, Huan Liu
“Subspace Clustering for High Dimensional
Data: A Review”, ACM SIGKDD
Explorations Newsletter - Special issue on
learning from imbalanced datasets
Homepage table of contents archive Volume
6 Issue 1, June 2004.
[15] Daphne Koller Mehran Sahami “Toward
Optimal Feature Selection”, Technical
Report. Stanford Info Lab.
[16] Isabelle Guyon, André Elisseeff “An
Introduction to Variable and Feature
Selection”, Journal of Machine Learning
Research 3 (2003) 1157-1182.
[17] Lei Yu, Huan Liu “Efficient Feature
Selection via Analysis of Relevance and
Redundancy”,Journal of Machine Learning
Research 5 (2004) 1205–1224.
[18] Qinbao Song, Jingjie Ni, and Guangtao
Wang “A Fast Clustering-Based Feature
Subset Selection Algorithm for High-
Dimensional Data”, IEEE Transactions on
Knowledge and Data Engineering, Vol. 25,
No. 1, January 2013.

A Combined Approach for Feature Subset Selection and Size Reduction for High Dimensional Data

More Related Content

What's hot (17)

Viewers also liked (20)

Similar to A Combined Approach for Feature Subset Selection and Size Reduction for High Dimensional Data (20)

Recently uploaded (20)

A Combined Approach for Feature Subset Selection and Size Reduction for High Dimensional Data