SlideShare a Scribd company logo
International Journal of Computer Networks & Communications (IJCNC) Vol.6, No.3, May 2014
DOI : 10.5121/ijcnc.2014.6315 197
CORRELATION BASED FEATURE
SELECTION (CFS) TECHNIQUE TO
PREDICT STUDENT PERFROMANCE
Mital Doshi 1
, Dr.Setu K Chaturvedi, Ph.D 2
1
Mtech. Research Scholar
Technocrats Institute of Technology Bhopal, India
2
Professor & HOD (Dept. of CSE)
Technocrats Institute of Technology Bhopal, India
ABSTRACT
Education data mining is an emerging stream which helps in mining academic data for solving various
types of problems. One of the problems is the selection of a proper academic track. The admission of a
student in engineering college depends on many factors. In this paper we have tried to implement a
classification technique to assist students in predicting their success in admission in an engineering
stream.We have analyzed the data set containing information about student’s academic as well as socio-
demographic variables, with attributes such as family pressure, interest, gender, XII marks and CET rank
in entrance examinations and historical data of previous batch of students. Feature selection is a process
for removing irrelevant and redundant features which will help improve the predictive accuracy of
classifiers. In this paper first we have used feature selection attribute algorithms Chi-square.InfoGain, and
GainRatio to predict the relevant features. Then we have applied fast correlation base filter on given
features. Later classification is done using NBTree, MultilayerPerceptron, NaiveBayes and Instance based
–K- nearest neighbor. Results showed reduction in computational cost and time and increase in predictive
accuracy for the student model
KEYWORDS
Chi-square, Correlation feature selection, IBK, Infogain, Gainratio, Multilayer perceptron, NaiveBayes,
NBTree
1. INTRODUCTION
Feature selection is a preprocessing step in machine learning. We have three main categories
wrapper, filter and embedded .algorithms [1]. The filter model selects some features without the
help of any learning algorithm. In the wrapper model we use some predetermined learning
algorithm to find out the relevant features and test them.Wrapper model is more expensive than
filter one because it requires more computations so when generally there are large number of
features we prefer filter model. In this paper, we have tried to use the filter model and our aim is
to improve the accuracy of recommending the stream to the student to help him develop a bright
future according to his choice by predicting the success at the earliest. Fast correlation base filter
is an algorithm which is much successful in removing the redundant and irrelevant features from
the dataset so that computation time is decreased and predictive accuracy is increased.
International Journal of Computer Networks & Communications (IJCNC) Vol.6, No.3, May 2014
198
2.CLASSFICATIONTECHNIQUES
2.1 NBTree
NBTree is a hybrid algorithm with Decision Tree and Naïve-Bayes. In this algorithm the basic
concept of recursive partitioning of the schemes remains the same but here the difference is that
the leaf nodes are naïve Bayes categorizers and will not have nodes predicting a single class. [2]
2.2 Naïve Bayes
The Naïve Bayes classifier technique is used when dimensionality of the inputs is high. This is a
simple algorithm but gives good output than others. We are using this to predict the dropout of
students by calculating the probability of each input for a predictable state. It trains the weighted
training data and also helps prevent over fitting.
2.3 Instance-based-k-nearest neighbor
In this technique a new item is classified by comparing the memorized data items using a distance
measure. For this we require storing of a dataset. Matching of items is done by putting them close
to original item. Nearest neighbors can be done by using cross-validation either automatically or
manually.
2.4 Multilayer Perceptron
It is one of the most widely used and popular neural networks. Its network consists of a set of
sensory elements which forms the input layer, one or more hidden layers of processing elements,
and the output layer is of the processing elements. The back propagation algorithm ANN can be
used for predicting both continuous and discrete data. ANN Algorithm represents each cluster by
a neuron based on the neural structure of the brain. Here each connection has an associated
weight, which is calculated adaptively during learning. The only point about ANN is that it takes
long training times and is therefore more suitable for applications where long training is feasible.
Here we have used Multilayer Perceptron technique of ANN. [3]
3. RELATED WORK
Pumpuang [4]had proposed the classifier algorithm for building Course Registration Planning
Model from historical dataset.The model used four classifiers including Bayesian Network, C4.5,
Decision Forest and NBTree. Results showed that NBTree seemed to be the best for prediction of
GPA of the student.
Tanna[5] has implemented a decision support system for admission in engineering colleges
which is based on entrance exam marks. Results show it will return colleges and streams
categorized as Ambitious, Best Bargain and Safe using an offset value.
In [6] Malaya used a knowledge based decision technique will guide the student for admission in
proper branch of engineering. They used two algorithms decision tree algorithm and ANN to find
out which one is more accurate for decision making. Results showed that accuracy of MLP
algorithm has proved to be better for training partition size 50 & testing partition size 50 upto
86%
International Journal of Computer Networks & Communications (IJCNC) Vol.6, No.3, May 2014
199
Al-Radaideh [7] proposed in his paper a simple classification model to provide a guideline to help
students and school management to choose the right track of study for a student. Decision tree
using the C4.5 algorithm (J48 in WEKA), was built by selecting the best attributes using the
information gain measure. The classification rules to find were based on more than one factor
such as the Ratio and the average of student mark in the 10th class (AVERAGE), and the average
of the student mark in 8th
, 9th
, and 10th
classes (AVG89_10). Results show that accuracy of the
model was 87.9% where 218 students were correctly classified out of 248 students.
Hany et al. [8] applied six classifiers on ASSISTments dataset having 15 features. They used
VF1,IBK,NaiveBayes Updateable, ONER, j48 and k means clustering classifiers to rank the
features. Results showed that k means clustering was the best in giving ranks to features and
Naïve Bayes was better in giving prediction accuracy.
Lei Yu [9] in their work proposed a feature selection algorithm which is specially used for high
dimensional data which is called as fast correlation base filter. This algorithm is for removing
irrelevant and redundant data. They applied FCBF, ReliefF, CorrF, and ConSF on four datasets
and recorded the running time and number of features selected. Then they applied C4.5 and NBC
classification on the data.
Bharadwaj and Pal [10] conducted experiment to predict the performance at the end of semester
using student’s data like attendance, class test, seminar and assignment marks from the student’s
previous database results
Hijazi and Naqvi [11] conducted a study on student performance on 300 students from group of
colleges of Punjab University. Results showed that student’s attitude towards attendance in class
are dependent on the time they spend in college for study after college hours. Other factors such
as mother’s age and education are related with student’s performance found by simple linear
regression analysis.
Khan [12] conducted an experiment on 200 boys and 200 girls of Secondary school of Aligarh
Muslim University. Their main aim was to find out variables which determine the success in
higher education in science stream. So they used demographic variables, personality measures as
an input. They had used cluster sampling technique for division into groups or clusters and a
random sample of cluster was used for further analysis. Results showed that girls with high socio-
economic status had relatively higher academic achievement in science whereas boys with low
socio-economic status had higher academic achievement in general.
Z. J. Kovacic [13] presented a case study on educational data mining to identify up to what extent
enrolment data can be used to predict student’s success. They had used CHAID and CART on
students of diploma college of New Zealand. They got two decision trees in their results and
accuracy of classifiers obtained was 59.4 and 60.5.
Al-Radaideh [7] proposed in his paper a simple classification model to provide a guideline to help
students and school management to choose the right track of study for a student. Decision tree
using the C4.5 algorithm (J48 in WEKA), was built by selecting the best attributes using the
information gain measure. The classification rules to find were based on more than one factor
such as the Ratio and the average of student mark in the 10th class (AVERAGE), and the average
of the student mark in 8th
, 9th
, and 10th
classes (AVG89_10). Results show that accuracy of the
model was 87.9% where 218 students were correctly classified out of 248 students.
Hany et al. [8] applied six classifiers on ASSISTments dataset having 15 features. They used
VF1,IBK,NaiveBayes Updateable, ONER, j48 and k means clustering classifiers to rank the
features. Results showed that k means clustering was the best in giving ranks to features and
Naïve Bayes was better in giving prediction accuracy.
International Journal of Computer Networks & Communications (IJCNC) Vol.6, No.3, May 2014
200
Lei Yu [9] in their work proposed a feature selection algorithm which is specially used for high
dimensional data which is called as fast correlation base filter. This algorithm is for removing
irrelevant and redundant data. They applied FCBF, ReliefF, CorrF, and ConSF on four datasets
and recorded the running time and number of features selected. Then they applied C4.5 and NBC
classification on the data.
Bharadwaj and Pal [10] conducted experiment to predict the performance at the end of semester
using student’s data like attendance, class test, seminar and assignment marks from the student’s
previous database results
Hijazi and Naqvi [11] conducted a study on student performance on 300 students from group of
colleges of Punjab University. Results showed that student’s attitude towards attendance in class
are dependent on the time they spend in college for study after college hours. Other factors such
as mother’s age and education are related with student’s performance found by simple linear
regression analysis.
Khan [12] conducted an experiment on 200 boys and 200 girls of Secondary school of Aligarh
Muslim University. Their main aim was to find out variables which determine the success in
higher education in science stream. So they used demographic variables, personality measures as
an input. They had used cluster sampling technique for division into groups or clusters and a
random sample of cluster was used for further analysis. Results showed that girls with high socio-
economic status had relatively higher academic achievement in science whereas boys with low
socio-economic status had higher academic achievement in general.
Z. J. Kovacic [13] presented a case study on educational data mining to identify up to what extent
enrolment data can be used to predict student’s success. They had used CHAID and CART on
students of diploma college of New Zealand. They got two decision trees in their results and
accuracy of classifiers obtained was 59.4 and 60.5.
4.CORRELATIONFEATURE SELECTION
Feature selection is a preprocessing step to machine learning which is effective in reducing
dimensionality, removing irrelevant data, increasing learning accuracy, and improving result
comprehensibility. [14]
4.1 STEPS OF FEATURE SELECTION
A feature of a subset is good if it is highly correlated with the class but not much correlated with
other features of the class. [15]
Steps:
a. Subset generation: We have used four classifiers to rank all the features of the data set.
Then we have used top 3, 4, and 5 features for classification.
b. Subset evaluation: Each classifier is applied to generated subset.
c. Stopping criterion: Testing process continues until 5 features of the subset are selected.
d. Result validation: We have used 10-fold cross validation method for testing each
classifier’s accuracy.
International Journal of Computer Networks & Communications (IJCNC) Vol.6, No.3, May 2014
201
4.2CORRELATION-BASED MEASURES
Here we shall discuss the measures used to find the goodness of a feature for classification. We
find a feature to be good if it is more relevant to the class and not redundant to any other features
of the class. So in short a feature should be highly correlated to the class and not much correlated
to any other feature of the class. For this we have used information theory based on entropy.
Entropy is a measure of uncertainty of a random variable. It can be defined by the following
equation 1 as
H(X) = - ∑P (xi) log2 (P (xi) (1)
i
And the entropy of X after observing values of another variable Y is defined in equation 2 as
H(X/Y) = - ∑P (yj) ∑P (xi/yj) log2 (P (xi/yj)) (2)
j i
Here, P (xi) is the prior probabilities for all values of X, and P (xi/yj) is the posterior probabilities
of X when values of Y are given. The amount by which the entropy of X decreases reflects
additional information about X provided by Y is called information gain given the equation 3 as
IG(X/Y) = H(X)-H(X/Y)(3)
We can conclude that feature Y is regarded to be more correlated to feature X than
to feature Z, if IG(X/Y) > IG (Z/Y).
We have one more measure symmetrical uncertainty which shows correlation between features
defined by equation 4 as
SU(X, Y) = 2 [IG(X/Y) / H(X) + H(Y)] (4)
SU compensates information gain’s bias toward features with more values and normalizes its
value to range of [0,1] with 1 showing that knowledge of either one completely predicts the value
of other and 0 shows that X and Y are independent. It considers pair of features symmetrically.
Entropy based measures require nominal features, but they can be applied to measure correlations
between continuous features as well if they are discretized properly.
5. ALGORITHM
Based on the methodology presentedbefore, wehave used the following algorithm,named
FCBF(FastCorrelation- Based Filter). [9]
Input: S (F1,F2, FN , C) // training data set
δ // predefined threshold value
Output: Sbest // an optimal subset
1 begin
2 for i = 1 to N do begin
3 calculate SUi,c for Fi ;
4 if (SUi,c ≥ δ)
5 append Fi to S'list ;
International Journal of Computer Networks & Communications (IJCNC) Vol.6, No.3, May 2014
202
6 end;
7 order S'list in descending SUi,c value;
8 Fp = getFirstElement(S'list)
9 do begin
10 Fq =getNextElement(S'list,Fp)
11 if (Fq<> NULL)
12 do begin
13 F 'q = Fq ;
14 if (SUp,q ≥ SUq,c)
15 remove Fq from S'list
16. Fq = getNextElement(S'list F'q);
17. else Fq = getNextElement(S'list, Fq);
18 end until (Fq = = NULL);
19 Fp = getNextElement(S'list,Fp);
20 end until (FP = = NULL);
21 Sbest = S'list ;
22 end;
6. PROPOSED SYSTEM
6.1 DATA PREPARATIONS
We have collected students data from a Mumbai college going to enroll in 2014 which is a
training dataset consisting of information about students admitted to the first year. Data is in the
excel format and has details of students personal and academic record. It has details such a s
student’s name, admission type, sex, marks in 12th
standard, marks in math, physics, chemistry,
average of all, common entrance test marks, and personal details as father’s occupation,
qualification, mother’s qualification and occupation, interest of student.
6.2 DATA PROCESSING
Student data warehouse contains details as follows. It contains 380 instances with 32 attributes.
From this list we have selected 17 attributes which we felt as relevant related to our work.
Following table 1 is the list of reduced number of attributes.
International Journal of Computer Networks & Communications (IJCNC) Vol.6, No.3, May 2014
203
Table 1: List of Attributes
6.3 IMPLEMENTATION OF MODEL
WEKA is open source software which is freely available for mining data and implements a large
collection of mining algorithms. It can accept data in various formats and also has converter
supported with it.So we have converted the student dataset into arff file. The file was loaded into
WEKA explorer. The classify panel is used for classification, to estimate the accuracy of resulting
predictive model, visualize erroneous predictions, or the model itself. Net Beans is used to
implement FCBF. For good results we need to know the weightage of each variable necessary for
the success of admission of student in engineering. So we have used feature selection algorithms
tests such as Info gain, Chi squared, gain Ratio. The following table 2 shows the features ranked
according to the algorithm.
International Journal of Computer Networks & Communications (IJCNC) Vol.6, No.3, May 2014
204
Table 2: Rank of features and Average rank.
So instead of trusting on any one attribute selector we have taken the average of their ranks and
selected the features. So the ranking is 17,8,4,16,10. From the above table we conclude that
family pressure is the most important factor for prediction of admission in engineering which is
followed by admission_type, interest of student, mother’s occupation, and residence in hostel.
Next we have applied classification algorithms NBTree, MultilayerPerceptron, Naïve Bayes and
IBK on the selected features. For this we take the subset of 3 features and then add on feature to
see the accuracy of the algorithms. The below table 3 shows the evaluation criteria of features
classified.
TABLE 3: EVALUATION OF CLASSIFIERS USING SUBSET OF 3, 4, 5 FEATURES (PA-Predictive
Accuracy)
International Journal of Computer Networks & Communications (IJCNC) Vol.6, No.3, May 2014
205
From the table we can see that highest PA of NBTree is 65% with three features. For MLP we get
highest PA of 65.83% with four features gives highest accuracy with 3 features. For Naïve Bayes
we get highest accuracy of 61.66% with four features. And for IBK we get PA of 75% also with
four features. Also amongst all the classifiers we conclude that IBK is the best classifier amongst
all with minimum time.
Now we do the classification using the FCBF algorithm which is implemented in JAVA using net
beans. FCBF is not supported by WEKA.
The following are the attributes which have been selected with their symmetric uncertainty
values. The most important factor that we have found using this algorithm is family income
followed by father qualification, all India rank in common entrance test. Now we apply the
classifiers on the selected attributes. The following table 4 shows the classification using 3, 4, and
5 features.
TABLE 4: EVALUATION OF CLASSIFIERS USING FCBF ALGORITHM SUBSET OF 3, 4, 5
FEATURES (PA-Predictive Accuracy)
No. of
Features
NBTree MLP Naïve
Bayes
IBK
PA time PA time PA time PA time
3 65.83 .05 75 .82 65.83 0 75 0
4 65.83 .08 81.6 1.64 66.6 0 100 0
5 75 .26 87.5 1.89 65.83 .01 100 0
MAX 75 87.5 66.6 100
Results show that using FCBF we get the maximum accuracy by using the classifier IBK i.e.
100%. Other than that we see from the table that PA of NBTree is 75% and that of MLP is 87.5%
and that of Naïve Bayes PA is 66.6 Also we get conclude that time is saved and accuracy is
increased.
7. CONCLUSION
From the above results we conclude that feature selection techniques can improve the
accuracy and efficiency of the classification algorithms by removing irrelevant and
redundant features. Also by using the average of Infogain, gainratio, and Chi-square test
we get the most relevant attributes. Four classifiers have been applied on the selected
attributes. From the results we conclude that family pressure and interest of student are the
most important factor for prediction of admission of student in engineering. So we get a
predictive idea that the student should take or not admission in engineering. Also we
conclude that amongst all selection techniques used FCBF gives the best output of
relevancy of features. In future other feature selection techniques can be applied on the
dataset.
International Journal of Computer Networks & Communications (IJCNC) Vol.6, No.3, May 2014
206
REFERENCES
[1] Ladha L. and Deepa T., "Feature Selection Methods and Algorithms", International Journal on
Computer Science and Engineering (IJCSE), 2011.
[2] R. Kohavi. “Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid”
Proceedings of the Second International Conference on Knowledge Discovery and Data Mining,
1996
[3] Baker, R.S.J.D. (2010). Data Mining for Education. In B. McGaw, P. Peterson, E. Baker (eds.),
International Encyclopaedia of Education (3rd edition), (pp. 112-118). Oxford, UK: Elsevier
[4] Pathom Pumpuang, Anongnart Srivihok , Prasong Praneetpolgrang, “Comparisons of Classifier
Algorithms: Bayesian Network, C4.5, Decision Forest and NBTree for Course Registration
Planning Model of Undergraduate Students”, 1-4244-2384-2/08/ 2008 IEEE
[5] Miren Tanna, “Decision Support System for Admission in Engineering Colleges based on
Entrance Exam Marks”, IJCA(0975 – 8887) Volume 52– No.11, August 2012
[6] Malaya Dutta Borah, Rajni Jindal, Daya Gupta Ganesh Chandra Deka, “Application of knowledge
based decision technique to predict student enrollment decision”, 978-1-4577-0792-6/11 2011
IEEE
[7] Qasem A. Al-Radaideh, Ahmad Al Ananbeh, and Emad M. Al-Shawakfa, “A classification model
for predicting the suitable study track for school students”, Vol8 Issue2/IJRRAS_8_2_15.pdf,
August 2011
[8] Hany M. Harb1, Malaka A. Moustafa, “Selecting optimal subset of features for student
performance model”, IJCSI Vol. 9, Issue 5, No 1, September 2012, 1694-0814
[9] Lei Yu leiyu,Huan Liu, “Feature Selection for High-Dimensional Data: A Fast Correlation-Based
Filter Solution”, (ICML-2003), Washington DC, 2003.
[10] B. K. Bharadwaj and S. Pal. "Mining Educational Data to Analyze Students' Performance",
International Journal of Advance Computer Science and Applications (IJACSA), Vol. 2, No. 6,
pp.63-69, 2011.
[11] S. T. Hijazi, and R. S. M. M. Naqvi, "Factors affecting student's performance: A Case of Private
Colleges", Bangladesh e-Journal of Sociology, Vol. 3, No. 1, 2006.
[12] Z. N. Khan, "Scholastic achievement of higher secondary students in science stream", Journalof
Social Sciences, Vol. 1, No. 2, pp. 84-87, 2005.
[13] Z. J. Kovacic, “Early prediction of student success: Mining student enrollment data”,Proceedings
of Informing Science & IT Education Conference 2010
[14] Blum & Langley, 1997; Kohavi &John, 1997
[15] Hall, M. (1999). Correlation based feature selection for machine learning. Doctoral dissertation,
Universityof Waikato, Dept. of Computer Science.
[16] WEKA,http://www.cs.waikato.ac.nz/ml/weka, Last access, 8 April 2008.
Authors
Mital Mehta,B.E. in Computer engineering. Pursuing Mtech in software systems from Bhopal T.I.T College

More Related Content

PDF
IRJET - A Study on Student Career Prediction
PDF
A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...
PDF
Clustering Students of Computer in Terms of Level of Programming
PDF
27 11 sep17 29aug 8513 9956-1-ed (edit)
PDF
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODEL
PPTX
Intelligent system for sTudent placement
PDF
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUE
PDF
Predicting students' performance using id3 and c4.5 classification algorithms
IRJET - A Study on Student Career Prediction
A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...
Clustering Students of Computer in Terms of Level of Programming
27 11 sep17 29aug 8513 9956-1-ed (edit)
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODEL
Intelligent system for sTudent placement
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUE
Predicting students' performance using id3 and c4.5 classification algorithms

What's hot (16)

PDF
Fuzzy Association Rule Mining based Model to Predict Students’ Performance
PDF
Using ID3 Decision Tree Algorithm to the Student Grade Analysis and Prediction
PDF
Association rule discovery for student performance prediction using metaheuri...
PDF
IRJET- Using Data Mining to Predict Students Performance
PDF
Analysis on Student Admission Enquiry System
PDF
A Study on Learning Factor Analysis – An Educational Data Mining Technique fo...
PDF
IRJET- Predictive Analytics for Placement of Student- A Comparative Study
PDF
06522405
PDF
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...
PDF
Data Mining Application in Advertisement Management of Higher Educational Ins...
PDF
Ijciet 10 02_007
PDF
A Model for Predicting Students’ Academic Performance using a Hybrid of K-mea...
PDF
Analyzing undergraduate students’ performance in various perspectives using d...
PDF
Assessment of Decision Tree Algorithms on Student’s Recital
PDF
DATA MINING METHODOLOGIES TO STUDY STUDENT'S ACADEMIC PERFORMANCE USING THE...
Fuzzy Association Rule Mining based Model to Predict Students’ Performance
Using ID3 Decision Tree Algorithm to the Student Grade Analysis and Prediction
Association rule discovery for student performance prediction using metaheuri...
IRJET- Using Data Mining to Predict Students Performance
Analysis on Student Admission Enquiry System
A Study on Learning Factor Analysis – An Educational Data Mining Technique fo...
IRJET- Predictive Analytics for Placement of Student- A Comparative Study
06522405
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...
Data Mining Application in Advertisement Management of Higher Educational Ins...
Ijciet 10 02_007
A Model for Predicting Students’ Academic Performance using a Hybrid of K-mea...
Analyzing undergraduate students’ performance in various perspectives using d...
Assessment of Decision Tree Algorithms on Student’s Recital
DATA MINING METHODOLOGIES TO STUDY STUDENT'S ACADEMIC PERFORMANCE USING THE...
Ad

Similar to CORRELATION BASED FEATURE SELECTION (CFS) TECHNIQUE TO PREDICT STUDENT PERFROMANCE (20)

PDF
ANALYSIS OF STUDENT ACADEMIC PERFORMANCE USING MACHINE LEARNING ALGORITHMS:– ...
DOC
Performance Evaluation of Feature Selection Algorithms in Educational Data Mi...
PDF
ASSOCIATION RULE DISCOVERY FOR STUDENT PERFORMANCE PREDICTION USING METAHEURI...
PDF
M-Learners Performance Using Intelligence and Adaptive E-Learning Classify th...
PDF
A Model for Predicting Students’ Academic Performance using a Hybrid of K-mea...
PDF
Data Mining Techniques for School Failure and Dropout System
PDF
03 20250 classifiers ensemble
PDF
UNIVERSITY ADMISSION SYSTEMS USING DATA MINING TECHNIQUES TO PREDICT STUDENT ...
PDF
Survey on Techniques for Predictive Analysis of Student Grades and Career
PDF
IRJET- Performance for Student Higher Education using Decision Tree to Predic...
PDF
Multi-label feature aware XGBoost model for student performance assessment us...
PDF
Data mining approach to predict academic performance of students
PDF
Student Performance Prediction via Data Mining & Machine Learning
PDF
DALAN: A COURSE RECOMMENDER FOR FRESHMEN STUDENTS USING A MULTIPLE REGRESSION...
DOCX
mini project on artificial intelligence and machine learning
PDF
L016136369
PDF
Student Performance Evaluation in Education Sector Using Prediction and Clust...
PDF
journal for research
PDF
The Architecture of System for Predicting Student Performance based on the Da...
ANALYSIS OF STUDENT ACADEMIC PERFORMANCE USING MACHINE LEARNING ALGORITHMS:– ...
Performance Evaluation of Feature Selection Algorithms in Educational Data Mi...
ASSOCIATION RULE DISCOVERY FOR STUDENT PERFORMANCE PREDICTION USING METAHEURI...
M-Learners Performance Using Intelligence and Adaptive E-Learning Classify th...
A Model for Predicting Students’ Academic Performance using a Hybrid of K-mea...
Data Mining Techniques for School Failure and Dropout System
03 20250 classifiers ensemble
UNIVERSITY ADMISSION SYSTEMS USING DATA MINING TECHNIQUES TO PREDICT STUDENT ...
Survey on Techniques for Predictive Analysis of Student Grades and Career
IRJET- Performance for Student Higher Education using Decision Tree to Predic...
Multi-label feature aware XGBoost model for student performance assessment us...
Data mining approach to predict academic performance of students
Student Performance Prediction via Data Mining & Machine Learning
DALAN: A COURSE RECOMMENDER FOR FRESHMEN STUDENTS USING A MULTIPLE REGRESSION...
mini project on artificial intelligence and machine learning
L016136369
Student Performance Evaluation in Education Sector Using Prediction and Clust...
journal for research
The Architecture of System for Predicting Student Performance based on the Da...
Ad

More from IJCNCJournal (20)

PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Authenticated Key Agreement Protocol with Forward Secrecy for Securing Cyber ...
PDF
Enhancing IoT Cyberattack Detection via Hyperparameter Optimization Technique...
PDF
Analysis of LTE/5G Network Performance Parameters in Smartphone Use Cases: A ...
PDF
An Energy Hole Detection and Relay Repositioning in Cluster Based Routing Pro...
PDF
Performance of Multi-Hop FSO Systems Under Practical Conditions with Malaga T...
PDF
QoS Based Reliable Route in MANET for Military Applications
PDF
Conflict Flow Avoided Proactive Rerouting Algorithm using Online Active Learn...
PDF
A Cluster-Based Trusted Secure Multipath Routing Protocol for Mobile Ad Hoc N...
PDF
Evaluating OTFS Modulation for 6G: Impact of High Mobility and Environmental ...
PDF
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
PDF
Classification of Network Traffic using Machine Learning Models on the NetML ...
PDF
A Cluster-Based Trusted Secure Multipath Routing Protocol for Mobile Ad Hoc N...
PDF
Energy Efficient Virtual MIMO Communication Designed for Cluster based on Coo...
PDF
An Optimized Energy-Efficient Hello Routing Protocol for Underwater Wireless ...
PDF
Evaluating OTFS Modulation for 6G: Impact of High Mobility and Environmental ...
PDF
Simulated Annealing-Salp Swarm Algorithm based Variational Autoencoder for Pe...
PDF
A Framework for Securing Personal Data Shared by Users on the Digital Platforms
PDF
Developing a Secure and Transparent Blockchain System for Fintech with Fintru...
PDF
Visually Image Encryption and Compression using a CNN-Based Autoencoder
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Authenticated Key Agreement Protocol with Forward Secrecy for Securing Cyber ...
Enhancing IoT Cyberattack Detection via Hyperparameter Optimization Technique...
Analysis of LTE/5G Network Performance Parameters in Smartphone Use Cases: A ...
An Energy Hole Detection and Relay Repositioning in Cluster Based Routing Pro...
Performance of Multi-Hop FSO Systems Under Practical Conditions with Malaga T...
QoS Based Reliable Route in MANET for Military Applications
Conflict Flow Avoided Proactive Rerouting Algorithm using Online Active Learn...
A Cluster-Based Trusted Secure Multipath Routing Protocol for Mobile Ad Hoc N...
Evaluating OTFS Modulation for 6G: Impact of High Mobility and Environmental ...
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
Classification of Network Traffic using Machine Learning Models on the NetML ...
A Cluster-Based Trusted Secure Multipath Routing Protocol for Mobile Ad Hoc N...
Energy Efficient Virtual MIMO Communication Designed for Cluster based on Coo...
An Optimized Energy-Efficient Hello Routing Protocol for Underwater Wireless ...
Evaluating OTFS Modulation for 6G: Impact of High Mobility and Environmental ...
Simulated Annealing-Salp Swarm Algorithm based Variational Autoencoder for Pe...
A Framework for Securing Personal Data Shared by Users on the Digital Platforms
Developing a Secure and Transparent Blockchain System for Fintech with Fintru...
Visually Image Encryption and Compression using a CNN-Based Autoencoder

Recently uploaded (20)

PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Basic Mud Logging Guide for educational purpose
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Pre independence Education in Inndia.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
Institutional Correction lecture only . . .
PPTX
master seminar digital applications in india
PPTX
GDM (1) (1).pptx small presentation for students
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Insiders guide to clinical Medicine.pdf
PDF
Classroom Observation Tools for Teachers
PDF
RMMM.pdf make it easy to upload and study
Supply Chain Operations Speaking Notes -ICLT Program
O7-L3 Supply Chain Operations - ICLT Program
Basic Mud Logging Guide for educational purpose
Module 4: Burden of Disease Tutorial Slides S2 2025
human mycosis Human fungal infections are called human mycosis..pptx
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Pre independence Education in Inndia.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
TR - Agricultural Crops Production NC III.pdf
Institutional Correction lecture only . . .
master seminar digital applications in india
GDM (1) (1).pptx small presentation for students
STATICS OF THE RIGID BODIES Hibbelers.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
Insiders guide to clinical Medicine.pdf
Classroom Observation Tools for Teachers
RMMM.pdf make it easy to upload and study

CORRELATION BASED FEATURE SELECTION (CFS) TECHNIQUE TO PREDICT STUDENT PERFROMANCE

  • 1. International Journal of Computer Networks & Communications (IJCNC) Vol.6, No.3, May 2014 DOI : 10.5121/ijcnc.2014.6315 197 CORRELATION BASED FEATURE SELECTION (CFS) TECHNIQUE TO PREDICT STUDENT PERFROMANCE Mital Doshi 1 , Dr.Setu K Chaturvedi, Ph.D 2 1 Mtech. Research Scholar Technocrats Institute of Technology Bhopal, India 2 Professor & HOD (Dept. of CSE) Technocrats Institute of Technology Bhopal, India ABSTRACT Education data mining is an emerging stream which helps in mining academic data for solving various types of problems. One of the problems is the selection of a proper academic track. The admission of a student in engineering college depends on many factors. In this paper we have tried to implement a classification technique to assist students in predicting their success in admission in an engineering stream.We have analyzed the data set containing information about student’s academic as well as socio- demographic variables, with attributes such as family pressure, interest, gender, XII marks and CET rank in entrance examinations and historical data of previous batch of students. Feature selection is a process for removing irrelevant and redundant features which will help improve the predictive accuracy of classifiers. In this paper first we have used feature selection attribute algorithms Chi-square.InfoGain, and GainRatio to predict the relevant features. Then we have applied fast correlation base filter on given features. Later classification is done using NBTree, MultilayerPerceptron, NaiveBayes and Instance based –K- nearest neighbor. Results showed reduction in computational cost and time and increase in predictive accuracy for the student model KEYWORDS Chi-square, Correlation feature selection, IBK, Infogain, Gainratio, Multilayer perceptron, NaiveBayes, NBTree 1. INTRODUCTION Feature selection is a preprocessing step in machine learning. We have three main categories wrapper, filter and embedded .algorithms [1]. The filter model selects some features without the help of any learning algorithm. In the wrapper model we use some predetermined learning algorithm to find out the relevant features and test them.Wrapper model is more expensive than filter one because it requires more computations so when generally there are large number of features we prefer filter model. In this paper, we have tried to use the filter model and our aim is to improve the accuracy of recommending the stream to the student to help him develop a bright future according to his choice by predicting the success at the earliest. Fast correlation base filter is an algorithm which is much successful in removing the redundant and irrelevant features from the dataset so that computation time is decreased and predictive accuracy is increased.
  • 2. International Journal of Computer Networks & Communications (IJCNC) Vol.6, No.3, May 2014 198 2.CLASSFICATIONTECHNIQUES 2.1 NBTree NBTree is a hybrid algorithm with Decision Tree and Naïve-Bayes. In this algorithm the basic concept of recursive partitioning of the schemes remains the same but here the difference is that the leaf nodes are naïve Bayes categorizers and will not have nodes predicting a single class. [2] 2.2 Naïve Bayes The Naïve Bayes classifier technique is used when dimensionality of the inputs is high. This is a simple algorithm but gives good output than others. We are using this to predict the dropout of students by calculating the probability of each input for a predictable state. It trains the weighted training data and also helps prevent over fitting. 2.3 Instance-based-k-nearest neighbor In this technique a new item is classified by comparing the memorized data items using a distance measure. For this we require storing of a dataset. Matching of items is done by putting them close to original item. Nearest neighbors can be done by using cross-validation either automatically or manually. 2.4 Multilayer Perceptron It is one of the most widely used and popular neural networks. Its network consists of a set of sensory elements which forms the input layer, one or more hidden layers of processing elements, and the output layer is of the processing elements. The back propagation algorithm ANN can be used for predicting both continuous and discrete data. ANN Algorithm represents each cluster by a neuron based on the neural structure of the brain. Here each connection has an associated weight, which is calculated adaptively during learning. The only point about ANN is that it takes long training times and is therefore more suitable for applications where long training is feasible. Here we have used Multilayer Perceptron technique of ANN. [3] 3. RELATED WORK Pumpuang [4]had proposed the classifier algorithm for building Course Registration Planning Model from historical dataset.The model used four classifiers including Bayesian Network, C4.5, Decision Forest and NBTree. Results showed that NBTree seemed to be the best for prediction of GPA of the student. Tanna[5] has implemented a decision support system for admission in engineering colleges which is based on entrance exam marks. Results show it will return colleges and streams categorized as Ambitious, Best Bargain and Safe using an offset value. In [6] Malaya used a knowledge based decision technique will guide the student for admission in proper branch of engineering. They used two algorithms decision tree algorithm and ANN to find out which one is more accurate for decision making. Results showed that accuracy of MLP algorithm has proved to be better for training partition size 50 & testing partition size 50 upto 86%
  • 3. International Journal of Computer Networks & Communications (IJCNC) Vol.6, No.3, May 2014 199 Al-Radaideh [7] proposed in his paper a simple classification model to provide a guideline to help students and school management to choose the right track of study for a student. Decision tree using the C4.5 algorithm (J48 in WEKA), was built by selecting the best attributes using the information gain measure. The classification rules to find were based on more than one factor such as the Ratio and the average of student mark in the 10th class (AVERAGE), and the average of the student mark in 8th , 9th , and 10th classes (AVG89_10). Results show that accuracy of the model was 87.9% where 218 students were correctly classified out of 248 students. Hany et al. [8] applied six classifiers on ASSISTments dataset having 15 features. They used VF1,IBK,NaiveBayes Updateable, ONER, j48 and k means clustering classifiers to rank the features. Results showed that k means clustering was the best in giving ranks to features and Naïve Bayes was better in giving prediction accuracy. Lei Yu [9] in their work proposed a feature selection algorithm which is specially used for high dimensional data which is called as fast correlation base filter. This algorithm is for removing irrelevant and redundant data. They applied FCBF, ReliefF, CorrF, and ConSF on four datasets and recorded the running time and number of features selected. Then they applied C4.5 and NBC classification on the data. Bharadwaj and Pal [10] conducted experiment to predict the performance at the end of semester using student’s data like attendance, class test, seminar and assignment marks from the student’s previous database results Hijazi and Naqvi [11] conducted a study on student performance on 300 students from group of colleges of Punjab University. Results showed that student’s attitude towards attendance in class are dependent on the time they spend in college for study after college hours. Other factors such as mother’s age and education are related with student’s performance found by simple linear regression analysis. Khan [12] conducted an experiment on 200 boys and 200 girls of Secondary school of Aligarh Muslim University. Their main aim was to find out variables which determine the success in higher education in science stream. So they used demographic variables, personality measures as an input. They had used cluster sampling technique for division into groups or clusters and a random sample of cluster was used for further analysis. Results showed that girls with high socio- economic status had relatively higher academic achievement in science whereas boys with low socio-economic status had higher academic achievement in general. Z. J. Kovacic [13] presented a case study on educational data mining to identify up to what extent enrolment data can be used to predict student’s success. They had used CHAID and CART on students of diploma college of New Zealand. They got two decision trees in their results and accuracy of classifiers obtained was 59.4 and 60.5. Al-Radaideh [7] proposed in his paper a simple classification model to provide a guideline to help students and school management to choose the right track of study for a student. Decision tree using the C4.5 algorithm (J48 in WEKA), was built by selecting the best attributes using the information gain measure. The classification rules to find were based on more than one factor such as the Ratio and the average of student mark in the 10th class (AVERAGE), and the average of the student mark in 8th , 9th , and 10th classes (AVG89_10). Results show that accuracy of the model was 87.9% where 218 students were correctly classified out of 248 students. Hany et al. [8] applied six classifiers on ASSISTments dataset having 15 features. They used VF1,IBK,NaiveBayes Updateable, ONER, j48 and k means clustering classifiers to rank the features. Results showed that k means clustering was the best in giving ranks to features and Naïve Bayes was better in giving prediction accuracy.
  • 4. International Journal of Computer Networks & Communications (IJCNC) Vol.6, No.3, May 2014 200 Lei Yu [9] in their work proposed a feature selection algorithm which is specially used for high dimensional data which is called as fast correlation base filter. This algorithm is for removing irrelevant and redundant data. They applied FCBF, ReliefF, CorrF, and ConSF on four datasets and recorded the running time and number of features selected. Then they applied C4.5 and NBC classification on the data. Bharadwaj and Pal [10] conducted experiment to predict the performance at the end of semester using student’s data like attendance, class test, seminar and assignment marks from the student’s previous database results Hijazi and Naqvi [11] conducted a study on student performance on 300 students from group of colleges of Punjab University. Results showed that student’s attitude towards attendance in class are dependent on the time they spend in college for study after college hours. Other factors such as mother’s age and education are related with student’s performance found by simple linear regression analysis. Khan [12] conducted an experiment on 200 boys and 200 girls of Secondary school of Aligarh Muslim University. Their main aim was to find out variables which determine the success in higher education in science stream. So they used demographic variables, personality measures as an input. They had used cluster sampling technique for division into groups or clusters and a random sample of cluster was used for further analysis. Results showed that girls with high socio- economic status had relatively higher academic achievement in science whereas boys with low socio-economic status had higher academic achievement in general. Z. J. Kovacic [13] presented a case study on educational data mining to identify up to what extent enrolment data can be used to predict student’s success. They had used CHAID and CART on students of diploma college of New Zealand. They got two decision trees in their results and accuracy of classifiers obtained was 59.4 and 60.5. 4.CORRELATIONFEATURE SELECTION Feature selection is a preprocessing step to machine learning which is effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. [14] 4.1 STEPS OF FEATURE SELECTION A feature of a subset is good if it is highly correlated with the class but not much correlated with other features of the class. [15] Steps: a. Subset generation: We have used four classifiers to rank all the features of the data set. Then we have used top 3, 4, and 5 features for classification. b. Subset evaluation: Each classifier is applied to generated subset. c. Stopping criterion: Testing process continues until 5 features of the subset are selected. d. Result validation: We have used 10-fold cross validation method for testing each classifier’s accuracy.
  • 5. International Journal of Computer Networks & Communications (IJCNC) Vol.6, No.3, May 2014 201 4.2CORRELATION-BASED MEASURES Here we shall discuss the measures used to find the goodness of a feature for classification. We find a feature to be good if it is more relevant to the class and not redundant to any other features of the class. So in short a feature should be highly correlated to the class and not much correlated to any other feature of the class. For this we have used information theory based on entropy. Entropy is a measure of uncertainty of a random variable. It can be defined by the following equation 1 as H(X) = - ∑P (xi) log2 (P (xi) (1) i And the entropy of X after observing values of another variable Y is defined in equation 2 as H(X/Y) = - ∑P (yj) ∑P (xi/yj) log2 (P (xi/yj)) (2) j i Here, P (xi) is the prior probabilities for all values of X, and P (xi/yj) is the posterior probabilities of X when values of Y are given. The amount by which the entropy of X decreases reflects additional information about X provided by Y is called information gain given the equation 3 as IG(X/Y) = H(X)-H(X/Y)(3) We can conclude that feature Y is regarded to be more correlated to feature X than to feature Z, if IG(X/Y) > IG (Z/Y). We have one more measure symmetrical uncertainty which shows correlation between features defined by equation 4 as SU(X, Y) = 2 [IG(X/Y) / H(X) + H(Y)] (4) SU compensates information gain’s bias toward features with more values and normalizes its value to range of [0,1] with 1 showing that knowledge of either one completely predicts the value of other and 0 shows that X and Y are independent. It considers pair of features symmetrically. Entropy based measures require nominal features, but they can be applied to measure correlations between continuous features as well if they are discretized properly. 5. ALGORITHM Based on the methodology presentedbefore, wehave used the following algorithm,named FCBF(FastCorrelation- Based Filter). [9] Input: S (F1,F2, FN , C) // training data set δ // predefined threshold value Output: Sbest // an optimal subset 1 begin 2 for i = 1 to N do begin 3 calculate SUi,c for Fi ; 4 if (SUi,c ≥ δ) 5 append Fi to S'list ;
  • 6. International Journal of Computer Networks & Communications (IJCNC) Vol.6, No.3, May 2014 202 6 end; 7 order S'list in descending SUi,c value; 8 Fp = getFirstElement(S'list) 9 do begin 10 Fq =getNextElement(S'list,Fp) 11 if (Fq<> NULL) 12 do begin 13 F 'q = Fq ; 14 if (SUp,q ≥ SUq,c) 15 remove Fq from S'list 16. Fq = getNextElement(S'list F'q); 17. else Fq = getNextElement(S'list, Fq); 18 end until (Fq = = NULL); 19 Fp = getNextElement(S'list,Fp); 20 end until (FP = = NULL); 21 Sbest = S'list ; 22 end; 6. PROPOSED SYSTEM 6.1 DATA PREPARATIONS We have collected students data from a Mumbai college going to enroll in 2014 which is a training dataset consisting of information about students admitted to the first year. Data is in the excel format and has details of students personal and academic record. It has details such a s student’s name, admission type, sex, marks in 12th standard, marks in math, physics, chemistry, average of all, common entrance test marks, and personal details as father’s occupation, qualification, mother’s qualification and occupation, interest of student. 6.2 DATA PROCESSING Student data warehouse contains details as follows. It contains 380 instances with 32 attributes. From this list we have selected 17 attributes which we felt as relevant related to our work. Following table 1 is the list of reduced number of attributes.
  • 7. International Journal of Computer Networks & Communications (IJCNC) Vol.6, No.3, May 2014 203 Table 1: List of Attributes 6.3 IMPLEMENTATION OF MODEL WEKA is open source software which is freely available for mining data and implements a large collection of mining algorithms. It can accept data in various formats and also has converter supported with it.So we have converted the student dataset into arff file. The file was loaded into WEKA explorer. The classify panel is used for classification, to estimate the accuracy of resulting predictive model, visualize erroneous predictions, or the model itself. Net Beans is used to implement FCBF. For good results we need to know the weightage of each variable necessary for the success of admission of student in engineering. So we have used feature selection algorithms tests such as Info gain, Chi squared, gain Ratio. The following table 2 shows the features ranked according to the algorithm.
  • 8. International Journal of Computer Networks & Communications (IJCNC) Vol.6, No.3, May 2014 204 Table 2: Rank of features and Average rank. So instead of trusting on any one attribute selector we have taken the average of their ranks and selected the features. So the ranking is 17,8,4,16,10. From the above table we conclude that family pressure is the most important factor for prediction of admission in engineering which is followed by admission_type, interest of student, mother’s occupation, and residence in hostel. Next we have applied classification algorithms NBTree, MultilayerPerceptron, Naïve Bayes and IBK on the selected features. For this we take the subset of 3 features and then add on feature to see the accuracy of the algorithms. The below table 3 shows the evaluation criteria of features classified. TABLE 3: EVALUATION OF CLASSIFIERS USING SUBSET OF 3, 4, 5 FEATURES (PA-Predictive Accuracy)
  • 9. International Journal of Computer Networks & Communications (IJCNC) Vol.6, No.3, May 2014 205 From the table we can see that highest PA of NBTree is 65% with three features. For MLP we get highest PA of 65.83% with four features gives highest accuracy with 3 features. For Naïve Bayes we get highest accuracy of 61.66% with four features. And for IBK we get PA of 75% also with four features. Also amongst all the classifiers we conclude that IBK is the best classifier amongst all with minimum time. Now we do the classification using the FCBF algorithm which is implemented in JAVA using net beans. FCBF is not supported by WEKA. The following are the attributes which have been selected with their symmetric uncertainty values. The most important factor that we have found using this algorithm is family income followed by father qualification, all India rank in common entrance test. Now we apply the classifiers on the selected attributes. The following table 4 shows the classification using 3, 4, and 5 features. TABLE 4: EVALUATION OF CLASSIFIERS USING FCBF ALGORITHM SUBSET OF 3, 4, 5 FEATURES (PA-Predictive Accuracy) No. of Features NBTree MLP Naïve Bayes IBK PA time PA time PA time PA time 3 65.83 .05 75 .82 65.83 0 75 0 4 65.83 .08 81.6 1.64 66.6 0 100 0 5 75 .26 87.5 1.89 65.83 .01 100 0 MAX 75 87.5 66.6 100 Results show that using FCBF we get the maximum accuracy by using the classifier IBK i.e. 100%. Other than that we see from the table that PA of NBTree is 75% and that of MLP is 87.5% and that of Naïve Bayes PA is 66.6 Also we get conclude that time is saved and accuracy is increased. 7. CONCLUSION From the above results we conclude that feature selection techniques can improve the accuracy and efficiency of the classification algorithms by removing irrelevant and redundant features. Also by using the average of Infogain, gainratio, and Chi-square test we get the most relevant attributes. Four classifiers have been applied on the selected attributes. From the results we conclude that family pressure and interest of student are the most important factor for prediction of admission of student in engineering. So we get a predictive idea that the student should take or not admission in engineering. Also we conclude that amongst all selection techniques used FCBF gives the best output of relevancy of features. In future other feature selection techniques can be applied on the dataset.
  • 10. International Journal of Computer Networks & Communications (IJCNC) Vol.6, No.3, May 2014 206 REFERENCES [1] Ladha L. and Deepa T., "Feature Selection Methods and Algorithms", International Journal on Computer Science and Engineering (IJCSE), 2011. [2] R. Kohavi. “Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid” Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996 [3] Baker, R.S.J.D. (2010). Data Mining for Education. In B. McGaw, P. Peterson, E. Baker (eds.), International Encyclopaedia of Education (3rd edition), (pp. 112-118). Oxford, UK: Elsevier [4] Pathom Pumpuang, Anongnart Srivihok , Prasong Praneetpolgrang, “Comparisons of Classifier Algorithms: Bayesian Network, C4.5, Decision Forest and NBTree for Course Registration Planning Model of Undergraduate Students”, 1-4244-2384-2/08/ 2008 IEEE [5] Miren Tanna, “Decision Support System for Admission in Engineering Colleges based on Entrance Exam Marks”, IJCA(0975 – 8887) Volume 52– No.11, August 2012 [6] Malaya Dutta Borah, Rajni Jindal, Daya Gupta Ganesh Chandra Deka, “Application of knowledge based decision technique to predict student enrollment decision”, 978-1-4577-0792-6/11 2011 IEEE [7] Qasem A. Al-Radaideh, Ahmad Al Ananbeh, and Emad M. Al-Shawakfa, “A classification model for predicting the suitable study track for school students”, Vol8 Issue2/IJRRAS_8_2_15.pdf, August 2011 [8] Hany M. Harb1, Malaka A. Moustafa, “Selecting optimal subset of features for student performance model”, IJCSI Vol. 9, Issue 5, No 1, September 2012, 1694-0814 [9] Lei Yu leiyu,Huan Liu, “Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution”, (ICML-2003), Washington DC, 2003. [10] B. K. Bharadwaj and S. Pal. "Mining Educational Data to Analyze Students' Performance", International Journal of Advance Computer Science and Applications (IJACSA), Vol. 2, No. 6, pp.63-69, 2011. [11] S. T. Hijazi, and R. S. M. M. Naqvi, "Factors affecting student's performance: A Case of Private Colleges", Bangladesh e-Journal of Sociology, Vol. 3, No. 1, 2006. [12] Z. N. Khan, "Scholastic achievement of higher secondary students in science stream", Journalof Social Sciences, Vol. 1, No. 2, pp. 84-87, 2005. [13] Z. J. Kovacic, “Early prediction of student success: Mining student enrollment data”,Proceedings of Informing Science & IT Education Conference 2010 [14] Blum & Langley, 1997; Kohavi &John, 1997 [15] Hall, M. (1999). Correlation based feature selection for machine learning. Doctoral dissertation, Universityof Waikato, Dept. of Computer Science. [16] WEKA,http://www.cs.waikato.ac.nz/ml/weka, Last access, 8 April 2008. Authors Mital Mehta,B.E. in Computer engineering. Pursuing Mtech in software systems from Bhopal T.I.T College