SlideShare a Scribd company logo
Student Alcohol Consumption Prediction:
Data Mining Approach
1
Hind Almayyan, 2
Waheeda Almayyan
1
Computer Department, Institute of Sectary Studies, PAAET, Kuwait
hi.almayyan@paaet.edu.kw
2
Computer Information Department, Collage of Business Studies, PAAET, Kuwait
wi.almayyan@paaet.edu.kw
	
Abstract
Alcohol consumption in higher education institutes is not a new problem; but excessive drinking by
underage students is a serious health concern. Excessive drinking among students is associated with a number
of life-threatening consequences that include serious injuries; alcohol poisoning; temporary loss of
consciousness; academic failure; violence, unplanned pregnancy; sexually transmitted diseases, troubles with
authorities, property damage; and vocational and criminal consequences that could jeopardize future job
prospects. This article describes a learning technique to improve the efficiency of academic performance in
the educational institutions for students who consume alcohol. This move can help in identifying the students
who need special advising or counselling to understand the danger of consuming alcohol. This was carried
out in two major phases: feature selection which aims at constructing diverse feature selection algorithms
such as Gain Ratio attribute evaluation, Correlation based Feature Selection, Symmetrical Uncertainty and
Particle Swarm Optimization Algorithms. Afterwards, a subset of features is chosen for the classification
phase. Next, several machine-learning classification methods are chosen to estimate the teenager’s alcohol
addiction possibility. Experimental results demonstrated that the proposed approach could improve the
accuracy performance and achieve promising results with a limited number of features.
Keywords
Data mining; Data mining; Classification; Student’s performance; Feature selection; Particle swarm
optimization; Alcohol consumption prediction.
1. INTRODUCTION
Globally, heavy alcohol drinking is associated with premature death, weaker probability of
employment, more absence from work, in addition to lost productivity and lower wages. Moreover,
alcohol consumption results in approximately 3.3 million deaths each year [1]. It is the third largest risk
factor for alcohol-related hospitalizations, deaths and disability in the world. Approximately one in four
children younger than 18 years old in the United States is exposed to alcohol abuse or alcohol dependence
in the family [2]. Alcohol consumption has consequences for the health and well-being of those who
drink and, by extension the lives of those around them.
The relationship between problematic alcohol consumption and academic performance is a concern
for decision makers in education. [3] Alcohol consumption has been negatively associated with poor
academic performance, [4] and heavy drinking has been proposed as a probable contributor to student
attrition from school. [5]
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 4, April 2018
29 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Traditional methods for monitoring adolescent alcohol consumption are based on surveys, which have
many limitations and are difficult to scale. Therefore, several approaches have been investigated using
conventional and artificial intelligence techniques in order to evaluate the teenage alcohol consumption.
In Crutzen et al. [6] a group of Dutch researchers studied the association between parental reports,
teenager perception and parenting practices to identify binge drinkers. They designed a binary classifier
using alternating decision trees to establish the effectiveness of the results of exploring nonlinear
relationships of data. Montaño et al. [7] proposed an analysis of psychosocial and personality variables
about nicotine consumption in teenagers. They applied several classification techniques such as RNA
Multi-layer perceptron, radial basis functions and probabilistic networks, decision trees, logistic
regression model and discriminant analysis. They discriminated successfully 78.20% of the subjects,
which indicates that this approach can be used to predict and prevent similar addictive behavior.
Pang et al. [8] applies a multimodal study to identify alcohol consumption in an audience of minors,
specifically the users of the Instagram social network. The analysis is based on facial recognition of selfie
photos and exploring the tags assigned to each image with the objective of finding consumption patterns
in terms of time, frequency and location. In the same way, they measured the penetration of alcohol
brands to establish their influence in the consumption behavior of their followers. Experimental results
were satisfactory and compliant with the polls made in the same audience, which can lead to use this
approach to other domains of public health.
In Bi et al. [9], a study using two machine learning methods to identify effectively the daily dynamic
alcohol consumption and the risk factors associated to it. For this, they proposed a Support Vector
Machine (SVM) as classifier to establish a function for stress, state of mind and consumption expectancy,
differentiating drinking patterns. After that, a fusion between clustering analysis and feature
classification was made to identify consumption patterns based on daily behavior of average intake and
detect risk factors associated to each pattern. Zuba et al. [10] proposed machine learning approach that
use a feature selection method with 1-norm support vector machines (SVM) to help classify college
students between high risk and low risk alcohol drinkers and the risk factors associated to the heavy
drinkers. This approach could be used to help to detect early signs of addiction and dependence to alcohol
in students.
In this article, we are addressing the prediction of teenager’s alcohol addiction by using past school
records, demographic, family and other data related to student. This article extends the research
conducted by Cortez and Silvain in 2008 [11]. This study seeks to establish the correlation between poor
academic performance and the use of alcohols among teenagers. We applied several data mining tools
and ends of evaluation shows potential of better results. This article suggests a new classification
technique that enhances the student performance prediction using less number of attributes then the ones
used in the original research. The aim is getting better prediction results using less parameters in the
process.
The article starts the suggested approach is presented in Section 3. Section 4 describes the experiment
steps and the involved dataset. Section 5 shows the experiment result. The article concluded with
conclusion and further research plan.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 4, April 2018
30 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
2. THE PROPOSED APPROACH
Initially, several machine-learning classification methods, which are considered very robust
in solving non-linear problems, are chosen to estimate the class possibility. These methods
include feed-forward artificial Neural Network with MLP, Simple Logistic multinomial logistic
model, Rotation Forest, Random Forest ensemble learning methods and C4.5 decision tree and
Fuzzy Unordered Rule Induction Algorithm (FURIA) classifiers. We carried out extensive
experimentation to prove the worth of the proposed approach. We analyze the results of the
dataset from each of the perspectives of, Accuracy, ROC and Cohen's kappa coefficient.
Feature extraction has played a significant role in many classification systems [12]. On this
basis, the focus of this section is on the applied feature selection techniques.
Figure 1 The proposed methodology
2.1 Particle swarm optimization (PSO)
The PSO technique is a population-based stochastic optimization technique first introduced in
1995 by Kennedy and Eberhart [13]. In PSO, a possible candidate solution is encoded as a
finite-length string called a particle pi in the search space. All of the particles make use of its
own memory and knowledge gained by the swarm as a whole to find the best solution. With
the purpose of discovering the optimal solution, each particle adjusts its searching direction
according to two features, its own best previous experience (pbest) and the best experience of its
companions flying experience (gbest). Each particle is moving around the n-dimensional search
Full	Dataset	
Set
Feature	
Selection	
Procedure
IGR CFS PSO
Selected	
Features
Classification	
Procedure
MLP
Simple	
Logistic
Rotation
Forest
Performance	
Measures
Accuracy ROC Kappa
Rotation	
Forest	
C4.5
SU
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 4, April 2018
31 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
space with objective function . Each particle has a position (t represents the
iteration counter), a fitness function and ‘‘flies’’ through the problem space with a
velocity . A new position is called better than iff .
Particles evolve simultaneously based on knowledge shared with neighbouring particles; they
make use of their own memory and knowledge gained by the swarm as a whole to find the best
solution. The best search space position particle i has visited until iteration t is its previous
experience pbest. To each particle, a subset of all particles is assigned as its neighbourhood. The
best previous experience of all neighbours of particle i is called gbest. Each particle additionally
keeps a fraction of its old velocity. The particle updates its velocity and position with the
following equation in continuous PSO [14]:
1
2
The first part in Equation 1 represents the previous flying velocity of the particle. While the
second part represents the ‘‘cognition” part, which is the private thinking of the particle itself,
where C1 is the individual factor. The third part of the equation is the ‘‘social” part, which
represents the collaboration amongst the particles, where C2 is the societal factor. The
acceleration coefficients (C1) and (C2) are constants represent the weighting of the stochastic
acceleration terms that pull each particle toward the pbest and gbest positions. Therefore, the
adjustment of these acceleration coefficients changes the amount of ‘tension’ in the system. In
the original algorithm, the value of (C1 + C2) is usually limited to 4 [14]. Particles’ velocities
are restricted to a maximum velocity, Vmax. If Vmax is too small, particles in this case could
become trapped in local optima. In contrast, if Vmax is too high particles might fly past fine
solutions. According to Equation 1, the particle’s new velocity is calculated according to its
previous velocity and the distances of its current position from its own best experience and the
group’s best experience. Afterwards, the particle flies toward a new position according to
Equation 2. The performance of each particle is measured according to a pre-defined fitness
function.
2.2 Information Gain Ratio (IGR) attribute evaluation
IGR measure was generally developed by Quinlan [15] within the C4.5 algorithm and based on
the Shannon entropy to select the test attribute at each node of the decision tree. It represents
how precisely the attributes predict the classes of the test dataset in order to use the ‘best’
attribute as the root of the decision tree.
S ®ÂÍ n
Sf : tix ,
)( ,tixf
tiv , Sz Î1 Sz Î2 )()( 21 zfzf <
)(*()*)(*()** 2211
old
pdd
old
pdpd
old
pd
new
pd xgbestrandCxpbestrandCvv d
-+-+= w
new
pd
old
pd
new
pd vxx +=
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 4, April 2018
32 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
The expected IGR needed to classify a given sample s from a set of data samples C IRG(s,C)
is calculated as follow
, 4
where freq(s,C), Ci and |Ci| are the frequency of the sample s in C, the ith
class of C and the
number of samples in Ci, respectively.
2.3 Symmetrical Uncertainty
Symmetric uncertainty correlation-based measure (SU) can be used to evaluate the goodness
of features by calculating between feature and the target class [16]. The features having greater
SU value gets higher importance. SU is defined as
, 5
Where H(X), H(Y) , H(X|Y), IG are the entropy of a of X, entropy of a of Y and the entropy
of a of posterior probability X given Y and information gain, respectively.
2.4 Genetic Algorithms (GAs)
The basic idea behind the evolutionary algorithms (EAs) is derived from theory of biological evolution
developed by Charles Darwin and others. It has been used as computational models and as adaptive
search strategies for solving optimization problems. Genetic algorithms were developed in 1975 by
Holland as a class of EAs [17]. GAs include a rapidly evolving population of artificial organisms, or so-
called agents. Every agent is comprised of a genotype, often called a binary string or chromosome, which
encodes a solution to the problem at hand and a phenotype that is the solution. In GAs, at the start the
population of agents is randomly generated representing candidate solutions to the problem.
The GAs implementation relies on the appropriate formulation of the fitness function. The main
objective of the closed identification fitness function is to maximize the recognition rate. Every agent is
evaluated in each iteration, to produce new candidate solutions new fitter offspring and to replace weaker
)(inf_
),(
),(
Cosplit
Csgain
CsIGR =
),,(),(),( CsentropyCsentropyCsgain p-=
),)(1(log))(1()(log)(),( 22 CspCspCspCspCsentropy ----=
,||/)(),( CCsfreqCsp =
),,(
||
||
),( ip
i
i
p Csentropy
C
C
Csentropy å=
,
||
||
log
||
||
)(inf_
C
C
C
C
Cosplit i
i
i
å-=
)()((
)|(2
),(
YHXH
YXIG
YXSU
+
=
)|(
)(
)|(
YXH
XH
YXIG =
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 4, April 2018
33 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
members of the last generation. Thus, the core of this class of evolutionary algorithms lies in selectively
breeding new genetic structures along the course of finding solutions for the problem at hand [18]. We
have adopted the algorithm described by Goldberg [19]. The flowchart of GA-based feature selection is
described in the Figure 2 below [20].
Figure 2 Feature Selection using GA [20]
2.5 Simple random sampling
Usually real-time databases experience class imbalance problems, due to the fact that one class is
represented by a considerably larger number of instances than other classes. Subsequently, classification
algorithms tend to ignore the minority classes. Simple random sampling has been advised as a good
means of increasing the sensitivity of the classifier to the minority class by scaling the class distribution.
An empirical study where the authors used twenty datasets from UCI repository has showed
quantitatively that classifier accuracy might be increased with a progressive sampling algorithm [21].
Weiss and Provost deployed decision trees to evaluate classification performances with the use of a
sampling strategy. Another important study used sampling to scale the class distribution and mainly
focus on biomedical datasets [22]. The authors measure the effect of the suggested sampling strategy by
the use of nearest neighbor and decision tree classifiers. In Simple random sampling, a sample is
randomly selected from the population so that the obtained sample is representative of the population.
Therefore, this technique provides an unbiased sample from the original data.
Regarding simple random sampling there are two approaches while making random selection, in the
first approach the samples are selected with replacement where the sample can be selected more than
once repeatedly with an equal selection chance. In the other approach the selection of samples is done
without replacement where the sample can be selected only once, so that each sample in the data set has
an equal chance of being selected and once selected it cannot be chosen again [23].
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 4, April 2018
34 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
3. Dataset and Evaluation Procedure
3.1 Dataset
The dataset used in this research was collected by customized questionnaire and school
reports during the 2005-2006 academic year from two public schools in the Alentejo region of
Portugal [11]. The school reports included few attributes such as the three period grades and
number of school absences. Researchers have designed a questionnaire with closed questions
to extract further socio-demographic information that were expected to affect student
performance. Such information includes demographic data (e.g. mother’s education, family
income), social-emotional (e.g. alcohol consumption) (Pritchard and Wilson 2003) and
academic learning attributes (e.g. number of past class failures) that were expected to affect
student performance. The questionnaire was first reviewed by school professionals and tested
on a small set of 15 students in order to get a feedback. Eventually 788 students completed the
customized questionnaire. The dataset has 33 attributes, variables, or features for each student.
The academic status or final student performance, which has two possible values: Pass (G3
≥10) or Fail. Eventually, to find alcohol consumption, there are two different attributes related
to alcohol, alcohol taking in work day (D_alc) and alcohol taking in weekend(W_alc).
Therefore, the total alcohol consumption by a specific student in a whole week was estimated
using the following formula [24]
Alcohol consumption = (W_alc×2+D_alc×5)/7 6
The new attribute varies between one and five. Therefore, the dataset is divided into two classes
according to its alcohol consumption column, which is set to 1 for the alcohol consumption is
greater than 3 and 0 otherwise. The 30 features along with description are listed in Table 1.
Table 1: The dataset description of attributes [11]
Attribute
Number
Attribute Description Attribute
type
Possible values of attributes
1 School - student's school Binary "GP" - Gabriel Pereira or "MS" - Mousinho da
Silveira
2 Gender - student's gender Binary "F" - female or "M" - male
3 Age - student's age Numeric from 15 to 22
4 Address - student's home address
type
Binary "U" - urban or "R" - rural)
5 Famsize - family size Binary "LE3" - less or equal to 3 or "GT3" - greater than 3
6 Pstatus - parent's cohabitation
status
Binary "T" - living together or "A" - apart
7 Medu - mother's education Numeric 0 - none, 1 - primary education (4th grade), 2 – 5th to
9th grade, 3 – secondary education or 4 – higher
education
8 Fedu - father's education Numeric 0 - none, 1 - primary education (4th grade), 2 – 5th to
9th grade, 3 – secondary education or 4 – higher
education
9 Mjob - mother's job Nominal "teacher", "health" care related, civil "services" (e.g.
administrative or police), "at_home" or "other"
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 4, April 2018
35 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
10 Fjob - father's job Nominal "teacher", "health" care related, civil "services" (e.g.
administrative or police), "at home" or "other")
11 Reason - reason to choose this
school
Nominal close to "home", school "reputation", "course"
preference or "other")
12 Guardian - student's guardian Nominal "mother", "father" or "other"
13 Traveltime - home to school
travel time
Numeric 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour,
or 4 - >1 hour
14 Studytime - weekly study time Numeric 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 -
>10 hours
15 Failures - number of past class
failures
Numeric (numeric: n if 1<=n<3, else 4)
16 Schoolsup - extra educational
support
Binary yes or no
17 Famsup - family educational
support
Binary yes or no
18 Paid - extra paid classes within
the course subject
Binary yes or no
19 Activities - extra-curricular
activities
Binary yes or no
20 Nursery - attended nursery school Binary yes or no
21 Higher - wants to take higher
education
Binary yes or no
22 Internet - Internet access at home Binary yes or no
23 Romantic - with a romantic
relationship
Binary yes or no
24 Famrel - quality of family
relationships
Numeric from 1 - very bad to 5 – excellent
25 Freetime - free time after school Numeric from 1 - very low to 5 - very high
26 Goout - going out with friends Numeric from 1 - very low to 5 - very high
27 Health - current health status Numeric from 1 - very bad to 5 - very good
28 Absences - number of school
absences
Numeric from 0 to 93
29 G3 - final grade Numeric from 0 to 20, output target
30 Alcohol consumption – Target
class
Binary 1= yes or 0= no
3.2 Performance Analysis
The performance of the suggested technique was evaluated by using three thresholds and
rank performance metrics, Accuracy, ROC and Cohen's kappa coefficient. The main
formulations are defined in Equations 6-8, according to the confusion matrix, which is shown
in Table 2. In the confusion matrix of a two-class problem, TP is the number of true positives
that represent in our case the Pass cases that that was classified correctly. FN is the number of
false negatives that represents the Pass cases that was classified incorrectly as Fail. TN is the
number of true negatives, which represents the Fail cases that was classified as Fail. FP is the
number of false positives that represents the Pass cases that was classified as Passed.
Table 2: The confusion matrix
Hypothesis Predicted patient state
Classified Pass Classified Fail
Hypothesis positive
Pass
True Positive
TP
False Negative
FN
Hypothesis negative
Fail
False Positive
FP
True Negative
TN
Consequently, we can define Precision as:
Precision=
#$
%&'#$
× 100% 6
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 4, April 2018
36 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Precision measures how many of the points predicted as significant are in fact significant.
Receiver Operator Characteristic (ROC) curve is another commonly used measure to evaluate
two-class decision problems in Machine Learning. The ROC curve is a standard tool for
summarizing classifier performance over a range of trade-offs between TP and FP error rates
[25]. ROC usually takes values between 0.5 for random drawing and 1.0 for perfect classifier
performance.
Kappa error or Cohen’s kappa statistics is another recommended measure to compare the
performances of different classifiers and henceforth the quality of selected features. Generally,
Kappa error value Î [-1,1], so when Kappa error value calculated for classifiers approaches to
1, then the performance of classier is assumed to be more realistic [26]. The Kappa error
measure can be calculated using the following formula:
Kappa error =
,(.)0,(1)
20,(1)
7
where P(A) is total agreement probability and P(E) is the hypothetical probability of chance
agreement.
In order to get reliable estimates for classification accuracy on each classification task, every
experiment has been performed using 10-fold cross-validation. Cross-validation is a method
designed for estimating the generalization error based on "resampling" [27]. Cross-validation
technique allows using the whole dataset for training and testing. In k-fold cross-validation
procedure, the relevant dataset is partitioned randomly into approximately equal size k parts
called folds and trained k times, each time leaving out one of the folds from training process,
whilst using only the omitted fold to compute error criterion. Then the average error across all
k trials is estimated as the mean error rate and defined as:
8
where, ei is error rate of each k experiment. Figure 3 depicts the concept behind k-fold cross
validation.
k= 1 k=2 k=3 … k=K
Train Train Validate Train
Figure 3 Data partitioning using k-fold cross-validation.
The whole dataset is divided into K folds. One-fold (k=3, in this example) is set aside to
validate the data of testing and the remaining K-1 folds are used for training. The entire
procedure is repeated for each of the K folds. A number of studies found that the value of 10
for k leads to adequate and accurate classification results [28].
å=
=
k
i
ie
k
E
1
1
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 4, April 2018
37 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
4. Results and discussion
A multi-class classification problem such as predicting student alcohol consumption is a
challenging application of data mining. The basic idea of data mining is to extract hidden
knowledge using data mining techniques. The suggested system for the purpose of predicting
student alcohol consumption applied in this study is carried out in three major phases. The
process starts with applying the simple random sampling to scale the imbalanced class
distribution. In the second phase, the feature space is searched to reduce the feature numbers
and prepare the conditions for the next step. This task is carried out using four feature reduction
techniques, namely GR, CFS, SU and PSO Algorithms. At the end of this step a subset of
features is chosen for the next round. Afterwards, the selected features are used as the inputs to
the classifiers. Five classifiers are proposed to estimate the success possibility as mentioned
previously, these methods include MLP, Simple Logistic, Rotation Forest, Random Forest,
C4.5 decision tree and SVM.
All the experiments were carried in Waikato Environment for Knowledge Analysis (Weka)
a popular suite of data mining algorithms written in Java. The RF algorithm ensemble classifier
is designed based on 150 trees and 10 random features to build each tree. While C4.5 classifier
was applied with a confidence factor for pruning = 0.25 and a minimum number of instances
per leaf of 2. The suggested algorithm is trained using 10-fold cross validation strategy to
evaluate the classification accuracy on the dataset. Whereas the PSO feature selection was
applied with a population size of 20, number of iterations = 20, individual weight = 0.34 and
inertia weight = 0.33.
Table 3 depicts the effect of the class distribution before and after applying the simple random
sampling technique. The unbalanced distribution of the two classes makes this dataset suitable
to test the effect of simple random sampling strategy. We, therefore, used a simple random
sampling approach without replacement to rescale class distribution of the dataset.
The experimental results of the multiple classifiers before and after applying the SRS can be
shown in Table 4. The best overall performance is associated with Random Forest classifier
with a precision of 92.2%, ROC of 94.5% and Kappa value of 70.4%, and a precision of 98.5%,
ROC of 99.7% and Kappa value of 95.2% with all features before and after applying the SRS
strategy with all features respectively. As for the classifiers that are used to perform predictions
based on the extracted features, we observed that there is no significant difference in
performance that explains the importance of SRS step.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 4, April 2018
38 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Table 3
Class distribution of the Student dataset before and after SRS
Index Class Class Distribution
Before SRS After SRS
1 Alcoholic 188 411
2 Not Alcoholic 856 1677
Table 4
Performance measures of selected classifiers before feature selection
Classifier Performance index Before SRS After SRS
MLP Accuracy 0.846 0.929
ROC 0.829 0.776
Kappa 0.479 0.776
Simple Logistic Accuracy 0.830 0.861
ROC 0.828 0.874
Kappa 0.370 0.523
Random Forest Accuracy 0.922 0.985
ROC 0.945 0.997
Kappa 0.702 0.952
C4.5 Accuracy 0.828 0.946
ROC 0.732 0.933
Kappa 0.412 0.829
FURIA Accuracy 0.868 0.967
ROC 0.824 0.967
Kappa 0.476 0.885
The second phase involves searching feature vector to reduce the feature numbers and prepare
the conditions for the next step. This task is carried out using four feature selection techniques,
GR, CFS, SU and PSO Algorithms. The optimal features of these techniques are summarized
in Table 5. As noted from Table 3, the dimensionality of features is noticeably reduced. It is
worth noting that the number of features has remarkably reduced, therefore less storage space
is required for the execution of the classification algorithms. This step helped in reducing the
size of dataset to only 6 to 15 attributes.
Table 5
Selected features of the student dataset
FS technique Number of selected features Selected features
IGR 13 2,10,11,13,14,15,17,21,25,26,27,28,29
SU 15 2,5,10,11,13,14,15,17,20,21,25,26,27,28,29
GA 7 2,13,25,26,27,28,29
PSO 6 2,13,26,27,28,29
The experimental results of the multiple classifiers with the reduced number of features can
be shown in Table 6. The highest performance rate for the IGR-based feature selection
technique is associated with Random Forest classifier with 97.9%, 98.7% and 93.3% for
Accuracy, ROC and Kappa, respectively with 13 features. While the highest performance rate
for the SU-based feature selection technique is associated with C4.5 classifier with 99.5%,
99.7% and 98.2% for Accuracy, ROC and Kappa, respectively with 15 features. The highest
performance rate for the GA-based feature selection technique is associated with Random
Forest classifier with 96.2%, 97.2% and 88.1% for Accuracy, ROC and Kappa, respectively
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 4, April 2018
39 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
with 7 features. The PSO-based feature selection technique highest performance rate is
associated with Random Forest classifier with 95.1%, 97% and 84.5% for Accuracy, ROC and
Kappa, respectively with 6 features.
Table 6
Performance measures of selected features after SRS
Classifier Performance index IGR SU GA PSO
MLP Accuracy 0.888 0.915 0.861 0.855
ROC 0.842 0.873 0.839 0.837
Kappa 0.641 0.716 0.537 0.521
Simple Logistic Accuracy 0.845 0.857 0.840 0.841
ROC 0.861 0.874 0.838 0.839
Kappa 0.476 0.491 0.456 0.461
Random Forest Accuracy 0.979 0.995 0.962 0.951
ROC 0.987 0.997 0.972 0.970
Kappa 0.933 0.982 0.881 0.845
C4.5 Accuracy 0.936 0.984 0.925 0.906
ROC 0.932 0.986 0.919 0.900
Kappa 0.797 0.947 0.761 0.698
FURIA Accuracy 0.962 0.961 0.911 0.890
ROC 0.970 0.969 0.913 0.848
Kappa 0.8752 0.8718 0.7046 0.625
Figure 4 visualizes the feature selection agreements between the IGR, SU, GA and PSO
models. The Venn diagram shows the suggested models share student's gender, home to school
travel time, going out with friends, current health status, number of school absences and final
grade, in which all was obtained by the PSO model.
Figure 4 PSO feature selection agreement in the student alcohol consumption dataset
PSO is a well-known optimization method that has a strong search capability and usually
used for fine-tuning of the features space. Our proposed technique based on PSO succeeded in
significantly improving the classification performance with a limited number of features
compared to other techniques. The suggested PSO selection-based features demonstrated
accuracies between 84.1% and 95.1% in various DM model and this is a quite high performance
for predicting student performance [29]. Therefore, deploying PSO in feature selection clearly
helped in reducing the size of dataset from 33 to only 6 attributes. It is worth noting that as the
number of features has reduced, less storage space and classification complexity is further
required. Moreover, the results demonstrated that these features are adequate to represent the
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 4, April 2018
40 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
dataset’s class information. The outcomes from the suggested feature selection techniques show
better results compared to datasets which are not pre-processed and also when these attribute
selection techniques are used independently. As can be seen from above results, the proposed
technique based on PSO has produced very promising results on the classification of multi-
class dataset in predicting the student alcohol consumption.
As we can comprehend from the data graph and table there is a significant gender differences
in the drinking habits. Comparing to men, women are more likely to be responsive to health
concerns and are less likely to engage in risky health behaviours [10,11]. Commonly, men
smoke and drink more than women in different societies and cultures, and women have a higher
expectation of self-control than do men [12,13]. 
That can lead the other features such as the
free time with friends, high travel time between school and home and the number of school
absences and eventually the poor academic performance. Our study shows that, drinking is the
product of many factors working together. This suggests that the educational professionals can
consider these features for further analysis in future.

5. CONCLUSION
Underage drinking or adolescent alcohol misuse is a major public health concern. The
proposed machine learning approach could improve the accuracy performance and achieve
promising results in identifying risk or protective factors for high-risk drinking that can be used
to help detect and address the early developmental signs of alcohol abuse and dependence
within adolescent students. The experiment results have shown that the PSO helped in reducing
the feature space, whereas adjusting the original data with simple random sampling helped in
increasing the region area of the minority class in favour of handling the existing imbalanced
data property.
ACKNOWLEDGEMENT
The authors would like to kindly appreciate and gratefully acknowledge, UCI Machine
Learning Repository [http://archive.ics.uci.edu/ml] and Mr. Paulo Cortez [11] for obtaining the
student performance dataset.
REFERENCES
1. World Health Organization Management of Substance Abuse Team. Global Status Report on Alcohol and Health. World Health
Organization, Geneva, Switzerland; 2011:85.
2. GRANT, B.F. Estimates of U.S. children exposed to alcohol abuse and dependence in the family. American Journal of Public
Health 90(1):112–115, 2000.
3 Aertgeerts B, Buntinx F. The relation between alcohol abuse or dependence and academic performance in first-year college
students. J Adolesc Health. 2002; 31:223–5.
4. Berkowitz AD, Perkins HW. Problem drinking among college students: A review of recent research. J Am Coll
Health. 1986;35:21–8.
5. Martinez JA, Sher KJ, Wood PK. Is heavy drinking really associated with attrition from college? The alcohol-attrition
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 4, April 2018
41 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
paradox. Psychol Addict Behav. 2008;22:450–6.
6. Crutzen, R., P.J. Giabbanelli, A. Jander, L. Mercken and H. de Vries, 2015. Identifying binge drinkers based on parenting
dimensions and alcohol-specific parenting practices: Building classifiers on adolescent-parent paired data. BMC Public Health,
15(1): 747.
7. Montaño, J.J., E. Gervilla, B. Cajal and A. Palmer, 2014. Data mining classification techniques: An application to tobacco
consumption in teenagers. An. Psicol., 30(2): 633-641.
8. Pang, R., A. Baretto, H. Kautz and J. Luo, 2015. Monitoring adolescent alcohol use via multimodal analysis in social multimedia.
Proceeding of the IEEE International Conference on Big Data (Big Data), pp: 1509-1518.
9. Bi, J., J. Sun, Y. Wu, H. Tennen and S. Armeli, 2013. A machine learning approach to college drinking prediction and risk factor
identification. ACM T. Intell. Syst. Technol., 4(4).
10. Zuba, M., J. Gilbert, Y. Wu, J. Bi, H. Tennen and S. Armeli, 2012. 1-norm support vector machine for college drinking risk
factor identification. Proceeding of the 2nd ACM SIGHIT International Health Informatics Symposium, pp: 651-660.
11. Cortez, P. and Silva, A. (2008). Using Data Mining to Predict Secondary School Student Performance. In Proceedings of 5th
Future Business Technology Conference. pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.
12. Guyon, I., Gunn, S., Nikravesh, M. and Zadeh, L.A. (2006). Feature Extraction, Foundations and Applications, Springer, Berlin,
13. Kennedy, J. and Eberhart, R. (2001). Swarm intelligence. Morgan Kaufmann.
14. Kennedy, J. and Eberhart, R. (1997). A discrete binary version of the particle swarm algorithm. Proceedings of the IEEE
International Conference on Systems, Man, and Cybernetics, Vol.5, pp. 4104–4108.
15. Quinlan J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers.
16. Fayyad, U., and Irani, K. (1993). Multi-interval discretization of continuous-valued attributes for classification learning.
Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence (pp. 1022–1027). Morgan Kaufmann.
17. J.H. Holland ,Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, MI (1975).
18. Randy, H., and Haupt, S., 1998. Practical Genetic Algorithms, John Wiley and Sons.
19. David E. Goldberg (1989). Genetic algorithms in search, optimization and machine learning. Addison-Wesley.
20. Hall, Mark A. Correlation-Based Feature Selection for Machine Learning, 1999.
21. G. Weiss and F. Provost, "Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction," J.
Artificial Intelligence Research, vol.19,315-354,2003.
22. Park, B.-H., Ostrouchov, G., Samatova, N.F., Geist, A.: Reservoir-based random sampling with replacement from data stream.
In: SDM 2004 , 492-496, (2004)
23. Mitra SK and Pathak PK. The nature of simple random sampling. Ann. Statist., 1984, 12:1536-1542.
24. Pagnotta, F. and H.M. Amran, 2016. Using data mining to predict secondary school student alcohol consumption. Department
of Computer Science, University of Camerino.
25. Fawcett, T. and Provost, F. (1996). Combining data mining and machine learning for effective user profiling. In Proceedings
of KDD-96, 8-13. Menlo Park, CA: AAAI Press.
26. Fleiss, J.L. and Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of
reliability. Educational and Psychological Measurement 33: 613–619.
27. Devijver P.A., and Kittler J. (1982). Pattern Recognition: A Statistical Approach. London, GB: Prentice-Hall.
28. Gupta G.K. (2006). Introduction to Data Mining with Case Studies. Prentice-Hall of India.
29. Liao S., Chu, P. and Hsiao, P.(2012). Data mining techniques and applications. A decade review from 2000 to 2011, Expert
Systems with Applications 39.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 4, April 2018
42 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

More Related Content

PDF
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
PDF
Novel modelling of clustering for enhanced classification performance on gene...
PPTX
Predicting Response Mode Preferences of Survey Respondents
PDF
Predicting Success : An Application of Data Mining Techniques to Student Outc...
PPT
Advances in computer aided drug design
PDF
A comprehensive study on disease risk predictions in machine learning
PDF
Doctor onlineneeds
PPT
Peter Embi's 2011 AMIA CRI Year-in-Review
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
Novel modelling of clustering for enhanced classification performance on gene...
Predicting Response Mode Preferences of Survey Respondents
Predicting Success : An Application of Data Mining Techniques to Student Outc...
Advances in computer aided drug design
A comprehensive study on disease risk predictions in machine learning
Doctor onlineneeds
Peter Embi's 2011 AMIA CRI Year-in-Review

What's hot (16)

PDF
Case Retrieval using Bhattacharya Coefficient with Particle Swarm Optimization
PPT
Drug discovery Using Bioinformatic
PDF
Data Mining for Education. Ryan S.J.d. Baker, Carnegie Mellon University
PPTX
Environmental scanning
PPTX
Environmental scanning
PDF
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
PDF
Identifying Structures in Social Conversations in NSCLC Patients through the ...
PDF
Classification with No Direct Discrimination
PPT
Mazda Presentation Topic
PDF
INVESTIGATING THE DETERMINANTS OF COLLEGE STUDENTS INFORMATION SECURITY BEHAV...
PDF
A Survey on Classification of Feature Selection Strategies
PPTX
Role of bioinformatics in drug designing
PDF
A conceptual design of analytical hierachical process model to the boko haram...
DOC
research proposal 2
PDF
Choosing Optimization Process In The Event Of Fligh Plan Interruption With Th...
PDF
59832462
Case Retrieval using Bhattacharya Coefficient with Particle Swarm Optimization
Drug discovery Using Bioinformatic
Data Mining for Education. Ryan S.J.d. Baker, Carnegie Mellon University
Environmental scanning
Environmental scanning
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
Identifying Structures in Social Conversations in NSCLC Patients through the ...
Classification with No Direct Discrimination
Mazda Presentation Topic
INVESTIGATING THE DETERMINANTS OF COLLEGE STUDENTS INFORMATION SECURITY BEHAV...
A Survey on Classification of Feature Selection Strategies
Role of bioinformatics in drug designing
A conceptual design of analytical hierachical process model to the boko haram...
research proposal 2
Choosing Optimization Process In The Event Of Fligh Plan Interruption With Th...
59832462
Ad

Similar to Student Alcohol Consumption Prediction: Data Mining Approach (20)

PDF
DIAGNOSIS OF OBESITY LEVEL BASED ON BAGGING ENSEMBLE CLASSIFIER AND FEATURE S...
PDF
A Review Of Electronic Interventions For Prevention And Treatment Of Overweig...
PDF
Scoring and predicting risk preferences
PDF
CLASS IMBALANCE HANDLING TECHNIQUES USED IN DEPRESSION PREDICTION AND DETECTION
PDF
Class Imbalance Handling Techniques used in Depression Prediction and Detection
PDF
Our kids and the digital utilities
PDF
Scoring and Predicting Risk Preferences
DOCX
METHODS1Sampling and MethodologyStuden
PDF
Protocol on burnout
PPTX
AI Trends in healthcare | Bio Medical Eng
PPTX
AI Trends in healthcare ppt for bio medial
PPTX
AI Trends in healthcare PPT for bio medical eng
PDF
Intelligent analysis of the effect of internet
DOCX
Review Journal 1A simplified mathematical-computational model of .docx
PDF
Qualitative Data Analysis Paper
DOC
Proposal-Example-4.doc
DOC
Proposal-Example-4.doc
PDF
Diabetes Mellitus Prediction System Using Data Mining
PDF
Toxicokinetics and Risk Assessment 1st Edition C. Lipscomb John
DOCX
Evaluation of the Bully-Proofing Your School Program in Colo.docx
DIAGNOSIS OF OBESITY LEVEL BASED ON BAGGING ENSEMBLE CLASSIFIER AND FEATURE S...
A Review Of Electronic Interventions For Prevention And Treatment Of Overweig...
Scoring and predicting risk preferences
CLASS IMBALANCE HANDLING TECHNIQUES USED IN DEPRESSION PREDICTION AND DETECTION
Class Imbalance Handling Techniques used in Depression Prediction and Detection
Our kids and the digital utilities
Scoring and Predicting Risk Preferences
METHODS1Sampling and MethodologyStuden
Protocol on burnout
AI Trends in healthcare | Bio Medical Eng
AI Trends in healthcare ppt for bio medial
AI Trends in healthcare PPT for bio medical eng
Intelligent analysis of the effect of internet
Review Journal 1A simplified mathematical-computational model of .docx
Qualitative Data Analysis Paper
Proposal-Example-4.doc
Proposal-Example-4.doc
Diabetes Mellitus Prediction System Using Data Mining
Toxicokinetics and Risk Assessment 1st Edition C. Lipscomb John
Evaluation of the Bully-Proofing Your School Program in Colo.docx
Ad

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Electronic commerce courselecture one. Pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
KodekX | Application Modernization Development
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
“AI and Expert System Decision Support & Business Intelligence Systems”
Electronic commerce courselecture one. Pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
MYSQL Presentation for SQL database connectivity
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx
The AUB Centre for AI in Media Proposal.docx
Machine learning based COVID-19 study performance prediction
Digital-Transformation-Roadmap-for-Companies.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Chapter 3 Spatial Domain Image Processing.pdf
Approach and Philosophy of On baking technology
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation theory and applications.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
KodekX | Application Modernization Development
MIND Revenue Release Quarter 2 2025 Press Release
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf

Student Alcohol Consumption Prediction: Data Mining Approach

  • 1. Student Alcohol Consumption Prediction: Data Mining Approach 1 Hind Almayyan, 2 Waheeda Almayyan 1 Computer Department, Institute of Sectary Studies, PAAET, Kuwait hi.almayyan@paaet.edu.kw 2 Computer Information Department, Collage of Business Studies, PAAET, Kuwait wi.almayyan@paaet.edu.kw Abstract Alcohol consumption in higher education institutes is not a new problem; but excessive drinking by underage students is a serious health concern. Excessive drinking among students is associated with a number of life-threatening consequences that include serious injuries; alcohol poisoning; temporary loss of consciousness; academic failure; violence, unplanned pregnancy; sexually transmitted diseases, troubles with authorities, property damage; and vocational and criminal consequences that could jeopardize future job prospects. This article describes a learning technique to improve the efficiency of academic performance in the educational institutions for students who consume alcohol. This move can help in identifying the students who need special advising or counselling to understand the danger of consuming alcohol. This was carried out in two major phases: feature selection which aims at constructing diverse feature selection algorithms such as Gain Ratio attribute evaluation, Correlation based Feature Selection, Symmetrical Uncertainty and Particle Swarm Optimization Algorithms. Afterwards, a subset of features is chosen for the classification phase. Next, several machine-learning classification methods are chosen to estimate the teenager’s alcohol addiction possibility. Experimental results demonstrated that the proposed approach could improve the accuracy performance and achieve promising results with a limited number of features. Keywords Data mining; Data mining; Classification; Student’s performance; Feature selection; Particle swarm optimization; Alcohol consumption prediction. 1. INTRODUCTION Globally, heavy alcohol drinking is associated with premature death, weaker probability of employment, more absence from work, in addition to lost productivity and lower wages. Moreover, alcohol consumption results in approximately 3.3 million deaths each year [1]. It is the third largest risk factor for alcohol-related hospitalizations, deaths and disability in the world. Approximately one in four children younger than 18 years old in the United States is exposed to alcohol abuse or alcohol dependence in the family [2]. Alcohol consumption has consequences for the health and well-being of those who drink and, by extension the lives of those around them. The relationship between problematic alcohol consumption and academic performance is a concern for decision makers in education. [3] Alcohol consumption has been negatively associated with poor academic performance, [4] and heavy drinking has been proposed as a probable contributor to student attrition from school. [5] International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 4, April 2018 29 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 2. Traditional methods for monitoring adolescent alcohol consumption are based on surveys, which have many limitations and are difficult to scale. Therefore, several approaches have been investigated using conventional and artificial intelligence techniques in order to evaluate the teenage alcohol consumption. In Crutzen et al. [6] a group of Dutch researchers studied the association between parental reports, teenager perception and parenting practices to identify binge drinkers. They designed a binary classifier using alternating decision trees to establish the effectiveness of the results of exploring nonlinear relationships of data. Montaño et al. [7] proposed an analysis of psychosocial and personality variables about nicotine consumption in teenagers. They applied several classification techniques such as RNA Multi-layer perceptron, radial basis functions and probabilistic networks, decision trees, logistic regression model and discriminant analysis. They discriminated successfully 78.20% of the subjects, which indicates that this approach can be used to predict and prevent similar addictive behavior. Pang et al. [8] applies a multimodal study to identify alcohol consumption in an audience of minors, specifically the users of the Instagram social network. The analysis is based on facial recognition of selfie photos and exploring the tags assigned to each image with the objective of finding consumption patterns in terms of time, frequency and location. In the same way, they measured the penetration of alcohol brands to establish their influence in the consumption behavior of their followers. Experimental results were satisfactory and compliant with the polls made in the same audience, which can lead to use this approach to other domains of public health. In Bi et al. [9], a study using two machine learning methods to identify effectively the daily dynamic alcohol consumption and the risk factors associated to it. For this, they proposed a Support Vector Machine (SVM) as classifier to establish a function for stress, state of mind and consumption expectancy, differentiating drinking patterns. After that, a fusion between clustering analysis and feature classification was made to identify consumption patterns based on daily behavior of average intake and detect risk factors associated to each pattern. Zuba et al. [10] proposed machine learning approach that use a feature selection method with 1-norm support vector machines (SVM) to help classify college students between high risk and low risk alcohol drinkers and the risk factors associated to the heavy drinkers. This approach could be used to help to detect early signs of addiction and dependence to alcohol in students. In this article, we are addressing the prediction of teenager’s alcohol addiction by using past school records, demographic, family and other data related to student. This article extends the research conducted by Cortez and Silvain in 2008 [11]. This study seeks to establish the correlation between poor academic performance and the use of alcohols among teenagers. We applied several data mining tools and ends of evaluation shows potential of better results. This article suggests a new classification technique that enhances the student performance prediction using less number of attributes then the ones used in the original research. The aim is getting better prediction results using less parameters in the process. The article starts the suggested approach is presented in Section 3. Section 4 describes the experiment steps and the involved dataset. Section 5 shows the experiment result. The article concluded with conclusion and further research plan. International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 4, April 2018 30 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 3. 2. THE PROPOSED APPROACH Initially, several machine-learning classification methods, which are considered very robust in solving non-linear problems, are chosen to estimate the class possibility. These methods include feed-forward artificial Neural Network with MLP, Simple Logistic multinomial logistic model, Rotation Forest, Random Forest ensemble learning methods and C4.5 decision tree and Fuzzy Unordered Rule Induction Algorithm (FURIA) classifiers. We carried out extensive experimentation to prove the worth of the proposed approach. We analyze the results of the dataset from each of the perspectives of, Accuracy, ROC and Cohen's kappa coefficient. Feature extraction has played a significant role in many classification systems [12]. On this basis, the focus of this section is on the applied feature selection techniques. Figure 1 The proposed methodology 2.1 Particle swarm optimization (PSO) The PSO technique is a population-based stochastic optimization technique first introduced in 1995 by Kennedy and Eberhart [13]. In PSO, a possible candidate solution is encoded as a finite-length string called a particle pi in the search space. All of the particles make use of its own memory and knowledge gained by the swarm as a whole to find the best solution. With the purpose of discovering the optimal solution, each particle adjusts its searching direction according to two features, its own best previous experience (pbest) and the best experience of its companions flying experience (gbest). Each particle is moving around the n-dimensional search Full Dataset Set Feature Selection Procedure IGR CFS PSO Selected Features Classification Procedure MLP Simple Logistic Rotation Forest Performance Measures Accuracy ROC Kappa Rotation Forest C4.5 SU International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 4, April 2018 31 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 4. space with objective function . Each particle has a position (t represents the iteration counter), a fitness function and ‘‘flies’’ through the problem space with a velocity . A new position is called better than iff . Particles evolve simultaneously based on knowledge shared with neighbouring particles; they make use of their own memory and knowledge gained by the swarm as a whole to find the best solution. The best search space position particle i has visited until iteration t is its previous experience pbest. To each particle, a subset of all particles is assigned as its neighbourhood. The best previous experience of all neighbours of particle i is called gbest. Each particle additionally keeps a fraction of its old velocity. The particle updates its velocity and position with the following equation in continuous PSO [14]: 1 2 The first part in Equation 1 represents the previous flying velocity of the particle. While the second part represents the ‘‘cognition” part, which is the private thinking of the particle itself, where C1 is the individual factor. The third part of the equation is the ‘‘social” part, which represents the collaboration amongst the particles, where C2 is the societal factor. The acceleration coefficients (C1) and (C2) are constants represent the weighting of the stochastic acceleration terms that pull each particle toward the pbest and gbest positions. Therefore, the adjustment of these acceleration coefficients changes the amount of ‘tension’ in the system. In the original algorithm, the value of (C1 + C2) is usually limited to 4 [14]. Particles’ velocities are restricted to a maximum velocity, Vmax. If Vmax is too small, particles in this case could become trapped in local optima. In contrast, if Vmax is too high particles might fly past fine solutions. According to Equation 1, the particle’s new velocity is calculated according to its previous velocity and the distances of its current position from its own best experience and the group’s best experience. Afterwards, the particle flies toward a new position according to Equation 2. The performance of each particle is measured according to a pre-defined fitness function. 2.2 Information Gain Ratio (IGR) attribute evaluation IGR measure was generally developed by Quinlan [15] within the C4.5 algorithm and based on the Shannon entropy to select the test attribute at each node of the decision tree. It represents how precisely the attributes predict the classes of the test dataset in order to use the ‘best’ attribute as the root of the decision tree. S ®ÂÍ n Sf : tix , )( ,tixf tiv , Sz Î1 Sz Î2 )()( 21 zfzf < )(*()*)(*()** 2211 old pdd old pdpd old pd new pd xgbestrandCxpbestrandCvv d -+-+= w new pd old pd new pd vxx += International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 4, April 2018 32 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 5. The expected IGR needed to classify a given sample s from a set of data samples C IRG(s,C) is calculated as follow , 4 where freq(s,C), Ci and |Ci| are the frequency of the sample s in C, the ith class of C and the number of samples in Ci, respectively. 2.3 Symmetrical Uncertainty Symmetric uncertainty correlation-based measure (SU) can be used to evaluate the goodness of features by calculating between feature and the target class [16]. The features having greater SU value gets higher importance. SU is defined as , 5 Where H(X), H(Y) , H(X|Y), IG are the entropy of a of X, entropy of a of Y and the entropy of a of posterior probability X given Y and information gain, respectively. 2.4 Genetic Algorithms (GAs) The basic idea behind the evolutionary algorithms (EAs) is derived from theory of biological evolution developed by Charles Darwin and others. It has been used as computational models and as adaptive search strategies for solving optimization problems. Genetic algorithms were developed in 1975 by Holland as a class of EAs [17]. GAs include a rapidly evolving population of artificial organisms, or so- called agents. Every agent is comprised of a genotype, often called a binary string or chromosome, which encodes a solution to the problem at hand and a phenotype that is the solution. In GAs, at the start the population of agents is randomly generated representing candidate solutions to the problem. The GAs implementation relies on the appropriate formulation of the fitness function. The main objective of the closed identification fitness function is to maximize the recognition rate. Every agent is evaluated in each iteration, to produce new candidate solutions new fitter offspring and to replace weaker )(inf_ ),( ),( Cosplit Csgain CsIGR = ),,(),(),( CsentropyCsentropyCsgain p-= ),)(1(log))(1()(log)(),( 22 CspCspCspCspCsentropy ----= ,||/)(),( CCsfreqCsp = ),,( || || ),( ip i i p Csentropy C C Csentropy å= , || || log || || )(inf_ C C C C Cosplit i i i å-= )()(( )|(2 ),( YHXH YXIG YXSU + = )|( )( )|( YXH XH YXIG = International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 4, April 2018 33 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 6. members of the last generation. Thus, the core of this class of evolutionary algorithms lies in selectively breeding new genetic structures along the course of finding solutions for the problem at hand [18]. We have adopted the algorithm described by Goldberg [19]. The flowchart of GA-based feature selection is described in the Figure 2 below [20]. Figure 2 Feature Selection using GA [20] 2.5 Simple random sampling Usually real-time databases experience class imbalance problems, due to the fact that one class is represented by a considerably larger number of instances than other classes. Subsequently, classification algorithms tend to ignore the minority classes. Simple random sampling has been advised as a good means of increasing the sensitivity of the classifier to the minority class by scaling the class distribution. An empirical study where the authors used twenty datasets from UCI repository has showed quantitatively that classifier accuracy might be increased with a progressive sampling algorithm [21]. Weiss and Provost deployed decision trees to evaluate classification performances with the use of a sampling strategy. Another important study used sampling to scale the class distribution and mainly focus on biomedical datasets [22]. The authors measure the effect of the suggested sampling strategy by the use of nearest neighbor and decision tree classifiers. In Simple random sampling, a sample is randomly selected from the population so that the obtained sample is representative of the population. Therefore, this technique provides an unbiased sample from the original data. Regarding simple random sampling there are two approaches while making random selection, in the first approach the samples are selected with replacement where the sample can be selected more than once repeatedly with an equal selection chance. In the other approach the selection of samples is done without replacement where the sample can be selected only once, so that each sample in the data set has an equal chance of being selected and once selected it cannot be chosen again [23]. International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 4, April 2018 34 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 7. 3. Dataset and Evaluation Procedure 3.1 Dataset The dataset used in this research was collected by customized questionnaire and school reports during the 2005-2006 academic year from two public schools in the Alentejo region of Portugal [11]. The school reports included few attributes such as the three period grades and number of school absences. Researchers have designed a questionnaire with closed questions to extract further socio-demographic information that were expected to affect student performance. Such information includes demographic data (e.g. mother’s education, family income), social-emotional (e.g. alcohol consumption) (Pritchard and Wilson 2003) and academic learning attributes (e.g. number of past class failures) that were expected to affect student performance. The questionnaire was first reviewed by school professionals and tested on a small set of 15 students in order to get a feedback. Eventually 788 students completed the customized questionnaire. The dataset has 33 attributes, variables, or features for each student. The academic status or final student performance, which has two possible values: Pass (G3 ≥10) or Fail. Eventually, to find alcohol consumption, there are two different attributes related to alcohol, alcohol taking in work day (D_alc) and alcohol taking in weekend(W_alc). Therefore, the total alcohol consumption by a specific student in a whole week was estimated using the following formula [24] Alcohol consumption = (W_alc×2+D_alc×5)/7 6 The new attribute varies between one and five. Therefore, the dataset is divided into two classes according to its alcohol consumption column, which is set to 1 for the alcohol consumption is greater than 3 and 0 otherwise. The 30 features along with description are listed in Table 1. Table 1: The dataset description of attributes [11] Attribute Number Attribute Description Attribute type Possible values of attributes 1 School - student's school Binary "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira 2 Gender - student's gender Binary "F" - female or "M" - male 3 Age - student's age Numeric from 15 to 22 4 Address - student's home address type Binary "U" - urban or "R" - rural) 5 Famsize - family size Binary "LE3" - less or equal to 3 or "GT3" - greater than 3 6 Pstatus - parent's cohabitation status Binary "T" - living together or "A" - apart 7 Medu - mother's education Numeric 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education 8 Fedu - father's education Numeric 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education 9 Mjob - mother's job Nominal "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other" International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 4, April 2018 35 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 8. 10 Fjob - father's job Nominal "teacher", "health" care related, civil "services" (e.g. administrative or police), "at home" or "other") 11 Reason - reason to choose this school Nominal close to "home", school "reputation", "course" preference or "other") 12 Guardian - student's guardian Nominal "mother", "father" or "other" 13 Traveltime - home to school travel time Numeric 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour 14 Studytime - weekly study time Numeric 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours 15 Failures - number of past class failures Numeric (numeric: n if 1<=n<3, else 4) 16 Schoolsup - extra educational support Binary yes or no 17 Famsup - family educational support Binary yes or no 18 Paid - extra paid classes within the course subject Binary yes or no 19 Activities - extra-curricular activities Binary yes or no 20 Nursery - attended nursery school Binary yes or no 21 Higher - wants to take higher education Binary yes or no 22 Internet - Internet access at home Binary yes or no 23 Romantic - with a romantic relationship Binary yes or no 24 Famrel - quality of family relationships Numeric from 1 - very bad to 5 – excellent 25 Freetime - free time after school Numeric from 1 - very low to 5 - very high 26 Goout - going out with friends Numeric from 1 - very low to 5 - very high 27 Health - current health status Numeric from 1 - very bad to 5 - very good 28 Absences - number of school absences Numeric from 0 to 93 29 G3 - final grade Numeric from 0 to 20, output target 30 Alcohol consumption – Target class Binary 1= yes or 0= no 3.2 Performance Analysis The performance of the suggested technique was evaluated by using three thresholds and rank performance metrics, Accuracy, ROC and Cohen's kappa coefficient. The main formulations are defined in Equations 6-8, according to the confusion matrix, which is shown in Table 2. In the confusion matrix of a two-class problem, TP is the number of true positives that represent in our case the Pass cases that that was classified correctly. FN is the number of false negatives that represents the Pass cases that was classified incorrectly as Fail. TN is the number of true negatives, which represents the Fail cases that was classified as Fail. FP is the number of false positives that represents the Pass cases that was classified as Passed. Table 2: The confusion matrix Hypothesis Predicted patient state Classified Pass Classified Fail Hypothesis positive Pass True Positive TP False Negative FN Hypothesis negative Fail False Positive FP True Negative TN Consequently, we can define Precision as: Precision= #$ %&'#$ × 100% 6 International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 4, April 2018 36 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 9. Precision measures how many of the points predicted as significant are in fact significant. Receiver Operator Characteristic (ROC) curve is another commonly used measure to evaluate two-class decision problems in Machine Learning. The ROC curve is a standard tool for summarizing classifier performance over a range of trade-offs between TP and FP error rates [25]. ROC usually takes values between 0.5 for random drawing and 1.0 for perfect classifier performance. Kappa error or Cohen’s kappa statistics is another recommended measure to compare the performances of different classifiers and henceforth the quality of selected features. Generally, Kappa error value Î [-1,1], so when Kappa error value calculated for classifiers approaches to 1, then the performance of classier is assumed to be more realistic [26]. The Kappa error measure can be calculated using the following formula: Kappa error = ,(.)0,(1) 20,(1) 7 where P(A) is total agreement probability and P(E) is the hypothetical probability of chance agreement. In order to get reliable estimates for classification accuracy on each classification task, every experiment has been performed using 10-fold cross-validation. Cross-validation is a method designed for estimating the generalization error based on "resampling" [27]. Cross-validation technique allows using the whole dataset for training and testing. In k-fold cross-validation procedure, the relevant dataset is partitioned randomly into approximately equal size k parts called folds and trained k times, each time leaving out one of the folds from training process, whilst using only the omitted fold to compute error criterion. Then the average error across all k trials is estimated as the mean error rate and defined as: 8 where, ei is error rate of each k experiment. Figure 3 depicts the concept behind k-fold cross validation. k= 1 k=2 k=3 … k=K Train Train Validate Train Figure 3 Data partitioning using k-fold cross-validation. The whole dataset is divided into K folds. One-fold (k=3, in this example) is set aside to validate the data of testing and the remaining K-1 folds are used for training. The entire procedure is repeated for each of the K folds. A number of studies found that the value of 10 for k leads to adequate and accurate classification results [28]. å= = k i ie k E 1 1 International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 4, April 2018 37 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 10. 4. Results and discussion A multi-class classification problem such as predicting student alcohol consumption is a challenging application of data mining. The basic idea of data mining is to extract hidden knowledge using data mining techniques. The suggested system for the purpose of predicting student alcohol consumption applied in this study is carried out in three major phases. The process starts with applying the simple random sampling to scale the imbalanced class distribution. In the second phase, the feature space is searched to reduce the feature numbers and prepare the conditions for the next step. This task is carried out using four feature reduction techniques, namely GR, CFS, SU and PSO Algorithms. At the end of this step a subset of features is chosen for the next round. Afterwards, the selected features are used as the inputs to the classifiers. Five classifiers are proposed to estimate the success possibility as mentioned previously, these methods include MLP, Simple Logistic, Rotation Forest, Random Forest, C4.5 decision tree and SVM. All the experiments were carried in Waikato Environment for Knowledge Analysis (Weka) a popular suite of data mining algorithms written in Java. The RF algorithm ensemble classifier is designed based on 150 trees and 10 random features to build each tree. While C4.5 classifier was applied with a confidence factor for pruning = 0.25 and a minimum number of instances per leaf of 2. The suggested algorithm is trained using 10-fold cross validation strategy to evaluate the classification accuracy on the dataset. Whereas the PSO feature selection was applied with a population size of 20, number of iterations = 20, individual weight = 0.34 and inertia weight = 0.33. Table 3 depicts the effect of the class distribution before and after applying the simple random sampling technique. The unbalanced distribution of the two classes makes this dataset suitable to test the effect of simple random sampling strategy. We, therefore, used a simple random sampling approach without replacement to rescale class distribution of the dataset. The experimental results of the multiple classifiers before and after applying the SRS can be shown in Table 4. The best overall performance is associated with Random Forest classifier with a precision of 92.2%, ROC of 94.5% and Kappa value of 70.4%, and a precision of 98.5%, ROC of 99.7% and Kappa value of 95.2% with all features before and after applying the SRS strategy with all features respectively. As for the classifiers that are used to perform predictions based on the extracted features, we observed that there is no significant difference in performance that explains the importance of SRS step. International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 4, April 2018 38 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 11. Table 3 Class distribution of the Student dataset before and after SRS Index Class Class Distribution Before SRS After SRS 1 Alcoholic 188 411 2 Not Alcoholic 856 1677 Table 4 Performance measures of selected classifiers before feature selection Classifier Performance index Before SRS After SRS MLP Accuracy 0.846 0.929 ROC 0.829 0.776 Kappa 0.479 0.776 Simple Logistic Accuracy 0.830 0.861 ROC 0.828 0.874 Kappa 0.370 0.523 Random Forest Accuracy 0.922 0.985 ROC 0.945 0.997 Kappa 0.702 0.952 C4.5 Accuracy 0.828 0.946 ROC 0.732 0.933 Kappa 0.412 0.829 FURIA Accuracy 0.868 0.967 ROC 0.824 0.967 Kappa 0.476 0.885 The second phase involves searching feature vector to reduce the feature numbers and prepare the conditions for the next step. This task is carried out using four feature selection techniques, GR, CFS, SU and PSO Algorithms. The optimal features of these techniques are summarized in Table 5. As noted from Table 3, the dimensionality of features is noticeably reduced. It is worth noting that the number of features has remarkably reduced, therefore less storage space is required for the execution of the classification algorithms. This step helped in reducing the size of dataset to only 6 to 15 attributes. Table 5 Selected features of the student dataset FS technique Number of selected features Selected features IGR 13 2,10,11,13,14,15,17,21,25,26,27,28,29 SU 15 2,5,10,11,13,14,15,17,20,21,25,26,27,28,29 GA 7 2,13,25,26,27,28,29 PSO 6 2,13,26,27,28,29 The experimental results of the multiple classifiers with the reduced number of features can be shown in Table 6. The highest performance rate for the IGR-based feature selection technique is associated with Random Forest classifier with 97.9%, 98.7% and 93.3% for Accuracy, ROC and Kappa, respectively with 13 features. While the highest performance rate for the SU-based feature selection technique is associated with C4.5 classifier with 99.5%, 99.7% and 98.2% for Accuracy, ROC and Kappa, respectively with 15 features. The highest performance rate for the GA-based feature selection technique is associated with Random Forest classifier with 96.2%, 97.2% and 88.1% for Accuracy, ROC and Kappa, respectively International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 4, April 2018 39 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 12. with 7 features. The PSO-based feature selection technique highest performance rate is associated with Random Forest classifier with 95.1%, 97% and 84.5% for Accuracy, ROC and Kappa, respectively with 6 features. Table 6 Performance measures of selected features after SRS Classifier Performance index IGR SU GA PSO MLP Accuracy 0.888 0.915 0.861 0.855 ROC 0.842 0.873 0.839 0.837 Kappa 0.641 0.716 0.537 0.521 Simple Logistic Accuracy 0.845 0.857 0.840 0.841 ROC 0.861 0.874 0.838 0.839 Kappa 0.476 0.491 0.456 0.461 Random Forest Accuracy 0.979 0.995 0.962 0.951 ROC 0.987 0.997 0.972 0.970 Kappa 0.933 0.982 0.881 0.845 C4.5 Accuracy 0.936 0.984 0.925 0.906 ROC 0.932 0.986 0.919 0.900 Kappa 0.797 0.947 0.761 0.698 FURIA Accuracy 0.962 0.961 0.911 0.890 ROC 0.970 0.969 0.913 0.848 Kappa 0.8752 0.8718 0.7046 0.625 Figure 4 visualizes the feature selection agreements between the IGR, SU, GA and PSO models. The Venn diagram shows the suggested models share student's gender, home to school travel time, going out with friends, current health status, number of school absences and final grade, in which all was obtained by the PSO model. Figure 4 PSO feature selection agreement in the student alcohol consumption dataset PSO is a well-known optimization method that has a strong search capability and usually used for fine-tuning of the features space. Our proposed technique based on PSO succeeded in significantly improving the classification performance with a limited number of features compared to other techniques. The suggested PSO selection-based features demonstrated accuracies between 84.1% and 95.1% in various DM model and this is a quite high performance for predicting student performance [29]. Therefore, deploying PSO in feature selection clearly helped in reducing the size of dataset from 33 to only 6 attributes. It is worth noting that as the number of features has reduced, less storage space and classification complexity is further required. Moreover, the results demonstrated that these features are adequate to represent the International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 4, April 2018 40 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 13. dataset’s class information. The outcomes from the suggested feature selection techniques show better results compared to datasets which are not pre-processed and also when these attribute selection techniques are used independently. As can be seen from above results, the proposed technique based on PSO has produced very promising results on the classification of multi- class dataset in predicting the student alcohol consumption. As we can comprehend from the data graph and table there is a significant gender differences in the drinking habits. Comparing to men, women are more likely to be responsive to health concerns and are less likely to engage in risky health behaviours [10,11]. Commonly, men smoke and drink more than women in different societies and cultures, and women have a higher expectation of self-control than do men [12,13]. 
That can lead the other features such as the free time with friends, high travel time between school and home and the number of school absences and eventually the poor academic performance. Our study shows that, drinking is the product of many factors working together. This suggests that the educational professionals can consider these features for further analysis in future.
 5. CONCLUSION Underage drinking or adolescent alcohol misuse is a major public health concern. The proposed machine learning approach could improve the accuracy performance and achieve promising results in identifying risk or protective factors for high-risk drinking that can be used to help detect and address the early developmental signs of alcohol abuse and dependence within adolescent students. The experiment results have shown that the PSO helped in reducing the feature space, whereas adjusting the original data with simple random sampling helped in increasing the region area of the minority class in favour of handling the existing imbalanced data property. ACKNOWLEDGEMENT The authors would like to kindly appreciate and gratefully acknowledge, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml] and Mr. Paulo Cortez [11] for obtaining the student performance dataset. REFERENCES 1. World Health Organization Management of Substance Abuse Team. Global Status Report on Alcohol and Health. World Health Organization, Geneva, Switzerland; 2011:85. 2. GRANT, B.F. Estimates of U.S. children exposed to alcohol abuse and dependence in the family. American Journal of Public Health 90(1):112–115, 2000. 3 Aertgeerts B, Buntinx F. The relation between alcohol abuse or dependence and academic performance in first-year college students. J Adolesc Health. 2002; 31:223–5. 4. Berkowitz AD, Perkins HW. Problem drinking among college students: A review of recent research. J Am Coll Health. 1986;35:21–8. 5. Martinez JA, Sher KJ, Wood PK. Is heavy drinking really associated with attrition from college? The alcohol-attrition International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 4, April 2018 41 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 14. paradox. Psychol Addict Behav. 2008;22:450–6. 6. Crutzen, R., P.J. Giabbanelli, A. Jander, L. Mercken and H. de Vries, 2015. Identifying binge drinkers based on parenting dimensions and alcohol-specific parenting practices: Building classifiers on adolescent-parent paired data. BMC Public Health, 15(1): 747. 7. Montaño, J.J., E. Gervilla, B. Cajal and A. Palmer, 2014. Data mining classification techniques: An application to tobacco consumption in teenagers. An. Psicol., 30(2): 633-641. 8. Pang, R., A. Baretto, H. Kautz and J. Luo, 2015. Monitoring adolescent alcohol use via multimodal analysis in social multimedia. Proceeding of the IEEE International Conference on Big Data (Big Data), pp: 1509-1518. 9. Bi, J., J. Sun, Y. Wu, H. Tennen and S. Armeli, 2013. A machine learning approach to college drinking prediction and risk factor identification. ACM T. Intell. Syst. Technol., 4(4). 10. Zuba, M., J. Gilbert, Y. Wu, J. Bi, H. Tennen and S. Armeli, 2012. 1-norm support vector machine for college drinking risk factor identification. Proceeding of the 2nd ACM SIGHIT International Health Informatics Symposium, pp: 651-660. 11. Cortez, P. and Silva, A. (2008). Using Data Mining to Predict Secondary School Student Performance. In Proceedings of 5th Future Business Technology Conference. pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. 12. Guyon, I., Gunn, S., Nikravesh, M. and Zadeh, L.A. (2006). Feature Extraction, Foundations and Applications, Springer, Berlin, 13. Kennedy, J. and Eberhart, R. (2001). Swarm intelligence. Morgan Kaufmann. 14. Kennedy, J. and Eberhart, R. (1997). A discrete binary version of the particle swarm algorithm. Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, Vol.5, pp. 4104–4108. 15. Quinlan J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers. 16. Fayyad, U., and Irani, K. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence (pp. 1022–1027). Morgan Kaufmann. 17. J.H. Holland ,Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, MI (1975). 18. Randy, H., and Haupt, S., 1998. Practical Genetic Algorithms, John Wiley and Sons. 19. David E. Goldberg (1989). Genetic algorithms in search, optimization and machine learning. Addison-Wesley. 20. Hall, Mark A. Correlation-Based Feature Selection for Machine Learning, 1999. 21. G. Weiss and F. Provost, "Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction," J. Artificial Intelligence Research, vol.19,315-354,2003. 22. Park, B.-H., Ostrouchov, G., Samatova, N.F., Geist, A.: Reservoir-based random sampling with replacement from data stream. In: SDM 2004 , 492-496, (2004) 23. Mitra SK and Pathak PK. The nature of simple random sampling. Ann. Statist., 1984, 12:1536-1542. 24. Pagnotta, F. and H.M. Amran, 2016. Using data mining to predict secondary school student alcohol consumption. Department of Computer Science, University of Camerino. 25. Fawcett, T. and Provost, F. (1996). Combining data mining and machine learning for effective user profiling. In Proceedings of KDD-96, 8-13. Menlo Park, CA: AAAI Press. 26. Fleiss, J.L. and Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement 33: 613–619. 27. Devijver P.A., and Kittler J. (1982). Pattern Recognition: A Statistical Approach. London, GB: Prentice-Hall. 28. Gupta G.K. (2006). Introduction to Data Mining with Case Studies. Prentice-Hall of India. 29. Liao S., Chu, P. and Hsiao, P.(2012). Data mining techniques and applications. A decade review from 2000 to 2011, Expert Systems with Applications 39. International Journal of Computer Science and Information Security (IJCSIS), Vol. 16, No. 4, April 2018 42 https://sites.google.com/site/ijcsis/ ISSN 1947-5500