SlideShare a Scribd company logo
International Journal of Electrical and Computer Engineering (IJECE)
Vol. 11, No. 3, June 2021, pp. 2407~2413
ISSN: 2088-8708, DOI: 10.11591/ijece.v11i3.pp2407-2413  2407
Journal homepage: http://ijece.iaescore.com
Prediction model of algal blooms using logistic regression and
confusion matrix
Hongwon Yun
Department of Computer Software Engineering, Silla University, Busan, Republic of Korea
Article Info ABSTRACT
Article history:
Received Jul 31, 2020
Revised Sep 22, 2020
Accepted Oct 8, 2020
Algal blooms data are collected and refined as experimental data for algal
blooms prediction. Refined algal blooms dataset is analyzed by logistic
regression analysis, and statistical tests and regularization are performed to
find the marine environmental factors affecting algal blooms. The predicted
value of algal bloom is obtained through logistic regression analysis using
marine environment factors affecting algal blooms. The actual values and the
predicted values of algal blooms dataset are applied to the confusion matrix.
By improving the decision boundary of the existing logistic regression, and
accuracy, sensitivity and precision for algal blooms prediction are improved.
In this paper, the algal blooms prediction model is established by the
ensemble method using logistic regression and confusion matrix. Algal
blooms prediction is improved, and this is verified through big data analysis.
Keywords:
Algal blooms
Confusion matrix
Ensemble method
Logistic regression
Prediction model This is an open access article under the CC BY-SA license.
Corresponding Author:
Hongwon Yun
Department of Computer Software Engineering
Silla University
Busan 46958, Republic of Korea
Email: hwyun@silla.ac.kr
1. INTRODUCTION
Logistic regression is a special case of a typical model and is similar to linear regression, however it
has a difference in the relationship between dependent and independent variables. The dependent variable of
logistic regression can be binary or continuous, and it is used as a model for classification or prediction when
the dependent variable is binary [1, 2]. If the dependent variable of logistic regression is binary, the range of
its value is limited to the bivariate and the distribution of conditional probability follows the Bernoulli
distribution. Logistic regression allows dependent variable values to be between 0 and 1 regardless of the
range of independent variable values, so it is possible to classify the result of data into a specific
classification when input data is given and predict the likelihood of an event occurring [3-5].
In logistic regression, where the dependent variable is binary, the predicted value can be calculated
using a linear combination of the independent variables. However, since the value of the dependent variable
is classified as pass or fail around the decision boundary, the value close to the decision boundary may be
less accurate [6-8]. In binary logistic regression, since the actual value of the dependent variable is present
and the predicted value can be calculated, the predicted value can be applied to a confusion matrix that can be
compared to the target value [9, 10]. It can be obtained sensitivity and precision from the confusion matrix
using the actual and predicted values of the logistic regression, and apply it to algal blooms to create a
summary of indicators such as sensitivity and precision including accuracy [11-13].
Sensitivity and precision are as important as accuracy in predicting algal bloom occurrence. This is
because high sensitivity and precision can provide indicators that can prevent massive property damage
[14-17]. The elements of the marine environment that cause algal blooms are generally known, but no study
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 11, No. 3, June 2021 : 2407 - 2413
2408
can be found to analyze the influence of each element on algal blooms and predict algal blooms. In this
study, the predicted value of logistic regression is calculated by machine learning. The actual value used in
logistic regression analysis and the prediction calculated through machine running are applied to the
confusion matrix to create a prediction model for algal blooms.
This paper is organized as follows. The logistic regression and confusion matrix as the background
theory of this study are describe in section 2. In section 3, we describe the algal blooms prediction model
using the ensemble method of the logistic regression and confusion matrix proposed in this study. Here we
describe the process of extracting marine environmental elements using logistic regression, obtaining red tide
prediction values, applying improved decision boundaries to logistic regression, and how to improve
accuracy, sensitivity and precision through confusion matrix. In section 4, we verify the proposed algal
blooms prediction model using the algal blooms dataset, and conclusions are described in section 5.
2. LOGISTIC REGRESSION AND CONFUTION MATRIX
2.1. Logistic regression
Linear regression is a model that estimates a regression coefficient that can linearly express the
relationship between independent variables X and dependent variables Y with continuous values. If the
dependent variable Y is a binary variable, logistic regression is used because linear regression cannot be
applied directly. Some regression algorithms can be used for classification, and logistic regression is widely
used to estimate the probability that a sample belongs to a particular class. If the estimated probability
exceeds 0.5, the sample is predicted to belong to the class, and if it is less than 0.5, it is used as a binary
classifier to predict that the sample does not belong to the class [18, 19]. To estimate the probability, logistic
regression calculates the weighted sum of the input characteristics, but instead of outputting the result
immediately such as linear regression, it outputs the logistic of the result value. Logistic is a sigmoid function
that outputs a value between 0 and 1 [20]. The logistic function has the effect of limiting the output result to
always between 0 and 1 for numerical values x, and its expression is defined as follows.
𝑦 =
1
1+𝑒−𝑓(𝑥) (1)
In (1), 𝑓(𝑥) can be either a simple linear function or a multiple linear function. For classification
problems with two categories, if 𝑓(𝑥) > 0 is classified as 𝑦 → 1 and 𝑓(𝑥) < 0 is classified as 𝑦 → 0. The
decision boundary of the logistic regression model is the 𝑓(𝑥) = 0 in hyperplane and becomes 𝑦 = 0.5.
Errors in prediction usually occur around the decision boundary [21, 22].
2.2. Confusion matrix
The confusion matrix is a tool that easily and effectively shows the performance of the classifier and
has the advantage of being easy to interpret the results. A confusion matrix can be used to evaluate the
performance of any models or algorithms. As shown in Table 1, the rows in the confusion matrix represent
the values of the predictive class and the columns represent the values of the actual class. Each cell is one of
the possible combinations of prediction and actuality. In the 2×2 confusion matrix, there are true positive
(TP), false positive (FP), false negative (FN), and true false (TF) [23].
The perfect model will only have values on the diagonal, the rest of the cells will be all zeros, and
the bad model will be evenly distributed in all cells. The error matrix tells us how bad a model is when it is
bad. The value of each cell can identify a misclassified pattern [24].
Table 1. Confusion matrix
Confusion matrix
True class (Actual)
P N
Hypothesized class
(Predicted)
Y True Positives False Positives
N False Negatives True Negatives
Methods for summarizing the results of the confusion matrix include accuracy, precision, and recall.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃+𝑇𝑁
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
(2)
The accuracy is obtained by dividing the accurately predicted number (TP+TN) by the total number
of samples, and is represented by (2). Among the methods for summarizing the results in the confusion
Int J Elec & Comp Eng ISSN: 2088-8708 
Prediction model of algal blooms using logistic regression and confusion matrix (Hongwon Yun)
2409
matrix, the most frequently used precision and sensitivity are as shown in (3) and (4), respectively.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃
𝑇𝑃+𝐹𝑃
(3)
Precision is a positive predictive value that measures how many of the samples (TP+FP) predicted
to be positive are true positives (TP). Precision is used as a performance indicator when the goal is to reduce
the number of false positives (FP).
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑇𝑃
𝑇𝑃+𝐹𝑁
(4)
Sensitivity measures how many of the total positive samples (TP+FN) are classified as positive classes (TP).
3. PREDICTION MODEL
After collecting algal blooms dataset from the National Institute of Fisheries Science, it was cleaned
and refined. The first multiple logistic regression analysis was performed on the refined algal blooms dataset,
and some attributes were removed through a statistical test. A second multiple logistic regression analysis
was performed with the exception of the attributes removed and then the regularization was applied. After
applying the regularization, a third multiple logistic regression analysis is performed and the results are
applied to the confusion matrix. Figure 1 shows this process.
Figure 1. Prediction process of algal blooms
The probability of occurrence of harmful algal blooms with two or more independent variables is
defined as p(x) and the odds as =
𝑝
1−𝑝
. When the range of input values is [0, 1], log it transformation is
performed to adjust the range of output values to (−∞, ∞), resulting in log(𝑜𝑑𝑑𝑠) = log (
𝑝(𝑥)
1−𝑝(𝑥)
) = 𝛽0 +
𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑖𝑥𝑖 [25]. Therefore, for multiple independent variables that affect harmful algal blooms,
the multiple logistic function that allows the dependent variable range to be between [0, 1] is as shown in (5).
In (5) calculates the effect of each element of the ocean observation data, which is an independent variable,
on the occurrence of a harmful algal blooms as a dependent variable. This is a basic model for estimating the
probability of occurrence of harmful algal blooms.
𝑝(𝑥) =
1
1+𝑒−(𝛽0+𝛽1𝑥1+𝛽2𝑥2+⋯+𝛽𝑖𝑥𝑖) (5)
The maximum likelihood estimation is used to estimate parameter 𝛽 in regression expression
𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑖𝑥𝑖 by logit transformation. The log likelihood function can be obtained from the
likelihood function [26] expressed as the product of Bernoulli's probability function, and is expressed as (6).
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 11, No. 3, June 2021 : 2407 - 2413
2410
The parameter that maximizes the log likelihood function in (6) is determined from multiple independent
variables that affect the harmful algal blooms.
ln𝐿 = ∑ 𝑦𝑖(𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑖𝑥𝑖)
𝑖 + ∑ ln
𝑖 (1 + 𝑒𝛽0+𝛽1𝑥1+𝛽2𝑥2+⋯+𝛽𝑖𝑥𝑖 ) (6)
The L1 regularization [27] used to eliminate low-impact independent variables among multiple independent
variables that affect harmful algal blooms is shown in (7).
argmin ∑ (𝑦𝑖 − ∑ 𝑥𝑖𝑗𝛽𝑗)2
+ 𝜆 ∑ |𝛽𝑗|
𝑝
𝑗=1
𝑝
𝑗=1
𝑛
𝑖=1 (7)
The properties in the marine environment observation dataset are shown in Table 2 and used as
independent variables in logistic regression.
Table 2. Multiple independent variables for logistic regression
Variables Comments
T Temperature
S Salinity
DO Dissolved Oxygen
P Phosphate Phosphorus
NA Nitrous Acid Nitrogen
N Nitric Acid Nitrogen
SA Silicic Acid Silicon
In (8) is obtained by applying seven independent variables, such as water temperature, salinity,
dissolved oxygen, phosphate phosphorus, nitrous acid nitrogen, nitric acid nitrogen, silicic acid silicon, to the
basic model of multiple logistic regression (5).
log
𝑝(𝑥)
1−𝑝(𝑥)
= 𝛽0 + 𝛽1𝑇𝑖 + 𝛽2𝑆𝑖 + 𝛽3𝐷𝑂𝑖 + 𝛽4𝑃𝑖 + 𝛽5𝑁𝐴𝑖 + 𝛽6𝑁𝑖 + 𝛽7𝑆𝐴𝑖 (8)
P-value is used to determine if any independent variable was statistically significant in the results of
multiple logistic regression analysis on the training dataset, and independent variables with a P-value of 0.05
or higher are excluded. The parameters for statistically significant independent variables are as shown in (9).
log
𝑝(𝑥)
1−𝑝(𝑥)
= 𝛽0 + 𝛽1𝑇𝑖 + 𝛽2𝑆𝑖 + 𝛽3𝑃𝑖 + 𝛽4𝑁𝑖 (9)
The regulation for removing an independent variable close to zero in order to make some coefficients zero is
as shown in (7). The result is as shown in (10) when (7) is applied to the result of (9).
log
𝑝(𝑥)
1−𝑝(𝑥)
= 𝛽0 + 𝛽1𝑇𝑖 + 𝛽2𝑆𝑖 + 𝛽3𝑃𝑖 (10)
In (11) is the logistic regression model for algal blooms prediction obtained by applying the above process to
the algal blooms dataset.
𝑝(𝑥) =
1
1+𝑒−(−5.89+0.34𝑇𝑖−0.12𝑆𝑖+0.35𝑃𝑖) (11)
The normalization process from the (8) to the (11) is from Step 2 to Step 5 among the algorithms in
Table 3, respectively. The algal blooms prediction model was normalized while performing experiments
based on the algorithm in Table 3. The detailed experimental process is described in section 4. The equation
for obtaining a decision boundary to increase the sensitivity and precision is defined as shown in (12).
𝛥 = |0.5 ±
1
2
(
𝑇𝑃
𝑇𝑃+𝐹𝑁
+
𝑇𝑃
𝑇𝑃+𝐹𝑃
)| (12)
Table 3 shows algorithm for establishing algal blooms prediction model. This algorithm shows the
process of performing multiple logistic regression first in a refined dataset, then statistical tests on the results,
Int J Elec & Comp Eng ISSN: 2088-8708 
Prediction model of algal blooms using logistic regression and confusion matrix (Hongwon Yun)
2411
and then removing low-weight independent variables, finally setting up a logistic regression model, and
finding the decision boundary finally.
Table 3. Algorithm for establishing algal blooms prediction model
Step Statements
1 Extraction, Transformation and Loading from collected dataset
Prepare training dataset
2 Perform multiple regression analysis using (1) on the training dataset
Output regression coefficients and statistical tests
3 Perform a statistical significance test
∙attributes P-value > 0.5 are excluded in the training dataset
4 Perform multiple regression analysis for the training dataset with attributes whose P-value <= 0.5
Output regression coefficients and statistical tests for attributes whose P-value <= 0.5
5 Regularize regression coefficients from step 4 using
argmin∑ (𝑦𝑖 − ∑ 𝑥𝑖𝑗𝛽𝑗)2
+ 𝜆 ∑ |𝛽𝑗|
𝑝
𝑗=1
𝑝
𝑗=1
𝑛
𝑖=1
Perform multiple regression analysis using a regularized regression formula
Output test dataset
6 Input test dataset from step 5
Predict probability using decision boundary
𝛥 = |0.5 ±
1
2
(
𝑇𝑃
𝑇𝑃+𝐹𝑁
+
𝑇𝑃
𝑇𝑃+𝐹𝑃
)|based on confusion matrix
4. EXPERIMENT
Multiple logistic regression analysis (8) can be performed on the training dataset to obtain the
results shown in Table 4. In Table 4, p-value is used to determine whether any independent variable is
statistically significant, and independent variables with a p-value of 0.05 or higher are excluded. Parameters
𝛽 are determined for statistically significant independent variables in Table 4 and L1 regularization is applied
and then the results shown in Table 5 can be obtained. In (11) of the logistic regression model for algal
blooms prediction is obtained from the coefficients in Table 5.
Table 4. 1st
multiple logistic regression analysis on training dataset
Input variables Coefficient Std. error P-value
Constant -5.35 1.49 0.00
Temperature 0.33 0.02 0.00
Salinity -0.12 0.03 0.00
Dissolved Oxygen -0.05 0.08 0.54
Phosphate Phosphorus 0.38 0.14 0.01
Nitrous Acid Nitrogen -0.07 0.12 0.58
Nitric Acid Nitrogen -0.06 0.02 0.00
Silicic Acid Silicon -0.02 0.01 0.16
Table 5. Coefficients of logistic regression model for algal blooms prediction
Input variables Coefficient Std. error P-value
Constant -5.89 1.21 0.00
Temperature 0.34 0.01 0.00
Salinity -0.12 0.03 0.00
Phosphate Phosphorus 0.35 0.14 0.01
Predicting the occurrence of algal blooms from (11) gives 91.84% accuracy. Accuracy alone may
not be sufficient to assess the predicted performance of algal blooms. We utilize the confusion matrix since
we do not know false negatives or false positives of algal blooms. A confusion matrix for algal blooms
shown in Table 6 is obtained from algal blooms dataset.
Table 6. Confusion matrix for algal blooms
Confusion matrix
(Error matrix)
Actual values of algal blooms
P(Occurrence) N(Not occurrence)
Predicted values of
algal blooms
Y (0.5 or higher) True Positive
tp=222
False Positive
fp=205
N (less than 0.5 ) False Negative
fn=599
True Negative
tn=8828
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 11, No. 3, June 2021 : 2407 - 2413
2412
Table 7 shows the sensitivity, specificity, and precision are obtained based on the decision boundary
0.5 using the values of the confusion matrix in Table 6. The prediction rate of false negative of algal blooms
is low as the sensitivity is 27.04%, and the prediction rate of false positive is also low because the precision is
51.99%. Since the sensitivity and precision are low in case of the decision boundary is 0.5, we apply
proposed the decision boundary (12) in order to solve these problems, and results are as shown in Table 8.
When the decision boundary proposed in this paper is applied, the decision boundary becomes
𝛥 = |0.5 ± 0.25|. When this is used as a decision boundary, TP=494, TN=9026, FN=327, FP=7, the
sensitivity is 60.17%, and the precision is 98.6% as shown in Table 8.
Table 7. Resulting confusion matrix based on decision boundary 0.5 (unit: %)
TPR TNR PPV FPR ACC F1
Sensitivity Specificity Precision Fallout Accuracy F1 Score
27.04 97.73 51.99 2.27 91.84 35.58
Table 8. Resulting confusion matrix based on proposed decision boundary 𝛥 (unit: %)
TPR TNR PPV FPR ACC F1
Sensitivity Specificity Precision Fallout Accuracy F1 Score
60.17 99.92 98.6 0.08 96.61 74.74
5. CONCLUSION
In this paper, logistic regression and confusion matrix were used to predict the occurrence of algal
blooms. Algal blooms datasets were collected and refined for experimental analysis of algal blooms
prediction. Logistic regression analysis was performed on refined algal blooms dataset and main marine
environmental factors affecting algal blooms were found through statistical test and regularization processes.
Logistic regression was performed using the marine environmental factors that were influential on algal
blooms and the accuracy of algal bloom occurrence was obtained. The values of the confustion matrix were
obtained using the dataset for algal blooms prediction and the predicted values obtained from logistic
regression. Although the sensitivity and precision for the occurrence of algal blooms can be obtained from
the values of the confusion matrix, the sensitivity and precision were low when the existing decision
boundary was 0.5. Sensitivity and precision were improved by using the decision boundary proposed in this
study. In this paper, the algal blooms prediction model was established by the ensemble method using logistic
regression analysis and confusion matrix. Also, the accuracy, sensitivity, and precision for algal blooms
prediction were improved, and these were verified through big data analysis.
REFERENCES
[1] Hosmer Jr., et al., “Applied logistic regression,” John Wiley & Sons, vol. 398, 2013.
[2] M. Chang, et al., “Selection of Transformations of Continuous Predictors in Logistic Regression,” Information
Technology-New Generations, pp. 443-447, 2018.
[3] J. W. Osborne, “Simple Linear Models with Categorical Dependent Variables: Binary Logistic Regression,” SAGE,
pp. 97-132, 2017.
[4] J. Tolles and J. M. William, “Logistic regression: relating patient characteristics to outcomes,” Jama, vol. 316,
no. 5, pp. 533-534, 2016.
[5] A. D. Caigny, et al., “A new hybrid classification algorithm for customer churn prediction based on logistic
regression and decision trees,” European Journal of Operational Research, vol. 269, no. 2, pp. 760-772, 2018.
[6] X. Wan, “The Influence of Polynomial Order in Logistic Regression on Decision Boundary,” IOP Conference
Series: Earth and Environmental Science, vol. 267, no. 4, pp. 1-4, 2019.
[7] S. Ghazaal and A. Hakan, “Using the Distance in Logistic Regression Models for Predictor Ranking in Diabetes
Detection,” International Conference on Medical and Biological Engineering, 2019, pp. 665-670.
[8] J. Friedman, et al., “Additive logistic regression: A statistical view of boosting,” Annals of statistics, vol. 28, no. 2,
pp. 337-374, 2000.
[9] H. M. Ramos, et al., “A new explanatory index for evaluating the binary logistic regression based on the sensitivity
of the estimated model,” Statistics and Probability Letters, vol. 120, pp. 135-140, 2017.
[10] M. Ohsaki, et al., “Confusion-matrix-based kernel logistic regression for imbalanced data classification,” IEEE
Transactions on Knowledge and Data Engineering, vol. 29, no. 9, pp. 1806-1819, 2017.
[11] J. A. McGowan, et al., “Predicting coastal algal blooms in southern California,” Ecology, vol. 98, no. 5,
pp. 1419-1433, 2017.
[12] N. F. Manning, et al., “Extending the forecast model: Predicting Western Lake Erie harmful algal blooms at
multiple spatial scales,” Journal of Great Lakes Research, vol. 45, no. 3 pp. 587-595, 2019.
Int J Elec & Comp Eng ISSN: 2088-8708 
Prediction model of algal blooms using logistic regression and confusion matrix (Hongwon Yun)
2413
[13] N. Mellios, et al., “Machine Learning Approaches for Predicting Health Risk of Cyanobacterial Blooms in Northern
European Lakes,” Water, vol. 12, no. 4, p. 1191, 2020.
[14] L. Wang, et al., “An approach of improved Multivariate Timing-Random Deep Belief Net modelling for algal
bloom prediction,” Biosystems engineering, vol. 177, pp. 130-138, 2019.
[15] Ghatkar, et al., “Classification of algal bloom species from remote sensing data using an extreme gradient boosted
decision tree model,” International Journal of Remote Sensing, vol. 40, no. 24, pp. 9412-9438, 2019.
[16] S. Lee and D. Lee, “Improved prediction of harmful algal blooms in four Major South Korea’s Rivers using deep
learning models,” International journal of environmental research and public health, vol. 15, no. 7, p. 1322, 2018.
[17] X. Sun, et al., “A Bayesian structural model for predicting algal blooms,” Journal of Forecasting, vol. 38, no. 8,
pp. 788-802, 2019.
[18] F. Thabtah et al., “A machine learning autism classification based on logistic regression analysis,” Health
information science and systems, vol. 7, no. 1, p. 12, 2019.
[19] D. Menezes, et al., “Data classification with binary response through the Boosting algorithm and logistic
regression,” Expert Systems with Applications, vol. 69, pp. 62-73, 2017.
[20] J. M. Hilbe, “Practical guide to logistic regression,” CRC Press, 2016.
[21] C. Fernández and F. Provost, “Causal Classification: Treatment Effect vs. Outcome Prediction,” Outcome
Prediction, 2019.
[22] K. Lee, et al., “Unbalanced data, type II error, and nonlinearity in predicting M&A failure,” Journal of Business
Research, vol. 109, pp. 271-287, 2010.
[23] S. V. Stehman, “Selecting and interpreting measures of thematic classification accuracy,” Remote Sensing of
Environment, vol. 62, no. 1, pp. 77-89, 1997.
[24] D. Powers, “Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness &
Correlation,” Journal of Machine Learning Technologies, vol. 2, no. 1, pp. 37-63, 2011.
[25] J. S. Cramer, “The origins and development of the logit model,” Cambridge UP, pp. 1-19, 2003.
[26] I. M. Myung, “Tutorial on Maximum Likelihood Estimation,” Journal of Mathematical Psychology, vol. 47, no. 1,
pp. 90-100, 2003.
[27] F. Santosa and W. Symes, “Linear inversion of band-limited reflection seismograms,” SIAM Journal on Scientific
and Statistical Computing, SIAM, vol. 7, no. 4, pp. 1307-1330, 1986.
BIOGRAPHY OF AUTHOR
Hongwon Yun is a Professor with the Department of Computer Software Engineering at Silla
University, Busan, South Korea. He received his B.S. and the Ph.D. degrees at the Department of
Computer Science from Pusan National University, South Korea. His research interests include
database system, big data analysis, and machine learning.

More Related Content

PDF
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
PDF
report
PDF
A Modified KS-test for Feature Selection
PDF
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
PDF
Special Double Sampling Plan for truncated life tests based on the Marshall-O...
PDF
Assessing Discriminatory Performance of a Binary Logistic Regression Model
PDF
Feature selection and microarray data
PDF
Classification accuracy analyses using Shannon’s Entropy
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
report
A Modified KS-test for Feature Selection
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
Special Double Sampling Plan for truncated life tests based on the Marshall-O...
Assessing Discriminatory Performance of a Binary Logistic Regression Model
Feature selection and microarray data
Classification accuracy analyses using Shannon’s Entropy

What's hot (16)

PDF
IRJET - Movie Genre Prediction from Plot Summaries by Comparing Various C...
PDF
BPSO&1-NN algorithm-based variable selection for power system stability ident...
PDF
Sensitivity analysis in a lidar camera calibration
PPTX
PDF
A New SDM Classifier Using Jaccard Mining Procedure (CASE STUDY: RHEUMATIC FE...
PDF
A Genetic Algorithm on Optimization Test Functions
PDF
A new sdm classifier using jaccard mining procedure case study rheumatic feve...
PDF
G03405049058
PDF
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
PDF
Data Mining using SAS
PDF
FurtherInvestegationOnProbabilisticErrorBounds_final
PDF
2014 IIAG Imputation Assessments
DOC
Missing Value imputation, Poor man's
PDF
Effect of Feature Selection on Gene Expression Datasets Classification Accura...
PDF
Simulation Study of Hurdle Model Performance on Zero Inflated Count Data
PDF
A Comparative Analysis on the Evaluation of Classification Algorithms in the ...
IRJET - Movie Genre Prediction from Plot Summaries by Comparing Various C...
BPSO&1-NN algorithm-based variable selection for power system stability ident...
Sensitivity analysis in a lidar camera calibration
A New SDM Classifier Using Jaccard Mining Procedure (CASE STUDY: RHEUMATIC FE...
A Genetic Algorithm on Optimization Test Functions
A new sdm classifier using jaccard mining procedure case study rheumatic feve...
G03405049058
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
Data Mining using SAS
FurtherInvestegationOnProbabilisticErrorBounds_final
2014 IIAG Imputation Assessments
Missing Value imputation, Poor man's
Effect of Feature Selection on Gene Expression Datasets Classification Accura...
Simulation Study of Hurdle Model Performance on Zero Inflated Count Data
A Comparative Analysis on the Evaluation of Classification Algorithms in the ...
Ad

Similar to Prediction model of algal blooms using logistic regression and confusion matrix (20)

PDF
the unconditional Logistic Regression .pdf
PDF
Smooth Support Vector Machine for Suicide-Related Behaviours Prediction
PDF
Short-term load forecasting with using multiple linear regression
PDF
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
PPTX
correction maximum likelihood estimation method
PDF
L033054058
PDF
Multinomial Logistic Regression.pdf
PDF
Correation, Linear Regression and Multilinear Regression using R software
PDF
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
PDF
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
PDF
A Comparison of Accuracy Measures for Remote Sensing Image Classification: Ca...
PDF
Evaluation measures for models assessment over imbalanced data sets
PDF
KNN and ARL Based Imputation to Estimate Missing Values
PDF
Parametric Sensitivity Analysis of a Mathematical Model of Two Interacting Po...
PDF
Supervised Learning.pdf
PDF
USE OF PLS COMPONENTS TO IMPROVE CLASSIFICATION ON BUSINESS DECISION MAKING
PDF
Data Mining for Integration and Verification of Socio-Geographical Trend Stat...
PDF
Data Mining for Integration and Verification of Socio-Geographical Trend Stat...
PDF
International Journal of Pharmaceutica Analytica Acta
PDF
B04460815
the unconditional Logistic Regression .pdf
Smooth Support Vector Machine for Suicide-Related Behaviours Prediction
Short-term load forecasting with using multiple linear regression
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
correction maximum likelihood estimation method
L033054058
Multinomial Logistic Regression.pdf
Correation, Linear Regression and Multilinear Regression using R software
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
Quantile Regression with Q1/Q3 Anchoring: A Robust Alternative for Outlier-Re...
A Comparison of Accuracy Measures for Remote Sensing Image Classification: Ca...
Evaluation measures for models assessment over imbalanced data sets
KNN and ARL Based Imputation to Estimate Missing Values
Parametric Sensitivity Analysis of a Mathematical Model of Two Interacting Po...
Supervised Learning.pdf
USE OF PLS COMPONENTS TO IMPROVE CLASSIFICATION ON BUSINESS DECISION MAKING
Data Mining for Integration and Verification of Socio-Geographical Trend Stat...
Data Mining for Integration and Verification of Socio-Geographical Trend Stat...
International Journal of Pharmaceutica Analytica Acta
B04460815
Ad

More from IJECEIAES (20)

PDF
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
PDF
Embedded machine learning-based road conditions and driving behavior monitoring
PDF
Advanced control scheme of doubly fed induction generator for wind turbine us...
PDF
Neural network optimizer of proportional-integral-differential controller par...
PDF
An improved modulation technique suitable for a three level flying capacitor ...
PDF
A review on features and methods of potential fishing zone
PDF
Electrical signal interference minimization using appropriate core material f...
PDF
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
PDF
Bibliometric analysis highlighting the role of women in addressing climate ch...
PDF
Voltage and frequency control of microgrid in presence of micro-turbine inter...
PDF
Enhancing battery system identification: nonlinear autoregressive modeling fo...
PDF
Smart grid deployment: from a bibliometric analysis to a survey
PDF
Use of analytical hierarchy process for selecting and prioritizing islanding ...
PDF
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
PDF
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
PDF
Adaptive synchronous sliding control for a robot manipulator based on neural ...
PDF
Remote field-programmable gate array laboratory for signal acquisition and de...
PDF
Detecting and resolving feature envy through automated machine learning and m...
PDF
Smart monitoring technique for solar cell systems using internet of things ba...
PDF
An efficient security framework for intrusion detection and prevention in int...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Embedded machine learning-based road conditions and driving behavior monitoring
Advanced control scheme of doubly fed induction generator for wind turbine us...
Neural network optimizer of proportional-integral-differential controller par...
An improved modulation technique suitable for a three level flying capacitor ...
A review on features and methods of potential fishing zone
Electrical signal interference minimization using appropriate core material f...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Bibliometric analysis highlighting the role of women in addressing climate ch...
Voltage and frequency control of microgrid in presence of micro-turbine inter...
Enhancing battery system identification: nonlinear autoregressive modeling fo...
Smart grid deployment: from a bibliometric analysis to a survey
Use of analytical hierarchy process for selecting and prioritizing islanding ...
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
Adaptive synchronous sliding control for a robot manipulator based on neural ...
Remote field-programmable gate array laboratory for signal acquisition and de...
Detecting and resolving feature envy through automated machine learning and m...
Smart monitoring technique for solar cell systems using internet of things ba...
An efficient security framework for intrusion detection and prevention in int...

Recently uploaded (20)

PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Well-logging-methods_new................
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
Construction Project Organization Group 2.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
web development for engineering and engineering
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
CYBER-CRIMES AND SECURITY A guide to understanding
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Well-logging-methods_new................
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
bas. eng. economics group 4 presentation 1.pptx
UNIT 4 Total Quality Management .pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Lesson 3_Tessellation.pptx finite Mathematics
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Operating System & Kernel Study Guide-1 - converted.pdf
Arduino robotics embedded978-1-4302-3184-4.pdf
Construction Project Organization Group 2.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Structs to JSON How Go Powers REST APIs.pdf
web development for engineering and engineering

Prediction model of algal blooms using logistic regression and confusion matrix

  • 1. International Journal of Electrical and Computer Engineering (IJECE) Vol. 11, No. 3, June 2021, pp. 2407~2413 ISSN: 2088-8708, DOI: 10.11591/ijece.v11i3.pp2407-2413  2407 Journal homepage: http://ijece.iaescore.com Prediction model of algal blooms using logistic regression and confusion matrix Hongwon Yun Department of Computer Software Engineering, Silla University, Busan, Republic of Korea Article Info ABSTRACT Article history: Received Jul 31, 2020 Revised Sep 22, 2020 Accepted Oct 8, 2020 Algal blooms data are collected and refined as experimental data for algal blooms prediction. Refined algal blooms dataset is analyzed by logistic regression analysis, and statistical tests and regularization are performed to find the marine environmental factors affecting algal blooms. The predicted value of algal bloom is obtained through logistic regression analysis using marine environment factors affecting algal blooms. The actual values and the predicted values of algal blooms dataset are applied to the confusion matrix. By improving the decision boundary of the existing logistic regression, and accuracy, sensitivity and precision for algal blooms prediction are improved. In this paper, the algal blooms prediction model is established by the ensemble method using logistic regression and confusion matrix. Algal blooms prediction is improved, and this is verified through big data analysis. Keywords: Algal blooms Confusion matrix Ensemble method Logistic regression Prediction model This is an open access article under the CC BY-SA license. Corresponding Author: Hongwon Yun Department of Computer Software Engineering Silla University Busan 46958, Republic of Korea Email: hwyun@silla.ac.kr 1. INTRODUCTION Logistic regression is a special case of a typical model and is similar to linear regression, however it has a difference in the relationship between dependent and independent variables. The dependent variable of logistic regression can be binary or continuous, and it is used as a model for classification or prediction when the dependent variable is binary [1, 2]. If the dependent variable of logistic regression is binary, the range of its value is limited to the bivariate and the distribution of conditional probability follows the Bernoulli distribution. Logistic regression allows dependent variable values to be between 0 and 1 regardless of the range of independent variable values, so it is possible to classify the result of data into a specific classification when input data is given and predict the likelihood of an event occurring [3-5]. In logistic regression, where the dependent variable is binary, the predicted value can be calculated using a linear combination of the independent variables. However, since the value of the dependent variable is classified as pass or fail around the decision boundary, the value close to the decision boundary may be less accurate [6-8]. In binary logistic regression, since the actual value of the dependent variable is present and the predicted value can be calculated, the predicted value can be applied to a confusion matrix that can be compared to the target value [9, 10]. It can be obtained sensitivity and precision from the confusion matrix using the actual and predicted values of the logistic regression, and apply it to algal blooms to create a summary of indicators such as sensitivity and precision including accuracy [11-13]. Sensitivity and precision are as important as accuracy in predicting algal bloom occurrence. This is because high sensitivity and precision can provide indicators that can prevent massive property damage [14-17]. The elements of the marine environment that cause algal blooms are generally known, but no study
  • 2.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 11, No. 3, June 2021 : 2407 - 2413 2408 can be found to analyze the influence of each element on algal blooms and predict algal blooms. In this study, the predicted value of logistic regression is calculated by machine learning. The actual value used in logistic regression analysis and the prediction calculated through machine running are applied to the confusion matrix to create a prediction model for algal blooms. This paper is organized as follows. The logistic regression and confusion matrix as the background theory of this study are describe in section 2. In section 3, we describe the algal blooms prediction model using the ensemble method of the logistic regression and confusion matrix proposed in this study. Here we describe the process of extracting marine environmental elements using logistic regression, obtaining red tide prediction values, applying improved decision boundaries to logistic regression, and how to improve accuracy, sensitivity and precision through confusion matrix. In section 4, we verify the proposed algal blooms prediction model using the algal blooms dataset, and conclusions are described in section 5. 2. LOGISTIC REGRESSION AND CONFUTION MATRIX 2.1. Logistic regression Linear regression is a model that estimates a regression coefficient that can linearly express the relationship between independent variables X and dependent variables Y with continuous values. If the dependent variable Y is a binary variable, logistic regression is used because linear regression cannot be applied directly. Some regression algorithms can be used for classification, and logistic regression is widely used to estimate the probability that a sample belongs to a particular class. If the estimated probability exceeds 0.5, the sample is predicted to belong to the class, and if it is less than 0.5, it is used as a binary classifier to predict that the sample does not belong to the class [18, 19]. To estimate the probability, logistic regression calculates the weighted sum of the input characteristics, but instead of outputting the result immediately such as linear regression, it outputs the logistic of the result value. Logistic is a sigmoid function that outputs a value between 0 and 1 [20]. The logistic function has the effect of limiting the output result to always between 0 and 1 for numerical values x, and its expression is defined as follows. 𝑦 = 1 1+𝑒−𝑓(𝑥) (1) In (1), 𝑓(𝑥) can be either a simple linear function or a multiple linear function. For classification problems with two categories, if 𝑓(𝑥) > 0 is classified as 𝑦 → 1 and 𝑓(𝑥) < 0 is classified as 𝑦 → 0. The decision boundary of the logistic regression model is the 𝑓(𝑥) = 0 in hyperplane and becomes 𝑦 = 0.5. Errors in prediction usually occur around the decision boundary [21, 22]. 2.2. Confusion matrix The confusion matrix is a tool that easily and effectively shows the performance of the classifier and has the advantage of being easy to interpret the results. A confusion matrix can be used to evaluate the performance of any models or algorithms. As shown in Table 1, the rows in the confusion matrix represent the values of the predictive class and the columns represent the values of the actual class. Each cell is one of the possible combinations of prediction and actuality. In the 2×2 confusion matrix, there are true positive (TP), false positive (FP), false negative (FN), and true false (TF) [23]. The perfect model will only have values on the diagonal, the rest of the cells will be all zeros, and the bad model will be evenly distributed in all cells. The error matrix tells us how bad a model is when it is bad. The value of each cell can identify a misclassified pattern [24]. Table 1. Confusion matrix Confusion matrix True class (Actual) P N Hypothesized class (Predicted) Y True Positives False Positives N False Negatives True Negatives Methods for summarizing the results of the confusion matrix include accuracy, precision, and recall. 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝑇𝑁 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁 (2) The accuracy is obtained by dividing the accurately predicted number (TP+TN) by the total number of samples, and is represented by (2). Among the methods for summarizing the results in the confusion
  • 3. Int J Elec & Comp Eng ISSN: 2088-8708  Prediction model of algal blooms using logistic regression and confusion matrix (Hongwon Yun) 2409 matrix, the most frequently used precision and sensitivity are as shown in (3) and (4), respectively. 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 𝑇𝑃+𝐹𝑃 (3) Precision is a positive predictive value that measures how many of the samples (TP+FP) predicted to be positive are true positives (TP). Precision is used as a performance indicator when the goal is to reduce the number of false positives (FP). 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑃 𝑇𝑃+𝐹𝑁 (4) Sensitivity measures how many of the total positive samples (TP+FN) are classified as positive classes (TP). 3. PREDICTION MODEL After collecting algal blooms dataset from the National Institute of Fisheries Science, it was cleaned and refined. The first multiple logistic regression analysis was performed on the refined algal blooms dataset, and some attributes were removed through a statistical test. A second multiple logistic regression analysis was performed with the exception of the attributes removed and then the regularization was applied. After applying the regularization, a third multiple logistic regression analysis is performed and the results are applied to the confusion matrix. Figure 1 shows this process. Figure 1. Prediction process of algal blooms The probability of occurrence of harmful algal blooms with two or more independent variables is defined as p(x) and the odds as = 𝑝 1−𝑝 . When the range of input values is [0, 1], log it transformation is performed to adjust the range of output values to (−∞, ∞), resulting in log(𝑜𝑑𝑑𝑠) = log ( 𝑝(𝑥) 1−𝑝(𝑥) ) = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑖𝑥𝑖 [25]. Therefore, for multiple independent variables that affect harmful algal blooms, the multiple logistic function that allows the dependent variable range to be between [0, 1] is as shown in (5). In (5) calculates the effect of each element of the ocean observation data, which is an independent variable, on the occurrence of a harmful algal blooms as a dependent variable. This is a basic model for estimating the probability of occurrence of harmful algal blooms. 𝑝(𝑥) = 1 1+𝑒−(𝛽0+𝛽1𝑥1+𝛽2𝑥2+⋯+𝛽𝑖𝑥𝑖) (5) The maximum likelihood estimation is used to estimate parameter 𝛽 in regression expression 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑖𝑥𝑖 by logit transformation. The log likelihood function can be obtained from the likelihood function [26] expressed as the product of Bernoulli's probability function, and is expressed as (6).
  • 4.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 11, No. 3, June 2021 : 2407 - 2413 2410 The parameter that maximizes the log likelihood function in (6) is determined from multiple independent variables that affect the harmful algal blooms. ln𝐿 = ∑ 𝑦𝑖(𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑖𝑥𝑖) 𝑖 + ∑ ln 𝑖 (1 + 𝑒𝛽0+𝛽1𝑥1+𝛽2𝑥2+⋯+𝛽𝑖𝑥𝑖 ) (6) The L1 regularization [27] used to eliminate low-impact independent variables among multiple independent variables that affect harmful algal blooms is shown in (7). argmin ∑ (𝑦𝑖 − ∑ 𝑥𝑖𝑗𝛽𝑗)2 + 𝜆 ∑ |𝛽𝑗| 𝑝 𝑗=1 𝑝 𝑗=1 𝑛 𝑖=1 (7) The properties in the marine environment observation dataset are shown in Table 2 and used as independent variables in logistic regression. Table 2. Multiple independent variables for logistic regression Variables Comments T Temperature S Salinity DO Dissolved Oxygen P Phosphate Phosphorus NA Nitrous Acid Nitrogen N Nitric Acid Nitrogen SA Silicic Acid Silicon In (8) is obtained by applying seven independent variables, such as water temperature, salinity, dissolved oxygen, phosphate phosphorus, nitrous acid nitrogen, nitric acid nitrogen, silicic acid silicon, to the basic model of multiple logistic regression (5). log 𝑝(𝑥) 1−𝑝(𝑥) = 𝛽0 + 𝛽1𝑇𝑖 + 𝛽2𝑆𝑖 + 𝛽3𝐷𝑂𝑖 + 𝛽4𝑃𝑖 + 𝛽5𝑁𝐴𝑖 + 𝛽6𝑁𝑖 + 𝛽7𝑆𝐴𝑖 (8) P-value is used to determine if any independent variable was statistically significant in the results of multiple logistic regression analysis on the training dataset, and independent variables with a P-value of 0.05 or higher are excluded. The parameters for statistically significant independent variables are as shown in (9). log 𝑝(𝑥) 1−𝑝(𝑥) = 𝛽0 + 𝛽1𝑇𝑖 + 𝛽2𝑆𝑖 + 𝛽3𝑃𝑖 + 𝛽4𝑁𝑖 (9) The regulation for removing an independent variable close to zero in order to make some coefficients zero is as shown in (7). The result is as shown in (10) when (7) is applied to the result of (9). log 𝑝(𝑥) 1−𝑝(𝑥) = 𝛽0 + 𝛽1𝑇𝑖 + 𝛽2𝑆𝑖 + 𝛽3𝑃𝑖 (10) In (11) is the logistic regression model for algal blooms prediction obtained by applying the above process to the algal blooms dataset. 𝑝(𝑥) = 1 1+𝑒−(−5.89+0.34𝑇𝑖−0.12𝑆𝑖+0.35𝑃𝑖) (11) The normalization process from the (8) to the (11) is from Step 2 to Step 5 among the algorithms in Table 3, respectively. The algal blooms prediction model was normalized while performing experiments based on the algorithm in Table 3. The detailed experimental process is described in section 4. The equation for obtaining a decision boundary to increase the sensitivity and precision is defined as shown in (12). 𝛥 = |0.5 ± 1 2 ( 𝑇𝑃 𝑇𝑃+𝐹𝑁 + 𝑇𝑃 𝑇𝑃+𝐹𝑃 )| (12) Table 3 shows algorithm for establishing algal blooms prediction model. This algorithm shows the process of performing multiple logistic regression first in a refined dataset, then statistical tests on the results,
  • 5. Int J Elec & Comp Eng ISSN: 2088-8708  Prediction model of algal blooms using logistic regression and confusion matrix (Hongwon Yun) 2411 and then removing low-weight independent variables, finally setting up a logistic regression model, and finding the decision boundary finally. Table 3. Algorithm for establishing algal blooms prediction model Step Statements 1 Extraction, Transformation and Loading from collected dataset Prepare training dataset 2 Perform multiple regression analysis using (1) on the training dataset Output regression coefficients and statistical tests 3 Perform a statistical significance test ∙attributes P-value > 0.5 are excluded in the training dataset 4 Perform multiple regression analysis for the training dataset with attributes whose P-value <= 0.5 Output regression coefficients and statistical tests for attributes whose P-value <= 0.5 5 Regularize regression coefficients from step 4 using argmin∑ (𝑦𝑖 − ∑ 𝑥𝑖𝑗𝛽𝑗)2 + 𝜆 ∑ |𝛽𝑗| 𝑝 𝑗=1 𝑝 𝑗=1 𝑛 𝑖=1 Perform multiple regression analysis using a regularized regression formula Output test dataset 6 Input test dataset from step 5 Predict probability using decision boundary 𝛥 = |0.5 ± 1 2 ( 𝑇𝑃 𝑇𝑃+𝐹𝑁 + 𝑇𝑃 𝑇𝑃+𝐹𝑃 )|based on confusion matrix 4. EXPERIMENT Multiple logistic regression analysis (8) can be performed on the training dataset to obtain the results shown in Table 4. In Table 4, p-value is used to determine whether any independent variable is statistically significant, and independent variables with a p-value of 0.05 or higher are excluded. Parameters 𝛽 are determined for statistically significant independent variables in Table 4 and L1 regularization is applied and then the results shown in Table 5 can be obtained. In (11) of the logistic regression model for algal blooms prediction is obtained from the coefficients in Table 5. Table 4. 1st multiple logistic regression analysis on training dataset Input variables Coefficient Std. error P-value Constant -5.35 1.49 0.00 Temperature 0.33 0.02 0.00 Salinity -0.12 0.03 0.00 Dissolved Oxygen -0.05 0.08 0.54 Phosphate Phosphorus 0.38 0.14 0.01 Nitrous Acid Nitrogen -0.07 0.12 0.58 Nitric Acid Nitrogen -0.06 0.02 0.00 Silicic Acid Silicon -0.02 0.01 0.16 Table 5. Coefficients of logistic regression model for algal blooms prediction Input variables Coefficient Std. error P-value Constant -5.89 1.21 0.00 Temperature 0.34 0.01 0.00 Salinity -0.12 0.03 0.00 Phosphate Phosphorus 0.35 0.14 0.01 Predicting the occurrence of algal blooms from (11) gives 91.84% accuracy. Accuracy alone may not be sufficient to assess the predicted performance of algal blooms. We utilize the confusion matrix since we do not know false negatives or false positives of algal blooms. A confusion matrix for algal blooms shown in Table 6 is obtained from algal blooms dataset. Table 6. Confusion matrix for algal blooms Confusion matrix (Error matrix) Actual values of algal blooms P(Occurrence) N(Not occurrence) Predicted values of algal blooms Y (0.5 or higher) True Positive tp=222 False Positive fp=205 N (less than 0.5 ) False Negative fn=599 True Negative tn=8828
  • 6.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 11, No. 3, June 2021 : 2407 - 2413 2412 Table 7 shows the sensitivity, specificity, and precision are obtained based on the decision boundary 0.5 using the values of the confusion matrix in Table 6. The prediction rate of false negative of algal blooms is low as the sensitivity is 27.04%, and the prediction rate of false positive is also low because the precision is 51.99%. Since the sensitivity and precision are low in case of the decision boundary is 0.5, we apply proposed the decision boundary (12) in order to solve these problems, and results are as shown in Table 8. When the decision boundary proposed in this paper is applied, the decision boundary becomes 𝛥 = |0.5 ± 0.25|. When this is used as a decision boundary, TP=494, TN=9026, FN=327, FP=7, the sensitivity is 60.17%, and the precision is 98.6% as shown in Table 8. Table 7. Resulting confusion matrix based on decision boundary 0.5 (unit: %) TPR TNR PPV FPR ACC F1 Sensitivity Specificity Precision Fallout Accuracy F1 Score 27.04 97.73 51.99 2.27 91.84 35.58 Table 8. Resulting confusion matrix based on proposed decision boundary 𝛥 (unit: %) TPR TNR PPV FPR ACC F1 Sensitivity Specificity Precision Fallout Accuracy F1 Score 60.17 99.92 98.6 0.08 96.61 74.74 5. CONCLUSION In this paper, logistic regression and confusion matrix were used to predict the occurrence of algal blooms. Algal blooms datasets were collected and refined for experimental analysis of algal blooms prediction. Logistic regression analysis was performed on refined algal blooms dataset and main marine environmental factors affecting algal blooms were found through statistical test and regularization processes. Logistic regression was performed using the marine environmental factors that were influential on algal blooms and the accuracy of algal bloom occurrence was obtained. The values of the confustion matrix were obtained using the dataset for algal blooms prediction and the predicted values obtained from logistic regression. Although the sensitivity and precision for the occurrence of algal blooms can be obtained from the values of the confusion matrix, the sensitivity and precision were low when the existing decision boundary was 0.5. Sensitivity and precision were improved by using the decision boundary proposed in this study. In this paper, the algal blooms prediction model was established by the ensemble method using logistic regression analysis and confusion matrix. Also, the accuracy, sensitivity, and precision for algal blooms prediction were improved, and these were verified through big data analysis. REFERENCES [1] Hosmer Jr., et al., “Applied logistic regression,” John Wiley & Sons, vol. 398, 2013. [2] M. Chang, et al., “Selection of Transformations of Continuous Predictors in Logistic Regression,” Information Technology-New Generations, pp. 443-447, 2018. [3] J. W. Osborne, “Simple Linear Models with Categorical Dependent Variables: Binary Logistic Regression,” SAGE, pp. 97-132, 2017. [4] J. Tolles and J. M. William, “Logistic regression: relating patient characteristics to outcomes,” Jama, vol. 316, no. 5, pp. 533-534, 2016. [5] A. D. Caigny, et al., “A new hybrid classification algorithm for customer churn prediction based on logistic regression and decision trees,” European Journal of Operational Research, vol. 269, no. 2, pp. 760-772, 2018. [6] X. Wan, “The Influence of Polynomial Order in Logistic Regression on Decision Boundary,” IOP Conference Series: Earth and Environmental Science, vol. 267, no. 4, pp. 1-4, 2019. [7] S. Ghazaal and A. Hakan, “Using the Distance in Logistic Regression Models for Predictor Ranking in Diabetes Detection,” International Conference on Medical and Biological Engineering, 2019, pp. 665-670. [8] J. Friedman, et al., “Additive logistic regression: A statistical view of boosting,” Annals of statistics, vol. 28, no. 2, pp. 337-374, 2000. [9] H. M. Ramos, et al., “A new explanatory index for evaluating the binary logistic regression based on the sensitivity of the estimated model,” Statistics and Probability Letters, vol. 120, pp. 135-140, 2017. [10] M. Ohsaki, et al., “Confusion-matrix-based kernel logistic regression for imbalanced data classification,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 9, pp. 1806-1819, 2017. [11] J. A. McGowan, et al., “Predicting coastal algal blooms in southern California,” Ecology, vol. 98, no. 5, pp. 1419-1433, 2017. [12] N. F. Manning, et al., “Extending the forecast model: Predicting Western Lake Erie harmful algal blooms at multiple spatial scales,” Journal of Great Lakes Research, vol. 45, no. 3 pp. 587-595, 2019.
  • 7. Int J Elec & Comp Eng ISSN: 2088-8708  Prediction model of algal blooms using logistic regression and confusion matrix (Hongwon Yun) 2413 [13] N. Mellios, et al., “Machine Learning Approaches for Predicting Health Risk of Cyanobacterial Blooms in Northern European Lakes,” Water, vol. 12, no. 4, p. 1191, 2020. [14] L. Wang, et al., “An approach of improved Multivariate Timing-Random Deep Belief Net modelling for algal bloom prediction,” Biosystems engineering, vol. 177, pp. 130-138, 2019. [15] Ghatkar, et al., “Classification of algal bloom species from remote sensing data using an extreme gradient boosted decision tree model,” International Journal of Remote Sensing, vol. 40, no. 24, pp. 9412-9438, 2019. [16] S. Lee and D. Lee, “Improved prediction of harmful algal blooms in four Major South Korea’s Rivers using deep learning models,” International journal of environmental research and public health, vol. 15, no. 7, p. 1322, 2018. [17] X. Sun, et al., “A Bayesian structural model for predicting algal blooms,” Journal of Forecasting, vol. 38, no. 8, pp. 788-802, 2019. [18] F. Thabtah et al., “A machine learning autism classification based on logistic regression analysis,” Health information science and systems, vol. 7, no. 1, p. 12, 2019. [19] D. Menezes, et al., “Data classification with binary response through the Boosting algorithm and logistic regression,” Expert Systems with Applications, vol. 69, pp. 62-73, 2017. [20] J. M. Hilbe, “Practical guide to logistic regression,” CRC Press, 2016. [21] C. Fernández and F. Provost, “Causal Classification: Treatment Effect vs. Outcome Prediction,” Outcome Prediction, 2019. [22] K. Lee, et al., “Unbalanced data, type II error, and nonlinearity in predicting M&A failure,” Journal of Business Research, vol. 109, pp. 271-287, 2010. [23] S. V. Stehman, “Selecting and interpreting measures of thematic classification accuracy,” Remote Sensing of Environment, vol. 62, no. 1, pp. 77-89, 1997. [24] D. Powers, “Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation,” Journal of Machine Learning Technologies, vol. 2, no. 1, pp. 37-63, 2011. [25] J. S. Cramer, “The origins and development of the logit model,” Cambridge UP, pp. 1-19, 2003. [26] I. M. Myung, “Tutorial on Maximum Likelihood Estimation,” Journal of Mathematical Psychology, vol. 47, no. 1, pp. 90-100, 2003. [27] F. Santosa and W. Symes, “Linear inversion of band-limited reflection seismograms,” SIAM Journal on Scientific and Statistical Computing, SIAM, vol. 7, no. 4, pp. 1307-1330, 1986. BIOGRAPHY OF AUTHOR Hongwon Yun is a Professor with the Department of Computer Software Engineering at Silla University, Busan, South Korea. He received his B.S. and the Ph.D. degrees at the Department of Computer Science from Pusan National University, South Korea. His research interests include database system, big data analysis, and machine learning.