Predicting Football Match Results with Data Mining Techniques

Predicting Football Match Results with Data
Mining Techniques
O. I. Aladesote, O. Agbelusi & M. Ganiyu
Abstract- Data mining techniques are very effective and useful for forecasting in many domains or fields. In this
research, prediction of Spanish la liga football match outcomes is carried out using various data mining techniques
(Multilayer Perception, Decision Tables, Random Forest, Reptree and Meta. Bagging) to determine the most accurate
among these techniques. The experimental results is done with Weka 3.9, shows that all the techniques performed well in
terms of accuracy but multilayer Perception was the most successful with an average accuracy of 100%..
I. INTRODUCTION
Football is a fast growing sport that is taking over as one the most viewed and richest sport therefore the drive to be
more than just a spectator has led to this research of being able to predict the final outcome of any match and
simultaneously making sport betting easier. One of the reasons for football being the most popular sport in the
planet is its unpredictability.
Every day, fans around the world argue over which team is going to win the next game or the next competition.
Many of these fans also put their money where their mouths are, by betting large sums on their predictions. Due to
the large amount of factors that can affect the result of a football match, it is incredibly difficult to correctly predict
its probabilities. With the increasing growth of the amount of money invested in sports betting markets, it is
important to verify how far data mining techniques can bring value to this area [9].
To solve this problem we propose building data-driven solutions designed through a data mining process. Data
mining is an aspect of computing that is used for extraction of hidden information and to automate the detection
of relevant patterns in a database. The data mining process allows us to build models that can give us predictions
according to the data that is fed into the system. The study is aimed at using data mining techniques for the
prediction of football match result. Every sport has particular rules, number of players, different styles, that is, a set
of different features. For a beginner, carrying out predictive model from the scratch with considerable dataset could
be somehow challenging. Finally, every individual especially football fans would be able to predict match result
based on identified factor at the end of this research.
We summarized the contributions of this paper as follows:
• Forecasting of la liga football match outcome using data of five previous seasons
• Comparative analysis to determine the most accurate technique.
The remainder of the paper is organized as follows: section 2 presents the literature review. In Section 3, the method
used to generate the results is presented. The experimental results for each data mining technique is presented and
discussed in section 4. Comparative analysis is done in section 5 and finally, conclusion and future work are
presented in section 6.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 18, No. 6, June 2020
46 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

II. LITERATURE REVIEW
Data mining is an important tool in event prediction. The literature selected and discussed in this section are those
that are more related and relevant to the result of football match prediction.
A match results prediction system is proposed using four data mining techniques. The author used basketball results
of four seasons (from 2005/2006 to 2009/2010) as training data and in order to assess or appraise the models, the
result of 2010/2011 season was used as test data. The result shows that the models performed with comparable
classification accuracy rate, with 67.8% as the highest [4].
The authors proposed the use of ANN and logistic regression techniques to forecast the outcome of 2014-2015
English premier league results to strengthen the complexity and inaccurate prediction results produced by statistical
approaches. The records of nine significant features are randomly selected from the records. The experimental result
of the model shows that logistic regression perform better than ANN and that the techniques show higher prediction
accuracy [17].
[13] carried out a preliminary investigation to forecast result of National Football League (NFL) using artificial
neural network (ANN). Five variables randomly extracted from first eight rounds of the competition was used for
the prediction. Teams were classified to be either strong or weak using cluster related methods.
The paper proposes data mining techniques to strengthen the limitations introduced by numeric prediction approach.
Eight years of data was used. To evaluate the performance of these techniques, both classification and regression
models were used. The experimental results clearly show that the accuracy rate of classification model outweigh
regression model [15]
The researchers carried out a performance evaluation using three classification models (naive Bayes, artificial neural
networks (ANNs) and decision trees) [16]. The models was built using different variables of NBA matches. The
experimental result shows that the accuracy of the proposed model is very reliable and that defensive fence is the
most significant variable among others. Three other variables were also chosen to be the significant.
The researcher adopted three data mining approaches to propose models for game outcome using historical data. The
purpose for this is to counter the idea of eligibility in ranking winning game based on experience. At the end of the
modeling process, all the three models were capable of forecasting the winner of the game and decision tree
produces the highest accuracy [5].
A reliable tennis match outcome prediction model is proposed with numerous factors that are systematically
prioritized to determine the match accuracy. The result shows that the proposed model with combine data and
judgement has 85.1% accuracy outcome of a match [7].
Machine learning method was adopted to forecast the result of future soccer matches based on dataset from past
matches. In this research, two important ideals were discovered as a result of some challenges encountered during
the modeling process of 2017 soccer match result. These two ideals brought about new feature engineering
methods (Recency and rating extraction) for match result forecasting. The author concluded that good forecasting
should be based on the knowledge of machine learning [3].
Vol. 18, No. 6, June 2020
ISSN 1947-5500

The authors developed predictive models to forecast the outcome of football match for 2008/2009 and 2015/2016
seasons. Techniques like artificial neural network (ANN), Random Forest (RF) and Support Vector Machine
(SVM) were used to develop models. Comparative analysis was made and the result shows that they are capable of
carrying out prediction correctly as compare with the result from the experience of football match analyst [8].
This paper proposes machine learning methods to determine the result of NBA match. The forecasting
process was based on the historical data, performance evaluation was done among the models developed and the
result shows that defensive rebounds features was an important features demonstrated by all the
methods for optimal prediction of the game result. Further research will be carried out using model like function
based techniques and deep learning [16].
III. METHODOLOGY
This section describes the dataset, classification techniques and performance analysis. The experiment is done using
Weka 3.9.2 on five algorithms: Multilayer Perception, Decision Tables, Random Forest, Reptree and Meta. Bagging.
In Weka, 10% cross-validation fold is adopted as classifier evaluation option.
A. Dataset
The dataset used for the implementation was the Spanish La Liga League of 2014/2015 to 2018/2019 seasons [18].
The league consists of twenty teams played both home and away matches, equaled to 380 matches per season and
1900 matches for these five seasons. The data consists of 61 features, in which 22 consists various statistical data
such as full and halt time result, home and away team shot, etc. while the remaining 39 consist of football betting
details. Out of the 22 features of the dataset, 10 features were randomly selected as predictors while full time results
(Home Win, Away Win and Lose) as the target.
B. Performance Analysis
The performance of these classification algorithms was measured based on the accuracy. Accuracy shows the rate at
which the classifier meets the correct target class, that is, it determines the instances of data correctly classified [2].
Accuracy = (1)
The total number of correctly predicted Home Win, Away Win and Lose match results is equivalent to the total
number of correctly predicted match results.
IV. RESULT AND DISCUSSION
The results of the experiment carried out on the five classification techniques would be presented and analysed
based on the percentage of accuracy of each technique. 10-foldcross validation techniques was adopted because of
small size of the data.
A. Multilayer Perception
Multilayer Perception is a type of neural network or artificial neural networks, which has appeared to be very a
valuable alternatives to old statistical techniques and does not create previous assumptions of data distribution [6].
Multilayer Perception is applied to the La Liga datasets using Weka 3.9.2. The percentage accuracy for the seasons
is 100% as depicted in Table 1 below and the result of Multilayer Perception for 2018/2019 Season in Figure 1.
Vol. 18, No. 6, June 2020
ISSN 1947-5500

TABLE I
PERCENTAGE ACCURACY OF THE MULTILAYER PERCEPTION FOR FIVE SEASONS
Season Accuracy (%)
2018/2019 Season 100
2017/2018 Season 100
2016/2017 Season 100
2015/2016 Season 100
2014/2015 Season 100
FIGURE 1: DETAILED OUTPUT OF MULTILAYER PERCEPTION OF 2018/2019 SEASON
B. Decision Tables
Decision table is a type of rules that indicates actions to be taken when certain conditions are meant [12]. The dataset
are imported into Weka 3.9.2 and the data are run sing Decision Tables technique. The percentage accuracy for
2018/2019 season is 97.38%, 91.58% for 2017/2018 season, 94.74% for 2016/2017 season, 98.95% for 2015/2016
season and 96.84% for 2014/2015 season. The percentage for the seasons using Decision Tables is presented in
Table 2.
Vol. 18, No. 6, June 2020
ISSN 1947-5500

TABLE II
PERCENTAGE ACCURACY OF THE DECISION TABLES FOR FIVE SEASONS
Season Accuracy (%)
2018/2019 Season 97.38
2017/2018 Season 91.58
2016/2017 Season 94.74
2015/2016 Season 98.95
2014/2015 Season 96.84
FIGURE 2: DETAILED OUTPUT OF DECISION TABLE OF 2017/2018 SEASON
C. Random Forest
Random Forest is a statistical learning mode, which is a tree-based ensemble with each node relying on group of
random variables. It performs well with small or medium dataset and can perform better than latest algorithms [1],
[11]. The dataset are imported into Weka 3.9.2 and the data are run sing Random Forest technique. The percentage
accuracy for 2018/2019 season is 98.42%, 98.95% for 2017/2018 season, 98.16% for 2016/2017 season, 97.63% for
2015/2016 season and 99.47% for 2014/2015 season. The percentage accuracy for the seasons using Random Forest
is presented in Table 3 below.
Vol. 18, No. 6, June 2020
ISSN 1947-5500

TABLE III
PERCENTAGE ACCURACY OF THE RANDOM FOREST FOR FIVE SEASONS
Season Accuracy (%)
2018/2019 Season 98.42
2017/2018 Season 98.95
2016/2017 Season 98.16
2015/2016 Season 97.63
2014/2015 Season 99.47
Figure 3: Detailed output of Random Forest of 2016/2017 Season
D RepTree
Reduced Error Pruning Tree (Reptree) is a fast decision tree learning, which uses regression tree logic to either build
a decision using information gain as splitting principle or reduces the variance [10]. The dataset of La Liga football
League of 2014/2015 season to 2018/2019 season are implemented into Weka 3.9.2 for the prediction. The
percentage accuracy for 2018/2019 season is 98.68%, 98.68% for 2017/2018 season, 98.42% for 2016/2017 season,
97.89% for 2015/2016 season and 98.95% for 2014/2015 season. The percentage accuracy for the seasons is
presented in Table 4.
Vol. 18, No. 6, June 2020
ISSN 1947-5500

TABLE IV
PERCENTAGE ACCURACY OF THE REPTREE FOR FIVE SEASONS
Season Accuracy (%)
2018/2019 Season 98.68
2017/2018 Season 98.16
2016/2017 Season 98.42
2015/2016 Season 97.89
2014/2015 Season 98.95
FIGURE 4: DETAILED OUTPUT OF REPTREE OF 2015/2016 SEASON
E Meta Bagging
Meta Bagging is a machine learning ensemble algorithm developed to enhance the accuracy of statistical
classification and regression of any machine learning based algorithms [14]. The dataset of La Liga football League
of 2014/2015 season to 2018/2019 season are implemented into Weka 3.9.2 for the prediction. The percentage
accuracy for 2018/2019 season is 99.74%, 98.42% for 2017/2018 season, 98.16% for 2016/2017 season, 98.42% for
2015/2016 season and 99.47% for 2014/2015 season. The percentage accuracy for the seasons is presented in Table
5.
Vol. 18, No. 6, June 2020
ISSN 1947-5500

TABLE V
PERCENTAGE ACCURACY OF THE META BAGGING FOR FIVE SEASONS
Season Accuracy (%)
2018/2019 Season 99.74
2017/2018 Season 98.42
2016/2017 Season 98.16
2015/2016 Season 98.42
2014/2015 Season 99.47
Figure 5: Detailed output of Meta Bagging of 2014/2015 Season
V. COMPARATIVE ANALYSIS
The comparative analysis of the result shows that Multilayer Perception has the overall best average percentage
accuracy with 100%, Meta Bagging with an average accuracy of 98.84% for the seasons, Random Forest has an
average percentage accuracy of 98.53%, Reptree has an average accuracy of 98.42 while Decision Tables has the
least average accuracy of 95.90% as presented in Table 6 and Figure 6 below
Vol. 18, No. 6, June 2020
ISSN 1947-5500

TABLE VI
COMPARISON OF AVERAGE PERCENTAGE ACCURACY
Accuracy Multilayer
Perception
Decision Tables Random Forest Reptree Meta Bagging
2018/2019 Season 100% 97.38% 98.42% 98.68% 99.74%
2017/2018 Season 100% 91.58% 98.95% 98.16% 98.42%
2016/2017 Season 100% 94.74% 98.16% 98.42% 98.16%
2015/2016 Season 100% 98.95% 97.63% 97.89% 98.42%
2014/2015 Season 100% 96.84% 99.47% 98.95% 99.47%
Average Accuracy 100% 95.90% 98.53% 98.42% 98.84%
FIGURE 6: GRAPHICAL REPRESENTATION OF AVERAGE ACCURACY
VI. CONCLUSION AND FUTURE WORK
This work compared five data mining algorithms on Spanish la liga football match outcome. The experimental
results revealed Multilayer Perception has the most successful result, which makes it the best data mining technique
to predict la liga football match outcome with 100% accuracy as against Decision Tables with 95.90% accuracy,
Random Forest with 98.53%, Reptree with 98.42% and Meta Bagging with 98.84% accuracy. However, all data
mining techniques can also be applied in future work, consideration rating of each team as part of the variables.
References
[1] A. Cutler, D. R. Cutler, and J. R. Stevens, “Ensemble Machine Learning,” Ensemble Mach. Learn., no. January, 2012, doi:
10.1007/978-1-4419-9326-7.
[2] C. M. F. Che Mohd Rosli, M. Z. Saringat, N. Razali, and A. Mustapha, “A Comparative Study of Data Mining Techniques on Football
Vol. 18, No. 6, June 2020
ISSN 1947-5500

Match Prediction,” J. Phys. Conf. Ser., vol. 1020, no. 1, 2018, doi: 10.1088/1742-6596/1020/1/012003.
[3] D. Berrar, P. Lopes, and W. Dubitzky, “Incorporating domain knowledge in machine learning for soccer outcome prediction,” Mach.
Learn., vol. 108, no. 1, pp. 97–126, 2019, doi: 10.1007/s10994-018-5747-8.
[4] C. Cao, “Sports data mining technology used in basketball outcome prediction,” Dublin Inst. Technol., pp. 1–86, 2012.
[5] D. Delen, D. Cogdell, and N. Kasap, “A comparative analysis of data mining methods in predicting NCAA bowl outcomes,” Int. J.
Forecast., vol. 28, no. 2, pp. 543–552, 2012, doi: 10.1016/j.ijforecast.2011.05.002.
[6] M. W. Gardner and S. R. Dorling, “Artificial neural networks (the multilayer perceptron) - a review of applications in the atmospheric
sciences,” Atmos. Environ., vol. 32, no. 14–15, pp. 2627–2636, 1998, doi: 10.1016/S1352-2310(97)00447-0.
[7] W. Gu and T. L. Saaty, “Predicting the Outcome of a Tennis Tournament: Based on Both Data and Judgments,” J. Syst. Sci. Syst. Eng.,
vol. 28, no. 3, pp. 317–343, 2019, doi: 10.1007/s11518-018-5395-3.
[8] H. Chen, “Neural Network Algorithm in Predicting Football Match Outcome Based on Player Ability Index,” Adv. Phys. Educ., vol. 09,
no. 04, pp. 215–222, 2019, doi: 10.4236/ape.2019.94015.
[9] J. J. Zhang, E. Kim, B. Marstromartino, T. Y. Qian, and J. Nauright, “The sport industry in growing economies: critical issues and
challenges,” Int. J. Sport. Mark. Spons., vol. 19, no. 2, pp. 110–126, 2018, doi: 10.1108/IJSMS-03-2018-0023.
[10] S. Kalmegh, “Analysis of WEKA Data Mining Algorithm REPTree , Simple Cart and RandomTree for Classification of Indian News,”
Int. J. Innov. Sci. Eng. Technol., vol. 2, no. 2, pp. 438–446, 2015.
[11] R. Kohavi, “The power of decision tables,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes
Bioinformatics), vol. 912, pp. 174–189, 1995, doi: 10.1007/3-540-59286-5_57.
[12] D. Tables, F. Definition, and S. P. Decision, “(cf. (cf.,” pp. 68–80, 1991.
[13] A. Reso- and K. Self-, “Different Training Methods Perform in Calling the Games,” pp. 9–15, 1996.
[14] P. Shrivastava and M. Shukla, “Uses the Bagging Algorithm of Classification Method Learning and Forest Fire Data,” Int. J. Adv.
Comput. Eng. Netw., vol. 01, no. 12, pp. 91–95, 2014.
[15] S. J. Lee and K. Siau, “A review of data mining techniques,” Ind. Manag. Data Syst., vol. 101, no. 1, pp. 41–46, 2001, doi:
10.1108/02635570110365989.
[16] F. Thabtah, L. Zhang, and N. Abdelhamid, “NBA Game Result Prediction Using Feature Analysis and Machine Learning,” Ann. Data
Sci., vol. 6, no. 1, pp. 103–116, 2019, doi: 10.1007/s40745-018-00189-x.
[17] C.P. Igiri, E.O. Nwachukwu, "An Improved Prediction System for Football Match Result," IOSR Journal of Engineering, vol. 04, no
12, pp. 12-20, 2014, doi: 10.9790/3021-04124012020
[18] Spanish La Liga (football) dataset [Online]. Available:
https://datahub.io/sports-data/spanish-la-liga#data. [Accessed on 17 December, 2019].
Vol. 18, No. 6, June 2020
ISSN 1947-5500

IJCSIS
ISSN (online): 1947-5500
Please consider to contribute to and/or forward to the appropriate groups the following opportunity to submit and publish
original scientific results.
CALL FOR PAPERS
International Journal of Computer Science and Information Security (IJCSIS)
January-December 2020 Issues
The topics suggested by this issue can be discussed in term of concepts, surveys, state of the art, research,
standards, implementations, running experiments, applications, and industrial case studies. Authors are invited
to submit complete unpublished papers, which are not under review in any other conference or journal in the
following, but not limited to, topic areas.
See authors guide for manuscript preparation and submission guidelines.
Indexed by Google Scholar, DBLP, CiteSeerX, Directory for Open Access Journal (DOAJ), Bielefeld
Academic Search Engine (BASE), SCIRUS, Scopus Database, Cornell University Library, ScientificCommons,
ProQuest, EBSCO and more.
Deadline: see web site
Notification: see web site
Revision: see web site
Publication: see web site
For more topics, please see web site https://sites.google.com/site/ijcsis/
For more information, please visit the journal website (https://sites.google.com/site/ijcsis/)

Context-aware systems
Networking technologies
Security in network, systems, and applications
Evolutionary computation
Industrial systems
Evolutionary computation
Autonomic and autonomous systems
Bio-technologies
Knowledge data systems
Mobile and distance education
Intelligent techniques, logics and systems
Knowledge processing
Information technologies
Internet and web technologies, IoT
Digital information processing
Cognitive science and knowledge
Agent-based systems
Mobility and multimedia systems
Systems performance
Networking and telecommunications
Software development and deployment
Knowledge virtualization
Systems and networks on the chip
Knowledge for global defense
Information Systems [IS]
IPv6 Today - Technology and deployment
Modeling
Software Engineering
Optimization
Complexity
Natural Language Processing
Speech Synthesis
Data Mining

Predicting Football Match Results with Data Mining Techniques

More Related Content

Similar to Predicting Football Match Results with Data Mining Techniques (20)

Recently uploaded (20)

Predicting Football Match Results with Data Mining Techniques