Baseline Models
Outline
Before a final neural network model is trained, it is useful to establish some accuracy benchmarks that the final model can be compared against. The following baseline models are designed to 'set the bar' when it comes to the possible prediction accuracy given the same WIGGO or FIFA dataset. In this case, all three models will be fit on the full FIFA dataset, as it is expected that WIGGO will outperform the FIFA rankings using the same model(s). Furthermore, actually using the acquired data in conjunction with predictive model will help to assess the feasability of using the data in the first place; if the added predictors provide the models with more usable information, the data is indeed predictive of the outcomes of football matches, at least to some extent.
I. Trivial Model - Predict All Games As Home Wins
The first baseline model is, beyond random guessing of match outcomes, the most primitive of possible prediction models. This trivial model simply predicts the designated 'Home' team to win each match, since a plurality of 2018 World Cup matches ended with a win for the 'Home' team. With this prediction rule, the model achieves a 40.6% prediction accuracy. While this is more than 7% above randomness (33.33%), it is still pretty poor in terms of overall predictive ability of the model.
II. Multinomial Logistic Regression
Beyond the trivial model, the next simplest prediction method would be a regression algorithm. Since the prediction problem in this project (i.e. predicting. World Cup outcomes) boils down to a classification problem, where the model must predict the outcome of a game from three possible outcomes, the simplest regression model is logistic regression. Although a logistic model is still fairly bare-bones relative to more complex models like neural networks, it still remains one of the best models for classification. This is supported by the high 56.3 % classification accuracy, which is more than 15% better than the trivial model and 23% better than random.
III. Neural Network (Multilayer Perceptron)
The final baseline model is a simple neural network (or multilayer perceptron). This network only has a single hidden layer, which filters the neurons down from 31 to 3 (as shown in the diagram below).
The hidden layer uses a ReLU activation while the output layer uses a softmax activation function to give probabilistic values for the three classes. The highest activation out of the three output neurons is chosen as the classification. While the exact accuracy of the neural network is subject to noise because of the random initialization of all neurons in the network, this network achieved a accuracy of approximately 56.3%, which matches the classification accuracy of the logistic regression model.
IV. Conclusion
Overall, the high accuracy of the logistic regression and neural network models compared to the trivial model confirm that the dataset is indeed useful for predicting the outcome of football matches. Furthermore, the fact that the accuracy of the neural network was comparable to that of the logistic regression model provides hope that a better-structured neural network will be able to outperform other non-neural network models. Moving forward, it seems reasonable to expect that with further fine-tuning and cross validation of the neural network hyperparameters and architecture, a higher prediction accuracy may be achieved through the neural network.