Data Science with Football Made Easy
The 15-Algorithm Playbook for Beginners
Naïve Bayes
Unpacking Premier League Scoring Chances with Bukayo Saka
@MartinOnData
version 1.1 [26-04-2024]
Copyright © 2024 Antoine Martin
All rights reserved.
Data Science with Football Made Easy: The 15-Algorithm Playbook for Beginners
Naïve Bayes
Unpacking Premier League Scoring Chances with Bukayo Saka
Naïve Bayes is a simple yet remarkably powerful algorithm for predictive modelling, particularly
well-suited for classification tasks. It is based on Bayes' Theorem, a foundational principle in
probability theory that describes the probability of an event, based on prior knowledge of
conditions related to the event. The "naïve" aspect comes from the algorithm's assumption that
the features it uses to make predictions are independent of each other, a simplification that might
not always hold true in real-world data but often works surprisingly well in practice.
In this short guide, you will learn everything you need to get started with Naïve Bayes using Bukayo
Saka’s Premier League shots on goal. We begin by unpacking the formal definition of Naïve Bayes
and the underlying Bayes’ Theorem (section 1), after which we translate this into practical terms
– how likely is Bukayo Saka to score a goal from the penalty box with his left foot (section 2). We
proceed with a step-by-step guide on how to compute this probability using R from the ground
up using real world data without installing any software on our computer (section 3). We conclude
with a discussion on the general applications of Naïve Bayes and potential future exercises
involving similar shooting data (section 4). Let’s dive in.
1. Formal Definition
At its core, Naïve Bayes leverages Bayes' Theorem to calculate the probability that a given data
point belongs to a certain class, given the data point's features. Bayes' Theorem is expressed as
𝑃(𝐵| 𝐴) ∗ 𝑃(𝐴)
𝑃(𝐴| 𝐵) =
𝑃(𝐵)
where:
- 𝑃(𝐴|𝐵) is the probability of event A happening given that B is true (posterior probability).
- 𝑃(𝐵|𝐴) is the probability of observing event B given that A is true (likelihood).
- 𝑃(𝐴) is the probability of observing event A (prior probability).
- 𝑃(𝐵) is the probability of observing event B (evidence).
2
Data Science with Football Made Easy: The 15-Algorithm Playbook for Beginners
Now let's repeat this definition only this time using Bukayo Saka's probability of scoring a goal,
given that he shoots with his left foot from the penalty line box (a distance of 16.5 meters from the
goal).
2. Practical Definition
Here is how the Bayes' Theorem is applied to calculate the probability of an event (Bukayo Saka
scoring a goal) based on specific conditions (shot taken with his left foot from 16.5 meters):
𝑃(𝑆ℎ𝑜𝑡 𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠| 𝐺𝑜𝑎𝑙) ∗ 𝑃(𝐺𝑜𝑎𝑙)
𝑃(𝐺𝑜𝑎𝑙| 𝑆ℎ𝑜𝑡 𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠) =
𝑃(𝑆ℎ𝑜𝑡 𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠)
where:
- 𝑃(𝐺𝑜𝑎𝑙| 𝑆ℎ𝑜𝑡 𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠) is the probability of scoring given the shot's characteristics.
- 𝑃(𝑆ℎ𝑜𝑡 𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠| 𝐺𝑜𝑎𝑙) is the likelihood of observing these specific shot features (left
foot, 16.5 meters) when a goal is scored.
- 𝑃(𝐺𝑜𝑎𝑙) is the prior probability of scoring a goal, before considering the shot's specifics.
- 𝑃(𝑆ℎ𝑜𝑡 𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠) is the probability of observing these specific shot characteristics,
irrespective of the scoring outcome.
Feature Independence: Note that Naïve Bayes simplifies the probability calculation by assuming
the shot features (distance to goal and foot used) affect the goal-scoring probability independently.
This assumption allows us to consider how each factor (shooting from 16.5 meters with the left
foot) individually influences Saka's likelihood of scoring, without intertwining their effects. This
assumption, although simplistic, allows for efficient computation and often yields robust results
despite its naivety.
Practical Calculation for Saka’s Shot: Imagine historical data shows:
- 𝑃(𝐺𝑜𝑎𝑙): The overall probability of scoring a goal, let's say Saka’s scores on 15% of his
shots.
- 𝑃(𝑆ℎ𝑜𝑡 𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠| 𝐺𝑜𝑎𝑙): This would involve understanding the likelihood of the shot's
specifics given a goal is scored. For our example, if 40% of Saka's goals come from shots
16.5 meters out (𝑃(𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 < 16.5 𝑚| 𝐺𝑜𝑎𝑙)), and 70% of Saka's goals are scored with
the left foot (𝑃(𝐹𝑜𝑜𝑡 = 𝐿𝑒𝑓𝑡| 𝐺𝑜𝑎𝑙)), then to calculate 𝑃(𝑆ℎ𝑜𝑡 𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠| 𝐺𝑜𝑎𝑙) under
the assumption of feature independence, we multiply those two likelihoods of the individual
shot features given a goal. This becomes 0.40 × 0.70 = 0.28 or 28%, representing the
3
Data Science with Football Made Easy: The 15-Algorithm Playbook for Beginners
combined likelihood of the specific shot features (16.5 m distance with the left foot) given
that a goal is scored.
- 𝑃(𝑆ℎ𝑜𝑡 𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠): The probability of observing these shot features (left foot and 16.5
meters distance) in the general context of all shots taken. This factor is crucial as it helps
normalize our posterior probability, ensuring we are accounting for how common these
features are among all shots, not just the successful ones. Suppose 20% of all shots are from
16.5 meters out with the left foot.
So, based on the Naïve Bayes calculation with the given data, Saka has a 21% probability of scoring
a goal when he takes a shot with his left foot from 16.5 meters out.
0.28 ∗ 0.15 0.042
𝑃(𝐺𝑜𝑎𝑙| 𝑆ℎ𝑜𝑡 𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠) = = = 0.21
0.2 0.2
Fantastic, now that you have grasped the concept and mechanics of Naïve Bayes, let's explore how
to compute the actual probability using real-world data in R (the figures used previously were for
illustrative purposes only).
3. Naïve Bayes with R
Before we start programming, here are a couple of important notes.
The complete version of the code is easily downloadable here.
To make the code accessible for individuals who do not have or prefer not to install
RStudio, all programming has been conducted using Google Colab. Google Colab is a free,
cloud-based platform that enables you to write and execute R (and Python) code directly
from your browser without any prior software installation. It is an ideal platform for
beginners owing to its simplicity and ease of use. For instructions on how to quickly get
started using it, please refer to Section 5 at the very end of this guide.
Once you have set up your account, we can proceed with some programming. We begin by
importing the tidyverse library for data wrangling, the worldfootballR package for football data
scraping, and e1071 for our Naïve Bayes algorithm functions. Keep in mind that two of these
packages must be installed before they can be loaded, as they are not included by default on the
Google Colab platform.
4
Data Science with Football Made Easy: The 15-Algorithm Playbook for Beginners
Feel free to comment out the install.packages lines (#) after installation (as illustrated below) if
you plan to rerun the entire code. This will prevent multiple installations and save you
approximately 19 seconds each time you run this part.
Before we conduct the analysis, we first need to collect our data. For the purposes of the analysis
in this guide, we will use Fbref.com, which hosts (among tons of other useful information) all
shots taken by each Premier League team. In the example below (the 2022/23 game between
Crystal Palace and Arsenal), we can easily observe that Saka took a total of three shots in the game
– all with his right foot – two of which were off target and one was blocked. In addition to the
body part with which the shot was taken and its outcome, we can also easily get the distance to
the goal in meters. So how do we get this kind of data for all Arsenal games? That is where
worldfootballR comes into play.
5
Data Science with Football Made Easy: The 15-Algorithm Playbook for Beginners
Source : https://fbref.com/en/matches/e62f6e78/Crystal-Palace-Arsenal-August-5-2022-Premier-League
Jason Zivkovic’s worldfootballR package facilitates data collection from FBREF by providing
clean wrappers for scraping various types of data from FBREF’s website. For a comprehensive
overview of the package’s capabilities, you may refer to the documentation here. To gather all of
Bukayo Saka’s shots, it is necessary to download data on all shots taken by Arsenal players. To
accomplish this, we download all links to Arsenal games using the following code snippet. It
utilizes the fb_match_urls function by specifying country = "ENG" for England, gender = "M"
for the male championship, season_end_year = 2023 for the 2022/23 season, and tier = "1st" for
the English Premier League. The second line filters all the links to include only Arsenal games.
Finally, we review the first few observations and the total number of fb_match_urls_arsenal
entries to ensure we have captured all 38 games of Arsenal's 2022/23 season. This code executes
in 4 seconds.
Once we have our links to all of Arsenal’s 2022/23 games, we loop over each one and extract the
all the shots taken by the two teams. We do that using the fb_match_shooting function. This
code executes in 2 minutes.
6
Data Science with Football Made Easy: The 15-Algorithm Playbook for Beginners
Next, we refine our df_raw data to retain only the shooting data from Arsenal players. We rename
the columns to lowercase, convert numeric variables to numeric types (if this hasn't been done
already), and finally, select only a few variables of interest. These include the date of the game, the
squad, the player name, the shot’s distance to the goal, the body part used for the shot, and the
outcome.
Below is what the head of the data should look like.
7
Data Science with Football Made Easy: The 15-Algorithm Playbook for Beginners
Next, we prepare the data for our analysis by focusing on shots taken by Bukayo Saka using either
his left or right foot. We also create a dummy variable to indicate whether each shot resulted in a
goal. Additionally, we convert the body_part variable into a factor, which is necessary for running
our Naïve Bayes function shortly.
Below is what the head of the data should look like.
The next code snippet fits the naïve bayes model using the naiveBayes function from the e1071
package.
8
Data Science with Football Made Easy: The 15-Algorithm Playbook for Beginners
Here is how to interpret the model’s results.
A-priori Probabilities: These are the model's initial guesses or "prior" probabilities about the
likelihood of each outcome (goal = 1, no goal = 0) before considering the specific features
(distance and body part) of the shots.
- 0 (No Goal): 0.8554217 chance. This means, based on the data provided to the model, about
85.54% of Bukayo Saka's shots did not result in a goal.
- 1 (Goal): 0.1445783 chance. Conversely, about 14.46% of his shots resulted in a goal.
These percentages reflect the overall distribution of goals and no-goals in the dataset used to train
the model.
Conditional Probabilities for `distance`: These show how the model views the relationship
between the distance of a shot and the outcome (goal or no goal), expressed in terms of mean
(average distance) and standard deviation (variation in distance).
- For No Goal (0): The average distance is about 16.77 meters, with a standard deviation of
5.58 meters. This means most shots that did not result in a goal were taken from around this
distance, give or take about 5.58 meters.
9
Data Science with Football Made Easy: The 15-Algorithm Playbook for Beginners
- For Goal (1): The average distance for goals is shorter, about 12.08 meters, with a standard
deviation of 5.52 meters. This suggests goals tend to come from closer shots, with similar
variability in distance as no-goal shots.
Conditional Probabilities for `body_part`: These indicate the likelihood of using a specific body
part for shots, given the outcome.
- Left Foot and No Goal: About 72% of Saka's no-goal shots were taken with the left foot.
- Right Foot and No Goal: About 28% of his no-goal shots used the right foot.
- Left Foot and Goal: For shots resulting in goals, approximately 75% were taken with the
left foot.
- Right Foot and Goal: Around 25% of goal shots used the right foot.
These proportions show a slight preference for the left foot in goal-scoring shots compared to
shots that did not result in goals.
Overall, the Naïve Bayes model tells us two main things about Bukayo Saka's shots:
1. General Chances: Before looking at where or how he shoots, he is more likely not to score
(85.54%) than to score (14.46%) based on past data.
2. Impact of Distance and Body Part:
a. Distance: Goals are generally scored from closer (about 12.08 meters on average),
while missed shots tend to come from a bit further away (about 16.77 meters on
average).
b. Body Part: Whether Saka scores or not, he predominantly uses his left foot, but
the model suggests a slightly higher proportion of left-footed shots among
successful goals than unsuccessful attempts.
And finally, here are the final calculations.
10
Data Science with Football Made Easy: The 15-Algorithm Playbook for Beginners
For a shot taken with the left foot from a distance of 16.5 meters, the model predicts there is an
approximately 11.5% probability that it will result in a goal (prob_goal). For a shot taken with the
right foot from the same distance, the probability of scoring is correspondingly lower, at
approximately 9.9%.
4. Conclusion
This tutorial demonstrated how to build a Naïve Bayes model to predict the likelihood of scoring
a goal based on the distance to the goal and the foot used to hit the ball. It is clear that we cannot
cover all aspects of Naïve Bayes in just ten pages. However, this provides a solid introduction to
a simple yet powerful model. If you are interested in further exploration of our football illustration,
you can rerun the analysis with different players, teams, leagues, or seasons. All these modifications
are easily achievable using the worldfootballR package with just a few adjustments in the function
specification parameters. Why not try rerunning the analysis using Jude Bellingham’s data from
the 2023/24 season? To do this, modify the fb_match_url parameters by setting the country to
“ESP” (instead of “ENG”) and the season_end_year to 2024. Next, filter the URLs to include
only those for “Real-Madrid” (instead of Arsenal). When preparing the df database, ensure the
squad is filtered to “Real Madrid.” Lastly, for the df_analysis, retain only the data pertaining to
Jude Bellingham and you will be ready to go .
11
Data Science with Football Made Easy: The 15-Algorithm Playbook for Beginners
Overall, despite its simplicity, Naïve Bayes can be incredibly effective in applications such as spam
detection and document classification. Additionally, it finds utility in financial modelling for credit
scoring and fraud detection, as well as in preliminary image classification tasks like facial
recognition. Despite the simplistic assumption that input features are independent—an
assumption often not held in real-world data—Naïve Bayes can deliver surprisingly accurate
results, particularly when the data is pre-processed appropriately. Its computational efficiency and
the minimal data requirement for reasonable predictions make it an invaluable tool in the data
scientist's toolkit, especially suitable for scenarios requiring quick decision-making and for use as
a baseline in complex predictive modelling.
5. Getting started with Google Colab for R programming
Google Colab is a free, cloud-based service that allows you to write and execute R (and Python)
code through your browser without any setup required. It is an ideal platform for beginners due
to its simplicity and accessibility. While experienced R programmers might use dedicated interfaces
like RStudio, Google Colab offers a straightforward alternative that is perfect for those just starting
their programming journey. Here is how to get set up:
Opening a New R Notebook in Google Colab
1. Navigate to Google Colab: Go to https://colab.research.google.com and sign in with
your Google account.
2. Create a New Notebook: Once you are on the Colab dashboard, click on `File` in the
top menu, then select `New notebook`. By default, this will create a Python notebook.
3. Switch to R: To change the notebook to R, click on the small arrow next to the `Connect`
button in the top-right corner. Then click on ‘Change runtime type’.
12
Data Science with Football Made Easy: The 15-Algorithm Playbook for Beginners
Then select `R` from the dropdown menu under `Runtime type`, and save. The notebook will
refresh, and you can now start writing R code.
4. Write Your First R Code: Start with something simple to ensure everything is set up
correctly. For example, type `print("Hello, R in Colab!")` and press `Shift + Enter` to run
the cell. You should see the output directly below the code.
To add a new piece of code you can either continue writing in the same cell or create a new one
by clicking on the ‘+ Code’ button in the upper left side.
13
Data Science with Football Made Easy: The 15-Algorithm Playbook for Beginners
14