0% found this document useful (0 votes)
3 views14 pages

Paper

The document presents the Video Assistant Referee System (VARS), an advanced system designed to enhance decision-making in football by analyzing fouls from multiple video angles without the need for additional referees. VARS achieves state-of-the-art performance in foul recognition and sanction classification, demonstrating potential to match human referees' performance. The study highlights VARS's ability to support officiating in both professional and amateur leagues, addressing current limitations of the existing Video Assistant Referee technology.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views14 pages

Paper

The document presents the Video Assistant Referee System (VARS), an advanced system designed to enhance decision-making in football by analyzing fouls from multiple video angles without the need for additional referees. VARS achieves state-of-the-art performance in foul recognition and sanction classification, demonstrating potential to match human referees' performance. The study highlights VARS's ability to support officiating in both professional and amateur leagues, addressing current limitations of the existing Video Assistant Referee technology.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Towards AI-Powered Video Assistant Referee System (VARS)

for Association Football


arXiv:2407.12483v2 [cs.CV] 18 Jul 2024

Jan Held1*, Anthony Cioppa1,2*, Silvio Giancola2*, Abdullah Hamdi3 , Christel


Devue1 , Bernard Ghanem2 and Marc Van Droogenbroeck1
1 University
of Liege (ULiège), Belgium.
2 King Abdullah University of Science and Technology (KAUST), Saudi Arabia.
3 University of Oxford, United Kingdom.

*Corresponding author(s). E-mail(s): [email protected]; [email protected];


[email protected];

Abstract
Over the past decade, the technology used by referees in football has improved substantially,
enhancing the fairness and accuracy of decisions. This progress has culminated in the imple-
mentation of the Video Assistant Referee (VAR), an innovation that enables backstage referees
to review incidents on the pitch from multiple points of view. However, the VAR is currently
limited to professional leagues due to its expensive infrastructure and the lack of referees world-
wide. In this paper, we present the Video Assistant Referee System (VARS) that leverages the
latest findings in multi-view video analysis. Our VARS achieves a new state-of-the-art on the
SoccerNet-MVFoul dataset by recognizing the type of foul in 50% of instances and the appropri-
ate sanction in 46% of cases. Finally, we conducted a comparative study to investigate human
performance in classifying fouls and their corresponding severity and compared these findings
to our VARS. The results of our study highlight the potential of our VARS to reach human
performance and support football refereeing across all levels of professional and amateur federations.

Keywords: Football, Soccer, Artificial Intelligence, Computer Vision, Video Recognition, Automated
Decision, Video Assistant Referee, Referee Success Rate, Fouls evaluation

1 Introduction player re-identification in occluded scenarios [20],


or dense video captioning for football broadcasts
In recent years, technology has played an increas- commentaries [21, 22]. Solving these tasks has
ing role in football, revolutionizing how the game been taken to a higher level thanks to the emer-
is played, coached, and officiated. This transfor- gence of deep learning techniques [23–25]. Simi-
mation extends into the domain of sports video lar to many other fields in which deep learning
analysis, which encompasses a diverse range of has been used, the advancements in sports video
challenging tasks, including player detection and understanding heavily rely on the availability
tracking [1–4], spotting actions in untrimmed of large-scale datasets [26–30]. SoccerNet [31–39]
videos [5–12], pass feasibility and prediction [13,
14], summarizing [15–18], camera calibration[19],

1
stands among the largest and most comprehen- referees during the 2020/2021 season, whereas the
sive sports dataset, with extensive annotations for number of games played each weekend was around
video understanding in football. 90,000 [45, 46]. The introduction of the VAR
In refereeing, the biggest revolution was intro- requires an additional team of referees per game,
duced by the Video Assistant Referee (VAR) in which is not feasible for semi-professional or ama-
2016 [40]. The system involves a team of refer- teur leagues. Finally, each referee interprets the
ees located in a video operation room outside the Laws of the Game [47] slightly differently, result-
stadium. These referees have access to all avail- ing in different decisions for similar actions. Given
able camera views and check all decisions taken that the video assistant referee (VAR) changes
by the on-field referee. If the VAR indicates a from one game to another, inconsistencies may
probable “clear and obvious error” (E.g. when arise, with the VAR making different decisions for
the referee misses a penalty or a red card, gives similar actions across different matches.
a yellow card to the wrong player, etc.), it will In this paper, we present the “Video Assistant
be communicated to the on-field referee who can Referee System” (VARS), which could support
then review his decision in the referee review area or extend the current VAR. Our VARS fulfills
before taking a final decision. The VAR helps to the same objectives and tasks as the VAR. By
ensure greater fairness in the game by reducing analyzing fouls from a single or a multi-camera
the impact of incorrect decisions on the outcome video feed, it indicates a probable “clear and obvi-
of games. Notably, in 8% of the matches, the VAR ous error”, and can communicate this information
has a decisive impact on the result of the game to the referee, who will then decide whether to
[41] and it slightly reduces the unconscious bias initiate a “review”. The proposed VARS automat-
of referees towards home teams [42]. On aver- ically analyzes potential incidents that can then
age, away teams now score more goals and receive be shown to the referee in the referee review area.
fewer yellow cards [43]. Controversial referee mis- Just like the regular VAR, our VARS serves as a
takes like the famous “hand of God” goal by support system for the referee and only alerts him
Diego Maradona during the quarter-final match in the case of potential game-changing mistakes,
Argentina versus England of the 1986 FIFA World but the final decision remains in the hands of the
Cup, Josip Šimunić getting three yellow cards in main referee. The main benefit of our VARS is
a single game at the 2006 FIFA World Cup, or that it no longer requires additional referees, mak-
Thierry Henry’s handball preventing the Repub- ing it the perfect tool for leagues that do not have
lic of Ireland from qualifying for the World Cup enough financial or human resources.
could have been avoided with the VAR and would Contributions. We summarize our contribu-
have changed football history. tions and novelties as follows: (i) We propose an
Despite its potential benefits, the use of the upgraded version of the VARS presented by Held
VAR technology remains limited to professional et al. [35]. We introduce an attention mechanism
leagues. The infrastructure of the VAR is expen- on the different views and calculate an importance
sive, including multiple cameras to analyze the score to allocate more attention to more infor-
incident from different angles, video operation mative views before aggregating the views. (ii)
rooms in various locations, and VAR officials hired We present a thorough study on the influence of
to analyze the footage. Leagues with financial lim- using multiple views and different types of cam-
itations cannot afford the necessary infrastructure era views on the performance of our VARS. (iii)
to operate the VAR. In addition to the upfront We present a comprehensive human study where
costs of the infrastructure, there is also an ongo- we compare the performance of human referees,
ing expense associated with using the VAR. The football players, and our VARS on the task of
officials who serve as Video Assistant Referees type of foul classification and offense severity clas-
require specialized training [44] and monetary sification. Our human study also illustrates the
compensation following each game. Given the subjectivity of refereeing decisions by examining
implementation and operational costs of VAR, its the inter-rater agreement among referees.
use is currently restricted to professional leagues.
A further obstacle is the shortage of referees world-
wide. In Germany, there were only 50,241 active
3

features. In the following, we use the state-of-the-


art video encoder MViT [48, 49] pretrained on
Kinetics [50], which incorporates a transformer-
based architecture with a multiscale feature rep-
resentation, allowing it to capture spatial and
temporal information from video clips.
Multi-view aggregation block A. The origi-
nal paper [35] used simple mean or max pooling
Fig. 1: Architecture of our Video Assistant operations to gather the multi-view information
Referee System. From multi-view video clips into a unique representation. A major drawback of
input, our system encodes per-view video features these pooling approaches is that the combination
(E), aggregates the view features (A), and clas- of the feature vectors is fixed and ignores the rela-
sifies different properties (CFoul and COff ). The tionship between the views. Instead, we propose a
figure is inspired by [35]. new aggregation technique based on an attention
mechanism to model such relationships.
Our approach is inspired by the “Integrating
2 Methodology Block” presented in [51], where each view is asso-
We propose an upgraded version of the Video ciated with an attention score. However, instead
Assistant Referee System, which adds an advanced of aggregating multi-view images, we extend the
pooling technique to combine the information operation to multi-view videos. Technically, we
from multiple views, extracting the most relevant assign an attention score to each view and then
information based on our attention mechanisms. calculate the final representation by a weighted
The architecture is shown in Figure 1. Formally, sum of the feature vectors. There exist several
the VARS takes multiple video clips v = {vi }n1 as strategies to assign an attention score to a view.
input. Each video clip shows the same action from A first naive approach consists of passing each
n different perspectives. Each clip vi is fed into a feature vector individually into a learned func-
video encoder E to extract a spatio-temporal fea- tion. However, this would neglect the relation-
ture vector fi of dimension d for each clip vi . All ships between the views and would not provide
feature vectors fi are then stored in a matrix f as a relative attention score of the views. A bet-
follows: ter approach consists of determining the attention
 T score of each view based on its relationships with
f = f1 , f2 , ..., fn . (1)
the other views. To do so, we first take the dot
An aggregation block A takes f as input and
product (denoted by ·) of f multiplied by a matrix
outputs a single multi-view representation R. A
W ∈ Rdxd of trainable weights and its transpose:
multi-head classifier, Cfoul and Coff , simultane-
ously predicts the fine-grained type of foul class
S = f W · (f W )T . (3)
and the offense severity class. For each task, the
VARS selects the value with the highest confidence By multiplying the matrix f with its transpose,
from the respective confidence vector as the final we compute the dot product between each pair
prediction, following: of feature vectors, which measures the similar-
ity between two vectors. The obtained symmetric
VARSt ← argmax CtθC t (R), ∀t ∈ {foul, off}, (2) similarity matrix S is of dimension n × n, where
the value at row i and column j corresponds to
where θC t corresponds to the parameters of the the similarity score between view i and view j. A
classification head for task t ∈ {foul, off}. The higher score indicates a higher similarity between
model is trained by minimizing the unweighted the vectors, while a lower score suggests a lower
summation of both task losses Lfoul and Loff . similarity. Next, we normalize the similarity scores
to get a probability-like distribution, by passing
Video Encoder E. Based on the work presented
the matrix S through a ReLU layer and divide it
in [35], the best performance is obtained with a
by the sum of the matrix S, following:
video encoder that extracts spatial and temporal
Fig. 2: Architecture of the attention block. “MatMul” represents matrix multiplication, “T” denotes
transpose, “Norm” signifies normalization, and “SumRow” indicates the process of summing each row.

two tasks to make better predictions. Each classi-


ReLU (S) fication head consists of two dense layers and takes
N = Pn Pn . (4)
i=1 j=1 ReLU (Si,j ) as input the aggregated representation. The out-
To obtain the attention score for each view, we put is a vector whose dimensions correspond to
sum the values in each row of the normalized sim- the number of classes in each of the classification
ilarity matrix N. The attention score for a view problems.
i represents the sum of its normalized similarity
scores with all other views. By summing the values 3 Experiments
in each row of the normalized similarity matrix N,
we aggregate the normalized similarity scores for 3.1 Experimental setup
each view. This aggregation reflects how similar a
particular view is to all other views collectively. Tasks. We test our VARS on the two classifica-
Consequently, the resulting attention score cap- tion tasks introduced by the SoccerNet-MVFouls
tures a view’s overall relevance within the set of dataset [35]: Fine-grained foul classification, which
views. The reasoning behind this approach is that is the task of classifying a foul into one of 8
if a view is highly similar to many other views, it is fine-grained foul classes (i.e., “Standing tack-
considered important because it shares visual con- ling”, “Tackling”, “High leg”, “Pushing”, “Hold-
tent with multiple views. On the other hand, if a ing”, “Elbowing”, “Challenge”, and “Dive/Simu-
view is dissimilar to other views, it might be con- lation”), and Offence severity classification, which
sidered less important since it does not contribute is the task of classifying whether an action is
significantly to the collective visual information. an offence, as well as the severity of the foul,
Formally, we take the sum per row to obtain the defined by four classes: “No offence”, “Offence +
attention score A per view: No card”, “Offence + Yellow card”, and “Offence
+ Red card”.
n
X Data. The SoccerNet-MVFoul dataset contains
A= Ni,j , (5)
3,901 actions, composed of at least two videos, the
i=1
live action and at least one replay, see Figure 1.
where A is a vector of size n, where the value j
The views were manually synchronized by a
corresponds to the attention score of the view j
human and no pre-processing of the video clips
regarding all other views and itself. The final rep-
is necessary. Our VARS is trained on clips of 16
resentation is given by the sum of the extracted
frames, mostly 8 frames before the foul and 8
feature vector weighted by their calculated atten-
after the foul, spanning one second temporally
tion score, following:
with a spatial dimension re-scaled to 224 × 224
n
X pixels. This approach was chosen because of the
Ri = fi,j × Aj . (6) high computational cost associated with using a
j=1 larger number of frames. Future research could
Classification heads C. A multi-task classifica- explore whether an increase in frame rate or a
tion approach is used to classify simultaneously larger temporal context enhances performance.
the type of foul, whether it is an offense or not, Training details. The encoder E is pre-trained
and its severity. As both tasks are related, learning as detailed in the methodology, and the classifier
them together can lead to improved generaliza- C is trained from scratch, both being trained in an
tion and a better understanding of each task. The end-to-end fashion. We use a cross-entropy loss,
model can leverage the relationships between the optimized with Adam on a batch size of 6 samples.
5

Type of Foul Offence Severity


The learning rate starts at 5e−5 and is multiplied
by 0.3 every 3 steps. To artificially increase the Feat. extr. Pooling Acc. BA. Acc. BA.
dataset size, we use data augmentation and a ran- ResNet Mean 0.30 0.27 0.34 0.25
dom temporal shift to have a flexible number of ResNet Max 0.32 0.27 0.32 0.24
R(2+1)D Mean 0.32 0.34 0.34 0.30
frames used before and after the foul frame anno- R(2+1)D Max 0.34 0.33 0.39 0.31
tation during training. The model begins to overfit MViT Mean 0.44 0.40 0.38 0.31
MViT Max 0.45 0.39 0.43 0.34
after 7 epochs and requires approximately 8 hours
of training time on a single NVIDIA V100 GPU. MViT Attention 0.50 0.41 0.46 0.34

Evaluations metrics. To evaluate the perfor- Table 1: Multi-task classification. Attention


mance of the VARS, SoccerNet-MVFouls uses pooling sets a new benchmark on the SoccerNet-
the classification accuracy, which is the ratio of MVFoul dataset for all the evaluation metrics
correctly classified actions regarding the total and tasks. Type of foul classification accuracy
number of actions. As SoccerNet-MVFouls [35] is increased by 5% while the balanced accuracy (BA)
unbalanced, the authors also suggest a balanced increased by 1%. We have an increment of 3% for
accuracy, which is defined as follows: the offense severity classification, while the bal-
anced accuracy stays the same.
N
1 X T Pi
Balanced Accuracy (BA) = , (7)
N i=1 Pi 3.3 Detailed analysis
with N being the number of classes, T Pi the Sensitivity analysis. We first investigate the
number of True Positives and Pi the number of impact of the training dataset size on the per-
Positives for class i. To ensure a fair comparison, formance of our two classification tasks. Figure 3
we use the same training, validation, and test sets shows the evolution of the accuracy regarding dif-
as those used in the original paper [35]. ferent training dataset sizes. For each dataset size,
we independently trained and tested the model
10 times to avoid any epistemic uncertainty bias.
3.2 Main results
The tests were all performed on the same test
Table 1 shows the results obtained for the fine- set. As expected, we observe that increasing the
grained foul and the offense severity classification
task. Compared to the fixed combination of the
feature vectors (mean or max pooling), our novel
attention mechanism enhances the model’s ability
to identify and classify the type of foul by 5% and
the balanced accuracy by 1%. This demonstrates
the effectiveness of combining the feature vectors
of the different views based on their importance
compared to max or mean pooling. Similarly, the
attention mechanism improves the model’s perfor-
mance to determine if an action is a foul and the
corresponding severity by 3% and the balanced
accuracy remains the same compared to max
pooling. One might argue that the performance Fig. 3: Performance evaluation for different
increase is based on the supplementary parameters dataset sizes. 100% of the dataset corresponds to
introduced by the attention mechanism. How- 2,319 actions. For each dataset size, we indepen-
ever, the attention mechanism only adds an extra dently trained and tested the model 10 times. The
0.1% of parameters to the model compared to tests were all performed on the same test set. The
when using max and mean pooling. This suggests error bar corresponds to the standard deviation.
that the performance increase derives from the For 0% of the dataset, we indicate the accuracy
use of the attention mechanism rather than the by taking a random decision.
introduction of additional parameters.
dataset size improves the accuracy of our VARS.
For the type of foul classification, we notice a sig-
nificant improvement in accuracy with increasing
dataset size, especially at the beginning. How-
ever, the accuracy reached a plateau between 40%
and 80% of the data. Interestingly, we observed
a sudden increase in accuracy when we increased
the dataset size from 80% to 100%. This may
be attributable to our unbalanced dataset. The
dataset contains numerous “Standing tacklings”
and “Tacklings”, while many of the other labels
are underrepresented. Increasing the dataset size
from 40% to 80% may not have improved accuracy
if the model still struggles to generalize to certain
actions due to a limited number of training sam-
ples. However, increasing the dataset size to 100%
could have provided the model with the addi-
tional data necessary to better generalize actions.
Moreover, Figure 3 reveals that our VARS is sig- Fig. 4: Qualitative results. VARS prediction
nificantly more prone to epistemic uncertainty for on two examples where the attention score of each
smaller datasets, as indicated by the high standard view is given in percentage. The ground truth is
deviation. given in bold and the model prediction with the
In contrast, the offense severity curve in confidence is given in italic.
Figure 3 initially shows a sharp increase, but
later demonstrates a slower growth. Yet, with each
increase in the dataset size, the accuracy improves, to make an accurate decision. Both replays have
which confirms that more data would further a similar attention score, as they both offer a lot
improve the performance. The reason for this lies of information to the model. However, we can
in the significant variability in the visual appear- see that the most informative view has a slightly
ance of an offense with “No card”, “Yellow card”, higher attention score. The attention score pro-
or “Red card”. For instance, a yellow card can be vides insight on which views contribute the most
the outcome of a tackle, or it can be the result to classifications and helps us better understand
of a player holding an opponent’s shirt. Although how the model processes the visual data. This
both instances may result in a yellow card, their interpretability is especially important when the
visual representations differ significantly. To accu- VARS is used in practice, as it is essential for fans,
rately determine whether an action is an offense players, and referees to understand the reasoning
or not and the corresponding severity, the model behind decisions and feel confident that the tech-
needs plenty of examples to learn the underlying nology is improving the fairness and integrity of
distribution. the sport. Finally, the attention scores assigned to
Qualitative results. Figure 4 shows the predic- each view can assist broadcasters in automatically
tion of our VARS on two examples with a 3-view selecting the optimal camera angle for broad-
setup. In both examples, the VARS correctly casting purposes. Furthermore, it can support
determines the type of foul and correctly classifies the VAR and helps speed up the review process
both actions as a foul with the correct severity. by automatically proposing the most informative
Furthermore, the attention scores offer valuable camera perspective. This is particularly useful at
insights into the contribution of different views or a professional level, where the VAR can have up to
camera angles to the decision-making process of 30 different camera perspectives at their disposal,
the model. In both cases, the “live action clips” making finding the optimal camera a challenge on
have the lowest attention score, confirming our its own. The attention scores would provide valu-
intuition that they were filmed from too far away able information by highlighting the views that
7

are more likely to provide crucial details, to accel- Type of Foul Offence Severity Time

erate the decision-making process during the VAR Acc. Conf. Acc. Conf.
review. Players 75% 3.6 58% 3.3 41.53
Referees 70% 3.7 60% 3.6 38.01
VARS 60% - 51% - 0.12
4 Human study Table 2: Accuracy comparison between ref-
erees, players, and our VARS. The survey was
In contrast to classical classification tasks that performed on a subset of the test set of size 77. The
involve well-defined and easily separable classes, time is given in seconds and represents the aver-
determining whether an action in football consti- age time needed to make a decision. Acc. stands
tutes a foul may be subjective. Despite the defini- for accuracy and conf. for confidence. A rating of 5
tions and regulations provided by the Laws of the indicates high confidence, while a rating of 1 indi-
Game [47], the rule book published by the IFAB cates low confidence.
regarding when an action in football is considered
a foul and its corresponding severity, these guide-
lines are still open to interpretation, leading to video’ button. For each action, the participants
differing opinions about the same action. In prac- had the same classification task as presented in
tice, many actions fall into this gray area where Section 3.1. Specifically, they had to determine the
both interpretations, foul or no foul, could be con- type of foul, if the action was a foul or not, and the
sidered correct. In this study, we first analyze corresponding severity. For each action, we use the
whether and how the performance of our VARS annotations from the SoccerNet-MVFoul dataset
aligns with human performance (i.e., referees and as ground truth to determine the accuracy for each
football players) by comparing the accuracy of participant. An important note is that the partic-
the type of foul and offense severity classifica- ipants have a clear advantage over our VARS as
tions between VARS and our human participants. they view clips lasting 5 seconds, with a frame rate
Secondly, we conduct an inter-rater agreement of 25 fps, while our model gets a 1-second clip at 16
analysis of human decisions to quantify the extent fps as input. Finally, let us note that our study was
of agreement among our human participants. approved by the local university’s ethics commit-
Experimental setup. The study involves two tee (2223-080/5624). All analyses were performed
distinct groups of participants with different using the JASP software.
expertise in football: “Players” and “Referees”.
The first group consisted of 15 male individuals 4.1 Comparison to human
aged 18 or older (with a mean M = 23.06 and
performance
a standard deviation SD = 3.49 years), who had
been playing football for a minimum of three years Table 2 shows the average accuracy compared
(M = 8.71 and SD = 3.32 years). The second to the ground truth of players, referees, and our
group consisted of 15 male individuals aged 18 or VARS, respectively. These results align with sim-
older (M = 25.33 and SD = 4.51 years), who are ilar studies [52–54], where the referees had an
certified football referees and have officiated in at overall decision accuracy ranging from 45% to
least 200 official games (from 223 to 1150 games). 80%.
Both groups analyzed 77 actions, each presented In terms of the type of foul categorization,
with three different camera perspectives simulta- players (M = 0.752, SD = 0.055) were numeri-
neously. The participants could review the clips cally more accurate than referees (M = 0.704, SD
several times and watch the actions in slow motion = 0.120), but this difference was not statistically
or frame-by-frame, without any time restriction. significant, as shown by an independent samples
To reduce bias, the actions were shown in a Student t-test, t(28) = 1.421, p = 0.166, d =
different random order to each participant. For 0.519, 95% CI = [−0.214 - 1.243]. Mean confidence
each action, we measured the time taken by the levels in these categorizations were comparable
participants to make their decision. This time between players (M = 3.64, SD = 0.28) and
was measured from the moment the participants referees (M = 3.71, SD = 0.32), t(28) < 1.
started the video until they clicked on the ‘Next
As for determining if an action corresponds Nb. of different decisions 1 2 3 4

to a foul and the corresponding severity, refer- High-level referees 16% 56% 28% 0%
ees were slightly more accurate (M = 0.594, SD Referee talents 2% 60% 38% 0%
= 0.091) than players (M = 0.582, SD = 0.061). Table 3: Similarity analysis of the results for
However, this difference was not statistically sig- Offense Severity classification. Among high-
nificant, t(28) = −0, 401, p = 0.691, d = −0.147, level referees, 28% of cases result in three different
95% CI = [−0.862 - 0.571]. Although the accuracy decisions being made for the same action. For
of players and referees was comparable, referees referee talents, this percentage even increases to
were more confident in their severity judgments 38%. These results show the significant challenge
(M = 3.67, SD = 0.36) than players (M = 3.33, involved in determining whether an action should
SD = 0.39), t(28) = −2.3, p = 0.029, d = −0.839, be classified as a foul and assessing its correspond-
95% CI = [−1.581 - −0.084]. Referees’ higher con- ing severity.
fidence might be due to their specific experience
in assessing fouls and their severity on the field.
Overall, our results suggest that the accu- of our VARS compared to humans, the current
racy of players and referees is comparable. The results are promising. Further, it is notable that
Bayesian version of the Student t-test provides our VARS only requires 120ms to reach a decision,
support for this null hypothesis with Bayes factors which is more than 300 times faster than humans.
BF10 of 0.732 and 0.366 for the type of foul and Both referees and players require a similar amount
offense severity task, respectively. There is a pos- of time to make a decision. On average, players
sibility that this lack of difference between groups take around 41.53 seconds and referees 38.01 sec-
is due to power issues, i.e., the sample size being onds, which is similar to the average time of 46
too small. Replication studies conducted on larger seconds taken for the VAR to make a decision as
groups would be valuable in revealing potential reported by López [55].
differences between the two human groups.
As we do not have a standard deviation for the 4.2 Inter-rater agreement
VARS, we conducted two One-Sample t-tests to
compare its performance against humans (players In this subsection, we investigate the reliability
and referees were grouped as their accuracy was and consistency of humans in determining whether
comparable). For action categorization, humans an action constitutes a foul and its severity. To
(M = 0.728, SD = 0.095) were significantly more assess the level of consensus among humans, we
accurate than our VARS (M = 0.597), t(29) = calculated inter-rater agreement in each group for
7.556, p < .001, d = 1.379, 95% CI = [0.870 - the severity classification task. Since determining
1.876]. Humans were also more accurate (M = if an action is a foul and assessing its severity is
0.588, SD = 0.081) than our VARS (M = 0.508) the most important task, we only focus on eval-
for offense severity judgments, t(29) = 5.492, p < uating inter-rater agreement for this aspect. To
.001, d = 1.003, 95% CI = [0.556 - 1.437]. This quantify the inter-rater agreement, we calculated
difference in performance might be due to differ- the unweighted average Cohen’s kappa, which
ences in training between our VARS and humans. measures the agreement between multiple individ-
Players and referees have accumulated an exten- uals. The referees achieved an unweighted average
sive amount of experience in football, through Cohen’s kappa of 0.213, indicating weak agree-
officiating, playing, and watching the game for ment. Similarly, players’ agreement was weak,
countless hours. In contrast, our VARS has only with a score of 0.223. This suggests limited con-
been trained on an unbalanced training set of sistency among both groups in their assessments.
2,916 actions, where some types of labels only Among our 15 referees, 7 are officiating at a high
occur a few times. For example, there are only 27 level (in the highest league of their country).
fouls with a red card in the training set, making These referees are called “high-level referees” in
it difficult for the model to precisely learn the dif- the following. All other referees are called “ref-
ference between a foul with a yellow card and one eree talents”. Table 3 shows the consensus in
with a red card. Considering the difficulty of the each subgroup for the offense severity classifica-
task and the significant experience disadvantage tion task. As can be seen, high-level and referee
9

Fig. 5: Example of the subjectivity of human choices. Decisions taken by our participants: “No
offense”, “Offense + No card”, and “Offense + Yellow card”.

talents reached a consensus between themselves Acknowledgement This work was partly sup-
for only 16% and 2% of the actions, respectively. ported by the King Abdullah University of Science
In the majority of cases, multiple decisions were and Technology (KAUST) Office of Sponsored
made for the same action, indicating the diffi- Research through the Visual Computing Center
culty in determining whether an action should be (VCC) funding and the SDAIA-KAUST Cen-
classified as a foul and assessing its severity. Par- ter of Excellence in Data Science and Artificial
ticularly among referee talents, 38% of actions Intelligence (SDAIA-KAUST AI). J. Held and
resulted in three different decisions (out of four A. Cioppa are funded by the F.R.S.-FNRS. The
possible decisions to take) for the same action. present research benefited from computational
Figure 5 shows an example of an action where all resources made available on Lucia, the Tier-1
three decisions “No offense”, “Offense + No card” supercomputer of the Walloon Region, infrastruc-
and “Offense + Yellow card” were taken among ture funded by the Walloon Region under the
the referees. For certain referees, the fact that the grant agreement n°1910247.
defender plays the ball is considered enough to
not award a free-kick in this situation. However, 6 Declarations
other referees believe that even if the defender
plays the ball, he disregards the danger to, or con- Availability of data and code. The data and
sequences for, an opponent and awards a yellow code are available at these addresses https://
card. These findings underscore the complexity github.com/SoccerNet/sn-mvfoul
and subjectivity inherent in refereeing decisions, Conflict of interest. The authors declare no
highlighting the potential for further research to conflict of interest.
improve consistency and fairness in officiating.
Open access.

5 Conclusion References
Distinguishing between a foul and no foul and
[1] Cioppa, A., Deliège, A., Ul Huda, N.,
determining its severity is a complex and subjec-
Gade, R., Van Droogenbroeck, M., Moes-
tive task that relies entirely on the interpretation
lund, T.B.: Multimodal and multiview dis-
of the Laws of the Game [47] by each individ-
tillation for real-time player detection on
ual. Despite the challenges posed by this complex
a football field. In: IEEE Int. Conf.
task and an unbalanced training dataset, our solu-
Comput. Vis. Pattern Recognit. Work.
tion demonstrates promising results. While we
(CVPRW), CVsports, Seattle, WA, USA,
have not reached human-level performance yet,
pp. 3846–3855 (2020). https://doi.org/10.
we believe that VARS holds the potential to
1109/CVPRW50498.2020.00448. https://doi.
assist and support referees across all levels of
org/10.1109/CVPRW50498.2020.00448
professionalism in the future.
[2] Maglo, A., Orcesi, A., Pham, Q.-C.: Effi- [7] Hong, J., Zhang, H., Gharbi, M., Fisher, M.,
cient tracking of team sport players with few Fatahalian, K.: Spotting temporally precise,
game-specific annotations. In: IEEE/CVF fine-grained events in video. In: Eur. Conf.
Conf. Comput. Vis. Pattern Recognit. Comput. Vis. (ECCV). Lect. Notes Com-
Work. (CVPRW), pp. 3460–3470. Inst. put. Sci., vol. 13695, pp. 33–51. Springer, Tel
Electr. Electron. Eng. (IEEE), New Orleans, Aviv, Israël (2022). https://doi.org/10.1007/
LA, USA (2022). https://doi.org/10.1109/ 978-3-031-19833-5 3
cvprw56347.2022.00390. https://doi.org/10.
1109/CVPRW56347.2022.00390 [8] Soares, J.V.B., Shah, A., Biswas, T.: Tempo-
rally precise action spotting in soccer videos
[3] Vandeghen, R., Cioppa, A., Van Droogen- using dense detection anchors. In: IEEE
broeck, M.: Semi-supervised training Int. Conf. Image Process. (ICIP), pp. 2796–
to improve player and ball detection 2800. Inst. Electr. Electron. Eng. (IEEE),
in soccer. In: IEEE Int. Conf. Com- Bordeaux, France (2022). https://doi.org/
put. Vis. Pattern Recognit. Work. 10.1109/icip46576.2022.9897256. https://doi.
(CVPRW), CVsports, pp. 3480–3489. org/10.1109/ICIP46576.2022.9897256
Inst. Electr. Electron. Eng. (IEEE),
New Orleans, LA, USA (2022). https: [9] Soares, J.V.B., Shah, A.: Action spot-
//doi.org/10.1109/cvprw56347.2022.00392. ting using dense detection anchors
https://doi.org/10.1109/CVPRW56347. revisited: Submission to the SoccerNet
2022.00392 challenge 2022. arXiv abs/2206.07846
(2022) https://arxiv.org/abs/2206.07846.
[4] Somers, V., Joos, V., Giancola, S., Cioppa, https://doi.org/10.48550/arXiv.2206.07846
A., Ghasemzadeh, S.A., Magera, F.,
Standaert, B., Mansourian, A.M., Zhou, [10] Giancola, S., Cioppa, A., Georgieva, J.,
X., Kasaei, S., Ghanem, B., Alahi, A., Billingham, J., Serner, A., Peek, K., Ghanem,
Van Droogenbroeck, M., De Vleeschouwer, B., Van Droogenbroeck, M.: Towards active
C.: SoccerNet game state reconstruction: learning for action spotting in association
End-to-end athlete tracking and identifi- football videos. In: IEEE/CVF Conf. Com-
cation on a minimap. In: IEEE Int. Conf. put. Vis. Pattern Recognit. Work. (CVPRW),
Comput. Vis. Pattern Recognit. Work. pp. 5098–5108. Inst. Electr. Electron. Eng.
(CVPRW), CVsports, Seattle, WA, USA (IEEE), Vancouver, Can. (2023). https:
(2024) //doi.org/10.1109/cvprw59228.2023.00538.
https://doi.org/10.1109/CVPRW59228.
[5] Cioppa, A., Deliège, A., Giancola, S., 2023.00538
Ghanem, B., Van Droogenbroeck, M., Gade,
R., Moeslund, T.B.: A context-aware loss [11] Cabado, B., Cioppa, A., Giancola, S.,
function for action spotting in soccer videos. Villa, A., Guijarro-Berdiñas, B., Padrón,
In: IEEE/CVF Conf. Comput. Vis. Pat- E., Ghanem, B., Van Droogenbroeck, M.:
tern Recognit. (CVPR), pp. 13123–13133. Beyond the Premier: Assessing action
Inst. Electr. Electron. Eng. (IEEE), Seat- spotting transfer capability across diverse
tle, WA, USA (2020). https://doi.org/10. domains. In: IEEE Int. Conf. Comput.
1109/cvpr42600.2020.01314. https://doi.org/ Vis. Pattern Recognit. Work. (CVPRW),
10.1109/CVPR42600.2020.01314 CVsports, Seattle, WA, USA (2024)

[6] Giancola, S., Ghanem, B.: Temporally-aware [12] Kassab, E.J., Solberg, H.M., Gautam, S.,
feature pooling for action spotting in soccer Sabet, S.S., Torjusen, T., Riegler, M.,
broadcasts. In: IEEE Int. Conf. Comput. Vis. Halvorsen, P., Midoglu, C.: Tacdec. Proceed-
Pattern Recognit. (CVPR), Nashville, TN, ings of the ACM Multimedia Systems Con-
USA, pp. 4490–4499 (2021). https://doi.org/ ference 2024 on ZZZ (2024). https://doi.org/
10.1109/CVPRW53098.2021.00506 10.1145/3625468.3652166
11

[13] Arbués Sangüesa, A., Martı́n, A., Fernández, (2022). https://doi.org/


J., Ballester, C., Haro, G.: Using player’s
body-orientation to model pass feasibility [19] Magera, F., Hoyoux, T., Barnich, O.,
in soccer. In: IEEE/CVF Conf. Comput. Van Droogenbroeck, M.: A universal pro-
Vis. Pattern Recognit. Work. (CVPRW), tocol to benchmark camera calibration for
pp. 3875–3884. Inst. Electr. Electron. Eng. sports. In: IEEE Int. Conf. Comput. Vis. Pat-
(IEEE), Seattle, WA, USA (2020). https: tern Recognit. Work. (CVPRW), CVsports,
//doi.org/10.1109/cvprw50498.2020.00451. Seattle, WA, USA (2024)
https://doi.org/10.1109/CVPRW50498.
2020.00451 [20] Somers, V., De Vleeschouwer, C., Alahi,
A.: Body part-based representation learn-
[14] Honda, Y., Kawakami, R., Yoshihashi, ing for occluded person Re-Identification.
R., Kato, K., Naemura, T.: Pass receiver In: IEEE/CVF Winter Conf. Appl.
prediction in soccer using video and Comput. Vis. (WACV), pp. 1613–1623.
players’ trajectories. In: IEEE/CVF Inst. Electr. Electron. Eng. (IEEE),
Conf. Comput. Vis. Pattern Recog- Waikoloa, HI, USA (2023). https://doi.
nit. Work. (CVPRW), pp. 3502–3511. org/10.1109/wacv56688.2023.00166. https:
Inst. Electr. Electron. Eng. (IEEE), //doi.org/10.1109/WACV56688.2023.00166
New Orleans, LA, USA (2022). https:
//doi.org/10.1109/cvprw56347.2022.00394. [21] Mkhallati, H., Cioppa, A., Giancola, S.,
https://doi.org/10.1109/CVPRW56347. Ghanem, B., Van Droogenbroeck, M.:
2022.00394 SoccerNet-caption: Dense video caption-
ing for soccer broadcasts commentaries.
[15] Gautam, S., Midoglu, C., Shafiee Sabet, In: IEEE/CVF Conf. Comput. Vis. Pat-
S., Kshatri, D.B., Halvorsen, P.: Soccer tern Recognit. Work. (CVPRW), pp.
game summarization using audio commen- 5074–5085. Inst. Electr. Electron. Eng.
tary, metadata, and captions. Proceedings of (IEEE), Vancouver, Can. (2023). https:
the 1st Workshop on User-centric Narrative //doi.org/10.1109/cvprw59228.2023.00536.
Summarization of Long Videos (2022). https: https://doi.org/10.1109/CVPRW59228.
//doi.org/10.1145/3552463.3557019 2023.00536

[16] Midoglu, C., Sabet, S.S., Sarkhoosh, M.H., [22] Andrews, P., Nordberg, O.E., Zubicueta Por-
Majidi, M., Gautam, S., Solberg, H.M., tales, S., Borch, N., Guribye, F., Fujita,
Kupka, T., Halvorsen, P.: Ai-based sports K., Fjeld, M.: AiCommentator: A multi-
highlight generation for social media. Pro- modal conversational agent for embedded
ceedings of the 3rd Mile-High Video Confer- visualization in football viewing. In: Int.
ence on zzz (2024). https://doi.org/10.1145/ Conf. Intell. User Interfaces, pp. 14–
3638036.3640799 34. ACM, Greenville, SC, USA (2024).
https://doi.org/10.1145/3640543.3645197.
[17] Gautam, S., Midoglu, C., Sabet, S.S., Ksha- https://doi.org/10.1145/3640543.3645197
tri, D.B., Halvorsen, P.: Assisting soc-
cer game summarization via audio inten- [23] Su, H., Maji, S., Kalogerakis, E., Learned-
sity analysis of game highlights. Unpub- Miller, E.: Multi-view convolutional neural
lished (2022). https://doi.org/10.13140/RG. networks for 3D shape recognition. In: IEEE
2.2.34457.70240/1 Int. Conf. Comput. Vis. (ICCV), pp. 945–953.
Inst. Electr. Electron. Eng. (IEEE), Santi-
[18] Midoglu, C., Hicks, S., Thambawita, V., ago, Chile (2015). https://doi.org/10.1109/
Kupka, T., Halvorsen, P.: MMSys’22 grand iccv.2015.114
challenge on AI-based video production for
soccer. In: ACM Multimedia Systems Con- [24] Bahdanau, D., Cho, K., Bengio,
ference (MMSys), Athlone, Ireland, pp. 1–6 Y.: Neural machine translation by
jointly learning to align and trans- https://doi.org/10.1145/3552437.3555699
late. arXiv abs/1409.0473 (2014).
https://doi.org/10.48550/arXiv.1409.0473 [31] Giancola, S., Amine, M., Dghaily, T.,
Ghanem, B.: SoccerNet: A scalable dataset
[25] Vaswani, A., Shazeer, N., Parmar, N., Uszko- for action spotting in soccer videos. In:
reit, J., Jones, L., Gomez, A.N., Kaiser, L., IEEE/CVF Conf. Comput. Vis. Pattern
Polosukhin, I.: Attention is all you need. Recognit. Work. (CVPRW), pp. 1792–
arXiv abs/1706.03762 (2017). https://doi. 179210. Inst. Electr. Electron. Eng. (IEEE),
org/10.48550/arXiv.1706.03762 Salt Lake City, UT, USA (2018). https://doi.
org/10.1109/cvprw.2018.00223
[26] Pappalardo, L., Cintia, P., Rossi, A., Mas-
succo, E., Ferragina, P., Pedreschi, D., Gian- [32] Deliège, A., Cioppa, A., Giancola, S., Seika-
notti, F.: A public data set of spatio-temporal vandi, M.J., Dueholm, J.V., Nasrollahi, K.,
match events in soccer competitions. Sci. Ghanem, B., Moeslund, T.B., Van Droogen-
Data 6(1), 1–15 (2019). https://doi.org/10. broeck, M.: SoccerNet-v2: A dataset and
1038/s41597-019-0247-7 benchmarks for holistic understanding of
broadcast soccer videos. In: IEEE Int.
[27] Yu, J., Lei, A., Song, Z., Wang, T., Cai, H., Conf. Comput. Vis. Pattern Recognit. Work.
Feng, N.: Comprehensive dataset of broad- (CVPRW), CVsports, Nashville, TN, USA,
cast soccer videos. In: IEEE Conf. Multime- pp. 4508–4519 (2021). https://doi.org/10.
dia Inf. Process. Retr. (MIPR), pp. 418–423. 1109/CVPRW53098.2021.00508. http://hdl.
Inst. Electr. Electron. Eng. (IEEE), Miami, handle.net/2268/253781
FL, USA (2018). https://doi.org/10.1109/
MIPR.2018.00090 [33] Cioppa, A., Deliège, A., Giancola, S.,
Ghanem, B., Van Droogenbroeck, M.: Scal-
[28] Scott, A., Uchida, I., Onishi, M., Kameda, Y., ing up SoccerNet with multi-view spatial
Fukui, K., Fujii, K.: SoccerTrack: A dataset localization and re-identification. Sci. Data
and tracking algorithm for soccer with 9(1), 1–9 (2022). https://doi.org/10.1038/
fish-eye and drone videos. In: IEEE/CVF s41597-022-01469-1
Conf. Comput. Vis. Pattern Recognit.
Work. (CVPRW), pp. 3568–3578. Inst. [34] Cioppa, A., Giancola, S., Deliege, A., Kang,
Electr. Electron. Eng. (IEEE), New Orleans, L., Zhou, X., Cheng, Z., Ghanem, B.,
LA, USA (2022). https://doi.org/10.1109/ Van Droogenbroeck, M.: SoccerNet-tracking:
cvprw56347.2022.00401. https://doi.org/10. Multiple object tracking dataset and bench-
1109/CVPRW56347.2022.00401 mark in soccer videos. In: IEEE Int.
Conf. Comput. Vis. Pattern Recognit. Work.
[29] Jiang, Y., Cui, K., Chen, L., Wang, C., (CVPRW), CVsports, pp. 3490–3501. Inst.
Xu, C.: SoccerDB: A large-scale database Electr. Electron. Eng. (IEEE), New Orleans,
for comprehensive video understanding. In: LA, USA (2022). https://doi.org/10.1109/
Int. ACM Work. Multimedia Content Anal. cvprw56347.2022.00393. https://doi.org/10.
Sports (MMSports), pp. 1–8. ACM, Seattle, 1109/CVPRW56347.2022.00393
WA, USA (2020). https://doi.org/10.1145/
3422844.3423051 [35] Held, J., Cioppa, A., Giancola, S., Hamdi, A.,
Ghanem, B., Van Droogenbroeck, M.: VARS:
[30] Van Zandycke, G., Somers, V., Istasse, M., Video assistant referee system for automated
Don, C.D., Zambrano, D.: DeepSportradar- soccer decision making from multiple views.
v1: Computer vision dataset for sports In: IEEE/CVF Conf. Comput. Vis. Pattern
understanding with high quality annota- Recognit. Work. (CVPRW), pp. 5086–5097.
tions. In: Int. ACM Work. Multimedia Inst. Electr. Electron. Eng. (IEEE), Vancou-
Content Anal. Sports (MMSports), ver, Can. (2023). https://doi.org/10.1109/
pp. 1–8. ACM, Lisbon, Port. (2022). cvprw59228.2023.00537. https://doi.org/10.
https://doi.org/10.1145/3552437.3555699.
13

1109/CVPRW59228.2023.00537 S., Thambawita, V., Riegler, M.A.,


Halvorsen, P., Shah, M.: SoccerNet-
[36] Cioppa, A., Giancola, S., Somers, V., Magera, echoes: A soccer game audio commentary
F., Zhou, X., Mkhallati, H., Deliège, A., dataset. arXiv abs/2405.07354 (2024)
Held, J., Hinojosa, C., Mansourian, A.M., https://arxiv.org/abs/2405.07354.
Miralles, P., Barnich, O., De Vleeschouwer, https://doi.org/10.48550/arXiv.2405.07354
C., Alahi, A., Ghanem, B., Van Droogen-
broeck, M., Kamal, A., Maglo, A., Clapés, [40] Spitz, J., Wagemans, J., Memmert, D.,
A., Abdelaziz, A., Xarles, A., Orcesi, A., Williams, A.M., Helsen, W.F.: Video assis-
Scott, A., Liu, B., Lim, B., Chen, C., Deuser, tant referees (var): The impact of tech-
F., Yan, F., Yu, F., Shitrit, G., Wang, G., nology on decision making in association
Choi, G., Kim, H., Guo, H., Fahrudin, H., football referees. Journal of Sports Sciences
Koguchi, H., Ardö, H., Salah, I., Yerushalmy, 39(2), 147–153 (2020). https://doi.org/10.
I., Muhammad, I., Uchida, I., Be’ery, I., 1080/02640414.2020.1809163
Rabarisoa, J., Lee, J., Fu, J., Yin, J., Xu, J.,
Nang, J., Denize, J., Li, J., Zhang, J., Kim, [41] De Dios Crespo, J.: 2. The Contribution of
J., Synowiec, K., Kobayashi, K., Zhang, K., VARs to Fairness in Sport, pp. 23–35. Rout-
Habel, K., Nakajima, K., Jiao, L., Ma, L., ledge, New York City, NY, USA (2021). https:
Wang, L., Wang, L., Li, M., Zhou, M., Nasr, //doi.org/10.4324/9780429455551-2
M., Abdelwahed, M., Liashuha, M., Falaleev,
N., Oswald, N., Jia, Q., Pham, Q.-C., Song, [42] Holder, U., Ehrmann, T., König, A.: Mon-
R., Hérault, R., Peng, R., Chen, R., Liu, itoring experts: insights from the introduc-
R., Baikulov, R., Fukushima, R., Escalera, tion of video assistant referee (VAR) in
S., Lee, S., Chen, S., Ding, S., Someya, T., elite football. Journal of Business Economics
Moeslund, T.B., Li, T., Shen, W., Zhang, 92(2), 285–308 (2021). https://doi.org/10.
W., Li, W., Dai, W., Luo, W., Zhao, W., 1007/s11573-021-01058-5
Zhang, W., Yang, X., Ma, Y., Joo, Y., Zeng,
[43] Dufner, A.-L., Schütz, L.-M., Hill, Y.: The
Y., Gan, Y., Zhu, Y., Zhong, Y., Ruan, Z.,
introduction of the video assistant referee
Li, Z., Huangi, Z., Meng, Z.: SoccerNet 2023
supports the fairness of the game — an anal-
challenges results. arXiv abs/2309.06006
ysis of the home advantage in the german
(2023) https://arxiv.org/abs/2309.06006.
bundesliga. Psychology of Sport and Exer-
https://doi.org/10.48550/arXiv.2309.06006
cise 66, 1–5 (2023). https://doi.org/10.1016/
[37] Leduc, A., Cioppa, A., Giancola, S., Ghanem, j.psychsport.2023.102386
B., Van Droogenbroeck, M.: SoccerNet-
[44] Armenteros, M., Webb, T.: Educating Inter-
Depth: a scalable dataset for monocular
national Football Referees: The importance
depth estimation in sports videos. In: IEEE
of Uniformity, pp. 301–327. Routledge, New
Int. Conf. Comput. Vis. Pattern Recognit.
York City, NY, USA (2021). Chap. 16. https:
Work. (CVPRW), CVsports, Seattle, WA,
//doi.org/10.4324/9780429455551-16
USA (2024)
[45] Deutscher Fußball-Bund (DFB): Anzahl
[38] Held, J., Itani, H., Cioppa, A., Giancola, S.,
aktiver Schiedsrichter/-innen bis 2022.
Ghanem, B., Van Droogenbroeck, M.: X-vars:
https://www.dfb.de/verbandsstruktur/
Introducing explainability in football referee-
dfb-zentrale/ (2022)
ing with multi-modal large language models.
In: IEEE Int. Conf. Comput. Vis. Pattern [46] Zeppenfeld, B.: Anzahl aktiver Schied-
Recognit. Work. (CVPRW), CVsports, Seat- srichter / Schiedsrichterinnen des Deutschen
tle, WA, USA (2024) Fußball Bundes (DFB) von 2018/2019
bis 2022/2023. https://de.statista.com/
[39] Gautam, S., Sarkhoosh, M.H., Held,
statistik/daten/studie/1243626/umfrage/
J., Midoglu, C., Cioppa, A., Giancola,
dfb-anzahl-aktiver-schiedsrichter/ (2023)
[47] IFAB: Laws of the game. Technical report, [54] Pizzera, A., Marrable, J., Raab, M.: The
The International Football Association video review system in association football:
Board, Zurich, Switzerland (2022) implementation and effectiveness for match
officials and referee education. Managing
[48] Fan, H., Xiong, B., Mangalam, K., Li, Y., Sport and Leisure, 1–17 (2022). https://doi.
Yan, Z., Malik, J., Feichtenhofer, C.: Mul- org/10.1080/23750472.2022.2147856
tiscale vision transformers. In: IEEE Int.
Conf. Comput. Vis. (ICCV), pp. 6804– [55] López, A.M.: Average time needed for a
6815. Inst. Electr. Electron. Eng. (IEEE), video assistant referee (VAR) interven-
Montréal, Can. (2021). https://doi.org/10. tion in Brazil in 2019 and 2020. https:
1109/iccv48922.2021.00675 //www.statista.com/statistics/1010093/
average-time-video-assistant-referee-checking-brazil/
[49] Li, Y., Wu, C.-Y., Fan, H., Mangalam, (2023)
K., Xiong, B., Malik, J., Feichtenhofer, C.:
MViTv2: Improved multiscale vision trans-
formers for classification and detection. In:
IEEE/CVF Conf. Comput. Vis. Pattern
Recognit. (CVPR), pp. 4794–4804. Inst.
Electr. Electron. Eng. (IEEE), New Orleans,
LA, USA (2022). https://doi.org/10.1109/
cvpr52688.2022.00476

[50] Kay, W., Carreira, J., Simonyan, K.,


Zhang, B., Hillier, C., Vijayanarasimhan,
S., Viola, F., Green, T., Back, T., Nat-
sev, P., Suleyman, M., Zisserman, A.: The
kinetics human action video dataset. arXiv
abs/1705.06950 (2017). https://doi.org/10.
48550/arXiv.1705.06950

[51] Yang, Z., Wang, L.: Learning relationships for


multi-view 3D object recognition. In: IEEE
Int. Conf. Comput. Vis. (ICCV), pp. 7504–
7513. Inst. Electr. Electron. Eng. (IEEE),
Seoul, South Korea (2019). https://doi.org/
10.1109/iccv.2019.00760

[52] MacMahon, C., Helsen, W.F., Starkes,


J.L., Weston, M.: Decision-making skills
and deliberate practice in elite association
football referees. Journal of Sports Sci-
ences 25(1), 65–78 (2007). https://doi.org/
10.1080/02640410600718640

[53] Spitz, J., Put, K., Wagemans, J., Williams,


A.M., Helsen, W.F.: Visual search behav-
iors of association football referees during
assessment of foul play situations. Cog-
nitive Research: Principles and Implica-
tions 1(1) (2016). https://doi.org/10.1186/
s41235-016-0013-8

You might also like