Industrial Safety Violation FinalRevision
Industrial Safety Violation FinalRevision
ABSTRACT
Proper use of personal protective equipment (PPE) can save the
lives of industry workers and it is a widely used application of
computer vision in the large manufacturing industries. However,
most of the applications deployed generate a lot of false alarms
(violations) because they tend to generalize the requirements of
PPE across the industry and tasks. The key to resolving this issue (a) Real Time Surveillance Feed (b) Multi Actions - Multi People
is to understand the action being performed by the worker and
customize the inference for the specific PPE requirements of that
action. In this paper, we propose a system that employs activity
recognition models to first understand the action being performed
and then use object detection techniques to check for violations.
This leads to a 23% improvement in the F1-score compared to the
PPE-based approach on our test dataset of 109 videos.
(c) Welding with Occlusion (d) Multi Person Walking
CCS CONCEPTS
Figure 1: Sample Data with Multi actor- Multi Action Indus-
• Computing methodologies → Activity recognition and un- trial Scenario
derstanding; Object detection.
KEYWORDS
requirements remains challenging for employers and safety offi-
Action Recognition, PPE Detection, Object Detection cers. For large industries, monitoring all activities and employees
ACM Reference Format: is manpower-intensive and time-consuming. Video analytics-based
Surya N Reddy, Vaibhav Kurrey, Mayank Nagar, and Gagan Raj Gupta. solutions might ensure better compliance with low costs and enable
2018. Action Recognition based Industrial Safety Violation Detection. In the automatic detection of violations in the workplace.
Proceedings of ACM Conference (CODS-COMAD 2024). ACM, New York, NY, During the root cause analysis of accidents, the common ques-
USA, 10 pages. https://doi.org/XXXXXXX.XXXXXXX tions asked are: i) Was the worker wearing the PPE designated for
that activity? ii) Was the worker following any unsafe practice or
1 INTRODUCTION violating any laid down procedure during the activity? iii) Was the
worker working in an unsafe environment? When multiple activi-
Accidents in construction sites and industrial environments can
ties are simultaneously performed by multiple people inside a large
turn fatal for the workers if they don’t wear proper Personal Pro-
manufacturing complex(shop floor), detection of activity-specific
tective Equipment (PPE). The right usage of PPE not only saves
PPE for each person is difficult. For example, a person walking
lives but also reduces the severity of the injury. Despite regulatory
inside the shop floor might need only a Safety Helmet and Safety
requirements and safety protocols, ensuring compliance with PPE
Shoes whereas a person working on handling materials or sharp
objects also needs safety gloves. Similarly, a person doing weld-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed ing needs all of the above PPEs along with safety glasses.These
for profit or commercial advantage and that copies bear this notice and the full citation variations in the PPE requirements (see table 1) make automated
on the first page. Copyrights for components of this work owned by others than ACM violation detection difficult. Our goal in this paper is to address the
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a questions (i) and (ii) by building a system that can understand
fee. Request permissions from [email protected]. the activity being performed by a worker from industrial
CODS-COMAD 2024, December 18-21, 2024, IIT-Jodhpur, India surveillance camera videos and check for any violation of
© 2018 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 PPE designated for those activities.
https://doi.org/XXXXXXX.XXXXXXX
CODS-COMAD 2024, December 18-21, 2024, IIT-Jodhpur, India Surya et al.
Understanding human actions within the shop floor environment 2.1 Action Recognition
is crucial for developing an effective violation detection system. Action recognition is an area in computer vision that involves
Sometimes, a worker may wear the necessary PPE but still engage identifying and categorizing human actions in video sequences.
in unsafe workflows, posing risks not only to themselves but also Unlike static image classification, human action recognition must
to surrounding workers. While action recognition and classifica- account for the temporal dynamics and sequential nature of actions,
tion are well-established tasks in computer vision, there are limited which significantly increases the complexity of the task[65]. His-
published works such as InHARD[8], HRI30[19], and LAMIS[41] torically, in frame-based action recognition, there have typically
that focus on industrial actions, either in terms of model architec- been two key steps: action representation[29, 42, 52, 63] and action
tures or datasets. Even then, these datasets often lack the realism classification[26, 36, 53]. Recent works have merged both these ap-
of an actual shop floor environment, typically featuring one per- proaches into an end-to-end learning framework, thus significantly
son per video or focusing on a single action, failing to capture the improving action classification performance.
complexities of real-world conditions (see Figure 1). To leverage information from all frames and model the inter-
Most benchmark action recognition datasets are sourced from frame information correlation, Tran et al .[58] proposed 3DCNN
the internet or controlled laboratory settings, predominantly fea- to learn features in both spatial and temporal domains, but with
turing sports-related or household actions performed by a single high computational costs. Carreira and Zisserman[6]introduced
actor. In real-world industrial settings, however, multiple individu- I3D, which builds upon the existing image classification architec-
als often need to be monitored simultaneously by a single camera. tures, making training easier. Feichtenhofer et al .[14] proposed
The existing datasets are usually well-curated, high-quality videos, an efficient network, SlowFast, with both slow and fast pathways
which do not fully capture the dynamic and chaotic nature of real- that can adapt to different scenarios by adjusting channel capac-
world industrial environments. Therefore, there is a pressing need ities, greatly enhancing overall efficiency. Additionally, various
for a comprehensive dataset that authentically represents indus- 3DCNN variants[13, 59, 73] have been proposed, further improving
trial actions. This paper proposes the creation of such a dataset, recognition efficiency and reducing the limitations in the initial
sourced from surveillance and process monitoring cameras within architecture. ViT[11], self-attention mechanisms[10, 60] have been
a large-scale manufacturing complex. adapted to action recognition tasks[3, 38], which has been shown
Integrating Human Action Recognition (HAR) models with tra- to achieve good performance. Spiking neural networks(SNN) have
ditional object detection systems creates a robust solution for de- also been used for action recognition. However, due to the non-
tecting PPE violations. These models can identify specific tasks differentiability of discrete pulse signals, training SNNs poses chal-
and check for compliance with PPE requirements, thereby reduc- lenges. Several effective training methods have been proposed to
ing computational costs and minimizing false alarms. For effective address this challenge[30, 44], but their effectiveness remains to be
PPE detection in real-world conditions, the dataset must be diverse, further investigated in the area of industrial context.
covering a wide range of tasks and PPE types. In typical indus-
trial settings, where dedicated high-quality cameras are uncom-
mon, the models must adapt to process monitoring or surveillance 2.2 Action Recognition Datasets
feeds, which may have poor lighting, blurry images, occlusions, Most popular action recognition datasets (see table 4), such as
and multiple individuals. These feeds often capture various activi- Weizmann[4], Hollywood-2 [40], HMDB[28] and UCF101[55], con-
ties happening simultaneously, making accurate violation detection sist of manually trimmed short clips to capture a single action.
challenging. Figure 1 shows examples of images from real industrial Unfortunately, these datasets don’t represent real-world applica-
settings. tions where multiple actors are working on multiple tasks and
In this study, we train a SlowFast network [14] (a state-of-the-art action recognition always occurs in an untrimmed environment.
model) for the task of video action recognition and a YOLOv9 model Video classification datasets, such as TrecVid multi-media event
for PPE detection to detect industrial safety violations at a clip level. detection[47], Sports-1M[23] and YouTube-8M[1] have focused on
The action recognition-based PPE detection approach is compared video classification on a large scale by automating the label genera-
with traditional PPE-based approaches. We also present a human tion thereby introducing a large number of noisy annotations.
study to compare the performance of our approach. Another line of work in HAR is towards temporal localization
To summarize, our contributions are as follows: of tasks. ActivityNet[17], THUMOS[18], MultiTHUMOS[69] and
• Proposed a novel dataset for understanding human actions Charades[54] use large numbers of untrimmed videos, each contain-
in industrial settings. ing multiple actions, obtained either from YouTube (ActivityNet,
• Proposed a novel approach for detecting task-specific PPE THUMOS, MultiTHUMOS) or from crowdsourced actors (Charades).
requirements using Action Recognition and Object detection The datasets cover the temporal localization aspect. However, they
models which can catch violations comparable to humans don’t address the spatial part.
and much better than PPE based approaches. Spatio-temporal action detection datasets, such as CMU[24],
MSR Actions[71], UCF Sports[50] and JHMDB[22], UCF101-24[55],
AVA[16] and AVA-Kinetics[31], MultiSports[34] typically evaluate
2 RELATED WORKS spatio-temporal action detection for short videos with frame-level
Related works in this area can be broadly classified into three areas. action annotations. These benchmarks pay more attention to spa-
Video-based action recognition, publicly available datasets on HAR tial information with frame-level detectors and clip-level detectors,
in an industrial context and PPE detection in industrial areas. which are limited to fully utilize temporal information. Very few
Action Recognition based Industrial Safety Violation Detection CODS-COMAD 2024, December 18-21, 2024, IIT-Jodhpur, India
published datasets and works are available for action recognition close sourced. Recent work SH17 [2] has a comprehensive collection
in the industrial context. HRI30[19], InHARD[8], and LAMIS[41] of PPE including gloves and earmuffs. This is again largely collected
are some of the existing datasets available in the industrial setting. from the internet and crowd-sourced. The available datasets don’t
However, most of these datasets are not complex in nature and accurately reflect the environmental conditions, noise, occlusion,
also involve one person doing a specific action4. In our proposed and lighting in the manufacturing setup and don’t have a full set
dataset we tried to capture the diversity of action and complexity of of PPE instances. Similar to SH17, we aim to bridge this gap by
interactions between multiple actors in the video using the surveil- proposing a novel dataset collected from surveillance videos of
lance videos. Our dataset differs from the above in terms of both large manufacturing industries.
content and annotation: we label a diverse collection of industrial 3 DATASET EXPLANATION
actions and provide spatio-temporal annotations for each subject This paper addresses the main limitations of existing datasets in
performing an action in a large set of sampled frames. understanding industrial action i.e. they don’t capture the dynamic
environment and action classes that a real industrial environment
2.3 PPE Detection on Surveillance Videos presents. Our goal is to build a large-scale, high-quality dataset with
Traditional approaches to PPE detection use Object Detection(OD) fine-grained action classes and dense annotations that capture most
models to identify the safety appliances. Isailovic et al.[21] and of the commonly performed industrial actions. Also, our proposed
Vukicevic et al.[61] use a two-stage approach by using a key point dataset is collected from real manufacturing setup surveillance feeds
detector to detect regions and then passing these regions to the and it tries to address the issue of lack of variety in the existing
Object Detection model for further PPE detection. Wu et al.[66] datasets discussed above.
used the Single Stage Detector (SSD)[37] architecture to identify
hardhats of different colors on construction sites and the model 3.1 Dataset Preparation
was benchmarked on GDUT-HWD[66] dataset. Otgonbold et al.[46]
3.1.1 Video Collection Process. Video is obtained from surveil-
benchmarked the performance of multiple OD models for detecting
lance or process monitoring cameras (2 PTZ cameras and 1 Bullet
6 different classes from person, helmet, head, and face on the novel
Camera) captured at a frame rate of 12 FPS, with a 1920x1080 pixels
SHEL5k dataset[46]. Chen and Demachi[7] introduced a method us-
resolution. In the current version, a total of 320 hours of footage is
ing OpenPose[5] for body landmark detection and the YOLOv3 OD
collected and cleaned for further processing and annotations.
model for PPE detection. They used the geometric relationships be-
The process involved the following key steps:
tween the key points to detect PPE and assess compliance. Zhafran
et al.[72] used the Fast R-CNN architecture and observed a decrease (1) Video Segmentation: We first divided the videos into 15-
in accuracy with changes in distance and lighting conditions. second clips to standardize the data for subsequent analy-
Many existing publicly available datasets are focused on the sis. Human detection: Using a pre-trained person detection
construction industry and the manufacturing industry is largely model, we filtered out any clips that did not contain human
unexplored. Most existing datasets (see table 2) available are focused subjects, significantly reducing the dataset size.
on hard hats and safety clothing. GDUT-HWD [66] is very noisy (2) Manual Review: The remaining clips were manually reviewed
due to the crowd-sourced nature of data and SHW [45] is sourced to eliminate those with poor visibility, unclear content, or
from search engines. CHV [64] dataset has no additional data and other factors that made them unsuitable for future use.
is simply a curated version of GDUT-HWD and SHW datasets. (3) Duplicate Removal: To ensure the uniqueness of the data set,
SHEL5K [46] and Pictor-PPE [25] focus on only the PPE’s clothing we applied a hash-based method using the hash lib library to
and hard hat aspect. These are also crowd-sourced from the web. detect and remove duplicate clips. This step involved calcu-
Only TSRSF [70] has a dataset collected from a real industrial setup lating the MD5 hash of each video file and eliminating any
in a chemical plant focusing on hard hats and clothing. However, it is files with matching hashes.
CODS-COMAD 2024, December 18-21, 2024, IIT-Jodhpur, India Surya et al.
(4) Final Dataset: After these steps, we were left with approx- generated. An annotator either removes incorrect bounding boxes
imately 1,600 high-quality clips, which were subsequently or creates the bounding boxes missed by the person detector.
used for the training, testing, and validation phases of our Along the lines of annotations done for the AVA [16] dataset,
project. bounding boxes over short periods are linked to obtaining ground-
Clip duration: In action recognition, 15-second video clips are truth person tracks. The action labels (see table 3) are generated by
chosen to provide a comprehensive temporal context for captur- crowd-sourced annotators using a custom-designed interface. Each
ing and understanding actions. This duration ensures that most annotator will go through each frame and each person in the frame
actions are fully represented and can be analyzed effectively by and select the corresponding micro-action for the selected person.
the model. To annotate these clips, the process involves extracting At the bottom panel, a choice of actions is provided to the annotator
15 frames from each video. From these frames, the first 2 and last for the selection. A key frame can contain multiple persons and a
2 are removed, leaving 11 frames for annotation. This approach person can be associated with one micro-action in any given frame.
helps focus on the core part of the action, reducing noise from the Annotators have the choice not to provide any action or bounding
beginning and end of the clip, where actions may be less defined or box for the persons who are distant in the frame to reduce the
transitional. noise. On average, annotators take between 60 to 180 seconds for
each video depending on the number of persons present in the key
Dataset Classes Images Instances Method frame. Finally, the output files from the VIA tool were converted
Pictor - PPE[25] 3 784 - Web into training-ready files for use with the action model.
SHW[45] 1 7581 120558 Web
CHV[64] 6 1330 - Web 3.2 Dataset Statistics
TCRSF[70] 7 12373 50558 Industrial Our proposed dataset is an ongoing work for building a large-scale
GDUT-HWD[66] 5 3174 18893 Web dataset for understanding industrial actions. A total of more than
SHEL5K[46] 5 5000 75570 Web 60,000 clips are generated using a total of 320 video hours. For
SH17[2] 17 8099 75994 Web this paper, we are presenting a total of 2900 clips annotated with
Our Data 7 3000 19954 Industrial micro action categories for 45652 instances distributed among 12
Table 2: Existing Datasets for PPE detection micro action categories. Consequently, the number of instances
in our sample dataset is as high as 15.74 per video and 3511 per
category. AVA-Kinetics [32], which is a standard dataset for action
recognition annotates only one keyframe for a 10-second clip which
3.1.2 Action Taxonomy. Based on the clips collected an action is much lower than ours of 10-12 keyframes per clip. As shown
taxonomy was prepared by selecting the most commonly performed in Figure 2, the distribution of action instances is not balanced.
actions in the videos and in the real time. Based on the inputs, the This distribution increases the difficulty of accurately classifying
actions are fine-grained enough so that there is clarity in under- the action for detection models. In our best understanding, a one-
standing the action and also there are not many repetitive actions. to-one comparison can not be made to any of the existing action
In the action dictionary provided the actions and micro-actions recognition datasets as the present datasets are not oriented towards
associated with each action are defined separately and mapping industrial actions. Also, the present action recognition databases
of each action with the respective classes was done. The actions HRI30[19], LAMIS[41], and InHARD[8] don’t reflect the real-world
defined are Crane Movement, Welding, Observing / Interaction on setting and are very limited in scope. Even in comparison to existing
Shop Floor, Walking, Moving on a Bike / Bicycle, Person Lifting / non-industrial action recognition databases, our clips are longer(15s
Carrying /Handing Over/ Pushing or Pulling an object, Interact- vs an average of 7-8 secs), much more instances per clip (15.74 vs 5
ing with a machine/equipment in Shop Floor. The micro action on average).
definitions provided to human annotators are shown in Table 3.
3.1.3 Annotation Process. We followed the AVA [16][68] an-
notation process for action labeling as this method incorporated
micro-actions for understanding the entire sequence of physical
activities. In this approach, the entire annotation process is divided
into three parts person bounding box annotation, person link anno-
tation, and action annotation. Person localization is done through
a bounding box. We utilized the VIA tool to annotate the activity
being performed by the detected humans.
When multiple subjects are present in a selected frame, anno-
tator evaluates the each subject separately for action annotation
because action labels for each person can be different. Since manual
bounding box annotation is intensive, a hybrid annotation approach
was followed by first generating the initial set of bounding boxes Figure 2: Distribution of Action Labels in Proposed Dataset
using Faster-RCNN person detector[49]. Annotators are supplied
with these proposal files to manually correct the bounding boxes
Action Recognition based Industrial Safety Violation Detection CODS-COMAD 2024, December 18-21, 2024, IIT-Jodhpur, India
3.3 Dataset Characteristics ability to capture both slow and fast visual information, making it
One of the important goals of this work is to build a diverse and well-suited for recognizing complex actions in videos. The SlowFast
rich dataset (see figure 2) for industrial action recognition. Besides network has two pathways: slow and fast. The slow pathway pro-
variation in bounding box size and size of the persons or objects in cesses video frames at a lower frame rate to capture high-resolution
the frame, many categories will require discriminating fine-grained spatial details and static content. The fast pathway processes video
differences, such as “observing” versus “interacting” or lifting an frames at a higher frame rate to capture dynamic, rapid movements.
object or transporting an object. Even within an action class, the These pathways are fused through lateral connections, combining
appearance varies with vastly different contexts: an object is being static and dynamic features for effective spatio-temporal pattern
simply lifted and handed over to another person or an object is recognition in video data.
being transported by two people. Similarly, when we detect the
motion of the crane the distance of the crane from the camera is 4.2 PPE Detection
only understood through the size of the hook. Also, it is difficult to To establish the baseline metric on the proposed dataset, we trained
accurately estimate the distance between persons in the surround- the RetinaNet and Fast R-CNN models from Detectron2, along
ing area of crane movement. These wide intra-class varieties will with the YOLOv9 model on the dataset with 3522 images. The
allow us to learn features that identify the critical spatio-temporal distribution of PPE classes is given in table 5.
parts of action — such as in the given frame whether a person is RetinaNet [35] is a one-stage object detection model that uses
welding or simply observing the welding process from very near. a focal loss function to address class imbalance during training.
4 MODEL ARCHITECTURE AND METRICS The focal loss applies a modulating term to the cross-entropy loss,
focusing learning on hard negative examples. Fast R-CNN [15]
4.1 Action Recognition processes an entire image and a set of object proposals. For each
In this study, we employ the SlowFast Networks[14] for the task object proposal, a region of interest (RoI) pooling layer extracts
of video action recognition to detect industrial safety violations. a fixed-length feature vector from the feature map which is fed
The SlowFast network is a state-of-the-art model known for its into a sequence of fully connected layers for producing the softmax
CODS-COMAD 2024, December 18-21, 2024, IIT-Jodhpur, India Surya et al.
( Í𝑁
probabilities and encoding refined bounding-box positions for each 1 if 𝑖=1 𝑝𝑖 ≥ 1
The metric is defined as follows: V =
class. YOLOv9 [62] combines two neural network architectures, 0 otherwise
CSPNet and ELAN, designed with gradient path planning in mind. This means that 𝑉 = 1 (violation in the clip) if at least one
The Generalized Efficient Layer Aggregation Network (GELAN) frame 𝑖 has 𝑝𝑖 = 1 (indicating a detected PPE violation) and 𝑉 = 0
enhances lightweight design, inference speed, and accuracy. (no violation in the clip) if all frames 𝑖 have 𝑝𝑖 = 0 (indicating no
detected PPE violations).
4.3 Evaluation Metrics
PPE detection models are evaluated using various metrics to accu- 4.4 Proposed Approach for Action
rately measure their performance. The Microsoft Common Objects Recognition-Based PPE Detection
in Context (MS-COCO) dataset employs several common metrics, To effectively monitor and ensure compliance with industrial safety
including Precision (P) and Recall (R). protocols, we integrate our action recognition and PPE detection
Additionally, Mean Average Precision (mAP) assesses the de- models into an integrated framework, presented in Algorithm 1.
tection accuracy across all classes. It is calculated by determining First, An action recognition model (Slow-Fast Network) takes in
Precision (P) and Recall (R) for each class and then averaging these the input video feed (V) and trains it to identify and indicate the
values to provide an overall score. The Intersection over Union (IoU) location of activities in the video.
metric is used to measure the accuracy of object localization. IoU
calculates the overlap between the ground truth bounding boxes Algorithm 1 Integrated Action Recognition and PPE Detection
(𝑏𝑔 ) and the model’s predicted bounding boxes (𝑏𝑝𝑟𝑒𝑑 ) as follows: Framework
Area(𝑏𝑝𝑟𝑒𝑑 ∩ 𝑏𝑔 ) Input: Video footage 𝑉
IoU = Output: Detection of PPE violations and safety certification
Area(𝑏𝑝𝑟𝑒𝑑 ∪ 𝑏𝑔 )
1 Function PPEComplianceCheck_Train({𝐹 first , 𝐹 middle , 𝐹 last },
Where 𝑏𝑔 represents the ground truth bounding box, and 𝑝𝑝𝑟𝑒𝑑 𝐵 info, PPE_List):
denotes the bounding box predicted by the model, AP50 (Average 2 for each frame 𝐹𝑖 in {𝐹 first, 𝐹 middle, 𝐹 last } do
Precision at 50% IoU) is a metric used in object detection to evaluate 3 for each 𝐵 info,𝑗 do
the precision and recall of a model at a single Intersection over 4 required_PPE ← PPE_List[𝑎 𝑗 ]
Union (IoU) threshold of 50%. for each 𝑝 in required_PPE do
5 if 𝑝 not in 𝐵 PPE then
Class Instances 6 Mark as violation
no-safety-glove 245 7 end
no-safety-helmet 2905 8 else
no-safety-shoes 3341 9 Certify safety compliance
safety-glove 1973
10 end
safety-helmet 5289
11 end
safety-shoes 6066
welding-helmet 135 12 end
13 end
Table 5: Class distribution for PPE detection
14
15 Function PPECheck_Inference(𝑉 ):
16 frames_info ← ActionRecognitionModel(𝑉 )
This means that a detected object’s bounding box is considered frames ← {𝐹 first, 𝐹 middle, 𝐹 last }
a true positive if its IoU with the ground-truth bounding box is for frame in frames do
at least 50%. AP50-95 evaluates the model’s performance across 17 PPEDetectionModel(frame, frames_info)
multiple IoU thresholds. It averages the Average Precision (AP) 18 end
scores calculated at ten different IoU thresholds: 50%, 55%, 60%, 65%,
19 PPEComplianceCheck_Train(frames, frames_info, PPE_List)
70%, 75%, 80%, 85%, 90%, and 95%. This range of thresholds provides
20
a broader view of the model’s ability to detect objects with varying
degrees of overlap, from relatively loose (50%) to very strict (95%).
The AP50-95 score is the mean of these AP values. This information is stored in frames_info. These locations are
For evaluation of the overall approach of PPE detection through fed into PPEDetectionModel (YOLOv9), which was trained on a
Action Recognition, a clip-level metric is defined. In a clip of 15 sec- custom dataset to identify PPE. Afterwards, a new module PPECom-
onds, even if 1 frame of the 15 clips has any PPE violation detected, plainceCheck_Train has been designed to check PPE compliance.
then the entire clip is considered as a clip having a violation. This model takes three frames F_first, F_middle, F_last and for
Let 𝑁 be the total number of frames in a clip. Let 𝑝𝑖 be a boolean each frame, it will check for PPE in each B_info generated from
indicator for frame 𝑖, where 𝑝𝑖 = 1 if there is at least one person the Action Recognition model. These PPEs are detected and then
detected without the required PPE in frame 𝑖, and 𝑝𝑖 = 0 otherwise. checked with a list of PPE requirements required_PPE from the
Finally, 𝑉 is a boolean indicator for the clip, where 𝑉 = 1 if the clip PPE_list dictionary. If the PPE detected is not in required_PPE, then
is considered as having a violation, and 𝑉 = 0 otherwise. it is marked as a violation or else, It is certified as safety complied.
Action Recognition based Industrial Safety Violation Detection CODS-COMAD 2024, December 18-21, 2024, IIT-Jodhpur, India
PPECheck_Inference has been designed to take the video feed (V) as 5.2 PPE Detection
input to run the inference and generate the compliance information. Although there are datasets available for PPE detection, we are not
The YOLOv9 model determines whether the individual in the aware of any real-life industry environment datasets. We bench-
indicated area is wearing the appropriate PPE or not. If the user mark the dataset with existing state-of-the-art models. The dataset
possesses the appropriate PPE for the activity, the system displays a for PPE detection mentioned in table 5 is formatted in MS COCO
message certifying their safety. If not, the system indicates that the style.
individual is missing certain safety equipment. This strategy allows We utilized the implementations of RetinaNet and FastRCNN
us to reduce the false positives by looking for only PPE designated from detectron-2 [67] and trained them on our dataset for 20,000 and
for activity detected. 27,000 iterations, respectively, using the default hyperparameters.
For the YOLOv9 model [62], we used the official implementation
5 EXPERIMENTS AND RESULTS and trained it for 120 epochs with default hyperparameters.
We conducted action recognition experiments using the SlowFast
network on a real-life industry environment dataset, for which no Retina FastRCNN
Classes YoloV9
previous datasets are available. We benchmarked our dataset against Net (R101)
existing state-of-the-art models. All experiments were performed no-safety-glove 56.9 66.1 67.1
on a single RTX A6000 GPU 48GB VRAM and Intel® Xeon(R) Gold no-safety-helmet 64.0 67.3 76.9
5318Y CPU @ 2.10GHz × 96. no-safety-shoes 49.4 54.4 66.4
safety-glove 37.1 39.9 51.5
5.1 Action Recognition safety-helmet 68.9 65.8 75.8
For the SlowFast network, we utilized the implementation from the safety-shoes 56.2 58.5 70.8
official repository [12] and trained the model for 50 epochs with welding-helmet 46.3 55.6 62.7
fine-tuned hyperparameters. The results are summarized in Table Table 8: Average Precision (AP50-95) score of various models
6 and Table 7. We observe that Recall@Top k metrics are high. on our dataset
6 DISCUSSION
The inference runs with an average time of 117.25 ms per frame, and
a total inference time of 1.76 seconds to process a 15-second video
Figure 3: Combined action recognition and PPE detection for clip. During the process, CPU usage peaks at 2%, memory usage
real-time safety compliance. is 5.6%, and GPU usage peaks at 12%. The CCTV cameras used in
the study, record data at 12fps, and we have tested that the above
system (see section 5) can process 25 video streams concurrently
from the algorithm 1. The dataset with 109 videos consisting 54 in real-time.
videos with violation is being used. These clips were chosen such One of the limitations of our dataset is having only a 2D RGB
that the field of view was sufficient enough to get a good focus video feed. This dataset lacks depth and 3D video feeds which can
on workers, and activities. We also ensured coverage of different be used for a more comprehensive understanding of the working
types of actions and violations. The results of these approaches are environment and context. For example, a person who is assisting
presented in Table 10. the crane movement should not be directly below the crane and
should maintain a certain distance from the path. This distance
Precision Recall F1 Score calculation between the worker and crane is difficult to measure
accurately because camera positioning will impact the accuracy.
Common PPE 0.62 0.55 0.59
We plan to deploy depth sensors at designated sites and collect the
All PPE 0.51 0.56 0.54
data for further augmenting the dataset.
Activity based PPE (1 frame) 0.60 0.93 0.73
The other issue is the field of view(FOV). The usual camera setup
Activity based PPE (2 frame) 0.64 0.83 0.72
in the industrial environment is to cover the maximum area possible.
Table 10: Precision, Recall and F1 Score on all three ap-
Due to this, the field of view becomes large, and small objects such
proaches.
as gloves, glasses, and shoes for the distant objects are difficult
to detect. One solution to solve this is to increase the number of
cameras. From analyzing the videos, we have observed that a FOV
While state-of-the-art models like YOLO are effective at detecting of 20mts is sufficient enough to get good accuracy. Practically also,
PPE, they tend to generate a high number of false positives. This is this approach makes sense as many large plants have shop floors
because not all PPE is required at all times; the necessity depends on covered in huge areas and complete coverage might not be feasible
the specific action being performed. For example, a person walking financially.
on a workshop floor does not need a welding helmet. We observe
better precision when detecting common PPE (e.g., safety shoes,
helmets) compared to detecting all PPE. This issue can be addressed 7 CONCLUSION
through an action-based approach, which identifies the necessary In this paper, we present a novel dataset with dense spatio-temporal
PPE based on the activity being performed and checks only for annotations designed to recognize industrial actions. This dataset
those specific items. As shown in the table 10, the activity-based sets itself apart from existing action detection datasets by offering
approach achieves a high recall rate of 93% by focusing on the a diverse and realistic collection of industrial environment clips,
relevant PPE for each action. By analyzing at least two frames to as well as comprehensive annotations for commonly performed
determine a violation, false positives are further reduced, leading industrial tasks. We also introduce an innovative approach for
to an increase in precision. detecting PPE violations by integrating action recognition with
In an industrial environment, safety must be the top priority. object detection models. Our approach achieves a high recall at
Therefore, our primary goal is to maximize the detection of videos the clip level. Our work highlights the need for further research in
with violations. As a result, we focus on optimizing the recall metric. Industrial Action Recognition (IAR) and aims to inspire continued
All violations detected by the algorithm would be verified by the exploration in application areas such as human workflow analysis
safety officer before initiating disciplinary action. and human-machine interactions.
Action Recognition based Industrial Safety Violation Detection CODS-COMAD 2024, December 18-21, 2024, IIT-Jodhpur, India
REFERENCES [25] Kyunghwan Kim, Kangeun Kim, and Soyoon Jeong. 2023. Application of YOLO v5
[1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Apostol Natsev, George and v8 for Recognition of Safety Risk Factors at Construction Sites. Sustainability
Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. 15 (10 2023), 15179.
YouTube-8M: A Large-Scale Video Classification Benchmark. ArXiv (2016). [26] Yu Kong and Yun Raymond Fu. 2018. Human Action Recognition and Prediction:
[2] Hafiz Mughees Ahmad and Afshin Rahimi. 2024. SH17: A Dataset for Human A Survey. International Journal of Computer Vision 130 (2018), 1366 – 1401.
Safety and Personal Protective Equipment Detection in Manufacturing Industry. https://api.semanticscholar.org/CorpusID:49551723
[3] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is Space-Time Atten- [27] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. 2011. HMDB: a large
tion All You Need for Video Understanding? ArXiv (2021). video database for human motion recognition. In Proceedings of the International
[4] Moshe Blank, Lena Gorelick, Eli Shechtman, Michal Irani, and Ronen Basri. 2005. Conference on Computer Vision (ICCV).
Actions as space-time shapes. Tenth IEEE International Conference on Computer [28] Hilde Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso A. Poggio, and Thomas
Vision (ICCV’05) Volume 1 (2005). Serre. 2011. HMDB: A large video database for human motion recognition. 2011
[5] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2016. Realtime Multi- International Conference on Computer Vision (2011).
person 2D Pose Estimation Using Part Affinity Fields. 2017 IEEE Conference on [29] Ivan Laptev. 2005. On Space-Time Interest Points. International Journal of
Computer Vision and Pattern Recognition (CVPR) (2016), 1302–1310. Computer Vision 64 (2005), 107–123. https://api.semanticscholar.org/CorpusID:
[6] João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A 2619278
New Model and the Kinetics Dataset. 2017 IEEE Conference on Computer Vision [30] Luziwei Leng, Kaiwei Che, Kaixuan Zhang, Jianguo Zhang, Qinghu Meng, Jie
and Pattern Recognition (CVPR) (2017), 4724–4733. https://api.semanticscholar. Cheng, Qinghai Guo, and Jianxing Liao. 2022. Differentiable hierarchical and
org/CorpusID:206596127 surrogate gradient search for spiking neural networks. In Advances in Neural
[7] Shi Chen and Kazuyuki Demachi. 2020. A Vision-Based Approach for Ensuring Information Processing Systems.
Proper Use of Personal Protective Equipment (PPE) in Decommissioning of [31] Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander Vostrikov,
Fukushima Daiichi Nuclear Power Station. Applied Sciences (2020). and Andrew Zisserman. 2020. The AVA-Kinetics Localized Human Actions Video
[8] Mejdi DALLEL, Vincent HAVARD, David BAUDRY, and Xavier SAVATIER. 2020. Dataset. ArXiv (2020).
InHARD - Industrial Human Action Recognition Dataset in the Context of In- [32] Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander Vostrikov,
dustrial Collaborative Robotics. In 2020 IEEE International Conference on Human- and Andrew Zisserman. 2020. The AVA-Kinetics Localized Human Actions Video
Machine Systems (ICHMS). Dataset. arXiv:2005.00214 [cs.CV]
[9] Mejdi Dallel, Vincent Havard, David Baudry, and Xavier Savatier. 2020. InHARD [33] W. Li, Zhengyou Zhang, and Zicheng Liu. 2010. Action recognition based on a
- Industrial Human Action Recognition Dataset in the Context of Industrial bag of 3D points. 2010 IEEE Computer Society Conference on Computer Vision and
Collaborative Robotics. 2020 IEEE International Conference on Human-Machine Pattern Recognition - Workshops (2010).
Systems (ICHMS) (2020). [34] Yixuan Li, Lei Chen, Runyu He, Zhenzhi Wang, Gangshan Wu, and Limin Wang.
[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: 2021. MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized
Pre-training of Deep Bidirectional Transformers for Language Understanding. In Sports Actions. 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
North American Chapter of the Association for Computational Linguistics. (2021).
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- [35] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017.
aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Focal loss for dense object detection. In Proceedings of the IEEE international
Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An Image is conference on computer vision. 2980–2988.
Worth 16x16 Words: Transformers for Image Recognition at Scale. ArXiv (2020). [36] Jingen Liu, Benjamin Kuipers, and Silvio Savarese. 2011. Recognizing human
[12] Haoqi Fan, Yanghao Li, Bo Xiong, Wan-Yen Lo, and Christoph Feichtenhofer. actions by attributes. CVPR 2011 (2011), 3337–3344. https://api.semanticscholar.
2020. PySlowFast. https://github.com/facebookresearch/slowfast. org/CorpusID:9119671
[13] Christoph Feichtenhofer. 2020. X3D: Expanding Architectures for Efficient Video [37] W. Liu, Dragomir Anguelov, D. Erhan, Christian Szegedy, Scott E. Reed, Cheng-
Recognition. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recogni- Yang Fu, and Alexander C. Berg. 2015. SSD: Single Shot MultiBox Detector. In
tion (CVPR) (2020). European Conference on Computer Vision.
[14] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2018. Slow- [38] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu.
Fast Networks for Video Recognition. 2019 IEEE/CVF International Conference on 2022. Video Swin Transformer. In 2022 IEEE/CVF Conference on Computer Vision
Computer Vision (ICCV) (2018). and Pattern Recognition (CVPR). IEEE.
[15] Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference [39] Christian Mandery, Ömer Terlemez, Martin Do, Nikolaus Vahrenkamp, and
on computer vision. 1440–1448. Tamim Asfour. 2015. The KIT whole-body human motion database. In 2015
[16] Chunhui Gu, Chen Sun, Sudheendra Vijayanarasimhan, Caroline Pantofaru, International Conference on Advanced Robotics (ICAR).
David A. Ross, George Toderici, Yeqing Li, Susanna Ricco, Rahul Sukthankar, [40] Marcin Marszalek, Ivan Laptev, and Cordelia Schmid. 2009. Actions in context.
Cordelia Schmid, and Jitendra Malik. 2017. AVA: A Video Dataset of Spatio- 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009).
Temporally Localized Atomic Visual Actions. 2018 IEEE/CVF Conference on [41] Naval Kishore Mehta, Shyam Sunder Prasad, Sumeet Saurav, Ravi Saini, and
Computer Vision and Pattern Recognition (2017). Sanjay Singh. 2024. IAR-Net: A Human–Object Context Guided Action Recog-
[17] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. nition Network for Industrial Environment Monitoring. IEEE Transactions on
2015. ActivityNet: A large-scale video benchmark for human activity understand- Instrumentation and Measurement 73 (2024), 1–8.
ing. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [42] Louis-Philippe Morency, Ariadna Quattoni, and Trevor Darrell. 2007. Latent-
(2015). Dynamic Discriminative Models for Continuous Gesture Recognition. 2007
[18] Haroon Idrees, Amir Zamir, Yu-Gang Jiang, Alexander N. Gorban, Ivan Laptev, IEEE Conference on Computer Vision and Pattern Recognition (2007), 1–8. https:
Rahul Sukthankar, and Mubarak Shah. 2016. The THUMOS challenge on action //api.semanticscholar.org/CorpusID:7117722
recognition for videos "in the wild". ArXiv (2016). [43] M. Müller, T. Röder, M. Clausen, B. Eberhardt, B. Krüger, and A. Weber. 2007.
[19] Francesco Iodice, Elena De Momi, and Arash Ajoudani. 2022. HRI30: An Ac- Documentation Mocap Database HDM05. Technical Report CG-2007-2. Universität
tion Recognition Dataset for Industrial Human-Robot Interaction. In 2022 26th Bonn.
International Conference on Pattern Recognition (ICPR). [44] Emre O. Neftci, Hesham Mostafa, and Friedemann Zenke. 2019. Surrogate Gradi-
[20] Francesco Iodice, Elena De Momi, and Arash Ajoudani. 2022. HRI30: An Ac- ent Learning in Spiking Neural Networks: Bringing the Power of Gradient-based
tion Recognition Dataset for Industrial Human-Robot Interaction. 2022 26th optimization to spiking neural networks. IEEE Signal Processing Magazine 36
International Conference on Pattern Recognition (ICPR) (2022). (2019), 51–63.
[21] Velibor Isailović, Aleksandar Peulić, Marko Djapan, Marija Savković, and Arso M. [45] njvisionpower. 2024. Safety-Helmet-Wearing-Dataset. GitHub (2024).
Vukicevic. 2022. The compliance of head-mounted industrial PPE by using deep [46] Munkh-Erdene Otgonbold, Munkhjargal Gochoo, Fady S. Alnajjar, Luqman Ali,
learning object detectors. Scientific Reports (2022). Tan-Hsu Tan, Jun-Wei Hsieh, and Ping-Yang Chen. 2022. SHEL5K: An Extended
[22] Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J. Black. Dataset and Benchmarking for Safety Helmet Detection. Sensors (Basel, Switzer-
2013. Towards Understanding Action Recognition. 2013 IEEE International land) 22 (2022).
Conference on Computer Vision (2013). [47] Paul Over, Jon Fiscus, Gregory Sanders, David Joy, Martial Michel, George Awad,
[23] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Suk- Alan Smeaton, Wessel Kraaij, and Georges Quénot. 2014. TRECVID 2014 – An
thankar, and Li Fei-Fei. 2014. Large-Scale Video Classification with Convolutional Overview of the Goals, Tasks, Data, Evaluation Mechanisms, and Metrics.
Neural Networks. 2014 IEEE Conference on Computer Vision and Pattern Recogni- [48] Qinhan Xiao Qiankun Xiao, Junfeng Li. 2013. Human Motion Capture Data
tion (2014). Retrieval Based on Quaternion and EMD. In International Conference on Intelligent
[24] Yan Ke, Rahul Sukthankar, and Martial Hebert. 2005. Efficient visual event Human-Machine Systems and Cybernetics.
detection using volumetric features. Tenth IEEE International Conference on [49] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN:
Computer Vision (ICCV’05) Volume 1 (2005). Towards Real-Time Object Detection with Region Proposal Networks. IEEE
Transactions on Pattern Analysis and Machine Intelligence 39 (2015), 1137–1149.
CODS-COMAD 2024, December 18-21, 2024, IIT-Jodhpur, India Surya et al.
[50] Mikel D. Rodriguez, Javed Ahmed, and Mubarak Shah. 2008. Action MACH a using deep learning techniques. Safety Science (2022).
spatio-temporal Maximum Average Correlation Height filter for action recogni- [62] Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. 2024. Yolov9: Learning
tion. 2008 IEEE Conference on Computer Vision and Pattern Recognition (2008). what you want to learn using programmable gradient information. arXiv preprint
[51] M. S. Ryoo and L. Matthies. 2013. First-Person Activity Recognition: What Are arXiv:2402.13616 (2024).
They Doing to Me?. In IEEE Conference on Computer Vision and Pattern Recognition [63] Heng Wang, Dan Oneaţă, Jakob J. Verbeek, and Cordelia Schmid. 2015. A Robust
(CVPR). and Efficient Video Representation for Action Recognition. International Jour-
[52] Paul Scovanner, Saad Ali, and Mubarak Shah. 2007. A 3-dimensional sift descriptor nal of Computer Vision 119 (2015), 219 – 238. https://api.semanticscholar.org/
and its application to action recognition. Proceedings of the 15th ACM international CorpusID:11491197
conference on Multimedia (2007). https://api.semanticscholar.org/CorpusID: [64] Zijian Wang, Yimin Wu, Lichao Yang, Arjun Thirunavukarasu, Colin Evison,
1087061 and Yifan Zhao. 2021. Fast Personal Protective Equipment Detection for Real
[53] Javen Qinfeng Shi, Li Cheng, Li Wang, and Alex Smola. 2011. Human Action Construction Sites Using Deep Learning Approaches. Sensors (Basel, Switzerland)
Segmentation and Recognition Using Discriminative Semi-Markov Models. Inter- (2021).
national Journal of Computer Vision 93 (2011), 22–32. https://api.semanticscholar. [65] Yuyang Wanyan, Xiaoshan Yang, Weiming Dong, and Changsheng Xu. 2024. A
org/CorpusID:9054863 Comprehensive Review of Few-shot Action Recognition.
[54] Gunnar A. Sigurdsson, Gül Varol, X. Wang, Ali Farhadi, Ivan Laptev, and Abhi- [66] Jixiu Wu, Nian Cai, Wenjie Chen, Huiheng Wang, and Guotian Wang. 2019.
nav Kumar Gupta. 2016. Hollywood in Homes: Crowdsourcing Data Collection Automatic detection of hardhats worn by construction personnel: A deep learning
for Activity Understanding. In European Conference on Computer Vision. approach and benchmark dataset. Automation in Construction (2019).
[55] Khurram Soomro, Amir Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of [67] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick.
101 Human Actions Classes From Videos in The Wild. ArXiv (2012). 2019. Detectron2. https://github.com/facebookresearch/detectron2.
[56] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A [68] Fan Yang. 2022. CustomAva. https://github.com/Whiffe/Custom-ava-dataset_
Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv:1212.0402 Custom-Spatio-Temporally-Action-Video-Dataset.
[57] Moritz Tenorth, Jan Bandouch, and Michael Beetz. 2009. The TUM Kitchen Data [69] Serena Yeung, Olga Russakovsky, Ning Jin, Mykhaylo Andriluka, Greg Mori, and
Set of everyday manipulation activities for motion tracking and action recog- Li Fei-Fei. 2015. Every Moment Counts: Dense Detailed Labeling of Actions in
nition. 2009 IEEE 12th International Conference on Computer Vision Workshops, Complex Videos. International Journal of Computer Vision (2015).
ICCV Workshops (2009). [70] Fusheng Yu, Xiaoping Wang, Jiang Li, Shaojin Wu, Jun jie Zhang, and Zhigang
[58] Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Zeng. 2023. Towards Complex Real-World Safety Factory Inspection: A High-
Paluri. 2014. Learning Spatiotemporal Features with 3D Convolutional Networks. Quality Dataset for Safety Clothing and Helmet Detection. ArXiv (2023).
2015 IEEE International Conference on Computer Vision (ICCV) (2014), 4489–4497. [71] Junsong Yuan, Zicheng Liu, and Ying Wu. 2009. Discriminative subvolume search
https://api.semanticscholar.org/CorpusID:1122604 for efficient action detection. 2009 IEEE Conference on Computer Vision and Pattern
[59] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Recognition (2009).
Paluri. 2017. A Closer Look at Spatiotemporal Convolutions for Action Recog- [72] Faishal Zhafran, Endah Suryawati Ningrum, Mohamad Nasyir Tamara, and Eny
nition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Kusumawati. 2019. Computer Vision System Based for Personal Protective
(2017). Equipment Detection, by Using Convolutional Neural Network. 2019 International
[60] Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Electronics Symposium (IES) (2019).
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you [73] Sijie Zhu, Taojiannan Yang, Mat’ias Mendieta, and Chen Chen. 2020. A3D:
Need. In Neural Information Processing Systems. Adaptive 3D Networks for Video Action Recognition. ArXiv (2020).
[61] Arso M. Vukicevic, Marko Djapan, Velibor Isailović, Danko Z. Milasinovic, Marija
Savković, and Pavle Miloević. 2022. Generic compliance of industrial PPE by