0% found this document useful (0 votes)

21 views10 pages

Industrial Safety Violation FinalRevision

This paper presents a system for detecting industrial safety violations by employing action recognition models to understand worker actions and using object detection techniques to check for compliance with personal protective equipment (PPE) requirements. The proposed approach improves the F1-score by 23% compared to traditional PPE-based methods, addressing the challenges of false alarms and the complexity of real-world industrial environments. Additionally, a novel dataset is introduced, capturing diverse industrial actions and conditions to enhance the accuracy of violation detection.

Uploaded by

YASH MODI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views10 pages

Industrial Safety Violation FinalRevision

Uploaded by

YASH MODI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Action Recognition based Industrial Safety Violation Detection

Surya N Reddy Vaibhav Kurrey

Indian Institute of Technology Indian Institute of Technology
Bhilai, India Bhilai, India
[email protected] [email protected]

Mayank Nagar Gagan Raj Gupta

Chennai Mathematical Institute Indian Institute of Technology
Chennai, India Bhilai, India
[email protected] [email protected]

ABSTRACT
Proper use of personal protective equipment (PPE) can save the
lives of industry workers and it is a widely used application of
computer vision in the large manufacturing industries. However,
most of the applications deployed generate a lot of false alarms
(violations) because they tend to generalize the requirements of
PPE across the industry and tasks. The key to resolving this issue (a) Real Time Surveillance Feed (b) Multi Actions - Multi People
is to understand the action being performed by the worker and
customize the inference for the specific PPE requirements of that
action. In this paper, we propose a system that employs activity
recognition models to first understand the action being performed
and then use object detection techniques to check for violations.
This leads to a 23% improvement in the F1-score compared to the
PPE-based approach on our test dataset of 109 videos.
(c) Welding with Occlusion (d) Multi Person Walking
CCS CONCEPTS
Figure 1: Sample Data with Multi actor- Multi Action Indus-
• Computing methodologies → Activity recognition and un- trial Scenario
derstanding; Object detection.

KEYWORDS
requirements remains challenging for employers and safety offi-
Action Recognition, PPE Detection, Object Detection cers. For large industries, monitoring all activities and employees
ACM Reference Format: is manpower-intensive and time-consuming. Video analytics-based
Surya N Reddy, Vaibhav Kurrey, Mayank Nagar, and Gagan Raj Gupta. solutions might ensure better compliance with low costs and enable
2018. Action Recognition based Industrial Safety Violation Detection. In the automatic detection of violations in the workplace.
Proceedings of ACM Conference (CODS-COMAD 2024). ACM, New York, NY, During the root cause analysis of accidents, the common ques-
USA, 10 pages. https://doi.org/XXXXXXX.XXXXXXX tions asked are: i) Was the worker wearing the PPE designated for
that activity? ii) Was the worker following any unsafe practice or
1 INTRODUCTION violating any laid down procedure during the activity? iii) Was the
worker working in an unsafe environment? When multiple activi-
Accidents in construction sites and industrial environments can
ties are simultaneously performed by multiple people inside a large
turn fatal for the workers if they don’t wear proper Personal Pro-
manufacturing complex(shop floor), detection of activity-specific
tective Equipment (PPE). The right usage of PPE not only saves
PPE for each person is difficult. For example, a person walking
lives but also reduces the severity of the injury. Despite regulatory
inside the shop floor might need only a Safety Helmet and Safety
requirements and safety protocols, ensuring compliance with PPE
Shoes whereas a person working on handling materials or sharp
objects also needs safety gloves. Similarly, a person doing weld-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed ing needs all of the above PPEs along with safety glasses.These
for profit or commercial advantage and that copies bear this notice and the full citation variations in the PPE requirements (see table 1) make automated
on the first page. Copyrights for components of this work owned by others than ACM violation detection difficult. Our goal in this paper is to address the
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a questions (i) and (ii) by building a system that can understand
fee. Request permissions from [email protected]. the activity being performed by a worker from industrial
CODS-COMAD 2024, December 18-21, 2024, IIT-Jodhpur, India surveillance camera videos and check for any violation of
© 2018 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 PPE designated for those activities.
https://doi.org/XXXXXXX.XXXXXXX
CODS-COMAD 2024, December 18-21, 2024, IIT-Jodhpur, India Surya et al.

Understanding human actions within the shop floor environment 2.1 Action Recognition
is crucial for developing an effective violation detection system. Action recognition is an area in computer vision that involves
Sometimes, a worker may wear the necessary PPE but still engage identifying and categorizing human actions in video sequences.
in unsafe workflows, posing risks not only to themselves but also Unlike static image classification, human action recognition must
to surrounding workers. While action recognition and classifica- account for the temporal dynamics and sequential nature of actions,
tion are well-established tasks in computer vision, there are limited which significantly increases the complexity of the task[65]. His-
published works such as InHARD[8], HRI30[19], and LAMIS[41] torically, in frame-based action recognition, there have typically
that focus on industrial actions, either in terms of model architec- been two key steps: action representation[29, 42, 52, 63] and action
tures or datasets. Even then, these datasets often lack the realism classification[26, 36, 53]. Recent works have merged both these ap-
of an actual shop floor environment, typically featuring one per- proaches into an end-to-end learning framework, thus significantly
son per video or focusing on a single action, failing to capture the improving action classification performance.
complexities of real-world conditions (see Figure 1). To leverage information from all frames and model the inter-
Most benchmark action recognition datasets are sourced from frame information correlation, Tran et al .[58] proposed 3DCNN
the internet or controlled laboratory settings, predominantly fea- to learn features in both spatial and temporal domains, but with
turing sports-related or household actions performed by a single high computational costs. Carreira and Zisserman[6]introduced
actor. In real-world industrial settings, however, multiple individu- I3D, which builds upon the existing image classification architec-
als often need to be monitored simultaneously by a single camera. tures, making training easier. Feichtenhofer et al .[14] proposed
The existing datasets are usually well-curated, high-quality videos, an efficient network, SlowFast, with both slow and fast pathways
which do not fully capture the dynamic and chaotic nature of real- that can adapt to different scenarios by adjusting channel capac-
world industrial environments. Therefore, there is a pressing need ities, greatly enhancing overall efficiency. Additionally, various
for a comprehensive dataset that authentically represents indus- 3DCNN variants[13, 59, 73] have been proposed, further improving
trial actions. This paper proposes the creation of such a dataset, recognition efficiency and reducing the limitations in the initial
sourced from surveillance and process monitoring cameras within architecture. ViT[11], self-attention mechanisms[10, 60] have been
a large-scale manufacturing complex. adapted to action recognition tasks[3, 38], which has been shown
Integrating Human Action Recognition (HAR) models with tra- to achieve good performance. Spiking neural networks(SNN) have
ditional object detection systems creates a robust solution for de- also been used for action recognition. However, due to the non-
tecting PPE violations. These models can identify specific tasks differentiability of discrete pulse signals, training SNNs poses chal-
and check for compliance with PPE requirements, thereby reduc- lenges. Several effective training methods have been proposed to
ing computational costs and minimizing false alarms. For effective address this challenge[30, 44], but their effectiveness remains to be
PPE detection in real-world conditions, the dataset must be diverse, further investigated in the area of industrial context.
covering a wide range of tasks and PPE types. In typical indus-
trial settings, where dedicated high-quality cameras are uncom-
mon, the models must adapt to process monitoring or surveillance 2.2 Action Recognition Datasets
feeds, which may have poor lighting, blurry images, occlusions, Most popular action recognition datasets (see table 4), such as
and multiple individuals. These feeds often capture various activi- Weizmann[4], Hollywood-2 [40], HMDB[28] and UCF101[55], con-
ties happening simultaneously, making accurate violation detection sist of manually trimmed short clips to capture a single action.
challenging. Figure 1 shows examples of images from real industrial Unfortunately, these datasets don’t represent real-world applica-
settings. tions where multiple actors are working on multiple tasks and
In this study, we train a SlowFast network [14] (a state-of-the-art action recognition always occurs in an untrimmed environment.
model) for the task of video action recognition and a YOLOv9 model Video classification datasets, such as TrecVid multi-media event
for PPE detection to detect industrial safety violations at a clip level. detection[47], Sports-1M[23] and YouTube-8M[1] have focused on
The action recognition-based PPE detection approach is compared video classification on a large scale by automating the label genera-
with traditional PPE-based approaches. We also present a human tion thereby introducing a large number of noisy annotations.
study to compare the performance of our approach. Another line of work in HAR is towards temporal localization
To summarize, our contributions are as follows: of tasks. ActivityNet[17], THUMOS[18], MultiTHUMOS[69] and
• Proposed a novel dataset for understanding human actions Charades[54] use large numbers of untrimmed videos, each contain-
in industrial settings. ing multiple actions, obtained either from YouTube (ActivityNet,
• Proposed a novel approach for detecting task-specific PPE THUMOS, MultiTHUMOS) or from crowdsourced actors (Charades).
requirements using Action Recognition and Object detection The datasets cover the temporal localization aspect. However, they
models which can catch violations comparable to humans don’t address the spatial part.
and much better than PPE based approaches. Spatio-temporal action detection datasets, such as CMU[24],
MSR Actions[71], UCF Sports[50] and JHMDB[22], UCF101-24[55],
AVA[16] and AVA-Kinetics[31], MultiSports[34] typically evaluate
2 RELATED WORKS spatio-temporal action detection for short videos with frame-level
Related works in this area can be broadly classified into three areas. action annotations. These benchmarks pay more attention to spa-
Video-based action recognition, publicly available datasets on HAR tial information with frame-level detectors and clip-level detectors,
in an industrial context and PPE detection in industrial areas. which are limited to fully utilize temporal information. Very few
Action Recognition based Industrial Safety Violation Detection CODS-COMAD 2024, December 18-21, 2024, IIT-Jodhpur, India

Action Specific PPE Requirements

S.No Activity Shoes Safety Helmet Safety Gloves Welding Helmet
1 Crane Movement ✓ ✓
2 Observing ✓ ✓
3 Interacting ✓ ✓
4 Walking ✓ ✓
5 Lifting / Handing over an object ✓ ✓ ✓
6 Moving on Bike / Bicycle ✓ ✓
7 Pushing / Pulling an Object ✓ ✓ ✓
8 Interacting with a machine ✓ ✓ ✓
9 Doing some mechanical work ✓ ✓ ✓
10 Welding ✓ ✓ ✓ ✓
*Even though idle/Walking/Monitoring, PPE requirement will depend upon the workplace
Table 1: Action Specific PPE Requirements in Industrial Setting

published datasets and works are available for action recognition close sourced. Recent work SH17 [2] has a comprehensive collection
in the industrial context. HRI30[19], InHARD[8], and LAMIS[41] of PPE including gloves and earmuffs. This is again largely collected
are some of the existing datasets available in the industrial setting. from the internet and crowd-sourced. The available datasets don’t
However, most of these datasets are not complex in nature and accurately reflect the environmental conditions, noise, occlusion,
also involve one person doing a specific action4. In our proposed and lighting in the manufacturing setup and don’t have a full set
dataset we tried to capture the diversity of action and complexity of of PPE instances. Similar to SH17, we aim to bridge this gap by
interactions between multiple actors in the video using the surveil- proposing a novel dataset collected from surveillance videos of
lance videos. Our dataset differs from the above in terms of both large manufacturing industries.
content and annotation: we label a diverse collection of industrial 3 DATASET EXPLANATION
actions and provide spatio-temporal annotations for each subject This paper addresses the main limitations of existing datasets in
performing an action in a large set of sampled frames. understanding industrial action i.e. they don’t capture the dynamic
environment and action classes that a real industrial environment
2.3 PPE Detection on Surveillance Videos presents. Our goal is to build a large-scale, high-quality dataset with
Traditional approaches to PPE detection use Object Detection(OD) fine-grained action classes and dense annotations that capture most
models to identify the safety appliances. Isailovic et al.[21] and of the commonly performed industrial actions. Also, our proposed
Vukicevic et al.[61] use a two-stage approach by using a key point dataset is collected from real manufacturing setup surveillance feeds
detector to detect regions and then passing these regions to the and it tries to address the issue of lack of variety in the existing
Object Detection model for further PPE detection. Wu et al.[66] datasets discussed above.
used the Single Stage Detector (SSD)[37] architecture to identify
hardhats of different colors on construction sites and the model 3.1 Dataset Preparation
was benchmarked on GDUT-HWD[66] dataset. Otgonbold et al.[46]
3.1.1 Video Collection Process. Video is obtained from surveil-
benchmarked the performance of multiple OD models for detecting
lance or process monitoring cameras (2 PTZ cameras and 1 Bullet
6 different classes from person, helmet, head, and face on the novel
Camera) captured at a frame rate of 12 FPS, with a 1920x1080 pixels
SHEL5k dataset[46]. Chen and Demachi[7] introduced a method us-
resolution. In the current version, a total of 320 hours of footage is
ing OpenPose[5] for body landmark detection and the YOLOv3 OD
collected and cleaned for further processing and annotations.
model for PPE detection. They used the geometric relationships be-
The process involved the following key steps:
tween the key points to detect PPE and assess compliance. Zhafran
et al.[72] used the Fast R-CNN architecture and observed a decrease (1) Video Segmentation: We first divided the videos into 15-
in accuracy with changes in distance and lighting conditions. second clips to standardize the data for subsequent analy-
Many existing publicly available datasets are focused on the sis. Human detection: Using a pre-trained person detection
construction industry and the manufacturing industry is largely model, we filtered out any clips that did not contain human
unexplored. Most existing datasets (see table 2) available are focused subjects, significantly reducing the dataset size.
on hard hats and safety clothing. GDUT-HWD [66] is very noisy (2) Manual Review: The remaining clips were manually reviewed
due to the crowd-sourced nature of data and SHW [45] is sourced to eliminate those with poor visibility, unclear content, or
from search engines. CHV [64] dataset has no additional data and other factors that made them unsuitable for future use.
is simply a curated version of GDUT-HWD and SHW datasets. (3) Duplicate Removal: To ensure the uniqueness of the data set,
SHEL5K [46] and Pictor-PPE [25] focus on only the PPE’s clothing we applied a hash-based method using the hash lib library to
and hard hat aspect. These are also crowd-sourced from the web. detect and remove duplicate clips. This step involved calcu-
Only TSRSF [70] has a dataset collected from a real industrial setup lating the MD5 hash of each video file and eliminating any
in a chemical plant focusing on hard hats and clothing. However, it is files with matching hashes.
CODS-COMAD 2024, December 18-21, 2024, IIT-Jodhpur, India Surya et al.

(4) Final Dataset: After these steps, we were left with approx- generated. An annotator either removes incorrect bounding boxes
imately 1,600 high-quality clips, which were subsequently or creates the bounding boxes missed by the person detector.
used for the training, testing, and validation phases of our Along the lines of annotations done for the AVA [16] dataset,
project. bounding boxes over short periods are linked to obtaining ground-
Clip duration: In action recognition, 15-second video clips are truth person tracks. The action labels (see table 3) are generated by
chosen to provide a comprehensive temporal context for captur- crowd-sourced annotators using a custom-designed interface. Each
ing and understanding actions. This duration ensures that most annotator will go through each frame and each person in the frame
actions are fully represented and can be analyzed effectively by and select the corresponding micro-action for the selected person.
the model. To annotate these clips, the process involves extracting At the bottom panel, a choice of actions is provided to the annotator
15 frames from each video. From these frames, the first 2 and last for the selection. A key frame can contain multiple persons and a
2 are removed, leaving 11 frames for annotation. This approach person can be associated with one micro-action in any given frame.
helps focus on the core part of the action, reducing noise from the Annotators have the choice not to provide any action or bounding
beginning and end of the clip, where actions may be less defined or box for the persons who are distant in the frame to reduce the
transitional. noise. On average, annotators take between 60 to 180 seconds for
each video depending on the number of persons present in the key
Dataset Classes Images Instances Method frame. Finally, the output files from the VIA tool were converted
Pictor - PPE[25] 3 784 - Web into training-ready files for use with the action model.
SHW[45] 1 7581 120558 Web
CHV[64] 6 1330 - Web 3.2 Dataset Statistics
TCRSF[70] 7 12373 50558 Industrial Our proposed dataset is an ongoing work for building a large-scale
GDUT-HWD[66] 5 3174 18893 Web dataset for understanding industrial actions. A total of more than
SHEL5K[46] 5 5000 75570 Web 60,000 clips are generated using a total of 320 video hours. For
SH17[2] 17 8099 75994 Web this paper, we are presenting a total of 2900 clips annotated with
Our Data 7 3000 19954 Industrial micro action categories for 45652 instances distributed among 12
Table 2: Existing Datasets for PPE detection micro action categories. Consequently, the number of instances
in our sample dataset is as high as 15.74 per video and 3511 per
category. AVA-Kinetics [32], which is a standard dataset for action
recognition annotates only one keyframe for a 10-second clip which
3.1.2 Action Taxonomy. Based on the clips collected an action is much lower than ours of 10-12 keyframes per clip. As shown
taxonomy was prepared by selecting the most commonly performed in Figure 2, the distribution of action instances is not balanced.
actions in the videos and in the real time. Based on the inputs, the This distribution increases the difficulty of accurately classifying
actions are fine-grained enough so that there is clarity in under- the action for detection models. In our best understanding, a one-
standing the action and also there are not many repetitive actions. to-one comparison can not be made to any of the existing action
In the action dictionary provided the actions and micro-actions recognition datasets as the present datasets are not oriented towards
associated with each action are defined separately and mapping industrial actions. Also, the present action recognition databases
of each action with the respective classes was done. The actions HRI30[19], LAMIS[41], and InHARD[8] don’t reflect the real-world
defined are Crane Movement, Welding, Observing / Interaction on setting and are very limited in scope. Even in comparison to existing
Shop Floor, Walking, Moving on a Bike / Bicycle, Person Lifting / non-industrial action recognition databases, our clips are longer(15s
Carrying /Handing Over/ Pushing or Pulling an object, Interact- vs an average of 7-8 secs), much more instances per clip (15.74 vs 5
ing with a machine/equipment in Shop Floor. The micro action on average).
definitions provided to human annotators are shown in Table 3.
3.1.3 Annotation Process. We followed the AVA [16][68] an-
notation process for action labeling as this method incorporated
micro-actions for understanding the entire sequence of physical
activities. In this approach, the entire annotation process is divided
into three parts person bounding box annotation, person link anno-
tation, and action annotation. Person localization is done through
a bounding box. We utilized the VIA tool to annotate the activity
being performed by the detected humans.
When multiple subjects are present in a selected frame, anno-
tator evaluates the each subject separately for action annotation
because action labels for each person can be different. Since manual
bounding box annotation is intensive, a hybrid annotation approach
was followed by first generating the initial set of bounding boxes Figure 2: Distribution of Action Labels in Proposed Dataset
using Faster-RCNN person detector[49]. Annotators are supplied
with these proposal files to manually correct the bounding boxes
Action Recognition based Industrial Safety Violation Detection CODS-COMAD 2024, December 18-21, 2024, IIT-Jodhpur, India

S.No Class Name Definition

1 Crane Movement A person is performing any action when the crane is around him
2 Welding A person is joining/separating a piece of metal with welding. Please make sure you see a welding spark while assigning
this class
3 Observing A person is simply observing an action by sitting or standing at the shop floor
4 Interacting A person is interacting with another person by sitting or standing on the shop floor or providing instructions either
non-verbally or verbally
5 Walking A person is walking on the shop floor (Ideally without any object on his hands or head)
6 Lifting an Object A person is lifting something with an intention to carry
7 Handing over an Object A person is handing over an object or instrument to another person
8 Carrying an Object A person is walking on the shop floor (Ideally with some object on his hands or head)
9 Moving on a Bike Riding a Motorized bike on shop floor
10 Moving on a Bicycle Riding a 2 wheeler in shop floor
11 Interacting with a machine A person is interacting or working on a machine with someone assisting him or alone
12 Pulling or Pushing an Object A person is pulling or pushing an object; it could be from a machine or cable
Table 3: Action Class Definitions and Descriptions provided to Human Annotators

Dataset Year Data Modalities Capture Activities Clips

MSR Action3D RGB-D Dataset[33] 2010 Depth sequences Depth Cameras 20 320
JPL First-Person Interaction Dataset[51] 2013 RGB videos Kinect 7 NA
HDM05 Dataset[43] 2007 3D Motion 3D motion sensors (MoCap) 70 1457
CMU Motion Capture[48] 2011 3D Motion 3D motion sensors (MoCap) 109 2605
3D motion sensors (MoCap)
KIT Whole-Body Human Motion[39] 2015 3D Motion, Videos 15 9727
and a Monocular Camera
3D motion sensors (MoCap),
TUM Kitchen Dataset[57] 2009 RGB video 10 NA
4 Mono Cameras, RFID Tags, Magnetic Sensors
HMDB51[27] 2014 RGB Video YouTube 51 6766
UCF-101[56] 2013 RGB Video YouTube 101 13320
MultiSports[34] 2021 RGB Video YouTube 66 3200
HRI30[20] 2022 RGBVideo Industrial Setting Simulated in Lab 30 2940
INHARD: INdustrial Human Action Recognition Dataset[9] 2020 RGB Video Sensors, Videos 13 4800
LAMIS Database[41] 2024 RGB Video Human Interactions with Lathe Machine in Workshop 17 2214
Our Dataset 2024 RGB Video Industrial Surveillance Feed 12 2900
Table 4: Comparison of Various Action Recognition Datasets

3.3 Dataset Characteristics ability to capture both slow and fast visual information, making it
One of the important goals of this work is to build a diverse and well-suited for recognizing complex actions in videos. The SlowFast
rich dataset (see figure 2) for industrial action recognition. Besides network has two pathways: slow and fast. The slow pathway pro-
variation in bounding box size and size of the persons or objects in cesses video frames at a lower frame rate to capture high-resolution
the frame, many categories will require discriminating fine-grained spatial details and static content. The fast pathway processes video
differences, such as “observing” versus “interacting” or lifting an frames at a higher frame rate to capture dynamic, rapid movements.
object or transporting an object. Even within an action class, the These pathways are fused through lateral connections, combining
appearance varies with vastly different contexts: an object is being static and dynamic features for effective spatio-temporal pattern
simply lifted and handed over to another person or an object is recognition in video data.
being transported by two people. Similarly, when we detect the
motion of the crane the distance of the crane from the camera is 4.2 PPE Detection
only understood through the size of the hook. Also, it is difficult to To establish the baseline metric on the proposed dataset, we trained
accurately estimate the distance between persons in the surround- the RetinaNet and Fast R-CNN models from Detectron2, along
ing area of crane movement. These wide intra-class varieties will with the YOLOv9 model on the dataset with 3522 images. The
allow us to learn features that identify the critical spatio-temporal distribution of PPE classes is given in table 5.
parts of action — such as in the given frame whether a person is RetinaNet [35] is a one-stage object detection model that uses
welding or simply observing the welding process from very near. a focal loss function to address class imbalance during training.
4 MODEL ARCHITECTURE AND METRICS The focal loss applies a modulating term to the cross-entropy loss,
focusing learning on hard negative examples. Fast R-CNN [15]
4.1 Action Recognition processes an entire image and a set of object proposals. For each
In this study, we employ the SlowFast Networks[14] for the task object proposal, a region of interest (RoI) pooling layer extracts
of video action recognition to detect industrial safety violations. a fixed-length feature vector from the feature map which is fed
The SlowFast network is a state-of-the-art model known for its into a sequence of fully connected layers for producing the softmax
CODS-COMAD 2024, December 18-21, 2024, IIT-Jodhpur, India Surya et al.

( Í𝑁
probabilities and encoding refined bounding-box positions for each 1 if 𝑖=1 𝑝𝑖 ≥ 1
The metric is defined as follows: V =
class. YOLOv9 [62] combines two neural network architectures, 0 otherwise
CSPNet and ELAN, designed with gradient path planning in mind. This means that 𝑉 = 1 (violation in the clip) if at least one
The Generalized Efficient Layer Aggregation Network (GELAN) frame 𝑖 has 𝑝𝑖 = 1 (indicating a detected PPE violation) and 𝑉 = 0
enhances lightweight design, inference speed, and accuracy. (no violation in the clip) if all frames 𝑖 have 𝑝𝑖 = 0 (indicating no
detected PPE violations).
4.3 Evaluation Metrics
PPE detection models are evaluated using various metrics to accu- 4.4 Proposed Approach for Action
rately measure their performance. The Microsoft Common Objects Recognition-Based PPE Detection
in Context (MS-COCO) dataset employs several common metrics, To effectively monitor and ensure compliance with industrial safety
including Precision (P) and Recall (R). protocols, we integrate our action recognition and PPE detection
Additionally, Mean Average Precision (mAP) assesses the de- models into an integrated framework, presented in Algorithm 1.
tection accuracy across all classes. It is calculated by determining First, An action recognition model (Slow-Fast Network) takes in
Precision (P) and Recall (R) for each class and then averaging these the input video feed (V) and trains it to identify and indicate the
values to provide an overall score. The Intersection over Union (IoU) location of activities in the video.
metric is used to measure the accuracy of object localization. IoU
calculates the overlap between the ground truth bounding boxes Algorithm 1 Integrated Action Recognition and PPE Detection
(𝑏𝑔 ) and the model’s predicted bounding boxes (𝑏𝑝𝑟𝑒𝑑 ) as follows: Framework
Area(𝑏𝑝𝑟𝑒𝑑 ∩ 𝑏𝑔 ) Input: Video footage 𝑉
IoU = Output: Detection of PPE violations and safety certification
Area(𝑏𝑝𝑟𝑒𝑑 ∪ 𝑏𝑔 )
1 Function PPEComplianceCheck_Train({𝐹 first , 𝐹 middle , 𝐹 last },
Where 𝑏𝑔 represents the ground truth bounding box, and 𝑝𝑝𝑟𝑒𝑑 𝐵 info, PPE_List):
denotes the bounding box predicted by the model, AP50 (Average 2 for each frame 𝐹𝑖 in {𝐹 first, 𝐹 middle, 𝐹 last } do
Precision at 50% IoU) is a metric used in object detection to evaluate 3 for each 𝐵 info,𝑗 do
the precision and recall of a model at a single Intersection over 4 required_PPE ← PPE_List[𝑎 𝑗 ]
Union (IoU) threshold of 50%. for each 𝑝 in required_PPE do
5 if 𝑝 not in 𝐵 PPE then
Class Instances 6 Mark as violation
no-safety-glove 245 7 end
no-safety-helmet 2905 8 else
no-safety-shoes 3341 9 Certify safety compliance
safety-glove 1973
10 end
safety-helmet 5289
11 end
safety-shoes 6066
welding-helmet 135 12 end
13 end
Table 5: Class distribution for PPE detection
14

15 Function PPECheck_Inference(𝑉 ):
16 frames_info ← ActionRecognitionModel(𝑉 )
This means that a detected object’s bounding box is considered frames ← {𝐹 first, 𝐹 middle, 𝐹 last }
a true positive if its IoU with the ground-truth bounding box is for frame in frames do
at least 50%. AP50-95 evaluates the model’s performance across 17 PPEDetectionModel(frame, frames_info)
multiple IoU thresholds. It averages the Average Precision (AP) 18 end
scores calculated at ten different IoU thresholds: 50%, 55%, 60%, 65%,
19 PPEComplianceCheck_Train(frames, frames_info, PPE_List)
70%, 75%, 80%, 85%, 90%, and 95%. This range of thresholds provides
20
a broader view of the model’s ability to detect objects with varying
degrees of overlap, from relatively loose (50%) to very strict (95%).
The AP50-95 score is the mean of these AP values. This information is stored in frames_info. These locations are
For evaluation of the overall approach of PPE detection through fed into PPEDetectionModel (YOLOv9), which was trained on a
Action Recognition, a clip-level metric is defined. In a clip of 15 sec- custom dataset to identify PPE. Afterwards, a new module PPECom-
onds, even if 1 frame of the 15 clips has any PPE violation detected, plainceCheck_Train has been designed to check PPE compliance.
then the entire clip is considered as a clip having a violation. This model takes three frames F_first, F_middle, F_last and for
Let 𝑁 be the total number of frames in a clip. Let 𝑝𝑖 be a boolean each frame, it will check for PPE in each B_info generated from
indicator for frame 𝑖, where 𝑝𝑖 = 1 if there is at least one person the Action Recognition model. These PPEs are detected and then
detected without the required PPE in frame 𝑖, and 𝑝𝑖 = 0 otherwise. checked with a list of PPE requirements required_PPE from the
Finally, 𝑉 is a boolean indicator for the clip, where 𝑉 = 1 if the clip PPE_list dictionary. If the PPE detected is not in required_PPE, then
is considered as having a violation, and 𝑉 = 0 otherwise. it is marked as a violation or else, It is certified as safety complied.
Action Recognition based Industrial Safety Violation Detection CODS-COMAD 2024, December 18-21, 2024, IIT-Jodhpur, India

PPECheck_Inference has been designed to take the video feed (V) as 5.2 PPE Detection
input to run the inference and generate the compliance information. Although there are datasets available for PPE detection, we are not
The YOLOv9 model determines whether the individual in the aware of any real-life industry environment datasets. We bench-
indicated area is wearing the appropriate PPE or not. If the user mark the dataset with existing state-of-the-art models. The dataset
possesses the appropriate PPE for the activity, the system displays a for PPE detection mentioned in table 5 is formatted in MS COCO
message certifying their safety. If not, the system indicates that the style.
individual is missing certain safety equipment. This strategy allows We utilized the implementations of RetinaNet and FastRCNN
us to reduce the false positives by looking for only PPE designated from detectron-2 [67] and trained them on our dataset for 20,000 and
for activity detected. 27,000 iterations, respectively, using the default hyperparameters.
For the YOLOv9 model [62], we used the official implementation
5 EXPERIMENTS AND RESULTS and trained it for 120 epochs with default hyperparameters.
We conducted action recognition experiments using the SlowFast
network on a real-life industry environment dataset, for which no Retina FastRCNN
Classes YoloV9
previous datasets are available. We benchmarked our dataset against Net (R101)
existing state-of-the-art models. All experiments were performed no-safety-glove 56.9 66.1 67.1
on a single RTX A6000 GPU 48GB VRAM and Intel® Xeon(R) Gold no-safety-helmet 64.0 67.3 76.9
5318Y CPU @ 2.10GHz × 96. no-safety-shoes 49.4 54.4 66.4
safety-glove 37.1 39.9 51.5
5.1 Action Recognition safety-helmet 68.9 65.8 75.8
For the SlowFast network, we utilized the implementation from the safety-shoes 56.2 58.5 70.8
official repository [12] and trained the model for 50 epochs with welding-helmet 46.3 55.6 62.7
fine-tuned hyperparameters. The results are summarized in Table Table 8: Average Precision (AP50-95) score of various models
6 and Table 7. We observe that Recall@Top k metrics are high. on our dataset

Total Extracted Frames 7920

Recall @ IoU=0.5 0.5959 Results on PPE detection are shown in the table 8 and 9. The
Precision @ IoU=0.5 0.6002 performance is similar to those obtained by training PPE-detection
Recall @ Top 3 0.9423 models on SH17 [2]. We observe that the precision for the classes
no-safety-helmet and safety-helmet is significantly higher compared
Recall @ Top 5 0.9820
to other classes. This is attributed to the distinct appearance of the
Table 6: Testing Metrics: Summary of Video Action Recogni-
helmet and the fact that these two classes have the highest sample
tion Results
count in the dataset, as shown in Table 5. The class safety-glove has
the lowest precision, while no-safety-glove exhibits comparatively
higher precision. This disparity may be due to the relatively lower
sample counts for safety gloves and the model potentially confusing
Action Class [email protected] gloves as extensions of clothing.
Crane Movement 0.0121
P R F1
Observing 0.1001
no-safety-glove 95.7 88.0 91.7
Interacting 0.0845
no-safety-helmet 98.8 98.2 98.5
Walking 0.1340
no-safety-shoes 88.6 87.3 87.9
Carrying an Object 0.1181 safety-glove 80.2 80.7 80.4
Moving on a Bike 0.0670 safety-helmet 98.9 98.9 98.9
Moving on a Cycle 0.5317 safety-shoes 85.7 88.7 87.2
Lifting an Object 0.1585 welding-helmet 93.2 92.0 92.6
Handing over an Table 9: Precision, Recall and F1-Score of YOLOv9 models on
0.01902
Object the test dataset
Interacting with a
0.0214
Machine
Pushing or Pulling an
0.0300
Object 5.3 Action Recognition based PPE Detection
Welding 0.0354
To assess the effectiveness of PPE violation detection using the
Mean [email protected] 0.1093
Action Recognition approach, we explored three different methods:
Table 7: Testing [email protected] : Performance By Category.
(1) detecting common PPE items such as safety shoes and helmets,
[email protected] for various action classes
(2) detecting all types of PPE, and (3) leveraging Action Recognition
for PPE detection with violation on at least 1 frame and 2 frames
CODS-COMAD 2024, December 18-21, 2024, IIT-Jodhpur, India Surya et al.

We also conducted a comparative study between our algorithm

and human evaluators. 28 Human evaluators were given guidelines
on how to detect violations and were asked to identify violations
from a set of randomly selected 20 video clips (sample image Fig 3).
The ground truth of these samples is verified by experts from the
safety department. Overall human evaluators’ average precision of 1
(based on the majority answer) and an average recall of 0.78%(based
on the majority answer) indicate that human evaluators are able to
accurately identify the True Positives (Violations being identified as
violations) but missed some of the violations (Violations identified
as Non-violations). Our proposed model gave a precision of 81.2%
and a recall of 93%.

6 DISCUSSION
The inference runs with an average time of 117.25 ms per frame, and
a total inference time of 1.76 seconds to process a 15-second video
Figure 3: Combined action recognition and PPE detection for clip. During the process, CPU usage peaks at 2%, memory usage
real-time safety compliance. is 5.6%, and GPU usage peaks at 12%. The CCTV cameras used in
the study, record data at 12fps, and we have tested that the above
system (see section 5) can process 25 video streams concurrently
from the algorithm 1. The dataset with 109 videos consisting 54 in real-time.
videos with violation is being used. These clips were chosen such One of the limitations of our dataset is having only a 2D RGB
that the field of view was sufficient enough to get a good focus video feed. This dataset lacks depth and 3D video feeds which can
on workers, and activities. We also ensured coverage of different be used for a more comprehensive understanding of the working
types of actions and violations. The results of these approaches are environment and context. For example, a person who is assisting
presented in Table 10. the crane movement should not be directly below the crane and
should maintain a certain distance from the path. This distance
Precision Recall F1 Score calculation between the worker and crane is difficult to measure
accurately because camera positioning will impact the accuracy.
Common PPE 0.62 0.55 0.59
We plan to deploy depth sensors at designated sites and collect the
All PPE 0.51 0.56 0.54
data for further augmenting the dataset.
Activity based PPE (1 frame) 0.60 0.93 0.73
The other issue is the field of view(FOV). The usual camera setup
Activity based PPE (2 frame) 0.64 0.83 0.72
in the industrial environment is to cover the maximum area possible.
Table 10: Precision, Recall and F1 Score on all three ap-
Due to this, the field of view becomes large, and small objects such
proaches.
as gloves, glasses, and shoes for the distant objects are difficult
to detect. One solution to solve this is to increase the number of
cameras. From analyzing the videos, we have observed that a FOV
While state-of-the-art models like YOLO are effective at detecting of 20mts is sufficient enough to get good accuracy. Practically also,
PPE, they tend to generate a high number of false positives. This is this approach makes sense as many large plants have shop floors
because not all PPE is required at all times; the necessity depends on covered in huge areas and complete coverage might not be feasible
the specific action being performed. For example, a person walking financially.
on a workshop floor does not need a welding helmet. We observe
better precision when detecting common PPE (e.g., safety shoes,
helmets) compared to detecting all PPE. This issue can be addressed 7 CONCLUSION
through an action-based approach, which identifies the necessary In this paper, we present a novel dataset with dense spatio-temporal
PPE based on the activity being performed and checks only for annotations designed to recognize industrial actions. This dataset
those specific items. As shown in the table 10, the activity-based sets itself apart from existing action detection datasets by offering
approach achieves a high recall rate of 93% by focusing on the a diverse and realistic collection of industrial environment clips,
relevant PPE for each action. By analyzing at least two frames to as well as comprehensive annotations for commonly performed
determine a violation, false positives are further reduced, leading industrial tasks. We also introduce an innovative approach for
to an increase in precision. detecting PPE violations by integrating action recognition with
In an industrial environment, safety must be the top priority. object detection models. Our approach achieves a high recall at
Therefore, our primary goal is to maximize the detection of videos the clip level. Our work highlights the need for further research in
with violations. As a result, we focus on optimizing the recall metric. Industrial Action Recognition (IAR) and aims to inspire continued
All violations detected by the algorithm would be verified by the exploration in application areas such as human workflow analysis
safety officer before initiating disciplinary action. and human-machine interactions.
Action Recognition based Industrial Safety Violation Detection CODS-COMAD 2024, December 18-21, 2024, IIT-Jodhpur, India

REFERENCES [25] Kyunghwan Kim, Kangeun Kim, and Soyoon Jeong. 2023. Application of YOLO v5
[1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Apostol Natsev, George and v8 for Recognition of Safety Risk Factors at Construction Sites. Sustainability
Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. 15 (10 2023), 15179.
YouTube-8M: A Large-Scale Video Classification Benchmark. ArXiv (2016). [26] Yu Kong and Yun Raymond Fu. 2018. Human Action Recognition and Prediction:
[2] Hafiz Mughees Ahmad and Afshin Rahimi. 2024. SH17: A Dataset for Human A Survey. International Journal of Computer Vision 130 (2018), 1366 – 1401.
Safety and Personal Protective Equipment Detection in Manufacturing Industry. https://api.semanticscholar.org/CorpusID:49551723
[3] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is Space-Time Atten- [27] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. 2011. HMDB: a large
tion All You Need for Video Understanding? ArXiv (2021). video database for human motion recognition. In Proceedings of the International
[4] Moshe Blank, Lena Gorelick, Eli Shechtman, Michal Irani, and Ronen Basri. 2005. Conference on Computer Vision (ICCV).
Actions as space-time shapes. Tenth IEEE International Conference on Computer [28] Hilde Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso A. Poggio, and Thomas
Vision (ICCV’05) Volume 1 (2005). Serre. 2011. HMDB: A large video database for human motion recognition. 2011
[5] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2016. Realtime Multi- International Conference on Computer Vision (2011).
person 2D Pose Estimation Using Part Affinity Fields. 2017 IEEE Conference on [29] Ivan Laptev. 2005. On Space-Time Interest Points. International Journal of
Computer Vision and Pattern Recognition (CVPR) (2016), 1302–1310. Computer Vision 64 (2005), 107–123. https://api.semanticscholar.org/CorpusID:
[6] João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A 2619278
New Model and the Kinetics Dataset. 2017 IEEE Conference on Computer Vision [30] Luziwei Leng, Kaiwei Che, Kaixuan Zhang, Jianguo Zhang, Qinghu Meng, Jie
and Pattern Recognition (CVPR) (2017), 4724–4733. https://api.semanticscholar. Cheng, Qinghai Guo, and Jianxing Liao. 2022. Differentiable hierarchical and
org/CorpusID:206596127 surrogate gradient search for spiking neural networks. In Advances in Neural
[7] Shi Chen and Kazuyuki Demachi. 2020. A Vision-Based Approach for Ensuring Information Processing Systems.
Proper Use of Personal Protective Equipment (PPE) in Decommissioning of [31] Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander Vostrikov,
Fukushima Daiichi Nuclear Power Station. Applied Sciences (2020). and Andrew Zisserman. 2020. The AVA-Kinetics Localized Human Actions Video
[8] Mejdi DALLEL, Vincent HAVARD, David BAUDRY, and Xavier SAVATIER. 2020. Dataset. ArXiv (2020).
InHARD - Industrial Human Action Recognition Dataset in the Context of In- [32] Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander Vostrikov,
dustrial Collaborative Robotics. In 2020 IEEE International Conference on Human- and Andrew Zisserman. 2020. The AVA-Kinetics Localized Human Actions Video
Machine Systems (ICHMS). Dataset. arXiv:2005.00214 [cs.CV]
[9] Mejdi Dallel, Vincent Havard, David Baudry, and Xavier Savatier. 2020. InHARD [33] W. Li, Zhengyou Zhang, and Zicheng Liu. 2010. Action recognition based on a
- Industrial Human Action Recognition Dataset in the Context of Industrial bag of 3D points. 2010 IEEE Computer Society Conference on Computer Vision and
Collaborative Robotics. 2020 IEEE International Conference on Human-Machine Pattern Recognition - Workshops (2010).
Systems (ICHMS) (2020). [34] Yixuan Li, Lei Chen, Runyu He, Zhenzhi Wang, Gangshan Wu, and Limin Wang.
[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: 2021. MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized
Pre-training of Deep Bidirectional Transformers for Language Understanding. In Sports Actions. 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
North American Chapter of the Association for Computational Linguistics. (2021).
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- [35] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017.
aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Focal loss for dense object detection. In Proceedings of the IEEE international
Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An Image is conference on computer vision. 2980–2988.
Worth 16x16 Words: Transformers for Image Recognition at Scale. ArXiv (2020). [36] Jingen Liu, Benjamin Kuipers, and Silvio Savarese. 2011. Recognizing human
[12] Haoqi Fan, Yanghao Li, Bo Xiong, Wan-Yen Lo, and Christoph Feichtenhofer. actions by attributes. CVPR 2011 (2011), 3337–3344. https://api.semanticscholar.
2020. PySlowFast. https://github.com/facebookresearch/slowfast. org/CorpusID:9119671
[13] Christoph Feichtenhofer. 2020. X3D: Expanding Architectures for Efficient Video [37] W. Liu, Dragomir Anguelov, D. Erhan, Christian Szegedy, Scott E. Reed, Cheng-
Recognition. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recogni- Yang Fu, and Alexander C. Berg. 2015. SSD: Single Shot MultiBox Detector. In
tion (CVPR) (2020). European Conference on Computer Vision.
[14] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2018. Slow- [38] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu.
Fast Networks for Video Recognition. 2019 IEEE/CVF International Conference on 2022. Video Swin Transformer. In 2022 IEEE/CVF Conference on Computer Vision
Computer Vision (ICCV) (2018). and Pattern Recognition (CVPR). IEEE.
[15] Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference [39] Christian Mandery, Ömer Terlemez, Martin Do, Nikolaus Vahrenkamp, and
on computer vision. 1440–1448. Tamim Asfour. 2015. The KIT whole-body human motion database. In 2015
[16] Chunhui Gu, Chen Sun, Sudheendra Vijayanarasimhan, Caroline Pantofaru, International Conference on Advanced Robotics (ICAR).
David A. Ross, George Toderici, Yeqing Li, Susanna Ricco, Rahul Sukthankar, [40] Marcin Marszalek, Ivan Laptev, and Cordelia Schmid. 2009. Actions in context.
Cordelia Schmid, and Jitendra Malik. 2017. AVA: A Video Dataset of Spatio- 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009).
Temporally Localized Atomic Visual Actions. 2018 IEEE/CVF Conference on [41] Naval Kishore Mehta, Shyam Sunder Prasad, Sumeet Saurav, Ravi Saini, and
Computer Vision and Pattern Recognition (2017). Sanjay Singh. 2024. IAR-Net: A Human–Object Context Guided Action Recog-
[17] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. nition Network for Industrial Environment Monitoring. IEEE Transactions on
2015. ActivityNet: A large-scale video benchmark for human activity understand- Instrumentation and Measurement 73 (2024), 1–8.
ing. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [42] Louis-Philippe Morency, Ariadna Quattoni, and Trevor Darrell. 2007. Latent-
(2015). Dynamic Discriminative Models for Continuous Gesture Recognition. 2007
[18] Haroon Idrees, Amir Zamir, Yu-Gang Jiang, Alexander N. Gorban, Ivan Laptev, IEEE Conference on Computer Vision and Pattern Recognition (2007), 1–8. https:
Rahul Sukthankar, and Mubarak Shah. 2016. The THUMOS challenge on action //api.semanticscholar.org/CorpusID:7117722
recognition for videos "in the wild". ArXiv (2016). [43] M. Müller, T. Röder, M. Clausen, B. Eberhardt, B. Krüger, and A. Weber. 2007.
[19] Francesco Iodice, Elena De Momi, and Arash Ajoudani. 2022. HRI30: An Ac- Documentation Mocap Database HDM05. Technical Report CG-2007-2. Universität
tion Recognition Dataset for Industrial Human-Robot Interaction. In 2022 26th Bonn.
International Conference on Pattern Recognition (ICPR). [44] Emre O. Neftci, Hesham Mostafa, and Friedemann Zenke. 2019. Surrogate Gradi-
[20] Francesco Iodice, Elena De Momi, and Arash Ajoudani. 2022. HRI30: An Ac- ent Learning in Spiking Neural Networks: Bringing the Power of Gradient-based
tion Recognition Dataset for Industrial Human-Robot Interaction. 2022 26th optimization to spiking neural networks. IEEE Signal Processing Magazine 36
International Conference on Pattern Recognition (ICPR) (2022). (2019), 51–63.
[21] Velibor Isailović, Aleksandar Peulić, Marko Djapan, Marija Savković, and Arso M. [45] njvisionpower. 2024. Safety-Helmet-Wearing-Dataset. GitHub (2024).
Vukicevic. 2022. The compliance of head-mounted industrial PPE by using deep [46] Munkh-Erdene Otgonbold, Munkhjargal Gochoo, Fady S. Alnajjar, Luqman Ali,
learning object detectors. Scientific Reports (2022). Tan-Hsu Tan, Jun-Wei Hsieh, and Ping-Yang Chen. 2022. SHEL5K: An Extended
[22] Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J. Black. Dataset and Benchmarking for Safety Helmet Detection. Sensors (Basel, Switzer-
2013. Towards Understanding Action Recognition. 2013 IEEE International land) 22 (2022).
Conference on Computer Vision (2013). [47] Paul Over, Jon Fiscus, Gregory Sanders, David Joy, Martial Michel, George Awad,
[23] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Suk- Alan Smeaton, Wessel Kraaij, and Georges Quénot. 2014. TRECVID 2014 – An
thankar, and Li Fei-Fei. 2014. Large-Scale Video Classification with Convolutional Overview of the Goals, Tasks, Data, Evaluation Mechanisms, and Metrics.
Neural Networks. 2014 IEEE Conference on Computer Vision and Pattern Recogni- [48] Qinhan Xiao Qiankun Xiao, Junfeng Li. 2013. Human Motion Capture Data
tion (2014). Retrieval Based on Quaternion and EMD. In International Conference on Intelligent
[24] Yan Ke, Rahul Sukthankar, and Martial Hebert. 2005. Efficient visual event Human-Machine Systems and Cybernetics.
detection using volumetric features. Tenth IEEE International Conference on [49] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN:
Computer Vision (ICCV’05) Volume 1 (2005). Towards Real-Time Object Detection with Region Proposal Networks. IEEE
Transactions on Pattern Analysis and Machine Intelligence 39 (2015), 1137–1149.
CODS-COMAD 2024, December 18-21, 2024, IIT-Jodhpur, India Surya et al.

[50] Mikel D. Rodriguez, Javed Ahmed, and Mubarak Shah. 2008. Action MACH a using deep learning techniques. Safety Science (2022).
spatio-temporal Maximum Average Correlation Height filter for action recogni- [62] Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. 2024. Yolov9: Learning
tion. 2008 IEEE Conference on Computer Vision and Pattern Recognition (2008). what you want to learn using programmable gradient information. arXiv preprint
[51] M. S. Ryoo and L. Matthies. 2013. First-Person Activity Recognition: What Are arXiv:2402.13616 (2024).
They Doing to Me?. In IEEE Conference on Computer Vision and Pattern Recognition [63] Heng Wang, Dan Oneaţă, Jakob J. Verbeek, and Cordelia Schmid. 2015. A Robust
(CVPR). and Efficient Video Representation for Action Recognition. International Jour-
[52] Paul Scovanner, Saad Ali, and Mubarak Shah. 2007. A 3-dimensional sift descriptor nal of Computer Vision 119 (2015), 219 – 238. https://api.semanticscholar.org/
and its application to action recognition. Proceedings of the 15th ACM international CorpusID:11491197
conference on Multimedia (2007). https://api.semanticscholar.org/CorpusID: [64] Zijian Wang, Yimin Wu, Lichao Yang, Arjun Thirunavukarasu, Colin Evison,
1087061 and Yifan Zhao. 2021. Fast Personal Protective Equipment Detection for Real
[53] Javen Qinfeng Shi, Li Cheng, Li Wang, and Alex Smola. 2011. Human Action Construction Sites Using Deep Learning Approaches. Sensors (Basel, Switzerland)
Segmentation and Recognition Using Discriminative Semi-Markov Models. Inter- (2021).
national Journal of Computer Vision 93 (2011), 22–32. https://api.semanticscholar. [65] Yuyang Wanyan, Xiaoshan Yang, Weiming Dong, and Changsheng Xu. 2024. A
org/CorpusID:9054863 Comprehensive Review of Few-shot Action Recognition.
[54] Gunnar A. Sigurdsson, Gül Varol, X. Wang, Ali Farhadi, Ivan Laptev, and Abhi- [66] Jixiu Wu, Nian Cai, Wenjie Chen, Huiheng Wang, and Guotian Wang. 2019.
nav Kumar Gupta. 2016. Hollywood in Homes: Crowdsourcing Data Collection Automatic detection of hardhats worn by construction personnel: A deep learning
for Activity Understanding. In European Conference on Computer Vision. approach and benchmark dataset. Automation in Construction (2019).
[55] Khurram Soomro, Amir Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of [67] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick.
101 Human Actions Classes From Videos in The Wild. ArXiv (2012). 2019. Detectron2. https://github.com/facebookresearch/detectron2.
[56] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A [68] Fan Yang. 2022. CustomAva. https://github.com/Whiffe/Custom-ava-dataset_
Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv:1212.0402 Custom-Spatio-Temporally-Action-Video-Dataset.
[57] Moritz Tenorth, Jan Bandouch, and Michael Beetz. 2009. The TUM Kitchen Data [69] Serena Yeung, Olga Russakovsky, Ning Jin, Mykhaylo Andriluka, Greg Mori, and
Set of everyday manipulation activities for motion tracking and action recog- Li Fei-Fei. 2015. Every Moment Counts: Dense Detailed Labeling of Actions in
nition. 2009 IEEE 12th International Conference on Computer Vision Workshops, Complex Videos. International Journal of Computer Vision (2015).
ICCV Workshops (2009). [70] Fusheng Yu, Xiaoping Wang, Jiang Li, Shaojin Wu, Jun jie Zhang, and Zhigang
[58] Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Zeng. 2023. Towards Complex Real-World Safety Factory Inspection: A High-
Paluri. 2014. Learning Spatiotemporal Features with 3D Convolutional Networks. Quality Dataset for Safety Clothing and Helmet Detection. ArXiv (2023).
2015 IEEE International Conference on Computer Vision (ICCV) (2014), 4489–4497. [71] Junsong Yuan, Zicheng Liu, and Ying Wu. 2009. Discriminative subvolume search
https://api.semanticscholar.org/CorpusID:1122604 for efficient action detection. 2009 IEEE Conference on Computer Vision and Pattern
[59] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Recognition (2009).
Paluri. 2017. A Closer Look at Spatiotemporal Convolutions for Action Recog- [72] Faishal Zhafran, Endah Suryawati Ningrum, Mohamad Nasyir Tamara, and Eny
nition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Kusumawati. 2019. Computer Vision System Based for Personal Protective
(2017). Equipment Detection, by Using Convolutional Neural Network. 2019 International
[60] Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Electronics Symposium (IES) (2019).
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you [73] Sijie Zhu, Taojiannan Yang, Mat’ias Mendieta, and Chen Chen. 2020. A3D:
Need. In Neural Information Processing Systems. Adaptive 3D Networks for Video Action Recognition. ArXiv (2020).
[61] Arso M. Vukicevic, Marko Djapan, Velibor Isailović, Danko Z. Milasinovic, Marija
Savković, and Pavle Miloević. 2022. Generic compliance of industrial PPE by

Visual Detection of Personal Protective Equipment and Safety Gear On Industry Workers
No ratings yet
Visual Detection of Personal Protective Equipment and Safety Gear On Industry Workers
8 pages
2nd Yrenggadmission
No ratings yet
2nd Yrenggadmission
178 pages
QX Max
No ratings yet
QX Max
94 pages
A Smart System For Personal Protective Equipment Detection in Industrial Environments Based On Deep Learning at The Edge
No ratings yet
A Smart System For Personal Protective Equipment Detection in Industrial Environments Based On Deep Learning at The Edge
17 pages
Service Manual CT 200h 2014
No ratings yet
Service Manual CT 200h 2014
70 pages
Paper Doctor Symbols... : P# Remedy Name Homeopatic
100% (3)
Paper Doctor Symbols... : P# Remedy Name Homeopatic
4 pages
A Study On Personal Protective Equipment's
100% (1)
A Study On Personal Protective Equipment's
65 pages
Stress Analysis Training - (Analysis) BY Nedunchezhiyan Anbazhagan
100% (3)
Stress Analysis Training - (Analysis) BY Nedunchezhiyan Anbazhagan
50 pages
Automated Ppe Detection Using Multi-Class Object Detection: Submitted by
No ratings yet
Automated Ppe Detection Using Multi-Class Object Detection: Submitted by
42 pages
Personal Protective Equipment (PPE) Detection
No ratings yet
Personal Protective Equipment (PPE) Detection
6 pages
High Voltage Engineering Lecture Notes: G.Kranthi Kumar
No ratings yet
High Voltage Engineering Lecture Notes: G.Kranthi Kumar
46 pages
IJRARTH00170
No ratings yet
IJRARTH00170
38 pages
MCHONE Stainless Grades Chart Downloadable
No ratings yet
MCHONE Stainless Grades Chart Downloadable
3 pages
1 s2.0 S1877050924030990 Main
No ratings yet
1 s2.0 S1877050924030990 Main
10 pages
Worker Safety Monitoring Through PPE Detection: Literature Review
No ratings yet
Worker Safety Monitoring Through PPE Detection: Literature Review
10 pages
Enhancing Laboratory Safety With AI: PPE Detection and Non-Compliant Activity Monitoring Using Object Detection and Pose Estimation
No ratings yet
Enhancing Laboratory Safety With AI: PPE Detection and Non-Compliant Activity Monitoring Using Object Detection and Pose Estimation
10 pages
Automated PPE
No ratings yet
Automated PPE
21 pages
Safety Measure Detection Using Deep Learning
No ratings yet
Safety Measure Detection Using Deep Learning
8 pages
Elvin Joseph, SR Executive, Videonetics, 8111939754
No ratings yet
Elvin Joseph, SR Executive, Videonetics, 8111939754
25 pages
Viact - Ai: PPE Detection Solution For Construction Jobsite, AI Video Analytics For Workplace Safety
No ratings yet
Viact - Ai: PPE Detection Solution For Construction Jobsite, AI Video Analytics For Workplace Safety
8 pages
PPE Detector A YOLO-based Architecture To Detect P
No ratings yet
PPE Detector A YOLO-based Architecture To Detect P
24 pages
PPE Monitoring System-Camera-Ready
No ratings yet
PPE Monitoring System-Camera-Ready
7 pages
Paper 39-Detection of Personal Protective Equipment
No ratings yet
Paper 39-Detection of Personal Protective Equipment
9 pages
DS 02-02-07 L3 OCW 011932 OCW Personal Protective Equipment
No ratings yet
DS 02-02-07 L3 OCW 011932 OCW Personal Protective Equipment
3 pages
A High-Performance Framework For Personal Protective Equipment Detection On The Offshore Drilling Platform
No ratings yet
A High-Performance Framework For Personal Protective Equipment Detection On The Offshore Drilling Platform
16 pages
AI PPE Detection Software
No ratings yet
AI PPE Detection Software
2 pages
Abinayasri. Mailam ECE 02
No ratings yet
Abinayasri. Mailam ECE 02
9 pages
Fbuil 06 00136
No ratings yet
Fbuil 06 00136
10 pages
1 s2.0 S0925753521004860 Main
No ratings yet
1 s2.0 S0925753521004860 Main
8 pages
Personal Protective Equipment
No ratings yet
Personal Protective Equipment
6 pages
Chapter 1
No ratings yet
Chapter 1
8 pages
Paper 10424
No ratings yet
Paper 10424
6 pages
Videography Surveillance For The Use of (1) (Read-Only)
No ratings yet
Videography Surveillance For The Use of (1) (Read-Only)
12 pages
Owners: Workshop
No ratings yet
Owners: Workshop
192 pages
ESOL International English Listening Examination Level C2 Proficient
No ratings yet
ESOL International English Listening Examination Level C2 Proficient
16 pages
PETRONAS RETAILER Price List - W.E.F 10-10-23
No ratings yet
PETRONAS RETAILER Price List - W.E.F 10-10-23
2 pages
Smart Detection System of Safety Hazrds in Industry
No ratings yet
Smart Detection System of Safety Hazrds in Industry
20 pages
X-Mas Activity
No ratings yet
X-Mas Activity
6 pages
Vision Fast Personal Protective Equipment Detection For R
No ratings yet
Vision Fast Personal Protective Equipment Detection For R
22 pages
Synopsis Chhayank Kuamr Sahu Pgdinds Misp-021
No ratings yet
Synopsis Chhayank Kuamr Sahu Pgdinds Misp-021
15 pages
TEAM - 13 (2nd)
No ratings yet
TEAM - 13 (2nd)
13 pages
Clearway PB PPE Detection
No ratings yet
Clearway PB PPE Detection
2 pages
Topic C - PERSONAL PROTECTIVE EQUIPMENT
No ratings yet
Topic C - PERSONAL PROTECTIVE EQUIPMENT
43 pages
Pharmaceutics E-Poster
No ratings yet
Pharmaceutics E-Poster
1 page
S8 Chris - Organized
No ratings yet
S8 Chris - Organized
49 pages
Automated PPE Detection Using YOLOv8 For Real-Time Workplace Safety Monitoring
No ratings yet
Automated PPE Detection Using YOLOv8 For Real-Time Workplace Safety Monitoring
6 pages
The Dairy Farming Handbook 2017 - by DR CJC Muller
No ratings yet
The Dairy Farming Handbook 2017 - by DR CJC Muller
346 pages
Innodisk InnoPPE Flyer
No ratings yet
Innodisk InnoPPE Flyer
2 pages
Articulo Kevin
No ratings yet
Articulo Kevin
8 pages
Sathanakrishnan Update
No ratings yet
Sathanakrishnan Update
12 pages
Afflux & Loss of Flood Plain at Bridges
No ratings yet
Afflux & Loss of Flood Plain at Bridges
19 pages
Ambit Report On Garware Technical Fibre
No ratings yet
Ambit Report On Garware Technical Fibre
29 pages
Engine Controls Schematics (Delphi ECM - FX3)
No ratings yet
Engine Controls Schematics (Delphi ECM - FX3)
9 pages
OPT5508B Tuner To IP Gateway User's Manual: Directory
No ratings yet
OPT5508B Tuner To IP Gateway User's Manual: Directory
16 pages
Aieee 2012
No ratings yet
Aieee 2012
53 pages
Sikafloor 1200
No ratings yet
Sikafloor 1200
2 pages
Approved Retrofitting Drawings PDF
No ratings yet
Approved Retrofitting Drawings PDF
49 pages
Electrospinning Review Journal of Materials Science
No ratings yet
Electrospinning Review Journal of Materials Science
28 pages
DM Questions Review
No ratings yet
DM Questions Review
5 pages
CBSE Notes Class 9 Geography Chapter 4 - Climate
No ratings yet
CBSE Notes Class 9 Geography Chapter 4 - Climate
1 page
Longjian - Kec JV: Subject: Proposal For Increased Time of Retention For Concrete Mixes at DC-02 Project
No ratings yet
Longjian - Kec JV: Subject: Proposal For Increased Time of Retention For Concrete Mixes at DC-02 Project
2 pages
The Stranger One-Pager
No ratings yet
The Stranger One-Pager
3 pages
One App To Trace Them All Examining App Specifications For Mass Acceptance of Contact Tracing Apps
No ratings yet
One App To Trace Them All Examining App Specifications For Mass Acceptance of Contact Tracing Apps
15 pages
Poem Six Form
No ratings yet
Poem Six Form
5 pages
OLED
No ratings yet
OLED
10 pages
Week 1 Lesson 2 - PPT
No ratings yet
Week 1 Lesson 2 - PPT
21 pages
Boundary Fill - Lect - 09
No ratings yet
Boundary Fill - Lect - 09
7 pages
ID - 20200106282. Ahnaf Tahmid (C)
No ratings yet
ID - 20200106282. Ahnaf Tahmid (C)
6 pages
The Anaesthetic Crisis Manual Is An Excellent Companion The Anaesthetic Crisis Manual Has All Drug Doses Conveniently
No ratings yet
The Anaesthetic Crisis Manual Is An Excellent Companion The Anaesthetic Crisis Manual Has All Drug Doses Conveniently
1 page
Agentic AI: Transforming Industries Through Independent Intelligence
From Everand
Agentic AI: Transforming Industries Through Independent Intelligence
Daniel Lozovsky
5/5 (1)
Cybersecurity in Cloud Computing
From Everand
Cybersecurity in Cloud Computing
Akula Achari
No ratings yet
CISSP - Certified Information Systems Security Professional Exam Preparation Study Guide
From Everand
CISSP - Certified Information Systems Security Professional Exam Preparation Study Guide
Georgio Daccache
5/5 (1)
Industrial Cybersecurity
From Everand
Industrial Cybersecurity
Anand Shinde
No ratings yet
Smart Manufacturing, Artificial Intelligence and Industry 4.0: The Next Industrial Revolution.: Industrial Automation, #5
From Everand
Smart Manufacturing, Artificial Intelligence and Industry 4.0: The Next Industrial Revolution.: Industrial Automation, #5
The Digital Allchemist
No ratings yet
Safe and Reliable Plant Operations: Operations Management for Hazardous Facilities
From Everand
Safe and Reliable Plant Operations: Operations Management for Hazardous Facilities
Dietrich Roeben
No ratings yet
IGNOU MCA Cloud Computing and IoT Previous year Unsolved Papers MCS 227
From Everand
IGNOU MCA Cloud Computing and IoT Previous year Unsolved Papers MCS 227
Manish Soni
No ratings yet
Mobile Offensive Security Pocket Guide: A Quick Reference Guide For Android And iOS
From Everand
Mobile Offensive Security Pocket Guide: A Quick Reference Guide For Android And iOS
James Stevenson
1/5 (1)
Introduction To Augmented Reality Hardware: Augmented Reality Will Change The Way We Live Now: 1, #1
From Everand
Introduction To Augmented Reality Hardware: Augmented Reality Will Change The Way We Live Now: 1, #1
Kaviyaraj R
No ratings yet
CompTIA CASP+ CAS-004 Exam Guide: A-Z of Advanced Cybersecurity Concepts, Mock Exams, Real-world Scenarios with Expert Tips (English Edition)
From Everand
CompTIA CASP+ CAS-004 Exam Guide: A-Z of Advanced Cybersecurity Concepts, Mock Exams, Real-world Scenarios with Expert Tips (English Edition)
Dr. Akashdeep Bhardwaj
No ratings yet
Security Engineering: CISSP, #3
From Everand
Security Engineering: CISSP, #3
Selwyn Classen
No ratings yet
Digital Technologies – an Overview of Concepts, Tools and Techniques Associated with it
From Everand
Digital Technologies – an Overview of Concepts, Tools and Techniques Associated with it
Editor IJSMI
No ratings yet
CCISO – Certified Chief Information Security Officer Exam Practice Questions and Dumps Exam Guidebook Updated Questions for EC|Council
From Everand
CCISO – Certified Chief Information Security Officer Exam Practice Questions and Dumps Exam Guidebook Updated Questions for EC|Council
Byte Books
No ratings yet
ANSYS Workbench 2021 R1: A Tutorial Approach, 4th Edition
From Everand
ANSYS Workbench 2021 R1: A Tutorial Approach, 4th Edition
Prof. Sham Tickoo
No ratings yet
ANSYS Workbench 2023 R2: A Tutorial Approach, 6th Edition
From Everand
ANSYS Workbench 2023 R2: A Tutorial Approach, 6th Edition
Prof. Sham Tickoo
No ratings yet
Object Detection: Advances, Applications, and Algorithms
From Everand
Object Detection: Advances, Applications, and Algorithms
Fouad Sabry
No ratings yet
InduSoft Application Design and SCADA Deployment Recommendations for Industrial Control System Security
From Everand
InduSoft Application Design and SCADA Deployment Recommendations for Industrial Control System Security
Richard Clark
No ratings yet
Machine Vision: Insights into the World of Computer Vision
From Everand
Machine Vision: Insights into the World of Computer Vision
Fouad Sabry
No ratings yet
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
From Everand
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
Fouad Sabry
No ratings yet
Video Content Analysis: Unlocking Insights Through Visual Data
From Everand
Video Content Analysis: Unlocking Insights Through Visual Data
Fouad Sabry
No ratings yet
Smart Camera: Revolutionizing Visual Perception with Computer Vision
From Everand
Smart Camera: Revolutionizing Visual Perception with Computer Vision
Fouad Sabry
No ratings yet

Industrial Safety Violation FinalRevision

Uploaded by

Industrial Safety Violation FinalRevision

Uploaded by

Action Recognition based Industrial Safety Violation Detection

Surya N Reddy Vaibhav Kurrey

Mayank Nagar Gagan Raj Gupta

Action Specific PPE Requirements

S.No Class Name Definition

Dataset Year Data Modalities Capture Activities Clips

Total Extracted Frames 7920

We also conducted a comparative study between our algorithm

You might also like