0% found this document useful (0 votes)
54 views17 pages

A Lidar-Based Computer Vision System For Monitoring

This research article presents a Lidar-based computer vision system designed for continuous monitoring of patients in medical settings. The system classifies patient postures into four categories and achieves a recognition accuracy of 93.46% with a speed of 42FPS, addressing the challenges of limited manpower in hospitals. The study emphasizes the importance of privacy-safe technology in healthcare while providing timely information to healthcare workers to enhance patient safety.

Uploaded by

jawaechan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views17 pages

A Lidar-Based Computer Vision System For Monitoring

This research article presents a Lidar-based computer vision system designed for continuous monitoring of patients in medical settings. The system classifies patient postures into four categories and achieves a recognition accuracy of 93.46% with a speed of 42FPS, addressing the challenges of limited manpower in hospitals. The study emphasizes the importance of privacy-safe technology in healthcare while providing timely information to healthcare workers to enhance patient safety.

Uploaded by

jawaechan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

A Lidar-based computer vision system for

monitoring of patients in medical scenes


Xuan Wu
Beijing Institute of Technology
Haiyang Zhang (  [email protected] )
Beijing Institute of Technology
Chunxiu Kong
Beijing Institute of Technology
Yuanze Wang
Beijing Institute of Technology
Yezhao Ju
Beijing Institute of Technology
Changming Zhao
Beijing Institute of Technology

Research Article

Keywords: Patient monitoring, computer vision, Lidar, privacy-safe

Posted Date: April 6th, 2023

DOI: https://doi.org/10.21203/rs.3.rs-2760999/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License

Additional Declarations: No competing interests reported.


A Lidar-based computer vision system for monitoring of

patients in medical scenes


Xuan Wu1,2, Haiyang Zhang1,2,3*, Chunxiu Kong1,2 , Yuanze Wang1,2, Yezhao Ju1,2, Changming Zhao1,2,3
School of Optics and Photonics, Beijing Institute of Technology, Beijing, 100081, China.
1

2
Key Laboratory of Photoelectronic Imaging Technology and System, Beijing Institute of Technology,
Beijing, 100081, China.
3
Key Laboratory of Information Photon Technology, Ministry of Industry and Information Technology,
Beijing, 100000,China.
*
Corresponding author(s). E-mail(s): [email protected];
Contributing authors: [email protected]; [email protected]; [email protected];
[email protected]; [email protected];
Abstract: Long-term monitoring for patients can improve patient safety, help doctors diagnose and
evaluate the clinical situation. Limited manpower in hospitals makes it difficult to achieve
continuous and nuanced monitoring. In this paper, we classify the patient's posture as standing,
sitting, lying and falling. Using a non-intrusive, privacy-compliant lidar, a medical human pose
dataset is collected in the First Affiliated Hospital, Sun Yat-Sen University, and a novel computer
vision-based approach is presented to continuously detect patients pose and provide timely
information to health care workers. The recognition accuracy reaches 93.46% and the recognition
speed reaches 42FPS on 3080Ti. Experimental results show that the algorithm performs well on the
medical human pose dataset, and can effectively solve the problem of human pose recognition in
medical scenes.
Keywords: Patient monitoring, computer vision, Lidar, privacy-safe

Statements and Declarations


Competing Interests: The authors declare no conflicts of interest.
Ethical Approval and Consent to Participate: All data is gathered and analyzed after getting
approve from the First Affiliated Hospital, Sun Yat-Sen University and consents from the volunteers
for participation.
Consent for publication: All the authors of this paper have consents fort his paper to be published
in Journal of Medical Systems.
Acknowledgments: The authors appreciate the First Affiliated Hospital, Sun Yat-Sen University
for providing the testing environment. We also appreciate Zhen Zhang and Jianfei Lai of Huawei
Inspiration Lab for their technical support.
Funding: This work is supported by the fund of Huawei Inspiration Lab.
Author contributions: All authors contributed to the study conception and design. Material
preparation, data collection and labeling were performed by Xuan Wu, Chunxiu Kong, Yuanze

1
Wang and Yezhao Ju. Technical guidance and agreement support were provided by Haiyang Zhang
and Changming Zhao. The first draft of the manuscript was written by Xuan Wu and all authors
commented on previous versions of the manuscript. All authors read and approved the final
manuscript.

1. Introduction
Health monitoring is an important basis for evaluating patients’ health status, timely acquisition
of patient activities is conducive to doctors’ judgment of illness and formulation of rehabilitation
plan. Traditional detection methods such as human observation [1] require a lot of manpower and
time. Yet, healthcare workers are often overworked, and hospitals understaffed and resource-limited
[2, 3]. The use of computer vision technology is of great significance to reduce the burden of the
medical system and improve the quality of medical services.
Human action recognition technology based on computer vision can quickly and accurately
obtain human activity information, and has been applied in many fields [4–7]. Some methods
combine RGB cameras and deep learning algorithms to recognize human movements [8–12], but
these approaches are more suitable for public settings due to privacy sensitivity and cannot be
applied in hospitals. With the development of sensing technology, depth sensors are more and more
applied in production and life. Compared with RGB camera, depth sensor can obtain 3D information
of the scene, it is not sensitive to lighting changes and chaotic background, and can well protect
people’s privacy. Combining depth cameras with motion recognition algorithms has been used in
ICU patient activity monitoring [13, 14], home health monitoring for seniors [15], etc. But these
prior works require additional sensors to ensure the accuracy of results. LIDAR-acquired 3D point
clouds provide higher spatial accuracy and have promising applications in computer vision tasks,
which offers solutions for simplicity and precision of monitoring in medical scenarios. Lidar-based
human action recognition is gaining attention [16–19]. So far, an automated, vision-based action
monitoring system that is simple to use, provides timely feedback and reliable results in the medical
scene has not been developed.
In this paper, we propose a monitoring system based on Lidar and 3D human action recognition
algorithm, as shown in Fig.1. The Lidar is deployed in the ward to obtain the patient’s posture and
movement information in real time, and alert the medical staff in time when danger occurs, so as to
improve the quality of medical services. The human pose is complex and diverse, and the occlusion
of facilities in wards and the actual deployment need to be considered. Aiming at the above problems,
this paper classifies human actions in medical scenes into four categories: standing, sitting, lying
and falling. The 3D human pose estimation and action recognition network consists of two stages,

2
the improved anchor-to-joint regression network is used to estimate the keypoints of the human
body, and then the SVM classifier is used to classify the action based on the manually extracted
skeleton features. This method uses the information of multi-scale feature maps to achieve stronger
depth perception ability and better keypoint estimation performance. The manually designed
skeleton features are more suitable for the characteristics of human actions in medical scenes, which
greatly reduces the amount of calculation and shortens the reasoning time under the premise of
meeting the recognition accuracy, the recognition speed is 42FPS on NVIDIA GeForce 3080Ti, and
the recognition accuracy is 93.46%. Compared with existing human action recognition methods, the
proposed method can better describe human pose in medical scenes, and is suitable for deployment.

Fig.1 Lidar-based patient monitoring system in medical scene

2. Methods
The proposed method consists of a pose estimation module and an action recognition module.
Fig.2 shows the overall network frame diagram, and the network input is a depth image of 288*288.
The 3D pose estimation module consists of a backbone network and two branches, the keypoints of
the human body are located based on the idea of anchor weighted voting, and the obtained 3D human
skeleton is composed of 15 keypoints (as shown in Fig.3). The action recognition module extracts
and classifies the skeleton features according to the 3D human skeleton, and finally generates the
action recognition results.

3
Fig. 2 Proposed human pose estimation and action recognition network

Fig.3 3D skeleton of human body

2.1 Data preprocessing and scene parameter calculation

Intel Realsense L515 is used as the data acquisition device. L515 is a small lidar camera for
indoor scenes, which can collect point clouds, infrared images and visible light images. The point
clouds collected by L515 are mapped to depth images as experimental samples, and point clouds
are used for keypoint annotation in order to obtain more accurate data. Point cloud labeling is labor-
intensive and error-prone, mapping the same frame of point cloud into depth maps of different views
can greatly reduce the labeling time and expand the dataset. labelimg is used to mark the boundary
box of the depth map, the part surrounded by the boundary box is cropped out, and it is filled into a
1:1 size image with pixel 0, so that the human body keeps the original proportion, and prevents
excessive deformation of the human body when the image is reshaped in the network, which can
enhance the robustness of the network for different human postures.

4
Fig.4 Deployment schematic for the Ward scenario

The location of the lidar deployment should avoid the occlusion of medical equipment and
medical personnel as far as possible, it is not feasible to shoot in the ideal position of the front, and
the tilt shooting may affect the judgment of the movement. Fig.4 shows the schematic diagram of
the deployment in the ward, the L515 Lidar is fixed on the wall, and the optical axis is tilted
downward, the origin of the lidar coordinate system is the center of the camera, and the direction of
the z axis is consistent with that of the optical axis. When the human body shows an oblique attitude
in the lidar camera coordinate system, the world coordinate system can express the human body
posture more accurately. The transformation matrix is calculated to realize the transformation from
the camera coordinate system to the world coordinate system.
The Random Sample Consensus (RANSAC) is used to fit the ground for the shot scene, and
the plane equation of the ground can be expressed as Eq. (1):

ax + by + cz + d = 0 (1)

Then, the unit normal vector of the ground in the lidar coordinate system is expressed as Eq. (2):

(a, b, c)
r lidar = (2)
a 2 + b2 + c 2

The normal vector of the ground in the world coordinate system is r ground = (0,0,1) ,
r = r lidar  r ground , the transformation matrix R is expressed as Eq. (3):
 0 −r z ry 
 
R = I  cos  + (1-cos  )  r  r + sin    r z
T
0 −r x  (3)
 
 −r y rz 0 

5
Where I represents the unit vector, and θ is the angle between r lidar and r ground . The keypoints
of the human body output by the 3D pose estimation module are transformed into the
representation of the world coordinate system through matrix transformation, which is more
convenient for feature extraction.

2.2 3D pose estimation module

The 3D pose estimation module locates the 3D coordinates of 15 keypoints of the human body
to form a skeleton model, which is used as the input of action classification. Referring to the work
[20], we set anchor points evenly on the graph. Firstly, we estimate the deviation between anchor
point coordinates and keypoint coordinates as well as the response of anchor points, and then convert
the response of anchor points into the weight of anchor points. Then, coordinate deviation and
weight reuse weighting are used to calculate the coordinates of keypoints. The size of the input
depth image is 288*288. resnet50[21] is used as the backbone network, and there are three branches
to estimate the offset of the plane coordinates of keypoints, the depth coordinates and the anchor
response, respectively. Most of the existing human pose depth map datasets have relatively ideal
shooting angles, with fewer cases of human body occlusion and self-occlusion, empty background,
and clear human body outline, the work [20] has achieved good results on the ITOP dataset.
However, the space of the ward is limited, the medical instruments and equipment are numerous,
and the occlusion and complex background bring interference to the human posture recognition,
moreover, the human body point cloud information obtained from the oblique shooting angle has a
certain degree of loss compared with the forward shooting. There are still shortcomings when the
original network is applied to the medical scene, this paper makes improvements for the problems:
1) When the depth difference between the human body and the background is limited, the original
network structure has a large deviation in the prediction of the depth coordinates of the keypoints.
By combining the plane coordinate offset estimation branch and the depth estimation branch into
the 3d offset branch, the 3d characteristics of the depth map can be fully utilized to reduce errors
and reduce the amount of calculation. 2) The human body is highly flexible and poses are complex
and diverse, and the lack of color details in the depth map is not conducive to accurately predicting
the coordinates of keypoints in complex poses. The shallow feature map contains more low-level
information, such as target shape and texture, etc. Concat the shallow feature map obtained by
convolutional layer 3 in the backbone network with the feature map obtained by convolutional layer
4 can enhance the ability to locate the keypoints of the human body in different positions, and is
also conducive to improving the positioning accuracy of the keypoints coordinates of the slender

6
parts of the limbs. The improved network structure is shown in Fig.5.

Fig.5 3D pose estimation network

The work[20] constructs two loss functions based on the positioning of anchor points and
keypoints. The informative anchor point surrounding loss is used to locate the anchor points around
the keypoints, then the weight of the anchor points and the coordinate offset between the keypoints
and the anchor points are used to calculate the loss of the keypoints. The expression of the
informative anchor point surrounding loss is given by Eq. (4):
~
loss1 = ∑𝑗∈𝐽 𝐿𝜏 (∑ 𝑃𝑗 (𝑎)𝑆(𝑎) − 𝑇𝑗 )
𝑎∈𝐴 (4)

~
Where 𝑃𝑗 (𝑎) is the weight of anchor point 𝑎 to keypoint 𝑗, 𝑆(𝑎) is the coordinates of anchor point
𝑎, 𝑇𝑗 is the actual position of keypoint 𝑗, and 𝐿𝜏 (𝑥) is the smoothL1 like loss function[22] given by
Eq. (5):

 1 2
 x , x 
 2
L ( x ) =  (5)
 x − , x 

 2

Where τ is 1. When calculating the locating loss of keypoints, the depth value is expanded to 50
times of the actual depth value (m), so that the plane coordinates and depth coordinates are in the
same order of magnitude. Due to the small number of pixels in the wrist, ankle and other limb parts,
the positioning accuracy is limited, which will make the loss difficult to further decrease in the later
stage of training, and the action classification mainly depends on the information of the keypoints
of the trunk,, we add a weight factor to the keypoint positioning loss, the keypoints of the trunk are
assigned a higher weight, other keypoints are assigned a lower weight, and the improved keypoint
positioning loss is as Eq. (6):

7
~
loss2 =∑ 𝜔𝑗 𝐿𝜏 (∑ 𝑃𝑗 (𝑎)(𝑆(𝑎) + 𝑂𝑗 (𝑎)) − 𝑇𝑗 )
𝑗∈𝐽 𝑎∈𝐴 (6)

Where 𝜔𝑗 is the weight factor of keypoint 𝑗, which is set as 1.5 for keypoint0, 1, 8, 9, 10, 1.2 for
keypoint2, 3, 11, 12, and 0.8 for other keypoints. 𝑂𝑗 (𝑎) represents the coordinate deviation from
anchor point 𝑎 to keypoint 𝑗. In order to further constrain the position of keypoints, bone length loss
is introduced, the difference between the skeleton length formed by keypoints and the actual
skeleton length is optimized to reduce the abnormal positioning of keypoints. 𝐿𝑚,𝑛 is defined as the
bone length formed by keypoints m and n, and the expression of bone length loss was shown as Eq.
(7):
𝑝𝑟𝑒𝑑
loss3 =∑ 𝜔𝑙 |𝐿𝑡𝑟𝑢𝑡ℎ
𝑚,𝑛 − 𝐿𝑚,𝑛 |
𝑚,𝑛∈{0,1,⋯,14} (7)

Where 𝜔𝑙 is the weight factor of skeleton length, which is set as 1.2 for 𝐿0,1, 𝐿1,2, 𝐿1,3, 𝐿1,8, 𝐿8,9 ,
𝐿8,10, and 0.8 for the rest. The total loss of the pose estimation network is expressed as Eq. (8):
loss = loss1 + 𝜆loss2 + loss3 (8)
𝜆=1.5 is the factor that balances the three losses.

2.3 Action classification module

Keypoints are important for describing human posture and behavior. When people make
different postures, the keypoints and joint vectors show different spatial information, tracking the
coordinates of joints and exploring the spatio-temporal geometric angle information of joint vectors
can provide very direct and reliable information for human action recognition. In the action
classification module, we use the skeleton features designed by hand to train the classifier and
complete the action recognition.

B = ( J ,V ) is used to represent the skeleton information of the human body, where J is the

set of joint point positions and V is the set of joint vectors. According to the transformation matrix

in Section 3.1, the coordinates of keypoints are converted into the representation in the world

coordinate system as Eq. (9):

J = J lidar  R + H lidar (9)

H lidar represents the vertical distance between the lidar position and the ground, and the world
coordinate of the kth keypoint of the human body is expressed as Eq. (10):
𝐽𝑘 = (𝑥, 𝑦, 𝑧) (10)

8
Coordinate z represents the height of the keypoint from the ground. The joint vector of the effective
part of the limb is extracted, and the joint vector composed of the m and n keypoints is expressed
as Eq. (11):
𝑉𝑛,𝑚 = 𝐽𝑛 − 𝐽𝑚 (11)
When the human body in the scene shows a certain posture, each segment of joint vector and

the scene form a certain position and Angle. The Angle between the two segments of joint vectors

connected by the same joint node also provides important information for describing the human

behavior. The calculation of the direction cosine feature and the Angle cosine feature of the joint is

conducive to describing the human posture. Among them, the trunk of the human body changes

more obviously, while the movement of the limbs is more flexible, and there is no clear distinction

between different postures. We extracted eight groups of features based on the keypoints of the trunk

to describe the posture of the human body, including:

( )( )( )(
1) Four cosine features of joint direction: V1,8 , V9,11 , V1,8 ,V10,12 , V9,11 ,V11,13 , V10,12 , V12,14 )
2) Three cosine features of joint direction: (𝑉1,8 , r ground ) , (𝑉9,11 , r ground ) , (𝑉10,12 , r ground )

3) A high feature:
J 8 ( z ) + J 9 ( z ) + J10 ( z )
H 8,9,10 =
3
After extracting skeleton features to form samples, it is necessary to select an appropriate
classifier to train the samples. SVM is a binary classifier proposed by Vapnik[23], which has perfect
mathematical theory and is often used in data classification and regression prediction. In this paper,
the nonlinear multidimensional Support Vector Classifier (SVC) is selected according to the actual
situation. SVC allows the establishment of a more robust model at the cost of sacrificing a small
number of classification errors, which can show more stable classification performance for complex
human pose.

3. Results

3.1 Dataset and experimental settings

The data acquisition equipment used in the experiment is Intel realsense L515 lidar. Human
standing, sitting, lying and falling postures of different genders, ages and body types are collected
in the simulated ward of the laboratory and the ward of the First Affiliated Hospital of Sun Yat-sen
University respectively. After labeling and data enhancement, a total of 3552 samples are obtained

9
to form a medical human posture dataset, 80% of them are used as training samples and 20% as
testing samples.
The experimental system environment is Windows10 64-bit, equipped with intel core i7-10700
CPU and NVIDIA GeForce RTX 3080 Ti. It is developed based on pytorch1.10 framework, and the
version of cuda is 11.3.1, the version of cudnn is 8.2.1, the size of the input image is 288*288. The
training is carried out by fine-tune, a total of 300 epochs are trained, the first 10 epochs are used to
preheat the learning rate, and the last 290 epochs are used to adjust the learning rate by Adam
optimizer, and the model with the best effect in the training process is saved.

3.2 Pose estimation results

The average precision (mAP) [24, 25] of 10 cm is used as the evaluation metric to compare the
performance of our proposed method with several mainstream 3D pose estimation methods on the
medical human pose dataset. Tab.1 compares the accuracy of different methods for estimating
human keypoints. Meanwhile, the percentage of success frames over different error thresholds are
given in Fig. 6. The method proposed in this paper has achieved good results. The use of shallow
feature maps provides more information about shape, texture and so on, which significantly
improves the positioning accuracy of human limbs, the mAP of each part is no less than70%. The
merging of plane offset estimation branch and depth branch gives full play to the 3D information
advantage of depth image, improves the depth prediction of keypoints, and the improvement of loss
function also further improves the accuracy of trunk keypoints, mAP reaches 83.37%, which is 3.9%
higher than the original A2J. The improved method achieves 184.69 frame/s reasoning speed on
3080Ti, which is far faster than other methods, compared with the original A2J, the merging of
network branches reduces the amount of calculation and the reasoning speed is increased by 9.54
FPS. Experiments show that the proposed method has better performance in accuracy and speed.
Fig. 7 shows the estimation results of the proposed 3D pose estimation network for different
human poses in the ward scene. In the visual field of lidar, there are more self-cover when people
are standing, sitting and lying on their side, when people fall or lie down, the depth difference
between their limbs and the background is small, and the outline of body is not clear, when people
are lying on the bed covered with a quilt, the position of their limbs is more difficult to identify,
which brings great challenges to posture estimation. Our proposed method fully considers the above
problems, and can effectively estimate the position of keypoints in various human poses.
Experiments show that the proposed method can be applied to human pose estimation in medical
scenarios.

10
Tab.1 Comparison of results of different pose estimation methods on medical human pose dataset

Towards. Integral. V2V GAST A2J


Method Ours
[26] [27] [28] [29] [20]
Head 80.08 91.01 88.68 87.14 88.29 91.14
Neck 81.22 90.30 92.37 89.71 88.86 92.43
Shoulders 74.43 89.52 89.60 83.58 83.00 87.86
Elbows 42.65 71.13 79.86 76.16 75.14 81.71
Wrists 29.09 66.24 67.63 66.72 66.29 71.72
mAP
Turso 82.54 86.60 91.84 88.28 91.43 90.29
Hips 77.61 72.89 80.92 84.86 78.14 85.57
Knees 68.87 68.36 81.19 80.71 75.01 84.55
Ankles 56.49 59.95 70.92 71.57 68.29 77.43
Average 65.89 74.94 80.88 79.47 78.11 83.37

Fig.6 The percentage of success frames over different error thresholds.

11
Fig.7 Results of pose estimation. The red part represents the right side of the body and the blue
part represents the left side

3.3 Action recognition results

The human skeleton obtained by the pose estimation module is used as input to test the
recognition accuracy of the proposed action recognition method for human standing, sitting, lying,
and falling. As shown in Fig. 8, successful pose estimation is beneficial for action classification, and
the action classification module can still output correct recognition results when some keypoints are
incorrectly located.

12
Fig. 8 Action recognition results on the medical human pose dataset

According to the confusion matrix of recognition results in the medical human post test set
shown in Fig. 9, the accuracy, recall and F1 score are calculated as the performance indicators of
the algorithm, and the calculation expressions are shown in s Eq. (12), (13) and (14) respectively.

TP
precision = (12)
TP + FP

TP
recall = (13)
TP + FN

precision  recall
F1 = 2  (14)
precision + recall

The calculation results are shown in Tab.2. The average accuracy and recall of four types of
actions are both above 90%, and the F1 score is 0.93. Among them, the accuracy reaches 95% when
standing due to less human body occlusion and great difference from the background, and the recall
of fall reaches 96.75%, indicating that the skeleton features we design are close to the patient’s
posture characteristics, The experiment proves that the method of combining the manual design
features with SVM classifier proposed in this paper is simple and effective, which is suitable for
action recognition in medical scenes.

13
Fig. 9 Confusion matrix of test results for action recognition

Tab.2 Recognition performance of the proposed method for different poses

Pose Stand Sit Lie Fall Average


precision(%) 95.00 92.57 91.95 94.30 93.46
recall(%) 95.00 89.01 93.37 96.75 93.53
F1 score 0.95 0.91 0.93 0.96 0.93

The training speed of SVM is fast and the reasoning time is microsecond. The whole algorithm
achieves a running speed of about 42 frames per second on the 3080Ti, and it only takes 24ms to
process a depth image, including 15ms for image reading and preprocessing. The algorithm achieves
a balance between accuracy and speed, and has the ability to be deployed in medical scenarios.

4. Conclusion
In this paper, we propose a lidar-based patient monitoring system in medical scene. Lidar
realizes non-intrusive and privacy-safe sensing. Using a dataset we collected from the First
Affiliated Hospital, Sun Yat-Sen University, the proposed 3D human posture estimation and motion
recognition technology can recognize standing, sitting, lying and falling with 93.46% accuracy and
42fps speed. The experimental results show that the proposed method achieves better results than
other mainstream algorithms in medical human pose dataset, and is suitable for deployment and
application in medical scenarios, which can help to meet the challenge of increasing burden on the
healthcare system.

14
References
[1] Berney, S.C., Rose, J.W., Bernhardt, J., Denehy, L.: Prospective observation of physical activity in
critically ill patients who were intubated for more than 48 hours. Journal of critical care 30(4), 658–663
(2015)
[2] Patel, R.S., Bachu, R., Adikey, A., Malik, M., Shah, M.: Factors related to physician burnout and its
consequences: a review. Behavioral sciences 8(11), 98 (2018)
[3] Lyon, M., Sturgis, L., Lendermon, D., Kuchinski, A.M., Mueller, T., Loeffler, P., Xu, H., Gibson, R.:
Rural ed transfers due to lack of radiology services. The American journal of emergency medicine 33(11),
1630–1634 (2015)
[4] Kong, Y., Fu, Y.: Human action recognition and prediction: A survey. International Journal of
Computer Vision 130(5), 1366–1401 (2022)
[5] Ozcan, T., Basturk, A.: Human action recognition with deep learning and structural optimization using
a hybrid heuristic algorithm. Cluster Computing 23(4), 2847–2860 (2020)
[6] Prati, A., Shan, C., Wang, K.I.-K.: Sensors, vision and networks: From video surveillance to activity
recognition and health monitoring. Journal of Ambient Intelligence and Smart Environments 11(1), 5–
22 (2019)
[7] Wang, L., Huynh, D.Q., Koniusz, P.: A comparative review of recent kinect-based action recognition
algorithms. IEEE Transactions on Image Processing 29, 15–28 (2019)
[8] Jaouedi, N., Boujnah, N., Bouhlel, M.S.: A new hybrid deep learning model for human action
recognition. Journal of King Saud University-Computer and Information Sciences 32(4), 447–453 (2020)
[9] Muhammad, K., Ullah, A., Imran, A.S., Sajjad, M., Kiran, M.S., Sannino, G., Albuquerque, V.H.C.,
et al.: Human action recognition using attention based lstm network with dilated cnn features. Future
Generation Computer Systems 125, 820–830 (2021)
[10] Wang, L., Tong, Z., Ji, B., Wu, G.: Tdn: Temporal difference networks for efficient action
recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 1895–1904 (2021)
[11] Jhuang, Y.-Y., Tsai, W.-J.: Deeppear: Deep pose estimation and action recognition. In: 2020 25th
International Conference on Pattern Recognition (ICPR), pp. 7119–7125 (2021). IEEE [12] Islam, M.S.,
Bakhat, K., Khan, R., Naqvi, N., Islam, M.M., Ye, Z.: Applied human action recognition network based
on snsp features. Neural Processing Letters 54(3), 1481–1494 (2022)
[13] Ma, A.J., Rawat, N., Reiter, A., Shrock, C., Zhan, A., Stone, A., Rabiee, A., Griffin, S., Needham,
D.M., Saria, S.: Measuring patient mobility in the icu using a novel noninvasive sensor. Critical care
medicine 45(4), 630 (2017)
[14] Yeung, S., Rinaldo, F., Jopling, J., Liu, B., Mehra, R., Downing, N.L., Guo, M., Bianconi, G.M.,
Alahi, A., Lee, J., et al.: A computer vision system for deep learning-based detection of patient
mobilization activities in the icu. NPJ digital medicine 2(1), 11 (2019)
[15] Luo, Z., Hsieh, J.-T., Balachandar, N., Yeung, S., Pusiol, G., Luxenberg, J., Li, G., Li, L.-J., Downing,
N.L., Milstein, A., et al.: Computer vision-based descriptive analytics of seniors’ daily activities for long-
term health monitoring. Machine Learning for Healthcare (MLHC) 2(1) (2018)
[16] Min, Y., Zhang, Y., Chai, X., Chen, X.: An efficient pointlstm for point clouds based gesture
recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 5761–5770 (2020)
[17] Fan, H., Yang, Y., Kankanhalli, M.: Point 4d transformer networks for spatio-temporal modeling in

15
point cloud videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 14204–14213 (2021)
[18] Katayama, H., Mizomoto, T., Rizk, H., Yamaguchi, H.: You work we care: Sitting posture
assessment based on point cloud data. In: 2022 IEEE International Conference on Pervasive Computing
and Communications Workshops and Other Affiliated Events (PerCom Workshops), pp. 121–123 (2022).
IEEE
[19] Xu, Y., Jung, C., Chang, Y.: Head pose estimation using deep neural networks and 3d point clouds.
Pattern Recognition 121, 108210 (2022)
[20] Xiong, F., Zhang, B., Xiao, Y., Cao, Z., Yu, T., Zhou, J.T., Yuan, J.: A2j: Anchor-to-joint regression
network for 3d articulated pose estimation from a single depth image. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision, pp. 793–802 (2019)
[21] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
[22] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region
proposal networks. Advances in neural information processing systems 28 (2015) [23] Cortes, C., Vapnik,
V.: Support-vector networks. Machine learning 20, 273–297 (1995)
[24] Haque, A., Peng, B., Luo, Z., Alahi, A., Yeung, S., Fei-Fei, L.: Towards viewpoint invariant 3d
human pose estimation. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The
Netherlands, October 11–14, 2016, Proceedings, Part I 14, pp. 160–177 (2016). Springer
[25] Wang, K., Zhai, S., Cheng, H., Liang, X., Lin, L.: Human pose estimation from depth images via
inference embedded multi-task learning. In: Proceedings of the 24th ACM International Conference on
Multimedia, pp. 1227–1236 (2016)
[26] Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Towards 3d human pose estimation in the wild: a
weakly-supervised approach. In: Proceedings of the IEEE International Conference on Computer Vision,
pp. 398–407 (2017)
[27] Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Proceedings of the
European Conference on Computer Vision (ECCV), pp. 529–545 (2018)
[28] Moon, G., Chang, J.Y., Lee, K.M.: V2v-posenet: Voxel-to-voxel prediction network for accurate 3d
hand and human pose estimation from a single depth map. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 5079–5088 (2018)
[29] Liu, J., Rojas, J., Li, Y., Liang, Z., Guan, Y., Xi, N., Zhu, H.: A graph attention spatio-temporal
convolutional network for 3d human pose estimation in video. In: 2021 IEEE International Conference
on Robotics and Automation (ICRA), pp. 3374–3380 (2021). IEEE

16

You might also like