0% found this document useful (0 votes)
35 views19 pages

Learning Similarity and Dissimilarity in 3D Faces With Triplet Network

Uploaded by

Douaa Ghellam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views19 pages

Learning Similarity and Dissimilarity in 3D Faces With Triplet Network

Uploaded by

Douaa Ghellam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Multimedia Tools and Applications (2021) 80:35973–35991

https://doi.org/10.1007/s11042-020-10160-9
1171: REAL-TIME 2D/ 3D IMAGE PROCESSING WITH DEEP LEARNING

Learning similarity and dissimilarity in 3D faces


with triplet network

Anagha R. Bhople1 · Surya Prakash1

Received: 30 April 2020 / Revised: 18 August 2020 / Accepted: 10 November 2020 /


Published online: 23 February 2021
© Springer Science+Business Media, LLC, part of Springer Nature 2021

Abstract
Face is the most preferred biometric trait used to recognize a person. The 2D face can
be considered as a promising biometric trait; however, it may be affected by changes in
the age, skin color, texture, or any other environmental factors like illumination variations,
occlusion, and low image resolution. The 3D face is an emerging biometric trait, which
is being used recently for human recognition. The discriminating power of the 3D face is
highly motivating for many tasks such as security, surveillance, and many other technologi-
cal application in day to day life. Although there are many techniques available for 3D face
recognition, most of these techniques are based on volumetric or depth/range images. The
conversion of 3D face data, which is originally in point cloud format to volumetric represen-
tation makes the data bulkier. Further, some of the geometric properties may be lost when
3D data is converted to a representation of lower dimensions such as depth/range images.
The driving objective behind this research is to perform 3D face recognition by directly
using faces represented in the form of 3D point cloud. We propose a novel approach for
3D face recognition by learning the similarity and dissimilarity in 3D faces, and for this
purpose, introduce a triplet network. The network is an ensemble of our proposed Convo-
lutional PointNet (CPN) network, used for feature extraction and triplet loss. The proposed
network maps a 3D face data to Euclidean space where distance based scores represent the
similarity among the 3D faces. We also introduce a new evaluation approach for computing
the dissimilarity between highly similar 3D face biometric data. Experimentation has been
carried out on two databases, namely IIT Indore 3D face database (our in-house database)
and Bosphorus 3D face database. To handle the training issues due to the limited avail-
ability of samples for each subject in both the databases, we propose a technique for 3D
data augmentation. We perform various experiments using the proposed network and show
the performance in terms of verification rate and ROC curve. Our point cloud based triplet
network shows encouraging performance as compared to other state-of-the-art techniques.
Keywords 3D face recognition · Point cloud · Deep learning · PointNet · Biometrics

 Anagha R. Bhople
[email protected]

Surya Prakash
[email protected]

1 Indian Institute of Technology Indore, Indore 453552, India


35974 Multimedia Tools and Applications (2021) 80:35973–35991

1 Introduction

Biometrics has become a widely used technique to recognize a person. It provides an auto-
mated way for recognition based on physiological or behavioral traits of the person. A few
examples of widely used physiological traits include face, fingerprint, iris, ear, etc. whereas
signature, voice, gait, etc are the commonly used behavioral traits. Out of various physio-
logical traits, face provides a promising way for person recognition. Recently, it has been
shown in [1] that the face is a robust biometric trait for human recognition. There exist many
algorithms for 2D and 3D face recognition. Popularity of the use of face for recognition is
due to the vital biometric information possessed by face and ease in its acquisition [18].
Recently, there are lots of techniques proposed for 2D face recognition using deep
learning approaches. The deciding factors behind the enormous success of 2D face recog-
nition using deep learning are the availability of substantial labeled training databases and
applicability of convolutional neural networks (CNN) for face recognition in a constrained
environment [26]. There exist various conventional methods such as Eigenface analysis
[47], Fisherface analysis [37], Linear discriminant analysis [9], and their extensions for 2D
face recognition. DeepFace [46] and FaceNet [40] are the two recently proposed extraordi-
nary deep learning based networks, which are able to achieve the performance at par with
a human. DeepFace uses a Siamese network, which consists of a twin network for calculat-
ing the similarity in 2D face data. FaceNet model is the well-established tool for 2D face
recognition; it proposes the concept of triplet loss and online triplet mining for sampling the
triplet of the faces.
Though there exist many 2D face recognition systems that provide very robust per-
formance, however, they mostly fail to achieve reasonable accuracy in an unconstrained
environment where image quality is affected by lighting conditions, poses, occlusion, etc.
To handle these challenges, the use of 3D data for face recognition has been proposed. Com-
paratively, 3D data is more reliable and is able to retain more geometrical and topological
information [2] than the corresponding 2D images. It is also able to overcome many other
drawbacks of 2D face recognition. It is seen that the 3D data is insensitive and robust against
changes in the poses, illumination, occlusion, skin color, texture, and facial deformation [1].
Moreover, collecting 3D face data has become convenient and cheaper than before by using
economically viable sensors like Microsoft’s Kinect® , Intel RealSense® , Lidar sensors, etc.
and it has set an exciting research area for various recognition tasks using 3D data [2].
There are various ways, such as volumetric, projections, descriptors, multi-view repre-
sentation, RGB-D data, mesh, graphs, and point clouds for representing the 3D data [2]. The
depth/range images are also widely used for 3D face representation and recognition. For
example, in [20], authors have proposed a deep neural network with depth images. How-
ever, this approach has some limitations as the conversion of 3D data to depth images is
computationally expensive and does not make full use of spatial information in 3D data.
The voxelization is performed to convert 3D data to volumetric representation such as voxel
grids [42] or occupancy grids [14, 31] making the data unnecessarily bulky, which adversely
affects the storage requirements and processing time. To eliminate this, we encompass a
technique for 3D face recognition by using 3D point cloud data, which is a collection of
3-dimensional coordinates, which represents the outer surface of a 3D object. Point cloud
based representation is an efficient way of representing all minute facial features. Moreover,
as each scanner generates a basic set of (x,y,z) points for any 3D object, the point cloud pro-
vides an easy and compact way to represent a 3D object. Due to this, the requirement of
resources and computational time reduces drastically.
Multimedia Tools and Applications (2021) 80:35973–35991 35975

Most of the CNN based models for 3D image recognition need input in a regular format
like voxel grid or 2D images, which are an array of pixels. Since a raw 3D point cloud
data is unstructured and noisy, it is difficult to feed the data directly to the CNN. To solve
this, we have proposed a dedicated profound neural network, which can directly take point
cloud as an input. Our proposed model is based on PointNet [35], which is the first deep
learning based architecture proposed for the direct use of a 3D point cloud as an input. This
architecture is also invariant to both permutation and geometrical transformation of 3D point
cloud data, which makes it most suitable for processing the unordered point cloud data. We
modify the original PointNet and make use of CNN with it, and we name our proposed
model as Convolutional PointNet (CPN). Instead of using hand-engineered features with
descriptors, we use proposed CPN for automatic feature generation. The point cloud data
of each face sample is fed to the proposed CPN network, and it produces an output feature
vector for each 3D facial scan.
While implementing face recognition in an unconstrained environment, it is essential to
reduce the intra-personal variations and increase inter-personal variations. To achieve this,
we have designed a triplet network based on the triplet loss [40] and have incorporated in
the proposed CPN. Overall the proposed triplet network based model is a shared network
consisting of three parallel and symmetric CPN networks. Input to the triplet network is
the set of triplets < anchor, positive, negative > where an anchor and positive 3D scans
are of the same person, whereas anchor and negative 3D scans are of different persons.
The use of triplet loss reduces the intra-personal distances and helps in maximizing the
inter-personal distances.
In summary, the following are our major contributions:
– One of the main challenge in 3D face recognition is to develop a neural network which
can directly take point cloud data as an input. For this, we propose an efficient network,
called CPN, using PointNet architecture and CNN.
– We propose a triplet network, first of its kind, for 3D face recognition where CPN is
the core feature extraction module. Triplet loss is introduced for learning of the dissim-
ilarity and the similarity between the 3D face pairs such as < anchor, negative > and
< anchor, positive >.
– We have also proposed data augmentation to cater the need of huge data requirements
of deep learning based models for training. We also introduce preprocessing to make
data compatible with our network and have shown the possibility of achieving better
results with raw point cloud data without any data conversion.
– We have introduced a new matching technique for validating the performance of the
triplet network. We perform extensive experimentation on two databases viz. Bosphorus
3D face database [39] and IIT Indore 3D face database; and have shown encouraging
results.
The rest of the paper is organized as follows. In the next section, existing related work
is reviewed. In Section 3, the proposed triplet network is described, whereas in the next
section, outcomes of the experimental analysis and performance evaluation are discussed.
The paper is concluded in the last section.

2 Related work

Face recognition is a highly researched problem in the area of biometrics [8, 34, 45, 51].
In 2D, there exist many techniques for face recognition which show excellent performance,
35976 Multimedia Tools and Applications (2021) 80:35973–35991

and most of these techniques are based on deep neural networks. 3D face recognition is a
relatively less explored topic; however, it has the potential to provide promising results as
compared to 2D in an unconstrained environment. In the case of 2D, even if two images
belong to the same person, illumination, poses, occlusion, and texture may affect the appear-
ance and quality of the images and make them unrecognizable by the system. However,
3D data preserves the structural and anatomical information about 3D faces, which indeed
increases the distinguishing power and robustness of the face recognition systems. This
motivates us to use 3D data for face recognition. While presenting the review of the exist-
ing techniques in 3D face recognition, we divide them into two categories, classical 3D face
recognition techniques, and deep learning based 3D face recognition technique.

2.1 Classical techniques for 3D face recognition

In this section, a detailed literature survey for the most recent 3D face recognition meth-
ods has been presented. Patil et al. has given a detailed overview of 3D face recognition
regarding 3D features, available 3D databases, various algorithms for 3D face recognition,
and challenges associated with 3D data in the paper [34]. The conventional methods for 3D
face recognition involve extracting hand-engineered features with descriptors, which can be
used for various identification/verification tasks [10, 17, 28]. Some of the methods are based
on the fusion of 2D and 3D face recognition algorithms. There are some existing solutions
which contain a local and global descriptor based approach, local region based approach,
model based approach, facial curve based approach, and point cloud based approach [41].
Local descriptors typically extract useful features from the particular sub-region of the
face, while the global descriptors treat the whole face as a set of features. Mian et al. have
proposed a keypoint detection technique that can identify the keypoints even if there are
high shape variation among 3D faces in [33]. They have used Scale-Invariant Function
Transform (SIFT) features and employed fusion of both 2D and 3D features. Mian et al.
have also presented a multimodal technique, where both 2D and 3D features are used, and
hybrid matching is performed to gain robust face recognition in [32]. Different poses and
texture of 3D faces are corrected using the Hotelling Transform. A rejection classifier is
formed with 3D spherical face representation (SFR) and a SIFT descriptor.
Gupta et al. have performed a technique for detecting facial fiducial points in [16], which
uses anthropometric features. A fully automatic face recognition method is developed in
which 3D Euclidean and geodesic distances are calculated for 3D face recognition. The
method proposed by Berretti et al. in [4] is an entirely 3D based approach which involves
keypoint detection and matching. This approach takes care of the 3D scans, captured in an
unconstrained environment and even with the missing portion. Here, the MeshDOG key-
points detection algorithm is used, and the Random sample consensus (RANSAC) algorithm
is implemented for avoiding outliers. Soltanpour et al. [44] have suggested a multimodal 2D
and 3D face recognition with local descriptors based on pyramidal shape map and structural
context histogram. Yu et al. [48] have proposed a rigid registration method with the help
of surface resampling and denoising, which minimizes the impact of noise and sampling
difference on registration residuals.
In local region based approaches, features are extracted from several local sub-regions,
and local information is gathered. Li et al. [28] have extended the matching methods
like SIFT for mesh data and propose 3D keypoint descriptors. They have proposed three
keypoint descriptors according to the surface differential quantities. Reji et al. [38] have
proposed a region based method for expression robust 3D face recognition. The contour
based image registration is used for calculating the matching score. In [25], Lei et al. have
Multimedia Tools and Applications (2021) 80:35973–35991 35977

delivered a technique for 3D face representation, which is known as Multiple Triangle


Statistics (KMTS), 3D keypoints are detected based on Hotelling transform, and four types
of geometric features are selected from keypoint area to set the descriptor. A Two-Phase
Weighted Collaborative Representation Classification (TPWCRC) method is used for face
recognition.
In [6], Blanz et al. have proposed a model based technique for 3D face recognition where
textured 3D faces are used, which are obtained by transforming size and texture present in
the 3D scans. The new 3D faces are synthesized by manipulating the linear combination
of existing samples. Gilani et al. [15] have presented an algorithm, which shows the dense
3D correspondences between numerous 3D face data. The model (K3DM) is proposed with
dense corresponding faces, and technique performs 3D face recognition by morphing the
model to fit unseen 3D face data. Drira et al. in [12] have proposed a method that uses cur-
vature analysis of the 3D facial surfaces. The radial curves in the nose tip area are extracted
and matched for face recognition. ICP (Iterative Closest Point) and Local ICP are iterative
approaches, the basic principle of ICP is to find the best translation and rotation parameters
and align one model to another. These algorithms are used for shape matching, as given in
[17]. Li et al. [29] have proposed an expression invariant 3D face recognition system with
the help of many subject-specific curves. The combination of such expression-insensitive
curves provides a set of features for 3D face recognition and matching of the 3D faces
performed using the ICP algorithm.

2.2 Deep learning based techniques for 3D face recognition

The 3D face recognition technique proposed by Kim et al. in [20] is purely based on deep
learning. They have performed face recognition in 3D with deep convolutional neural net-
works (DCNN). Moreover, they have proposed new augmentation techniques. In this paper,
the transfer learning method is used to train 3D face data and shown that CNN trained on
2D data can successfully work for 3D depth maps. Gilani et al. [52] have designed their own
CNN model for 3D face recognition, which is trained on 3.1 Million 3D face data and new
3D faces are generated by introducing the non linearity in the 3D face data.
Leo et al. have proposed an expression invariant 3D face recognition system in [27].
The proposed technique is the combination of SVM (support vector machine) and PCA
(principal component analysis). In [11], Dorofeev et al. have proposed CNN based approach
where various filtering algorithms are used for reconstruction of 3D face surfaces. The
augmentation technique is utilized to form data with different facial expressions from a
single 3D face scan. In [13], Feng et al. have utilized the depth information in the 3D data
to enhance the robustness of face recognition system. DCNN, along with softmax activation
function, is used to perform 3D face recognition.
In [49], Zhang et al. have proposed a 3D face recognition technique based on point cloud
data where the modified PointNet++ model is used to train the augmented 3D point cloud
data. Also, the Face-based loss and multi-label loss are used to enhance the feature discrim-
ination. Zhang et al. [50] have also proposed a data-free 3D face recognition method where
training is performed with synthesized unreal data obtained from statistical 3D Morphable
Models. The training is performed on the generated data, and testing is performed on a few
samples obtained from the real data. The curvature-aware point sampling (CPS) is used to
obtain the feature sensitive points. The PointNet++ network is used to obtain the features
from the face point cloud.
Our technique differs from existing 3D face recognition techniques, as we have per-
formed 3D face recognition with deep neural network based on the point cloud, unlike other
35978 Multimedia Tools and Applications (2021) 80:35973–35991

traditional approaches that have used 2D/3D descriptors [30] or the 2.5D data [3] for train-
ing a deep neural network. We have given the deep learning solution for 3D face recognition
by training the 3D face point cloud from scratch and have optimized the network for better
resource utilization. We have also fulfilled the research gap by introducing the new match-
ing technique based on the triplet network, where our proposed network shows the ability
to identify even highly similar 3D faces with triplet loss technique.

3 Proposed approach

The driving idea behind this work is to directly make use of point cloud data for 3D face
recognition. In most of the present work in the domain of 3D face recognition, depth maps
(or range images) are utilized. These images are computed by converting 3D data to lower
dimensions such as 2.5D data [43]; however, this representation is view dependent and does
not give the global representation for the 3D face data. Further, 2.5D representation is more
inclined towards the physical appearance of the face rather than the geometric structure and
can get easily affected by environmental conditions. It is also affected by self occlusions
in case of pose variations. It is found that it is not easy to express features of a 3D object
such as face using these representations when the object has complex features. We attempt
to overcome this limitation, and for this purpose, we propose the use of point cloud data for
3D face recognition. The point cloud represents the exterior of a 3D object surface in the
form of Cartesian coordinates. In this representation, each point Pi = (xi , yi , zi ) represents
geometric coordinates of the surface of the object in 3-dimensional space.
We propose a deep learning based solution for 3D face recognition and construct a triplet
network with CPN. An overview of the proposed triplet network is shown in Fig. 4. The
main building blocks of the triplet network are the CPN and triplet loss. The input to this
network is in the form of point cloud where the CPN used in the network captures both local
and global features from the point cloud data. The main advantage of the proposed approach
over other existing approaches is that it is purely based on point cloud data. This avoids
the extra cost of data conversion, and at the same time, preserves the precise information
about the size and the shape of the 3D face data. The network is also capable of learning the
similarity and the dissimilarity between two given 3D faces efficiently.
Before using the point cloud data for training or testing in the network, we perform a
few necessary preprocessing steps on the point cloud data. Important preprocessing steps
include a sampling of point cloud data to reduce file size, conversion of input data format to
HDF5/H5 file, and data augmentation. The various steps involved in the proposed approach
are explained below.

3.1 Preprocessing and augmentation

As point cloud has the potential to retain the geometric properties of the 3D face efficiently,
it is a preferred choice for 3D face recognition. 3D facial point cloud data used in this work
contains only geometrical or spatial information in terms of x, y, and z coordinates. In the
proposed work, the main emphasis is given on the geometric information rather than color
and texture. Like in 2D data, the color and texture captured by the 3D scanners also vary
depending upon the environmental conditions, whereas the geometric information remains
unaffected. Initially, each 3D face scan may contain a huge and variable number of points.
However, all those points may not be required. Moreover, it makes the input data bulky and
difficult to process. Due to this, preprocessing becomes essential [21] to reduce the number
Multimedia Tools and Applications (2021) 80:35973–35991 35979

of points in the input file. For this purpose, we use a subsampling method, which randomly
selects 2048 points from the point cloud data of each 3D face. Subsampling is performed in
such a way that the resultant point cloud is able to retain the structure of the original face.
We have also preprocessed 3D face data by using filtering and diffusion techniques [7] to
remove noise.
We have used the HDF5/H5 file [23] as an input to our network. Originally, Bosphorus
database exists in .BNT data format, whereas IIT Indore database contains the 3D face
data in ASCII format. We convert data files of both the databases in H5 file format, where
data is saved in a hierarchical way. It is an abstract way of saving sizeable floating point
data that preserve the relationship between the point cloud data and labels by utilizing a
multidimensional array structure.
As there are limited samples available for each subject in both the databases, 3D face
recognition performance tends to get saturated if the available samples are used for training
and inference. Hence to handle this problem, we perform 3D data augmentation. It helps in
increasing the number of samples per subject to ensure robust training while using the pro-
posed network. It also improves the generalization of the network and prevents overfitting.
We have applied augmentation at point cloud level [24, 35, 36] where augmentation is per-
formed by rotating the point cloud data of available samples by 90◦ along a fixed direction,
by randomly perturbing the points by a small rotation with angle-sigma as 0.06, and angle-
clip as 0.18, and by jittering the position of each point in the cloud by Gaussian noise with
mean zero and standard deviation 0.02.

3.2 Triplet loss

Triplet loss [40] has various applications in recognition, verification, and clustering tasks.
Input to a triplet network is the triplet of three 3D facial scans, which are denoted by a
triplet as < positive, anchor, negative > where positive and anchor face samples belong
to the same class, whereas the negative face sample belongs to a different class than that
of an anchor. The idea behind triplet loss is to distinguish 3D faces according to the dis-
tances among an anchor 3D sample, a positive 3D sample, and a negative 3D sample in an
embedding space. The task of the triplet loss is to minimize the squared distance between
the anchor and positive 3D face samples and maximize the distance between the anchor and
negative 3D face samples.
Let anc be an anchor 3D face scan, pos be a positive 3D face scan and neg be a neg-
ative 3D face scan. Let the squared distance between anchor and positive scans is given
by d(anc, pos), whereas squared distance between anchor and negative scans is shown by
d(anc, neg). Then these distances should follow the inequality given below.
d(anc, pos) < d(anc, neg) (1)

d(anc, pos) − d(anc, neg) < 0 (2)


In some cases network stops learning and calculates zero for everything. To get rid
of such situations, a margin (m) is introduced, which is a decision boundary between
d(anc, pos) and d(anc, neg). With this, above inequality can be written as follows.
d(anc, pos) − d(anc, neg) + m < 0 (3)
With this, the triplet loss function is given as follows.
triplet loss = max(d(anc, pos) − d(anc, neg) + m, 0) (4)
35980 Multimedia Tools and Applications (2021) 80:35973–35991

Now, the objective is to minimize the triplet loss. This indirectly means that dissimilarity
between the anchor and the positive scans should be minimized, whereas it should be max-
imized between the anchor and the negative scans. To do this, we calculate the gradients
using triplet loss function and by using gradients, update weights, and biases in the triplet
network to move towards the objective.

3.3 Proposed network architecture

We propose a triplet network for learning the similarity and the dissimilarity in 3D faces.
It is a shared network consisting of three parallel and symmetric CPNs where a proposed
CPN is an enhanced version of the PointNet [35] architecture. All three networks have the
same architecture as well as use the same set of parameters. As it is a shared network, while
training weights are updated simultaneously for all the three parallel CPNs. For the task of
3D face recognition, the objective is to build the triplet network but keeping the CPN as the
core feature extraction module. Below we first discuss the working of our proposed CPN
architecture and then explain the triplet network.
The proposed CPN architecture consists of three important segments viz. Input Transfor-
mation Network, Feature Transformation Network, and Forward Network. Its flow chart is
displayed in Fig. 1. As stated above, it is an enhanced version of PointNet [35]. In Point-
Net, implementation is carried out using multi-layer perceptron (MLP), which is a fully
connected neural network where each node is connected to every other node. Due to a huge
number of connections, it requires too many parameters to be trained and may lead to com-
promising of spatial information. Due to robust feature extraction and network optimization
proven by the convolutional layers, we use CNN for enhancing the PointNet [35] architec-
ture. We name the enhanced network as CPN. In the case of use of CNN, local connectivity
results in weight and parameter sharing and provides some sort of built-in regularization.
All the layers are sparsely connected rather than fully connected, and weight and parameters
are less redundant. In CNN, filters look for a particular pattern, whereas the location of a
pattern does not matter. As a result, it identifies the patterns and retains the spatial informa-
tion. In the proposed CPN, each convolutional layer is implemented with 1-D convolutional
operation with kernel size of one. Further, the input to the CPN is directly the point cloud
data in the form of a matrix of size (N×3) where N is the number of points in the input 3D
face surface.
Initially, the (N×3) point cloud data is introduced to the Input Transformation Network
as shown in Fig. 2. This network makes the input data immune to the rigid geometric trans-
formations and performs pose normalization. Original point cloud data does not have global

Input Feature
Conv - BN - ReLU Conv - BN - ReLU Max Dense - BN - ReLU
Transformation Transformation Pooling
Network (64, 64) Network (64, 128, 2048) (512, 256)
Layer
(1 x 4096)
N x 2048

( 1 x 2048 )
Nxx6464
64
Nx3
Nx3

N
N 64
N *xx64

Global Dense -
Input Features Sigmoid
Forward Network-1 Forward Network-2
Points

Fig. 1 The proposed CPN network takes (N×3) a set of points as an input and passes it through a series of
convolutional layers along with batch normalization and ReLU activation function to learn local and global
features. Global features are extracted through max pooling. Further, a fully connected layer is applied along
with sigmoid function, and output is produced in the form of a feature vector of size (1×4096). In the figure,
N, Conv, BN, and ReLU stand for the number of 3D points, convolutional layer, batch normalization, and
rectified linear unit respectively
Multimedia Tools and Applications (2021) 80:35973–35991 35981

Max
Conv-BN- Conv-BN- Conv-BN- Dense-BN-
Pooling Trainable weights Trainable biases
ReLU (64) ReLU (128) ReLU (2048) ReLU (256)
Layer (256 x 9) (1 x 9)

N x 2048

1 X 2048
N x 128

1 x 256
N x 64
Nx3

Reshape
mat mul 1x9
(3 x 3)

Fig. 2 Architecture of Input Transformation Network. In the figure, N, Conv, BN, ReLU, mat-mul stand for
number of 3D points, convolutional layer, batch normalization, rectified linear unit and matrix multiplication
operation respectively

parametrization and exists in the scattered form; hence this particular subnetwork aligns the
unordered point cloud in canonical space to make it unchanging to any deformation. The
architecture of the Input Transformation Network is inspired by the overall CPN architec-
ture. Similar to CPN, it also contains a series of convolutional, max pooling, and dense
layers. In Input Transformation Network, convolutional layer is used to perform feature
learning where size of the layers is kept as (64, 128, 2048). Further, max pooling layer
is used for global information encoding. Finally, the transformation matrix of size (3×3)
is generated by combining the globally trainable weights and biases with dense layer of
size (256). The input point cloud (N×3) is transformed by performing matrix multiplica-
tion between the input point cloud and the transformation matrix. Batch normalization (BN)
and rectified linear unit (ReLU) activation functions are also used at each layer. We pro-
pose to optimize the network by applying less number of fully connected layers to make it
cost-efficient.
The input to the first Forward Network-1 is a well-ordered set of points (N×3), as shown
in Fig. 1. It contains convolutional layers of size (64, 64), and the output of the network
is a set of local features with dimensions (N×64). Since these features are required to be
invariant to any geometric deformation, they are fed as input to the Feature Transformation
Network to achieve this. The architecture of Feature Transformation Network is shown in
Fig. 3. Its working is quite similar to that of the Input Transformation Network. The only
difference that it carries is the dimensions of the trainable weights and biases, which are
(256×4096) and (1×4096) respectively here, and the dimension of the output transforma-
tion matrix which is of size (64×64) here. Further, the transformed features, which are the
output of this subnetwork are obtained by multiplying the feature transformation matrix of
size (64×64) to the input feature matrix of size (N×64). The obtained transformed features
act as the input to the Forward network-2. This subnetwork consists of three convolutional
layers of size (64, 128, 2048), and each of the convolutional layer is combined with a batch
normalization function and a ReLU activation function. Forward network-2 results in the
output feature matrix of the dimension (N×2048).

Max
Conv-BN- Conv-BN- Conv-BN- Dense-BN-
Pooling Trainable weights Trainable biases
ReLU (64) ReLU (128) ReLU (2048) ReLU (256)
Layer (256 x 4096) (1 x 4096)
N x 2048

1 X 2048
N x 128

1 x 256
N x 64

N x 64

mat mul 1 x 4096 Reshape


(64 x 64)

Fig. 3 Architecture of Feature Transformation Network. In the figure, N, Conv, BN, ReLU, mat-mul stand for
number of 3D points, convolutional layer, batch normalization, rectified linear unit and matrix multiplication
operation respectively
35982 Multimedia Tools and Applications (2021) 80:35973–35991

The last max pooling operation in the proposed network (Fig. 1) ensures the permuta-
tion invariance of the network by summing up the local features and obtaining the global
features for the network. Two fully connected layers of size (512, 256) are used with ReLU
activation function in the network, and the output is flattened. Unlike the original PointNet,
which implements a softmax activation function in the final layer, we have replaced it with a
sigmoid activation function in the final fully connected layer. The end output of the network
is a feature vector of size (1×4096).
The triplet network is a shared network which consists of three CPN networks, which
share the weights and the parameters. As shown in Fig. 4, input to the triplet network is
the triplet of three 3D facial scans, which is denoted by < positive, anchor, negative >.
Once the input is received, the triplet network computes the feature embedding for all three
3D facial scans presented to it using CPN. This produces a feature vector of dimension
(1×4096) for each 3D facial scan. The triplet network is trained with many triplets until the
network reaches a convergence. The most important constituent of the triplet network is the
triplet loss function. This function works in such a way that the similar 3D scans are pulled
close to each other by a decreasing distance between them, whereas dissimilar 3D scans are
moved away from each other as the distance between them is maximized. In the triplet loss,
dissimilarity is learned by calculating the squared distance between each pair of 3D facial
scans, viz., < anchor, positive > and < anchor, negative >. In this way, the network
learns the intra/inter class differences for each genuine (where both 3D scans belong to the
same subject) and imposter (where both 3D scans belong to a different subject) pairs.
For evaluating the performance of the proposed network, pairwise samples are created,
which consists of several genuine and imposter pairs. The CPN in proposed triplet network
obtains a (1×4096) dimensional feature embedding for each sample, and squared distance
is calculated for each pair. This distance determines the similarity in two 3D faces where a
smaller distance represents similar 3D face samples, whereas a larger distance shows that

Anchor 3D Anchor
CPN
scan Embeddings

(N X 3)
(1 X 4096)
Shared
Weights
Triplet Loss

Positive 3D Positive
CPN
scan Embeddings

(N X 3)
(1 X 4096)
Shared
Weights

Negative 3D Negative
Negative
CPN
scan Embeddings
Embeddings

(N X 3)
(1 X 4096)

Fig. 4 Proposed triplet network with shared weights where features are being learnt with the help of CPN.
Input to the network is a triplet of 3D facial point clouds where each point cloud is of size (N×3). Output of
the CPN is a feature embedding of dimension (1×4096) for each 3D face scan in the triplet
Multimedia Tools and Applications (2021) 80:35973–35991 35983

the two 3D faces belong to two different subjects. The intuition behind this is that the feature
embeddings of 3D facial scans of the same person would be similar, whereas there will be
a difference in feature embeddings if the 3D scans in the input pair belong to two different
subjects. Hence, distance based scores for genuine and imposter pairs would be different.
The proposed model can correctly discriminate between different identities or subjects and
performs the standard verification task of 3D face recognition.

4 Experimental analysis

We have performed different experiments to evaluate the proposed architecture for 3D


face recognition. The information about databases used in the analysis, data preparation,
parameter analysis, and performance evaluation are given below.

4.1 Databases used

The proposed network has been evaluated on Bosphorus university 3D face database [39]
and IIT Indore 3D face database. Details about both the databases are given in Table 1. The
Bosphorus 3D face database includes images with a unique and occluded background, and
are affected by different imaging factors like light, pose, and scale variations. The database
consists of 4666 samples of 105 subjects where maximum and minimum available samples
for each subject in the database are 54 and 31, respectively. It contains images with 35
variations in expression, variations in poses (13 yaw and pitch rotations), and four types of
occlusions (beard & mustache, hair, hand, eyeglasses). The face data has been captured in a
controlled environment, and most of the subjects are professional actors and actresses. The
database has used Inspeck Mega Capturor II 3D scanner for capturing the 3D faces. Some
of the samples from the Bosphorus database are shown in Fig. 5.
Our in-house database, the IIT Indore 3D face database, has been captured in an outdoor
settings and contains challenging data. Due to an outdoor setting, samples have been cap-
tured in different lighting conditions and contain noise. All the facial samples have been
captured under neutral expression. The database has used Artec-Eva® scanner for the col-
lection of the 3D face scans from the subjects. All the subjects in the database are students,
faculty, and staff members of IIT Indore, where their age ranges from 18 years to 45 years.
The data has been captured in three phases, and the minimum gap between the data acqui-
sition of two subsequent phases is one year. We have used the data of phase-2 and phase-3
for our experimentation. The total number of available samples for each subject is three
in the database where there are 90 subjects in phase-2 database and 99 subjects in phase-
3 database. Some of the samples from IIT Indore 3D face database are shown in Fig. 6.

Table 1 Details of databases

Database IDs Scans Expression Pose Occlusion Scanner

IIT Indore phase-2 90 270 – – – Artec-Eva


IIT Indore phase-3 99 297 – – – Artec-Eva
Bosphorus 105 4666 35 ±90◦ 4 types Inspeck Mega
Capturor
II 3D
35984 Multimedia Tools and Applications (2021) 80:35973–35991

Fig. 5 A few samples of a subject from Bosphorus 3D face database with different expressions (Texture
mapping and lighting effects are used to achieve better rendering/visibility)

Since the 3D facial data in this database contain noise due to poor and varying lighting con-
ditions, we have preprocessed it with diffusion and filtering techniques [7] before its use.
Further, the facial scans in a few cases also contain holes which have been filled by using
an interpolation based technique.

4.2 Data preparation

To maintain consistency, we sample the training database in such a way that it contains an
equal number of samples per subject. For example, in the Bosphorus database, the number
of samples per subject ranges from 31 to 54; however, we have only selected thirty samples
randomly per subject to make the number of samples per subject same for the experimental
evaluation. The randomly selected samples of a subject cover the samples with expression
change, occluded background, and the neutral ones. In the IIT Indore 3D face database, we
have three 3D scans per subject for both phase-2 and phase-3 data, and we have considered
all of them in the experimental evaluation. As the data requirement of deep learning mod-
els is huge, to compensate, we have generated additional samples per subject for both the
databases using our proposed augmentation techniques. The overall database obtained after
augmentation is divided into three sets, which are training set, validation set, and testing set.
We keep 70%, 15%, and 15% of the total samples per subject in training, validation, and
testing set, respectively.
Further, we generate triplets for training and validation. To get one triplet, first, we select
an anchor sample, and with reference to that, we select a positive sample which belongs
to the same subject to which anchor sample belongs and a negative sample which belongs
to a different subject. In the proposed approach, we randomly select positive and negative
3D faces [19], where positive and anchor 3D face scans belong to the same person, and a

Fig. 6 A few samples from IIT


Indore 3D face database
Multimedia Tools and Applications (2021) 80:35973–35991 35985

negative sample is the 3D face scan of a different person than the one in the anchor. Other
possible approaches for sampling the negative 3D face scans are based on the use of easy
negatives, semi-hard negatives, and hard negatives, where easy negatives are the ones which
can easily satisfy the inequality in the triplet loss, and semi-hard and hard negatives have
much greater loss value. Though hard and semi-hard triplets help in learning of the network,
such triplets are found to be very sensitive to the bad data, and selecting only hard negatives
may result in bad local minima and may lead to collapsed model [40]. Hence, in this work,
we prefer to select negative samples randomly. For testing, two sets are created where one
set contains all genuine pairs (i.e. both the samples in the pair belong to the same subject),
and another set contains imposter pairs (i.e. the samples in the pair belong to different
subjects). The samples involved in testing are unseen by the network, and we ensure that
there should not be any duplicate pair.

4.3 Parameter setting

The proposed triplet network is implemented using Keras and Tensorflow frameworks. To
perform the tuning of the hyperparameters, we have used the validation database. The pro-
posed triplet network, which consists of three CPN networks, has been trained as follows.
The weights are initialized with mean as 0.0 and standard deviation as 0.01 for convolu-
tion layers and fully connected layers. Biases are initialized to 0.5 mean and 0.01 standard
deviation as mentioned in [22]. The learning rate is set to 0.0001 in our experiments. Adam
(Adaptive momentum estimation) optimizer with β = 0.9 is used for optimization. It plays
a vital role in updating the weights and in calculating the cost function [35]. The loss func-
tion [40] is the most important part of the proposed triplet network. We have used the triplet
loss function (explained in the previous section) with the margin value as 0.5 for this pur-
pose. The network is trained for 500 epochs with a batch size of 64. Plots for loss vs. epoch
are shown in Fig. 7 for all the experiments. These plots clearly show that the value of triplet
loss decreases as epochs proceed. Dropout is used to avoid overfitting, where nodes have
been randomly removed with a probability of 0.5. Batch normalization, which corrects the
activation values to zero mean and unit gradient descent, is used to accelerate the process of
training. For activation function, we have utilized both ReLU as well as sigmoid functions
at different layers.

4.4 Performance evaluation

The output of the proposed triplet network is a feature embedding (or a feature vector) of
size (1 × 4096) for each 3D face scan. We obtain distance based scores by computing the
squared distances in the embedding space to match two 3D face samples and, based on this,

Fig. 7 Loss vs. epoch plots for different databases during the training of the proposed network: (a) IIT Indore
(phase-2) (b) IIT Indore (phase-3) (c) Bosphorus
35986 Multimedia Tools and Applications (2021) 80:35973–35991

Table 2 Performance of the


proposed network in terms of Database Verification rate (at 0.1% FAR)
verification rate on Bosphorus
and IIT Indore 3D face database IIT Indore 3D face (phase-2) 90.00
IIT Indore 3D face (phase-3) 92.75
Bosphorus 3D face 96.25

perform the evaluation. The performance of the proposed network for 3D face recognition
is being evaluated with respect to standard evaluation metrics such as FAR, FRR, verifi-
cation rate, and Reciever Operating Characteristic (ROC) curve. FAR is the percentage of
times a face recognition system classifies an imposter as a legitimate person, whereas FRR
is the percentage of times a genuine user is perceived as an imposter by the face recogni-
tion system. The verification rate determines the percentage of correctly classified genuine
users, which is calculated for a fixed FAR value as 0.1% in our experiments. ROC curve
determines how well a biometric recognition system can distinguish between the two 3D
face scans. It shows the relationship between the FAR and verification rate and summarizes
the performance of the recognition system.
Table 2 shows the performance of the proposed network in terms of verification rate at
0.1% FAR for different databases. The proposed network has achieved 96.25%, 90.00%,
and 92.75% verification rates for Bosphorus, IIT Indore (phase-2), and IIT Indore (phase-
3) databases, respectively. The performances in the case of IIT Indore (phase-2) and IIT
Indore (phase-3) databases are found to be slightly low as compared to that in the case of the
Bosphorus database. However, this is due to the availability of a limited number of samples
per subject and the challenging nature of the data. The ROC curves for the proposed network
are shown in Fig. 8 for different databases. All our experiments have been performed on
a computer system having Intel Xeon processor and NVIDIA Tesla V100 GPU card with
32GB RAM and Xubuntu operating system with 600GB RAM.
We have followed the standard protocols for evaluating the proposed network for 3D face
recognition. We have computed the verification rate at 0.1% FAR to enable us to compare
the performance of the proposed network with other existing state-of-the-arts techniques
which have reported their verification rates at 0.1% FAR. The comparative results are shown
in Table 3. These results show that the proposed network is clearly superior to other existing
state-of-the-art methods in performing the 3D face recognition task. Moreover, the proposed
network has shown the possibility of achieving better performance even with the limited
availability of 3D data. Another advantage of the proposed network lies in its ability to
consume input data in a point cloud format, which is a raw data format directly available
from the 3D scanners. This removes the requirement of conversion of point cloud data to

100 100 100


Verification Rate

Verification Rate

80 80
Verification Rate

80

60 60 60

40 40 40

20 20 20

0 0 0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 0 5 10 15 20 25 30 35 40 45 50 55 0 65 70 75 80 85 90 95 100 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
FAR FAR FAR

Fig. 8 ROC curves of the proposed network for different databases: (a) IIT Indore (phase-2) (b) IIT Indore
(phase-3) (c) Bosphorus
Multimedia Tools and Applications (2021) 80:35973–35991 35987

Table 3 Comparison of
verification rate of the proposed Method Verification rate in (%)
network at 0.1% FAR with the
state-of-the-art techniques on Reji et al. [38] 95.3
Bosphorus database Soltanpour et al. [44] 95.8
Yu et al. [48] 95.7
The results of our proposed Our Proposed Method 96.25
technique is highlighted in bold

other formats and helps in preserving the spatial and geometric properties of the 3D face
data.
We have proposed a triplet network with triplet loss, which is a variant of the Siamese
network for determining the similarity and dissimilarity in 3D faces. The triplet network
has more discriminating power, and it is able to learn high quality facial feature embeddings
than that of the Siamese network, which also makes use of similarity learning technique
along with standard binary cross-entropy loss or contrastive loss. In the triplet network,
more information is captured during training as the input is in the form of a triplet, and train-
ing is performed in a comparative manner. This is further due to the fact that at an instance,
during training, both positive and negative samples are involved in the triplet network. More-
over, the proposed triplet network is a shared network, which helps in reducing the required
number of weight parameters and makes the network lightweight. We have proposed the
use of the Siamese network in our prior work [5], where we have used the Siamese net-
work along with PointNet-CNN (PointNet implementation with CNN) for feature learning.
Standard binary cross-entropy loss is used in the proposed network. To show the superi-
ority of the triplet network over the Siamese network, we have compared the performance
of the two networks in terms of recognition rates on Bosphorus and IIT Indore databases.
The obtained results are shown in Table 4, which clearly demonstrate the superiority of the
triplet network.
In the original PointNet architecture, MLP is used for feature learning. It is a deep neu-
ral network which is a combination of fully connected layers. Due to dense connections
among nodes in the layers, number of parameters increases, and that results in redundancy
and inefficiency. In the proposed network, CNN is used in place of MLP, which enables
weight and parameter sharing and retains spatial information. In CNN, the required num-
ber of parameters are less, weights are much smaller, and hence it is effortless to train it as
compared to MLP. It is also computationally efficient to use CNN as compared to MLP. We
have demonstrated the difference in processing time required for original PointNet architec-
ture and CPN architecture in terms of their training and inference times. The comparative

Table 4 Comparison in terms of


recognition rate of proposed Database Recognition rate for Recognition rate for
triplet network with previously siamese network triplet network
proposed siamese network based
technique for 3D face recognition IIT Indore 3D 87.31 93.10
Face (Phase-2)
IIT Indore 3D 92.19 94.26
Face (Phase-3)
Here, the recognition rate is Bosphorus 98.91 97.55
denoted in % for both Bosphorus Database
and IIT Indore database
35988 Multimedia Tools and Applications (2021) 80:35973–35991

Table 5 The time analysis on Bosphorus and IIT Indore 3D face databases for triplet network with original
PointNet and triplet network with CPN

Databases Proposed network with PointNet Proposed network with CPN


used processing time (in seconds) processing time (in seconds)

Training time Inference time Training time Inference time


per sample per sample per sample per sample

IIT Indore 3D 8.58 0.28 8.40 0.25


Face (Phase-2)
IIT Indore 3D 8.65 0.28 8.46 0.25
Face (Phase-3)
Bosphorus 8.78 0.28 8.51 0.26
Database

results on Bosphorus and IIT Indore 3D face databases for both original PointNet and CPN
are given in Table 5.
The implementation of CPN with 1D convolution is computationally more efficient than
other existing methods which use 2D CNN or 3D CNN for 3D data and results into O(N 2 )
or O(N 3 ) complexity. Since 1D CNN is used in the case of the proposed network, its time
complexity is just O(N ), where N is the number of points present in the input sample.The
proposed triplet network with CPN is a robust model for feature extraction and similarity
learning and provides weight sharing, which reduces the required number of weight param-
eters. Overall the cost of storage, data conversion, preprocessing, and computational time is
substantially reduced in the proposed network.

5 Conclusions

In this paper, we have proposed a novel technique for 3D face recognition using 3D point
cloud data as an input. We have proposed a triplet network which is made up of CPN and
triplet loss. CPN, an enhanced version of PointNet, is a deep learning-based architecture,
specially designed to process the unorganized point cloud data. We have proposed it for
robust feature extraction from input 3D face data. The triplet loss used in the proposed
network is a loss function used in deep neural networks where a baseline (anchor) input
is compared to positive input and negative input. Numerous techniques exist for 3D face
recognition task; however, these techniques mostly use volumetric or range/depth images
and pre-trained networks, which can be less practical in some instances. Moreover, these
techniques are computationally expensive as they require the conversion of 3D face data
(which is originally in the form of point cloud) to volumetric or range/depth images before
being used for the recognition task. The input to the proposed network is facial scans in
terms of point cloud representation, which has no extra cost of data conversion. Further,
point cloud has lesser memory requirements as compared to the volumetric representations
such as voxel and occupancy grids. We are the first to propose a triplet network for 3D face
recognition by directly using the point cloud data. We have utilized the proposed network
to learn the similarities and dissimilarities in the 3D face data. The triplet loss used in the
network minimizes the distances between the 3D face scans in a genuine pair, whereas it
maximizes the distances between the 3D face scans in an imposter pair. The essence of
Multimedia Tools and Applications (2021) 80:35973–35991 35989

the idea is that if two facial embeddings (feature space) are close to each other, then the
corresponding 3D scans belong to the same person. We have also proposed a new matching
technique by computing distance scores based on the squared distances between the 3D face
pairs. We have demonstrated the performance of the proposed network on Bosphorus and
IIT Indore 3D face databases using metrics such as verification rate, and ROC plots. We have
achieved a verification rate of 96.25% on the Bosphorus 3D face database, 90.00% on IIT
Indore 3D face database (phase-2), and 92.75% on IIT Indore 3D face database (phase-3).
We have shown the possibility of getting higher results even with the limited and challenging
data. Compared with other state-of-the-art techniques, the proposed network has delivered
a new 3D face recognition technique with encouraging performance improvements. Due to
huge data requirement and complexity of the deep neural network, it is important to propose
a network which takes less computational time and resources. In our future work, we shall
make an attempt to enhance the proposed network by incorporating the fully convolutional
neural networks in PointNet to speed up the calculations.

References

1. Abate AF, Nappi M, Riccio D, Sabatino G (2007) 2D and 3D face recognition: a survey. Pattern Recogn
Lett 28(14):1885–1906
2. Ahmed E, Saint A, Shabayek AER, Cherenkova K, Das R, Gusev G, Aouada D, Ottersten B (2018) Deep
learning advances on different 3D data representations: a survey. arXiv:180801462
3. Bagchi P, Bhattacharjee D, Nasipuri M (2016) A robust analysis, detection and recognition of facial
features in 2.5D images. Multimed Tools and Appl 75(18):11059–11096
4. Berretti S, Werghi N, Del Bimbo A, Pala P (2013) Matching 3D face scans using interest points and local
histogram descriptors. Comput Graph 37(5):509–525
5. Bhople Anagha SA, Prakash S (2020) Point cloud based deep convolutional neural network for 3D face
recognition. Multimed Tools Appl. https://doi.org/10.1007/s11042-020-09008-z
6. Blanz V, Vetter T et al (1999) A morphable model for the synthesis of 3D faces. In: Proceedings of
SIGGRAPH, pp 187–194
7. Borisenko G, Denisov A, Krylov A (2004) A diffusion filtering method for image processing. Program
Comput Softw 30(5):273–277
8. Bowyer KW, Chang K, Flynn P (2006) A survey of approaches and challenges in 3D and multi-modal
3D + 2D face recognition. Comput Vis Image Underst 101(1):1–15
9. Chelali FZ, Djeradi A, Djeradi R (2009) Linear discriminant analysis for face recognition. In:
Proceedings of international conference on multimedia computing and systems, pp 1–10
10. Chouchane A, Ouamane A, Boutellaa E, Belahcene M, Bourennane S (2018) 3D face verification across
pose based on euler rotation and tensors. Multimed Tools Appl 77(16):20697–20714
11. Dorofeev K, Ruchay A, Kober A, Kober V (2019) 3D face recognition using depth filtering and deep con-
volutional neural network. In: Proceedings of applications of digital image processing XLII, vol 11137,
p 111371Y
12. Drira H, Amor BB, Srivastava A, Daoudi M, Slama R (2013) 3D face recognition under expressions,
occlusions, and pose variations. IEEE Trans Pattern Anal Mach Intell 35(9):2270–2283
13. Feng J, Guo Q, Guan Y, Wu M, Zhang X, Ti C (2019) 3D face recognition method based on deep
convolutional neural network. In: Proceedings of smart innovations in communication and computational
sciences, pp 123–130
14. Garcia-Garcia A, Gomez-Donoso F, Garcia-Rodriguez J, Orts-Escolano S, Cazorla M, Azorin-Lopez
J (2016) Pointnet: a 3D convolutional neural network for real-time object class recognition. In:
Proceedings of international joint conference on neural networks, pp 1578–1584
15. Gilani SZ, Mian A, Shafait F, Reid I (2017) Dense 3D face correspondence. IEEE Trans Pattern Anal
Mach Intell 40(7):1584–1598
16. Gupta S, Markey MK, Bovik AC (2010) Anthropometric 3D face recognition. Int J Comput Vis
90(3):331–349
17. He Y, Liang B, Yang J, Li S, He J (2017) An iterative closest points algorithm for registration of 3D
laser scanner point clouds with geometric features. Sensors 17(8):1862
35990 Multimedia Tools and Applications (2021) 80:35973–35991

18. Jayaraman U, Gupta P, Gupta S, Arora G, Tiwari K (2020) Recent development in face recognition.
Neurocomputing. https://doi.org/10.1016/j.neucom.2019.08.110
19. Kang BN, Kim Y, Kim D (2017) Deep convolutional neural network using triplets of faces, deep ensem-
ble, and score-level fusion for face recognition. In: Proceedings of the IEEE conference on computer
vision and pattern recognition workshops, pp 109–116
20. Kim D, Hernandez M, Choi J, Medioni G (2017) Deep 3D face identification. In: Proceedings of IEEE
international joint conference on biometrics, pp 133–142
21. Kingkan C, Owoyemi J, Hashimoto K (2018) Point attention network for gesture recognition using point
cloud data. In: Proceedings of British machine vision conference, pp 118–130
22. Koch G, Zemel R, Salakhutdinov R (2015) Siamese neural networks for one-shot image recognition. In:
Proceedings of ICML deep learning workshop, vol 2
23. Koziol Q (2011) HDF5, Encyclopedia of parallel computing. In: Encyclopedia of parallel computing,
pp 827–833
24. Le T, Duan Y (2018) Pointgrid: a deep network for 3D shape understanding. In: Proceedings of IEEE
conference on computer vision and pattern recognition, pp 9204–9214
25. Lei Y, Guo Y, Hayat M, Bennamoun M, Zhou X (2016) A two-phase weighted collaborative
representation for 3D partial face recognition with single sample. Pattern Recogn 52:218–237
26. Leng B, Liu Y, Yu K, Xu S, Yuan Z, Qin J (2016) Cascade shallow CNN structure for face verification
and identification. Neurocomputing 215:232–240
27. Leo MJ, Suchitra S (2018) SVM based expression-invariant 3D face recognition system. Procedia
Comput Sci 143:619–625
28. Li H, Huang D, Morvan JM, Wang Y, Chen L (2014) Towards 3D face recognition in the real: a
registration-free approach using fine-grained matching of 3D keypoint descriptors. Int J Comput Vis
113:128–142
29. Li Y, Wang Y, Liu J, Hao W (2018) Expression-insensitive 3D face recognition by the fusion of multiple
subject-specific curves. Neurocomputing 275:1295–1307
30. Marcolin F, Vezzetti E (2017) Novel descriptors for geometrical 3D face analysis. Multimedia Tools and
Applications 76(12):13805–13834
31. Maturana D, Scherer S (2015) Voxnet: a 3D convolutional neural network for real-time object
recognition. In: Proceedings of IEEE/RSJ international conference on intelligent robots and systems,
pp 922–928
32. Mian A, Bennamoun M, Owens R (2007) An efficient multimodal 2D-3D hybrid approach to automatic
face recognition. IEEE Trans Pattern Anal Mach Intell 29(11):1927–1943
33. Mian AS, Bennamoun M, Owens R (2008) Keypoint detection and local feature matching for textured
3D face recognition. Int J Comput Vis 79(1):1–12
34. Patil H, Kothari A, Bhurchandi K (2015) 3D Face recognition: features, databases, algorithms and
challenges. Artif Intell Rev 44(3):393–441
35. Qi CR, Su H, Mo K, Guibas LJ (2017) Pointnet: deep learning on point sets for 3D classification and
segmentation. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 652–
660
36. Qi CR, Yi L, Su H, Guibas LJ (2017) Pointnet++: deep hierarchical feature learning on point sets in a
metric space. In: Proceedings of advances in neural information processing systems, pp 5099–5108
37. Rahim R, Afriliansyah T, Winata H, Nofriansyah D, Aryza S et al (2018) Research of face recogni-
tion with fisher linear discriminant. In: Proceedings of IOP conference series: materials science and
engineering, pp 012–037
38. Reji R, SojanLal P (2017) Region based 3D face recognition. In: Proceedings of IEEE international
conference on computational intelligence and computing research, pp 1–6
39. Savran A, Alyüz N, Dibeklioğlu H, Çeliktutan O, Gökberk B, Sankur B, Akarun L (2008) Bospho-
rus database for 3D face analysis. In: Proceedings of European workshop on biometrics and identity
management, pp 47–56
40. Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and
clustering. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 815–823
41. Sharma PB, Goyani MM (2012) 3D face recognition techniques-a review. Int J Eng Res Appl 2(1):787–
793
42. Sharma S, Kumar V (2020) Voxel-based 3D face reconstruction and its application to
face recognition using sequential deep learning. Multimed Tools Appl 79:17303–17330.
https://doi.org/10.1007/s11042-020-08688-x
43. Singh RD, Mittal A, Bhatia RK (2019) 3D convolutional neural network for object recognition: a review.
Multimed Tools Appl 78(12):15951–15995
Multimedia Tools and Applications (2021) 80:35973–35991 35991

44. Soltanpour S, Wu QJ (2016) Multimodal 2D–3D face recognition using local descriptors: pyramidal
shape map and structural context. IET Biometrics 6(1):27–35
45. Soltanpour S, Boufama B, Wu QJ (2017) A survey of local feature methods for 3D face recognition.
Pattern Recogn 72:391–406
46. Taigman Y, Yang M, Ranzato M, Wolf L (2014) Deepface: closing the gap to human-level performance
in face verification. In: Proceedings of IEEE conference on computer vision and pattern recognition,
pp 1701–1708
47. Turk MA, Pentland AP (1991) Face recognition using eigenfaces. In: Proceedings of IEEE computer
society conference on computer vision and pattern recognition, pp 586–591
48. Yu Y, Da F, Guo Y (2018) Sparse icp with resampling and denoising for 3D face verification. IEEE
Trans on Inf Forensics Secur 14(7):1917–1927
49. Zhang Z, Da F, Wang C, Yu J, Yu Y (2019) Face recognition on 3D point clouds. In: Proceedings
of seventh international conference on optical and photonic engineering (icOPEN 2019), vol 11205,
pp 350–355
50. Zhang Z, Da F, Yu Y (2019) Data-free point cloud network for 3D face recognition. arXiv:1911.04731
51. Zhou H, Mian A, Wei L, Creighton D, Hossny M, Nahavandi S (2014) Recent advances on singlemodal
and multimodal face recognition: a survey. IEEE Trans Hum-Mach Syst 44(6):701–716
52. Zulqarnain Gilani S, Mian A (2018) Learning from millions of 3D scans for large-scale 3D face recog-
nition. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 1896–1905

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.

You might also like