
tend to saturate rather quickly as the volume of the training
set grows significantly.
Recently, there has been a surge of interest in neu-
ral networks [18, 20]. In particular, deep and large net-
works have exhibited impressive results once: (1) they
have been applied to large amounts of training data and (2)
scalable computation resources such as thousands of CPU
cores [11] and/or GPU’s [18] have become available. Most
notably, Krizhevsky et al. [18] showed that very large and
deep convolutional networks [20] trained by standard back-
propagation [24] can achieve excellent recognition accuracy
when trained on a large dataset.
Face recognition state of the art Face recognition er-
ror rates have decreased over the last twenty years by three
orders of magnitude [12] when recognizing frontal faces in
still images taken in consistently controlled (constrained)
environments. Many vendors deploy sophisticated systems
for the application of border-control and smart biometric
identification. However, these systems have shown to be
sensitive to various factors, such as lighting, expression, oc-
clusion and aging, that substantially deteriorate their perfor-
mance in recognizing people in such unconstrained settings.
Most current face verification methods use hand-crafted
features. Moreover, these features are often combined
to improve performance, even in the earliest LFW con-
tributions. The systems that currently lead the perfor-
mance charts employ tens of thousands of image descrip-
tors [5, 7, 2]. In contrast, our method is applied directly
to RGB pixel values, producing a very compact and even
sparse descriptor.
Deep neural nets have also been applied in the past to
face detection [23], face alignment [26] and face verifica-
tion [8, 15]. In the unconstrained domain, Huang et al. [15]
used as input LBP features and they showed improvement
when combining with traditional methods. In our method
we use raw images as our underlying representation, and
to emphasize the contribution of our work, we avoid com-
bining our features with engineered descriptors. We also
provide a new architecture, that pushes further the limit of
what is achievable with these networks by incorporating 3D
alignment, customizing the architecture for aligned inputs,
scaling the network by almost two order of magnitudes and
demonstrating a simple knowledge transfer method once the
network has been trained on a very large labeled dataset.
Metric learning methods are used heavily in face verifi-
cation. In several cases existing methods are successfully
employed, but this is often coupled with task-specific in-
novation [25, 28, 6]. Currently, the most successful sys-
tem that uses a large data set of labeled faces [5] employs
a clever transfer learning technique which adapts a Joint
Bayesian model [6] learned on a dataset containing 99,773
images from 2,995 different subjects to the LFW image do-
main. Here, in order to demonstrate the effectiveness of the
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 1. Alignment pipeline. (a) The detected face, with 6 initial fidu-
cial points. (b) The induced 2D-aligned crop. (c) 67 fiducial points on
the 2D-aligned crop with their corresponding Delaunay triangulation, we
added triangles on the contour to avoid discontinuities. (d) The reference
3D shape transformed to the 2D-aligned crop image-plane. (e) Triangle
visibility w.r.t. to the fitted 3D-2D camera; black triangles are less visible.
(f) The 67 fiducial points induced by the 3D model that are using to direct
the piece-wise affine warpping. (g) The final frontalized crop. (h) A new
view generated by the 3D model (not used in this paper).
features, we keep the distance learning step trivial.
2. Face Alignment
Existing aligned versions of several face databases (e.g.
LFW-a [28]) help to improve recognition algorithms by pro-
viding a normalized input [25]. However, aligning faces
in the unconstrained scenario is still considered a difficult
problem that has to account for many factors such as pose
(due to the non-planarity of the face) and non-rigid expres-
sions, which are hard to decouple from a person’s identity-
bearing facial morphology. Recent methods have shown
successful ways that compensate for these difficulties by
using sophisticated alignment techniques. Those methods
can be one or more from the following: (1) employing an
analytical 3D model of the face [27], (2) searching for sim-
ilar fiducial-points configurations from an external dataset
to infer from [4], and (3) unsupervised methods that find a
similarity transformation for the pixels [16, 14].
While alignment is widely employed, no complete phys-
ically correct solution is currently present in the context of
unconstrained face verification. 3D models have fallen out
of favor in recent years, especially in unconstrained envi-
ronments. However, since faces are 3D objects, done cor-
rectly, we believe that it is the right way. In this paper, we
describe a system that combines analytical 3D modeling of
the face based on fiducial points, used to warp a detected
facial crop to a 3D frontal mode (frontalization).
Similar to much of the recent alignment literature, our
alignment solution is based on using fiducial point detectors
to direct the alignment process. Localizing fiducial points
in unconstrained images is known to be a difficult problem,