0% found this document useful (0 votes)
44 views16 pages

Ear Recognition Models: Performance Analysis

This paper evaluates and analyzes ear recognition models, focusing on both descriptor-based and deep-learning techniques, to understand their performance, complexity, and resource requirements. The study reveals that while deep-learning methods show promise for improved recognition, descriptor-based techniques still hold advantages in terms of training efficiency. Additionally, a new dataset, AWE Extended, is introduced, containing 4104 images for further research in this area.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views16 pages

Ear Recognition Models: Performance Analysis

This paper evaluates and analyzes ear recognition models, focusing on both descriptor-based and deep-learning techniques, to understand their performance, complexity, and resource requirements. The study reveals that while deep-learning methods show promise for improved recognition, descriptor-based techniques still hold advantages in terms of training efficiency. Additionally, a new dataset, AWE Extended, is introduced, containing 4104 images for further research in this area.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Neural Computing and Applications

https://doi.org/10.1007/s00521-018-3530-1 (0123456789().,-volV)(0123456789().,-volV)

S.I.: ADVANCES IN BIO-INSPIRED INTELLIGENT SYSTEMS

Evaluation and analysis of ear recognition models: performance,


complexity and resource requirements
Žiga Emeršič1 • Blaž Meden1 • Peter Peer1 • Vitomir Štruc2

Received: 22 December 2017 / Accepted: 4 May 2018


 The Natural Computing Applications Forum 2018

Abstract
Ear recognition technology has long been dominated by (local) descriptor-based techniques due to their formidable
recognition performance and robustness to various sources of image variability. While deep-learning-based techniques
have started to appear in this field only recently, they have already shown potential for further boosting the performance of
ear recognition technology and dethroning descriptor-based methods as the current state of the art. However, while
recognition performance is often the key factor when selecting recognition models for biometric technology, it is equally
important that the behavior of the models is understood and their sensitivity to different covariates is known and well
explored. Other factors, such as the train- and test-time complexity or resource requirements, are also paramount and need
to be consider when designing recognition systems. To explore these issues, we present in this paper a comprehensive
analysis of several descriptor- and deep-learning-based techniques for ear recognition. Our goal is to discover weak points
of contemporary techniques, study the characteristics of the existing technology and identify open problems worth
exploring in the future. We conduct our analysis through identification experiments on the challenging Annotated Web
Ears (AWE) dataset and report our findings. The results of our analysis show that the presence of accessories and high
degrees of head movement significantly impacts the identification performance of all types of recognition models, whereas
mild degrees of the listed factors and other covariates such as gender and ethnicity impact the identification performance
only to a limited extent. From a test-time-complexity point of view, the results suggest that lightweight deep models can be
equally fast as descriptor-based methods given appropriate computing hardware, but require significantly more resources
during training, where descriptor-based methods have a clear advantage. As an additional contribution, we also introduce a
novel dataset of ear images, called AWE Extended (AWEx), which we collected from the web for the training of the deep
models used in our experiments. AWEx contains 4104 images of 346 subjects and represents one of the largest and most
challenging (publicly available) datasets of unconstrained ear images at the disposal of the research community.

Keywords Ear recognition  Covariate analysis  Convolutional neural networks  Feature extraction

1 Introduction ranging from geometric and holistic tech-


niques [2, 4, 11, 54, 56] to more recent descriptor-
Automatic ear recognition represents a subproblem of [6, 8, 33, 36, 42] and deep-learning-based [16–18, 23, 55]
biometrics with important applications in security, methods. While descriptor-based methods have dominated
surveillance and forensics. Many techniques have been the field over the last years, research is moving away from
proposed in the literature for ear recognition systems these methods and is now focusing increasingly on deep-

Vitomir Štruc
& Žiga Emeršič
vitomir.struc@fe.uni-lj.si
ziga.emersic@fri.uni-lj.si
1
Blaž Meden Faculty of Computer and Information Science, University of
blaz.meden@fri.uni-lj.si Ljubljana, Večna pot 113, 1000 Ljubljana, Slovenia
2
Peter Peer Faculty of Electrical Engineering, University of Ljubljana,
peter.peer@fri.uni-lj.si Tržaška 25, 1000 Ljubljana, Slovenia

123
Neural Computing and Applications

learning-based models, which recently brought about convolutional neural networks (CNNs), are starting to
considerable advancements in various areas of computer dominate the field due to their excellent performance and
vision and beyond. ability to learn image representations (descriptors) directly
When developing ear recognition technology, it is of from the training data. These two groups of techniques
paramount importance to understand how different recog- approach the ear recognition in fundamentally different
nition models behave when applied on challenging data ways.
captured in unconstrained environments where the tech- Descriptor-based techniques, for example, extract iden-
nology is ultimately deployed. Variations across pose, tity cues from local image areas and use the extracted
gender, ethnicity, occlusions and alike are common in these information for identity inference. As emphasized by
environments and may influence the choice of recognition Emeršič et al. [24], two groups of techniques can in general
approach for a particular application. While one approach be considered descriptor based: (1) techniques that first
may result is lower performance on established ear detect interest points in the image and then compute
recognition benchmarks, it may still exhibit robustness to descriptors for the detected interest points and (2) tech-
certain covariates and, hence, be favored over another in niques that compute descriptors densely over the entire
specific circumstances. Similarly, if computing resources images based on a sliding-window approach (with or
are limited or the run-time complexity is important, one without overlap). Examples of techniques from the first
recognition model may be preferred over the other even at group include [3, 9] or more recently [46]. A common
the cost of somewhat lower performance. To facilitate characteristic of these techniques is the description of the
informed choices during the R&D work, it is therefore interest points independently one from the other, which
crucial to have empirical evidence about the properties of makes it possible to design matching techniques with
the available recognition approaches and have insight into robustness to partial occlusions of the ear area. Examples of
their characteristics. techniques from the second group include [5, 10, 31, 52].
While most of the research on ear recognition is focused These techniques also capture the global properties of the
on new detection and enrollment techniques, more elabo- ear in addition to the local characteristics which commonly
rate recognition models and more discriminative repre- results in a higher recognition performance, but the dense
sentations of ear images, studies focusing on the strengths descriptor computation procedure comes at the expense of
and weaknesses of existing techniques are largely missing the robustness to partial occlusions. Nonetheless, recent
from the literature. In this paper, we try to fill this gap and trends in ear recognition favor dense descriptor-based
present a comprehensive analysis of several ear recognition techniques primarily due to their computational simplicity
techniques considered state-of-the-art today. Specifically, and high recognition performance.
we experiment with eight (dense) descriptor-based ear Deep-learning-based methods, on the other had, typi-
recognition techniques and three recent deep-learning- cally process the input images in a holistic manner and
based recognition models and analyze their characteristics learn image representations (features, descriptor) directly
through comprehensive experiments on a challenging from the training data by minimizing some suitable loss at
dataset of ear images, captured in unconstrained settings the output of the recognition model. The most popular
(a.k.a in the wild). We aim at identifying which factors (or deep-learning models, CNNs, commonly process the data
covariates) influence the recognition techniques the most through a hierarchy of convolutional and pooling layers
and, hence, contribute the greatest to recognition errors, that can be seen as stacked feature extractors and once
what kind of computational complexity is induced by the fully trained can be used to derive highly discriminative
recognition techniques during train and test time and what data representations from the input images that can be
resources are required to run the selected techniques. A exploited for identity inference. While these representa-
detailed understanding of these factors is extremely tions commonly ensure formidable recognition perfor-
important not only because it allows us to devise more mance, the CNN-training procedure typically requires a
effective recognition techniques, but because it helps to large amount of training data, which may not always be
identify future research trends in this area as well. available and is not needed with descriptor-based meth-
ods. In the field of ear recognition, deep-learning-based
1.1 State-of-the-art recognition models methods appeared only recently [16–18, 23, 55], but were
already shown to outperform local descriptor-based
Most of the existing surveys on ear recognition, methods (see [23]).
e.g., [1, 24, 41] identify descriptor-based recognition We contribute to a better understanding of these meth-
techniques as the state of the art in this field. However, as ods in this work by conducting an analysis of some of the
shown by more recent group evaluations and challenges, key characteristics of both groups of techniques.
e.g, [23] deep-learning methods, and in particular

123
Neural Computing and Applications

1.2 Contributions and paper organization required? Answers to questions like these make it possible
to target weak points of the existing technology and pro-
We make several important contributions in this paper. A vide directions for future research.
short list with brief summaries is given below: In the field of biometric ear recognition, some of the
questions outlined above are (partially) discussed in recent
• Experimental evaluation We conduct a comprehensive
survey papers, such as [1, 24, 41, 47], where structured
experimental evaluation of several state-of-the-art ear
comparisons of existing ear recognition techniques are
recognition techniques on a challenging dataset of ear
presented. The comparisons in these papers are based on
images gathered from web with the goal of studying
previously reported results and summarize recognition
unconstrained ear recognition. As part of the evalua-
experiments on different datasets with different experi-
tion, we perform a comparative assessment of recent
mental protocols. While general trends about the
descriptor- and deep-learning-based ear recognition
advancement of ear recognition technology over the years
techniques and investigate their robustness by studying
are presented and some of the strengths and weaknesses are
the impact of various covariates, such as ethnicity, head
identified, no detailed information about the performance
rotation (in terms of yaw, roll and tilt angles), gender
of the existing techniques with respect to different
and presence of occlusions and accessories.
covariates is given.
• Dataset To be able to train the deep-learning-based
Similarly to our work, the survey by Emeršič et al. [24]
recognition models, we gather a new dataset of
also presents a comparison of some descriptor-based fea-
unconstrained ear images from the web and make it
ture extraction techniques from the literature using a
publicly available to the research community. The new
challenging dataset and predefined experimental protocol.
dataset, named Annotated Web Ears Extended
However, here we focus on the impact of different
(AWEx), contains 4104 images of 346 subjects and to
covariates on the recognition performance of the tested
the best of our knowledge represents one of the most
techniques and include analysis of CNN-based approaches
challenging and largest datasets of this kind available
as well. Also, in this paper we provide the analysis of
for research purposes. The dataset can be downloaded
important model characteristics such as time and space
from: http://awe.fri.uni-lj.si/.
complexity.
• Analysis We present an extensive analysis of the
Pflug and Busch [41] compare the performance of var-
evaluated recognition techniques in terms of recogni-
ious texture and surface descriptors for ear biometrics, but
tion performance, computationally complexity and
different from our work uses the descriptors in combination
resource requirements and thus contribute to new
with subspace projection techniques. The reported experi-
knowledge in the field and a better understanding of
ments are conducted on a dataset of ear images with lab-
their characteristics.
oratory-like quality, but no ablation study is presented.
The rest of the paper is structured as follows. In Sect. 2, we The study from [44] is likely the closest to our work as
review existing work related to our paper and further far the evaluation of descriptor-based ear recognition
motivate our analysis. We describe our experimental setup techniques with respect to different covariates is con-
and the ear recognition techniques considered in this work cerned. However, the focus here is on only on image
in Sect. 3 and introduce the experimental dataset and characteristics, such as noise and blurring, and not on ear-
protocol in Sect. 4. We present the results of our analysis related covariates and other factors such as in our work.
and discuss its implications in Sect. 5. We conclude the Furthermore, since deep-learning models were not yet
paper with some final comments and directions for future applied to ear recognition at the time of writing of [44], the
work in Sect. 6. study does not include the most recent and promising ear
recognition models.
The analysis presented in this work builds on the pre-
2 Motivation and related work liminary version from [19], but extends the study to deep-
learning models, new model characteristics and novel
Understanding the characteristics of biometric recognition aspects that were not considered before.
technology is of considerable importance to the advance-
ment of the field and key for researchers in this area. What
properties of the input data make the recognition process 3 Methodology
difficult? What properties make it is easy? Are certain
techniques better suited for specific data characteristics In this section, we present the methodology used for our
than others? What is the computational complexity of the analysis. We start the section with a description of the
recognition techniques? What kind of resources are experimental setup used and then describe the descriptor-

123
Neural Computing and Applications

and deep-learning-based recognition techniques then both setups produce image representations (feature
considered. vectors) y 2 Rd from the input images as:
y ¼ f ðxÞ; ð1Þ
3.1 Experimental setup
where f ðÞ is a feature extraction function.
To analyze the characteristics of different recognition For this work, we consider eight descriptor-based
models, we use an identification pipeline as illustrated in recognition approaches and three deep CNN models for our
Fig. 1. Here, the input images are first subjected to a fea- experiments based on either their reported performance for
ture extraction technique that converts the input RGB ear ear recognition [24, 41] or their popularity within the
images into a discriminative representation by exploiting research community [25]. We describe the considered
either a deep CNN model (marked as scenario A in Fig. 1) approaches in the following two sections.
or a descriptor computation procedure (marked as scenario
B in Fig. 1). Once the representation is computed for a 3.2 Descriptor-based ear recognition techniques
given test images, identity inference is conducted based on
the cosine similarities with a predefined set of gallery For the descriptor-based methods, we consider dense
images. descriptor computation and generate the d-dimensional
The main difference between the experimental setups feature vectors needed for recognition from grayscale
(i.e., scenarios A and B) for the deep-learning and converted input images (see Fig. 1—setup B). Specifically,
descriptor-based pipelines is that for setup A we start by we implement methods based on local binary patterns
using train and validation sets to learn the parameters of (LBPs [7, 24, 26, 43, 45]), (Rotation Invariant) Local
our CNN models, while for setup B no training is needed. Phase Quantization Features (RILPQ and LPQ [39, 40]),
If we denote the input images into our pipeline as x 2 Rn , binarized statistical image features (BSIF [24, 30, 43]),
Histograms of Oriented Gradients (HOGs, [12, 13, 24, 43]),

Fig. 1 The identification pipeline used in our experiments. We procedure involving train and validation sets is required (case A),
employ the same pipeline for descriptor-based and deep-learning- whereas for the descriptor-based technique no training is needed (case
based recognition techniques. In both cases, features are extracted B). After the feature extraction step, the procedure is the same for
from the input images using either dense descriptor computation or both scenarios, (A) and (B). Each extracted test feature vector is
CNN models. The main difference between the two approaches is in matched against all gallery feature vectors using the cosine similarity,
the experimental setup (A or B), where for the CNN models a training and the ID of the most similar gallery vector is returned as the output

123
Neural Computing and Applications

the Dense Scale-Invariant Feature Transform 3.2.2 (Rotation Invariant) Local Phase Quantization
(DSIFT, [15, 24, 31]), Gabor wavelets [14, 24, 33, 34,
49–51] and Patterns of Oriented Edge Magnitudes Local Phase Quantization (LPQ) features [39] are very
(POEM, [24, 52]) for the analysis. similar in essence to LBPs, as the local image texture is
again encoded using binary strings, and histograms are
3.2.1 Local binary patterns again computed from the binary strings of local image
blocks and concatenated into the final representation of the
Local binary patterns (LBPs) represent powerful texture given image. LPQ features are computed from the Fourier
descriptors that achieved competitive recognition perfor- phase spectrum of an image and are known to be invariant
mance in various areas of computer vision [45]. The use of to blurring under certain conditions. This feature makes
the LBP descriptor for ear recognition is mainly motivated LPQs an attractive alternative for ear recognition (see,
by its computational simplicity and the fact that the texture e.g., [43]), where blurred and low-resolution images rep-
of the ear is highly discriminative. Many successful ear resent a problem for the existing technology.
recognition techniques have been presented in the literature With LPQ, the local neighborhoods of every pixel in the
exploiting LBPs either as stand-alone texture representa- image are first transformed into the frequency domain
tions or in combination with other techniques, using a short-term Fourier transform. Local Fourier coef-
e.g., [7, 26, 43]. ficients are computed at four selected frequency points, and
LBPs encode the local texture of an image by generating the local phase information contained in these (complex)
binary strings from circular neighborhoods of points coefficients is then encoded. Here, a similar quantization
thresholded at the gray-level value of their center pixels. scheme is used as in iris recognition systems, where every
The generated binary strings are interpreted as decimal complex Fourier coefficient contributes two bits to the final
numbers and assigned to the center pixels of the neigh- binary string. The result of this coding procedure is a 8-bit
borhoods. The number of sampling points P used to gen- binary string for every pixel in the image from which the
erate the binary strings depends on the radii R of the local 256-bin histograms are computed and later concate-
circular neighborhoods and results in the following nated into a global descriptor of the image.
encoding [45]: An extension of this technique to Rotation Invariant
X
P1 Local Phase Quantization (RILPQ) features was presented
LBPP;R ¼ 2p sðgp  gc Þ; ð2Þ in [40]. The idea here is similar to the original LPQ tech-
p¼0 nique with the difference that a characteristic orientation is
first estimated for the given local neighborhood, and then,
where LBPP;R stands for the computed binary pattern of
this orientation is used to compute a directed version of the
some center pixel, gc and gp denote the gray-level values of
binary descriptor. The binary descriptor is computed with
the center pixel and the pth pixel from the neighborhood,
the same procedure as the original LPQ, but every local
respectively, and the thresholding function sðÞ stands for:
neighborhood is first rotated in accordance with its char-

1 if x  0; acteristic orientation. RILPQ descriptors are not only blur
sðxÞ ¼ ð3Þ
0 otherwise: invariant, but also exhibit a certain degree of robustness
toward image rotation.
In practice, not all binary patterns returned by Eq. (2) are
useful for texture representation. Typically, only binary 3.2.3 Binarized statistical image features
strings with at most two bitwise transitions from 0 to 1 (or
vice versa) are considered in the final descriptor. For a 8- Binarized statistical image features (BSIF) [30] represent a
pixel neighborhood and a consequent 8-bit binary string, more recent tool for texture description. Here, binary
for example, exactly 58 such patterns (called uniform strings (encoding texture information) are again con-
patterns) can be computed. Most methods exploiting LBPs structed for each pixel in the image, but this time by pro-
with a 8-pixel neighborhood for texture description, jecting image patches onto a subspace, whose basis vectors
therefore, compute 59-bin histograms from local image are learned from natural images. The subspace coefficients
blocks and then concatenate the computed histograms over are then binarized using simple thresholding. This proce-
all blocks into a global texture descriptor (our d-dimen- dure is equivalent to filtering the input image with a
sional feature vector x) that can be used for recognition. A number of pre-learned filters and binarizing the filter
similar procedure is also used in our experiments in responses at each pixel location. Each filter contributes 1
Section 5. bit to the binary string of a pixel making the length of the
binary string dependent on the number of filter used.
Similar to LBP and LPQ, the binary string of each pixel is

123
Neural Computing and Applications

interpreted in decimal form and a global histogram-based represent the local neighborhood around the detected
representation (our d-dimensional feature vector x) is keypoints. As indicated in the introductory section, early
constructed for the given images by concatenating his- techniques to ear recognition relied on the SIFT keypoint
tograms constructed from smaller image blocks. detector as well as the SIFT descriptor, e.g., [3, 15]
The main characteristic that makes BSIF features so and [9], and, therefore, demonstrated a high degree of
appealing is the fact that the binary strings are not con- robustness toward partial occlusions.
structed based on heuristic operations, but on the basis of More recent techniques, on the other hand, compute
statistics of natural images. The idea behind BSIF-based dense SIFT (DSIFT) representations from the images and
texture description is in line with recent feature learning do not rely on the keypoint detector. Here, the keypoints
approaches, which produced competitive results for many are simply arranged uniformly into a grid that is placed
computer vision problems in recent years. The use of BSIF over the image. Techniques based on DSIFT
features for ear recognition was advocated by Pflug (e.g., [31, 37]) have reported excellent recognition perfor-
et al. [43], where excellent performance was reported. mance as well as robustness to partial occlusions similar to
techniques based on the original SIFT formulation. We
3.2.4 Histograms of Oriented Gradients evaluate a DSIFT-based technique in the experimental
section and, thus, discuss here only the SIFT keypoint
Descriptors exploiting Histograms of Oriented Gradients descriptor. The reader is referred to [35] for a detailed
(HOGs) were originally introduced for the problem of description of the keypoint detection procedure.
human detection by Dalal and Triggs [12], but have since The SIFT descriptor shares similarities with the HOG
been successfully applied to various fields of computer descriptor. For every point of interest, SIFT considers a
vision, including ear recognition [13, 43]. HOG descriptors local neighborhood of 16  16 pixels. This neighborhood
have excellent texture description properties and are con- is partitioned into subregions of 4  4 pixels, and for each
sidered robust toward moderate illumination changes. This subregion, an 8-bin histogram is computed based on the
fact makes them highly suitable for problems, such as ear orientations and magnitudes of the image gradient in that
recognition, where illumination-induced variability is one subregion. The gradients are also weighted by a Gaussian
of the major problems. function to give more importance to image gradients closer
HOGs are computed based on a simple procedure. The to the point of interest and normalized by the dominant
computation starts by calculating the gradient of the image gradient orientation to achieve rotation invariance. The
using 1-dimensional convolutional masks, i.e., ½1; 0; 1 final dimensionality of the SIFT descriptor is 128 for a
and ½1; 0; 1T . In the next step, the image is divided into a single keypoint, so care needs to be taken when computing
number of cells and compact histograms of quantized DSIFT representations from the image. The dimensionality
gradient orientations are computed for each cell. Here, a of final feature vector can easily become computationally
voting procedure is used during histogram construction, so prohibitive if too many grid points are chosen for DSIFT
that pixels with higher gradient magnitudes contribute calculation.
more to the histogram bins than pixels with lower magni-
tudes. Neighboring cells are then grouped into larger 3.2.6 Gabor wavelets
blocks and normalized to account for potential changes in
contrast and illumination. This normalization procedure is 2D Gabor wavelets were originally introduced by Daug-
applied in a sliding-window manner over the entire image man [14] for the problem of iris coding, but due to their
with some overlap between neighboring blocks. Ulti- ability to analyze images at multiple scales and orienta-
mately, all normalized histograms are concatenated into the tions, they have been successfully employed in other
final HOG descriptor (our feature vector x) that can be used problem areas as well. In the spatial domain, Gabor
for matching and recognition. wavelets are defined with the following expression[49, 50]:
2 
f2
fu2  cu2 x02 þgu2 y02 j2pfu x0
f

3.2.5 Dense Scale-Invariant Feature Transform wu;v ðx; yÞ ¼ e e ; ð4Þ


pcg
The Scale-Invariant Feature Transform (SIFT), introduced where
by Lowe in [35], represents one of the most successful
techniques for image description in computer vision. The x0 ¼ x cos hv þ y sin hv ;
ð5Þ
original approach to SIFT calculation includes both a y0 ¼ x sin hv þ y cos hv
keypoint detector, capable of finding points of interest in an
and the parameters fu and hv represent the center frequency
image, as well as a local descriptor that can effectively
and orientation of the complex sinusoidal from Eq. (4),

123
Neural Computing and Applications

respectively. c and g define the ratio between the center 3.3 Deep-learning-based ear recognition models
frequency and the size of the Gaussian and ensure that all
generated wavelets share some specific properties [51]. For We use the deep-learning-based recognition models as
feature extraction, a family of wavelets is typically created black-box feature extractors in this work and exploit the
and used to extract features from the processed image. This image representations produced by one of the final layers
family commonly consist of wavelets of 5 scales and 8 (i.e., one of the layers before the softmax) of the models as
orientations, i.e., f0 ; f1 ; :::; f7 and h0 ; h1 ; :::; h4 . the d-dimensional feature vectors (descriptors) of the input
To extract Gabor features from an image, the image is images needed for recognition (see Fig. 1—setup A). We
convolved with the entire family of Gabor wavelets (fil- consider three different CNN models for our analysis,
ters), and the magnitude responses of the convolution which cover some of the most popular architectures for
outputs are retained (the phase responses are discarded), recognition networks from the literature, i.e., ResNet [27],
down-sampled and concatenated into a global feature SqueezeNet [29] and the VGG network from [48].
vector encoding multi-resolution, orientation-dependent
texture information of the input image. 3.3.1 VGG network
Techniques based on the outlined procedure and its
modifications (e.g., using log-Gabor wavelets) are among The VGG network (or model) [48] is a representative of
the most popular techniques for ear recogni- so-called very deep CNN models and in the most common
tion [33, 34, 36, 38, 53]. Their advantages lie in their configuration comprises a total of 16 network layers (VGG-
excellent discriminative properties; however, Gabor fea- 16). The VGG model has been successfully applied to
tures are computational relatively complex to compute, as numerous recognition problems, including ear recognition
the input image needs to be filtered with an entire family of (see, e.g., [17, 18, 22, 23]), and has been shown to ensure
filters. state-of-the-art performance of challenging ear datasets.
The main characteristic of the model is the use of sev-
3.2.7 Patterns of Oriented Edge Magnitudes eral consecutive convolutional layers with small 3  3 fil-
ters. The consecutive stacks of 3  3 convolutional layers
Patterns of Oriented Edge Magnitudes (POEM) [52] rep- are able to capture the same information as the larger filters
resent another popular approach to texture description that used in older model architectures, such as AlexNet [32],
combines ideas from LBP and HOG descriptors as well as but require significantly less parameters that need to be
Gabor wavelets. estimated during training. The 3  3 filter stacks are
The POEM construction procedure starts by computing interspersed with max-pooling layers which reduce the
the gradient of the input image and building magnitude- dimensionality of the activation maps produced by the
weighted histograms of gradient orientations for every model layers. The convolutional part of the VGG model is
pixel in the image. This histogram is computed from local followed by three fully connected layers with 4096, 4096
pixel neighborhoods referred to by the authors as cells. In and 1000 channels, respectively. Finally, a softmax layer is
this regard, POEM shares similarities with the HOG used at the top of the model to facilitate training. For our
descriptor, which also relies on gradient directions to experiments, we perform network surgery on the VGG
encode an image, but different from HOG, POEM com- model and use the d ¼ 4096 dimensional output of the
putes the histograms densely in a sliding-window manner penultimate fully connected layer (fc7) as the feature vector
over the entire image. After this step, every pixel in the of the VGG model.
image is represented by a local histogram of quantized
gradient orientations, or in other words, the image is 3.3.2 SqueezeNet
decomposed into m oriented gradient images, where m is
the number of discrete orientations of the local histograms. SqueezeNet (SNet in the experimental section) represents a
Each of these images is then encoded using the LBP recent CNN model which was shown to ensure AlexNet-
operator, and a global image descriptor is constructed by level accuracy with 50 fewer parameters [29]. The model
concatenating all block histograms computed from the was first introduced to the field of ear recognition in [22]
oriented gradient images. with highly competitive results.
The POEM descriptor has demonstrated impressive SqueezeNet builds on ideas from residual net-
performance for face recognition [52] and exhibits desir- works [27, 28], but additionally introduces so-called
able properties, such as orientational selectivity, robustness squeeze layers (i.e., convolutional layers with 1  1 con-
to moderate illumination changes and low-computational volutions) that serve as bottlenecks of the CNN architec-
complexity, which make it appealing for image represen- ture and aim at further reducing the parameter space of the
tation in ear recognition systems.

123
Neural Computing and Applications

overall model. The network exploits a few additional semiautomatic two-step procedure [20, 21, 24]. In the first
design principles: (1) replacement of part of the 3  3 fil- step, candidate images for the dataset were collected from
ters in the convolutional layers with 1  1 filters, (2) the web using web crawlers that looked for appropriately
postponing the down-sampling steps to later stages in the tagged imagery on Flickr and Google’s image search. The
network so that convolutional layers have large activation candidate images were then manually screened and curated
maps [29] and (3) network pruning. The result of these in the second step to ensure that ears were indeed present in
design choices is a model that has significantly less all images. This approach ensured that the appearance
parameters to tune than competing models and can be variability of the images was not artificially reduced
trained on relatively small amounts of data. This means through automatic ear detection techniques and resulted in
that the model has a small data footprint and that the a challenging dataset of ear images captured in uncon-
training is much faster in comparison with the original strained settings [24]. This real-life variability is also, to
AlexNet implementation. For our experiments, we use the the best of our knowledge, the biggest advantage compared
output of the model layer preceding softmax as the d ¼ to the other ear datasets available. This ensures that the
86; 527 dimensional feature vector of the SqueezeNet results of the experiments based on this dataset are to the
model. large extent applicable to real-life scenarios.
The images of the AWE dataset contain ground truth
3.3.3 Residual network: ResNet50 annotations in terms of gender, extent of head pitch, roll
and yaw rotations, ethnicity and presence of occlusions and
Residual networks (ResNets or RNets, [27]) belong to a thus provide a perfect starting point for our analysis. The
recent class of deep models that introduced shortcut (or labels/annotations were assigned to the images by a trained
skip) connections into CNN models. These connections annotator and validated by the authors of the dataset.
represent identity shortcuts that bypass some of the con- Because the image acquisition procedure was not con-
volutional layers and forward information from the lower trolled, each image from the dataset typically exhibits
to the higher model layers. This ensures that no informa- variations across several attributes (e.g., large pitch, roll
tion is lost along the network, but also facilitates training of and yaw angles at the same time) and is annotated with
deeper models, i.e., models with a larger number of layers. multiple labels, so attribute cross-talk effects need to be
The added value of the shortcut connections during model taken into account when interpreting the results presented
training is that they also serve as shortcuts for the back- in the results section. The distribution of the individual
propagation algorithm used to learn the model parameters label categories is presented in Fig. 2.
and hence make sure that gradients do not vanish down the To assess the impact of the different covariates, we
network. From an architectural point of view, the ResNets conduct identification experiments with the AWE dataset
are similar to the VGG model and exploit small 3  3 filter and report the rank 1 recognition rate (Rank-1) and the
stacks in their convolutional layers to keep the number of (normalized) area under the cumulative match score curves
parameters that need to be learned during training low. In (AUC) when presenting results. For each of the experi-
general, residual networks may feature several hundreds of ments, the probes consist of all images with a specific label
model layers; however, for this work we use the standard (e.g., severe head yaw), while the galleries represent all
ResNet50 architecture. images from the AWE dataset. With this setup, the gallery
Residual networks have, to the best of our knowledge, size is fixed for all experiments, while the number of
not been used before in the field of ear recognition, but probes (and consequently number of conducted identifica-
have due to their performance in other areas been selected tion experiments) depends on the label distribution (shown
for our analysis as well. Similar to VGG and SqueezeNet, in Fig. 2) and differs from experiment to experiment.
we use the output of the last layer before the softmax as the Related labels are merged for the experiments to ensure
d ¼ 2048 dimensional feature vector of the ResNet model. sufficient numbers of samples for each experiment, e.g.,
mild head yaw from both left and right, are merged into
one group of mild yaw, the same for the severe yaw rota-
4 Experimental dataset and protocol tion and the other head rotations (roll and pitch).
For the descriptor-based feature extraction methods, we
For our analysis, we conduct identification experiments on use the implementations that ship with AWE toolbox and
the introduced Annotated Web Ears Extended (AWEx) make no change to the default parameters, which are
dataset in accordance with the pipeline described in described in detail in [24].
Sect. 3.1 (see Fig. 1 for details). Our experimental dataset Since the considered deep models need to be trained
contains 1000 ear images of 100 subjects (with 10 images before they can be exploited for feature extraction, we
per subject) and was gathered from the web with a collect a novel dataset of ear images from the web, using

123
Neural Computing and Applications

9% 7% 7%
11%

28%
Gender Ethnicity Occlusions
21%
61%
65%

91%

Female Male Caucasian Asian Black Hispanic None Mild Severe

2% 2% 3% 2%
15% 14%
15% 18% 16%
25%

Head Roll Head Pitch Head Yaw


25%
31%

63% 54% 9% 6%

To Right ++ To Right + Up ++ Up + Frontal Le Profile Le


Neutral Neutral Down + Down ++ Profile Right Middle Right Frontal Right

Fig. 2 The graphs show the distribution of covariates (labels) of the image in the dataset. Accessories are not shown explicitly here, but
images of the AWE dataset. The dataset contains 1000 images of 100 from the 1000 AWE images, 91% have no accessories, 8% have some
subjects. Gender and ethnicity are labeled on a per subject basis, accessories, and 1% (or 9 images) has a significant amount of
whereas occlusions, head roll, head pitch and head yaw vary for each accessories

the same procedure as used for the AWE dataset. In total, • scale increase/decrease of up to 30%.
we gather 3104 additional images belonging to 246 sub- As an additional contribution, we merge the newly col-
jects that were not present in the original AWE data. We lected images with the original AWE dataset of ear images
split these images into training (2173 images) and valida- into the Annotated Web Ears Extended (AWEx) dataset,
tion sets (931 images) and use the data splits to train our which now contains a total of 4104 images of 346 subjects
deep models. The models are trained on a desktop PC with captured in completely unconstrained environments and
Intel Core i7-6700K CPU, 32 GB of RAM and Nvidia makes the dataset publicly available to the research com-
Titan Xp until convergence. We set the learning rate to munity through: http://awe.fri.uni-lj.si/. Some example
0.001 and the weights decay rate to 0.001 divided by the images from the new dataset are shown in Fig. 3.
number of epochs. To avoid over-fitting, we use a high
dropout rate of 0.1 and introduce random perturbations of
100-fold the training data (each image resulted in new 100 5 Experiments and results
perturbated images), where the data transformations are
performed (or not) with a 50% chance. Below is a list of In this section, we present the results of our analysis. We
the perturbations. first describe experiments aimed at analyzing the sensitiv-
• horizontal flipping, ity of the recognition approaches toward various covari-
• trimming 0–10% of images on each side, ates, then present a comparative assessment of the tested
• Gaussian blurring with r 0–3.0, methods and finally explore the time and space complexity
• addition of Gaussian noise with scale 0–0.2, of the recognition models.
• brightness reduction/increase of pixel intensities by a
value of 10 (over all color channels or over a single 5.1 Sensitivity analysis
channel),
• contrast increase/decrease of up to 50% (over all color In our first experiments, we evaluate the sensitivity of all
channels or over a single channel), 11 recognition approaches to the following covariate fac-
• rotation of up to 20 in both directions, tors: gender, presence of accessories, occlusions, ethnicity,

123
Neural Computing and Applications

subgroup. Surprisingly, occlusions which consist mostly of


hair have a limited impact on performance. The reason for
this, we argue, is that the occlusions are more or less
consistent throughout all ear images for a selected subject.
The presence of accessories, on the other hand, has a
considerable (negative) effect on the recognition perfor-
mance of all techniques, which again is reasonable, as this
type of occlusion varies significantly from image to image.
Although the impact of accessories requires a more in-
depth analysis, we presume that the performance drop can
be attributed to the fact that samples that fall into this
category contain large hearing aids, headphones or some
large ornaments, which may not be present in the gallery
images. The Rank-1 recognition rates of 0, 4, 11.1 and
22.2% for DSIFT, HOG, LPQ, RILPQ, Gabor wavelets,
LBPs, POEM, VGG and RNet need to be interpreted with
reservation since only 8 samples were available for this
experiment. Nevertheless, we believe that the low perfor-
mance still shows to a trend with respect to the perfor-
mance of ear recognition models in the presence of large
Fig. 3 Sample images from the AWEx dataset
accessories.
head pitch, head roll and head yaw. The goal of these
5.2 Comparative evaluation
experiments is to establish how the recognition models
behave in general when confronted with data of different
In the second series of experiments, we compare all con-
characteristics. We are not interested in the performance
sidered techniques and explore how different covariates
and sensitivity of individual models, but in general trends
effect the performance of individual recognition models. In
that can be seen over all tested techniques, the sensitivity of
Fig. 5, a comparison of the Rank-1 recognition rates is
individual methods will be the focus of the next section. A
presented for all assessed techniques and all considered
comparison of the Rank-1 recognition rates for this series
covariates in the form of radar graphs. Here a larger area
of experiments together with the corresponding AUC val-
covered by the graphs suggests a better performance.
ues is presented in Fig. 4 and quantitatively in Table 1.
The graphs show that among the descriptor-based
Note that the AUC values in Fig. 4 are size-encoded, where
methods DSIF, LPQ and RILPQ are overall slightly infe-
a bigger circle indicates a higher AUC value.
rior to the other methods, which all perform similarly. In
The results show that severe head rotations, especially
terms of robustness, the POEM-based approach seems to
roll, negatively impact the identification performance.
be a little more stable than the other techniques as the
Large pitch rotations also have a detrimental effect on
outline of the graph is most stable with this approach. All
performance, whereas yaw angles seem to be less crucial
in all, the descriptor-based methods exhibit similar
for the performance of the tested methods. These results
behavior when confronted with different covariates point-
can be explained through the geometrical properties of the
ing to the need to improve the recognition technology in
ears, which can are mostly flat (with wrinkles). Since all
specific areas (e.g., in the presence of large accessories,
ear images are usually resized to a fixed input size prior to
under severe head rotations).
feature extraction, this pre-procedure (partially) compen-
Among the deep-learning-based model, SNet is clearly
sates for head rotations that only change the viewing angle
the top performer and also the most robust among the
of the ears (e.g., yaw), but do not result in orientation
CNN-based approaches. The worst-performing deep-
changes. To compensate for other head rotations (e.g., roll),
learning model is RNet, which is outperformed by the all
additional alignment steps would need to be considered in
other tested models. These results can be attributed to the
the recognition pipeline.
number of parameters that need to learned for the deep
Among the considered covariates, gender and ethnicity
models and the still relatively modest (from the per-
have the smallest impact on identification performance—
spective of deep learning) amount of ear images available
the results for all subgroups of these covariates are very
for training. Here, SNet has the clear advantage, as it
close, while the minor performance differences are likely a
contains the lowest number of open parameters among the
consequence of the different number of probes in each
deep models. The CNN models show similar sensitivity

123
Neural Computing and Applications

Fig. 4 The performance plots show a comparison of eleven state-of- comparison. The small number of images also contributes to the
the-art ear recognition techniques with respect to different covariates. higher Rank-1 recognition rates for women compared to men. Large
The plots show Rank-1 recognition rates in percent [%] (y-axis) and pitch, roll and yaw (head) rotations show a negative impact on the
relative AUC values (circle size: where the smallest AUC value performance of all assessed techniques. The biggest impact is
among the AUC values was set as the smallest, still visible dot, and observed with large accessories, but the results for this test are
the largest value was set as the largest, visually acceptable circle). generated with a small number of probe images. The results are best
Due to the small number of subjects for the Hispanic ethnicity viewed in color; the darker tones in each plot denote deep-learning-
subgroup, the results for this subgroup were omitted from the based approaches

characteristics as the descriptor-based methods, but the rotations, we can see that small differences can be
sensitivity is less pronounced. For example, if we look at observed, but these are minimal when compared to the
the performance differences with respect to head local techniques.

123
123
Table 1 Comparative assessment of the eight descriptor-based and three deep-learning-based techniques considered in this work
Perf. metric Rank-1 (%) AUC (%)
Method BSIF DSI Gab HOG LBP LPQ PO RIL RNet SNet VGG BSIF DSI Gab HOG LBP LPQ PO RIL RNet SNet VGG

Female 55.6 45.6 52.2 47.8 54.4 50.0 47.8 43.3 37.6 62.4 60.0 90.3 88.8 93.4 89.8 92.8 90.6 90.8 90.1 91.8 93.6 91.8
Male 43.1 39.9 42.4 46.2 39.5 38.5 46.8 38.1 29.5 52.6 43.7 89.5 87.5 93.6 92.2 87.8 88.8 89.9 86.3 84.8 89.4 86.7
Asian 45.2 46.2 42.9 46.2 41.0 41.4 48.6 43.8 30.1 54.2 43.8 92.4 89.6 94.1 91.9 89.8 91.3 91.5 87.9 88.6 90.7 87.4
Caucasian 43.4 38.0 42.5 45.9 40.2 38.2 45.7 36.9 30.7 53.6 44.9 88.4 86.9 92.9 92.1 87.4 87.7 89.5 86.5 85.2 89.4 86.7
Black 40.9 38.2 46.4 42.7 37.3 40.9 42.7 32.7 28.4 47.1 45.5 89.1 85.9 94.5 90.1 86.5 88.1 86.7 83.7 85.2 88.2 88.2
Accessories / 44.1 41.1 43.7 47.0 40.4 39.7 47.4 39.3 31.0 52.9 44.1 89.7 87.9 93.8 92.1 88.4 89.1 90.1 86.7 85.5 89.8 86.9
Accessories ? 47.4 37.2 42.3 43.6 48.7 42.3 44.9 34.6 33.5 62.9 61.0 88.4 86.3 91.8 91.4 87.6 87.7 89.3 87.4 87.7 93.8 92.6
Accessories ?? 22.2 0.0 11.1 0.0 11.1 0.0 11.1 0.0 4.0 29.1 0.0 82.9 72.3 82.8 80.5 82.0 85.7 83.6 80.3 69.7 76.9 74.8
Pitch / 43.8 41.8 43.1 50.5 42.0 41.1 49.2 40.5 30.3 55.8 45.8 90.2 88.6 93.8 93.4 88.5 89.5 90.5 87.0 83.6 88.7 86.2
Pitch ? 46.1 40.6 45.3 43.1 41.4 39.2 46.3 37.9 29.9 54.2 44.9 89.2 87.3 93.6 91.1 88.5 88.8 90.3 86.6 83.6 88.7 86.3
Pitch ?? 33.3 23.5 29.4 27.5 23.5 25.5 27.5 23.5 29.8 43.5 37.7 85.5 79.3 90.5 84.0 83.5 84.3 81.7 83.4 86.9 93.9 89.0
Roll / 48.5 43.7 47.0 50.6 45.0 42.1 50.6 42.9 30.3 54.3 46.1 91.3 89.7 94.6 94.0 90.1 90.9 91.8 87.9 87.1 89.8 88.0
Roll ? 38.9 37.1 39.5 41.3 35.2 37.7 43.1 32.8 29.8 51.4 43.5 88.0 85.2 92.2 89.7 86.4 86.9 88.8 85.8 82.1 89.8 86.4
Roll ?? 23.3 18.6 18.6 23.3 23.3 16.3 23.3 20.9 31.3 57.5 45.6 76.5 75.6 88.0 79.4 75.8 76.3 72.0 75.2 87.4 90.2 82.5
Yaw / 46.0 35.3 44.0 44.7 42.0 38.7 42.7 40.7 29.3 60.0 48.7 89.6 87.7 94.7 94.1 88.7 90.0 88.6 85.7 89.3 92.2 88.0
Yaw ? 46.4 42.7 47.7 49.1 43.6 41.6 49.1 42.9 30.4 52.0 44.6 90.7 88.8 94.4 92.3 89.5 89.8 91.0 88.0 85.3 89.3 87.3
Yaw ?? 38.9 38.5 34.4 41.7 34.7 35.8 44.8 29.2 30.3 53.1 44.7 87.3 85.2 91.3 90.3 85.6 86.8 88.7 84.5 84.0 89.8 86.7
Occlusion / 42.2 39.5 45.9 45.2 38.3 37.9 46.4 38.6 28.9 53.8 46.0 89.1 87.3 93.8 91.9 87.6 88.5 89.5 85.6 86.7 89.3 86.4
Occlusion ? 47.6 43.3 39.6 50.6 45.5 42.9 48.4 39.6 30.4 54.0 41.8 90.4 87.8 93.1 92.5 88.9 89.4 90.6 88.8 81.7 90.9 88.4
Occlusion ?? 48.7 37.8 33.8 40.5 46.0 40.5 46.0 35.1 40.9 49.1 51.1 90.9 89.3 92.8 91.1 91.2 91.1 92.3 88.6 88.7 91.2 89.7
Results were generated on the whole AWE dataset of 1000 images of 100 subjects in the form of Rank-1 recognition rates, and the area under the cumulative match score curve (AUC). All
results are given in percentages. RIL., PO., DSI., Gab., RNet and SNet denote RILPQ, POEM, DSIFT, Gabor, ResNet and SqueezeNet, respectively
Neural Computing and Applications
Neural Computing and Applications

Fig. 5 The radar graphs show a comparison of the Rank-1 recognition Caucasian, respectively. For a description of the labels, please refer
rates of the evaluated recognition methods with respect to different to Fig. 2 and the description in [24]. The graphs are best viewed in
covariates. The axes show values from 0 to 60%. Acc., Ro., Occ., color
Fem. and Cau. denote accuracy, roll, occlusion, female and

When comparing descriptor-based methods to deep- the descriptor-based methods, the processing time was
learning-based models, we can see the overall winner of computed by running the experiments on the CPU of our
the comparison i SNet, but the worst-performing model is experimental hardware, whereas for the deep-learning-
gain a CNN suggesting that deep models are competitive based models the experiments were conducted on the GPU.
but need sufficient data to be trained effectively or feature a Ideally, a fast and efficient method should be as close as
sufficiently small number of parameters that need to be possible to the x-axis and as far away from the y-axis as
learned. possible. As we can see, the average time for all methods is
similar; a clear outlier here is the Gabor wavelet technique,
5.3 Time and space complexity which takes significantly more time on average than the
competing methods. SNet is again the best approach in this
comparison, as it is reasonably fast, but with the highest
Last but not least, we compare and analyze the time and performance.
space complexity of the tested recognition techniques in In Fig. 6, we see a similar comparison, but here the
Fig. 6 and Table 2. Here, Fig. 6 (left) shows a comparison feature vector length is plotted against the Rank-1 recog-
of the average time needed to process one image vs. the nition performance. As we can see, the clear outlier this
achieved recognition performance. The average time was time is SNet, which generates a significantly larger feature
computed over the entire test set of 1000 AWE images. For vector than the remaining techniques. Thus, when this is

123
Neural Computing and Applications

Fig. 6 Time and space complexity vs. the recognition performance. The feature vector size is plotted against the Rank-1 recognition rate.
Left: The average test time is plotted versus the Rank-1 recognition The closer a methods is located to the lower right corner, the better
rate. The closer a methods is located to the lower right corner, the the characteristics with respect to the space complexity. Best viewed
better the characteristics with respect to the time complexity. Right: in color

Table 2 Time and space complexity


Method Model size (in MB) # Parameters to train Feature vector size Training time (in h) Average test time—per image (in ms)

BSIF 0 0 9216 0 8
DSIFT 0 0 12,800 0 8
Gabor 0 0 5760 0 270
HOG 0 0 8712 0 4
LBP 0 0 9971 0 18
LPQ 0 0 9216 0 6
POEM 0 0 11,328 0 25
RILPQ 0 0 9216 0 25
RNet 96.8 25,636,712 2047  18 11
SNet 3.5 1,235,496 86,527 8 3
VGG 541.1 117,479,232 4095  12 9
The table shows a comparison of all considered techniques with respect to different characteristics such as the model size, number of parameters
to train, feature vector size, training time and average test time

problematic due to the limited availability of resources, are scarce, descriptor-based methods may still have an edge
alternative models would need to be sought, despite the over deep-learning-based models, where typically millions
best overall performance of SNet so far. of parameters need to be learned during training (that takes
When look at the information presented in Table 2, we several hours or even days depending on the hardware and
notice a striking difference between descriptor-based and amount of training data). As we already pointed out above,
deep-learning-based methods, i.e., the model size that the feature vector size is comparable among all methods
needs to be stored in RAM is in the range of MBs, with the (except for SNet), but serves here only for illustrative
biggest model, VGG, requiring a total 541.1 MB just to purposes to show the approximate magnitude of the vector
load the model. The cost for the descriptor-based methods, sizes. In general, the sizes can vary depending on the
on the other hand, is 0 as far as memory requirements is choice of open parameters of the recognition techniques
concerned. Descriptor-based methods also do not require considered.
any training procedure (their training time is 0) and can be
applied to ear recognition problems without the need for
enormous amounts of training data. Hence, if training data

123
Neural Computing and Applications

6 Conclusion international conference on applied research in computer science


and engineering, pp 1–6
8. Bourouba H, Doghmane H, Benzaoui A, Boukrouche AH (2015)
We have evaluated eight popular dense descriptor-based Ear recognition based on Multi-bags-of-features histogram. In:
feature extraction methods for ear recognition and three Proceedings of the international conference on control, engi-
popular approaches based on convolutional neural net- neering information technology, pp 1–6
9. Bustard JD, Nixon MS (2010) Toward unconstrained ear recog-
works and analyzed their performance with respect to nition from two-dimensional images. Trans Syst Man Cybern Part
different covariates. The results show that gender and A Syst Hum 40(3):486–494
ethnicity with some exceptions do not impact identification 10. Chan T-S, Kumar A (2012) Reliable ear identification using 2-D
performance significantly. However, severe head rotations quadrature filters. Pattern Recogn Lett 33(14):1870–1881
11. Choraś M (2008) Perspective methods of human identification:
and severe use of accessories all negatively impact recog- ear biometrics. Opto-Electron Rev 16(1):85–96
nition performance. Furthermore, we showed that hair 12. Dalal N, Triggs B (2005) Histograms of oriented gradients for
occlusions negatively impact performance to a much more human detection. In: Proceedings of the international conference
limited extent than other factors. The reason for this, we on computer vision and pattern recognition, pp 886–893
13. Damar N, Fuhrer B (2012) Ear recognition using multi-scale
argue, is that hair that belongs to a specific person is similar histogram of oriented gradients. In: Proceedings of the confer-
throughout all (or most) ear images. We found that the ence on intelligent information hiding and multimedia signal
tested methods differ significantly in terms of time and processing, pp 21–24
space complexity and that in situations where resources are 14. Daugman J (1985) Uncertainty relation for resolution in space,
spatial frequency, and orientation optimized by two-dimensional
scarce, descriptor-based methods may be favored over visual cortical filters. J Opt Soc Am A 2(7):1160–1169
CNN models, despite their slightly inferior performance. 15. Dewi K, Yahagi T (2006) Ear photo recognition using scale
invariant keypoints. In: Proceedings of the computational intel-
Acknowledgements This research was supported in parts by the ligence, pp 253–258
ARRS (Slovenian Research Agency) Research Program P2-0250 16. Dodge S, Mounsef J, Karam L (2018) Unconstrained ear recog-
(B) Metrology and Biometric Systems, the ARRS Research Program nition using deep neural networks. IET Biom 7:207–214
P2-0214 (A) Computer Vision. The authors thank NVIDIA for 17. Eyiokur FI, Yaman D, Ekenel HK (2018) Domain adaptation for
donating the Titan Xp GPU that was used in the experiments. ear recognition using deep convolutional neural networks. IET
Biom 7:199–206
18. Earnest H, Segundo P, Sarkar S (2018) Employing fusion of
Compliance with ethical standards learned and handcrafted features for unconstrained ear recogni-
tion. IET Biom 7:215–223
Conflict of interest We warrant that the article has not received prior 19. Emeršič Ž, Meden B, Peer P, Štruc V (2017) Covariate analysis
publication, is not under consideration for publication elsewhere and of descriptor-based ear recognition techniques. In: 2017 interna-
is an original work. On behalf of all co-authors, the corresponding tional conference and workshop on bioinspired intelligence
author bears full responsibility for the submission. The authors certify (IWOBI), pp 1–9
that they have no affiliations with or involvement in any organization 20. Emeršič Ž, Peer P (2015) Ear biometric database in the wild. In:
or entity with any financial interest. 2015 4th international work conference on bioinspired intelli-
gence (IWOBI), pp 27–32
21. Emeršič Ž, Peer P (2015) Toolbox for ear biometric recognition
References evaluation. In: EUROCON 2015—international conference on
computer as a tool (EUROCON), IEEE, pp 1–6
1. Abaza A, Ross A, Hebert C, Harrison MAF, Nixon M (2013) A 22. Emeršič Ž, Štepec D, Štruc V, Peer P (2017) Training convolu-
survey on ear biometrics. ACM Comput Surv 45(2):1–22 tional neural networks with limited training data for ear recog-
2. Alaraj M, Hou J, Fukami T (2010) A neural network based nition in the wild. In: Proceedings of the 12th IEEE international
human identification framework using ear images. In: Proceed- conference on automatic face and gesture (FG 2017)
ings of the international technical conference of IEEE region 10, 23. Emeršič Ž, Štepec D, Štruc V, Peer P, George A, Ahmad A, Omar
pp 1595–1600 E, Boult TE, Safdari R, Zhou Y, Zafeiriou S, Yaman D, Eyiokur
3. Arbab-Zavar B, Nixon MS (2008) Robust log-Gabor filter for ear FI, Ekenel HK (2017) The unconstrained ear recognition chal-
biometrics. In: Proceedings of the international conference on lenge. In: International joint conference on biometrics (IJCB)
pattern recognition, pp 1–4 24. Emeršič Ž, Štruc V, Peer P (2017) Ear recognition: more than a
4. Baoqing Z, Zhichun M, Chen J, Jiyuan D (2013) A robust survey. Neurocomputing 255:26–39
algorithm for ear recognition under partial occlusion. In: Pro- 25. Grm K, Štruc V, Artiges A, Caron M, Ekenel HK (2017)
ceedings of the Chinese control conference, pp 3800–3804 Strengths and weaknesses of deep learning models for face
recognition against image degradations. IET Biom 7:81–89
5. Basit A, Shoaib M (2014) A human ear recognition method using
nonlinear curvelet feature subspace. Int J Comput Math 26. Guo Y, Xu Z (2008) Ear recognition using a new local matching
91(3):616–624 approach. In: Proceedings of the international conference on
6. Benzaoui A, Hezil N, Boukrouche A (2015) Identity recognition image processing, pp 289–292
based on the external shape of the human ear. In: Proceedings of 27. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for
the international conference on applied research in computer image recognition. In: Proceedings of the IEEE conference on
science and engineering, pp 1–5 computer vision and pattern recognition, pp 770–778
7. Benzaoui A, Kheider A, Boukrouche A (2015) Ear description 28. He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep
and recognition using ELBP and wavelets. In: Proceedings of the residual networks. In: European conference on computer vision.
Springer, Berlin, pp 630–645

123
Neural Computing and Applications

29. Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, the international Carnahan conference on security technology,
Keutzer K (2016) Squeezenet: Alexnet-level accuracy with 50x pp 1–6
fewer parameters and 0.5 mb model size. arXiv preprint arXiv: 44. Pflug A, Wagner J, Rathgeb C, Busch C (2014) Impact of severe
1602.07360, signal degradation on ear recognition performance. In: 2014 37th
30. Kannala J, Rahtu E (2012) BSIF: Binarized statistical image international convention on information and communication
features. In: Proceedings of the international conference on pat- technology, electronics and microelectronics (MIPRO),
tern recognition, pp 1363–1366 pp 1342–1347
31. Križaj J, Štruc V, Pavešic N (2010) Adaptation of SIFT features 45. Pietikäinen M, Hadid A, Zhao G, Ahonen T (2011) Computer
for robust face recognition. In: Proceedings of the image analysis vision using local binary patterns. Computational imaging and
and recognition. Springer, New York, pp 394–404 vision. Springer, New York
32. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classi- 46. Prakash S, Gupta P (2013) An efficient ear recognition technique
fication with deep convolutional neural networks. In: Advances in invariant to illumination and pose. Telecommun Syst
neural information processing systems, pp 1097–1105 52(3):1435–1448
33. Kumar A, Wu C (2012) Automated human identification using 47. Purkait R (2015) Role of external ear in establishing personal
ear imaging. Pattern Recogn 45(3):956–968 identity—a short review. Austin J Forensic Sci Criminol 2(2):1–5
34. Kumar A, Zhang D (2007) Ear authentication using log-gabor 48. Simonyan K, Zisserman A (2014) Very deep convolutional net-
wavelets. In: Proceedings of the symposium on defense and works for large-scale image recognition. arXiv preprint arXiv:
security. International society for optics and photonics, p 65390A 1409.1556
35. Lowe D (2004) Distinctive image features from scale-invariant 49. Štruc V, Gajšek R, Pavesic N (2009) Principal Gabor filters for
keypoints. Int J Comput Vis 60(2):91–110 face recognition. In: Proceedings of the conference on biometrics:
36. Meraoumia A, Chitroub S, Bouridane A (2015) An automated ear theory, applications and systems, pp 1–6
identification system using Gabor filter responses. In: Proceed- 50. Štruc V, Pavesic N (2009) Gabor-based kernel partial-least-
ings of the international conference on new circuits and systems, squares discrimination features for face recognition. EURASIP J
pp 1–4 Adv Signal Process 20(1):115–138
37. Morales A, Ferrer M, Diaz-Cabrera M, Gonzalez E (2013) 51. Štruc V, Pavešic N (2010) The complete gabor-fisher classifier
Analysis of local descriptors features and its robustness applied to for robust face recognition. EURASIP J Adv Signal Process
ear recognition. In: Proceedings of the international carnahan 1–26:2010
conference on security technology, pp 1–5 52. Vu N-S, Caplier A (2010) Face recognition with patterns of
38. Nanni L, Lumini A (2009) Fusion of color spaces for ear oriented edge magnitudes. In: European conference on computer
authentication. Pattern Recogn 42(9):1906–1913 vision, pp 313–326
39. Ojansivu V, Heikkilä J (2008) Blur insensitive texture classifi- 53. Xiaoyun W, Weiqi Y (2009) Human ear recognition based on
cation using local phase quantization. In: Image and signal pro- block segmentation. In: Proceedings of the international confer-
cessing, Springer, New York, pp 236–243 ence on cyber-enabled distributed computing and knowledge
40. Ojansivu V, Rahtu E, Heikkilä J (2008) Rotation invariant local discovery, pp 262–266
phase quantization for blur insensitive texture analysis. In: Pro- 54. Xie Z, Mu Z (2008) Ear recognition using LLE and IDLLE
ceedings of the international conference on pattern recognition, algorithm. In: Proceedings of the international conference on
pp 1–4 pattern recognition, pp 1–4
41. Pflug A, Busch C (2012) Ear biometrics: a survey of detection, 55. Zhang Y, Mu Z, Yuan L, Yu C (2018) Ear verification under
feature extraction and recognition methods. Biometrics uncontrolled conditions with convolutional neural networks. IET
1(2):114–129 Biom 7. https://ieeexplore.ieee.org/abstract/document/8340919/
42. Pflug A, Busch C, Ross A (2014) 2D ear classification based on 56. Zhang Z, Liu H (2008) Multi-view ear recognition based on
unsupervised clustering. In: Proceedings of the international joint B-Spline pose manifold construction. In: Proceedings of the
conference on biometrics, pp 1–8 world congress on intelligent control and automation
43. Pflug A, Paul PN, Busch C (2014) A comparative study on tex-
ture and surface descriptors for ear biometrics. In: Proceedings of

123

You might also like