Prototree
Prototree
Training Data
Predic�on
Training Data
Example Example
Training Training
Images Images Scarlet
Tanager
Figure 1: A ProtoTree is a globally interpretable model faithfully explaining its entire reasoning (left, partially shown).
Additionally, the decision making process for a single prediction can be followed (right): the presence of a red chest and
black wing, and the absence of a black stripe near the eye, identifies a Scarlet Tanager. A pruned ProtoTree learns roughly
200 prototypes for CUB (dataset with 200 bird species), making only 8 local decisions on average for one test image.
Abstract 1. Introduction
Prototype-based methods use interpretable representa- There is an ongoing scientific dispute between simple,
tions to address the black-box nature of deep learning mod- interpretable models and complex black boxes, such as
els, in contrast to post-hoc explanation methods that only Deep Neural Networks (DNNs). DNNs have achieved su-
approximate such models. We propose the Neural Prototype perior performance, especially in computer vision, but their
Tree (ProtoTree), an intrinsically interpretable deep learn- complex architectures and high-dimensional feature spaces
ing method for fine-grained image recognition. ProtoTree has led to an increasing demand for transparency, inter-
combines prototype learning with decision trees, and thus pretability and explainability [1], particularly in domains
results in a globally interpretable model by design. Addi- with high-stakes decisions [43]. In contrast, decision trees
tionally, ProtoTree can locally explain a single prediction are easy to understand and interpret [14, 19], because they
by outlining a decision path through the tree. Each node transparently arrange decision rules in a hierarchical struc-
in our binary tree contains a trainable prototypical part. ture. Their predictive performance is however far from
The presence or absence of this learned prototype in an im- competitive for computer vision tasks. We address this so-
age determines the routing through a node. Decision mak- called ‘accuracy-interpretability trade-off’ [1, 35] by com-
ing is therefore similar to human reasoning: Does the bird bining the expressiveness of deep learning with the inter-
have a red throat? And an elongated beak? Then it’s a pretability of decision trees.
hummingbird! We tune the accuracy-interpretability trade- We present the Neural Prototype Tree, ProtoTree in
off using ensemble methods, pruning and binarizing. We short, an intrinsically interpretable method for fine-grained
apply pruning without sacrificing accuracy, resulting in a image recognition. A ProtoTree has the representational
small tree with only 8 learned prototypes along a path to power of a neural network, and contains a built-in binary
classify a bird from 200 species. An ensemble of 5 Pro- decision tree structure, as shown in Fig. 1 (left). Each in-
toTrees achieves competitive accuracy on the CUB-200- ternal node in the tree contains a trainable prototype. Our
2011 and Stanford Cars data sets. Code is available at prototypes are prototypical parts learned with backpropa-
github.com/M-Nauta/ProtoTree. gation, as introduced in the Prototypical Part Network (Pro-
14933
Test image Prototype Similarity Weight Points Test image Prototype Similarity Weight Points
toPNet) [9] where a prototype is a trainable tensor that can score score
be visualized as a patch of a training sample. The extent 4.17 × 1.28 = 5.34 2.65 × 1.35 = 3.58
to which this prototype is present in an input image deter-
+
mines the routing of the image through the corresponding +
node. Leaves of the ProtoTree learn class distributions. The 3.98 × 1.29 = 5.13 1.78 × 1.02 = 1.82
paths from root to leaves represent the learned classifica-
…
tion rules. The reasoning of our model is thus similar to the Total points Lazuli Bunting = 36.16 Total points Indigo Bunting = 21.86
14934
MNIST. Furthermore, the above methods lose the main at-
tractive property of decision trees: interpretability. DNDFs
can be locally interpreted by visualizing a path of saliency
maps [33], as shown in Fig. 3a. Frosst & Hinton [15] train a
perceptron for each node, and visualize the learned weights
(a) DNDF [33] (b) SDT [15] (c) SDT [22] (d) Ours (Fig. 3b). The limited representational power of perceptrons
however leads to suboptimal classification results. The ap-
Figure 3: Visualized root node from soft decision trees. Ap- proach in [22] makes SDTs deterministic at test time and
plied to resp. CIFAR10, MNIST, FashionMNIST and CUB. linear split parameters can be visualized and enhanced with
Republished with permission from the authors (a-c). a spatial regularization term (Fig. 3c). In contrast to these
interpretable methods, we apply our method to natural im-
ages for fine-grained image recognition and our visualiza-
fine-grained image classification. ProtoPNet learns a pre- tions are sharp and full-color, therefore improving inter-
determined number of prototypical parts (prototypes) per pretability (Fig. 3d). Instead of image recognition, Neural-
class. To classify an image, the similarity between a pro- Backed Decision Trees for Segmentation [52] use visual de-
totype and a patch in the image is calculated by measur- cision rules with saliency maps for segmentation.
ing the distance in latent space. The resulting similarity Other tree approaches for image classification apply
scores are weighted by values learned by a fully-connected post-hoc explanation techniques, by showing example im-
layer. The explanation of ProtoPNet shows the reasoning ages per node [2, 55], visualizing typical CNN filters of
process for a single image, by visualizing all prototypes to- each node that can be manually labelled [55], showing class
gether with their weighted similarity score. Summing the activation maps [25] or manual inspection of leaf labels
weighted similarity scores per class gives a final score for and the meaning of internal nodes [51]. We extend prior
the image belonging to each class, as shown in Fig. 2. We work by including prototypes in a tree structure, thereby
improve upon ProtoPNet by showing an easy-to-interpret obtaining a globally explainable, intrinsically interpretable
global explanation by means of a decision tree. Such a hi- model with only one decision per node. Additionally, sim-
erarchical, logical model aids interpretability [14, 43], since ilar to ProtoPNet [9], a ProtoTree does not require manual
a tree has various conceptual advantages compared to a lin- labelling and is therefore self-explanatory. Our work differs
ear bag-of-prototypes: a tree enforces a sequence of steps from hierarchical image classification (e.g., a gibbon is an
and it supports negative associations (i.e. absence of proto- animal and a primate) such as [20], since we do not require
type), thereby reducing the number of prototypes and bet- hierarchical labels or a predefined taxonomy.
ter mimicking human reasoning. The hierarchical structure
therefore enhances interpretability and could also lead to 3. Neural Prototype Tree
more insights w.r.t. clusters in the data. Instead of multi-
plying similarity scores with weights, our local explanation A Neural Prototype Tree (ProtoTree) hierarchically
shows the routing of a sample through the tree. Further- routes an image through a binary tree for interpretable im-
more, we do not have class-specific prototypes, do not need age recognition. We now formalise the definition of a Pro-
to learn weights for similarity scores and we use a simple toTree for supervised learning. Consider a classification
cross-entropy loss function. problem with training set T containing N labelled images
{(x(1) , y (1) ), ..., (x(N ) , y (N ) )} ∈ X × Y. Given an input
2.2. Neural Soft Decision Trees image x, a ProtoTree predicts the class probability distribu-
tion over K classes, denoted as ŷ. We use y to denote the
Soft Decision Trees (SDTs) have shown to be more one-hot encoded ground-truth label y such that we can train
accurate than traditional hard decision trees [23, 24, 46]. a ProtoTree by minimizing the cross-entropy loss between y
Only recently deep neural networks are integrated in binary and ŷ. A ProtoTree can also be trained with soft labels from
SDTs. The Deep Neural Decision Forest (DNDF) [29] is a trained model for knowledge distillation, similar to [15].
an ensemble of neural SDTs: a neural network learns a A ProtoTree T is a combination of a convolutional neu-
latent representation of the input, and each node learns a ral network (CNN) f with a soft neural binary decision
routing function. Adaptive Neural Trees [47] (ANTs) are tree structure. As shown in Fig. 4, an input image is first
a generalization of DNDF. Each node can transform and forwarded through f . The resulting convolutional output
route its input with a small neural network. In contrast to z = f (x; ω) consists of D two-dimensional (H × W ) fea-
most SDTs that require a fixed tree structure, including ours, ture maps, where ω denotes the trainable parameters of f .
ANTs greedily learn a binary tree topology. Such greedy al- Secondly, the latent representation z ∈ RH×W ×D serves as
gorithms however could lead to suboptimal trees [40], and input for a binary tree. This tree consists of a set of internal
are only applied to simple classification problems such as nodes N , a set of leaf nodes L, and a set of edges E. Each
14935
Figure 4: Decision making process of a ProtoTree to predict class probability distribution ŷ of input image x. During training,
prototypes pn ∈ P , leaves’ class distributions c and CNN parameters ω are learned. Probabilities pe (shown with example
values) depend on the similarity between a patch in the latent input image and a prototype.
internal node n ∈ N has exactly two child nodes: n.left z is traversed through all edges and ends up in each leaf
connected by edge e(n, n.left) ∈ E and n.right connected node ℓ ∈ L with a certain probability. Path Pℓ denotes the
by e(n, n.right) ∈ E. Each internal node n ∈ N corre- sequence of edges from the root node to leaf ℓ. The prob-
sponds to a trainable prototype pn ∈ P . We follow the ability of sample z arriving in leaf ℓ, denoted as πℓ , is the
prototype definition of ProtoPNet [9] where each prototype product of probabilities of the edges in path Pℓ :
is a trainable tensor of shape H1 × W1 × D (with H1 ≤ H, Y
W1 ≤ W , and in our implementation H1 = W1 = 1) such πℓ (z) = pe (z). (3)
that the prototype’s depth corresponds to the depth of the e∈Pℓ
convolutional output z.
Each leaf node ℓ ∈ L carries a trainable parameter cℓ ,
We use a form of generalized convolution without
denoting the distribution in that leaf over the K classes that
bias [17], where each prototype pn ∈ P acts as a kernel
needs to be learned. The softmax function σ(cℓ ) normal-
by ‘sliding’ over z of shape H × W × D and computes
izes cℓ to get the class probability distribution of leaf ℓ. To
the Euclidean distance between pn and its current receptive
obtain the final predicted class probability distribution ŷ for
field z̃ (called a patch). We apply a minimum pooling op-
input image x, latent representation z = f (x|ω) is tra-
eration to select the patch in z of shape H1 × W1 × D that
versed through all edges in T such that all leaves contribute
is closest to prototype pn :
to the final prediction ŷ. An example prediction is shown
z̃ ∗ = argmin ||z̃ − pn ||. (1) on the right of Fig. 4. The contribution of leaf ℓ is weighted
z̃∈patches(z) by path probability πℓ , such that:
X
The distance between the nearest latent patch z̃ ∗ and pro- ŷ(x) = σ(cℓ ) · πℓ (f (x; ω)). (4)
totype pn determines to what extent the prototype is present ℓ∈L
anywhere in the input image, which influences the routing
of z through corresponding node n. In contrast to tradi- 4. Training a ProtoTree
tional decision trees, where an internal node routes sample
Training a ProtoTree requires to learn the parameters ω
z either right or left, our node n ∈ N is soft and routes z to
of CNN f for informative feature maps, the nodes’ proto-
both children, each with a fuzzy weight within [0, 1], giving
types P for routing and the leaves’ class distribution logits
it a probabilistic interpretation [15, 23, 29, 47]. Following
c for prediction. The number of prototypes to be learned,
this probabilistic terminology, we define the similarity be-
i.e. |P |, depends on the tree size. A binary tree structure is
tween z̃ ∗ and pn , and therefore the probability of routing
initialized by defining a maximum height h, which creates
sample z through the right edge as
2h leaves and 2h − 1 prototypes. Thus, the computational
pe(n,n.right) (z) = exp(−||z̃ ∗ − pn ||), (2) complexity of learning P is growing exponentially with h.
We require a pre-trained CNN f (e.g. on ImageNet or
such that pe(n,n.left) = 1 − pe(n,n.right) . Thus, the similar- training it on a specific prediction task first). During train-
ity between prototype pn and the nearest patch in the con- ing, prototypes in P are trainable tensors. Parameters ω
volutional output, z̃ ∗ , determines to what extent z is routed and P are simultaneously learned with back-propagation
to the right child of node n. Because of the soft routing, by minimizing the cross-entropy loss between the predicted
14936
Algorithm 1: Training a ProtoTree
Input: Training set T , max height h, nEpochs … …
(1)
1 initialize ProtoTree T with height h and ω, P , c ;
2 for t ∈ {1, ..., nEpochs} do
3 randomly split T into B mini-batches;
4 for (xb , yb ) ∈ {T 1 , ..., T b , ..., T B } do
5 ŷb = T (xb );
6 compute loss (ŷb , yb );
7 update ω and P with gradient descent; Figure 5: Pruning removes a subtree T ′ , and its parent, in
8 for ℓ ∈ L do which all leaves have an (nearly) uniform distribution.
(t+1) (t)
9 cℓ -= B1 · cℓ ;
(t+1)
10 cℓ += Eq. 5 for xb , yb ; 5. Interpretability and Visualization
11 prune T (optional);
To foster global model interpretability, we prune ineffec-
12 replace each prototype pn ∈ P with its nearest
tive prototypes, visualize the learned latent prototypes, and
latent patch z̃n∗ and visualize;
convert soft to hard decisions.
5.1. Pruning
class probability distribution ŷ and ground-truth y. The Interpretability can be quantified by explanation size [11,
learned prototypes should be near a latent patch of a training 45]. In a ProtoTree T , explanation size is related to the num-
image such that they can be visualized as an image patch to ber of prototypes. To reduce explanation size, we analyse
represent a prototypical part (cf. Sec. 5). the learned class probability distributions in the leaves and
Learning leaves’ distributions. In a classical decision remove leaves with nearly uniform distributions, i.e. little
tree, a leaf label is learned from the samples ending up in discriminative power. Specifically, we define a threshold τ
that leaf. Since we use a soft tree, learning the leaves’ distri- and prune all leaves where max(σ(cℓ )) ≤ τ , with τ being
butions is a global learning problem. Although it is possible slightly greater than 1/K where K is the number of classes.
to learn c with back-propagation together with ω and P , we If all leaves in a full subtree T ′ ⊂ T are pruned, T ′ (and its
found that this gives inferior classification results. We hy- prototypes) can be removed. As visualized in Fig. 5, Pro-
pothesize that including c in the loss term leads to an overly toTree T can be reorganized by additionally removing the
complex optimization problem. Kontschieder et al. [29] now superfluous parent of the root of T ′ .
noted that solely optimizing leaf parameters is a convex op- 5.2. Prototype Visualization
timization problem and proposed a derivative-free strategy.
Translating their approach to our methodology gives the fol- Learned latent prototypes need to be mapped to pixel
lowing update scheme for cℓ for all ℓ ∈ L: space to enable interpretability. Similar to ProtoPNet [9],
we replace each prototype pn ∈ P with its nearest latent
(t+1)
X (t) patch present in the training data, z̃n∗ . Thus,
cℓ = (σ(cℓ ) ⊙ y ⊙ πℓ ) ⊘ ŷ, (5)
x,y∈T pn ← z̃n∗ , z̃n∗ = argmin ||z̃ ∗ − pn || (6)
z∈{f (x),∀x∈T }
where t indexes a training epoch, ⊙ denotes element-wise such that prototype pn is equal to latent representation z̃n∗ .
multiplication and ⊘ is element-wise division. The result Where ProtoPNet replaces its prototypes during training ev-
is a vector of size K representing the class distribution in ery 10th epoch, prototype replacement after training is suf-
leaf ℓ. This learning scheme is however computationally ficient for a ProtoTree, since our routing mechanism implic-
(t+1)
expensive: at each epoch, first cℓ is computed to up- itly optimizes prototypes to represent a certain patch. This
date the leaves, and then all other parameters are trained by reduces computational complexity and simplifies the train-
looping through the data again, meaning that ŷ is computed ing process.
twice. We propose to do this more efficiently and intertwine We denote by x∗n the training image corresponding to
mini-batch gradient descent optimization for ω and P with nearest patch z̃n∗ . Prototype pn can now be visualized
a derivative-free update to learn c, as shown in Alg. 1. Our as a patch of x∗n . We forward x∗n through f to create a
algorithm has the advantage that each mini-batch update of 2-dimensional similarity map that includes the similarity
ω and P is taken into account for updating c(t+1) , which score between pn and all patches in z = f (x∗n )
aids convergence. Moreover, computing ŷ only once for
each batch roughly halves the training time. Sn(i,j) = exp(−||z̃ (i,j) − pn ||), (7)
14937
Data Method Inter- Top-1 #Proto
set pret. Accuracy types
Triplet Model [34] - 87.5 n.a.
TranSlider [58] - 85.8 n.a.
In a soft decision tree, all nodes contribute to a predic- Table 1: Mean accuracy and standard deviation of our Pro-
tion. In contrast, in a hard, deterministic tree, only the nodes toTree (5 runs) and ensemble with 3 or 5 ProtoTrees com-
along a path account for the final prediction, making hard pared with self-reported accuracy of uninterpretable state-
decision trees easier to interpret than soft trees [2]. Whereas of-the-art2 (-), attention-based models (o) and interpretable
a ProtoTree is soft during training, we propose two possible ProtoPNet (+, with ResNet34-backbone).
strategies to convert a ProtoTree to a hard tree at test time:
1. select the path to the leaf with the highest path proba-
bility: argmaxℓ∈L (πℓ ) 1 × 1 convolutional layer1 to reduce the dimensionality of
2. greedily traverse the tree, i.e. go right at internal node latent output z to D, the prototype depth. Based on cross-
n if pe(n,n.right) > 0.5 and left otherwise. validation from {128, 256, 512}, we used D=256 for CUB
Sec. 6.2 evaluates to what extent these deterministic strate- and D=128 for CARS. Similar to ProtoPNet, H1 =W1 =1 to
gies influence accuracy compared to soft reasoning. provide well-sized patches, such that a prototype is of size
1 × 1 × 256 for CUB. We use ReLU as activation function,
6. Experiments except for the last layer which has a Sigmoid function to
act as a form of normalization. We initialize the prototypes
We evaluate the accuracy-interpretability trade-off of a by sampling from N (0.5, 0.1). The initial leaf distributions
ProtoTree, and compare our ProtoTrees with ProtoPNet [9] (1) (1)
σ(cℓ ) are uniformly distributed by initializing cℓ with
and state-of-the-art models in terms of accuracy and inter- zeros for all ℓ ∈ L. See Suppl. for all details.
pretability. We evaluate on CUB-200-2011 [50] with 200
bird species (CUB) and Stanford Cars [30] with 196 car 6.2. Accuracy and Interpretability
types (CARS), since both were used by ProtoPNet [9].
Table 1 compares the accuracy of ProtoTrees with state-
of-the-art methods. Our ProtoTree outperforms ProtoPNet
6.1. Experimental Setup
for both datasets. We also evaluated the accuracy of Pro-
We implemented ProtoTree in PyTorch. Our CNN f toTree ensembles by averaging the predictions of 3 or 5 in-
contains the convolutional layers of ResNet50 [21], pre- dividual ProtoTrees, all trained on the same dataset. An
trained on ImageNet [10] for CARS. For CUB, ResNet50 ensemble of ProtoTrees outperforms a ProtoPNet ensem-
is pretrained on iNaturalist2017 [49], containing plants and ble, and approximates the accuracy of uninterpretable or
animals and therefore a suitable source domain [32], using attention-based methods, while providing intrinsically in-
the backbone of [59]. Backbone f is frozen for 30 epochs terpretable global and faithful local explanations.
after which f is optimized jointly with the prototypes with 1 ProtoPNet [9] appends two 1×1 convolutional layers, but in our model
Adam [28]. For a fair comparison with ProtoPNet [9], we this gave varying (and lower) accuracy across runs.
resize all images to 224 × 224 such that the resulting feature 2 Using higher-resolution images (e.g. 448 × 448) has shown to give
maps are 7 × 7. The CNN architecture is extended with a better results [48, 57] with e.g. accuracy up to 90.4% [16] for CUB.
14938
Dataset K h Initial Acc Acc pruned Acc pruned+repl. # Prototypes % Pruned Distance z̃n∗
CUB 200 9 82.206 ± 0.723 82.192 ± 0.723 82.199 ± 0.726 201.6 ± 1.9 60.5 0.0020 ± 0.0068
CARS 196 11 86.584 ± 0.250 86.576 ± 0.245 86.576 ± 0.245 195.4 ± 0.5 90.5 0.0005 ± 0.0016
Table 2: Impact of pruning and prototype replacement: 1) before pruning and replacement, 2) after pruning, 3) after pruning
and replacement, 4) number of prototypes after pruning, 5) fraction of prototypes that is pruned and 6) Euclidean distance
from each latent prototype to its nearest latent training patch (after pruning). Showing mean and std dev across 5 runs.
100%
90%
Strategy Accuracy Fidelity Path length
80% Soft 82.20±0.01 n.a. n.a.
Accuracy
14939
Figure 8: Subtree of an automatically visualized ProtoTree trained on CUB, h=8 (middle). Each internal node contains
a prototype (left) and the training image from which it is extracted (right). A ProtoTree faithfully shows its reasoning and
clusters similar classes (e.g. birds with a white chest). Top left: maximum values of all leaf distributions. Top right: ProtoTree
reveals biases learned by the model: e.g. classifying a Gray Catbird based on the presence of a leaf. Best viewed in color.
Test image
Figure 9: Local explanation that shows the greedy path when classifying a White-breasted Nuthatch. Prototypes found in the
test image are: a dark-colored tail, a contrastive eye, sky background (learned bias), a white chest and a blue-grey wing.
14940
References [15] Nicholas Frosst and Geoffrey Hinton. Distilling a neu-
ral network into a soft decision tree. arXiv preprint
[1] Amina Adadi and Mohammed Berrada. Peeking inside the arXiv:1711.09784, 2017.
black-box: A survey on explainable artificial intelligence [16] Weifeng Ge, Xiangru Lin, and Yizhou Yu. Weakly su-
(xai). IEEE Access, 6:52138–52160, 2018. pervised complementary parts models for fine-grained im-
[2] Stephan Alaniz and Zeynep Akata. Explainable observer- age classification from the bottom up. In Proceedings of
classifier for explainable binary decisions. arXiv preprint the IEEE/CVF Conference on Computer Vision and Pattern
arXiv:1902.01780, 2019. Recognition (CVPR), June 2019.
[3] David Alvarez-Melis and Tommi S Jaakkola. On the [17] Kamaledin Ghiasi-Shirazi. Generalizing the convolution op-
robustness of interpretability methods. arXiv preprint erator in convolutional neural networks. Neural Processing
arXiv:1806.08049, 2018. Letters, 50(3):2627–2646, 2019.
[4] Plamen Angelov and Eduardo Soares. Towards deep ma- [18] Riccardo Guidotti, Anna Monreale, Stan Matwin, and Dino
chine reasoning: a prototype-based deep neural network with Pedreschi. Black box explanation by learning image ex-
decision tree inference. In 2020 IEEE International Confer- emplars in the latent feature space. In Ulf Brefeld, Elisa
ence on Systems, Man, and Cybernetics (SMC), pages 2092– Fromont, Andreas Hotho, Arno Knobbe, Marloes Maathuis,
2099. IEEE, 2020. and Céline Robardet, editors, Machine Learning and Knowl-
[5] Sercan O Arik and Tomas Pfister. Protoattend: edge Discovery in Databases, pages 189–205, Cham, 2020.
Attention-based prototypical learning. arXiv preprint Springer International Publishing.
arXiv:1902.06292, 2019. [19] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri,
[6] Sebastian Bach, Alexander Binder, Grégoire Montavon, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A sur-
Frederick Klauschen, Klaus-Robert Müller, and Wojciech vey of methods for explaining black box models. ACM com-
Samek. On pixel-wise explanations for non-linear classifier puting surveys (CSUR), 51(5):1–42, 2018.
decisions by layer-wise relevance propagation. PLOS ONE, [20] Peter Hase, Chaofan Chen, Oscar Li, and Cynthia Rudin. In-
10(7):1–46, 07 2015. terpretable image recognition with hierarchical prototypes.
[7] Irving Biederman. Recognition-by-components: a the- In Proceedings of the AAAI Conference on Human Compu-
ory of human image understanding. Psychological review, tation and Crowdsourcing, volume 7, pages 32–40, 2019.
94(2):115, 1987. [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[8] Wieland Brendel and Matthias Bethge. Approximating cnns Deep residual learning for image recognition. In Proceed-
with bag-of-local-features models works surprisingly well on ings of the IEEE conference on computer vision and pattern
imagenet. In 7th International Conference on Learning Rep- recognition, pages 770–778, 2016.
resentations, ICLR 2019, New Orleans, LA, USA, May 6-9, [22] Thomas M Hehn, Julian FP Kooij, and Fred A Hamprecht.
2019. OpenReview.net, 2019. End-to-end learning of decision trees and forests. Interna-
tional Journal of Computer Vision, pages 1–15, 2019.
[9] Chaofan Chen, Oscar Li, Daniel Tao, Alina Barnett, Cynthia
[23] Ozan Irsoy, Olcay Taner Yıldız, and Ethem Alpaydın. Soft
Rudin, and Jonathan K Su. This looks like that: Deep learn-
decision trees. In Proceedings of the 21st International Con-
ing for interpretable image recognition. In H. Wallach, H.
ference on Pattern Recognition (ICPR2012), pages 1819–
Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R.
1822. IEEE, 2012.
Garnett, editors, Advances in Neural Information Process-
[24] Ozan Irsoy, Olcay Taner Yildiz, and Ethem Alpaydin. Bud-
ing Systems 32, pages 8930–8941. Curran Associates, Inc.,
ding trees. In 2014 22nd International Conference on Pattern
2019.
Recognition, pages 3582–3587. IEEE, 2014.
[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
[25] Ruyi Ji, Longyin Wen, Libo Zhang, Dawei Du, Yanjun Wu,
and Li Fei-Fei. Imagenet: A large-scale hierarchical image
Chen Zhao, Xianglong Liu, and Feiyue Huang. Attention
database. In 2009 IEEE conference on computer vision and
convolutional binary neural tree for fine-grained visual cat-
pattern recognition, pages 248–255. Ieee, 2009.
egorization. In Proceedings of the IEEE/CVF Conference
[11] Finale Doshi-Velez and Been Kim. Towards a rigorous on Computer Vision and Pattern Recognition, pages 10468–
science of interpretable machine learning. arXiv preprint 10477, 2020.
arXiv:1702.08608, 2017. [26] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai,
[12] Alexey Dosovitskiy and Thomas Brox. Inverting visual rep- James Wexler, Fernanda Viegas, and Rory sayres. Inter-
resentations with convolutional networks. In Proceedings pretability beyond feature attribution: Quantitative testing
of the IEEE Conference on Computer Vision and Pattern with concept activation vectors (TCAV). volume 80 of Pro-
Recognition (CVPR), June 2016. ceedings of Machine Learning Research, pages 2668–2677,
[13] Ruth C. Fong and Andrea Vedaldi. Interpretable explanations Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018.
of black boxes by meaningful perturbation. In Proceedings PMLR.
of the IEEE International Conference on Computer Vision [27] Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, Max-
(ICCV), Oct 2017. imilian Alber, Kristof T. Schütt, Sven Dähne, Dumitru Er-
[14] Alex A. Freitas. Comprehensible classification models: A han, and Been Kim. The (Un)reliability of Saliency Methods,
position paper. SIGKDD Explor. Newsl., 15(1):1–10, Mar. pages 267–280. Springer International Publishing, Cham,
2014. 2019.
14941
[28] Diederik P Kingma and Jimmy Ba. Adam: A method for models instead. Nature Machine Intelligence, 1(5):206–215,
stochastic optimization. arXiv preprint arXiv:1412.6980, 2019.
2014. [44] Sascha Saralajew, Lars Holdijk, Maike Rees, Ebubekir Asan,
[29] Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, and Thomas Villmann. Classification-by-components: Prob-
and Samuel Rota Bulo. Deep neural decision forests. In The abilistic modeling of reasoning over a set of components. In
IEEE International Conference on Computer Vision (ICCV), H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc,
December 2015. E. Fox, and R. Garnett, editors, Advances in Neural Informa-
[30] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. tion Processing Systems 32, pages 2792–2803. Curran Asso-
3d object representations for fine-grained categorization. In ciates, Inc., 2019.
4th International IEEE Workshop on 3D Representation and [45] Wilson Silva, Kelwin Fernandes, Maria J. Cardoso, and
Recognition (3dRR-13), Sydney, Australia, 2013. Jaime S. Cardoso. Towards complementary explanations us-
[31] Thibault Laugel, Marie-Jeanne Lesot, Christophe Marsala, ing deep neural networks. In Danail Stoyanov, Zeike Taylor,
Xavier Renard, and Marcin Detyniecki. The dangers of post- Seyed Mostafa Kia, Ipek Oguz, Mauricio Reyes, Anne Mar-
hoc interpretability: Unjustified counterfactual explanations. tel, Lena Maier-Hein, Andre F. Marquand, Edouard Duches-
arXiv preprint arXiv:1907.09294, 2019. nay, Tommy Löfstedt, Bennett Landman, M. Jorge Cardoso,
[32] Hao Li, Pratik Chaudhari, Hao Yang, Michael Lam, Avinash Carlos A. Silva, Sergio Pereira, and Raphael Meier, editors,
Ravichandran, Rahul Bhotika, and Stefano Soatto. Re- Understanding and Interpreting Machine Learning in Med-
thinking the hyperparameters for fine-tuning. arXiv preprint ical Image Computing Applications, pages 133–140, Cham,
arXiv:2002.11770, 2020. 2018. Springer International Publishing.
[33] Shichao Li and Kwang-Ting Cheng. Visualizing the [46] Alberto Suárez and James F Lutsko. Globally optimal
decision-making process in deep neural decision forest. In fuzzy decision trees for classification and regression. IEEE
CVPR Workshops, pages 114–117, 2019. Transactions on Pattern Analysis and Machine Intelligence,
[34] J. Liang, J. Guo, Y. Guo, and S. Lao. Adaptive triplet 21(12):1297–1311, 1999.
model for fine-grained visual categorization. IEEE Access, [47] Ryutaro Tanno, Kai Arulkumaran, Daniel Alexander, An-
6:76776–76786, 2018. tonio Criminisi, and Aditya Nori. Adaptive neural trees.
[35] Zachary C. Lipton. The mythos of model interpretability. volume 97 of Proceedings of Machine Learning Research,
Queue, 16(3):30:31–30:57, June 2018. pages 6166–6175, Long Beach, California, USA, 09–15 Jun
[36] X. Ma and A. Boukerche. An ai-based visual attention model 2019. PMLR.
for vehicle make and model recognition. In 2020 IEEE Sym- [48] Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Herve
posium on Computers and Communications (ISCC), pages Jegou. Fixing the train-test resolution discrepancy. In H.
1–6, 2020. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E.
[37] Grégoire Montavon, Wojciech Samek, and Klaus-Robert Fox, and R. Garnett, editors, Advances in Neural Informa-
Müller. Methods for interpreting and understanding deep tion Processing Systems 32, pages 8252–8262. Curran Asso-
neural networks. Digital Signal Processing, 73:1–15, 2018. ciates, Inc., 2019.
[38] Meike Nauta, Annemarie Jutte, Jesper Provoost, and Christin [49] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui,
Seifert. This looks like that, because ... explaining prototypes Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and
for interpretable image recognition, 2020. Serge Belongie. The inaturalist species classification and de-
[39] Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas tection dataset. In The IEEE Conference on Computer Vision
Brox, and Jeff Clune. Synthesizing the preferred inputs for and Pattern Recognition (CVPR), June 2018.
neurons in neural networks via deep generator networks. In [50] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.
D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. The Caltech-UCSD Birds-200-2011 Dataset. Technical Re-
Garnett, editors, Advances in Neural Information Process- port CNS-TR-2011-001, California Institute of Technology,
ing Systems 29, pages 3387–3395. Curran Associates, Inc., 2011.
2016. [51] Alvin Wan, Lisa Dunlap, Daniel Ho, Jihan Yin, Scott Lee,
[40] Mohammad Norouzi, Maxwell Collins, Matthew A Johnson, Henry Jin, Suzanne Petryk, Sarah Adel Bargal, and Joseph E
David J Fleet, and Pushmeet Kohli. Efficient non-greedy op- Gonzalez. Nbdt: Neural-backed decision trees. arXiv
timization of decision trees. In Advances in neural informa- preprint arXiv:2004.00221, 2020.
tion processing systems, pages 1729–1737, 2015. [52] Alvin Wan, Daniel Ho, Younjin Song, Henk Tillman,
[41] Chris Olah, Alexander Mordvintsev, and Ludwig Sarah Adel Bargal, and Joseph E Gonzalez. Segnbdt:
Schubert. Feature visualization. Distill, 2017. Visual decision rules for segmentation. arXiv preprint
https://distill.pub/2017/feature-visualization. arXiv:2006.06868, 2020.
[42] Andrew Slavin Ross, Michael C Hughes, and Finale Doshi- [53] Chih-Kuan Yeh, Joon Kim, Ian En-Hsu Yen, and Pradeep K
Velez. Right for the right reasons: Training differentiable Ravikumar. Representer point selection for explaining deep
models by constraining their explanations. arXiv preprint neural networks. In S. Bengio, H. Wallach, H. Larochelle,
arXiv:1703.03717, 2017. K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Ad-
[43] Cynthia Rudin. Stop explaining black box machine learn- vances in Neural Information Processing Systems 31, pages
ing models for high stakes decisions and use interpretable 9291–9301. Curran Associates, Inc., 2018.
14942
[54] Matthew D. Zeiler and Rob Fergus. Visualizing and under-
standing convolutional networks. In David Fleet, Tomas Pa-
jdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer
Vision – ECCV 2014, pages 818–833, Cham, 2014. Springer
International Publishing.
[55] Quanshi Zhang, Yu Yang, Haotian Ma, and Ying Nian Wu.
Interpreting cnns via decision trees. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 6261–6270, 2019.
[56] Heliang Zheng, Jianlong Fu, Tao Mei, and Jiebo Luo. Learn-
ing multi-attention convolutional neural network for fine-
grained image recognition. In Proceedings of the IEEE Inter-
national Conference on Computer Vision (ICCV), Oct 2017.
[57] Heliang Zheng, Jianlong Fu, Zheng-Jun Zha, and Jiebo Luo.
Looking for the devil in the details: Learning trilinear atten-
tion sampling network for fine-grained image recognition. In
Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition (CVPR), June 2019.
[58] Kuo Zhong, Ying Wei, Chun Yuan, Haoli Bai, and Junzhou
Huang. Translider: Transfer ensemble learning from ex-
ploitation to exploration. In Proceedings of the 26th ACM
SIGKDD International Conference on Knowledge Discov-
ery & Data Mining, KDD ’20, page 368–378, New York,
NY, USA, 2020. Association for Computing Machinery.
[59] Boyan Zhou, Quan Cui, Xiu-Shen Wei, and Zhao-Min
Chen. Bbn: Bilateral-branch network with cumulative learn-
ing for long-tailed visual recognition. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 9719–9728, 2020.
[60] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva,
and Antonio Torralba. Learning deep features for discrimi-
native localization. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), June
2016.
[61] Bolei Zhou, Yiyou Sun, David Bau, and Antonio Torralba.
Interpretable basis decomposition for visual explanation. In
Proceedings of the European Conference on Computer Vi-
sion (ECCV), September 2018.
14943