Random Kernel Forests
Random Kernel Forests
ABSTRACT Random forests of axis-parallel decision trees still show competitive accuracy in various
tasks; however, they have drawbacks that limit their applicability. Namely, they perform poorly for mul-
tidimensional sparse data. A straightforward solution is to create forests of decision trees with oblique splits;
however, most training approaches have low performance. Besides, those ensembles appear unstable and
easily overfit, so they should be regularized to find the trade-off between complexity and generalization.
This paper proposes an algorithm to train kernel decision trees and random forests. At the top level, this
algorithm follows a common greedy procedure to train decision trees; however, it produces quasi-optimal
oblique and kernel splits at the decision stump level. At each stump, the algorithm finds a quasi-optimal
distribution of classes to subtrees and trains this stump via optimization of an SVM-like loss function with
a margin re-scaling approach, which helps optimize the margin between subtree data and arbitrary impurity
criteria. We also try to reveal uniform stability-based generalization bounds for those forests and apply them
to select the regularization technique. The obtained bounds explicitly consider primary hyperparameters
of forests, trees, and decision stumps. The bounds also show a relationship between outliers in decision
tree structure, stability and generalization, and illustrate how forests smooth these outliers. We performed
an experimental evaluation on several tasks, such as studying the reaction of social media users, image
recognition, and bank scoring. The experiments show that the proposed algorithm trains ensembles, which
outperform other oblique or kernel forests in many commonly-recognized datasets. Namely, the proposed
algorithm shows 99.1% accuracy on MNIST and 58.1% on CIFAR-10. It has been confirmed that the selected
regularization technique helps reduce overfitting on several datasets. Therefore, the proposed algorithm
may be considered a small step toward customized and regularized ensembles of kernel trees that keep
reasonable training performance on large datasets. We believe that the proposed algorithm can be utilized as
a construction element of approaches to training context-aware models on complex problems with integer
inputs and outputs, such as text mining and image recognition.
INDEX TERMS Generalization bounds, kernel splits, kernel forests, random forests, random uniform
stability, regularization, slack re-scaling.
I. INTRODUCTION because one can test all possible thresholds for each feature
Decision trees [1], their ensembles [2], and random forests [3] and then select the best parameters according to the split
are the cornerstone of much intelligent software. They are criterion. However, that type of split has limited discrimi-
still winners in many data mining competitions. The binary native strength. For example, they perform poorly for such
decision trees consist of nodes that are called decision stumps. frequent input in many text and image processing tasks as
They divide the data into two subsets according to a split- multidimensional sparse data.
ting (impurity) criterion, such as Gini impurity or Informa- One of the approaches to tackle the problem is to train the
tion gain, corresponding to a discrete optimization problem. trees with more complex decision rules, for example, with
Optimizing axis-parallel decision stumps is straightforward oblique (linear) splits. Unfortunately, oblique decision trees
have low generalization ability. Besides, most approaches to
The associate editor coordinating the review of this manuscript and training oblique splits have low performance or are hard to
approving it for publication was Alberto Cano . tune, they often presume optimization of a non-convex or
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
77962 VOLUME 10, 2022
D. A. Devyatkin, O. G. Grigoriev: Random Kernel Forests
even non-smooth loss. Promising directions of oblique and TABLE 1. Important notations in this paper.
kernel tree development are bagging, sub-bagging, or ran-
dom forest ensembles, which have better stability than their
underlying classifiers. However, even those ensembles appear
unstable and easily overfit, so they should be regularized to
reveal the trade-off between complexity and generalization.
Several studies show the margin between subtree data and
splitting criteria affect the tree generalization [4], [11]. There-
fore, kernel or oblique tree decision stumps should optimize
those parameters simultaneously. Further improvement of
oblique decision tree ensembles requires detailed estimating
of the generalization ability that would consider values of
the train hyperparameters, properties of the tree structures,
and features of the train algorithms.However, studies on gen-
eralization consider single trees or randomized ensembles
of arbitrary estimators but do not focus on the features of
combined concepts like bagging on decision trees or Random
Forests.
This paper proposes decision trees with kernel splits and
random forests of those trees, i.e., trees that split are kernel
classifiers with arbitrary kernel functions. The main contri-
butions of our paper are as follows.
• We propose a greedy algorithm to train kernel trees
and obtain their kernel splits via optimization of an
SVM-like loss function. This function helps improve
both arbitrary impurity criteria and the margin between Section 5 and 6 present obtained uniform stability-based
sub-trees. Thereby, the method should train trees with generalization bounds and the application of the bounds to
sufficient generalization, and existing efficient linear choose the forest regularization. Sections 7 and 8 presents
time solvers can be used to perform the training on large dataset description and obtained experimental results. Finally,
datasets. Section 9 provides some results on training time and com-
• The experiments show that it makes sense to complicate plexity of the proposed algorithm.
the decision splits further. Namely, kernel forests with
non-linear (radial basis functions and polynomial) ker- II. NOTATION AND DEFINITIONS
nels outperform oblique forests on several datasets. Let us introduce some definitions. We consider a decision tree
• We also try to obtain uniform stability-based generaliza- as a set of decision stump chains. Each chain represents a path
tion bounds for randomized tree ensembles and apply from the tree root to a leaf. Each chain contains from one to
them to select and modify the regularization technique. several binary decision stump estimators that should return
Namely, we determined that it makes sense to focus a positive label if a classified object belongs to that chain
on improving uniform stability that affects variance and and zero otherwise. The chain also includes a leaf estimator,
choose a pruning method that follows the idea. The and it returns empirical probabilities that the classified object
experiments show that the proposed regularization can belongs to some classes.
significantly improve the generalization of the kernel We modified the notation from [4] to define the trees more
forests. formally (Table 1). Let X be a random variable that takes
We studied the performance of kernel tree ensembles on values from Rf , U = {u1 , u2 , . . . , uC } is a set of class labels
several well-recognized datasets, mostly related to the UCI with size C, and Y is a random variable that takes values from
(University of California Irvine) set. In addition, we tested U . Let PXY is a distribution over X and Y , and Dm is a training
the applicability of those ensembles in psycholinguistics. set of size m, whose elements are i.i.d. with PXY : Dm ∼ PXY .
Namely, we used them to assess the direction and dynamics of Let Xm be a set of training objects from Dm and Ym be a label
the reaction of social media users to significant events based set for those objects.
on multi-level features of network discourse. Set H = {h1 , h2 , . . . , h0 } of length 0 is a family of leaf
This paper is structured as follows. Section 2 pro- functions s = hi (x, y) that maps objects and their labels onto
vides important notations and definitions used in the paper. a subset of real numbers Rf × U 7→ [0..1].
Section 3 discuss current state-of-the-art in oblique and kernel Let all the functions
trees and forests, approaches to get generalization error of the
trees and ensembles and to regularize them. Section 4 con- 1
hiDm (x, y) = |{(x̃, ỹ)|(x̃, ỹ) ∈ Dmi ∨ ỹ = y}| ∈ H (1)
tains the proposed algorithm to train kernel trees and forests. i |Dmi |
VOLUME 10, 2022 77963
D. A. Devyatkin, O. G. Grigoriev: Random Kernel Forests
φ : Rf → Xtr , (6)
Breiman proposed an approach to estimating the upper 1 − δ for each set Dm , containing m i.i.d. samples:
bound of generalization error for Random Forests and bag- s
ging [3]. He considers an ensemble as a set of trees, and log 1δ
each tree returns a label of the predicted class. He defines R(fD ) ≤ R̂(fD ) + 2γm + (4mγm + M ) (9)
2m
an ensemble margin function that is the difference between However, this approach does not consider the features of
the probability of true labels and the maximum probability randomized algorithms for ensemble training. Paper [10] pro-
of labels that are not true. Therefore, the larger the margin poses a random uniform stability score βm . Let r is a random
function is, the more samples are correctly classified. Eventu- variable that controls the training of the particular estima-
ally, let the ensemble strength be an expectation of the margin tors in the ensemble. Then the random uniform stability is
function on all the data. Then the generalization error depends bounded as follows.
on the strength and margin correlation of individual trees (the
bias/variance ratio). The higher the strength and the lower the ∀Dm ∼ PXY , ∀x, y ∈ Dm
correlation is, the lower the error is.
sup | Er (l(fDm (r) (x), y) − Er (l(fD\i (r) (x), y)))| ≤ βm (10)
Although that estimation can not be applied to perform Dm ,x m
building trees only for binary classification problems. Tibshi- a data-dependent overfitting estimate for this family. That
rani with colleagues imply that pure impurity criteria do not overfit score is used to select classification algorithms for
properly capture the spatial distribution of training data. They each leaf. It is noteworthy that the score depends on the
use an optimal margin classifier at each split [11] to tackle Rademacher complexity of the subfamily of algorithms and
the problem. They revealed that the proposed method has the proportion of training set objects correctly classified in
competitive accuracy and provides additional interpretabil- each leaf [29]. Random composite decision trees are also
ity in its grouping of the classes. Leading the same moti- introduced, and overfitting estimates are obtained for them,
vation, Manwani and Sastry [27] proposed a strategy to which are also applied to random forests. They propose
assess the hyperplanes in such a way that the geometric an algorithm for training random composite trees (Random
structure in the data is taken into account. At each stump Composite Forest, RCF) based on the generation of a set
of the decision tree, they find the clustering hyperplanes for of random composite trees. However, non-terminal decision
both classes and use their angle bisectors as the split rule stumps of such trees still use axis-parallel splits.
at that stump. Experimental results show that the strategy Another perspective direction of research is the end-to-
helps train small decision trees and improve performance. end training of oblique decision trees and forests. Paper [23]
Papers [7] and [8] introduce a budget-aware classifier and presents an expectation-maximization training approach,
pruning strategy to minimize overfitting. The papers propose when the trees are fully probabilistic at train time but deter-
a sparse coding tree framework for multi-label annotation ministic at test time. The experiments on image datasets show
problems to deal with ultrahigh label dimensions. All the that the approach can learn more complex splits than common
considered studies focused on optimizing margin or purity oblique ones and facilitates interpretability through spatial
criteria; however, as we learned from theoretical results, both regularization. However, the expectation-maximization algo-
those factors affect overfitting [4], [11]. Norouzi et al. [17] rithm converges slowly on large datasets with many features.
proposed CO2-trees (Continuously Optimized Oblique trees) Similarly, the study [20] proposes to use the backpropagation
with oblique decision stumps. This work defines decision method, which is usually used to train multilayer neural
stump training as a structured prediction problem with latent networks [30], to train oblique decision trees with a fixed
variables [18]. The researchers propose a convex-concave structure. However, fixing the tree structure can lead to the
loss function (convex-concave, i.e., the difference of several training of overly complex classifiers with a large number of
convex functions) and also apply a regularization approach parameters and, as a result, a worse generalization.
when training such trees. Note that the convex-concave opti- The study [22] notes that algorithms to build oblique deci-
mization problem can be effectively solved by the gradient sion trees require higher computation than axis-parallel trees.
method proposed in [28]. Although that function is an upper Besides, the training results depend on the initialization of the
bound of empirical loss, the ‘‘smoothness’’ of that function splitting hyper-planes. The researchers proposed a Weighted
and tightness on the bound heavily depend on hyperparame- Oblique Decision Tree (WODT) based on continuous opti-
ters and scale of the features, limiting the applicability of the mization with random initialization. They assign weights for
method. training samples of child nodes at all internal nodes and then
Paper [16] presents composite decision trees whose leaf train a decision stump by optimizing the weighted informa-
estimators are chosen from a set of hypotheses consisting tion entropy loss. It is worth noting that the WODT loss is
of subfamilies with different complexity. They also propose still non-convex; therefore, the global optimum may not be
found with pure gradient methods. At the same time, the a stump [32]. The paper [33] proposes to discard the class
experimental results show competitive accuracy scores on empirical probabilities stored in all tree leaves of a pre-trained
many datasets. random forest and relearn them through a ’’global’’ refine-
Several papers are focused on techniques to improve ment: all tree leaves are simultaneously optimized by explic-
trained decision trees or forests Paper [25] proposes an algo- itly minimizing a global loss function defined on all training
rithm that improves a given tree and produces a new tree samples, according to the averaging rule of random forests.
with the same or smaller structure but new parameter values We are going to use that approach as the basis for our exper-
that lower or leave unchanged the misclassification error. iments on ensemble regularization so that we will consider
A hybrid SVM based decision tree is proposed in [26]. it more formally. Let 8 : Rf → {0, 1}T 0 be a function that
Researchers have focused on reducing the number of test for any sample x returns the binary vector, whose elements
samples that need SVM’s help in getting classified. The are 1 if x goes to the corresponding decision stump chain and
central idea is to approximate the decision boundary of SVM 0 otherwise.
using decision trees. The resulting tree is a hybrid tree in
8(x) = (φ1 (x), φ2 (x), . . . , φT 0 (x)) (14)
the sense that it has both univariate and SVM nodes. The
hybrid tree takes SVM’s help only in classifying crucial Matrix W contains the corresponding class weight for all
samples lying near the decision boundary; remaining less the decision chains in the ensemble.
crucial samples are classified by univariate nodes. Paper [21]
focuses on dual incremental learning strategies for oblique W = (w1 , w2 , . . . , wT 0 ) (15)
random forests when new inputs from the existing classes Ren with colleagues define the tree refined classifier as the
or unseen classes come. They propose a batch multiclass following linear function [33]:
algorithm by using a broad learning system and a multi-to-
binary method to obtain an optimal oblique hyperplane in y = W ∗ 8(x) (16)
a higher dimensional space and then separate the samples
where W ∗ can be found with the following SVM-like opti-
into two supervised clusters at each node. The incremental
mization.
strategy consists in analyzing the parameters of all stumps on
m
the classification route of the increment of input samples and 1 CX
W ∗ = arg min ||W ||2F + l(yi , ŷi ),
the increment of input classes. W 2 m
i=1
We propose a method for training and regularizing a ran-
s.t.yi = W 8(x), ∀i ∈ [1, m] (17)
dom forest of decision trees with kernel splits (with linear,
polynomial, Gaussian, and other kernels), i.e., allowing one to The proposed global refinement can be efficiently solved
create estimators with arbitrary complexity generalizing abil- with a linear SVM or a ridge regression method. As a result,
ity. Similarly to CO2 Forest [17] we consider decision stump the complementary information between trees is exploited,
training as a structured learning problem; however, we define and the fitting power is significantly improved. However,
it in more simple, explicit form. Similarly to WODT [22] the global optimization in training might cause over-fitting
we weigh training samples and minimize an impurity cri- for many tree leaves. Paper [33] proposes a global prun-
terion, although we optimize this criterion together with a ing method that alternates between refining all tree leaves
margin between data in sub-trees, which should lead to better and merging the insignificant leaves to reduce the risk of
generalization. In addition, the splitting criterion is planted over-fitting and model size. The method is to join two adja-
to the optimization problem explicitly; therefore, a user can cent leaves if the norm of their leaf vectors is close to zero.
define their criteria without re-creating the loss function from
scratch. Still, in our case, training of decision stumps is IV. TRAINING OF THE KERNEL TREES
reduced to solving the SVM with slack re-scaling problem We apply a commonly-used greedy procedure to build the
that is convex optimization with inequality constraints, for trees (see Algorithm 1). At each step, we train a decision
which a large number of computationally efficient meth- stump, which splits the training data, then we recursively
ods have been previously proposed. We initially refused to repeat this for the ‘‘left’’ and ‘‘right’’ partitions of the data
find a global optimum to keep high training speed. Instead, until the requested tree height is reached.
we implemented a greedy stump optimization, presuming that First (steps 5-10), we set the target distribution of the
the tree training is greedy itself. classes in the sub-trees. We consider this distribution as a
binary classification problem where labels H = {−1, +1}
C. REGULARIZATION OF BAGGING AND RANDOM represent left and right sub-trees. That does not contradict the
FORESTS definition from the ‘‘Notation and definitions’’ section (II),
According to the concept of strength/correlation [3] the most since there is one-to-one mapping for the sets {0, 1} and
obvious way to achieve better generalization of a tree ensem- {−1, +1}. Usually, fitting of decision stumps in decision
ble is to add more randomness and reduce the correlation. trees is done by optimization of some purity criterion like
One can randomly choose splitting criterion [31] or ran- Gini impurity or information gain. In our case, the target
domly define how many features should be tested to build distribution cs : U → H , should minimize a decision
Algorithm 1 Kernel Tree the algorithm uses cs to define a subtree for each training
Input: Dataset Dm , regularization term C, J is the threshold sample from the dataset Dm : Hbest = cs (Ym ).
to determine if the class-subtree distribution is defined Next steps of the algorithm (Step 11-12) are related to the
exactly or greedy, N is size of the top best splittings to training of a decision stump classifier; therefore, it would
randomly choose class-subtree distribution. optimize margin between sub-tree data and the impurity
1: call BuildTree(Dm ): criterion simultaneously. The optimization of the impurity
2: BuildTree(D): criterion can be done quite effectively by selecting some
3: if D has only one class then return features and testing thresholds brute-force in univariate trees.
4: else However, such an approach is not applicable in the case of
5: if |U| < J then oblique or more complex splits because it leads to deficient
6: s := all_distributions(U , H = {+1, −1}) training speed. Suppose one wants to keep the original idea
7: Hbest := sort(Impurity(s))[rnd(1..N )] F U → H of purity maximization but make the fitting more effective.
8: else In this case, they need to construct some continuous objec-
9: Hbest := greedy_find_best(D) FU →H tive function that would reflect the dependence between the
10: end if parameters of the decision stump and the purity of data.
11: L1∗ , . . . , Lm∗ := L(hi , −hi ), hi ∈ Hbest . Thus, this problem would be solved via existing gradient
12: w∗ , 1∗ , m∗ := optimize_node(C, X , L1∗ , . . . , Lm∗ ) F optimization techniques.
Solve an SVM with slack re-scaling: 1 /L1∗ , . . . , m /Lm∗ . Our method uses the general idea that a miss-classification
13: Dl := D[classify(D, w∗ ) ≥ 0] with a high purity decrease should be penalized more than one
14: Dr := D[classify(D, w∗ ) < 0] with a slight increase of purity. One of the implementations
15: BuildTree(Dl ) of the idea to train SVM is called ‘‘slack re-scaling’’ [18].
16: BuildTree(Dr ) It was originally introduced to classify complex structures
17: end if like syntax trees. The benefit of that implementation is that
it is invariant regarding the scale of features or separate
hyperplane. In contrast to other methods, we explicitly define
stump impurity criterion (Gini, information gain, etc.). Some the distribution of samples to subtrees.
criteria split classes, but we overlook that for the sake of speed The training of the decision stump can be defined as the
performance. following:
m
sbest = arg min PLs g(pLs ) + PRs g(pRs ) (18) 1 CX
s w∗ , ∗ = arg min ||w||2 + i , (21)
w, 2 m
i=1
where for the Gini impurity we have: i
s.t: ∀i, wT xhi ≥ 1 − (22)
N
X L(hi , −hi )
g(p) = (1 − pi )pi (19)
where w∗ are the parameters of the splitting hyperplane, ∗
i=1
are the slack variables, hi ∈ Hbest is the predicted subtree
and for information gain: for the data sample with index i, and L(hi , −hi ) reflects the
N
X growth of the impurity criterion if the sample with index i is
g(p) = − pi log(pi ) (20) classified to wrong sub-tree.
i=1 If one applies KKT conditions to that problem, they will
get the dual problem:
If the number of classes |U | is relatively small (we use a
threshold value J to reveal that fact), the algorithm enumer- m m m
1 XX X
ates all possible distributions and assesses the value of related ã = arg max − ai aj K (xi , xj ) + ai , (23)
a 2
impurity criterion (Steps 6-7). Then the obtained distribution i=1 j=1 i=1
m
list is sorted according to the impurity criterion, and the target X ai C
distribution cs is selected randomly from the top of the sorted s.t: ≤ (24)
L(hi , −hi ) m
list (we use a hyperparameter N to define size of the top). i=1
If the number of classes |U | is greater than J the algorithm where aij is the weight of the training sample i (non-zero
applies a greedy procedure to reveal the target distribution for the support vectors), L(hi , −hi ) reflects the growth of
(Step 9). The procedure start from generating random distri- the impurity criterion in case of miss-classification, and
bution of classes to subtrees. Then the algorithm iteratively K (xi , xj ) : Rf × Rf → R is a positively defined kernel
changes the generated distribution and keep the swaps that function.
improve the impurity criteria. The algorithm repeats this In contrast to the classification of structures [18], this
procedure several times and outputs the best distribution cs . problem can be efficiently solved in an explicit form since
As a result, we add some randomization to the decision stump there are only two classes. Regularization hyper-parameter
training, which should reduce correlation between trees. Then C should be found empirically for each dataset. In both
i=1 j=1 PD (j) of the elements from the training set D(rij−1 ) for the stump
j − 1 that are also presented in the training set D(rij ) for
The latter means that decision classifier stumps themselves
the next decision stump j, d(rij ) is the number of distinct
are uniformly stable.
samples in D(rij ), and Pr(d(rij )=k) returns the probability that
For the kernel support vector machine, the uniform stability
the sampled training datasets of an ensemble estimator have
depends on the regularization parameter C: γSVM = θ2mC
2
k distinct samples.
(where θ 2 ≥ K (x, x) reflects the range of the kernel function). If γk = O k1 , then the sum for βm reaches minimum if the
Therefore, the uniform stability of an algorithm to train kernel probability distribution defined for all decision stumps with
decision stump chain is upper-bounded as follows: the expression lt=1 PD(rij ) (t) is uniform. It is worth noting
Q
n
X θ 2C that Theorem 5 does not demand the tree leaves distribution
γSVM = φ(B, n) Qi (28) to be uniform; it just benefits the trees with balanced decision
i=1 2m j=1 PD (j) stumps. Conversely, in the case of uniform leaves, the number
Unfortunately, when assessing randomized uniform sta- of chains reaches the maximum and increases the stability
bility, the tree ensembles cannot be considered as sets of bound.
0
X random sampling parameters r. The results are consistent
βm ≤ 0.632B Eri γPr (i, 0.632m) (30)
with the ‘‘bias/variance’’ study from [36], although we also
i=1
show the effect of tree structures on stability and variance.
That is, for each ‘‘group’’ of independent decision chains On the one hand, the obtained expression for generaliza-
in the ensemble, the average probability of selection will tend tion error contains all the hyperparameters of the bagging
to some expected value. As a result, probability ‘‘outliers’’ in algorithm, trees, and decision stumps. On the other hand,
individual trees are smoothed out.
Q (See fig. 5). it explicitly integrates the ‘‘bias/variance’’ part [3] and the
In the theorem 5 the value lt=1 PD(rij ) (t) is bounded as ‘‘nonlinear perturbations’’ part of the random uniform stabil-
follows: [ m1 .. m−1
m ]. It makes no sense to consider the case ity [9]. Although direct optimization of the expression from
when the probability of transition along the chain is 0, since Theorem 6 does not make much sense, it provides some clues
such a chain can be reduced to a shorter one. Otherwise, after that can be used to construct an approach to regularize tree
going through the chain, at least one training sample remains, ensembles. Namely,we can reduce random uniform stability
which gives the probability m1 . Following similar reasoning, and optimize tree structures since stability affects the vari-
this value will not exceed m−1m .
ance. The latter can be done via pruning of extra decision
Then the change in the uniform stability of the decision stump chains that have low probability and do not affect R̂
stump chain training algorithm is finite and does not exceed much.
the value |γk/m − γk(m−1)/m | ≤ ck .Therefore, the random
uniform stability of the bagging on decision trees converges VI. REGULARIZATION OF BAGGING ON DECISION TREES
to its expected value (30) with an exponential rate. The study [33] provides a general optimization-based refine-
ment and pruning method that helps minimize empirical risk
T 0t 2
P(|βm − µ| ≤ t) ≤ 2 exp − 2 better, while not worsening the generalization. The pruning
γk/m − γk(m−1)/m helps reduce the number of tree leaves, which results in
achieving better stability of the tree ensemble (Theorem 5)
1
=O (31) without a significant increase in the empirical risk. However,
exp(T )
such an approach makes the ensemble estimators not indepen-
This result is consistent with the repeatedly empirically dent anymore. We believe some improvement can be achieved
confirmed thesis about the pointlessness of building bagging here if one reveals a trade-off between the global optimization
for 1000 or more trees. of leaf vectors and the dependency of the trees. We propose
to modify the feature generation process of the refinement as
B. GENERALIZATION OF BAGGING ON DECISION TREES
follows (33):
The theorem 3 can only be applied to assess the generalization
error of ensembles built by randomized algorithms, in which 8r (x) = (φ1 (x)r1 , φ2 (x)r2 , . . . , φT 0 (x)rT0 ), (33)
the training parameters are independent for each model in
the ensemble. We represent decision tree ensembles as sets where r1 , r2 , . . . , rT 0 are i.i.d. random binary variables taken
of decision stump chains (7); therefore, that condition is values {0, 1}.
When one generates the features for the refinement with TABLE 3. The properties of the standard machine learning datasets used
in the experiments.
our approach, they can control the ratio of the samples that
will be excluded from the dataset for particular decision
stump chains. The rest of the refinement and pruning is the
same as in the [33]. We believe this would help covariance
not increase so much during the refinement (see theorem 6).
One can argue that SVM itself is a well-regularized algo-
rithm that can hardly be disturbed by minor changes of
features if the regularization term C is correctly defined.
However, our experiments show slight but steady improve-
ment in classification accuracy, then we add a portion of
noise (10-30%) as described above (33). We have also imple-
mented a baseline tree pruning approach. In that approach,
we perform refinement and remove decision stump chains train-test split with respect to governmental and oppositional
with low weights (i.e., small l 2 -norm of leaf vectors) instead classes to keep proportion balance for 3-class classification.
of merging. The dataset contains features that were used to analyze the
reaction of the population to COVID-19 lockdown in Pikabu
VII. DATASETS social network [41]. These features can be divided into sev-
We conduct experiments on five UCI multi-class datasets, eral categories.
Kaggle’s BNP Paribas (Banque Nationale de Paris and 1) Psycholinguistic markers: the set of linguistic proper-
Paribas) credit scoring dataset, CIFAR-10 image dataset, ties of texts, including proportions of post-tags used
and ‘‘Youtube channels’’ psycholinguistic dataset. We used in the text, psycholinguistic coefficients, and stylistic
SatImage, USPS, Letter, MNIST [37] from the UCI. Table 3 features. For example, these markers include:
presents their primary parameters. Most datasets are devoted
• the ratio of the number of verbs to the number of
to image recognition problems. For example, the MNIST
adjectives (Tragger Coefficient);
and USPS datasets contain handwritten images of digits,
• the ratio of the number of infinitives to the total
while the Letter dataset contains Latin letters. The CIFAR-10
number of verbs;
dataset is also related to image recognition [38]. It contains
• frequency of use of first-person plural pronouns;
32 by 32 colored images of 10 classes (airplane, horse,
• frequency of first-person verbs in the past tense.
bird, etc.) with eight gray levels. We apply a simple pre-
processing technique to all the image recognition datasets. 2) Topical groups of words: these features were retrieved
Namely, we perform feature-level normalization of the data in the form of occurrence frequency of terms from
with ‘‘MinMaxScaler’’ and ‘‘Normalize’’ tools from Scikit- various dictionaries: topical (e. g. healthcare terms,
Learn [39]. No other complex processing is used. environmental terms, political terms), sentiment/
Kaggle’s BNP Paribas dataset is devoted to predicting moodbased (e.g. motivation vocabulary, anxiety vocab-
the category of a claim based on features available early in ulary, hostile vocabulary).
the process. The dataset contains categorical and numeric 3) Basic emotions dictionary (BE): the set of dictionaries
features available when BNP Paribas Cardiff received the consists of terms related to the expression of basic
claims. All string features are categorical. We utilize one-hot human emotions in the text: disgust, shame, anger, fear,
encoding to deal with categorical features. Thus, we obtain sadness, happiness, and wonder.
the data with pretty high-dimensional and sparse features. It is 4) Sentiment: a single item feature computed with the
a tool to test the ensemble methods on such data types. As in help of the Linis-Crowd sentiment dictionary. For each
the previous case, we applied Scikit-Learn to perform simple video, all comments and replies were concatenated into
normalization. a single text. Combined texts were processed to retrieve
‘‘Youtube channels’’ [40] is an example of a mid-sized text features.
dataset with i.i.d. samples. It does not contain complex
structural dependencies between features. It provides some VIII. EXPERIMENTAL RESULTS
real-valued markers measured on discussions from different Two versions of the kernel forests that utilize the proposed
Russian Youtube channels that belong to three categories: algorithm have been implemented for the tests. The first one
political-governmental, political-oppositional, non-political. was used primarily to perform experiments on small and
A period of 1 year between April 30, 2020, and April 30, medium-sized datasets like USPS or Youtube channels, or in
2021, was chosen as the targeted time span for data down- the large datasets, but with the oblique trees only. That version
load. In total, comments were processed for 4807 videos: is based on the Scikit-learn implementation of Liblinear SVM
2629 videos with political discussions (5 million messages) solver. The second version is devoted to processing large
and 2178 videos for the non-political subcorpus (1.2 mil- datasets with non-linear kernels, and it uses the Thunder SVM
lion messages). To perform classification, we used 70%-30% library, which provides GPU acceleration. It is also worth
noting that we added simple modifications to both solvers TABLE 4. Accuracy of considered randomized tree ensembles
(non-regularized).
because they do not consider sample weights (the version of
Scikit-learn we used has a parameter for the sample weights,
but does not handle it anyhow).
We applied a commonly recognized cross-validation tech-
nique to estimate the ensemble hyperparameters: the num-
ber of trees T = {30, 100, 300}, stump regularization
C = {100, 1000, 3000, 5000}, maximum tree depth n =
{3, 4, 5, 6, 7}, the proportion of features to be considered
at each stump f = {0.08, 0.1, 0.2, 0.3, 0.4, 0.5}, noise
(up to 0.3) and pruning (up to 0.9) ratios, and kernel parame-
ters (gamma = {10, 100} for the Gaussian kernel, degree = 3 TABLE 5. Precision of considered randomized tree ensembles
for the Polynomial kernel). Random uniform stability βm (non-regularized).
achieved with C > 1000, when the loss becomes ‘‘steep’’ One can define a probability space (, A, P) and the discrete
and hard to optimize. Therefore, future research is needed to random variable 9 : (, A) → (R, B) on this space, so that
create fast, presumably GPU-based, algorithms to train the 9(ω) = ω, ∀ω ∈ , where = {1, 2, . . . , n}, A is an
decision stumps. It seems that the stability and generalization algebra on , and B is a Borel algebra. That random variable
of decision trees depend on ‘‘outliers’’ on leave probability. defines a position βi on the right hand side of the (43).
Decision stump chains with low probability are unstable and Because of the product commutativity for all the values i ∈ B
randomized ensembles smooth them. As we revealed, tree the probability is:
refinement and pruning method [33] also helps make the 1
trees more stable, while not decreasing the discriminative P(i ∈ 9) = (44)
n
power much. We tried to modify that method to reduce the
correlation between estimators, and we obtained a steady Evidently, P satisfies all the properties of probability. Now
improvement in accuracy on some datasets. we can obtain the expectation of the right hand side of the
Many classification and regression problems presume to expression (43):
analyze discrete processes that could be done better by dis- n−2 n
X B B B X
crete models. We believe future development of random- γn ≤ ( )i P(i) + ( )n−1 P(n−1)+( )n−1 P(n) γi
ized tree ensembles should be focused on utilizing them 2 2 2
i=1 i=1
as a part of more sophisticated models that can catch n−2 n
1 X B i B n−1 X
complex relationships between features and data samples, ≤ ( ) +2 γj , (45)
i.e., [44], [45]. n 2 2
i=1 j=1
chain, an event A1 that is the training sample number i is Now we consider the following functions:
presented in the training data for the root decision stump
classifier, A2 that is the training sample number i is presented g1 (d, k) = ED,r (K (D, r)|Dk−1 , Dk = d)
in the training data for the next decision stump after root, An is − E Dk−1 (K (D, r)|Dk−1 ) (52)
presented in the training data for the i-th decision stump n.
and
Recall that according to the properties of stump chains, when
training stumps in a chain, each next stump is trained on a g2 (r, k) = Er (K (D, r)|D, Rk−1 rk = r)
subset of the data on which the previous stump was trained: − Er (K (D, r)|D, Rk−1 ) (53)
Dn ⊂ · · ·S⊂ D2 ⊂ D1 . Therefore, An ⊂ · · · ⊂ A2 ⊂
A1 and P( ni=1 Ai ) = P(A1 ) is the desired probability and Let value
P(A1 ) = mk . m
X T0
X
Now substituting the expression for the uniform stability v̂ = sup var( g1 (d, k)) + var(g2 (r, k))
of the algorithm to build decision stump classifier chain (27) r1 ,...,rk −1, k=1 k=1
d1 ,...,dk−1 ,
we have the desired expression. x
m
X
APPENDIX C = sup var( g1 (d, k))
d1 ,...,dk−1 ,
PROOF OF THE THEOREM 6 x k=1
Proof: For simplicity we will use flat indexing for T0
X
decision stumps. Suppose our ensemble has T 0 decision + sup var(g2 (r, k)) (54)
stumps. r1 ,...,rk −1, k=1
d1 ,...,dk−1 ,
Let define a difference between generalization error and x
empirical risk as the following random function: is the upper bound on the sum of variances of those functions
g1 and g2 . Namely, those expressions reflect variance of
K (D, r) = R(FD(r) ) − R̂(FD(r) ), (48)
the generalization error, which genesis are the training data
where D = {D1 , . . . , Dm } is a training dataset with size m (for g1 ) and random parameters of the training (for g2 ).
and r = {r1 , . . . , rT 0 } is a set of random vectors that are For the variance of g1 we have the following.
parameters for data sampling in decision stump nodes. It is Xm m
X
worth noting that some of vectors in r are interdependent. var( g1 (d, k)) = var(ED,r (K (D, r)|Dk−1 , dk = d)
We need to assess upper bound of that random function k=1 k=1
K (D, r) to obtain the upper bound of the generalization error.
We use MacDiarmid’s inequality (Theorem 3.7 from [46]) to − ED,r (K (D, r)|Dk−1 )) (55)
assess the deviation of values K from its expected value on r
and D. Let’s represent K according to definitions of R̂ and R. The expected value ED,r (K (D, R)|Dk−1 ) for a fixed Dk−1
does not affect the variance. We also know that D1 , . . . , Dm
m
1X are independent, besides r does not depend on D and vise
K (D, r) = EXY (l(FD(r) (X ), Y )) − l(FD(r) (xi ), yi ) versa.
m
i=1 Therefore, for the fixed Dk−1 we can put:
m
1X
≤ EXY (l(FD(r) (X ), Y )) + l(FD(r) (xi ), yi ) , ED,r (K (D, r)|Dk−1 , dk = d) = ED,r (K (D\k∪d , r)) (56)
m
i=1
(49) In accordance with Theorem 3.4 from [36] we have the
following.
where xi , yi ∈ D. m
Since l(FD(r) (x), y) is B-Lipshitz we can put:
X
sup var( g1 (d, k))
k=1 d1 ,...,d
x
k−1 ,
K (D, r) ≤ 2B sup FD(r) (X ) (50) m
X
X
2
≤ 4B sup Ed Ed (ED,r (FD\k∪d (r) (x, y)))
By the definition of the decision stump chain FD(r) (x) > 0, k=1 d1 ,...,d
x
k−1 ,
thus: 2
≤ B m βm
4
K (D, r) ≤ 2B sup FD(r) (X ) − ED,r (FD\k∪d (r) (x, y)) (57)
(51)
X
Let Rk−1 denotes an event when r1 = r1 , . . . , rk−1 = Now consider g2 (r, k). Combining (51) and the expression
P 0
rk−1 , and Dk−1 is an event when D1 = d1 , . . . , Dk−1 = FD(r) (x, y) = Ti=1 fD(r) (x, y) we have:
dk−1 , R denotes RT 0 and D denotes Dm . Let D = T0
d1 , . . . , dk−1 , Dk , . . . , Dm and D\k∪d = d1 , . . . , dk−1 , d,
X
var(g2 (r, k))
Dk+1 , . . . , Dm . k=1
T0
2 X T0
2B X Then substitute it to (64):
≤ var( Eri (fD(ri ) (x, y)|D, Rk−1 , rk = r)
T
k=1 i=1 R(FD(r ) − R̂(FD(r ) − 2βm
− Eri (fD(ri ) (x, y)|D, Rk−1 )) (58) 1
1 2BM
r
1 2BM 2 1
≤ ln( ) + ln( ) + 8ln v̂ (66)
Note that the difference between the expectations is 2 δ 3 δ 3 δ
non-zero only for the decision stump chains that are depen- Finally, with probability at least 1 − δ we have:
dent on rk . Denote set of numbers of those chains as ρk .
Observe that for a fixed Rk−1 and D the expected value 1 1 2M
R ≤ R̂ + 2βm + B ln( )
Eri (fD(ri ) (x, y)|D, Rk−1 )) is a constant, so it does not affect 2 δ 3
the variance. Then we have: s
2 (2M )2 0 3
1 2M 2 1
T0 + ln( ) + 8ln B2 m βm +
X δ 3 δ T
var(g2 (r, k))
k=1
(67)
T0
2 X
which concludes the proof.
2B X
Eri (fD(ri ) (x, y)|D, Rk−1 , rk = r))
≤ var(
T
k=1 i∈ρk REFERENCES
(59) [1] L. Breiman, J. H. Friedman, and R. A. Olshen, Classification and Regres-
sion Trees. Belmont, CA, USA: Wadsworth, 1984.
The size of the set ρk cannot exceed a number of tree [2] T. Chen and C. Guestrin, ‘‘XGBoost: A scalable tree boosting system,’’
in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
leaves. Denote the maximum number of tree leaves in the
Aug. 2016, pp. 785–794.
ensemble as 0 = supk |ρk |. For simplicity we presume that [3] L. Breiman, ‘‘Random forests,’’ Mach. Learn., vol. 45, no. 1, pp. 5–32,
all ensemble trees have the same number of tree leaves. Value 2001.
Eri (fri ,D (x, y)|D, Rk−1 , rk = r) is bounded by the range of [4] M. Golea, P. Bartlett, L. Mason, and W. S. Lee, ‘‘Generalization in decision
trees and DNF: Does size matter?’’ in Proc. Adv. Neural Inf. Process. Syst.,
f , which is M ; therefore we can put: 1998, pp. 259–265.
[5] V. Vapnik, Statistical Learning Theory. Hoboken, NJ, USA: Wiley, 1998,
T0 T0
2 X
2 (2BM )2 0 3
X 2B p. 768.
var(g2 (r, k)) ≤ M0 = (60) [6] L. Breiman, ‘‘Some properties of splitting criteria,’’ Mach. Learn., vol. 24,
T T no. 1, pp. 41–47, 1996.
k=1 k=1
[7] W. Liu and I. W. Tsang, ‘‘Sparse perceptron decision tree for millions of
Now we apply the McDiarmid inequality (3.8) dimensions,’’ in Proc. 13th AAAI Conf. Artif. Intell., 2016, pp. 1181–1187.
from [46]: [8] W. Liu and I. W. Tsang, ‘‘Making decision trees feasible in ultrahigh
feature and label dimensions,’’ J. Mach. Learn. Res., vol. 18, pp. 1–36,
Jul. 2017.
PD,r K (D, r) − ED,r (K (D, r)) ≥ t) [9] J. H. Friedman and P. Hall, ‘‘On bagging and nonlinear estimation,’’
J. Stat. Planning Inference, vol. 137, no. 3, pp. 669–683, Mar. 2007, doi:
10.1016/j.jspi.2006.06.002.
t2 [10] A. Elisseeff, T. Evgeniou, and M. Pontil, ‘‘Stability of randomized learning
≤ exp(− ), (61)
2v̂(1 + (bt/3v̂)) algorithms,’’ J. Mach. Learn. Res., vol. 6, no. 1, pp. 55–79, Dec. 2005.
[11] R. Tibshirani and T. Hastie, ‘‘Margin trees for high-dimensional classifi-
where b = BM . Let’s denote r.h.s. of the inequality as δ and cation,’’ J. Mach. Learn. Res., vol. 8, no. 3, pp. 637–652, 2007.
express t from it: [12] J. R. Quinlan, ‘‘Induction of decision trees,’’ Mach. Learn., vol. 1, no. 1,
pp. 81–106, 1986.
[13] Y.-H. Chen, S.-H. Lyu, and Y. Jiang, ‘‘Improving deep forest by exploiting
t2
δ = exp(− ) (62) high-order interactions,’’ in Proc. IEEE Int. Conf. Data Mining (ICDM),
2v̂(1 + (bt/3v̂)) Dec. 2021, pp. 1030–1035.
[14] S. K. Murthy, S. Kasif, S. Salzberg, and R. Beigel, ‘‘OC1: A randomized
Because t > 0 : algorithm for building oblique decision trees,’’ in Proc. AAAI Conf. Artif.
r Intell., vol. 93, 1993, pp. 322–327.
1 1 2BM 1 2BM 2 1 [15] T. Evgeniou, M. Pontil, and A. Elisseeff, ‘‘Leave one out error, stability,
t= ln( ) + ln( ) + 8ln v̂ (63) and generalization of voting combinations of classifiers,’’ Mach. Learn.,
2 δ 3 δ 3 δ vol. 55, no. 1, pp. 71–97, Apr. 2004.
With probability 1 − δ with respect to D and r: [16] G. DeSalvo and M. Mohri, ‘‘Random composite forests,’’ in Proc. AAAI
Conf. Artif. Intell., Phoenix, AZ, USA, 2016, pp. 1540–1546.
[17] M. Norouzi, M. D. Collins, D. J. Fleet, and P. Kohli, ‘‘CO2 forest:
K (D, r) − ED,r (K (D, r)) Improved random forest by continuous optimization of oblique splits,’’
r 2015, arXiv:1506.06155.
1 1 2BM 1 2BM 2 1 [18] I. Tsochantaridis, T. Joachims, T. Hofmann, Y. Altun, and Y. Singer,
≤ ln( ) + ln( ) + 8ln v̂ (64)
2 δ 3 δ 3 δ ‘‘Large margin methods for structured and interdependent output vari-
ables,’’ J. Mach. Learn. Res., vol. 6, no. 9, pp. 15–64, 2005.
The next step is completely analogous to the proof of [19] B. H. Menze, B. M. Kelm, D. N. Splitthoff, U. Koethe, and
F. A. Hamprecht, ‘‘On oblique random forests,’’ in Proc. Joint Eur.
Theorem 3.4 in [10]. Let’s obtain the expectation K (D, r) on
Conf. Mach. Learn. Knowl. Discovery Databases, Bristol, U.K., 2011,
D and r. pp. 453–469.
[20] O. Irsoy and E. Alpaydin, ‘‘Autoencoder trees,’’ in Proc. Asian Conf. Mach.
ED,r (K (D, r)) = 2βm (65) Learn., Hamilton, New Zealand, 2016, pp. 378–390.
[21] Z. Chai and C. Zhao, ‘‘Multiclass oblique random forests with dual- [40] YouTube Channels Dataset. Accessed: Jul. 1, 2022. [Online]. Available:
incremental learning capacity,’’ IEEE Trans. Neural Netw. Learn. Syst., http://keen.isa.ru/youtube
vol. 31, no. 12, pp. 5192–5203, Dec. 2020, doi: 10.1109/TNNLS. [41] I. Smirnov, M. Stankevich, Y. Kuznetsova, M. Suvorova, D. Larionov,
2020.2964737. E. Nikitina, and O. Grigoriev, ‘‘TITANIS: A tool for intelligent text
[22] B. B. Yang, S. Q. Shen, and W. Gao, ‘‘Weighted oblique decision analysis in social media,’’ in Proc. Russian Conf. Artif. Intell., 2021,
trees,’’ in Proc. AAAI Conf. Artif. Intell., Honolulu, HI, USA, 2019, pp. 232–247.
pp. 5621–5627. [42] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan,
[23] T. M. Hehn, J. F. P. Kooij, and F. A. Hamprecht, ‘‘End-to-end learn- ‘‘A dual coordinate descent method for large-scale linear SVM,’’ in Proc.
ing of decision trees and forests,’’ Int. J. Comput. Vis., vol. 128, no. 4, 25th Int. Conf. Mach. Learn., 2008, pp. 408–415.
pp. 997–1011, Apr. 2020, doi: 10.1007/s11263-019-01237-6. [43] A. Abdiansah and R. Wardoyo, ‘‘Time complexity analysis of support
[24] K. P. Bennett and J. A. Blue, ‘‘A support vector machine approach to vector machines (SVM) in LibSVM,’’ Int. J. Comput. Appl., vol. 128, no. 3,
decision trees,’’ in Proc. IEEE Int. Joint Conf. Neural Netw., IEEE World pp. 28–34, 2015, doi: 10.5120/ijca2015906480.
Congr. Comput. Intell., vol. 3, May 1998, pp. 2396–2401. [44] Z.-H. Zhou and J. Feng, ‘‘Deep forest,’’ 2017, arXiv:1702.08835.
[25] M. A. Carreira-Perpinán and P. Tavallali, ‘‘Alternating optimization of [45] P. Ma, Y. Wu, Y. Li, L. Guo, and Z. Li, ‘‘DBC-Forest: Deep forest with
decision trees, with application to learning sparse oblique trees,’’ in Proc. binning confidence screening,’’ Neurocomputing, vol. 475, pp. 112–122,
Adv. Neural Inf. Process. Syst., vol. 31, 2018, pp. 1–11. Feb. 2022, doi: 10.1016/j.neucom.2021.12.075.
[26] M. A. Kumar and M. Gopal, ‘‘A hybrid SVM based decision tree,’’ [46] C. McDiarmid, ‘‘Concentration,’’ in Probabilistic Methods for Algorithmic
Pattern Recognit., vol. 43, no. 12, pp. 3977–3987, Dec. 2010, doi: Discrete Mathematics. Berlin, Germany: Springer, 1998, pp. 195–248.
10.1016/j.patcog.2010.06.010.
[27] N. Manwani and P. S. Sastry, ‘‘Geometric decision tree,’’ IEEE Trans.
Syst., Man, Cybern. B, Cybern., vol. 42, no. 1, pp. 181–192, Feb. 2012, DMITRY A. DEVYATKIN received the M.S.
doi: 10.1109/TSMCB.2011.2163392. degree in computer science from Rybinsk State
[28] A. L. Yuille and A. Rangarajan, ‘‘The concave-convex procedure,’’ Neural Aviation Technology University, Rybinsk, Russia,
Comput., vol. 15, no. 4, pp. 915–936, Apr. 2003. in 2011. He is currently pursuing the Ph.D. degree
[29] P. L. Bartlett and S. Mendelson, ‘‘Rademacher and Gaussian complexi- in computer science with the Federal Research
ties: Risk bounds and structural results,’’ J. Mach. Learn. Res., vol. 3,
Center ‘‘Computer Science and Control,’’ RAS,
pp. 463–482, Nov. 2002.
[30] R. Hecht-Nielsen, ‘‘Theory of the backpropagation neural network,’’ in
Moscow, Russia.
Neural Networks for Perception. New York, NY, USA: Academic, 1992, Since 2011, he has been a Researcher with the
1992, pp. 65–93. Russian Artificial Intelligence Research Institute,
[31] M. Robnik-Sikonja, ‘‘Improving random forests,’’ in Machine Learning: Federal Research Center ‘‘Computer Science and
ECML 2004. Berlin, Germany: Springer, 2004, pp. 359–370. Control,’’ RAS. His research interests include machine learning, randomized
[32] S. Bernard, L. Heutte, and S. Adam, ‘‘Forest-RK: A new random for- ensembles, natural language processing, information extraction, and infor-
est induction method,’’ in Advanced Intelligent Computing Theories and mation retrieval.
Applications. With Aspects of Artificial Intelligence. Berlin, Germany: Mr. Devyatkin’s awards and honors include the Best Paper Award from
Springer, 2008, pp. 430–437. the IEEE 8th International Conference on Intelligent Systems (IS).
[33] S. Ren, X. Cao, Y. Wei, and J. Sun, ‘‘Global refinement of random forest,’’
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015,
pp. 723–730. OLEG G. GRIGORIEV received the M.S. degree
[34] G. King and L. Zeng, ‘‘Logistic regression in rare events data,’’ Political in computer science from the Moscow Institute of
Anal., vol. 9, no. 2, pp. 137–163, 2001. Electronics and Mathematics (MIEM), Moscow,
[35] B. Taskar, C. Guestrin, and D. Koller, ‘‘Max-margin Markov networks,’’ Russia, in 1980, and the Ph.D. degree in computer
in Proc. Adv. Neural Inf. Process. Syst., vol. 16, 2004, pp. 25–32. science from the Moscow State University of
[36] A. Elisseeff, T. Evgeniou, and M. Pontil, ‘‘Stability of randomized learning Technology ‘‘STANKIN,’’ in 2004.
algorithms with an application to bootstrap methods,’’ Tech. Rep., 2004. From 1980 to 1989, he had been a Research
[37] P. M. Murphy and D. W. Aha, ‘‘UCI Repository of machine learn-
Fellow with the Computing Centre of the
ing databases,’’ Dept. Inf. Comput. Sci., Univ. California, Irvine, CA,
USA, Tech. Rep., 1991. Accessed: Jul. 24, 2022. [Online]. Available:
Soviet Academy of Sciences, Moscow. In 1989,
https://archive.ics.uci.edu/ml/about.html he founded ‘‘Informatic Ltd.’’ He had been the
[38] A. Krizhevsky, ‘‘Learning multiple layers of features from tiny images,’’ CEO of this company, from 1989 to 2010. Since 2010, he has been a Principal
M.S. thesis, Dept. Comput. Sci., Univ. Toronto, Toronto, ON, Canada, Researcher at the Federal Research Center ‘‘Computer Science and Control,’’
2009. RAS, Moscow. He is also the Developer of the first Russian spell checker
[39] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, ORFO. His research interests include machine learning, natural language
and E. Duchesnay, ‘‘Scikit-learn: Machine learning in Python,’’ J. Mach. processing, and digital healthcare.
Learn. Res., vol. 12, pp. 2825–2830, Nov. 2011.