0% found this document useful (0 votes)
2 views18 pages

Random Kernel Forests

This paper introduces a novel algorithm for training kernel decision trees and random forests, addressing the limitations of traditional axis-parallel decision trees in handling multidimensional sparse data. The proposed method optimizes oblique and kernel splits using an SVM-like loss function, achieving high accuracy on datasets such as 99.1% on MNIST and 58.1% on CIFAR-10 while effectively reducing overfitting through a regularization technique. The findings suggest that this approach can enhance the performance of context-aware models in complex problems like text mining and image recognition.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views18 pages

Random Kernel Forests

This paper introduces a novel algorithm for training kernel decision trees and random forests, addressing the limitations of traditional axis-parallel decision trees in handling multidimensional sparse data. The proposed method optimizes oblique and kernel splits using an SVM-like loss function, achieving high accuracy on datasets such as 99.1% on MNIST and 58.1% on CIFAR-10 while effectively reducing overfitting through a regularization technique. The findings suggest that this approach can enhance the performance of context-aware models in complex problems like text mining and image recognition.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Received 4 July 2022, accepted 16 July 2022, date of publication 25 July 2022, date of current version 28 July 2022.

Digital Object Identifier 10.1109/ACCESS.2022.3193385

Random Kernel Forests


DMITRY A. DEVYATKIN AND OLEG G. GRIGORIEV
Federal Research Center ‘‘Computer Science and Control,’’ RAS, 119333 Moscow, Russia
Corresponding author: Dmitry A. Devyatkin ([email protected])
This work was supported by the Ministry of Science and Higher Education of the Russian Federation under Project 075-15-2020-799.

ABSTRACT Random forests of axis-parallel decision trees still show competitive accuracy in various
tasks; however, they have drawbacks that limit their applicability. Namely, they perform poorly for mul-
tidimensional sparse data. A straightforward solution is to create forests of decision trees with oblique splits;
however, most training approaches have low performance. Besides, those ensembles appear unstable and
easily overfit, so they should be regularized to find the trade-off between complexity and generalization.
This paper proposes an algorithm to train kernel decision trees and random forests. At the top level, this
algorithm follows a common greedy procedure to train decision trees; however, it produces quasi-optimal
oblique and kernel splits at the decision stump level. At each stump, the algorithm finds a quasi-optimal
distribution of classes to subtrees and trains this stump via optimization of an SVM-like loss function with
a margin re-scaling approach, which helps optimize the margin between subtree data and arbitrary impurity
criteria. We also try to reveal uniform stability-based generalization bounds for those forests and apply them
to select the regularization technique. The obtained bounds explicitly consider primary hyperparameters
of forests, trees, and decision stumps. The bounds also show a relationship between outliers in decision
tree structure, stability and generalization, and illustrate how forests smooth these outliers. We performed
an experimental evaluation on several tasks, such as studying the reaction of social media users, image
recognition, and bank scoring. The experiments show that the proposed algorithm trains ensembles, which
outperform other oblique or kernel forests in many commonly-recognized datasets. Namely, the proposed
algorithm shows 99.1% accuracy on MNIST and 58.1% on CIFAR-10. It has been confirmed that the selected
regularization technique helps reduce overfitting on several datasets. Therefore, the proposed algorithm
may be considered a small step toward customized and regularized ensembles of kernel trees that keep
reasonable training performance on large datasets. We believe that the proposed algorithm can be utilized as
a construction element of approaches to training context-aware models on complex problems with integer
inputs and outputs, such as text mining and image recognition.

INDEX TERMS Generalization bounds, kernel splits, kernel forests, random forests, random uniform
stability, regularization, slack re-scaling.

I. INTRODUCTION because one can test all possible thresholds for each feature
Decision trees [1], their ensembles [2], and random forests [3] and then select the best parameters according to the split
are the cornerstone of much intelligent software. They are criterion. However, that type of split has limited discrimi-
still winners in many data mining competitions. The binary native strength. For example, they perform poorly for such
decision trees consist of nodes that are called decision stumps. frequent input in many text and image processing tasks as
They divide the data into two subsets according to a split- multidimensional sparse data.
ting (impurity) criterion, such as Gini impurity or Informa- One of the approaches to tackle the problem is to train the
tion gain, corresponding to a discrete optimization problem. trees with more complex decision rules, for example, with
Optimizing axis-parallel decision stumps is straightforward oblique (linear) splits. Unfortunately, oblique decision trees
have low generalization ability. Besides, most approaches to
The associate editor coordinating the review of this manuscript and training oblique splits have low performance or are hard to
approving it for publication was Alberto Cano . tune, they often presume optimization of a non-convex or

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
77962 VOLUME 10, 2022
D. A. Devyatkin, O. G. Grigoriev: Random Kernel Forests

even non-smooth loss. Promising directions of oblique and TABLE 1. Important notations in this paper.
kernel tree development are bagging, sub-bagging, or ran-
dom forest ensembles, which have better stability than their
underlying classifiers. However, even those ensembles appear
unstable and easily overfit, so they should be regularized to
reveal the trade-off between complexity and generalization.
Several studies show the margin between subtree data and
splitting criteria affect the tree generalization [4], [11]. There-
fore, kernel or oblique tree decision stumps should optimize
those parameters simultaneously. Further improvement of
oblique decision tree ensembles requires detailed estimating
of the generalization ability that would consider values of
the train hyperparameters, properties of the tree structures,
and features of the train algorithms.However, studies on gen-
eralization consider single trees or randomized ensembles
of arbitrary estimators but do not focus on the features of
combined concepts like bagging on decision trees or Random
Forests.
This paper proposes decision trees with kernel splits and
random forests of those trees, i.e., trees that split are kernel
classifiers with arbitrary kernel functions. The main contri-
butions of our paper are as follows.
• We propose a greedy algorithm to train kernel trees
and obtain their kernel splits via optimization of an
SVM-like loss function. This function helps improve
both arbitrary impurity criteria and the margin between Section 5 and 6 present obtained uniform stability-based
sub-trees. Thereby, the method should train trees with generalization bounds and the application of the bounds to
sufficient generalization, and existing efficient linear choose the forest regularization. Sections 7 and 8 presents
time solvers can be used to perform the training on large dataset description and obtained experimental results. Finally,
datasets. Section 9 provides some results on training time and com-
• The experiments show that it makes sense to complicate plexity of the proposed algorithm.
the decision splits further. Namely, kernel forests with
non-linear (radial basis functions and polynomial) ker- II. NOTATION AND DEFINITIONS
nels outperform oblique forests on several datasets. Let us introduce some definitions. We consider a decision tree
• We also try to obtain uniform stability-based generaliza- as a set of decision stump chains. Each chain represents a path
tion bounds for randomized tree ensembles and apply from the tree root to a leaf. Each chain contains from one to
them to select and modify the regularization technique. several binary decision stump estimators that should return
Namely, we determined that it makes sense to focus a positive label if a classified object belongs to that chain
on improving uniform stability that affects variance and and zero otherwise. The chain also includes a leaf estimator,
choose a pruning method that follows the idea. The and it returns empirical probabilities that the classified object
experiments show that the proposed regularization can belongs to some classes.
significantly improve the generalization of the kernel We modified the notation from [4] to define the trees more
forests. formally (Table 1). Let X be a random variable that takes
We studied the performance of kernel tree ensembles on values from Rf , U = {u1 , u2 , . . . , uC } is a set of class labels
several well-recognized datasets, mostly related to the UCI with size C, and Y is a random variable that takes values from
(University of California Irvine) set. In addition, we tested U . Let PXY is a distribution over X and Y , and Dm is a training
the applicability of those ensembles in psycholinguistics. set of size m, whose elements are i.i.d. with PXY : Dm ∼ PXY .
Namely, we used them to assess the direction and dynamics of Let Xm be a set of training objects from Dm and Ym be a label
the reaction of social media users to significant events based set for those objects.
on multi-level features of network discourse. Set H = {h1 , h2 , . . . , h0 } of length 0 is a family of leaf
This paper is structured as follows. Section 2 pro- functions s = hi (x, y) that maps objects and their labels onto
vides important notations and definitions used in the paper. a subset of real numbers Rf × U 7→ [0..1].
Section 3 discuss current state-of-the-art in oblique and kernel Let all the functions
trees and forests, approaches to get generalization error of the
trees and ensembles and to regularize them. Section 4 con- 1
hiDm (x, y) = |{(x̃, ỹ)|(x̃, ỹ) ∈ Dmi ∨ ỹ = y}| ∈ H (1)
tains the proposed algorithm to train kernel trees and forests. i |Dmi |
VOLUME 10, 2022 77963
D. A. Devyatkin, O. G. Grigoriev: Random Kernel Forests

where D is a training dataset for qD , (xi , yi ) ∈ D, ai and b are


parameters of the estimators, and K is a positively defined
kernel, so that:

K (x, z) =< φ(x), φ(z) > (5)

and φ(x) is a transformation function:

φ : Rf → Xtr , (6)

where Xtr is some transformed feature-space.


FIGURE 1. Decision tree with height 3 and its representation as the set of Definition 4: Suppose there are T sets of leaf functions
decision stump chains.
H1 , . . . , HT and T sets of families of decision stump estima-
tors Q1 , . . . , QT . Bagging ensemble over decision trees is a
set of binary decision trees {ti |ti = hHi , Qi i ∨ i = 1, . . . , T }
return the probability that object x with the label y belongs
such that the decision function can be defined as follows:
to the leaf with index i, and Dmi is a subset of Dm that
belongs to the leaf with index i. We will presume further T 0i |Qj ∈Qi |
1 XX Y
that hDi (x, y) denotes hiDm (x, y) for simplicity. We define FDm (r) (x, y) = hDij (rij ) (x, y) qDijk (rij )(x) , (7)
i
Q = {q1 , q2 , . . . , qn } as a family of decision stump estimators T
i=1 j=1 k=1
s = qjDm (x) trained on the subsets Dmj ⊂ Dm . We will
j where hDij refers to the leaf function hDj from the set Hi ,
presume further that qDj (x) denotes qjDm (x) for simplicity. qDijk refers to the decision stump estimator qDk from the
j
qDj (x) returns 1 if x belongs to the decision stump j, and family Qj that is in the set Qi , r = {r11 , r12 , . . . , r0T T }
0 otherwise: Rf 7 → {0, 1}. are random parameters that control data sampling in the
Definition 1: Decision stump chain of a leaf with index i ensemble trees, Dm (r) is a sampled dataset that is obtained
is a tuple hhDi , Qi where decision function is with the parameter r, T is the number of trees, and 0i is the
|Q| number of leaves in the tree with index i.
Y
fDmi (i, x, y) = hDi (x, y) qDj (x) (2)
j=1 III. RELATED WORK
We asset that all the decision stump chains should match A. GENERALIZATION BOUNDS OF DECISION TREES,
the following properties: BAGGING AND RANDOM FORESTS
1) There is an order in how the chain’s estimators are We will begin from the review of generalization scores on
trained: (i < j) ∨ (qDi (x), qDj (x) ∈ Q) → Dj ⊂ Di decision trees and their randomized ensembles because they
2) Each decision stump estimator defines the data that is can provide requirements for training algorithms that help
used to train the following stump in the chain: i < improve generalization. Study [4] gives an estimate of the
j ∨ ((x1 , y1 ), (x2 , y2 ) ∈ Dj ) ∨ (qDi (x), qDj (x) ∈ Q) → generalization of decision trees for binary classification. They
qDi (x1 )qDi (x2 ) > 0 introduced a concept of an effective number of tree leaves that
3) Each decision stump estimator in the chain defines the are a scaled original leaf number. The closer the empirical
data that will be ignored when the the following stump distribution of training samples on leaves to the uniform one
in the chain is trained: i < j ∨ ((x1 , y1 ) ∈ Di , (x2 , y2 ) ∈ is, the closer the effective number is to the original one. The
Di \ Dj ) ∨ (qDi (x), (qDj (x) ∈ Q) → qDi (x1 )qDj (x2 ) = 0 researchers showed that the upper bound of the generalization
Definition 2: Let Q = {Q1 , . . . , Q0 } be a set of families error positively depends on the effective number of leaves
of decision stump estimators. Binary tree is a tuple hH , Qi and VC-dimension (Vapnik-Chervovenkis dimension) [5] of
where decision function is: the decision stump estimators. In the context of the result
above, it is worth noting that the tree structure including data
0 |QY
i ∈Q|
X distribution in leaves depends on splitting criteria [6]; there-
fDm (x, y) = hDi (x, y) qDij (x) (3)
fore, the criteria affects generalization. Papers [7] and [8]
i=1 j=1
presents a data-dependent generalization error bound for a
where qDij refers to the decision stump estimator qDj from the kernel decision tree that shows a kernel capacity and margin
family Qi , and 0 is the number of leaves. between training data in subtrees affect generalization.
Therefore, a decision tree is a combination of the decision The generalization of bagging is significantly affected by
stump chains (see fig. 1). the training algorithm. Friedman and Hall [9] via the frame-
Definition 3: A kernel decision tree is a binary decision work of Taylor decomposition shows that the loss function of
tree with the following stump estimators: the ensemble has nonlinear stochastic perturbations. Bagging
|D| reduces those perturbations and variance; therefore, if they
1 X
∀qD ∈ Q : qD (x) = (1 + sgn( ai yi K (xi , x) + b)) (4) are sufficiently symmetric, bagging does not significantly
2 increase bias.
i=1

77964 VOLUME 10, 2022


D. A. Devyatkin, O. G. Grigoriev: Random Kernel Forests

Breiman proposed an approach to estimating the upper 1 − δ for each set Dm , containing m i.i.d. samples:
bound of generalization error for Random Forests and bag- s
ging [3]. He considers an ensemble as a set of trees, and log 1δ
each tree returns a label of the predicted class. He defines R(fD ) ≤ R̂(fD ) + 2γm + (4mγm + M ) (9)
2m
an ensemble margin function that is the difference between However, this approach does not consider the features of
the probability of true labels and the maximum probability randomized algorithms for ensemble training. Paper [10] pro-
of labels that are not true. Therefore, the larger the margin poses a random uniform stability score βm . Let r is a random
function is, the more samples are correctly classified. Eventu- variable that controls the training of the particular estima-
ally, let the ensemble strength be an expectation of the margin tors in the ensemble. Then the random uniform stability is
function on all the data. Then the generalization error depends bounded as follows.
on the strength and margin correlation of individual trees (the
bias/variance ratio). The higher the strength and the lower the ∀Dm ∼ PXY , ∀x, y ∈ Dm
correlation is, the lower the error is.
sup | Er (l(fDm (r) (x), y) − Er (l(fD\i (r) (x), y)))| ≤ βm (10)
Although that estimation can not be applied to perform Dm ,x m

an ensemble regularization because all the data is unknown


and strength and correlation can not be evaluated, it shows Note that according to the paper [10], if the loss function
that expectation of the ensemble margin and correlation of to train estimators is B-Lipschitz, then the random uniform
the tree margins affect the generalization. This result benefits stability is bounded as follows for the ensemble of size T :
ensembles of decision trees built with greedy algorithms
because they introduce additional randomness in the tree T
B X 
margin. Therefore, one can reduce the generalization error sup Er1 ,...,rT | fD(rt ) (x) − fD\i (x) |
D,x T
by applying the trees with higher strength, which would not t=1
increase the correlation much. We believe that claim is the T
B X
γd(r) | ≥ βm ,

primary motivation to modify Random Forest. = sup Er1 ,...,rT | (11)
D,x T
Studies [3] and [9]reveal the mechanism that reduces gen- t=1
eralization error in bagging ensembles and random forests.
where r1 , . . . , rT are the random parameters of the estimators
However, those approaches do not provide many clues regard-
1, . . . , T ; γd(r) is the uniform stability of the algorithm to train
ing defining hyperparameters of forests, trees, splitting cri-
an ensemble’s estimator, and d(r) defines which data is used
teria, and possible ways to regularize the ensembles. Some
to train the estimator.
practically relevant results [10] here were achieved using
For the bagging algorithm, a random uniform stability was
the concept of uniform stability. We will use that concept
also proposed. Before describing it, it is necessary to set a few
in our further reasoning; therefore, let’s consider a formal
notations and preconditions.
definition.
In the randomized ensembles the generalization depends 1) Let R = {0, 1}m . The bagging uses i.i.d. random
not only on the complexity of estimators, but rather on the parameter vectors r = {r1 , r2 , . . . , rT }, rt ∈ R to
algorithm to train the ensemble. Lets denote the loss of an define which objects from Dm should be used to train
estimator f : Rf → U for an object x as l(f (x), y). Let Dm particular estimator from the ensemble. Namely, for
\i
and Dm are two train datasets of size m that differ by only each element of the ensemble the bagging produces a
one sample with index i. Then the training algorithm A is random vector with length m, and each element of that
uniformly stable with the stability γ , if it trains the estimator vector is 1, if a corresponding object from Dm is used
\i
for every Dm and Dm , which satisfies the following: to train and 0 otherwise. We denote D(rt ) as the dataset
generated for the estimator number t with the parameter
∀Dm ∼ PXY , ∀i = 1, . . . , m vector rt . Obviously, r depends on the bagging algo-
rithm and does not depend on the dataset Dm .
k l(fDm (.), .) − l(fD\i (.), .) k∞ ≤ γm (8)
m 2) The same parameter vector rt ∈ r is used to train both
We presume that in our case the decision stump estimators estimators fD (rt ) and fD\i (rt ), where D \ i is the set D
hD (x, y) return real numbers, but as it has been shown in [15], without object number i.
that approach can be easily modified for the estimators like 3) If there were several copies of one training object in
qD (x) : Rf → {0, 1} (binary classification). Therefore, for dataset that would not affect the training result.
simplicity we stick to the approach above (8), because all the Theorem 2 (Eliseeff, 2005): Let us suppose all the condi-
obtained results can be easily modified for other cases. tions 1-3 are satisfied, and FDm (r) is an ensemble of T estima-
The uniform stability can be used to determine the general- tors built with the bagging algorithm on the i.i.d. distributed
izing ability of estimators trained with stable algorithms. [15]. dataset Dm . Let us suppose all training algorithms of all the
Theorem 1 (Evgeniou, 2004): Let an estimator fD be estimators in the ensemble optimize B-Lipschitz loss functions
trained with γ -stable algorithm and a loss l that is bounded and all those algorithms are γm -uniformly stable. Then the
like 0 ≤ l ≤ M . Then with a probability that is not less than random uniform stability of the bagging algorithm is upper

VOLUME 10, 2022 77965


D. A. Devyatkin, O. G. Grigoriev: Random Kernel Forests

bounded as follows: value of some feature is compared with a constant, which


T m is equivalent to splitting the training data set by a hyper-
BX X kγk plane parallel to one of the axes. Optimizing one-dimensional
βm ≤ Ert (γrt |1i∈rt ) = B Pr(d(r)=k) , (12)
T m decision stumps is straightforward because it is possible to
t=1 k=1
where d(r) is the number of distinct samples in D(r), enumerate all possible thresholds for each feature and choose
Pr(d(r)=k) returns the probability that the sampled training the best parameters according to a given splitting criterion.
datasets of an ensemble estimator have k distinct samples. Various studies show that considering interactions between
The concept of random uniform stability can be used to features of training data helps improve accuracy of deci-
bound the generalization error for a bagging ensemble. sion trees and their ensembles [13]. Training trees with
Theorem 3 (Eliseeff, 2005): Let us suppose all the condi- oblique (linear) or more complex decision splits is a natural
tions 1-3 are satisfied, FD(r) is an bagging ensemble of T way to consider such interactions [14]. However, indetermi-
estimators, and random uniform stability of the bagging nation of optimized criterion set and loss function design
is βm . Let us suppose the bagging algorithm uses the i.i.d. makes training of such trees non-trivial. Another problem
dataset Dm of size m to perform training, and all training associated with linear splits is instability (in the sense given
algorithms of the ensemble estimators optimize a B-Lipschitz in [15]) and excessive complexity of the resulting algorithms
loss function with range M . Then with the probability that is that leads to the construction of unstable and overtrained clas-
greater than 1 − δ: sifiers. Therefore, a more promising direction is the creation
of compositions of such classifiers, which make it possible to
R(FD(r ) ≤ R̂(FD(r ) + 2βm reduce the effects of overfitting and increase the stability of
 √ 
M + 4mβm 2BM p the resulting classifiers.
+ √ + √ log(2/δ) (13)
2m T Table 2 shows primary approaches on training oblique
Therefore, if the algorithms to train ensemble estimators and kernel trees and forests from 1998 to present. The table
are uniformly stable, so that γm = O( m1 ), the bagging is has the following structure. Column ‘‘Kernel’’ shows if the
not more stable than those algorithms. This shows why bag- approaches support training of non-linear decision stumps.
ging is not helpful in training ensembles of well-regularized Column ‘‘Margin opt.’’ highlights the approaches that opti-
estimators like SVM classifiers. Unfortunately, the approach mize a margin between data in sub-trees when train decision
above can not be directly applied to obtain regularization stumps. Column ‘‘Impurity opt.’’ indicate if the approaches
bounds of random forest or boosting on decision trees, at least optimize any data impurity criterion like Gini or information
for the chosen formalism for decision trees. Namely, if the gain. Column ‘‘Arbitrary impurity’’ clarifies if that criterion
decision stump chains are dependent that means the random can be easily changed by a user. Column ‘‘Loss function’’
parameter vectors r = {r1 , r2 , . . . , rT } are dependent too. shows the types of optimized loss at each decision stump.
Besides, we lack the expression for uniform stability for a Those types define if the optimization is time and hard-
binary decision tree. ware demanding. Finally, column ‘‘Greedy’’ indicates if the
In summary, the primary outcome from results on gen- applied tree growing procedure is greedy. In all the columns
eralization is that perspective algorithms to train decision ‘‘-’’ means that the option is not available, ‘‘+’’ shows the
trees in randomized ensembles should be greedy to enhance option is presumed, ‘‘±’’ indicates an option is partially
randomness of the tree structures, they also should optimize supported (for example, in Composite Forest [16] internal
the margin and the splitting criteria. We believe additional univariate stumps are built with impurity optimization, but
requirements can also be inferred from the stability based terminal oblique stumps are built without it; therefore, impu-
generalization scores for random tree ensembles, if such are rity optimization is partially supported). Let’s consider those
derived. studies in more detail.
Support Vector Machines (SVM) is a well-known frame-
B. OBLIQUE DECISION TREES AND RANDOM FORESTS work to train regularized linear and non-linear classifiers;
Decision trees [1], and random forests [3] have long been suc- therefore, it is a very straightforward idea to modify it to train
cessful in machine learning, due in part to their computational decision splits. One of the earliest attempts to utilize SVM
efficiency and their applicability to big data classification and is presented in paper [24]. The paper proposes a method to
regression problems. The standard approach to constructing train multivariate nonlinear or linear decision splits for trees
them remains a greedy recursive algorithm. In this algorithm, with fixed structure. The decision splits can be linear thresh-
at each node of the tree, a one-dimensional optimization old units, polynomials, sigmoidal neural networks, or radial
is performed that divides the training data into two subsets basis function networks. However, the fixed structure means
according to a partitioning criterion such as the Gini criterion high correlation between tree margins. Besides, the proposed
in CART (Classification and Regression Tree) [6] or the error function is not convex, so existing well-performance
information gain in C4.5 [12]. Most research on decision tree optimization methods can not be applied. Paper [19] proposes
algorithms has been limited to decision trees with discrete a variant of a random forest with oblique decision trees. Ridge
features [12], or decision trees with one-dimensional splits regression or linear discriminant analysis was used to search
for continuous features [1]. In a one-dimensional split, the for separating hyperplanes. This method, however, allows

77966 VOLUME 10, 2022


D. A. Devyatkin, O. G. Grigoriev: Random Kernel Forests

TABLE 2. Considered studies on oblique and kernel trees and forests.

building trees only for binary classification problems. Tibshi- a data-dependent overfitting estimate for this family. That
rani with colleagues imply that pure impurity criteria do not overfit score is used to select classification algorithms for
properly capture the spatial distribution of training data. They each leaf. It is noteworthy that the score depends on the
use an optimal margin classifier at each split [11] to tackle Rademacher complexity of the subfamily of algorithms and
the problem. They revealed that the proposed method has the proportion of training set objects correctly classified in
competitive accuracy and provides additional interpretabil- each leaf [29]. Random composite decision trees are also
ity in its grouping of the classes. Leading the same moti- introduced, and overfitting estimates are obtained for them,
vation, Manwani and Sastry [27] proposed a strategy to which are also applied to random forests. They propose
assess the hyperplanes in such a way that the geometric an algorithm for training random composite trees (Random
structure in the data is taken into account. At each stump Composite Forest, RCF) based on the generation of a set
of the decision tree, they find the clustering hyperplanes for of random composite trees. However, non-terminal decision
both classes and use their angle bisectors as the split rule stumps of such trees still use axis-parallel splits.
at that stump. Experimental results show that the strategy Another perspective direction of research is the end-to-
helps train small decision trees and improve performance. end training of oblique decision trees and forests. Paper [23]
Papers [7] and [8] introduce a budget-aware classifier and presents an expectation-maximization training approach,
pruning strategy to minimize overfitting. The papers propose when the trees are fully probabilistic at train time but deter-
a sparse coding tree framework for multi-label annotation ministic at test time. The experiments on image datasets show
problems to deal with ultrahigh label dimensions. All the that the approach can learn more complex splits than common
considered studies focused on optimizing margin or purity oblique ones and facilitates interpretability through spatial
criteria; however, as we learned from theoretical results, both regularization. However, the expectation-maximization algo-
those factors affect overfitting [4], [11]. Norouzi et al. [17] rithm converges slowly on large datasets with many features.
proposed CO2-trees (Continuously Optimized Oblique trees) Similarly, the study [20] proposes to use the backpropagation
with oblique decision stumps. This work defines decision method, which is usually used to train multilayer neural
stump training as a structured prediction problem with latent networks [30], to train oblique decision trees with a fixed
variables [18]. The researchers propose a convex-concave structure. However, fixing the tree structure can lead to the
loss function (convex-concave, i.e., the difference of several training of overly complex classifiers with a large number of
convex functions) and also apply a regularization approach parameters and, as a result, a worse generalization.
when training such trees. Note that the convex-concave opti- The study [22] notes that algorithms to build oblique deci-
mization problem can be effectively solved by the gradient sion trees require higher computation than axis-parallel trees.
method proposed in [28]. Although that function is an upper Besides, the training results depend on the initialization of the
bound of empirical loss, the ‘‘smoothness’’ of that function splitting hyper-planes. The researchers proposed a Weighted
and tightness on the bound heavily depend on hyperparame- Oblique Decision Tree (WODT) based on continuous opti-
ters and scale of the features, limiting the applicability of the mization with random initialization. They assign weights for
method. training samples of child nodes at all internal nodes and then
Paper [16] presents composite decision trees whose leaf train a decision stump by optimizing the weighted informa-
estimators are chosen from a set of hypotheses consisting tion entropy loss. It is worth noting that the WODT loss is
of subfamilies with different complexity. They also propose still non-convex; therefore, the global optimum may not be

VOLUME 10, 2022 77967


D. A. Devyatkin, O. G. Grigoriev: Random Kernel Forests

found with pure gradient methods. At the same time, the a stump [32]. The paper [33] proposes to discard the class
experimental results show competitive accuracy scores on empirical probabilities stored in all tree leaves of a pre-trained
many datasets. random forest and relearn them through a ’’global’’ refine-
Several papers are focused on techniques to improve ment: all tree leaves are simultaneously optimized by explic-
trained decision trees or forests Paper [25] proposes an algo- itly minimizing a global loss function defined on all training
rithm that improves a given tree and produces a new tree samples, according to the averaging rule of random forests.
with the same or smaller structure but new parameter values We are going to use that approach as the basis for our exper-
that lower or leave unchanged the misclassification error. iments on ensemble regularization so that we will consider
A hybrid SVM based decision tree is proposed in [26]. it more formally. Let 8 : Rf → {0, 1}T 0 be a function that
Researchers have focused on reducing the number of test for any sample x returns the binary vector, whose elements
samples that need SVM’s help in getting classified. The are 1 if x goes to the corresponding decision stump chain and
central idea is to approximate the decision boundary of SVM 0 otherwise.
using decision trees. The resulting tree is a hybrid tree in
8(x) = (φ1 (x), φ2 (x), . . . , φT 0 (x)) (14)
the sense that it has both univariate and SVM nodes. The
hybrid tree takes SVM’s help only in classifying crucial Matrix W contains the corresponding class weight for all
samples lying near the decision boundary; remaining less the decision chains in the ensemble.
crucial samples are classified by univariate nodes. Paper [21]
focuses on dual incremental learning strategies for oblique W = (w1 , w2 , . . . , wT 0 ) (15)
random forests when new inputs from the existing classes Ren with colleagues define the tree refined classifier as the
or unseen classes come. They propose a batch multiclass following linear function [33]:
algorithm by using a broad learning system and a multi-to-
binary method to obtain an optimal oblique hyperplane in y = W ∗ 8(x) (16)
a higher dimensional space and then separate the samples
where W ∗ can be found with the following SVM-like opti-
into two supervised clusters at each node. The incremental
mization.
strategy consists in analyzing the parameters of all stumps on
m
the classification route of the increment of input samples and 1 CX
W ∗ = arg min ||W ||2F + l(yi , ŷi ),
the increment of input classes. W 2 m
i=1
We propose a method for training and regularizing a ran-
s.t.yi = W 8(x), ∀i ∈ [1, m] (17)
dom forest of decision trees with kernel splits (with linear,
polynomial, Gaussian, and other kernels), i.e., allowing one to The proposed global refinement can be efficiently solved
create estimators with arbitrary complexity generalizing abil- with a linear SVM or a ridge regression method. As a result,
ity. Similarly to CO2 Forest [17] we consider decision stump the complementary information between trees is exploited,
training as a structured learning problem; however, we define and the fitting power is significantly improved. However,
it in more simple, explicit form. Similarly to WODT [22] the global optimization in training might cause over-fitting
we weigh training samples and minimize an impurity cri- for many tree leaves. Paper [33] proposes a global prun-
terion, although we optimize this criterion together with a ing method that alternates between refining all tree leaves
margin between data in sub-trees, which should lead to better and merging the insignificant leaves to reduce the risk of
generalization. In addition, the splitting criterion is planted over-fitting and model size. The method is to join two adja-
to the optimization problem explicitly; therefore, a user can cent leaves if the norm of their leaf vectors is close to zero.
define their criteria without re-creating the loss function from
scratch. Still, in our case, training of decision stumps is IV. TRAINING OF THE KERNEL TREES
reduced to solving the SVM with slack re-scaling problem We apply a commonly-used greedy procedure to build the
that is convex optimization with inequality constraints, for trees (see Algorithm 1). At each step, we train a decision
which a large number of computationally efficient meth- stump, which splits the training data, then we recursively
ods have been previously proposed. We initially refused to repeat this for the ‘‘left’’ and ‘‘right’’ partitions of the data
find a global optimum to keep high training speed. Instead, until the requested tree height is reached.
we implemented a greedy stump optimization, presuming that First (steps 5-10), we set the target distribution of the
the tree training is greedy itself. classes in the sub-trees. We consider this distribution as a
binary classification problem where labels H = {−1, +1}
C. REGULARIZATION OF BAGGING AND RANDOM represent left and right sub-trees. That does not contradict the
FORESTS definition from the ‘‘Notation and definitions’’ section (II),
According to the concept of strength/correlation [3] the most since there is one-to-one mapping for the sets {0, 1} and
obvious way to achieve better generalization of a tree ensem- {−1, +1}. Usually, fitting of decision stumps in decision
ble is to add more randomness and reduce the correlation. trees is done by optimization of some purity criterion like
One can randomly choose splitting criterion [31] or ran- Gini impurity or information gain. In our case, the target
domly define how many features should be tested to build distribution cs : U → H , should minimize a decision

77968 VOLUME 10, 2022


D. A. Devyatkin, O. G. Grigoriev: Random Kernel Forests

Algorithm 1 Kernel Tree the algorithm uses cs to define a subtree for each training
Input: Dataset Dm , regularization term C, J is the threshold sample from the dataset Dm : Hbest = cs (Ym ).
to determine if the class-subtree distribution is defined Next steps of the algorithm (Step 11-12) are related to the
exactly or greedy, N is size of the top best splittings to training of a decision stump classifier; therefore, it would
randomly choose class-subtree distribution. optimize margin between sub-tree data and the impurity
1: call BuildTree(Dm ): criterion simultaneously. The optimization of the impurity
2: BuildTree(D): criterion can be done quite effectively by selecting some
3: if D has only one class then return features and testing thresholds brute-force in univariate trees.
4: else However, such an approach is not applicable in the case of
5: if |U| < J then oblique or more complex splits because it leads to deficient
6: s := all_distributions(U , H = {+1, −1}) training speed. Suppose one wants to keep the original idea
7: Hbest := sort(Impurity(s))[rnd(1..N )] F U → H of purity maximization but make the fitting more effective.
8: else In this case, they need to construct some continuous objec-
9: Hbest := greedy_find_best(D) FU →H tive function that would reflect the dependence between the
10: end if parameters of the decision stump and the purity of data.
11: L1∗ , . . . , Lm∗ := L(hi , −hi ), hi ∈ Hbest . Thus, this problem would be solved via existing gradient
12: w∗ , 1∗ , m∗ := optimize_node(C, X , L1∗ , . . . , Lm∗ ) F optimization techniques.
Solve an SVM with slack re-scaling: 1 /L1∗ , . . . , m /Lm∗ . Our method uses the general idea that a miss-classification
13: Dl := D[classify(D, w∗ ) ≥ 0] with a high purity decrease should be penalized more than one
14: Dr := D[classify(D, w∗ ) < 0] with a slight increase of purity. One of the implementations
15: BuildTree(Dl ) of the idea to train SVM is called ‘‘slack re-scaling’’ [18].
16: BuildTree(Dr ) It was originally introduced to classify complex structures
17: end if like syntax trees. The benefit of that implementation is that
it is invariant regarding the scale of features or separate
hyperplane. In contrast to other methods, we explicitly define
stump impurity criterion (Gini, information gain, etc.). Some the distribution of samples to subtrees.
criteria split classes, but we overlook that for the sake of speed The training of the decision stump can be defined as the
performance. following:
m
sbest = arg min PLs g(pLs ) + PRs g(pRs ) (18) 1 CX
s w∗ ,  ∗ = arg min ||w||2 + i , (21)
w, 2 m
i=1
where for the Gini impurity we have: i
s.t: ∀i, wT xhi ≥ 1 − (22)
N
X L(hi , −hi )
g(p) = (1 − pi )pi (19)
where w∗ are the parameters of the splitting hyperplane,  ∗
i=1
are the slack variables, hi ∈ Hbest is the predicted subtree
and for information gain: for the data sample with index i, and L(hi , −hi ) reflects the
N
X growth of the impurity criterion if the sample with index i is
g(p) = − pi log(pi ) (20) classified to wrong sub-tree.
i=1 If one applies KKT conditions to that problem, they will
get the dual problem:
If the number of classes |U | is relatively small (we use a
threshold value J to reveal that fact), the algorithm enumer- m m m
1 XX X
ates all possible distributions and assesses the value of related ã = arg max − ai aj K (xi , xj ) + ai , (23)
a 2
impurity criterion (Steps 6-7). Then the obtained distribution i=1 j=1 i=1
m
list is sorted according to the impurity criterion, and the target X ai C
distribution cs is selected randomly from the top of the sorted s.t: ≤ (24)
L(hi , −hi ) m
list (we use a hyperparameter N to define size of the top). i=1

If the number of classes |U | is greater than J the algorithm where aij is the weight of the training sample i (non-zero
applies a greedy procedure to reveal the target distribution for the support vectors), L(hi , −hi ) reflects the growth of
(Step 9). The procedure start from generating random distri- the impurity criterion in case of miss-classification, and
bution of classes to subtrees. Then the algorithm iteratively K (xi , xj ) : Rf × Rf → R is a positively defined kernel
changes the generated distribution and keep the swaps that function.
improve the impurity criteria. The algorithm repeats this In contrast to the classification of structures [18], this
procedure several times and outputs the best distribution cs . problem can be efficiently solved in an explicit form since
As a result, we add some randomization to the decision stump there are only two classes. Regularization hyper-parameter
training, which should reduce correlation between trees. Then C should be found empirically for each dataset. In both

VOLUME 10, 2022 77969


D. A. Devyatkin, O. G. Grigoriev: Random Kernel Forests

FIGURE 2. Purity criteria and R̂(w ). R̂(w ) is a tangent to the criterion


function at the position related to the best split hyperplane Wbest . W ∗ is
the split hyperplane that can be really reached.

(primal and dual) definitions parameter C helps find a trade


off between the optimization of the impurity and the margin
maximization. FIGURE 3. Difference in a decision stump training result with (right) and
without (left) slack re-scaling. Dataset: Titanic, criterion optimized: Gini
Finally (steps 13-14), the algorithm classifies all the train- impurity.
ing data to subtrees with the trained classifier and recursively
builds the decision stumps for the obtained sub-tree data Dl
and Dr .
In program implementation of the algorithm we also added can unambiguously transform the loss function of the class
the following changes: samples L(cs (u), −cs (u)) to the loss function of the changes
of class probabilities in comparison to the best splitting
1) We re-scale values of L(hi , −hi ) because they can be
Hbest see the expression 25). Therefore, the loss L(hi , xiT w)
really small in large datasets.
approximately returns a partial derivative of the splitting
2) We use the ‘‘balanced’’ heuristic from [34] because the
criterion at the point Hbest with respect to the variable hi , i.e.
optimal splits of the training data are often imbalanced,
R̂ is a tangent of the splitting criterion function at the point
especially when we use Gini impurity.
Hbest with the splitting hyperplane parameters wbest . When
Now, let’s clarify how solving of the problem (21) or (23) we optimize (21) we go along the tangent of the criterion
optimizes the impurity criterion. Let wbest be the parame- function towards to the wbest as close as possible by the
ters (possibly, non-existent) of the splitting hyperplain that distribution of objects, classes, the chosen kernel function,
achieves sbest (see 18). Let the minimum value of the impurity etc (See fig. 2). Since the presented method guarantees that
criterion correspond to the probabilities of assigning a data for a given distribution over subtrees, the separating hyper-
sample to sub-trees PL and PR and the sub-tree class proba- plane will be constructed in such a way as to minimize R̂(w)
bilities of that sample pL and pR . Suppose the decision stump (See fig. 3). However, it does not guarantee that for other
classifies the sample to the sub-tree h∗ instead of h. Therefore, distributions U → H the value of R̂(w) would not become
the change in those probabilities (in comparison to the ones smaller because the obtained w∗ would become closer to
from the best splitting sbest ) is: P∗L ,P∗R ,p∗L ,p∗R and the increase its wbest . Theoretically, one needs to consider all the pairs
in the criterion is: U → H to optimize the criterion. However, realizing that
the tree construction algorithm itself is greedy, we will not
L(h∗ , h) = P∗L g(p∗L ) + P∗R g(p∗R ) − PL g(pL ) − PR g(pR ) (25)
require the global minimum to be reached in this case either.
Denote R̂(w) = m1 m 1 L(hi , xi w) as empirical risk on the
T We propose to process the ‘‘best split’’ only, or randomly
P
training set. Papers [18], [35] provide the following proof that choose several good splits, solve the (21) for all of them and
this approach optimizes the empirical risk R̂(w). pick the best. The latter would provide additional randomness
Theorem 4 (Tsochantaridis, 2005): If  ∗ (w) are the opti- to the structure of the trees. This approach has been tested on
mal values of  in terms ofP( 21) for given w∗ , then empirical several datasets (MNIST, CIFAR, USPS). No difference in
risk is upper bounded: m1 m i=1 i ≤ R̂(w).
∗ the accuracy of the obtained ensembles was found compared
Therefore, (21) or (23) optimizes R̂(w). It is worth noting to testing all the possible U → H distributions.
that the empirical risk R̂(w) differs from the original impurity
criteria. We can reformulate the empirical risk as R̂(perr ) = V. THEORETICAL BOUNDS ON THE KERNEL TREES
P
u∈U perr (u)L(cs (u), −cs (u)), where perr (u) returns the pro-
A. UNIFORM STABILITY OF BAGGING ON DECISION
portion of incorrectly classified samples of particular classes TREES
(the ones, which do not follow the side distribution sbest ). Suppose one wants to apply a random uniform stability
All the criteria we consider (Gini impurity and information (Theorem 3) to estimate the generalization of a tree bagging
gain, etc.) are convex functions of class probabilities. One ensemble. In that case, they need first to assess the uniform

77970 VOLUME 10, 2022


D. A. Devyatkin, O. G. Grigoriev: Random Kernel Forests

independent decision stump chains because random param-


eters ri of the chains in each tree are dependent. That means
that Theorem 2 cannot be applied to obtain the desired score.
We define a weakened set of conditions:
1) Let R = {0, 1}m . The bagging on trees uses random
parameter vectors r = {r11 , r12 , . . . , r0T T }, rij ∈ R
to generate datasets for the particular decision stump
chains. Each random parameter vector rij has length of
the training dataset m, and each element of that vector
FIGURE 4. Decision stump classifier chain with length 3. p11 , p21 , p31 are is 1, if the corresponding sample from Dm is used to
the probabilities of the corresponding transitions between decision
stumps. train the stump classifier and 0 otherwise. Let D(rij ) be
a dataset generated from Dm with the parameter vector
rij for the decision stump chain with index j of the tree
stability of a binary decision tree, and we have to start from with index i.
the stability of decision stump Qclassifier chains. 2) Dm and r are independent from each other.
Lemma 1: Let Pr(x) = q∈G q(x) be a product of all 3) One uses the same parameter rij ∈ r to train both
the estimators g with the range [0..1] from a set G with classifiers fD and fD\t , where D\t is D without sample
size n. Those estimators are trained on a i.i.d. dataset Dm number t.
with size m and the training algorithms of those estimators 4) Having multiple copies of the same example in an
optimize some B-Lipschitz loss functions. Assuming those estimator’s training dataset does not affect training out-
training algorithms are γq -uniformly stable, then the uniform comes.
stability of the algorithm to train that estimator product Pr(x) 5) One uses the B-Lipschitz loss with range [0..M ] when
is upper bounded as follows. training the decision stumps.
 n−2  We propose the following expressions for random uniform
1 X B i B n−1 X
γPr ≤ ( ) +2 γq (26) stability of bagging on decision trees.
n 2 2 Theorem 5: Suppose one uses bagging to train an ensem-
i=1 q∈G
Therefore, ∀q ∈ G, γPr = O(γq ). ble of T decision trees on a dataset Dm with size m, each
Due to the properties of decision stump classifier chains tree i contains 0i decision stump chains with length nj . One
each next stump j in the chain is trained on a subset of the also uses the γ -uniformly stable algorithms that optimize
train data for the current stump j − 1 i.e. Dj ⊂ Dj−1 . B-Lipschitz loss function to train decision stumps of the trees.
Let the probability that elements from the set Dj−1 are Let the weaken set of conditions is satisfied. Then the ran-
presented in the set Dj is defined with PD (j) : N → [0..1] domized uniform stability of the bagging is upper-bounded
(fig. 4). Suppose uniform stability of the algorithms to train as follows.
decision stumps
 in the stump chain is upper bounded, so that
m T 0i
γ = O 1
. Then, the uniform stability for the decision B XXX k
m
βm ≤ γPr (i, j, k) Prij (d(rij ) = k) (29)
classifier chain training with length n is bounded as follows: T m
k=1 i=1 j=1
 n−2  n
1 X B i B n−1 X γ γk
γPr ≤
Pnj
( ) +2 Qi where γPr (i, j, k) = φ(B, nj ) l=1 Ql , φ(B, n) =
n 2 2 j=1 PD (j) t=1 PD(rij ) (t)
i=1 i=1  
n 1 Pn−2 B i B n−1
γ

i=1 ( 2 ) + 2 2 , function PD (rij ) defines the ratio
X
= φ(B, n) Qi (27) n

i=1 j=1 PD (j) of the elements from the training set D(rij−1 ) for the stump
j − 1 that are also presented in the training set D(rij ) for
The latter means that decision classifier stumps themselves
the next decision stump j, d(rij ) is the number of distinct
are uniformly stable.
samples in D(rij ), and Pr(d(rij )=k) returns the probability that
For the kernel support vector machine, the uniform stability
the sampled training datasets of an ensemble estimator have
depends on the regularization parameter C: γSVM = θ2mC
2
k distinct samples.
(where θ 2 ≥ K (x, x) reflects the range of the kernel function). If γk = O k1 , then the sum for βm reaches minimum if the

Therefore, the uniform stability of an algorithm to train kernel probability distribution defined for all decision stumps with
decision stump chain is upper-bounded as follows: the expression lt=1 PD(rij ) (t) is uniform. It is worth noting
Q
n
X θ 2C that Theorem 5 does not demand the tree leaves distribution
γSVM = φ(B, n) Qi (28) to be uniform; it just benefits the trees with balanced decision
i=1 2m j=1 PD (j) stumps. Conversely, in the case of uniform leaves, the number
Unfortunately, when assessing randomized uniform sta- of chains reaches the maximum and increases the stability
bility, the tree ensembles cannot be considered as sets of bound.

VOLUME 10, 2022 77971


D. A. Devyatkin, O. G. Grigoriev: Random Kernel Forests

violated. We introduce a modified version of this approach,


in which the independence of the parameters for ensemble
training is not required.
Theorem 6: Supposing P the weakened set of conditions is
0
satisfied, FD(r) (x, y) = T1 Tt=1 fD(rt ) (x, y) is a decision tree
ensemble trained with a bagging algorithm on a i.i.d. set Dm
with size m. Each tree has 0 decision stump chains and the
randomized uniform stability of the bagging algorithm does
not exceed βm .
Then with the probability that is at least 1−δ the following
Ql holds.
FIGURE 5. Probability histogram t =1 PD(rij ) (t ) for a single decision tree
and bagging ensembles with different number of trees.
R(FD(r) )

1 1 2M
≤ R̂(FD(r) ) + 2βm + B ln( )
Study [10] shows that as the number of trees grows T → 2 δ 3
∞, k converges to its expectation k = (1 − 1e )m ≈ 0.632m.
s
2 (2M )2 0 3
 
1 2M 2 1
Obviously, not all the random vectors from r are dependent + ln( ) + 8ln B2 m βm +
δ 3 δ T
from each other. Suppose one has an infinite bagging tree
(32)
ensemble that has 0 sets of decision stump chains with i.i.d. 2
parameters rij . Then the following holds. Here B2 m βm refers to a training data-driven variance,
and (2MT) 0 is an upper bound of a variance related to the
2 3

0  
X random sampling parameters r. The results are consistent
βm ≤ 0.632B Eri γPr (i, 0.632m) (30)
with the ‘‘bias/variance’’ study from [36], although we also
i=1
show the effect of tree structures on stability and variance.
That is, for each ‘‘group’’ of independent decision chains On the one hand, the obtained expression for generaliza-
in the ensemble, the average probability of selection will tend tion error contains all the hyperparameters of the bagging
to some expected value. As a result, probability ‘‘outliers’’ in algorithm, trees, and decision stumps. On the other hand,
individual trees are smoothed out.
Q (See fig. 5). it explicitly integrates the ‘‘bias/variance’’ part [3] and the
In the theorem 5 the value lt=1 PD(rij ) (t) is bounded as ‘‘nonlinear perturbations’’ part of the random uniform stabil-
follows: [ m1 .. m−1
m ]. It makes no sense to consider the case ity [9]. Although direct optimization of the expression from
when the probability of transition along the chain is 0, since Theorem 6 does not make much sense, it provides some clues
such a chain can be reduced to a shorter one. Otherwise, after that can be used to construct an approach to regularize tree
going through the chain, at least one training sample remains, ensembles. Namely,we can reduce random uniform stability
which gives the probability m1 . Following similar reasoning, and optimize tree structures since stability affects the vari-
this value will not exceed m−1m .
ance. The latter can be done via pruning of extra decision
Then the change in the uniform stability of the decision stump chains that have low probability and do not affect R̂
stump chain training algorithm is finite and does not exceed much.
the value |γk/m − γk(m−1)/m | ≤ ck .Therefore, the random
uniform stability of the bagging on decision trees converges VI. REGULARIZATION OF BAGGING ON DECISION TREES
to its expected value (30) with an exponential rate. The study [33] provides a general optimization-based refine-
ment and pruning method that helps minimize empirical risk
T 0t 2
 
P(|βm − µ| ≤ t) ≤ 2 exp − 2 better, while not worsening the generalization. The pruning
γk/m − γk(m−1)/m helps reduce the number of tree leaves, which results in
achieving better stability of the tree ensemble (Theorem 5)
 
1
=O (31) without a significant increase in the empirical risk. However,
exp(T )
such an approach makes the ensemble estimators not indepen-
This result is consistent with the repeatedly empirically dent anymore. We believe some improvement can be achieved
confirmed thesis about the pointlessness of building bagging here if one reveals a trade-off between the global optimization
for 1000 or more trees. of leaf vectors and the dependency of the trees. We propose
to modify the feature generation process of the refinement as
B. GENERALIZATION OF BAGGING ON DECISION TREES
follows (33):
The theorem 3 can only be applied to assess the generalization
error of ensembles built by randomized algorithms, in which 8r (x) = (φ1 (x)r1 , φ2 (x)r2 , . . . , φT 0 (x)rT0 ), (33)
the training parameters are independent for each model in
the ensemble. We represent decision tree ensembles as sets where r1 , r2 , . . . , rT 0 are i.i.d. random binary variables taken
of decision stump chains (7); therefore, that condition is values {0, 1}.

77972 VOLUME 10, 2022


D. A. Devyatkin, O. G. Grigoriev: Random Kernel Forests

When one generates the features for the refinement with TABLE 3. The properties of the standard machine learning datasets used
in the experiments.
our approach, they can control the ratio of the samples that
will be excluded from the dataset for particular decision
stump chains. The rest of the refinement and pruning is the
same as in the [33]. We believe this would help covariance
not increase so much during the refinement (see theorem 6).
One can argue that SVM itself is a well-regularized algo-
rithm that can hardly be disturbed by minor changes of
features if the regularization term C is correctly defined.
However, our experiments show slight but steady improve-
ment in classification accuracy, then we add a portion of
noise (10-30%) as described above (33). We have also imple-
mented a baseline tree pruning approach. In that approach,
we perform refinement and remove decision stump chains train-test split with respect to governmental and oppositional
with low weights (i.e., small l 2 -norm of leaf vectors) instead classes to keep proportion balance for 3-class classification.
of merging. The dataset contains features that were used to analyze the
reaction of the population to COVID-19 lockdown in Pikabu
VII. DATASETS social network [41]. These features can be divided into sev-
We conduct experiments on five UCI multi-class datasets, eral categories.
Kaggle’s BNP Paribas (Banque Nationale de Paris and 1) Psycholinguistic markers: the set of linguistic proper-
Paribas) credit scoring dataset, CIFAR-10 image dataset, ties of texts, including proportions of post-tags used
and ‘‘Youtube channels’’ psycholinguistic dataset. We used in the text, psycholinguistic coefficients, and stylistic
SatImage, USPS, Letter, MNIST [37] from the UCI. Table 3 features. For example, these markers include:
presents their primary parameters. Most datasets are devoted
• the ratio of the number of verbs to the number of
to image recognition problems. For example, the MNIST
adjectives (Tragger Coefficient);
and USPS datasets contain handwritten images of digits,
• the ratio of the number of infinitives to the total
while the Letter dataset contains Latin letters. The CIFAR-10
number of verbs;
dataset is also related to image recognition [38]. It contains
• frequency of use of first-person plural pronouns;
32 by 32 colored images of 10 classes (airplane, horse,
• frequency of first-person verbs in the past tense.
bird, etc.) with eight gray levels. We apply a simple pre-
processing technique to all the image recognition datasets. 2) Topical groups of words: these features were retrieved
Namely, we perform feature-level normalization of the data in the form of occurrence frequency of terms from
with ‘‘MinMaxScaler’’ and ‘‘Normalize’’ tools from Scikit- various dictionaries: topical (e. g. healthcare terms,
Learn [39]. No other complex processing is used. environmental terms, political terms), sentiment/
Kaggle’s BNP Paribas dataset is devoted to predicting moodbased (e.g. motivation vocabulary, anxiety vocab-
the category of a claim based on features available early in ulary, hostile vocabulary).
the process. The dataset contains categorical and numeric 3) Basic emotions dictionary (BE): the set of dictionaries
features available when BNP Paribas Cardiff received the consists of terms related to the expression of basic
claims. All string features are categorical. We utilize one-hot human emotions in the text: disgust, shame, anger, fear,
encoding to deal with categorical features. Thus, we obtain sadness, happiness, and wonder.
the data with pretty high-dimensional and sparse features. It is 4) Sentiment: a single item feature computed with the
a tool to test the ensemble methods on such data types. As in help of the Linis-Crowd sentiment dictionary. For each
the previous case, we applied Scikit-Learn to perform simple video, all comments and replies were concatenated into
normalization. a single text. Combined texts were processed to retrieve
‘‘Youtube channels’’ [40] is an example of a mid-sized text features.
dataset with i.i.d. samples. It does not contain complex
structural dependencies between features. It provides some VIII. EXPERIMENTAL RESULTS
real-valued markers measured on discussions from different Two versions of the kernel forests that utilize the proposed
Russian Youtube channels that belong to three categories: algorithm have been implemented for the tests. The first one
political-governmental, political-oppositional, non-political. was used primarily to perform experiments on small and
A period of 1 year between April 30, 2020, and April 30, medium-sized datasets like USPS or Youtube channels, or in
2021, was chosen as the targeted time span for data down- the large datasets, but with the oblique trees only. That version
load. In total, comments were processed for 4807 videos: is based on the Scikit-learn implementation of Liblinear SVM
2629 videos with political discussions (5 million messages) solver. The second version is devoted to processing large
and 2178 videos for the non-political subcorpus (1.2 mil- datasets with non-linear kernels, and it uses the Thunder SVM
lion messages). To perform classification, we used 70%-30% library, which provides GPU acceleration. It is also worth

VOLUME 10, 2022 77973


D. A. Devyatkin, O. G. Grigoriev: Random Kernel Forests

noting that we added simple modifications to both solvers TABLE 4. Accuracy of considered randomized tree ensembles
(non-regularized).
because they do not consider sample weights (the version of
Scikit-learn we used has a parameter for the sample weights,
but does not handle it anyhow).
We applied a commonly recognized cross-validation tech-
nique to estimate the ensemble hyperparameters: the num-
ber of trees T = {30, 100, 300}, stump regularization
C = {100, 1000, 3000, 5000}, maximum tree depth n =
{3, 4, 5, 6, 7}, the proportion of features to be considered
at each stump f = {0.08, 0.1, 0.2, 0.3, 0.4, 0.5}, noise
(up to 0.3) and pruning (up to 0.9) ratios, and kernel parame-
ters (gamma = {10, 100} for the Gaussian kernel, degree = 3 TABLE 5. Precision of considered randomized tree ensembles
for the Polynomial kernel). Random uniform stability βm (non-regularized).

exponentially converges to its expected value with grow-


ing T . Consequently, setting T greater than several dozen
does not really improve βm (31). Similarly, setting T greater
than several hundred does not improve the upper bound of a
variance related to random sampling parameters r much (32).
Therefore, we conclude that tree number T should be between
several dozen and several hundred, so as not increase training
time much, but still keep variance of errors low. We trained
ensembles with a small number of trees (30 trees) for large
TABLE 6. Recall of considered randomized tree ensembles
datasets because of hardware limitations and ensembles with (non-regularized).
up to 300 trees for small datasets. Parameters C and n are
used in the scores (28) and (29), they have a linear effect
on random uniform stability of ensembles. However, they
also affect an empirical risk R̂ a lot, so we had to utilize an
empirical approach to define grid search options for them.
Eventually, the regularization parameter C is pretty big for all
of the datasets, and it varies from 100 to 5000; otherwise, the
slack re-scaling approach does not have the necessary effect.
We use accuracy, precision and recall scores to estimate
the performance of the ensembles, except the BNP Paribas TABLE 7. Regularization of Kernel Forests.
task, where we utilize log-loss to make the results compara-
ble with the results from the original competition. Besides,
we could perform the experiments on the CIFAR-10 dataset
with Random Forest and Kernel Forest only because of time
or memory limitations.
Table 4 presents the obtained test results for
non-regularized ensembles. The best results for the image
recognition datasets have been achieved with radial basis
function (RBF) kernel, whereas trees with Polynomial kernel
outperformed others on the ‘‘Youtube channels’’ dataset. Tables 5 and 6 present precision and recall scores with
We used oblique splits for the ‘‘BNP Paribas’’ and Titanic macro-averaging. They also show the proposed approach
datasets because more complex kernels led to over-fitting. (Kernel Forest) is superior to most of other considered
Although the difference is slight, the proposed method approaches except CO2 Forest.
outperforms other approaches on most datasets. Statistical Table 7 shows the results of the experiments with the
analysis of the obtained results (accuracy scores) with the ensemble regularizations. We selected datasets on which
Wilcoxon test show the proposed method (Kernel Forest) out- quite large over-fitting was observed. One can note that
performs all other approaches except CO2 Forest (p-values regularization helps significantly improve results on those
are about 0.004). However, there is no reliable conclusion datasets (up to 1.5%), enhancing the applicability of trees
regarding comparing CO2 Forest and Kernel Forest (p-value with complex kernel decision stumps. However, introduc-
is 0.1). CO2 Forest with oblique trees optimizes cross-entropy ing noise to decorrelate ensemble estimators brings minor
and margin simultaneously; therefore, it shows a competitive improvement. Statistical analysis of the obtained results with
level of accuracy on relatively small datasets (like ‘‘Letter’’) Wilcoxon test shows the usage of the noise and regularization
when the usage of non-linear kernels leads to over-fitting. helps to improve results (p-value 0.05).

77974 VOLUME 10, 2022


D. A. Devyatkin, O. G. Grigoriev: Random Kernel Forests

IX. TRAINING TIME AND COMPLEXITY ANALYSIS


A. COMPLEXITY ANALYSIS
For an axis-parallel decision tree, the training cost can be esti-
mated as follows. Assuming that the subtrees remain approx-
imately balanced, the cost at each decision stump consists
of searching through f features to find one that offers the
largest reduction in the impurity criterion. Therefore, the final
training cost for a balanced tree is:
O mflog(m) ,

(34)
where m is the size of training dataset.
In general, kernel trees follow the same greedy algorithm
to grow. However, the cost of decision stump training, in that
case, depends on the split type (linear or more complex one)
and on the particular optimization method used to train the FIGURE 6. Time to train Random forests with different types of oblique
stump. In the case of a linear split, one can utilize the dual trees depending on a data-set size.

coordinate descent method from [42]. This method has linear


complexity with respect to the number of training samples:

O mf (35)
Therefore, the final complexity score is the same as for
ordinal axis-parallel trees.

O mflog(m) (36)
In the case of non-linear kernels, one of the solvers that
could be applied is LibSVM that uses a structural minimal
optimization (SMO) algorithm [43]. The training cost for this
solver is:
O m2 f .

(37)
Therefore, the final complexity score is polynomial, which
can be a problem for training on large datasets:
O m2 flog(m)

(38) FIGURE 7. Time to train Kernel forests with the RBF kernel depending on
a data-set size.
B. TRAINING TIME
We assessed the time to train random forests of oblique and
kernel trees with different approaches. We set the ensemble In the second experiment, we tested the proposed approach
size to 10 trees and depth to 5 for all the approaches. All with RBF-kernel on the same datasets as in the first test
the experiments were performed on a computer with the (fig. 7). We used ThunderSVM and LibSVM libraries to
following hardware: 12-Core CPU AMD Ryzen Threadripper train oblique decision stumps with the proposed method.
1920, 128 GB RAM, Nvidia GeForce RTX 2080. ThunderSVM uses GPU, so it allows training kernel trees six
In the first experiment, we randomly sampled batches times faster than LibSVM; however, there is a polynomial
of fixed size (from 10K to 50K) from several different dependency between training time and the size of a dataset
datasets and estimated the training time for oblique forests for kernel forests for both libraries. The latter means that
(fig. 6). We used the dual coordinate descent method from further studying on training kernel SVM is required to make
Scikit-Learn to train oblique decision stumps with the pro- the proposed method reasonably applicable to process large
posed approach (Kernel Forest). As a result, we revealed that data.
most of the considered approaches, including the proposed
one, show linear dependence from the dataset size for a fixed X. CONCLUSION
tree depth. However, the CO2 Forest that finds the global opti- Non-linear kernels, margin and impurity optimization help
mum of convex-concave loss function shows a polynomial the tree ensembles achieve better accuracy on many prob-
trend. The obtained scores agree with the theoretical results of lems, especially related to image recognition. However, they
complexity analysis, and the proposed approach can be used are quite imperfect in terms of speed and memory require-
to train oblique random forests on large datasets. ments. For example, better classification results are usually

VOLUME 10, 2022 77975


D. A. Devyatkin, O. G. Grigoriev: Random Kernel Forests

achieved with C > 1000, when the loss becomes ‘‘steep’’ One can define a probability space (, A, P) and the discrete
and hard to optimize. Therefore, future research is needed to random variable 9 : (, A) → (R, B) on this space, so that
create fast, presumably GPU-based, algorithms to train the 9(ω) = ω, ∀ω ∈ , where  = {1, 2, . . . , n}, A is an
decision stumps. It seems that the stability and generalization algebra on , and B is a Borel algebra. That random variable
of decision trees depend on ‘‘outliers’’ on leave probability. defines a position βi on the right hand side of the (43).
Decision stump chains with low probability are unstable and Because of the product commutativity for all the values i ∈ B
randomized ensembles smooth them. As we revealed, tree the probability is:
refinement and pruning method [33] also helps make the 1
trees more stable, while not decreasing the discriminative P(i ∈ 9) = (44)
n
power much. We tried to modify that method to reduce the
correlation between estimators, and we obtained a steady Evidently, P satisfies all the properties of probability. Now
improvement in accuracy on some datasets. we can obtain the expectation of the right hand side of the
Many classification and regression problems presume to expression (43):
analyze discrete processes that could be done better by dis- n−2 n
X B B B X
crete models. We believe future development of random- γn ≤ ( )i P(i) + ( )n−1 P(n−1)+( )n−1 P(n) γi
ized tree ensembles should be focused on utilizing them 2 2 2
i=1 i=1
as a part of more sophisticated models that can catch  n−2  n
1 X B i B n−1 X
complex relationships between features and data samples, ≤ ( ) +2 γj , (45)
i.e., [44], [45]. n 2 2
i=1 j=1

which proofs (26). 


APPENDIX A
PROOF OF THE LEMMA 1
APPENDIX B
Proof: First we consider the product of two estimators PROOF OF THE THEOREM 5
Pr( x) = q1D (x)q2D (x): Proof: Let’s introduce random uniform stability for the
γPr = sup |l(q1D (x)q2D (x), y) − l(q1D\i (x)q2D\i (x), y)|, bagging on kernel tree. We consider the bagging ensemble as
D a set of dependent decision stump chains. Following the [10]
∀i ∈ 1, . . . , |D|, ∀x ∈ X . (39) the expression for the random uniform stability can be defined
as follows.
As far as the loss functions are B-Lipschitz, we can reformu-
T
late that expression as follows. BX
βm ≤ Ert (γrt |1i∈rt ) (46)
T
γPr ≤ B sup |q1D (x)q2D (x) − q1D\i (x)q2D\i (x)|, (40) t=1
D
It is worth noting that this expression does not add condi-
Because the range of q1 (x) and q2 (x) is [0..1] for all x, then tions of independence of the parameters rt .
q1 (x)q2 (x) ≤ q1 (x) + q2 (x). For simplicity we shall consider an ensemble with T trees,
  where each of the trees has 0 chains with length n. Each chain
B
γPr ≤ sup |(q1D (x) + q2D (x)) − (q1D\i (x) + q2D\i (x))| contains decision stump classifiers trained with an algorithm
2 D with a uniform stability γ .
B
≤ (γ1 + γ2 ), (41) T 0
2 B XX
βm ≤ Ertj (γrtj |1i∈rtj )
which concludes the proof. T
t=1 j=1
Now we can considerQ the product of n estimators. It can be m T 0
represented as q = qi1 j6=i1 qj and, in accordance with the B XXX
= Prtj (d(rtj ) = k)γk
previous case, T
k=1 t=1 j=1
B · Ertj (1i∈rtj ; d(rtj ) = k), (47)
γn ≤ (γi + γj6=i1 ). (42)
2 1 where d(rtj ) = k is the probability that the train data of the
Because of the commutativity of the product i1 can be any nat- decision stump chain j from the tree i has k distinct samples.
ural number in Q the range [1, . . . , n], all the optionsQare equal. Let’s estimate the expectation Ertj (1i∈rtj ; d(rtj ) = k). The
TheQproduct j6=i1 qj can also be decomposed as: j6=i1 qj = expectation is the probability that element i (deleted element)
qi2 j6=i1 ,i2 qj . The decomposition process repeats until more is presented in the training data in the decision stump chain
than one factor remains in the product. As a result, the uni- j of the tree t, provided that the randomized algorithm for
form stability of the product construction algorithm is limited training this chain has selected k objects from the original
as follows: training data set of size m.
B B B B Let’s consider an event A that is the training sample number
γn ≤ γi1 +( )2 γi2 + · · · + ( )n−1 γin−1 +( )n−1 γin (43) i is presented in the training data for some decision classifier
2 2 2 2
77976 VOLUME 10, 2022
D. A. Devyatkin, O. G. Grigoriev: Random Kernel Forests

chain, an event A1 that is the training sample number i is Now we consider the following functions:
presented in the training data for the root decision stump
classifier, A2 that is the training sample number i is presented g1 (d, k) = ED,r (K (D, r)|Dk−1 , Dk = d)
in the training data for the next decision stump after root, An is − E Dk−1 (K (D, r)|Dk−1 ) (52)
presented in the training data for the i-th decision stump n.
and
Recall that according to the properties of stump chains, when
training stumps in a chain, each next stump is trained on a g2 (r, k) = Er (K (D, r)|D, Rk−1 rk = r)
subset of the data on which the previous stump was trained: − Er (K (D, r)|D, Rk−1 ) (53)
Dn ⊂ · · ·S⊂ D2 ⊂ D1 . Therefore, An ⊂ · · · ⊂ A2 ⊂
A1 and P( ni=1 Ai ) = P(A1 ) is the desired probability and Let value
P(A1 ) = mk . m
X T0 
X
Now substituting the expression for the uniform stability v̂ = sup var( g1 (d, k)) + var(g2 (r, k))
of the algorithm to build decision stump classifier chain (27) r1 ,...,rk −1, k=1 k=1
d1 ,...,dk−1 ,
we have the desired expression.  x
m
X 
APPENDIX C = sup var( g1 (d, k))
d1 ,...,dk−1 ,
PROOF OF THE THEOREM 6 x k=1
Proof: For simplicity we will use flat indexing for T0
X 
decision stumps. Suppose our ensemble has T 0 decision + sup var(g2 (r, k)) (54)
stumps. r1 ,...,rk −1, k=1
d1 ,...,dk−1 ,
Let define a difference between generalization error and x
empirical risk as the following random function: is the upper bound on the sum of variances of those functions
g1 and g2 . Namely, those expressions reflect variance of
K (D, r) = R(FD(r) ) − R̂(FD(r) ), (48)
the generalization error, which genesis are the training data
where D = {D1 , . . . , Dm } is a training dataset with size m (for g1 ) and random parameters of the training (for g2 ).
and r = {r1 , . . . , rT 0 } is a set of random vectors that are For the variance of g1 we have the following.
parameters for data sampling in decision stump nodes. It is Xm  m
X
worth noting that some of vectors in r are interdependent. var( g1 (d, k)) = var(ED,r (K (D, r)|Dk−1 , dk = d)
We need to assess upper bound of that random function k=1 k=1

K (D, r) to obtain the upper bound of the generalization error.
We use MacDiarmid’s inequality (Theorem 3.7 from [46]) to − ED,r (K (D, r)|Dk−1 )) (55)
assess the deviation of values K from its expected value on r
and D. Let’s represent K according to definitions of R̂ and R. The expected value ED,r (K (D, R)|Dk−1 ) for a fixed Dk−1
does not affect the variance. We also know that D1 , . . . , Dm
m
1X are independent, besides r does not depend on D and vise
K (D, r) = EXY (l(FD(r) (X ), Y )) − l(FD(r) (xi ), yi ) versa.
m
i=1 Therefore, for the fixed Dk−1 we can put:
m
1X
≤ EXY (l(FD(r) (X ), Y )) + l(FD(r) (xi ), yi ) , ED,r (K (D, r)|Dk−1 , dk = d) = ED,r (K (D\k∪d , r)) (56)
m
i=1
(49) In accordance with Theorem 3.4 from [36] we have the
following.
where xi , yi ∈ D. m
Since l(FD(r) (x), y) is B-Lipshitz we can put:
X
sup var( g1 (d, k))
k=1 d1 ,...,d
x
k−1 ,
K (D, r) ≤ 2B sup FD(r) (X ) (50) m 
X
X
2
≤ 4B sup Ed Ed (ED,r (FD\k∪d (r) (x, y)))
By the definition of the decision stump chain FD(r) (x) > 0, k=1 d1 ,...,d
x
k−1 ,

thus:   2
≤ B m βm
4

K (D, r) ≤ 2B sup FD(r) (X ) − ED,r (FD\k∪d (r) (x, y)) (57)
(51)
X

Let Rk−1 denotes an event when r1 = r1 , . . . , rk−1 = Now consider g2 (r, k). Combining (51) and the expression
P 0
rk−1 , and Dk−1 is an event when D1 = d1 , . . . , Dk−1 = FD(r) (x, y) = Ti=1 fD(r) (x, y) we have:
dk−1 , R denotes RT 0 and D denotes Dm . Let D = T0
d1 , . . . , dk−1 , Dk , . . . , Dm and D\k∪d = d1 , . . . , dk−1 , d,
X
var(g2 (r, k))
Dk+1 , . . . , Dm . k=1

VOLUME 10, 2022 77977


D. A. Devyatkin, O. G. Grigoriev: Random Kernel Forests

 T0
2 X T0
2B X Then substitute it to (64):
≤ var( Eri (fD(ri ) (x, y)|D, Rk−1 , rk = r)
T
k=1 i=1 R(FD(r ) − R̂(FD(r ) − 2βm

− Eri (fD(ri ) (x, y)|D, Rk−1 )) (58) 1

1 2BM
r
1 2BM 2 1

≤ ln( ) + ln( ) + 8ln v̂ (66)
Note that the difference between the expectations is 2 δ 3 δ 3 δ
non-zero only for the decision stump chains that are depen- Finally, with probability at least 1 − δ we have:
dent on rk . Denote set of numbers of those chains as ρk . 
Observe that for a fixed Rk−1 and D the expected value 1 1 2M
R ≤ R̂ + 2βm + B ln( )
Eri (fD(ri ) (x, y)|D, Rk−1 )) is a constant, so it does not affect 2 δ 3
the variance. Then we have: s
2 (2M )2 0 3
 
1 2M 2 1
T0 + ln( ) + 8ln B2 m βm +
X δ 3 δ T
var(g2 (r, k))
k=1
(67)
 T0
2 X
which concludes the proof. 
2B X
Eri (fD(ri ) (x, y)|D, Rk−1 , rk = r))

≤ var(
T
k=1 i∈ρk REFERENCES
(59) [1] L. Breiman, J. H. Friedman, and R. A. Olshen, Classification and Regres-
sion Trees. Belmont, CA, USA: Wadsworth, 1984.
The size of the set ρk cannot exceed a number of tree [2] T. Chen and C. Guestrin, ‘‘XGBoost: A scalable tree boosting system,’’
in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
leaves. Denote the maximum number of tree leaves in the
Aug. 2016, pp. 785–794.
ensemble as 0 = supk |ρk |. For simplicity we presume that [3] L. Breiman, ‘‘Random forests,’’ Mach. Learn., vol. 45, no. 1, pp. 5–32,
all ensemble trees have the same number of tree leaves. Value 2001.
Eri (fri ,D (x, y)|D, Rk−1 , rk = r) is bounded by the range of [4] M. Golea, P. Bartlett, L. Mason, and W. S. Lee, ‘‘Generalization in decision
trees and DNF: Does size matter?’’ in Proc. Adv. Neural Inf. Process. Syst.,
f , which is M ; therefore we can put: 1998, pp. 259–265.
[5] V. Vapnik, Statistical Learning Theory. Hoboken, NJ, USA: Wiley, 1998,
T0 T0
2 X
2 (2BM )2 0 3

X 2B p. 768.
var(g2 (r, k)) ≤ M0 = (60) [6] L. Breiman, ‘‘Some properties of splitting criteria,’’ Mach. Learn., vol. 24,
T T no. 1, pp. 41–47, 1996.
k=1 k=1
[7] W. Liu and I. W. Tsang, ‘‘Sparse perceptron decision tree for millions of
Now we apply the McDiarmid inequality (3.8) dimensions,’’ in Proc. 13th AAAI Conf. Artif. Intell., 2016, pp. 1181–1187.
from [46]: [8] W. Liu and I. W. Tsang, ‘‘Making decision trees feasible in ultrahigh
feature and label dimensions,’’ J. Mach. Learn. Res., vol. 18, pp. 1–36,
Jul. 2017.
 
PD,r K (D, r) − ED,r (K (D, r)) ≥ t) [9] J. H. Friedman and P. Hall, ‘‘On bagging and nonlinear estimation,’’
J. Stat. Planning Inference, vol. 137, no. 3, pp. 669–683, Mar. 2007, doi:
10.1016/j.jspi.2006.06.002.
t2 [10] A. Elisseeff, T. Evgeniou, and M. Pontil, ‘‘Stability of randomized learning
≤ exp(− ), (61)
2v̂(1 + (bt/3v̂)) algorithms,’’ J. Mach. Learn. Res., vol. 6, no. 1, pp. 55–79, Dec. 2005.
[11] R. Tibshirani and T. Hastie, ‘‘Margin trees for high-dimensional classifi-
where b = BM . Let’s denote r.h.s. of the inequality as δ and cation,’’ J. Mach. Learn. Res., vol. 8, no. 3, pp. 637–652, 2007.
express t from it: [12] J. R. Quinlan, ‘‘Induction of decision trees,’’ Mach. Learn., vol. 1, no. 1,
pp. 81–106, 1986.
[13] Y.-H. Chen, S.-H. Lyu, and Y. Jiang, ‘‘Improving deep forest by exploiting
t2
δ = exp(− ) (62) high-order interactions,’’ in Proc. IEEE Int. Conf. Data Mining (ICDM),
2v̂(1 + (bt/3v̂)) Dec. 2021, pp. 1030–1035.
[14] S. K. Murthy, S. Kasif, S. Salzberg, and R. Beigel, ‘‘OC1: A randomized
Because t > 0 : algorithm for building oblique decision trees,’’ in Proc. AAAI Conf. Artif.
 r  Intell., vol. 93, 1993, pp. 322–327.
1 1 2BM 1 2BM 2 1 [15] T. Evgeniou, M. Pontil, and A. Elisseeff, ‘‘Leave one out error, stability,
t= ln( ) + ln( ) + 8ln v̂ (63) and generalization of voting combinations of classifiers,’’ Mach. Learn.,
2 δ 3 δ 3 δ vol. 55, no. 1, pp. 71–97, Apr. 2004.
With probability 1 − δ with respect to D and r: [16] G. DeSalvo and M. Mohri, ‘‘Random composite forests,’’ in Proc. AAAI
Conf. Artif. Intell., Phoenix, AZ, USA, 2016, pp. 1540–1546.
[17] M. Norouzi, M. D. Collins, D. J. Fleet, and P. Kohli, ‘‘CO2 forest:
K (D, r) − ED,r (K (D, r)) Improved random forest by continuous optimization of oblique splits,’’
 r  2015, arXiv:1506.06155.
1 1 2BM 1 2BM 2 1 [18] I. Tsochantaridis, T. Joachims, T. Hofmann, Y. Altun, and Y. Singer,
≤ ln( ) + ln( ) + 8ln v̂ (64)
2 δ 3 δ 3 δ ‘‘Large margin methods for structured and interdependent output vari-
ables,’’ J. Mach. Learn. Res., vol. 6, no. 9, pp. 15–64, 2005.
The next step is completely analogous to the proof of [19] B. H. Menze, B. M. Kelm, D. N. Splitthoff, U. Koethe, and
F. A. Hamprecht, ‘‘On oblique random forests,’’ in Proc. Joint Eur.
Theorem 3.4 in [10]. Let’s obtain the expectation K (D, r) on
Conf. Mach. Learn. Knowl. Discovery Databases, Bristol, U.K., 2011,
D and r. pp. 453–469.
[20] O. Irsoy and E. Alpaydin, ‘‘Autoencoder trees,’’ in Proc. Asian Conf. Mach.
ED,r (K (D, r)) = 2βm (65) Learn., Hamilton, New Zealand, 2016, pp. 378–390.

77978 VOLUME 10, 2022


D. A. Devyatkin, O. G. Grigoriev: Random Kernel Forests

[21] Z. Chai and C. Zhao, ‘‘Multiclass oblique random forests with dual- [40] YouTube Channels Dataset. Accessed: Jul. 1, 2022. [Online]. Available:
incremental learning capacity,’’ IEEE Trans. Neural Netw. Learn. Syst., http://keen.isa.ru/youtube
vol. 31, no. 12, pp. 5192–5203, Dec. 2020, doi: 10.1109/TNNLS. [41] I. Smirnov, M. Stankevich, Y. Kuznetsova, M. Suvorova, D. Larionov,
2020.2964737. E. Nikitina, and O. Grigoriev, ‘‘TITANIS: A tool for intelligent text
[22] B. B. Yang, S. Q. Shen, and W. Gao, ‘‘Weighted oblique decision analysis in social media,’’ in Proc. Russian Conf. Artif. Intell., 2021,
trees,’’ in Proc. AAAI Conf. Artif. Intell., Honolulu, HI, USA, 2019, pp. 232–247.
pp. 5621–5627. [42] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan,
[23] T. M. Hehn, J. F. P. Kooij, and F. A. Hamprecht, ‘‘End-to-end learn- ‘‘A dual coordinate descent method for large-scale linear SVM,’’ in Proc.
ing of decision trees and forests,’’ Int. J. Comput. Vis., vol. 128, no. 4, 25th Int. Conf. Mach. Learn., 2008, pp. 408–415.
pp. 997–1011, Apr. 2020, doi: 10.1007/s11263-019-01237-6. [43] A. Abdiansah and R. Wardoyo, ‘‘Time complexity analysis of support
[24] K. P. Bennett and J. A. Blue, ‘‘A support vector machine approach to vector machines (SVM) in LibSVM,’’ Int. J. Comput. Appl., vol. 128, no. 3,
decision trees,’’ in Proc. IEEE Int. Joint Conf. Neural Netw., IEEE World pp. 28–34, 2015, doi: 10.5120/ijca2015906480.
Congr. Comput. Intell., vol. 3, May 1998, pp. 2396–2401. [44] Z.-H. Zhou and J. Feng, ‘‘Deep forest,’’ 2017, arXiv:1702.08835.
[25] M. A. Carreira-Perpinán and P. Tavallali, ‘‘Alternating optimization of [45] P. Ma, Y. Wu, Y. Li, L. Guo, and Z. Li, ‘‘DBC-Forest: Deep forest with
decision trees, with application to learning sparse oblique trees,’’ in Proc. binning confidence screening,’’ Neurocomputing, vol. 475, pp. 112–122,
Adv. Neural Inf. Process. Syst., vol. 31, 2018, pp. 1–11. Feb. 2022, doi: 10.1016/j.neucom.2021.12.075.
[26] M. A. Kumar and M. Gopal, ‘‘A hybrid SVM based decision tree,’’ [46] C. McDiarmid, ‘‘Concentration,’’ in Probabilistic Methods for Algorithmic
Pattern Recognit., vol. 43, no. 12, pp. 3977–3987, Dec. 2010, doi: Discrete Mathematics. Berlin, Germany: Springer, 1998, pp. 195–248.
10.1016/j.patcog.2010.06.010.
[27] N. Manwani and P. S. Sastry, ‘‘Geometric decision tree,’’ IEEE Trans.
Syst., Man, Cybern. B, Cybern., vol. 42, no. 1, pp. 181–192, Feb. 2012, DMITRY A. DEVYATKIN received the M.S.
doi: 10.1109/TSMCB.2011.2163392. degree in computer science from Rybinsk State
[28] A. L. Yuille and A. Rangarajan, ‘‘The concave-convex procedure,’’ Neural Aviation Technology University, Rybinsk, Russia,
Comput., vol. 15, no. 4, pp. 915–936, Apr. 2003. in 2011. He is currently pursuing the Ph.D. degree
[29] P. L. Bartlett and S. Mendelson, ‘‘Rademacher and Gaussian complexi- in computer science with the Federal Research
ties: Risk bounds and structural results,’’ J. Mach. Learn. Res., vol. 3,
Center ‘‘Computer Science and Control,’’ RAS,
pp. 463–482, Nov. 2002.
[30] R. Hecht-Nielsen, ‘‘Theory of the backpropagation neural network,’’ in
Moscow, Russia.
Neural Networks for Perception. New York, NY, USA: Academic, 1992, Since 2011, he has been a Researcher with the
1992, pp. 65–93. Russian Artificial Intelligence Research Institute,
[31] M. Robnik-Sikonja, ‘‘Improving random forests,’’ in Machine Learning: Federal Research Center ‘‘Computer Science and
ECML 2004. Berlin, Germany: Springer, 2004, pp. 359–370. Control,’’ RAS. His research interests include machine learning, randomized
[32] S. Bernard, L. Heutte, and S. Adam, ‘‘Forest-RK: A new random for- ensembles, natural language processing, information extraction, and infor-
est induction method,’’ in Advanced Intelligent Computing Theories and mation retrieval.
Applications. With Aspects of Artificial Intelligence. Berlin, Germany: Mr. Devyatkin’s awards and honors include the Best Paper Award from
Springer, 2008, pp. 430–437. the IEEE 8th International Conference on Intelligent Systems (IS).
[33] S. Ren, X. Cao, Y. Wei, and J. Sun, ‘‘Global refinement of random forest,’’
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015,
pp. 723–730. OLEG G. GRIGORIEV received the M.S. degree
[34] G. King and L. Zeng, ‘‘Logistic regression in rare events data,’’ Political in computer science from the Moscow Institute of
Anal., vol. 9, no. 2, pp. 137–163, 2001. Electronics and Mathematics (MIEM), Moscow,
[35] B. Taskar, C. Guestrin, and D. Koller, ‘‘Max-margin Markov networks,’’ Russia, in 1980, and the Ph.D. degree in computer
in Proc. Adv. Neural Inf. Process. Syst., vol. 16, 2004, pp. 25–32. science from the Moscow State University of
[36] A. Elisseeff, T. Evgeniou, and M. Pontil, ‘‘Stability of randomized learning Technology ‘‘STANKIN,’’ in 2004.
algorithms with an application to bootstrap methods,’’ Tech. Rep., 2004. From 1980 to 1989, he had been a Research
[37] P. M. Murphy and D. W. Aha, ‘‘UCI Repository of machine learn-
Fellow with the Computing Centre of the
ing databases,’’ Dept. Inf. Comput. Sci., Univ. California, Irvine, CA,
USA, Tech. Rep., 1991. Accessed: Jul. 24, 2022. [Online]. Available:
Soviet Academy of Sciences, Moscow. In 1989,
https://archive.ics.uci.edu/ml/about.html he founded ‘‘Informatic Ltd.’’ He had been the
[38] A. Krizhevsky, ‘‘Learning multiple layers of features from tiny images,’’ CEO of this company, from 1989 to 2010. Since 2010, he has been a Principal
M.S. thesis, Dept. Comput. Sci., Univ. Toronto, Toronto, ON, Canada, Researcher at the Federal Research Center ‘‘Computer Science and Control,’’
2009. RAS, Moscow. He is also the Developer of the first Russian spell checker
[39] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, ORFO. His research interests include machine learning, natural language
and E. Duchesnay, ‘‘Scikit-learn: Machine learning in Python,’’ J. Mach. processing, and digital healthcare.
Learn. Res., vol. 12, pp. 2825–2830, Nov. 2011.

VOLUME 10, 2022 77979

You might also like