MASTER DEGREE IN APPLIED COMPUTER SCIENCE
(2016-2017)
A Sparse-Coding Based Approach for
Class-Specific Feature Selection
Supervisor:
Prof. Angelo CIARAMELLA
Co-Supervisor:
Prof. Antonino STAIANO
Author:
Mr. Davide NARDONE
0120/131
blank
PARTHENOPE UNIVERSITY OF NAPLES
DEPARTMENT OF SCIENCE AND TECHNOLOGY
December 14, 2017
“Success is the result of
perfection, hard work,
learning from failure,
loyalty, and persistence.”
Colin Powell
Contents
Abstract 1
Introduction 1
1 Mathematical Optimization Problems 3
1.1 Introduction to Optimization Problems . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Types of Optimization Problem and Traditional Numerical methods . . . . . . 4
1.2.1 Linear optimization problems . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Quadratic optimization problems . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Stochastic optimization problems . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Convex optimization problems . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Nonlinear optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Local solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Global solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 The Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.2 The Lagrange dual function . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.3 The Lagrange dual problem . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.4 Dual Ascent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.5 Augmented Lagrangian and the Method of Multipliers . . . . . . . . . 10
1.5 Alternating Direction Method of Multipliers . . . . . . . . . . . . . . . . . . . 11
1.5.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.2 Convergence in practice . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.3 Extension and Variations . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Sparse statistical models 14
2.1 Sparse Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Least Absolute Shrinkage and Selection Operator (LASSO) . . . . . . . . . . . 16
2.4.1 The LASSO estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 LASSO Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
III
CONTENTS
2.4.3 Computation of LASSO solution . . . . . . . . . . . . . . . . . . . . . 18
2.4.4 Single Predictor: Soft Thresholding . . . . . . . . . . . . . . . . . . . 19
2.5 Key Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Data Size Reduction 21
3.1 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Feature Selection for Classification . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5 Flat Feature Selection Techniques . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.1 Filter methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5.2 Wrapper methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5.3 Embedded and hybrid methods . . . . . . . . . . . . . . . . . . . . . . 26
3.5.4 Regularization models . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6 Algorithms for Structured Features . . . . . . . . . . . . . . . . . . . . . . . . 27
3.7 Algorithms for Streaming Features . . . . . . . . . . . . . . . . . . . . . . . . 30
3.8 Feature Selection Application Domains . . . . . . . . . . . . . . . . . . . . . 31
3.8.1 Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.8.2 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.8.3 Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 Class-Specific Feature Selection Methodology 33
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 Learning compact dictionaries . . . . . . . . . . . . . . . . . . . . . . 34
4.2.2 Finding representative features . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Reformulation problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.1 ADMM for the LASSO . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Class-Specific Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4.1 General Framework for Class-Specific Feature Selection . . . . . . . . 39
4.4.2 A Sparse Learning-Based Approach for Class-Specific Feature Selection 40
5 Experimental Results 42
5.1 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Datasets Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . 46
6 Conclusion 49
References 50
IV
List of Figures
1.1 Global and local maxima. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Estimation picture for the lasso (left) and the ridge regression (right). Shown
are contours of the error and constraint functions. The solid blue areas the
constraints regions |β1|+|β2| ≤ t and β2
1 +β2
2 ≤ t2, respectively, while the red
ellipse are the contours of the least squares error function. . . . . . . . . . . . . 16
2.2 The ridge coefficients (green) are a reduced factor of the simple linear regres-
sion coefficients (red) and thus never attain zero values but very small values,
whereas the the lasso coefficients (blue) become zero in a certain range and are
reduced by a constant factor, which explains there low magnitude in comparison
to ridge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Soft thresholding function Sλ(x) = sign(x)(|x|−λ)+ is shown in blue (broken
lines), along with the 45◦line in black. . . . . . . . . . . . . . . . . . . . . . . 19
3.1 A general Framework of Feature Selection for Classification. . . . . . . . . . . 23
3.2 Taxonomy of Algorithms for Feature Selection for Classification. . . . . . . . . 24
3.3 A General Framework for Wrapper Methods of Feature Selection for Classifi-
cation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Illustration of Lasso, Group Lasso and Sparse Group Lasso. Features can be
grouped into 4 disjoint groups G1,G2,G3,G4. Each cell denotes a feature and
white color represents the corresponding cell with coefficient zero. . . . . . . . 28
3.5 An illustration of a simple index tree of height 3 with 8 features. . . . . . . . . 29
3.6 An illustration of the graph of 7 features {f1,f2,...,f7} and its representation A. 29
4.1 Cumulative sum of the rows of matrix C in descending order of · 1. The
regularization parameter of the eq. 4.12 is set to 20 and 50 respectively. . . . . 35
4.2 A General Framework for Class-Specific Feature Selection [41]. . . . . . . . . 38
4.3 A Sparse Learning Based approach for Class-Specific Feature Selection. . . . . 40
5.1 Classification accuracy comparisons of seven feature selection algorithms on
six data sets. SVM with 5-fold cross validation is used for classification. SCBA
and SCBA-CSFS are our methods. . . . . . . . . . . . . . . . . . . . . . . . . 46
V
LIST OF FIGURES
5.2 Classification accuracy comparisons of seven feature selection algorithms on
six datasets. SVM with 5-fold cross validation is used for classification. SCBA-
CSFS is our method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
VI
List of Tables
5.1 Datasets Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Accuracy Score of SVM using 5-fold cross validation. Six TFS methods are
compared against our methods. RFS: Robust Feature Selector, FS: Fisher Score,
mRmR: Minimum-Redundancy-Maximum-Relevance, BSL: all features, SCBA-
FS and SCBA-CSFS our methods. The best results are highlighted in bold. . . . 46
5.3 Accuracy Score of SVM using 5-fold cross validation. The GCSFS [41] frame-
work using 5 traditional feature selector is compared against our SCBA-CSFS.
RFS: Robust Feature Selector, FS: Fisher Score, mRmR: Minimum-Redundancy-
Maximum-Relevance, BSL: all features and SCBA-CSFS our method. The best
results are highlighted in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . 47
VII
Abstract
Feature selection (FS) plays a key role in several fields and in particular computational biology,
making it possible to treat models with fewer variables, which in turn are easier to explain and
might speed the experimental validation up, by providing valuable insight into the importance
and their role. Here, we propose a novel procedure for FS conceiving a two-steps approach.
Firstly, a sparse coding based learning technique is used to find the best subset of features
for each class of the training data. In doing so, it is assumed that a class is represented by
using a subset of features, called representatives, such that each sample, in a specific class,
can be described as a linear combination of them. Secondly, the discovered feature subsets
are fed to a class-specific feature selection scheme, to assess the effectiveness of the selected
features in classification task. To this end, an ensemble of classifiers is built by training a
classifier, one for each class on its own feature subset, i.e., the one discovered in the previous
step and a proper decision rule is adopted to compute the ensemble responses. To assess the
effectiveness of the proposed FS approach, a number of experiments have been performed on
benchmark microarray data sets, in order to compare the performance to several FS techniques
from literature. In all cases, the proposed FS methodology exhibits convincing results, often
overcoming its competitors.
1
Introduction
In many areas such as computer vision, signal/image processing, and bioinformatics, data are
represented by high dimensional feature vectors and it is typical to solve problems through data
analysis, particularly through the use of statistical and machine learning algorithms. This has
motivated a lot of work in the area of dimensionality reduction, whose goal is to find compact
representations of the data that can save memory and computational time and also improve the
performance of algorithms that deal with the data. Moreover, dimensionality reduction can also
improve our understanding and interpretation of the data. Feature selection aims to select a
subset of features from the high dimensional feature set, trying to achieve a compact and proper
data representation. A large number of developments on feature selection have been made in
the literature and there are many recent reviews and workshops devoted to this topic. Feature
selection has been playing a critical role in many applications such as bioinformatics, where
a large amount of genomic and proteomic data are produced for biological and biomedical
studies.
In this work, we propose a novel Feature Selection framework called Sparse-Coding Based
Approach for Class Specific Feature Selection (SCBA-CSFS) that simultaneously exploits the
idea of Compressed Sensing and Class-Specific Feature Selection. Regarding the feature selec-
tion step, the feature selection matrix is constrained to have sparse rows, which is formulated as
2,1-norm minimization term. To solve the proposed optimization problem, an efficient iterative
algorithm is adopted. Preliminary experiments of our work are conducted on different bioinfor-
matics datasets, which shows that the proposed approach (for specific datasets) outperforms the
state-of-the-arts.
2
Chapter 1
Mathematical Optimization
Problems
In this Chapter we give an overview of mathematical optimization, focusing in particular
on the special role of convex optimization. In addition, we briefly review two optimization
algorithms that are precursors to the Alternating Direction Method of Multipliers.
1.1 Introduction to Optimization Problems
Mathematical Optimization problems are concerned with finding the values for one or for
several decision variables that meet the objective(s) the best, without violating a given set of
constraints [104, 105]. A mathematical optimization problem has the form
minimize f0(x)
subject to fi(x) ≤ bi, i = 1,...,m
hi(x) = bi, i = 1,...,p
(1.1)
which describes the problem of finding an x that minimizes f0(x) among all x that satisfy
the conditions fi(x) ≤ 0,i = 1,...,m and hi(x) ≤ 0,i = 1,...,p. We call the the vector x =
(x1,...,xn) the optimization variable and the function f0 : Rn → R the objective function or cost
function. The terms fi(x) ≤ 0 are called the inequality constraints, the corresponding functions
hi : Rn → R are called the inequality constraints functions, and the constants b1,...,bm are the
bounds for the constraints. If there are no constraints (i.e.,m = p = 0) then the problem (1.1) is
unconstrained. The set of points for which the objective functions and all constraint functions
are defined
D =
m
i=0
dom fi ∩
p
i=1
dom hi (1.2)
is called the domain of the optimization problem (1.1). A point x ∈ D is feasible if it satisfies
the constraints fi(x) ≤ bi,i = 1,...,m and hi(x) = bi,i = 1,...,p. The problem (1.1) is said
to be feasible if there exists at least one feasible point, and unfeasible otherwise. The set of
3
CHAPTER 1. MATHEMATICAL OPTIMIZATION PROBLEMS
all feasible points is called the feasible set. A vector x is called a solution of the problem
(1.1), if it has the smallest objective value among all vectors that satisfy the constraints: for
any z with f1(z) ≤ b1,...,fm(z) ≤ bm, we have f0(z) ≥ f0(x ). We generally consider fami-
lies or classes of optimization problems, characterized by particular forms of the objective and
constraint functions.
1.2 Types of Optimization Problem and Traditional
Numerical methods
Traditional numerical methods are usually based on iterative search or heuristic algorithms.
The former starts with a (deterministic or arbitrary) solution which is iteratively improved ac-
cording to some deterministic rule, while the latter starts off with a more or less arbitrary initial
solution and iteratively produces new solutions by some generation rule and evaluates these new
solutions, thus eventually reporting the best solution found during the search process.Which
type of method should and could be applied depends largely on the type of problem.
1.2.1 Linear optimization problems
A Linear Programming (LP) is an optimization problem (1.1) in which the objective func-
tion and the constraints functions f0,...,fm are linear, and satisfy
fi(αx+βy) = αfi(x)+βfi(y) (1.3)
for all x,y ∈ Rn and all α,β ∈ Rn. If the optimization problem is not linear, it is called Non
Linear Programming (NLP). Since all linear functions are convex, LP problems are intrinsically
easier to solve than general NLP, which may be non-convex. The most popular method for
solving LP is the Simplex Algorithm, where the inequalities are first transformed into equalities
by adding slack variables and then including and excluding base variables until the optimum is
found. Though its worst case computational complexity is exponential, it is found to work quite
efficiently for many instances.
1.2.2 Quadratic optimization problems
A Quadratic Programming (QP) is an optimization problem in which the objective function
and the (inequality) constraints are quadratic
minimize (1/2)xT
Px+qT
x+r
subject to Ax = 0
Gx ≤ q
(1.4)
4
CHAPTER 1. MATHEMATICAL OPTIMIZATION PROBLEMS
where P ∈ Sn
is the Hessian matrix of the objective function, q is its gradient, G ∈ Rm×n and
A ∈ Rp×n contain the linear equality and not equality constraints, respectively. QP problems,
like LP problems, have only one feasible region with "flat faces" on its surface (due to the linear
constraints), but the optimal solution may be found anywhere within the region or on its surface.
The quadratic objective function may be convex –which makes the problem easy to solve or
non-convex, which makes it very difficult to solve.
1.2.3 Stochastic optimization problems
A Stochastic Programming (SP) is an optimization problem where (some of the) data in-
corporated in the objective function are uncertain. Usual approaches include assumption of
different scenarios and sensitivity analyses.
1.2.4 Convex optimization problems
Convex optimization problems are far more general than LP problems, but they share the
desirable properties of LP problems: they can be solved quickly and reliably up to very large
size –hundreds of thousands of variables and constraints. The issue has been that, unless your
objective and constraints were linear, it was difficult to determine whether or not they were
convex. A convex optimization problem is one of the form
minimize f0(x)
subject to fi(x) ≤ bi, i = 1,...,m
aT
i x = bi, i = 1,...,p
(1.5)
where f0,....fm are convex functions. Comparing eq. 1.5 with the general standard form prob-
lem (1.1), the convex problem has three additional requirements:
• The objective function must be convex.
• The inequality constraint functions must be convex.
• The equality constraint functions hi(x) = aT
i x−bi must be affine.
We immediately note an important property: The feasible set of a convex optimization prob-
lem is convex, since it is the intersection of the domain of the problem. Another fundamental
property of convex optimization problems is that any locally optimal points is also (globally)
optimal.
A non-convex optimization problem is any problem where the objective or any of the con-
straints are non-convex. Such a problem may have multiple feasible regions and multiple locally
optimal point within each set. It might take exponential time in the number of variables and
constraints to determine that a non-convex problem is unfeasible, that the objective function is
unbounded, or that an optimal solution is the "global optimum" across all feasible regions.
5
CHAPTER 1. MATHEMATICAL OPTIMIZATION PROBLEMS
Figure 1.1: Global and local maxima.
1.3 Nonlinear optimization
Nonlinear optimization is the term used to describe an optimization problem when the objec-
tive functions are not linear and it may be convex or non-convex. Sadly, there are no effective
methods for solving the general NLP problem. Nonlinear functions, unlike linear functions,
may involve variables that are raised to a power or multiplied or divided by other variables.
They may also use transcendental functions such as exp, log, sine and cosine.
1.3.1 Local solution
A locally optimal solution is one where there are no other feasible solutions with better
function values in the neighborhood. Rather than seeking for the optimal x which minimizes the
objective over all feasible points, we seek a point that is only locally optimal, which means that
it minimizes the objective function among feasible points that are near it, but is not guaranteed
to have a lower objective value than all other feasible points.
In Convex Optimization problems, a locally optimal solution is also globally optimal. These
include i) LP problems; ii) QP problems where the objective is positive definite (if minimizing,
negative definite if maximizing); iii) NLP problems where the objective is a convex function (if
minimizing; concave if maximizing) and the constraints form a convex set.
Local optimization methods can be fast, can handle large-scale problems, and are widely
applicable, since they only require differentiability of the objective and constraint functions.
There are several disadvantages of local optimization methods, beyond (possibly) not finding
the actual, globally optimal solution. The methods require an initial guess for the optimization
6
CHAPTER 1. MATHEMATICAL OPTIMIZATION PROBLEMS
variable.
1.3.2 Global solution
A globally optimal solution is one where there are no other feasible solutions with a better
function values. Global optimization is used for problems with a small number of variables,
where computing time is not critical, and the value of finding the actual global solution is very
high.
7
CHAPTER 1. MATHEMATICAL OPTIMIZATION PROBLEMS
1.4 Duality
In mathematical optimization theory, the duality is the principle by which optimization prob-
lems may be viewed from either of two perspectives, the primal problem or the dual problem. It
is a powerful and widely employed tool in applied mathematics for a number of reasons. First,
the dual problem is always convex even if the primal is not. Second, the number of variables in
the dual is equal to the number of constraints in the primal, which is often less than the number
of variables in the primal program. Third, the maximum value achieved by the dual problem
is often equal to the minimum of the primal [1]. However, in general, the optimal values of
the primal and dual problems need not be equal. When not equal, their difference is called the
duality gap.
1.4.1 The Lagrangian
We consider an optimization problem in the standard (1.5)
minimize f0(x)
subject to fi(x) ≤ bi, i = 1,...,m
hi(x) = bi, i = 1,...,p
(1.6)
with variable x ∈ Rn. We assume its domain D = { m
i=1 domfi} ∩ { p
i=1 dom hi} is nonempty
and denote the optimal value of 1.6 by p . We do not assume the problem ( 1.6) is convex.
The basic idea in Lagrangian duality is to take the constraints in 1.6 into account by aug-
menting the objective function with a weighted sum of the constraint functions. We define the
Lagrangian L : Rn ×Rm ×Rp → R associated with the problem ( 1.6) as
L(x,λ,ν) = f0(x)+
m
i=1
λifi(x)+
p
i=1
νihi(x) (1.7)
with dom L = D ×Rm ×Rp. We refer to λi as the Lagrange multiplier associated with i-th in-
equality constraint fi(x) ≤ 0; similarly we refer to νi as the Lagrange multiplier associated with
the i-th equality constraint hi(x) = 0. The vectors λ and ν are called the Lagrange multipliers
vectors associated with the problem ( 1.7).
1.4.2 The Lagrange dual function
We define the Lagrange dual function g : Rm × Rp → R as the minimum value of the La-
grangian over x: for λ ∈ Rm,ν ∈ Rp
g(λ,ν) = inf
x∈D
L(x,λ,ν) = inf
x∈D

f0(x)+
m
i=1
λifi(x)+
p
i=1
νihi(x)

 (1.8)
8
CHAPTER 1. MATHEMATICAL OPTIMIZATION PROBLEMS
where the Lagrangian is unbounded below in x and the dual function takes on the value −∞.
Since the dual function is the point-wise infimum of a family of affine functions of (λ,ν), it is
concave, even when the problem (1.24) is not convex.
1.4.3 The Lagrange dual problem
The Lagrange dual problem term, usually referred to as the Lagrangian dual problem is
obtained by forming the Lagrangian (1.7) using: i) non-negative Lagrange multipliers; ii) to
add the constraints to the objective function, iii) and then solving for some primal variable
values that minimize the Lagrangian. This solution gives the primal variables as functions
of the Lagrange multipliers, which are called dual variables, so that the new problem is to
maximize the objective function with respect to the dual variables under the derived constraints
on the dual variables (including at least the non-negativity).
maximize g(λ,ν)
subject to λ 0
(1.9)
1.4.4 Dual Ascent
The Dual Ascent is an optimization method for solving the Lagrangian Dual problem. Con-
sider the equality-constrained convex optimization problem
minimize f(x)
subject to Ax = b
(1.10)
with variable x ∈ Rn, where A ∈ Rm×n and f : Rn → R is convex. The Lagrangian for problem
(1.10) is
L(x,y) = f(x)+yT
(Ax−b) (1.11)
and the dual function is
g(y) = inf
β∈Rm
L(x,y) = −f∗
(−AT
y)−bT
y (1.12)
where y is the algebraic expression of the Lagrange multipliers, and f∗ is the convex conjugate
of f. The associate dual problem is
maximize g(y) (1.13)
Assuming that strong duality holds, the optimal values of the primal and dual problems are the
same. We can recover a primal optimal point x from a dual optimal point y∗ as
x = argmin
x
L(x,y ) (1.14)
9
CHAPTER 1. MATHEMATICAL OPTIMIZATION PROBLEMS
providing there is only one minimizer of L(x,y ) whether f is strictly convex.
The Dual problem is solved by using the gradient ascent. Assuming that g is differentiable,
the gradient g(y,ν) can be evaluated as follows. We first find x+ = argminx L(x,y,ν), then
we have g(y) = Ax+ − b, which is the residual for the equality constraint. The dual ascent
method consists of iterating the updates
xk+1
= argmin
x
L(x,yk
) (1.15)
yk+1
= yk
+αk
(Axk+1
−b) (1.16)
where αk > 0 is a step size. The first step is an x-minimization step, and the second step is a
dual variable update. The dual ascent method can be used even in cases when g is not differen-
tiable. If αk is chosen appropriately and several other assumptions hold, then xk converges to
an optimal point and yk converges to an optimal dual point. However, these assumptions do not
hold in many applications, so dual ascent often cannot be used.
1.4.5 Augmented Lagrangian and the Method of Multipliers
Augmented Lagrangian methods were developed in part to bring robustness to the dual
ascent method and in particular to yield convergence without assumptions, like strict convexity
or fitness of f. The augmented Lagrangian for (1.10) is
Lp(x,y) = f(x)+yT
(Ax−b)+
ρ
2
Ax−b 2
2 (1.17)
where ρ > 0 is called the penalty parameter. The augmented Lagrangian can be viewed as the
(unaugmented) Lagrangian associated with the problem
minimize f(x)+
ρ
2
Ax−b 2
2
subject to Ax = b
(1.18)
The gradient of the augmented dual function is found the same way as with the ordinary La-
grangian. By applying dual ascent to the modified problem yields the algorithm
xk+1
= argmin
x
L(x,yk
) (1.19)
yk+1
= yk
+ρk
(Axk+1
−b) (1.20)
this is known as the method of multipliers for solving (1.10). This is the same as standard dual
ascent, except that x-minimization step uses the augmented Lagrangian, and the penalty param-
eter ρ is used as the step size αk. The method of multipliers converges under far more general
conditions than dual ascent, including cases when f takes on the value +∞ or is not strictly
convex. Finally, the greatly improved convergence properties of the method of multipliers over
10
CHAPTER 1. MATHEMATICAL OPTIMIZATION PROBLEMS
dual ascent comes at cost. When f is separable, the augmented Lagrangian Lρ is not separable,
so the x-minimization step cannot be carried out separately (parallelize).
1.5 Alternating Direction Method of Multipliers
The Alternating Direction Method of Multipliers (ADMM) [105] is a Lagrangian based
approach intended to blend the decomposability of dual ascent with the superior properties of
the method of multipliers. Consider a problem of the form
minimize
x∈Rm,z∈Rn
f(x)+g(z) subject to Ax+Bz = c (1.21)
where f : Rm → R and g : Rn → R are convex functions and A ∈ Rn×d and B ∈ Rn×d are
(known) matrices of constraints, and c ∈ Rd is a constrained vector. To solve this problem
we introduce a vector y ∈ Rd of Lagrange multipliers associated with the constraint, and then
consider the augmented Lagrangian
Lp(x,z,y) = f(x)+g(z)+y(Ax+Bz −c)+
ρ
2
Ax+Bz −c 2
2 (1.22)
where ρ > 0 is a small fixed parameter. The ADMM algorithm is based on minimizing the
augmented Lagrangian problem (1.17) successively over x and z, and then applying a dual
variable update to y.In doing so yields the updates
xk+1
= arg min
x∈Rm
Lρ(x,zk
,yk
) (1.23)
zk+1
= arg min
z∈Rn
Lρ(xk+1
,z,yk
) (1.24)
yk+1
= y +ρ(Axk+1
+Bzk+1
−c) (1.25)
for iterations t = 0,1,2,.... The algorithm is very similar to dual ascent and the method of
multipliers: It consists of an x-minimization step (1.23), a z-minimization step (1.24) and a
dual variable update (1.25). As the method of multipliers, the dual update uses a step size equal
to the augmented Lagrangian parameter ρ. The method of multipliers has the form
(xk+1
,zk+1
) = argmin
x,z
Lp(x,z,yk
) (1.26)
yk+1
= yk
+ρ(Axk+1
+Bzk+1
−c) (1.27)
Here the augmented Lagrangian is jointly minimized w.r.t. the two primal variables. In ADMM,
on the other hand, x and z are updated in alternating or sequential fashion, which accounts for
the term alternating direction. The ADMM framework has several advantages. First, convex
problems with non-differentiable constraints can be easily handled by the separation of param-
eters x and z. A second advantage of ADMM is its ability to break up a large problem into
11
CHAPTER 1. MATHEMATICAL OPTIMIZATION PROBLEMS
smaller pieces. For dataset with a large number of observations we can break up the data into
blocks, and carry out the optimization over each block.
1.5.1 Convergence
Under modest assumption on f and g, we get that ADMM iterates, for any ρ > 0, satisfy:
• Residual convergence: rk → 0 as k → ∞, i.e., primal iterates approach feasibility.
• Objective convergence: f(xk) + g(zk) → p as k → ∞, i.e., the objective function of
the iterates approaches to the optimal value.
• Dual variable convergence: yk → y as k → ∞, where y is a dual optimal point.
Here, rk is the primal residual at iteration k as rk = Axk +Bzk −c.
1.5.2 Convergence in practice
Simple examples show that ADMM can be very slow to convergen to high accuracy. How-
ever, it is often the case that ADMM converges to modest accuracy –sufficient for many applications–
within a few tens of iterations. This behavior makes ADMM similar to algorithms like the con-
jugate gradient method. However, the slow convergence of ADMM also distinguishes it from
algorithms such as Newton’s method, where the high accuracy can be attained in a reasonable
amount of time.
1.5.3 Extension and Variations
In practice, ADMM obtains a relatively accurate solution in a handful of iterations, but
requires many iterations for a highly accurate solution. Hence, it behaves more like a first-order
method than a second-order method. In order to attain superior convergence, many variations
on the ADMM have been explored in literature.
A standard extension is to use possible different penalty parameters ρk for each iteration,
with the goal of improving the convergence in practice, as well as making performance less
dependent on the initial choice of the penalty parameter. Though it can be difficult to prove the
convergence of ADMM when ρ varies by iteration, the fixed ρ theory still applies if one just
assumes that ρ becomes fixed after a finite number of iterations. A simple scheme that often
works well is [2]
ρk+1
=



τincrρk if rk
2
> µ sk
2
ρk/τdecr if sk
2
> µ rk
2
ρk otherwise,
(1.28)
where µ > 1, τincr > 1, and τdecr > 1 are parameters. Typical choices might be µ = 10 and
τincr = τdecr = 2.
12
CHAPTER 1. MATHEMATICAL OPTIMIZATION PROBLEMS
The ADMM update equation (1.28) suggests that large value of ρ place a large penalty on
violations of primal feasibility and so tend to produce small primal residuals. Conversely, the
definition of sk+1 suggests that small values of ρ tend to reduce the dual residual, but at the
expense of reducing the penalty on primal feasibility, which may result in larger primal residual.
Therefore, the adjustment scheme (1.28) inflates ρ by τincr when the primal residual appears
large compared to the dual residual, and deflates ρ by τdecr when the primal residual seems too
small relative to the dual residual.
13
Chapter 2
Sparse statistical models
In this Chapter, we summarize the actively developing field of Statistical Learning with
Sparsity. In particular, we introduce the LASSO estimator for linear regression. We describe
the basic LASSO method, and outline a simple approach for its implementation. We relate and
compare LASSO to Ridge Regression, and give a brief description of the latter.
2.1 Sparse Models
Nowadays, large quantities of data are collected and mined in nearly every area of sci-
ence, entertainment, business, and industry. Medical scientists study the genomes of patients
to choose the best treatments, in order to learn the underlying causes of their disease. Thus
the world is overwhelmed with data and there is a crucial need to sort through this mass of
information, and pare it down to its bare essentials.
For this process to be successful, we need to hope that the world is not as complex as it
might be. For example, we hope that not all of the 30,000 or so genes in the human body are
directly involved in the process that leads to the development of cancer.
This points to an underlying assumption of simplicity. One form of simplicity is sparsity.
Broadly speaking, a sparse statistical model is one in which only a relatively small number of
parameters (or predictors) plays an important role. It represents a classic case of "less is more":
a sparse model can be much easier to estimate and interpret than a dense model. The sparsity
assumption allows us to tackle such problems and extract useful and reproducible patterns from
big datasets. The leading example is linear regression, which we will discuss through this
Chapter.
2.2 Introduction
In linear regression settings, we are given N samples {(xi,yi)}N
i=1, where each xi = (xi,1,...,
xi,p) is a p−dimensional vector of features or predictors, and each yi ∈ R is the associated re-
sponse variable. The goal is to approximate the response variable yi using a linear combination
14
CHAPTER 2. SPARSE STATISTICAL MODELS
of the predictors. A linear regression model assumes that
η(xi) = β0 +
p
j=1
xijβj +ei (2.1)
The model is parameterized by the vector of regression weights β = (β1,...,βp) ∈ Rp and the
intercept (or "bias") term β0 ∈ R which are unknown parameters and ei is an error term. The
method of Least Squares (LS) provides estimates of the parameters by the minimization of the
following squared-loss function:
minimize
β0,β



N
i=1
(yi −β0 −
p
j=1
xijβj)2



(2.2)
Typically all of the LS estimates from (2.2) will be nonzero. This will make interpretation of
the final model challenging if p is large. In fact, if p > N, the LS estimates are not unique.
There is an infinite set of solutions that make the objective function equal to zero, and these
solutions almost surely overfit the data as well. There are two reasons why we might consider
an alternative to the LS estimate:
1. The prediction accuracy: the LS estimate often has low bias but large variance, and
prediction accuracy can sometimes be improved by shrinking the values of the regression
coefficients, or setting some coefficients to zero. By doing so, we introduce some bias but
reduce the variance of the predicted values, hence this may improve the overall prediction
accuracy (as measured in terms of the Mean Squared Error (MSE)).
2. The purposes of interpretation: with a large number of predictors, we often would like to
identify a smaller subset of these predictors that exhibit the strongest effects.
Thus, there is a need to constrain, or regularize the estimation process. The lasso or 1-regularized
regression is a method that combines the LS loss (2.2) with a 1-constraint, or bound on the sum
of the absolute values of the coefficients. Relative to the LS solution, this constraint has the ef-
fect of shrinking the coefficients, and even setting some to zero. In this way, it provides an
automatic way for doing model selection in linear regression. Moreover, unlike some other cri-
teria for model selection, the resulting optimization problem is convex, and can be efficiently
solved for large problems.
2.3 Ridge Regression
The Ridge Regression performs 2 regularization, i.e., it adds a factor of sum of squares of
coefficients in the optimization objective. Thus, ridge regression optimizes the following
minimize
β∈Rp



1
2N
y−Xδ 2
2
loss
+δ β 2
2
penalty


 (2.3)
15
CHAPTER 2. SPARSE STATISTICAL MODELS
where δ is the parameter which balances the amount of emphasis given to minimizing the Resid-
ual Sum of Squares (RSS) versus minimizing the Sum of Square of Coefficients (SSO). The δ
value can take different values:
• δ = 0: The objective becomes simple as Linear Regression and we will get the same
coefficients as a simple linear regression;
• δ = ∞: The coefficients will be zero because of infinite weighting on square of coeffi-
cients, anything less than zero will make the objective infinite;
• 0 < δ < ∞: The magnitude of δ will decide the weighting given to different parts of ob-
jective. The coefficients will be somewhere between 0 and 1 for simple linear regression.
2.4 Least Absolute Shrinkage and Selection Operator
(LASSO)
In statistics and machine learning, Least Absolute Shrinkage and Selection Operator (LASSO)
[28] is a regression analysis method that performs both variable selection and regularization in
order to enhance the prediction accuracy and interpretability of the statistical model it produces.
Figure 2.1: Estimation picture for the lasso (left) and the ridge regression (right). Shown
are contours of the error and constraint functions. The solid blue areas the constraints
regions |β1|+|β2| ≤ t and β2
1 +β2
2 ≤ t2, respectively, while the red ellipse are the contours
of the least squares error function.
16
CHAPTER 2. SPARSE STATISTICAL MODELS
2.4.1 The LASSO estimator
Given a collection of N predictors-response pairs {(xi,yi)}N
i=1, the LASSO finds the solu-
tion ( ˆβ0, ˆβ) to the optimization problem
minimize
β0,β



1
2N
N
i=1
(yi −β0 −
p
j=1
xijβj)2



subject to
p
j=1
|βj| ≤ t.
(2.4)
The constraint
p
j=1
|βj| ≤ t, can be written as the l1-norm constraint β 1 ≤ t. Furthermore (2.4)
is often represented using matrix-vector notation. Let y = (yi,...,yN ) denotes the N-vector
of responses, and X be the N × p matrix with xi ∈ Rp in its ith row, then the optimization
problem (2.4) can be re-expressed as a Linear Regression problem
minimize
β0,β



1
2N
y−β01−Xβ 2
2



subject to β 1 ≤ t
(2.5)
where 1 is the vector of N ones, and · 2 denotes the usual Euclidean norm on vectors. The
bound t is a kind of ‘budget’: It limits the sum of the absolute values of the parameters estimates.
It is often convenient to rewrite the LASSO problem in the so-called Lagrangian form (1.7)
minimize
β∈Rp



1
2N
y−Xβ 2
2
loss
+λ β 1
penalty


 (2.6)
for some λ ≥ 0. The 1 penalty will promote sparse solutions. This means that as λ is increased,
elements of β will become exactly zero. Due to the non-differentiability of the penalty function,
there are no closed-form solutions to equation (2.3).
2.4.2 LASSO Regression
LASSO Regression is a powerful technique generally used for creating models that deal with
‘large’ number of features. Lasso regression performs 1 regularization, i.e., it adds a factor of
sum of absolute value of coefficients in the optimization objective. It works by penalizing the
magnitude of the coefficients/estimates along with minimizing the error between predicted and
actual observations –the larger the penalty applied, the further coefficients are shrunk towards
17
CHAPTER 2. SPARSE STATISTICAL MODELS
Figure 2.2: The ridge coefficients (green) are a reduced factor of the simple linear regression
coefficients (red) and thus never attain zero values but very small values, whereas the
the lasso coefficients (blue) become zero in a certain range and are reduced by a constant
factor, which explains there low magnitude in comparison to ridge.
zero. Thus, LASSO regression optimizes the following
minimize
β∈Rp



1
2N
y−Xβ 2
2 +λ β 1



(2.7)
where λ works similar to problem (2.3) and provides a trade-off between balancing RSS and
the Sum of Coefficients (SOC). As in ridge regression, λ can take various values:
• δ = 0: The objective becomes sample as Linear Regression and we will get the same
coefficients as a simple Linear Regression;
• δ = ∞: The coefficients will be zero as for the Ridge;
• 0 < δ < ∞: The coefficients will be between 0 and 1 for simple Linear Regression.
2.4.3 Computation of LASSO solution
The LASSO problem is a convex program, specifically a QP with a convex constraint. As
such, there are many sophisticated QP methods for solving the LASSO. However, there is a par-
ticularly simple and effective computational algorithm that gives insight into how the LASSO
works. For convenience, we rewrite the criterion in Lagrangian form
minimize
β∈Rp



1
2N
N
i=1
(yi −
p
j=1
xijβj)2
+λ
p
j=1
|βj|



. (2.8)
As discussed in Chapter 1, the Lagrangian form is especially convenient for numerical compu-
tation of the solution by using techniques such as the Dual Ascent, the Augmented Lagrangian
18
CHAPTER 2. SPARSE STATISTICAL MODELS
and the Method of Multipliers or the Alternating Direction Method of Multipliers.
Figure 2.3: Soft thresholding function Sλ(x) = sign(x)(|x|−λ)+ is shown in blue (broken
lines), along with the 45◦line in black.
2.4.4 Single Predictor: Soft Thresholding
Soft thresholding is becoming a very popular tool in computer vision and machine learning.
Essentially it allows to tackle the problem (2.8) and solve it in a very fast fashion way. Since 1
penalties are being used nearly everywhere at the moment, the property of soft thresholding to
efficiently find the solution to the above form ( 2.8) becomes very useful. Let’s first consider a
single predictor setting, based on samples {(zi,yi)}N
i=1. The problem then is to solve
minimize
β



1
2N
N
i=1
(yi −ziβ)2
+λ|β|



. (2.9)
The standard approach to this univariate minimization problem, would be to take the gradient
(first derivative) with respect to β, and set it to zero. There is a complication, however, because
the absolute value function does not have a derivative at |β| = 0. However, we can proceed by
direct inspection of the function (2.9), and find that
ˆβ =



1
N x,y −λ if 1
N z,y > λ
0 if 1
N z,y ≤ λ
1
N x,y +λ if 1
N z,y < −λ
(2.10)
which can be shortly written as
ˆβ = Sλ(
1
N
z,y ). (2.11)
19
CHAPTER 2. SPARSE STATISTICAL MODELS
Here the soft-thresholding operator
Sλ(x) = sign(x)|x|−λ)+ (2.12)
translates its argument x toward zero by the amount λ, and sets it to zero if |x| ≤ λ.
2.5 Key Difference
Although similar in formulation, the benefits and properties for Ridge and LASSO are quite
different. In Ridge Regression, we aim to reduce the variance of the estimators and predictions,
which is particularly helpful in the presence of multicollinearity. It includes all (or none) of the
features in the model. Thus, the major advantage of ridge regression is coefficient shrinkage and
reducing model complexity. On the other hand, LASSO is a tool for model (predictor) selection
and consequently for the improvement of interpretability.
A fundamental difference among these two regression models, is that the penalty term for
LASSO uses the 1 norm and ridge uses the squared 2 norm. This difference has a simple impli-
cation on the solution of the two optimization problems. In fact, the 2 penalty of ridge regres-
sion leads to a shrinkage of the regression coefficients, much like the 1 penalty of the LASSO,
but the coefficients are not forced to be exactly zero for finite values of λ. This phenomenon of
the coefficients being zero is called sparsity. In addition, a benefit of ridge regression is that a
unique solution is available, also when the data matrix X is rank deficient, e.g., when there are
more predictors than observations (p>N).
20
Chapter 3
Data Size Reduction
In this Chapter, we will introduce the problem of Curse of Dimensionality and, examine the
two main approaches used for tackling this problem. We will then, give an overview of the main
approaches for feature selection, particularly focusing on algorithms for flat feature selection,
where features are assumed to be independent. Finally, we will examine some application
domains where these methods are mostly relevant.
3.1 Dimensionality Reduction
Nowadays, the growth of the high-throughput technologies has resulted in exponential growth
in the harvested data w.r.t both dimensionality and sample size. Efficient and effective manage-
ment of this data becomes increasingly challenging. Traditionally, manual management of these
datasets has proved to be impractical. Therefore, data mining and machine learning techniques
were developed to automatically discover knowledge and recognize patterns from the data. In
industry, this trend has been referred to as ‘Big Data’, and it has a significant impact in areas
as varied as artificial intelligence, internet applications, computational biology, medicine, fi-
nance, marketing, etc. However, this collected data is usually associated with a high level of
noise. There are many variables that can cause noise in the data, some examples of this are
an imperfection in the technology that compiles the data or the source of the data itself. Di-
mensionality reduction is one of the most popular techniques to remove noise and redundant
features. Dimensionality reduction is important, because a high number of features in a dataset,
comparable to or higher than the number of samples, leads to model overfitting, which in turn
leads to poor results on the testing datasets. Additionally, constructing models from datasets
with many features is more computationally demanding [3].
Dataset size reduction [4] can be performed in one of the two ways: feature set reduction or
sample set reduction. In this thesis, we focus on feature set reduction. The feature set reduction,
also known as dimensionality reduction can be mainly categorized into feature extraction and
feature selection.
21
CHAPTER 3. DATA SIZE REDUCTION
3.2 Feature extraction
Feature extraction maps the original feature space to a new feature space with lower dimen-
sion. It is difficult to link the features from original feature space to new features. Therefore,
further analysis of new features is problematic since there is no physical meaning for the trans-
formed features obtained from feature extraction techniques. Features extraction can be used
in this context to reduce complexity and give a simple representation of data representing each
variable in feature space as a linear combination of original input variable. The most popular
and widely used feature extraction techniques include Principal Component Analysis (PCA),
Linear Discriminant Analysis (LDA), Canonical Correlation Analysis (CCA) and Multidimen-
sional Scaling [4].
3.3 Feature selection
Feature selection –the process of selecting a subset of relevant features– is a key component
in building robust machine learning models for classification, clustering, and other tasks. It has
been playing an important role in many applications since it can speed up the learning process,
leads to better learning performance (e.g., higher learning accuracy for classification), lower
computational costs, improve the model interpretability and alleviate the effect of the Curse
of Dimensionality [5]. Moreover, with the presence of a large number of features, a learning
model tends to overfit, resulting in their performance to degenerate. To address the problem of
the curse of dimensionality, dimensionality reduction techniques have been studied. This is an
important branch in the machine learning and data mining research areas.
In addition, unlike feature extraction, feature selection selects a subset of features from the
original feature set without any transformation, and maintains the physical meanings of the
original features. In this sense, feature selection is superior in terms of better readability and
interpretability. This property has its significance in many practical applications such as finding
relevant genes to a specific disease and building a sentiment lexicon for sentiment analysis. For
the classification problem, feature selection aims to select subset of highly discriminant features
which are capable of discriminating samples that belong to different classes.
Depending on the training set is labelled or not, feature selection algorithms can be catego-
rized into supervised [47, 48], unsupervised [49, 50] and semi-supervised learning algorithms
[51, 52]. For supervised learning, feature selection algorithms maximize some functions of the
predictive accuracy. Because we are given class labels, it is natural that we want to keep only the
features that are related to or lead to these classes. For unsupervised learning, feature selection
is a less constrained search problem without class labels, depending on clustering quality mea-
sures [53], and can eventuate in many equally valid feature subsets. Because there is no label
information directly available, its much more difficult to select discriminative features. Super-
vised feature selection assesses the relevance of features guided by the label information but a
good selector needs enough labeled data, which in turn is time consuming. While unsupervised
22
CHAPTER 3. DATA SIZE REDUCTION
Figure 3.1: A general Framework of Feature Selection for Classification.
feature selection works with unlabeled data, it is difficult to evaluate the relevance of features.
It is common to have a data set with huge dimensionality but small labeled-sample size. The
combination of the two data characteristics manifests a new research challenge. Under the as-
sumption that labeled and unlabeled data are sampled from the same population generated by
target concept, semi-supervised feature selection makes use of both labeled and unlabeled data
to estimate feature relevance [54].
Typically, a feature selection method consists of four basic steps [67], namely, subset gen-
eration, subset evaluation, stopping criterion, and result validation. In the first step, a candidate
feature subset will be chosen based on a given search strategy, which is sent, in the second step,
to be evaluated according to certain evaluation criterion. The subset that best fits the evaluation
criterion will be chosen from all the candidates that have been evaluated after the stopping cri-
terion is met. In the final step, the chosen subset will be validated using domain knowledge or
a validation set.
Feature selection is an NP-hard problem [7], since it requires to evaluate (if there are n
features in total) n
m combinations to find the optimal subset of m n features. Therefore,
sub-optimal search strategies are considered.
3.4 Feature Selection for Classification
Feature selection is based on the terms of feature relevance and redundancy w.r.t. the goal
(i.e., classification). More specifically, a feature is usually categorized as: 1) strongly relevant,
2) weakly relevant, but not redundant, 3) irrelevant, and 4) redundant [55, 56]. A strongly rele-
vant feature is always necessary for an optimal feature subset; it cannot be removed without af-
fecting the original conditional target distribution [55]. Weakly relevant feature may not always
be necessary for an optimal subset, this may depend on certain conditions. Irrelevant features
are not directly associated with the target concept but affect the learning process. Redundant
23
CHAPTER 3. DATA SIZE REDUCTION
Figure 3.2: Taxonomy of Algorithms for Feature Selection for Classification.
features are those that are weakly relevant but can be completely replaced with a set of other
features. Redundancy is thus always inspected in multivariate case (when examining feature
subset), whereas relevance is established for individual features. The aim of feature selection
is to maximize relevance and minimize redundancy. It usually includes finding a feature subset
consisting of only relevant features. In many classification problems, it is difficult to learn good
classifiers before removing these unwanted features, due to the huge size of the data. Reducing
the number of irrelevant/redundant features can drastically reduce the running time of the learn-
ing algorithms and yields a more general classifier. A general feature selection for classification
framework is demonstrated in Figure 3.1. Usually, feature selection for classification attempts
to select the subset of feature with minimal dimension according to the following criteria: 1)
the classification accuracy does not significantly decrease and, 2) the resulting class distribution
–giving only the values for the selected features, is as close as possible to the original class
distribution, giving all features.
The selection process can be achieved in a number of ways depending on the goal, the
resources at hand, and the desired level of optimization. In this chapter, we categorize feature
selection for classification into three classes (Fig. 3.2):
1. Methods for flat features.
2. Methods for structured features.
3. Methods for streaming features.
3.5 Flat Feature Selection Techniques
In this section, we will review algorithms for flat features, where features are assumed to
be independent. Algorithms in this category can be mainly classified into filters, wrappers,
embedded and hybrid methods.
24
CHAPTER 3. DATA SIZE REDUCTION
Figure 3.3: A General Framework for Wrapper Methods of Feature Selection for Classifica-
tion.
3.5.1 Filter methods
Filter methods select features based on a performance measure without utilizing any clas-
sification algorithms [57]. A typical filter algorithm consists of two steps. In the first step, it
ranks features based on certain criteria. Filter algorithm can rank individual features or evalu-
ate entire feature subsets. We can roughly classify the developed measures for feature filtering
into: information, distance, consistency, similarity, and statistical measures. Moreover, the fil-
ter methods can be either Univariate or Multivariate. Univariate feature filters rank each feature
independently of the feature space whereas the Multivariate feature filters evaluate an entire
feature subset (batch way). Therefore, the Multivariate approach has the ability of handling
redundant features. Among all the filter methods, we can highlight the Relief-F [8], Fisher [9],
LaplacianScore [10] and so on. Filter models select features independent of any specific classi-
fiers. However the major disadvantage of the filter approach is that it totally ignores the effects
of the selected feature subset on the performance of the induction algorithm [58, 59].
3.5.2 Wrapper methods
Wrapper models utilize a specific classifier to evaluate the quality of selected features, and
offer a simple and powerful way to address the problem of feature selection, regardless of the
chosen learning machine [58, 60]. Wrappers method involves optimizing a specific classifier
as a part of the selection process to evaluate the quality of the selected features. Thus, for
classification tasks, a wrapper will evaluate subsets based on the classifier performance (e.g.
Naïve Bayes or SVM)[12, 13], whereas for clustering, a wrapper will evaluate subsets based on
the performance of a clustering algorithm (e.g. K-means) [14]. Given a predefined classifier, a
25
CHAPTER 3. DATA SIZE REDUCTION
typical wrapper model will perform the following steps:
1. Searching a subset of features.
2. Evaluating the selected subset of features by the performance of the classifier.
3. Repeating 1 and 2 until the desired quality is reached.
A general representation for wrapper methods of feature selection for classification is shown
in Fig. 3.3. Wrapper methods tend to give better results but are much slower than filters in
finding sufficiently good subsets because they depend on the resource demands of the modelling
algorithm. Therefore, when it is necessary to effectively handle datasets with a huge number of
features, filter methods are indispensable in order to obtain a reduced set of features that can be
treated by other more expensive feature selection methods.
3.5.3 Embedded and hybrid methods
Embedded methods perform feature selection injecting the selection process into the mod-
elling algorithm’s execution. There are three types of embedded methods. The first are pruning
methods that initially utilize all features to train a model and then attempt to eliminate some fea-
tures by setting the corresponding coefficients to zero, while maintaining model performance
such as recursive feature elimination using support vector machine (SVM) [62]. The second
are models with a built-in mechanism for feature selection such as ID3 [62] and C4.5 [63]. The
third are regularization models with objective functions that minimize fitting errors and in the
mean time force the coefficients to be small or to be exact zero. Features with coefficients that
are close to zero are then eliminated [64].
3.5.4 Regularization models
Recently, Regularization models have attracted much attention in the feature selection con-
text due to the good compromise between accuracy and time computational complexity. In this
section, we only consider linear classifiers w in which classification of Y can be based on a
linear combination of X such as SVM and logistic regression. In regularization methods, classi-
fier induction and feature selection are achieved simultaneously by estimating w with properly
tuned penalties. The learned classifier w can have coefficients exactly equal to zero. Since
each coefficient of w corresponds to one feature (i.e., wi for fi), only features with nonzero
coefficients in w will be used in the classifier. Specifically, we define ˆw as
ˆw = minimize
w
f(w,X)+αg(w) (3.1)
where f(·) is the classification objective function, g(w) is a regularization term, and α is the
regularization parameter controlling the trade-off between the f(·) and the penalty. Popular
26
CHAPTER 3. DATA SIZE REDUCTION
choices of f(·) include quadratic loss such as Least Squares, hinge loss such as 1-SVM [65]
and logistic loss as BlogReg [66] as
f(w,X) =
n
i=1
(yi −w xi)2
(Quadratic loss) (3.2)
f(w,X) =
n
i=1
max(0,1−yiw xi) (Hinge loss) (3.3)
f(w,X) =
n
i=1
log(1+exp(−yi(w xi +b))) (Logistic loss) (3.4)
One of the most important embedded models based on regularization is the LASSO Regulariza-
tion [28] which is based on 1-norm of the coefficient of w and it’s defined as
g(w) =
n
i=1
|wi| (Logistic loss) (3.5)
An important property of the 1 regularization is that it can generate an estimation of w with
exact zero coefficients. In other words, there are zero entities in w, which denotes that the
corresponding features are eliminated during the classifier learning process. Therefore, it can
be used for feature selection.
Some other representative embedded models based on regularization are: Adaptive LASSO [68],
Bridge regularization [69, 70] and Elastic net regularization [71]. Unlike Lasso Regularization,
the latter methods are used for instance, to counteract the known issues of LASSO estimates be-
ing biased for large coefficients (Adaptive LASSO) or to handle features with high correlations
(Elastic net regularization).
3.6 Algorithms for Structured Features
The models previously introduced totally ignore the feature structures and assume that fea-
tures are independent [76]. However, for many real-world applications, the features exhibit
certain intrinsic structures, e.g., spatial or temporal smoothness [77, 78], disjoint/overlapping
groups [79], trees [80], and graphs [81]. Incorporating knowledge about the structures of fea-
tures may significantly improve the classification performance and help identify the important
features. For example, in the study of arrayCGH [77, 82], the features (the DNA copy numbers
along the genome) have the natural spatial order, and incorporate the structure information us-
ing an extension of the 1-norm, outperforming the LASSO in both classification and feature
selection. A very popular and successful approach to learn linear classifiers with structured
features is to minimize an empirical error penalized by a regularization term as
ˆw = minimize
w
f(w X,Y)+αg(w,ð) (3.6)
27
CHAPTER 3. DATA SIZE REDUCTION
Figure 3.4: Illustration of Lasso, Group Lasso and Sparse Group Lasso. Features can be
grouped into 4 disjoint groups G1,G2,G3,G4. Each cell denotes a feature and white color
represents the corresponding cell with coefficient zero.
where ð denotes the structure of features and α controls the trade-off between data fitting and
regularization. Equation 3.6 will lead to sparse classifiers making them particularly apt to in-
terpretation, this is often of primary importance in many applications such as biology or social
sciences [83]. Among all the algorithms for Structured Feature selection we might distinguish
them depending on the type of structure they exploit to
• Feature with Group Structure: These take into account real-world applications where
features form group structure. For instance, in speed and signal processing, different
frequency bands can be represented by groups [84]. Depending on the fact the features
are fully or partially selected in the chosen groups, we identify the main algorithm: Group
LASSO and Sparse Group LASSO. The former drives all coefficients together in one
group to zero, whereas the latter performs the selection of features from the selected
groups i.e., performing simultaneous group selection and feature selection. Figure 3.4
demonstrates the different solutions among Lasso, Group Lasso and Sparse group LASSO
for 4 disjoint groups {G1,G2,G3,G4}.
• Feature with Tree Structure: These tend to represent features by using certain tree struc-
tures. For instance genes/proteins may form certain hierarchical tree structures [85]; the
image pixels of the face image can be represented as a tree, where each parent node con-
tains a series of child nodes that enjoy spatial locality. One of the most known algorithm
which exploit this structure is Tree-guided group LASSO regularization [80, 85, 86]. A
simple representation of this algorithm is depicted in Figure 3.5.
• Feature with Graph Structure: These take into account real-world applications where
dependencies exist between features. For example, many biological studies have sug-
gested that genes tend to work in groups according to their biological functions, and there
are some regulatory relationships between genes. In these cases, features form an undi-
rected graph, where the nodes represent the features, and the edges imply the relationships
between features. For features with graph structure, a subset of highly connected features
28
CHAPTER 3. DATA SIZE REDUCTION
Figure 3.5: An illustration of a simple index tree of height 3 with 8 features.
in the graph are likely to be selected or not selected as a whole. For example, in Figure
3.6, {f5,f6,f7} are selected, while {f1,f2,f3,f4} are not selected.
Figure 3.6: An illustration of the graph of 7 features {f1,f2,...,f7} and its representation
A.
29
CHAPTER 3. DATA SIZE REDUCTION
3.7 Algorithms for Streaming Features
A less worrying problem concerns the selection of features when they are sequentially pre-
sented to the classifier for potential inclusion in the model [87, 88, 89, 90]. All the methods
we introduced above, assumed that all features are known in advance. In this different scenario,
instead, the candidate features are generated dynamically and the size of features is unknown.
We call this kind of features as streaming features and feature selection for streaming features
is called streaming feature selection. Streaming feature selection has practical significance in
many applications. For example, the famous microblogging website Twitter produces more
than 250 million tweets per day and many new words (features) are generated, such as abbrevi-
ations. When performing feature selection for tweets, it is not practical to wait until all features
have been generated, thus it could be preferable a streaming feature selection scheme. A typical
streaming feature selection will perform the following steps
1. Generating new feature.
2. Determining whether adding the newly generated feature to the set of currently selected
features.
3. Determining whether removing features from the set of currently selected features.
4. Repeat from 1 to 3.
Since step 2 and step 3 account for most the above procedure, in the latter years many
algorithms have been proposed (i.e., Grafting algorithm [87], Alpha-investing algorithm [91],
Online Streaming Feature Selection Algorithm [89]).
30
CHAPTER 3. DATA SIZE REDUCTION
3.8 Feature Selection Application Domains
The choice of feature selection methods strictly depends on the application area to pursue.
In the following subsections, we give a brief review about the most important feature selection
methods for the well known application domains.
3.8.1 Text Mining
In text mining, the standard way of representing a document is by using the bag-of-words
model. The idea is to model each document with the count of words occurring in that document.
Feature vectors are typically formed so that each feature (i.e., each element of the feature vector)
represents the count of a specific word, an alternative being to just indicate the presence/absence
of a word, by using a binary representation. The set of words whose occurrences are counted is
called a vocabulary. Given a dataset that needs to be represented, one can use all the words from
all the documents in the dataset to build the vocabulary and then prune the vocabulary using
feature selection. It is common to apply a degree of pre-processing prior to feature selection,
typically including the removal of rare words with only a few occurrences, the removal of overly
common words (e.g. "a", "the", "and" and similar) and grouping the differently inflected forms
of a word together (lemmatization, stemming) [22].
3.8.2 Image Processing
Representing images is not a straightforward task as the number of possible image features
is practically unlimited [23]. The choice of features typically depends on the target application.
Examples of features include histograms of oriented gradients, edge orientation histograms,
Haar wavelets, raw pixels, gradient values, edges, color channels, etc. [24]. Some authors [25]
studied the use of a hybrid combination of feature selection algorithms for the problem of image
classification that yields better performance than when using Relief or SFFS/SFBS alone. In
cases when there are no irrelevant or redundant features in the dataset, the proposed algorithm
does not degrade performance.
3.8.3 Bioinformatics
In the past ten years, feature selection has seen much activity, primarily due to the advances
in bioinformatics where a large amount of genomic and proteomic data are produced for bio-
logical and biomedical studies. An interesting bioinformatics application of feature selection is
in biomarker discovery from genomics data. In genomics data (DNA microarray), individual
features correspond to the expression levels of thousands of genes in a single experiment. Gene
expression data usually contains a large number of genes, but a small number of samples, so
it’s critical to identify the most relevant features, which may give more important knowledge
about the genes that are the most discriminative for a particular problem. In fact, a given disease
31
CHAPTER 3. DATA SIZE REDUCTION
or a biological function is usually associated with a few genes [26]. Out of several thousands
of genes to select a few of relevant genes thus becomes a key problem in bioinformatics re-
search [27]. In proteomics, high-throughput mass spectrometry (MS) screening measures the
molecular weights of individual bio-molecules (such as proteins and nucleic acids) and has
potential to discover putative proteomic bio-markers. Each spectrum is composed of peak am-
plitude measurements at approximately 15,500 features, represented by a corresponding mass-
to-charge value. The identification of meaningful proteomic features from MS is crucial for
disease diagnosis and protein-based bio-marker profiling [27]. In bioinformatics applications,
many feature selection methods have been proposed and applied. Widely used filter-type fea-
ture selection methods include: F-statistic [29], ReliefF [8], mRMR [30], t-test, and Information
Gain [31] which compute the sensitivity (correlation or relevance) of a feature w.r.t the class la-
bel distribution of the data. These methods can be characterized by using global statistical
information.
32
Chapter 4
Class-Specific Feature Selection
Methodology
In this chapter we introduce a Sparse Learning-Based Feature Selection approach, aiming to
reducing the feature space based on the concept of the Compressed Sensing. Basically, this
approach is a joint sparse multiple optimization problem [37] which tries to find a subset of fea-
tures called representative, which can best reconstruct the entire dataset by linearly combining
each feature component. Later on, we also propose a Class-Specific Feature Selection (CSFS)
model [41] that’s based on the idea we’ve just introduced. This transforms a c-class problem
into c sub-problems, one for each class, where the instances of a class are used to decide which
features best represent (up to an error) the data reconstruction of each class. For doing CSFS,
the feature subset selected for each problem is assigned to the class from which this problem
was constructed. In order to classify new instances, our framework proposes to use an ensemble
of classifiers, where, for each class, a classifier is trained using the whole training set, yet by
only using the feature subset assigned to the class. Finally, our framework applies an ad-hoc
decision rule for joining classifier outputs. In doing so, it tries to improve both the feature
selection and the classification process, since it first tries to find the best feature subsets for
representing each class of a dataset and, then use it to build an ensemble of classifiers which
improves the classification accuracy.
4.1 Introduction
Given a set of features in Rm arranged as columns of the data matrix X = [x1,...,xN ], we
consider the following optimization problem
minimize X−XC 2
F
subject to C row,0
≤ k
(4.1)
where C ∈ RN×N is the coefficient matrix and C row,0
count the number of nonzero rows
of C. In a nutshell, we would like to find at most K N representative features which best
33
CHAPTER 4. CLASS-SPECIFIC FEATURE SELECTION METHODOLOGY
represent/reconstruct the whole dataset. This can be viewed as a Sparse Dictionary Learning
(SDL) scheme where the atoms of the dictionary are chosen from the data points. Unlike the
classic procedures for building the Dictionary [33], in our formulation the dictionary is given by
the data matrix X as well as the measurements and the unknown sparse code select the features
via convex optimization.
4.2 Problem formulation
Consider a set of features in Rm arranged as the columns of the data matrix X = [x1,...,xN ].
In this section we formulate the problem of finding representative features given a fixed feature
space belonging to a collections of data points.
4.2.1 Learning compact dictionaries
Finding compact dictionaries to represent data has been well-studied in the literature [34,
35, 36, 37, 38]. More specifically, in Dictionary Learning (DL) problems, one tries to si-
multaneously learn a compact dictionary D = [d1,...,dm] ∈ Rk×m and the coefficients C =
[c1,...,cN ] ∈ Rm×n that can well represent collections of data points. The best representation
of the data is typically obtained by minimizing the objective function
N
i=1
xi −Dci
2
2 = X−DC 2
2 (4.2)
w.r.t. the dictionary D and the coefficient matrix C, subject to appropriate constraints. In the
SDL framework [34, 35, 37, 38], one requires the coefficient matrix C to be sparse by solving
the optimization program
minimize
D,C
X−DC 2
F
subject to ci 0 ≤ s, dj 2 ≤ 1,∀i,j
(4.3)
where ci 0 indicates the number of nonzero elements of ci. In other words, one simultaneously
learns a dictionary and coefficients such that each feature component in C is written as a linear
combination of at most s atoms of the dictionary (feature selected). Aside being NP-hard, due to
use of the l0 norm, this problem is not convex due the product of two unknown and constrained
matrices D and C. As a result, iterative procedure as those introduced in Chapter 1 are employed
to find each unknown matrix by fixing the other.
4.2.2 Finding representative features
The learned atoms of the dictionary almost never correspond with the original feature space [37,
38, 40], therefore can not be considered as good features for a collection of sample. In order to
34
CHAPTER 4. CLASS-SPECIFIC FEATURE SELECTION METHODOLOGY
(a) τ = 20
(b) τ = 50
Figure 4.1: Cumulative sum of the rows of matrix C in descending order of · 1. The
regularization parameter of the eq. 4.12 is set to 20 and 50 respectively.
find a subset of features that best represent the entire feature space, we consider a modification
to the DL framework, which first addresses the problem of local minima, transforming the pre-
vious not convex problem to a convex one and second it enforces selecting representatives set
of features from a collection of sample. We do this by setting the dictionary D to be the matrix
of sample X, and minimizing the expression
N
i=1
xi −Xci
2
2 = X−XC 2
F (4.4)
w.r.t coefficient matrix C [c1,...,cN ] ∈ RN×N , subject to additional constraints that we de-
scribe next. In other words, we minimize the reconstruction error of each feature component
by linearly combining all components of the feature space. To choose k N representatives,
which take part in the linear reconstruction of the each component in (4.4), we enforce
C 0,q ≤ k (4.5)
35
CHAPTER 4. CLASS-SPECIFIC FEATURE SELECTION METHODOLOGY
where the mixed 0/ q norm is defined as C 0,q
N
i=1 I( ci
q
> 0), where ci denotes the
i-th row of C and I(·) denotes the indicator function. In a nutshell, C 0,q counts the number
of nonzero rows of C. The indices of the nonzero rows of C correspond to the indices of the
columns of X which are chosen as the representative features. As said before, the aim is to
select k N representatives features that can reconstruct each feature of the X matrix up to a
fixed error. In order to find them, we solve
minimize
C
X−XC 2
F
subject to C 0,q ≤ k
(4.6)
which is a NP-hard problem as it implies a combinatorial calculation over every subset of the k
columns of X. Therefore, we relax the 0 to 1 norm, solving then
minimize
C
X−XC 2
F
subject to C 1,q ≤ τ
(4.7)
where C 1,q
N
i=1 ci
q
is the sum of the q norms of the rows of C, and τ > 0 is an ap-
propriate chosen parameter. The solution of the optimization program (4.7) not only indicates
the representative features as the nonzero rows of the C, but also provides information about
the ranking of the selected feature. More precisely, a representative that has a higher ranking
takes part in the reconstruction process more than the other, hence, its corresponding row in the
optimal coefficient matrix C has many nonzero elements with large values. Conversely, a repre-
sentative with lower ranking takes part in the reconstruction process less than the other, hence,
its corresponding row in C has a few nonzero elements with the smaller values. Thus, we can
rank k representative features yi1,...,yik as i1 ≥ i2 ≥ ··· ≥ ik, whenever for the corresponding
rows of C we have
ci1
q
≥ ci2
q
··· ≥ cik
q
(4.8)
Another optimization problems which is closely related to (4.7) is
minimize
C
C 0,q
subject to X−XC F ≤
(4.9)
which minimizes the number of representatives that can represent the whole feature space up to
an error . As before, we relax the problem using the 1 norm, obtaining
minimize
C
C 1,q
subject to X−XC F ≤ .
(4.10)
This optimization problem can also be viewed in a compression scheme where we want to
choose a few representatives that can reconstruct the data up to an error.
36
CHAPTER 4. CLASS-SPECIFIC FEATURE SELECTION METHODOLOGY
4.3 Reformulation problem
Both the two optimization problems (4.7-4.10) can be expressed by using the Lagrange
multipliers
minimize
C
1
2
X−XC 2
F +λ C 1,q (4.11)
which is nothing more than the LASSO problem we described in 2.4. In practice, we use
ADMM described in 1.5 for finding the representatives features of a given dataset.
procedure ADMM for LASSO
Data: X ∈ Rm×n
Result: θ ∈ Rn×n
Initialize θ,µ ∈ Rm×n as zeros matrices
repeat
Update βt+1 = (XT
X+ρI)−1(XT
X+ρθt −µt)
Update θt+1 = Sλ/ρ(βt+1 +µ/ρ)
Update µt+1 = µt +ρ(βt+1 −θt+1)
t = t+1
until converges
4.3.1 ADMM for the LASSO
In section 2.4.4 we have seen Soft Thresholding and how it can be applied very efficiently
when we have problems like (4.11). The latter is commonly referred to as LASSO Regression,
which uses the constraint that the 1 of the parameter vector, is no greater than a given value.
Unfortunately, this case it is non-trivial, and not efficient to apply Soft Thresholding to this
problem, as each element of X cannot be treated independently. Hence, we can apply the
Alternating Direction Method of Multipliers approach we have seen in 1.5 for solving LASSO,
which in Lagrange form is
minimize
β∈Rp,θ∈Rp



Y−Xβ 2
2
loss
+λ θ 1
penalty



subject to β −θ = 0.
(4.12)
When applied to this problem, the ADMM updates in (4.12) take the form of
βt+1
= (XT
X+ρI)−1
(XT
y+ρt
θ −µt
)
θt+1
= Sλ/ρ(βt+1
+µ/ρ)
µt+1
= µt
+ρ(βt+1
−θt+1
)
(4.13)
which allows us to break our original problem into a sequence of two sub-problems. As shown
in (4.13), in the first sub-problem, when minimizing the problem (4.12) w.r.t. only β, the 1
penalty θ 1 disappears from the objective making it a very efficient and simple Ridge Re-
37
CHAPTER 4. CLASS-SPECIFIC FEATURE SELECTION METHODOLOGY
Figure 4.2: A General Framework for Class-Specific Feature Selection [41].
gression problem. In the second sub-problem, when minimizing (4.12) w.r.t. only θ the term
X−Xβ 2
2 disappear allowing for θ to be solved independently across each element, allowing
us to efficiently use Soft-Thresholding. The current estimates of β and θ are then combined in
the last step of the ADMM to update our current estimate of the Lagrangian multiplier matrix µ.
After the start-up phase, that involves computing the product of XT
X, the subsequent iterations
have cost of O(Np). Consequently, the cost per iteration is similar to Coordinate Descent or
the Composite Gradient method.
4.4 Class-Specific Feature Selection
Feature selection has been widely used for eliminating redundant or irrelevant features, and
it can be done in two ways: Traditional Feature Selection (TFS) for all classes and Class-
Specific Feature Selection (CSFS). CSFS is the process of finding different set of features for
each class. In this kind of approach, different methods have been proposed [42]. Conversely
from a TFS algorithm, where a single feature subset is selected for discriminating among all the
classes in a supervised classification problem, a CSFS algorithm selects a subset of feature for
each class. A general framework for CSFS can use any traditional feature selector, for selecting
a possible different subset for each class of a supervised classification problem. Depending on
the type of the feature selector, the overall process may slightly change.
38
CHAPTER 4. CLASS-SPECIFIC FEATURE SELECTION METHODOLOGY
4.4.1 General Framework for Class-Specific Feature Selection
Generally, we can appreciate that all the CSFS algorithms are strongly related to the use of
a particular classifier. In general, it might be desirable that we could apply CSFS independently
of the classifier used in the classification stage as well as the used feature selector method. In
order to allow it, authors in [41] have built up a General Framework CSFS (GF-CSFS) which
allows using any traditional feature selector for CSFS as well as any classifier, consisting of
four stages:
1. Class binarization: In CSFS, the goal is to select a feature subset that allows discriminat-
ing a class from the remaining classes. Therefore, in the first stage of [41], they propose
to use the one-against-all class binarization in order to transform a c-class problem into
c binary problems.
2. Class balancing: Since the binarization stage make the generated binary problems unbal-
anced, it’s necessary to balance the classes by applying an oversampling process, before
applying a conventional feature selector on a binary problem. In the literature, there are
several methods for oversampling. Some of the most used are random oversampling [43],
SMOTE [44] and Borderline-SMOTE [45].
3. Class-Specific Feature Selection: For each binary problem, features are selected using a
traditional feature selector, and the selected features are assigned to the class from which
the binary problem was constructed. In this way, c possible different feature subsets
are obtained, one for each class of the original c-class supervised classification problem.
In this stage, it is possible to use a different traditional feature selector for each binary
problem. At the end of this stage, the CSFS process has properly finished. However, as
for each class a possible different subset of features has been selected, it is very important
to define how to use these subsets for classification.
4. Classification: For doing class-specific selection, a multiclass problem has been trans-
formed in c binary problems. At first glance, the straightforward way for using each
subset of features associated to each class is to follow a multi-classification scheme, train-
ing a classifier for each binary problem and integrating the decisions of the classifiers.
However, following this approach would end up in solving a problem different from the
originally formulated one. It is important to point out that the set of features associated
to a class, in theory, is the best subset found that allows discriminating the objects of
this class from the objects in the other classes. Therefore in the classification stage, for
each class si, a classifier ei is trained for the original problem (i.e., the instances in the
training set for the classifier ei maintain their original class) but taking into account only
the selected features for each class ci. In this way, we will have a classifier ensemble
E = {e1,...,ec}. When a new instance O is classified through the ensemble, its orig-
inal dimensionality d must be reduced to the dimensionality di used by the classifier
ei,i = 1...,c; and use a custom majority voting scheme to classify the new instance.
39
CHAPTER 4. CLASS-SPECIFIC FEATURE SELECTION METHODOLOGY
Figure 4.3: A Sparse Learning Based approach for Class-Specific Feature Selection.
4.4.2 A Sparse Learning-Based Approach for Class-Specific Fea-
ture Selection
In this section we describe a novel Sparse Learning-Based Approach for Class-Specific
Feature Selection, based on the Sparse statistical model we presented in Chapter 4. By following
the concept of representativeness described in 4.2.2, we try to best represent each class-set of
the training set by only using few representatives features. More specifically, the method is
made up of the following steps:
1. Class-sample separation: Unlike the General Framework for Class-Specific Feature Se-
lection, our CSFS model does not employ the Class binarization stage to transform a
c-class problem into c binary problems, instead it just uses a simple Class-sample sep-
aration among all the classes samples of the training set, in order to obtain different
subsets/configurations of samples, one for each class.
2. Class balancing:1 Once the class sample set of the training set have been split apart, it
may be possible that each class-subset is unbalanced. Therefore, we used the SMOTE [44]
re-sampling method to balance each class-subset.
3. Intra-Class-Specific feature selection: Unlike GF-CSFS, where a binarization step is
carried out for doing class-specific selection, our method involves using the Sparse-Coding
1This stage might not be necessary if there are enough samples for each class in the training set.
40
CHAPTER 4. CLASS-SPECIFIC FEATURE SELECTION METHODOLOGY
Based Approach described in 4.2.2 for retrieving the most representative features for each
class-subset of the training data, which must be the one that minimizes the equation 4.11
and, therefore that best represents/reconstruct the whole collection of objects. In doing so,
this approach takes advantage of the intra-class feature selection approach for improving
the classification accuracy against the TFS or the GF-CSFS.
4. Classification: In the classification step we have followed up a wise-ensemble procedure
for classifying new instances. Likewise the [41], for each class ci, a classifier ei is trained
for the original problem but taking into account only the selected features for each class ci.
In this way, we will produce an ensemble of classifier E = {e1,...,ec}. Whatever a new
instance O needs to be classified through the ensemble, its original dimensionality must
be reduced to the dimensionality di used by the classifier ei,i = 1...,c. The following
ad-hoc majority decision rule is used for determining the belonging of the sample O to a
class:
(a) If a classifier ei gives as output the class ci, i.e., the same class for which the features
(used for the training ei) were selected; then the class ci is assigned to O. If there is
a tie (two or more classifiers gives as output ci), the class of O is assigned through
majority vote among all classifiers. If the tie continues then the class of O will be
the majority class among the tied classes.
(b) If no classifier gives as output the class for which the features (used for training ei)
were selected; the class of O is assigned through majority vote. If there is a tie then
the class of O will be the majority class among the tied classes.
41
Chapter 5
Experimental Results
In this Chapter, we conducted several experiments to evaluate the performance of our CSBA-
CSFS proposed technique. In particular, we compared the proposed method to the state-of-art
feature selection methods as well as to the GF-CSFS introduced in section 4.4.
5.1 Experimental Analysis
In our experiments, in order to evaluate the performance of our feature selection method,
we have applied our methods to a total of six publicly available (bioinformatics) microarray
datasets: the acute lymphoblastic leukemia and acute myeloid leukemia (ALLAML) dataset [93],
the human carcinomas (CAR) dataset [94], the human lung carcinomas (LUNG) dataset [98],
the diffuse large B-cell lymphoma (DLBCL) dataset, the malignant glioma (GLIOMA) dataset [99].
The Support Vector Machine (SVM) classifier is employed to these data sets, using 5-fold cross-
validation.
We focused on this type of datasets because selecting small subsets out of the thousands
of genes in microarray data is an important task for several medical purposes. Microarray
data analysis is popular for involving a huge number of genes compared to a relatively small
number of samples. In particular, Gene selection is the task of identifying the most significantly
deferentially expressed genes under different conditions, and it has been an open research focus.
Gene selection is a prerequisite in many applications [102]. These selected genes are very
useful in clinical applications such as recognizing diseased profiles. Nonetheless, because of
high costs, the quantity of the number of experiments that can be used for classification purposes
is usually limited. This small number of samples, compared to the large number of genes in an
experiment, is well known as the Curse of Dimensionality [5] and challenges the classification
task as well as other data analysis. Moreover, it is well-known that a significant number of
genes play an important role, whereas many others could be unrelated to the classification task
[101]. Therefore, a critical step to effective classification is to identify the representatives genes,
thus in this manner to decrease the quantity of genes used for classification purpose.
42
CHAPTER 5. EXPERIMENTAL RESULTS
Dataset Size # of Features # of Classes
ALLAML 72 7129 2
CAR 174 9182 11
LUNG_C 203 3312 5
LUNG_D 73 325 7
DLBCL 96 4026 9
GLIOMA 50 4434 4
Table 5.1: Datasets Description.
5.2 Datasets Description
We give a brief description on all datasets used in our experiments as follows.
The ALLAML dataset [93] contains in total 72 samples in two classes, ALL and AML, which
contain 47 and 25 samples, respectively. Every sample contains 7,129 gene expression values.
The CARCINOM dataset [94] contains in total 174 samples in eleven classes, prostate, blad-
der/ureter, breast, colorectal, gastroesophagus, kidney, liver, ovary, pancreas, lung adenocar-
cinomas, and lung squamous cell carcinoma, which have 26, 8, 26, 23,12,11, 7, 27, 6,14,14
samples, respectively. After pre-processing as described in [95], the dataset has been shrunk
with 174 samples and 9,182 genes.
The LUNG dataset [96] contains in total 203 samples in five classes, adenocarcinomas, squa-
mous cell lung carcinomas, pulmonary carcinoids, small-cell lung carcinomas and normal lung,
which have 139, 21, 20, 6,17 samples, respectively. The genes with standard deviations smaller
than 50 expression units were removed and we obtained a dataset with 203 samples and 3,312
genes.
The LUNG_DISCRETE dataset [97] contains in total 73 samples in seven classes. Each sam-
ple has 325 features.
The DLBCL dataset [98] is a modified version of the original DLBCL dataset. It consists
of 96 samples in nine classes, where each sample is defined by the expression of 4,026 genes.
Sample in the DLBCL dataset are 46, 10, 9, 11, 6, 6, 4, 2, 2, respectively.
The GLIOMA dataset [99] contains in total 50 samples in four classes, cancer glioblastomas,
non-cancer glioblastomas, cancer oligodendrogliomas and non-cancer oligodendrogliomas, which
have 14,14, 7,15 samples, respectively. Each sample has 12,625 genes. After pre-processing as
described in [99], the dataset has been shrunk with 50 samples and 4,433 genes.
All datasets have been downloaded from [103] whose information are summarized in Table
5.1
43
CHAPTER 5. EXPERIMENTAL RESULTS
5.3 Experiment Setup
To validate the effectiveness of –Sparse-Coding Based Approach Feature Selection (SCBA-
FS) and Sparse-Coding Based Approach Class-Specific Feature Selection (SCBA-CSFS), we
have compared our methods against several TFS and the GF-CSFS proposed in [41].
More precisely, in our experiments, we firstly compared our methods against the TFS meth-
ods. In addition, given that the proposed framework in [41] can use any TFS method as base
for doing CSFS, we made some experiments using both filter and wrapper methods (injection
process). In addition, we also compared the accuracy results against those by using all the fea-
tures (BSL). Since our feature selection methods are more wise-sparse filter type approaches,
we have found appropriate to compare them to the following TFS methods:
• RFS [100]: Robust Feature Selection method is a sparse based-learning approach for
feature selection which emphasizes the joint 2,1 norm minimization on both loss and
regularization function.
• ls- 2,1 [4]: ls- 2,1 is a supervised sparse feature selection method. It exploits the 2,1-
norm regularized regression model, for joint feature selection from multiple tasks where
the classification objective function is a quadratic loss.
• ll- 2,1 [4]: ll- 2,1 is a supervised sparse feature selection method which uses the same
concept of ls- 2,1 but instead uses a logistic loss.
• Fisher [9]: Fisher is one of the most widely used supervised filter feature selection meth-
ods. It selects each feature as the ratio of inter-class separation and intraclass variance,
where features are evaluated independently, and the final feature selection occurs by ag-
gregating the m top ranked ones.
• Relief-F [8]: Relief-F is an iterative, randomized, and supervised filter approach that
estimates the quality of the features according to how well their values differentiate data
samples that are near to each other; it does not discriminate among redundant features,
and performance decreases with few data.
• mRmR [30]: Minimum-Redundancy-Maximum-Relevance is a mutual information filter
based algorithm which selects features according to the maximal statistical dependency
criterion.
We pre-processed all the datasets by using the Z-score [92] normalization. Support Vector
Machine (SVM) classifier has been individually performed on all data sets using 5-fold cross-
validation. We utilized the linear kernel with the parameter C = 1 and the one-vs-the-rest strategy
for multi-class classification. For RFS, ls- 2,1 and ll- 2,1 we fixed the regularization parameter γ
to 1.0 by default. For Relief-F we fix k, which specifies the size of neighborhood to 5 by default.
For our methods we tuned the regularization parameter λ in order to achieve better results on
all datasets. For evaluating the performance of all the methods, we have chosen the number of
44
CHAPTER 5. EXPERIMENTAL RESULTS
features to range from 1 to 300 and then we computed the average accuracy. The evaluation
metric used for assessing the classification performance among all the methods is the Accuracy
Score (AS). It’s defined as follow:
Accuracy_Score(y, ˆy) =
1
nsamples
nsamples
i=1
1(ˆyi = yi) (5.1)
where yi and ˆyi are, respectively, the ground truth and the predicted label of the i-th samples
and, nsamples is the number of samples of the test set. Obviously, a larger AS indicates better
performance.
Although two-class classification problem is an important type of task, it’s relatively easy,
since a random choice of class labels would give 50% accuracy. Classification problems with
multiple classes are generally more difficult and give a more realistic assessment of the proposed
methods.
45
CHAPTER 5. EXPERIMENTAL RESULTS
(a) ALLAML (b) CARCINOM (c) LUNG
(d) LUNG_DISCRETE (e) LYMPHOMA (f) GLIOMA
Figure 5.1: Classification accuracy comparisons of seven feature selection algorithms on
six data sets. SVM with 5-fold cross validation is used for classification. SCBA and
SCBA-CSFS are our methods.
Average accuracy of top 20 features (%) Average accuracy of top 80 features (%)
RFS ls-21 ll-21 Fisher Relief-F mRmR SCBA SCBA-CSFS BSL RFS ls-21 ll-21 Fisher Relief mRmR SCBA SCBA-CSFS
ALLAML 80.82 74.27 88.36 96.78 95.67 74.62 92.57 80.81 93.21 97.84 74.27 95.73 98.95 98.89 83.16 96.84 95.67
CAR 85.19 72.03 84.84 65.67 85.58 85.88 84.84 93.60 90.25 96.98 88.88 94.61 92.92 96.95 94.93 94.61 99.32
LUNG_C 93.24 92.37 97.84 90.22 96.98 96.55 96.98 98.99 95.57 98.12 97.84 98.99 99.28 99.57 98.71 99.42 99.70
LUNG_D 89.89 87.72 93.26 90.51 91.89 93.93 89.20 96.60 83.43 95.93 95.93 94.62 95.93 97.31 96.60 97.29 97.29
DLBCL 93.96 92.99 95.89 97.34 99.03 99.52 95.89 99.52(5) 93.74 99.03 95.42 99.76 99.76 99.76 99.8 99.76 99.76
GLIOMA 80 60 73.33 76.67 76.66 75 76.67 76.67 74 88.33(29) 70 80 83.33 80 78.33 81.67 88.33
Average 87.18 79.9 88.92 86.19 90.97 87.58 89.36 91.03 88.37 96.03 87.05 93.95 95.03 95.41 91.92 94.93 96.68
Table 5.2: Accuracy Score of SVM using 5-fold cross validation. Six TFS methods are
compared against our methods. RFS: Robust Feature Selector, FS: Fisher Score, mRmR:
Minimum-Redundancy-Maximum-Relevance, BSL: all features, SCBA-FS and SCBA-CSFS
our methods. The best results are highlighted in bold.
5.4 Experimental Results and Discussion
The results have been obtained using a workstation with a dual Intel(R) Xeon(R) 2.40GHz
and 64GB RAM.
We summarize the 5-fold cross validation accuracy results on different methods on the six
datasets shown in table 5.1. Tables 5.2-5.3 show the experimental results using SVM. For all
the comparisons we computed the average accuracy using the top 20 and top 80 features for all
feature selection methods. Where there is a tie among the methods, we’ve considered the best
achieved accuracy with fewer number of features.
Firstly, we compared the performance of our method against the TFS methods. Table 5.2
shows the experimental results using SVM. We can see from the table that our approach (SCBA-
CSFS) significantly outperforms the other methods when the datasets have many classes. Based
46
CHAPTER 5. EXPERIMENTAL RESULTS
(a) ALLAML (b) CARCINOM (c) LUNG
(d) LUNG_DISCRETE (e) LYMPHOMA (f) GLIOMA
Figure 5.2: Classification accuracy comparisons of seven feature selection algorithms on
six datasets. SVM with 5-fold cross validation is used for classification. SCBA-CSFS is
our method.
Average accuracy of top 20 features (%) Average accuracy of top 80 features (%)
RFS ls-21 ll-21 Fisher Relief mRmR SCBA-CSFS BSL RFS ls-21 ll-21 Fisher Relief mRmR SCBA-CSFS
ALLAML 82 72.29 80.57 98.57 95.81 72.10 80.82 93.21 95.71 83.33 90.20 98.57 97.14 81.71 95.67
CAR 54 35.61 69.01 89.70 89.13 78.18 93.60 90.25 83.88 64.35 83.88 94.84 94.86 89.63 99.32
LUNG_C 85.73 81.35 90.63 93.11 91.15 86.74 98.99 95.57 93.07 88.67 94.05 95.56 94.56 91.61 99.70
LUNG_D 72.76 68.76 72.67 89.05 85.05 86.38 96.60 83.43 89.14 86.48 86.38 91.71 89.05 89.05 97.29
DLBCL 94.44 92.75 97.56 99.28 99.76 99.76(6) 99.52 93.74 99.98 99.28 99.76 99.76 99.98 99.98 99.76
GLIOMA 68 58 70 76 66 64 76.67 74 76 70 70 82 80 66 88.33
Average 76.16 68.12 80.07 90.95 87.82 81.19 91.03 88.37 89.63 82.02 87.38 93.74 92.6 86.33 96.68
Table 5.3: Accuracy Score of SVM using 5-fold cross validation. The GCSFS [41] framework
using 5 traditional feature selector is compared against our SCBA-CSFS. RFS: Robust
Feature Selector, FS: Fisher Score, mRmR: Minimum-Redundancy-Maximum-Relevance,
BSL: all features and SCBA-CSFS our method. The best results are highlighted in bold.
on our experimental results, we may affirm that, usually, applying TFS allows getting better re-
sults than using all the available features. However, in most of the cases, applying our CSFS, al-
lows getting better results than applying TFS methods. In particular, we noticed, that our CSFS
methods seem to achieve the best results when the datasets have many classes, suggesting that
it better performs on dataset with many classes (e.g., LUNG_C, LUNG_D, CAR, DLBLC). In
addition, as shown in Fig. 5.1 it’s important to point out that our method always outperforms
the others with fewer number of features. As a result, we can assert our method is able to identi-
fy/retrieve the most representative features that maximize the classification accuracy. With top
20 and 80 features, our method is around 1%-12% and 1%-10% better than the other methods
on all six data sets, respectively.
Secondly, we compared the performance of our method against [41]. The experimental
results are shown in Table 5.3. From the table, we can appreciate that the process underlying
our SCBA for feature selection is more suitable for retrieving the best features for the purpose
47
CHAPTER 5. EXPERIMENTAL RESULTS
of classification w.r.t. the GF-CSFS, leading most of the times to get better results. With top 20
and 80 features, our method is around 1%-23% and 1%-10% better than the other methods all
six data sets, respectively.
48
Chapter 6
Conclusion
In this thesis, we proposed a novel Sparse-Coding Based Approach Feature Selection with
emphasizing joint 1,2-norm minimization and the Class Specific Feature Selection. Experimen-
tal results on six different datasets validate the unique aspects of SCBA-CSFS and demonstrate
the better performance achieved against the-state-of-art methods.
One of the main characteristics of our framework is that by jointly exploiting the idea of
Compressed Sensing and Class-Specific Feature Selection, it is able to identify/retrieve the most
representative features that maximize the classification accuracy in the case that the dataset is
made up of many classes.
Based on our experimental results, we can conclude that, usually applying TFS allows
achieving better results than using all the available features. However, in most of the cases,
applying our proposed method CSBA-CSFS allows getting better results than TFS as well as
GF-CSFS with several TFS methods injected.
Future works will include the analysis of our method against other type of datasets and the
injection of any TFS methods in our framework for comparing the different performance. In
addition, we plan to test our method on real case datasets such as the EPIC dataset [106], after
a thorough analysis of pre-filtering.
49
References
[1] Benjamin Recht. Convex Modeling with Priors. PhD thesis, Massachusetts Institute of Tech-
nology, Media Arts and Sciences Department, 2006.
[2] He, B. S., Hai Yang, and S. L. Wang. Alternating direction method with self-adaptive
penalty parameters for monotone variational inequalities. Journal of Optimization Theory
and applications 106.2 (2000): 337-356.
[3] Korn, Flip, B-U. Pagel, and Christos Faloutsos. On the "dimensionality curse" and the "self-
similarity blessing". IEEE Transactions on Knowledge and Data Engineering 13.1 (2001):
96-111.
[4] J. Tang, Jiliang, Salem Alelyani, and Huan Liu. Feature selection for classification: A
review. Data Classification: Algorithms and Applications (2014): 37.
[5] Liu, Huan, and Hiroshi Motoda. Feature selection for knowledge discovery and data mining.
Vol. 454. Springer Science & Business Media, 2012.
[6] Guyon, Isabelle, and André Elisseeff. An introduction to variable and feature selection.
Journal of machine learning research 3.Mar (2003): 1157-1182.
[7] Yager, Ronald R., and Liping Liu, eds. Classic works of the Dempster-Shafer theory of
belief functions. Vol. 219. Springer, 2008.
[8] Kira K, Rendell LA. A practical approach to feature selection. InProceedings of the ninth
international workshop on Machine learning 1992 Jul 12 (pp. 249-256).
[9] Gu, Quanquan, Zhenhui Li, and Jiawei Han. Generalized fisher score for feature selection.
arXiv preprint arXiv:1202.3725 (2012).
[10] He, Xiaofei, Deng Cai, and Partha Niyogi. Laplacian score for feature selection. Advances
in neural information processing systems. 2006.
[11] Liu, Huan, and Hiroshi Motoda. Feature selection for knowledge discovery and data min-
ing. Vol. 454. Springer Science & Business Media, 2012.
[12] Bradley, Paul S., and Olvi L. Mangasarian. Feature selection via concave minimization
and support vector machines. ICML. Vol. 98. 1998.
50
REFERENCES
[13] Maldonado, Sebastián, Richard Weber, and Fazel Famili. Feature selection for high-
dimensional class-imbalanced data sets using Support Vector Machines. Information Sci-
ences 286 (2014): 228-246.
[14] Kim, YongSeog, W. Nick Street, and Filippo Menczer. Evolutionary model selection in
unsupervised learning. Intelligent data analysis 6.6 (2002): 531-556.
[15] Hapfelmeier, Alexander, and Kurt Ulm. A new variable selection approach using random
forests. Computational Statistics & Data Analysis 60 (2013): 50-69.
[16] Cawley, Gavin C., Nicola L. Talbot, and Mark Girolami. Sparse multinomial logistic re-
gression via bayesian l1 regularisation. Advances in neural information processing systems.
2007.
[17] Das, S. (2001, June). Filters, wrappers and a boosting-based hybrid for feature selection.
In ICML (Vol. 1, pp. 74-81).
[18] Cadenas, José M., M. Carmen Garrido, and Raquel MartíNez. Feature subset selection
filter–wrapper based on low quality data. Expert systems with applications 40.16 (2013):
6241-6252.
[19] I. S. Oh, J. S. Lee, and B. R. Moon, "Hybrid genetic algorithms for feature selection,"
IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 11, pp. 1424–1437, 2004.
[20] S. I. Ali and W. Shahzad. A Feature Subset Selection Method based on Conditional Mutual
Information and Ant Colony Optimization. International Journal of Computer Applications,
vol. 60, no. 11, pp. 5–10, 2012.
[21] S. Sarafrazi and H. Nezamabadi-pour. Facing the classification of binary problems with
a GSA-SVM hybrid system. Mathematical and Computer Modelling, vol. 57, issues 1-2, pp.
270–278, 2013.
[22] Forman, George. An extensive empirical study of feature selection metrics for text classifi-
cation. Journal of machine learning research 3.Mar (2003): 1289-1305.
[23] J. Bins and B. A. Draper. Feature selection from huge feature sets, in: Proc. 8th Interna-
tional Conference on Computer Vision (ICCV-01), Vancouver, British Columbia, Canada,
IEEE Computer Society, pp. 159–165, 2001.
[24] K. Brki´c. Structural analysis of video by histogram-based description of local space-time
appearance, Ph.D. dissertation, University of Zagreb, Faculty of Electrical Engineering and
Computing, 2013.
[25] M. Muštra, M. Grgi´c, and K. Delaˇc. Breast density classification using multiple feature
selection. Automatika, vol. 53, pp. 1289– 1305, 2012.
51
REFERENCES
[26] H. Peng, F. Long, and C. Ding. Feature selection based on mutual information: Criteria
of max-depe ndency, max-relevance, and min-redundancy. IEEE Trans. Pattern Analysis and
Machine Intelligence, 27, 2005.
[27] Y. Saeys, I. Inza, and P. Larranaga. A review of feature selection techniques in bioinformat-
ics. Bioinformatics, 23(19):2507–2517, 2007.
[28] Tibshirani, Robert. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society. Series B (Methodological) (1996): 267-288.
[29] Ding, Chris, and Hanchuan Peng. Minimum redundancy feature selection from microarray
gene expression data. Journal of bioinformatics and computational biology 3.02 (2005): 185-
205.
[30] Peng, Hanchuan, Fuhui Long, and Chris Ding. Feature selection based on mutual infor-
mation criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions
on pattern analysis and machine intelligence 27.8 (2005): 1226-1238.
[31] L. E. Raileanu and K. Stoffel. Theoretical comparison between the gini index and infor-
mation gain criteria. Univeristy of Neuchatel, 2000.
[32] Nie, Feiping, et al. Efficient and robust feature selection via joint 2,1-norms minimization.
Advances in neural information processing systems. 2010.
[33] Xu, Jin, Haibo He, and Hong Man. Active Dictionary Learning in Sparse Representation
Based Classification. arXiv preprint arXiv:1409.5763 (2014).
[34] Aharon, Michal, Michael Elad, and Alfred Bruckstein. SVD: An algorithm for designing
overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing
54.11 (2006): 4311-4322.
[35] Engan, Kjersti, Sven Ole Aase, and J. Hakon Husoy. Method of optimal directions for
frame design. Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE
International Conference on. Vol. 5. IEEE, 1999.
[36] Jolliffe, Ian T. Principal component analysis and factor analysis. Principal component
analysis (2002): 150-166.
[37] Mairal, Julien, et al. Discriminative learned dictionaries for local image analysis. Com-
puter Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008.
[38] Ramirez, Ignacio, Pablo Sprechmann, and Guillermo Sapiro. Classification and clustering
via dictionary learning with structured incoherence and shared features. Computer Vision
and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010.
52
REFERENCES
[39] Ramirez, Ignacio, Pablo Sprechmann, and Guillermo Sapiro. Classification and clustering
via dictionary learning with structured incoherence and shared features. Computer Vision
and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010.
[40] Mairal, Julien, et al. Non-local sparse models for image restoration. Computer Vision,
2009 IEEE 12th International Conference on. IEEE, 2009.
[41] Pineda-Bautista, Bárbara B., Jesús Ariel Carrasco-Ochoa, and J. Fco Martınez-Trinidad.
General framework for class-specific feature selection. Expert Systems with Applications
38.8 (2011): 10018-10024.
[42] Fu, Xiuju, and Lipo Wang. A GA-based RBF classifier with class-dependent features. Evo-
lutionary Computation, 2002. CEC’02. Proceedings of the 2002 Congress on. Vol. 2. IEEE,
2002.
[43] Van Hulse, Jason, Taghi M. Khoshgoftaar, and Amri Napolitano. Experimental perspec-
tives on learning from imbalanced data. Proceedings of the 24th international conference on
Machine learning. ACM, 2007.
[44] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic
minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
[45] Han, H., Wang, W. Y., & Mao, B. H. (2005). Borderline-SMOTE: A new over-sampling
method in imbalanced data sets learning. In Proceedings of the international conference on
intelligent computing (pp. 878–887).
[46] Liu, Huan, and Hiroshi Motoda, eds. Computational methods of feature selection. CRC
Press, 2007.
[47] Weston, Jason, et al. Use of the zero-norm with linear models and kernel methods. Journal
of machine learning research 3.Mar (2003): 1439-1461.
[48] Song, Le, et al. Supervised feature selection via dependence estimation. Proceedings of
the 24th international conference on Machine learning. ACM, 2007.
[49] Dy, Jennifer G., and Carla E. Brodley. Feature selection for unsupervised learning. Journal
of machine learning research 5.Aug (2004): 845-889.
[50] P. Mitra, C. A. Murthy, and S. Pal. Unsupervised feature selection using feature similarity.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:301–312, 2002.
[51] Zhao, Zheng, and Huan Liu. Semi-supervised feature selection via spectral analysis. Pro-
ceedings of the 2007 SIAM International Conference on Data Mining. Society for Industrial
and Applied Mathematics, 2007.
53
REFERENCES
[52] Xu, Zenglin, et al. Discriminative semi-supervised feature selection via manifold regular-
ization. IEEE Transactions on Neural networks 21.7 (2010): 1033-1047.
[53] Dy, Jennifer G., and Carla E. Brodley. Feature subset selection and order identification
for unsupervised learning. ICML. 2000.
[54] Z. Zhao and H. Liu. Semi-supervised feature selection via spectral analysis In Pro-
ceedings of SIAM International Conference on Data Mining, 2007.
[55] Yu, Lei, and Huan Liu. Efficient feature selection via analysis of relevance and redundancy.
Journal of machine learning research 5.Oct (2004): 1205-1224.
[56] Aggarwal, Charu C., and Chandan K. Reddy, eds. Data clustering: algorithms and appli-
cations. CRC press, 2013.
[57] Liu, Huan, and Hiroshi Motoda, eds. Computational methods of feature selection. CRC
Press, 2007.
[58] Kohavi, Ron, and George H. John. Wrappers for feature subset selection. Artificial intelli-
gence 97.1-2 (1997): 273-324.
[59] Hall, Mark A., and Lloyd A. Smith. Feature Selection for Machine Learning: Comparing
a Correlation-Based Filter Approach to the Wrapper. FLAIRS conference. Vol. 1999. 1999.
[60] I. Inza, P. Larranaga, R. Blanco, and A. J. Cerrolaza. Filter versus wrapper gene selection
approaches in dna microarray domains. Artificial intelligence in medicine,31(2):91–103,
2004.
[61] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification
using support vector machines. Machine learning, 46(1-3):389–422, 2002.
[62] Quinlan, J. Ross. Induction of decision trees. Machine learning 1.1 (1986): 81-106.
[63] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[64] Ma, Shuangge, and Jian Huang. Penalized feature selection and classification in bioinfor-
matics. Briefings in bioinformatics 9.5 (2008): 392-403.
[65] Bradley, Paul S., and Olvi L. Mangasarian. Feature selection via concave minimization
and support vector machines. ICML. Vol. 98. 1998.
[66] Cawley, Gavin C., Nicola L. Talbot, and Mark Girolami. Sparse multinomial logistic re-
gression via bayesian l1 regularisation. Advances in neural information processing systems.
2007.
[67] Liu, Huan, and Lei Yu. Toward integrating feature selection algorithms for classification
and clustering. IEEE Transactions on knowledge and data engineering 17.4 (2005): 491-502.
54
REFERENCES
[68] Zou, Hui. The adaptive lasso and its oracle properties. Journal of the American statistical
association 101.476 (2006): 1418-1429.
[69] Knight, Keith, and Wenjiang Fu. Asymptotics for lasso-type estimators. Annals of statistics
(2000): 1356-1378.
[70] Huang, Jian, Joel L. Horowitz, and Shuangge Ma. Asymptotic properties of bridge estima-
tors in sparse high-dimensional regression models. The Annals of Statistics (2008): 587-613.
[71] Zou, Hui, and Trevor Hastie. Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67.2 (2005):
301-320.
[72] Wang, Li, Ji Zhu, and Hui Zou. Hybrid huberized support vector machines for microarray
classification. Proceedings of the 24th international conference on Machine learning. ACM,
2007.
[73] Obozinski, Guillaume, Ben Taskar, and Michael Jordan. Multi-task feature selection.
Statistics Department, UC Berkeley, Tech. Rep 2 (2006).
[74] Argyriou, Andreas, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature
learning. Advances in neural information processing systems. 2007.
[75] Yuan, Ming, and Yi Lin. Model selection and estimation in regression with grouped vari-
ables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68.1
(2006): 49-67.
[76] Ye, Jieping, and Jun Liu. Sparse methods for biomedical data. ACM Sigkdd Explorations
Newsletter 14.1 (2012): 4-15.
[77] Tibshirani, Robert, et al. Sparsity and smoothness via the fused lasso. Journal of the Royal
Statistical Society: Series B (Statistical Methodology) 67.1 (2005): 91-108.
[78] Zhou, Jiayu, et al.Modeling disease progression via fused sparse group lasso. Proceed-
ings of the 18th ACM SIGKDD international conference on Knowledge discovery and data
mining. ACM, 2012.
[79] Jenatton, Rodolphe, Jean-Yves Audibert, and Francis Bach. Structured variable selection
with sparsity-inducing norms. Journal of Machine Learning Research 12.Oct (2011): 2777-
2824.
[80] Kim, Seyoung, and Eric P. Xing. Tree-guided group lasso for multi-task regression with
structured sparsity.J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity.
In Proceedings of the 26th Annual International Conference on Machine Learning, pages
417–424. ACM, 2009. (2010).
55
REFERENCES
[81] J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings
of the 26th Annual International Conference on Machine Learning, pages 417–424.ACM,
2009.
[82] Tibshirani, Robert, and Pei Wang. "Spatial smoothing and hot spot detection for CGH data
using the fused lasso." Biostatistics 9.1 (2007): 18-29.
[83] Yuan, Ming, and Yi Lin. Model selection and estimation in regression with grouped vari-
ables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68.1
(2006): 49-67.
[84] McAuley, James, et al. Subband correlation and robust speech recognition. IEEE Transac-
tions on Speech and Audio Processing 13.5 (2005): 956-964.
[85] Liu, Jun, and Jieping Ye. Moreau-Yosida regularization for grouped tree structure learning.
Advances in Neural Information Processing Systems. 2010.
[86] Jenatton, Rodolphe, et al. Proximal methods for sparse hierarchical dictionary learning.
Proceedings of the 27th international conference on machine learning (ICML-10). 2010.
[87] Perkins, Simon, and James Theiler. Online feature selection using grafting. Proceedings
of the 20th International Conference on Machine Learning (ICML-03). 2003.
[88] Zhou, Dengyong, Jiayuan Huang, and Bernhard Schölkopf. Learning from labeled and
unlabeled data on a directed graph. Proceedings of the 22nd international conference on
Machine learning. ACM, 2005.
[89] Wu, Xindong, et al. Online streaming feature selection. Proceedings of the 27th interna-
tional conference on machine learning (ICML-10). 2010.
[90] Wang, Jialei, et al. Online feature selection and its applications. IEEE Transactions on
Knowledge and Data Engineering 26.3 (2014): 698-710.
[91] Zhou, Jing, et al. Streaming feature selection using alpha-investing. Proceedings of the
eleventh ACM SIGKDD international conference on Knowledge discovery in data mining.
ACM, 2005.
[92] E. Kreyszig (1979). Advanced Engineering Mathematics (Fourth ed.). Wiley. p. 880, eq. 5.
ISBN 0-471-02140-7.
[93] Golub TR, et al. Molecular classification of cancer: class discovery and class prediction
by gene expression monitoring. science. 1999 Oct 15;286(5439):531-7.
[94] Nutt, Catherine L., et al. Gene expression-based classification of malignant gliomas corre-
lates better with survival than histological classification. Cancer research 63.7 (2003): 1602-
1607.
56
REFERENCES
[95] Yang, Kun, et al. A stable gene selection in microarray data analysis. BMC bioinformatics
7.1 (2006): 228.
[96] Bhattacharjee, Arindam, et al. Classification of human lung carcinomas by mRNA ex-
pression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National
Academy of Sciences 98.24 (2001): 13790-13795.
[97] Peng, Hanchuan, Fuhui Long, and Chris Ding. Feature selection based on mutual infor-
mation criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions
on pattern analysis and machine intelligence 27.8 (2005): 1226-1238.
[98] Alizadeh, Ash A., et al. Distinct types of diffuse large B-cell lymphoma identified by gene
expression profiling. Nature 403.6769 (2000): 503-511.
[99] Nutt, Catherine L., et al. Gene expression-based classification of malignant gliomas corre-
lates better with survival than histological classification. Cancer research 63.7 (2003): 1602-
1607.
[100] Nie, Feiping, et al. Efficient and robust feature selection via joint 2,1-norms minimization.
Advances in neural information processing systems. 2010.
[101] Xiong, Momiao, Xiangzhong Fang, and Jinying Zhao. Biomarker identification by fea-
ture wrappers. Genome Research 11.11 (2001): 1878-1887.
[102] Mukherjee, Sach, and Stephen J. Roberts. A theoretical analysis of gene selection. Com-
putational Systems Bioinformatics Conference, 2004. CSB 2004. Proceedings. 2004 IEEE.
IEEE, 2004.
[103] http://featureselection.asu.edu/datasets.php
[104] Boyd, Stephen, and Lieven Vandenberghe. Convex optimization. Cambridge university
press, 2004.
[105] Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization
and statistical learning via the alternating direction method of multipliers. Foundations and
Trends in Machine Learning, 3(1), 1-122.
[106] Demetriou, Christiana A., Jia Chen, Silvia Polidoro, Karin Van Veldhoven, Cyrille
Cuenin, Gianluca Campanella, Kevin Brennan et al. Methylome analysis and epigenetic
changes associated with menarcheal age. PloS one 8, no. 11 (2013): e79391.
57

M.Sc thesis

  • 1.
    MASTER DEGREE INAPPLIED COMPUTER SCIENCE (2016-2017) A Sparse-Coding Based Approach for Class-Specific Feature Selection Supervisor: Prof. Angelo CIARAMELLA Co-Supervisor: Prof. Antonino STAIANO Author: Mr. Davide NARDONE 0120/131 blank PARTHENOPE UNIVERSITY OF NAPLES DEPARTMENT OF SCIENCE AND TECHNOLOGY December 14, 2017
  • 2.
    “Success is theresult of perfection, hard work, learning from failure, loyalty, and persistence.” Colin Powell
  • 3.
    Contents Abstract 1 Introduction 1 1Mathematical Optimization Problems 3 1.1 Introduction to Optimization Problems . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Types of Optimization Problem and Traditional Numerical methods . . . . . . 4 1.2.1 Linear optimization problems . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Quadratic optimization problems . . . . . . . . . . . . . . . . . . . . . 4 1.2.3 Stochastic optimization problems . . . . . . . . . . . . . . . . . . . . 5 1.2.4 Convex optimization problems . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Nonlinear optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.1 Local solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.2 Global solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4.1 The Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4.2 The Lagrange dual function . . . . . . . . . . . . . . . . . . . . . . . 8 1.4.3 The Lagrange dual problem . . . . . . . . . . . . . . . . . . . . . . . 9 1.4.4 Dual Ascent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4.5 Augmented Lagrangian and the Method of Multipliers . . . . . . . . . 10 1.5 Alternating Direction Method of Multipliers . . . . . . . . . . . . . . . . . . . 11 1.5.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5.2 Convergence in practice . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5.3 Extension and Variations . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 Sparse statistical models 14 2.1 Sparse Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Least Absolute Shrinkage and Selection Operator (LASSO) . . . . . . . . . . . 16 2.4.1 The LASSO estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2 LASSO Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 III
  • 4.
    CONTENTS 2.4.3 Computation ofLASSO solution . . . . . . . . . . . . . . . . . . . . . 18 2.4.4 Single Predictor: Soft Thresholding . . . . . . . . . . . . . . . . . . . 19 2.5 Key Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3 Data Size Reduction 21 3.1 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.4 Feature Selection for Classification . . . . . . . . . . . . . . . . . . . . . . . . 23 3.5 Flat Feature Selection Techniques . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5.1 Filter methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5.2 Wrapper methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5.3 Embedded and hybrid methods . . . . . . . . . . . . . . . . . . . . . . 26 3.5.4 Regularization models . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.6 Algorithms for Structured Features . . . . . . . . . . . . . . . . . . . . . . . . 27 3.7 Algorithms for Streaming Features . . . . . . . . . . . . . . . . . . . . . . . . 30 3.8 Feature Selection Application Domains . . . . . . . . . . . . . . . . . . . . . 31 3.8.1 Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.8.2 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.8.3 Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4 Class-Specific Feature Selection Methodology 33 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2.1 Learning compact dictionaries . . . . . . . . . . . . . . . . . . . . . . 34 4.2.2 Finding representative features . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Reformulation problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.1 ADMM for the LASSO . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.4 Class-Specific Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4.1 General Framework for Class-Specific Feature Selection . . . . . . . . 39 4.4.2 A Sparse Learning-Based Approach for Class-Specific Feature Selection 40 5 Experimental Results 42 5.1 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.2 Datasets Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.4 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . 46 6 Conclusion 49 References 50 IV
  • 5.
    List of Figures 1.1Global and local maxima. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Estimation picture for the lasso (left) and the ridge regression (right). Shown are contours of the error and constraint functions. The solid blue areas the constraints regions |β1|+|β2| ≤ t and β2 1 +β2 2 ≤ t2, respectively, while the red ellipse are the contours of the least squares error function. . . . . . . . . . . . . 16 2.2 The ridge coefficients (green) are a reduced factor of the simple linear regres- sion coefficients (red) and thus never attain zero values but very small values, whereas the the lasso coefficients (blue) become zero in a certain range and are reduced by a constant factor, which explains there low magnitude in comparison to ridge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Soft thresholding function Sλ(x) = sign(x)(|x|−λ)+ is shown in blue (broken lines), along with the 45◦line in black. . . . . . . . . . . . . . . . . . . . . . . 19 3.1 A general Framework of Feature Selection for Classification. . . . . . . . . . . 23 3.2 Taxonomy of Algorithms for Feature Selection for Classification. . . . . . . . . 24 3.3 A General Framework for Wrapper Methods of Feature Selection for Classifi- cation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Illustration of Lasso, Group Lasso and Sparse Group Lasso. Features can be grouped into 4 disjoint groups G1,G2,G3,G4. Each cell denotes a feature and white color represents the corresponding cell with coefficient zero. . . . . . . . 28 3.5 An illustration of a simple index tree of height 3 with 8 features. . . . . . . . . 29 3.6 An illustration of the graph of 7 features {f1,f2,...,f7} and its representation A. 29 4.1 Cumulative sum of the rows of matrix C in descending order of · 1. The regularization parameter of the eq. 4.12 is set to 20 and 50 respectively. . . . . 35 4.2 A General Framework for Class-Specific Feature Selection [41]. . . . . . . . . 38 4.3 A Sparse Learning Based approach for Class-Specific Feature Selection. . . . . 40 5.1 Classification accuracy comparisons of seven feature selection algorithms on six data sets. SVM with 5-fold cross validation is used for classification. SCBA and SCBA-CSFS are our methods. . . . . . . . . . . . . . . . . . . . . . . . . 46 V
  • 6.
    LIST OF FIGURES 5.2Classification accuracy comparisons of seven feature selection algorithms on six datasets. SVM with 5-fold cross validation is used for classification. SCBA- CSFS is our method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 VI
  • 7.
    List of Tables 5.1Datasets Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 Accuracy Score of SVM using 5-fold cross validation. Six TFS methods are compared against our methods. RFS: Robust Feature Selector, FS: Fisher Score, mRmR: Minimum-Redundancy-Maximum-Relevance, BSL: all features, SCBA- FS and SCBA-CSFS our methods. The best results are highlighted in bold. . . . 46 5.3 Accuracy Score of SVM using 5-fold cross validation. The GCSFS [41] frame- work using 5 traditional feature selector is compared against our SCBA-CSFS. RFS: Robust Feature Selector, FS: Fisher Score, mRmR: Minimum-Redundancy- Maximum-Relevance, BSL: all features and SCBA-CSFS our method. The best results are highlighted in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . 47 VII
  • 8.
    Abstract Feature selection (FS)plays a key role in several fields and in particular computational biology, making it possible to treat models with fewer variables, which in turn are easier to explain and might speed the experimental validation up, by providing valuable insight into the importance and their role. Here, we propose a novel procedure for FS conceiving a two-steps approach. Firstly, a sparse coding based learning technique is used to find the best subset of features for each class of the training data. In doing so, it is assumed that a class is represented by using a subset of features, called representatives, such that each sample, in a specific class, can be described as a linear combination of them. Secondly, the discovered feature subsets are fed to a class-specific feature selection scheme, to assess the effectiveness of the selected features in classification task. To this end, an ensemble of classifiers is built by training a classifier, one for each class on its own feature subset, i.e., the one discovered in the previous step and a proper decision rule is adopted to compute the ensemble responses. To assess the effectiveness of the proposed FS approach, a number of experiments have been performed on benchmark microarray data sets, in order to compare the performance to several FS techniques from literature. In all cases, the proposed FS methodology exhibits convincing results, often overcoming its competitors. 1
  • 9.
    Introduction In many areassuch as computer vision, signal/image processing, and bioinformatics, data are represented by high dimensional feature vectors and it is typical to solve problems through data analysis, particularly through the use of statistical and machine learning algorithms. This has motivated a lot of work in the area of dimensionality reduction, whose goal is to find compact representations of the data that can save memory and computational time and also improve the performance of algorithms that deal with the data. Moreover, dimensionality reduction can also improve our understanding and interpretation of the data. Feature selection aims to select a subset of features from the high dimensional feature set, trying to achieve a compact and proper data representation. A large number of developments on feature selection have been made in the literature and there are many recent reviews and workshops devoted to this topic. Feature selection has been playing a critical role in many applications such as bioinformatics, where a large amount of genomic and proteomic data are produced for biological and biomedical studies. In this work, we propose a novel Feature Selection framework called Sparse-Coding Based Approach for Class Specific Feature Selection (SCBA-CSFS) that simultaneously exploits the idea of Compressed Sensing and Class-Specific Feature Selection. Regarding the feature selec- tion step, the feature selection matrix is constrained to have sparse rows, which is formulated as 2,1-norm minimization term. To solve the proposed optimization problem, an efficient iterative algorithm is adopted. Preliminary experiments of our work are conducted on different bioinfor- matics datasets, which shows that the proposed approach (for specific datasets) outperforms the state-of-the-arts. 2
  • 10.
    Chapter 1 Mathematical Optimization Problems Inthis Chapter we give an overview of mathematical optimization, focusing in particular on the special role of convex optimization. In addition, we briefly review two optimization algorithms that are precursors to the Alternating Direction Method of Multipliers. 1.1 Introduction to Optimization Problems Mathematical Optimization problems are concerned with finding the values for one or for several decision variables that meet the objective(s) the best, without violating a given set of constraints [104, 105]. A mathematical optimization problem has the form minimize f0(x) subject to fi(x) ≤ bi, i = 1,...,m hi(x) = bi, i = 1,...,p (1.1) which describes the problem of finding an x that minimizes f0(x) among all x that satisfy the conditions fi(x) ≤ 0,i = 1,...,m and hi(x) ≤ 0,i = 1,...,p. We call the the vector x = (x1,...,xn) the optimization variable and the function f0 : Rn → R the objective function or cost function. The terms fi(x) ≤ 0 are called the inequality constraints, the corresponding functions hi : Rn → R are called the inequality constraints functions, and the constants b1,...,bm are the bounds for the constraints. If there are no constraints (i.e.,m = p = 0) then the problem (1.1) is unconstrained. The set of points for which the objective functions and all constraint functions are defined D = m i=0 dom fi ∩ p i=1 dom hi (1.2) is called the domain of the optimization problem (1.1). A point x ∈ D is feasible if it satisfies the constraints fi(x) ≤ bi,i = 1,...,m and hi(x) = bi,i = 1,...,p. The problem (1.1) is said to be feasible if there exists at least one feasible point, and unfeasible otherwise. The set of 3
  • 11.
    CHAPTER 1. MATHEMATICALOPTIMIZATION PROBLEMS all feasible points is called the feasible set. A vector x is called a solution of the problem (1.1), if it has the smallest objective value among all vectors that satisfy the constraints: for any z with f1(z) ≤ b1,...,fm(z) ≤ bm, we have f0(z) ≥ f0(x ). We generally consider fami- lies or classes of optimization problems, characterized by particular forms of the objective and constraint functions. 1.2 Types of Optimization Problem and Traditional Numerical methods Traditional numerical methods are usually based on iterative search or heuristic algorithms. The former starts with a (deterministic or arbitrary) solution which is iteratively improved ac- cording to some deterministic rule, while the latter starts off with a more or less arbitrary initial solution and iteratively produces new solutions by some generation rule and evaluates these new solutions, thus eventually reporting the best solution found during the search process.Which type of method should and could be applied depends largely on the type of problem. 1.2.1 Linear optimization problems A Linear Programming (LP) is an optimization problem (1.1) in which the objective func- tion and the constraints functions f0,...,fm are linear, and satisfy fi(αx+βy) = αfi(x)+βfi(y) (1.3) for all x,y ∈ Rn and all α,β ∈ Rn. If the optimization problem is not linear, it is called Non Linear Programming (NLP). Since all linear functions are convex, LP problems are intrinsically easier to solve than general NLP, which may be non-convex. The most popular method for solving LP is the Simplex Algorithm, where the inequalities are first transformed into equalities by adding slack variables and then including and excluding base variables until the optimum is found. Though its worst case computational complexity is exponential, it is found to work quite efficiently for many instances. 1.2.2 Quadratic optimization problems A Quadratic Programming (QP) is an optimization problem in which the objective function and the (inequality) constraints are quadratic minimize (1/2)xT Px+qT x+r subject to Ax = 0 Gx ≤ q (1.4) 4
  • 12.
    CHAPTER 1. MATHEMATICALOPTIMIZATION PROBLEMS where P ∈ Sn is the Hessian matrix of the objective function, q is its gradient, G ∈ Rm×n and A ∈ Rp×n contain the linear equality and not equality constraints, respectively. QP problems, like LP problems, have only one feasible region with "flat faces" on its surface (due to the linear constraints), but the optimal solution may be found anywhere within the region or on its surface. The quadratic objective function may be convex –which makes the problem easy to solve or non-convex, which makes it very difficult to solve. 1.2.3 Stochastic optimization problems A Stochastic Programming (SP) is an optimization problem where (some of the) data in- corporated in the objective function are uncertain. Usual approaches include assumption of different scenarios and sensitivity analyses. 1.2.4 Convex optimization problems Convex optimization problems are far more general than LP problems, but they share the desirable properties of LP problems: they can be solved quickly and reliably up to very large size –hundreds of thousands of variables and constraints. The issue has been that, unless your objective and constraints were linear, it was difficult to determine whether or not they were convex. A convex optimization problem is one of the form minimize f0(x) subject to fi(x) ≤ bi, i = 1,...,m aT i x = bi, i = 1,...,p (1.5) where f0,....fm are convex functions. Comparing eq. 1.5 with the general standard form prob- lem (1.1), the convex problem has three additional requirements: • The objective function must be convex. • The inequality constraint functions must be convex. • The equality constraint functions hi(x) = aT i x−bi must be affine. We immediately note an important property: The feasible set of a convex optimization prob- lem is convex, since it is the intersection of the domain of the problem. Another fundamental property of convex optimization problems is that any locally optimal points is also (globally) optimal. A non-convex optimization problem is any problem where the objective or any of the con- straints are non-convex. Such a problem may have multiple feasible regions and multiple locally optimal point within each set. It might take exponential time in the number of variables and constraints to determine that a non-convex problem is unfeasible, that the objective function is unbounded, or that an optimal solution is the "global optimum" across all feasible regions. 5
  • 13.
    CHAPTER 1. MATHEMATICALOPTIMIZATION PROBLEMS Figure 1.1: Global and local maxima. 1.3 Nonlinear optimization Nonlinear optimization is the term used to describe an optimization problem when the objec- tive functions are not linear and it may be convex or non-convex. Sadly, there are no effective methods for solving the general NLP problem. Nonlinear functions, unlike linear functions, may involve variables that are raised to a power or multiplied or divided by other variables. They may also use transcendental functions such as exp, log, sine and cosine. 1.3.1 Local solution A locally optimal solution is one where there are no other feasible solutions with better function values in the neighborhood. Rather than seeking for the optimal x which minimizes the objective over all feasible points, we seek a point that is only locally optimal, which means that it minimizes the objective function among feasible points that are near it, but is not guaranteed to have a lower objective value than all other feasible points. In Convex Optimization problems, a locally optimal solution is also globally optimal. These include i) LP problems; ii) QP problems where the objective is positive definite (if minimizing, negative definite if maximizing); iii) NLP problems where the objective is a convex function (if minimizing; concave if maximizing) and the constraints form a convex set. Local optimization methods can be fast, can handle large-scale problems, and are widely applicable, since they only require differentiability of the objective and constraint functions. There are several disadvantages of local optimization methods, beyond (possibly) not finding the actual, globally optimal solution. The methods require an initial guess for the optimization 6
  • 14.
    CHAPTER 1. MATHEMATICALOPTIMIZATION PROBLEMS variable. 1.3.2 Global solution A globally optimal solution is one where there are no other feasible solutions with a better function values. Global optimization is used for problems with a small number of variables, where computing time is not critical, and the value of finding the actual global solution is very high. 7
  • 15.
    CHAPTER 1. MATHEMATICALOPTIMIZATION PROBLEMS 1.4 Duality In mathematical optimization theory, the duality is the principle by which optimization prob- lems may be viewed from either of two perspectives, the primal problem or the dual problem. It is a powerful and widely employed tool in applied mathematics for a number of reasons. First, the dual problem is always convex even if the primal is not. Second, the number of variables in the dual is equal to the number of constraints in the primal, which is often less than the number of variables in the primal program. Third, the maximum value achieved by the dual problem is often equal to the minimum of the primal [1]. However, in general, the optimal values of the primal and dual problems need not be equal. When not equal, their difference is called the duality gap. 1.4.1 The Lagrangian We consider an optimization problem in the standard (1.5) minimize f0(x) subject to fi(x) ≤ bi, i = 1,...,m hi(x) = bi, i = 1,...,p (1.6) with variable x ∈ Rn. We assume its domain D = { m i=1 domfi} ∩ { p i=1 dom hi} is nonempty and denote the optimal value of 1.6 by p . We do not assume the problem ( 1.6) is convex. The basic idea in Lagrangian duality is to take the constraints in 1.6 into account by aug- menting the objective function with a weighted sum of the constraint functions. We define the Lagrangian L : Rn ×Rm ×Rp → R associated with the problem ( 1.6) as L(x,λ,ν) = f0(x)+ m i=1 λifi(x)+ p i=1 νihi(x) (1.7) with dom L = D ×Rm ×Rp. We refer to λi as the Lagrange multiplier associated with i-th in- equality constraint fi(x) ≤ 0; similarly we refer to νi as the Lagrange multiplier associated with the i-th equality constraint hi(x) = 0. The vectors λ and ν are called the Lagrange multipliers vectors associated with the problem ( 1.7). 1.4.2 The Lagrange dual function We define the Lagrange dual function g : Rm × Rp → R as the minimum value of the La- grangian over x: for λ ∈ Rm,ν ∈ Rp g(λ,ν) = inf x∈D L(x,λ,ν) = inf x∈D  f0(x)+ m i=1 λifi(x)+ p i=1 νihi(x)   (1.8) 8
  • 16.
    CHAPTER 1. MATHEMATICALOPTIMIZATION PROBLEMS where the Lagrangian is unbounded below in x and the dual function takes on the value −∞. Since the dual function is the point-wise infimum of a family of affine functions of (λ,ν), it is concave, even when the problem (1.24) is not convex. 1.4.3 The Lagrange dual problem The Lagrange dual problem term, usually referred to as the Lagrangian dual problem is obtained by forming the Lagrangian (1.7) using: i) non-negative Lagrange multipliers; ii) to add the constraints to the objective function, iii) and then solving for some primal variable values that minimize the Lagrangian. This solution gives the primal variables as functions of the Lagrange multipliers, which are called dual variables, so that the new problem is to maximize the objective function with respect to the dual variables under the derived constraints on the dual variables (including at least the non-negativity). maximize g(λ,ν) subject to λ 0 (1.9) 1.4.4 Dual Ascent The Dual Ascent is an optimization method for solving the Lagrangian Dual problem. Con- sider the equality-constrained convex optimization problem minimize f(x) subject to Ax = b (1.10) with variable x ∈ Rn, where A ∈ Rm×n and f : Rn → R is convex. The Lagrangian for problem (1.10) is L(x,y) = f(x)+yT (Ax−b) (1.11) and the dual function is g(y) = inf β∈Rm L(x,y) = −f∗ (−AT y)−bT y (1.12) where y is the algebraic expression of the Lagrange multipliers, and f∗ is the convex conjugate of f. The associate dual problem is maximize g(y) (1.13) Assuming that strong duality holds, the optimal values of the primal and dual problems are the same. We can recover a primal optimal point x from a dual optimal point y∗ as x = argmin x L(x,y ) (1.14) 9
  • 17.
    CHAPTER 1. MATHEMATICALOPTIMIZATION PROBLEMS providing there is only one minimizer of L(x,y ) whether f is strictly convex. The Dual problem is solved by using the gradient ascent. Assuming that g is differentiable, the gradient g(y,ν) can be evaluated as follows. We first find x+ = argminx L(x,y,ν), then we have g(y) = Ax+ − b, which is the residual for the equality constraint. The dual ascent method consists of iterating the updates xk+1 = argmin x L(x,yk ) (1.15) yk+1 = yk +αk (Axk+1 −b) (1.16) where αk > 0 is a step size. The first step is an x-minimization step, and the second step is a dual variable update. The dual ascent method can be used even in cases when g is not differen- tiable. If αk is chosen appropriately and several other assumptions hold, then xk converges to an optimal point and yk converges to an optimal dual point. However, these assumptions do not hold in many applications, so dual ascent often cannot be used. 1.4.5 Augmented Lagrangian and the Method of Multipliers Augmented Lagrangian methods were developed in part to bring robustness to the dual ascent method and in particular to yield convergence without assumptions, like strict convexity or fitness of f. The augmented Lagrangian for (1.10) is Lp(x,y) = f(x)+yT (Ax−b)+ ρ 2 Ax−b 2 2 (1.17) where ρ > 0 is called the penalty parameter. The augmented Lagrangian can be viewed as the (unaugmented) Lagrangian associated with the problem minimize f(x)+ ρ 2 Ax−b 2 2 subject to Ax = b (1.18) The gradient of the augmented dual function is found the same way as with the ordinary La- grangian. By applying dual ascent to the modified problem yields the algorithm xk+1 = argmin x L(x,yk ) (1.19) yk+1 = yk +ρk (Axk+1 −b) (1.20) this is known as the method of multipliers for solving (1.10). This is the same as standard dual ascent, except that x-minimization step uses the augmented Lagrangian, and the penalty param- eter ρ is used as the step size αk. The method of multipliers converges under far more general conditions than dual ascent, including cases when f takes on the value +∞ or is not strictly convex. Finally, the greatly improved convergence properties of the method of multipliers over 10
  • 18.
    CHAPTER 1. MATHEMATICALOPTIMIZATION PROBLEMS dual ascent comes at cost. When f is separable, the augmented Lagrangian Lρ is not separable, so the x-minimization step cannot be carried out separately (parallelize). 1.5 Alternating Direction Method of Multipliers The Alternating Direction Method of Multipliers (ADMM) [105] is a Lagrangian based approach intended to blend the decomposability of dual ascent with the superior properties of the method of multipliers. Consider a problem of the form minimize x∈Rm,z∈Rn f(x)+g(z) subject to Ax+Bz = c (1.21) where f : Rm → R and g : Rn → R are convex functions and A ∈ Rn×d and B ∈ Rn×d are (known) matrices of constraints, and c ∈ Rd is a constrained vector. To solve this problem we introduce a vector y ∈ Rd of Lagrange multipliers associated with the constraint, and then consider the augmented Lagrangian Lp(x,z,y) = f(x)+g(z)+y(Ax+Bz −c)+ ρ 2 Ax+Bz −c 2 2 (1.22) where ρ > 0 is a small fixed parameter. The ADMM algorithm is based on minimizing the augmented Lagrangian problem (1.17) successively over x and z, and then applying a dual variable update to y.In doing so yields the updates xk+1 = arg min x∈Rm Lρ(x,zk ,yk ) (1.23) zk+1 = arg min z∈Rn Lρ(xk+1 ,z,yk ) (1.24) yk+1 = y +ρ(Axk+1 +Bzk+1 −c) (1.25) for iterations t = 0,1,2,.... The algorithm is very similar to dual ascent and the method of multipliers: It consists of an x-minimization step (1.23), a z-minimization step (1.24) and a dual variable update (1.25). As the method of multipliers, the dual update uses a step size equal to the augmented Lagrangian parameter ρ. The method of multipliers has the form (xk+1 ,zk+1 ) = argmin x,z Lp(x,z,yk ) (1.26) yk+1 = yk +ρ(Axk+1 +Bzk+1 −c) (1.27) Here the augmented Lagrangian is jointly minimized w.r.t. the two primal variables. In ADMM, on the other hand, x and z are updated in alternating or sequential fashion, which accounts for the term alternating direction. The ADMM framework has several advantages. First, convex problems with non-differentiable constraints can be easily handled by the separation of param- eters x and z. A second advantage of ADMM is its ability to break up a large problem into 11
  • 19.
    CHAPTER 1. MATHEMATICALOPTIMIZATION PROBLEMS smaller pieces. For dataset with a large number of observations we can break up the data into blocks, and carry out the optimization over each block. 1.5.1 Convergence Under modest assumption on f and g, we get that ADMM iterates, for any ρ > 0, satisfy: • Residual convergence: rk → 0 as k → ∞, i.e., primal iterates approach feasibility. • Objective convergence: f(xk) + g(zk) → p as k → ∞, i.e., the objective function of the iterates approaches to the optimal value. • Dual variable convergence: yk → y as k → ∞, where y is a dual optimal point. Here, rk is the primal residual at iteration k as rk = Axk +Bzk −c. 1.5.2 Convergence in practice Simple examples show that ADMM can be very slow to convergen to high accuracy. How- ever, it is often the case that ADMM converges to modest accuracy –sufficient for many applications– within a few tens of iterations. This behavior makes ADMM similar to algorithms like the con- jugate gradient method. However, the slow convergence of ADMM also distinguishes it from algorithms such as Newton’s method, where the high accuracy can be attained in a reasonable amount of time. 1.5.3 Extension and Variations In practice, ADMM obtains a relatively accurate solution in a handful of iterations, but requires many iterations for a highly accurate solution. Hence, it behaves more like a first-order method than a second-order method. In order to attain superior convergence, many variations on the ADMM have been explored in literature. A standard extension is to use possible different penalty parameters ρk for each iteration, with the goal of improving the convergence in practice, as well as making performance less dependent on the initial choice of the penalty parameter. Though it can be difficult to prove the convergence of ADMM when ρ varies by iteration, the fixed ρ theory still applies if one just assumes that ρ becomes fixed after a finite number of iterations. A simple scheme that often works well is [2] ρk+1 =    τincrρk if rk 2 > µ sk 2 ρk/τdecr if sk 2 > µ rk 2 ρk otherwise, (1.28) where µ > 1, τincr > 1, and τdecr > 1 are parameters. Typical choices might be µ = 10 and τincr = τdecr = 2. 12
  • 20.
    CHAPTER 1. MATHEMATICALOPTIMIZATION PROBLEMS The ADMM update equation (1.28) suggests that large value of ρ place a large penalty on violations of primal feasibility and so tend to produce small primal residuals. Conversely, the definition of sk+1 suggests that small values of ρ tend to reduce the dual residual, but at the expense of reducing the penalty on primal feasibility, which may result in larger primal residual. Therefore, the adjustment scheme (1.28) inflates ρ by τincr when the primal residual appears large compared to the dual residual, and deflates ρ by τdecr when the primal residual seems too small relative to the dual residual. 13
  • 21.
    Chapter 2 Sparse statisticalmodels In this Chapter, we summarize the actively developing field of Statistical Learning with Sparsity. In particular, we introduce the LASSO estimator for linear regression. We describe the basic LASSO method, and outline a simple approach for its implementation. We relate and compare LASSO to Ridge Regression, and give a brief description of the latter. 2.1 Sparse Models Nowadays, large quantities of data are collected and mined in nearly every area of sci- ence, entertainment, business, and industry. Medical scientists study the genomes of patients to choose the best treatments, in order to learn the underlying causes of their disease. Thus the world is overwhelmed with data and there is a crucial need to sort through this mass of information, and pare it down to its bare essentials. For this process to be successful, we need to hope that the world is not as complex as it might be. For example, we hope that not all of the 30,000 or so genes in the human body are directly involved in the process that leads to the development of cancer. This points to an underlying assumption of simplicity. One form of simplicity is sparsity. Broadly speaking, a sparse statistical model is one in which only a relatively small number of parameters (or predictors) plays an important role. It represents a classic case of "less is more": a sparse model can be much easier to estimate and interpret than a dense model. The sparsity assumption allows us to tackle such problems and extract useful and reproducible patterns from big datasets. The leading example is linear regression, which we will discuss through this Chapter. 2.2 Introduction In linear regression settings, we are given N samples {(xi,yi)}N i=1, where each xi = (xi,1,..., xi,p) is a p−dimensional vector of features or predictors, and each yi ∈ R is the associated re- sponse variable. The goal is to approximate the response variable yi using a linear combination 14
  • 22.
    CHAPTER 2. SPARSESTATISTICAL MODELS of the predictors. A linear regression model assumes that η(xi) = β0 + p j=1 xijβj +ei (2.1) The model is parameterized by the vector of regression weights β = (β1,...,βp) ∈ Rp and the intercept (or "bias") term β0 ∈ R which are unknown parameters and ei is an error term. The method of Least Squares (LS) provides estimates of the parameters by the minimization of the following squared-loss function: minimize β0,β    N i=1 (yi −β0 − p j=1 xijβj)2    (2.2) Typically all of the LS estimates from (2.2) will be nonzero. This will make interpretation of the final model challenging if p is large. In fact, if p > N, the LS estimates are not unique. There is an infinite set of solutions that make the objective function equal to zero, and these solutions almost surely overfit the data as well. There are two reasons why we might consider an alternative to the LS estimate: 1. The prediction accuracy: the LS estimate often has low bias but large variance, and prediction accuracy can sometimes be improved by shrinking the values of the regression coefficients, or setting some coefficients to zero. By doing so, we introduce some bias but reduce the variance of the predicted values, hence this may improve the overall prediction accuracy (as measured in terms of the Mean Squared Error (MSE)). 2. The purposes of interpretation: with a large number of predictors, we often would like to identify a smaller subset of these predictors that exhibit the strongest effects. Thus, there is a need to constrain, or regularize the estimation process. The lasso or 1-regularized regression is a method that combines the LS loss (2.2) with a 1-constraint, or bound on the sum of the absolute values of the coefficients. Relative to the LS solution, this constraint has the ef- fect of shrinking the coefficients, and even setting some to zero. In this way, it provides an automatic way for doing model selection in linear regression. Moreover, unlike some other cri- teria for model selection, the resulting optimization problem is convex, and can be efficiently solved for large problems. 2.3 Ridge Regression The Ridge Regression performs 2 regularization, i.e., it adds a factor of sum of squares of coefficients in the optimization objective. Thus, ridge regression optimizes the following minimize β∈Rp    1 2N y−Xδ 2 2 loss +δ β 2 2 penalty    (2.3) 15
  • 23.
    CHAPTER 2. SPARSESTATISTICAL MODELS where δ is the parameter which balances the amount of emphasis given to minimizing the Resid- ual Sum of Squares (RSS) versus minimizing the Sum of Square of Coefficients (SSO). The δ value can take different values: • δ = 0: The objective becomes simple as Linear Regression and we will get the same coefficients as a simple linear regression; • δ = ∞: The coefficients will be zero because of infinite weighting on square of coeffi- cients, anything less than zero will make the objective infinite; • 0 < δ < ∞: The magnitude of δ will decide the weighting given to different parts of ob- jective. The coefficients will be somewhere between 0 and 1 for simple linear regression. 2.4 Least Absolute Shrinkage and Selection Operator (LASSO) In statistics and machine learning, Least Absolute Shrinkage and Selection Operator (LASSO) [28] is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces. Figure 2.1: Estimation picture for the lasso (left) and the ridge regression (right). Shown are contours of the error and constraint functions. The solid blue areas the constraints regions |β1|+|β2| ≤ t and β2 1 +β2 2 ≤ t2, respectively, while the red ellipse are the contours of the least squares error function. 16
  • 24.
    CHAPTER 2. SPARSESTATISTICAL MODELS 2.4.1 The LASSO estimator Given a collection of N predictors-response pairs {(xi,yi)}N i=1, the LASSO finds the solu- tion ( ˆβ0, ˆβ) to the optimization problem minimize β0,β    1 2N N i=1 (yi −β0 − p j=1 xijβj)2    subject to p j=1 |βj| ≤ t. (2.4) The constraint p j=1 |βj| ≤ t, can be written as the l1-norm constraint β 1 ≤ t. Furthermore (2.4) is often represented using matrix-vector notation. Let y = (yi,...,yN ) denotes the N-vector of responses, and X be the N × p matrix with xi ∈ Rp in its ith row, then the optimization problem (2.4) can be re-expressed as a Linear Regression problem minimize β0,β    1 2N y−β01−Xβ 2 2    subject to β 1 ≤ t (2.5) where 1 is the vector of N ones, and · 2 denotes the usual Euclidean norm on vectors. The bound t is a kind of ‘budget’: It limits the sum of the absolute values of the parameters estimates. It is often convenient to rewrite the LASSO problem in the so-called Lagrangian form (1.7) minimize β∈Rp    1 2N y−Xβ 2 2 loss +λ β 1 penalty    (2.6) for some λ ≥ 0. The 1 penalty will promote sparse solutions. This means that as λ is increased, elements of β will become exactly zero. Due to the non-differentiability of the penalty function, there are no closed-form solutions to equation (2.3). 2.4.2 LASSO Regression LASSO Regression is a powerful technique generally used for creating models that deal with ‘large’ number of features. Lasso regression performs 1 regularization, i.e., it adds a factor of sum of absolute value of coefficients in the optimization objective. It works by penalizing the magnitude of the coefficients/estimates along with minimizing the error between predicted and actual observations –the larger the penalty applied, the further coefficients are shrunk towards 17
  • 25.
    CHAPTER 2. SPARSESTATISTICAL MODELS Figure 2.2: The ridge coefficients (green) are a reduced factor of the simple linear regression coefficients (red) and thus never attain zero values but very small values, whereas the the lasso coefficients (blue) become zero in a certain range and are reduced by a constant factor, which explains there low magnitude in comparison to ridge. zero. Thus, LASSO regression optimizes the following minimize β∈Rp    1 2N y−Xβ 2 2 +λ β 1    (2.7) where λ works similar to problem (2.3) and provides a trade-off between balancing RSS and the Sum of Coefficients (SOC). As in ridge regression, λ can take various values: • δ = 0: The objective becomes sample as Linear Regression and we will get the same coefficients as a simple Linear Regression; • δ = ∞: The coefficients will be zero as for the Ridge; • 0 < δ < ∞: The coefficients will be between 0 and 1 for simple Linear Regression. 2.4.3 Computation of LASSO solution The LASSO problem is a convex program, specifically a QP with a convex constraint. As such, there are many sophisticated QP methods for solving the LASSO. However, there is a par- ticularly simple and effective computational algorithm that gives insight into how the LASSO works. For convenience, we rewrite the criterion in Lagrangian form minimize β∈Rp    1 2N N i=1 (yi − p j=1 xijβj)2 +λ p j=1 |βj|    . (2.8) As discussed in Chapter 1, the Lagrangian form is especially convenient for numerical compu- tation of the solution by using techniques such as the Dual Ascent, the Augmented Lagrangian 18
  • 26.
    CHAPTER 2. SPARSESTATISTICAL MODELS and the Method of Multipliers or the Alternating Direction Method of Multipliers. Figure 2.3: Soft thresholding function Sλ(x) = sign(x)(|x|−λ)+ is shown in blue (broken lines), along with the 45◦line in black. 2.4.4 Single Predictor: Soft Thresholding Soft thresholding is becoming a very popular tool in computer vision and machine learning. Essentially it allows to tackle the problem (2.8) and solve it in a very fast fashion way. Since 1 penalties are being used nearly everywhere at the moment, the property of soft thresholding to efficiently find the solution to the above form ( 2.8) becomes very useful. Let’s first consider a single predictor setting, based on samples {(zi,yi)}N i=1. The problem then is to solve minimize β    1 2N N i=1 (yi −ziβ)2 +λ|β|    . (2.9) The standard approach to this univariate minimization problem, would be to take the gradient (first derivative) with respect to β, and set it to zero. There is a complication, however, because the absolute value function does not have a derivative at |β| = 0. However, we can proceed by direct inspection of the function (2.9), and find that ˆβ =    1 N x,y −λ if 1 N z,y > λ 0 if 1 N z,y ≤ λ 1 N x,y +λ if 1 N z,y < −λ (2.10) which can be shortly written as ˆβ = Sλ( 1 N z,y ). (2.11) 19
  • 27.
    CHAPTER 2. SPARSESTATISTICAL MODELS Here the soft-thresholding operator Sλ(x) = sign(x)|x|−λ)+ (2.12) translates its argument x toward zero by the amount λ, and sets it to zero if |x| ≤ λ. 2.5 Key Difference Although similar in formulation, the benefits and properties for Ridge and LASSO are quite different. In Ridge Regression, we aim to reduce the variance of the estimators and predictions, which is particularly helpful in the presence of multicollinearity. It includes all (or none) of the features in the model. Thus, the major advantage of ridge regression is coefficient shrinkage and reducing model complexity. On the other hand, LASSO is a tool for model (predictor) selection and consequently for the improvement of interpretability. A fundamental difference among these two regression models, is that the penalty term for LASSO uses the 1 norm and ridge uses the squared 2 norm. This difference has a simple impli- cation on the solution of the two optimization problems. In fact, the 2 penalty of ridge regres- sion leads to a shrinkage of the regression coefficients, much like the 1 penalty of the LASSO, but the coefficients are not forced to be exactly zero for finite values of λ. This phenomenon of the coefficients being zero is called sparsity. In addition, a benefit of ridge regression is that a unique solution is available, also when the data matrix X is rank deficient, e.g., when there are more predictors than observations (p>N). 20
  • 28.
    Chapter 3 Data SizeReduction In this Chapter, we will introduce the problem of Curse of Dimensionality and, examine the two main approaches used for tackling this problem. We will then, give an overview of the main approaches for feature selection, particularly focusing on algorithms for flat feature selection, where features are assumed to be independent. Finally, we will examine some application domains where these methods are mostly relevant. 3.1 Dimensionality Reduction Nowadays, the growth of the high-throughput technologies has resulted in exponential growth in the harvested data w.r.t both dimensionality and sample size. Efficient and effective manage- ment of this data becomes increasingly challenging. Traditionally, manual management of these datasets has proved to be impractical. Therefore, data mining and machine learning techniques were developed to automatically discover knowledge and recognize patterns from the data. In industry, this trend has been referred to as ‘Big Data’, and it has a significant impact in areas as varied as artificial intelligence, internet applications, computational biology, medicine, fi- nance, marketing, etc. However, this collected data is usually associated with a high level of noise. There are many variables that can cause noise in the data, some examples of this are an imperfection in the technology that compiles the data or the source of the data itself. Di- mensionality reduction is one of the most popular techniques to remove noise and redundant features. Dimensionality reduction is important, because a high number of features in a dataset, comparable to or higher than the number of samples, leads to model overfitting, which in turn leads to poor results on the testing datasets. Additionally, constructing models from datasets with many features is more computationally demanding [3]. Dataset size reduction [4] can be performed in one of the two ways: feature set reduction or sample set reduction. In this thesis, we focus on feature set reduction. The feature set reduction, also known as dimensionality reduction can be mainly categorized into feature extraction and feature selection. 21
  • 29.
    CHAPTER 3. DATASIZE REDUCTION 3.2 Feature extraction Feature extraction maps the original feature space to a new feature space with lower dimen- sion. It is difficult to link the features from original feature space to new features. Therefore, further analysis of new features is problematic since there is no physical meaning for the trans- formed features obtained from feature extraction techniques. Features extraction can be used in this context to reduce complexity and give a simple representation of data representing each variable in feature space as a linear combination of original input variable. The most popular and widely used feature extraction techniques include Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Canonical Correlation Analysis (CCA) and Multidimen- sional Scaling [4]. 3.3 Feature selection Feature selection –the process of selecting a subset of relevant features– is a key component in building robust machine learning models for classification, clustering, and other tasks. It has been playing an important role in many applications since it can speed up the learning process, leads to better learning performance (e.g., higher learning accuracy for classification), lower computational costs, improve the model interpretability and alleviate the effect of the Curse of Dimensionality [5]. Moreover, with the presence of a large number of features, a learning model tends to overfit, resulting in their performance to degenerate. To address the problem of the curse of dimensionality, dimensionality reduction techniques have been studied. This is an important branch in the machine learning and data mining research areas. In addition, unlike feature extraction, feature selection selects a subset of features from the original feature set without any transformation, and maintains the physical meanings of the original features. In this sense, feature selection is superior in terms of better readability and interpretability. This property has its significance in many practical applications such as finding relevant genes to a specific disease and building a sentiment lexicon for sentiment analysis. For the classification problem, feature selection aims to select subset of highly discriminant features which are capable of discriminating samples that belong to different classes. Depending on the training set is labelled or not, feature selection algorithms can be catego- rized into supervised [47, 48], unsupervised [49, 50] and semi-supervised learning algorithms [51, 52]. For supervised learning, feature selection algorithms maximize some functions of the predictive accuracy. Because we are given class labels, it is natural that we want to keep only the features that are related to or lead to these classes. For unsupervised learning, feature selection is a less constrained search problem without class labels, depending on clustering quality mea- sures [53], and can eventuate in many equally valid feature subsets. Because there is no label information directly available, its much more difficult to select discriminative features. Super- vised feature selection assesses the relevance of features guided by the label information but a good selector needs enough labeled data, which in turn is time consuming. While unsupervised 22
  • 30.
    CHAPTER 3. DATASIZE REDUCTION Figure 3.1: A general Framework of Feature Selection for Classification. feature selection works with unlabeled data, it is difficult to evaluate the relevance of features. It is common to have a data set with huge dimensionality but small labeled-sample size. The combination of the two data characteristics manifests a new research challenge. Under the as- sumption that labeled and unlabeled data are sampled from the same population generated by target concept, semi-supervised feature selection makes use of both labeled and unlabeled data to estimate feature relevance [54]. Typically, a feature selection method consists of four basic steps [67], namely, subset gen- eration, subset evaluation, stopping criterion, and result validation. In the first step, a candidate feature subset will be chosen based on a given search strategy, which is sent, in the second step, to be evaluated according to certain evaluation criterion. The subset that best fits the evaluation criterion will be chosen from all the candidates that have been evaluated after the stopping cri- terion is met. In the final step, the chosen subset will be validated using domain knowledge or a validation set. Feature selection is an NP-hard problem [7], since it requires to evaluate (if there are n features in total) n m combinations to find the optimal subset of m n features. Therefore, sub-optimal search strategies are considered. 3.4 Feature Selection for Classification Feature selection is based on the terms of feature relevance and redundancy w.r.t. the goal (i.e., classification). More specifically, a feature is usually categorized as: 1) strongly relevant, 2) weakly relevant, but not redundant, 3) irrelevant, and 4) redundant [55, 56]. A strongly rele- vant feature is always necessary for an optimal feature subset; it cannot be removed without af- fecting the original conditional target distribution [55]. Weakly relevant feature may not always be necessary for an optimal subset, this may depend on certain conditions. Irrelevant features are not directly associated with the target concept but affect the learning process. Redundant 23
  • 31.
    CHAPTER 3. DATASIZE REDUCTION Figure 3.2: Taxonomy of Algorithms for Feature Selection for Classification. features are those that are weakly relevant but can be completely replaced with a set of other features. Redundancy is thus always inspected in multivariate case (when examining feature subset), whereas relevance is established for individual features. The aim of feature selection is to maximize relevance and minimize redundancy. It usually includes finding a feature subset consisting of only relevant features. In many classification problems, it is difficult to learn good classifiers before removing these unwanted features, due to the huge size of the data. Reducing the number of irrelevant/redundant features can drastically reduce the running time of the learn- ing algorithms and yields a more general classifier. A general feature selection for classification framework is demonstrated in Figure 3.1. Usually, feature selection for classification attempts to select the subset of feature with minimal dimension according to the following criteria: 1) the classification accuracy does not significantly decrease and, 2) the resulting class distribution –giving only the values for the selected features, is as close as possible to the original class distribution, giving all features. The selection process can be achieved in a number of ways depending on the goal, the resources at hand, and the desired level of optimization. In this chapter, we categorize feature selection for classification into three classes (Fig. 3.2): 1. Methods for flat features. 2. Methods for structured features. 3. Methods for streaming features. 3.5 Flat Feature Selection Techniques In this section, we will review algorithms for flat features, where features are assumed to be independent. Algorithms in this category can be mainly classified into filters, wrappers, embedded and hybrid methods. 24
  • 32.
    CHAPTER 3. DATASIZE REDUCTION Figure 3.3: A General Framework for Wrapper Methods of Feature Selection for Classifica- tion. 3.5.1 Filter methods Filter methods select features based on a performance measure without utilizing any clas- sification algorithms [57]. A typical filter algorithm consists of two steps. In the first step, it ranks features based on certain criteria. Filter algorithm can rank individual features or evalu- ate entire feature subsets. We can roughly classify the developed measures for feature filtering into: information, distance, consistency, similarity, and statistical measures. Moreover, the fil- ter methods can be either Univariate or Multivariate. Univariate feature filters rank each feature independently of the feature space whereas the Multivariate feature filters evaluate an entire feature subset (batch way). Therefore, the Multivariate approach has the ability of handling redundant features. Among all the filter methods, we can highlight the Relief-F [8], Fisher [9], LaplacianScore [10] and so on. Filter models select features independent of any specific classi- fiers. However the major disadvantage of the filter approach is that it totally ignores the effects of the selected feature subset on the performance of the induction algorithm [58, 59]. 3.5.2 Wrapper methods Wrapper models utilize a specific classifier to evaluate the quality of selected features, and offer a simple and powerful way to address the problem of feature selection, regardless of the chosen learning machine [58, 60]. Wrappers method involves optimizing a specific classifier as a part of the selection process to evaluate the quality of the selected features. Thus, for classification tasks, a wrapper will evaluate subsets based on the classifier performance (e.g. Naïve Bayes or SVM)[12, 13], whereas for clustering, a wrapper will evaluate subsets based on the performance of a clustering algorithm (e.g. K-means) [14]. Given a predefined classifier, a 25
  • 33.
    CHAPTER 3. DATASIZE REDUCTION typical wrapper model will perform the following steps: 1. Searching a subset of features. 2. Evaluating the selected subset of features by the performance of the classifier. 3. Repeating 1 and 2 until the desired quality is reached. A general representation for wrapper methods of feature selection for classification is shown in Fig. 3.3. Wrapper methods tend to give better results but are much slower than filters in finding sufficiently good subsets because they depend on the resource demands of the modelling algorithm. Therefore, when it is necessary to effectively handle datasets with a huge number of features, filter methods are indispensable in order to obtain a reduced set of features that can be treated by other more expensive feature selection methods. 3.5.3 Embedded and hybrid methods Embedded methods perform feature selection injecting the selection process into the mod- elling algorithm’s execution. There are three types of embedded methods. The first are pruning methods that initially utilize all features to train a model and then attempt to eliminate some fea- tures by setting the corresponding coefficients to zero, while maintaining model performance such as recursive feature elimination using support vector machine (SVM) [62]. The second are models with a built-in mechanism for feature selection such as ID3 [62] and C4.5 [63]. The third are regularization models with objective functions that minimize fitting errors and in the mean time force the coefficients to be small or to be exact zero. Features with coefficients that are close to zero are then eliminated [64]. 3.5.4 Regularization models Recently, Regularization models have attracted much attention in the feature selection con- text due to the good compromise between accuracy and time computational complexity. In this section, we only consider linear classifiers w in which classification of Y can be based on a linear combination of X such as SVM and logistic regression. In regularization methods, classi- fier induction and feature selection are achieved simultaneously by estimating w with properly tuned penalties. The learned classifier w can have coefficients exactly equal to zero. Since each coefficient of w corresponds to one feature (i.e., wi for fi), only features with nonzero coefficients in w will be used in the classifier. Specifically, we define ˆw as ˆw = minimize w f(w,X)+αg(w) (3.1) where f(·) is the classification objective function, g(w) is a regularization term, and α is the regularization parameter controlling the trade-off between the f(·) and the penalty. Popular 26
  • 34.
    CHAPTER 3. DATASIZE REDUCTION choices of f(·) include quadratic loss such as Least Squares, hinge loss such as 1-SVM [65] and logistic loss as BlogReg [66] as f(w,X) = n i=1 (yi −w xi)2 (Quadratic loss) (3.2) f(w,X) = n i=1 max(0,1−yiw xi) (Hinge loss) (3.3) f(w,X) = n i=1 log(1+exp(−yi(w xi +b))) (Logistic loss) (3.4) One of the most important embedded models based on regularization is the LASSO Regulariza- tion [28] which is based on 1-norm of the coefficient of w and it’s defined as g(w) = n i=1 |wi| (Logistic loss) (3.5) An important property of the 1 regularization is that it can generate an estimation of w with exact zero coefficients. In other words, there are zero entities in w, which denotes that the corresponding features are eliminated during the classifier learning process. Therefore, it can be used for feature selection. Some other representative embedded models based on regularization are: Adaptive LASSO [68], Bridge regularization [69, 70] and Elastic net regularization [71]. Unlike Lasso Regularization, the latter methods are used for instance, to counteract the known issues of LASSO estimates be- ing biased for large coefficients (Adaptive LASSO) or to handle features with high correlations (Elastic net regularization). 3.6 Algorithms for Structured Features The models previously introduced totally ignore the feature structures and assume that fea- tures are independent [76]. However, for many real-world applications, the features exhibit certain intrinsic structures, e.g., spatial or temporal smoothness [77, 78], disjoint/overlapping groups [79], trees [80], and graphs [81]. Incorporating knowledge about the structures of fea- tures may significantly improve the classification performance and help identify the important features. For example, in the study of arrayCGH [77, 82], the features (the DNA copy numbers along the genome) have the natural spatial order, and incorporate the structure information us- ing an extension of the 1-norm, outperforming the LASSO in both classification and feature selection. A very popular and successful approach to learn linear classifiers with structured features is to minimize an empirical error penalized by a regularization term as ˆw = minimize w f(w X,Y)+αg(w,ð) (3.6) 27
  • 35.
    CHAPTER 3. DATASIZE REDUCTION Figure 3.4: Illustration of Lasso, Group Lasso and Sparse Group Lasso. Features can be grouped into 4 disjoint groups G1,G2,G3,G4. Each cell denotes a feature and white color represents the corresponding cell with coefficient zero. where ð denotes the structure of features and α controls the trade-off between data fitting and regularization. Equation 3.6 will lead to sparse classifiers making them particularly apt to in- terpretation, this is often of primary importance in many applications such as biology or social sciences [83]. Among all the algorithms for Structured Feature selection we might distinguish them depending on the type of structure they exploit to • Feature with Group Structure: These take into account real-world applications where features form group structure. For instance, in speed and signal processing, different frequency bands can be represented by groups [84]. Depending on the fact the features are fully or partially selected in the chosen groups, we identify the main algorithm: Group LASSO and Sparse Group LASSO. The former drives all coefficients together in one group to zero, whereas the latter performs the selection of features from the selected groups i.e., performing simultaneous group selection and feature selection. Figure 3.4 demonstrates the different solutions among Lasso, Group Lasso and Sparse group LASSO for 4 disjoint groups {G1,G2,G3,G4}. • Feature with Tree Structure: These tend to represent features by using certain tree struc- tures. For instance genes/proteins may form certain hierarchical tree structures [85]; the image pixels of the face image can be represented as a tree, where each parent node con- tains a series of child nodes that enjoy spatial locality. One of the most known algorithm which exploit this structure is Tree-guided group LASSO regularization [80, 85, 86]. A simple representation of this algorithm is depicted in Figure 3.5. • Feature with Graph Structure: These take into account real-world applications where dependencies exist between features. For example, many biological studies have sug- gested that genes tend to work in groups according to their biological functions, and there are some regulatory relationships between genes. In these cases, features form an undi- rected graph, where the nodes represent the features, and the edges imply the relationships between features. For features with graph structure, a subset of highly connected features 28
  • 36.
    CHAPTER 3. DATASIZE REDUCTION Figure 3.5: An illustration of a simple index tree of height 3 with 8 features. in the graph are likely to be selected or not selected as a whole. For example, in Figure 3.6, {f5,f6,f7} are selected, while {f1,f2,f3,f4} are not selected. Figure 3.6: An illustration of the graph of 7 features {f1,f2,...,f7} and its representation A. 29
  • 37.
    CHAPTER 3. DATASIZE REDUCTION 3.7 Algorithms for Streaming Features A less worrying problem concerns the selection of features when they are sequentially pre- sented to the classifier for potential inclusion in the model [87, 88, 89, 90]. All the methods we introduced above, assumed that all features are known in advance. In this different scenario, instead, the candidate features are generated dynamically and the size of features is unknown. We call this kind of features as streaming features and feature selection for streaming features is called streaming feature selection. Streaming feature selection has practical significance in many applications. For example, the famous microblogging website Twitter produces more than 250 million tweets per day and many new words (features) are generated, such as abbrevi- ations. When performing feature selection for tweets, it is not practical to wait until all features have been generated, thus it could be preferable a streaming feature selection scheme. A typical streaming feature selection will perform the following steps 1. Generating new feature. 2. Determining whether adding the newly generated feature to the set of currently selected features. 3. Determining whether removing features from the set of currently selected features. 4. Repeat from 1 to 3. Since step 2 and step 3 account for most the above procedure, in the latter years many algorithms have been proposed (i.e., Grafting algorithm [87], Alpha-investing algorithm [91], Online Streaming Feature Selection Algorithm [89]). 30
  • 38.
    CHAPTER 3. DATASIZE REDUCTION 3.8 Feature Selection Application Domains The choice of feature selection methods strictly depends on the application area to pursue. In the following subsections, we give a brief review about the most important feature selection methods for the well known application domains. 3.8.1 Text Mining In text mining, the standard way of representing a document is by using the bag-of-words model. The idea is to model each document with the count of words occurring in that document. Feature vectors are typically formed so that each feature (i.e., each element of the feature vector) represents the count of a specific word, an alternative being to just indicate the presence/absence of a word, by using a binary representation. The set of words whose occurrences are counted is called a vocabulary. Given a dataset that needs to be represented, one can use all the words from all the documents in the dataset to build the vocabulary and then prune the vocabulary using feature selection. It is common to apply a degree of pre-processing prior to feature selection, typically including the removal of rare words with only a few occurrences, the removal of overly common words (e.g. "a", "the", "and" and similar) and grouping the differently inflected forms of a word together (lemmatization, stemming) [22]. 3.8.2 Image Processing Representing images is not a straightforward task as the number of possible image features is practically unlimited [23]. The choice of features typically depends on the target application. Examples of features include histograms of oriented gradients, edge orientation histograms, Haar wavelets, raw pixels, gradient values, edges, color channels, etc. [24]. Some authors [25] studied the use of a hybrid combination of feature selection algorithms for the problem of image classification that yields better performance than when using Relief or SFFS/SFBS alone. In cases when there are no irrelevant or redundant features in the dataset, the proposed algorithm does not degrade performance. 3.8.3 Bioinformatics In the past ten years, feature selection has seen much activity, primarily due to the advances in bioinformatics where a large amount of genomic and proteomic data are produced for bio- logical and biomedical studies. An interesting bioinformatics application of feature selection is in biomarker discovery from genomics data. In genomics data (DNA microarray), individual features correspond to the expression levels of thousands of genes in a single experiment. Gene expression data usually contains a large number of genes, but a small number of samples, so it’s critical to identify the most relevant features, which may give more important knowledge about the genes that are the most discriminative for a particular problem. In fact, a given disease 31
  • 39.
    CHAPTER 3. DATASIZE REDUCTION or a biological function is usually associated with a few genes [26]. Out of several thousands of genes to select a few of relevant genes thus becomes a key problem in bioinformatics re- search [27]. In proteomics, high-throughput mass spectrometry (MS) screening measures the molecular weights of individual bio-molecules (such as proteins and nucleic acids) and has potential to discover putative proteomic bio-markers. Each spectrum is composed of peak am- plitude measurements at approximately 15,500 features, represented by a corresponding mass- to-charge value. The identification of meaningful proteomic features from MS is crucial for disease diagnosis and protein-based bio-marker profiling [27]. In bioinformatics applications, many feature selection methods have been proposed and applied. Widely used filter-type fea- ture selection methods include: F-statistic [29], ReliefF [8], mRMR [30], t-test, and Information Gain [31] which compute the sensitivity (correlation or relevance) of a feature w.r.t the class la- bel distribution of the data. These methods can be characterized by using global statistical information. 32
  • 40.
    Chapter 4 Class-Specific FeatureSelection Methodology In this chapter we introduce a Sparse Learning-Based Feature Selection approach, aiming to reducing the feature space based on the concept of the Compressed Sensing. Basically, this approach is a joint sparse multiple optimization problem [37] which tries to find a subset of fea- tures called representative, which can best reconstruct the entire dataset by linearly combining each feature component. Later on, we also propose a Class-Specific Feature Selection (CSFS) model [41] that’s based on the idea we’ve just introduced. This transforms a c-class problem into c sub-problems, one for each class, where the instances of a class are used to decide which features best represent (up to an error) the data reconstruction of each class. For doing CSFS, the feature subset selected for each problem is assigned to the class from which this problem was constructed. In order to classify new instances, our framework proposes to use an ensemble of classifiers, where, for each class, a classifier is trained using the whole training set, yet by only using the feature subset assigned to the class. Finally, our framework applies an ad-hoc decision rule for joining classifier outputs. In doing so, it tries to improve both the feature selection and the classification process, since it first tries to find the best feature subsets for representing each class of a dataset and, then use it to build an ensemble of classifiers which improves the classification accuracy. 4.1 Introduction Given a set of features in Rm arranged as columns of the data matrix X = [x1,...,xN ], we consider the following optimization problem minimize X−XC 2 F subject to C row,0 ≤ k (4.1) where C ∈ RN×N is the coefficient matrix and C row,0 count the number of nonzero rows of C. In a nutshell, we would like to find at most K N representative features which best 33
  • 41.
    CHAPTER 4. CLASS-SPECIFICFEATURE SELECTION METHODOLOGY represent/reconstruct the whole dataset. This can be viewed as a Sparse Dictionary Learning (SDL) scheme where the atoms of the dictionary are chosen from the data points. Unlike the classic procedures for building the Dictionary [33], in our formulation the dictionary is given by the data matrix X as well as the measurements and the unknown sparse code select the features via convex optimization. 4.2 Problem formulation Consider a set of features in Rm arranged as the columns of the data matrix X = [x1,...,xN ]. In this section we formulate the problem of finding representative features given a fixed feature space belonging to a collections of data points. 4.2.1 Learning compact dictionaries Finding compact dictionaries to represent data has been well-studied in the literature [34, 35, 36, 37, 38]. More specifically, in Dictionary Learning (DL) problems, one tries to si- multaneously learn a compact dictionary D = [d1,...,dm] ∈ Rk×m and the coefficients C = [c1,...,cN ] ∈ Rm×n that can well represent collections of data points. The best representation of the data is typically obtained by minimizing the objective function N i=1 xi −Dci 2 2 = X−DC 2 2 (4.2) w.r.t. the dictionary D and the coefficient matrix C, subject to appropriate constraints. In the SDL framework [34, 35, 37, 38], one requires the coefficient matrix C to be sparse by solving the optimization program minimize D,C X−DC 2 F subject to ci 0 ≤ s, dj 2 ≤ 1,∀i,j (4.3) where ci 0 indicates the number of nonzero elements of ci. In other words, one simultaneously learns a dictionary and coefficients such that each feature component in C is written as a linear combination of at most s atoms of the dictionary (feature selected). Aside being NP-hard, due to use of the l0 norm, this problem is not convex due the product of two unknown and constrained matrices D and C. As a result, iterative procedure as those introduced in Chapter 1 are employed to find each unknown matrix by fixing the other. 4.2.2 Finding representative features The learned atoms of the dictionary almost never correspond with the original feature space [37, 38, 40], therefore can not be considered as good features for a collection of sample. In order to 34
  • 42.
    CHAPTER 4. CLASS-SPECIFICFEATURE SELECTION METHODOLOGY (a) τ = 20 (b) τ = 50 Figure 4.1: Cumulative sum of the rows of matrix C in descending order of · 1. The regularization parameter of the eq. 4.12 is set to 20 and 50 respectively. find a subset of features that best represent the entire feature space, we consider a modification to the DL framework, which first addresses the problem of local minima, transforming the pre- vious not convex problem to a convex one and second it enforces selecting representatives set of features from a collection of sample. We do this by setting the dictionary D to be the matrix of sample X, and minimizing the expression N i=1 xi −Xci 2 2 = X−XC 2 F (4.4) w.r.t coefficient matrix C [c1,...,cN ] ∈ RN×N , subject to additional constraints that we de- scribe next. In other words, we minimize the reconstruction error of each feature component by linearly combining all components of the feature space. To choose k N representatives, which take part in the linear reconstruction of the each component in (4.4), we enforce C 0,q ≤ k (4.5) 35
  • 43.
    CHAPTER 4. CLASS-SPECIFICFEATURE SELECTION METHODOLOGY where the mixed 0/ q norm is defined as C 0,q N i=1 I( ci q > 0), where ci denotes the i-th row of C and I(·) denotes the indicator function. In a nutshell, C 0,q counts the number of nonzero rows of C. The indices of the nonzero rows of C correspond to the indices of the columns of X which are chosen as the representative features. As said before, the aim is to select k N representatives features that can reconstruct each feature of the X matrix up to a fixed error. In order to find them, we solve minimize C X−XC 2 F subject to C 0,q ≤ k (4.6) which is a NP-hard problem as it implies a combinatorial calculation over every subset of the k columns of X. Therefore, we relax the 0 to 1 norm, solving then minimize C X−XC 2 F subject to C 1,q ≤ τ (4.7) where C 1,q N i=1 ci q is the sum of the q norms of the rows of C, and τ > 0 is an ap- propriate chosen parameter. The solution of the optimization program (4.7) not only indicates the representative features as the nonzero rows of the C, but also provides information about the ranking of the selected feature. More precisely, a representative that has a higher ranking takes part in the reconstruction process more than the other, hence, its corresponding row in the optimal coefficient matrix C has many nonzero elements with large values. Conversely, a repre- sentative with lower ranking takes part in the reconstruction process less than the other, hence, its corresponding row in C has a few nonzero elements with the smaller values. Thus, we can rank k representative features yi1,...,yik as i1 ≥ i2 ≥ ··· ≥ ik, whenever for the corresponding rows of C we have ci1 q ≥ ci2 q ··· ≥ cik q (4.8) Another optimization problems which is closely related to (4.7) is minimize C C 0,q subject to X−XC F ≤ (4.9) which minimizes the number of representatives that can represent the whole feature space up to an error . As before, we relax the problem using the 1 norm, obtaining minimize C C 1,q subject to X−XC F ≤ . (4.10) This optimization problem can also be viewed in a compression scheme where we want to choose a few representatives that can reconstruct the data up to an error. 36
  • 44.
    CHAPTER 4. CLASS-SPECIFICFEATURE SELECTION METHODOLOGY 4.3 Reformulation problem Both the two optimization problems (4.7-4.10) can be expressed by using the Lagrange multipliers minimize C 1 2 X−XC 2 F +λ C 1,q (4.11) which is nothing more than the LASSO problem we described in 2.4. In practice, we use ADMM described in 1.5 for finding the representatives features of a given dataset. procedure ADMM for LASSO Data: X ∈ Rm×n Result: θ ∈ Rn×n Initialize θ,µ ∈ Rm×n as zeros matrices repeat Update βt+1 = (XT X+ρI)−1(XT X+ρθt −µt) Update θt+1 = Sλ/ρ(βt+1 +µ/ρ) Update µt+1 = µt +ρ(βt+1 −θt+1) t = t+1 until converges 4.3.1 ADMM for the LASSO In section 2.4.4 we have seen Soft Thresholding and how it can be applied very efficiently when we have problems like (4.11). The latter is commonly referred to as LASSO Regression, which uses the constraint that the 1 of the parameter vector, is no greater than a given value. Unfortunately, this case it is non-trivial, and not efficient to apply Soft Thresholding to this problem, as each element of X cannot be treated independently. Hence, we can apply the Alternating Direction Method of Multipliers approach we have seen in 1.5 for solving LASSO, which in Lagrange form is minimize β∈Rp,θ∈Rp    Y−Xβ 2 2 loss +λ θ 1 penalty    subject to β −θ = 0. (4.12) When applied to this problem, the ADMM updates in (4.12) take the form of βt+1 = (XT X+ρI)−1 (XT y+ρt θ −µt ) θt+1 = Sλ/ρ(βt+1 +µ/ρ) µt+1 = µt +ρ(βt+1 −θt+1 ) (4.13) which allows us to break our original problem into a sequence of two sub-problems. As shown in (4.13), in the first sub-problem, when minimizing the problem (4.12) w.r.t. only β, the 1 penalty θ 1 disappears from the objective making it a very efficient and simple Ridge Re- 37
  • 45.
    CHAPTER 4. CLASS-SPECIFICFEATURE SELECTION METHODOLOGY Figure 4.2: A General Framework for Class-Specific Feature Selection [41]. gression problem. In the second sub-problem, when minimizing (4.12) w.r.t. only θ the term X−Xβ 2 2 disappear allowing for θ to be solved independently across each element, allowing us to efficiently use Soft-Thresholding. The current estimates of β and θ are then combined in the last step of the ADMM to update our current estimate of the Lagrangian multiplier matrix µ. After the start-up phase, that involves computing the product of XT X, the subsequent iterations have cost of O(Np). Consequently, the cost per iteration is similar to Coordinate Descent or the Composite Gradient method. 4.4 Class-Specific Feature Selection Feature selection has been widely used for eliminating redundant or irrelevant features, and it can be done in two ways: Traditional Feature Selection (TFS) for all classes and Class- Specific Feature Selection (CSFS). CSFS is the process of finding different set of features for each class. In this kind of approach, different methods have been proposed [42]. Conversely from a TFS algorithm, where a single feature subset is selected for discriminating among all the classes in a supervised classification problem, a CSFS algorithm selects a subset of feature for each class. A general framework for CSFS can use any traditional feature selector, for selecting a possible different subset for each class of a supervised classification problem. Depending on the type of the feature selector, the overall process may slightly change. 38
  • 46.
    CHAPTER 4. CLASS-SPECIFICFEATURE SELECTION METHODOLOGY 4.4.1 General Framework for Class-Specific Feature Selection Generally, we can appreciate that all the CSFS algorithms are strongly related to the use of a particular classifier. In general, it might be desirable that we could apply CSFS independently of the classifier used in the classification stage as well as the used feature selector method. In order to allow it, authors in [41] have built up a General Framework CSFS (GF-CSFS) which allows using any traditional feature selector for CSFS as well as any classifier, consisting of four stages: 1. Class binarization: In CSFS, the goal is to select a feature subset that allows discriminat- ing a class from the remaining classes. Therefore, in the first stage of [41], they propose to use the one-against-all class binarization in order to transform a c-class problem into c binary problems. 2. Class balancing: Since the binarization stage make the generated binary problems unbal- anced, it’s necessary to balance the classes by applying an oversampling process, before applying a conventional feature selector on a binary problem. In the literature, there are several methods for oversampling. Some of the most used are random oversampling [43], SMOTE [44] and Borderline-SMOTE [45]. 3. Class-Specific Feature Selection: For each binary problem, features are selected using a traditional feature selector, and the selected features are assigned to the class from which the binary problem was constructed. In this way, c possible different feature subsets are obtained, one for each class of the original c-class supervised classification problem. In this stage, it is possible to use a different traditional feature selector for each binary problem. At the end of this stage, the CSFS process has properly finished. However, as for each class a possible different subset of features has been selected, it is very important to define how to use these subsets for classification. 4. Classification: For doing class-specific selection, a multiclass problem has been trans- formed in c binary problems. At first glance, the straightforward way for using each subset of features associated to each class is to follow a multi-classification scheme, train- ing a classifier for each binary problem and integrating the decisions of the classifiers. However, following this approach would end up in solving a problem different from the originally formulated one. It is important to point out that the set of features associated to a class, in theory, is the best subset found that allows discriminating the objects of this class from the objects in the other classes. Therefore in the classification stage, for each class si, a classifier ei is trained for the original problem (i.e., the instances in the training set for the classifier ei maintain their original class) but taking into account only the selected features for each class ci. In this way, we will have a classifier ensemble E = {e1,...,ec}. When a new instance O is classified through the ensemble, its orig- inal dimensionality d must be reduced to the dimensionality di used by the classifier ei,i = 1...,c; and use a custom majority voting scheme to classify the new instance. 39
  • 47.
    CHAPTER 4. CLASS-SPECIFICFEATURE SELECTION METHODOLOGY Figure 4.3: A Sparse Learning Based approach for Class-Specific Feature Selection. 4.4.2 A Sparse Learning-Based Approach for Class-Specific Fea- ture Selection In this section we describe a novel Sparse Learning-Based Approach for Class-Specific Feature Selection, based on the Sparse statistical model we presented in Chapter 4. By following the concept of representativeness described in 4.2.2, we try to best represent each class-set of the training set by only using few representatives features. More specifically, the method is made up of the following steps: 1. Class-sample separation: Unlike the General Framework for Class-Specific Feature Se- lection, our CSFS model does not employ the Class binarization stage to transform a c-class problem into c binary problems, instead it just uses a simple Class-sample sep- aration among all the classes samples of the training set, in order to obtain different subsets/configurations of samples, one for each class. 2. Class balancing:1 Once the class sample set of the training set have been split apart, it may be possible that each class-subset is unbalanced. Therefore, we used the SMOTE [44] re-sampling method to balance each class-subset. 3. Intra-Class-Specific feature selection: Unlike GF-CSFS, where a binarization step is carried out for doing class-specific selection, our method involves using the Sparse-Coding 1This stage might not be necessary if there are enough samples for each class in the training set. 40
  • 48.
    CHAPTER 4. CLASS-SPECIFICFEATURE SELECTION METHODOLOGY Based Approach described in 4.2.2 for retrieving the most representative features for each class-subset of the training data, which must be the one that minimizes the equation 4.11 and, therefore that best represents/reconstruct the whole collection of objects. In doing so, this approach takes advantage of the intra-class feature selection approach for improving the classification accuracy against the TFS or the GF-CSFS. 4. Classification: In the classification step we have followed up a wise-ensemble procedure for classifying new instances. Likewise the [41], for each class ci, a classifier ei is trained for the original problem but taking into account only the selected features for each class ci. In this way, we will produce an ensemble of classifier E = {e1,...,ec}. Whatever a new instance O needs to be classified through the ensemble, its original dimensionality must be reduced to the dimensionality di used by the classifier ei,i = 1...,c. The following ad-hoc majority decision rule is used for determining the belonging of the sample O to a class: (a) If a classifier ei gives as output the class ci, i.e., the same class for which the features (used for the training ei) were selected; then the class ci is assigned to O. If there is a tie (two or more classifiers gives as output ci), the class of O is assigned through majority vote among all classifiers. If the tie continues then the class of O will be the majority class among the tied classes. (b) If no classifier gives as output the class for which the features (used for training ei) were selected; the class of O is assigned through majority vote. If there is a tie then the class of O will be the majority class among the tied classes. 41
  • 49.
    Chapter 5 Experimental Results Inthis Chapter, we conducted several experiments to evaluate the performance of our CSBA- CSFS proposed technique. In particular, we compared the proposed method to the state-of-art feature selection methods as well as to the GF-CSFS introduced in section 4.4. 5.1 Experimental Analysis In our experiments, in order to evaluate the performance of our feature selection method, we have applied our methods to a total of six publicly available (bioinformatics) microarray datasets: the acute lymphoblastic leukemia and acute myeloid leukemia (ALLAML) dataset [93], the human carcinomas (CAR) dataset [94], the human lung carcinomas (LUNG) dataset [98], the diffuse large B-cell lymphoma (DLBCL) dataset, the malignant glioma (GLIOMA) dataset [99]. The Support Vector Machine (SVM) classifier is employed to these data sets, using 5-fold cross- validation. We focused on this type of datasets because selecting small subsets out of the thousands of genes in microarray data is an important task for several medical purposes. Microarray data analysis is popular for involving a huge number of genes compared to a relatively small number of samples. In particular, Gene selection is the task of identifying the most significantly deferentially expressed genes under different conditions, and it has been an open research focus. Gene selection is a prerequisite in many applications [102]. These selected genes are very useful in clinical applications such as recognizing diseased profiles. Nonetheless, because of high costs, the quantity of the number of experiments that can be used for classification purposes is usually limited. This small number of samples, compared to the large number of genes in an experiment, is well known as the Curse of Dimensionality [5] and challenges the classification task as well as other data analysis. Moreover, it is well-known that a significant number of genes play an important role, whereas many others could be unrelated to the classification task [101]. Therefore, a critical step to effective classification is to identify the representatives genes, thus in this manner to decrease the quantity of genes used for classification purpose. 42
  • 50.
    CHAPTER 5. EXPERIMENTALRESULTS Dataset Size # of Features # of Classes ALLAML 72 7129 2 CAR 174 9182 11 LUNG_C 203 3312 5 LUNG_D 73 325 7 DLBCL 96 4026 9 GLIOMA 50 4434 4 Table 5.1: Datasets Description. 5.2 Datasets Description We give a brief description on all datasets used in our experiments as follows. The ALLAML dataset [93] contains in total 72 samples in two classes, ALL and AML, which contain 47 and 25 samples, respectively. Every sample contains 7,129 gene expression values. The CARCINOM dataset [94] contains in total 174 samples in eleven classes, prostate, blad- der/ureter, breast, colorectal, gastroesophagus, kidney, liver, ovary, pancreas, lung adenocar- cinomas, and lung squamous cell carcinoma, which have 26, 8, 26, 23,12,11, 7, 27, 6,14,14 samples, respectively. After pre-processing as described in [95], the dataset has been shrunk with 174 samples and 9,182 genes. The LUNG dataset [96] contains in total 203 samples in five classes, adenocarcinomas, squa- mous cell lung carcinomas, pulmonary carcinoids, small-cell lung carcinomas and normal lung, which have 139, 21, 20, 6,17 samples, respectively. The genes with standard deviations smaller than 50 expression units were removed and we obtained a dataset with 203 samples and 3,312 genes. The LUNG_DISCRETE dataset [97] contains in total 73 samples in seven classes. Each sam- ple has 325 features. The DLBCL dataset [98] is a modified version of the original DLBCL dataset. It consists of 96 samples in nine classes, where each sample is defined by the expression of 4,026 genes. Sample in the DLBCL dataset are 46, 10, 9, 11, 6, 6, 4, 2, 2, respectively. The GLIOMA dataset [99] contains in total 50 samples in four classes, cancer glioblastomas, non-cancer glioblastomas, cancer oligodendrogliomas and non-cancer oligodendrogliomas, which have 14,14, 7,15 samples, respectively. Each sample has 12,625 genes. After pre-processing as described in [99], the dataset has been shrunk with 50 samples and 4,433 genes. All datasets have been downloaded from [103] whose information are summarized in Table 5.1 43
  • 51.
    CHAPTER 5. EXPERIMENTALRESULTS 5.3 Experiment Setup To validate the effectiveness of –Sparse-Coding Based Approach Feature Selection (SCBA- FS) and Sparse-Coding Based Approach Class-Specific Feature Selection (SCBA-CSFS), we have compared our methods against several TFS and the GF-CSFS proposed in [41]. More precisely, in our experiments, we firstly compared our methods against the TFS meth- ods. In addition, given that the proposed framework in [41] can use any TFS method as base for doing CSFS, we made some experiments using both filter and wrapper methods (injection process). In addition, we also compared the accuracy results against those by using all the fea- tures (BSL). Since our feature selection methods are more wise-sparse filter type approaches, we have found appropriate to compare them to the following TFS methods: • RFS [100]: Robust Feature Selection method is a sparse based-learning approach for feature selection which emphasizes the joint 2,1 norm minimization on both loss and regularization function. • ls- 2,1 [4]: ls- 2,1 is a supervised sparse feature selection method. It exploits the 2,1- norm regularized regression model, for joint feature selection from multiple tasks where the classification objective function is a quadratic loss. • ll- 2,1 [4]: ll- 2,1 is a supervised sparse feature selection method which uses the same concept of ls- 2,1 but instead uses a logistic loss. • Fisher [9]: Fisher is one of the most widely used supervised filter feature selection meth- ods. It selects each feature as the ratio of inter-class separation and intraclass variance, where features are evaluated independently, and the final feature selection occurs by ag- gregating the m top ranked ones. • Relief-F [8]: Relief-F is an iterative, randomized, and supervised filter approach that estimates the quality of the features according to how well their values differentiate data samples that are near to each other; it does not discriminate among redundant features, and performance decreases with few data. • mRmR [30]: Minimum-Redundancy-Maximum-Relevance is a mutual information filter based algorithm which selects features according to the maximal statistical dependency criterion. We pre-processed all the datasets by using the Z-score [92] normalization. Support Vector Machine (SVM) classifier has been individually performed on all data sets using 5-fold cross- validation. We utilized the linear kernel with the parameter C = 1 and the one-vs-the-rest strategy for multi-class classification. For RFS, ls- 2,1 and ll- 2,1 we fixed the regularization parameter γ to 1.0 by default. For Relief-F we fix k, which specifies the size of neighborhood to 5 by default. For our methods we tuned the regularization parameter λ in order to achieve better results on all datasets. For evaluating the performance of all the methods, we have chosen the number of 44
  • 52.
    CHAPTER 5. EXPERIMENTALRESULTS features to range from 1 to 300 and then we computed the average accuracy. The evaluation metric used for assessing the classification performance among all the methods is the Accuracy Score (AS). It’s defined as follow: Accuracy_Score(y, ˆy) = 1 nsamples nsamples i=1 1(ˆyi = yi) (5.1) where yi and ˆyi are, respectively, the ground truth and the predicted label of the i-th samples and, nsamples is the number of samples of the test set. Obviously, a larger AS indicates better performance. Although two-class classification problem is an important type of task, it’s relatively easy, since a random choice of class labels would give 50% accuracy. Classification problems with multiple classes are generally more difficult and give a more realistic assessment of the proposed methods. 45
  • 53.
    CHAPTER 5. EXPERIMENTALRESULTS (a) ALLAML (b) CARCINOM (c) LUNG (d) LUNG_DISCRETE (e) LYMPHOMA (f) GLIOMA Figure 5.1: Classification accuracy comparisons of seven feature selection algorithms on six data sets. SVM with 5-fold cross validation is used for classification. SCBA and SCBA-CSFS are our methods. Average accuracy of top 20 features (%) Average accuracy of top 80 features (%) RFS ls-21 ll-21 Fisher Relief-F mRmR SCBA SCBA-CSFS BSL RFS ls-21 ll-21 Fisher Relief mRmR SCBA SCBA-CSFS ALLAML 80.82 74.27 88.36 96.78 95.67 74.62 92.57 80.81 93.21 97.84 74.27 95.73 98.95 98.89 83.16 96.84 95.67 CAR 85.19 72.03 84.84 65.67 85.58 85.88 84.84 93.60 90.25 96.98 88.88 94.61 92.92 96.95 94.93 94.61 99.32 LUNG_C 93.24 92.37 97.84 90.22 96.98 96.55 96.98 98.99 95.57 98.12 97.84 98.99 99.28 99.57 98.71 99.42 99.70 LUNG_D 89.89 87.72 93.26 90.51 91.89 93.93 89.20 96.60 83.43 95.93 95.93 94.62 95.93 97.31 96.60 97.29 97.29 DLBCL 93.96 92.99 95.89 97.34 99.03 99.52 95.89 99.52(5) 93.74 99.03 95.42 99.76 99.76 99.76 99.8 99.76 99.76 GLIOMA 80 60 73.33 76.67 76.66 75 76.67 76.67 74 88.33(29) 70 80 83.33 80 78.33 81.67 88.33 Average 87.18 79.9 88.92 86.19 90.97 87.58 89.36 91.03 88.37 96.03 87.05 93.95 95.03 95.41 91.92 94.93 96.68 Table 5.2: Accuracy Score of SVM using 5-fold cross validation. Six TFS methods are compared against our methods. RFS: Robust Feature Selector, FS: Fisher Score, mRmR: Minimum-Redundancy-Maximum-Relevance, BSL: all features, SCBA-FS and SCBA-CSFS our methods. The best results are highlighted in bold. 5.4 Experimental Results and Discussion The results have been obtained using a workstation with a dual Intel(R) Xeon(R) 2.40GHz and 64GB RAM. We summarize the 5-fold cross validation accuracy results on different methods on the six datasets shown in table 5.1. Tables 5.2-5.3 show the experimental results using SVM. For all the comparisons we computed the average accuracy using the top 20 and top 80 features for all feature selection methods. Where there is a tie among the methods, we’ve considered the best achieved accuracy with fewer number of features. Firstly, we compared the performance of our method against the TFS methods. Table 5.2 shows the experimental results using SVM. We can see from the table that our approach (SCBA- CSFS) significantly outperforms the other methods when the datasets have many classes. Based 46
  • 54.
    CHAPTER 5. EXPERIMENTALRESULTS (a) ALLAML (b) CARCINOM (c) LUNG (d) LUNG_DISCRETE (e) LYMPHOMA (f) GLIOMA Figure 5.2: Classification accuracy comparisons of seven feature selection algorithms on six datasets. SVM with 5-fold cross validation is used for classification. SCBA-CSFS is our method. Average accuracy of top 20 features (%) Average accuracy of top 80 features (%) RFS ls-21 ll-21 Fisher Relief mRmR SCBA-CSFS BSL RFS ls-21 ll-21 Fisher Relief mRmR SCBA-CSFS ALLAML 82 72.29 80.57 98.57 95.81 72.10 80.82 93.21 95.71 83.33 90.20 98.57 97.14 81.71 95.67 CAR 54 35.61 69.01 89.70 89.13 78.18 93.60 90.25 83.88 64.35 83.88 94.84 94.86 89.63 99.32 LUNG_C 85.73 81.35 90.63 93.11 91.15 86.74 98.99 95.57 93.07 88.67 94.05 95.56 94.56 91.61 99.70 LUNG_D 72.76 68.76 72.67 89.05 85.05 86.38 96.60 83.43 89.14 86.48 86.38 91.71 89.05 89.05 97.29 DLBCL 94.44 92.75 97.56 99.28 99.76 99.76(6) 99.52 93.74 99.98 99.28 99.76 99.76 99.98 99.98 99.76 GLIOMA 68 58 70 76 66 64 76.67 74 76 70 70 82 80 66 88.33 Average 76.16 68.12 80.07 90.95 87.82 81.19 91.03 88.37 89.63 82.02 87.38 93.74 92.6 86.33 96.68 Table 5.3: Accuracy Score of SVM using 5-fold cross validation. The GCSFS [41] framework using 5 traditional feature selector is compared against our SCBA-CSFS. RFS: Robust Feature Selector, FS: Fisher Score, mRmR: Minimum-Redundancy-Maximum-Relevance, BSL: all features and SCBA-CSFS our method. The best results are highlighted in bold. on our experimental results, we may affirm that, usually, applying TFS allows getting better re- sults than using all the available features. However, in most of the cases, applying our CSFS, al- lows getting better results than applying TFS methods. In particular, we noticed, that our CSFS methods seem to achieve the best results when the datasets have many classes, suggesting that it better performs on dataset with many classes (e.g., LUNG_C, LUNG_D, CAR, DLBLC). In addition, as shown in Fig. 5.1 it’s important to point out that our method always outperforms the others with fewer number of features. As a result, we can assert our method is able to identi- fy/retrieve the most representative features that maximize the classification accuracy. With top 20 and 80 features, our method is around 1%-12% and 1%-10% better than the other methods on all six data sets, respectively. Secondly, we compared the performance of our method against [41]. The experimental results are shown in Table 5.3. From the table, we can appreciate that the process underlying our SCBA for feature selection is more suitable for retrieving the best features for the purpose 47
  • 55.
    CHAPTER 5. EXPERIMENTALRESULTS of classification w.r.t. the GF-CSFS, leading most of the times to get better results. With top 20 and 80 features, our method is around 1%-23% and 1%-10% better than the other methods all six data sets, respectively. 48
  • 56.
    Chapter 6 Conclusion In thisthesis, we proposed a novel Sparse-Coding Based Approach Feature Selection with emphasizing joint 1,2-norm minimization and the Class Specific Feature Selection. Experimen- tal results on six different datasets validate the unique aspects of SCBA-CSFS and demonstrate the better performance achieved against the-state-of-art methods. One of the main characteristics of our framework is that by jointly exploiting the idea of Compressed Sensing and Class-Specific Feature Selection, it is able to identify/retrieve the most representative features that maximize the classification accuracy in the case that the dataset is made up of many classes. Based on our experimental results, we can conclude that, usually applying TFS allows achieving better results than using all the available features. However, in most of the cases, applying our proposed method CSBA-CSFS allows getting better results than TFS as well as GF-CSFS with several TFS methods injected. Future works will include the analysis of our method against other type of datasets and the injection of any TFS methods in our framework for comparing the different performance. In addition, we plan to test our method on real case datasets such as the EPIC dataset [106], after a thorough analysis of pre-filtering. 49
  • 57.
    References [1] Benjamin Recht.Convex Modeling with Priors. PhD thesis, Massachusetts Institute of Tech- nology, Media Arts and Sciences Department, 2006. [2] He, B. S., Hai Yang, and S. L. Wang. Alternating direction method with self-adaptive penalty parameters for monotone variational inequalities. Journal of Optimization Theory and applications 106.2 (2000): 337-356. [3] Korn, Flip, B-U. Pagel, and Christos Faloutsos. On the "dimensionality curse" and the "self- similarity blessing". IEEE Transactions on Knowledge and Data Engineering 13.1 (2001): 96-111. [4] J. Tang, Jiliang, Salem Alelyani, and Huan Liu. Feature selection for classification: A review. Data Classification: Algorithms and Applications (2014): 37. [5] Liu, Huan, and Hiroshi Motoda. Feature selection for knowledge discovery and data mining. Vol. 454. Springer Science & Business Media, 2012. [6] Guyon, Isabelle, and André Elisseeff. An introduction to variable and feature selection. Journal of machine learning research 3.Mar (2003): 1157-1182. [7] Yager, Ronald R., and Liping Liu, eds. Classic works of the Dempster-Shafer theory of belief functions. Vol. 219. Springer, 2008. [8] Kira K, Rendell LA. A practical approach to feature selection. InProceedings of the ninth international workshop on Machine learning 1992 Jul 12 (pp. 249-256). [9] Gu, Quanquan, Zhenhui Li, and Jiawei Han. Generalized fisher score for feature selection. arXiv preprint arXiv:1202.3725 (2012). [10] He, Xiaofei, Deng Cai, and Partha Niyogi. Laplacian score for feature selection. Advances in neural information processing systems. 2006. [11] Liu, Huan, and Hiroshi Motoda. Feature selection for knowledge discovery and data min- ing. Vol. 454. Springer Science & Business Media, 2012. [12] Bradley, Paul S., and Olvi L. Mangasarian. Feature selection via concave minimization and support vector machines. ICML. Vol. 98. 1998. 50
  • 58.
    REFERENCES [13] Maldonado, Sebastián,Richard Weber, and Fazel Famili. Feature selection for high- dimensional class-imbalanced data sets using Support Vector Machines. Information Sci- ences 286 (2014): 228-246. [14] Kim, YongSeog, W. Nick Street, and Filippo Menczer. Evolutionary model selection in unsupervised learning. Intelligent data analysis 6.6 (2002): 531-556. [15] Hapfelmeier, Alexander, and Kurt Ulm. A new variable selection approach using random forests. Computational Statistics & Data Analysis 60 (2013): 50-69. [16] Cawley, Gavin C., Nicola L. Talbot, and Mark Girolami. Sparse multinomial logistic re- gression via bayesian l1 regularisation. Advances in neural information processing systems. 2007. [17] Das, S. (2001, June). Filters, wrappers and a boosting-based hybrid for feature selection. In ICML (Vol. 1, pp. 74-81). [18] Cadenas, José M., M. Carmen Garrido, and Raquel MartíNez. Feature subset selection filter–wrapper based on low quality data. Expert systems with applications 40.16 (2013): 6241-6252. [19] I. S. Oh, J. S. Lee, and B. R. Moon, "Hybrid genetic algorithms for feature selection," IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 11, pp. 1424–1437, 2004. [20] S. I. Ali and W. Shahzad. A Feature Subset Selection Method based on Conditional Mutual Information and Ant Colony Optimization. International Journal of Computer Applications, vol. 60, no. 11, pp. 5–10, 2012. [21] S. Sarafrazi and H. Nezamabadi-pour. Facing the classification of binary problems with a GSA-SVM hybrid system. Mathematical and Computer Modelling, vol. 57, issues 1-2, pp. 270–278, 2013. [22] Forman, George. An extensive empirical study of feature selection metrics for text classifi- cation. Journal of machine learning research 3.Mar (2003): 1289-1305. [23] J. Bins and B. A. Draper. Feature selection from huge feature sets, in: Proc. 8th Interna- tional Conference on Computer Vision (ICCV-01), Vancouver, British Columbia, Canada, IEEE Computer Society, pp. 159–165, 2001. [24] K. Brki´c. Structural analysis of video by histogram-based description of local space-time appearance, Ph.D. dissertation, University of Zagreb, Faculty of Electrical Engineering and Computing, 2013. [25] M. Muštra, M. Grgi´c, and K. Delaˇc. Breast density classification using multiple feature selection. Automatika, vol. 53, pp. 1289– 1305, 2012. 51
  • 59.
    REFERENCES [26] H. Peng,F. Long, and C. Ding. Feature selection based on mutual information: Criteria of max-depe ndency, max-relevance, and min-redundancy. IEEE Trans. Pattern Analysis and Machine Intelligence, 27, 2005. [27] Y. Saeys, I. Inza, and P. Larranaga. A review of feature selection techniques in bioinformat- ics. Bioinformatics, 23(19):2507–2517, 2007. [28] Tibshirani, Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) (1996): 267-288. [29] Ding, Chris, and Hanchuan Peng. Minimum redundancy feature selection from microarray gene expression data. Journal of bioinformatics and computational biology 3.02 (2005): 185- 205. [30] Peng, Hanchuan, Fuhui Long, and Chris Ding. Feature selection based on mutual infor- mation criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence 27.8 (2005): 1226-1238. [31] L. E. Raileanu and K. Stoffel. Theoretical comparison between the gini index and infor- mation gain criteria. Univeristy of Neuchatel, 2000. [32] Nie, Feiping, et al. Efficient and robust feature selection via joint 2,1-norms minimization. Advances in neural information processing systems. 2010. [33] Xu, Jin, Haibo He, and Hong Man. Active Dictionary Learning in Sparse Representation Based Classification. arXiv preprint arXiv:1409.5763 (2014). [34] Aharon, Michal, Michael Elad, and Alfred Bruckstein. SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing 54.11 (2006): 4311-4322. [35] Engan, Kjersti, Sven Ole Aase, and J. Hakon Husoy. Method of optimal directions for frame design. Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on. Vol. 5. IEEE, 1999. [36] Jolliffe, Ian T. Principal component analysis and factor analysis. Principal component analysis (2002): 150-166. [37] Mairal, Julien, et al. Discriminative learned dictionaries for local image analysis. Com- puter Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008. [38] Ramirez, Ignacio, Pablo Sprechmann, and Guillermo Sapiro. Classification and clustering via dictionary learning with structured incoherence and shared features. Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010. 52
  • 60.
    REFERENCES [39] Ramirez, Ignacio,Pablo Sprechmann, and Guillermo Sapiro. Classification and clustering via dictionary learning with structured incoherence and shared features. Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010. [40] Mairal, Julien, et al. Non-local sparse models for image restoration. Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009. [41] Pineda-Bautista, Bárbara B., Jesús Ariel Carrasco-Ochoa, and J. Fco Martınez-Trinidad. General framework for class-specific feature selection. Expert Systems with Applications 38.8 (2011): 10018-10024. [42] Fu, Xiuju, and Lipo Wang. A GA-based RBF classifier with class-dependent features. Evo- lutionary Computation, 2002. CEC’02. Proceedings of the 2002 Congress on. Vol. 2. IEEE, 2002. [43] Van Hulse, Jason, Taghi M. Khoshgoftaar, and Amri Napolitano. Experimental perspec- tives on learning from imbalanced data. Proceedings of the 24th international conference on Machine learning. ACM, 2007. [44] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. [45] Han, H., Wang, W. Y., & Mao, B. H. (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the international conference on intelligent computing (pp. 878–887). [46] Liu, Huan, and Hiroshi Motoda, eds. Computational methods of feature selection. CRC Press, 2007. [47] Weston, Jason, et al. Use of the zero-norm with linear models and kernel methods. Journal of machine learning research 3.Mar (2003): 1439-1461. [48] Song, Le, et al. Supervised feature selection via dependence estimation. Proceedings of the 24th international conference on Machine learning. ACM, 2007. [49] Dy, Jennifer G., and Carla E. Brodley. Feature selection for unsupervised learning. Journal of machine learning research 5.Aug (2004): 845-889. [50] P. Mitra, C. A. Murthy, and S. Pal. Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:301–312, 2002. [51] Zhao, Zheng, and Huan Liu. Semi-supervised feature selection via spectral analysis. Pro- ceedings of the 2007 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2007. 53
  • 61.
    REFERENCES [52] Xu, Zenglin,et al. Discriminative semi-supervised feature selection via manifold regular- ization. IEEE Transactions on Neural networks 21.7 (2010): 1033-1047. [53] Dy, Jennifer G., and Carla E. Brodley. Feature subset selection and order identification for unsupervised learning. ICML. 2000. [54] Z. Zhao and H. Liu. Semi-supervised feature selection via spectral analysis In Pro- ceedings of SIAM International Conference on Data Mining, 2007. [55] Yu, Lei, and Huan Liu. Efficient feature selection via analysis of relevance and redundancy. Journal of machine learning research 5.Oct (2004): 1205-1224. [56] Aggarwal, Charu C., and Chandan K. Reddy, eds. Data clustering: algorithms and appli- cations. CRC press, 2013. [57] Liu, Huan, and Hiroshi Motoda, eds. Computational methods of feature selection. CRC Press, 2007. [58] Kohavi, Ron, and George H. John. Wrappers for feature subset selection. Artificial intelli- gence 97.1-2 (1997): 273-324. [59] Hall, Mark A., and Lloyd A. Smith. Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper. FLAIRS conference. Vol. 1999. 1999. [60] I. Inza, P. Larranaga, R. Blanco, and A. J. Cerrolaza. Filter versus wrapper gene selection approaches in dna microarray domains. Artificial intelligence in medicine,31(2):91–103, 2004. [61] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine learning, 46(1-3):389–422, 2002. [62] Quinlan, J. Ross. Induction of decision trees. Machine learning 1.1 (1986): 81-106. [63] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [64] Ma, Shuangge, and Jian Huang. Penalized feature selection and classification in bioinfor- matics. Briefings in bioinformatics 9.5 (2008): 392-403. [65] Bradley, Paul S., and Olvi L. Mangasarian. Feature selection via concave minimization and support vector machines. ICML. Vol. 98. 1998. [66] Cawley, Gavin C., Nicola L. Talbot, and Mark Girolami. Sparse multinomial logistic re- gression via bayesian l1 regularisation. Advances in neural information processing systems. 2007. [67] Liu, Huan, and Lei Yu. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on knowledge and data engineering 17.4 (2005): 491-502. 54
  • 62.
    REFERENCES [68] Zou, Hui.The adaptive lasso and its oracle properties. Journal of the American statistical association 101.476 (2006): 1418-1429. [69] Knight, Keith, and Wenjiang Fu. Asymptotics for lasso-type estimators. Annals of statistics (2000): 1356-1378. [70] Huang, Jian, Joel L. Horowitz, and Shuangge Ma. Asymptotic properties of bridge estima- tors in sparse high-dimensional regression models. The Annals of Statistics (2008): 587-613. [71] Zou, Hui, and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67.2 (2005): 301-320. [72] Wang, Li, Ji Zhu, and Hui Zou. Hybrid huberized support vector machines for microarray classification. Proceedings of the 24th international conference on Machine learning. ACM, 2007. [73] Obozinski, Guillaume, Ben Taskar, and Michael Jordan. Multi-task feature selection. Statistics Department, UC Berkeley, Tech. Rep 2 (2006). [74] Argyriou, Andreas, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. Advances in neural information processing systems. 2007. [75] Yuan, Ming, and Yi Lin. Model selection and estimation in regression with grouped vari- ables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68.1 (2006): 49-67. [76] Ye, Jieping, and Jun Liu. Sparse methods for biomedical data. ACM Sigkdd Explorations Newsletter 14.1 (2012): 4-15. [77] Tibshirani, Robert, et al. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67.1 (2005): 91-108. [78] Zhou, Jiayu, et al.Modeling disease progression via fused sparse group lasso. Proceed- ings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012. [79] Jenatton, Rodolphe, Jean-Yves Audibert, and Francis Bach. Structured variable selection with sparsity-inducing norms. Journal of Machine Learning Research 12.Oct (2011): 2777- 2824. [80] Kim, Seyoung, and Eric P. Xing. Tree-guided group lasso for multi-task regression with structured sparsity.J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 417–424. ACM, 2009. (2010). 55
  • 63.
    REFERENCES [81] J. Huang,T. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 417–424.ACM, 2009. [82] Tibshirani, Robert, and Pei Wang. "Spatial smoothing and hot spot detection for CGH data using the fused lasso." Biostatistics 9.1 (2007): 18-29. [83] Yuan, Ming, and Yi Lin. Model selection and estimation in regression with grouped vari- ables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68.1 (2006): 49-67. [84] McAuley, James, et al. Subband correlation and robust speech recognition. IEEE Transac- tions on Speech and Audio Processing 13.5 (2005): 956-964. [85] Liu, Jun, and Jieping Ye. Moreau-Yosida regularization for grouped tree structure learning. Advances in Neural Information Processing Systems. 2010. [86] Jenatton, Rodolphe, et al. Proximal methods for sparse hierarchical dictionary learning. Proceedings of the 27th international conference on machine learning (ICML-10). 2010. [87] Perkins, Simon, and James Theiler. Online feature selection using grafting. Proceedings of the 20th International Conference on Machine Learning (ICML-03). 2003. [88] Zhou, Dengyong, Jiayuan Huang, and Bernhard Schölkopf. Learning from labeled and unlabeled data on a directed graph. Proceedings of the 22nd international conference on Machine learning. ACM, 2005. [89] Wu, Xindong, et al. Online streaming feature selection. Proceedings of the 27th interna- tional conference on machine learning (ICML-10). 2010. [90] Wang, Jialei, et al. Online feature selection and its applications. IEEE Transactions on Knowledge and Data Engineering 26.3 (2014): 698-710. [91] Zhou, Jing, et al. Streaming feature selection using alpha-investing. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, 2005. [92] E. Kreyszig (1979). Advanced Engineering Mathematics (Fourth ed.). Wiley. p. 880, eq. 5. ISBN 0-471-02140-7. [93] Golub TR, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science. 1999 Oct 15;286(5439):531-7. [94] Nutt, Catherine L., et al. Gene expression-based classification of malignant gliomas corre- lates better with survival than histological classification. Cancer research 63.7 (2003): 1602- 1607. 56
  • 64.
    REFERENCES [95] Yang, Kun,et al. A stable gene selection in microarray data analysis. BMC bioinformatics 7.1 (2006): 228. [96] Bhattacharjee, Arindam, et al. Classification of human lung carcinomas by mRNA ex- pression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences 98.24 (2001): 13790-13795. [97] Peng, Hanchuan, Fuhui Long, and Chris Ding. Feature selection based on mutual infor- mation criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence 27.8 (2005): 1226-1238. [98] Alizadeh, Ash A., et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403.6769 (2000): 503-511. [99] Nutt, Catherine L., et al. Gene expression-based classification of malignant gliomas corre- lates better with survival than histological classification. Cancer research 63.7 (2003): 1602- 1607. [100] Nie, Feiping, et al. Efficient and robust feature selection via joint 2,1-norms minimization. Advances in neural information processing systems. 2010. [101] Xiong, Momiao, Xiangzhong Fang, and Jinying Zhao. Biomarker identification by fea- ture wrappers. Genome Research 11.11 (2001): 1878-1887. [102] Mukherjee, Sach, and Stephen J. Roberts. A theoretical analysis of gene selection. Com- putational Systems Bioinformatics Conference, 2004. CSB 2004. Proceedings. 2004 IEEE. IEEE, 2004. [103] http://featureselection.asu.edu/datasets.php [104] Boyd, Stephen, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. [105] Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1), 1-122. [106] Demetriou, Christiana A., Jia Chen, Silvia Polidoro, Karin Van Veldhoven, Cyrille Cuenin, Gianluca Campanella, Kevin Brennan et al. Methylome analysis and epigenetic changes associated with menarcheal age. PloS one 8, no. 11 (2013): e79391. 57