Numerical Optimization Project Cme 304, Winter Quarter 2016: 1 The Problem Statement
Numerical Optimization Project Cme 304, Winter Quarter 2016: 1 The Problem Statement
1
CME 304 Project, Winter 2016 Rahul Sarkar
(a) (b)
Figure 2: For N = 2, two possible configurations of rectangles that enclose the ellipse.
1. Each rectangle has 5 degrees of freedom which means that we need 5 independent
variables to completely specify each rectangle. These are the X and Y coordinates
of the center of each rectangle, the length and breadth of each rectangle and the
angle by which one of the sides of each rectangle is rotated relative to the X axis.
This means that for N rectangles there are 5N independent variables.
2. The first difficulty that arose was in determining a way to calculate the objective
function for a given valid set of 5N independent variables. Valid set means that the
N rectangles corresponding to the 5N independent variables completely enclose
the ellipse. Assuming a valid set, one could calculate the area enclosed by all
the rectangles first and then subtract the area of the ellipse to get the objective
2
CME 304 Project, Winter 2016 Rahul Sarkar
function value. The calculation of the area enclosed by all the rectangles is non-
trivial but possible with a suitable algorithm. It involves computing a set of
disjoint(non-intersecting) polygons that represent the union of all the rectangles,
and then calculating the area of each disjoint polygon, followed by summing the
areas.
3. The next difficulty that was encountered was in computing the gradient of the ob-
jective function with respect to the 5N independent variables. The same difficulty
was encountered in the hessian computation. The lack of an analytical formula for
the objective function and consequently both the gradient and the hessian meant
that one had to rely entirely on finite difference schemes to evaluate these quanti-
ties of interest while solving the optimization problem. There is a possibility that
there may be risks associated with this idea as well, as there is a suspicion that the
objective function is non-differentiable, but this has not been properly investigated.
4. However, the most important difficulty was encountered in trying to come up with
a set of functional constraints that must be satisfied in order for the ellipse to be
completely enclosed by N rectangles, given by a set of 5N numbers. No way was
found to overcome this difficulty, but again it may be due to my ignorance in this
field.
(a) (b)
Figure 3: For N = 2, two possible symmetric configurations of rectangles that enclose the ellipse.
Instead of trying to solve the optimization problem in the general case, we impose ex-
tra conditions on the possible configurations that the rectangles can achieve in terms of
their positions and orientations. These conditions are that the sides of each rectangle
must be parallel to the coordinate axes and the center of each rectangle must coincide
with the origin. Imposing these extra conditions leads to a symmetry requirement that
the rectangles must be invariant under inversion about the origin (x ← −x, y ← −y),
3
CME 304 Project, Winter 2016 Rahul Sarkar
and reflection about the X axis (x ← x, y ← −y) and Y axis (x ← −x, y ← y). We
will call such configurations of rectangles that cover the ellipse as “symmetric covering
configurations (SCC).” Two such configurations are shown for N = 2 in Figure 3. It
turns out that although the configuration in Figure 3(b) is a SCC, it cannot be one
that minimizes the area outside the ellipse. For instance, the outermost rectangle can
always be shrunk so that its sides are tangential to the ellipse, thereby decreasing the
area outside the ellipse.
In general, it can be proved that a SCC that minimizes the area outside the ellipse must
be of the form shown in Figure 3(a). We skip the proof here, but present a geometrical
argument for why this must be true for N = 2. Let us start by noting that if there exists
a SCC where any one or both rectangles completely enclose the ellipse individually, then
each one of the rectangles can be shrunk so that they are tangential to the four apexes
of the ellipse, thereby decreasing the area outside the ellipse. The configuration thus
obtained has an area outside the ellipse that is larger than any configuration of the form
in Figure 3(a). Next note that it must also be true that for the minimum configuration,
atleast one rectangle must be tangential to the apexes joined by the minor axis of the
ellipse. But this rectangle cannot be tangential to the other two apexes as well, because
it will then fully enclose the ellipse and there exists configurations of the form in Figure
3(a) with a smaller area. This rectangle then must intersect the ellipse in the minimum
configuration at four other additional points. It then follows that the other rectangle
must be tangential to the remaining two apexes and in addition must intersect the
ellipse at the same four non-apex points, leading to the configuration of Figure 3(a).
This argument can be easily generalized to cases for N ≥ 2, but is more involved. In
Figure 4, we show possible SCC for N = 3 and 4.
Y Y
(a) (b)
In what follows, we will be only concerned with finding solutions to the optimization
problem where the configuration of the rectangles is a SCC of the form descibed in
Figure 4. We will see that this leads to an enormous simplification of the optimization
problem as we will be able to write an analytical expression for the objective function.
This in turn will allow us to find easy analytical expressions for the gradient and hessian
of the objective function. But first, we reduce the problem to that on the first quadrant
by exploiting the symmetry of the solution.
4
CME 304 Project, Winter 2016 Rahul Sarkar
(a) (b)
Figure 5: (a) This figure represents the picture in the first quadrant for a SCC of rectangles for N = 3.
Note that now all the rectangles have the lower left vertex coinciding with the origin. (b) This figure
represents the configuration of the rectangles leading to an equivalent optimization problem as in (a).
N
i−1
!2 1/2
X X
A = ab hi 1 − hj (2)
i=1 j=1
5
CME 304 Project, Winter 2016 Rahul Sarkar
If we can solve the above problem for the unit circle in (4), we can get a solution for the
problem for the ellipse in (3) for any arbitrary a and b by simply scaling the X and Y
axes appropriately (x ← ax, y ← by). That in turn can be used to construct a solution
for the whole ellipse in all the four quadrants. This is great progress !
6
CME 304 Project, Winter 2016 Rahul Sarkar
(6)
2 2
The hessian is symmetric and it can be directly verified that ∂h∂m ∂h
F
k
= ∂h∂k ∂h
F
m
from the
relationships in equation (6). Surprisingly, it turns out that the hessian calculation is
not very expensive for the objective function. One needs to calculate the diagonal entries
of the hessian and the first sub-diagonal only. All the other non-diagonal elements of
the hessian belonging to the same row are equal below the diagonal. This can again be
verified from equation (6). To get the entries of the hessian matrix above the diagonal,
one can just use symmetry of the hessian to get them from the computed entries below
the diagonal.
7
CME 304 Project, Winter 2016 Rahul Sarkar
3 Numerical Implementation
In this section, we present some of the key aspects of the numerical implementation for
solving the optimization problem in (4). We will briefly discuss the following aspects:
1. Efficient computation of the objective function, the gradient and the hessian.
2. Null space active set method to enforce the constraints.
3. Determining the direction of descent based on steepest descent or modified Newton.
4. The linesearch algorithm based on Goldstein or Strong-Wolfe criteria.
5. The complete high level algorithm to find the solution.
8
CME 304 Project, Winter 2016 Rahul Sarkar
9
CME 304 Project, Winter 2016 Rahul Sarkar
At any feasible point, the constraints in (7) must be true. The strategy that we adopt to
solve the problem is to always be feasible with respect to the strict equality constraint.
So initially, we will start from a point that is feasible with respect to all the constraints
and then search for the minimizer in the null space of only the equality constraint.
The strategy to maintain feasibility with respect to the inequality constraints is differ-
ent. The key observation that motivates this approach is that the solution to the opti-
mization problem in (4) cannot be such that any of the inequality constraints are satisfied
exactly. This can be proved easily with a geometric argument. Suppose for N rectangles
the solution to the optimization problem involves some hm = 0 for m ∈ {1, . . . , N }.
But this cannot be true because one can split any of the other rectangles with non-zero
width into two and decrease the area outside the ellipse, thereby leading to a contra-
diction. Hence, we cannot have hm = 0 at the minimizer. Applying this argument
recursively, one can prove that there can be no i such that hi = 0 at the minimizer.
Therefore, at the minimizer all the inequality constraints must be inactive. Also note
that the equality and inequality constraints in equation (7) together specify a bounded
convex domain.
These observations suggest that as long as we stay feasible with respect to the hy-
perplane N
P
h
i=1 i = 1, and in addition remain inactive with respect to the inequality
constraints we should be able to converge to a minimizer inside the bounded convex
domain. This motivates the following strategy to remain feasible with respect to the
inequality constraints: starting from a feasible point at the current iterate we will de-
termine a search direction, and then inside the linesearch routine we will choose the
maximum step length so that none of the inequality constraints are hit. This will lead
to a feasible point inactive with respect to the inequality constraints, to repeat the same
process for the next iteration.
With this strategy our working set will always consist of only the row vector [1 1 . . . 1 1].
Therefore, using the notation in Walter’s notes, we have:
A1×N = 1 1 . . . 1 1 (8)
For such an A, the orthogonal matrix Z that spans the null space of A is obtained by
performing a QR decomposition of the matrix M in equation (9).
1 . . . . . .
−1 1 . . . . .
. −1 1 . . . .
MN ×(N −1) = . . −1 . 1 . . (9)
. . . . −1 1 .
. . . . . −1 1
. . . . . . −1
It is easy to check that M spans the null space of A, i.e AM = 0, and is full column
rank. The matrix Z is then given by ZR = M , where R is a non-singular upper-
triangular matrix. We then have AZ = 0 and Z T Z = I. This QR factorization
10
CME 304 Project, Winter 2016 Rahul Sarkar
only needs to be performed once and does not need to be changed from one iteration
to the next for a fixed number of rectangles N . The initial feasible point with re-
spect to both the equality and inequality constraints can be taken for example to be
h(0) = [1/N 1/N . . . 1/N 1/N ]T . Once Z is calculated, one can see that any search
direction of the form p = Zpz will ensure that the successive iterates satisfy the equality
constraint exactly. Finally, the reduced gradient and the reduced hessian can be calcu-
lated as in equation (10). Calculating these quantities are O(N 2 ) and O(N 3 ) operations
respectively, in terms of computational complexity.
Reduced Gradient = ĝ = Z T g
(10)
Reduced Hessian = Ĥ = Z T HZ
Given ĝ, Ĥ and Z, the modified cholesky algorithm can be used to compute the
descent direction in O(N 3 ) computational complexity. Even though this is costlier
as compared to steepest descent, it will be seen that it more than justifies itself
due to the property of quadratic convergence of the Newton method, close to the
true solution.
11
CME 304 Project, Winter 2016 Rahul Sarkar
Let us next briefly discuss the choice of parameters for the linesearch algorithms and
some of the main implementation details.
1. Goldstein linesearch: For a search direction p, the current iterate h and a gradient
g at the current iterate, the Goldstein linesearch algorithm proceeds by finding an
initial interval D = [α1 α2 ] ⊂ [0 αmax ] such that equation (13) is true, for some
choice of µ1 and µ2 such that 0 < µ1 ≤ µ2 < 1.
F (h + α1 p) < F (h) + µ2 α1 gT p , F (h + α2 p) > F (h) + µ1 α2 gT p (13)
Once such an interval is determined, we use bisection to reduce the interval iter-
atively till we find a point satisfying the Goldstein conditions in the interval D.
The Goldstein condition is satisfied when equation (14) holds. Also note that it
may be the case that while searching for an interval satisfying equation (13), we
might find a point that satisfies equation (14) before an interval is found. In that
case we have already found a valid α satisfying Goldstein conditions and nothing
further has to be done.
F (h) + µ2 αgT p ≤ F (h + αp) ≤ F (h) + µ1 αgT p (14)
12
CME 304 Project, Winter 2016 Rahul Sarkar
13
CME 304 Project, Winter 2016 Rahul Sarkar
4 Results
In this section, we present the results obtained by the optimization program. All results
will be presented on the unit circle, although at the end we will show an example of how
the result looks for the case of an ellipse. As we shall see from the results, solving the
problem on the unit circle reveals some very interesting properties of the solution. We
begin by studying some of these properties generated using the optimization program
outlined in Algorithm 5. The Strong-Wolfe linesearch and modified Newton search
direction were used inside the algorithm for all the results discussed in the sections 4.1
- 4.2, while the initial feasible point was taken as h(0) = [1/N 1/N . . . 1/N 1/N ]T for
solving each problem for different values of N .
14
CME 304 Project, Winter 2016 Rahul Sarkar
N rectangles, and we want to find the solution for N + 1 rectangles. One can start
by dividing any rectangle at the optimal solution for N into two rectangles to yield a
feasible point for the case N + 1 (as an example, consider the diagrams in Figure 7 for
N = 1 and N = 2 to see how such a division could be done). Let the objective function
0
value in this configuration be FN +1 . If at the optimal solution for N + 1 rectangles, the
0
value of the objective function is FN∗ +1 , then it must be true that FN∗ +1 ≤ FN +1 < FN∗ .
Therefore as N increases, the area outside the unit circle decreases monotonically. In
the limit as N → ∞, this area converges to 0.
N=8 N = 16 N = 32
N = 64 N = 128 N = 256
Figure 7: The configuration of the rectangles at the optimal solution point for different values of N .
15
CME 304 Project, Winter 2016 Rahul Sarkar
The minimum area outside the unit circle (FN∗ ) determined by solving the optimization
problem numerically, as a function of the number of rectangles N is plotted in Figure 8.
To be precise, we have plotted FN∗ on a logarithmic axis versus log2 N . The interesting
result here is that the plot seems to be linear for large N .
−1
10
−2
10
−3
10
−4
10
0 1 2 3 4 5 6 7 8 9 10
Log2(N), N = number of rectangles
Another interesting quantity of interest is the area outside the unit circle for each
rectangle corresponding to the solution for a total of N rectangles. We plot this quantity
for N = 32, 64 and 128 in Figure 9. The easy observation from these figures is that the
area outside each rectangle decreases as the number of rectangles N increases. This
is completely expected. However, there is another more interesting fact about these
rectangles that is a little harder to see. This is an important result and we state it
below:
For a fixed value of N , the area outside the unit circle at the optimal solution for each
of the N rectangles is a symmetrical function about N/2.
To see this, look at Figure 10 which plots the same figures as in Figure 9 for N =
32 and 128 over a smaller dynamic range suitable for each figure. The symmetry is
16
CME 304 Project, Winter 2016 Rahul Sarkar
easily seen. However, this is predictable from a geometrical argument on the circle.
Imagine how the figures shown previously in 5 (a) and (b) would look like for the unit
circle, the latter case being the optimization problem we are solving. The equivalence
of the problems characterized by Figures 5 (a) and (b) (as explained in section 2.4)
leads to an important conclusion, which is that Figure 5 (a) at the optimal point for the
unit circle must be invariant under a permutation of the coordinate axes y ← x, x ← y,
due to symmetry for any N . Note that the symmetry argument only holds for the
configuration of the rectangles at the optimal solution with the circle playing a key
role in the argument. What this implies is that if one swapped the X and Y axes, the
entire image (all the rectangles and circle combined) would resemble the image before
the swapping took place.
−3 −3 −3
10 10 10
Area outside unit circle for each rectangle
5 10 15 20 25 30 10 20 30 40 50 60 20 40 60 80 100 120
Rectangle Number Rectangle Number Rectangle Number
N = 32 N = 64 N = 128
Figure 9: This figure plots the area outside each rectangle at the optimal solution determined by the
optimization program for N = 32, 64 and 128. The vertical axis here is plotted in a logarithmic scale,
and the range of the vertical axis is kept the same.
−4.59
−3.41 10
10
Area outside unit circle for each rectangle
−4.6
−3.42 10
10
−4.61
−3.43
10
10
−4.62
10
−3.44
10
−4.63
10
−3.45
10
−4.64
10
−3.46
10
−4.65
10
−3.47
10
−4.66
10
−3.48
10
5 10 15 20 25 30 20 40 60 80 100 120
Rectangle Number Rectangle Number
N = 32 N = 128
Figure 10: Same plot as in Figure 9 for N = 32 and 128 with reduced range on the vertical axis.
17
CME 304 Project, Winter 2016 Rahul Sarkar
An exactly similar symmetry argument also suggests the following result. We state this
below:
The angles subtended by the circumference of the circle contained inside each of the N
rectangles at the optimal solution is also a symmetrical function about N/2.
This result is shown in Figure 11 for the cases N = 32 and 64. The angle being plotted
here is the angle subtended by the arc at the center of the circle, where the end points
of the arc for each rectangle are the two points where each rectangle intersects the
circumference of the circle.
0.14 0.09
0.08
0.12
0.07
0.1
0.06
Angle in radian
Angle in radian
0.08 0.05
0.04
0.06
0.03
0.04
0.02
0.02 0.01
5 10 15 20 25 30 10 20 30 40 50 60
Rectangle Number Rectangle Number
N = 32 N = 64
Figure 11: Plot of the angle subtended by the part of the circumference contained inside each rectangle
at the center of the circle, for the cases N = 32 and 64.
Both the symmetry results stated above follow because of symmetries of the circle.
This is an important point because we would have missed these nice properties of the
solution if we had solved our optimization problem on the ellipse directly, as then the
same quantities would have no longer been symmetrical.
In the next section, we study how the performance of the algorithm such as run time,
number of iterations and other key performance indicators vary as a function of the the
number of rectangles N .
18
CME 304 Project, Winter 2016 Rahul Sarkar
investigation). There are several desirable aspects about these figures that deserve
some discussion.
(a) The figures illustrate that increasing the dimension of the problem does not
lead to an ever increasing amount of work in terms of function and gradient
evaluations. In fact, it should be pointed out that this is a consequence of using
modified Newton to find the search direction at every iteration, which is not
the case if the search directions were determined using steepest descent!!.
(b) Numerical tests done with steepest descent did not even converge in a rea-
sonable amount of time for N = 210 rectangles on a windows PC on which
all computing was done for this project. In fact, the number of function and
gradient evaluations increase with N .
(c) Another aspect that could be behind these figures is how we choose the initial
feasible point. Everything we discussed so far is based on choosing the initial
point as h(0) = [1/N 1/N . . . 1/N 1/N ]T . It may be that this is a very
good starting solution and changing the initial feasible point to something
else might lead to a degraded performance.
2 2
10 10
1 1
10 10
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Log2(N), N = number of rectangles Log2(N), N = number of rectangles
(a) (b)
Figure 12: (a) Plot of the total number of function evaluations vs log2 N . (b) Plot of the total number
of gradient evaluations vs log2 N . The vertical axis is plotted on a logarithmic scale.
2. Number of iterations and total run time: We next study the total number of
iterations needed to converge to the true solution as a function of N and also how
the run time varies with N . These results are plotted in Figure 13 (a) and (b)
respectively. We make the following comments about these plots below:
(a) The optimization algorithm with modified Newton search direction is ex-
tremely robust in terms of generating search directions that quickly converge
to the optimal solution. The number of iterations appear to be approximately
10 across the entire range of N tried in this project 20 − 210 . However, this
19
CME 304 Project, Winter 2016 Rahul Sarkar
again might be influenced by the fact that the initial feasible solution is a
good starting point. An important comment regarding this is that the conclu-
sions would be completely different had one used steepest descent to generate
the search directions. With steepest descent the number of iterations needed to
converge goes up drastically.
(b) Figure 13 (b) is the more practical plot of interest that shows that the actual
run time does increase with increasing the value of N . Neglecting the initial
part of the curve for small values of N where overhead effects are probably
influencing the behavior of the curve, we see that for large N the run time
increases exponentially (since the horizontal axis plots log2 N ).
(c) A big reason for the increase in run time involves forming the hessian ma-
trix, accessing its elements, performing the modified cholesky factorization
and associated increases in computational complexity for matrix vector mul-
tiplications with increasing N . So this dependence of the total run time on N
is expected.
12
Number of iterations till convergence
1
10
10
Total runtime (in seconds)
8
0
10
6
4 −1
10
−2
0 10
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Log2(N), N = number of rectangles Log2(N), N = number of rectangles
Figure 13: (a) This figure is a plot of the total number of iterations taken to converge to the solution
vs log2 N . (b) This figure is a plot of the total run time of the optimization program vs log2 N . The
vertical axis in (b) is plotted in a logarithmic scale.
4.3 Performance analysis of the algorithm for different line search strate-
gies and for different descent directions for a fixed value of N
So far we have looked at different results by using descent directions generated using
the modified Newton algorithm and the Strong-Wolfe line search criteria inside the
optimization program. In this section, we will study how the performance of the opti-
mization program depends on the choice of the line search algorithm and the choice of
using different descent directions. We choose N = 100 for this study and do not vary
it.
20
CME 304 Project, Winter 2016 Rahul Sarkar
In the first test, we use steepest descent to generate our descent directions and then com-
pare some key performance parameters using these directions for different line search
methods - Goldstein and Strong-Wolfe, as a function of the iteration number. For the
purpose of this study, we fix the total number of iterations at 100. In fact, the solu-
tion does not converge in 100 iterations (because we are using steepest descent), but is
enough to illustrate the main differences. The following quantities are of interest as a
function of the iteration number:
• Norm of reduced gradient ||ĝ||
• Norm of search direction ||p||
• Objective function F
• The step length α
• Number of function and gradient evaluations at every iteration
• Number of cumulative function and gradient evaluations
0 0
10 10
Goldstein Linesearch Goldstein Linesearch
Strong−Wolfe Linesearch Strong−Wolfe Linesearch
−1 −1
10 10
Norm of reduced gradient : ZTg
Norm of search direction : p
−2 −2
10 10
−3 −3
10 10
−4 −4
10 10
−5 −5
10 10
−6 −6
10 10
0 20 40 60 80 100 0 20 40 60 80 100
Iteration number Iteration number
Figure 14: In this figure we plot the norm of the search direction vector ||p||2 and the norm of the
reduced gradient ||ĝ||2 as a function of the iteration number, in (a) and (b) respectively. The vertical
axis is plotted in a logarithmic scale.
In Figures 14 (a) and (b), we plot the norm of the search direction vector ||p||2 and
the norm of the reduced gradient ||ĝ||2 as a function of the iteration number. As
we can see, we are very far from achieving convergence as ||ĝ||2 ∼ 10−4 after 100
iterations. However, the plots suggest that Strong-Wolfe line search is achieving an
order of magnitude better convergence compared to Goldstein line search.
21
CME 304 Project, Winter 2016 Rahul Sarkar
In Figure 15 (a), we plot the function value F versus the iteration number. The plot
suggests that in terms of achieving the optimal value of the objective function, both the
line search algorithms are good overall. The main differences are in how many digits
after the decimal point do the answers match the true solution. However, as we can
see, after 40 iterations the algorithms with either line search match the true solution
with less than 1% error. In Figure 15 (b), we plot the step length α as a function of the
iteration number. It appears that α for either line search spans 3 orders of magnitude
(10−1 − 101 ), but there is no pattern in this figure of relevance.
−3
x 10 2
5 10
Goldstein Linesearch Goldstein Linesearch
4.8 Strong−Wolfe Linesearch Strong−Wolfe Linesearch
4.6 1
10
4.4
Function value : F
Step length : α
4.2
0
4 10
3.8
3.6 −1
10
3.4
3.2
−2
10
3 0 20 40 60 80 100
0 20 40 60 80 100 Iteration number
Iteration number
(a) F (b) α
Figure 15: In this figure we plot the function value F and the step length α as a function of the
iteration number, in (a) and (b) respectively. The vertical axis in (b) is plotted in a logarithmic scale.
Next, in Figure 16 we plot some useful statistics of the number of function and gradient
evaluations as a function of iteration number for the Goldstein and Strong-Wolfe line
search algorithms. Figures 16 (a) and (b) show the exact number of function and
gradient evaluations respectively, taking place per iteration. As one can see, the number
of gradient evaluations in the optimization algorithm with Goldstein line search is just
one per iteration. This is because as one can see from equations (13) and (14), the
conditions check for only the function values F (h+αp) at the new points. The quantity
gT p appearing in the conditions only need to be computed for the first point being
checked and for any new point being checked it does not change. However, for Strong-
Wolfe line search, we need to compute the gradient of the function at each point we check
which explains Figure 16 (b). Finally for Figures 16 (a) and (b), one needs to note that
for the Strong-Wolfe line search both the number of function and gradient evaluations
are showing an overall increasing tendency with increasing iteration number. (This
is worrying because we expect the amount of computation involved in the line search
to decrease as we converge closer and closer to the solution. This probably points to
inefficiencies in the line search code that we have currently !)
22
CME 304 Project, Winter 2016 Rahul Sarkar
15 15
Goldstein Linesearch Goldstein Linesearch
Strong−Wolfe Linesearch Strong−Wolfe Linesearch
10 10
5 5
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Iteration number Iteration number
(a) (b)
1000 1000
Goldstein Linesearch Goldstein Linesearch
900 Strong−Wolfe Linesearch 900 Strong−Wolfe Linesearch
Cumulative number of gradient evaluations
Cumulative number of function evaluations
800 800
700 700
600 600
500 500
400 400
300 300
200 200
100 100
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Iteration number Iteration number
(c) (d)
Figure 16: In this figure we plot : (a) Number of function evaluations per iteration, (b) Number of
gradient evaluations per iteration, (c) Cumulative number of function evaluations, (d) Cumulative
number of gradient evaluations, versus the iteration number.
In Figures 16 (c) and (d), we plot the cumulative number of function and gradient
evaluations as a function of the iteration number. There is no more information in
them than in Figures 16 (a) and (b). However, it is still nice to visualize these results.
In particular it reveals that the number of function evaluations tend to be the same for
both Goldstein and Strong-Wolfe line search algorithms, while the number of gradient
evaluations is clearly much greater for Strong-Wolfe line search than for Goldstein line
search.
Comparison of steepest descent and modified Newton search directions
In this section we study the superiority of the modified Newton algorithm as compared
to the steepest descent algorithm when it comes to generating good search directions.
We know of the theoretical result that close to the true solution, Newton iterates exhibit
23
CME 304 Project, Winter 2016 Rahul Sarkar
quadratic convergence. This is really what is behind some of the spectacular results
that follow. The number of rectangles in this study is again fixed at N = 100, while the
optimization program is run for 100 iterations, without an explicit convergence criteria
enforced to stop execution. We have already demonstrated that with steepest descent,
we do not achieve convergence in 100 iterations, but we will be pleasantly surprised to
see that modified Newton converges in ∼ 50 iterations. For this study, we will use the
Strong-Wolfe conditions in the line search. We plot the same quantities plotted in the
Figures 14, 15 and 16, but this time we compare the results obtained using modified
Newton versus steepest descent algorithms.
We begin by plotting the norm of the search direction vector ||p||2 and the norm of
the reduced gradient ||ĝ||2 in Figures 17 (a) and (b) respectively, as a function of
the iteration number using modified Newton and steepest descent directions. The
results show that both these quantities ||p||2 and ||ĝ||2 reach machine precision ∼ 10−15
by iteration 50 for modified Newton. Contrast this with the dismal performance of
steepest descent which only manages to reduce these quantities to about ∼ 10−5 after
100 iterations. This clearly is a consequence of the quadratic convergence properties
of the Newton iterates close to the optimal point, something that the steepest descent
iterates lack completely.
0 0
10 10
Steepest Descent Steepest Descent
Modified Newton Modified Newton
Norm of reduced gradient : ZTg
Norm of search direction : p
−5 −5
10 10
−10 −10
10 10
−15 −15
10 10
0 20 40 60 80 100 0 20 40 60 80 100
Iteration number Iteration number
Figure 17: In this figure we plot the norm of the search direction vector ||p||2 and the norm of the
reduced gradient ||ĝ||2 as a function of the iteration number, in (a) and (b) respectively. The vertical
axis is plotted in a logarithmic scale.
In Figure 18 (a), we plot the objective function F versus the number of iterations for the
optimization program run using modified Newton and steepest descent directions. The
rate at which the modified Newton iterates converge to the true solution is spectacular
- in fact we reach the true solution within an error tolerance of 1% of the objective
function value at the optimal solution, in just 5 iterations. Comparatively, the steepest
24
CME 304 Project, Winter 2016 Rahul Sarkar
descent iterates take about 40 iterations to reach the same tolerance in terms of the
objective function value. The amazing convergence of the modified Newton iterates
is a consequence of quadratic convergence of the Newton iterates near the optimal
solution. In Figure 18 (b), we plot the value of the step length α as a function of the
iteration number. The general observation is that for steepest descent, α is in the range
∼ 10−1 − 101 . For modified Newton iterates, the value of α seems to have a bigger
dynamic range ∼ 10−6 − 100 , but there is also an undesirable feature for N ∼ 65 or
larger. It seems that the value of alpha becomes equal to 10−3 in this range. This
shouldn’t be, as from theory we know that α should be close to 1 near the optimal
solution, for search directions generated using the Newton method. This probably is due
to inefficiencies in the line search routine and needs more investigation ! However, it
is equally likely that this effect is a manifestation of the fact that we cannot compute
very small numbers (∼ 10−15 ) on the computer with high accuracy, and the lack of
precision in being able to compute the function value and the gradient interferes with
the quadratic convergence of the Newton iterates.
−3
x 10 4
5 10
Steepest Descent Steepest Descent
Modified Newton Modified Newton
4.8 2
10
4.6
0
10
Function value : F
Step length : α
4.4
−2
4.2 10
4 −4
10
3.8
−6
10
3.6
−8
10
3.4 0 20 40 60 80 100
0 20 40 60 80 100 Iteration number
Iteration number
(a) F (b) α
Figure 18: In this figure we plot the function value F and the step length α as a function of the
iteration number, in (a) and (b) respectively. The vertical axis in (b) is plotted in a logarithmic scale.
Finally, we plot the number of function and gradient evaluations per iteration, for
steepest descent and modified Newton iterates in Figures 19 (a) and (b) respectively.
We see that for modified Newton search directions, the Strong-Wolfe line search needs
to perform many more function and gradient evaluations compared to the steepest
descent search directions. The unpleasant aspect about these plots is that for modified
Newton search directions, the line search routine does too many computations even after
convergence is reached. This can either be due to the inability to compute and compare
very small numbers (∼ 10−15 ) accurately on a computer, and due to inefficiencies in
the line search implementation.
25
CME 304 Project, Winter 2016 Rahul Sarkar
Figures 19 (c) and (d) plot the cumulative number of function and gradient evalua-
tions respectively for the optimization program with descent directions generated using
modified Newton or steepest descent algorithm, versus the iteration number. Although
we expect the plots to be increasing with the number of iterations (which is the case
for both algorithms), it is clear that the total number of function and gradient com-
putations become very large for the modified Newton search direction implementation.
This may be also true for the steepest descent directions once the solution gets “close
enough” to the true solution, which the steepest descent algorithm fails to do in 100
iterations and so we don’t see the effect in these plots.
100 100
Steepest Descent Steepest Descent
90 Modified Newton 90 Modified Newton
80 80
70 70
60 60
50 50
40 40
30 30
20 20
10 10
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Iteration number Iteration number
(a) (b)
7000 7000
Steepest Descent Steepest Descent
Modified Newton Modified Newton
Cumulative number of gradient evaluations
Cumulative number of function evaluations
6000 6000
5000 5000
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Iteration number Iteration number
(c) (d)
Figure 19: In this figure we plot : (a) Number of function evaluations per iteration, (b) Number of
gradient evaluations per iteration, (c) Cumulative number of function evaluations, (d) Cumulative
number of gradient evaluations, versus the iteration number.
Finally, we should note that even though we plotted 100 iterations with the modified
Newton algorithm being used to generate the search directions inside the optimization
program, Figure 12 says that convergence is reached in only about 10-15 iterations.
26
CME 304 Project, Winter 2016 Rahul Sarkar
So all the issues pointed out about the line search routine, are never really
encountered in practice as the program stops much before due to conver-
gence being achieved. Nevertheless, the plots in Figures 18 and 19 are important to
consider in order to be able to find computer bugs.
−2
10
5.5
−4
−6
Function value : F
10
5
−8
10
−10
4.5 10
−12
10
4
−14
10
−16
10
3.5 0 20 40 60 80 100
0 20 40 60 80 100 Iteration number
Iteration number
(b)
(a)
7000 7000
Cumulative number of gradient evaluations
Cumulative number of function evaluations
6000 6000
5000 5000
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Iteration number Iteration number
(c) (d)
Figure 20: In this figure we plot for different randomly generated starting feasible solutions : (a) func-
tion value F , (b) norm of the reduced gradient ||ĝ||2 , (c) Cumulative number of function evaluations,
(d) Cumulative number of gradient evaluations, versus the iteration number. The vertical axis is (a)
and (b) are plotted in a logarithmic scale. Each color represents a different starting solution and the
colors across all the four figures are consistent.
Again, we have plotted some key performance indicators for 100 iterations, like function
value F , the norm of the reduced gradient ||ĝ||2 , cumulative number of function and
27
CME 304 Project, Winter 2016 Rahul Sarkar
gradient evaluations as a function of the iteration number. We use the modified Newton
algorithm to generate search directions and Strong-Wolfe line search for this study. The
results are plotted in Figure 20. Each of the colors represent the results for a randomly
generated initial feasible point. As one can see, there is not much difference in the
performance of the algorithm when we use different starting feasible points, and all of
them exhibit similar convergence characteristics and computational effort in terms of
function and gradient evaluations.
(a) (b)
(c)
Figure 21: This figure illustrates how the solution obtained can be extended to obtain the configuration
of the rectangles for the unit circle on all the four quadrants. (a) This is the solution of the optimization
problem that we solve for the unit circle on the first quadrant. (b) Next, note how the solution in (a)
remains unchanged for this new configuration of the rectangles. (c) Finally, the configuration in (b)
can be extended to all the four quadrants to give the solution of the ellipse covering problem in all
four quadrants. Here we have plotted the results for N = 8 rectangles.
We finish the analysis by showing how the problem solved on only the first quadrant
for a unit circle directly yields a SCC solution for the whole ellipse in all the four
28
CME 304 Project, Winter 2016 Rahul Sarkar
Figure 22: The SCC using N = 8 rectangles for an ellipse with semi-major and semi-minor axes lengths
a = 2 and b = 1 respectively, that minimizes the area outside the ellipse.
29
CME 304 Project, Winter 2016 Rahul Sarkar
5 Future Work
Some cool results have been generated by running the program on super-ellipses which
are described by equation (17). However, the results are in unstructured form and more
study needs to be done. Hence we avoid putting them here. Some pictures will be sent
at a later date. x α y β
+ = 1 , α > 0, β > 0 (17)
a b
6 Acknowledgements
I’d like to thank Carlos for excellent exposition of some very important and difficult
concepts pertinent to numerical optimization in his office hours for the course CME 304.
I’d also like to thank Prof. Walter Murray for making this project an essential compo-
nent of the course. Trying to implement computer programs to solve the problem really
taught me a great deal about some of the major difficulties that one encounters during
program execution like limits of finite precision floating point arithmetic, performance
issues with inefficient line search routines, trade-offs necessary to solve problems in a
reasonable amount of time with limited memory and computing resources etc.
References
[1] Nocedal, Jorge and Wright, Stephen, 2006, Numerical optimization: Springer Sci-
ence & Business Media
[2] Gill, Philip E and Murray, Walter and Wright, Margaret H, 1981, Practical opti-
mization: Academic press
[3] Murray, Walter, 2016, Numerical Optimization CME 304 course lecture notes
30