0% found this document useful (0 votes)
5 views270 pages

Programming Basics and AI Lecture

This document is a lecture note by Seongjai Kim from Mississippi State University, providing an overview of programming basics and scientific computing using Matlab and Python. It emphasizes the importance of algorithmic design and aims to introduce early-year college and talented high school students to computational methods. The content includes various programming concepts, examples, and exercises to enhance understanding of mathematical analysis and coding techniques.

Uploaded by

belal97ctgbd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views270 pages

Programming Basics and AI Lecture

This document is a lecture note by Seongjai Kim from Mississippi State University, providing an overview of programming basics and scientific computing using Matlab and Python. It emphasizes the importance of algorithmic design and aims to introduce early-year college and talented high school students to computational methods. The content includes various programming concepts, examples, and exercises to enhance understanding of mathematical analysis and coding techniques.

Uploaded by

belal97ctgbd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 270

Programming Basics and AI

with Matlab and Python

Lectures on YouTube:
https://www.youtube.com/@mathtalent

Seongjai Kim

Department of Mathematics and Statistics


Mississippi State University
Mississippi State, MS 39762 USA
Email: [email protected]

Updated: May 19, 2023


Coding techniques may improve the computer program by tens of percents,
while an effective algorithmic design can improve it by tens or hundreds of times

In computational literacy, the coding ability is to the tip of the iceberg


as the ability of algorithmic design is to its remainder

Seongjai Kim, Professor of Mathematics, Department of Mathematics and Statistics, Mis-


sissippi State University, Mississippi State, MS 39762 USA. Email: [email protected].
Prologue
This lecture note provides an overview of scientific computing, i.e., of modern information
engineering tasks to be tackled by powerful computer simulations. The emphasis through-
out is on the understanding of modern algorithmic designs and their efficient im-
plementation.
As well known as in the society of computational methods, computer programming is
the process of constructing an executable computer program in order to accomplish a spe-
cific computational task. Programming in practice cannot be realized without incorporat-
ing computational languages. However, it is not a simple process of experiencing computa-
tional languages; it involves concerns such as

• mathematical analysis,
• generating computational algorithms,
• profiling algorithms’ accuracy and cost, and
• the implementation of algorithms in selected programming languages
(commonly referred to as coding).

The source code of a program can be written in one or more programming languages.
The manuscript is conceived as an introduction to the thriving field of information engi-
neering, particularly for early-year college students who are interested in mathemat-
ics, engineering, and other sciences, without an already strong background in computa-
tional methods. It will also be suitable for talented high school students. All examples
to be treated in this manuscript are implemented in Matlab and Python, and occasionally
in Maple.

Currently, the lecture note is growing.


May 19, 2023

iii
iv
Contents

Title ii

Prologue iii

Table of Contents viii

1 Programming Basics 1
1.1. What is Programming or Coding? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1. Programming: Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2. Simple form of programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.3. Functions: generalization and reusability . . . . . . . . . . . . . . . . . . . . . . 6
1.1.4. Becoming a good programmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2. Matlab: A Powerful Computer Language . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1. Introduction to Matlab/Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.2. Graphics with Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.3. Repetition: iteration loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.4. Loop control statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.2.5. Anonymous function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.2.6. Open source alternatives to Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Exercises for Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2 Simple Programming Examples 29


2.1. Area Estimation of A Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2. Visualization of Complex-Valued Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3. Inverse Functions and Logarithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.1. Inverse functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.2. Exponential functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.3.3. Logarithmic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Exercises for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3 Programming with Calculus 53


3.1. Derivative: The Slope of the Tangent Line . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2. Basis Functions and Power Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.1. Power series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2.2. Taylor series expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

v
vi Contents

3.3. Newton’s Method for Zero-Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71


3.4. Zeros of Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4.1. Horner’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5. Multi-Variable Functions and the Gradient Vector . . . . . . . . . . . . . . . . . . . . . 81
3.5.1. Functions of several variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.5.2. First-order partial derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.5.3. The gradient vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Exercises for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4 Programming with Linear Algebra 89


4.1. Solutions of Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.1.1. Solving a linear system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1.2. Matrix equation Ax = b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.1.3. Reduced row echelon form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.2. Invertible Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3. Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.4. Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.4.1. Characteristic equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.4.2. Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.4.3. Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Exercises for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5 Regression Analysis 113


5.1. Least-Squares Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.2. Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.2.1. Regression line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.2.2. Least-squares fitting of other curves . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.2.3. Nonlinear regression: Linearization . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.3. Scene Analysis with Noisy Data: RANSAC . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.3.1. Weighted least-squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.3.2. RANdom SAmple Consensus (RANSAC) . . . . . . . . . . . . . . . . . . . . . . . 126
Exercises for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6 Fundamentals of AI 131
6.1. What is Artificial Intelligence (AI)? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.2. Constituents of AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.3. Designing Artificial Brains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.4. Future of AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Exercises for Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

7 Python Basics 137


7.1. Why Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.2. Python in an Hour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Contents vii

7.2.1. Python essentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141


7.2.2. Frequently used Python rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.2.3. Looping and functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.3. Zeros of Polynomials in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.4. Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.5. A Machine Learning Modelcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Exercises for Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

8 Mathematical Optimization 161


8.1. Gradient Descent (GD) Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.2. Newton’s Method for Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Exercises for Chapter 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

9 Vector Spaces and Orthogonality 171


9.1. Linear Indepencence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Exercises for Chapter 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

10 Principal Component Analysis 175


10.1.Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
10.1.1. The covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
10.1.2. Computation of principal components . . . . . . . . . . . . . . . . . . . . . . . . 181
10.1.3. Dimensionality reduction: Data compression . . . . . . . . . . . . . . . . . . . . 183
10.2.Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
10.2.1. Algebraic interpretation of the SVD . . . . . . . . . . . . . . . . . . . . . . . . . 189
10.2.2. Computation of the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
10.3.Application of the SVD for LS Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Exercises for Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

11 Machine Learning 205


11.1.What is Machine Learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
11.1.1. Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
11.1.2. Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
11.2.Binary Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
11.2.1. Adaline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.2.2. Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
11.2.3. Multi-class classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
11.3.Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
11.3.1. A simple network to classify hand-written digits . . . . . . . . . . . . . . . . . . 220
11.3.2. Implementing a network to classify digits [7] . . . . . . . . . . . . . . . . . . . . 224
11.4.Multi-Column Least-Squares Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Exercises for Chapter 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

12 Scikit-Learn: A Popular Machine Learning Library 237


viii Contents

12.1.Scikit-Learn Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238


12.1.1. Why scikit-learn? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
12.1.2. Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
12.2.Scikit-Learn – Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
12.2.1. Scikit-Learn supervised learning modules . . . . . . . . . . . . . . . . . . . . . . 247
12.2.2. Performance comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
12.3.Scikit-Learn – Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
12.3.1. Scikit-Learn unsupervised learning modules . . . . . . . . . . . . . . . . . . . . 248
12.3.2. Performance comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
Exercises for Chapter 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

P Projects 251
P.1. Edge Detection, using Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
P.2. Number Plate Detection, using Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

Bibliography 255

Index 257
1
C HAPTER

Programming Basics

In this chapter, you will learn


• what programming is
• what coding is
• what programming languages are
• how to convert mathematical terms to codes
• how to control repetitions

Contents of Chapter 1
1.1. What is Programming or Coding? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2. Matlab: A Powerful Computer Language . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Exercises for Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1
2 Chapter 1. Programming Basics

1.1. What is Programming or Coding?


Definition 1.1. Computer programming is the process of building
an executable computer program in order to accomplish a specific com-
putational task.
• Programming involves various concerns such as
– mathematical/physical analysis,
– generating computational algorithms,
– profiling algorithms’ accuracy and cost, and
– the implementation of algorithms in a chosen programming lan-
guage (commonly referred to as coding).
• The purpose of programming is to find a sequence of instructions
that will automate the performance of a task for solving a given
problem.
• Thus, the process of programming often requires expertise in sev-
eral different subjects, including knowledge of the application
domain, specialized algorithms, and formal logic.

1.1.1. Programming: Some examples


Example 1.2. Assume that we need to find the sum of integers from 2 to
5:
2 + 3 + 4 + 5.
Then, you may start with 2; add 3, add 4, and finally add 5; the answer
is 14. This simple procedure is the result of programming in your brain.
Programming is thinking.
1.1. What is Programming or Coding? 3

Example 1.3. Let’s try to get 5. Your calculator must have a function
√ √
key . When you input 5 and push Enter , your calculator displays the
answer on the spot. How can the calculator get the answer?
Solution. Calculators or computers cannot keep a table to look the answer
up. They compute the answer on the spot as follows.

Let Q = 5.
1. initialization: p
2. for i = 1, 2, · · · , itmax
p ← (p + Q/p)/2;
3. end for
squareroot_Q.m
1 Q=5;
2

3 p = 1;
4 for i=1:8
5 p = (p+Q/p)/2;
6 fprintf("%3d %.20f\n",i,p)
7 end

Output
1 1 3.00000000000000000000
2 2 2.33333333333333348136
3 3 2.23809523809523813753
4 4 2.23606889564336341891
5 5 2.23606797749997809888
6 6 2.23606797749978980505
7 7 2.23606797749978980505
8 8 2.23606797749978980505

The algorithm has converged to 20 decimal digit accuracy, just in 6 itera-


tions.
Note: The above example shows what really happens in your calculators

and computers. In general, Q can be found in a few iterations of simple
mathematical operations.
4 Chapter 1. Programming Basics

Remark 1.4. Note that


p2 − Q
p ← (p + Q/p)/2 = p − , (1.1)
2p
which can be interpreted as follows.
1. Square the current iterate p;
2. Measure the difference from Q;
3. Scale the difference by twice the current iterate (2p)
4. Update p by subtracting the scaled difference (correction term)

Quesiton. How could we know:


a good scaling factor in the correction term is 2p?

• The answer comes from a mathematical analysis.


• In general, programming consists of
(a) mathematical analysis,
(b) algorithmic design,
(c) implementation to the computer (coding), and
(d) verification for accuracy and efficiency.
• Once you have done a mathematical analysis and performed al-
gorithmic design, the next step is to implement the algorithm for
a code (coding).
• In implementation, you can use one (or more) of computer languages
such as Matlab, Python, C, C++, and Java.

• Through the course, you will learn programming techniques,


using simple languages such as Matlab and Python.
• Why simple languages?

To focus on mathematical logic and algorithmic design


1.1. What is Programming or Coding? 5

Remark 1.5. (Coding vs. Programming)


At this moment, you may ask questions like:
• What is coding?
• How is it related to programming?
Though the terms are often used interchangeably, coding and program-
ming are two different things.
Particularly, in Software Development Industries,
• Coding refers to writing codes for applications, but programming
is a much broader term.
– Coding is basically the process of creating codes from one lan-
guage to another, while programming is to find solutions of prob-
lems and determine how they should be solved.
– Programmers generally deal with the big picture in applications,
not just compartmentalized lines of code.
So you will learn programming!

1.1.2. Simple form of programming


In order to understand more clearly what programming is, let us first con-
sider the following example.
Example 1.6. Find the sum of the square of consecutive integers from 1
to 10.
Solution.

• This example asks to evaluate the quantity:


10
X
2 2 2
1 + 2 + · · · + 10 = i2 . (1.2)
i=1

• A Matlab code can be written as


6 Chapter 1. Programming Basics

• When the code is executed, the vari-


1 sqsum = 0; able sqsum saves 385.
2 for i=1:10
• The code is a simple form of
3 sqsum = sqsum + i^2;
repetition, one of most common
4 end
building blocks in programming.

1.1.3. Functions: generalization and reusability

Remark 1.7. Reusability.


• The above Matlab program produces the square sum of integers
from 1 to 10, which may not be useful for other occasions.
• In order to make programs reusable for various situations, opera-
tions or a group of operations must be
– implemented with variable inputs, and
– saved as a form of function.

Example 1.8. (Generalization of Example 1.6). Find the sum of the


square of consecutive integers from m to n.

Solution. As a generalization of the above Matlab code, it can be imple-


mented and saved in squaresum.m as follows.
squaresum.m
1 function sqsum = squaresum(m,n)
2 %function sqsum = squaresum(m,n)
3 % Evaluates the square sum of consecutive integers: m to n.
4 % input: m,n
5 % output: sqsum
6

7 sqsum = 0;
8 for i=m:n
9 sqsum = sqsum + i^2;
10 end
1.1. What is Programming or Coding? 7

• In Matlab, each of saved function files is called an M-file, of which the


first line specifies
– the function name (squaresum),
– input variables (m,n),
– outputs (sqsum).

• Lines 2–5 of squaresum.m, beginning with the percent sign (%), are for
a convenient user interface. A built–in function help can be utilized
whenever we want to see what the programmer has commented for the
function.
• For example,
help
1 >> help squaresum
2 function sqsum = squaresum(m,n)
3 Evaluates the square sum of consecutive integers: m to n.
4 input: m,n
5 output: sqsum

• The last four lines of squaresum.m include the required operations for
the given task.
• On the command window, the function is called for various m and n.
For example,

1 >> squaresum(1,10)
2 ans = 385

Remark 1.9. Programming Basics


When a computational task is programmed, a code (functions) must be
implemented with sufficient information, including
• operations (e.g., summing the square of consecutive integers)
• inputs (e.g., the beginning and terminal integers)
• outputs (e.g., the sum)
8 Chapter 1. Programming Basics

1.1.4. Becoming a good programmer

In-Reality 1.10. Components in Programming:


As aforementioned, computer programming (or programming) is
the process of building an executable computer program for accomplish-
ing a specific computational task. A task may consist of numerous
sub-tasks each of which can be implemented as a function; some
functions may be used more than once or repeatedly in a pro-
gram. The reader may get a deeper understanding on programming,
reading through the following.

• Task modularization: The given computational task can partitioned


into several small sub-tasks (modules), each of which is manage-
able conveniently and effectively in both mathematical analysis and
computer implementation. The major goal of task modularization is to
build a backbone of programming.
• Development of algorithms: For each module, computational algo-
rithms must be developed and saved in functions.
• Choice of computer languages: One can choose one of computer
languages in which all the sub-tasks are implemented. However, it
is occasionally the case that sub-tasks are implemented in more than
one computer languages, in order to maximize the performance
of the resulting program and/or to minimize human efforts.
• Debugging: Once all the modules are implemented and linked for the
given computational task, the code must be verified for correctness
and effectiveness. Such a process of finding and resolving defects or
issues within a computer program is called debugging.

Note: It is occasionally the case that verification and debugging take


much longer time than implementation itself.
1.1. What is Programming or Coding? 9

Remark 1.11. Tips for Programming:


• Add functions one-by-one: Building a program is not a simple prob-
lem but a difficult project, particularly when the program should be
constructed from scratch. A good strategy for an effective program-
ming is:
(a) Add functions one-by-one.
(b) Check if the program is correct, each time adding a function.
That is, the programmer should keep the program in a working condi-
tion for the whole period of time of implementation.
• Use/modification of functions: One can build a new program pretty
effectively by trying to modify and use old functions used for the same
or similar projects. When it is the case,
you may have to start the work by copying old func-
tions to newly-named functions to modify, rather than
adding/replacing lines of the original functions.
Such a strategy will make the programmer debug much more easily
and keep the program in a working condition all the time.

Note: For a successful programming, the programmer may consider the


following, before he/she starts implementation (coding).
• Understanding the problem: inputs, operations, & outputs
• Required algorithms: reusable or new
• Required mathematical methods/derivation
• Program structure: how to place operations/functions
• Verification: How to verify the code to make sure correctness
10 Chapter 1. Programming Basics

Example 1.12. Let us write a program for sorting an array of numbers


from smallest to largest.
Solution. We should consider the following before coding.
• The goal: A sorting algorithm.
• Method: Comparison of component pairs tor the smaller to move up.
• Verification: How can I verify the program works correctly?
Let’s use e.g., a randomly-generated array of numbers.
• Parameters: Overall, what could be input/output parameters?

All being considered, a program is coded as follows.


mysort.m
1 function S = mysort(R)
2 %function S = mysort(R)
3 % which sorts an array from smallest to largest
4

5 %% initial setting
6 S = R;
7

8 %% get the length


9 n = length(R);
10

11 %% begin sorting
12 for j=n:-1:2 %index for the largest among remained
13 for i=1:j-1
14 if S(i) > S(i+1)
15 tmp = S(i);
16 S(i) = S(i+1);
17 S(i+1) = tmp;
18 end
19 end
20 end
1.1. What is Programming or Coding? 11

SortArray.m
1 % User parameter
2 n=10;
3

4 % An array of random numbers


5 % (1,n) vector of integer random values <= 100
6 R = randi(100,1,n)
7

8 % Call sorting, without using "sort"


9 S = mysort(R)

Output
1 >> SortArray
2 R =
3 33 88 75 17 91 94 79 36 2 72
4 S =
5 2 17 33 36 72 75 79 88 91 94

Summary 1.13. Programming vs. Coding

• Programming consists of analysis, design, coding, & verification.


It requires creative thinking and reasoning, on top of coding.

• It would better begin with a simple computer language.


12 Chapter 1. Programming Basics

1.2. Matlab: A Powerful Computer Language


Matlab (matrix laboratory) is a multi-paradigm numerical comput-
ing environment and proprietary programming language developed by
MathWorks.
• Flexibility: Matlab allows matrix manipulations, plotting of func-
tions and data, implementation of algorithms, creation of user in-
terfaces, and interfacing with programs written in other languages,
including C, C++, C#, Java, Fortran, and Python; it is particularly
good at matrix manipulations.
• Computer Algebra: Although Matlab is intended primarily for nu-
merical computing, an optional toolbox uses the MuPAD symbolic
engine, allowing access to symbolic computing abilities.
• Most Convenient Computer Language: Overall, Matlab is about
the easiest computer language to learn and to use as well.

Remark 1.14. For each of programming languages, there are four


essential components to learn.
1. Looping – repetition
2. Conditional statements – dealing with cases
3. Input/Output – using data and saving/visualizing results
4. Functions – reusability and programming efficiency

1.2.1. Introduction to Matlab/Octave


Vectors and Matrices
The most basic thing you will need to do is to enter vectors and matrices.
You would enter commands to Matlab at a prompt that looks like >>.
• Rows are separated by semicolons (;) or Enter .
• Entries in a row are separated by commas (,) or Space .
1.2. Matlab: A Powerful Computer Language 13

For example,
Vectors and Matrices
1 >> v = [1; 2; 3] % column vector
2 v =
3 1
4 2
5 3
6 >> w = [5, 6, 7, 8] % row vector
7 w =
8 5 6 7 8
9 >> A = [2 1; 1 2] % matrix
10 A =
11 2 1
12 1 2
13 >> B = [2, 1; 1, 2]
14 B =
15 2 1
16 1 2

• The symbols (,) and (;) can be used to combine more than one command
in the same line.
• If we use semicolon (;), Matlab sets the variable but does not print the
output.

For example,

1 >> p = [2; -3; 1], q = [2; 0; -3];


2 p =
3 2
4 -3
5 1
6 >> p+q
7 ans =
8 4
9 -3
10 -2
11 >> d = dot(p,q);

where dot computes the dot product of two vectors.


14 Chapter 1. Programming Basics

• Instead of entering a matrix at once, we can build it up from either its


rows or its columns.

For example,
1 >> c1=[1; 2]; c2=[3; 4];
2 >> M=[c1,c2]
3 M =
4 1 3
5 2 4
6 >> c3=[5; 6];
7 >> M=[M,c3]
8 M =
9 1 3 5
10 2 4 6
11 >> c4=c1; r3=[2 -1 5 0];
12 >> N=[M, c4; r3]
13 N =
14 1 3 5 1
15 2 4 6 2
16 2 -1 5 0

Operations with Vectors and Matrices


• Matlab uses the symbol (*) for both scalar multiplication and matrix-
vector multiplication.
• In Matlab, to retrieve the (i, j)-th entry of a matrix M, type M(i,j).
• To retrieve more than one element at a time, give a list of columns and
rows that you want.
• For example, 2:4 is the same as [2 3 4].
• A colon (:) by itself means all. Thus, M(i,:) extracts the i-th row of
M. Similarly, M(:,j) extracts the j-th column of M.

For example,
1 >> M=[1 2 3 4; 5 6 7 8; 9 10 11 12], v=[1;-2;2;1];
2 M =
3 1 2 3 4
4 5 6 7 8
5 9 10 11 12
6 >> M(2,3)
1.2. Matlab: A Powerful Computer Language 15

7 ans =
8 7
9 >> M(3,[2 4])
10 ans =
11 10 12
12 >> M(:,2)
13 ans =
14 2
15 6
16 10
17 >> 3*v
18 ans =
19 3
20 -6
21 6
22 3
23 >> M*v
24 ans =
25 7
26 15
27 23
16 Chapter 1. Programming Basics

• To multiply two matrices in Matlab, use the symbol (*).


• The n × n identity matrix is formed with the command eye(n).
• You can ask Matlab for its reasoning using the command why. Unfor-
tunately, Matlab usually takes attitude and gives a random response.

For example,

1 >> A=[1 2; 3 4], B=[4 5; 6 7],


2 A =
3 1 2
4 3 4
5 B =
6 4 5
7 6 7
8 >> A*B
9 ans =
10 16 19
11 36 43
12 >> I=eye(3)
13 I =
14 1 0 0
15 0 1 0
16 0 0 1
17 >> C=[2 4 6; 1 3 5; 0 1 1];
18 >> C_inv = inv(C)
19 C_inv =
20 1.0000 -1.0000 -1.0000
21 0.5000 -1.0000 2.0000
22 -0.5000 1.0000 -1.0000
23 >> C_inv2=C\I
24 C_inv2 =
25 1.0000 -1.0000 -1.0000
26 0.5000 -1.0000 2.0000
27 -0.5000 1.0000 -1.0000
28 >> C_inv*C
29 ans =
30 1 0 0
31 0 1 0
32 0 0 1
1.2. Matlab: A Powerful Computer Language 17

1.2.2. Graphics with Matlab


In Matlab, the most popular graphic command is plot, which creates
a 2D line plot of the data in Y versus the corresponding values in X. A
general syntax for the command is
plot(X1,Y1,LineSpec1,...,Xn,Yn,LineSpecn)

For example,
fig_plot.m
1 close all
2

3 %% a curve
4 X1=linspace(0,2*pi,11); % n=11
5 Y1=cos(X1);
6

7 %% another curve
8 X2=linspace(0,2*pi,51);
9 Y2=sin(X2);
10

11 %% plot together
12 plot(X1,Y1,'-or','linewidth',2, X2,Y2,'-b','linewidth',2)
13 legend({'y=cos(x)','y=sin(x)'})
14 axis tight
15 print -dpng 'fig_cos_sin.png'

Figure 1.1: plot of y = cos x and y = sin x.


18 Chapter 1. Programming Basics

Above fig_plot.m is a typical M-file for figuring with plot.


• Line 1: It closes all figures currently open.
• Lines 3, 7, and 11 (comments): When the percent sign (%) appears, the
rest of the line will be ignored by Matlab.
• Lines 4 and 8: The command linspace(x1,x2,n) returns a row vector
of n evenly spaced points between x1 and x2.
• Line 12: Its result is a figure shown in Figure 1.1.
• Line 15: it saves the figure into a png format, named fig_cos_sin.png.
The first function (y = cos x) is plotted with 11 points so that its curve
shows the local linearity, while the graph of y = sin x looks smooth with
51 points.

• For contour plots, you may use contour.


• For figuring 3D objects, you may try surf and mesh.
• For function plots, you can use fplot, fsurf, and fmesh.

Remark 1.15. (help and doc).


Matlab is powerful and well-documented as well. To see what a built-in
function do or how you can use it, type

help <name> or doc <name>

The command doc opens the Help browser. If the Help browser is already
open, but not visible, then doc brings it to the foreground and opens a
new tab. Try doc surf, followed by doc contour.
1.2. Matlab: A Powerful Computer Language 19

1.2.3. Repetition: iteration loops


Note: Repetition
• In scientific computing, one of most frequently occurring events is
repetition.
• Each repetition of the process is also called an iteration.
• It is the act of repeating a process, to generate a (possibly un-
bounded) sequence of outcomes, with the aim of approaching a de-
sired goal, target or result. Thus,
(a) Iteration must start with an initialization (starting point), and
(b) Perform a step-by-step marching in which the results of one it-
eration are used as the starting point for the next iteration.

In the context of mathematics or computer science, iteration (along with


the related technique of recursion) is a very basic building block in pro-
gramming. As in other computer languages, Matlab provides a few types of
loops to handle looping requirements including: while loops, for loops, and
nested loops.
20 Chapter 1. Programming Basics

While loop
The while loop repeatedly executes statements while a specified condi-
tion is true. The syntax of a while loop in Matlab is as follows.
while <expression>
<statements>
end
An expression is true when the result is nonempty and contains all
nonzero elements, logical or real numeric; otherwise the expression is
false.

Example 1.16. Here is an example for the while loop.


%% while loop
a=10; b=15;
fprintf('while loop execution: a=%d, b=%d\n',a,b);

while a<=b
fprintf(' The value of a=%d\n',a);
a = a+1;
end
When the code above is executed, the result will be:
while loop execution: a=10, b=15
The value of a=10
The value of a=11
The value of a=12
The value of a=13
The value of a=14
The value of a=15
1.2. Matlab: A Powerful Computer Language 21

For loop
A for loop is a repetition control structure that allows you to efficiently
write a loop that needs to execute a specific number of times. The syntax
of a for loop in Matlab is as following:
for index = values
<program statements>
end
Here values can be any list of numbers. For example:
• initval:endval – increments the index variable from initval to
endval by 1, and repeats execution of program statements while in-
dex is not greater than endval.
• initval:step:endval – increments index by the value step on each
iteration, or decrements when step is negative.

Example 1.17. The code in Example 1.16 can be rewritten as a for loop.
%% for loop
a=10; b=15;
fprintf('for loop execution: a=%d, b=%d\n',a,b);

for i=a:b
fprintf(' The value of i=%d\n',i);
end
When the code above is executed, the result will be:
for loop execution: a=10, b=15
The value of i=10
The value of i=11
The value of i=12
The value of i=13
The value of i=14
The value of i=15
22 Chapter 1. Programming Basics

Nested loops
Matlab also allows to use one loop inside another loop. The syntax for a
nested loop in Matlab is as follows:
for n = n0:n1
for m = m0:m1
<statements>;
end
end
The syntax for a nested while loop statement in Matlab is as follows:
while <expression1>
while <expression2>
<statements>;
end
end
For a nested loop, you can combine
• for loop and while loop
• more than two

1.2.4. Loop control statements

Note: Loop control statements change execution from its normal se-
quence.
• When execution leaves a scope, all automatic objects that were cre-
ated in that scope are destroyed.
• The scope defines where the variables can be valid in Matlab, typi-
cally a scope within a loop body is from the beginning of conditional
code to the end of conditional code. It tells Matlab what to do when
the conditional code fails in the loop.
• Matlab supports both break statement and continue statement.
1.2. Matlab: A Powerful Computer Language 23

Break Statement
The break statement terminates execution of for or while loops.
• Statements in the loop that appear after the break statement are
not executed.
• In nested loops, break exits only from the loop in which it occurs.
• Control passes to the statement following the end of that loop.

Example 1.18. Let’s modify the code in Example 1.16 to involve a break
statement.
%% "break" statement with while loop
a=10; b=15; c=12.5;
fprintf('while loop execution: a=%d, b=%d, c=%g\n',a,b,c);

while a<=b
fprintf(' The value of a=%d\n',a);
if a>c, break; end
a = a+1;
end
When the code above is executed, the result is:
while loop execution: a=10, b=15, c=12.5
The value of a=10
The value of a=11
The value of a=12
The value of a=13
When the condition a>c is satisfied, break is invoked; which breaks the while
loop to stop.
24 Chapter 1. Programming Basics

Continue Statement
continue passes control to the next iteration of a for or while loop.
• It skips any remaining statements in the body of the loop for the
current iteration; the program continues execution from the next
iteration.
• continue applies only to the body of the loop where it is called.
• In nested loops, continue skips remaining statements only in the
body of the loop in which it occurs.

Example 1.19. Consider a modification of the code in Example 1.17.


%% for loop with "continue"
a=10; b=15;
fprintf('for loop execution: a=%d, b=%d\n',a,b);

for i=a:b
if mod(i,2), continue; end % even integers, only
disp([' The value of i=' num2str(i)]);
end
When the code above got executed, the result is:
for loop execution: a=10, b=15
The value of i=10
The value of i=12
The value of i=14

Note: In the above, mod(i,2) returns the remainder when i is divided


by 2 (so that the result is either 0 or 1). In general,
• mod(a,m) returns the remainder after division of a by m, where a is
the dividend and m is the divisor.
• This mod function is often called the modulo operation.
1.2. Matlab: A Powerful Computer Language 25

1.2.5. Anonymous function


Matlab-code 1.20. In Matlab, one can define an anonymous function,
which is a function that is not stored in a program file.
anonymous_function.m
1 %% Define an anonymous function
2 f = @(x) x.^3-x-2;
3

4 %% Evaluate the function


5 f1 = f(1)
6 X = 1:6;
7 fX = feval(f,X)
8

9 %% Calculus
10 q = integral(f,1,3)

Output
1 >> anonymous_function
2 f1 =
3 -2
4 fX =
5 -2 4 22 58 118 208
6 q =
7 12

1.2.6. Open source alternatives to Matlab

• Octave is the best-known alternative to Matlab. Octave strives for


exact compatibility, so many of your projects developed for Matlab may
run in Octave with no modification necessary.
• NumPy is the main package for scientific computing with Python. It
can process n-dimensional arrays, complex matrix transforms, linear
algebra, Fourier transforms, and can act as a gateway for C and C++
integration. It is the fundamental data-array structure for the SciPy
Stack, and an ecosystem of Python-based math, science, and engineer-
ing software. Python basics will be considered in Chapter 7, p. 137.
26 Chapter 1. Programming Basics

Exercises for Chapter 1

1.1. On Matlab command window, perform the following

• 1:20
• 1:1:20
• 1:2:20
• 1:3:20;
• isprime(12)
• isprime(13)
• for i=3:3:30, fprintf('[i,i^2]=[%d, %d]\n',i,i^2), end
for i=3:3:30

The above is the same as  fprintf('[i,i^2]=[%d, %d]\n',i,i^2)


end
• for i=1:10,if isprime(i),fprintf('prime=%d\n',i);end,end
Rewrite it with linebreaks, rather than using comma (,).

1.2. Compose a code and write as a function for the sum of prime numbers not larger than
a positive integer n.
1.3. Modify the function you made in Exercise 2 to count the number of prime numbers
and return the result along with the sum. For multiple output, the function may
start with
function [sum, numver] = <function_name>(inputs)

1.4. Let, for k, n positive integers,


k
X
Sk = 1 + 2 + · · · + k = i
i=1

and n
X
Tn = Sk .
k=1

Write a code to find and print out Sn and Tn for n = 1 : 10.



1+ 5
1.5. The golden ratio is the number φ = .
2
(a) Verify that the golden ratio is the positive solution of x2 − x − 1 = 0.
(b) Evaluate the golden ratio in 12-digit decimal accuracy.
1.2. Matlab: A Powerful Computer Language 27

1.6. The Fibonacci sequence is a series of numbers, defined by

f0 = 0, f1 = 1; fn = fn−1 + fn−2 , n = 2, 3, · · · (1.3)

The Fibonacci sequence has interesting properties; two of them are

(i) The ratio rn = fn /fn−1 approaches the golden ratio, as n increases.


(ii) Let x1 and x2 be two solutions of x2 − x − 1 = 0:
√ √
1− 5 1+ 5
x1 = and x2 = .
2 2
Then
(x2 )n − (x1 )n
tn := √ = fn , for all n ≥ 0. (1.4)
5

(a) Compose a code to print out the following in a table format.

n fn rn tn

for n ≤ K = 20.
You may start with
Fibonacci_sequence.m
1 K = 20;
2 F = zeros(K);
3 F(1)=1; F(2)=F(1);
4

5 for n=3:K
6 F(n) = F(n-1)+F(n-2);
7 rn = F(n)/F(n-1);
8 fprintf("n =%3d; F = %7d; rn = %.12f\n",n,F(n),rn);
9 end

(b) Find n such that rn has 12-digit decimal accuracy to the golden ratio φ.
Ans: (b) n = 32
28 Chapter 1. Programming Basics
2
C HAPTER

Simple Programming Examples

Contents of Chapter 2
2.1. Area Estimation of A Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2. Visualization of Complex-Valued Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3. Inverse Functions and Logarithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Exercises for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

29
30 Chapter 2. Simple Programming Examples

2.1. Area Estimation of A Region

Problem 2.1. It is common in reality that a region is saved by a se-


quence of points: For some n > 0,
(x0 , y0 ), (x1 , y1 ), · · · , (xn , yn ), (xn , yn ) = (x0 , y0 ). (2.1)

Figure 2.1: A region and its approximation.

Here the question is:


If a sequence of points (2.1) represents a region, how can we com-
pute its area accurately?

Derivation of Computational Formulas


Example 2.2. Let’s begin with a very simple example.
(a) Find the area of a rectangle [a, b] × [c, d].

Solution. We know the area = (b − a) · (d − c).


It can be rewritten as

b · (d − c) − a · (d − c) = b · (d − c) + a · (c − d)

from which we may guess that


X
Area = x∗i · ∆y i , (2.2)
i

where the sum is carried out over line segments Li and x∗i denotes the
mid value of x on Li .
2.1. Area Estimation of A Region 31

(b) Find the area of a triangle.


Solution. We know the area = 12 (b − a) · (d − c).
Now, let’s try to find the area using the formula
(2.2): X
Area = x∗i · ∆y i .
i

Let L1 , L2 , L3 be the bottom side, vertical side,


and the hypotenuse, respectively.
Then
a+b b+b b+a
Area = · (c − c) + · (d − c) + · (c − d)
2 2 2
b+a
= 0 + b · (d − c) + · (c − d)
2
 b + a 1
= b− · (d − c) = (b − a) · (d − c).
2 2
Okay. The formula is correct!
Note: Horizontal line segments makes no contribution to the area.

(c) Let’s verify the formula once more.

The area of the M-shaped is 30.


Let’s collect only nonzero values:

2 · 3 − 2.5 · 2 + 3.5 · 2 − 4 · 3
+6 · 6
−3.5 · 2 + 2.5 · 2
= 6 − 5 + 7 − 12
+36
−7 + 5
= 30

Again, the formula is correct!!


32 Chapter 2. Simple Programming Examples

Summary 2.3. The above work can be summarized as follows.


• Let a region R be represented as a sequence of points

(x0 , y0 ), (x1 , y1 ), · · · , (xn , yn ), (xn , yn ) = (x0 , y0 ). (2.3)

• Let Li be the i-th line segment connecting (xi−1 , yi−1 ) and (xi , yi ), n =
1, 2, · · · , n. Then the area of R can be computed using the formula
n
X
Area(R) = x∗i · ∆y i , (2.4)
i=1

where
xi−1 + xi
x∗i = , ∆y i = yi − yi−1 .
2

Note: The formula (2.4) is a result of Green’s Theorem for the line
integral and numerical approximation.

Example 2.4. We will generate a dataset, save it to a file, and read it to


plot and measure the area.

(a) Generate a dataset that represents the circle of radius centered the
origin. For example, for i = 0, 1, 2, · · · , n,

(xi , yi ) = (cos θi , sin θi ), θi = i · . (2.5)
n
Note that (xn , yn ) = (x0 , y0 ).
(b) Analyze accuracy improvement of the area as n grows. The larger n you
choose, the more accurately the data would represent the region.

Solution.
circle.m
1 n = 10;
2 %%---- Data generation -----------------
3 theta = linspace(0,2*pi,n+1)'; % a column vector
4 data = [cos(theta),sin(theta)];
5
2.1. Area Estimation of A Region 33

6 %%---- Plot it & Save the image --------


7 figure,
8 plot(data(:,1),data(:,2),'r-','linewidth',2);
9 daspect([1 1 1]); axis tight;
10 xlim([-1 1]), ylim([-1 1]);
11 title(['Circle: n=' int2str(n)])
12 image_name = 'circle.png';
13 saveas(gcf,image_name);
14

15 %%---- Save the data -------------------


16 filename = 'circle-data.txt';
17 csvwrite(filename,data)
18 %writematrix(data,filename,'Delimiter',' ');
19

20 %%======================================
21 %%---- Read the data -------------------
22 %%======================================
23 DATA = load(filename);
24 X = DATA(:,1);
25 Y = DATA(:,2);
26

27 figure,
28 plot(X,Y,'b--','linewidth',2);
29 daspect([1 1 1]); axis tight
30 xlim([-1 1]), ylim([-1 1]);
31 title(['Circle: n=' int2str(n)])
32 yticks(-1:0.5:1)
33 saveas(gcf,'circle-dashed.png');
34

35 %%---- Area computation ----------------


36 area = area_closed_curve(DATA); %See an Exercise problem
37 fprintf('n = %3d; area = %.12f, misfit = %.12f\n', ...
38 size(DATA,1)-1,area, abs(pi-area));
34 Chapter 2. Simple Programming Examples

Figure 2.2: An approximation of the unit circle, with n = 10.

Accuracy Improvement
1 n = 10; area = 2.938926261462, misfit = 0.202666392127
2 n = 20; area = 3.090169943749, misfit = 0.051422709840
3 n = 40; area = 3.128689300805, misfit = 0.012903352785
4 n = 80; area = 3.138363829114, misfit = 0.003228824476
5 n = 160; area = 3.140785260725, misfit = 0.000807392864

Note: The misfit becomes a quarter as the number of points is doubled.

Remark 2.5. From Example 2.4, you learn how to


• Generate datasets
• Save data into a file
• Read a data file
• Set figure environments
• Call functions
2.2. Visualization of Complex-Valued Solutions 35

2.2. Visualization of Complex-Valued Solutions

Problem 2.6. Seeking the solutions of

f (x) = x2 − x + 1 = 0, (2.6)

we can easily find that the equation has no real solutions. However,
by using the quadratic formula, the complex-valued solutions are

1 ± 3i
x= .
2
Here the question is:
What does the complex-valued solution mean? Can we visualize
the solutions?

Remark 2.7. Complex Number System:


The most complete number system is the system of complex numbers:

C = {x + yi | x, y ∈ R}

where i = −1, called the imaginary unit.
• Seeking a real-valued solution of f (x) = 0 is the same as finding
a solution of f (z) = 0, z = x + yi, restricting on the x-axis (y = 0).
• If
f (z) = A(x, y) + B(x, y) i, (2.7)
then the complex-valued solutions are the points x + yi such that
A(x, y) = B(x, y) = 0.

Example 2.8. For f (x) = x2 − x + 1, express f (x + yi) in the form of (2.7).


Solution.


Ans: f (z) = (x2 − x − y 2 + 1) + (2x − 1)y i
36 Chapter 2. Simple Programming Examples

Example 2.9. Implement a code to visualize complex-valued solutions of


f (x) = x2 − x + 1 = 0.
Solution. From Example 2.8,
f (z) = A(x, y) + B(x, y) i, A(x, y) = x2 − x − y 2 + 1, B(x, y) = (2x − 1)y.
visualize_complex_solution.m
1 close all
2 if exist('OCTAVE_VERSION','builtin'), pkg load symbolic; end
3

4 syms x y real
5

6 %% z^2 -z +1 = 0
7 A = @(x,y) x.^2-x-y.^2+1;
8 B = @(x,y) (2*x-1).*y;
9 T = 'z^2-z+1=0';
10

11 figure, % Matlab 'fmesh' is not yet in Octave


12 np=41; X=linspace(-5,5,np); Y=linspace(-5,5,np);
13 mesh(X,Y,A(X,Y'),'EdgeColor','r'), hold on
14 mesh(X,Y,B(X,Y'),'EdgeColor','b'),
15 mesh(X,Y,zeros(np,np),'EdgeColor','k'),
16 legend("A","B","0"),
17 xlabel('x'), ylabel('y'), title(['A and B for ' T])
18 hold off
19 print -dpng 'complex-solutions-A-B-fmesh.png'
20

21 %%--- Solve A=0 and B=0 --------------


22 [xs,ys] = solve(A(x,y)==0,B(x,y)==0,x,y);
23

24 figure,
25 np=101; X=linspace(-5,5,np); Y=linspace(-5,5,np);
26 contour(X,Y,A(X,Y'), [0 0],'r','linewidth',2), hold on
27 contour(X,Y,B(X,Y'), [0 0],'b--','linewidth',2)
28 plot(double(xs),double(ys),'r.','MarkerSize',30) % the solutions
29 grid on
30 %ax=gca; ax.GridAlpha=0.5; ax.FontSize=13;
31 legend("A=0","B=0")
32 xlabel('x'), ylabel('yi'), title(['Compex solutions of ' T])
33 hold off
34 print -dpng 'complex-solutions-A-B=0.png'
2.2. Visualization of Complex-Valued Solutions 37

Figure 2.3: Two solutions are 1/2 + -3ˆ (1/2)/2 i and 1/2 + 3ˆ (1/2)/2 i.

Remark 2.10. You can easily find the real part and the imaginary
part of polynomials of z = x + iy as follows.
Real and Imaginary Parts
1 syms x y real
2 z = x + 1i*y;
3

4 g = z^2 -z +1;
5 simplify(real(g))
6 simplify(imag(g))

Here “1i” (number 1 and letter i), appeared in Line 2, means the imaginary

unit i = −1.
Output
1 ans = x^2 - x - y^2 + 1
2 ans = y*(2*x - 1)
38 Chapter 2. Simple Programming Examples

2.3. Inverse Functions and Logarithms

In-Reality 2.11. A function f is a rule that assigns an output y


to each input x: f (x) = y. Thus a function is a set of actions that de-
termines the system. However, in reality, it is often the case that the
equation must be solved for either the input or the function.
1. Given (f, x), getting y is the simplest and most common task.
2. Given (f, y), solving for x is to find the inverse function of f .
3. Given (x, y), solving for f is not a simple task in practice.
• Using many data points {(xi , yi )}, finding an approximation of f
is the core subject of polynomial interpolation, regression
analysis, and machine learning.

Roughly Speaking, What is the Inverse of a Function?

Key Idea 2.12. Let f : X → Y be a function. For simplicity, consider

y = f (x) = 2x + 1. (2.8)

• Then, f is a rule that performs two actions: ×2 and followed by +1.


• The reverse of f must be: −1 followed by ÷2.
– Let y ∈ Y . Then the reverse of f can be written as
x = (y − 1)/2 =: g(y) (2.9)

The function g is the inverse function of f .


– However, it is conventional to choose x for the independent vari-
able. Thus it can be formulated as
y = g(x) = (x − 1)/2. (2.10)

• Let’s summarize the above:


(a) Solve y = f (x) for x: x = (y − 1)/2 =: g(y).
(b) Exchange x and y: y = g(x) = (x − 1)/2.
2.3. Inverse Functions and Logarithms 39

2.3.1. Inverse functions


Note: The first step for finding the inverse function of f is to solve y =
f (x) for x, to get x = g(y). Here the required is for g to be a function.

Definition 2.13. A function f is called a one-to-one function if it


never takes on the same value twice; that is,

f (x1 ) 6= f (x2 ) whenever x1 6= x2 . (2.11)

Claim 2.14. Horizontal Line Test.


A function is one-to-one if and only if no horizontal line intersects its
graph more than once.

Example 2.15. Check if the function is one-to-one.

1. f (x) = x2 3. h(x) = x3
2. g(x) = x2 , x ≥ 0

Solution.

Definition 2.16. Let f be a one-to-one function with domain X and


range Y . Then its inverse function f −1 has domain Y and range X and
is defined by
f −1 (y) = x ⇐⇒ f (x) = y, (2.12)
for any y ∈ Y .
40 Chapter 2. Simple Programming Examples

Remark 2.17.
The definition says that if f maps x into
y, then f −1 maps y back into x. From
(2.12), we can obtain the cancellation
equations

f −1 (f (x)) = x for all x ∈ X


(2.13)
f (f −1 (y)) = y for all y ∈ Y

Example 2.18. For example, if f (x) = x3 , then f −1 (x) = x1/3 and so that
the cancellation equations read

f −1 (f (x)) = f −1 (x3 ) = (x3 )1/3 = x


f (f −1 (y)) = f (y 1/3 ) = (y 1/3 )3 = y

Example 2.19. Assume f is a one-to-one function.

(a) If f (1) = 5, what is f −1 (5)? (b) If f −1 (8) = −10, what is f (−10)?

Solution.

Caution 2.20. Do not mistake the −1 in f −1 for an exponent.


1
f −1 (x) does not mean . (2.14)
f (x)
If f were not one-to-one, then its inverse would not be uniquely defined
and cannot be a function. ⇒ An inverse function does not exist.

Strategy 2.21. How to Find the Inverse Function of a One-to-One


Function f : Write y = f (x).
Step 1: Solve this equation for x in terms of y (if possible).
Step 2: Interchange x and y; the resulting equation is y = f −1 (x).
2.3. Inverse Functions and Logarithms 41

6 − 3x
Example 2.22. Find the inverse of the function h(x) = .
5x + 7
Solution.

Ans: syms x; finverse((6-3*x)/(5*x+7),x) ⇒ -(7*x - 6)/(5*x + 3)


Example 2.23. Find the inverse of the function f (x) = x3 + 2, expressed
as a function of x.

Solution. Write y = x3 + 2.

Step 1: Solve it for x:


p
x3 = y − 2 ⇒ x = 3
y − 2.

Step 2: Exchange x and y:



y = 3 x − 2.

Therefore the inverse function is



f −1 (x) = 3 x − 2.

Observation 2.24.
• The graph of f −1 is obtained by reflecting the graph of f about the line
y = x.
• (Domain of f −1 ) = (Range of f )
42 Chapter 2. Simple Programming Examples

2.3.2. Exponential functions


Definition 2.25. A function of the form

f (x) = ax , where a > 0 and a 6= 1, (2.15)

is called an exponential function (with base a).


• All exponential functions have domain (−∞, ∞) and range (0, ∞), so
an exponential function never assumes the value 0.
• All exponential functions are either increasing (a > 1) or decreasing
(0 < a < 1) over the whole domain.

Figure 2.4: Exponential functions.

Example 2.26. Sketch the graph of the function f (x) = 3 − 2x and deter-
mine its domain and range.
Solution.
2.3. Inverse Functions and Logarithms 43

Example 2.27. Table 2.1 shows data for the population of the world in
the 20th century. Figure 2.5 shows the corresponding scatter plot.
• The pattern of the data points in Figure 2.5 suggests an exponential
growth.
• Use an exponential regression algorithm to find a model of the
form
P (t) = a · bt , (2.16)
where t = 0 corresponds to 1900.

Table 2.1
t Population P
(years since 1900) (millions)
0 1650
10 1750
20 1860
30 2070
40 2300
50 2560
60 3040
70 3710
80 4450
90 5280
100 6080 Figure 2.5: Scatter plot for world population
growth.
110 6870

Solution. We will see details of the exponential regression later.


population.m
1 Data =[0 1650; 10 1750; 20 1860; 30 2070;
2 40 2300; 50 2560; 60 3040; 70 3710;
3 80 4450; 90 5280; 100 6080; 110 6870];
4 m = size(Data,1);
5

6 % exponential model, through linearization


7 A = ones(m,2);
8 A(:,2) = Data(:,1);
9 r = log(Data(:,2));
10 lm = (A'*A)\(A'*r);
11 a = exp(lm(1)), b = exp(lm(2)),
44 Chapter 2. Simple Programming Examples

12

13 plot(Data(:,1),Data(:,2),'k.','MarkerSize',20)
14 xlabel('Years since 1900');
15 ylabel('Millions'); hold on
16 print -dpng 'population-data.png'
17 t = Data(:,1);
18 plot(t,a*b.^t,'r-','LineWidth',2)
19 print -dpng 'population-regression.png'
20 hold off

The program results in

a = 1.4365 × 103 , b = 1.0140.

Thus the exponential model reads

P (t) = (1.4365×109 )·(1.0140)t . (2.17)

Figure 2.6 shows the graph of this ex-


ponential function together with the
original data points. We see that the
exponential curve fits the data rea- Figure 2.6: Exponential model for world
population growth.
sonably well.

Integer and Rational Exponents


• When x = n is a positive integer,
an = a
| · a {z
· ... · a} .
n times

• When x = −n for some positive integer n,


−n 1  1 n
a = n = .
a a
• When x = p/q is a rational number,
√ √
ap/q = q ap = ( q a)p
2.3. Inverse Functions and Logarithms 45

Laws of Exponents
If a > 0 and b > 0, the following rules hold for all real numbers x and y.

1. ax · ay = ax+y 4. ax · bx = (ab)x
ax ax  a x
2. y = ax−y 5. x =
a b b
3. (ax )y = (ay )x = axy

The Number e
Of all possible bases for an exponential function, there is one that is
most convenient for the purposes of calculus. The choice of a base a is
influenced by the way the graph of y = ax crosses the y-axis.
• Some of the formulas of calculus will be greatly simplified, if we
choose the base a so that the slope of the tangent line to y = ax
at x = 0 is exactly 1.
• In fact, there is such a number and it is denoted by the letter e.
(This notation was chosen by the Swiss mathematician Leonhard
Euler in 1727, probably standing for exponential.)
• It turns out that the number e lies between 2 and 3:

e ≈ 2.718282 (2.18)

Figure 2.7: The number e.


46 Chapter 2. Simple Programming Examples

Remark 2.28. Properties of the Natural Exponential Function


The exponential function f (x) = ex is an increasing continuous function
with domain R and range (0, ∞). Thus ex > 0 for all x and the x-axis is a
horizontal asymptote of f (x) = ex .
The properties hold for f (x) = ax , where a > 1, as well.

Self-study 2.29. Find the domain of the following functions.

1+x 1 − ex
2

(a) f (x) = (b) f (x) =


ecos x 1 − e1−x2

Solution.

Example 2.30. Graph the function y = 12 e−x + 1 and state the domain and
range.
Solution.
2.3. Inverse Functions and Logarithms 47

2.3.3. Logarithmic functions

Recall: If a > 0 and a 6= 1, the exponential function f (x) = ax is either


increasing or decreasing and so it is one-to-one by the Horizontal Line
Test. It therefore has an inverse function.

Definition 2.31. The logarithmic function with base a, written


y = loga x, is the inverse of y = ax (a > 0, a 6= 1). That is,

loga x = y ⇐⇒ ay = x. (2.19)

Example 2.32. Find the inverse of y = 2x .

Solution.

1. Solve y = 2x for x:
x = log2 y

2. Exchange x and y:
y = log2 x

Thus the graph of y = log2 x must the


reflection of the graph of y = 2x about Figure 2.8: Graphs of y = 2x and y = log2 x.
y = x.

Note:
• Equation (2.19) represents the action of “solving for x”
• The domain of y = loga x must be the range of y = ax , which is (0, ∞).
48 Chapter 2. Simple Programming Examples

The Natural Logarithm and the Common Logarithm


Of all possible bases a for logarithms, we will see later that the most conve-
nient choice of a base is the number e.
Definition 2.33.

• The logarithm with base e is called the natural logarithm and has
a special notation:
loge x = ln x (2.20)

• The logarithm with base 10 is called the common logarithm and


has a special notation:

log10 x = log x (2.21)

Remark 2.34.

• From your calculator, you can see buttons of LN and LOG , which
represent ln = loge and log = log10 , respectively.
• When you implement a code on computers, the functions ln and
log can be called by “log” and “log10”, respectively.

Properties of Logarithms

• Algebraic Properties: for (a > 0, a 6= 1)

Product Rule: loga xy = loga x + loga y


x
Quotient Rule: loga = loga x − loga y
y (2.22)
Power Rule: loga xα = α loga x
1
Reciprocal Rule: loga = − loga x
x
• Inverse Properties

aloga x = x, x > 0; loga ax = x, x ∈ R


(2.23)
eln x = x, x > 0; ln ex = x, x ∈ R
2.3. Inverse Functions and Logarithms 49

 x 2 √x 2 + 3 
Example 2.35. Use the laws of logs to expand ln .
3x + 1
Solution.

Example 2.36. Simplify the following.

(a) log3 75 − 2 log3 5 (b) 2 log5 100 − 4 log5 50

Solution.
50 Chapter 2. Simple Programming Examples

Example 2.37. Solve for x.

(a) e5−3x = 3.
(b) log3 x + log3 (x − 2) = 1
(c) ln(ln x) = 0

Solution.

Ans: (a) x = 31 (5 − ln 3). (b) x = 3. (Caution: x = −1 cannot be a solution.)

Claim 2.38.
(a) Every exponential function is a power of the natural exponential
function.
ax = ex ln a . (2.24)

(b) Every logarithmic function is a constant multiple of the natural log-


arithm.
ln x
loga x = , (a > 0, a 6= 1) (2.25)
ln a
which is called the Change-of-Base Formula.
x
Proof. (a). ax = eln(a ) = ex ln a .
(b). ln x = ln(aloga x ) = (loga x)(ln a), from which one can get (2.25).

Remark 2.39. Based on Claim 2.38, all exponential and logarithmic


functions can be evaluated by the natural exponential function and the
natural logarithmic function; which are named “exp()” and “log()”, in
Matlab.
2.3. Inverse Functions and Logarithms 51

Exercises for Chapter 2

2.1. Download a dataset saved in heart-data.txt:


https://skim.math.msstate.edu/LectureNotes/heart-data.txt

(a) Draw a figure for it.


(b) Use the formula (2.4) to find the area.

Hint : You may use the following. You should finish the function area_closed_curve.
Note that the index in Matlab arrays begins with 1, not 0.
heart.m
1 DATA = load('heart-data.txt');
2

3 X = DATA(:,1); Y = DATA(:,2);
4 figure, plot(X,Y,'r-','linewidth',2);
5

6 [m,n] = size(DATA);
7 area = area_closed_curve(DATA);
8

9 fprintf('# of points = %d; area = %g\n',m,area);

area_closed_curve.m
1 function area = area_closed_curve(data)
2 % compute the area of a region of closed curve
3

4 [m,n] = size(data);
5 area = 0;
6

7 for i=2:m
8 %FILL HERE APPROPRIATELY
9 end

Ans: (b) 9.41652.


2.2. Function f (x) = x3 − 2x2 + x − 2 has two complex-values zeros and a real zero. Imple-
ment a code to visualize all the solutions in the complex coordinates.
Hint : Find the real and imaginary parts of f (z) as in Remark 2.10.
2.3. The population of Starkville, Mississippi, was 2,689 in the year 1900 and 25,495 in
2020. Assume that the population in Starkville grows exponentially with the model

Pn = P0 · (1 + r)n , (2.26)

where n is the elapsed year and r denotes the growth rate per year.

(a) Find the growth rate r.


(b) Estimate the population in 1950 and 2000.
52 Chapter 2. Simple Programming Examples

(c) Approximately when is the population going to reach 50,000?

Hint : Applying the natural log to (2.26) reads log(Pn /P0 ) = n log(1 + r). Dividing it
by n and applying the natural exponential function gives 1 + r = exp(log(Pn /P0 )/n),
where Pn = 25495, P0 = 2689, and n = 120.
Ans: (a) r = 0.018921(= 1.8921%). (c) 2056.
3
C HAPTER

Programming with Calculus

Contents of Chapter 3
3.1. Derivative: The Slope of the Tangent Line . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2. Basis Functions and Power Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3. Newton’s Method for Zero-Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4. Zeros of Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5. Multi-Variable Functions and the Gradient Vector . . . . . . . . . . . . . . . . . . . . . 81
Exercises for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

53
54 Chapter 3. Programming with Calculus

3.1. Derivative: The Slope of the Tangent Line

Problem 3.1. A function y = f (x) can be graphed as a curve, while


z = f (x, y) plots a surface. In many applications, the tangent line or
tangent plane plays a crucial role for the computation of approximate
solutions. Here a questions is:
• What is the tangent line?
• How can we find it?

Note: In the late 16th century, Galileo discovered that a solid object
dropped from rest (initially not moving) near the surface of the earth
and allowed to fall freely will fall a distance proportional to the square
of the time it has been falling.
• This type of motion is called free fall.
• It assumes negligible air resistance to slow the object down, and that
gravity is the only force acting on the falling object.
• If y denotes the distance fallen in feet after t seconds, then the
Galileo’s law of free-fall is

y = 16t2 ft. (3.1)

The Galileo’s law of free-fall states that, in the absence of air resistance, all
bodies fall with the same acceleration, independent of their mass.
Average and Instantaneous Speed
Average Speed. When f (t) measures the distance traveled at time t,
distance traveled f (t0 + h) − f (t0 )
Average speed over [t0 , t0 + h] = =
elapsed time (t0 + h) − t0
(3.2)
Instantaneous Speed. For h very small,
f (t0 + h) − f (t0 )
Instantaneous speed at t0 ≈ (3.3)
h
3.1. Derivative: The Slope of the Tangent Line 55

Example 3.2. Let f (t) = 16t2 and t0 = 1.

(a) Find average speed, the difference quotient,


f (t0 + h) − f (t0 )
h
for various h, positive and negative.
(b) Estimate the instantaneous speed at t = t0 .

Solution.
free_fall.m
1 syms f(t) Q(h) %also, views t and h as symbols
2

3 f(t) = 16*t.^2; t0=1;


4 Int = [t0-1.5,t0+1.1];
5 fplot(f(t),Int, 'k-','LineWidth',3)
6 hold on
7

8 %%---- Difference Quotient (DQ) ----------


9 Q(h) = (f(t0+h)-f(t0))/h;
10 S(t,h) = Q(h)*(t-t0)+f(t0); % Secant line
11

12 %%---- Secant Lines, with Various h ------


13 for h0 = [-1 -0.5 0.5 1]
14 fplot(S(t,h0),Int, 'b--','LineWidth',2)
15 plot([t0+h0],[f(t0+h0)],'b.','markersize',30)
16 end
17

18 %%---- Limit of the DQ -------------------


19 tan_slope = limit(Q(h),h,0);
20 T(t) = tan_slope*(t-t0)+f(t0);
21 fplot(T(t),Int, 'r-','LineWidth',3)
22 plot([t0],[f(t0)],'r.','markersize',30)
23

24 axis tight, grid on; hold off


25 ax=gca; ax.FontSize=15; ax.GridAlpha=0.5;
26 print -dpng 'free-fall-tangent.png'
56 Chapter 3. Programming with Calculus

27

28 %%---- Measure Q(h) wih h=+-10^-i --------


29 for i = 1:5
30 h=-10^(-i); fprintf(" h= %.5f; Q(h) = %.8f\n",h,Q(h))
31 end
32 for i = 1:5
33 h=10^(-i); fprintf(" h= %.5f; Q(h) = %.8f\n",h,Q(h))
34 end

Difference Quotient at t0 = 1
1 h= -0.10000; Q(h) = 30.40000000
2 h= -0.01000; Q(h) = 31.84000000
3 h= -0.00100; Q(h) = 31.98400000
4 h= -0.00010; Q(h) = 31.99840000
5 h= -0.00001; Q(h) = 31.99984000
6 h= 0.10000; Q(h) = 33.60000000
7 h= 0.01000; Q(h) = 32.16000000
8 h= 0.00100; Q(h) = 32.01600000
9 h= 0.00010; Q(h) = 32.00160000
10 h= 0.00001; Q(h) = 32.00016000

Let’s confirm this algebraically.


• When f (t) = 16t2 and t0 = 1, the difference quotient reads
∆y 16(1 + h)2 − 16(1)2 16(1 + 2h + h2 ) − 16(1)2
(t0 = 1) = =
∆t h h (3.4)
2
32h + 16h
= = 32 + 16h
h
• As h gets closer and closer to 0, the average speed has the limiting
value 32 ft/sec when t0 = 1 sec.
• Thus, the slope of the tangent line is 32.
3.1. Derivative: The Slope of the Tangent Line 57

Example 3.3. Find an equation of the tangent line to the graph of y = x2


at x0 = 2.

Figure 3.1: Graph of y = x2 and the tangent line at x0 = 2.

Solution. Let’s first try to find the slope, as the limit of the difference
quotient.

Definition 3.4.
The slope of the curve y = f (x) at the point P (x0 , f (x0 )) is the number
f (x0 + h) − f (x0 )
lim (provided the limit exists). (3.5)
h→0 h
The tangent line to the curve at P is the line through P with this slope.
58 Chapter 3. Programming with Calculus

Example 3.5. Can you find the tan-


gent line to y = |x2 − 1| at x0 = 1?
Solution. As one can see from Fig-
ure 3.2, the left-hand limit and
the right-hand slope of the differ-
ence quotient are not the same. Or,
you may say the left-hand and the
right-hand secant lines converge dif-
ferently. Thus no tangent line can be
defined. Figure 3.2: Graph of y = |x2 − 1| and secant
lines at x0 = 1.

secant_lines_abs_x2_minus_1.m
1 syms f(x) Q(h) %also, views t and h as symbols
2

3 f(x)=abs(x.^2-1); x0=1;
4 figure, fplot(f(x),[x0-3,x0+1.5], 'k-','LineWidth',3)
5 hold on
6

7 Q(h) = (f(x0+h)-f(x0))/h;
8 S(x,h) = Q(h)*(x-x0)+f(x0); % Secant line
9 %%---- Secant Lines, with Various h ------
10 for h0 = [-0.5 -0.25 -0.1 0.1 0.25 0.5]
11 fplot(S(x,h0),[x0-1,x0+1], 'b--','LineWidth',2)
12 plot([x0+h0],[f(x0+h0)],'b.','markersize',25)
13 end
14 plot([x0],[f(x0)],'r.','markersize',35)
15 daspect([1 2 1])
16 axis tight, grid on
17 ax=gca; ax.FontSize=15; ax.GridAlpha=0.5;
18 hold off
19 print -dpng 'secant-y=abs-x2-1.png'
3.1. Derivative: The Slope of the Tangent Line 59

f (x + h) − f (x)
Example 3.6. Find and simplify the difference quotients
h
for the functions, and then apply lim .
h→0

(a) f (x) = ax + b (b) f (x) = x2 (c) f (x) = x3

Solution.

Definition 3.7. The derivative of a function f (x), denoted f 0 (x) or


df (x)
, is
dx
df (x) f (x + h) − f (x0 )
f 0 (x) = = lim , (3.6)
dx h→0 h
provided that the limit exists.

Formula 3.8. From the last example,

f (x) = x ⇒ f 0 (x) = 1
f (x) = x2 ⇒ f 0 (x) = 2x
f (x) = x3 ⇒ f 0 (x) = 3x2
.. ..
. .
f (x) = xn ⇒ f 0 (x) = nxn−1
60 Chapter 3. Programming with Calculus

Example 3.9. Differentiate the following powers of x.

(a) x6 (b) x3/2 (c) x1/2

Solution.

Formula 3.10. Differentiation Rules:


• Let f (x) = au(x) + bv(x), for some constants a and b. Then

f (x + h) − f (x) [au(x + h) + bv(x + h)] − [au(x) + bv(x)]


=
h h
u(x + h) − u(x) v(x + h) − v(x) (3.7)
= a +b
h h
0 0
→ au (x) + bv (x)

• Let f (x) = u(x)v(x). Then


f (x + h) − f (x) u(x + h)v(x + h) − u(x)v(x)
=
h h
u(x + h)v(x + h) − u(x)v(x + h) + u(x)v(x + h) − u(x)v(x)
=
h
u(x + h) − u(x) v(x + h) − v(x)
= v(x + h) + u(x)
h h
→ u0 (x)v(x) + u(x)v 0 (x)
(3.8)
3.1. Derivative: The Slope of the Tangent Line 61

Example 3.11. Use the product rule (3.8) to find the derivative of the
function
f (x) = x6 = x2 · x4

Solution.

Example 3.12. Does the curve y = x4 −2x2 +2 have any horizontal tangent
line? Use the information you just found, to sketch the graph.
Solution.
62 Chapter 3. Programming with Calculus

Rules of Derivative
Example 3.13. Consider a computer program.
derivative_rules.m
1 syms n a b real
2 syms u(x) v(x)
3 syms f(x) Q(h)
4

5 %%-- Define f(x) ----------------------


6 f(x) = x^n;
7

8 %%-- Define Q(h) and take limit -------


9 Q(h) = (f(x+h)-f(x))/h;
10 fp = limit(Q(h),h,0); simplify(fp)

Use the program to verify various rules of differentiation.


Solution.
Table 3.1: Rules of the derivative

f(x) Results Mathematical formula


0
xˆ n n*xˆ (n - 1) xn = nxn−1 (power rule)
0
a*u(x) + b*v(x) a*D(u)(x) + b*D(v)(x) au + bv = au0 + bv 0 (linearity rule)
0
u(x)*v(x) D(u)(x)*v(x) + uv = u0 v + uv 0 (product rule)
u(x)*D(v)(x)
0
u(x)/v(x) (D(u)(x)*v(x) - u/v = (u0 v − uv 0 )/v 2 (quotient rule)
D(v)(x)*u(x))/v(x)ˆ 2
0
u(v(x)) D(v)(x)*D(u)(v(x)) u(v(x)) = u0 (v(x)) · v 0 (x) (chain rule)

Example 3.14. Find the derivative of g(x) = (2x + 1)10 .


Solution. Let u(x) = x10 and v(x) = 2x + 1. Then g(x) = u(v(x)).

• u0 (x) = 10x9 and v 0 (x) = 2.


• Thus
9
g 0 (x) = u0 (v(x)) · v 0 (x) = 10 v(x) · 2 = 20(2x + 1)9 .
3.1. Derivative: The Slope of the Tangent Line 63

Remark 3.15. Rules of Derivative:


• The power rule holds for real number n.
• The quotient rule can be explained using the product rule.
 u 0
= (u · v −1 )0 = u0 · v −1 + u · (v −1 )0
v
u0 u · v 0
= u0 · v −1 + u · (−v −2 v 0 ) = − 2 (3.9)
v v
u0 v − uv 0
=
v2
• The chain rule can be verified algebraically.
d u(v(x)) u(v(x + h)) − u(v(x))
= lim
dx h→0 h
u(v(x + h)) − u(v(x)) v(x + h) − v(x) (3.10)
= lim ·
h→0 v(x + h) − v(x) h
0 0
= u (v(x)) · v (x).

Here u0 (v(x)) means the rate of change of u with respect to v(x).


• We may rewrite (3.10):
d u(v(x)) ∆u ∆u ∆v
= lim = lim · = u0 (v(x)) · v 0 (x). (3.11)
dx ∆x→0 ∆x ∆x→0 ∆v ∆x

Self-study 3.16. Find the derivative of the functions.

(a) f (x) = (3x − 1)7 x5 (3x − 1)7


(b) g(x) =
x5
Solution.


Ans: (b) (3x − 1)6 (6x + 5) /x6
64 Chapter 3. Programming with Calculus

3.2. Basis Functions and Power Series


In-Reality 3.17. Two Major Mathematical Techniques.
In the history of the engineering research and development (R&D),
there have been two major mathematical techniques: the change of
variables and dealing with basis functions.
• In a nut shell, the change of variables is a basic technique used
to simplify problems.
• A function can be either represented or approximated by a lin-
ear combination of basis functions.

Example 3.18. Find solutions of the system of equations


(
xy + 2x + 2y = 20
(3.12)
x2 y + xy 2 = 48

where x and y are positive real numbers with x < y.


Solution. Let s = xy and t = x + y (a change of variables).

Ans: (x, y) = (2, 4)


Note: Most subjects in Calculus, particularly Vector Calculus, are deeply
related to the change of variables, to handle differentiation and integra-
tion over general 2D and 3D domains more effectively.
3.2. Basis Functions and Power Series 65

3.2.1. Power series


The most common basis is the monomial basis given by

{xn | n = 0, 1, 2, · · · } (3.13)

This basis is used in power series and Taylor series.

Definition 3.19. A power series about x = 0 is a series of the form



X
cn xn = c0 + c1 x + c2 x2 + · · · + cn xn + · · · . (3.14)
n=0

A power series about x = a is a series of the form



X
cn (x − a)n = c0 + c1 (x − a) + c2 (x − a)2 + · · · + cn (x − a)n + · · · , (3.15)
n=0

in which the center a and the coefficients c0 , c1 , c2 , · · · , cn , · · · are con-


stants.

Example 3.20. Taking all the coefficients to be 1 in (3.14) gives the geo-
metric series ∞
X
xn = 1 + x + x2 + · · · + xn + · · · ,
n=0
which converges to 1/(1 − x) for |x| < 1. That is,
1
= 1 + x + x2 + · · · + xn + · · · , |x| < 1. (3.16)
1−x
66 Chapter 3. Programming with Calculus

Remark 3.21. It follows from Example 3.16 that


1. A function can be approximated by a power series.
2. A power series may converge only on a certain interval, of radius R.

Theorem
X 3.22. The Ratio Test:
Let an be any series and suppose that
an+1
lim = ρ. (3.17)
n→∞ an
P
(a) If ρ < 1, then the series converges absolutely. ( |an | converges)
(b) If ρ > 1, then the series diverges.
(c) If ρ = 1, then the test is inconclusive.

Example 3.23. For what values of x do the following power series con-
verge?
∞ n ∞
X
n−1 x x2 x3 X xn x2 x3
(a) (−1) = x− + −··· (b) = 1+x+ + + ···
n=1
n 2 3 n=0
n! 2! 3!

Solution.

Ans: (a) x ∈ (−1, 1].


3.2. Basis Functions and Power Series 67

Theorem 3.24. Term-by-Term Differentiation.



X
If cn (x − a)n has radius of convergence R > 0, it defines a function
n=0


X
f (x) = cn (x − a)n on |x − a| < R. (3.18)
n=0

This function f has derivatives of all orders inside the interval, and we
obtain the derivatives by differentiating the original series term by term:

X
0
f (x) = ncn (x − a)n−1 ,
n=1

X (3.19)
f 00 (x) = n(n − 1)cn (x − a)n−2 ,
n=2
and so on. Each of these derived series converges at every point of the
interval a − R < x < a + R.

This theorem similarly holds for Term-by-Term Integration.

3.2.2. Taylor series expansion

Remark 3.25. We have seen how a converging power series defines or


generates a function. In order to make infinite series more useful:
• Here we will try to express a given function as a specific form of
infinite series called the Taylor series.
• In many cases, the Taylor series provides useful polynomial ap-
proximation of the original function.
• Because approximation by polynomials is extremely useful to both
mathematicians and scientists, Taylor series are at the core of the
theory of infinite series.
68 Chapter 3. Programming with Calculus

Series Representations

Key Idea 3.26. Taylor Series.


• Let’s assume that a given function f is the sum of a power series about
x = a:

X
f (x) = cn (x − a)n
n=0
(3.20)
2 n
= c0 + c1 (x − a) + c2 (x − a) + · · · + cn (x − a) + · · · ,

with a positive radius of convergence R > 0.


• Then,
f 0 (x) = c1 + 2 c2 (x − a) + 3 c3 (x − a)2 + · · · + ncn (x − a)n−1 + · · ·
f 00 (x) = 2 c2 + 3 · 2 c3 (x − a) + · · · + n(n − 1)cn (x − a)n−2 + · · · (3.21)
f 000 (x) = 3 · 2 c3 + 4 · 3 · 2 c4 (x − a) + · · · + n(n − 1)(n − 2)cn (x − a)n−3 + · · ·

with the nth derivative being

f (n) (x) = n!cn + (n + 1)!cn+1 (x − a) + · · · (3.22)

• Thus, when x = a,

(n) f (n) (a)


f (a) = n!cn ⇒ cn = . (3.23)
n!

Definition 3.27. Let f be a function with derivatives of all orders


throughout some interval containing a as an interior point. Then the
Taylor series generated by f at x = a is

X f (n) (a) n 0f 00 (a)
(x − a) = f (a) + f (a)(x − a) + (x − a)2 + · · · (3.24)
n=0
n! 2!

The Maclaurin series of f is the Taylor series generated by f at x = 0:



X f (n) (0) f 00 (0) 2
xn = f (0) + f 0 (0)x + x + ··· (3.25)
n=0
n! 2!
3.2. Basis Functions and Power Series 69

Taylor Polynomials
Definition 3.28. Let f be a function with derivatives of order k =
1, 2, · · · , N in some interval containing a as an interior point. Then for
any integer n from 0 through N , the Taylor polynomial of order n
generated by f at x = a is the polynomial

0 f 00 (a) 2 f (n) (a)


Pn (x) = f (a) + f (a)(x − a) + (x − a) + · · · + (x − a)n . (3.26)
2! n!

Example 3.29. Find the Taylor series and Taylor polynomials generated
by f (x) = cos x at x = 0.
Solution. The cosine and its derivatives are
f (x) = cos x f 0 (x) = − sin x
f 00 (x) = − cos x f (3) (x) = sin x
.. ..
. .
f (x) = (−1)n cos x
(2n)
f (2n+1)
(x) = (−1)n+1 sin x.
At x = 0, the cosines are 1 and the sines are 0, so
f (2n) (0) = (−1)n , f (2n+1) (0) = 0. (3.27)
The Taylor series generated by cos x at x = 0 is
1 2 0 3 1 4 x2 x4 x6
1+0·x− x + x + x + ··· = 1 − + − + ··· (3.28)
2! 3! 4! 2! 4! 6!

Figure 3.3: y = cos x and its Taylor polynomials near x = 0.


70 Chapter 3. Programming with Calculus

Commonly Used Taylor Series

Function Series Convergence



1 2 3
X
= 1 + x + x + x + ··· = xn x ∈ (−1, 1)
1−x n=0
2 ∞
x X xn
ex = 1+x+ + ··· = x∈R
2! n=0
n!

x2 x4 X
n x
2n
cos x = 1− + − ··· = (−1) x∈R
2! 4! n=0
(2n)! (3.29)

x3 x5 X
n x
2n+1
sin x = x− + − ··· = (−1) x∈R
3! 5! n=0
(2n + 1)!

x2 x3 X
n+1 x
n
ln(1 + x) = x− + − ··· = (−1) x ∈ (−1, 1]
2 3 n=1
n

x3 x5 X x2n+1
tan−1 x = x− + − ··· = (−1)n x ∈ [−1, 1]
3 5 n=0
2n + 1

Note: The interval of convergence can be verified using e.g. the ratio
test, presented in Theorem 3.22, p. 66.
sin x
Self-study 3.30. Plot the sinc function f (x) = and its Taylor poly-
x
nomials of order 4, 6, and 8, about x = 0.
Solution. Hint : Use e.g., syms x; T4 = taylor(sin(x)/x,x,0,’Order’,5). Here “Or-
der” means the leading order of truncated terms.
3.3. Newton’s Method for Zero-Finding 71

3.3. Newton’s Method for Zero-Finding


The Newton’s method is also called the Newton-Raphson method.
Recall: The objective is to find a zero p of f :

f (p) = 0. (3.30)

Strategy 3.31. Let p0 be an approximation of p. We will try to find a


correction term h such that (p0 + h) is a better approximation of p than
p0 ; ideally (p0 + h) = p.

• If f 00 exists and is continues, then by Taylor’s Theorem

0 (p − p0 )2 00
0 = f (p) = f (p0 + h) = f (p0 ) + (p − p0 )f (p0 ) + f (ξ), (3.31)
2
where ξ lies between p and p0 .
• If |p − p0 | is small, it is reasonable to ignore the last term of (3.31) and
solve for h = p − p0 :
f (p0 )
h = p − p0 ≈ − 0 . (3.32)
f (p0 )
• Define
f (p0 )
p1 = p0 − ; (3.33)
f 0 (p0 )
then p1 may be a better approximation of p than p0 .
• The above can be repeated.

Algorithm 3.32. (Newton’s method for solving f (x) = 0). For p0


chosen close to a root p, compute {pn } repeatedly satisfying
f (pn−1 )
pn = pn−1 − , n ≥ 1. (3.34)
f 0 (pn−1 )
72 Chapter 3. Programming with Calculus

Graphical interpretation
• Let p0 be the initial approximation close to p. Then, the tangent line
at (p0 , f (p0 )) reads
L(x) = f 0 (p0 )(x − p0 ) + f (p0 ). (3.35)

• To find the x-intercept of y = L(x), let


0 = f 0 (p0 )(x − p0 ) + f (p0 ).

Solving the above equation for x becomes


f (p0 )
x = p0 − 0 , (3.36)
f (p0 )
of which the right-side is the same as in (3.33).

Figure 3.4: Graphical interpretation of the Newton’s method.


An Example of Divergence
1 f := arctan(x);
2 Newton(f, x = Pi/2, output = plot, maxiterations = 3);
3.3. Newton’s Method for Zero-Finding 73

Remark 3.33.
• The Newton’s method may diverge, unless the initialization is accu-
rate.
• It cannot be continued if f 0 (pn−1 ) = 0 for some n. As a matter of fact,
the Newton’s method is most effective when f 0 (x) is bounded away
from zero near p.

Convergence analysis for the Newton’s method


Define the error in the n-th iteration: en = pn − p. Then
f (pn−1 ) en−1 f 0 (pn−1 ) − f (pn−1 )
en = pn − p = pn−1 − 0 −p = . (3.37)
f (pn−1 ) f 0 (pn−1 )
On the other hand, it follows from the Taylor’s Theorem that
1
0 = f (p) = f (pn−1 − en−1 ) = f (pn−1 ) − en−1 f 0 (pn−1 ) + e2n−1 f 00 (ξn−1 ), (3.38)
2
for some ξn−1 . Thus, from (3.37) and (3.38), we have
1 f 00 (ξn−1 ) 2
en = e . (3.39)
2 f 0 (pn−1 ) n−1

Theorem 3.34. (Convergence of Newton’s method): Let f ∈ C 2 [a, b]


and p ∈ (a, b) is such that f (p) = 0 and f 0 (p) 6= 0. Then, there is a
neighborhood of p such that if the Newton’s method is started p0 in that
neighborhood, it generates a convergent sequence pn satisfying

|pn − p| ≤ C|pn−1 − p|2 , (3.40)

for a positive constant C.


74 Chapter 3. Programming with Calculus

Example 3.35. Apply the Newton’s method to solve f (x) = arctan(x) = 0,


with p0 = π/5.

1 Newton(arctan(x), x = Pi/5, output = sequence, maxiterations = 5)


2 0.6283185308, -0.1541304479, 0.0024295539, -9.562*10^(-9), 0., 0.

Since p = 0, en = pn and
|en | ≤ 0.67|en−1 |3 , (3.41)
which is an occasional super-convergence.
Theorem 3.36. (Newton’s Method for a Convex Function): Let
f ∈ C 2 (R) be increasing, convex, and of a zero. Then, the zero is unique
and the Newton iteration will converge to it from any starting point.

Example 3.37. Use the Newton’s method to find the square root of a
positive number Q.

Solution. Let x = Q. Then x is a root of x2 − Q = 0. Define f (x) = x2 − Q;
set f 0 (x) = 2x. The Newton’s method reads
f (pn−1 ) p2n−1 − Q 1  Q 
pn = pn−1 − 0 = pn−1 − = pn−1 + . (3.42)
f (pn−1 ) 2pn−1 2 pn−1

mysqrt.m
1 function x = mysqrt(q)
2 %function x = mysqrt(q)
3

4 x = (q+1)/2;
5 for n=1:10
6 x = (x+q/x)/2;
7 fprintf('x_%02d = %.16f\n',n,x);
8 end
3.3. Newton’s Method for Zero-Finding 75

Results
1 >> mysqrt(16); 1 >> mysqrt(0.1);
2 x_01 = 5.1911764705882355 2 x_01 = 0.3659090909090910
3 x_02 = 4.1366647225462421 3 x_02 = 0.3196005081874647
4 x_03 = 4.0022575247985221 4 x_03 = 0.3162455622803890
5 x_04 = 4.0000006366929393 5 x_04 = 0.3162277665175675
6 x_05 = 4.0000000000000506 6 x_05 = 0.3162277660168379
7 x_06 = 4.0000000000000000 7 x_06 = 0.3162277660168379
8 x_07 = 4.0000000000000000 8 x_07 = 0.3162277660168379
9 x_08 = 4.0000000000000000 9 x_08 = 0.3162277660168379
10 x_09 = 4.0000000000000000 10 x_09 = 0.3162277660168379
11 x_10 = 4.0000000000000000 11 x_10 = 0.3162277660168379

Note: The function sqrt is implemented the same way as mysqrt.m.


76 Chapter 3. Programming with Calculus

3.4. Zeros of Polynomials


Definition 3.38. A polynomial of degree n has a form

P (x) = an xn + an−1 xn−1 + · · · + a1 x + a0 , (3.43)

where an 6= 0 and ai ’s are called the coefficients of P .

Theorem 3.39. (Theorem on Polynomials).


• Fundamental Theorem of Algebra: Every nonconstant polynomial
has at least one root (possibly, in the complex field).
• Complex Roots of Polynomials: A polynomial of degree n has ex-
actly n roots in the complex plane, being agreed that each root shall be
counted a number of times equal to its multiplicity. That is, there
are unique (complex) constants x1 , x2 , · · · , xk and unique integers
m1 , m2 , · · · , mk such that
k
X
m1 m2 mk
P (x) = an (x − x1 ) (x − x2 ) · · · (x − xk ) , mi = n. (3.44)
i=1

• Localization of Roots: All roots of the polynomial P lie in the open


disk centered at the origin and of radius of
1
ρ=1+ max |ai |. (3.45)
|an | 0≤i<n

• Uniqueness of Polynomials: Let P (x) and Q(x) be polynomials of


degree n. If x1 , x2 , · · · , xr , with r > n, are distinct numbers with
P (xi ) = Q(xi ), for i = 1, 2, · · · , r, then P (x) = Q(x) for all x.
– For example, two polynomials of degree n are the same if they
agree at (n + 1) points.
3.4. Zeros of Polynomials 77

3.4.1. Horner’s method


Note: Known as nested multiplication and also as synthetic divi-
sion, Horner’s method can evaluate polynomials very efficiently. It
requires n multiplications and n additions to evaluate an arbitrary n-th
degree polynomial.

Algorithm 3.40. Let us try to evaluate P (x) at x = x0 .


• Utilizing the Remainder Theorem, we can rewrite the polynomial
as
P (x) = (x − x0 )Q(x) + r = (x − x0 )Q(x) + P (x0 ), (3.46)
where Q(x) is a polynomial of degree n − 1, say

Q(x) = bn xn−1 + · · · + b2 x + b1 . (3.47)

• Substituting the above into (3.46), utilizing (3.43), and setting equal
the coefficients of like powers of x on the two sides of the resulting
equation, we have
bn = an
bn−1 = an−1 + x0 bn
.. (3.48)
.
b1 = a1 + x 0 b 2
P (x0 ) = a0 + x0 b1
• Introducing b0 = P (x0 ), the above can be rewritten as

bn+1 = 0; bk = ak + x0 bk+1 , n ≥ k ≥ 0. (3.49)

• If the calculation of Horner’s algorithm is to be carried out with pencil


and paper, the following arrangement is often used (known as syn-
thetic division):
78 Chapter 3. Programming with Calculus

Example 3.41. Use Horner’s algorithm to evaluate P (3), where

P (x) = x4 − 4x3 + 7x2 − 5x − 2. (3.50)

Solution. For x0 = 3, we arrange the calculation as mentioned above:

Note that the 4-th degree polynomial in (3.50) is written as

P (x) = (x − 3)(x3 − x2 + 4x + 7) + 19.

Remark 3.42. When the Newton’s method is applied for finding an


approximate zero of P (x), the iteration reads
P (xn−1 )
xn = xn−1 − . (3.51)
P 0 (xn−1 )
Thus both P (x) and P 0 (x) must be evaluated in each iteration.

Strategy 3.43. How to evaluate P 0 (x): The derivative P 0 (x) can be


evaluated by using the Horner’s method with the same efficiency.
Indeed, differentiating (3.46)

P (x) = (x − x0 )Q(x) + P (x0 )

reads
P 0 (x) = Q(x) + (x − x0 )Q0 (x). (3.52)
Thus
P 0 (x0 ) = Q(x0 ). (3.53)
That is, the evaluation of Q at x0 becomes the desired quantity P 0 (x0 ).
3.4. Zeros of Polynomials 79

Example 3.44. Evaluate P 0 (3) for P (x) considered in Example 3.41, the
previous example.
Solution. As in the previous example, we arrange the calculation and carry
out the synthetic division one more time:

Example 3.45. Implement the Horner’s algorithm to evaluate P (3) and


P 0 (3), for the polynomial in (3.50): P (x) = x4 − 4x3 + 7x2 − 5x − 2.
Solution.
horner.m
1 function [p,d] = horner(A,x0)
2 % input: A = [a_0,a_1,...,a_n]
3 % output: p=P(x0), d=P'(x0)
4

5 n = size(A(:),1);
6 p = A(n); d=0;
7

8 for i = n-1:-1:1
9 d = p + x0*d;
10 p = A(i) +x0*p;
11 end

Call_horner.m
1 a = [-2 -5 7 -4 1];
2 x0=3;
3 [p,d] = horner(a,x0);
4 fprintf(" P(%g)=%g; P'(%g)=%g\n",x0,p,x0,d)
5 Result: P(3)=19; P'(3)=37
80 Chapter 3. Programming with Calculus

Example 3.46. Let P (x) = x4 − 4x3 + 7x2 − 5x − 2, as in (3.50). Use the


Newton’s method and the Horner’s method to implement a code and find an
approximate zero of P near 3.
Solution.
newton_horner.m
1 function [x,it] = newton_horner(A,x0,tol,itmax)
2 % input: A = [a_0,a_1,...,a_n]; x0: initial for P(x)=0
3 % outpue: x: P(x)=0
4

5 x = x0;
6 for it=1:itmax
7 [p,d] = horner(A,x);
8 h = -p/d;
9 x = x + h;
10 if(abs(h)<tol), break; end
11 end
Call_newton_horner.m
1 a = [-2 -5 7 -4 1];
2 x0=3;
3 tol = 10^-12; itmax=1000;
4 [x,it] = newton_horner(a,x0,tol,itmax);
5 fprintf(" newton_horner: x0=%g; x=%g, in %d iterations\n",x0,x,it)
6 Result: newton_horner: x0=3; x=2, in 7 iterations

Figure 3.5: Polynomial P (x) = x4 − 4x3 + 7x2 − 5x − 2. Its two zeros are −0.275682 and 2.
3.5. Multi-Variable Functions and the Gradient Vector 81

3.5. Multi-Variable Functions and the Gradient


Vector
3.5.1. Functions of several variables
Definition 3.47. A function of two variables, f , is a rule that assigns
each ordered pair of real numbers (x, y) in a set D ⊂ R2 a unique real
number denoted by f (x, y). The set D is called the domain of f and its
range is the set of values that f takes on, that is, {f (x, y) : (x, y) ∈ D}.

Definition 3.48. Let f be a function of two variables, and z = f (x, y).


Then x and y are called independent variables and z is called a de-
pendent variable.

x+y+1
Problem 3.49. Let f (x, y) = . Evaluate f (3, 2) and give its
x−1
domain.
Solution.


Ans: f (3, 2) = 6/2; D = {(x, y) : x + y + 1 ≥ 0, x 6= 1}
Problemp3.50. Find the domain and the range of
f (x, y) = 9 − x2 − y 2 .
Solution.
82 Chapter 3. Programming with Calculus

3.5.2. First-order partial derivatives

Recall: A function y = f (x) is differentiable at a if


f (a + h) − f (a)
f 0 (a) = lim exists.
h→0 h

Figure 3.6: Ordinary derivative f 0 (a) and partial derivatives fx (a, b) and fy (a, b).

Let f be a function of two variables (x, y). Suppose we let only x vary while
keeping y fixed, say y = b . Then g(x) := f (x, b) is a function of a single
variable. If g is differentiable at a, then we call it the partial derivative
of f with respect to x at (a, b) and denoted by fx (a, b).

g(a + h) − g(a)
g 0 (a) = lim
h→0 h
(3.54)
f (a + h, b) − f (a, b)
= lim =: fx (a, b).
h→0 h
3.5. Multi-Variable Functions and the Gradient Vector 83

Similarly, the partial derivative of f with respect to y at (a, b), denoted


by fy (a, b), is obtained keeping x fixed, say x = a , and finding the ordinary
derivative at b of G(y) := f (a, y) :

G(b + h) − G(b)
G0 (b) = lim
h→0 h
(3.55)
f (a, b + h) − f (a, b)
= lim =: fy (a, b).
h→0 h

p
Problem 3.51. Find fx (0, 0), when f (x, y) = 3
x3 + y 3 .
Solution. Using the definition,
f (h, 0) − f (0, 0)
fx (0, 0) = lim
h→0 h

Ans: 1
Definition 3.52. If f is a function of two variables, its partial deriva-
tives are the functions fx = ∂f ∂f
∂x and fy = ∂y defined by:

∂f f (x + h, y) − f (x, y)
fx (x, y) = (x, y) = lim and
∂x h→0 h
(3.56)
∂f f (x, y + h) − f (x, y)
fy (x, y) = (x, y) = lim .
∂y h→0 h
84 Chapter 3. Programming with Calculus

Observation 3.53. The partial derivative with respect to x represents


the slope of the tangent lines to the curve that are parallel to the xz-
plane (i.e. in the direction of h1, 0, ·i). Similarly, the partial derivative
with respect to y represents the slope of the tangent lines to the curve
that are parallel to the yz-plane (i.e. in the direction of h0, 1, ·i).

Rule for finding Partial Derivatives of z = f (x, y)


• To find fx , regard y as a constant and differentiate f w.r.t. x.
• To find fy , regard x as a constant and differentiate f w.r.t. y.

Problem 3.54. If f (x, y) = x3 + x2 y 3 − 2y 2 , find fx (2, 1) and fy (2, 1).


Solution.

Ans: fx (2, 1) = 16; fy (2, 1) = 8


 x 
Problem 3.55. Let f (x, y) = sin . Find the first partial derivatives
1+y
of f (x, y).
Solution.
3.5. Multi-Variable Functions and the Gradient Vector 85

3.5.3. The gradient vector


Definition 3.56. Let f be a differentiable function of two variables x
and y. Then the gradient of f is the vector function
∂f ∂f
∇f (x, y) = hfx (x, y), fy (x, y)i = i+ j. (3.57)
∂x ∂y

Problem 3.57. If f (x, y) = sin(x) + exy , find ∇f (x, y) and ∇f (0, 1).
Solution.

Ans: h2, 0i

Claim 3.58. The gradient direction is the direction where the function
changes fastest, more precisely, increases fastest!

Example 3.59. Consider a level curve f (x, y) = −x2 + y = k. Set e.g.


k = 1 and graphically figure out that

(a) ∇f (x, y) is normal to the curve


(b) It points the fastest increasing direction.

Solution.
86 Chapter 3. Programming with Calculus

Exercises for Chapter 3

3.1. In Example 3.5, we considered the curve y = |x2 − 1|. Find the left-hand limit and
right-hand slope of the difference quotient at x0 = 1.
Ans: −2 and 2.
3.2. The number e is determined so that the slope of the graph of y = ex at x = 0 is exactly
1. Let h be a point near 0. Then

eh − e0 eh − 1
Q(h) := =
h−0 h

represents the average slope of the graph between the two points (0, 1) and (h, eh ).
Evaluate Q(h), for h = 0.1, 0.01, 0.001, 0.0001. What can you say about the results?
Ans: For example, Q(0.01) = 1.0050.

3.3. Recall the Taylor series for ex , cos x and sin x in (3.29). Let x = iθ, where i = −1.
Then
i2 θ2 i3 θ3 i4 θ4 i5 θ5 i6 θ6
eiθ = 1 + iθ + + + + + + ··· (3.58)
2! 3! 4! 5! 6!
(a) Prove that eiθ = cos θ + i sin θ, which is called the Euler’s identity.
(b) Prove that eiπ + 1 = 0.

3.4. Implement a code to visualize complex-valued solutions of ez = −1.

• Use fimplicit
• Visualize, with ylim([-2*pi 4*pi]), yticks(-pi:pi:3*pi)

Hint : Use the code in § 2.2, starting with


eulers_identity.m
1 syms x y real
2 z = x+1i*y;
3

4 %% ---- Euler's identity


5 g = exp(z)+1;
6 RE = simplify(real(g))
7 IM = simplify(imag(g))
8

9 A = @(x,y) <Copy RE appropriately>


10 B = @(x,y) <Copy IM appropriately>
11

12 %%--- Solve A=0 and B=0 --------------


13

3.5. Using your calculator (or pencil-and-paper), run two iterations of Newton’s method to
find x2 for given f and x0 .
3.5. Multi-Variable Functions and the Gradient Vector 87

(a) f (x) = x4 − 2, x0 = 1
(b) f (x) = xex − 1, x0 = 0.5
Ans: (b) x2 = 0.56715557
3.6. The graphs of y = x2 (x + 1) and y = 1/x (x > 0) intersect at one point x = r. Use
Newton’s method to estimate the value of r to eight decimal places.

3.7. Consider the level curve f (x, y) = −x2 + y = k as in Example 3.59. For k = 1:

(a) Plot the level curve. (Explore the command ‘contour’.)


(b) Superpose gradient vectors at x=[-1,0,1,2]. (You may use ‘quiver’.)
88 Chapter 3. Programming with Calculus
4
C HAPTER

Programming with Linear Algebra


Real-world systems can be approximated/represented as systems of
linear equations
 
a11 a12 · · · a1n
 21 a22 · · · a2n 
a 
m×n
Ax = b, A =  .. .
.. . .
. . ..  ∈ R , (4.1)
 .
am1 am2 · · · amn

where b is the source and x is the solution.


• When m < n, the system is underdetermined; it has infinitely
many solutions.
• When m > n, the system is overdetermined; it may have no solu-
tion.
• When m = n, the system may have either a unique solution or in-
finitely many solutions. When it has a unique solution, its solution
can formally be written as

x = A−1 b. (4.2)

Contents of Chapter 4
4.1. Solutions of Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.2. Invertible Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3. Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.4. Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Exercises for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

89
90 Chapter 4. Programming with Linear Algebra

4.1. Solutions of Linear Systems


Definition 4.1. A linear equation in the variables x1 , x2 , · · · , xn is an
equation that can be written in the form

a1 x1 + a2 x2 + · · · + an xn = b, (4.3)

where b and the coefficients a1 , a2 , · · · , an are real or complex numbers.

A system of linear equations (or a linear system) is a collection of one


or more linear equations involving the same variables – say, x1 , x2 , · · · , xn .
Example 4.2.
( 
4x1 − x2 = 3  2x + 3y − 4z = 2

(a)
2x1 + 3x2 = 5 (b) x − 2y + z = 1

 3x + y − 2z = −1

• Solution: A solution of the system is a list (s1 , s2 , · · · , sn ) of numbers


that makes each equation a true statement, when the values s1 , s2 , · · · , sn
are substituted for x1 , x2 , · · · , xn , respectively.
• Solution Set: The set of all possible solutions is called the solution set
of the linear system.
• Equivalent System: Two linear systems are called equivalent if they
have the same solution set.

• For example, above (a) is equivalent to


(
2x1 − 4x2 = −2
R1 ← R1 − R2
2x1 + 3x2 = 5
4.1. Solutions of Linear Systems 91

Remark 4.3. Linear systems may have

no solution ) : inconsistent system


exactly one (unique) solution
: consistent system
infinitely many solutions

Example 4.4. Consider the case of two equations in two unknowns.


( ( (
−x + y = 1 x+y = 1 −2x + y = 2
(a) (b) (c)
−x + y = 3 x−y = 2 −4x + 2y = 4

4.1.1. Solving a linear system


Consider a simple system of 2 linear equations:
(
−2x1 + 3x2 = −1
(4.4)
x1 + 2x2 = 4
Such a system of linear equations can be treated much more conveniently
and efficiently with matrix form. In matrix form, (4.4) reads
" # " # " #
−2 3 x1 −1
= . (4.5)
1 2 x2 4
| {z }
coefficient matrix
The essential information of the system can be recorded compactly in a
rectangular array called an augmented matrix:
" # " #
−2 3 −1 −2 3 −1
or (4.6)
1 2 4 1 2 4
92 Chapter 4. Programming with Linear Algebra

Solving (4.4):

System of linear equations Matrix form


( " #
−2x1 + 3x2 = −1 1 −2 3 −1
x1 + 2x2 = 4 2 1 2 4

1 ↔ 2 : (interchange)
( " #
x1 + 2x2 = 4 1 1 2 4
−2x1 + 3x2 = −1 2 −2 3 −1

2 ← 2 +2· 1 : (replacement)
( " #
x1 + 2x2 = 4 1 1 2 4
7x2 = 7 2 0 7 7

2 ← 2 /7: (scaling)
( " #
x1 + 2x2 = 4 1 1 2 4
x2 = 1 2 0 1 1

1 ← 1 −2· 2 : (replacement)
( " #
x1 = 2 1 1 0 2
x2 = 1 2 0 1 1

At the last step:


( " #
x1 = 2 2
LHS: solution : RHS : I
x2 = 1 1
4.1. Solutions of Linear Systems 93

Tools 4.5. Three Elementary Row Operations (ERO):


• Replacement: Replace one row by the sum of itself and a multiple
of another row
Ri ← Ri + k · Rj , j 6= i
• Interchange: Interchange two rows
Ri ↔ Rj , j 6= i
• Scaling: Multiply all entries in a row by a nonzero constant
Ri ← k · Ri , k 6= 0

Definition 4.6. Two matrices are row equivalent if there is a se-


quence of EROs that transforms one matrix to the other.

4.1.2. Matrix equation Ax = b


A fundamental idea in linear algebra is to view a linear combination of
vectors as a product of a matrix and a vector.
Definition 4.7. Let A = [a1 a2 · · · an ] be an m × n matrix and x ∈ Rn ,
then the product of A and x denoted by A x is the linear combination
of columns of A using the corresponding entries of x as weights, i.e.,
 
x1
x 
 2
A x = [a1 a2 · · · an ]  ..  = x1 a1 + x2 a2 + · · · + xn an . (4.7)
.
xn

A matrix equation is of the form A x = b, where b is a column vector of


size m × 1.

Example 4.8. x = [x1 , x2 ]T = [−3, 2]T is the solution of the linear system.

Matrix equation Vector equation Linear system


" #" # " # " # " # " #
1 2 x1 1 1 2 1 x1 + 2x2 = 1
= x1 + x2 =
3 4 x2 −1 3 4 −1 3x1 + 4x2 = −1
94 Chapter 4. Programming with Linear Algebra

Theorem 4.9. Let A = [a1 a2 · · · an ] be an m × n matrix, x ∈ Rn , and


b ∈ Rm . Then the matrix equation
Ax = b (4.8)

has the same solution set as the vector equation


x1 a1 + x2 a2 + · · · + xn an = b, (4.9)

which, in turn, has the same solution set as the system with augmented
matrix
[a1 a2 · · · an : b]. (4.10)

Two Fundamental Questions about a Linear System:


1. (Existence): Is the system consistent; that is, does at least one
solution exist?
2. (Uniqueness): If a solution exists, is it the only one; that is, is the
solution unique?

Example 4.10. Determine the values of h such that the given system is a
consistent linear system
x + h y = −5
2x − 8y = 6

Solution.

Ans: h 6= −4
4.1. Solutions of Linear Systems 95

4.1.3. Reduced row echelon form


Example 4.11. Solve the following system of linear equations, using the 3
EROs. Then, determine if the system is consistent.

x2 − 2x3 = 0
x1 − 2x2 + 2x3 = 3
4x1 − 8x2 + 6x3 = 14

Solution.

Ans: x = [1, −2, −1]T


Note: The system of linear equations can be solved by transforming the
augmented matrix to the reduced row echelon form (rref).
linear_equations_rref.m
1 A = [0 1 -2; 1 -2 2; 4 -8 6];
2 b = [0; 3; 14];
3

4 AA = [A b];
5 rref(AA)

Result
1 ans =
2 1 0 0 1
3 0 1 0 -2
4 0 0 1 -1
96 Chapter 4. Programming with Linear Algebra

Example 4.12. Find the general solution of the system whose aug-
mented matrix is 
1 0 0 1 7
0 1 3 0 −1
[A|b] = 
 
2 −1 −3 2 15

1 0 −1 0 4
Solution.
linear_equations_rref.m
1 Ab = [1 0 0 1 7; 0 1 3 0 -1; 2 -1 -3 2 15; 1 0 -1 0 4];
2 rref(Ab)

Result
1 ans =
2 1 0 0 1 7
3 0 1 0 -3 -10
4 0 0 1 1 3
5 0 0 0 0 0

Remark 4.13. Using rref, one can find the general solution which
may consist of infinitely many solutions.
4.2. Invertible Matrices 97

4.2. Invertible Matrices


Definition 4.14. An n × n (square) matrix A is said to be invertible
(nonsingular) if there is an n × n matrix B such that AB = In = BA,
where In is the identity matrix.

Note: In this case, B is the unique inverse of A denoted by A−1 .


(Thus AA−1 = In = A−1 A.)
" # " #
2 5 −7 −5
Example 4.15. If A = and B = . Find AB and BA.
−3 −7 3 2
Solution.

Theorem 4.16. (Inverse of an n × n matrix, n ≥ 2) An n × n matrix


A is invertible if and only if A is row equivalent to In and in this case any
sequence of elementary row operations that reduces A into In will also
reduce In to A−1 .

Algorithm 4.17. Algorithm to find A−1 :


1) Row reduce the augmented matrix [A : In ]
2) If A is row equivalent to In , then [A : In ] is row equivalent to
[In : A−1 ]. Otherwise A does not have any inverse.
98 Chapter 4. Programming with Linear Algebra
" #
3 2
Example 4.18. Find the inverse of A = .
8 5
Solution. You may begin
 with
3 2 1 0
[A : I2 ] =
8 5 0 1

 
0 1 0
Self-study 4.19. Use pencil-and-paper to find the inverse of A = 1 0 3,
 
4 −3 8
if it exists.
Solution.

When it is implemented:
inverse_matrix.m
1 A = [0 1 0
2 1 0 3
3 4 -3 8];
4 I = eye(3);
5

6 AI = [A I];
7 rref(AI)

Result
1 ans =
2 1.0000 0 0 2.2500 -2.0000 0.7500
3 0 1.0000 0 1.0000 0 0
4 0 0 1.0000 -0.7500 1.0000 -0.2500
4.2. Invertible Matrices 99

Theorem 4.20.
" #
ba
a. (Inverse of a 2 × 2 matrix) Let A = . If ad − bc 6= 0, then A
dc
is invertible and " #
1 d −b
A−1 = (4.11)
ad − bc −c a

b. If A is an invertible matrix, then A−1 is also invertible and


(A−1 )−1 = A.
c. If A and B are n × n invertible matrices then AB is also invertible
and (AB)−1 = B −1 A−1 .
d. If A is invertible, then AT is also invertible and (AT )−1 = (A−1 )T .
e. If A is an n × n invertible matrix, then for each b ∈ Rn , the equation
Ax = b has a unique solution x = A−1 b.

True-or-False 4.21.
a. In order for a matrix B to be the inverse of A, both equations AB = In
and BA = In must be true.
" #
a b
b. if A = and ad = bc, then A is not invertible.
c d
c. If A is invertible, then elementary row operations that reduce A to the
identity In also reduce A−1 to In .
Solution.
Ans: T,T,F
100 Chapter 4. Programming with Linear Algebra

4.3. Determinants
Definition 4.22. Let A be an n × n square matrix. Then determinant
is a scalar value denoted by det A or |A|.
1) Let A = [a] ∈ R1 × 1 . Then det A = a.
" #
a b
2) Let A = ∈ R2 × 2 . Then det A = ad − bc.
c d
" #
2 1
Example 4.23. Let A = . Consider a linear transformation T : R2 → R2
0 3
defined by T (x) = Ax.

1) Find the determinant of A.


2) Determine the image of a rectangle R = [0, 2] × [0, 1] under T .
3) Find the area of the image.
4) Figure out how det A, the area of the rectangle (= 2), and the area of the
image are related.
Solution.

Ans: 3) 12
Note: The determinant can be viewed as a volume scaling factor.
4.3. Determinants 101

Definition 4.24. Let Aij be the submatrix of A obtained by deleting row


i and column j of A. Then the (i, j)-cofactor of A = [aij ] is the scalar Cij ,
given by
Cij = (−1)i+j det Aij . (4.12)

Definition 4.25. For n ≥ 2, the determinant of an n × n matrix A =


[aij ] is given by the following formulas:
1. The cofactor expansion across the first row:

det A = a11 C11 + a12 C12 + · · · + a1n C1n (4.13)

2. The cofactor expansion across the row i:

det A = ai1 Ci1 + ai2 Ci2 + · · · + ain Cin (4.14)

3. The cofactor expansion down the column j:

det A = a1j C1j + a2j C2j + · · · + anj Cnj (4.15)


 
1 5 0
Example 4.26. Find the determinant of A = 2 4 −1, by expanding
 
0 −2 0
across the first row and down column 3.
Solution.

Ans: −2
102 Chapter 4. Programming with Linear Algebra

Note: If A is triangular (upper or lower) matrix, then det A is the product


of entries on the main diagonal of A.
 
1 −2 5 2
0 −6 −7 5
Example 4.27. Compute the determinant of A =  .
 
0 0 3 0
0 0 0 4
Solution.

determinant.m
1 A = [1 -2 5 2; 0 -6 -7 5; 0 0 3 0; 0 0 0 4];
2 det(A)

Result
1 ans =
2 -72
4.3. Determinants 103

Properties of Determinants

Theorem 4.28. Let A be n × n square matrix.


a) (Replacement): If B is obtained from A by a row replacement, then
det B = det A.
" # " #
1 3 1 3
A= , B=
2 1 0 −5

b) (Interchange): If two rows of A are interchanged to form B, then


det B = −det A.
" # " #
1 3 2 1
A= , B=
2 1 1 3

c) (Scaling): If one row of A is multiplied by k (6= 0), then


det B = k · det A.
" # " #
1 3 1 3
A= , B=
2 1 −4 −2

 
1 −4 2
Example 4.29. Compute det A, where A = −2 8 −9, after applying
 
−1 7 0
a couple of steps of replacement operations.
Solution.

Ans: 15
104 Chapter 4. Programming with Linear Algebra

Theorem 4.30. A square matrix A is invertible if and only if det A 6= 0.

Remark 4.31. Let A and B be n × n matrices.


a) det AT = det A.
" # " #
1 3 1 2
A= , AT =
2 1 3 1

b) det (AB) = det A · det B.


" # " # " #
1 3 1 1 13 7
A= , B= ; then AB = .
2 1 4 2 6 4

1
c) If A is invertible, then det A−1 = . (∵ det In = 1.)
det A

Example 4.32. Suppose the sequence 5 × 5 matrices A, A1 , A2 , and A3 are


related by following elementary row operations:
R ←R −3R R3 ←(1/5) R3 R ↔R
A −−2−−−2 1
−−→ A1 −−−−−−−→ A2 −−4−−→
5
A3
 
1 2 3 4 1
0 −2 1 −1 1
 
Find det A, if A3 = 0 0 3 0 1
 
 
0 0 0 −1 1
0 0 0 0 1
Solution.

Ans: −30
4.4. Eigenvalues and Eigenvectors 105

4.4. Eigenvalues and Eigenvectors


Definition 4.33. Let A be an n × n matrix. An eigenvector of A is a
nonzero vector x such that Ax = λx for some scalar λ. In this case, a
scalar λ is an eigenvalue and x is the corresponding eigenvector.
" # " #
−1 5 2
Example 4.34. Is an eigenvector of ? What is the eigenvalue?
1 3 6
Solution.

" #
1 6
Example 4.35. Let A = . Show that 7 is an eigenvalue of matrix A,
5 2
and find the corresponding eigenvectors.
Solution. Hint : Start with Ax = 7x. Then (A − 7I)x = 0.
106 Chapter 4. Programming with Linear Algebra

Definition 4.36. The set of all solutions of (A − λI) x = 0 is called the


eigenspace of A corresponding to eigenvalue λ.

Remark 4.37. Let λ be an eigenvalue of A. Then


a) Eigenspace is a subspace of Rn and the eigenspace of A corresponding
to λ is Nul (A − λI) := {x | (A − λI) x = 0}.
b) The homogeneous equation (A − λI) x = 0 has at least one free vari-
able.

Theorem 4.38. A square matrix A is invertible if and only if the num-


ber 0 is not an eigenvalue of A.

4.4.1. Characteristic equation


Definition 4.39. The scalar equation det (A − λI) = 0 is called the
characteristic equation of A; the polynomial p(λ) = det (A − λI)
is called the characteristic polynomial of A. The solutions of
det (A − λI) = 0 are the eigenvalues of A.

Example 4.40. Find the characteristic polynomial, eigenvalues, and cor-


" #
8 2
responding eigenvectors of A = .
3 3
Solution.
4.4. Eigenvalues and Eigenvectors 107

Example 4.41. Find the characteristic polynomial and all eigenvalues of


 
1 1 0
A = 6 0 5
 
0 0 2
eigenvalues.m
1 syms x
2 A = [1 1 0; 6 0 5; 0 0 2];
3

4 polyA = charpoly(A,x)
5 eigenA = solve(polyA)
6 [P,D] = eig(A) % A*P = P*D
7 P*D*inv(P)

Results
1 polyA =
2 12 - 4*x - 3*x^2 + x^3
3

4 eigenA =
5 -2
6 2
7 3
8

9 P =
10 0.4472 -0.3162 -0.6155
11 0.8944 0.9487 -0.6155
12 0 0 0.4924
13 D =
14 3 0 0
15 0 -2 0
16 0 0 2
17

18 ans =
19 1.0000 1.0000 -0.0000
20 6.0000 0.0000 5.0000
21 0 0 2.0000
108 Chapter 4. Programming with Linear Algebra

4.4.2. Similarity
Definition 4.42. Let A and B be n × n matrices. Then, A is similar to
B, if there is an invertible matrix P such that

A = P BP −1 , or equivalently, P −1 AP = B.

Writing Q = P −1 , we have B = QAQ−1 . So B is also similar to A, and we


say simply that A and B are similar. The map A 7→ P −1 AP is called a
similarity transformation.

The next theorem illustrates one use of the characteristic polynomial, and
it provides the foundation for several iterative methods that approximate
eigenvalues.
Theorem 4.43. If n × n matrices A and B are similar, then they
have the same characteristic polynomial and hence the same eigenvalues
(with the same multiplicities).

Proof. B = P −1 AP . Then,

B − λI = P −1 AP − λI
= P −1 AP − λP −1 P
= P −1 (A − λI)P,

from which we conclude that det (B − λI) = det (A − λI).


4.4. Eigenvalues and Eigenvectors 109

4.4.3. Diagonalization
Definition 4.44. An n × n matrix A is said to be diagonalizable if
there exists an invertible matrix P and a diagonal matrix D such that

A = P DP −1 (or P −1 AP = D) (4.16)

Remark 4.45. Let A be diagonalizable, i.e., A = P DP −1 . Then

A2 = (P DP −1 )(P DP −1 ) = P D2 P −1
Ak = P Dk P −1
(4.17)
A−1 = P D−1 P −1 (when A is invertible)
det A = det D

Diagonalization enables us to compute Ak and det A quickly.


" #
7 2
Self-study 4.46. Let A = . Find a formula for Ak , given that
−4 1
" # " #
1 1 5 0
A = P DP −1 , where P = and D = .
−1 −2 0 3
Solution.

2 · 5k − 3k 5k − 3k
 
k
Ans: A =
2 · 3k − 2 · 5k 2 · 3k − 5k
110 Chapter 4. Programming with Linear Algebra

Theorem 4.47. (The Diagonalization Theorem)


1. An n × n matrix A is diagonalizable if and only if A has n linearly
independent eigenvectors v1 , v2 , · · · , vn .
2. In fact, A = P DP −1 if and only if columns of P are n linearly inde-
pendent eigenvectors of A. In this case, the diagonal entries of D are
the corresponding eigenvalues of A. That is,

P = [v1 v2 · · · vn ],
 
λ1 0 · · · 0
 0 λ ··· 0  (4.18)
2
D = diag(λ1 , λ2 , · · · , λn ) =  .. .. . . ,
 
. . .
. .. 
0 0 · · · λn

where Avk = λk vk , k = 1, 2, · · · , n.

Proof. Let P = [v1 v2 · · · vn ] and D = diag(λ1 , λ2 , · · · , λn ), arbitrary ma-


trices. Then,

AP = A[v1 v2 · · · vn ] = [Av1 Av2 · · · Avn ], (4.19)

while
 
λ1 0 · · · 0
 0 λ ··· 0 
2
P D = [v1 v2 · · · vn ] .. .. . .  = [λ1 v1 λ2 v2 · · · λn vn ]. (4.20)
 
. . .
. .. 
0 0 · · · λn
(⇒ ) Now suppose A is diagonalizable and A = P DP −1 . Then we have
AP = P D; it follows from (4.19) and (4.20) that

[Av1 Av2 · · · Avn ] = [λ1 v1 λ2 v2 · · · λn vn ],

from which we conclude

Avk = λk vk , k = 1, 2, · · · , n. (4.21)

Furthermore, P is invertible ⇒ {v1 , v2 , · · · , vn } is linearly independent.


(⇐ ) It is almost trivial.
4.4. Eigenvalues and Eigenvectors 111

Example 4.48. Diagonalize the following matrix, if possible.


 
1 3 3
A = −3 −5 −3
 
3 3 1

Solution.
1. Find the eigenvalues of A.
2. Find three linearly independent eigenvectors of A.
3. Construct P from the vectors in step 2.
4. Construct D from the corresponding eigenvalues.
Check: AP = P D?
−1 −1
     
1
Ans: λ = 1, −2, −2. v1 = −1 , v2 =
   1 , v3 =
  0
1 0 1
diagonalization.m
1 A = [1 3 3; -3 -5 -3; 3 3 1];
2 [P,D] = eig(A) % A*P = P*D
3 P*D*inv(P)

Results
1 P =
2 -0.5774 -0.7876 0.4206
3 0.5774 0.2074 -0.8164
4 -0.5774 0.5802 0.3957
5 D =
6 1.0000 0 0
7 0 -2.0000 0
8 0 0 -2.0000
9

10 ans =
11 1.0000 3.0000 3.0000
12 -3.0000 -5.0000 -3.0000
13 3.0000 3.0000 1.0000

Attention: Eigenvectors corresponding to λ = −2


112 Chapter 4. Programming with Linear Algebra

Exercises for Chapter 4

4.1. An important concern in the study of heat Write a system of four equations whose so-
transfer is to determine the steady-state tem- lution gives estimates for the temperatures
perature distribution of a thin plate when the T1 , T2 , · · · , T4 , and solve it.
temperature around the boundary is known.
Assume the plate shown in the figure repre-
sents a cross section of a metal beam, with
negligible heat flow in the direction perpen-
dicular to the plate. Let T1 , T2 , · · · , T4 denote
the temperatures at the four interior nodes of
the mesh in the figure. The temperature at
a node is approximately equal to the average
of the four nearest nodes. For example, T1 =
(10 + 20 + T2 + T4 )/4 or 4T1 = 10 + 20 + T2 + T4 .
Figure 4.1

1 −2
 
  1
3 −4
4.2. Find the inverses of the matrices, if exist: A = and B =  4 −7 3
7 −8
−2 6 −4
Ans: B is not invertible.
 
3 1
4.3. Let A = . Write 5A. Is det (5A) = 5det A?
4 2
1 1 −3
 

4.4. Let A = 0 2 8.


2 4 2

(a) Find det A.


(b) Let U = [0, 1]3 , the unit cube. What can you say about A(U ), the image of U under
the matrix multiplication by A.
 
1 0 1
4.5. Use pencil-and-paper to compute det (B 6 ), where B = 1 1 2.
1 2 1
Ans: 64
 
3 1 0
4.6. A matrix is not always diagonalizable. Let A = 0 3 1. Use [P,D] = eig(A) in
0 0 3
Matlab to verify

(a) P does not have its inverse.


(b) AP = P D.
5
C HAPTER

Regression Analysis

Contents of Chapter 5
5.1. Least-Squares Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.2. Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.3. Scene Analysis with Noisy Data: RANSAC . . . . . . . . . . . . . . . . . . . . . . . . . 124
Exercises for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

113
114 Chapter 5. Regression Analysis

5.1. Least-Squares Problem


Definition 5.1. For a given dataset {(xi , yi )}, let a continuous function
p(x) be constructed.
(a) p is an interpolation if it passes (interpolates) all the data points.
(b) p is an approximation if it approximates (represents) the data
points.
Dataset, in Maple
1 with(LinearAlgebra): with(CurveFitting):
2 n := 100: roll := rand(-n..n):
3 m := 10: xy := Matrix(m, 2):
4 for i to m do
5 xy[i, 1] := i;
6 xy[i, 2] := i + roll()/n;
7 end do:
8 plot(xy,color=red, style=point, symbol=solidbox, symbolsize=20);
5.1. Least-Squares Problem 115

Note: Interpolation may be too oscillatory to be useful; furthermore, it


may not be defined.

The Least-Squares (LS) Problem


Note: Let A is an m × n matrix. Then Ax = b may have no solution,
particularly when m > n. In real-world,
• m  n, where m represents the number of data points and n denotes
the dimension of the points
• Need to find a best solution for Ax ≈ b

Definition 5.2. Let A ∈ Rm×n , m ≥ n, and b ∈ Rm . The least-squares


b ∈ Rn which minimizes kAx − bk2 :
problem is to find x

b = arg min kAx − bk2 ,


x
x
or, equivalently, (5.1)
b = arg min kAx − bk22 ,
x
x

where x
b called a least-squares solution of Ax = b.
116 Chapter 5. Regression Analysis

Normal Equations

Theorem 5.3. The set of LS solutions of Ax = b coincides with the


nonempty set of solutions of the normal equations

AT Ax = AT b. (5.2)

Method of Calculus
Let J (x) = kAx − bk2 = (Ax − b)T (Ax − b) and x
b a minimizer of J (x).
• Then we must have
∂J (x)
∇x J (b
x) = = 0. (5.3)
∂x x=b
x

• Let’s compute the gradient of J .



∂J (x) ∂ (Ax − b)T (Ax − b)
=
∂x ∂x
∂(x A Ax − 2xT AT b + bT b)
T T
(5.4)
=
∂x
T T
= 2A Ax − 2A b.

• By setting the last term to zero, we obtain normal equations.

Remark 5.4. Theorem 5.3 implies that LS solutions of Ax = b are


solutions of the normal equations AT Ab
x = AT b.
• When AT A is not invertible, the normal equations have either no
solution or infinitely many solutions.
• So, data acquisition is important, to make it invertible.
5.1. Least-Squares Problem 117

Theorem 5.5. (Method of Normal Equations) Let A ∈ Rm×n , m ≥ n.


The following statements are logically equivalent:
a. The equation Ax = b has a unique LS solution for each b ∈ Rm .
b. The matrix AT A is invertible.
When these statements are true, the unique LS solution x
b is given by

b = (AT A)−1 AT b.
x (5.5)

Example 5.6. Describe all least squares solutions of the equation Ax = b,


given    
1 1 0 1
0 1 0 3
A=  and b =  .
   
0 0 1 8
1 0 1 2
Solution. Let’s try to solve the problem with pencil-and-paper.

least_squares.m
1 A = [1 1 0; 0 1 0; 0 0 1; 1 0 1];
2 b = [1; 3; 8; 2];
3 x = (A'*A)\(A'*b)

Ans: x = [−4, 4, 7]T


118 Chapter 5. Regression Analysis

5.2. Regression Analysis


Definition 5.7. Regression analysis is a set of statistical methods
used for the estimation of relationships between a dependent variable
and one or more independent variables.

5.2.1. Regression line

Figure 5.1: A regression line.

Definition 5.8. Suppose a set of experimental data points are given as

(x1 , y1 ), (x2 , y2 ), · · · , (xm , ym )

such that the graph is close to a line. We determine a line

y = β0 + β1 x (5.6)

that is as close as possible to the given points. This line is called the
least-squares line; it is also called the regression line of y on x and
β0 , β1 are called regression coefficients.
5.2. Regression Analysis 119

Calculation of Least-Squares Lines


Consider a least-squares (LS) model of the form y = β0 + β1 x, for a given
data set {(xi , yi ) | i = 1, 2, · · · , m}.
• Then
Predicted y-value Observed y-value
β0 + β1 x1 = y1
β0 + β1 x2 = y2 (5.7)
.. ..
. .
β0 + β1 xm = ym

• It can be equivalently written as


Xβ = y, (5.8)

where    
1 x1 " # y1
1 x  β0 y 
2  2
X =  .. .. , β= , y =  .. .

. .  β1  . 
1 xm ym
Here we call X the design matrix, β the parameter vector, and y
the observation vector.
• Thus the LS solution can be determined by solving the normal equa-
tions:
X T Xβ = X T y, (5.9)
provided that X T X is invertible.
• The normal equations for the regression line read
" # " #
m Σxi Σyi
β= . (5.10)
Σxi Σx2i Σxi yi
120 Chapter 5. Regression Analysis

Remark 5.9. (Pointwise construction of the normal equations)


The normal equations for the regression line in (5.10) can be rewritten
as " # " #
m m
X 1 xi X yi
2 β = . (5.11)
i=1
xi x i i=1
x i y i

• The pointwise construction of the normal equation is convenient when


either points are first to be searched or weights are applied depending
on the point location.
• The idea is applicable for other regression models as well.

Self-study 5.10. Find the equation y = β0 + β1 x of least-squares line that


best fits the given points:
(−1, 0), (0, 1), (1, 2), (2, 4)
Solution.
5.2. Regression Analysis 121

5.2.2. Least-squares fitting of other curves

Remark 5.11. Consider a regression model of the form

y = β0 + β1 x + β2 x2 ,

for a given data set {(xi , yi ) | i = 1, 2, · · · , m}. Then


Predicted y-value Observed y-value
β0 + β1 x1 + β2 x21 = y1
β0 + β1 x2 + β2 x22 = y2 (5.12)
.. ..
. .
β0 + β1 xm + β2 x2m = ym

It can be equivalently written as


Xβ = y, (5.13)

where    
1 x1 x21   y1
1 x x2  β0 y 
2 2  2
X =  .. .. , β = β1 , y =  .. .
  
. . .
.. 
  . 
β2
1 xm x2m ym
Now, it can be solved through normal equations:
   
2
Σ1 Σxi Σxi Σyi
X Xβ =  Σxi Σxi Σxi  β =  Σxi yi  = X T y
T 2 3 (5.14)
   
Σx2i Σx3i Σx4i Σx2i yi

Self-study 5.12. Find an LS curve of the form y = β0 + β1 x + β2 x2 that best


fits the given points:
(0, 1), (1, 1), (1, 2), (2, 3).
Solution.

Ans: y = 1 + 0.5x2
122 Chapter 5. Regression Analysis

5.2.3. Nonlinear regression: Linearization

Strategy 5.13. For nonlinear models, a change of variables can be


applied to get a linear model.

Model Change of Variables Linearization


B 1
y =A+ x
e = , ye = y ⇒ ye = A + Be
x
x x
1 1
y = x
e = x, ye = ⇒ ye = A + Be
x (5.15)
A + Bx y
y = CeDx x
e = x, ye = ln y ⇒ ye = ln C + De
x
1 1
y = x
e = ln x, ye = ⇒ ye = A + Be
x
A + B ln x y
The above table contains just a few examples of linearization; for other
nonlinear models, use your imagination and creativity.

Example 5.14. Find the best fitting curve of the form y = cedx for the data
 
0.1 1.9940
0.2 2.0087
 
0.3 1.8770
 
 
0.4 3.5783
 
0.5 3.9203
 
0.6 4.7617
 
0.7 6.7246
 
0.8 7.1491
 
 
0.9 9.5777
1.0 11.5625

Solution. Applying the natural log function (ln) to y = cedx gives

ln y = ln c + dx. (5.16)

Using the change of variables

Y = ln y, a0 = ln c, a1 = d, X = x,
5.2. Regression Analysis 123

the equation (5.16) reads


Y = a0 + a1 X, (5.17)
for which one can apply the linear LS procedure.
Linearized regression, in Maple
1 # The transformed data
2 xlny := Matrix(m, 2):
3 for i to m do
4 xlny[i, 1] := xy[i, 1];
5 xlny[i, 2] := ln(xy[i, 2]);
6 end do:
7

8 # The linear LS
9 L := CurveFitting[LeastSquares](xlny, x, curve = b*x + a);
10 0.295704647799999 + 2.1530740654363654 x
11

12 # Back to the original parameters


13 c := exp(0.295704647799999) = 1.344073123
14 d := 2.15307406543637:
15

16 # The desired nonlinear model


17 c*exp(d*x);
18 1.344073123 exp(2.15307406543637 x)
124 Chapter 5. Regression Analysis

5.3. Scene Analysis with Noisy Data: RANSAC


Note: Scene analysis is concerned with the interpretation of acquired
data in terms of a set of predefined models. It consists of 2 subproblems:
1. finding the best model (classification problem)
2. computing the best parameter values (parameter estimation problem)

• Traditional parameter estimation techniques, such as least-


squares (LS), optimize the model to all of the presented data.
– Those techniques are simple averaging methods, based on the
smoothing assumption: There will always be good data points
enough to smooth out any gross deviation.
• However, in many interesting parameter estimation problems, the
smoothing assumption does not hold; that is, the data set may
involve gross errors such as noise.
– Thus, in order to obtain more reliable model parameters, there
must be internal mechanisms to determine which points are
matching to the model (inliers) and which points are false
matches (outliers).

5.3.1. Weighted least-squares


Definition 5.15. When certain data points are more important or more
reliable than the others, one may try to compute the coefficient vector
with larger weights on more reliable data points. The weighted least-
squares method is an LS method which involves a weight. The weight
is often given as a diagonal matrix

W = diag(w1 , w2 , · · · , wm ). (5.18)

The weight matrix W can be decided either manually or automatically.


5.3. Scene Analysis with Noisy Data: RANSAC 125

Algorithm 5.16. (Weighted Least-Squares)


• Given data {(xi , yi )}, 1 ≤ i ≤ m, the best-fitting curve can be found
by solving an over-determined algebraic system (5.8):

Xβ = y. (5.19)

• When a weight matrix is applied, the above system can be written


as
W Xβ = W y. (5.20)

• Thus its weighted normal equations read

X T W Xβ = X T W y. (5.21)

Example 5.17. Given data, find the LS line with and without a weight.
When a weight is applied, weigh the first and the last data point by 1/4.
" #T
1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
xy :=
5.89 1.92 2.59 4.41 4.49 6.22 7.74 7.07 9.05 5.7

Solution.
Weighted-LS
1 LS := CurveFitting[LeastSquares](xy, x);
2 2.7639999999999967 + 0.49890909090909125 x
3 WLS := CurveFitting[LeastSquares](xy, x,
4 weight = [1/4,1,1,1,1,1,1,1,1,1/4]);
5 1.0466694879390623 + 0.8019424460431653 x
126 Chapter 5. Regression Analysis

5.3.2. RANdom SAmple Consensus (RANSAC)


The random sample consensus (RANSAC) is one of the most powerful
tools for the reconstruction of ground structures from point cloud obser-
vations in many applications. The algorithm utilizes iterative search
techniques for a set of inliers to find a proper model for given data.

Algorithm 5.18. (RANSAC) (Fischler-Bolles, 1981) [3]


Input: Measurement set X = {xi }, the error tolerance τe , the stopping
threshold η, and the maximum number of iterations N .
1. Select randomly a minimum point set S, required to determine a
hypothesis.
2. Generate a hypothesis p = g(S).
3. Compute the hypothesis consensus set, fitting within the error tol-
erance τe :
C = inlier(X, p, τe )
4. If |C| ≥ γ = η|X|, then re-estimate a hypothesis p = g(C) and stop.
5. Otherwise, repeat steps 1–4 (maximum of N times).

Example 5.19. Let’s set a hypothesis for a regression line.

1. Minimum point set S: a set of two points, (x1 , y1 ) and (x2 , y2 ).


2. Hypothesis p: y = a + bx (⇒ a + bx − y = 0)
y2 − y1
y = b(x − x1 ) + y1 = a + bx ⇐ b = , a = y1 − bx1 .
x2 − x1

3. Consensus set C:
n |a + bxi − yi | o
C = (xi , yi ) ∈ X | d = √ ≤ τe (5.22)
b2 + 1
5.3. Scene Analysis with Noisy Data: RANSAC 127

Note: In practice:
• Step 2: A hypothesis p is the set of model parameters, rather than
the model itself.
• Step 3: The consensus set can be represented more conveniently by
considering C as an index array. That is,
(
1 if xi ∈ C
C(i) = (5.23)
0 if xi 6∈ C

See inlier.m implemented for Exercise 5.2, p. 129.

Remark 5.20. The “inlier” function in Step 3 collects points whose


distance from the model, f (p), is not larger than τe . Thus, the function
can be interpreted as an automatic weighting mechanism. Indeed,
for each point xi ,
(
≤ τe , then wi = 1
dist(f (p), xi ) (5.24)
> τe , then wi = 0

Then the re-estimation in Step 4, p = g(C), can be seen as an pa-


rameter estimation p = g(X) with the corresponding weight matrix
W = {w1 , w2 , · · · , wm }.

Remark 5.21.
• The above basic RANSAC algorithm is an iterative search method
for a set of inliers which may produce presumably accurate model
parameters.
• It is simple to implement and efficient. However, it is problematic
and often erroneous.
• The main disadvantage of RANSAC is that RANSAC is unrepeat-
able; it may yield different results in each run so that none of the
results can be optimal.
128 Chapter 5. Regression Analysis

(Nin = 200, Nout = 50) (Nin = 1200, Nout = 300)

Figure 5.2: The RANSAC for linear-type synthetic datasets.

Table 5.1: The RANSAC: model fitting y = a0 + a1 x. The algorithm runs 1000 times for
each dataset to find the standard deviation of the error: σ(a0 − b
a0 ) and σ(a1 − b
a1 ).

Data σ(a0 − b
a0 ) σ(a1 − b
a1 ) E-time (sec)
1 0.1156 0.0421 0.0156
2 0.1101 0.0391 0.0147

The RANSAC is neither repeatable nor optimal.


In order to overcome the drawbacks, various variants have been stud-
ied in the literature. Nonetheless, it remains a prevailing algorithm for
finding inliers. For variants, see e.g.,
• Maximum Likelihood Estimation Sample Consensus (MLESAC) [9]
• Progressive Sample Consensus (PROSAC) [1]
• Recursive RANSAC (R-RANSAC) [6]
5.3. Scene Analysis with Noisy Data: RANSAC 129

Exercises for Chapter 5

5.1. Given data

xi 0.2 0.4 0.6 0.8 1. 1.2 1.4 1.6 1.8 2.


yi 1.88 2.13 1.76 2.78 3.23 3.82 6.13 7.22 6.66 9.07

(a) Plot the data (scattered point plot)


(b) Decide what curve fits the data best.
(c) Implement an LS code to find the curve.
(d) Plot the curve superposed over the point plot.

5.2. This problem uses the data in Example 5.17, p.125.

(a) Implement the method of normal equations for the least-squares regression to
find the best-fitting line.
(b) The RANSAC, Algorithm 5.18 is implemented for you below. Use the code to
analyze the performance of the RANSAC.
• Set τe = 1, γ = η|X| = 8, and N = 100.
• Run ransac2 100 times to get the minimum, maximum, and average number
of iterations for the RANSAC to find an acceptable hypothesis consensus set.
(c) Plot the best-fitting lines found from (a) and (b), superposed along the data.
ransac2.m
1 function [p,C,iter] = ransac2(X,tau_e,gamma,N)
2 % Input: X = {(x_i,y_i)}
3 % tau_e: the error tolerance
4 % gamma = eta*|X|
5 % N: the maximum number of iterations
6 % Output: p = [a,b], where y= a+b*x
7

8 %%-----------
9 [m,n] = size(X);
10 if n>m, X=X'; [m,n] = size(X); end
11

12 for iter = 1:N


13 % step 1
14 s1 = randi([1 m]); s2 = randi([1 m]);
15 while s1==s2, s2 = randi([1 m]); end
16 S = [X(s1,:);X(s2,:)];
17 % step 2
18 p = get_hypothesis_WLS(S,[1;1]);
19 % step 3
20 C = inlier(X,p,tau_e);
21 % step 4
130 Chapter 5. Regression Analysis

22 if sum(C)>=gamma
23 p = get_hypothesis_WLS(X,C);
24 break;
25 end
26 end

get_hypothesis_WLS.m
1 function p = get_hypothesis_WLS(X,C)
2 % Get hypothesis p, with C being used as weights
3 % Output: p = [a,b], where y= a+b*x
4

5 m = size(X,1);
6

7 A = [ones(m,1) X(:,1)];
8 A = A.*C; %A = bsxfun(@times,A,C);
9 r = X(:,2).*C;
10

11 p = ((A'*A)\(A'*r))';

inlier.m
1 function C = inlier(X,p,tau_e)
2 % Input: p=[a,b] s.t. a+b*x-y=0
3

4 m = size(X,1);
5 C = zeros(m,1);
6

7 a = p(1); b=p(2);
8 factor = 1./sqrt(b^2+1);
9 for i=1:m
10 xi = X(i,1); yi = X(i,2);
11 dist = abs(a+b*xi-yi)*factor; %distance from point to line
12 if dist<=tau_e, C(i)=1; end
13 end
6
C HAPTER

Fundamentals of AI

Contents of Chapter 6
6.1. What is Artificial Intelligence (AI)? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.2. Constituents of AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.3. Designing Artificial Brains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.4. Future of AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Exercises for Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

131
132 Chapter 6. Fundamentals of AI

6.1. What is Artificial Intelligence (AI)?


DoiNg HeRe
6.2. Constituents of AI 133

6.2. Constituents of AI
134 Chapter 6. Fundamentals of AI

6.3. Designing Artificial Brains


6.4. Future of AI 135

6.4. Future of AI
136 Chapter 6. Fundamentals of AI

Exercises for Chapter 6

6.1.
7
C HAPTER

Python Basics

Contents of Chapter 7
7.1. Why Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.2. Python in an Hour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.3. Zeros of Polynomials in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.4. Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.5. A Machine Learning Modelcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Exercises for Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

137
138 Chapter 7. Python Basics

7.1. Why Python?


Note: A good programming language must be easy to learn and use
and flexible and reliable.

Advantages of Python
Python has the following characteristics.
• Easy to learn and use
• Flexible and reliable
• Extensively used in Data Science
• Handy for Web Development purposes
• Having Vast Libraries support
• Among the fastest-growing programming languages in the tech
industry
Disadvantage of Python
Python is an interpreted and dynamically-typed language. The line-by-
line execution of code, built with a high flexibility, most likely leads to
slow execution. Python is slower than Matlab that is slower than C.

Remark 7.1. Speed up Python Programs.


• Use numpy and scipy for all mathematical operations.
• Always use a C library wherever possible.

• You yourself may create and import your own C-module into Python.
If you extend Python with pieces of compiled C-code, then the re-
sulting code is easily 100× faster than Python. Best choice!
• Cython: It is designed as a C-extension for Python, which is
developed for users not familiar with C. For Cyphon implemetation,
see e.g. https://www.youtube.com/watch?v=JKMkhARcwdU, one of
Simon Funke’s YouTube videos.
7.1. Why Python? 139

• The library numpy is designed for a Matlab-like implementation.


• Python can be used as a convenient desktop calculator.
– First, set a startup environment
– Use Python as a desktop calculator

∼/.python_startup.py
1 #.bashrc: export PYTHONSTARTUP=~/.python_startup.py
2 #.cshrc: setenv PYTHONSTARTUP ~/.python_startup.py
3 #---------------------------------------------------
4 print("\t^[[1;33m~/.python_startup.py")
5

6 import numpy as np; import sympy as sym


7 import numpy.linalg as la; import matplotlib.pyplot as plt
8 print("\tnp=numpy; la=numpy.linalg; plt=matplotlib.pyplot; sym=sympy")
9

10 from numpy import zeros,ones


11 print("\tzeros,ones, from numpy")
12

13 import random
14 from sympy import *
15 x,y,z,t = symbols('x,y,z,t');
16 print("\tfrom sympy import *; x,y,z,t = symbols('x,y,z,t')")
17

18 print("\t^[[1;37mTo see details: dir() or dir(np)^[[m")

Figure 7.1: Python startup.


140 Chapter 7. Python Basics

7.2. Python in an Hour


Key Features of Python
• Python is a simple, readable, open source programming language
which is easy to learn.
• It is an interpreted language, not a compiled language.
• In Python, variables are untyped; i.e., there is no need to define the
data type of a variable while declaring it.
• Python supports the object-oriented programming model.
• It is platform-independent and easily extensible and embeddable.
• It has a huge standard library with lots of modules and packages.
• Python is a high level language as it is easy to use because of simple
syntax, powerful because of its rich libraries and extremely versatile.

Programming Features
• Python has no support pointers.
• Python codes are stored with .py extension.
• Indentation: Python uses indentation to define a block of code.
– A code block (body of a function, loop, etc.) starts with indenta-
tion and ends with the first unindented line.
– The amount of indentation is up to the user, but it must be consis-
tent throughout that block.
• Comments:
– The hash (#) symbol is used to start writing a comment.
– Multi-line comments: Python uses triple quotes, either ”’ or """.
7.2. Python in an Hour 141

7.2.1. Python essentials


• Sequence datatypes: list, tuple, string
– [list]: defined using square brackets (and commas)
>>> li = ["abc", 14, 4.34, 23]
– (tuple): defined using parentheses (and commas)
>>> tu = (23, (4,5), ’a’, 4.1, -7)
– "string": defined using quotes (", ’, or """)
>>> st = ’Hello World’
>>> st = "Hello World"
>>> st = """This is a multi-line string
. . . that uses triple quotes."""
• Retrieving elements
>>> li[0]
’abc’
>>> tu[1],tu[2],tu[-2]
((4, 5), ’a’, 4.1)
>>> st[25:36]
’ng\nthat use’
• Slicing
>>> tu[1:4] # be aware
((4, 5), ’a’, 4.1)
• The + and ∗ operators
>>> [1, 2, 3]+[4, 5, 6,7]
[1, 2, 3, 4, 5, 6, 7]
>>> "Hello" + " " + ’World’
Hello World
>>> (1,2,3)*3
(1, 2, 3, 1, 2, 3, 1, 2, 3)
142 Chapter 7. Python Basics

• Reference semantics
>>> a = [1, 2, 3]
>>> b = a
>>> a.append(4)
>>> b
[1, 2, 3, 4]
Be aware with copying lists and numpy arrays!
• numpy, range, and iteration
>>> range(8)
[0, 1, 2, 3, 4, 5, 6, 7]
>>> import numpy as np
>>> for k in range(np.size(li)):
... li[k]
. . . <Enter>
’abc’
14
4.34
23
• numpy array and deepcopy
>>> from copy import deepcopy
>>> A = np.array([1,2,3])
>>> B = A
>>> C = deepcopy(A)
>>> A *= 4
>>> B
array([ 4, 8, 12])
>>> C
array([1, 2, 3])
7.2. Python in an Hour 143

7.2.2. Frequently used Python rules


frequently_used_rules.py
1 ## Multi-line statement
2 a = 1 + 2 + 3 + 4 + 5 +\
3 6 + 7 + 8 + 9 + 10
4 b = (1 + 2 + 3 + 4 + 5 +
5 6 + 7 + 8 + 9 + 10) #inside (), [], or {}
6 print(a,b)
7 # Output: 55 55
8

9 ## Multiple statements in a single line using ";"


10 a = 1; b = 2; c = 3
11

12 ## Docstrings in Python
13 def double(num):
14 """Function to double the value"""
15 return 2*num
16 print(double.__doc__)
17 # Output: Function to double the value
18

19 ## Assigning multiple values to multiple variables


20 a, b, c = 1, 2, "Hello"
21 ## Swap
22 b, c = c, b
23 print(a,b,c)
24 # Output: 1 Hello 2
25

26 ## Data types in Python


27 a = 5; b = 2.1
28 print("type of (a,b)", type(a), type(b))
29 # Output: type of (a,b) <class 'int'> <class 'float'>
30

31 ## Python Set: 'set' object is not subscriptable


32 a = {5,2,3,1,4}; b = {1,2,2,3,3,3}
33 print("a=",a,"b=",b)
34 # Output: a= {1, 2, 3, 4, 5} b= {1, 2, 3}
144 Chapter 7. Python Basics

35

36 ## Python Dictionary
37 d = {'key1':'value1', 'Seth':22, 'Alex':21}
38 print(d['key1'],d['Alex'],d['Seth'])
39 # Output: value1 21 22
40

41 ## Output Formatting
42 x = 5.1; y = 10
43 print('x = %d and y = %d' %(x,y))
44 print('x = %f and y = %d' %(x,y))
45 print('x = {} and y = {}'.format(x,y))
46 print('x = {1} and y = {0}'.format(x,y))
47 # Output: x = 5 and y = 10
48 # x = 5.100000 and y = 10
49 # x = 5.1 and y = 10
50 # x = 10 and y = 5.1
51

52 print("x=",x,"y=",y, sep="#",end="&\n")
53 # Output: x=#5.1#y=#10&
54

55 ## Python Input
56 C = input('Enter any: ')
57 print(C)
58 # Output: Enter any: Starkville
59 # Starkville
7.2. Python in an Hour 145

7.2.3. Looping and functions


Example 7.2. Compose a Python function which returns cubes of natural
numbers.
Solution.
get_cubes.py
1 def get_cubes(num):
2 cubes = []
3 for i in range(1,num+1):
4 value = i**3
5 cubes.append(value)
6 return cubes
7

8 if __name__ == '__main__':
9 num = input('Enter a natural number: ')
10 cubes = get_cubes(int(num))
11 print(cubes)

Remark 7.3. get_cubes.py


• Lines 8-11 are added for the function to be called directly. That is,
[Fri Jul.22] python get_cubes.py
Enter a natural number: 6
[1, 8, 27, 64, 125, 216]
• When get_cubes is called from another function, the last four lines
will not be executed.
call_get_cubes.py
1 from get_cubes import *
2

3 cubes = get_cubes(8)
4 print(cubes)

Execusion
1 [Fri Jul.22] python call_get_cubes.py
2 [1, 8, 27, 64, 125, 216, 343, 512]
146 Chapter 7. Python Basics

7.3. Zeros of Polynomials in Python


In this section, a code in Python will be implemented for zeros of polynomi-
als; we will compare the Python code with the Matlab code in §3.4.

Recall: Let’s begin with recalling how to find zeros of polynomials, pre-
sented in §3.4.
• Remark 3.42: When the Newton’s method is applied for finding an
approximate zero of P (x), the iteration reads
P (xn−1 )
xn = xn−1 − . (7.1)
P 0 (xn−1 )
Thus both P (x) and P 0 (x) must be evaluated in each iteration.
• Strategy 3.43: The derivative P 0 (x) can be evaluated by using
the Horner’s method with the same efficiency. Indeed, differen-
tiating (3.46)
P (x) = (x − x0 )Q(x) + P (x0 )
reads
P 0 (x) = Q(x) + (x − x0 )Q0 (x). (7.2)
Thus
P 0 (x0 ) = Q(x0 ). (7.3)
That is, the evaluation of Q at x0 becomes the desired quantity P 0 (x0 ).
7.3. Zeros of Polynomials in Python 147

Example 7.4. (Revisit of Example 3.46, p. 80)


Let P (x) = x4 −4x3 +7x2 −5x−2. Use the Newton’s method and the Horner’s
method to implement a code and find an approximate zero of P near 3.
Solution. First, let’s try to use built-in functions.
zeros_of_poly_built_in.py
1 import numpy as np
2

3 coeff = [1, -4, 7, -5, -2]


4 P = np.poly1d(coeff)
5 Pder = np.polyder(P)
6

7 print(P)
8 print(Pder)
9 print(np.roots(P))
10 print(P(3), Pder(3))

Output
1 4 3 2
2 1 x - 4 x + 7 x - 5 x - 2
3 3 2
4 4 x - 12 x + 14 x - 5
5 [ 2. +0.j 1.1378411+1.52731225j 1.1378411-1.52731225j -0.2756822+0.j ]
6 19 37

Now, we implement a code for Newton-Horner method to find an approxi-


mate zero of P near 3.
Zeros-Polynomials-Newton-Horner.py
1 def horner(A,x0):
2 """ input: A = [a_n,...,a_1,a_0]
3 output: p,d = P(x0),DP(x0) = horner(A,x0) """
4 n = len(A)
5 p = A[0]; d = 0
6

7 for i in range(1,n):
8 d = p + x0*d
9 p = A[i] +x0*p
10 return p,d
148 Chapter 7. Python Basics

11

12 def newton_horner(A,x0,tol,itmax):
13 """ input: A = [a_n,...,a_1,a_0]
14 output: x: P(x)=0 """
15 x=x0
16 for it in range(1,itmax+1):
17 p,d = horner(A,x)
18 h = -p/d;
19 x = x + h;
20 if(abs(h)<tol): break
21 return x,it
22

23 if __name__ == '__main__':
24 coeff = [1, -4, 7, -5, -2]; x0 = 3
25 tol = 10**(-12); itmax = 1000
26 x,it =newton_horner(coeff,x0,tol,itmax)
27 print("newton_horner: x0=%g; x=%g, in %d iterations" %(x0,x,it))
Execution
1 [Sat Jul.23] python Zeros-Polynomials-Newton-Horner.py
2 newton_horner: x0=3; x=2, in 7 iterations

Note: The above Python code must be compared with the Matlab code
in §3.4.
newton_horner.m
1 function [x,it] = newton_horner(A,x0,tol,itmax)
2 % input: A = [a_0,a_1,...,a_n]; x0: initial for P(x)=0
3 % outpue: x: P(x)=0
4

5 x = x0;
6 for it=1:itmax
7 [p,d] = horner(A,x);
8 h = -p/d;
9 x = x + h;
10 if(abs(h)<tol), break; end
11 end
7.3. Zeros of Polynomials in Python 149

horner.m
1 function [p,d] = horner(A,x0)
2 % input: A = [a_0,a_1,...,a_n]
3 % output: p=P(x0), d=P'(x0)
4

5 n = size(A(:),1);
6 p = A(n); d=0;
7

8 for i = n-1:-1:1
9 d = p + x0*d;
10 p = A(i) +x0*p;
11 end

Call_newton_horner.m
1 a = [-2 -5 7 -4 1];
2 x0=3;
3 tol = 10^-12; itmax=1000;
4 [x,it] = newton_horner(a,x0,tol,itmax);
5 fprintf(" newton_horner: x0=%g; x=%g, in %d iterations\n",x0,x,it)
6 Result: newton_horner: x0=3; x=2, in 7 iterations

Observation 7.5.
Python programming is as easy and simple as Matlab programming.
• In particular, numpy is developed for Matlab-like implementation,
with enhanced convenience.
• Python uses classes for object-oriented programming.
• Furthermore, Python is an open source (free) programming lan-
guage, which explains why Python is fastest-growing in use.
150 Chapter 7. Python Basics

7.4. Classes
Remark 7.6. Classes are a key concept in the so-called object-
oriented programming (OOP). Classes provide a means of
bundling data and functionality together.
• A class is a user-defined template or prototype from which real-
world objects are created.
• A class tells us what data an object should have, what are the ini-
tial/default values of the data, and what methods are associated
with the object to take actions on the objects using their data.
• An object is an instance of a class, and creating an object from a
class is called instantiation.

In the following, we would build a simple class, as Dr. Xu did in [11, Ap-
pendix B.5]; you will learn how to initiate, refine, and use classes.
7.4. Classes 151

Polynomial_01.py
1 class Polynomial():
2 """A class of polynomials"""
3

4 def __init__(self,coefficient):
5 """Initialize coefficient attribute of a polynomial."""
6 self.coeff = coefficient
7

8 def degree(self):
9 """Find the degree of a polynomial"""
10 return len(self.coeff)-1
11

12 if __name__ == '__main__':
13 p2 = Polynomial([1,2,3])
14 print(p2.coeff) # a variable; output: [1, 2, 3]
15 print(p2.degree()) # a method; output: 2

• Lines 1-2: define a class called Polynomial with a docstring.


– The parentheses in the class definition are empty because we cre-
ate this class from scratch.
• Lines 4-10: define two functions, __init__() and degree(). A function
in a class is called a method.
– The __init__() method is a special method for initialization; it is
called the __init__() constructor.
– The self parameter is required and must come first before the
other parameters in each method.
– Whenever we make an object from the class, we need to provide
arguments for parameters, except for self.
– The variable self.coeff (prefixed with self) is available to every
method and is accessible by any object created from the class.
– Variables prefixed with self are called attributes.
• Line 13: The line p2 = Polynomial([1,2,3]) creates an object p2 (a
polynomial x2 + 2x + 3), by passing the coefficient list [1,2,3].
– When Python reads this line, it calls the method __init__() in the
class Polynomial and creates the object named p2 that represents
this particular polynomial x2 + 2x + 3.
152 Chapter 7. Python Basics

Refinement of the Polynomial class


Polynomial_02.py
1 class Polynomial():
2 """A class of polynomials"""
3

4 count = 0
5

6 def __init__(self):
7 """Initialize coefficient attribute of a polynomial."""
8 self.coeff = [1]
9 Polynomial.count += 1
10

11 def __del__(self):
12 """Delete a polynomial object"""
13 Polynomial.count -= 1
14

15 def degree(self):
16 """Find the degree of a polynomial"""
17 return len(self.coeff)-1
18

19 def evaluate(self,x):
20 """Evaluate a polynomial."""
21

22 n = self.degree()
23 eval = []
24 for xi in x:
25 p = self.coeff[0] #Horner's method
26 for k in range(1,n+1):
27 p = self.coeff[k]+ xi*p
28 eval.append(p)
29 return eval
30

31 if __name__ == '__main__':
32 poly1 = Polynomial()
33 print('poly1, default coefficients:', poly1.coeff)
34 poly1.coeff = [1,2,-3]
35 print('poly1, coefficients after reset:', poly1.coeff)
36 print('poly1, degree:', poly1.degree())
37

38 poly2 = Polynomial()
39 poly2.coeff = [1,2,3,4,-5]
40 print('poly2, coefficients after reset:', poly2.coeff)
41 print('poly2, degree:', poly2.degree())
42
7.4. Classes 153

43 print('number of created polynomials:', Polynomial.count)


44 del poly1
45 print('number of polynomials after a deletion:', Polynomial.count)
46

47 print('poly2.evaluate([-1,0,1,2]):',poly2.evaluate([-1,0,1,2]))

• Line 4: The variable count is a class attribute of Polynomial.


– A class attribute is a variable that belongs to a class but not a
particular object.
– All objects of the class share this same variable (the class at-
tribute).
• Line 8: initializes the class attribute self.coeff.
– Every object or class attribute in a class needs an initial value.
– One can set a default value for an object attribute in the
__init__() constructor, and we then do not have to include a pa-
rameter for that attribute in the constructor. See Lines 32 and
38.
• Lines 11-13: define the __del__() method in the class for the deletion
of objects. See Line 44.
• Lines 19-29: define another method called evaluate, which uses the
Horner’s method. See Example 7.4, p.147.

Execution
1 [Sat Jul.23] python Polynomial_02.py
2 poly1, default coefficients: [1]
3 poly1, coefficients after reset: [1, 2, -3]
4 poly1, degree: 2
5 poly2, coefficients after reset: [1, 2, 3, 4, -5]
6 poly2, degree: 4
7 number of created polynomials: 2
8 number of polynomials after a deletion: 1
9 poly2.evaluate([-1,0,1,2]): [-7, -5, 5, 47]
154 Chapter 7. Python Basics

Inheritance
Note: If we want to write a class that is just a specialized version of
another class, we do not need to write the class from scratch.
• We call the specialized class a child class and the other general
class a parent class.
• The child class can inherit all the attributes and methods form the
parent class; it can also define its own special attributes and meth-
ods or even overrides methods of the parent class.
Classes.py
1 class Polynomial():
2 """A class of polynomials"""
3

4 def __init__(self,coefficient):
5 """Initialize coefficient attribute of a polynomial."""
6 self.coeff = coefficient
7

8 def degree(self):
9 """Find the degree of a polynomial"""
10 return len(self.coeff)-1
11

12 class Quadratic(Polynomial):
13 """A class of quadratic polynomial"""
14

15 def __init__(self,coefficient):
16 """Initialize the coefficient attributes ."""
17 super().__init__(coefficient)
18 self.power_decrease = 1
19

20 def roots(self):
21 a,b,c = self.coeff
22 if self.power_decrease != 1:
23 a,c = c,a
24 discriminant = b**2-4*a*c
25 r1 = (-b+discriminant**0.5)/(2*a)
26 r2 = (-b-discriminant**0.5)/(2*a)
27 return [r1,r2]
28

29 def degree(self):
30 return 2
7.4. Classes 155

• Line 12: We must include the name of the parent class in the paren-
theses of the definition of the child class (to indicate the parent-child
relation for inheritance).
• Line 17: The super() function is to give an child object all the at-
tributes defined in the parent class.
• Line 18: An additional child class attribute self.power_decrease is
initialized.
• Lines 20-27: define a new method called roots.
• Lines 29-30: The method degree() overrides the parent’s method.
call_Quadratic.py
1 from Classes import *
2

3 quad1 = Quadratic([2,-3,1])
4 print('quad1, roots:',quad1.roots())
5 quad1.power_decrease = 0
6 print('roots when power_decrease = 0:',quad1.roots())
7 # Output: quad1, roots: [1.0, 0.5]
8 # roots when power_decrease = 0: [2.0, 1.0]
156 Chapter 7. Python Basics

7.5. A Machine Learning Modelcode


A code for machine learning can start with the following machine learn-
ing modelcode. You may copy-and-paste the scripts to run.
Machine_Learning_Model.py
1 import numpy as np; import pandas as pd
2 import seaborn as sbn; import matplotlib.pyplot as plt
3 import time
4 from sklearn.model_selection import train_test_split
5 from sklearn import datasets; #print(dir(datasets))
6 np.set_printoptions(suppress=True)
7

8 #=====================================================================
9 # DATA: Read & Preprocessing
10 # load_iris, load_wine, load_breast_cancer, ...
11 #=====================================================================
12 data_read = datasets.load_iris(); #print(data_read.keys())
13

14 X = data_read.data
15 y = data_read.target
16 dataname = data_read.filename
17 targets = data_read.target_names
18 features = data_read.feature_names
19

20 print('X.shape=',X.shape, 'y.shape=',y.shape)
21 #---------------------------------------------------------------------
22 # SETTING
23 #---------------------------------------------------------------------
24 N,d = X.shape; labelset=set(y)
25 nclass=len(labelset);
26 print('N,d,nclass=',N,d,nclass)
27

28 rtrain = 0.7e0; run = 100


29 rtest = 1-rtrain
30

31 #=====================================================================
32 # CLASSIFICATION
33 #=====================================================================
34 btime = time.time()
35 Acc = np.zeros([run,1])
36 ##from sklearn.neighbors import KNeighborsClassifier
37 ##clf = KNeighborsClassifier(5)
38 from myCLF import myCLF ## My classifier
7.5. A Machine Learning Modelcode 157

39

40 for it in range(run):
41 Xtrain, Xtest, ytrain, ytest = train_test_split(
42 X, y, test_size=rtest, random_state=it, stratify = y)
43 ##clf.fit(Xtrain, ytrain);
44 clf = myCLF(Xtrain,ytrain); clf.fit(); ## My classifier
45 Acc[it] = clf.score(Xtest, ytest)
46

47 #-----------------------------------------------
48 # Print: Accuracy && E-time
49 #-----------------------------------------------
50 etime = time.time()-btime
51 print(' %s: Acc.(mean,std) = (%.2f,%.2f)%%; Average E-time= %.5f'
52 %(dataname,np.mean(Acc)*100,np.std(Acc)*100,etime/run))
53

54 #=====================================================================
55 # Scikit-learn Classifiers, for Comparisions
56 #=====================================================================
57 exec(open("sklearn_classifiers.py").read())

sklearn_classifiers.py
1 #=====================================================================
2 # Required: X, y, [dataname, run]
3 print('========= Scikit-learn Classifiers, for Comparisions =========')
4 #=====================================================================
5 from sklearn.preprocessing import StandardScaler
6 from sklearn.datasets import make_moons, make_circles, make_classification
7 from sklearn.neural_network import MLPClassifier
8 from sklearn.neighbors import KNeighborsClassifier
9 from sklearn.linear_model import LogisticRegression
10 from sklearn.svm import SVC
11 from sklearn.gaussian_process import GaussianProcessClassifier
12 from sklearn.gaussian_process.kernels import RBF
13 from sklearn.tree import DecisionTreeClassifier
14 from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
15 from sklearn.naive_bayes import GaussianNB
16 from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
17 from sklearn.inspection import DecisionBoundaryDisplay
18

19 #-----------------------------------------------
20 classifiers = [
21 LogisticRegression(max_iter = 1000),
22 KNeighborsClassifier(5),
23 SVC(kernel="linear", C=0.5),
158 Chapter 7. Python Basics

24 SVC(gamma=2, C=1),
25 RandomForestClassifier(max_depth=5, n_estimators=50, max_features=1),
26 MLPClassifier(alpha=1, max_iter=1000),
27 AdaBoostClassifier(),
28 GaussianNB(),
29 QuadraticDiscriminantAnalysis(),
30 GaussianProcessClassifier(),
31 ]
32 names = [
33 "Logistic Regr",
34 "KNeighbors-5 ",
35 "Linear SVM ",
36 "RBF SVM ",
37 "Random Forest",
38 "Deep-NN ",
39 "AdaBoost ",
40 "Naive Bayes ",
41 "QDA ",
42 "Gaussian Proc",
43 ]
44 #-----------------------------------------------
45 if dataname is None: dataname = 'No-dataname';
46 if run is None: run = 100;
47

48 #===============================================
49 acc_max=0
50 for name, clf in zip(names, classifiers):
51 Acc = np.zeros([run,1])
52 btime = time.time()
53

54 for it in range(run):
55 Xtrain, Xtest, ytrain, ytest = train_test_split(
56 X, y, test_size=rtest, random_state=it, stratify = y)
57

58 clf.fit(Xtrain, ytrain);
59 Acc[it] = clf.score(Xtest, ytest)
60

61 etime = time.time()-btime
62 accmean = np.mean(Acc)*100
63 print('%s: %s: Acc.(mean,std) = (%.2f,%.2f)%%; E-time= %.5f'
64 %(dataname,name,accmean,np.std(Acc)*100,etime/run))
65 if accmean>acc_max:
66 acc_max= accmean; algname = name
67 print('sklearn classifiers max: %s= %.2f' %(algname,acc_max))
7.5. A Machine Learning Modelcode 159

Exercises for Chapter 7

You should use Python for the following problems.


7.1. Use nested for loops to assign entries of a 5 × 5 matrix A such that A[i, j] = ij.
7.2. The variable d is initially equal to 1. Use a while loop to keep dividing d by 2 until
d < 10−6 .

(a) Determine how many divisions are made.


(b) Verify your result by algebraic derivation.

Note: A while loop has not been considered in the lecture. However, you can figure it out
easily by yourself.
7.3. Write a function that takes as input a list of values and returns the largest value. Do
this without using the Python max() function; you should combine a for loop and an
if statement.

(a) Produce a random list of size 10-20 to verify your function.

7.4. Let P4 (x) = 2x4 − 5x3 − 11x2 + 20x + 10. Solve the following.

(a) Plot P4 over the interval [−3, 4].


(b) Find all zeros of P4 , modifying Zeros-Polynomials-Newton-Horner.py, p.147.
(c) Add markers for the zeros to the plot.
(d) Find all roots of P40 (x) = 0.
(e) Add markers for the zeros of P40 to the plot.

Hint : For plotting, you may import: “import matplotlib.pyplot as plt” then use
plt.plot(). You will see the Python plotting is quite similar to Matlab plotting.
160 Chapter 7. Python Basics
8
C HAPTER

Mathematical Optimization

Problem 8.1. (Minimization Problem) Let Ω ⊂ Rn , n ≥ 1. Given a


real-valued function f : Ω → R, the general problem of finding the value
that minimizes f is formulated as follows.

min f (x). (8.1)


x∈Ω

In this context, f is the objective function (sometimes referred to as


loss function or cost function). Ω ⊂ Rn is the domain of the function
(also known as the constraint set).

In this chapter, we solve the minimization problem (8.1) iteratively as


follows: Given an initial guess x0 ∈ Rn , find successive approximations
xk ∈ Rn of the form

xk+1 = xk + γk pk , k = 0, 1, · · · , (8.2)

where pk is the search direction and γk > 0 is the step length.

Contents of Chapter 8
8.1. Gradient Descent (GD) Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.2. Newton’s Method for Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Exercises for Chapter 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

161
162 Chapter 8. Mathematical Optimization

8.1. Gradient Descent (GD) Method


Note: The gradient descent method is also known as the steepest
descent method or the Richardson’s method.
• Recall that we would solve the minimization problem (8.1) using
iterative algorithms of the form (8.2).

Derivation of the GD method


• Given xk+1 as in (8.2), we have by Taylor’s formula: for some ξ,

f (xk+1 ) = f (xk + γk pk )
γk2 (8.3)
0
= f (xk ) + γk f (xk ) · pk + pk · f 00 (ξ)pk .
2
• Assume that f 00 is bounded. Then

f (xk+1 ) = f (xk ) + γk f 0 (xk ) · pk + O(γk2 ), as γk → 0.

• The Goal: To find pk and γk such that

f (xk+1 ) < f (xk ), (8.4)

which can be achieved if

f 0 (xk ) · pk < 0 (8.5)

and either γk is sufficiently small or f 00 (ξ) is nonnegative.


• Choice: Let f 0 (xk ) 6= 0. If we choose

pk = −f 0 (xk ), (8.6)

then
f 0 (xk ) · pk = −||f 0 (xk )||2 < 0, (8.7)
which satisfies (8.5) and therefore (8.4).
• Summary: In the GD method, the search direction is the negative
gradient, the steepest descent direction.
8.1. Gradient Descent (GD) Method 163

The Gradient Descent Method in 1D


Algorithm 8.2. Consider the minimization problem in 1D:

min f (x), x ∈ S, (8.8)


x

where S is a closed interval in R. Then its gradient descent method reads

xk+1 = xk − γ f 0 (xk ). (8.9)

Picking the step length γ : Assume that the step length was chosen to
be independent of n, although one can play with other choices as well. The
question is how to select γ in order to make the best gain of the method. To
turn the right-hand side of (8.9) into a more manageable form, we invoke
Taylor’s Theorem:1
ˆ x+t
0
f (x + t) = f (x) + t f (x) + (x + t − s) f 00 (s) ds. (8.10)
x
00
Assuming that |f (s)| ≤ L, we have
t2 0
f (x + t) ≤ f (x) + t f (x) + L.
2
Now, letting x = xk and t = −γ f 0 (xk ) reads
f (xk+1 ) = f (xk − γ f 0 (xk ))
1
≤ f (xk ) − γ f 0 (xk ) f 0 (xk ) + L [γ f 0 (xk )]2 (8.11)
2
 L
= f (xk ) − [f 0 (xk )]2 γ − γ 2 .
2
The gain (learning) from the method occurs when
L 2
γ − γ2 > 0 ⇒ 0 < γ < , (8.12)
2 L
and it will be best when γ − L2 γ 2 is maximal. This happens at the point
1
γ= . (8.13)
L
1
Taylor’s Theorem with integral remainder: Suppose f ∈ C n+1 [a, b] and x0 ∈ [a, b]. Then, for every
n ˆ
X f (k) (x0 ) 1 x
x ∈ [a, b], f (x) = (x − x0 )k + Rn (x), where Rn (x) = (x − s)n f (n+1) (s) ds.
k! n! x0
k=0
164 Chapter 8. Mathematical Optimization

Thus an effective gradient descent method (8.9) can be written as


1 0 1
xk+1 = xk − γ f 0 (xk ) = xk − f (xk ) = xk − 00
f 0 (xk ). (8.14)
L max |f (x)|
Furthermore, it follows from (8.11) and (8.13) that
1 0
f (xk+1 ) ≤ f (xk ) − [f (xk )]2 . (8.15)
2L

Remark 8.3. (Convergence of gradient descent method).


Thus it is obvious that the method defines a sequence of points {xk } along
which {f (xk )} decreases.
• If f is bounded from below and the level sets of f are bounded,
{f (xk )} converges; so does {xk }. That is, there is a point x
b such
that
lim xk = x
b. (8.16)
n→∞

• Now, we can rewrite (8.15) as

[f 0 (xk )]2 ≤ 2L [f (xk ) − f (xk+1 )]. (8.17)

Since f (xk ) − f (xk+1 ) → 0, also f 0 (xk ) → 0.


• When f 0 is continuous, using (8.16) reads

f 0 (b
x) = lim f 0 (xk ) = 0, (8.18)
n→∞

which implies that the limit x


b is a critical point.
• The method thus generally finds a critical point but that could still
be a local minimum or a saddle point. Which it is cannot be decided
at this level of analysis.
8.1. Gradient Descent (GD) Method 165

Example 8.4. (Rosenbrock function). For example, the Rosenbrock


function in the two-dimensional (2D) space is defined as2

f (x, y) = (1 − x)2 + 100 (y − x2 )2 . (8.19)

Use the GD method to find the minimizer, starting with x0 = (−1, 2).

Figure 8.1: Plots of the Rosenbrock function f (x, y) = (1 − x)2 + 100 (y − x2 )2 .

rosenbrock_2D_GD.py
1 import numpy as np; import time
2

3 itmax = 10000; tol = 1.e-7; gamma = 1/500


4 x0 = np.array([-1., 2.])
5

6 def rosen(x):
7 return (1.-x[0])**2+100*(x[1]-x[0]**2)**2
8

9 def rosen_grad(x):
10 h = 1.e-5;
11 g1 = ( rosen([x[0]+h,x[1]]) - rosen([x[0]-h,x[1]]) )/(2*h)
12 g2 = ( rosen([x[0],x[1]+h]) - rosen([x[0],x[1]-h]) )/(2*h)
13 return np.array([g1,g2])
14

2
The Rosenbrock function in 3D is given as f (x, y, z) = [(1 − x)2 + 100 (y − x2 )2 ] + [(1 − y)2 + 100 (z − y 2 )2 ],
which has exactly one minimum at (1, 1, 1). Similarly, one can define the Rosenbrock function in gen-
eral N -dimensional spaces, for N ≥ 4, by adding one more component for each enlarged dimension.
N
X −1
(1 − xi )2 + 100(xi+1 − x2i )2 , where x = [x1 , x2 , · · · , xN ] ∈ RN . See Wikipedia
 
That is, f (x) =
i=1
(https://en.wikipedia.org/wiki/Rosenbrock_function) for details.
166 Chapter 8. Mathematical Optimization

15 # Now, GD iteration begins


16 if __name__ == '__main__':
17 t0 = time.time()
18 x=x0
19 for it in range(itmax):
20 corr = gamma*rosen_grad(x)
21 x = x - corr
22 if np.linalg.norm(corr)<tol: break
23 print('GD Method: it = %d; E-time = %.4f' %(it+1,time.time()-t0))
24 print(x)

Output
1 GD Method: it = 7687; E-time = 0.0521
2 [0.99994416 0.99988809]

The Choice of Step Size and Line Search


Note: The convergence of the gradient descent method can be extremely
sensitive to the choice of step size. It often requires to choose the
step size adaptively: the step size would better be chosen small in re-
gions of large variability of the gradient, while in regions with small
variability we would like to take it large.

Strategy 8.5. Backtracking line search procedures allow to select


a step size depending on the current iterate and the gradient. In this
procedure, we select an initial (optimistic) step size γk and evaluate the
following inequality (known as sufficient decrease condition):
γk
f (xk − γk ∇f (xk )) ≤ f (xk ) − k∇f (xk )k2 . (8.20)
2
If this inequality is verified, the current step size is kept. If not, the step
size is divided by 2 (or any number larger than 1) repeatedly until (8.20)
is verified. To get a better understanding, refer to (8.15) on p. 164.
8.1. Gradient Descent (GD) Method 167

The gradient descent algorithm with backtracking line search then becomes
Algorithm 8.6. (The Gradient Descent Algorithm, with Back-
tracking Line Search).

input: initial guess x0 , step size γ > 0;


for k = 0, 1, 2, · · · do

initial step size estimate γk ;
 while (TRUE) do
γk

k∇f (xk )k2

 if f (x k − γ k ∇f (x k )) ≤ f (x k ) −
 2

 break; (8.21)

 else γk = γk /2;
end while
xk+1 = xk − γk ∇f (xk );
end for
return xk+1 ;

Remark 8.7. Incorporated with


• either a line search
• or partial updates,
the gradient descent method is the major computational algorithm
for various machine learning tasks.

Note: The gradient descent method with partial updates is called the
stochastic gradient descent (SGD) method.
168 Chapter 8. Mathematical Optimization

8.2. Newton’s Method for Optimization


DoiNg HeRe
8.2. Newton’s Method for Optimization 169

rosenbrock_opt_Newton.py
1 import numpy as np; import time
2 from scipy import optimize as opt
3

4 x0 = np.array([-1., 2.])
5

6 # method='Newton-CG' (default: 'BFGS')


7 t0 = time.time()
8 res = opt.minimize(opt.rosen, x0, method='Newton-CG', tol=1e-7,
9 jac=opt.rosen_der, hess=opt.rosen_hess)
10

11 print('Method = %s; E-time = %.4f' %('Newton-CG',time.time()-t0))


12 print(res)

Output
1 Method = Newton-CG; E-time = 0.0244
2 fun: 9.003798065813694e-20
3 jac: array([ 9.97569418e-08, -5.00786967e-08])
4 message: 'Optimization terminated successfully.'
5 nfev: 169
6 nhev: 148
7 nit: 148
8 njev: 169
9 status: 0
10 success: True
11 x: array([1., 1.])
170 Chapter 8. Mathematical Optimization

Exercises for Chapter 8

8.1.
9
C HAPTER

Vector Spaces and Orthogonality

Contents of Chapter 9
9.1. Linear Indepencence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Exercises for Chapter 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

171
172 Chapter 9. Vector Spaces and Orthogonality

9.1. Linear Indepencence


DoiNg HeRe
9.1. Linear Indepencence 173

Exercises for Chapter 9

9.1.
174 Chapter 9. Vector Spaces and Orthogonality
10
C HAPTER

Principal Component Analysis

Contents of Chapter 10
10.1.Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
10.2.Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
10.3.Application of the SVD for LS Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Exercises for Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

175
176 Chapter 10. Principal Component Analysis

10.1. Principal Component Analysis


Definition 10.1. Principal component analysis (PCA) is the pro-
cess of computing and using the principal components to perform a
change of basis on the data, sometimes with only the first few princi-
pal components and ignoring the rest.

The PCA, in a Nutshell


• The PCA is a statistical procedure that uses an orthogonal transfor-
mation to convert a set of observations of possibly correlated variables
into a set of linearly uncorrelated variables called the principal com-
ponents.
• The orthogonal axes of the new subspace can be interpreted as the
directions of maximum variance given the constraint that the new
feature axes are orthogonal to each other:

Figure 10.1: Principal components.

• It can be shown that the principal directions are eigenvectors of the


data’s covariance matrix.
• The PCA directions are highly sensitive to data scaling, and we
need to standardize the features prior to PCA, particularly when the
features were measured on different scales and we want to assign
equal importance to all features.
10.1. Principal Component Analysis 177

10.1.1. The covariance matrix


Definition 10.2. Variance measures the variation of a single random
variable, whereas covariance is a measure of the joint variability of two
random variables. Let the random variable pair (x, y) take on the values
{(xi , yi ) | i = 1, 2, · · · , n}, with equal probabilities pi = 1/n. Then
• The formula for variance of x is given by
n
1X
σx2 = (xi − x̄)2 , (10.1)
n i=1

where x̄ is the mean of x values.


• The covariance σ(x, y) of two random variables x and y is given by
n
1X
cov(x, y) = (xi − x̄)(yi − ȳ). (10.2)
n i=1

Remark 10.3. In reality, data are saved in a matrix X ∈ Rn×d :


• each of the n rows represents a different data point, and
• each of the d columns gives a particular kind of feature.
Thus, d describes the dimension of data points and also can be considered
as the number of random variables.

Definition 10.4. The covariance matrix of a data matrix X ∈ Rn×d is


a square matrix C ∈ Rd×d , whose (i, j)-entry is the covariance of the i-th
column and the j-th column of X. That is,

C = [Cij ] ∈ Rd×d , Cij = cov(Xi , Xj ). (10.3)

Example 10.5. Let X b be the data X subtracted by the mean column-


b = X − E[X]. Then the covariance matrix of X reads
wisely: X
1 bT b 1
C = X X = (X − E[X])T (X − E[X]), (10.4)
n n
for which the scaling factor 1/n is often ignored in reality.
178 Chapter 10. Principal Component Analysis

Example 10.6. Generate a synthetic data X in 2D to find its covariance


matrix and principal directions.
Solution.
util_Covariance.py
1 import numpy as np
2

3 # Generate data
4 def generate_data(n):
5 # Normally distributed around the origin
6 x = np.random.normal(0,1, n)
7 y = np.random.normal(0,1, n)
8 S = np.vstack((x, y)).T
9 # Transform
10 sx, sy = 1, 3;
11 Scale = np.array([[sx, 0], [0, sy]])
12 theta = 0.25*np.pi; c,s = np.cos(theta), np.sin(theta)
13 Rot = np.array([[c, -s], [s, c]]).T #T, due to right multiplication
14

15 return S.dot(Scale).dot(Rot) +[5,2]


16

17 # Covariance
18 def cov(x, y):
19 xbar, ybar = x.mean(), y.mean()
20 return np.sum((x - xbar)*(y - ybar))/len(x)
21

22 # Covariance matrix
23 def cov_matrix(X):
24 return np.array([[cov(X[:,0], X[:,0]), cov(X[:,0], X[:,1])], \
25 [cov(X[:,1], X[:,0]), cov(X[:,1], X[:,1])]])
Covariance.py
1 import numpy as np
2 import matplotlib.pyplot as plt
3 from util_Covariance import *
4

5 # Generate data
6 n = 200
7 X = generate_data(n)
8 print('Generated data: X.shape =', X.shape)
9

10 # Covariance matrix
11 C = cov_matrix(X)
10.1. Principal Component Analysis 179

12 print('C:\n',C)
13

14 # Principal directions
15 eVal, eVec = np.linalg.eig(C)
16 xbar,ybar = np.mean(X,0)
17 print('eVal:\n',eVal); print('eVec:\n',eVec)
18 print('np.mean(X, 0) =',xbar,ybar)
19

20 # Plotting
21 plt.style.use('ggplot')
22 plt.scatter(X[:, 0],X[:, 1],c='#00a0c0',s=10)
23 plt.axis('equal');
24 plt.title('Generated Data')
25 plt.savefig('py-data-generated.png')
26

27 for e, v in zip(eVal, eVec.T):


28 plt.plot([0,2*np.sqrt(e)*v[0]]+xbar,\
29 [0,2*np.sqrt(e)*v[1]]+ybar, 'r-', lw=2)
30 plt.title('Principal Directions')
31 plt.savefig('py-data-principal-directions.png')
32 plt.show()

Figure 10.2: Synthetic data and its principal directions (right).


180 Chapter 10. Principal Component Analysis

Output
1 Generated data: X.shape = (200, 2)
2 C:
3 [[ 5.10038723 -4.15289232]
4 [-4.15289232 4.986776 ]]
5 eVal:
6 [9.19686242 0.89030081]
7 eVec:
8 [[ 0.71192601 0.70225448]
9 [-0.70225448 0.71192601]]
10 np.mean(X, 0) = 4.986291809096116 2.1696690114181947

Observation 10.7. Covariance Matrix.


• Symmetry: The covariance matrix C is symmetric so that it is di-
agonalizable. (See §4.4.3, p.109.) That is,

C = U DU −1 , (10.5)

where D is a diagonal matrix of eigenvalues of C and U is the corre-


sponding eigenvectors of C such that U T U = I. (Such a square matrix
U is called an orthogonal matrix.)
• Principal directions: The principal directions are eigenvectors of
the data’s covariance matrix.
• Minimum volume enclosing ellipsoid (MVEE): The PCA can be
viewed as fitting a d-dimensional ellipsoid to the data, where each
axis of the ellipsoid represents one of principal directions.
– If some axis of the ellipsoid is small, then the variance along that
axis is also small.
10.1. Principal Component Analysis 181

10.1.2. Computation of principal components

• Consider a data matrix X ∈ Rn×d :


– each of the n rows represents a different data point,
– each of the d columns gives a particular kind of feature, and
– each column has zero empirical mean (e.g., after standardization).
• Our goal is to find an orthogonal weight matrix W ∈ Rd×d such that

Z = X W, (10.6)

where Z ∈ Rn×d is call the score matrix. Columns of Z represent


principal components of X.

First weight vector w1 : the first column of W :


In order to maximize variance of z1 , the first weight vector w1 should satisfy
w1 = arg max kz1 k2 = arg max kXwk2
kwk=1 kwk=1
wT X T Xw (10.7)
T T
= arg max w X Xw = arg max ,
kwk=1 w6=0 wT w
where the quantity to be maximized can be recognized as a Rayleigh quo-
tient.

Theorem 10.8. For a positive semidefinite matrix (such as X T X),


the maximum of the Rayleigh quotient is the same as the largest eigen-
value of the matrix, which occurs when w is the corresponding eigenvec-
tor, i.e.,
wT X T Xw v1
w1 = arg max = , (X T X)v1 = λ1 v1 , (10.8)
w6=0 wT w kv1 k
where λ1 is the largest eigenvalue of X T X ∈ Rd×d .

Example 10.9. With w1 found, the first principal component of a data


(i)
vector x(i) , the i-th row of X, is then given as a score z1 = x(i) · w1 .
182 Chapter 10. Principal Component Analysis

Further weight vectors wk :


The k-th weight vector can be found by 1 subtracting the first (k − 1) prin-
cipal components from X:
k−1
X
Xbk := X − Xwi wiT , (10.9)
i=1

and then 2 finding the weight vector which extracts the maximum variance
from this new data matrix
wk = arg max kX bk wk2 , (10.10)
kwk=1

which turns out to give the remaining eigenvectors of X T X.

Remark 10.10. The principal components transformation can also be


associated with the singular value decomposition (SVD) of X:
X = U ΣV T , (10.11)

where
U : n × d orthogonal (the left singular vectors of X.)
Σ : d × d diagonal (the singular values of X.)
V : d × d orthogonal (the right singular vectors of X.)

• The matrix Σ explicitly reads


Σ = diag(σ1 , σ2 , · · · , σd ), (10.12)

where σ1 ≥ σ2 ≥ · · · ≥ σd ≥ 0.
• In terms of this factorization, the matrix X T X reads

X T X = (U ΣV T )T U ΣV T = V ΣU T U ΣV T = V Σ2 V T . (10.13)

• Comparing with the eigenvector factorization of X T X, we conclude


= the eigenvectors of X T X ⇒ V ∼
– the right singular vectors V ∼ =W
– (the square of singular values of X) = (the eigenvalues of X T X)
⇒ σj2 = λj , j = 1, 2, · · · , d.
10.1. Principal Component Analysis 183

Summary 10.11. (Computation of Principal Components)


1. Computer the singular value decomposition (SVD) of X:

X = U ΣV T . (10.14)

2. Set
W = V. (10.15)
Then the score matrix, the set of principal components, is

Z = XW = XV = U ΣV T V = U Σ
(10.16)
= [σ1 u1 |σ2 u2 | · · · |σd ud ].

* The SVD will be discussed in §10.2.

10.1.3. Dimensionality reduction: Data compression

• The transformation Z = XW maps a data vector x(i) ∈ Rd to a new


space of d variables which are now uncorrelated.
• However, not all the principal components need to be kept.
• Keeping only the first k principal components, produced by using only
the first k eigenvectors of X T X (k  d), gives the truncated score
matrix:
Zk := X Wk = U ΣV T Wk = U Σk , (10.17)
where Zk ∈ Rn×k , Wk ∈ Rd×k , and

Σk := diag(σ1 , · · · , σk , 0, · · · , 0). (10.18)

• It follows from (10.17) that the corresponding truncated data matrix


reads

Xk = Zk WkT = U Σk WkT = U Σk W T = U Σk V T . (10.19)

Quesitons. How can we choose k ? &


Is the difference kX − Xk k (that we truncated) small ?
184 Chapter 10. Principal Component Analysis

Claim 10.12. It follows from (10.11) and (10.19) that

kX − Xk k2 = kU ΣV T − U Σk V T k2
= kU (Σ − Σk )V T k2 (10.20)
= kΣ − Σk k2 = σk+1 ,

where k · k2 is the induced matrix L2 -norm.

Remark 10.13. Efficient algorithms exist to compute the SVD of X


without having to form the matrix X T X, so computing the SVD is now
the standard way to carry out the PCA. See [4, 10].

Image Compression
• Dyadic Decomposition: The data matrix X ∈ Rm×n is expressed as
a sum of rank-1 matrices:
n
X
T
X = U ΣV = σi ui viT , (10.21)
i=1

where
V = [v1 , · · · , vn ], U = [u1 , · · · , un ].

• Approximation: X can be approximated as


k
X
T
X ≈ Xk := U Σk V = σi ui viT (10.22)
i=1

is closest to X among matrices of rank≤ k, and

||X − Xk ||2 = σk+1 .

• It only takes n · k + m · k = (m + n) · k words to store [v1 , v2 , · · · , vk ] and


[σ1 u1 , σ2 u2 , · · · , σk uk ], from which we can reconstruct Xk .
• We use Xk as our compressed images, stored using (m + n) · k words.
10.1. Principal Component Analysis 185

A Matlab code to demonstrate the SVD compression of images:


peppers_compress.m
1 img = imread('Peppers.png'); [m,n,d]=size(img);
2 [U,S,V] = svd(reshape(im2double(img),m,[]));
3 %%---- select k <= p=min(m,n)
4 k = 20;
5 img_k = U(:,1:k)*S(1:k,1:k)*V(:,1:k)';
6 img_k = reshape(img_k,m,n,d);
7 figure, imshow(img_k)

The “Peppers" image is in [270, 270, 3] ∈ R270×810 .


Image compression using k singular values
Original (k = 270) k=1 k = 10

k = 20 k = 50 k = 100
186 Chapter 10. Principal Component Analysis

Peppers: Singular values

Peppers: Compression quality





 13.7 when k = 1,
20.4 when k = 10,





 23.7 when k = 20,
PSNR (dB) =

 29.0 when k = 50,

32.6 when k = 100,





 37.5 when k = 150,

where PSNR is the “Peak Signal-to-Noise Ratio.”

Peppers Storage: It requires (m + n) · k words.


For example, when k = 50,

(m + n) · k = (270 + 810) · 50 = 54,000 , (10.23)

which is approximately a quarter the full storage space

270 × 270 × 3 = 218,700 .


10.2. Singular Value Decomposition 187

10.2. Singular Value Decomposition


Here we will deal with the SVD in detail.

Theorem 10.14. (SVD Theorem). Let A ∈ Rm×n with m ≥ n. Then we


can write
A = U ΣV T, (10.24)
where U ∈ Rm×n and satisfies U T U = I, V ∈ Rn×n and satisfies V T V = I,
and Σ = diag(σ1 , σ2 , · · · , σn ), where

σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0.

Remark 10.15. The matrices are illustrated pictorially as


   
  
   
   
 A  =  U  Σ  V T , (10.25)
     
   
   

where
U : m × n orthogonal (the left singular vectors of A.)
Σ : n × n diagonal (the singular values of A.)
V : n × n orthogonal (the right singular vectors of A.)

• For some r ≤ n, the singular values may satisfy

σ ≥ σ2 ≥ · · · ≥ σr > σr+1 = · · · = σn = 0. (10.26)


|1 {z }
nonzero singular values

In this case, rank(A) = r.


• If m < n, the SVD is defined by considering AT .
188 Chapter 10. Principal Component Analysis

Proof. (of Theorem 10.14) Use induction on m and n: we assume that the
SV D exists for (m − 1) × (n − 1) matrices, and prove it for m × n. We assume
A 6= 0; otherwise we can take Σ = 0 and let U and V be arbitrary orthogonal
matrices.

• The basic step occurs when n = 1 (m ≥ n). We let A = U ΣV T with


U = A/||A||2 , Σ = ||A||2 , V = 1.
• For the induction step, choose v so that

||v||2 = 1 and ||A||2 = ||Av||2 > 0.

Av
• Let u = ||Av||2 , which is a unit vector. Choose Ũ , Ṽ such that

U = [u Ũ ] ∈ Rm×n and V = [v Ṽ ] ∈ Rn×n

are orthogonal.
• Now, we write
" # " #
T T T
u u Av u AṼ
U T AV = · A · [v Ṽ ] =
Ũ T Ũ T Av Ũ T AṼ

Since
T (Av)T (Av) ||Av||22
u Av = = = ||Av||2 = ||A||2 ≡ σ,
||Av||2 ||Av||2
Ũ T Av = Ũ T u||Av||2 = 0,

we have
" # " #" #" #T
σ 0 1 0 σ 0 1 0
U T AV = = ,
0 U1 Σ1 V1T 0 U1 0 Σ1 0 V1

or equivalently
" #! " # " #!T
1 0 σ 0 1 0
A= U V . (10.27)
0 U1 0 Σ1 0 V1

Equation (10.27) is our desired decomposition.


10.2. Singular Value Decomposition 189

10.2.1. Algebraic interpretation of the SVD

Let rank(A) = r. let the SVD of A be A = U Σ V T , with

U = [u1 u2 · · · un ],
Σ = diag(σ1 , σ2 , · · · , σn ),
V = [v1 v2 · · · vn ],

and σr be the smallest positive singular value. Since

A = U Σ V T ⇐⇒ AV = U ΣV T V = U Σ,

we have
AV = A[v1 v2 ··· vn ] = [Av1 Av2 · · · Avn ]
 
σ1

 ... 

 
= [u1 · · · ur · · · un ]  σ r
 (10.28)
...
 
 
 
0
= [σ1 u1 · · · σr ur 0 · · · 0].

Therefore,
(
Avj = σj uj , j = 1, 2, · · · , r
A = U ΣV T ⇔ (10.29)
Avj = 0, j = r + 1, · · · , n

Similarly, starting from AT = V Σ U T ,


(
AT uj = σj vj , j = 1, 2, · · · , r
AT = V Σ U T ⇔ (10.30)
AT uj = 0, j = r + 1, · · · , n
190 Chapter 10. Principal Component Analysis

Summary 10.16. It follows from (10.29) and (10.30) that

• (vj , σj2 ), j = 1, 2, · · · , r, are eigenvector-eigenvalue pairs of AT A.


AT A vj = AT (σj uj ) = σj2 vj , j = 1, 2, · · · , r. (10.31)

So, the singular values play the role of eigenvalues.


• Similarly, we have

AAT uj = A(σj vj ) = σj2 uj , j = 1, 2, · · · , r. (10.32)

• Equation (10.31) gives how to find the singular values {σj } and the
right singular vectors V , while (10.29) shows a way to compute the
left singular vectors U .
• (Dyadic decomposition) The matrix A ∈ Rm×n can be expressed as
n
X
A= σj uj vjT . (10.33)
j=1

When rank(A) = r ≤ n,
r
X
A= σj uj vjT . (10.34)
j=1

This property has been utilized for various approximations and ap-
plications, e.g., by dropping singular vectors corresponding to small
singular values.
10.2. Singular Value Decomposition 191

10.2.2. Computation of the SVD

For A ∈ Rm×n , the procedure is as follows.


1. Form AT A (AT A – covariance matrix of A).
2. Find the eigen-decomposition of AT A by orthogonalization process,
i.e., Λ = diag(λ1 , · · · , λn ),

AT A = V ΛV T ,

where V = [v1 ··· vn ] is orthogonal, i.e., V T V = I.


3. Sort the eigenvalues according to their magnitude and let
p
σj = λj , j = 1, 2, · · · , n.

4. Form the U matrix as follows,


1
uj = Avj , j = 1, 2, · · · , r.
σj
If necessary, pick up the remaining columns of U so it is orthogonal.
(These additional columns must be in Null(AAT ).)
 
v1T
 .
5. A = U ΣV T = [u1 · · · ur · · · un ] diag(σ1 , · · · , σr , 0, · · · , 0)  .
 .
vnT

Lemma 10.17. Let A ∈ Rn×n be symmetric. Then (a) all the eigenvalues
of A are real and (b) eigenvectors corresponding to distinct eigenvalues
are orthogonal.
192 Chapter 10. Principal Component Analysis
 
1 2
Example 10.18. Find the SV D for A =  −2 1 .
 
3 2
Solution.
" #
14 6
1. AT A = .
6 9

2. Solving det (AT A − λI) = 0 gives the eigenvalues of AT A

λ1 = 18 and λ2 = 5,

of which corresponding eigenvectors are


√3 − √213
" # " # " #
3 −2 13
ṽ1 = , ṽ2 = . =⇒ V =
2 3 √2 √3
13 13

√ √ √ √ √
3. σ1 = λ1 = 18 = 3 2, σ2 = λ2 = 5. So
"√ #
18 0
Σ= √
0 5
 
√7
 
" # 7
√3 234
  4
1

4. u1 = √1 A 13
= √118 √113  −4  =  − √234
σ1 Av1 =
 
18 √2  
13
13 √13
234
   4 
" −2
# 4 √
65

1 √1 A 13 1 √1    √7 
u2 = σ2 Av2 = 5 3
= 5 13  7  =  65  .


13
0 0

√7 √4
 
234 65 "√
√3 √2
#" #
 4 7
 18 0 13 13
5. A = U ΣV T = − √234 √  √
65 0 5 2 3
  − √13 √
13
√13 0
234
10.3. Application of the SVD for LS Problems 193

10.3. Application of the SVD for LS Problems


Recall: (Definition 5.2, p. 115): Let A ∈ Rm×n , m ≥ n, and b ∈ Rm . The
least-squares problem is to find x b ∈ Rn which minimizes kAx − bk2 :

b = arg min kAx − bk2 ,


x
x
or, equivalently, (10.35)
b = arg min kAx −
x bk22 ,
x

where x
b called a least-squares solution of Ax = b.

Note: When AT A is invertible, the equation Ax = b has a unique LS


solution for each b ∈ Rm (Theorem 5.5). It can be solved by the method
of normal equations; the unique LS solution x b is given by

b = (AT A)−1 AT b.
x (10.36)

Definition 10.19. (AT A)−1 AT is called the pseudoinverse of A. Let


A = U ΣV T be the SVD of A. Then
def
(AT A)−1 AT = V Σ−1 U T == A+ . (10.37)
 
1 2
Example 10.20. Find the pseudoinverse of A = −2 1.
 
3 2
Solution. From Example 10.18, p.192, we have
√7 √4
 
"√ #" #
234 65 √3 √2
4 √7 
18 0
A = U ΣV T =  − √234 √ 13 13
 
65 2 3
0 5 − √13 √13
√13 0
234
Thus,
" #" #" #
√3 − √213 √1 0 √7 4
− √234 √13
A+ = V Σ−1 U T = 13 18 234 234
√2 √3 0 √1 √4 √7 0
13 13 5 65 65
" #
1 4 1
− 30 − 15 6
= 11 13 1 .
45 45 9
194 Chapter 10. Principal Component Analysis

Quesiton. What if AT A is not invertible? Although it is invertible, what


if the hypothesis space is either too big or too small?

Solving LS Problems by the SVD


Let A ∈ Rm×n , m > n, with rank(A) = k ≤ n.
• Suppose that the SVD of A is given, that is,

A = U ΣV T .

• Since U and V are `2 -norm preserving, we have

||Ax − b|| = ||U ΣV T x − b|| = ||ΣV T x − U T b||. (10.38)

• Define z = V T x and c = U T b. Then


k
X n
X 1/2
2
||Ax − b|| = (σi zi − ci ) + c2i . (10.39)
i=1 i=k+1

• Thus the norm is minimized when z is chosen with


(
ci /σi , when i ≤ k,
zi = (10.40)
arbitrary, otherwise.

• After determining z, one can find the solution as

x
b = V z. (10.41)

Then the least-squares error reads


n
 X 1/2
min ||Ax − b|| = c2i (10.42)
x
i=k+1
10.3. Application of the SVD for LS Problems 195

Strategy 10.21. When z is obtained as in (10.40), it is better to choose


zero for the “arbitrary” part:

z = [c1 /σ1 , c2 /σ2 , · · · , ck /σk , 0, · · · , 0]T . (10.43)

In this case, z can be written as

z = Σ+ + T
k c = Σk U b, (10.44)

where
Σ+ T
k = [1/σ1 , 1/σ2 , · · · , 1/σk , 0, · · · , 0] . (10.45)
Thus the corresponding LS solution reads

b = V z = V Σ+
x T
k U b. (10.46)

Note that x
b involves no components of the null space of A;
b is unique in this sense.
x

Remark 10.22.
• When rank(A) = k = n: It is easy to see that
−1 T
V Σ+ T
kU = VΣ U , (10.47)

which is the pseudoinverse of A.


• When rank(A) = k < n: AT A not invertible. However,

A+ + T
k := V Σk U (10.48)

plays the role of the pseudoinverse of A. Thus we will call it the k-th
pseudoinverse of A.

Note: For some LS applications, although rank(A) = n, the k-th pseu-


doinverse A+
k , with a small k < n, may give more reliable solutions.
196 Chapter 10. Principal Component Analysis

Example 10.23. Generate a synthetic dataset in 2D to find least-squares


solutions, using
(a) the method of normal equations and
(b) the SVD with various numbers of principal components.

Solution. Here we implement a Matlab code. You will redo it in Python;


see Exercise 10.2.
util.m
1 classdef util,
2 methods(Static)
3 %---------------------------------------
4 function data = get_data(npt,bx,sigma)
5 data = zeros(npt,2);
6 data(:,1) = rand(npt,1)*bx;
7 data(:,2) = max(bx/3,2*data(:,1)-bx);
8

9 r = randn(npt,1)*sigma; theta = randn(npt,1)*pi;


10 noise = r.*[cos(theta),sin(theta)];
11 data = data+noise;
12 end % indentation is not required, but an extra 'end' is.
13 %---------------------------------------
14 function mysave(gcf,filename)
15 exportgraphics(gcf,filename,'Resolution',100)
16 fprintf('saved: %s\n',filename)
17 end
18 %---------------------------------------
19 function A = get_A(data,n)
20 npt = size(data,1);
21 A = ones(npt,n);
22 for j=2:n
23 A(:,j) = A(:,j-1).*data;
24 end
25 end
26 %---------------------------------------
27 function Y = predict_Y(X,coeff,S_mean,S_std)
28 n = numel(coeff);
29 if nargin==2, S_mean=zeros(1,n); S_std=ones(1,n); end
30 A = util.get_A(X(:),n);
31 Y = ((A-S_mean)./S_std)*coeff;
32 end
33 end,end
10.3. Application of the SVD for LS Problems 197

Note: In Matlab, you can save multiple functions in a file, using


classdef and methods(Static).
• The functions will be called as class_name.function_name().
• Lines 12, 17, 25, 32: The extra ‘end’ is required for Matlab to distin-
guish functions without ambiguity.
– You may put the extra ‘end’ also for stand-alone functions.
• Line 29: A Matlab function can be implemented so that you may call
the function without some arguments using default arguments.
• Line 30: See how to call a class function from another function.
pca_regression.m
1 function [sol_PCA,S_mean,S_std] = pca_regression(A,b,npc)
2 % input: npc = the number of principal components
3

4 %% Standardization
5 %%---------------------------------------------
6 S_mean = mean(A); S_std = std(A);
7 if S_std(1)==0, S_std(1)=1/S_mean(1); S_mean(1)=0; end
8 AS = (A-S_mean)./S_std;
9

10 %% SVD regression, using the pseudoinverse


11 %%---------------------------------------------
12 [U,S,V] = svd(AS,'econ');
13 S1 = diag(S); % a column vector
14 C1 = zeros(size(S1));
15 C1(1:npc) = 1./S1(1:npc);
16 C = diag(C1); % a matrix
17

18 sol_PCA = V*C*U'*b;
19 end

Note: The standardization variables are included in output to be used


for the prediction.
• Line 7: Note that A(:,1)=1 so that its std must be 0.
• Lines 13 and 16: The function diag() toggles between a column vector
and a diagonal matrix.
• Line 19: The function puts an extra ‘end’ at the end.
198 Chapter 10. Principal Component Analysis

Regression_Analysis.m
1 clear all; close all;
2

3 %%-----------------------------------------------------
4 %% Setting
5 %%-----------------------------------------------------
6 regen_data = 0; %==1, regenerate the synthetic data
7 poly_n = 9;
8 npt=300; bx=5.0; sigma=0.50; %for synthetic data
9 datafile = 'synthetic-data.txt';
10

11 %%-----------------------------------------------------
12 %% Data: Generation and Read
13 %%-----------------------------------------------------
14 if regen_data || ~isfile(datafile)
15 DATA = util.get_data(npt,bx,sigma);
16 writematrix(DATA, datafile);
17 fprintf('%s: re-generated.\n',datafile)
18 end
19 DATA = readmatrix(datafile,"Delimiter",",");
20

21 %%-----------------------------------------------------
22 %% The system: A x = b
23 %%-----------------------------------------------------
24 A = util.get_A(DATA(:,1),poly_n+1);
25 b = DATA(:,2);
26

27 %%-----------------------------------------------------
28 %% Method of Noral Equations
29 %%-----------------------------------------------------
30 sol_NE = (A'*A)\(A'*b);
31 figure,
32 plot(DATA(:,1),DATA(:,2),'k.','MarkerSize',8);
33 axis tight; hold on
34 yticks(1:5); ax = gca; ax.FontSize=13; %ax.GridAlpha=0.25
35 title(sprintf('Synthetic Data: npt = %d',npt),'fontsize',13)
36 util.mysave(gcf,'data-synthetic.png');
37 x=linspace(min(DATA(:,1)),max(DATA(:,1)),51);
38 plot(x,util.predict_Y(x,sol_NE),'r-','linewidth',2);
39 Pn = ['P_',int2str(poly_n)];
40 legend('data',Pn, 'location','best','fontsize',13)
41 TITLE0=sprintf('Method of NE: npt = %d',npt);
42 title(TITLE0,'fontsize',13)
43 hold off
10.3. Application of the SVD for LS Problems 199

44 util.mysave(gcf,'data-synthetic-sol-NE.png');
45

46 %%-----------------------------------------------------
47 %% PCA Regression
48 %%-----------------------------------------------------
49 for npc=1:size(A,2);
50 [sol_PCA,S_mean,S_std] = pca_regression(A,b,npc);
51 figure,
52 plot(DATA(:,1),DATA(:,2),'k.','MarkerSize',8);
53 axis tight; hold on
54 yticks(1:5); ax = gca; ax.FontSize=13; %ax.GridAlpha=0.25
55 x=linspace(min(DATA(:,1)),max(DATA(:,1)),51);
56 plot(x,util.predict_Y(x,sol_PCA,S_mean,S_std),'r-','linewidth',2);
57 Pn = ['P_',int2str(poly_n)];
58 legend('data',Pn, 'location','best','fontsize',13)
59 TITLE0=sprintf('Method of PC: npc = %d',npc);
60 title(TITLE0,'fontsize',13)
61 hold off
62 savefile = sprintf('data-sol-PCA-npc-%02d.png',npc);
63 util.mysave(gcf,savefile);
64 end

Note: Regression_Analysis is the main function. The code is simple; the


complication is due to plotting.
• Lines 6, 14-19: Data is read from a datafile.
– Setting regen_data = 1 will regenerate the datafile.

Figure 10.3: The synthetic data and the LS solution P9 (x), overfitted.
200 Chapter 10. Principal Component Analysis

Figure 10.4: PCA regression of the data, with various numbers of principal components.
The best regression is achieved when npc = 3.
10.3. Application of the SVD for LS Problems 201

Exercises for Chapter 10

10.1. Download wine.data from the UCI database:


https://archive.ics.uci.edu/ml/machine-learning-databases/wine/
The data is extensively used in the Machine Learning community. The first column
of the data is the label and the others are features of three different kinds of wines.

(a) Add lines to the code given, to verify (10.20), p.184. For example, set k = 5.
Wine_data.py
1 import numpy as np
2 from numpy import diag,dot
3 from scipy.linalg import svd,norm
4 import matplotlib.pyplot as plt
5

6 data = np.loadtxt('wine.data', delimiter=',')


7 X = data[:,1:]; y = data[:,0]
8

9 #-----------------------------------------------
10 # Standardization
11 #-----------------------------------------------
12 X_mean, X_std = np.mean(X,axis=0), np.std(X,axis=0)
13 XS = (X - X_mean)/X_std
14

15 #-----------------------------------------------
16 # SVD
17 #-----------------------------------------------
18 U, s, VT = svd(XS)
19 if U.shape[0]==U.shape[1]:
20 U = U[:,:len(s)] # cut the nonnecessary
21 Sigma = diag(s) # transform to a matrix
22 print('U:',U.shape, 'Sigma:',Sigma.shape, 'VT:',VT.shape)

Note:

• Line 12: np.mean and np.std are applied, with the option axis=0, to get
the quantities column-by-column vertically. Thus X_mean and X_std are row
vectors.
• Line 18: In Python, svd produces [U, s, VT], where VT = V T . If you would
like to get V , then V = VT.T.

10.2. Implement the code in Example 10.23, in Python.

(a) Report your complete code.


(b) Attached figures as in Figures 10.3 and 10.4.
202 Chapter 10. Principal Component Analysis

Clue: The major reason that a class is used in the Matlab code in Example 10.23 is
to combine multiple functions to be saved in a file. In Python, you do not have to use
a class to save multiple functions in a file. You may start with the following.
util.py
1 mport numpy as np
2 import matplotlib.pyplot as plt
3

4 def get_data(npt,bx,sigma):
5 data = np.zeros([npt,2]);
6 data[:,0] = np.random.uniform(0,1,npt)*bx;
7 data[:,1] = np.maximum(bx/3,2*data[:,0]-bx);
8 r = np.random.normal(0,1,npt)*sigma;
9 theta = np.random.normal(0,1,npt)*np.pi;
10 noise = np.column_stack((r*np.cos(theta),r*np.sin(theta)));
11 data += noise;
12 return data
13

14 def mysave(filename):
15 plt.savefig(filename,bbox_inches='tight')
16 print('saved:',filename)
17

18 # Add other functions

Regression_Analysis.py
1 import numpy as np
2 import numpy.linalg as la
3 import matplotlib.pyplot as plt
4 from os.path import exists
5 import util
6

7 ##-----------------------------------------------------
8 ## Setting
9 ##-----------------------------------------------------
10 regen_data = 1; #==1, regenerate the synthetic data
11 poly_n = 9;
12 npt=300; bx=5.0; sigma=0.50; #for synthetic data
13 datafile = 'synthetic-data.txt';
14 plt.style.use('ggplot')
15

16 ##-----------------------------------------------------
17 ## Data: Generation and Read
18 ##-----------------------------------------------------
19 if regen_data or not exists(datafile):
20 DATA = util.get_data(npt,bx,sigma);
21 np.savetxt(datafile,DATA,delimiter=',');
10.3. Application of the SVD for LS Problems 203

22 print('%s: re-generated.' %(datafile))


23

24 DATA = np.loadtxt(datafile, delimiter=',')


25

26 plt.figure() # initiate a new plot


27 plt.scatter(DATA[:,0],DATA[:,1],s=8,c='k')
28 plt.title('Synthetic Data: npc = '+ str(npt))
29 util.mysave('data-synthetic-py.png')
30 #plt.show()
31

32 ##-----------------------------------------------------
33 ## The system: A x = b
34 ##-----------------------------------------------------

Note: The semi-colons (;) are not necessary in Python nor harmful; they are in-
cluded from copy-and-paste of Matlab lines. The ggplot style emulates “ggplot",
a popular plotting package for R. When Regression_Analysis.py is executed, you
will have a saved image:

Figure 10.5: data-synthetic-py.png


204 Chapter 10. Principal Component Analysis
11
C HAPTER

Machine Learning

Contents of Chapter 11
11.1.What is Machine Learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
11.2.Binary Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
11.3.Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
11.4.Multi-Column Least-Squares Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Exercises for Chapter 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

205
206 Chapter 11. Machine Learning

11.1. What is Machine Learning?


Definition 11.1. Machine learning (ML)

• ML algorithms are algorithms that can learn from data (input)


and produce functions/models (output).
• Machine learning is the science of getting machines to act, without
functions/models being explicitly programmed to do so.

Example 11.2. There are three different types of ML:

• Supervised learning: e.g., classification, regression


– Labeled data
– Direct feedback
– Preduct outcome/future

• Unsupervised learning: e.g., clustering


– No labels
– No feedback
– Find hidden structure in data

• Reinforcement learning: e.g., chess engine


– Decision process
– Reward system
– Learn series of actions

Note: The most popular type is supervised learning.


11.1. What is Machine Learning? 207

11.1.1. Supervised learning

Assumption. Given a data set {(xi , yi )}, where yi are labels,


there exists a relation f : X → Y .
Supervised learning:
(
Given : A training data {(xi , yi ) | i = 1, · · · , N }
(11.1)
Find : fb : X → Y , a good approximation to f

Figure 11.1: Supervised learning and prediction.

Figure 11.2: Classification and regression.


208 Chapter 11. Machine Learning

Why is ML not Always Simple?


Major Issues in ML
1. Overfitting: Fitting training data too tightly
• Difficulties: Accuracy drops significantly for test data
• Remedies:
– More training data (often, impossible)
– Early stopping; feature selection
– Regularization; ensembling (multiple classifiers)

2. Curse of Dimensionality: The feature space becomes increas-


ingly sparse for an increasing number of dimensions (of a fixed-
size training dataset)
• Difficulties: Larger error, more computation time;
Data points appear equidistant from all the others
• Remedies
– More training data (often, impossible)
– Dimensionality reduction (e.g., Feature selection, PCA)
11.1. What is Machine Learning? 209

3. Multiple Local Minima Problem


Training often invovles minimizing an objective function.

• Difficulties: Larger error, unrepeatable


• Remedies
– Gaussian sailing; regularization
– Careful access to the data (e.g., mini-batch)

4. Interpretability:
Although ML has come very far, researchers still don’t know exactly
how some algorithms (deep nets) work.
• If we don’t know how training nets actually work, how do we make
any real progress?
5. One-Shot Learning:
We still haven’t been able to achieve one-shot learning. Traditional
gradient-based networks need a huge amount of data, and are
often in the form of extensive iterative training.
• Instead, we should find a way to enable neural networks to learn,
using just a few examples.
210 Chapter 11. Machine Learning

11.1.2. Unsupervised learning


Note:
• In supervised learning, we know the right answer beforehand
when we train our model, and in reinforcement learning, we de-
fine a measure of reward for particular actions by the agent.
• In unsupervised learning, however, we are dealing with unla-
beled data or data of unknown structure. Using unsupervised learn-
ing techniques, we are able to explore the structure of our data
to extract meaningful information, without the guidance of a known
outcome variable or reward function.
• Clustering is an exploratory data analysis technique that allows
us to organize a pile of information into meaningful subgroups
(clusters) without having any prior knowledge of their group mem-
berships.

Figure 11.3: Clustering.


11.2. Binary Classifiers 211

11.2. Binary Classifiers


A binary classifier is a function which can decide whether or not an
input vector belongs to some specific class (e.g., spam/ham).
• Binary classification often refers to those classification tasks that
have two class labels. (two-class classification)
• It is a type of linear classifier, i.e. a classification algorithm that
makes its predictions based on a linear predictor function.
• Linear classifiers are artificial neurons.
Examples: Perceptron [8], Adaline, Logistic Regression, Support Vector
Machine [2]

Remark 11.3. Neurons are interconnected nerve cells, involved in the


processing and transmitting of chemical and electrical signals. Such a
nerve cell can be described as a simple logic gate with binary outputs;
• multiple signals arrive at the dendrites,
• they are integrated into the cell body,
• and if the accumulated signal exceeds a certain threshold, an output
signal is generated that will be passed on by the axon.

Figure 11.4: A schematic description of a neuron.


212 Chapter 11. Machine Learning

Definition 11.4. Let {(x(i) , y (i) )} be labeled data, with x(i) ∈ Rd and
y (i) ∈ {0, 1}. A binary classifier finds a hyperplane in Rd that sepa-
rates data points X = {x(i) } to two classes; see Figure 11.2, p. 207.

Observation 11.5. Let’s consider the following to interpret binary clas-


sifiers in a unified manner.
• The labels (0 and 1) are chosen for simplicity.
• A hyperplane can be formulated by a normal vector w ∈ Rd and a
shift (bias) b:
z = wT x + b. (11.2)
In ML, z is called the net input.
– The net input can go very high in magnitude for some x.
– It is a weighted sum (linear combination) of input features x.
• Activation function: In order (a) to keep the net input re-
stricted to a certain limit as per our requirement and, more im-
portantly, (b) to add nonlinearity to the network; we apply an
activation function φ(z).

• To learn w and b, you may formulate a cost function to minimize,


as the Sum of Squared Errors (SSE):
1 X  (i) 2
J (w) = y − φ(wT x(i) ) . (11.3)
2 i

However, in order to get them more effectively, we must formulate


the cost function more meaningfully.

Thus, an important task in ML is on the choice of effective


• activation functions and
• cost functions.
11.2. Binary Classifiers 213

Activation functions:
(
1, if z ≥ θ
Perceptron : φ(z) =
0, otherwise
Adaline : φ(z) = z (11.4)
1
Logistic Regression : φ(z) = σ(z) :=
1 + e−z
where “Adaline” stands for ADAptive LInear NEuron. The activation
function σ(z) called the standard logistic sigmoid function or simply
the sigmoid function.

Remark 11.6. (The standard logistic sigmoid function)


1 ex
σ(x) = = (11.5)
1 + e−x 1 + ex
• The standard logistic function is the solution of the simple first-
order non-linear ordinary differential equation
d 1
y = y(1 − y), y(0) = . (11.6)
dx 2
• It can be verified easily as
0 ex (1 + ex ) − ex · ex ex
σ (x) = = = σ(x)(1 − σ(x)). (11.7)
(1 + ex )2 (1 + ex )2

• σ 0 is even: σ 0 (−x) = σ 0 (x).


• Rotational symmetry about (0, 1/2):
1 1 2 + ex + e−x
σ(x) + σ(−x) = + = ≡ 1. (11.8)
1 + e−x 1 + ex 2 + ex + e−x
´ ´ ex x
• σ(x) dx = 1+e x dx = ln(1 + e ), which is known as the softplus

function in artificial neural networks. It is a smooth approximation


of the the rectifier (an activation function) defined as
f (x) = x+ = max(x, 0). (11.9)
214 Chapter 11. Machine Learning

Figure 11.5: Popular activation functions: (left) The standard logistic sigmoid function
and (right) the rectifier and softplus function.

11.2.1. Adaline
Algorithm 11.7. Adaline Learning:
From data {(x(i) , y (i) )}, learn the weights w and bias b, with
• Activation function: φ(z) = z (i.e., identity activation)
• Cost function: the SSE
1 X  (i) (i)
2
J (w, b) = y − φ(z ) . (11.10)
2 i

where z (i) = wT x(i) + b and φ = I, the identity.

The dominant algorithm for the minimization of the cost function is the the
Gradient Descent Method.

Algorithm 11.8. The Gradient Descent Method uses −∇J for the
search direction (update direction):

w = w + ∆w = w − η∇w J (w, b),


(11.11)
b = b + ∆b = b − η∇b J (w, b),

where η > 0 is the step length (learning rate).


11.2. Binary Classifiers 215

Computation of ∇J for Adaline :


The partial derivatives of the cost function J w.r.to wj and b read
∂J (w, b) X
(i) (i)

(i)
= − y − φ(z ) xj ,
∂wj i
∂J (w, b) X  (11.12)
(i) (i)
= − y − φ(z ) .
∂b i

Thus, with φ = I,
X 
∆w = −η∇w J (w, b) = η y − φ(z ) x(i) ,
(i) (i)

i 
X
(i) (i)
 (11.13)
∆b = −η∇b J (w, b) = η y − φ(z ) .
i

Hyperparameters
Definition 11.9. In ML, a hyperparameter is a parameter whose
value is set before the learning process begins. Thus it is an algorithmic
parameter. Examples are
• The learning rate (η)
• The number of maximum epochs/iterations (n_iter)

Figure 11.6: Well-chosen learning rate vs. a large learning rate

Note: There are effective searching schemes to set the learning rate η
automatically.
216 Chapter 11. Machine Learning

11.2.2. Logistic Regression

Algorithm 11.10. Logistic Regression Learning:


From data {(x(i) , y (i) )}, learn the weights w and bias b, with
• Activation function: φ(z) = σ(z), the logistic sigmoid function
• Cost function: The likelihood is maximized.
Based on the log-likelihood, we define the logistic cost function to
be minimized: h    i
X
(i) (i) (i) (i)
J (w, b) = −y ln φ(z ) − (1 − y ) ln 1 − φ(z ) , (11.14)
i

where z (i) = wT x(i) + b.

Computation of ∇J for Logistic Regression :


Let’s start by calculating the partial derivative of the logistic cost function
with respect to the j–th weight, wj :
∂φ(z (i) )
 
∂J (w, b) X 1 1
= −y (i) (i) )
+ (1 − y (i)
) (i) )
, (11.15)
∂wj i
φ(z 1 − φ(z ∂w j

where, using z (i) = wT x(i) and (11.7),


∂φ(z (i) ) 0 (i) ∂z
(i)
(i)

(i)

(i)
= φ (z ) = φ(z ) 1 − φ(z ) xj .
∂wj ∂wj
Thus, if follows from the above and (11.15) that
∂J (w, b) Xh
(i)

(i)

(i) (i)
i
(i)
= −y 1 − φ(z ) + (1 − y )φ(z ) xj
∂wj iX h i
(i) (i) (i)
= − y − φ(z ) xj
i

and therefore Xh i
∇w J (w) = − y − φ(z ) x(i) .
(i) (i)
(11.16)
i
Similarly, one can get
Xh i
(i) (i)
∇b J (w) = − y − φ(z ) . (11.17)
i
11.2. Binary Classifiers 217

Algorithm 11.11. Gradient descent learning for Logistic Regression is


formulated as
w := w + ∆w, b := b + ∆b, (11.18)
where η > 0 is the step length (learning rate) and
Xh i
∆w = −η∇w J (w, b) = η y − φ(z ) x(i) ,
(i) (i)

i h
X
(i) (i)
i (11.19)
∆b = −η∇b J (w, b) = η y − φ(z ) .
i

Note: The above gradient descent rule for Logistic Regression is of the
same form as that of Adaline; see (11.13) on p. 215. Only the difference
is the activation function φ.
218 Chapter 11. Machine Learning

11.2.3. Multi-class classification

Figure 11.7: Classification for three classes.

One-versus-all (one-versus-rest) classification


Learning: learn 3 classifiers

• − vs {◦, +} ⇒ weights w−
• + vs {◦, −} ⇒ weights w+
• ◦ vs {+, −} ⇒ weights w◦

Prediction: for a new data sample


x,
Figure 11.8: Three weights: w− , w+ , and
w◦ . yb = arg max φ(wiT x).
i∈{−,+,◦}

OvA (OvR) is readily applicable for classification of general n classes, n ≥ 2.


11.3. Neural Networks 219

11.3. Neural Networks


Recall: The Perceptron (or, Adaline, Logistic Regression) is the simplest
artificial neuron that makes decisions by weighting up evidence.

Figure 11.9: A simplest artificial neuron.

Complex Neural Networks


• Obviously, a simple artificial neuron is not a complete model of human
decision-making!
• However, they can be used as building blocks for more complex neu-
ral networks.

Figure 11.10: A complex neural network.


220 Chapter 11. Machine Learning

11.3.1. A simple network to classify hand-written digits


• The problem of recognizing hand-written digits has two components:
segmentation and classification.

=⇒
Figure 11.11: Segmentation.

• We’ll focus on algorithmic components for the classification of individ-


ual digits.

MNIST data set :


A modified subset of two data sets collected by NIST (US National Insti-
tute of Standards and Technology):
• Its first part contains 60,000 images (for training)
• The second part is 10,000 images (for test), each of which is in 28 × 28
grayscale pixels

A Simple Neural Network

Figure 11.12: A sigmoid network having a single hidden layer.


11.3. Neural Networks 221

What the Neural Network Will Do


• Let’s concentrate on the first output neuron, the one that is trying
to decide whether or not the input digit is a 0.
• It does this by weighing up evidence from the hidden layer of neurons.

• What are those hidden neurons doing?


• Let’s suppose for the sake of argument that the first neuron
in the hidden layer may detect whether or not an image like the
following is present

It can do this by heavily weighting input pixels which overlap with the
image, and only lightly weighting the other inputs.
• Similarly, let’s suppose that the second, third, and fourth neurons
in the hidden layer detect whether or not the following images are
present

• As you may have guessed, these four images together make up the 0
image that we saw in the line of digits shown in Figure 11.11:

• So if all four of these hidden neurons are firing, then we can conclude
that the digit is a 0.
222 Chapter 11. Machine Learning

Learning with Gradient Descent

• Data set {(x(i) , y(i) )}, i = 1, 2, · · · , N


(e.g., if an image x(k) depicts a 2, then y(k) = (0, 0, 1, 0, 0, 0, 0, 0, 0, 0)T .)
• Cost function
1 X (i)
C(W , B) = ||y − a(x(i) )||2 , (11.20)
2N i

where W denotes the collection of all weights in the network, B all the
biases, and a(x(i) ) is the vector of outputs from the network when x(i)
is input.
• Gradient descent method
" # " # " #
W W ∆W
← + , (11.21)
B B ∆B

where " # " #


∆W ∇W C
= −η .
∆B ∇B C

Note: To compute the gradient ∇C, we need to compute the gradients


∇Cx(i) separately for each training input, x(i) , and then average them:
1 X
∇C = ∇Cx(i) . (11.22)
N i

Unfortunately, when the number of training inputs is very large, it


can take a long time, and learning thus occurs slowly. An idea called
stochastic gradient descent can be used to speed up learning.
11.3. Neural Networks 223

Stochastic Gradient Descent


The idea is to estimate the gradient ∇C by computing ∇Cx(i) for a small
sample of randomly chosen training inputs. By averaging over this
small sample, it turns out that we can quickly get a good estimate of
the true gradient ∇C; this helps speed up gradient descent, and thus
learning.

• Pick out a small number of randomly chosen training inputs (m  N ):

e(1) , x
x e(2) , · · · , x
e(m) ,

which we refer to as a mini-batch.


• Average ∇Cxe(k) to approximate the gradient ∇C. That is,
m
1 X def 1
X
∇Cxe(k) ≈ ∇C == ∇Cx(i) . (11.23)
m N i
k=1

• For classification of hand-written digits for the MNIST data set, you
may choose: batch_size = 10.

Note: In practice, you can implement the stochastic gradient descent as


follows. For an epoch,
• Shuffle the data set
• For each m samples (selected from the beginning), update (W , B)
using the approximate gradient (11.23).
224 Chapter 11. Machine Learning

11.3.2. Implementing a network to classify digits [7]


network.py
1 """
2 network.py (by Michael Nielsen)
3 ~~~~~~~~~~
4 A module to implement the stochastic gradient descent learning
5 algorithm for a feedforward neural network. Gradients are calculated
6 using backpropagation. """
7 #### Libraries
8 # Standard library
9 import random
10 # Third-party libraries
11 import numpy as np
12

13 class Network(object):
14 def __init__(self, sizes):
15 """The list ``sizes`` contains the number of neurons in the
16 respective layers of the network. For example, if the list
17 was [2, 3, 1] then it would be a three-layer network, with the
18 first layer containing 2 neurons, the second layer 3 neurons,
19 and the third layer 1 neuron. """
20

21 self.num_layers = len(sizes)
22 self.sizes = sizes
23 self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
24 self.weights = [np.random.randn(y, x)
25 for x, y in zip(sizes[:-1], sizes[1:])]
26

27 def feedforward(self, a):


28 """Return the output of the network if ``a`` is input."""
29 for b, w in zip(self.biases, self.weights):
30 a = sigmoid(np.dot(w, a)+b)
31 return a
32

33 def SGD(self, training_data, epochs, mini_batch_size, eta,


34 test_data=None):
35 """Train the neural network using mini-batch stochastic
36 gradient descent. The ``training_data`` is a list of tuples
37 ``(x, y)`` representing the training inputs and the desired
38 outputs. """
39

40 if test_data: n_test = len(test_data)


41 n = len(training_data)
42 for j in xrange(epochs):
43 random.shuffle(training_data)
44 mini_batches = [
45 training_data[k:k+mini_batch_size]
46 for k in xrange(0, n, mini_batch_size)]
11.3. Neural Networks 225

47 for mini_batch in mini_batches:


48 self.update_mini_batch(mini_batch, eta)
49 if test_data:
50 print "Epoch {0}: {1} / {2}".format(
51 j, self.evaluate(test_data), n_test)
52 else:
53 print "Epoch {0} complete".format(j)
54

55 def update_mini_batch(self, mini_batch, eta):


56 """Update the network's weights and biases by applying
57 gradient descent using backpropagation to a single mini batch.
58 The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
59 is the learning rate."""
60 nabla_b = [np.zeros(b.shape) for b in self.biases]
61 nabla_w = [np.zeros(w.shape) for w in self.weights]
62 for x, y in mini_batch:
63 delta_nabla_b, delta_nabla_w = self.backprop(x, y)
64 nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
65 nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
66 self.weights = [w-(eta/len(mini_batch))*nw
67 for w, nw in zip(self.weights, nabla_w)]
68 self.biases = [b-(eta/len(mini_batch))*nb
69 for b, nb in zip(self.biases, nabla_b)]
70

71 def backprop(self, x, y):


72 """Return a tuple ``(nabla_b, nabla_w)`` representing the
73 gradient for the cost function C_x. ``nabla_b`` and
74 ``nabla_w`` are layer-by-layer lists of numpy arrays, similar
75 to ``self.biases`` and ``self.weights``."""
76 nabla_b = [np.zeros(b.shape) for b in self.biases]
77 nabla_w = [np.zeros(w.shape) for w in self.weights]
78 # feedforward
79 activation = x
80 activations = [x] #list to store all the activations, layer by layer
81 zs = [] # list to store all the z vectors, layer by layer
82 for b, w in zip(self.biases, self.weights):
83 z = np.dot(w, activation)+b
84 zs.append(z)
85 activation = sigmoid(z)
86 activations.append(activation)
87 # backward pass
88 delta = self.cost_derivative(activations[-1], y) * \
89 sigmoid_prime(zs[-1])
90 nabla_b[-1] = delta
91 nabla_w[-1] = np.dot(delta, activations[-2].transpose())
92

93 for l in xrange(2, self.num_layers):


226 Chapter 11. Machine Learning

94 z = zs[-l]
95 sp = sigmoid_prime(z)
96 delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
97 nabla_b[-l] = delta
98 nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
99 return (nabla_b, nabla_w)
100

101 def evaluate(self, test_data):


102 test_results = [(np.argmax(self.feedforward(x)), y)
103 for (x, y) in test_data]
104 return sum(int(x == y) for (x, y) in test_results)
105

106 def cost_derivative(self, output_activations, y):


107 """Return the vector of partial derivatives \partial C_x /
108 \partial a for the output activations."""
109 return (output_activations-y)
110

111 #### Miscellaneous functions


112 def sigmoid(z):
113 return 1.0/(1.0+np.exp(-z))
114

115 def sigmoid_prime(z):


116 return sigmoid(z)*(1-sigmoid(z))

The code is executed using


Run_network.py
1 import mnist_loader
2 training_data, validation_data, test_data = mnist_loader.load_data_wrapper()
3

4 import network
5 n_neurons = 20
6 net = network.Network([784 , n_neurons, 10])
7

8 n_epochs, batch_size, eta = 30, 10, 3.0


9 net.SGD(training_data , n_epochs, batch_size, eta, test_data = test_data)

len(training_data)=50000, len(validation_data)=10000, len(test_data)=10000


11.3. Neural Networks 227

Validation Accuracy
Validation Accuracy
1 Epoch 0: 9006 / 10000
2 Epoch 1: 9128 / 10000
3 Epoch 2: 9202 / 10000
4 Epoch 3: 9188 / 10000
5 Epoch 4: 9249 / 10000
6 ...
7 Epoch 25: 9356 / 10000
8 Epoch 26: 9388 / 10000
9 Epoch 27: 9407 / 10000
10 Epoch 28: 9410 / 10000
11 Epoch 29: 9428 / 10000

Accuracy Comparisons
• scikit-learn’s SVM classifier using the default settings: 9435/10000
• A well-tuned SVM: ≈98.5%
• Well-designed Convolutional NN (CNN):
9979/10000 (only 21 missed!)

Note: For well-designed neural networks, the performance is close


to human-equivalent, and is arguably better, since quite a few of
the MNIST images are difficult even for humans to recognize with confi-
dence, e.g.,

Figure 11.13: MNIST images difficult even for humans to recognize.

Moral of the Neural Networks


• Let all the complexity be learned, automatically, from data
• Simple algorithms can perform well for some problems:
(sophisticated algorithm) ≤ (simple learning algorithm + good training data)
228 Chapter 11. Machine Learning

11.4. Multi-Column Least-Squares Problem


• Consider a synthetic data of three classes:

Figure 11.14: A synthetic data of three classes.

• Let the superscript in () denote the class. A point in the c-th class is
expressed as
(c) (c)
x(c) = [x1 , x2 ] = [x1 , x2 , c], c = 0, 1, 2.

• Let’s design a neural network, for simplicity, of no hidden layer.


(a) The Weights: A set of weights can be trained in a way that points
in a class are heavily weighted by the corresponding part
of weights, i.e.,
(
(j) (j) (i) (j) (i) 1 if i = j
w0 + w1 x1 + w2 x2 = δij = (11.24)
0 if i 6= j
(j)
where δij is called the Kronecker delta and w0 is a bias for the
class j.
(b) Thus, for neural networks of C classes, the weights to be trained
must have dimensions (d + 1) × C.
(c) Least-Squares Formulation: The weights can be computed by
the least-squares method.
11.4. Multi-Column Least-Squares Problem 229

Multi-Column Least-Squares Problem

• Dataset: Let the dataset be


   
x11 x12 c1
x
 21 x22 
 c 
 2
X =  .. .
.  ∈ RN ×2 , y =  .. , (11.25)
 . .   . 
xN 1 xN 2 cN

where ci ∈ {0, 1, 2}, the class number.


• The Algebraic System: It can be formulated using (11.24).
– Define the information matrix
 
1 x11 x12
1 x
21 x22 

A = [np.ones([N, 1]), X] =   ∈ RN ×3 . (11.26)

.
.
 . 
1 xN 1 xN 2

– The weight matrix to be learned is:


 (0) (1) (2) 
w0 w 0 w0
W = [w(0) , w(1) , w(2) ] = w1(0) w1(1) w1(2) , (11.27)
 
(0) (1) (2)
w2 w 2 w2

where the j-th column weights heavily the point in the j-th class.
– Define the source matrix
B = [δci ,j ] ∈ RN ×3 . (11.28)

For example, if the i-th point is in the class 0, then the i-th row of
B is [1, 0, 0].
• Then the multi-column least-squares problem reads
c = arg min ||AW − B||2 ,
W (11.29)
W

which can be solved by the method of normal equations:


c = (AT A)−1 AT B,
W AT A ∈ R3×3 . (11.30)
230 Chapter 11. Machine Learning

Prediction
• Let [x1 , x2 ] be a new point.
• Compute
[1, x1 , x2 ] W
c = [p0 , p1 , p2 ], c ∈ R3×3 .
W (11.31)
Ideally, if the point [x1 , x2 ] is in class j, then pj is near 1, while others
would be near 0. Thus pj is the largest.
• Decide the class c:

c = np.argmax([p0 , p1 , p2 ], axis = 1). (11.32)

Python Code : The dataset in Figure 11.14 is produced by


GLOBAL_VARIABLES.py
1 import numpy as np
2 import matplotlib.pyplot as plt
3

4 N_D1 = 100
5 FORMAT = '%.3f','%.3f','%d'
6

7 SCALE = [[1,1],[1,2],[1.5,1]]
8 THETA = [0,-0.25*np.pi, 0]
9 TRANS = [[0,0],[6,0],[3,4]]
10 COLOR = ['r','b','c']
11 MARKER = ['.','s','+','*']
12 LINESTYLE = [['r--','r-'],['b--','b-'],['c--','c-']]
13

14 N_CLASS = len(SCALE)
15

16 DAT_FILENAME = 'synthetic.data'
17 FIG_FILENAME = 'synthetic-data.png'
18 FIG_INTERPRET = 'synthetic-data-interpret.png'
19

20 def myfigsave(figname):
21 plt.savefig(figname,bbox_inches='tight')
22 print(' saved: %s' %(figname))
11.4. Multi-Column Least-Squares Problem 231

synthetic_data.py
1 import numpy as np
2 import matplotlib.pyplot as plt
3 from GLOBAL_VARIABLES import *
4

5 def generate_data(n,scale,theta):
6 # Normally distributed around the origin
7 x = np.random.normal(0,1, n); y = np.random.normal(0,1, n)
8 P = np.vstack((x, y)).T
9 # Transform
10 sx,sy = scale
11 S = np.array([[sx,0],[0,sy]])
12 c,s = np.cos(theta), np.sin(theta)
13 R = np.array([[c,-s],[s,c]]).T #T, due to right multiplication
14 return P.dot(S).dot(R)
15

16 def synthetic_data():
17 N=0
18 plt.figure()
19 for i in range(N_CLASS):
20 scale = SCALE[i]; theta = THETA[i]; N+=N_D1
21 D1 = generate_data(N_D1,scale,theta) +TRANS[i]
22 D1 = np.column_stack((D1,i*np.ones([N_D1,1])))
23 if i==0: DATA = D1
24 else: DATA = np.row_stack((DATA,D1))
25 plt.scatter(D1[:,0],D1[:,1],s=15,c=COLOR[i],marker=MARKER[i])
26

27 np.savetxt(DAT_FILENAME,DATA,delimiter=',',fmt=FORMAT)
28 print(' saved: %s' %(DAT_FILENAME))
29

30 #xmin,xmax = np.min(DATA[:,0]), np.max(DATA[:,0])


31 ymin,ymax = np.min(DATA[:,1]), np.max(DATA[:,1])
32 plt.ylim([int(ymin)-1,int(ymax)+1])
33

34 plt.title('Synthetic Data: N = '+str(N))


35 myfigsave(FIG_FILENAME)
36 if __name__ == '__main__':
37 plt.show(block=False); plt.pause(10)
38

39 if __name__ == '__main__':
40 synthetic_data()
232 Chapter 11. Machine Learning

The multi-column least-square method can be implemented as


util_MC_LS.py
1 mport numpy as np
2 import numpy.linalg as la
3 from numpy import diag,dot
4 from scipy.linalg import svd
5

6 def set_MC_LS(X,y):
7 N,d = X.shape; nclass = len(set(y))
8 A = np.column_stack((np.ones([N,]),X))
9 b = np.zeros([N,nclass])
10 for i,v in enumerate(y): # one-hot encoding
11 b[i,int(v)] = 1
12 return A,b
13

14 def ls_solve(A,b,npc):
15 if npc==0:
16 return la.solve((A.T).dot(A),(A.T).dot(b))
17 else:
18 U, s, VT = svd(A)
19 V = VT.T
20 U = U[:,:npc]
21 C = diag(1/s[:npc])
22 V = V[:,:npc]
23 return V.dot(C.dot((U.T).dot(b)))
24

25 def prediction(A,sol):
26 forward = A.dot(sol)
27 return np.argmax(forward,axis=1)
28

29 def count_diff(u,v):
30 count =0
31 for i in range(len(u)):
32 if u[i] != v[i]: count+=1
33 return count
11.4. Multi-Column Least-Squares Problem 233

Multi_Column_LS.py
1 import numpy as np
2 import matplotlib.pyplot as plt
3 import time
4 from util_MC_LS import *; from GLOBAL_VARIABLES import *
5

6 #-----------------------------------------------
7 # Add Data: append(['name','delimiter',clabel])
8 #-----------------------------------------------
9 DLIST =[]
10 DLIST.append(['synthetic.data', ',', -1])
11 DLIST.append(['wine.data', ',', 0])
12 DLIST.append(['seeds_dataset.txt','\t',-1])
13

14 #-----------------------------------------------
15 # User Setting
16 #-----------------------------------------------
17 idata = 0
18 refigure = 1
19 rtrain = 0.7; run = 1000
20

21 #-----------------------------------------------
22 # DATA: Read & Preprocessing
23 #-----------------------------------------------
24 DATA =np.loadtxt(DLIST[idata][0], delimiter=DLIST[idata][1]);
25 clabel =int(DLIST[idata][2])
26

27 N,d = DATA.shape; d-=1;


28 labelset=set(DATA[:,clabel]); nclass=len(labelset);
29 print(' %s: X.shape= (%d,%d); nclass= %d'\
30 %(DLIST[idata][0],N,d,nclass))
31

32 l0=int(min(labelset));
33 if l0: DATA[:,clabel]-=l0 # label begins with 0
34

35 #-----------------------------------------------
36 # Machine Learning: Multi-Column Least-Squares
37 #-----------------------------------------------
38 ntrain = int(N*rtrain)
39 Acc = np.zeros([run,1])
40 print(' Multi-Column Least-Squares: (rtrain,run) =(%.2f,%d)' %(rtrain,run))
41

42 btime = time.time()
43 for i in range(run):
234 Chapter 11. Machine Learning

44 # Begins with Shuffling and Cutting


45 #-----------------------------------------------
46 np.random.shuffle(DATA) # shuffle returns type=None
47 if clabel==0: X = DATA[:,1:];
48 else: X = DATA[:,:d];
49 y = DATA[:,clabel]
50

51 Xtrain = X[0:ntrain,:]; ytrain = y[0:ntrain]


52 Xtest = X[ntrain:,:]; ytest = y[ntrain:]
53

54 # Multi-Column Least-Squares:
55 #-----------------------------------------------
56 A,b = set_MC_LS(Xtrain,ytrain)
57 sol = ls_solve(A,b,0)
58 if i==0 and DLIST[idata][0]=='synthetic.data': param = sol
59

60 # Prediction
61 #-----------------------------------------------
62 A1,b1 = set_MC_LS(Xtest,ytest)
63 predicted = prediction(A1,sol)
64 Acc[i] = 1-count_diff(predicted,ytest)/len(ytest)
65

66 etime = time.time()-btime
67 print(' Accuracy.(mean,std) = (%.2f,%.2f)%%'\
68 %(np.mean(Acc)*100,np.std(Acc)*100))
69 print(' Average Total Etime = %.5f' %(etime/run))
70

71 #-----------------------------------------------
72 # Figuring
73 #-----------------------------------------------
74 if DLIST[idata][0]=='synthetic.data':
75 if refigure:
76 plt.figure()
77 DATA = np.loadtxt(DAT_FILENAME,delimiter=',')
78 for i in range(nclass):
79 D1 = DATA[(i*N_D1):((i+1)*N_D1),:]
80 plt.scatter(D1[:,0],D1[:,1],s=15,c=COLOR[i],marker=MARKER[i])
81

82 xmin,xmax = np.min(DATA[:,0]), np.max(DATA[:,0])


83 ymin,ymax = np.min(DATA[:,1]), np.max(DATA[:,1])
84 x = np.linspace(int(xmin)-1,int(xmax)+1,100)
85 plt.ylim([int(ymin)-1,int(ymax)+1])
86 LEGEND=[]
87 for i in range(nclass):
88 a,b,c=param[:,i]
11.4. Multi-Column Least-Squares Problem 235

89 p=0; y=(p-a-b*x)/c; plt.plot(x,y,LINESTYLE[i][0],lw=1.5+0.5*i)


90 LEGEND.append('$L_'+str(i)+'(x_1,x_2)=0$')
91 p=1; y=(p-a-b*x)/c; plt.plot(x,y,LINESTYLE[i][1],lw=1.5+0.5*i)
92 LEGEND.append('$L_'+str(i)+'(x_1,x_2)=1$')
93 plt.legend(LEGEND)
94 myfigsave(FIG_INTERPRET)
95 #plt.draw(); plt.waitforbuttonpress(0)
96 plt.show(block=False); plt.pause(5)
97 else:
98 print(' Set refigure = 1, to see figure')

c represents three lines, a column for each line.


Note: param = W
(j) (j) (j)
• Let [w0 , w1 , w2 ]T be the j-th column of W
c.
• Then, define Lj (x1 , x2 ) as
(j) (j) (j)
Lj (x1 , x2 ) = w0 + w1 x1 + w2 x2 (11.33)

• In Figure 11.15, Lj (x1 , x2 ) = k, j = 0, 1, 2, k = 0, 1, are superposed


over the dataset.

Figure 11.15: Lines represented by the weight vectors.

Note: The multi-column least-squares method may be viewed as an one-


versus-all classification, presented in §11.2.3.
236 Chapter 11. Machine Learning

Exercises for Chapter 11

11.1. You will work on machine learning with real datasets.

(a) Save the code in §11.4 to files of the same name.


(b) Download the UCI datasets:
• wine.data
• seeds_dataset.txt
The delimiter in wine.data is ’,’ (comma), and the file is well saved to use.
On the other hand, the delimiter in seeds_dataset.txt is ’\t’ (tab), which are
doubled for several spots. You have to eliminate the extra tabs before use.
(c) Run the code with setting idata=0, 1, or 2.
(d) Run it with various rtrain to see how the accuracy varies.

Note: The multi-column least-squares (MC-LS) method is a brand-new algorithm.

• For wine.data, the best known algorithm can predict with accuracy about
95%, while the MC-LS can predict with accuracy about 98.5%.
• For seeds_dataset.txt, the best known algorithm can predict with accuracy
about 92%, while the new algorithm can achieve about 97% accuracy.

It is simple, but better than others for some datasets.

11.2. Now, modify synthetic_data.py to produce new synthetic datasets having 3-4 classes.
Set rtrain = 0.7.

• Generate a synthetic dataset of three classes where the centers of classes are in
a straight line.
• Generate a synthetic dataset of four classes, with class centers not on a straight
line.

(a) Modify Multi_Column_LS.py, if necessary, to process the new datasets and pro-
duce figures as in Figure 11.15, p. 235.
(b) How about accuracy? Is the MC-LS similarly good for the new datasets?
12
C HAPTER

Scikit-Learn: A Popular Machine


Learning Library

Contents of Chapter 12
12.1.Scikit-Learn Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
12.2.Scikit-Learn – Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
12.3.Scikit-Learn – Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
Exercises for Chapter 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

237
238 Chapter 12. Scikit-Learn: A Popular Machine Learning Library

12.1. Scikit-Learn Basics


Scikit-learn is one of the most useful and robust libraries for machine
learning in Python.
• It provides a selection of efficient tools for machine learning and
statistical modeling including preprocessing, classification, re-
gression, clustering, dimensionality reduction, and ensemble
methods.
• This library, which is largely written in Python, is built upon
NumPy, SciPy, pandas, and Matplotlib.

• Prerequisites: The following are required, before we start using


scikit-learn.
– Python 3
– Numpy, Scipy, Matplotlib
– Seaborn (visualization)
– Pandas (data analysis)
• Installation: For example, using pip:
pip install -U scikit-learn

12.1.1. Why scikit-learn?


Five Main Steps, in Machine Learning
1. Selection of features
2. Choosing a performance metric
3. Choosing a classifier and optimization algorithm
4. Evaluating the performance of the model
5. Tuning the algorithm
12.1. Scikit-Learn Basics 239

In practice :
• Each algorithm has its own quirks/characteristics and is based on cer-
tain assumptions.
• It is always recommended that you compare the performance of at
least a handful of different learning algorithms to select the best model
for the particular problem.
• No Free Lunch Theorem: No single classifier works best across all
possible scenarios.

Why Scikit-Learn?
• Nice documentation and usability
• Covers most machine-learning tasks
• Scikit-learn scales to most data problems
⇒ Easy-to-use, convenient, and powerful enough

An Example Code
iris_sklearn.py
1 #------------------------------------------------------
2 # Load Data
3 #------------------------------------------------------
4 from sklearn import datasets
5 # dir(datasets); load_iris, load_digits, load_breast_cancer, load_wine, ...
6

7 iris = datasets.load_iris()
8

9 feature_names = iris.feature_names
10 target_names = iris.target_names
11 print("## feature names:", feature_names)
12 print("## target names :", target_names)
13 print("## set(iris.target):", set(iris.target))
14

15 #------------------------------------------------------
16 # Create "model instances"
17 #------------------------------------------------------
18 from sklearn.linear_model import LogisticRegression
19 from sklearn.neighbors import KNeighborsClassifier
20 LR = LogisticRegression(max_iter = 1000)
21 KNN = KNeighborsClassifier(n_neighbors = 3)
22
240 Chapter 12. Scikit-Learn: A Popular Machine Learning Library

23 #------------------------------------------------------
24 # Split, train, and fit
25 #------------------------------------------------------
26 import numpy as np
27 from sklearn.model_selection import train_test_split
28

29 X = iris.data; y = iris.target
30 iter = 20; Acc = np.zeros([iter,2])
31

32 for i in range(iter):
33 X_train, X_test, y_train, y_test = train_test_split(
34 X, y, test_size=0.3, random_state=i, stratify=y)
35 LR.fit(X_train, y_train); Acc[i,0] = LR.score(X_test, y_test)
36 KNN.fit(X_train, y_train); Acc[i,1] = KNN.score(X_test, y_test)
37

38 acc_mean = np.mean(Acc,axis=0)
39 acc_std = np.std(Acc,axis=0)
40 print('## iris.Accuracy.LR : %.4f +- %.4f' %(acc_mean[0],acc_std[0]))
41 print('## iris.Accuracy.KNN: %.4f +- %.4f' %(acc_mean[1],acc_std[1]))
42

43 #------------------------------------------------------
44 # New Sample
45 #------------------------------------------------------
46 sample = [[5, 3, 2, 4],[4, 3, 3, 6]];
47 print('## New sample =',sample)
48 predL = LR.predict(sample); predK = KNN.predict(sample)
49 print(" ## sample.LR.predict :",target_names[predL])
50 print(" ## sample.KNN.predict:",target_names[predK])

Output
1 ## feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
2 'petal width (cm)']
3 ## target names : ['setosa' 'versicolor' 'virginica']
4 ## set(iris.target): {0, 1, 2}
5 ## iris.Accuracy.LR : 0.9667 +- 0.0294
6 ## iris.Accuracy.KNN: 0.9678 +- 0.0248
7 ## New sample = [[5, 3, 2, 4], [4, 3, 3, 6]]
8 ## sample.LR.predict : ['setosa' 'virginica']
9 ## sample.KNN.predict: ['versicolor' 'virginica']

Note: In Scikit-Learn, particularly with ensembling, you can finish


machine learning tasks easily and conveniently.
12.1. Scikit-Learn Basics 241

12.1.2. Data preprocessing


Data preprocessing is a data mining technique.
• It involves transforming raw data into a understandable and
more tractable format.
• Real-world data is often incomplete, redundant, inconsistent,
and/or lacking in certain behaviors or trends, and is likely to contain
many errors.
• Data preprocessing is a proven method of resolving such issues.
• Often, data preprocessing is the most important phase of a ma-
chine learning project, especially in computational biology.

Remark 12.1. Data preparation is difficult because the process is


not objective, and it is important because ML algorithms learn from
data. Consider the following.
• Preparing data for analysis is one of the most important steps in
any data-mining project – and traditionally, one of the most time
consuming.
• Often, it takes up to 80% of the time.
• Data preparation is not a once-off process; that is, it is iterative
as you understand the problem deeper on each successive pass.
• It is critical that you feed the algorithms with the right data for
the problem you want to solve. Even if you have a good dataset, you
need to make sure that it is in a useful scale and format and that
meaningful features are included.

• The more disciplined you are in your handling of data, the


more consistent and better results you are likely to achieve.
• An important step is data visualization.

Note: The “preprocessing” module in Scikit-learn can be more effective,


when along with Pandas, a data analysis library.
242 Chapter 12. Scikit-Learn: A Popular Machine Learning Library

Data Visualization: Matplotlib vs. Seaborn


iris_seaborn.py
1 import seaborn as sbn; import matplotlib.pyplot as plt
2

3 iris = sbn.load_dataset('iris')
4 print(iris.head())
5 sbn.pairplot(iris, hue='species',height=3)
6 plt.savefig('seaborn-pairplot-iris.png',bbox_inches='tight')
7 #plt.show()

Output
1 sepal_length sepal_width petal_length petal_width species
2 0 5.1 3.5 1.4 0.2 setosa
3 1 4.9 3.0 1.4 0.2 setosa
4 2 4.7 3.2 1.3 0.2 setosa
5 3 4.6 3.1 1.5 0.2 setosa
6 4 5.0 3.6 1.4 0.2 setosa
12.1. Scikit-Learn Basics 243

• Seaborn is more agreeable and convenient in taking care of data


frames in Pandas, while Matplotlib is very much associated with Pan-
das and Numpy.
• Seaborn is an extended version of Matplotlib which uses Mat-
plotlib along with Numpy and Pandas for plotting graphs.

Pandas: Data Analysis


iris_sklearn_pandas.py
1 from sklearn import datasets
2 import numpy as np; import pandas as pd
3 import seaborn as sbn; import matplotlib.pyplot as plt
4

5 #------------------------------------------------------
6 # Load Data
7 #------------------------------------------------------
8 iris = datasets.load_iris()
9 nclass = len(set(iris.target))
10 print('## iris: nclass =', nclass)
11

12 #------------------------------------------------------
13 # Use pandas, for data analysis
14 #------------------------------------------------------
15 data = pd.DataFrame(iris.data)
16 target = pd.DataFrame(iris.target)
17

18 print('## data.head(3):\n', data.head(3))


19 print('## Re-assign data.columns and target[0] ##')
20 data.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
21 print('## data.head(3):\n', data.head(3))
22

23 target = target.rename(columns = {0: 'target'})


24 print('## target.head(3):\n', target.head(3))
25

26 #------------------------------------------------------
27 print('## Visualization: use seaborn + matplotlib.pyplot ##')
28 #------------------------------------------------------
29 sbn.heatmap(data.corr(), annot = True, cmap='Greys');
30 plt.title('iris.data.corr()');
31 plt.savefig('iris_data_corr.png',bbox_inches='tight')
32 #plt.show()
33 #------------------------------------------------------
34 print('## df = pd.concat([data, target], axis = 1) ##')
244 Chapter 12. Scikit-Learn: A Popular Machine Learning Library

35 #------------------------------------------------------
36 df = pd.concat([data, target], axis = 1)
37 print('## df.head(3):\n', df.head(3))
38

39 #------------------------------------------------------
40 print('## Check for Missing Values')
41 #------------------------------------------------------
42 print('## df.isnull().sum():\n',df.isnull().sum())
43 print('## df.describe():\n', df.describe())
44

45 #------------------------------------------------------
46 print("## Data Sepatation: C0 = df.loc[df['target']==0]")
47 #------------------------------------------------------
48 C0 = df.loc[df['target']==0]
49 print('## C0.describe():\n', C0.describe())
50 print('## C0.count()[0] , C0.mean()[0] =',C0.count()[0],',',C0.mean()[0])
51

52 y0 = C0.pop('target')
53 plt.figure() # new figure
54 sbn.heatmap(C0.corr(), annot = True, cmap='Greys');
55 plt.title('iris.C0.corr()');
56 plt.savefig('iris_C0_corr.png',bbox_inches='tight')

Output
1 ## iris: nclass = 3
2 ## data.head(3):
3 0 1 2 3
4 0 5.1 3.5 1.4 0.2
5 1 4.9 3.0 1.4 0.2
6 2 4.7 3.2 1.3 0.2
7 ## Re-assign data.columns and target[0] ##
8 ## data.head(3):
9 sepal_length sepal_width petal_length petal_width
10 0 5.1 3.5 1.4 0.2
11 1 4.9 3.0 1.4 0.2
12 2 4.7 3.2 1.3 0.2
13 ## target.head(3):
14 target
15 0 0
16 1 0
17 2 0
18 ## Visualization: use seaborn + matplotlib.pyplot ##
19 ## df = pd.concat([data, target], axis = 1) ##
20 ## df.head(3):
21 sepal_length sepal_width petal_length petal_width target
12.1. Scikit-Learn Basics 245

22 0 5.1 3.5 1.4 0.2 0


23 1 4.9 3.0 1.4 0.2 0
24 2 4.7 3.2 1.3 0.2 0
25 ## Check for Missing Values
26 ## df.isnull().sum():
27 sepal_length 0
28 sepal_width 0
29 petal_length 0
30 petal_width 0
31 target 0
32 dtype: int64
33 ## df.describe():
34 sepal_length sepal_width petal_length petal_width target
35 count 150.000000 150.000000 150.000000 150.000000 150.000000
36 mean 5.843333 3.057333 3.758000 1.199333 1.000000
37 std 0.828066 0.435866 1.765298 0.762238 0.819232
38 min 4.300000 2.000000 1.000000 0.100000 0.000000
39 25% 5.100000 2.800000 1.600000 0.300000 0.000000
40 50% 5.800000 3.000000 4.350000 1.300000 1.000000
41 75% 6.400000 3.300000 5.100000 1.800000 2.000000
42 max 7.900000 4.400000 6.900000 2.500000 2.000000
43 ## Data Sepatation: C0 = df.loc[df['target']==0]
44 ## C0.describe():
45 sepal_length sepal_width petal_length petal_width target
46 count 50.00000 50.000000 50.000000 50.000000 50.0
47 mean 5.00600 3.428000 1.462000 0.246000 0.0
48 std 0.35249 0.379064 0.173664 0.105386 0.0
49 min 4.30000 2.300000 1.000000 0.100000 0.0
50 25% 4.80000 3.200000 1.400000 0.200000 0.0
51 50% 5.00000 3.400000 1.500000 0.200000 0.0
52 75% 5.20000 3.675000 1.575000 0.300000 0.0
53 max 5.80000 4.400000 1.900000 0.600000 0.0
54 ## C0.count()[0] , C0.mean()[0] = 50 , 5.006
246 Chapter 12. Scikit-Learn: A Popular Machine Learning Library

Figure 12.1: iris_data_corr.png and iris_C0_corr.png

Summary 12.2. Scikit-Learn


• Works well with: Numpy, Scipy, Matplotlib, Seaborn, Pandas
• Many datasets are used, without saving them
• Easy-to-use, convenient, and powerful enough; particularly for
– Data preprocessing, data analysis, visualization
• Also, the available are various machine learning (ML) models and
optimization algorithms.
• Scikit-Learn (sklearn) is one of the most popular libraries, used for
industrial ML projects.
12.2. Scikit-Learn – Supervised Learning 247

12.2. Scikit-Learn – Supervised Learning


DoiNg HeRe
12.2.1. Scikit-Learn supervised learning modules
12.2.2. Performance comparisons
248 Chapter 12. Scikit-Learn: A Popular Machine Learning Library

12.3. Scikit-Learn – Unsupervised Learning


DoiNg HeRe
12.3.1. Scikit-Learn unsupervised learning modules
12.3.2. Performance comparisons
12.3. Scikit-Learn – Unsupervised Learning 249

Exercises for Chapter 12

12.1.
250 Chapter 12. Scikit-Learn: A Popular Machine Learning Library
P
A PPENDIX

Projects

Contents of Chapter P
P.1. Edge Detection, using Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
P.2. Number Plate Detection, using Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

251
252 Appendix P. Projects

P.1. Edge Detection, using Matlab


Use NumPy to create the masks (e.g., the Sobel filter), perform the convo-
lution operations, and combine the horizontal and vertical mask outputs to
extract all the edges.

• with pre-smoothing
• without pre-smoothing
P.2. Number Plate Detection, using Python 253

P.2. Number Plate Detection, using Python


import pytesseract
254 Appendix P. Projects
Bibliography
[1] O. C HUM AND J. M ATAS, Matching with prosac-progressive sample consensus, in
2005 IEEE computer society conference on computer vision and pattern recognition
(CVPR’05), vol. 1, IEEE, 2005, pp. 220–226.

[2] C. C ORTES AND V. N. VAPNIK, Support-vector networks, Machine Learning, 20 (1995),


pp. 273–297.

[3] M. F ISCHLER AND R. B OLLES, Random sample consensus: A paradigm for model
fitting with applications to image analysis and automated cartography, Communica-
tions of the ACM, 24 (1981), pp. 381–395.

[4] B. G ROSSER AND B. L ANG, An o(n2 ) algorithm for the bidiagonal svd, Lin. Alg. Appl.,
358 (2003), pp. 45–70.

[5] S. K IM , H. L IM , D. K IM∗ , AND M. T YNAN∗ , Subject modularization and research


projects with high school students on mathematical image processing, in Proceedings
of the IASTED International Conference on Education and Technology, 2005, pp. 247–
252.

[6] P. C. N IEDFELDT AND R. W. B EARD, Recursive ransac: multiple signal estimation


with outliers, IFAC Proceedings Volumes, 46 (2013), pp. 430–435.

[7] M. N IELSEN, Neural networks and deep learning. (The online book can be found at
http://neuralnetworksanddeeplearning.com), 2013.

[8] F. R OSENBLATT, The Perceptron: A probabilistic model for information storage and
organization in the brain, Psychological Review, (1958), pp. 65–386.

[9] P. H. T ORR AND A. Z ISSERMAN, Mlesac: A new robust estimator with application
to estimating image geometry, Computer vision and image understanding, 78 (2000),
pp. 138–156.

[10] P. R. W ILLEMS, B. L ANG, AND C. V ÖMEL, Computing the bidiagonal SVD using
multiple relatively robust representations, SIAM Journal on Matrix Analysis and Ap-
plications, 28 (2006), pp. 907–926.

[11] S. X U, An Introduction to Scientific Computing with MATLAB and Python Tutorials,


CRC Press, Boca Raton, FL, 2022.

255
256 BIBLIOGRAPHY
Index
:, Python slicing, 141 class, 149, 150
:, in Matlab, 14 class attribute, 153
_ _init_ _() constructor, 151 Classes.py, 154
classification problem, 124
activation function, 212 clustering, 210
activation function, why?, 212 CNN, 227
activation functions, popular, 214 code block, 140
Adaline, 213, 214 coding, iii, 2
adaptive step size, 166 coding vs. programming, 5
algorithmic design, 4 coefficient matrix, 91
algorithmic parameter, 215 coefficients, 76
anonymous function, 25 cofactor, 101
anonymous_function.m, 25 cofactor expansion, 101
approximation, 114 common logarithm, 48
area_closed_curve.m, 33, 51 complex number system, 35
artificial neurons, 211 computer programming, iii, 2, 8
attributes, 151 consistent system, 91
augmented matrix, 91 constraint set, 161
average slope, 86 continue, 24
average speed, 54 contour, 36, 87
contour, in Matlab, 18
backbone of programming, 8 convergence of Newton’s method, 73
backtracking line search, 166 converges absolutely, 66
basis function, 64 correction term, 71
binary classifier, 211, 212 cost function, 212
break, 23 covariance, 177
covariance matrix, 176–178, 180, 191
call_get_cubes.py, 145 Covariance.py, 178
cancellation equations, 40 critical point, 164
chain rule, 62 csvwrite, 32
change of basis, 176 curse of dimensionality, 208
change of variables, 64, 122 cython, 138
change-of-base formula, 50
characteristic equation, 106 daspect, 32
characteristic polynomial, 106 data matrix, 181
charpoly, 107 data preparation, 241
child class, 154 data preprocessing, 241
circle.m, 32 data visualization, 241

257
258 INDEX

debugging, 8 Fibonacci sequence, 27


deepcopy, 142 Fibonacci_sequence.m, 27
dependent variable, 81 fig_plot.m, 17
derivative, 59 fimplicit, 86
derivative_rules.m, 62 finverse, in Matlab, 41
design matrix, 119 fmesh, in Matlab, 18
desktop calculator, 139 for loop, 21
det, 102 four essential components, 12
determinant, 100, 101 fplot, in Matlab, 18
determinant.m, 102 free fall, 54
diagonalizable, 109, 180 free_fall.m, 55
diagonalization theorem, 110 frequently_used_rules.py, 143
diagonalization.m, 111 fsurf, in Matlab, 18
difference quotient, 55 function, 38
differentiable, 82 function of two variables, 81
differentiate, 60 fundamental questions, two, 94
differentiation rules, 60 Fundamental Theorem of Algebra, 76
diverges, 66
doc, in Matlab, 18 Galileo, 54
domain, 81, 161 Galileo’s law of free-fall, 54
dot product, 13 general solution, 96
dot, in Matlab, 13 get_cubes.py, 145
dyadic decomposition, 184, 190 get_hypothesis_WLS.m, 130
ggplot, 203
e, 45 GLOBAL_VARIABLES.py, 230
edge detection, 252 golden ratio, 26
effective programming, 9 gradient, 85
eig, 107, 112 gradient descent algorithm, 167
eigenspace, 106 gradient descent method, 162, 164, 214,
eigenvalue, 105, 106 222
eigenvalues.m, 107 Green’s Theorem, 32
eigenvector, 105
elementary row operations, 93, 103 help, in Matlab, 7, 18
ellipsoid, 180 horizontal asymptote, 46
ensembling, 240 horizontal line test, 39
equivalent system, 90 Horner’s method, 77, 153
Euler’s identity, 86 horner, in Python, 147
Euler, Leonhard, 45 horner.m, 79, 149
eulers_identity.m, 86 hyperparameter, 215
existence, 94 hyperplane, 212
exponential function, 42 hypothesis space, 194
exponential regression, 43
imag, imaginaty part, 37
exponents, laws of, 45
image compression, 184
eye, in Matlab, 16
imaginary part, 37
fastest increasing direction, 85 imaginary unit, 35
INDEX 259

inconclusive, 66 logistic cost function, 216


inconsistent system, 91 Logistic Regression, 213, 216
indentation, 140 loop control statements, 22
independent variable, 81 LS problem, 115, 193
information engineering, iii
information matrix, 229 M-file, 7
inheritance, 154 machine learning, 38, 206
initialization, 19 machine learning algorithm, 206
inlier.m, 130 machine learning modelcode, 156
inliers, 124 Machine_Learning_Model.py, 156
instance, 150 Maclaurin series, 68
instantaneous speed, 54 mathematical analysis, 4
instantiation, 150 Matlab, 12
interpolation, 114 Matplotlib, 242
interpretability, 209 Matplotlib vs. Seaborn, 242
interval of convergence, 70 matrix equation, 93
inverse function, 38, 39 matrix form, 91
inverse_matrix.m, 98 matrix-matrix multiplication, 16
invertible matrix, 97 matrix-vector multiplication, 14
iris_seaborn.py, 242 mesh, 36
iris_sklearn.py, 239 mesh, in Matlab, 18
iris_sklearn_pandas.py, 243 method of normal equations, 117, 193, 229
iteration, 19 method, in Python class, 151
mini-batch, 223
Kronecker delta, 228 Minimization Problem, 161
minimum volume enclosing ellipsoid, 180
laws of exponents, 45 MLESAC, 128
learning rate, 214 MNIST data, 220
least-squares line, 118 mod, 24
least-squares problem, 115, 193 modularization, 8
least-squares solution, 115, 193 module, 8
least_squares.m, 117 modulo, 24
left singular vectors, 182, 187 monomial basis, 65
left-hand limit, 58 multi-class classification, 218
level curve, 85 multi-column least-squares problem, 229
linear equation, 90 multi-line comments, 140
linear system, 90 Multi_Column_LS.py, 233
linear_equations_rref.m, 95 multiple local minima problem, 209
linearity rule, 62 multiple output, 26
linearization, 122 MVEE, 180
linspace, in Matlab, 18 mysort.m, 10
list, in Python, 141 mysqrt.m, 74
load, 32
localization of roots, 76 natural exponential function, 46
logarithmic function, 47 natural logarithm, 48
260 INDEX

nested loop, 22 pca_regression.m, 197


nested multiplication, 77 peppers_compress.m, 185
net input, 212 Perceptron, 213
network.py, 224 plot, in Matlab, 17
neural network, 228 polynomial approximation, 67
neuron, 211 polynomial interpolation, 38
Newton’s method, 71 polynomial of degree n, 76
Newton-Raphson method, 71 Polynomial_01.py, 151
newton_horner, in Python, 147 Polynomial_02.py, 152
newton_horner.m, 80, 148 population.m, 43
No Free Lunch Theorem, 239 positive semidefinite matrix, 181
nonlinear regression, 122 power rule, 62
nonsingular matrix, 97 power series, 65
normal equations, 116, 121 preprocessing, Scikit-learn, 241
number plate detection, 253 principal component analysis, 176
numerical approximation, 32 principal components, 183
numpy, 25, 138, 149 principal directions, 176, 178, 180
numpy.loadtxt, 203 product rule, 62
numpy.savetxt, 203 programming, iii, 2, 8
PROSAC, 128
object-oriented programming, 150 pseudoinverse, 193, 195
objective function, 161 pseudoinverse, the k-th, 195
observation vector, 119 PSNR, 186
Octave, 25 Python, 138
Octave, how to import symbolic package, Python essentials, 141
36 python_startup.py, 139
Octave, how to know if Octave is runng-
ing, 36 quadratic formula, 35
one-shot learning, 209 quiver, 87
one-to-one function, 39 quotient rule, 62
one-versus-all, 218, 235
one-versus-rest, 218 R-RANSAC, 128
OOP, 150 R&D, 64
orthogonal matrix, 180 radius of convergence, 67
outliers, 124 random sample consensus, 126
overdetermined, 89 range, 81
overfitting, 208 range, in Python, 142
RANSAC, 126
pairplot, in Seaborn, 242 ransac2.m, 129
Pandas, 241, 243 ratio test, 66, 70
parameter estimation problem, 124 Rayleigh quotient, 181
parameter vector, 119 readmatrix, 33
parent class, 154 real part, 37
partial derivative, 83 real, real part, 37
PCA, 176 real-valued solution, 35
INDEX 261

real_imaginary_parts.m, 37 singular value decomposition, 182, 183,


rectifier, 213 187
reduced row echelon form, 95 singular values, 182, 187
reference semantics, in Python, 142 sklearn_classifiers.py, 157
region, 30 slicing, in Python, 141
regression analysis, 38, 118 slope of the curve, 57
regression coefficients, 118 smoothing assumption, 124
regression line, 118 softplus function, 213
Regression_Analysis.m, 198 solution, 90
Regression_Analysis.py, 202 solution set, 90
Remainder Theorem, 77 SortArray.m, 11
repetition, 6, 19 source matrix, 229
research and development, 64 Speed up Python Programs, 138
retrieving elements, in Python, 141 sqrt, 75
reusability, 6 square root, 74
reverse, 38 squareroot_Q.m, 3
Richardson’s method, 162 squaresum.m, 6
right singular vectors, 182, 187 SSE, 212
right-hand slope, 58 standard logistic sigmoid function, 213
Rosenbrock function, 165 Starkville, 51
rosenbrock_2D_GD.py, 165 steepest descent method, 162
rosenbrock_opt_Newton.py, 169 step length, 161, 214, 217
rotational symmetry, 213 stochastic gradient descent, 167, 223
row equivalent, 93 string, in Python, 141
rref, 95, 96 submatrix, 101
rules of derivative, 62 sufficient decrease condition, 166
Run_network.py, 226 Sum of Squared Errors, 212
super-convergence, 74
save multiple functions in a file, 197 supervised learning, 207, 247
saveas, 32, 33 surf, in Matlab, 18
scene analysis, 124 SVD, 182
Scikit-learn, 238 SVD theorem, 187
scikit-learn, 246 SVD, algebraic interpretation, 189
scipy, 138 symbolic computation, 55
scope in loops, 22 symmetric, 180
score matrix, 181 synthetic division, 77
Seaborn, 242 synthetic_data.py, 231
search direction, 161, 214 system, 38
secant_lines_abs_x2_minus_1.m, 58 system of linear equations, 90
self, 151 systems of linear equations, 89
SGD, 167
sigmoid function, 213 tangent line, 54, 57, 72
similar, 108 tangent plane, 54
similarity transformation, 108 Taylor polynomial of order n, 69
sinc function, 70 Taylor series, 67, 68
262 INDEX

Taylor series, commonly used, 70 util_MC_LS.py, 232


Taylor’s formula, 162
Taylor’s Theorem with integral remain- variance, 177
der, 163 vehicle number plate detection, 253
taylor, in Matlab, 70 visualize_complex_solution.m, 36
Term-by-Term Differentiation, 67 volume scaling factor, 100
term-by-term integration, 67
training data, 207 weight matrix, 124, 229
truncated data matrix, 183 weighted least-squares method, 124
truncated score matrix, 183 weighted normal equations, 125
tuple, in Python, 141 while loop, 20
why, in Matlab, 16
underdetermined, 89 Wine_data.py, 201
unique inverse, 97 writematrix, 33
uniqueness, 94
unsupervised learning, 210, 248 x-intercept, 72
update direction, 214
util.m, 196 Zeros-Polynomials-Newton-Horner.py,
util.py, 202 147
util_Covariance.py, 178 zeros_of_poly_built_in.py, 147

You might also like