Problems Chap1
Problems Chap1
Chapter 1
Pierre Paquay
Problem 1.1
Let B1 and B2 be the events The first picked ball is black and The second picked ball is black. We have
P(B2 ∩ B1 )
P(B2 |B1 ) = ,
P(B1 )
with
and
In conclusion, we get
1/2 2
P(B2 |B1 ) = = .
3/4 3
Problem 1.2
w0 + w1 x1 + w2 x2 = 0.
1
0.0
y
−0.5
−1.0
−1.0 −0.5 0.0 0.5 1.0
x
For w = −(1, 2, 3)T , we have the graph below.
0.0
y
−0.5
−1.0
−1.0 −0.5 0.0 0.5 1.0
x
We may notice that the lines are identical in these two graphs (which is not that suprising considering that
the coefficients are opposites). However, the regions where h(x) = +1 and h(x) = −1 are different; indeed in
the first plot the positive region is the one above the line, and in the second plot the positive region is the
one below the line.
Problem 1.3
yn = sign(w∗T xn ),
2
which translates to
yn (w∗T xn ) > 0
for all n = 1, · · · , N . This implies that
It remains to prove that wT (t)w∗ ≥ tρ, to do this we will proceed by induction. If t = 0, we obviously get
that 0 · w∗ ≥ 0. If the thesis is true for t − 1, let us prove it for t. If we use the first part of point (b), we have
that
wT (t)w∗ ≥ wT (t − 1)w∗ + ρ ≥ (t − 1)ρ + ρ = tρ.
wT (t)w∗ tρ √ ρ
≥√ = t .
||w(t)|| tR R
R2 ||w∗ ||2
t≤ .
ρ2
3
Problem 1.4
(a) Below, we generate a linearly separable data set of size 20 and the target function f (in red).
set.seed(101)
h <- function(x, w) {
scalar_prod <- cbind(1, x$x1, x$x2) %*% w
return(as.vector(sign(scalar_prod)))
}
f <- function(x) {
return(h(x, c(w0, w1, w2)))
}
D <- data.frame(x1 = runif(20, min = -1, max = 1), x2 = runif(20, min = -1, max = 1))
D <- cbind(D, y = f(D))
p_f <- p + geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red")
p_f
0.5
0.0
x2
−0.5
−1.0
−1.0 −0.5 0.0 0.5 1.0
x1
(b) Below, we plot the training data, the target function f (in red) and the final hypothesis g (in blue)
generated by PLA.
iter <- 0
w <- c(0, 0, 0)
4
repeat {
y_pred <- h(D, w)
D_mis <- subset(D, y != y_pred)
if (nrow(D_mis) == 0)
break
x_t <- D_mis[1, ]
w <- w + c(1, x_t$x1, x_t$x2) * x_t$y
iter <- iter + 1
}
p_g <- p_f + geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")
p_g
0.5
0.0
x2
−0.5
−1.0
−1.0 −0.5 0.0 0.5 1.0
x1
Here, the PLA took 5 iterations before converging. We may notice that although g is pretty close to f , they
are not quite identical.
(c) Below, we repeat what we did in point (b) with another randomly generated data set of size 20.
D1 <- data.frame(x1 = runif(20, min = -1, max = 1), x2 = runif(20, min = -1, max = 1))
D1 <- cbind(D1, y = f(D1))
iter <- 0
w <- c(0, 0, 0)
repeat {
y_pred <- h(D1, w)
D_mis <- subset(D1, y != y_pred)
if (nrow(D_mis) == 0)
break
x_t <- D_mis[1, ]
w <- w + c(1, x_t$x1, x_t$x2) * x_t$y
iter <- iter + 1
}
5
theme(legend.position = "none") +
geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red") +
geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")
1.0
0.5
x2
0.0
−0.5
−1.0
−0.5 0.0 0.5
x1
In this case, the PLA took 12 iterations (which is greater than in (b)) before converging. We may notice that,
as in point (b), although g is pretty close to f , they are not quite identical.
(d) Below, we repeat what we did in point (b) with another randomly generated data set of size 100.
D1 <- data.frame(x1 = runif(100, min = -1, max = 1), x2 = runif(100, min = -1, max = 1))
D1 <- cbind(D1, y = f(D1))
iter <- 0
w <- c(0, 0, 0)
repeat {
y_pred <- h(D1, w)
D_mis <- subset(D1, y != y_pred)
if (nrow(D_mis) == 0)
break
x_t <- D_mis[1, ]
w <- w + c(1, x_t$x1, x_t$x2) * x_t$y
iter <- iter + 1
}
6
1.0
0.5
x2
0.0
−0.5
−1.0
−1.0 −0.5 0.0 0.5 1.0
x1
In this case, the PLA took 33 iterations (which is greater than in (b) and (c)) before converging. We may
notice that, here f and g are very close to each other.
(e) Below, we repeat what we did in point (b) with another randomly generated data set of size 1000.
D1 <- data.frame(x1 = runif(1000, min = -1, max = 1), x2 = runif(1000, min = -1, max = 1))
D1 <- cbind(D1, y = f(D1))
iter <- 0
w <- c(0, 0, 0)
repeat {
y_pred <- h(D1, w)
D_mis <- subset(D1, y != y_pred)
if (nrow(D_mis) == 0)
break
x_t <- D_mis[1, ]
w <- w + c(1, x_t$x1, x_t$x2) * x_t$y
iter <- iter + 1
}
7
1.0
0.5
x2
0.0
−0.5
−1.0
−1.0 −0.5 0.0 0.5 1.0
x1
In this case, the PLA took 511 iterations (which is greater than in (b), (c) and (d)) before converging. We
may notice that, here f and g are nearly undistinguishable.
(f ) Here, we randomly generate a linearly separable data set of size 1000 with xn ∈ R10 .
N <- 10
h <- function(x, w) {
scalar_prod <- cbind(1, x) %*% w
return(as.vector(sign(scalar_prod)))
}
w <- runif(N + 1)
f <- function(x) {
return(h(x, w))
}
iter <- 0
w0 <- rep(0, N + 1)
repeat {
y_pred <- h(as.matrix(D2[, 1:N]), as.numeric(w0))
D_mis <- subset(D2, y != y_pred)
if (nrow(D_mis) == 0)
break
x_t <- D_mis[1, ]
w0 <- w0 + cbind(1, x_t[, 1:N]) * x_t$y
iter <- iter + 1
}
8
In this case, we may see that the number of iterations is 4326 which is very big, this is a direct consequence
of the increase in dimensions from 2 to 10.
(g) Below, we repeat the algorithm on the same data set as (f ) for 100 experiments, and we pick x(t) randomly
instead of deterministically.
updates <- rep(0, 100)
iter <- 0
for (i in 1:100) {
w0 <- rep(0, N + 1)
repeat {
y_pred <- h(as.matrix(D2[, 1:N]), as.numeric(w0))
D_mis <- subset(D2, y != y_pred)
if (nrow(D_mis) == 0)
break
x_t <- D_mis[sample(nrow(D_mis), 1), ]
w0 <- w0 + cbind(1, x_t[, 1:N]) * x_t$y
iter <- iter + 1
}
updates[i] <- iter
}
Now, we plot a histogram of the number of updates that the PLA takes to converge.
ggplot(data.frame(updates), aes(x = updates)) + geom_histogram(bins = 15, col = "black") +
labs(x = "Number of updates", y = "Count")
6
Count
0
0e+00 1e+05 2e+05 3e+05 4e+05 5e+05
Number of updates
We may see that the distribution of the number of updates seems pretty uniform and varies from 0 to
5.0437 × 105 .
(h) As we saw above, the more data points we have (N ), the more accurate g becomes in approximating f
and the greater the running time gets. Moreover, the greater d becomes, the greater the running time gets
also.
9
Problem 1.5
(a) Below, we generate a training data set of size 100 and a test data set of size 10000. We also plot the
target function f (in red) and the final hypothesis g (in blue) generated by Adaline [We use a η value of 5
instead of 100 to simplify the values computed].
set.seed(1975)
h <- function(x, w) {
scalar_prod <- cbind(1, x$x1, x$x2) %*% w
return(as.vector(sign(scalar_prod)))
}
f <- function(x) {
return(h(x, c(w0, w1, w2)))
}
D_train <- data.frame(x1 = runif(100, min = -1, max = 1), x2 = runif(100, min = -1, max = 1))
D_train <- cbind(D_train, y = f(D_train))
D_test <- data.frame(x1 = runif(10000, min = -1, max = 1), x2 = runif(10000, min = -1, max = 1))
D_test <- cbind(D_test, y = f(D_test))
iter <- 0
eta <- 5
w <- c(0, 0, 0)
repeat {
y_pred <- h(D_train, w)
D_mis <- subset(D_train, y != y_pred)
if (nrow(D_mis) == 0)
break
obs_t <- D_mis[sample(nrow(D_mis), 1), ]
x_t <- c(1, as.numeric(obs_t[1:2]))
y_t <- as.numeric(obs_t[3])
s_t <- sum(w * x_t)
if (y_t * s_t <= 1)
w <- w + eta * (y_t - s_t) * x_t
iter <- iter + 1
if (iter == 1000)
break
}
p_g <- p + geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red") +
geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")
10
p_g
1.0
0.5
x2
0.0
−0.5
−1.0
−1.0 −0.5 0.0 0.5 1.0
x1
We have a classification error rate of 2.23% on the test set.
(b) Now we repeat everything we did in (a) with η = 1.
iter <- 0
eta <- 1
w <- c(0, 0, 0)
repeat {
y_pred <- h(D_train, w)
D_mis <- subset(D_train, y != y_pred)
if (nrow(D_mis) == 0)
break
obs_t <- D_mis[sample(nrow(D_mis), 1), ]
x_t <- c(1, as.numeric(obs_t[1:2]))
y_t <- as.numeric(obs_t[3])
s_t <- sum(w * x_t)
if (y_t * s_t <= 1)
w <- w + eta * (y_t - s_t) * x_t
iter <- iter + 1
if (iter == 1000)
break
}
p_g <- p + geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red") +
geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")
p_g
11
1.0
0.5
x2
0.0
−0.5
−1.0
12
p_g <- p + geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red") +
geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")
p_g
1.0
0.5
x2
0.0
−0.5
−1.0
13
p <- ggplot(D_train, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +
theme(legend.position = "none")
p_g <- p + geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red") +
geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")
p_g
1.0
0.5
x2
0.0
−0.5
−1.0
Problem 1.6
14
P(At least one sample has ν = 0) = 1 − P(νi > 0 ∀i)
1000
Y
= 1− P(νi > 0)
i=1
1000
Y
= 1− [1 − P(νi = 0)]
i=1
1000
Y
= 1− [1 − (1 − µ)10 ]
i=1
= 1 − [1 − (1 − µ)10 ]1000 .
Which gives us 1 when µ = 0.05, 0.6235762 when µ = 0.5, and 1.0239476 × 10−4 when µ = 0.8.
(c) Here, we repeat (b) for 1000000 independant samples. In that case, we obtain
P(At least one sample has ν = 0) = 1 − [1 − (1 − µ)10 ]1000000 .
Which gives us 1 when µ = 0.05, 1 when µ = 0.5, and 0.0973316 when µ = 0.8.
Problem 1.7
(a) First we treat the case where µ = 0.05, for one coin, we have that
P(At least one coin has ν = 0) = (1 − 0.05)10 = 0.5987369;
for 1000 coins, we have
P(At least one coin has ν = 0) = 1 − [1 − (1 − 0.05)10 ]1000 = 1;
and finally for 1000000 coins we get
P(At least one coin has ν = 0) = 1 − [1 − (1 − 0.05)10 ]1000000 = 1.
We repeat the same reasoning for µ = 0.8, for one coin, we have that
P(At least one coin has ν = 0) = (1 − 0.8)10 = 1.024 × 10−7 ;
for 1000 coins, we have
P(At least one coin has ν = 0) = 1 − [1 − (1 − 0.8)10 ]1000 = 1.0239476 × 10−4 ;
and finally for 1000000 coins we get
P(At least one coin has ν = 0) = 1 − [1 − (1 − 0.8)10 ]1000000 = 0.0973316.
(b) Here, we consider N = 6, two coins, and µ = 0.5. If we use the Hoeffding inequality bound, we obtain that
Below, we plot the above probability with its Hoeffding inequality bound for in the range [0, 1].
15
N <- 6
mu <- 0.5
max_abs <- function(epsilon) {
c1 <- sample(c(0, 1), N, replace = TRUE)
nu1 <- mean(c1)
c2 <- sample(c(0, 1), N, replace = TRUE)
nu2 <- mean(c2)
m <- max(abs(nu1 - mu), abs(nu2 - mu))
3
Proba
0
0.00 0.25 0.50 0.75 1.00
epsilon
Problem 1.8
16
which proves the Markov Inequality.
(b) Here u is a random variable with mean µ and variance σ 2 and α > 0. If we consider the random variable
(u − µ)2 ≥ 0, the Markov Inequality tells us that
E[(u − µ)2 ] σ2
P[(u − µ)2 ≥ α] ≤ =
α α
which proves the Chebyshev Inequality.
1
PN
(c) Now u1 , · · · , uN are iid random variables each with mean µ and variance σ 2 , u = N n=1 un and α > 0.
We consider the random variable u, its expectation is
N
1 X
E(u) = µ=µ
N n=1
σ2
P[(u − µ)2 ≥ α] ≤ .
Nα
Problem 1.9
(a) Let t be a finite random variable, α > 0 and s > 0, we may write that
P(u ≥ α) = P(N u ≥ N α)
≤ e−sN α E(esN u )
N
Y
= e−sN α E(esun )
n=1
= (e−sN α U (s))N .
17
we get immediately
df e−sα
= [(1 − α)es − α],
ds 2
which has a root for s = ln(α/(1 − α)), and
d2 f e−sα s
= [e (α − 1)2 + α] ≥ 0.
ds2 2
So, we conclude that s = ln(α/(1 − α)) is a minimum of f (s).
1/2 + 1/2 +
− ln (1/2 + )U (ln )
1/2 − 1/2 −
!
1 1
P u≥ + ≤ (e−s( 2 +) U (s))N
2
1
≤ min(e−s( 2 +) U (s))N
s
!!N
− ln
1/2+
(1/2+) 1/2 +
= e 1/2− U ln
1/2 −
1/2+ 1 1/2+
= (e− ln 1/2− (1/2+) (1 + eln 1/2− ))N
2
" !1/2+ !#N
1 1/2 − 1/2 +
= 1+
2 1/2 + 1/2 −
" !1/2+ !−1 #N
1 1/2 − 1
= −
2 1/2 + 2
" #N
−1 1 1
= 2
(1/2 + )1/2+ (1/2 − )1/2−
" #N
1/2+
−log2 (1/2−)1/2−
= 2−1−log2 (1/2+)
It remains to see that β > 0, to do that, we plot the graph of β for 0 < < 1/2 below.
18
1.00
0.75
beta
0.50
0.25
0.00
0.0 0.1 0.2 0.3 0.4 0.5
epsilon
Problem 1.10
which is equivalent to the fact that the number of h(xN +m ) 6= f (xN +m ) must be equal to k. And the number
k
of ways we have k errors in M trials is CM (the binomial coefficient).
(d) We may write that
1
Ef [Eof f (h, f )] = Ef [Number of h(xN +m ) 6= f (xN +m )]
M
1 1 1
= ·M · M = M
M 2 2
Problem 1.11
19
N
(S) 1 X
Ein (h) = e(h(xn ), f (xn ))
N n=1
1 X X
= [ e(h(xn ), 1) + e(h(xn ), −1)]
N y =1 yn =−1
n
1 X X
= [ 10 · [[h(xn ) 6= 1]] + [[h(xn ) 6= −1]].
N y =1 y =−1
n n
N
(C) 1 X
Ein (h) = e(h(xn ), f (xn ))
N n=1
1 X X
= [ e(h(xn ), 1) + e(h(xn ), −1)]
N y =1 yn =−1
n
1 X X
= [ [[h(xn ) 6= 1]] + 1000 · [[h(xn ) 6= −1]].
N y =1 y =−1
n n
Problem 1.12
PN
(a) To minimize Ein (h) = n=1 (h − yn )2 , we have to find the stationary points of this function. We
immediately find that
N
dEin (h) X
=2 (h − yn ) = 0
dh n=1
1
PN
implies h = N n=1 yn = hmean . It is actually a minimum as we have that
d2 Ein (h)
= 2N > 0.
dh2
PN
(b) We proceed in the same fashion to minimize Ein (h) = n=1 |h − yn |, so we get that
N
dEin (h) X
= (h − yn ) = 0
dh n=1
implies h = median{y1 , · · · , yn } = hmed as the derivative is equal to zero only if the number of positive terms
is equal to the number of negative terms.
(c) In the case where yN becomes an outlier, we get that hmean → ∞ and hmed remains unchanged.
20