0% found this document useful (0 votes)
62 views20 pages

Problems Chap1

The document contains 4 problems regarding probability, linear classification, and perceptron learning algorithm (PLA). Problem 1 calculates the probability of drawing two black balls from an urn. Problem 2 describes linear classification with a sign function and plots decision boundaries. Problem 3 proves properties of the weight vector and number of iterations for PLA on separable data. Problem 4 generates synthetic data and applies PLA, plotting the target function and hypotheses for various data sizes.

Uploaded by

Mohamed Taha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views20 pages

Problems Chap1

The document contains 4 problems regarding probability, linear classification, and perceptron learning algorithm (PLA). Problem 1 calculates the probability of drawing two black balls from an urn. Problem 2 describes linear classification with a sign function and plots decision boundaries. Problem 3 proves properties of the weight vector and number of iterations for PLA on separable data. Problem 4 generates synthetic data and applies PLA, plotting the target function and hypotheses for various data sizes.

Uploaded by

Mohamed Taha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Problem Solutions

Chapter 1
Pierre Paquay

Problem 1.1

Let B1 and B2 be the events The first picked ball is black and The second picked ball is black. We have

P(B2 ∩ B1 )
P(B2 |B1 ) = ,
P(B1 )

with

P(B1 ) = P(B1 |Bag 1)P(Bag 1) + P(B1 |Bag 2)P(Bag 2)


1 1 1 3
= 1· + · =
2 2 2 4

and

P(B1 ∩ B2 ) = P(B1 ∩ B2 |Bag 1)P(Bag 1) + P(B1 ∩ B2 |Bag 2)P(Bag 2)


1 1 1
= 1· +0· = .
2 2 2

In conclusion, we get
1/2 2
P(B2 |B1 ) = = .
3/4 3

Problem 1.2

We have h(x) = sign(wT x) with w = (w0 , w1 , w2 )T and x = (1, x1 , x2 )T .


(a) If we have h(x) = +1 (resp. −1), it implies that wT x > 0 (resp. < 0). So, we may conclude that the
separation between these two regions is the line whose equation is wT x = 0 or more explicitly

w0 + w1 x1 + w2 x2 = 0.

We may also write this equation as


x2 = ax1 + b
where a = −w1 /w2 and b = −w0 /w2 .
(b) For w = (1, 2, 3)T , we have the graph below.

1
0.0
y

−0.5

−1.0
−1.0 −0.5 0.0 0.5 1.0
x
For w = −(1, 2, 3)T , we have the graph below.

0.0
y

−0.5

−1.0
−1.0 −0.5 0.0 0.5 1.0
x
We may notice that the lines are identical in these two graphs (which is not that suprising considering that
the coefficients are opposites). However, the regions where h(x) = +1 and h(x) = −1 are different; indeed in
the first plot the positive region is the one above the line, and in the second plot the positive region is the
one below the line.

Problem 1.3

(a) As every xn is well classified by w∗ for all n = 1, · · · , N , we have that

yn = sign(w∗T xn ),

2
which translates to
yn (w∗T xn ) > 0
for all n = 1, · · · , N . This implies that

ρ = min yn (w∗T xn ) > 0.


n

(b) We have that

wT (t)w∗ = [wT (t − 1) + y(t − 1)xT (t − 1)]w∗


= wT (t − 1)w∗ + y(t − 1)w∗T x(t − 1)
≥ wT (t − 1)w∗ + ρ.

It remains to prove that wT (t)w∗ ≥ tρ, to do this we will proceed by induction. If t = 0, we obviously get
that 0 · w∗ ≥ 0. If the thesis is true for t − 1, let us prove it for t. If we use the first part of point (b), we have
that
wT (t)w∗ ≥ wT (t − 1)w∗ + ρ ≥ (t − 1)ρ + ρ = tρ.

(c) We may write that

||w(t)||2 = ||w(t − 1) + y(t − 1)x(t − 1)||2


= ||w(t − 1)||2 + ||y(t − 1)x(t − 1)||2 + 2y(t − 1)wT (t − 1)x(t − 1)
≤ ||w(t − 1)||2 + ||y(t − 1)x(t − 1)||2
≤ ||w(t − 1)||2 + ||x(t − 1)||2

as x(t − 1) is misclassified by w(t − 1) and y(t − 1)2 = 1.


(d) Now we prove by induction that ||w(t)||2 ≤ tR2 where R = maxn ||xn ||. If t = 0, we trivially have that
0 ≤ 0 · R2 . If the thesis is true for t − 1, let us prove it for t. Because of point (c), we may write that

||w(t)||2 ≤ ||w(t − 1)||2 + ||x(t − 1)||2 ≤ (t − 1)R2 + R2 = tR2 .

(e) By using points (b) and (d), we get that

wT (t)w∗ tρ √ ρ
≥√ = t .
||w(t)|| tR R

By using the inequality above, we may write that

R2 (wT (t)w∗ )2 R2 (wT (t)w∗ )2


t≤ = ||w∗ ||2 .
ρ2 ||w(t)||2 ρ2 ||w(t)||2 ||w∗ ||2

However, we have that


(wT (t)w∗ )2
≤1
||w(t)||2 ||w∗ ||2
as
(wT (t)w∗ )2 ≤ ||w(t)||2 ||w∗ ||2
by the Cauchy-Schwarz inequality. In conclusion, we get that

R2 ||w∗ ||2
t≤ .
ρ2

3
Problem 1.4

(a) Below, we generate a linearly separable data set of size 20 and the target function f (in red).
set.seed(101)

h <- function(x, w) {
scalar_prod <- cbind(1, x$x1, x$x2) %*% w

return(as.vector(sign(scalar_prod)))
}

w0 <- runif(1, min = -999, max = 999)


w1 <- runif(1, min = -999, max = 999)
w2 <- runif(1, min = -999, max = 999)

f <- function(x) {
return(h(x, c(w0, w1, w2)))
}

D <- data.frame(x1 = runif(20, min = -1, max = 1), x2 = runif(20, min = -1, max = 1))
D <- cbind(D, y = f(D))

p <- ggplot(D, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +


theme(legend.position = "none")

p_f <- p + geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red")
p_f

0.5

0.0
x2

−0.5

−1.0
−1.0 −0.5 0.0 0.5 1.0
x1
(b) Below, we plot the training data, the target function f (in red) and the final hypothesis g (in blue)
generated by PLA.
iter <- 0
w <- c(0, 0, 0)

4
repeat {
y_pred <- h(D, w)
D_mis <- subset(D, y != y_pred)
if (nrow(D_mis) == 0)
break
x_t <- D_mis[1, ]
w <- w + c(1, x_t$x1, x_t$x2) * x_t$y
iter <- iter + 1
}

p_g <- p_f + geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")
p_g

0.5

0.0
x2

−0.5

−1.0
−1.0 −0.5 0.0 0.5 1.0
x1
Here, the PLA took 5 iterations before converging. We may notice that although g is pretty close to f , they
are not quite identical.
(c) Below, we repeat what we did in point (b) with another randomly generated data set of size 20.
D1 <- data.frame(x1 = runif(20, min = -1, max = 1), x2 = runif(20, min = -1, max = 1))
D1 <- cbind(D1, y = f(D1))

iter <- 0
w <- c(0, 0, 0)
repeat {
y_pred <- h(D1, w)
D_mis <- subset(D1, y != y_pred)
if (nrow(D_mis) == 0)
break
x_t <- D_mis[1, ]
w <- w + c(1, x_t$x1, x_t$x2) * x_t$y
iter <- iter + 1
}

ggplot(D1, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +

5
theme(legend.position = "none") +
geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red") +
geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")

1.0

0.5
x2

0.0

−0.5

−1.0
−0.5 0.0 0.5
x1
In this case, the PLA took 12 iterations (which is greater than in (b)) before converging. We may notice that,
as in point (b), although g is pretty close to f , they are not quite identical.
(d) Below, we repeat what we did in point (b) with another randomly generated data set of size 100.
D1 <- data.frame(x1 = runif(100, min = -1, max = 1), x2 = runif(100, min = -1, max = 1))
D1 <- cbind(D1, y = f(D1))

iter <- 0
w <- c(0, 0, 0)
repeat {
y_pred <- h(D1, w)
D_mis <- subset(D1, y != y_pred)
if (nrow(D_mis) == 0)
break
x_t <- D_mis[1, ]
w <- w + c(1, x_t$x1, x_t$x2) * x_t$y
iter <- iter + 1
}

ggplot(D1, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +


theme(legend.position = "none") +
geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red") +
geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")

6
1.0

0.5
x2

0.0

−0.5

−1.0
−1.0 −0.5 0.0 0.5 1.0
x1
In this case, the PLA took 33 iterations (which is greater than in (b) and (c)) before converging. We may
notice that, here f and g are very close to each other.
(e) Below, we repeat what we did in point (b) with another randomly generated data set of size 1000.
D1 <- data.frame(x1 = runif(1000, min = -1, max = 1), x2 = runif(1000, min = -1, max = 1))
D1 <- cbind(D1, y = f(D1))

iter <- 0
w <- c(0, 0, 0)
repeat {
y_pred <- h(D1, w)
D_mis <- subset(D1, y != y_pred)
if (nrow(D_mis) == 0)
break
x_t <- D_mis[1, ]
w <- w + c(1, x_t$x1, x_t$x2) * x_t$y
iter <- iter + 1
}

ggplot(D1, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +


theme(legend.position = "none") +
geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red") +
geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")

7
1.0

0.5
x2

0.0

−0.5

−1.0
−1.0 −0.5 0.0 0.5 1.0
x1
In this case, the PLA took 511 iterations (which is greater than in (b), (c) and (d)) before converging. We
may notice that, here f and g are nearly undistinguishable.
(f ) Here, we randomly generate a linearly separable data set of size 1000 with xn ∈ R10 .
N <- 10

h <- function(x, w) {
scalar_prod <- cbind(1, x) %*% w

return(as.vector(sign(scalar_prod)))
}

w <- runif(N + 1)

f <- function(x) {
return(h(x, w))
}

D2 <- matrix(runif(10000, min = -1, max = 1), ncol = N)


D2 <- cbind(D2, y = f(D2))
D2 <- data.frame(D2)

iter <- 0
w0 <- rep(0, N + 1)
repeat {
y_pred <- h(as.matrix(D2[, 1:N]), as.numeric(w0))
D_mis <- subset(D2, y != y_pred)
if (nrow(D_mis) == 0)
break
x_t <- D_mis[1, ]
w0 <- w0 + cbind(1, x_t[, 1:N]) * x_t$y
iter <- iter + 1
}

8
In this case, we may see that the number of iterations is 4326 which is very big, this is a direct consequence
of the increase in dimensions from 2 to 10.
(g) Below, we repeat the algorithm on the same data set as (f ) for 100 experiments, and we pick x(t) randomly
instead of deterministically.
updates <- rep(0, 100)
iter <- 0
for (i in 1:100) {
w0 <- rep(0, N + 1)
repeat {
y_pred <- h(as.matrix(D2[, 1:N]), as.numeric(w0))
D_mis <- subset(D2, y != y_pred)
if (nrow(D_mis) == 0)
break
x_t <- D_mis[sample(nrow(D_mis), 1), ]
w0 <- w0 + cbind(1, x_t[, 1:N]) * x_t$y
iter <- iter + 1
}
updates[i] <- iter
}

Now, we plot a histogram of the number of updates that the PLA takes to converge.
ggplot(data.frame(updates), aes(x = updates)) + geom_histogram(bins = 15, col = "black") +
labs(x = "Number of updates", y = "Count")

6
Count

0
0e+00 1e+05 2e+05 3e+05 4e+05 5e+05
Number of updates
We may see that the distribution of the number of updates seems pretty uniform and varies from 0 to
5.0437 × 105 .
(h) As we saw above, the more data points we have (N ), the more accurate g becomes in approximating f
and the greater the running time gets. Moreover, the greater d becomes, the greater the running time gets
also.

9
Problem 1.5

(a) Below, we generate a training data set of size 100 and a test data set of size 10000. We also plot the
target function f (in red) and the final hypothesis g (in blue) generated by Adaline [We use a η value of 5
instead of 100 to simplify the values computed].
set.seed(1975)

h <- function(x, w) {
scalar_prod <- cbind(1, x$x1, x$x2) %*% w

return(as.vector(sign(scalar_prod)))
}

w0 <- runif(1, min = -999, max = 999)


w1 <- runif(1, min = -999, max = 999)
w2 <- runif(1, min = -999, max = 999)

f <- function(x) {
return(h(x, c(w0, w1, w2)))
}

D_train <- data.frame(x1 = runif(100, min = -1, max = 1), x2 = runif(100, min = -1, max = 1))
D_train <- cbind(D_train, y = f(D_train))

D_test <- data.frame(x1 = runif(10000, min = -1, max = 1), x2 = runif(10000, min = -1, max = 1))
D_test <- cbind(D_test, y = f(D_test))

iter <- 0
eta <- 5
w <- c(0, 0, 0)
repeat {
y_pred <- h(D_train, w)
D_mis <- subset(D_train, y != y_pred)
if (nrow(D_mis) == 0)
break
obs_t <- D_mis[sample(nrow(D_mis), 1), ]
x_t <- c(1, as.numeric(obs_t[1:2]))
y_t <- as.numeric(obs_t[3])
s_t <- sum(w * x_t)
if (y_t * s_t <= 1)
w <- w + eta * (y_t - s_t) * x_t
iter <- iter + 1
if (iter == 1000)
break
}

test_error <- mean(h(D_test, w) != D_test$y)

p <- ggplot(D_train, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +


theme(legend.position = "none")

p_g <- p + geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red") +
geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")

10
p_g
1.0

0.5
x2

0.0

−0.5

−1.0
−1.0 −0.5 0.0 0.5 1.0
x1
We have a classification error rate of 2.23% on the test set.
(b) Now we repeat everything we did in (a) with η = 1.
iter <- 0
eta <- 1
w <- c(0, 0, 0)
repeat {
y_pred <- h(D_train, w)
D_mis <- subset(D_train, y != y_pred)
if (nrow(D_mis) == 0)
break
obs_t <- D_mis[sample(nrow(D_mis), 1), ]
x_t <- c(1, as.numeric(obs_t[1:2]))
y_t <- as.numeric(obs_t[3])
s_t <- sum(w * x_t)
if (y_t * s_t <= 1)
w <- w + eta * (y_t - s_t) * x_t
iter <- iter + 1
if (iter == 1000)
break
}

test_error <- mean(h(D_test, w) != D_test$y)

p <- ggplot(D_train, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +


theme(legend.position = "none")

p_g <- p + geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red") +
geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")
p_g

11
1.0

0.5
x2

0.0

−0.5

−1.0

−1.0 −0.5 0.0 0.5 1.0


x1
We may see that the classification error rate has now decreased to 1.23% on the test set.
(c) Now we repeat everything we did in (a) with η = 0.01.
iter <- 0
eta <- 0.01
w <- c(0, 0, 0)
repeat {
y_pred <- h(D_train, w)
D_mis <- subset(D_train, y != y_pred)
if (nrow(D_mis) == 0)
break
obs_t <- D_mis[sample(nrow(D_mis), 1), ]
x_t <- c(1, as.numeric(obs_t[1:2]))
y_t <- as.numeric(obs_t[3])
s_t <- sum(w * x_t)
if (y_t * s_t <= 1)
w <- w + eta * (y_t - s_t) * x_t
iter <- iter + 1
if (iter == 1000)
break
}

test_error <- mean(h(D_test, w) != D_test$y)

p <- ggplot(D_train, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +


theme(legend.position = "none")

12
p_g <- p + geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red") +
geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")
p_g

1.0

0.5
x2

0.0

−0.5

−1.0

−1.0 −0.5 0.0 0.5 1.0


x1
We may see that the classification error rate has now increased to 2.43% on the test set.
(d) Now we repeat everything we did in (a) with η = 0.0001.
iter <- 0
eta <- 0.0001
w <- c(0, 0, 0)
repeat {
y_pred <- h(D_train, w)
D_mis <- subset(D_train, y != y_pred)
if (nrow(D_mis) == 0)
break
obs_t <- D_mis[sample(nrow(D_mis), 1), ]
x_t <- c(1, as.numeric(obs_t[1:2]))
y_t <- as.numeric(obs_t[3])
s_t <- sum(w * x_t)
if (y_t * s_t <= 1)
w <- w + eta * (y_t - s_t) * x_t
iter <- iter + 1
if (iter == 1000)
break
}

test_error <- mean(h(D_test, w) != D_test$y)

13
p <- ggplot(D_train, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +
theme(legend.position = "none")

p_g <- p + geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red") +
geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")
p_g

1.0

0.5
x2

0.0

−0.5

−1.0

−1.0 −0.5 0.0 0.5 1.0


x1
We may see that the classification error rate has now increased to 2.5% on the test set.
(e) We may conclude that the η value that results in the minimum classification error rate on the test set is
actually 1.

Problem 1.6

(a) For one sample we have that


P(ν = 0) = (1 − µ)10 .
So for µ = 0.05, we get P(ν = 0) = 0.5987369, for µ = 0.5, we get P(ν = 0) = 9.765625 × 10−4 , and for
µ = 0.8, we get P(ν = 0) = 1.024 × 10−7 .
(b) Now, for 1000 independant samples, we have that

14
P(At least one sample has ν = 0) = 1 − P(νi > 0 ∀i)
1000
Y
= 1− P(νi > 0)
i=1
1000
Y
= 1− [1 − P(νi = 0)]
i=1
1000
Y
= 1− [1 − (1 − µ)10 ]
i=1
= 1 − [1 − (1 − µ)10 ]1000 .

Which gives us 1 when µ = 0.05, 0.6235762 when µ = 0.5, and 1.0239476 × 10−4 when µ = 0.8.
(c) Here, we repeat (b) for 1000000 independant samples. In that case, we obtain
P(At least one sample has ν = 0) = 1 − [1 − (1 − µ)10 ]1000000 .
Which gives us 1 when µ = 0.05, 1 when µ = 0.5, and 0.0973316 when µ = 0.8.

Problem 1.7

(a) First we treat the case where µ = 0.05, for one coin, we have that
P(At least one coin has ν = 0) = (1 − 0.05)10 = 0.5987369;
for 1000 coins, we have
P(At least one coin has ν = 0) = 1 − [1 − (1 − 0.05)10 ]1000 = 1;
and finally for 1000000 coins we get
P(At least one coin has ν = 0) = 1 − [1 − (1 − 0.05)10 ]1000000 = 1.
We repeat the same reasoning for µ = 0.8, for one coin, we have that
P(At least one coin has ν = 0) = (1 − 0.8)10 = 1.024 × 10−7 ;
for 1000 coins, we have
P(At least one coin has ν = 0) = 1 − [1 − (1 − 0.8)10 ]1000 = 1.0239476 × 10−4 ;
and finally for 1000000 coins we get
P(At least one coin has ν = 0) = 1 − [1 − (1 − 0.8)10 ]1000000 = 0.0973316.

(b) Here, we consider N = 6, two coins, and µ = 0.5. If we use the Hoeffding inequality bound, we obtain that

P(max |νi − µi | > ) = P(|ν1 − µ1 | >  or |ν2 − µ2 | > )


i
= P(|ν1 − µ1 | > ) + P(|ν2 − µ2 | > ) − P(|ν1 − µ1 | >  and |ν2 − µ2 | > )
= P(|ν1 − µ1 | > ) + P(|ν2 − µ2 | > ) − P(|ν1 − µ1 | > )P(|ν2 − µ2 | > )
2
≤ 4e−12 .

Below, we plot the above probability with its Hoeffding inequality bound for  in the range [0, 1].

15
N <- 6
mu <- 0.5
max_abs <- function(epsilon) {
c1 <- sample(c(0, 1), N, replace = TRUE)
nu1 <- mean(c1)
c2 <- sample(c(0, 1), N, replace = TRUE)
nu2 <- mean(c2)
m <- max(abs(nu1 - mu), abs(nu2 - mu))

return(m > epsilon)


}
proba <- function(epsilon) mean(replicate(10000, max_abs(epsilon = epsilon)))
proba <- Vectorize(proba)
ggplot(data.frame(epsilon = c(0, 1)), aes(x = epsilon)) +
stat_function(fun = proba, geom = "line") +
stat_function(fun = function(epsilon) 4 * exp(-12 * epsilon^2), geom = "line", col = "red") +
labs(y = "Proba")

3
Proba

0
0.00 0.25 0.50 0.75 1.00
epsilon

Problem 1.8

(a) If t is a non-negative random variable and α > 0, we may write that


αI{t≥α} ≤ t
as in the case where t ≥ α, we have α · 1 ≤ t, and in the case where t < α, we have α · 0 ≤ t. As the
expectation is a non-decreasing function, we get
E(αI{t≥α} ) ≤ E(t).
As we also have that
E(αI{t≥α} ) = αP(t ≥ α) + 0 · P(t < α) = αP(t ≥ α),
we finally get
E(t)
P(t ≥ α) ≤
α

16
which proves the Markov Inequality.
(b) Here u is a random variable with mean µ and variance σ 2 and α > 0. If we consider the random variable
(u − µ)2 ≥ 0, the Markov Inequality tells us that

E[(u − µ)2 ] σ2
P[(u − µ)2 ≥ α] ≤ =
α α
which proves the Chebyshev Inequality.
1
PN
(c) Now u1 , · · · , uN are iid random variables each with mean µ and variance σ 2 , u = N n=1 un and α > 0.
We consider the random variable u, its expectation is
N
1 X
E(u) = µ=µ
N n=1

and its variance is


N
1 X 2 σ2
Var(u) = 2
σ = .
N n=1 N
It suffices now to apply the Chebyshev Inequality to u to obtain that

σ2
P[(u − µ)2 ≥ α] ≤ .

Problem 1.9

(a) Let t be a finite random variable, α > 0 and s > 0, we may write that

P(t ≥ α) = P(st ≥ sα)


= P(est ≥ esα )
E(est )
≤ = e−sα T (s)
esα

because of the Markov Inequality.


1
PN
(b) Here u1 , · · · , uN are iid random variables and u = N n=1 un . We have successively that

P(u ≥ α) = P(N u ≥ N α)
≤ e−sN α E(esN u )
N
Y
= e−sN α E(esun )
n=1
= (e−sN α U (s))N .

(c) In this case, we may write that


1 1 1
U (s) = E(esun ) = es·0 P(un = 0) + es·1 P(un = 1) = 1 · + es · = (1 + es )
2 2 2
Let
1
f (s) = e−sα (1 + es ),
2

17
we get immediately
df e−sα
= [(1 − α)es − α],
ds 2
which has a root for s = ln(α/(1 − α)), and

d2 f e−sα s
= [e (α − 1)2 + α] ≥ 0.
ds2 2
So, we conclude that s = ln(α/(1 − α)) is a minimum of f (s).

1/2 +  1/2 + 
− ln (1/2 + )U (ln )
1/2 −  1/2 − 

(d) First, we notice that


N N
1 X 1 X 1 1 1
E(u) = E(un ) = (0 · + 1 · ) = .
N n=1 N n=1 2 2 2

Now, let 0 <  < 12 , we may write that

!
1 1
P u≥ + ≤ (e−s( 2 +) U (s))N
2
1
≤ min(e−s( 2 +) U (s))N
s
!!N
− ln
1/2+
(1/2+) 1/2 + 
= e 1/2− U ln
1/2 − 
1/2+ 1 1/2+
= (e− ln 1/2− (1/2+) (1 + eln 1/2− ))N
2
" !1/2+ !#N
1 1/2 −  1/2 + 
= 1+
2 1/2 +  1/2 − 
" !1/2+ !−1 #N
1 1/2 −  1
= −
2 1/2 +  2
" #N
−1 1 1
= 2
(1/2 + )1/2+ (1/2 − )1/2−
" #N
1/2+
−log2 (1/2−)1/2−
= 2−1−log2 (1/2+)

= 2−N [1+(1/2+) log2 (1/2+)+(1/2−) log2 (1/2−)] = 2−βN .

It remains to see that β > 0, to do that, we plot the graph of β for 0 <  < 1/2 below.

18
1.00

0.75
beta

0.50

0.25

0.00
0.0 0.1 0.2 0.3 0.4 0.5
epsilon

Problem 1.10

(a) In this case, we may write that


M
1 X 1
Eof f (h, f ) = [[h(xN +m ) 6= f (xN +m )]] = [Number of odd m between 1 and M ],
M m=1 M

which is equal to 1/2 if M is even, and to 1/2 + 1/2M if M is odd.


(b) If D is fixed of size N , there are 2N target functions f that can generate D in a noiseless setting.
(c) We must have that
M
X
[[h(xN +m ) 6= f (xN +m )]] = k,
m=1

which is equivalent to the fact that the number of h(xN +m ) 6= f (xN +m ) must be equal to k. And the number
k
of ways we have k errors in M trials is CM (the binomial coefficient).
(d) We may write that

1
Ef [Eof f (h, f )] = Ef [Number of h(xN +m ) 6= f (xN +m )]
M
1 1 1
= ·M · M = M
M 2 2

as we have here a binomial distribution.

Problem 1.11

For the supermarket, we have

19
N
(S) 1 X
Ein (h) = e(h(xn ), f (xn ))
N n=1
1 X X
= [ e(h(xn ), 1) + e(h(xn ), −1)]
N y =1 yn =−1
n

1 X X
= [ 10 · [[h(xn ) 6= 1]] + [[h(xn ) 6= −1]].
N y =1 y =−1
n n

And for the CIA, we have

N
(C) 1 X
Ein (h) = e(h(xn ), f (xn ))
N n=1
1 X X
= [ e(h(xn ), 1) + e(h(xn ), −1)]
N y =1 yn =−1
n

1 X X
= [ [[h(xn ) 6= 1]] + 1000 · [[h(xn ) 6= −1]].
N y =1 y =−1
n n

Problem 1.12
PN
(a) To minimize Ein (h) = n=1 (h − yn )2 , we have to find the stationary points of this function. We
immediately find that
N
dEin (h) X
=2 (h − yn ) = 0
dh n=1
1
PN
implies h = N n=1 yn = hmean . It is actually a minimum as we have that

d2 Ein (h)
= 2N > 0.
dh2
PN
(b) We proceed in the same fashion to minimize Ein (h) = n=1 |h − yn |, so we get that
N
dEin (h) X
= (h − yn ) = 0
dh n=1

implies h = median{y1 , · · · , yn } = hmed as the derivative is equal to zero only if the number of positive terms
is equal to the number of negative terms.
(c) In the case where yN becomes an outlier, we get that hmean → ∞ and hmed remains unchanged.

20

You might also like