0% found this document useful (0 votes)

62 views20 pages

Problems Chap1

The document contains 4 problems regarding probability, linear classification, and perceptron learning algorithm (PLA). Problem 1 calculates the probability of drawing two black balls from an urn. Problem 2 describes linear classification with a sign function and plots decision boundaries. Problem 3 proves properties of the weight vector and number of iterations for PLA on separable data. Problem 4 generates synthetic data and applies PLA, plotting the target function and hypotheses for various data sizes.

Uploaded by

Mohamed Taha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views20 pages

Problems Chap1

Uploaded by

Mohamed Taha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Problem Solutions

Chapter 1
Pierre Paquay

Problem 1.1

Let B1 and B2 be the events The first picked ball is black and The second picked ball is black. We have

P(B2 ∩ B1 )
P(B2 |B1 ) = ,
P(B1 )

with

P(B1 ) = P(B1 |Bag 1)P(Bag 1) + P(B1 |Bag 2)P(Bag 2)

1 1 1 3
= 1· + · =
2 2 2 4

and

P(B1 ∩ B2 ) = P(B1 ∩ B2 |Bag 1)P(Bag 1) + P(B1 ∩ B2 |Bag 2)P(Bag 2)

1 1 1
= 1· +0· = .
2 2 2

In conclusion, we get
1/2 2
P(B2 |B1 ) = = .
3/4 3

Problem 1.2

We have h(x) = sign(wT x) with w = (w0 , w1 , w2 )T and x = (1, x1 , x2 )T .

(a) If we have h(x) = +1 (resp. −1), it implies that wT x > 0 (resp. < 0). So, we may conclude that the
separation between these two regions is the line whose equation is wT x = 0 or more explicitly

w0 + w1 x1 + w2 x2 = 0.

We may also write this equation as

x2 = ax1 + b
where a = −w1 /w2 and b = −w0 /w2 .
(b) For w = (1, 2, 3)T , we have the graph below.

1
0.0
y

−0.5

−1.0
−1.0 −0.5 0.0 0.5 1.0
x
For w = −(1, 2, 3)T , we have the graph below.

0.0
y

−0.5

−1.0
−1.0 −0.5 0.0 0.5 1.0
x
We may notice that the lines are identical in these two graphs (which is not that suprising considering that
the coefficients are opposites). However, the regions where h(x) = +1 and h(x) = −1 are different; indeed in
the first plot the positive region is the one above the line, and in the second plot the positive region is the
one below the line.

Problem 1.3

(a) As every xn is well classified by w∗ for all n = 1, · · · , N , we have that

yn = sign(w∗T xn ),

2
which translates to
yn (w∗T xn ) > 0
for all n = 1, · · · , N . This implies that

ρ = min yn (w∗T xn ) > 0.

(b) We have that

wT (t)w∗ = [wT (t − 1) + y(t − 1)xT (t − 1)]w∗

= wT (t − 1)w∗ + y(t − 1)w∗T x(t − 1)
≥ wT (t − 1)w∗ + ρ.

It remains to prove that wT (t)w∗ ≥ tρ, to do this we will proceed by induction. If t = 0, we obviously get
that 0 · w∗ ≥ 0. If the thesis is true for t − 1, let us prove it for t. If we use the first part of point (b), we have
that
wT (t)w∗ ≥ wT (t − 1)w∗ + ρ ≥ (t − 1)ρ + ρ = tρ.

(c) We may write that

||w(t)||2 = ||w(t − 1) + y(t − 1)x(t − 1)||2

= ||w(t − 1)||2 + ||y(t − 1)x(t − 1)||2 + 2y(t − 1)wT (t − 1)x(t − 1)
≤ ||w(t − 1)||2 + ||y(t − 1)x(t − 1)||2
≤ ||w(t − 1)||2 + ||x(t − 1)||2

as x(t − 1) is misclassified by w(t − 1) and y(t − 1)2 = 1.

(d) Now we prove by induction that ||w(t)||2 ≤ tR2 where R = maxn ||xn ||. If t = 0, we trivially have that
0 ≤ 0 · R2 . If the thesis is true for t − 1, let us prove it for t. Because of point (c), we may write that

||w(t)||2 ≤ ||w(t − 1)||2 + ||x(t − 1)||2 ≤ (t − 1)R2 + R2 = tR2 .

(e) By using points (b) and (d), we get that

wT (t)w∗ tρ √ ρ
≥√ = t .
||w(t)|| tR R

By using the inequality above, we may write that

R2 (wT (t)w∗ )2 R2 (wT (t)w∗ )2

t≤ = ||w∗ ||2 .
ρ2 ||w(t)||2 ρ2 ||w(t)||2 ||w∗ ||2

However, we have that

(wT (t)w∗ )2
≤1
||w(t)||2 ||w∗ ||2
as
(wT (t)w∗ )2 ≤ ||w(t)||2 ||w∗ ||2
by the Cauchy-Schwarz inequality. In conclusion, we get that

R2 ||w∗ ||2
t≤ .
ρ2

3
Problem 1.4

(a) Below, we generate a linearly separable data set of size 20 and the target function f (in red).
set.seed(101)

h <- function(x, w) {
scalar_prod <- cbind(1, x$x1, x$x2) %*% w

return(as.vector(sign(scalar_prod)))
}

w0 <- runif(1, min = -999, max = 999)

w1 <- runif(1, min = -999, max = 999)
w2 <- runif(1, min = -999, max = 999)

f <- function(x) {
return(h(x, c(w0, w1, w2)))
}

D <- data.frame(x1 = runif(20, min = -1, max = 1), x2 = runif(20, min = -1, max = 1))
D <- cbind(D, y = f(D))

p <- ggplot(D, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +

theme(legend.position = "none")

p_f <- p + geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red")
p_f

0.5

0.0
x2

−0.5

−1.0
−1.0 −0.5 0.0 0.5 1.0
x1
(b) Below, we plot the training data, the target function f (in red) and the final hypothesis g (in blue)
generated by PLA.
iter <- 0
w <- c(0, 0, 0)

4
repeat {
y_pred <- h(D, w)
D_mis <- subset(D, y != y_pred)
if (nrow(D_mis) == 0)
break
x_t <- D_mis[1, ]
w <- w + c(1, x_t$x1, x_t$x2) * x_t$y
iter <- iter + 1
}

p_g <- p_f + geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")
p_g

0.5

0.0
x2

−0.5

−1.0
−1.0 −0.5 0.0 0.5 1.0
x1
Here, the PLA took 5 iterations before converging. We may notice that although g is pretty close to f , they
are not quite identical.
(c) Below, we repeat what we did in point (b) with another randomly generated data set of size 20.
D1 <- data.frame(x1 = runif(20, min = -1, max = 1), x2 = runif(20, min = -1, max = 1))
D1 <- cbind(D1, y = f(D1))

iter <- 0
w <- c(0, 0, 0)
repeat {
y_pred <- h(D1, w)
D_mis <- subset(D1, y != y_pred)
if (nrow(D_mis) == 0)
break
x_t <- D_mis[1, ]
w <- w + c(1, x_t$x1, x_t$x2) * x_t$y
iter <- iter + 1
}

ggplot(D1, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +

5
theme(legend.position = "none") +
geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red") +
geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")

1.0

0.5
x2

0.0

−0.5

−1.0
−0.5 0.0 0.5
x1
In this case, the PLA took 12 iterations (which is greater than in (b)) before converging. We may notice that,
as in point (b), although g is pretty close to f , they are not quite identical.
(d) Below, we repeat what we did in point (b) with another randomly generated data set of size 100.
D1 <- data.frame(x1 = runif(100, min = -1, max = 1), x2 = runif(100, min = -1, max = 1))
D1 <- cbind(D1, y = f(D1))

iter <- 0
w <- c(0, 0, 0)
repeat {
y_pred <- h(D1, w)
D_mis <- subset(D1, y != y_pred)
if (nrow(D_mis) == 0)
break
x_t <- D_mis[1, ]
w <- w + c(1, x_t$x1, x_t$x2) * x_t$y
iter <- iter + 1
}

ggplot(D1, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +

theme(legend.position = "none") +
geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red") +
geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")

6
1.0

0.5
x2

0.0

−0.5

−1.0
−1.0 −0.5 0.0 0.5 1.0
x1
In this case, the PLA took 33 iterations (which is greater than in (b) and (c)) before converging. We may
notice that, here f and g are very close to each other.
(e) Below, we repeat what we did in point (b) with another randomly generated data set of size 1000.
D1 <- data.frame(x1 = runif(1000, min = -1, max = 1), x2 = runif(1000, min = -1, max = 1))
D1 <- cbind(D1, y = f(D1))

iter <- 0
w <- c(0, 0, 0)
repeat {
y_pred <- h(D1, w)
D_mis <- subset(D1, y != y_pred)
if (nrow(D_mis) == 0)
break
x_t <- D_mis[1, ]
w <- w + c(1, x_t$x1, x_t$x2) * x_t$y
iter <- iter + 1
}

ggplot(D1, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +

theme(legend.position = "none") +
geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red") +
geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")

7
1.0

0.5
x2

0.0

−0.5

−1.0
−1.0 −0.5 0.0 0.5 1.0
x1
In this case, the PLA took 511 iterations (which is greater than in (b), (c) and (d)) before converging. We
may notice that, here f and g are nearly undistinguishable.
(f ) Here, we randomly generate a linearly separable data set of size 1000 with xn ∈ R10 .
N <- 10

h <- function(x, w) {
scalar_prod <- cbind(1, x) %*% w

return(as.vector(sign(scalar_prod)))
}

w <- runif(N + 1)

f <- function(x) {
return(h(x, w))
}

D2 <- matrix(runif(10000, min = -1, max = 1), ncol = N)

D2 <- cbind(D2, y = f(D2))
D2 <- data.frame(D2)

iter <- 0
w0 <- rep(0, N + 1)
repeat {
y_pred <- h(as.matrix(D2[, 1:N]), as.numeric(w0))
D_mis <- subset(D2, y != y_pred)
if (nrow(D_mis) == 0)
break
x_t <- D_mis[1, ]
w0 <- w0 + cbind(1, x_t[, 1:N]) * x_t$y
iter <- iter + 1
}

8
In this case, we may see that the number of iterations is 4326 which is very big, this is a direct consequence
of the increase in dimensions from 2 to 10.
(g) Below, we repeat the algorithm on the same data set as (f ) for 100 experiments, and we pick x(t) randomly
instead of deterministically.
updates <- rep(0, 100)
iter <- 0
for (i in 1:100) {
w0 <- rep(0, N + 1)
repeat {
y_pred <- h(as.matrix(D2[, 1:N]), as.numeric(w0))
D_mis <- subset(D2, y != y_pred)
if (nrow(D_mis) == 0)
break
x_t <- D_mis[sample(nrow(D_mis), 1), ]
w0 <- w0 + cbind(1, x_t[, 1:N]) * x_t$y
iter <- iter + 1
}
updates[i] <- iter
}

Now, we plot a histogram of the number of updates that the PLA takes to converge.
ggplot(data.frame(updates), aes(x = updates)) + geom_histogram(bins = 15, col = "black") +
labs(x = "Number of updates", y = "Count")

6
Count

0
0e+00 1e+05 2e+05 3e+05 4e+05 5e+05
Number of updates
We may see that the distribution of the number of updates seems pretty uniform and varies from 0 to
5.0437 × 105 .
(h) As we saw above, the more data points we have (N ), the more accurate g becomes in approximating f
and the greater the running time gets. Moreover, the greater d becomes, the greater the running time gets
also.

9
Problem 1.5

(a) Below, we generate a training data set of size 100 and a test data set of size 10000. We also plot the
target function f (in red) and the final hypothesis g (in blue) generated by Adaline [We use a η value of 5
instead of 100 to simplify the values computed].
set.seed(1975)

h <- function(x, w) {
scalar_prod <- cbind(1, x$x1, x$x2) %*% w

return(as.vector(sign(scalar_prod)))
}

w0 <- runif(1, min = -999, max = 999)

w1 <- runif(1, min = -999, max = 999)
w2 <- runif(1, min = -999, max = 999)

f <- function(x) {
return(h(x, c(w0, w1, w2)))
}

D_train <- data.frame(x1 = runif(100, min = -1, max = 1), x2 = runif(100, min = -1, max = 1))
D_train <- cbind(D_train, y = f(D_train))

D_test <- data.frame(x1 = runif(10000, min = -1, max = 1), x2 = runif(10000, min = -1, max = 1))
D_test <- cbind(D_test, y = f(D_test))

iter <- 0
eta <- 5
w <- c(0, 0, 0)
repeat {
y_pred <- h(D_train, w)
D_mis <- subset(D_train, y != y_pred)
if (nrow(D_mis) == 0)
break
obs_t <- D_mis[sample(nrow(D_mis), 1), ]
x_t <- c(1, as.numeric(obs_t[1:2]))
y_t <- as.numeric(obs_t[3])
s_t <- sum(w * x_t)
if (y_t * s_t <= 1)
w <- w + eta * (y_t - s_t) * x_t
iter <- iter + 1
if (iter == 1000)
break
}

test_error <- mean(h(D_test, w) != D_test$y)

p <- ggplot(D_train, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +

theme(legend.position = "none")

p_g <- p + geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red") +
geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")

10
p_g
1.0

0.5
x2

0.0

−0.5

−1.0
−1.0 −0.5 0.0 0.5 1.0
x1
We have a classification error rate of 2.23% on the test set.
(b) Now we repeat everything we did in (a) with η = 1.
iter <- 0
eta <- 1
w <- c(0, 0, 0)
repeat {
y_pred <- h(D_train, w)
D_mis <- subset(D_train, y != y_pred)
if (nrow(D_mis) == 0)
break
obs_t <- D_mis[sample(nrow(D_mis), 1), ]
x_t <- c(1, as.numeric(obs_t[1:2]))
y_t <- as.numeric(obs_t[3])
s_t <- sum(w * x_t)
if (y_t * s_t <= 1)
w <- w + eta * (y_t - s_t) * x_t
iter <- iter + 1
if (iter == 1000)
break
}

test_error <- mean(h(D_test, w) != D_test$y)

p <- ggplot(D_train, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +

theme(legend.position = "none")

p_g <- p + geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red") +
geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")
p_g

11
1.0

0.5
x2

0.0

−0.5

−1.0

−1.0 −0.5 0.0 0.5 1.0

x1
We may see that the classification error rate has now decreased to 1.23% on the test set.
(c) Now we repeat everything we did in (a) with η = 0.01.
iter <- 0
eta <- 0.01
w <- c(0, 0, 0)
repeat {
y_pred <- h(D_train, w)
D_mis <- subset(D_train, y != y_pred)
if (nrow(D_mis) == 0)
break
obs_t <- D_mis[sample(nrow(D_mis), 1), ]
x_t <- c(1, as.numeric(obs_t[1:2]))
y_t <- as.numeric(obs_t[3])
s_t <- sum(w * x_t)
if (y_t * s_t <= 1)
w <- w + eta * (y_t - s_t) * x_t
iter <- iter + 1
if (iter == 1000)
break
}

test_error <- mean(h(D_test, w) != D_test$y)

p <- ggplot(D_train, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +

theme(legend.position = "none")

12
p_g <- p + geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red") +
geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")
p_g

1.0

0.5
x2

0.0

−0.5

−1.0

−1.0 −0.5 0.0 0.5 1.0

x1
We may see that the classification error rate has now increased to 2.43% on the test set.
(d) Now we repeat everything we did in (a) with η = 0.0001.
iter <- 0
eta <- 0.0001
w <- c(0, 0, 0)
repeat {
y_pred <- h(D_train, w)
D_mis <- subset(D_train, y != y_pred)
if (nrow(D_mis) == 0)
break
obs_t <- D_mis[sample(nrow(D_mis), 1), ]
x_t <- c(1, as.numeric(obs_t[1:2]))
y_t <- as.numeric(obs_t[3])
s_t <- sum(w * x_t)
if (y_t * s_t <= 1)
w <- w + eta * (y_t - s_t) * x_t
iter <- iter + 1
if (iter == 1000)
break
}

test_error <- mean(h(D_test, w) != D_test$y)

13
p <- ggplot(D_train, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +
theme(legend.position = "none")

p_g <- p + geom_abline(slope = -w1 / w2, intercept = -w0 / w2, colour = "red") +
geom_abline(slope = -w[2] / w[3], intercept = -w[1] / w[3], colour = "blue")
p_g

1.0

0.5
x2

0.0

−0.5

−1.0

−1.0 −0.5 0.0 0.5 1.0

x1
We may see that the classification error rate has now increased to 2.5% on the test set.
(e) We may conclude that the η value that results in the minimum classification error rate on the test set is
actually 1.

Problem 1.6

(a) For one sample we have that

P(ν = 0) = (1 − µ)10 .
So for µ = 0.05, we get P(ν = 0) = 0.5987369, for µ = 0.5, we get P(ν = 0) = 9.765625 × 10−4 , and for
µ = 0.8, we get P(ν = 0) = 1.024 × 10−7 .
(b) Now, for 1000 independant samples, we have that

14
P(At least one sample has ν = 0) = 1 − P(νi > 0 ∀i)
1000
Y
= 1− P(νi > 0)
i=1
1000
Y
= 1− [1 − P(νi = 0)]
i=1
1000
Y
= 1− [1 − (1 − µ)10 ]
i=1
= 1 − [1 − (1 − µ)10 ]1000 .

Which gives us 1 when µ = 0.05, 0.6235762 when µ = 0.5, and 1.0239476 × 10−4 when µ = 0.8.
(c) Here, we repeat (b) for 1000000 independant samples. In that case, we obtain
P(At least one sample has ν = 0) = 1 − [1 − (1 − µ)10 ]1000000 .
Which gives us 1 when µ = 0.05, 1 when µ = 0.5, and 0.0973316 when µ = 0.8.

Problem 1.7

(a) First we treat the case where µ = 0.05, for one coin, we have that
P(At least one coin has ν = 0) = (1 − 0.05)10 = 0.5987369;
for 1000 coins, we have
P(At least one coin has ν = 0) = 1 − [1 − (1 − 0.05)10 ]1000 = 1;
and finally for 1000000 coins we get
P(At least one coin has ν = 0) = 1 − [1 − (1 − 0.05)10 ]1000000 = 1.
We repeat the same reasoning for µ = 0.8, for one coin, we have that
P(At least one coin has ν = 0) = (1 − 0.8)10 = 1.024 × 10−7 ;
for 1000 coins, we have
P(At least one coin has ν = 0) = 1 − [1 − (1 − 0.8)10 ]1000 = 1.0239476 × 10−4 ;
and finally for 1000000 coins we get
P(At least one coin has ν = 0) = 1 − [1 − (1 − 0.8)10 ]1000000 = 0.0973316.

(b) Here, we consider N = 6, two coins, and µ = 0.5. If we use the Hoeffding inequality bound, we obtain that

P(max |νi − µi | > ) = P(|ν1 − µ1 | > or |ν2 − µ2 | > )

i
= P(|ν1 − µ1 | > ) + P(|ν2 − µ2 | > ) − P(|ν1 − µ1 | > and |ν2 − µ2 | > )
= P(|ν1 − µ1 | > ) + P(|ν2 − µ2 | > ) − P(|ν1 − µ1 | > )P(|ν2 − µ2 | > )
2
≤ 4e−12 .

Below, we plot the above probability with its Hoeffding inequality bound for in the range [0, 1].

15
N <- 6
mu <- 0.5
max_abs <- function(epsilon) {
c1 <- sample(c(0, 1), N, replace = TRUE)
nu1 <- mean(c1)
c2 <- sample(c(0, 1), N, replace = TRUE)
nu2 <- mean(c2)
m <- max(abs(nu1 - mu), abs(nu2 - mu))

return(m > epsilon)

}
proba <- function(epsilon) mean(replicate(10000, max_abs(epsilon = epsilon)))
proba <- Vectorize(proba)
ggplot(data.frame(epsilon = c(0, 1)), aes(x = epsilon)) +
stat_function(fun = proba, geom = "line") +
stat_function(fun = function(epsilon) 4 * exp(-12 * epsilon^2), geom = "line", col = "red") +
labs(y = "Proba")

3
Proba

0
0.00 0.25 0.50 0.75 1.00
epsilon

Problem 1.8

(a) If t is a non-negative random variable and α > 0, we may write that

αI{t≥α} ≤ t
as in the case where t ≥ α, we have α · 1 ≤ t, and in the case where t < α, we have α · 0 ≤ t. As the
expectation is a non-decreasing function, we get
E(αI{t≥α} ) ≤ E(t).
As we also have that
E(αI{t≥α} ) = αP(t ≥ α) + 0 · P(t < α) = αP(t ≥ α),
we finally get
E(t)
P(t ≥ α) ≤
α

16
which proves the Markov Inequality.
(b) Here u is a random variable with mean µ and variance σ 2 and α > 0. If we consider the random variable
(u − µ)2 ≥ 0, the Markov Inequality tells us that

E[(u − µ)2 ] σ2
P[(u − µ)2 ≥ α] ≤ =
α α
which proves the Chebyshev Inequality.
1
PN
(c) Now u1 , · · · , uN are iid random variables each with mean µ and variance σ 2 , u = N n=1 un and α > 0.
We consider the random variable u, its expectation is
N
1 X
E(u) = µ=µ
N n=1

and its variance is

N
1 X 2 σ2
Var(u) = 2
σ = .
N n=1 N
It suffices now to apply the Chebyshev Inequality to u to obtain that

σ2
P[(u − µ)2 ≥ α] ≤ .
Nα

Problem 1.9

(a) Let t be a finite random variable, α > 0 and s > 0, we may write that

P(t ≥ α) = P(st ≥ sα)

= P(est ≥ esα )
E(est )
≤ = e−sα T (s)
esα

because of the Markov Inequality.

1
PN
(b) Here u1 , · · · , uN are iid random variables and u = N n=1 un . We have successively that

P(u ≥ α) = P(N u ≥ N α)
≤ e−sN α E(esN u )
N
Y
= e−sN α E(esun )
n=1
= (e−sN α U (s))N .

(c) In this case, we may write that

1 1 1
U (s) = E(esun ) = es·0 P(un = 0) + es·1 P(un = 1) = 1 · + es · = (1 + es )
2 2 2
Let
1
f (s) = e−sα (1 + es ),
2

17
we get immediately
df e−sα
= [(1 − α)es − α],
ds 2
which has a root for s = ln(α/(1 − α)), and

d2 f e−sα s
= [e (α − 1)2 + α] ≥ 0.
ds2 2
So, we conclude that s = ln(α/(1 − α)) is a minimum of f (s).

1/2 + 1/2 +
− ln (1/2 + )U (ln )
1/2 − 1/2 −

(d) First, we notice that

N N
1 X 1 X 1 1 1
E(u) = E(un ) = (0 · + 1 · ) = .
N n=1 N n=1 2 2 2

Now, let 0 < < 12 , we may write that

!
1 1
P u≥ + ≤ (e−s( 2 +) U (s))N
2
1
≤ min(e−s( 2 +) U (s))N
s
!!N
− ln
1/2+
(1/2+) 1/2 +
= e 1/2− U ln
1/2 −
1/2+ 1 1/2+
= (e− ln 1/2− (1/2+) (1 + eln 1/2− ))N
2
" !1/2+ !#N
1 1/2 − 1/2 +
= 1+
2 1/2 + 1/2 −
" !1/2+ !−1 #N
1 1/2 − 1
= −
2 1/2 + 2
" #N
−1 1 1
= 2
(1/2 + )1/2+ (1/2 − )1/2−
" #N
1/2+
−log2 (1/2−)1/2−
= 2−1−log2 (1/2+)

= 2−N [1+(1/2+) log2 (1/2+)+(1/2−) log2 (1/2−)] = 2−βN .

It remains to see that β > 0, to do that, we plot the graph of β for 0 < < 1/2 below.

18
1.00

0.75
beta

0.50

0.25

0.00
0.0 0.1 0.2 0.3 0.4 0.5
epsilon

Problem 1.10

(a) In this case, we may write that

M
1 X 1
Eof f (h, f ) = [[h(xN +m ) 6= f (xN +m )]] = [Number of odd m between 1 and M ],
M m=1 M

which is equal to 1/2 if M is even, and to 1/2 + 1/2M if M is odd.

(b) If D is fixed of size N , there are 2N target functions f that can generate D in a noiseless setting.
(c) We must have that
M
X
[[h(xN +m ) 6= f (xN +m )]] = k,
m=1

which is equivalent to the fact that the number of h(xN +m ) 6= f (xN +m ) must be equal to k. And the number
k
of ways we have k errors in M trials is CM (the binomial coefficient).
(d) We may write that

1
Ef [Eof f (h, f )] = Ef [Number of h(xN +m ) 6= f (xN +m )]
M
1 1 1
= ·M · M = M
M 2 2

as we have here a binomial distribution.

Problem 1.11

For the supermarket, we have

19
N
(S) 1 X
Ein (h) = e(h(xn ), f (xn ))
N n=1
1 X X
= [ e(h(xn ), 1) + e(h(xn ), −1)]
N y =1 yn =−1
n

1 X X
= [ 10 · [[h(xn ) 6= 1]] + [[h(xn ) 6= −1]].
N y =1 y =−1
n n

And for the CIA, we have

N
(C) 1 X
Ein (h) = e(h(xn ), f (xn ))
N n=1
1 X X
= [ e(h(xn ), 1) + e(h(xn ), −1)]
N y =1 yn =−1
n

1 X X
= [ [[h(xn ) 6= 1]] + 1000 · [[h(xn ) 6= −1]].
N y =1 y =−1
n n

Problem 1.12
PN
(a) To minimize Ein (h) = n=1 (h − yn )2 , we have to find the stationary points of this function. We
immediately find that
N
dEin (h) X
=2 (h − yn ) = 0
dh n=1
1
PN
implies h = N n=1 yn = hmean . It is actually a minimum as we have that

d2 Ein (h)
= 2N > 0.
dh2
PN
(b) We proceed in the same fashion to minimize Ein (h) = n=1 |h − yn |, so we get that
N
dEin (h) X
= (h − yn ) = 0
dh n=1

implies h = median{y1 , · · · , yn } = hmed as the derivative is equal to zero only if the number of positive terms
is equal to the number of negative terms.
(c) In the case where yN becomes an outlier, we get that hmean → ∞ and hmed remains unchanged.

2009 - Introductory Time Series With R - Select Solutions - Aug 05
33% (3)
2009 - Introductory Time Series With R - Select Solutions - Aug 05
16 pages
Problems Chap3
No ratings yet
Problems Chap3
27 pages
Aeronautical Engg Vtu 2010 Scheme Syllabus
No ratings yet
Aeronautical Engg Vtu 2010 Scheme Syllabus
64 pages
Heat Transfer Analysis of Pusher Type Reheat
100% (1)
Heat Transfer Analysis of Pusher Type Reheat
8 pages
Solucionario Gamelin
No ratings yet
Solucionario Gamelin
719 pages
The Development of Mathematics Curriculum in Malaysia
No ratings yet
The Development of Mathematics Curriculum in Malaysia
2 pages
Problems Chap8
No ratings yet
Problems Chap8
22 pages
Exercise 3 Computer Intensive Statistics
No ratings yet
Exercise 3 Computer Intensive Statistics
10 pages
Massachusetts Institute of Technology: 6.867 Machine Learning, Fall 2006 Problem Set 2: Solutions
No ratings yet
Massachusetts Institute of Technology: 6.867 Machine Learning, Fall 2006 Problem Set 2: Solutions
7 pages
Homework 2: - y y y Aα Aα α α A A y
No ratings yet
Homework 2: - y y y Aα Aα α α A A y
9 pages
endsem_ML_makeup_AK-_1_
No ratings yet
endsem_ML_makeup_AK-_1_
7 pages
HW9
No ratings yet
HW9
6 pages
Maths
No ratings yet
Maths
26 pages
Cs1a
No ratings yet
Cs1a
10 pages
451hw02 Soln
No ratings yet
451hw02 Soln
16 pages
EENG703 - Assignment04 (AutoRecovered)
No ratings yet
EENG703 - Assignment04 (AutoRecovered)
19 pages
Homework 1
No ratings yet
Homework 1
8 pages
EE364a Homework 7 Solutions
No ratings yet
EE364a Homework 7 Solutions
16 pages
Numerical Methods For CSE Problem Sheet 4: Problem 1. Order of Convergence From Error Recursion (Core Prob-Lem)
No ratings yet
Numerical Methods For CSE Problem Sheet 4: Problem 1. Order of Convergence From Error Recursion (Core Prob-Lem)
14 pages
hw2 Sol
No ratings yet
hw2 Sol
7 pages
cor
No ratings yet
cor
6 pages
Compusoft, 2 (11), 340-349 PDF
No ratings yet
Compusoft, 2 (11), 340-349 PDF
10 pages
R Akne Ovningar Empirisk Modellering
No ratings yet
R Akne Ovningar Empirisk Modellering
23 pages
STA108HW4-1
No ratings yet
STA108HW4-1
5 pages
Assignment3 Finaldraft
No ratings yet
Assignment3 Finaldraft
38 pages
ES_key (2)
No ratings yet
ES_key (2)
6 pages
Homework 2
No ratings yet
Homework 2
5 pages
Chapter 8: Performance Surfaces and Optimum Points: Brandon Morgan 1/15/2021
No ratings yet
Chapter 8: Performance Surfaces and Optimum Points: Brandon Morgan 1/15/2021
10 pages
Taller 3 (A. NG.) - Introducción Al Aprendizaje Supervisado
No ratings yet
Taller 3 (A. NG.) - Introducción Al Aprendizaje Supervisado
8 pages
Stat513 l12
No ratings yet
Stat513 l12
26 pages
Lab kamal sir
No ratings yet
Lab kamal sir
5 pages
Midterm 2010 Solutions
No ratings yet
Midterm 2010 Solutions
8 pages
Hevia ARMA Estimation
No ratings yet
Hevia ARMA Estimation
6 pages
Beyond The Bakushinkii Veto - Regularising Linear Inverse Problems Without Knowing The Noise Distribution
No ratings yet
Beyond The Bakushinkii Veto - Regularising Linear Inverse Problems Without Knowing The Noise Distribution
23 pages
Optimization GRG - Method
No ratings yet
Optimization GRG - Method
22 pages
endsem_ML_regular_AK
No ratings yet
endsem_ML_regular_AK
7 pages
L 18 Mit Ts
No ratings yet
L 18 Mit Ts
5 pages
Project
No ratings yet
Project
16 pages
Problem Set 1 Solution Numerical Methods
No ratings yet
Problem Set 1 Solution Numerical Methods
32 pages
Sparse Inverse Covariance Estimation With The Graphical Lasso
No ratings yet
Sparse Inverse Covariance Estimation With The Graphical Lasso
14 pages
Nama: Asnur Saputra NIM: F1A220034 Kelas: B Prodi: S1 Statistika
No ratings yet
Nama: Asnur Saputra NIM: F1A220034 Kelas: B Prodi: S1 Statistika
7 pages
Problems Chap9
No ratings yet
Problems Chap9
23 pages
2019-20-I MS Key
No ratings yet
2019-20-I MS Key
6 pages
Assignment2, Comp, 220104084
No ratings yet
Assignment2, Comp, 220104084
8 pages
Assignment2, Matlab_220104080
No ratings yet
Assignment2, Matlab_220104080
8 pages
Functional Local Linear Relative Regression: Abdelkader Chahad Ali Laksaci Ait-Hennani Larbi
No ratings yet
Functional Local Linear Relative Regression: Abdelkader Chahad Ali Laksaci Ait-Hennani Larbi
7 pages
Chap 35
No ratings yet
Chap 35
62 pages
The Functional Central Limit Theorem and Testing For Time Varying Parameters
No ratings yet
The Functional Central Limit Theorem and Testing For Time Varying Parameters
38 pages
Rallfun v37
No ratings yet
Rallfun v37
1,294 pages
HW6 Solution
No ratings yet
HW6 Solution
10 pages
418 Material
No ratings yet
418 Material
16 pages
103 April 2002 Solution
No ratings yet
103 April 2002 Solution
11 pages
MIT14 384F13 Problems
No ratings yet
MIT14 384F13 Problems
7 pages
Solution: Assignment 2
No ratings yet
Solution: Assignment 2
8 pages
Math 644 - Solution For HW6
100% (1)
Math 644 - Solution For HW6
6 pages
281A Final Sol
No ratings yet
281A Final Sol
9 pages
Chương 9
No ratings yet
Chương 9
12 pages
Numerical hw2
No ratings yet
Numerical hw2
1 page
ISLR solutions——Classification
No ratings yet
ISLR solutions——Classification
20 pages
Stats-C183-P4
No ratings yet
Stats-C183-P4
6 pages
Sample Midterm With Solutions (Updated)
No ratings yet
Sample Midterm With Solutions (Updated)
26 pages
Assignment Q4
No ratings yet
Assignment Q4
7 pages
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
SET225 Tutorial 1
No ratings yet
SET225 Tutorial 1
26 pages
Problems Chap5
No ratings yet
Problems Chap5
2 pages
أفضل 20 أداة اختراق والقرصنة ألاخلاقية مفتوحة المصدر في 2020 - GNU-Linux Revolution
No ratings yet
أفضل 20 أداة اختراق والقرصنة ألاخلاقية مفتوحة المصدر في 2020 - GNU-Linux Revolution
1,033 pages
Problems Chap7
No ratings yet
Problems Chap7
13 pages
Problems Chap2
No ratings yet
Problems Chap2
26 pages
Internet of Things, Smart Spaces, and Next Generation Networks and Systems (2014)
100% (1)
Internet of Things, Smart Spaces, and Next Generation Networks and Systems (2014)
729 pages
Data Structures - Data Structures - : Lecture 1: Introduction
No ratings yet
Data Structures - Data Structures - : Lecture 1: Introduction
55 pages
Node N
No ratings yet
Node N
5 pages
Data Structures, Sample Test Questions For The Material After Test 2, With Answers
No ratings yet
Data Structures, Sample Test Questions For The Material After Test 2, With Answers
12 pages
Classes in C++
No ratings yet
Classes in C++
73 pages
Social Media Tools For Researchers
No ratings yet
Social Media Tools For Researchers
3 pages
File Structures: ÷ Öof Ú Êëov F
No ratings yet
File Structures: ÷ Öof Ú Êëov F
9 pages
BSEEN Prospectus
No ratings yet
BSEEN Prospectus
7 pages
Machine Learning and AI Beyond the Basics Raschka - Download the ebook today and own the complete content
100% (3)
Machine Learning and AI Beyond the Basics Raschka - Download the ebook today and own the complete content
69 pages
Kindergarten Progress Report: Student: Month
No ratings yet
Kindergarten Progress Report: Student: Month
2 pages
Acoustic Scattering From Viscoelastically Coated Spheres and Cylinders in Viscous Fluids
No ratings yet
Acoustic Scattering From Viscoelastically Coated Spheres and Cylinders in Viscous Fluids
25 pages
Department of Physics College of Natural and Computational Sciences Addis Ababa University
No ratings yet
Department of Physics College of Natural and Computational Sciences Addis Ababa University
11 pages
Recall Problems in Mathematics
No ratings yet
Recall Problems in Mathematics
4 pages
math-10-4th-quarter-examination
No ratings yet
math-10-4th-quarter-examination
4 pages
Final Project Report
0% (1)
Final Project Report
20 pages
Cauchy Eng4
No ratings yet
Cauchy Eng4
7 pages
Rohan Khosla
No ratings yet
Rohan Khosla
1 page
Libya Med
No ratings yet
Libya Med
4 pages
Formulae Sheet Integration-2024
No ratings yet
Formulae Sheet Integration-2024
3 pages
Trades Math Workbook PDF
No ratings yet
Trades Math Workbook PDF
32 pages
Making Your Own Jewelry Like A Pro
100% (2)
Making Your Own Jewelry Like A Pro
40 pages
Mtse New Syllabus 2023-2024
No ratings yet
Mtse New Syllabus 2023-2024
3 pages
Entrepreneurship Prototype
No ratings yet
Entrepreneurship Prototype
10 pages
Critical Thinking and Other Types of Thinking
No ratings yet
Critical Thinking and Other Types of Thinking
32 pages
The Veil of Avidya Ulrich Mohrhoff
No ratings yet
The Veil of Avidya Ulrich Mohrhoff
14 pages
Insem Paper AI
No ratings yet
Insem Paper AI
1 page
Mat049 Module 1
No ratings yet
Mat049 Module 1
28 pages
Common_Errors_in_QCA_v2
No ratings yet
Common_Errors_in_QCA_v2
7 pages
Ai theory assignmnet (120 E)
No ratings yet
Ai theory assignmnet (120 E)
6 pages
ASNE94 Paper PDF
No ratings yet
ASNE94 Paper PDF
6 pages
2011 Introduction To Theory of Computation
0% (1)
2011 Introduction To Theory of Computation
12 pages
Unit - 2
No ratings yet
Unit - 2
90 pages
Outline: Regression Analysis in Python
No ratings yet
Outline: Regression Analysis in Python
11 pages

Problems Chap1

Uploaded by

Problems Chap1

Uploaded by

Problem Solutions

P(B1 ) = P(B1 |Bag 1)P(Bag 1) + P(B1 |Bag 2)P(Bag 2)

P(B1 ∩ B2 ) = P(B1 ∩ B2 |Bag 1)P(Bag 1) + P(B1 ∩ B2 |Bag 2)P(Bag 2)

We have h(x) = sign(wT x) with w = (w0 , w1 , w2 )T and x = (1, x1 , x2 )T .

We may also write this equation as

(a) As every xn is well classified by w∗ for all n = 1, · · · , N , we have that

ρ = min yn (w∗T xn ) > 0.

(b) We have that

wT (t)w∗ = [wT (t − 1) + y(t − 1)xT (t − 1)]w∗

(c) We may write that

||w(t)||2 = ||w(t − 1) + y(t − 1)x(t − 1)||2

as x(t − 1) is misclassified by w(t − 1) and y(t − 1)2 = 1.

||w(t)||2 ≤ ||w(t − 1)||2 + ||x(t − 1)||2 ≤ (t − 1)R2 + R2 = tR2 .

(e) By using points (b) and (d), we get that

By using the inequality above, we may write that

R2 (wT (t)w∗ )2 R2 (wT (t)w∗ )2

However, we have that

w0 <- runif(1, min = -999, max = 999)

p <- ggplot(D, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +

ggplot(D1, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +

ggplot(D1, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +

ggplot(D1, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +

D2 <- matrix(runif(10000, min = -1, max = 1), ncol = N)

w0 <- runif(1, min = -999, max = 999)

test_error <- mean(h(D_test, w) != D_test$y)

p <- ggplot(D_train, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +

test_error <- mean(h(D_test, w) != D_test$y)

p <- ggplot(D_train, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +

−1.0 −0.5 0.0 0.5 1.0

test_error <- mean(h(D_test, w) != D_test$y)

p <- ggplot(D_train, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +

−1.0 −0.5 0.0 0.5 1.0

test_error <- mean(h(D_test, w) != D_test$y)

−1.0 −0.5 0.0 0.5 1.0

(a) For one sample we have that

P(max |νi − µi | > ) = P(|ν1 − µ1 | >  or |ν2 − µ2 | > )

return(m > epsilon)

(a) If t is a non-negative random variable and α > 0, we may write that

and its variance is

P(t ≥ α) = P(st ≥ sα)

because of the Markov Inequality.

(c) In this case, we may write that

(d) First, we notice that

Now, let 0 <  < 12 , we may write that

= 2−N [1+(1/2+) log2 (1/2+)+(1/2−) log2 (1/2−)] = 2−βN .

(a) In this case, we may write that

which is equal to 1/2 if M is even, and to 1/2 + 1/2M if M is odd.

as we have here a binomial distribution.

For the supermarket, we have

And for the CIA, we have

You might also like

P(max |νi − µi | > ) = P(|ν1 − µ1 | > or |ν2 − µ2 | > )

Now, let 0 < < 12 , we may write that

= 2−N [1+(1/2+) log2 (1/2+)+(1/2−) log2 (1/2−)] = 2−βN .