ProbList2 24 SLN
ProbList2 24 SLN
Exercise 1
This is an exercise on the use of plot and its arguments. You will use the data in the file prob1.csv.
(1) Read the data in the file prob1.csv into a file named data1. Use str to explore the structure of the
set data1. If the variable class is of mode chr, change it to a factor.
data1 <- read.csv('prob1.csv', header = T)
str(data1)
(2) Divide the plotting window into four sectors and draw boxplots for the four numerical variables according
to class. Comment on what you observe.
par(mfrow = c(2,2))
plot(var1 ~ class, data = data1, col = 'wheat')
plot(var2 ~ class, data = data1, col = 'wheat')
plot(var3 ~ class, data = data1, col = 'wheat')
plot(var4 ~ class, data = data1, col = 'wheat')
1
5.0
20
10
4.0
var1
var2
0
3.0
−20 −10
2.0
A B A B
class class
10
15
5
var3
var4
0
10
−5
5
−10
A B A B
class class
par(mfrow = c(1,1))
For var1 we see that the values cover approximately the same range and there seem to be no important
differences between the two classes. On the other hand, for var2 the differences are significant. The boxes
(representing the central 50% of the data) are disjoint, and the range of values for class B is much shorter
than for class A. For var3 the values for class A are lower than for class B but the central boxes overlap
considerably. Finaly, for var4 the situation is similar to that of var1‘.
(3) Divide the plotting window into four sectors; on the left column, you will use var1 and on the right
column, var2, while on top, you will use data from class A, and the bottom corresponds to class B.
Plot histograms of relative frequencies in the four windows according to the previous description. Use
the variable name as a label for the x-axis and add a title to each plot. Since you want to compare the
distribution of the variables according to class, the scales on the x-axis should be the same for plots on
the same column. Comment on what you observe.
par(mfcol = c(2,2))
hist(data1$var1[data1$class=='A'], breaks = 10, xlim = c(2, 5), xlab = 'var 1',
main = 'Var1 for class A', col = 'azure2')
hist(data1$var1[data1$class=='B'], breaks = 8, xlim = c(2, 5), xlab = 'var 1',
main = 'Var1 for class B', col = 'azure2')
2
hist(data1$var2[data1$class=='B'], breaks = 8, xlim = c(-20, 20), xlab = 'var 2',
main = 'Var2 for class B', col = 'azure2')
12
Frequency
Frequency
0 2 4 6 8
6
4
2
0
var 1 var 2
12
8
Frequency
Frequency
0 2 4 6 8
6
4
2
0
var 1 var 2
par(mfrow = c(1,1))
We confirm our observations from (2), the distributions for var1 in classes A and B are very similar, while
for var2 these distributions are very different. var2 for class B only has negative values, roughly between -15
and 0, while for class A the values go from -20 to 20.
(4) Using the function plot, create a matrix of plots for the four numerical variables in prob1. Use a solid
square as the plotting symbol and color by the values of class. Comment on what you observe. Which
variables seem to be related?
plot(data1[,1:4], col = data1$class, pch = 15)
3
−20 −10 0 10 20 −10 −5 0 5 10
5.0
4.0
var1
3.0
2.0
20
10
var2
0
−20
15
var3
10
5
10
5
var4
0
−10 −5
We see that var1 and var3 seem to be linearly related for both classes, having similar slopes. var2 and var4
also seem to have a linear relation, but in this case the slopes for the two classes are very different. The rest
of the variables do not seem to be related.
Exercise 2
For this exercise, we will use the data set Auto in the ISLR package, that has information regarding fuel
consumption (miles per galon, mpg) and other variables for 392 different car models.
(1) Use the functions str and help to explore this data set.
library(ISLR)
data(Auto)
str(Auto)
4
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2
# help(Auto)
The file has information on mpg and 8 other variables for 392 vehicles. The variables are numeric except name
which is a factor. The help page (not shown) gives detailed information about the variables in the data set.
(2) If you use the command unique(Auto$cylinders), you will get the different values for the number of
cylinders in the data set. We will be only interested in cars with 4, 6, or 8 cylinders. Using the function
select, create a new data frame named Auto_new that only includes cars with the selected number of
cylinders.
unique(Auto$cylinders)
## [1] 8 4 6 3 5
We see that there are cars with 3, 4, 5, 6, and 8 cylinders. We use subset to extract the data according to
the question
Auto_new <- subset(Auto,cylinders == 4 | cylinders == 6 | cylinders == 8)
str(Auto_new)
5
cylinders
4
6
40
8
30
mpg
20
10
70 72 74 76 78 80 82
year
Blue triangles, corresponding to cars with 4 cylinders, usually have the higher values, indicating better
efficiency. Grey triangles, representing cars with 8 cylinders, are usually at the bottom. There are no grey
triangles for years 80 - 82. The plot shows an increasing trend, indicating improving fuel efficiency.
(4) In some countries, fuel consumption is measured in liters of fuel per 100 kilometers. Using the fact that
one mile = 1.61 kilometers, and one gallon = 3.785 liters, create a new variable in Auto_new called fc
(for fuel consumption), that has fuel consumption measured in liters per 100 kilometers.
km_lt <- Auto_new$mpg*1.61/3.785
Auto_new$fc <- 100/km_lt
(5) Plot fc against displacement and color the dots by cylinders. Use a solid point as plotting symbol
and add a legend and title to the plot.
plot(fc ~ displacement, data = Auto_new, pch = 16, col = cylinders)
legend('topleft',legend = c(4,6,8),pch = 16, col = c(4,6,8), title = 'cylinders')
6
cylinders
25
4
6
8
20
fc
15
10
5
displacement
We see an increasing trend in fuel consumption as the engine displacement increases, but the variability does
not seem constant.
In this plot, blue dots, corresponding to cars with four cylinders, are at the lower left corner, corresponding
to smaller engines (lower displacement) and less fuel consumption. In the central part of the plot, we
have red dots corresponding to 6-cylinder cars. They have more significant displacement and increased fuel
consumption. The variability seems similar to that of the blue dots. Finally, the grey dots occupy mostly the
upper right region of the graph, with larger displacement and higher fuel consumption. They also show more
variability than the other two groups.
Exercise 3
Histograms
For this exercise we are going to use simulated data from a mixture of normal distributions. In this population,
45% of the points come from a normal distribution with mean 13 and standard deviation 0.75, and 55% come
from a normal distribution with mean 16 and standard deviation 1.
7
Mixture Distribution
0.20
density
0.10
0.00
10 12 14 16 18 20
values
The following commands draw a sample of size 500 from this mixture and print the range of values for the
simulated data. The sample is stored in the vector mix.sample
n <- 500; set.seed(4567)
unif.sample <- runif(n) <= 0.45
mix.sample <- unif.sample *rnorm(n, mean=13, sd = 0.75) +
(1-unif.sample)*rnorm(n, mean=16, sd = 1)
(rng <- range(mix.sample))
8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
10 12 14 16 18 10 12 14 16 18
values values
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
10 12 14 16 18 10 12 14 16 18
values values
Not really. In the second and third plots the bimodality is not clear. The second plot looks like a right-skewed
distribution. In the third plot the data look more uniformly distributed, with data missing in some intervals.
2. Divide the plotting window into 4 using the function par with argument mfrow. Draw successive
histograms of relative frequency for the first 25, 50, 100, and 500 points in mix.sample. Set the bin
width to 0.5 in all plots. Make sure that the scales are the same for all plots. Are these plots similar to
the density in the previous slide?
par(mfrow=c(2,2))
truehist(mix.sample[1:50], xlab = 'values', h = 0.5,
xlim = c(floor(rng[1]),ceiling(rng[2])))
truehist(mix.sample[1:100], xlab = 'values', h = 0.5,
xlim = c(floor(rng[1]),ceiling(rng[2])))
truehist(mix.sample[1:500], xlab = 'values', h = 0.5,
xlim = c(floor(rng[1]),ceiling(rng[2])))
truehist(mix.sample[1:1000], xlab = 'values', h = 0.5,
xlim = c(floor(rng[1]),ceiling(rng[2])))
9
0.30
10 12 14 16 18 10 12 14 16 18
values values
0.20
0.20
0.10
0.10
0.00
0.00
10 12 14 16 18 10 12 14 16 18
values values
As the sample size increases, the plots lok more and more like the population density. They show clearly the
bimodal nature of the data and the proportion between the two modes is approximately correct.
3. Using again the function par with argument mfrow, set the graphical window to a single graph. Draw a
histogram of relative frequency using all the points in mix.sample. Choose the number of bins (nbins)
using the Scott rule.
par(mfrow=c(1,1))
truehist(mix.sample[1:1000], xlab = 'values',
xlim = c(floor(rng[1]),ceiling(rng[2])),
main = 'Histogram of simulated data', nbins = 'Scott',
ylim = c(0,0.25))
10
Histogram of simulated data
0.25
0.20
0.15
0.10
0.05
0.00
10 12 14 16 18
values
4. Using the function lines with argument density(sample.mix), add an estimate of the density for
this sample. Color the line in blue. Add also a graph of the theoretical density in red (look back to the
previous page to see how this density was plotted before and make the necessary changes). Comment
on what you observe.
par(mfrow=c(1,1))
truehist(mix.sample[1:1000], xlab = 'values',
xlim = c(floor(rng[1]),ceiling(rng[2])),
main = 'Histogram of simulated data', nbins = 'FD',
ylim = c(0,0.25))
lines(density(mix.sample),col = 'blue', lwd=2)
lines(points.x,points.dens,type='l',col = 2, lwd=2)
11
Histogram of simulated data
0.25
0.20
0.15
0.10
0.05
0.00
10 12 14 16 18
values
We see that the estimated density and the histogram are reasonably close to the population density.
Exercise 4
In this exercise we look at quantile plots. In all cases we will consider samples simulated from the normal
distribution. We explore the effect of size, mean, and variance, and also use qqplot to compare samples.
1. Divide the graphical window into four regions using par and mfrow. Generate four samples from
the standard normal distribution of size 10 and draw normal quantile plots. Add lines with qqline.
Comment on what you observe.
2. Repeat for sample sizes 20, 50, and 100. Comment on what you observe.
3. Draw samples of size 50 from normal distributions with means -6, -2, 2, and 6, all with variance 1 and
draw the corresponding quantile plots. To be able to compare the four graphs, find a suitable common
scale for the axes for all plots. Comment on the similarities and differences between the plots.
4. Draw samples of size 50 from normal distributions with mean 1 and standard deviations 0.5, 2, 4, and
6, and draw the corresponding quantile plots. To be able to compare the four graphs, find a suitable
common scale for the axes for all plots. Comment on the similarities and differences between the plots.
5. Draw two samples of size 10 from the standard normal distribution and compare them using qqplot.
Repeat a total of four times. Plot the four graphs on the same window. Comment on what you see.
In this exercise we look at quantile plots. In all cases we will consider samples simulated from the normal
distribution. We explore the effect of size, mean, and variance, and also use qqplot to compare samples.
12
1. Divide the graphical window into four regions using par and mfrow. Generate four samples from
the standard normal distribution of size 10 and draw normal quantile plots. Add lines with qqline.
Comment on what you observe.
par(mfrow=c(2,2))
for(i in 1:4) {samp1 <- rnorm(10); qqnorm(samp1); qqline(samp1)}
1.5
Sample Quantiles
Sample Quantiles
0.5
0.5
−0.5
−1.5 −0.5
−1.5
−1.5 −0.5 0.5 1.0 1.5 −1.5 −0.5 0.5 1.0 1.5
Sample Quantiles
0.5
0.0
−0.5
−1.0
−1.5
−1.5 −0.5 0.5 1.0 1.5 −1.5 −0.5 0.5 1.0 1.5
In plot 1 the six points in the middle are aligned but the other four points do not show a good fit. Something
similar occurs with plot 3. Plots 2 and 4 show a good alignment of the points.
13
2. Repeat for sample sizes 20, 50, and 100. Comment on what you observe.
par(mfrow=c(2,2))
for(i in 1:4) {samp1 <- rnorm(20); qqnorm(samp1); qqline(samp1)}
1.5
Sample Quantiles
Sample Quantiles
1
0
0.5
−1
−1.5 −0.5
−3
−2 −1 0 1 2 −2 −1 0 1 2
Sample Quantiles
0.5
−0.5
−1.0
−1.5
−2 −1 0 1 2 −2 −1 0 1 2
For sample size 20 the fit is reasonable, but there are still some points that deviated markedly from the line.
14
2. Repeat for sample sizes 20, 50, and 100.
par(mfrow=c(2,2))
for(i in 1:4) {samp1 <- rnorm(50); qqnorm(samp1); qqline(samp1)}
2
Sample Quantiles
Sample Quantiles
1
1
0
0
−1
−1
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
Sample Quantiles
1
1
0
0
−1
−1
−2
−2 −1 0 1 2 −2 −1 0 1 2
For sample size 50 the fit is better. In three out of four plots the fit is very good. Plot 3 has a large minimum
value that deviates from the rest.
15
par(mfrow=c(2,2))
for(i in 1:4) {samp1 <- rnorm(100); qqnorm(samp1); qqline(samp1)}
2
Sample Quantiles
Sample Quantiles
1
1
0
0
−1
−2 −1
−3
−2 −1 0 1 2 −2 −1 0 1 2
2
Sample Quantiles
Sample Quantiles
1
1
0
0
−2 −1
−1
−2
−2 −1 0 1 2 −2 −1 0 1 2
Now the fit is very good iin all cases. We see that, as the sample size grows, the fit improves.
16
3. Draw samples of size 50 from normal distributions with means -6, -2, 2, and 6, all with variance 1 and
draw the corresponding quantile plots. To be able to compare the four graphs, find a suitable common
scale for the axes for all plots. Comment on the similarities and differences between the plots.
par(mfrow=c(2,2))
for (i in c(-3,-1,1,3)) {
dat <- rnorm(30,i);qqnorm(dat,ylim=c(-6,6));qqline(dat)}
6
Sample Quantiles
Sample Quantiles
4
4
2
2
−2
−2
−6
−6
−2 −1 0 1 2 −2 −1 0 1 2
6
Sample Quantiles
Sample Quantiles
4
4
2
2
−2
−2
−6
−6
−2 −1 0 1 2 −2 −1 0 1 2
In the plots we see that the slope of the lines remain constant, but the lines shift upwards as the mean
increases.
17
par(mfrow=c(2,2))
for (i in c(-3,-1,1,3)) {
dat <- rnorm(30,i);qqnorm(dat,ylim=c(-6,6));qqline(dat)
abline(v=0,col='red'); abline(h=mean(dat),col = 'red')}
6
Sample Quantiles
Sample Quantiles
4
4
2
2
−2
−2
−6
−6
−2 −1 0 1 2 −2 −1 0 1 2
6
Sample Quantiles
Sample Quantiles
4
4
2
2
−2
−2
−6
−6
−2 −1 0 1 2 −2 −1 0 1 2
18
4. Draw samples of size 50 from normal distributions with mean 1 and standard deviations 0.5, 2, 4, and
6, and draw the corresponding quantile plots. To be able to compare the four graphs, find a suitable
common scale for the axes for all plots. Comment on the similarities and differences between the plots.
par(mfrow=c(2,2))
for (i in c(0.5,1,2,3)) {
dat <- rnorm(30,0,i);qqnorm(dat,ylim=c(-6,6));qqline(dat)}
6
Sample Quantiles
Sample Quantiles
4
4
2
2
−2
−2
−6
−6
−2 −1 0 1 2 −2 −1 0 1 2
6
Sample Quantiles
Sample Quantiles
4
4
2
2
−2
−2
−6
−6
−2 −1 0 1 2 −2 −1 0 1 2
In these plots we see that the height of the central points remains constant, but the slope of the lines increases
as the variance increases.
19
5. Draw two samples of size 10 from the standard normal distribution and compare them using qqplot.
Repeat a total of four times. Plot the four graphs on the same window. Comment on what you see.
par(mfrow=c(2,2))
for (i in 1:4) {
dat <- rnorm(20);qqplot(dat[1:10],dat[11:20], ylim=c(-2.5,2.5),
xlab = 'Sample 1', ylab = 'Sample 2')}
2
2
1
1
Sample 2
Sample 2
0
0
−2
−2
−0.5 0.0 0.5 1.0 1.5 −2.0 −1.0 0.0 0.5 1.0
Sample 1 Sample 1
2
2
1
1
Sample 2
Sample 2
0
0
−2
−2
Sample 1 Sample 1
We see that the first two plots do not show adequate alignment, and we would probably conclude that the
two samples come from different distributions. The last two plots show points that are reasonably aligned,
and we would conclude that in this case they come from a common distribution function.
20