0% found this document useful (0 votes)
18 views20 pages

ProbList2 24 SLN

The document outlines a series of exercises related to applied statistics and data analysis using R. It includes tasks such as reading data, creating plots, and analyzing relationships between variables using datasets like prob1.csv and Auto from the ISLR package. The exercises cover boxplots, histograms, and scatter plots, with commentary on the observations made from the visualizations.

Uploaded by

Sam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views20 pages

ProbList2 24 SLN

The document outlines a series of exercises related to applied statistics and data analysis using R. It includes tasks such as reading data, creating plots, and analyzing relationships between variables using datasets like prob1.csv and Auto from the ISLR package. The exercises cover boxplots, histograms, and scatter plots, with commentary on the observations made from the visualizations.

Uploaded by

Sam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

STAT 210

Applied Statistics and Data Analysis


Problem List 2 - Solution
(due on week 3)
Spring 2024

Exercise 1
This is an exercise on the use of plot and its arguments. You will use the data in the file prob1.csv.
(1) Read the data in the file prob1.csv into a file named data1. Use str to explore the structure of the
set data1. If the variable class is of mode chr, change it to a factor.
data1 <- read.csv('prob1.csv', header = T)
str(data1)

## 'data.frame': 100 obs. of 5 variables:


## $ var1 : num 3.18 4.28 3.2 3.33 3.67 ...
## $ var2 : num 5.312 7.678 10.07 -0.496 -2.169 ...
## $ var3 : num 6.56 12.33 6.89 7.3 9.4 ...
## $ var4 : num 3.199 4.838 6.415 0.412 0.716 ...
## $ class: chr "A" "A" "A" "A" ...
## Change class to a factor:
data1$class <- factor(data1$class)

(2) Divide the plotting window into four sectors and draw boxplots for the four numerical variables according
to class. Comment on what you observe.
par(mfrow = c(2,2))
plot(var1 ~ class, data = data1, col = 'wheat')
plot(var2 ~ class, data = data1, col = 'wheat')
plot(var3 ~ class, data = data1, col = 'wheat')
plot(var4 ~ class, data = data1, col = 'wheat')

1
5.0

20
10
4.0
var1

var2

0
3.0

−20 −10
2.0

A B A B

class class

10
15

5
var3

var4

0
10

−5
5

−10

A B A B

class class
par(mfrow = c(1,1))

For var1 we see that the values cover approximately the same range and there seem to be no important
differences between the two classes. On the other hand, for var2 the differences are significant. The boxes
(representing the central 50% of the data) are disjoint, and the range of values for class B is much shorter
than for class A. For var3 the values for class A are lower than for class B but the central boxes overlap
considerably. Finaly, for var4 the situation is similar to that of var1‘.
(3) Divide the plotting window into four sectors; on the left column, you will use var1 and on the right
column, var2, while on top, you will use data from class A, and the bottom corresponds to class B.
Plot histograms of relative frequencies in the four windows according to the previous description. Use
the variable name as a label for the x-axis and add a title to each plot. Since you want to compare the
distribution of the variables according to class, the scales on the x-axis should be the same for plots on
the same column. Comment on what you observe.
par(mfcol = c(2,2))
hist(data1$var1[data1$class=='A'], breaks = 10, xlim = c(2, 5), xlab = 'var 1',
main = 'Var1 for class A', col = 'azure2')
hist(data1$var1[data1$class=='B'], breaks = 8, xlim = c(2, 5), xlab = 'var 1',
main = 'Var1 for class B', col = 'azure2')

hist(data1$var2[data1$class=='A'], breaks = 10, xlim = c(-20, 20), xlab = 'var 2',


main = 'Var2 for class A', col = 'azure2')

2
hist(data1$var2[data1$class=='B'], breaks = 8, xlim = c(-20, 20), xlab = 'var 2',
main = 'Var2 for class B', col = 'azure2')

Var1 for class A Var2 for class A


8 10

12
Frequency

Frequency

0 2 4 6 8
6
4
2
0

2.0 2.5 3.0 3.5 4.0 4.5 5.0 −20 −10 0 10 20

var 1 var 2

Var1 for class B Var2 for class B


10

12
8
Frequency

Frequency

0 2 4 6 8
6
4
2
0

2.0 2.5 3.0 3.5 4.0 4.5 5.0 −20 −10 0 10 20

var 1 var 2
par(mfrow = c(1,1))

We confirm our observations from (2), the distributions for var1 in classes A and B are very similar, while
for var2 these distributions are very different. var2 for class B only has negative values, roughly between -15
and 0, while for class A the values go from -20 to 20.
(4) Using the function plot, create a matrix of plots for the four numerical variables in prob1. Use a solid
square as the plotting symbol and color by the values of class. Comment on what you observe. Which
variables seem to be related?
plot(data1[,1:4], col = data1$class, pch = 15)

3
−20 −10 0 10 20 −10 −5 0 5 10

5.0
4.0
var1

3.0
2.0
20
10

var2
0
−20

15
var3

10
5
10
5

var4
0
−10 −5

2.0 3.0 4.0 5.0 5 10 15

We see that var1 and var3 seem to be linearly related for both classes, having similar slopes. var2 and var4
also seem to have a linear relation, but in this case the slopes for the two classes are very different. The rest
of the variables do not seem to be related.

Exercise 2
For this exercise, we will use the data set Auto in the ISLR package, that has information regarding fuel
consumption (miles per galon, mpg) and other variables for 392 different car models.
(1) Use the functions str and help to explore this data set.
library(ISLR)
data(Auto)
str(Auto)

## 'data.frame': 392 obs. of 9 variables:


## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...

4
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2
# help(Auto)

The file has information on mpg and 8 other variables for 392 vehicles. The variables are numeric except name
which is a factor. The help page (not shown) gives detailed information about the variables in the data set.
(2) If you use the command unique(Auto$cylinders), you will get the different values for the number of
cylinders in the data set. We will be only interested in cars with 4, 6, or 8 cylinders. Using the function
select, create a new data frame named Auto_new that only includes cars with the selected number of
cylinders.
unique(Auto$cylinders)

## [1] 8 4 6 3 5
We see that there are cars with 3, 4, 5, 6, and 8 cylinders. We use subset to extract the data according to
the question
Auto_new <- subset(Auto,cylinders == 4 | cylinders == 6 | cylinders == 8)
str(Auto_new)

## 'data.frame': 385 obs. of 9 variables:


## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2
With the restriction on the number of cylinders, we have lost seven cars in the data set.
(3) Using the file you created in (2), plot mpg as a function of year, and color the points according to the
value of cylinder. Use a solid triangle as plotting symbol and add a legend on the top left corner.
Comment on the plot.
plot(mpg ~ year, data = Auto_new, col = cylinders, pch = 17)
legend('topleft',legend = c(4,6,8),pch = 17, col = c(4,6,8), title = 'cylinders')

5
cylinders
4
6

40
8

30
mpg

20
10

70 72 74 76 78 80 82

year

Blue triangles, corresponding to cars with 4 cylinders, usually have the higher values, indicating better
efficiency. Grey triangles, representing cars with 8 cylinders, are usually at the bottom. There are no grey
triangles for years 80 - 82. The plot shows an increasing trend, indicating improving fuel efficiency.
(4) In some countries, fuel consumption is measured in liters of fuel per 100 kilometers. Using the fact that
one mile = 1.61 kilometers, and one gallon = 3.785 liters, create a new variable in Auto_new called fc
(for fuel consumption), that has fuel consumption measured in liters per 100 kilometers.
km_lt <- Auto_new$mpg*1.61/3.785
Auto_new$fc <- 100/km_lt

(5) Plot fc against displacement and color the dots by cylinders. Use a solid point as plotting symbol
and add a legend and title to the plot.
plot(fc ~ displacement, data = Auto_new, pch = 16, col = cylinders)
legend('topleft',legend = c(4,6,8),pch = 16, col = c(4,6,8), title = 'cylinders')

6
cylinders

25
4
6
8

20
fc

15
10
5

100 200 300 400

displacement

We see an increasing trend in fuel consumption as the engine displacement increases, but the variability does
not seem constant.
In this plot, blue dots, corresponding to cars with four cylinders, are at the lower left corner, corresponding
to smaller engines (lower displacement) and less fuel consumption. In the central part of the plot, we
have red dots corresponding to 6-cylinder cars. They have more significant displacement and increased fuel
consumption. The variability seems similar to that of the blue dots. Finally, the grey dots occupy mostly the
upper right region of the graph, with larger displacement and higher fuel consumption. They also show more
variability than the other two groups.

Exercise 3
Histograms
For this exercise we are going to use simulated data from a mixture of normal distributions. In this population,
45% of the points come from a normal distribution with mean 13 and standard deviation 0.75, and 55% come
from a normal distribution with mean 16 and standard deviation 1.

0.45 × N (13, 0.752 ) + 0.55N (16, 1)

The code below plots the density for this distribution.


points.x <- seq(10,20,length=1000)
points.dens <- 0.45*(dnorm(points.x, mean=13, sd = 0.75)) +
0.55*(dnorm(points.x, mean = 16, sd = 1))
plot(points.x,points.dens,type='l',xlab='values',ylab='density',lwd = 2,
col = 'navyblue', main = 'Mixture Distribution')

7
Mixture Distribution

0.20
density

0.10
0.00

10 12 14 16 18 20

values

The following commands draw a sample of size 500 from this mixture and print the range of values for the
simulated data. The sample is stored in the vector mix.sample
n <- 500; set.seed(4567)
unif.sample <- runif(n) <= 0.45
mix.sample <- unif.sample *rnorm(n, mean=13, sd = 0.75) +
(1-unif.sample)*rnorm(n, mean=16, sd = 1)
(rng <- range(mix.sample))

## [1] 10.84217 18.75402


We will use this sample to draw histograms with the function truehist in the MASS package. Look up the
help for truehist. It is also a good idea to explore the use of the function hist on the base package by
repeating this exercise using hist.
1. Divide the plotting window into 4 using the function par with argument mfrow. Select four disjoint
subsets of data of length 25 and draw histograms for them. Set the bin width to 0.5 in all plots. Make
sure that the scales are the same for all plots. Are these plots similar to the density in the previous
slide?
library(MASS)
par(mfrow=c(2,2))
truehist(mix.sample[1:25], xlab = 'values', h = 0.5,
xlim = c(floor(rng[1]),ceiling(rng[2])),
ylim = c(0,0.6))
truehist(mix.sample[101:125], xlab = 'values', h = 0.5,
xlim = c(floor(rng[1]),ceiling(rng[2])),
ylim = c(0,0.6))
truehist(mix.sample[201:225], xlab = 'values', h = 0.5,
xlim = c(floor(rng[1]),ceiling(rng[2])),
ylim = c(0,0.6))
truehist(mix.sample[301:325], xlab = 'values', h = 0.5,
xlim = c(floor(rng[1]),ceiling(rng[2])),
ylim = c(0,0.6))

8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0
10 12 14 16 18 10 12 14 16 18

values values
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0
10 12 14 16 18 10 12 14 16 18

values values

Not really. In the second and third plots the bimodality is not clear. The second plot looks like a right-skewed
distribution. In the third plot the data look more uniformly distributed, with data missing in some intervals.
2. Divide the plotting window into 4 using the function par with argument mfrow. Draw successive
histograms of relative frequency for the first 25, 50, 100, and 500 points in mix.sample. Set the bin
width to 0.5 in all plots. Make sure that the scales are the same for all plots. Are these plots similar to
the density in the previous slide?
par(mfrow=c(2,2))
truehist(mix.sample[1:50], xlab = 'values', h = 0.5,
xlim = c(floor(rng[1]),ceiling(rng[2])))
truehist(mix.sample[1:100], xlab = 'values', h = 0.5,
xlim = c(floor(rng[1]),ceiling(rng[2])))
truehist(mix.sample[1:500], xlab = 'values', h = 0.5,
xlim = c(floor(rng[1]),ceiling(rng[2])))
truehist(mix.sample[1:1000], xlab = 'values', h = 0.5,
xlim = c(floor(rng[1]),ceiling(rng[2])))

9
0.30

0.00 0.10 0.20 0.30


0.20
0.10
0.00

10 12 14 16 18 10 12 14 16 18

values values
0.20

0.20
0.10

0.10
0.00

0.00
10 12 14 16 18 10 12 14 16 18

values values

As the sample size increases, the plots lok more and more like the population density. They show clearly the
bimodal nature of the data and the proportion between the two modes is approximately correct.
3. Using again the function par with argument mfrow, set the graphical window to a single graph. Draw a
histogram of relative frequency using all the points in mix.sample. Choose the number of bins (nbins)
using the Scott rule.
par(mfrow=c(1,1))
truehist(mix.sample[1:1000], xlab = 'values',
xlim = c(floor(rng[1]),ceiling(rng[2])),
main = 'Histogram of simulated data', nbins = 'Scott',
ylim = c(0,0.25))

10
Histogram of simulated data

0.25
0.20
0.15
0.10
0.05
0.00

10 12 14 16 18

values
4. Using the function lines with argument density(sample.mix), add an estimate of the density for
this sample. Color the line in blue. Add also a graph of the theoretical density in red (look back to the
previous page to see how this density was plotted before and make the necessary changes). Comment
on what you observe.
par(mfrow=c(1,1))
truehist(mix.sample[1:1000], xlab = 'values',
xlim = c(floor(rng[1]),ceiling(rng[2])),
main = 'Histogram of simulated data', nbins = 'FD',
ylim = c(0,0.25))
lines(density(mix.sample),col = 'blue', lwd=2)
lines(points.x,points.dens,type='l',col = 2, lwd=2)

11
Histogram of simulated data

0.25
0.20
0.15
0.10
0.05
0.00

10 12 14 16 18

values
We see that the estimated density and the histogram are reasonably close to the population density.

Exercise 4
In this exercise we look at quantile plots. In all cases we will consider samples simulated from the normal
distribution. We explore the effect of size, mean, and variance, and also use qqplot to compare samples.
1. Divide the graphical window into four regions using par and mfrow. Generate four samples from
the standard normal distribution of size 10 and draw normal quantile plots. Add lines with qqline.
Comment on what you observe.
2. Repeat for sample sizes 20, 50, and 100. Comment on what you observe.
3. Draw samples of size 50 from normal distributions with means -6, -2, 2, and 6, all with variance 1 and
draw the corresponding quantile plots. To be able to compare the four graphs, find a suitable common
scale for the axes for all plots. Comment on the similarities and differences between the plots.
4. Draw samples of size 50 from normal distributions with mean 1 and standard deviations 0.5, 2, 4, and
6, and draw the corresponding quantile plots. To be able to compare the four graphs, find a suitable
common scale for the axes for all plots. Comment on the similarities and differences between the plots.
5. Draw two samples of size 10 from the standard normal distribution and compare them using qqplot.
Repeat a total of four times. Plot the four graphs on the same window. Comment on what you see.
In this exercise we look at quantile plots. In all cases we will consider samples simulated from the normal
distribution. We explore the effect of size, mean, and variance, and also use qqplot to compare samples.

12
1. Divide the graphical window into four regions using par and mfrow. Generate four samples from
the standard normal distribution of size 10 and draw normal quantile plots. Add lines with qqline.
Comment on what you observe.
par(mfrow=c(2,2))
for(i in 1:4) {samp1 <- rnorm(10); qqnorm(samp1); qqline(samp1)}

Normal Q−Q Plot Normal Q−Q Plot

1.5
Sample Quantiles

Sample Quantiles
0.5

0.5
−0.5

−1.5 −0.5
−1.5

−1.5 −0.5 0.5 1.0 1.5 −1.5 −0.5 0.5 1.0 1.5

Theoretical Quantiles Theoretical Quantiles

Normal Q−Q Plot Normal Q−Q Plot


1.0
Sample Quantiles

Sample Quantiles
0.5

0.0
−0.5

−1.0
−1.5

−1.5 −0.5 0.5 1.0 1.5 −1.5 −0.5 0.5 1.0 1.5

Theoretical Quantiles Theoretical Quantiles

In plot 1 the six points in the middle are aligned but the other four points do not show a good fit. Something
similar occurs with plot 3. Plots 2 and 4 show a good alignment of the points.

13
2. Repeat for sample sizes 20, 50, and 100. Comment on what you observe.
par(mfrow=c(2,2))
for(i in 1:4) {samp1 <- rnorm(20); qqnorm(samp1); qqline(samp1)}

Normal Q−Q Plot Normal Q−Q Plot

1.5
Sample Quantiles

Sample Quantiles
1
0

0.5
−1

−1.5 −0.5
−3

−2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles

Normal Q−Q Plot Normal Q−Q Plot


0.0 0.5
Sample Quantiles

Sample Quantiles

0.5
−0.5
−1.0

−1.5

−2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles

For sample size 20 the fit is reasonable, but there are still some points that deviated markedly from the line.

14
2. Repeat for sample sizes 20, 50, and 100.
par(mfrow=c(2,2))
for(i in 1:4) {samp1 <- rnorm(50); qqnorm(samp1); qqline(samp1)}

Normal Q−Q Plot Normal Q−Q Plot

2
Sample Quantiles

Sample Quantiles
1

1
0

0
−1

−1
−2

−2
−2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles

Normal Q−Q Plot Normal Q−Q Plot


2
2
Sample Quantiles

Sample Quantiles

1
1

0
0

−1
−1

−2

−2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles

For sample size 50 the fit is better. In three out of four plots the fit is very good. Plot 3 has a large minimum
value that deviates from the rest.

15
par(mfrow=c(2,2))
for(i in 1:4) {samp1 <- rnorm(100); qqnorm(samp1); qqline(samp1)}

Normal Q−Q Plot Normal Q−Q Plot


2

2
Sample Quantiles

Sample Quantiles
1

1
0

0
−1

−2 −1
−3

−2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles

Normal Q−Q Plot Normal Q−Q Plot


2

2
Sample Quantiles

Sample Quantiles
1

1
0

0
−2 −1

−1
−2

−2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles

Now the fit is very good iin all cases. We see that, as the sample size grows, the fit improves.

16
3. Draw samples of size 50 from normal distributions with means -6, -2, 2, and 6, all with variance 1 and
draw the corresponding quantile plots. To be able to compare the four graphs, find a suitable common
scale for the axes for all plots. Comment on the similarities and differences between the plots.
par(mfrow=c(2,2))
for (i in c(-3,-1,1,3)) {
dat <- rnorm(30,i);qqnorm(dat,ylim=c(-6,6));qqline(dat)}

Normal Q−Q Plot Normal Q−Q Plot


6

6
Sample Quantiles

Sample Quantiles
4

4
2

2
−2

−2
−6

−6
−2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles

Normal Q−Q Plot Normal Q−Q Plot


6

6
Sample Quantiles

Sample Quantiles
4

4
2

2
−2

−2
−6

−6

−2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles

In the plots we see that the slope of the lines remain constant, but the lines shift upwards as the mean
increases.

17
par(mfrow=c(2,2))
for (i in c(-3,-1,1,3)) {
dat <- rnorm(30,i);qqnorm(dat,ylim=c(-6,6));qqline(dat)
abline(v=0,col='red'); abline(h=mean(dat),col = 'red')}

Normal Q−Q Plot Normal Q−Q Plot


6

6
Sample Quantiles

Sample Quantiles
4

4
2

2
−2

−2
−6

−6
−2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles

Normal Q−Q Plot Normal Q−Q Plot


6

6
Sample Quantiles

Sample Quantiles
4

4
2

2
−2

−2
−6

−6

−2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles

18
4. Draw samples of size 50 from normal distributions with mean 1 and standard deviations 0.5, 2, 4, and
6, and draw the corresponding quantile plots. To be able to compare the four graphs, find a suitable
common scale for the axes for all plots. Comment on the similarities and differences between the plots.
par(mfrow=c(2,2))
for (i in c(0.5,1,2,3)) {
dat <- rnorm(30,0,i);qqnorm(dat,ylim=c(-6,6));qqline(dat)}

Normal Q−Q Plot Normal Q−Q Plot


6

6
Sample Quantiles

Sample Quantiles
4

4
2

2
−2

−2
−6

−6
−2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles

Normal Q−Q Plot Normal Q−Q Plot


6

6
Sample Quantiles

Sample Quantiles
4

4
2

2
−2

−2
−6

−6

−2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles

In these plots we see that the height of the central points remains constant, but the slope of the lines increases
as the variance increases.

19
5. Draw two samples of size 10 from the standard normal distribution and compare them using qqplot.
Repeat a total of four times. Plot the four graphs on the same window. Comment on what you see.
par(mfrow=c(2,2))
for (i in 1:4) {
dat <- rnorm(20);qqplot(dat[1:10],dat[11:20], ylim=c(-2.5,2.5),
xlab = 'Sample 1', ylab = 'Sample 2')}
2

2
1

1
Sample 2

Sample 2
0

0
−2

−2
−0.5 0.0 0.5 1.0 1.5 −2.0 −1.0 0.0 0.5 1.0

Sample 1 Sample 1
2

2
1

1
Sample 2

Sample 2
0

0
−2

−2

−1 0 1 2 −1.5 −0.5 0.5 1.0 1.5

Sample 1 Sample 1

We see that the first two plots do not show adequate alignment, and we would probably conclude that the
two samples come from different distributions. The last two plots show points that are reasonably aligned,
and we would conclude that in this case they come from a common distribution function.

20

You might also like