R Exercises For Modules
R Exercises For Modules
2. Start up R.
3. Install the R packages we will be using in the course by copying the following 2 lines
of code and pasting them into the R Console window (not in R Studio).
install.packages(c('iNZightPlots', 'FutureLearnData'), dependencies = TRUE,
repos = c('http://r.docker.stat.auckland.ac.nz/R', 'https://cran.rstudio.com'))
If it asks you, “Would you like to create a personal library to install packages into? “,
say, “Yes”.
5. Now try the following: (Paste lines of code, or even several lines of code at a time,
into the R Console window. See what they do.
# R CODE COMMENTARY
4. Ask for plots and summaries of other variables whose names you can see in the
names list.
5. When you have finished, close R. When it asks “Save Workspace image? “, click,
“No”.
From Exercise 1.10 (R version) you have already seen how to make data sets in the
FutureLearnData package available for analysis but we will reiterate the general pattern soon.
The # character in R: If you type or paste a line into the R Console window, R will ignore everything
that comes after a “#” character. So # tells R that what follows is a comment left for human
readers, not an instruction for R itself.
We will use this in the following as we talk about the pattern for making the data in a package, in
our case the FutureLearnData package available for analysis.
3. Now try the following: (Paste lines of code, or even several lines of code at a time, into the
R Console window. See what they do.
# R CODE COMMENTARY
# Import the file Census at School-500.csv read.csv is asking R to read a csv file
file.choose() is telling R to throw up a
cas_500 = read.csv(file.choose(), header = TRUE) browser window that will allow you to
navigate to wherever you have stored
Census at School-500.csv and open the file
header = TRUE tells R that this file has a
header line containing the names of the
variables
cas_500 = tells R to store the result as
cas_500
# Now import the file olympics100m.txt As above but to read the tab-separated text
file we use read.table, not read.csv. We
Olymp_imp = read.table(file.choose(), header = TRUE, sep="\t") include sep="\t" to tell R to look for tab
characters as the separators between data
fields
We store the result as store it as
Olymp_imp
[Note: Most actions in R are invoked by calling an R function. Function calls in R are of the
form:
function.name(list of function parameters separated by commas)
When you look at help files you will note in the “Usage” paragraph that a function will often
have a large number of parameters. You do not need to include any parameters in your call to a
function if that parameter is set equal to a value in this paragraph. That assigned value is the
default value. You do not need to include any parameter that has a default in your call unless
you want to change its value from the default to something else.]
4. Try some variations of the above, e.g. plotting new variables, reading another data file.
5. When you have finished, close R. When it asks “Save Workspace image?”, click, “No”.
iNZightPlot(Race3, data=nhanes_1000)
getPlotSummary(Race3, data=nhanes_1000)
getPlotSummary(Race3, data=nhanes_1000,
summary.type="inference", inference.type="conf")
COMMENTARY
# Create a new variable Race3.reord to re-order Race3 R calls a categorical variable a “factor”
# with the categories in frequency order
Show me the levels of Race3 (I can also see in
levels(nhanes_1000$Race3) the graph). Output is …
[1] "Asian" "Black" "Hispanic" "Mexican" "Other"
"White"
I can see what the frequency order should be
from the graph. (This can be done generally with
nhanes_1000$Race3.reord = code but the code is too complex to do at this stage)
factor(nhanes_1000$Race3, levels = c("White", So I’ll make Race3.reord from Race3 and put
"Black", "Mexican", "Hispanic", "Asian", "Other") ) them in the order I want. (Getting the number of
levels and spelling exactly right is crucial)
iNZightPlot(Race3.reord, data=nhanes_1000)
a sensible order
30
25
iNZightPlot(Education, data=nhanes_1000)
Proportion (%)
20
15
10
5
0
ad
e e rad ol e
Gr rad eG cho lleg
thG hS Co
8th 11 lleg Hig me
9_ Co So
Education
285 missing values
Distribution of Education.reord
iNZightPlot(Education.reord, data=nhanes_1000)
30
25
Proportion (%)
20
15
10
5
0
ad
e e ol e
rad
Gr rad cho o lleg eG
8th thG hS eC lleg
11 Hig m Co
9_ So
Education.reord
285 missing values
iNZightPlot(Education.reord, COMMENTARY
data=nhanes_1000,colby=Education.reord) Colour by Education.reord
# Now change the colour palette to rainbow colours
Col.fun has to be a colour palette function
iNZightPlot(Education.reord, There are lots of colour palette functions in R,
data=nhanes_1000,colby=Education.reord, many you have to install other packages to get.
col.fun=rainbow) rainbow() is a generally available colour palette
Try repeating the above using other choices for variables and settings
Installing the package viridis and then loading it [via library(viridis)] will give you access to
the colour functions: viridis, magma, and inferno
iNZightPlot(Pulse, data=nhanes_1000)
getPlotSummary(Pulse, data=nhanes_1000)
getPlotSummary(Pulse, data=nhanes_1000,
summary.type="inference",
inference.type="conf")
Coloured by Age
iNZightPlot(Height, data=nhanes_1000)
Try doing more things like the above but using other variables and settings
# Setup Commentary
library(iNZightPlots)
library(FutureLearnData)
Make gapminder_2008 inside
data(gapminder_2008) FutureLearnData available for analysis
iNZightPlot(ChildrenPerWoman, Region,
data=gapminder_2008)
getPlotSummary(ChildrenPerWoman, Region,
data=gapminder_2008)
# Colour by Region
iNZightPlot(ChildrenPerWoman, Region,
data=gapminder_2008, colby=Region)
# Try also
iNZightPlot(ChildrenPerWoman, Region, data=gapminder_2008,
colby=Region, cex.text=.3)
iNZightPlot(ChildrenPerWoman, Region, data=gapminder_2008,
colby=Region, hide.legend = TRUE)
# Colour by Infantmortality
iNZightPlot(ChildrenPerWoman, Region,
data=gapminder_2008, colby=Infantmortality)
Also try colouring by other variables you think might help explain the Regional
differences.
# Setup Commentary
library(iNZightPlots)
library(FutureLearnData)
data(gapminder) Use gapminder NOT gapminder_2008
names(gapminder)
# Now put it in a loop and do it for every year, i.e. for every level of Year_cat
# Do not display a new plot UNTIL you have clicked on the on plot window
old.value = devAskNewPage(TRUE) # save current plotting behaviour and ask for new behaviour
for (k in levels(gapminder$Year_cat))
iNZightPlot(ChildrenPerWoman,Region, g1=Year_cat, g1.level=k, data=gapminder)
devAskNewPage(old.value) # Reset the plotting behaviour back to the way it was before
Commentary
# Play the plots, but with a 2 second delay between plots
This time there are 2 lines of
for (k in levels(gapminder$Year_cat)) { code to be run at each step so we
iNZightPlot(ChildrenPerWoman,Region, g1=Year_cat, g1.level=k, have to put them in “{ .. }”
data=gapminder) brackets so that both lines get
Sys.sleep(2) run
}
1. Creating plots of two categorical variables (when the predictor variable has only 2
groups).
2. Making a side-by-side bar charts and separate bar charts for two categorical
variables.
3. Filtering out unwanted groups
iNZightPlot(Education.reord, data=nhanes_1000)
# (This is an example of “filtering” the data) (Warning: there is a leading space on all
of AgeDecade’s level names– a data bug)
Temp = subset(nhanes_1000, AgeDecade!=" 0-9" & AgeDecade
!=" 10-19") # “!=” is read as “is not equal to”
Output
table(Temp$AgeDecade) 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70+
0 0 146 110 144 111 05 66
It still has the empty levels
# This will remove the empty levels
Temp$AgeDecade=factor(Temp$AgeDecade)
# Replot the data using the subset of the data called Temp
iNZightPlot(Education.reord, g1=AgeDecade, data=Temp,
colby=AgeDecade)
Now construct similar graphics using other categorical variables whose behaviour
you may be interested in
1. Creating a scatterplot of two numeric variables (when the predictor variable has
more than groups).
2. Adding a trend line (if suitable) and label any interesting points.
3. Train your eyes to observe and draw an envelope around the scatter.
When we call iNZightPlot(x,y) using two numeric variables x and y we will get a scatterplot.
The 1st variable (x) gets plotted against the horizontal axis and the 2nd variable (y) is plotted
against the vertical axis. So we should put the predictor variable first and the outcome
variable second.
[Note: I’ve just seen that have reversed the order of the variables in this exercise from the
way it was in the (older) iNZight version. The order below is more natural. The change
affects nothing that matters in the exercise but the graphs are “the other way around” from
the iNZight version.]
#R Code Commentary
# Setup
library(iNZightPlots)
library(FutureLearnData)
data(gapminder_2008)
iNZightPlot(CO2Emissions, EnergyUsePerPerson,
data=gapminder_2008)
# Note the reversal in the order of the names between the call and
ordinary English usage for what we want to do!
Recall: the plot call is in the order
iNZightPlot(x,y,…)
iNZightPlot(CO2Emissions, EnergyUsePerPerson,
data=gapminder_2008, locate.extreme=4, locate=Country)
iNZightPlot(CO2Emissions, EnergyUsePerPerson,
data=gapminder_2008, trend="linear")
iNZightPlot(CO2Emissions, EnergyUsePerPerson,
data=gapminder_2008, xlab="CO2 emissions per person ",
ylab="Energy use per person", main="Energy use versus CO2
emissions")
iNZightPlot(CO2Emissions, EnergyUsePerPerson,
data=gapminder_2008, trend="linear",
col.trend=list(linear="red"))
Commentary
# Label specified data points
The options added to the plots above can all be used together. Try putting several of them
together in the same call to iNZightPlot. There are lots of others you will find out about over
time, e.g. if you are putting a line on your plot then adding “, lwd=2” to your call (without
the “”) will double the thickness of the line.
Try to identify more or other countries (spelling and lower/upper case is critical).
Try working with other variables.
1. Create a scatterplot of two numeric variables and apply a suitable trend line.
2. Use techniques such as jittering, transparency and running quartiles to deal
with overprinting.
We are going to explore the relationship between variables Age and Weight of
people in the nhanes_1000 dataset in the FutureLearnData package.
# Setup
library(iNZightPlots)
library(FutureLearnData)
data(nhanes_1000)
iNZightPlot(Age, Weight,data=nhanes_1000)
# Make all the lines thicker, all solid lines, and change line colours
iNZightPlot(Age,Weight,data=nhanes_1000, trend=c("linear",
"quadratic", "cubic") , smooth=.25, lwd=2,
lty.trend=list(linear=1,quadratic=1,cubic=1),
col.trend=list(linear="red", quadratic="yellow", cubic="blue"),
col.smooth="green")
We are going to explore the relationship between the variables Infantmortality and ChildrenPerWoman of
countries in the Gapminder dataset over time.
# Setup
library(iNZightPlots)
library(FutureLearnData)
data(gapminder)
# Scatterplot of Infantmortaility against ChildrenPerWoman
iNZightPlot(ChildrenPerWoman,Infantmortality ,
data=gapminder)
# Subset by Year_cat
iNZightPlot(ChildrenPerWoman,Infantmortality, g1=Year_cat,
data=gapminder)
iNZightPlot(ChildrenPerWoman,Infantmortality,g1=Year_cat,
data=gapminder, colby=Region, sizeby=Populationtotal)
iNZightPlot(ChildrenPerWoman,Infantmortality,g1=Year_cat,
g1.level="[1972]",data=gapminder, colby=Region,
sizeby=Populationtotal)
iNZightPlot(ChildrenPerWoman,Infantmortality, g1=Year_cat,
g1.level="[1972]",data=gapminder, colby=Region,
sizeby=Populationtotal, bg="darkgray")
iNZightPlot(ChildrenPerWoman,Infantmortality, g1=Year_cat,
g1.level="[1972]",data=gapminder, colby=Region,
sizeby=Populationtotal, alpha=.45, cex.dotpt=.5)
iNZightPlot(ChildrenPerWoman,Infantmortality, g1=Year_cat,
g1.level="[1972]",data=gapminder, colby=Region,
sizeby=Populationtotal, alpha=.45, cex.dotpt=.5,
locate.id=ids, locate=Country, xlim=c(0,9))
for (k in levels(gapminder$Year_cat)) {
iNZightPlot(ChildrenPerWoman,Infantmortality, g1=Year_cat,
g1.level=k, data=gapminder, colby=Region,
sizeby=Populationtotal, alpha=.45, cex.dotpt=.5,
locate.id=ids, locate=Country)
Sys.sleep(1)
}
Play some more with these settings and try other variables
For even more settings, type ?inzpar into R to get help on the inzpar, or type inzpar
to just get a complete list (last time I looked the help file wasn’t entirely complete)
Use iNZightPlot to get inferential mark-ups of plots so that you can make visual
comparisons between sub-groups allowing for sampling error.
To obtain numerical confidence limits for true between-group differences.
# What do you see here? The thick black lines are called
‘comparison intervals’ and are the lines that we
look at when are the lines that we look at when
observing any overlap. The thin red lines are the
individual confidence intervals for each
mean/median.
etc
# Reorder the HealthGen variable and create HealthGen.reord with the levels in
Temp$HealthGen.reord = factor(Temp$HealthGen, levels a sensible order
= c("Excellent","Vgood","Good","Fair","Poor") )
Play some more with these settings and try other variables
1. Generate a Time Series plot and a Seasonal plot for a single numeric variable.
2. Get an Additive and Multiplicative Decomposition.
3. Make a forecast.
Generate a Time Series plot and a Seasonplot for a single numeric variable
The data we are using shows us the number of visitors from different countries who
are currently staying in New Zealand. We will investigate the changes in the number
of Australian visitors over time.
Commentary
# Install the iNZightTS package
This installation only has to be done once
install.packages(c('iNZightTS'), dependencies =
TRUE, repos =
c('http://r.docker.stat.auckland.ac.nz/R',
'https://cran.rstudio.com'))
Commentary
Australia = iNZightTS
(week8_AverageVisitorsQuarterly, var=2) Create Time Series object for the Australian series
Australia is the 2nd variable in the dataset
# Plot the data -- t controls smoothing
rawplot( Australia , t=25) t controls the smoothing (t must be betw. 0 and 100)
seasonplot( Australia )
# Decomposition plot
# Recomposed plot
forecastplot( Australia )
Commentary
# Let’s establish this pattern for another country
China = iNZightTS (week8_AverageVisitorsQuarterly, var=3) Create Time Series object for the China
series. China is the 3nd variable in the
dataset week8_AverageVisitorsQuarterly
rawplot( China , t=20)
decompositionplot( China, t=20) Now you can start plotting …
# etc …
Repeat what we have done above for any other country that interests you and try
to interpret the patterns you see as has been done in the video
o Skim-read the iNZight version for the commentary that is missing here. (This
document just concentrates on how the code works)
[In the next Exercise we will start comparing series from different countries.]
This exercise will enable you to use iNZight to compare several time series by viewing them
simultaneously in two different ways.
# Set up
library(iNZightTS)
library(FutureLearnData)
data(week8_AverageVisitorsQuarterly)
head(week8_AverageVisitorsQuarterly)
multiseries(Aus_USA, t=20)
multiseries(ALL, t=20)
compareplot(Aus_USA, t=30)
compareplot(ALL, t=30)
Repeat what we have done above for any other combinations of countries that
interest you and try to interpret the patterns you see as has been done in the video
o Skim-read the iNZight version for the commentary that is missing here. (This
document just concentrates on how the code works)
and for exploration questions