R-Course
R-Course
Matthew Lau
November 4, 2008
Today, we covered lot of ground with the goal of providing the tools to
input and manipulate data, conduct analyses and make plots. Below,
I have detailed all of the activities that we did in R.
1. Scripting
2. Math Operations
3. Help!
4. Data Entry
5. Statistical Analyses
6. Plotting
7. Next Class
1 Scripting
Perhaps the most important thing we learned above everything else
is to use a script editor. Do not work solely in the console command
line. Please, open a new script file to work in by going to the file
menu and selecting a ”New Document”. This document can be saved
for future use, unlike the R Console, which will save what you have
done, but not in a reproducible format.
When you script you can write out your code and then run it by
placing the cursor on the line or highlighting all of the script you
want to run and running it (Windows users press CTRL + R and Mac
users press COMMAND or APPLE + ENTER). You can add notes by using
the pound symbol, #, which tells R not to run any text to the right
of that symbol on that line.
2 Math Operations
Math operations are basic and intuitive, but are essential tools for
working effectively in R.
1
2.1 Basic Math Operations:
> 1 + 1
[1] 2
> 1 - 1
[1] 0
> 2 * 2
[1] 4
> 2/2
[1] 1
> 2^2
[1] 4
> (2 + 2) * 2
[1] 8
[1] 1.414214
> log(2)
[1] 0.6931472
> cos(2)
[1] -0.4161468
There also several statistically relevant math commands that I’ll just
mention now, but we’ll use them later: mean(), sd() and length()
return the mean, standard deviation and number of elements of a
vector of numbers.
2
3 HELP!
You may have already encountered some issues just with trying to do
these simple operations and commands. In order to get around obsta-
cles and improve your R knowledge-base it is important to know how
to access help resources. There are many books out there including
the R ”bible”, many of which are listed on the R website. I typically
rely on two resources, the internal help files within R and internet
search engines. You can reach help within R in a number of ways,
but one of the easiest is to use the help() command. For example:
> help(sqrt)
This will bring up a help file with information pertaining to the sqrt()
and other related commands.
4 Data Entry
Handling data is vital to successful analyses. Bringing data into R
can be done in at least two ways. First, you can enter data manually,
not unlike a calculator, and second, you get input data via file reading
commands.
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
3
4.2 Reading Data from File
More often than not, we already have our data saved as a data file.
Here are two commands that you can use to bring data into R. Note
that both of these commands require the data to be in a specific
format. Check the help file for the commands to see how the data
should be structured.
read.csv(”file location”) will read a .csv file. This is most useful when
you have your data organized as spreadsheets, which many people do.
For example:
> read.csv("/Users/artemis/Desktop/rock_lichen_data.csv")
moth canopy A
1 2 14.3 7.2
2 2 28.5 2.8
3 2 34.9 4.6
4 2 43.6 5.2
5 2 169.9 25.4
6 2 38.0 0.0
7 2 102.4 9.7
8 2 64.8 9.9
9 2 39.2 1.8
10 2 39.2 4.7
11 1 320.0 39.9
12 1 206.4 58.0
13 1 194.9 39.1
14 1 135.2 18.3
15 1 89.8 12.4
16 1 417.7 44.1
17 1 164.5 24.1
18 1 155.7 34.9
19 1 155.1 43.1
20 1 169.0 19.0
This is the standard formatting for the read.csv command. Data are
organized into columns with the columns headed by their appropriate
names. Column names should not contain spaces or math operations
and should not start with numbers.
4
we can use different symbols to represent or ”store” our data. This
is done by using either the equals sign ”=” or by making a left arrow
with the less-than sign ”<” and a dash ”-”. For instance:
> a = 10
> a
[1] 10
> a <- 10
> a
[1] 10
You can see that the first line above tells R that ”a” is the number
10, because when we enter ”a” R returns the number 10. Note that
object names: a) cannot start with a number, b) are case sensitive
and c) cannot contain math operators (e.g. + , - , / , etc.).
We can also make much more complex objects with vectors and ma-
trixes, as well as data frames, data tables and lists. For example, we
can create an object from a data matrix read in from a file:
> data <- read.csv("/Users/artemis/Desktop/rock_lichen_data.csv")
> data
moth canopy A
1 2 14.3 7.2
2 2 28.5 2.8
3 2 34.9 4.6
4 2 43.6 5.2
5 2 169.9 25.4
6 2 38.0 0.0
7 2 102.4 9.7
8 2 64.8 9.9
9 2 39.2 1.8
10 2 39.2 4.7
11 1 320.0 39.9
12 1 206.4 58.0
13 1 194.9 39.1
14 1 135.2 18.3
15 1 89.8 12.4
16 1 417.7 44.1
17 1 164.5 24.1
18 1 155.7 34.9
19 1 155.1 43.1
20 1 169.0 19.0
5
5 Statistical Analyses
5.1 t-test
A t-test is a very common parametric analysis of ecological data, when
we are have one or two samples that we would like to analyze. Here
we looked at an example where we were interested in testing whether
or not habitat restoration had improved the abundance of tigers in a
preserve with a set of tiger counts of tigers taken once a month for
six months. We did a t-test in two different ways. The first way we
used a command that comes in the R base package, t.test():
> tigers <- c(10, 14, 11, 12, 10, 18)
> t.test(tigers)
data: tigers
t = 9.934, df = 5, p-value = 0.0001765
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
9.265422 15.734578
sample estimates:
mean of x
12.5
This conducts a t-test using the default settings (see the help file) of
the population mean different from zero. One problem is that it is
a two-tail test, which we don’t want (counts can never be less than
zero). So we can re-run it specifying the argument ”alternative” as
”greater” for a one-tail test of the mean greater than zero:
> t.test(tigers, alternative = "greater")
data: tigers
t = 9.934, df = 5, p-value = 8.823e-05
alternative hypothesis: true mean is greater than 0
95 percent confidence interval:
9.964453 Inf
sample estimates:
mean of x
12.5
We also programmed our own t-test from scratch using what we new
about the mechanics of the t-test. First we calculated our observed
t, and then we computed a p-value for that t:
6
> tobs <- (mean(tigers) - 0)/(sd(tigers)/sqrt(length(tigers)))
> pt(tobs, df = length(tigers) - 1, lower.tail = FALSE)
[1] 8.82312e-05
Here the mean() and sd() commands are computing our mean and
standard deviations for us. The length() command yields the number
of elements in our data vector ”tigers” which is also our sample size
(i.e. n). The pt() command gives us our p-value. The ”lower.tail”
argument specification gives us the area under the t distribution to
the right of our observed t.
data: tigers
W = 0.8502, p-value = 0.158
7
Normal Q−Q Plot
18
●
16
Sample Quantiles
14
●
12
●
10
● ●
Theoretical Quantiles
> hist(tigers)
8
Histogram of tigers
4
3
Frequency
2
1
0
10 12 14 16 18
tigers
9
data: tigers
V = 21, p-value = 0.01776
alternative hypothesis: true location is greater than 0
Here we see that the results, although different, yield the same answer
as our t-tests.
5.4 Regression
In R there are two steps to running a regression. First we need to
fit a model. Here we can use the lm command to fit a least squares
regression model to the data that we loaded earlier:
> data
> attach(data)
Call:
lm(formula = A ~ canopy)
Coefficients:
(Intercept) canopy
2.5353 0.1368
Notice that we used the attach command. This tells R to make ob-
jects from the columns. This object consists of a vector named by the
column name. This allows us to use these objects in our lm command.
Notice the structure of the argument here. This is our model speci-
fication. It consists of the response variable (A), a tilde ”˜
” signifying
regression and our predictor variable (canopy). The lm command fits
the model and generates the typical output that you need to conduct
the regression analysis. Thus, we save this output to a new object,
fit. This will contain different components within it: such as the
fitted values and the residuals. We can access these components us-
ing a dollar sign ”$”. For instance, we can pull our residuals out of
this output to check the assumption of the residuals being normally
distributed:
> residuals <- fit$residuals
> shapiro.test(residuals)
data: residuals
W = 0.843, p-value = 0.004082
> qqnorm(residuals)
> qqline(residuals)
10
Normal Q−Q Plot
20
●
Sample Quantiles
●
10
●
0
●
● ●
● ● ●
● ● ●
● ● ● ●
●
−10
−2 −1 0 1 2
Theoretical Quantiles
> hist(residuals)
Histogram of residuals
14
12
10
Frequency
8
6
4
2
0
−20 −10 0 10 20 30
residuals
11
Now we can now use this information to conduct a regression analysis
(i.e. a hypothesis test for the slope different from zero), using the
summary command:
> summary(fit)
Call:
lm(formula = A ~ canopy)
Residuals:
Min 1Q Median 3Q Max
-15.5971 -6.1815 -2.7243 0.3875 27.2191
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.53526 3.69067 0.687 0.501
canopy 0.13685 0.02246 6.092 9.33e-06 ***
---
Signif. codes: 0 ^aĂŸ***^
aĂŹ 0.001 ^
aĂŸ**^
aĂŹ 0.01 ^
aĂŸ*^
aĂŹ 0.05 ^
aĂŸ.^
aĂŹ 0.1 ^
aĂŸ ^
aĂŹ 1
This is our standard F-table, which we can use to conduct our hy-
pothesis test.
5.5 ANOVA
Analysis of Variance (ANOVA) is theoretically very similar to regres-
sion. It makes sense then that in R conducting an ANOVA is nearly
identical to regression. We do only two things differently. First, we
need to make sure that we specify our predictor variable, in this case
”moth”, as categorical. We can do this with the factor command.
Second, we can use the anova command to produce an F-table that is
more useful for conducting an ANOVA. Overall, the process is still
very much the same:
> fit.anova <- lm(A ~ factor(moth))
> anova(fit.anova)
Response: A
Df Sum Sq Mean Sq F value Pr(>F)
factor(moth) 1 3421.7 3421.7 26.599 6.612e-05 ***
Residuals 18 2315.6 128.6
---
Signif. codes: 0 ^
aĂŸ***^
aĂŹ 0.001 ^
aĂŸ**^
aĂŹ 0.01 ^
aĂŸ*^
aĂŹ 0.05 ^
aĂŸ.^
aĂŹ 0.1 ^
aĂŸ ^
aĂŹ 1
12
One note here is that we could have made a new object for ”moth”
as a factor and then used the new object in our lm command. It’s
merely a matter of personal preference, R doesn’t care:
> moth <- factor(moth)
> fit.anova <- lm(A ~ moth)
> anova(fit.anova)
Analysis of Variance Table
Response: A
Df Sum Sq Mean Sq F value Pr(>F)
moth 1 3421.7 3421.7 26.599 6.612e-05 ***
Residuals 18 2315.6 128.6
---
Signif. codes: 0 ^aĂŸ***^
aĂŹ 0.001 ^
aĂŸ**^
aĂŹ 0.01 ^
aĂŸ*^
aĂŹ 0.05 ^
aĂŸ.^
aĂŹ 0.1 ^
aĂŸ ^
aĂŹ 1
6 Plotting
Visualizing data to look for trends is an important aspect of analysis.
We can use one command, plot, to create great plots that complement
both regression and ANOVA.
6.1 Scatterplots
To make a scatterplot, first we simply use the plot command:
> plot(canopy, A)
60
●
50
●
●
40
●
●
●
30
A
●
●
20
● ●
●
10
● ●
●
●●●
●
●
●
0
canopy
13
The plot command is expecting the ”x” variable first and then the
”y” variable second. We could also specify the input as a formula:
●
50
●
●
40
●
●
●
30
A
●
●
20
● ●
●
10
● ●
●
●●●
●
●
●
0
canopy
14
60
●
50
●
●
40
●
●
●
30
A
●
●
20
● ●
●
10
● ●
●
●●●
●
●
●
0
canopy
This uses our fitted model, ”fit”, to draw the regression line.
15
60
50
40
30
A
●
20
10
0
1 2
moth
16
30
25
20
15
10
5
0
This plot looks pretty poor at this point. There is a whole suite
of arguments that can be specified to alter how the plot looks. For
instance, you’ll notice that the axes have no names. We can add (or
change default) axis names by specifying the xlab and ylab arguments:
17
30
25
Lichen Abundance
20
15
10
5
0
Moth
We can also add names to the two levels of ”moth” by specifying the
names argument:
> barplot(c(mean.R, mean.S), xlab = "Moth", ylab = "Lichen Abundance",
+ names = c("R", "S"))
18
30
25
Lichen Abundance
20
15
10
5
0
R S
Moth
7 Next class:
Next week, we’ll go over:
19
Introduction to Ecological Analysis in R (Day 2)
Matthew Lau
November 4, 2008
1. Integrative Analysis
2. Making Better Barplots
3. Community Data
4. Species Accumulation Curves
5. Visualizing Data
1 Integrative Analyses
When working in R it’s helpful to make sure that you manage objects,
functions (i.e. commands) and packages effectively. You can do this
by:
1. Setting the working directory to a specific folder where you can keep all
of the files pertinent to your analysis
• You can now enter the file name instead of the whole file path when
loading data into R
2. Promptly removing objects using the rm command
• rm(list=ls()) removes all objects visible to R
1
Next, we load the package into R:
> library(gplots)
Now, we can load our data, make our mean and standard error vec-
tors, which we will be using to make our plot:
Now we need to create our a vector for our error bars. We do this
by first making a vector of standard deviations, which we divide by
the square root of our sample size (obtained by the length of one of
our response vectors):
> anova.sd <- c(sd(Bat_Abundance[Fire_Intensity == 1]), sd(Bat_Abundance[Fire_Intensity ==
+ 2]), sd(Bat_Abundance[Fire_Intensity == 3]))
> n <- length(Bat_Abundance[Fire_Intensity == 1])
> anova.se <- anova.sd/sqrt(n)
> anova.se
We now add and subtract the standard error vector from our means
to get our error bar vectors:
> anova.ci.u <- anova.means + anova.se
> anova.ci.l <- anova.means - anova.se
> anova.ci.u
> anova.ci.l
2
> barplot2(anova.means, plot.ci = TRUE, ci.u = anova.ci.u, ci.l = anova.ci.l)
25
20
15
10
5
0
If you look carefully at the argument specification, you can see that
to get our error bars we had to specify the logical argument plot.ci
as TRUE and the ci.u and ci.l arguments, which define the ”upper”
and ”lower” limits of the error bars, using our error bar vectors. We
then also had to specify:
3
30
25
20
Bat Abundance
15
10
5
0
1 2 3
Fire Intensity
3 Community Data
Working with community data presents its own set of questions and
issues. Here we detail how to:
1. Manage community data
2. Check how well our sample represents the community
3. Obtain univariate community statistics
4. Visualize community data
5. Test for patterns in our community data
> com.data <- read.csv("CommData.csv")
This file contains both the grouping information as well as the species
abundances for each observation in rows. First, we pull out our envi-
ronmental data and our community matrix:
> env <- factor(com.data$env)
> com <- com.data[, 2:ncol(com.data)]
> com <- as.matrix(com)
4
The as.matrix function converts our current ”matrix” into the matrix
format recognized by R (the read.csv function imports it as some
other format). The dollar sign ”$” refers to a particular column (in
this case env ) within a data frame (in this case com.data).
> library(vegan)
Now, we will use the specaccum function to generate the species accu-
mulation curve information:
Notice how we have specified our community matrix, method and per-
mutation arguments. Simply, this tells the specaccum function what
the data matrix is called, what method to use to generate the curves
and how many permutations to use to generate our confidence inter-
5
30
25
20
random
15
10
5
0
5 10 15 20
Sites
vals.
Considering the the the curve levels off way before the number of
observations we have, we can confidently say that we have adequately
sampled the community. Now we can proceed with our analyses.
6
1.0
0.8
0.6
0.4
0.2
0.0
You’ll notice that we first used the as.matrix function on our com-
munity matrix. This is because when we imported the data using the
read.csv function, R formatted it as a ”data frame”. Unfortunately,
some functions, such as the image function, require the data to be
a ”matrix” format. You can check to see if an object is a matrix by
using the is.matrix function.
Next, we may want to alter this plot in order to make it more com-
prehensible. For instance, we can change the axis by:
> obs = seq(1, nrow(com), by = 1)
> species = seq(1, ncol(com), by = 1)
> image(x = obs, y = species, z = com, xaxt = "n", yaxt = "n")
> axis(side = 1, tick = TRUE, at = obs, labels = env)
> axis(2, tick = TRUE, at = species, labels = colnames(com))
7
V30
V26
V22
V18
species
V14
V7 V10
V4
V1
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
obs
We now have a plot with the appropriate lables. What we have done
to do this is first make two vectors that define the axis markings for
our observations and our species. We then created a plot with no
axis labels by specifying the arguments, xaxt and yaxt, as ”n”, which
is short for ”none”. We then use the axis function to plot both the x
(observations) and the y (species) axes. For more information on the
argument specifications for these functions, please consult the help
files (e.g. help(axis)).
We can also make a legend by adding the script at the bottom spec-
ifying the legend function:
> obs = seq(1, nrow(com), by = 1)
> species = seq(1, ncol(com), by = 1)
> image(x = obs, y = species, z = com, xaxt = "n", yaxt = "n")
> axis(side = 1, tick = TRUE, at = obs, labels = env)
> axis(2, tick = TRUE, at = species, labels = colnames(com))
> legend("topleft", legend = seq(min(com), max(com), by = 50),
+ col = heat.colors(9), pch = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0), bg = "white")
5.2 Ordination
The heatmaps are only useful to a certain extent for visualizing com-
munity data. The more commonly used method is ordination. This
is merely a process of simplifying the multi-dimensional data into
8
something with fewer dimensions that is presentable and still confers
important patterns in the data. Non-metric Multidimensional Scaling
(NMS) using Bray-Curtis distance is the most commonly used ordi-
nation technique used for community data. This method is robust to
typical characteristics of community data. It also represents the data
in distance space, preserving the rank order of the observations.
> install.packages("BiodiversityR")
> install.packages("ellipse")
As before with the gplots package, we now need to load these packages
into R:
> library(ecodist)
> library(BiodiversityR)
> library(ellipse)
> library(vegan)
Notice here that I have also loaded the vegan package, even though
I didn’t install it in the last step. This is because once you have
installed a package, as we did when we were generating species area
curves above, you do not need to install it again (unless you delete
it). You do, however, need to load it again if you have detached it or
closed and re-startred R.
> dis <- distance(com, method = "bray-curtis")
9
> scree <- nmds(dis, mindim = 1, maxdim = 5, nits = 10)
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
10
Using random start configuration
Using random start configuration
Using random start configuration
●●
● ●
0.8
●
0.6
●
stress
0.4
●
0.2
●●● ●● ● ● ● ● ●
● ● ● ● ● ●●●●●
●●●●●●● ● ●●
●●●●●●●●●●
0 10 20 30 40 50
Index
This plot is of the stress values from the minimized stress configura-
tions from ten random starts (inits=10) at each level of dimension-
ality, one to five, (mindim=1,maxdim=5 ), ranked in order from the
first to the last configuration, which means that the first ten stress
values are all from 1-D configurations, the second ten are from 2-D,
and so on. What we can see is that the stress rapidly decreases as
we increase the number of nms axes and that at two axes we have an
acceptable stress level with little variation amongst repeated random
starting configurations. Thus, 2-D is fine for our purposes.
Here is the code for a more presentable scree plot (dissect it at your
leisure):
> axis.seq <- c(seq(1, 1, length = 10), seq(2, 2, length = 10),
+ seq(3, 3, length = 10), seq(4, 4, length = 10), seq(5, 5,
+ length = 10))
> plot(stress ~ factor(axis.seq))
11
0.8
0.6
stress
0.4
0.2
1 2 3 4 5
factor(axis.seq)
We can now re-run the nmds command with more iterations to ”ex-
plore” more of the ordination configuration space and ensure that we
have the lowest stress configuration possible:
12
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
Using random start configuration
13
●
0.4
●
●
● ●
● ●
●
0.2
● ●
nms.min[, 2]
0.0
●
●
−0.2
●
● ●
● ●
●
−0.4
● ●
nms.min[, 1]
This plot is hardly satisfactory. To make a better plot we can use more
functions from our packages that are specific to making ordination
plots:
14
●
0.4
●
●
● ●
● ●
●
0.2
● ●
X2
0.0
−0.2
−0.4
X1
Last, to get the actual stress value for our final configuration:
> min(nms.final$stress)
[1] 0.1565177
15
Introduction to Ecological Analysis in R (Day 3)
Matthew Lau
November 14, 2008
> library(vegan)
1
> dist.com <- vegdist(com, method = "bray")
> anosim(dis = dist.com, grouping = env)
Call:
anosim(dis = dist.com, grouping = env)
Dissimilarity: bray
ANOSIM statistic R: 1
Significance: < 0.001
Call:
mrpp(dat = com, grouping = env, distance = "bray")
Call:
adonis(formula = com ~ env, permutations = 1000)
2
NOTE: the number of permutations directly effects the resulting p-
value. This is because the p-value is calculated as the number of
permutations that result in a F-statistic greater than or equal to our
observed F-statistic divided by the total number of permutations. For
instance, if you only ran ten permutations the smallest, non-zero, p-
value that you could possibly get is 0.10 (i.e. one divided by ten).
The rule of thumb is to run at least 200 permutations, but try and
run as many as you can. Usually 1000 is sufficient.
3
V29 2 0.7588 0.001
> detach(package:labdsv)
[1] 10
This is a very simple task, and we most likely would never do this,
but it illustrates all you need know to construct almost any for-loop,
which is a very powerful thing. Note that inside of the curly-brackets,
we are telling R to add one to the object, ”a”, for ten steps. That is R
adds one to ”a” then does it again to the ”new” value of ”a” generated
by the last step, and so on, until it has done it ten times. Also note,
we created ”a” outside of the loop. This is important because it needs
to be created before we can do anything with it, and if we have it
inside the loop, R will make a new ”a” at each step.
Now, say for instance our advisor asks us to run tests for the effect of
our environmental factor on all of our species. This could take a long
time or a lot of code lines to this. However, for-loops can save the
day. We simply write the test into a for-loop properly and, voi-la, R
will do all of tests for us:
4
[[1]]
Df Sum Sq Mean Sq F value Pr(>F)
env 1 146890 146890 24.496 0.0001036 ***
Residuals 18 107938 5997
---
[[2]]
Df Sum Sq Mean Sq F value Pr(>F)
env 1 69502 69502 24.208 0.0001104 ***
Residuals 18 51679 2871
---
[[3]]
Df Sum Sq Mean Sq F value Pr(>F)
env 1 36722 36722 8.0937 0.01075 *
Residuals 18 81668 4537
---
[[4]]
Df Sum Sq Mean Sq F value Pr(>F)
env 1 112650 112650 47.359 1.95e-06 ***
Residuals 18 42816 2379
---
[[5]]
Df Sum Sq Mean Sq F value Pr(>F)
env 1 16416 16416 5.3427 0.03286 *
Residuals 18 55309 3073
---
[[6]]
Df Sum Sq Mean Sq F value Pr(>F)
env 1 78250 78250 76.257 6.885e-08 ***
Residuals 18 18470 1026
---
[[7]]
Df Sum Sq Mean Sq F value Pr(>F)
env 1 65322 65322 34.603 1.435e-05 ***
Residuals 18 33979 1888
---
[[8]]
5
Df Sum Sq Mean Sq F value Pr(>F)
env 1 135137 135137 126.88 1.388e-09 ***
Residuals 18 19171 1065
---
[[9]]
Df Sum Sq Mean Sq F value Pr(>F)
env 1 57459 57459 14.535 0.001275 **
Residuals 18 71156 3953
---
[[10]]
Df Sum Sq Mean Sq F value Pr(>F)
env 1 193258 193258 50.243 1.313e-06 ***
Residuals 18 69236 3846
---
You’ll notice that we first made a new list object called, ”results”.
We then wrote the for-loop, which consists of two lines:
1. The number of loops
2. The function to be repeated, which say run this test of the column of
com equal to ”i” (i.e. the loop or step number) and save it as the ”ith”
component of the list ”results”
6
R-Course Day 4
Matthew K. Lau
November 29, 2008
3. Sorting Data
4. Principal Components Analysis
5. Infromation Theory Based Multi-Model Inference
1 Review of Analysis in R
Before we keep going with progressively complex topics, I would like to bring
things back to brass tacks. At the heart of all of these fancy scripts and packages
is our desire to conduct efficient inferences about our questions of interest. This
inherently includes all aspects of analysis from data management to presenta-
tion. It’s very easy to lose sight of this in the midst of searching around for
the ”right” test and presentation format. I constantly remind myself by asking,
”what the $#&% is the point of all this?”
Once you get around the initial difficulties of using R it provides an analyt-
ical environment that has a greater potential. The goal of making plots and
conducting analyses is to generate a coherent, reasoned argument for or against
a particular line of inference, not to fulfill some predefined list of rules. Here is
a quick run through a hypothetical analysis scenario:
First we create we get data, which in this case are generated by R, but we
could have also entered them in via any number of data importing functions:
Once we have the data inputed, we would most likely want to visualize and
summarize the data:
1
> plot(density(x1), xlim = c(0, 50))
> lines(density(x2), lty = 2)
> summary(cbind(x1, x2))
x1 x2
Min. : 7.455 Min. :11.18
1st Qu.: 8.838 1st Qu.:14.90
Median :10.028 Median :15.59
Mean : 9.799 Mean :15.06
3rd Qu.:10.639 3rd Qu.:15.75
Max. :12.211 Max. :16.84
density.default(x = x1)
0.25
0.20
0.15
Density
0.10
0.05
0.00
0 10 20 30 40 50
N = 10 Bandwidth = 0.7633
data: x1 and x2
t = -7.705, df = 17.849, p-value = 4.405e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.698512 -3.826834
2
sample estimates:
mean of x mean of y
9.798954 15.061627
Notice that we used the cbind and the unlist functions to reformat the data
and the results before saving them as ”.R” files. To double check that everything
was saved correctly, we can read the files back into R:
> read.table("/Users/artemis/Documents/FALL2008/
R-Course_Day4_Example/Day4_data.R")
x1 x2
1 7.454862 16.06117
2 10.543494 15.67643
3 10.671207 15.48232
4 8.794536 15.51148
5 12.211224 14.69980
6 11.155776 15.73512
7 8.134445 11.17991
8 9.930993 16.83588
9 8.968685 15.74910
10 10.124315 13.68506
> read.table(file = "/Users/artemis/Documents/FALL2008/
R-Course_Day4_Example/Day4_results.R")
x
statistic.t -7.7050266505846
parameter.df 17.8489694691014
p.value 4.40486369636702e-07
conf.int1 -6.6985120477549
conf.int2 -3.82683448815842
estimate.mean of x 9.79895376046191
estimate.mean of y 15.0616270284186
null.value.difference in means 0
alternative two.sided
method Welch Two Sample t-test
data.name x1 and x2
3
Once we have our analyses done, we will most likely want to create a figure to
present these results. Previously, we used the barplot2 function in the gplots
package, but here we will use the lines function to plot our confidence intervals.
10
5
0
X1 X2
As you can see, the barplot function creates an initial plot, which we
then ”write” on top of with the lines functions.
And last, we clean up our mess by removing all of the objects that we created.
Had we loaded any packages that we weren’t going to use later on, we’d unload
them by the detach function:
2 Writing Functions
We have been using functions left and right. All functions do is group together
other functions in an applied way. Because R is an ”open-source” programming
language, many of the functions in its employ have been created by users of R
4
(e.g. barplots2). We can see all of the code (i.e. the guts) of any function by
typing in the name of the function with no parantheses:
> sd
Function Anatomy
Functions have a structure very similar to loops. In fact, one way of looking
at a function is to see it as a function that creates a function from functions.
For example, if we want to create a function that will calculate the standard
error of a sample for us, we would write:
Let’s take a look at what’s going on. First, there is the function function
itself. This allows for object oriented arguments to be specified in a general way.
When we use our function later on, these arguments will need to be specified.
Second, we specify the ”action” of the function (i.e. what the function will be
doing). This can be anything. Here we calculate the standard error, ”se”, and
then use the return function to give us the calculated value. Last, we create
an object with the backward arrow, <-. This names our function by saving it
into as an object. Now, we can use our function:
> SE(x)
[1] 0.3822643
5
Notice here that functions can be composed of any other functions, including
loops. For example:
[1] 110
Functions are the atoms of the nucleus that is R. Knowing how to write them
will allow you to do many things in R that you can not do in your standard
point and click stats packages.
3 Sorting Data
Sorting data is an essential part of data management. In R sorting is very
simply achieved by using the bracket specification. For example, if we wanted
to sort our data matrix by the central column from lowest to highest:
> data <- cbind(1:10, rep(c(1, 2), 10), rnorm(10))
We could go in and create a vector of the position of the all the values in the
correct order, and put it in our bracket specification:
> sorting.vector <- c(1, 3, 5, 7, 9, 11, 13, 15, 17, 19,
+ 2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
> data[sorting.vector, ]
6
[18,] 6 2 1.2972297
[19,] 8 2 0.3452335
[20,] 10 2 0.1251559
Note that we are placing our ”sorting vector” in the row portion of our bracket
specification. This is because we are telling R to give us the rows of the object
data as specified by our sorting vector, just like we’ve use it before to pull out
portions of data vectors and matrixes.
[1] 1 3 5 7 9 11 13 15 17 19 2 4 6 8 10 12 14 16 18 20
[1] 1 3 5 7 9 11 13 15 17 19 2 4 6 8 10 12 14 16 18 20
> data[sorting.vector2, ]
7
[3,] 5 1 -0.2910797
[4,] 7 1 2.1438397
[5,] 9 1 0.4063706
[6,] 1 1 -0.3354490
[7,] 3 1 0.4490912
[8,] 5 1 -0.2910797
[9,] 7 1 2.1438397
[10,] 9 1 0.4063706
[11,] 2 2 -1.1793572
[12,] 4 2 1.0220319
[13,] 6 2 1.2972297
[14,] 8 2 0.3452335
[15,] 10 2 0.1251559
[16,] 2 2 -1.1793572
[17,] 4 2 1.0220319
[18,] 6 2 1.2972297
[19,] 8 2 0.3452335
[20,] 10 2 0.1251559
8
in the latter.
Load vegan package and the ”dune” data, which comes with it.
> library(vegan)
> data(dune)
> dune[1:3, ]
[1] TRUE
> e.values
9
[9] 2.481984e+00 1.853767e+00 1.747117e+00 1.313583e+00
[13] 9.905115e-01 6.377937e-01 5.508266e-01 3.505841e-01
[17] 1.995562e-01 1.487978e-01 1.157526e-01 2.477742e-16
[21] 1.265238e-16 9.915497e-17 7.827192e-17 7.767934e-17
[25] -1.615479e-17 -1.670766e-16 -1.790788e-16 -1.994938e-16
[29] -7.571117e-16 -7.735208e-15
> e.values[1]/sum(e.values)
[1] 0.2947484
10
A scree plot is a nice visual summary of this information:
fig=true plot(cpve) abline(a=0.95,b=0)
To get the principal components, we simply multiply the original data matrix
by the eigenvector matrix that we can get from the eigen function that we
used to get our eigenvalues. To do this, we need to first convert the dune data
into a matrix (it’s currently a data frame), and then we need to use the matrix
multiplier ”%%” to multiply the two together:
> X <- as.matrix(dune)
> V <- eigen(cov(dune))$vector
> PC <- X %*% V
> PC[1:5, ]
[,1] [,2] [,3] [,4] [,5] [,6]
2 7.8632333 -6.9156455 -1.802930 -0.7592717 1.0038289 4.880190
13 -0.1837002 -9.7166345 4.098879 -1.7969386 -0.5913193 3.223097
4 2.0525466 -9.3132762 -2.539806 -0.7808169 -3.3998013 1.247622
16 -7.5610307 -3.9151554 -1.455246 -6.4509398 -1.8612388 2.165339
6 8.9943182 -0.1316022 2.075440 -8.7597034 -4.8445081 2.326789
[,7] [,8] [,9] [,10] [,11] [,12]
2 -5.013353 -2.2455974 -2.6476661 -2.8427649 -0.4593202 1.2118617
13 -2.296845 0.8166063 -2.1470136 0.3124062 -2.8458103 2.1506161
4 -7.624091 3.8758407 1.2243159 -1.9621888 -1.0915741 0.6602083
16 -2.827212 1.5025837 -2.2036058 -3.7778249 0.3165852 0.5928802
6 -3.365366 0.2061268 0.2744863 -3.2406152 -0.9433286 2.7334334
[,13] [,14] [,15] [,16] [,17] [,18]
2 -0.40358193 0.1596677 1.22251362 1.0052880 0.4765954 -0.6637914
13 -0.73115235 1.6597452 -0.03941401 1.9043549 -0.3642813 -0.7397268
4 0.12584658 1.6981585 0.38371399 0.9805425 0.3957807 -0.8773133
16 0.08986328 1.6755371 1.84847618 2.3480531 0.1527719 -0.4932394
6 0.67422064 0.9965782 -0.47402683 1.3346653 0.3012177 -0.5218971
[,19] [,20] [,21] [,22] [,23] [,24] [,25]
2 0.8675266 0.4202482 0.6969981 0.349236 -1.300027 0.2059339 0.6115661
13 0.4284945 0.4202482 0.6969981 0.349236 -1.300027 0.2059339 0.6115661
4 0.4281225 0.4202482 0.6969981 0.349236 -1.300027 0.2059339 0.6115661
16 0.3784169 0.4202482 0.6969981 0.349236 -1.300027 0.2059339 0.6115661
6 0.6381589 0.4202482 0.6969981 0.349236 -1.300027 0.2059339 0.6115661
[,26] [,27] [,28] [,29] [,30]
2 -0.8477374 -0.1990721 -0.8456182 0.1189631 0.5027105
13 -0.8477374 -0.1990721 -0.8456182 0.1189631 0.5027105
4 -0.8477374 -0.1990721 -0.8456182 0.1189631 0.5027105
16 -0.8477374 -0.1990721 -0.8456182 0.1189631 0.5027105
6 -0.8477374 -0.1990721 -0.8456182 0.1189631 0.5027105
And, if we were so inclined, we could plot all pairs of the first 10 principal
components using the pairs function:
> detach(package:vegan)
> rm(list = ls())
11
> pairs(PC[, 1:10])
−10 0 −8 0 0 4 −4 2 −4 0
● ●●● ●●●●
● ● ●●● ● ● ●●● ● ●● ●● ●●●●
● ●●
● ● ●●● ● ●●
● ●
●
● ● ●● ● ● ● ● ● ● ●
●●● ● ● ● ● ● ● ● ●● ●
−5 5
● ●●● ● ●●● ● ●●● ● ●●●●● ●● ● ● ● ● ●●● ● ●● ●●
var 1 ●●●
●●
●●
●
● ● ● ●● ● ● ● ● ● ●● ●
●
●
●
●●
● ●● ● ● ●
●
●
●●
●●
●
●●
●● ●
●
● ●
●
● ●
● ●
● ●
● ●● ● ● ● ● ● ●● ●
● ● ● ● ● ● ● ● ●
● ● ●● ● ● ●● ●● ● ● ● ●
●●● ●●● ● ●●
●
●● ●●●●
●
●● ● ● ● ●● ● ● ●●●●●
● ●● ●
●●●
●●
● ●●
●
●●●● ●
●
0
● ● ● ● ●
●● ●● ●● ● ● ●● ● ● ● ●● ● ● ●● ●● ●
● ●
● ●● var 2 ● ● ● ● ●● ● ● ● ● ● ● ● ● ●
● ● ●●● ●● ● ● ● ●● ● ● ●● ● ● ●●
● ● ● ● ●● ●● ●
−10
● ● ● ●● ● ●● ● ●
●●● ●● ● ●
● ●● ● ●● ● ●
● ●● ● ●● ● ●
● ● ● ● ● ● ● ● ●
●● ● ● ● ● ●● ● ● ● ● ●● ●
4
●● ●● ● ● ● ● ● ● ● ● ● ●●
●● ● ●
●●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●
● ●
●●●●● ●● ● var 3 ● ●● ● ● ●●● ●● ●●●●
●
●●● ● ● ● ●
● ●●
●●●● ● ●●●
●● ● ●
●
●
●●●●●●
●
● ●●
● ●●● ● ●●
● ●
● ● ●●● ●● ●●
●●● ●
● ● ●
●●
●●● ● ●
●
● ●● ●●●
−4
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
●● ● ● ● ● ●●●●●● ● ● ● ● ●●
0
●●●●●● ●●● ● ●● ●●
●●● ● ● ●●●●●●●● ● ● ● ●●
● ● ●●● ● ● ● ●●● ●
●●
●
●
●● ●
●● var 4 ● ●● ●● ● ● ●●● ●● ● ● ● ● ●●
●●●● ● ●● ●
● ●
●
● ●● ● ●● ●
●● ● ● ● ●●●● ●●●● ●●● ● ● ●●● ●
● ●
● ●● ●● ● ●● ● ● ●● ● ●● ●●● ●● ●● ●
−8
● ● ● ● ● ● ● ● ●
● ● ●●● ●
● ● ●● ● ● ● ●●● ● ● ●
● ●● ● ●●● ● ● ●● ●● ● ●● ●
●
●
0
● ● ● ● ● ● ●● ●
● ●
● ● ●
●●● ● ● ●● ● ● ● ● ●● ●● ●● ●
●● ● ● ●● ●
●● ●● ● ●●●
●● ● var 5 ● ●●●
● ●●●●●● ● ●●●● ●●●
● ● ● ● ● ●●
● ●● ●● ● ●● ● ●●● ●●
●●
●
●● ●● ● ●● ●● ● ●●
● ● ● ● ● ●
−6
● ● ● ● ●
● ●● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ●● ●
● ● ●● ●● ● ● ● ● ●●
● ● ● ● ● ● ● ● ●
4
●
●● ● ●●● ● ●● ●● ●●●● ● ●● ● ●
●
● ●●●
●●● ●● ● ● ● ● ● ●
● ● ●● ● ● ● ●●●●
●● ●●●●●
●
●
● ●
● ●●● ● ●●●●●
● var 6 ● ● ●●
●●●●
● ●
●
●●●● ● ● ●●● ●
● ● ● ● ● ●● ● ●
●
●● ● ● ●● ● ● ● ● ●●●
●
● ● ● ● ● ● ● ● ●
0
● ●● ● ●● ● ●● ● ●● ●● ● ●● ● ● ●● ● ●● ●● ●
0
●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●
●● ●●● ●●● ●●●
● ●● ● ●
●● ●●●●● ● ●● ● ●● ●●●
● ●● ● ●● ●
●● ●●
● ●●● ●
● ● ● ● ●● ●● ● ●● ●
● ● ●●● ●● ●
● ● ●●
● ● ● ●● ● ●●
● ●● ● ● ● ● ●● ● var 7 ● ● ●●● ●●● ●●●
●● ● ● ● ●●● ●● ●● ●● ●● ● ● ●● ● ● ● ●● ● ●●●● ●● ●
● ● ●●
● ● ● ● ●
−6
● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
2
● ● ● ● ●● ●●● ● ●● ●● ● ●●●● ● ●● ●● ● ● ● ●● ● ●
●● ●● ● ●● ●
●●● ●
●
●●
● ●●●●
●
●●
● ●●●● ● ●● ●● ●
● ● ●● ● ●●
●●●● ● ●●●●●●● ● ● ●
● ● ●
● ● ● ● var 8 ●● ● ●
● ● ●
● ●●●●
●● ●
●●
●● ●●
● ●
●
● ●●● ● ● ●● ● ● ● ●●● ● ●● ● ● ●
● ● ● ● ●● ●● ● ● ● ● ●
● ●
● ●
−4
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
1
● ●●●● ●● ● ● ● ● ●● ● ● ●
●● ●●● ●● ●● ● ● ●●●●● ● ● ●● ●
●●
●
●
● ● ●
●
● ● ●
●
● var 9 ●
●● ●● ● ●●
● ● ● ●●
● ●●● ● ●●●● ●
● ●●● ●● ●● ● ●● ●●● ● ●
●●●●●● ●●●● ●●●●
● ● ●
● ● ● ●●●●● ●● ●●●●●
●
−3
●● ● ● ●
●● ● ●● ● ● ● ● ●●
● ● ● ● ●● ● ● ●● ● ● ● ●
●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●
0
● ● ●● ●● ●● ●
● ●● ● ● ● ● ●●
●● ● ● ●● ● ● ● ●● ●●●● ● ● ● ● ● ● ●●● ●
●● ●
●● ●
●
● ●● ●●
●
●●●
● ● ●
● ●● ● ●
●●●
●● ● ● ●● ●
● ●
● ●●
●● ● ●●●
●●● var 10
● ● ● ● ●
●
● ● ● ●●●● ●● ●● ●●
● ●● ●● ●●●
●●●
●●●
●
●● ● ●
−4
● ●● ● ● ● ● ● ● ● ●
●● ● ● ● ● ● ● ●● ●● ● ● ●● ● ●
−5 5 −4 4 −6 0 −6 0 −3 1
MODELS
12
We can get our richness estimate from the dune data by using the specnumber
function:
> library(vegan)
> data(dune)
> data(dune.env)
> R = specnumber(dune)
Once we have our data and our models of interest. The steps for
conducting the analysis are:
13
Model # Info. Crit. D Exp(-D/2) Weights
1 1 85.69719 0.000000 1.000000000 0.977879763
2 4 94.94671 9.249522 0.009805996 0.009589085
3 5 94.95225 9.255069 0.009778840 0.009562530
4 2 98.66867 12.971484 0.001525029 0.001491295
5 3 98.68749 12.990304 0.001510745 0.001477327
From this we can see that the clear winner is Model 1 or the Interaction
Model, which has the highest Akaike weight.
14
Introduction to Ecological Analyses in R (Day 5):
Monte-Carlo Methods and Bayesian Beginnings
Matthew K. Lau
Next, we need to be able to ”flip” our coin. This can be achieved by using the
sample function:
1
[1] "H"
Here you can see that the function drew a ”sample” of size one from our object
”coin.” The latter two arguments specify whether or not to choose previously
chosen values from the object when the sample size is larger than one and the
probability of sampling a particular value in the object being sampled where
the default (i.e. NULL) applies equal probability to all values.
●
●
●●
●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●●
●●
●●
●●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●●
●
●●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●●
●
●●
●●
●
●●
●●
●
●●
●●
●●
●
●●
●●
●●
●●
●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●●
●
●●
●
●●
●
●●
●●
●●
●
●●●
●
●
●●
●●
●
●●
●●
●
●●●
●
●●
●●
●●
●
●●●
●
●●
●●
●●
●●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●●
●
●●
●●
●
●●●
●
●●
●
●●
●
●●
●●
●
0.8
0.6
flips
0.4
0.2
0.0
●
●
●●●
●●●
●●
●●
●●
●
●●
●●
●
●●
●●
●
●●●
●
●●
●
●●●●
●
●●
●●
●
●●
●●
●
●●
●●
●●
●
●●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●
●●●
●
●●●
●●
●●
●●●
●
●
●●
●●
●●
●●
●
●●●
●
●
●●
●●
●
●●
●●
●●
●●
●●
●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●●
●
●●
●
●●
●●
●
●
●●
●●
●
●
●●●
●
●●
●●●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●●
●●
●
●●●
●
●●
●
●●
●●
●●
●●
●
●●
●
●●
●
●
Index
2
500
400
Number of Flips
300
200
100
0
0 1
Looking at the first plot, we can see that the outcome of each one of our flips
tend to come up pretty evenly as tails (=0) and heads (=1). The center line here
shows the mean value of our simulation, which in this case would be equivalent
to the probability of the flip coming up heads. We can calculate the probability
of either heads or tails by counting the number of flips that produced either one
and dividing by the total number of flips:
p.heads p.tails
[1,] 0.511 0.489
Both of these probabilities are very close to the analytical solutions: P(heads)
= 1/2 = 0.5 and P(tails) = 1/2 = 0.5. However, the values are just a little off
of the mark. This is because there is a small amount of random error inherent
to the MC algorithm. Theoretically, as we increase the number of simulations
toward infinity, this error would diminish to zero. It is generally recommended
to not only run a large (>1000) number of simulations, but also assess the MC
error of the simulation by looking at the variance of multiple MC simulations (i.e.
repeat the MC simulation over and over, each time calculating the probability
value of interest and then calculate the variance of the simulated probabilities).
This is an instance where writing functions can really help to reduce the length
of the script!
3
1.2 Dice: craps simulation
Here is another example where the analytical solution is easy to obtain, giving
us an easy way to confirm our MC simulation. Say we are going to Las Vegas and
we would like to try our luck at the craps table. Let’s figure out the probability
of rolling a 7 or 11 in one roll of two dice.
1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12
The above table shows all of the possible outcomes of one roll of two dice.
From this we can calculate the analytical solution which is P(7) + P(11) = 6/36
+ 2/36 = 0.17 + 0.06 = 0.23.
[1] 0.203
As you can see this MC simulation solution is very close to the analytical
solution, but just a little off, as with the coin example above.
4
1.3 Monte Carlo Test of Two Samples from the Same Pop-
ulation: Boxing Kangaroos
A well-respected biologist suggests to you that kangaroos from W. Australia
are better boxers (i.e. better intra-specific competitors) than kangaroos from E.
Australia, due to selective pressure from greater resource limitation in the West.
You set out to test this hypothesis. After collecting kangaroo boxing ability
estimates, which continuous and range from zero to infinity, from both sides of
the continent, you wish to test the hypothesis that the western population has
no better or worse boxing ability than the eastern population.
[1] 0.0264
[1] 0.0264
> hist(sim.d)
> abline(v = c(mean(sim.d), obs.d), lty = c(2, 1))
5
2500
2000
1500 Histogram of sim.d
Frequency
1000
500
0
0 5 10 15 20 25 30
sim.d
6
on the long run frequency of an event. The results of Bayesian Statistical ap-
proaches are often more intuitive, since they typically give us direct estimates
of our hypotheses of interest, rather than awkwardly tiptoeing around with p-
values in the Null Hypothesis Testing framework that is the most prevalent ap-
proach. Also, Bayesian methods are logically in line with the scientific method,
because they develop a subjective estimate of probability, as a degree of belief,
in a particular hypothesis by quantitatively modifying prior beliefs (prior proba-
bility) with current observations (likelihood) to generate an updated probability
estimate (posterior probability).
But why talk about Bayesian methods in conjunction with MC methods? Be-
cause, MC methods are at the core of a modern revolution in Bayesian Statistics.
Observe Bayes’ theorem, which allows one to calculate the posterior probability:
P rior∗Likelihood
P osterior = M arginal P robability
Ignoring the denominator for a second, the numerator shows us that Posterior
is simply a modification of the Prior by the Likelihood, which is our data that
we have in hand. In other words, we get an estimate of the probability of our
hypothesis of interest, which must be quantitative, collect data and calculate the
Likelihood, which is the probability of observing our data given our hypothesis,
multiply these two together. This however only produces a value proportional
to the Posterior probability. To get the true Posterior probability estimate, we
must divide by the Marginal Probability, which is simply a normalizing constant.
7
140
●
120
●
100
●
●
80
●
y
●
●
60
● ●
●
●
40
●
● ● ●
●
20
●
0
20 40 60 80 100
Now, we load up our MCMCpack functions and conduct our Bayesian regres-
sion:
8
> library(MCMCpack)
> mcmc.output <- MCMCregress(y ~ x)
> xtable(summary(mcmc.output)[1]$statistics, caption = "Bayesian Results1")
We could conclude here that the true slope parameter is not equal to zero with
95% probability, which a similar, but not identical, conclusion that we would
come to with the frequentist, null hypothesis approach. However, there is more
that can be done with the posterior probability distribution. Two things, which
9
I will discuss briefly and refer you to the web to learn more about, are multi-
model inference and prediction. Bayesian multi-model inference is very similar
to frequentist based multi-model inference (often referred to as AIC for the
most widely used information criterion, Akaike’s Information Criterion) except
that the posterior probabilities would be used to compute estimates of relative
model performance. Also, the posterior probability distribution can be used to
generate predictions about our parameters of interest. This allows one to use all
available information at hand in a systematic analytical framework to predict
future values, which is especially useful for reducing variance in estimates by
using prior information to augment noisy data.
10