0% found this document useful (0 votes)
89 views18 pages

Week - 1 - Getting - Started in RStudio - 2023

This document provides an overview and introduction to using RStudio for a quantitative biology course. It discusses that RStudio is a powerful and flexible software for data analysis and visualization. It has advantages over other statistical software like being free, open-source, and providing transparent and reproducible analysis. The document outlines how to set up the RStudio environment, including setting a working directory and uploading data and packages. It also previews the learning outcomes for the first week, which will focus on data frames, the RStudio interface, and basic descriptive and visual analytics.

Uploaded by

JNCBatt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views18 pages

Week - 1 - Getting - Started in RStudio - 2023

This document provides an overview and introduction to using RStudio for a quantitative biology course. It discusses that RStudio is a powerful and flexible software for data analysis and visualization. It has advantages over other statistical software like being free, open-source, and providing transparent and reproducible analysis. The document outlines how to set up the RStudio environment, including setting a working directory and uploading data and packages. It also previews the learning outcomes for the first week, which will focus on data frames, the RStudio interface, and basic descriptive and visual analytics.

Uploaded by

JNCBatt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Quantitative Biology: Week 1

Week 1 Getting started in RStudio


RStudio is an integrated suite of software facilities for data manipulation, simulation,
calculation and graphical display. It handles and analyses data very effectively and it
contains a suite of operators for calculations on arrays and matrices. In addition, it has the
graphical capabilities for very sophisticated graphs and data displays. Finally, it is an
elegant, object-oriented programming language.
There are lots of statistical software packages available, so why are we using Rstudio? The
advantages to using RStudio include:
1. Independent, free and open sourced, which means you can complete your assignments
on your own computer at home
2. Created by scientists, who are constantly improving and providing new code
3. Provides the most transparent and reproducible form of data analysis, which is
becoming a requirement in science
4. More powerful and flexible than most other statistical packages
5. Provides you with a more in-depth understanding of statistics in comparison to most
other ‘point and click’ statistical packages
6. Huge amount of online support available, and through the R website forum, questions
are addressed quickly.
7. Looks great on your CV!!

For those with no programming experiencing, RStudio is a user friendly version of R. It’s a
great platform to get started, and once you’ve mastered RStudio, you will also have the
knowledge to work directly into R, should you want to.
Now don’t be scared by the prospect of coding! This unit has been designed to enable you
to walk away with ability to design, implement and analyse your own data. That is the key
aim of this unit. Therefore, we will provide the code where necessary. However, for those of
you excited by the prospect of learning code and who want to take it a step further, there
will be ample opportunity to understand how the code is compiled, break it down and
further develop your coding skills.
The first couple of weeks will focus on getting familiar with the RStudio environment,
before we move on to statistically analysing and graphing your data. If you find that you are
struggling during the first 2 weeks, then I recommend that you jump on Youtube. I have
found the following videos particularly instructive.
• RStudio training by Mike Marin (several short video’s on key features)

1
Quantitative Biology: Week 1
• RMarkdown with Roger Peng
• Introduction to RStudio by Justin Murphy

You will also find this week’s practical recorded and available online. I will only record the
first week’s practical as we are laying the foundations down for the course and you will
have lots of questions. If you feel that you struggled in the first week, you can view the
recording and stop and start it as you go through the practical at your own pace.

As you work through the practical, please make sure that you answer all the questions
(denoted in blue) and complete any exercised provided. Your tutor will go over these key
questions in class, so also check that you got them correct.

Today’s learning outcomes


1. Understand what a dataframe is and how to save data for importing into Rstudio
2. Introduction to the RStudio environment
3. How to set a working directory
4. How to upload data
5. How to upload packages
6. Descriptive data
7. Visualising data using histograms and box plots
8. Vectors and matrices

2
Quantitative Biology: Week 1

1. Data and data files


It is important before we start that you understand how data is organised and laid out in
tables. For statistical analysis to be effective, you need to make sure that you have
structured your data correctly before you start analysing it. So we are going to have a look
at the excel file Dummy Data first which you can find on blackboard. Once you have the file
open, inspect it and answer the following questions:
1. How many rows and columns are there?

2. Which columns contain the dependent variable and which the independent
variable? How do we distinguish between them?

3. What is a cell in a data frame?

4. What is the number in cell (3,3)? HINT: Think about the row and column
numbering.

5. What is the difference between a data frame and a matrix?

6. What file type is the Dummy Data?

3
Quantitative Biology: Week 1

2. Introduction to the RStudio environment


RStudio will already be installed on the university computer you may be working on.
However, if you decide to use your own computer/laptop, you will need to install the
software. First, you need to install R before you can install RStudio.
Install R https://cran.curtin.edu.au/
RStudio https://www.rstudio.com/products/rstudio/download/#download
or you can simply Google “install R” and “install RStudio”
Once both R and Rstudio are installed, find RStudio using the start button and open it up.
When you open up RStudio you will notice 3 panels: Console, workspace and files panel.

You will introduce a fourth panel by clicking on and choosing a new R script.

The editor window is where you will write your code, the results of which are displayed in
the console window. The workspace shows you what data you are working with and the
files panel, which is further subdivided, provides information on packages available,
provide help with functions and is where your plots will appear. The first thing we need to
do is set up a working directory.

4
Quantitative Biology: Week 1

3. Organising data and setting the working directory


It is always good to get into the habit of organizing your data into appropriate folders, and
even more so with statistical analysis when you are often using lots of different data sets
and you can create lots of outputs!
So one of the first things we need to do, is create a Quantitative Biology folder and set this
as your working directory. I recommend that you put ALL your files into this working
directory and store the working directory in a place that can be easily accessible from more
than one computer e.g. your I drive, drop box etc.

Now we need to set this folder location as your working directory. There are several
ways to do this, but the simplest is to:
1. Click on the session tab at the top of the window
2. Go to set working directory
3. Here you can “Choose Directory” by navigating to and opening the folder.
You can always check your working directory by typing in getwd(), and it will return in
answer in your console.

Now that you have set the folder as your working directory, this is where you should store
all your data files so they can be uploaded into the program.

5
Quantitative Biology: Week 1

4. How to upload data


There are several ways to upload your own data. But the data needs to be saved as either
a .csv file or a .txt file. I recommend saving your excel data spreadsheets as a .csv file as this
file type can be accessed both through your working directory or from another location on
your computer. For this first exercise, we will be using a dummy data set, which you can
find as an excel spreadsheet on Blackboard. This data contains three columns, one with
categorical data and two with numeric data, therefore we call it a data frame.
Now we are going to download the Dummy data set from Blackboard and save it into your
working directory as a .csv file. It should now appear in your working directory file that
you have created. Double click on it to view it. To upload the data into your global
environment type the following command:
Plants<-read.csv("Dummy data.csv") #the <- means that you are creating an
object called Plants that contains these data

NOTE: when you see the # command it’s just there to give you extra information, but
RStudio won't think its code!!
The data set with 10 observations and 3 variables should now be in your global
environment.

Check that the data frame has imported correctly by viewing the data. To view the data at
any time, you can click on the table icon next to a data set in the global environment.

6
Quantitative Biology: Week 1

5. How to upload and attach packages

There are lots of additional packages that can be installed on top of the ones already in
there. Packages contain different types of tests and/or plot functions and allow you do even
more with your data. A list of the most useful packages can be found at:
https://support.rstudio.com/hc/en-us/articles/201057987-Quick-list-of-useful-R-
packages
Now make sure you understand the difference between upload and attach. To upload a
package, it means that you don’t have the package installed on your computer and have to
be connected to the internet to install the package. You only have to do this step once, so
once the package is on your computer – it will always be there (unless you actively remove
it!). BUT, once it’s uploaded on to your computer, you still have to tell RStudio when you
want to use it. To use the package, you have to attach it, and this you have to do every time
you want to use a package.
How to upload a package
If you have to install a package from the CRAN website, click on the install button in the
files window and follow instructions. It may take a few minutes to install a package – just
let it do its thing! You know that RStudio is processing information if you see the stop sign
in the console – just leave it until the stop sign disappears. You can check that it has
installed properly by clicking on the packages tab in the file console and seeing if the
package is there.
install.packages("car")

How to attach a package.


To attach a package (i.e. it is already installed on your computer, but not active), type in the
following command. We are going to attach a package called “car”, which you will use a lot
over the next few weeks.
library(car)

To check that the “car” package has attached, scroll down your packages list and you should
see a tick in the box next to it.

7
Quantitative Biology: Week 1

6. Descriptive data

One of the first things we do when working with data is to have a good look at the data, get
some descriptive information (e.g. mean, median, min, max and interquartile values) and
plot the data. We can get all that descriptive data using the summary function.
summary(Plants)
## Fertiliser Plant.1 Plant.2
## NO :5 Min. :2.400 Min. :2.400
## YES:5 1st Qu.:3.125 1st Qu.:3.025
## Median :3.900 Median :4.950
## Mean :3.790 Mean :4.980
## 3rd Qu.:4.500 3rd Qu.:6.975
## Max. :5.100 Max. :7.600

There are also some other really useful functions that you should start using when you first
import a data set into the R environment. Type the following functions into and Rstudio
and see what information it give you back about your data set.
A. str(Plants)

B. dim(Plants)

C. View(Plants)

D. head(Plants)

Using the help function!!


Sometimes you might come across a function and not understand what it does or how to
build it. But all the information is already in Rstudio - you just need to access it. Try using
the help function to work out what tail does by clicking on the help tab and writing in tail.
What does tail do?
?tail

8
Quantitative Biology: Week 1
When you have a data set, you often just want to find out some information about one
column. So to refer to a column in a data frame, you use this symbol $. For example if I want
to get the mean value of plant1 then I type in:
mean(Plants$Plant.1)
## [1] 3.79

It is also important to check what class our data is i.e. is it numeric, integer or a factor. This
will become more important as you carry on throughout semester. You can check what
class each variable is the in data frame using the "str" function or you can check one
column like this:
class(Plants$Plant.1)
## [1] "numeric"

OK, let’s say that you wanted to know the mean value of plants in Plant.1 but only those
plants that had had fertiliser treatment. To do this, we need use your factor which here is
fertiliser. You can ONLY use factors to differentiate groups within a column. There are two
ways of finding this mean value out.
mean(Plants$Plant.1[Plants$Fertiliser=="YES"])
## [1] 4.46

or
tapply(Plants$Plant.1, Plants$Fertiliser, mean)
## NO YES
## 3.12 4.46

tapply is a VERY useful command, and one that we will use often.
Exercise 1
• What is the mean plant height of plant 1 with and plant 1 without fertiliser?
• What is the mean plant height of plant 2 with and plant 2 without fertiliser?
• What is the maximum height value of plant 1 and plant 2 with fertiliser?
Did you get these answers: 4.46 cm and 3.12 cm 7.06 cm and 2.9 cm 5.1 cm and 7.6 cm?

9
Quantitative Biology: Week 1

7. Visualising data using histograms and box plots


When you are given a data set or have collected your own data set, one of the very first
things that you should do is graph the data. You need to get a really good idea of what the
data ‘looks’ like, in particular the spread of data and the amount of variability in the data
set. You should do this along with getting the basic descriptives of the data set.
To get an idea of the spread of data, you should create a histogram plot using the
dependent variable i.e. the value that is changing in response to an independent variable
such as gender or an environmental factor. In the case of the plant data, height is the
dependent variable. To create a histogram, we use the histogram function hist().
hist(Plants$Plant.1)

What’s going on here? What do we need to do?

10
Quantitative Biology: Week 1
par(mfrow=c(1,2))
hist(Plants$Plant.1[Plants$Fertiliser=="YES"])
hist(Plants$Plant.1[Plants$Fertiliser=="NO"])

What sort of distributions do we have?

Use the help function to work out what 'par(mfrow=c(1,2))' does? Type in par in the ‘help’
menu (or ?par) and scroll down to the relevant (mfcol, mfrow) section in the help menu.
Sometimes we may want to add a line over the top of the fertilised treatment for plant.1 to
confirm its normal distribution
par(mfrow=c(1,1))
hist(Plants$Plant.1[Plants$Fertiliser=="YES"])
xfit<-seq(3,6, length=100) #create a sequence of numbers
yfit<-
dnorm(xfit,mean=mean(Plants$Plant.1[Plants$Fertiliser=="YES"]),sd=sd(Plants$P
lant.1[Plants$Fertiliser=="YES"]))*3 #find y values on a normal distribution
with appropriate mean and sd for each x value in xfit
lines(xfit,yfit) #plot the line

11
Quantitative Biology: Week 1

The last thing we need to do is add the all important labels to the graph.
par(mfrow=c(1,1))
hist(Plants$Plant.1[Plants$Fertiliser=="YES"], main="Plant 1 with
Fertiliser", xlab=" Growth (cm)", ylab="Frequency")

12
Quantitative Biology: Week 1
Then add the line again if we want to. remember that we have already made the sequences
xfit and yfit. They should be in your global environment.
lines(xfit,yfit) #plot the line

Histograms are great for understanding the distribution of data points, but sometimes you
might want to compare between treatments. Box plots are a great way of doing this and for
comparing the amount of variability between treatments.
Let’s compare the data between our plant groups.
boxplot(Plants$Plant.1, Plants$Plant.2)

What is this graph showing us?

Again we should add labels to the graph


boxplot(Plants$Plant.1, Plants$Plant.2, main ="Plant growth", xlab="Plant
groups", ylab="Growth (cm)")

13
Quantitative Biology: Week 1

We can also add colour!!


boxplot(Plants$Plant.1, Plants$Plant.2, main ="Plant growth", xlab="Plant
groups", ylab="Growth (cm)", col=rainbow(2))

14
Quantitative Biology: Week 1
8. Vectors, matrices and data frames
So far we have been working with a dataframe that you imported into Rstudio, and this is
most likely the way that you will work with most of your data. But it’s also important to
understand how to create matrices and dataframes in Rstudio, should you need to.
We can create variables, which are either as a single value, a string of numbers (vector) or a
matrix (columns and rows). Note that matrices contain columns of data but you can also get
data that contains categories (e.g. male, female). When you get data with columns and rows
that has these types of data as well as numeric data, we call it a data frame. You create
objects using the <- assignment operator.
VECTORS
1. Create a variable named “a” with the value of 1
a <- 1

#to recall it you can type


a
## [1] 1

2. Create a numeric vector named ”b" with elements equal to 1, 2 and 3. There are at
least 3 ways to do this in R.
b <-c(1,2,3)
b
## [1] 1 2 3
assign("b",c(1,2,3))
b
## [1] 1 2 3
b <- seq(1,3)
b
## [1] 1 2 3

3. Create the following vectors: (a) (1 3 5); (b) (1 2 3 0 1 2 3); (c) (1 1 1 1); and (d) (1 2
3 1 2 3 1 2 3)
seq(1,6,2)
## [1] 1 3 5
c(b,0,b)
## [1] 1 2 3 0 1 2 3
rep(1,4)

15
Quantitative Biology: Week 1
## [1] 1 1 1 1
rep(b,3)
## [1] 1 2 3 1 2 3 1 2 3

4. Create character vectors containing: (a) the names of at least 5 students in this class;
(b) the values X1, X2, X3 and X4, and call it labels
student.names <- c("Hunter","Eric","Sara","Arvind","Abigail")
student.names
## [1] "Hunter" "Eric" "Sara" "Arvind" "Abigail"
labels <- paste("X",1:4, sep="")
labels
## [1] "X1" "X2" "X3" "X4"

MATRICES
5. Create the following matrices: (a) a 3 x 3 matrices with numbers 1 to 9 with
numbers increasing from left to right, (b) a 3 x 3 matrices with numbers 1 to 9 with
numbers increasing from top to bottom
matrix(c(1,2,3,4,5,6,7,8,9), nrow=3, byrow=TRUE)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
matrix(c(1,2,3,4,5,6,7,8,9), nrow=3, byrow=FALSE)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9

6. Create the following matrices: (a) a 3 x 3 identity matrix with 1's in the diagonal and
0's in all of the o’s diagonal elements; (b) a 3 x 3 matrix with the values 1, 2 and 3
along the diagonal and 0's in the o’s diagonals; (c) an empty matrix with 2 rows and
3 columns.
I <- diag(1,3)
I
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
diag(b)

16
Quantitative Biology: Week 1
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 2 0
## [3,] 0 0 3
A <- matrix(nrow=2,ncol=3)
A
## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] NA NA NA

7. Create the following matrices: (a) 2 x 3 matrix with numbers 1,3,5,7,9,11 with number
increasing left to right, and 2 rows and 3 columns.
matrix(seq(1,12,2),nrow=2, byrow=T)
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 7 9 11

8. Adding columns and rows can easily be done using the bind function. First we have to
give the matrix a name, then we can bind another column to it
data<-matrix(seq(1,12,2),nrow=2, byrow=T)
data2<-cbind(data,c(13,15))
##cbind binds the columns from the object I that we have already created to
our dataframe named “data” to create a new data frame “data2”
data2
##the column that we added was a concatenated sequence of the numbers 13 and
15

To add an extra row, we use rbind instead! If you ever come across a function in a piece of
code that you’re not sure what it does, then you can always use the help function by typing
in question mark in front of the function. The help information will pop up in the files tab.
?cbind

Exercise 2
• Create a vector of 10 numbers from 10 to 100, with numbers in multiples of 10. Call this
vector X
• Bind a column to X of 10 numbers of 1 to 5 repeated twice. Call this matrix Y
• Add 5 to every number in Y, call this new matrix Z
• Extract the number on the 5 row, second column from Z.

17
Quantitative Biology: Week 1
The value should be = 10

18

You might also like