Week2 Slides
Week2 Slides
Yuting Huang
AY24/25
1 / 76
Basics in R Programming
Week 1:
1. R objects
2. R syntax
Week 2:
3. R functions
4. Base R plotting
5. Generate reports with R Markdown
2 / 76
Conditional executions
x <- 10
if (x > 5) {
print(paste(x, "is larger than 5."))
}
3 / 76
Single-line if statement
▶ For statements with more than one lines, we should include them
in the curly braces.
4 / 76
else if for additional conditions
x <- 2
if (x > 5) {
print(paste(x, "is larger than 5."))
} else if (x < 5) {
print(paste(x, "is smaller than 5."))
}
5 / 76
else for all other cases
x <- 5
if (x > 5) {
print(paste(x, "is larger than 5."))
} else if (x < 5) {
print(paste(x, "is smaller than 5."))
} else {
print(paste(x, "is equal to 5."))
}
6 / 76
Vectorized ifelse()
7 / 76
for loops
▶ For each iteration, it runs the statement for the provided value
(year).
8 / 76
A for loop with conditional logic
9 / 76
while loops
A while loop repeats a task indefinitely untill a specified condition
becomes FALSE.
number <- 1
while (number <= 5) {
print(number)
number <- number + 1
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
10 / 76
Using break in a while loop
The break statement is used to exit a loop, even if the test condition
is TRUE.
▶ The loop below stops when number == 3, though the stopping
condigion is number <= 5.
number <- 1
while (number <= 5) {
print(number)
number <- number + 1
if (number == 3) break
}
## [1] 1
## [1] 2
11 / 76
Use loops with caution
for while
Best when the number of iteration Best when the stopping condition is
is known. flexible.
Can result in infinite loops if not
written properly. So it requires
more careful condition checks.
Loops are powerful, but if not written properly, they can get very
inefficient when applied to large data sets.
▶ Should be avoided if better alternatives exist.
▶ As you learn more, you will realize that vectorization is
preferred over loops since they are more efficient and produce
clearner code.
12 / 76
R functions
1. Built-in functions.
2. Writing your own functions.
3. The family of apply functions.
13 / 76
R Functions
14 / 76
R Functions
args(sd)
15 / 76
Built-in functions
In the dice game, we simulate a roll of the die with the sample()
function.
sample(1:6, size = 1)
## [1] 5
▶ To roll a dice twice, we can set size = 2 and use the argument
replace = TRUE.
## [1] 4 2
16 / 76
Writing R functions
17 / 76
A simple example
18 / 76
Default value of an argument
In the previous example, price and tax do not have default value.
▶ We will get an error if we don’t supply values for them, as R
would not know what value to use.
▶ If we declare the argument and its default value, when we call the
function and leave the argument(s) empty, R will automatically
use the default value.
calculate_total(price = 100)
19 / 76
Example: A single game of dice
Let us suppose that we wish to write a function to simulate one game
of dice between players A and B.
▶ Code (from last lecture) that simulates one single game:
set.seed(2101)
A <- sample(1:6, size = 1)
B <- sample(1:6, size = 1)
if (A > B) {
results <- "A is the winner"
} else if (A == B) {
results <- "It is a draw"
} else {
results <- "B is the winner"
}
results
if (A > B) {
results <- "A"
} else if (A == B) {
results <- "Draw"
} else {
results <- "B"
}
21 / 76
Now we can run the game by calling the function.
set.seed(2101)
single_game()
## [1] "B"
22 / 76
Clearer code for 1000 iterations
set.seed(2101)
results <- rep(0, 1000)
for(i in 1:1000) {
results[i] <- single_game()
}
table(results)
## results
## A B Draw
## 419 431 150
23 / 76
The family of apply functions
24 / 76
apply()
25 / 76
apply()
Basic syntax: apply(X, MARGIN, FUN)
▶ The input argument X can a matrix or a data frame.
▶ MARGIN = 1 indicates rows; MARGIN = 2 indicates columns.
▶ FUN is the function to be applied.
▶ Let’s first compute column means for the matrix.
## [1] 5 6
26 / 76
Example: USPersonalExpenditure data
Let’s work with a base R data set USPersonalExpenditure, which is
a matrix.
▶ US Personal Expenditures (in billion of dollars) for several
consumption categories for the years 1940, 1945, 1950, 1955, and
1960.
data(USPersonalExpenditure)
USPersonalExpenditure
27 / 76
1. Calculate the sum of personal expenditure across categories for
each year.
28 / 76
sapply() and lapply()
Let us create a list and then apply the two functions to compare the
outputs.
29 / 76
▶ lapply() returns a list.
lapply(y, mean)
## $A
## [1] 3
##
## $B
## [1] 0.6666667
class(lapply(y, mean))
## [1] "list"
30 / 76
▶ sapply() returns a numeric vector that contains the output of a
function.
sapply(y, mean)
## A B
## 3.0000000 0.6666667
class(sapply(y, mean))
## [1] "numeric"
31 / 76
aggregate()
This function works similarly to apply() – it computes group
statistics in the data.
▶ Let’s consider another base R data set, airquality, which
contains daily air quality measurements in New York.
head(airquality, 3)
## Month Temp
## 1 5 65.54839
## 2 6 79.10000
## 3 7 83.90323
## 4 8 83.96774
## 5 9 76.90000
32 / 76
The family of apply functions
33 / 76
Manipulation of important class types
1. Strings
2. Factors
3. Date
34 / 76
Strings
# install.packages("stringr")
library(stringr)
36 / 76
Subsetting strings
str_sub(fruits, 1, 3)
37 / 76
Parsing strings
Regular expressions are a language for expressing patterns in
strings.
▶ They are used in many programming languages including R.
38 / 76
Matching patterns
To detect presence/absence of a pattern, we use str_detect().
▶ Basic syntax: str_detect(string, pattern).
▶ To detect the pattern ana:
str_detect(fruits, "ana")
39 / 76
Matching patterns
str_detect(fruits, "[abc]")
str_detect(fruits, "[p-t]")
40 / 76
▶ To replace the first matched patterns, we use str_replace().
41 / 76
Example
▶ Here is a simple vector uni with names of universities:
str_detect(uni, "Singapore")
42 / 76
More on regular expressions
43 / 76
Factors
44 / 76
Creating factors
Suppose we have a vector containing classification of university
students:
x1 <- as.factor(x1)
x1
The levels are the possible values that the variable could take on.
▶ By default, they are arranged by alphabetical order.
45 / 76
Levels of a factor
46 / 76
Dates
R contains a Date class type to work with dates.
▶ Dates are stored internally as integers since 1st Jan 1970.
▶ This allow R to compute differences between dates, sequences of
dates, and divide dates into convenient periods.
## [1] "Date"
d1
## [1] "2025-01-17"
47 / 76
Here are some convenient functions for extracting information that we
typically need from Date objects.
## [1] "2025-01-19"
## [1] "Sunday"
## [1] "Jan"
48 / 76
Base R plotting
1. Scatter plot
2. Bar plot
49 / 76
Scatter plot
The axis, scales, titles, and plotting symbols are all chosen
automatically. But these can be manually overridden.
50 / 76
cars data set
data(cars)
head(cars, n = 3)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
51 / 76
Scatterplot
plot(cars$speed, cars$dist)
100
cars$dist
60
0 20
5 10 15 20 25
cars$speed
52 / 76
Adding axis label and title
We can make the graph more informative.
plot(cars$speed, cars$dist,
xlab = "Speed (mph)", ylab = "Stopping distance (ft)",
main = "Relationship between Speed and Braking")
20 40 60 80
0
5 10 15 20 25
Speed (mph)
53 / 76
Altering plotting symbols
In R, the plotting symbols are referred to as the plotting character
(pch for short).
▶ Change the symbol by specifying the pch argument to plot().
▶ The full list of symbols is displayed below.
▶ The default is pch = 1.
54 / 76
Altering plotting symbols
For instance, we can use unfilled triangles instead of unfilled circles.
20 40 60 80
0
5 10 15 20 25
Speed (mph)
55 / 76
Altering plotting symbols
To change the size of the plotting character, we can modify the cex
argument.
▶ cex stands for “character expansion”, with a default value of 1.
▶ Larger values will make the symbol larger.
▶ This is an important abbreviation, because you will see a similar
argument in a lot of help pages referring to other parameters:
▶ cex.axis affects the font size of the axis
▶ cex.main affects the font size of the title, and so on.
56 / 76
Colors
The color of the plotting characters can be changed using the col
argument of the plot() function.
▶ The common colors can be accessed by their names, e,g, blue,
green, red.
57 / 76
Colors
20 40 60 80
0
5 10 15 20 25
Speed (mph)
58 / 76
Adding a trend line
plot(cars$speed, cars$dist, col = "red", pch = 16, cex = 1,
xlab = "Speed (mph)", ylab = "Stopping distance (ft)",
main = "Relationship between Speed and Braking")
abline(reg = lm(dist ~ speed, data = cars),
col = "gray60", lty = "dashed")
20 40 60 80
0
5 10 15 20 25
Speed (mph)
59 / 76
Bar plot
## category amount
## 1 Asset 38.0
## 2 Manpower 519.4
## 3 Other 141.4
60 / 76
Bar plot
61 / 76
Summary: Base R plotting
62 / 76
Generate reports with R Markdown
Literate programming with R Markdown
63 / 76
What is (R)Markdown?
64 / 76
Our workflow may look something like this:
65 / 76
A more efficient and robust workflow would look like this:
66 / 76
What is R Markdown?
67 / 76
What is R Markdown?
68 / 76
R Markdown pre-requisites
69 / 76
To create a new R Markdown file, click on File -> New file -> R
Markdown...
▶ At this step, RStudio may prompt you to update/install other
packages. Click Yes and wait for the installation to complete.
▶ Select HTML as the default output format.
▶ File -> Save and save the file in your src folder.
70 / 76
R Markdown basics
---
title: "Untitled"
date: "2023-11-23"
output: html_document
---
71 / 76
The rest of the Rmd file will consist of code chunks (R code) and text.
▶ Code chunks will be surrounded by tickmarks.
▶ Text will be simple text, formatted with #, *, etc.
72 / 76
Global chunk options
We can specify local chunk options that modify the chunk output
for this specific code chunk.
Refer to the R Markdown Reference Guide for a complete set of (global) chunk options.
74 / 76
Using R Markdown
75 / 76
Useful resources:
76 / 76