0% found this document useful (0 votes)
3 views

Week2 Slides

The document covers the basics of R programming, focusing on conditional executions, loops, and functions. It explains how to use if statements, for and while loops, and the apply family of functions for efficient data manipulation. Additionally, it provides examples of writing custom functions and using built-in functions in R.

Uploaded by

Tùng Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Week2 Slides

The document covers the basics of R programming, focusing on conditional executions, loops, and functions. It explains how to use if statements, for and while loops, and the apply family of functions for efficient data manipulation. Additionally, it provides examples of writing custom functions and using built-in functions in R.

Uploaded by

Tùng Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

DSA2101

Essential Data Analytics Tools: Data Visualization

Yuting Huang

AY24/25

Week 2: Basics in R Programming II

1 / 76
Basics in R Programming

Week 1:

1. R objects
2. R syntax

Week 2:

3. R functions
4. Base R plotting
5. Generate reports with R Markdown

2 / 76
Conditional executions

An if statement tells R to execute certain tasks if a condition is TRUE.

x <- 10
if (x > 5) {
print(paste(x, "is larger than 5."))
}

## [1] "10 is larger than 5."

▶ If x > 5 is TRUE, R will print the statement.

3 / 76
Single-line if statement

If there’s only one statement to perform, we can simplify it into a


single line.
▶ A concise way to write the same logic.

if (x > 5) print(paste(x, "is larger than 5."))

## [1] "10 is larger than 5."

▶ For statements with more than one lines, we should include them
in the curly braces.

4 / 76
else if for additional conditions

We can use else if to test multiple conditions.

x <- 2
if (x > 5) {
print(paste(x, "is larger than 5."))
} else if (x < 5) {
print(paste(x, "is smaller than 5."))
}

## [1] "2 is smaller than 5."

5 / 76
else for all other cases

else catches anything that’s not caught by the preceding conditions.

x <- 5
if (x > 5) {
print(paste(x, "is larger than 5."))
} else if (x < 5) {
print(paste(x, "is smaller than 5."))
} else {
print(paste(x, "is equal to 5."))
}

## [1] "5 is equal to 5."

6 / 76
Vectorized ifelse()

R has a built-in ifelse() function.


▶ It is a compact and efficient way to apply conditional logic to
vectors.
▶ Basic syntax:
ifelse(test, action_if_TRUE, action_if_FALSE)

yes <- paste(x, "is larger than 11.")


no <- paste(x, "is not larger than 11.")
ifelse(x > 11, yes, no)

## [1] "5 is not larger than 11."

7 / 76
for loops

A for loop repeats code for each element in its input.


▶ Particularly useful when the set of values to iterate over is known
in advance.

for (year in 2023:2025){


print(paste("The year is", year))
}

## [1] "The year is 2023"


## [1] "The year is 2024"
## [1] "The year is 2025"

▶ For each iteration, it runs the statement for the provided value
(year).

8 / 76
A for loop with conditional logic

for (year in 2023:2025) {


message <- ifelse(year == 2025,
paste(year, "- the current year!"),
paste(year, "- not the current year."))
print(message)
}

## [1] "2023 - not the current year."


## [1] "2024 - not the current year."
## [1] "2025 - the current year!"

9 / 76
while loops
A while loop repeats a task indefinitely untill a specified condition
becomes FALSE.

1. Test the condition.


2. If FALSE, exit the loop.
3. If TRUE, execute the task and return to step 1.

number <- 1
while (number <= 5) {
print(number)
number <- number + 1
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

10 / 76
Using break in a while loop
The break statement is used to exit a loop, even if the test condition
is TRUE.
▶ The loop below stops when number == 3, though the stopping
condigion is number <= 5.

number <- 1
while (number <= 5) {
print(number)
number <- number + 1

if (number == 3) break
}

## [1] 1
## [1] 2

11 / 76
Use loops with caution

for while
Best when the number of iteration Best when the stopping condition is
is known. flexible.
Can result in infinite loops if not
written properly. So it requires
more careful condition checks.

Loops are powerful, but if not written properly, they can get very
inefficient when applied to large data sets.
▶ Should be avoided if better alternatives exist.
▶ As you learn more, you will realize that vectorization is
preferred over loops since they are more efficient and produce
clearner code.

12 / 76
R functions

1. Built-in functions.
2. Writing your own functions.
3. The family of apply functions.

13 / 76
R Functions

We have already encountered a few functions in R.


▶ As you have noticed, they can take arguments that modify their
behavior.
▶ Not all arguments need to be specified.

14 / 76
R Functions

▶ The args() command can be used to list the arguments of a


function.

args(sd)

## function (x, na.rm = FALSE)


## NULL

▶ The first argument, x, does not have default value – it must be


supplied.
▶ na.rm has a default value, which is FALSE.
▶ It specifies whether missing values (NAs) should be ignored in the
computation.

15 / 76
Built-in functions

In the dice game, we simulate a roll of the die with the sample()
function.

sample(1:6, size = 1)

## [1] 5

▶ To roll a dice twice, we can set size = 2 and use the argument
replace = TRUE.

sample(1:6, size = 2, replace = TRUE)

## [1] 4 2

16 / 76
Writing R functions

At some point, we have to write a function of our own.


We will need to decide:

1. What argument it should take.


2. Whether these arguments should have default values, and if so,
what default values should be.
3. What output it should return.

The typical approach is to write a sequence of expressions that work,


then package them into a function.

17 / 76
A simple example

calculate_total <- function(price, tax) {


total_price <- price * (1 + tax)
message <- paste("The total is", total_price)
return(message)
}

calculate_total(price = 100, tax = 0.09)

## [1] "The total is 109"

▶ The function name (calculate_total) and its arguments


(price, tax) are defined.
▶ The arguments do not have default value.

18 / 76
Default value of an argument
In the previous example, price and tax do not have default value.
▶ We will get an error if we don’t supply values for them, as R
would not know what value to use.
▶ If we declare the argument and its default value, when we call the
function and leave the argument(s) empty, R will automatically
use the default value.

calculate_total <- function(price, tax = 0.09) {


total_price <- price * (1 + tax)
message <- paste("The total is", total_price)
return(message)
}

calculate_total(price = 100)

## [1] "The total is 109"

19 / 76
Example: A single game of dice
Let us suppose that we wish to write a function to simulate one game
of dice between players A and B.
▶ Code (from last lecture) that simulates one single game:

set.seed(2101)
A <- sample(1:6, size = 1)
B <- sample(1:6, size = 1)

if (A > B) {
results <- "A is the winner"
} else if (A == B) {
results <- "It is a draw"
} else {
results <- "B is the winner"
}

results

## [1] "B is the winner"


20 / 76
Construct a function single_game() that simulates the game:

single_game <- function() { # the function takes zero arguments

A <- sample(1:6, size = 1)


B <- sample(1:6, size = 1)

if (A > B) {
results <- "A"
} else if (A == B) {
results <- "Draw"
} else {
results <- "B"
}

return(results) # return the vector results


}

21 / 76
Now we can run the game by calling the function.

set.seed(2101)
single_game()

## [1] "B"

22 / 76
Clearer code for 1000 iterations

set.seed(2101)
results <- rep(0, 1000)

for(i in 1:1000) {
results[i] <- single_game()
}

table(results)

## results
## A B Draw
## 419 431 150

23 / 76
The family of apply functions

We sometimes find ourselves having to repeat the same operation


multiple times.
▶ The apply family of functions in base R allow us to perform tasks
in a repetitive way without explicit use of loop constructs.
▶ apply(): apply a function across rows or columns of a matrix or
a data frame.
▶ sapply() apply a function across elements of a list, and return a
vector or a matrix.
▶ lapply(): apply a function across elements of a list, and return a
list.

Another related function: aggregate(), for computation of group


statistics.

24 / 76
apply()

If we have a matrix and want to apply a function to each row or


column separately, then the apply() function is what we need.
▶ Let us first generate a 2-by-5 matrix with values 1 to 10 in the
entries.

my_mat <- matrix(1:10, nrow = 2, byrow = FALSE)


my_mat

## [,1] [,2] [,3] [,4] [,5]


## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10

25 / 76
apply()
Basic syntax: apply(X, MARGIN, FUN)
▶ The input argument X can a matrix or a data frame.
▶ MARGIN = 1 indicates rows; MARGIN = 2 indicates columns.
▶ FUN is the function to be applied.
▶ Let’s first compute column means for the matrix.

apply(my_mat, MARGIN = 2, mean)

## [1] 1.5 3.5 5.5 7.5 9.5

▶ Next, we compute row means.

apply(my_mat, MARGIN = 1, mean)

## [1] 5 6
26 / 76
Example: USPersonalExpenditure data
Let’s work with a base R data set USPersonalExpenditure, which is
a matrix.
▶ US Personal Expenditures (in billion of dollars) for several
consumption categories for the years 1940, 1945, 1950, 1955, and
1960.

data(USPersonalExpenditure)
USPersonalExpenditure

## 1940 1945 1950 1955 1960


## Food and Tobacco 22.200 44.500 59.60 73.2 86.80
## Household Operation 10.500 15.500 29.00 36.5 46.20
## Medical and Health 3.530 5.760 9.71 14.0 21.10
## Personal Care 1.040 1.980 2.45 3.4 5.40
## Private Education 0.341 0.974 1.80 2.6 3.64

27 / 76
1. Calculate the sum of personal expenditure across categories for
each year.

apply(USPersonalExpenditure, MARGIN = 2, sum)

## 1940 1945 1950 1955 1960


## 37.611 68.714 102.560 129.700 163.140

2. Find the maximum personal expenditure category for each year.

apply(USPersonalExpenditure, MARGIN = 2, max)

## 1940 1945 1950 1955 1960


## 22.2 44.5 59.6 73.2 86.8

28 / 76
sapply() and lapply()

If we wish to iterate over a list, we should use sapply() and lapply()


▶ lapply() takes in a list and returns a list.
▶ sapply() takes in a list, simplifies the outputs, and returns a
vector.

Let us create a list and then apply the two functions to compare the
outputs.

y <- list(A = 1:5,


B = c(TRUE, TRUE, FALSE))

29 / 76
▶ lapply() returns a list.

lapply(y, mean)

## $A
## [1] 3
##
## $B
## [1] 0.6666667

class(lapply(y, mean))

## [1] "list"

30 / 76
▶ sapply() returns a numeric vector that contains the output of a
function.

sapply(y, mean)

## A B
## 3.0000000 0.6666667

class(sapply(y, mean))

## [1] "numeric"

31 / 76
aggregate()
This function works similarly to apply() – it computes group
statistics in the data.
▶ Let’s consider another base R data set, airquality, which
contains daily air quality measurements in New York.

head(airquality, 3)

## Ozone Solar.R Wind Temp Month Day


## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3

aggregate(Temp ~ Month, data = airquality, FUN = mean)

## Month Temp
## 1 5 65.54839
## 2 6 79.10000
## 3 7 83.90323
## 4 8 83.96774
## 5 9 76.90000
32 / 76
The family of apply functions

In practice, consider the following when choosing the appropriate


function to use:
▶ The data type of the input.
▶ Operation to perform.
▶ The subsets of the data: rows, columns, or perhaps all?
▶ The data type of the output. This may depend on how you plan
to use the output in the next step.

33 / 76
Manipulation of important class types

Now we turn to introduce functions to manipulate some important


types of R objects, including

1. Strings
2. Factors
3. Date

34 / 76
Strings

To manipulate strings, we shall use the stringr package. It is more


powerful than the base R functions.
▶ Install the package first and then load the package:

# install.packages("stringr")
library(stringr)

Artwork by Allison Horst


35 / 76
Strings

In R, strings can be created in single or double quotes.

fruits <- c("apple", "banana", "orange")

▶ str_c() combines multiple character vectors into a single


character string.

str_c(fruits, collapse = ", ")

## [1] "apple, banana, orange"

36 / 76
Subsetting strings

To subset a string, we can use str_sub().


▶ The syntax is str_sub(string, start, end).

str_sub(fruits, 1, 3)

## [1] "app" "ban" "ora"

str_sub(fruits, -3, -1) # count backwards using negative indices

## [1] "ple" "ana" "nge"

37 / 76
Parsing strings
Regular expressions are a language for expressing patterns in
strings.
▶ They are used in many programming languages including R.

Artwork by Allison Horst

38 / 76
Matching patterns
To detect presence/absence of a pattern, we use str_detect().
▶ Basic syntax: str_detect(string, pattern).
▶ To detect the pattern ana:

str_detect(fruits, "ana")

## [1] FALSE TRUE FALSE

▶ To detect the absence of the pattern ana:

str_detect(fruits, "ana", negate = TRUE)

## [1] TRUE FALSE TRUE

39 / 76
Matching patterns

▶ To detect whether a string contains a, b, or c:

str_detect(fruits, "[abc]")

## [1] TRUE TRUE TRUE

▶ To detect whether a string contains any character between p and


t:

str_detect(fruits, "[p-t]")

## [1] TRUE FALSE TRUE

40 / 76
▶ To replace the first matched patterns, we use str_replace().

str_replace(fruits, "[p-t]", "X")

## [1] "aXple" "banana" "oXange"

▶ To replace all matched patterns, use str_replace_all().

str_replace_all(fruits, "[p-t]", "X")

## [1] "aXXle" "banana" "oXange"

41 / 76
Example
▶ Here is a simple vector uni with names of universities:

uni <- c("National University of Singapore",


"NUS",
"Nanyang Technological University")

str_detect(uni, "Singapore")

## [1] TRUE FALSE FALSE

▶ Replace “NUS” by the full name:

str_replace(uni, "NUS", "National University of Singapore")

## [1] "National University of Singapore" "National University of Singap


## [3] "Nanyang Technological University"

42 / 76
More on regular expressions

Regular expressions (Regex) is a sequence of characters that


describes a certain pattern found in a text.
▶ We can use them for pattern-matching/replacing operations on
strings.
▶ Type ?"regular expression" in your Console to learn more on
regular expressions in R.
▶ The RStudio team has a useful cheatsheet for the stringr
package, which also provides a handy guide to regular
expressions.

43 / 76
Factors

We use factors when we work with categorical variables in R.


▶ These are variables that have a fixed and known set of possible
values.
▶ Examples: gender, calendar month, . . .
▶ We shall see that they are useful for dividing our data set into
groups and performing analyses.

44 / 76
Creating factors
Suppose we have a vector containing classification of university
students:

x1 <- c("Junior", "Freshmen", "Sophomores", "Seniors")


x1

## [1] "Junior" "Freshmen" "Sophomores" "Seniors"

▶ To convert the character vector to a factor:

x1 <- as.factor(x1)
x1

## [1] Junior Freshmen Sophomores Seniors


## Levels: Freshmen Junior Seniors Sophomores

The levels are the possible values that the variable could take on.
▶ By default, they are arranged by alphabetical order.
45 / 76
Levels of a factor

We can encode the levels using the factor() function.


▶ . . . specify the correct order of the categories, which is then
passed to the levels argument.
▶ With ordered = TRUE, we indicate that the factor is ordered.

x1_lvls <- c("Freshmen", "Sophomores", "Junior", "Seniors")


x2 <- factor(x1, levels = x1_lvls, ordered = TRUE)
x2

## [1] Junior Freshmen Sophomores Seniors


## Levels: Freshmen < Sophomores < Junior < Seniors

46 / 76
Dates
R contains a Date class type to work with dates.
▶ Dates are stored internally as integers since 1st Jan 1970.
▶ This allow R to compute differences between dates, sequences of
dates, and divide dates into convenient periods.

The easiest way to create a Date object is from character strings:

d1 <- as.Date("2025/01/17", format = "%Y/%m/%d")


class(d1)

## [1] "Date"

d1

## [1] "2025-01-17"

47 / 76
Here are some convenient functions for extracting information that we
typically need from Date objects.

today <- Sys.Date()


today

## [1] "2025-01-19"

weekdays(today, abbreviate = FALSE)

## [1] "Sunday"

months(today, abbreviate = TRUE)

## [1] "Jan"

Later, we shall learn about functions from the lubridate package


that provides even more convenient functions.

48 / 76
Base R plotting

1. Scatter plot
2. Bar plot

49 / 76
Scatter plot

The plot() function takes arguments x and y


▶ x is considered as the horizontal axis
▶ y is considered the vertical axis
▶ They should be vectors of the same length

The axis, scales, titles, and plotting symbols are all chosen
automatically. But these can be manually overridden.

50 / 76
cars data set

The cars data set in base R contains two columns:


▶ speed in miles per hour.
▶ distance taken to stop, in feet.

data(cars)
head(cars, n = 3)

## speed dist
## 1 4 2
## 2 4 10
## 3 7 4

51 / 76
Scatterplot

The following command creates a basic plot

plot(cars$speed, cars$dist)
100
cars$dist

60
0 20

5 10 15 20 25

cars$speed

52 / 76
Adding axis label and title
We can make the graph more informative.

plot(cars$speed, cars$dist,
xlab = "Speed (mph)", ylab = "Stopping distance (ft)",
main = "Relationship between Speed and Braking")

Relationship between Speed and Braking


120
Stopping distance (ft)

20 40 60 80
0

5 10 15 20 25

Speed (mph)

53 / 76
Altering plotting symbols
In R, the plotting symbols are referred to as the plotting character
(pch for short).
▶ Change the symbol by specifying the pch argument to plot().
▶ The full list of symbols is displayed below.
▶ The default is pch = 1.

54 / 76
Altering plotting symbols
For instance, we can use unfilled triangles instead of unfilled circles.

plot(cars$speed, cars$dist, pch = 2,


xlab = "Speed (mph)", ylab = "Stopping distance (ft)",
main = "Relationship between Speed and Braking")

Relationship between Speed and Braking


120
Stopping distance (ft)

20 40 60 80
0

5 10 15 20 25

Speed (mph)

55 / 76
Altering plotting symbols

To change the size of the plotting character, we can modify the cex
argument.
▶ cex stands for “character expansion”, with a default value of 1.
▶ Larger values will make the symbol larger.
▶ This is an important abbreviation, because you will see a similar
argument in a lot of help pages referring to other parameters:
▶ cex.axis affects the font size of the axis
▶ cex.main affects the font size of the title, and so on.

56 / 76
Colors

The color of the plotting characters can be changed using the col
argument of the plot() function.
▶ The common colors can be accessed by their names, e,g, blue,
green, red.

plot(cars$speed, cars$dist, pch = 16, col = "red", cex = 2,


xlab = "Speed (mph)", ylab = "Stopping distance (ft)",
main = "Relationship between Speed and Braking")

▶ To see a list of all named colors in R, use colors().


▶ Another useful reference for colors:
http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

57 / 76
Colors

Relationship between Speed and Braking


120
Stopping distance (ft)

20 40 60 80
0

5 10 15 20 25

Speed (mph)

58 / 76
Adding a trend line
plot(cars$speed, cars$dist, col = "red", pch = 16, cex = 1,
xlab = "Speed (mph)", ylab = "Stopping distance (ft)",
main = "Relationship between Speed and Braking")
abline(reg = lm(dist ~ speed, data = cars),
col = "gray60", lty = "dashed")

Relationship between Speed and Braking


120
Stopping distance (ft)

20 40 60 80
0

5 10 15 20 25

Speed (mph)

59 / 76
Bar plot

Bar plot represents a variable by drawing bars whose heights are


proportional to the values of the variable.
▶ Let us create a bar chart using a data frame that we set up
previously.

category <- c("Asset", "Manpower", "Other")


amount <- c(38.0, 519.4, 141.4)
op_budget <- data.frame(category, amount)
op_budget

## category amount
## 1 Asset 38.0
## 2 Manpower 519.4
## 3 Other 141.4

60 / 76
Bar plot

barplot(op_budget$amount, names.arg = op_budget$category)


500
300
100
0

Asset Manpower Other

61 / 76
Summary: Base R plotting

The base R plotting commands are quite useful.


▶ Ideal for quick plots to examine our data.
▶ Later in the semester, we will learn about a different paradigm
when plotting – the grammar of graphics.
▶ More details about this later. For now, we meant to give some
knowledge on plotting with base R.
▶ As we proceed, we will see more examples of plotting with base R
until we reach the grammar of graphics topic.

62 / 76
Generate reports with R Markdown
Literate programming with R Markdown

Artwork by Allison Horst

63 / 76
What is (R)Markdown?

▶ Markdown is a plain text format that allows for common


typesetting features including
▶ Code
▶ Text formatting (bold, italic, . . . ), tables, images, links
▶ R Markdown is an extension to Markdown that allows us to
combine R code, its outputs, and text into one document.
▶ When the document is processed (or knitted), the code blocks
can be executed, and outputs such as figures and analyses can be
included directly in the formatted document.

64 / 76
Our workflow may look something like this:

65 / 76
A more efficient and robust workflow would look like this:

66 / 76
What is R Markdown?

R Markdown allows us to produce several kinds of output documents


(HTML, PDF, . . . ).
▶ This allows us to write analyses in R as we already do.
▶ Also allows us to write our reports and papers.

67 / 76
What is R Markdown?

R Markdown is useful when you wish to


▶ Write a report based on your analysis (which is what you will be
doing for the next few years in NUS).
▶ Share your work and findings with others. They will be able to
easily reproduce your exact findings.
▶ Capture all your analyses on your data set.

68 / 76
R Markdown pre-requisites

Run the following code to install some required packages:


▶ These packages help us to convert R markdown files to other
formats such as HTML, PDF, and Word.

# Install necessary packages


install.packages(c("rmarkdown", "knitr"))

69 / 76
To create a new R Markdown file, click on File -> New file -> R
Markdown...
▶ At this step, RStudio may prompt you to update/install other
packages. Click Yes and wait for the installation to complete.
▶ Select HTML as the default output format.

▶ File -> Save and save the file in your src folder.

70 / 76
R Markdown basics

The first section of an Rmd file is a YAML header.


It will look like this:

---
title: "Untitled"
date: "2023-11-23"
output: html_document
---

▶ The YAML header contains the metadata we put in at the


beginning.
▶ It can contain arguments such as title, date, author, and
output, demarcated by three dashes (-) on both ends.

71 / 76
The rest of the Rmd file will consist of code chunks (R code) and text.
▶ Code chunks will be surrounded by tickmarks.
▶ Text will be simple text, formatted with #, *, etc.

Now we can knit it as an HTML file.

72 / 76
Global chunk options

Chunk output can be customized with knit options.


▶ echo = TRUE displays code.
▶ include = TRUE displays both code and the results.
▶ message = FALSE and warning = FALSE prevent messages that
generate by code.
▶ fig.align = "center" and out.width = "60%" adjust the
alignment and the size of the figures generate by code.
73 / 76
Code chunks

Code chunks are surrounded by three tickmarks.


We can insert code chunks in R Markdown in three ways:
▶ Keyboard short cut: Cmd + Option (Alt) + I.
▶ The Adding Chunk button in the editor toolbar.
▶ Or manually type the chunk delimiters.

We can specify local chunk options that modify the chunk output
for this specific code chunk.
Refer to the R Markdown Reference Guide for a complete set of (global) chunk options.

74 / 76
Using R Markdown

R Markdown is a great tool for sharing your work.


▶ You no longer have to zip up source code, images, and PDF
output to share with your team mates or colleagues.
▶ With just an .Rmd and the necessary data files, they can do what
you have done, exactly.
▶ It will take a short while to get used to the formatting. After
that, it will become very easy to use.

The first chapter of the Reporting with R Markdown DataCamp


assignment can help you get familiar with the Markdown syntax.

75 / 76
Useful resources:

▶ R Markdown from RStudio: Instructional videos under the “Get


Started” section.
https://rmarkdown.rstudio.com/
▶ R Markdown: The Definite Guide by Yihui Xie:
https://bookdown.org/yihui/rmarkdown/
▶ R Markdown Reference:
https://www.rstudio.com/wp-
content/uploads/2015/03/rmarkdown-reference.pdf

76 / 76

You might also like