0% found this document useful (0 votes)
18 views53 pages

A Crash Course in R - Intro To Statistical Programming

This lecture introduces the R programming language, focusing on its capabilities for statistical computation and graphics. Key topics include creating and indexing atomic vectors, lists, matrices, and data frames, as well as the importance of organizing code for reproducibility. The session also highlights R's object-oriented nature and various functions for data manipulation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views53 pages

A Crash Course in R - Intro To Statistical Programming

This lecture introduces the R programming language, focusing on its capabilities for statistical computation and graphics. Key topics include creating and indexing atomic vectors, lists, matrices, and data frames, as well as the importance of organizing code for reproducibility. The session also highlights R's object-oriented nature and various functions for data manipulation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

lecture 2: a crash course in r

STAT 545: Introduction to computational statistics

Vinayak Rao
Department of Statistics, Purdue University

August 25, 2016


The R programming language

From the manual,

• R is a system for statistical computation and graphics


• R provides a programming language, high level graphics,
interfaces to other languages and debugging facilities

It is possible to go far using R interactively


Better:

• Organize code for debugging/reproducibility/homework

1/43
The R programming language

John Chambers:

• Everything that exists is an object


• Everything that happens is a function call

• typeof() gives the type or internal storage mode of an object


• str() provides a summary of the R object
• class() returns the object’s class

2/43
Atomic vectors

Collections of objects of the same type


Common types include: “logical”, “integer”, “double”,
“complex”, “character”, “raw”

R has no scalars, just vectors of length 1

3/43
Creating vectors

One-dimensional vectors:
age <- 25 # 1-dimensional vector
name <- ”Alice”; undergrad <- FALSE
typeof(age) # Note: age is a double
#> [1] ”double”
class(age)
#> [1] ”numeric”

age <- 15L # L for long integer


typeof(age)
#> [1] ”integer”

4/43
Creating vectors

people <- c(’Alice’, ’Bob’, ’Carol’) # c() concatenates


years <- 1991:2000 # but not years <- 2000:1991, use seq()
even_years <- (years %% 2) == 0

typeof(people)
#> [1] ”character”
length(years)
#> [1] 10
[Link](even_years)
#> [1] TRUE

5/43
Indexing elements of a vector

Use brackets [] to index subelements of a vector


people[1] # First element is indexed by 1
#> [1] ”Alice”
years[1:5] # Index with a subvector of integers
#> [1] 1991 1992 1993 1994 1995
years[c(1, 3, length(years))]
#> [1] 1991 1993 2000

6/43
Indexing elements of a vector

Negative numbers exclude elements

people[-1]
#> [1] ”Bob” ”Carol” # All but the first element
years[-c(1, length(years))] # All but first and last elements
#> [1] 1991 1992 1993 1994 1995

6/43
Indexing elements of a vector

Index with logical vectors

even_years <- (years %% 2) == 0


#> [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
years[even_years] # Index with a logical vector
#> [1] 1992 1994 1996 1998 2000

6/43
Indexing elements of a vector

Example: sample 100 Gaussian random variables and find the


mean of the positive elements
xx <- rnorm(100, 0, 1) # Sample 100 Gaussians
indx_xx_pos <- (xx > 0) # Is this element positive
xx_pos <- xx[indx_xx_pos] # Extract positive elements
xx_pos_mean <- mean(xx_pos) # calculate mean

More terse
xx <- rnorm(100, 0, 1) # Sample 100 Gaussians
xx_pos_mean <- mean(xx[xx > 0]) # calc. mean of positives

6/43
Replacing elements of a vector

Can assign single elements


people[1] <- ”Dave”; print(people)
#> [1] ”Dave” ”Bob” ”Carol”

or multiple elements
years[even_years] <- years[even_years] + 1
#> [1] 1991 1993 1993 1995 1995 1997 1997 1999 1999 2001

(more on this when we look at recycling)


years[-c(1,length(years)] <- 0
#> [1] 1991 0 0 0 0 0 0 0 0 2001

7/43
Replacing elements of a vector

What if we assign to an element outside the vector?


years[length(years) + 1] <- 2015
years
#> [1] 1991 0 0 0 0 0 0 0 0 2001 2015
length(years)
#> [1] 11

We have increased the vector length by 1


In general, this is an inefficient way to go about things
Much more efficient is to first allocate the entire vector

8/43
Recycling

vals <- 1:6


#> [1] 1 2 3 4 5 6
vals + 1
#> [1] 2 3 4 5 6 7

vals + c(1, 2)
#> [1] 2 4 4 6 6 8

Can repeat explicitly too


rep(c(1, 2),3)
#> [1] 1 2 1 2 1 2
rep(c(1, 2),each=3)
#> [1] 1 1 1 2 2 2

9/43
Some useful R functions

seq(), min(), max(), length(), range(), any(), all(),

Comparison operators: <, <=, >, >=, ==, !=


Logical operators: &&, ||, !, &, |, xor()

[Link](), [Link](), [Link](), [Link]()


[Link](), [Link](), [Link](), [Link]()

‘Coercion’ often happens implicitly in function calls:


sum(rnorm(10) > 0)
10/43
Lists (generic vectors) in R

Elements of a list can be any R object (including other lists)


Lists are created using list():
> car <- list(”Ford”, ”Mustang”, 1999, TRUE)
> length(car)

Can have nested lists:


# car, house, cat and sofa are other lists
> possessions <- list(car, house, cat, sofa, ”3000USD”)

11/43
Indexing elements of a list

Use brackets [] and double brackets [[]]


Brackets [] return a sublist of indexed elements

12/43
Indexing elements of a list

Use brackets [] and double brackets [[]]


Brackets [] return a sublist of indexed elements

> car[1]
[[1]]
[1] ”Ford”

> typeof(car[1])
[1] ”list”

12/43
Indexing elements of a list

Use brackets [] and double brackets [[]]


Double brackets [[]] return element of list

> car[[1]]
[1] ”Ford”

> typeof(car[[1]])
[1] ”character”

12/43
Named lists

Can assign names to elements of a list

> names(car) <- c(”Manufacturer”, ”Make”, ”Year”,


+ ”Mileage”, ”Gasoline”)
# Or
> car <- list(”Manufacturer” = ”Ford”, ”Make” = ”Mustang”,
+ ”Year” = 1999, ”Mileage” = 120021.3, ”Gasoline” = TRUE)

13/43
Named lists

Can assign names to elements of a list

> names(car) <- c(”Manufacturer”, ”Make”, ”Year”,


+ ”Mileage”, ”Gasoline”)
# Or
> car <- list(”Manufacturer” = ”Ford”, ”Make” = ”Mustang”,
+ ”Year” = 1999, ”Mileage” = 120021.3, ”Gasoline” = TRUE)

> car[[”Year”]] # A length-one vector


[1] 1999
# Or
> car$Year # Shorthand notation
[1] 1999

13/43
Object attributes

names() is an instance of an object attribute


These store useful information about the object
Other common attributes: class, dim and dimnames.
Many have specific accessor functions e.g. class() or dim()
You can create your own

14/43
Matrices and arrays

Are two- and higher-dimensional collections of objects


These have an appropriate dim attribute
> my_mat <- 1:6 # vector
[1] 1 2 3 4 5 6
> dim(my_mat) <- c(3,2) # 3 rows and 2 columns
> my_mat
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

15/43
Matrices and arrays

Are two- and higher-dimensional collections of objects


These have an appropriate dim attribute
> my_mat <- 1:6 # vector
[1] 1 2 3 4 5 6
> dim(my_mat) <- c(3,2) # 3 rows and 2 columns
> my_mat
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

Equivalently (and better)


> my_mat <- matrix(1:6, nrow = 3, ncol = 2) # ncol is redundant

15/43
Matrices and arrays

Are two- and higher-dimensional collections of objects


These have an appropriate dim attribute
> my_arr <- array(1:8, c(2,2,2))
, , 1
[,1] [,2]
[1,] 1 3
[2,] 2 4

, , 2
[,1] [,2]
[1,] 5 7
[2,] 6 8

15/43
Matrices and arrays

Useful functions include

• typeof(), class(), str()


• dim(), nrow(), ncol()
• [Link](), [Link](), …

16/43
Matrices and arrays

A vector/list is NOT an 1-d matrix (no dim attribute)


> [Link](1:6)
[1] FALSE

Use drop() to eliminate empty dimensions


> my_mat <- array(1:6, c(2,3,1)) # dim(my_mat) is (2,3,1)
, , 1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6

17/43
Matrices and arrays

A vector/list is NOT an 1-d matrix (no dim attribute)


> [Link](1:6)
[1] FALSE

Use drop() to eliminate empty dimensions


> my_mat <- array(1:6, c(2,3,1)) # dim(my_mat) is (2,3,1)
, , 1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> my_mat <- drop(my_mat) # dim is now (2,3)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6

17/43
Indexing matrices and arrays

> my_mat[2,1] # Again, use square brackets


[1] 2

Excluding an index returns the entire dimension


> my_mat[2,]
[1] 2 4 6
> my_arr[1,,1] # slice along dim 2, with dims 1, 3 equal to 1
[1] 6 8

Usual ideas from indexing vectors still apply


> my_mat[c(2,3),]
[,1] [,2]
[1,] 2 5
[2,] 3 6

18/43
Column-major order

We saw how to create a matrix from an array


> my_mat <- matrix(1:6, nrow = 3, ncol = 2)
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

In R matrices are stored in column-major order


(like Fortran, and unlike C and Python)
> my_mat[1:6]
[1] 1 2 3 4 5 6

19/43
Recycling

Column-major order explains recycling to fill larger matrices


> ones <- matrix(1, nrow=3, ncol = 3)
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1
[3,] 1 1 1

> my_seq <- matrix(c(1,2,3), nrow=3, ncol = 3)


[,1] [,2] [,3]
[1,] 1 1 1
[2,] 2 2 2
[3,] 3 3 3

20/43
Data frames

Very common and convenient data structures


Used to store tables:

Columns are variables and rows are observations

Age PhD GPA


Alice 25 TRUE 3.6
Bob 24 TRUE 3.4
Carol 21 FALSE 3.8

An R data frame is a list of equal length vectors and special


convenience syntax

21/43
Data frames
> df <- [Link](age = c(25L,24L,21L),
PhD = c(T,T,F),
GPA = c(3.6,2.4,2.8))

> df
age PhD GPA
1 25 TRUE 3.6
2 24 TRUE 2.4
3 21 FALSE 2.8
> typeof(df)
[1] ”list”
> class(df)
[1] ”[Link]”

> str(df) # Try yourself

22/43
Data frames

Since data frames are lists, we can use list indexing


Can also use matrix indexing (more convenient)
> df[2,3]
[1] 2.4
> df[2,]
age PhD GPA
2 24 TRUE 2.4
> df$GPA
[1] 3.6 2.4 2.8

• list functions apply as usual


• matrix functions are also interpreted intuitively

23/43
Data frames

Many datasets are data frames and many packages expect


dataframes
> library(”datasets”)
> class(mtcars)
[1] ”[Link]”

> head(mtcars) # Print part of a large object


mpg cyl disp hp drat wt qsec vs am gear
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3

24/43
if statements

Allow conditional execution of statements


if( condition1 ) {
statement1
} else if( condition2 ) {
statement2
} else {
statement3
}

25/43
Logical operators

!: logical negation
& and &&: logical ‘and’
| and ||: logical ‘or’
& and | perform elementwise comparisons on vectors
&& and ||:

• evaluate from left to right


• look at first element of each vector
• evaluation proceeds only until the result is determined

26/43
explicit looping: for(), while() and repeat()

for(elem in vect) { # Can be atomic vector or list


Do_stuff_with_elem # over successive elements of vect
}

x <- 0
for(ii in 1:50000) x <- x + log(ii) # Horrible
x <- sum(log(1:50000)) # Much more simple and efficient!
> [Link]({x<-0; for(i in 1:50000) x[i] <- i})
user system elapsed
0.048 0.000 0.048
> [Link](x <- log(sum(1:50000))
user system elapsed
0.001 0 0.002

27/43
Vectorization

Vectorization allows concise and fast loop-free code


∑|p|
Example: Entropy H(p) = − i=1 pi log pi of a prob. distrib.
H <- -sum( p * log(p) ) # Vectorized but wrong (p[i] == 0?)

H <- 0 # Correct but slow


for(i in 1:length(p))
if(p[i] > 0) H <- H - p[i] * log(p[i])

pos <- p > 0


H <- - sum( p[pos] * log(p[pos]) )

28/43
While loops

while( condition ) {
stuff # Repeat stuff while condition evaluates to TRUE
}

If stuff doesn’t affect condition, we loop forever.


Then, we need a break statement. Useful if many conditions
while( TRUE ) { # Or use ‘repeat { ... }’
stuff1
if( condition1 ) break
stuff2
if( condition2 ) break
}

29/43
The *apply family

Useful functions for repeated operations on vectors, lists etc.


Sample usage:
# Calc. mean of each element of my_list
rslt_list <- lapply(my_list, FUN = mean)

Stackexchange has a nice summary: [url]


Note (Circle 4 of the R inferno):

• These are not vectorized operations but are loop-hiding


• Cleaner code, but comparable speeds to explicit for loops

30/43
R functions

R comes with its own suite of built-in functions

• An important part of learning R is learning the vocabulary


See e.g. [Link]

Non-trivial applications require you build your own functions

• Reuse the same set of commands


• Apply the same commands to different inputs
• Cleaner, more modular code
• Easier testing/debugging

31/43
Creating functions

Create functions using function:


my_func <- function( formal_arguments ) body

The above statement creates a function called my_func

formal_arguments comma separated names


describes inputs my_func expects
function_body a statement or a block
describes what my_func does with inputs

32/43
An example function

normalize_mtrx <- function( ip_mat, row = TRUE ) {


# Normalizes columns to add up to one if row = FALSE
# If row = TRUE or row not specified, normalizes columns
if(![Link](ip_mat)) {
warning(”Expecting a matrix as input”);
return(NULL)
}
# You can define objects inside a function
# You can even define other functions
rslt <- if(row) ip_mat / rowSums(ip_mat) else
t( t(ip_mat) / colSums(ip_mat))
}

n_mtrx <- normalize_mtrx(mtrx)

33/43
Argument matching

Proceeds by a three-pass process

• Exact matching on tags


• Partial matching on tags: multiple matches gives an error
• Positional matching

Any remaining unmatched arguments triggers an error

34/43
Environments in R

R follows what is called lexical scoping


Lexical scoping:

• To evaluate a symbol R checks current environment


• If not present, move to parent environment and repeat
• Value of the variable at the time of calling is used
• Assignments are made in current environment

Here, environments are those at time of definition


Where the function is defined (rather than how it is called)
determines which variables to use
Values of these variables at the time of calling are used

35/43
Scoping in R

> x <- 5; y <- 6


> func1 <- function(x) {x + 1}
> func1(1)

> func2 <- function() {x + 1}


> func2()
> x <- 10; func2() # use new x or x at the time of definition?

> func3 <- function(x) {func1(x)}


> func3(2)

> func4 <- function(x) {func2()}


> func4(2) # func2 uses x from calling or global environment?

36/43
Scoping in R

For more on scoping, see (Advanced R, Hadley Wickham)


The bottomline

• Avoid using global variables


• Always define and use clear interfaces to functions
• Warning: you are always implicitly using global objects in R
> ’+’ <- function(x,y) x*y # Open a new RStudio session!
> 2 + 10

37/43
Creating functions

R’s base graphics supports some plotting commands


E.g. plot(), hist(), barplot()
Extending these standard graphics to custom plots is tedious
ggplot is much more flexible, and pretty


• View different graphs as sharing common structure
• Grammar of graphics breaks everything down into a set of
components and rules relating them,

Install like you’d install any other package:


[Link](’ggplot2’)
library(ggplot2)

38/43
Plotting in base R

> str(diamonds)
’[Link]’: 53940 obs. of 10 variables:
$ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 ...
$ cut : [Link] w/ 5 levels ”Fair”<”Good”<..: 5 4 2 ...
$ color : [Link] w/ 7 levels ”D”<”E”<”F”<”G”<..: 2 2 ...
$ clarity: [Link] w/ 8 levels ”I1”<”SI2”<”SI1”<..: 2 3 ...
$ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 ...
$ table : num 55 61 65 58 58 57 57 55 61 61 ...
$ price : int 326 326 327 334 335 336 336 337 337 338 ...
$ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 ...
$ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 ...
$ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 ...

plot(diamonds$carat, diamonds$price) # plot(x,y)

39/43
Plotting in ggplot

ggplot() +
layer(
data = diamonds,
mapping = aes(x = carat, y = price),
geom = ”point”,
stat = ”identity”,
position = ”identity” ) +
scale_y_continuous() + scale_x_continuous() +
coord_cartesian()

Of course, ggplot has intelligent defaults


ggplot(diamonds, aes(carat, price)) + geom_point()

There’s also further abbreviations via qplot (I find it confusing)

40/43
Layers

ggplot produces an object that is rendered into a plot


This object consists of a number of layers
Each layer can get own inputs or share arguments to ggplot()
Add another layer to previous plot:
ggplot(diamonds, aes(x=carat, y = price)) + geom_point()
+ geom_line(stat= ”smooth”, color=”blue”, size=5, alpha=0.7)

41/43
More Examples

ggplot(diamonds, aes(x=carat, y = price,colour=cut)) +


geom_point() +
geom_line(stat= ”smooth”, size=5, alpha= 0.7)

ggplot(diamonds, aes(x=carat, y = price,colour=cut)) +


geom_point() +
geom_line(stat= ”smooth”, method=lm, size=5, alpha= 0.7) +
scale_x_log10()+ scale_y_log10()

ggplot(diamonds, aes(x=carat, fill=cut)) +


geom_histogram(alpha=0.7, binwidth=.4, color=”black”,
position=”dodge”) + xlim(0,2) + coord_cartesian(xlim=c(.1,5))

42/43
A more complicated example

Moscou

55.5 Polotzk Chjat Mojaisk direction

Witebsk Tarantino A
Gloubokoe Wixma R
55.0 Kowno Malo−Jarosewii
Dorogobouge
Smolensk survivors
Wilna
1e+05
54.5 Orscha
Smorgoni Bobr 2e+05
Moiodexno Studienska
3e+05
54.0
Minsk Mohilow
24 28 32 36

‘A Layered Grammar of Graphics’, Hadlay Wickham, Journal of


Computational and Graphical Statistics, 2010
ggplot documentation: [Link]
Search ‘ggplot’ on Google Images for inspiration
Play around to make your own figures

43/43

You might also like