0% found this document useful (0 votes)

496 views60 pages

Understanding

The document discusses the use of the open source statistical software R for epidemiologists. It notes that R is a full-function calculator, includes extensible statistical packages, and is a high-quality graphics tool. It also describes R as a multi-use programming language. The document outlines benefits of R such as low cost, high quality due to an extensive community of users/developers, and growing resources like tutorials and manuals. It provides examples of R's mathematical and statistical capabilities.

Uploaded by

Tomas Aragon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

496 views60 pages

Understanding

Uploaded by

Tomas Aragon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Understanding R for Epidemiologists

Tomás Aragón, MD, DrPH

Faculty, Division of Epidemiology

UC Berkeley School of Public Health

Health Officer, City & County of San Francisco

Director, Population Health and Prevention
San Francisco Department of Public Health

URL: http://www.medepi.com (tja)

Email: [email protected]

August 27, 2012

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 1 / 60
Outline

1 Background
Cost
Quality
Community

2 Getting started with R

Full-function calculator/spreadsheet
Extensible statistical packages
High quality graphics tool
Multi-use programming language

3 Working with R data objects

Atomic vs. recursive data objects
Working with vectors, matrices, & arrays
Working with lists, data frames, and functions

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 2 / 60
Background: Major issues

Cost
Quality
Community
Functionality

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 3 / 60
Cost: Open Source vs. Proprietary Software

Costs of software
Costs of multi-platforms
Costs of education and training
Costs of adding solutions (e.g., packages)
Costs of solving problems and sharing solutions

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 4 / 60
Quality: Open Source vs. Proprietary Software

Core Development Team

Large pool of users/testers
Quality control process for packages
Bug fixes based on need/demand, not profits

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 5 / 60
Community: Open Source vs. Proprietary Software

Large community of users

Transparent development process
Growing number of books and trainings
Growing number of free tutorials and manuals

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 6 / 60
Current R contributors

Douglas Bates
Martin Maechler
John Chambers
Duncan Murdoch
Peter Dalgaard
Paul Murrell
Seth Falcon
Martyn Plummer
Robert Gentleman
Brian Ripley
Kurt Hornik
Deepayan Sarkar
Stefano Iacus
Duncan Temple Lang
Ross Ihaka
Luke Tierney
Friedrich Leisch
Simon Urbanek
Thomas Lumley

Source: http://www.r-project.org/contributors.html

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 7 / 60
What is R?

Full-function calculator/spreadsheet
Extensible statistical packages
High-quality graphics tool
Multi-use programming language

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 8 / 60
Full-function calculator: Selected math operators
Operator Description Try these examples
+ addition 5+4
− subtraction 5-4
∗ multiplication 5*4
/ division 5/4
ˆ exponentiation 5^4
− unary minus (change current -5
sign)
abs absolute value abs(-23)
exp exponentiation (e to a power) exp(8)
log logarithm (default is natural log) log(exp(8))
sqrt square root sqrt(64)
%/% integer divide 10%/%3
%% modulus 10%%3
%*% matrix multiplication xx <- matrix(1:4, 2, 2)
xx%*%c(1, 1)
c(1, 1)%*%xx

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 9 / 60
Extensible statistical packages

Generalized Linear Models (Base)

◮ Linear regression
◮ Logistic regression
◮ Poisson regression
Cox Proportional Hazard models (Survival)
◮ Cox PH regression
◮ Conditional logistic regression (matched case-control studies)
Meta-analysis (meta)
Complex survey analysis (survey)
Epidemiology packages
◮ epitools
◮ epicalc
◮ epibasix
◮ epiR

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 10 / 60
Graphics display of sample size curves

Null distribution Alternative distribution

H0 H1

Power
( 1 − β)

β α 2

− Z1−α 2 µ0 Z1−α 2 µ1

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 11 / 60
Graphics display of P value function

1 0

0.9 10

0.8 20

95% Lower Confidence Limit = 0.74

95% Upper Confidence Limit = 21.0

Confidence level (%)

0.7 30

Median unbiased estimate

Null hypothesis
0.6
P−value

0.5 50

0.4 60

0.3 70

0.2 80

0.1 90
95% Confidence Interval
0.05 95

0 100

0.2 0.5 1.0 2.0 2.9 5.0 10.0 20.0

Rate Ratio

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 12 / 60
Graphical display of multiple linear regression

10 20 30 40 50 60 70 80 90
y

x2
50
40
30
20
10
0
0 10 20 30 40 50

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 13 / 60
Epidemic curve using Color Brewer colors

West Nile Virus Human Cases Reported in California

by Disease Week as of December 14, 2004
80

Unknown
WNF
60

WNND
Cases

+ Horse
40

6/20
+ Chicken
5/17
+ Mosquito
20

4/14
+ Bird
2/24
0

52 03 06 09 12 15 18 21 24 27 30 33 36 39 42 45 48 51
Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Disease Week & Calendar Month, 2004

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 14 / 60
Multi-use programming language

Vectorized computations
Functional programming language
Object-oriented programming
Text processing (e.g., using regular expressions)
Links to C, Fortran, etc.

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 15 / 60
Data objects in R

Object types Operations

Vector Create
Matrix Name
Array Index
List Replace
Data frame Manipulate
Function Do computations

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 16 / 60
Summary of types of data objects in R

Data object Possible modea Default class

Atomic
vector character, numeric, logical NULL
matrix character, numeric, logical NULL
array character, numeric, logical NULL
Recursive
list list NULL
data frame list data frame
function function NULL

a
We are ignoring complex numbers

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 17 / 60
Understanding vectors

A vector is a collection of like elements without dimensions1 . The vector

elements are all of the same mode (either character, numeric, or logical).

> y <- c("Pedro", "Paulo", "Maria")

> y
[1] "Pedro" "Paulo" "Maria"
> x <- c(1, 2, 3, 4, 5)
> x
[1] 1 2 3 4 5
> x < 3
[1] TRUE TRUE FALSE FALSE FALSE

1
In other programming languages, vectors are either row vectors or column
vectors. R does not make this distinction until it is necessary.
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 18 / 60
Understanding vectors: Indexing

Indexing by Try these examples

Position x <- c(chol=234, sbp=148, dbp=78, age=54)
x[2] #positions to include
x[c(2, 3)]
x[-c(1, 3, 4)] #positions to exclude
x[-c(1, 4)]
Name x["sbp"]
x[c("sbp", "dbp")]
Logical x < 100
x[x < 100]
(x < 150) & (x > 70)
bp <- (x < 150) & (x > 70)
x[bp]

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 19 / 60
Understanding vectors: Replacement

Replacing by Try these examples

Position x <- c(chol=234, sbp=148, dbp=78, age=54)
x[1]
x[1] <- 250
x
Name x["sbp"]
x["sbp"] <- 150
x
Logical x[x<100]
x[x<100] <- NA
x

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 20 / 60
Understanding vectors: Replacement

> x <- c(chol = 234, sbp = 148, dbp = 78, age = 54)
> x[1] <- 250 #by position
> x
chol sbp dbp age
250 148 78 54
> x["sbp"] <- 150 #by name
> x
chol sbp dbp age
250 150 78 54
> x[x<100]
dbp age
78 54
> x[x<100] <- NA #by logical
> x
chol sbp dbp age
250 150 NA NA

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 21 / 60
Understanding matrices

A matrix is a collection of like elements organized into a 2-dimensional

(tabular) data object. Matrix elements can be either numeric, character,
or logical. We can think of a matrix as a vector with a 2-dimensional
structure. Contingency tables in epidemiology are represented in R as
numeric matrices or arrays. An array is the generalization of matrices to 3
or more dimensions (commonly known as stratified tables). We cover
arrays later, for now we will focus on 2-dimensional tables.

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 22 / 60
Understanding matrices

When R returns a matrix the [n,] indicates the nth row and [,m]
indicates the mth column.

> x <- c("a", "b", "c", "d")

> y <- matrix(x, 2, 2)
> y
[,1] [,2]
[1,] "a" "c"
[2,] "b" "d"
> y[1,]
[1] "a" "c"
> y[,2]
[1] "c" "d"

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 23 / 60
Understanding matrices

> x <- c(30, 21, 170, 180) # creating

> y <- matrix(x, 2, 2, byrow = TRUE) # creating
> y
[,1] [,2]
[1,] 30 21
[2,] 170 180
> rownames(y) <- c("Deaths", "Survivors") # naming
> colnames(y) <- c("Tolbutamide", "Placebo") # naming
> y[2, 1] <- 174 # replace by position
> y["Survivors", "Placebo"] <- 184 # replace by name
> y
Tolbutamide Placebo
Deaths 30 21
Survivors 174 184

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 24 / 60
Understanding matrices

Consider the 2 × 2 table of crude data in Table. In this randomized clinical

trial (RCT), diabetic subjects were randomly assigned to receive either
tolbutamide, an oral hypoglycemic drug, or placebo. Because this was a
prospective study we can calculate risks, odds, a risk ratio, and an odds
ratio. We will do this using R as a calculator.

Table: Deaths among subjects who received tolbutamide and placebo in the
Unversity Group Diabetes Program (1970)

Tolbutamide Placebo
Deaths 30 21
Survivors 174 184

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 25 / 60
Understanding matrices

> dat <- matrix(c(30, 174, 21, 184), 2, 2)

> rownames(dat) <- c("Deaths", "Survivors")
> colnames(dat) <- c("Tolbutamide", "Placebo")
> coltot <- apply(dat, 2, sum) #column totals
> risks <- dat["Deaths",]/coltot
> risk.ratio <- risks/risks[2] #risk ratio
> odds <- risks/(1-risks)
> odds.ratio <- odds/odds[2] #odds ratio

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 26 / 60
Understanding matrices

> # display results

> dat
Tolbutamide Placebo
Deaths 30 21
Survivors 174 184
> rbind(risks, risk.ratio, odds, odds.ratio)
Tolbutamide Placebo
risks 0.1470588 0.1024390
risk.ratio 1.4355742 1.0000000
odds 0.1724138 0.1141304
odds.ratio 1.5106732 1.0000000

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 27 / 60
Understanding arrays
An array is a collection of like elements organized into a n-dimensional
data object. When R returns an array the [n,,] indicates the nth row
and [,m,] indicates the mth column, and so on.
> x <- 1:8
> y <- array(x, dim=c(2, 2, 2))
> y
, , 1

[,1] [,2]
[1,] 1 3
[2,] 2 4

, , 2

[,1] [,2]
[1,] 5 7
[2,] 6 8

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 28 / 60
Understanding arrays
While a matrix is a 2-dimensional table of like elements, an array is the
generalization of matrices to n-dimensions. Stratified contingency tables in
epidemiology are represented as array data objects in R. For example, the
RCT previously shown comparing the number deaths among diabetic
subjects that received tolbutamide vs. placebo is now also stratified by age
group:

Table: Deaths among subjects who received tolbutamide and placebo in the
Unversity Group Diabetes Program (1970), stratifying by age

Age<55 Age≥55 Combined

Tolb Plac Tolb Plac Tolb Plac
Deaths 8 5 22 16 30 21
Survivors 98 115 76 69 174 184
Total 106 120 98 85 204 205

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 29 / 60
Understanding arrays

> tdat <- c(8, 98, 5, 115, 22, 76, 16, 69)
> tdat <- array(tdat, c(2, 2, 2))
> dimnames(tdat) <- list(Outcome=c("Deaths", "Survivors"),
+ Treatment=c("Tolbutamide", "Placebo"),
+ "Age group"=c("Age<55", "Age>=55"))
> tdat
, , Age group = Age<55
Treatment
Outcome Tolbutamide Placebo
Deaths 8 5
Survivors 98 115

, , Age group = Age>=55

Treatment
Outcome Tolbutamide Placebo
Deaths 22 16
Survivors 76 69

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 30 / 60
Table: Example of 4-dimensional array: Year 2000 population estimates by age,
ethnicity, sex, and county
Ethnicity
County/Sex Age White AfrAmer AsianPI Latino Multirace AmerInd
Alameda
Female <=19 58,160 31,765 40,653 49,738 10,120 839
20–44 112,326 44,437 72,923 58,553 7,658 1,401
45–64 82,205 24,948 33,236 18,534 2,922 822
65+ 49,762 12,834 16,004 7,548 1,014 246
Male <=19 61,446 32,277 42,922 53,097 10,102 828
20–44 115,745 36,976 69,053 69,233 6,795 1,263
45–64 81,332 20,737 29,841 17,402 2,506 687
65+ 33,994 8,087 11,855 5,416 711 156
San Francisco
Female <=19 14,355 6,986 23,265 13,251 2,940 173
20–44 85,766 10,284 52,479 23,458 3,656 526
45–64 35,617 6,890 31,478 9,184 1,144 282
65+ 27,215 5,172 23,044 5,773 554 121
Male <=19 14,881 6,959 24,541 14,480 2,851 165
20–44 105,798 11,111 48,379 31,605 3,766 782
45–64 43,694 7,352 26,404 8,674 1,220 354
65+ 20,072 3,329 17,190 3,428 450 76

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 31 / 60
Understanding arrays

Figure: Schematic representation of a 4-dimensional array: Year 2000 population

estimates by age (1), race (2), sex (3), and county (4)

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 32 / 60
Understanding arrays

Figure: Schematic of a theoretical 5-D array (e.g., data by age (1), race (2), sex
(3), party affiliation (4), and state (5)). We can see that the field “state” has 3
levels, and the field “party affiliation” has 2 levels; however, it is not apparent the
number of age, race, and sex levels. Although not displayed, age levels would be
represented by row names (along 1st dimension), race levels by column names
(along 2nd dimension), and sex levels by depth names (along 3rd dimension).

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 33 / 60
Understanding lists

Up to now, we have been working with atomic data objects (vector, matrix,
array). In contrast, lists, data frames, and functions are recursive data
objects. Recursive data objects have more flexibility in combining diverse
data objects into one object. A list provides the most flexibility. Think of a
list object as a collection of “bins” that can contain any R object. Lists
are very useful for collecting results of an analysis or a function into one
data object where all its contents are readily accessible by indexing.

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 34 / 60
Understanding lists
A list is a collection of data objects without any restrictions:
> x <- c(11, 22, 34)
> y <- c("Male", "Female", "Male")
> z <- matrix(c(67, 34, 56,22), 2, 2)
> mylist <- list(x, y, z)
> mylist
[[1]]
[1] 11 22 34

[[2]]
[1] "Male" "Female" "Male"

[[3]]
[,1] [,2]
[1,] 67 56
[2,] 34 22

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 35 / 60
Understanding lists
Names can be assigned to each bin of a list.
> names(mylist) <- c("Age", "Sex", "Data")
> mylist
$Age
[1] 11 22 34

$Sex
[1] "Male" "Female" "Male"

$Data
[,1] [,2]
[1,] 67 56
[2,] 34 22

> mylist$Sex
[1] "Male" "Female" "Male"

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 36 / 60
Understanding lists

Figure: Schematic representation of a list of length four. The first bin [1]
contains a smiling face [[1]], the second bin [2] contains a flower [[2]], the
third bin [3] contains a lightning bolt [[3]], and the fourth bin [[4]] contains
a heart [[4]]. When indexing a list object, single brackets [·] indexes the bin,
and double brackets [[·]] indexes the bin contents. If the bin has a name, then
$name also indexes the contents.

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 37 / 60
Understanding lists

For example, using the UGDP clinical trial data, suppose we perform
Fisher’s exact test for testing the null hypothesis of independence of rows
and columns in a contingency table with fixed marginals.

> udat <- read.csv("http://www.medepi.net/data/ugdp.txt")

> tab <- xtabs(~ Status + Treatment, data = udat)[,2:1]
> tab
Treatment
Status Tolbutamide Placebo
Death 30 21
Survivor 174 184

> ftab <- fisher.test(tab)

> ftab

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 38 / 60
Understanding lists

> ftab
Fisher’s Exact Test for Count Data

data: tab
p-value = 0.1813
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.8013768 2.8872863
sample estimates:
odds ratio
1.509142

The default display only shows partial results. The total results are stored
in the object ftab. Let’s evaluate the structure of ftab and extract some
results:

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 39 / 60
Understanding lists

> str(ftab)
List of 7
$ p.value : num 0.181
$ conf.int : atomic [1:2] 0.801 2.887
..- attr(*, "conf.level")= num 0.95
$ estimate : Named num 1.51
..- attr(*, "names")= chr "odds ratio"
$ null.value : Named num 1
..- attr(*, "names")= chr "odds ratio"
$ alternative: chr "two.sided"
$ method : chr "Fisher’s Exact Test for Count Data"
$ data.name : chr "tab"
- attr(*, "class")= chr "htest"

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 40 / 60
Understanding lists

Let’s index some of the bins from ftab.

> ftab$estimate
odds ratio
1.5091

> ftab$conf.int
[1] 0.80138 2.88729

> ftab$conf.int[2]
[1] 2.887286
attr(,"conf.level")
[1] 0.95

> ftab$p.value
[1] 0.18126

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 41 / 60
Understanding data frames
A data frame is a list with a 2-dimensional (tabular) structure.
Epidemiologists are very experienced working with data frames where each
row usually represents data collected on individual subjects (also called
records or observations) and columns represent fields for each type of data
collected (also called variables).
> subjno <- c(1, 2, 3, 4)
> age <- c(34, 56, 45, 23)
> sex <- c("Male", "Male", "Female", "Male")
> case <- c("Yes", "No", "No", "Yes")
> mydat <- data.frame(subjno, age, sex, case)
> mydat
subjno age sex case
1 1 34 Male Yes
2 2 56 Male No
3 3 45 Female No
4 4 23 Male Yes

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 42 / 60
Understanding data frames

Epidemiologists are familiar with tabular data sets where each row is a
record and each column is a field. A record can be data collected on
individuals or groups. We usually refer to the field name as a variable
(e.g., age, gender, ethnicity). Fields can contain numeric or character
data. In R, these types of data sets are handled by data frames. Each
column of a data frame is usually either a factor or numeric vector,
although it can have complex, character, or logical vectors. Data frames
have the functionality of matrices and lists. For example, here is the first
10 rows of the infert data set, a matched case-control study published in
1976 that evaluated whether infertility was associated with prior
spontaneous or induced abortions.

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 43 / 60
Understanding data frames

> data(infert)
> str(infert)
‘data.frame’: 248 obs. of 8 variables:
$ education : Factor w/ 3 levels "0-5yrs",..: 1 1 ...
$ age : num NA 45 NA 23 35 36 23 32 21 28 ...
$ parity : num 6 1 6 4 3 4 1 2 1 2 ...
$ induced : num 1 1 2 2 1 2 0 0 0 0 ...
$ case : num 1 1 1 1 1 1 1 1 1 1 ...
$ spontaneous : num 2 0 0 0 1 1 0 0 1 0 ...
$ stratum : int 1 2 3 4 5 6 7 8 9 10 ...
$ pooled.stratum: num 3 1 4 2 32 36 6 22 5 19 ...

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 44 / 60
Understanding data frames

> infert[1:10, 1:6]

education age parity induced case spontaneous
1 0-5yrs NA 6 1 1 2
2 0-5yrs 45 1 1 1 0
3 0-5yrs NA 6 2 1 0
4 0-5yrs 23 4 2 1 0
5 6-11yrs 35 3 1 1 1
6 6-11yrs 36 4 2 1 1
7 6-11yrs 23 1 0 1 0
8 6-11yrs 32 2 0 1 0
9 6-11yrs 21 1 0 1 1
10 6-11yrs 28 2 0 1 0

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 45 / 60
Understanding data frames

The fields are obviously vectors. Let’s explore a few of these vectors to see
what we can learn about their structure in R.

> #age variable

> infert$age
[1] 26 42 39 34 35 36 23 32 21 28 29 37 31 29 31 27 30 26
...
[235] 25 32 25 31 38 26 31 31 25 31 34 35 29 23

> mode(infert$age)
[1] "numeric"

> class(infert$age)
[1] "numeric"

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 46 / 60
Understanding data frames

> # education variable

> infert$education
[1] 0-5yrs 0-5yrs 0-5yrs 0-5yrs 6-11yrs 6-11yrs
...
[247] 12+ yrs 12+ yrs
Levels: 0-5yrs 6-11yrs 12+ yrs

> mode(infert$education)
[1] "numeric"

> class(infert$education)
[1] "factor"

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 47 / 60
Understanding data frames and factors
A factor is R’s representation of categorical fields and keeps track of all
possible category levels.
> sex <- sample(c("Male", "Female"), 100, replace = TRUE)
> mode(sex); class(sex)
[1] "character"
[1] "character"
> table(sex)
sex
Female Male
51 49
> sexf <- factor(sex, levels = c("Male", "Female", "Transgender"))
> table(sexf)
sexf
Male Female Transgender
49 51 0
> mode(sexf); class(sexf)
[1] "numeric"
[1] "factor"

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 48 / 60
Understanding data frames and lists

Infert data is a matched case-control study evaluating the association of

history of abortions and infertility. Use conditional logistic regression.

> mod3 <- clogit(case ~ spontaneous + induced +

+ strata(stratum), data = infert)
> mod3
Call:
clogit(case ~ spontaneous + induced + strata(stratum), data =

coef exp(coef) se(coef) z p

spontaneous 1.99 7.29 0.352 5.63 1.8e-08
induced 1.41 4.09 0.361 3.91 9.4e-05

> summod3 <- summary(mod3)

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 49 / 60
Understanding data frames and lists

> summod3
n= 248
coef exp(coef) se(coef) z Pr(>|z|)
spontaneous 1.9859 7.2854 0.3524 5.635 1.75e-08 ***
induced 1.4090 4.0919 0.3607 3.906 9.38e-05 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

exp(coef) exp(-coef) lower .95 upper .95

spontaneous 7.285 0.1373 3.651 14.536
induced 4.092 0.2444 2.018 8.298

Rsquare= 0.193 (max possible= 0.519 )

Likelihood ratio test= 53.15 on 2 df, p=2.869e-12
Wald test = 31.84 on 2 df, p=1.221e-07
Score (logrank) test = 48.44 on 2 df, p=3.032e-11

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 50 / 60
Understanding data frames and lists

> str(summod3)
List of 12
$ call : language coxph(formula = Surv(rep(1, 248L), case) ~ sponta
$ fail : NULL
$ na.action : NULL
$ n : int 248
$ loglik : num [1:2] -90.8 -64.2
$ coefficients: num [1:2, 1:5] 1.986 1.409 7.285 4.092 0.352 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "spontaneous" "induced"
.. ..$ : chr [1:5] "coef" "exp(coef)" "se(coef)" "z" ...
$ conf.int : num [1:2, 1:4] 7.285 4.092 0.137 0.244 3.651 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "spontaneous" "induced"
.. ..$ : chr [1:4] "exp(coef)" "exp(-coef)" "lower .95" "upper .95"
$ logtest : Named num [1:3] 5.32e+01 2.00 2.87e-12
... [output truncated]

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 51 / 60
Understanding data frame and lists

> summod3$coef
coef exp(coef) se(coef) z Pr(>|z|)
spontaneous 1.985876 7.285423 0.3524435 5.634592 1.754734e-08
induced 1.409012 4.091909 0.3607124 3.906191 9.376245e-05

> summod3$coef[1, ]
coef exp(coef) se(coef) z Pr(>|z|)
1.985876e+00 7.285423e+00 3.524435e-01 5.634592e+00 1.754734e-08

> summod3$coef[ ,2]

spontaneous induced
7.285423 4.091909

> summod3$coef[1,2]
[1] 7.285423

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 52 / 60
Understanding functions
Risk Ratio confidence interval from baby Rothman, p. 135
rr.wald <- function(x, conf.level = 0.95){
## prepare input
x1 <- x[1,1]; n1 <- sum(x[1,])
x0 <- x[2,1]; n0 <- sum(x[2,])
## do calculations
p1 <- x1/n1 ##risk among exposed
p0 <- x0/n0 ##risk among unexposed
RR <- p1/p0;
logRR <- log(RR)
SElogRR <- sqrt(1/x1 - 1/n1 + 1/x0 - 1/n0)
Z <- qnorm(0.5*(1 + conf.level))
LCL <- exp(logRR - Z*SElogRR)
UCL <- exp(logRR + Z*SElogRR)
##collect output
list(x = x, risks = c(p1 = p1, p0 = p0), risk.ratio = RR,
conf.int = c(LCL, UCL), conf.level = conf.level)
}

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 53 / 60
Understanding functions

Run rr.wald function on UGDP RCT data (results displayed in 2

columns).

> tab
Treatment $risks
Status Tolbutamide Placebo p1 p0
Death 30 21 0.5882353 0.4860335
Survivor 174 184
$risk.ratio
> rr.wald(tab) [1] 1.210277
$x
Treatment $conf.int
Status Tolbutamide Placebo [1] 0.9396227 1.5588927
Death 30 21
Survivor 174 184 $conf.level
[1] 0.95

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 54 / 60
The epitools package

The following epidemiologists, directly or indirectly, contributed to

’epitools’:
Tomás Aragón, MD, DrPH, , UC Berkeley
Michael P. Fay, PhD, Mathematical Statistician National Institute of
Allergy and Infectious Diseases
Wayne Enanoria, PhD, MPH, UC Berkeley
Travis Porco, PhD, MPH, UC San Francisco
Michael Samuel, DrPH, California Department of Public Health

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 55 / 60
Using epitools for outbreak investigations

Using the epitab function (only arguments are displayed);

epitab(x, y = NULL,
method = c("oddsratio", "riskratio", "rateratio"),
conf.level = 0.95,
rev = c("neither", "rows", "columns", "both"),
oddsratio = c("wald", "fisher", "midp", "small"),
riskratio = c("wald", "boot", "small"),
rateratio = c("wald", "midp"),
pvalue = c("fisher.exact", "midp.exact", "chi2"),
correction = FALSE,
verbose = FALSE)

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 56 / 60
Hypothesis testing using Oswego: Passing 2 vectors

> library(epitools) #load ’epitools’ package

> data(oswego) #load Oswego dataset
> attach(oswego) #attach dataset
> round(epitab(jello, ill, method = "riskratio")$tab, 2)
Outcome
Predictor N p0 Y p1 riskratio lower upper p.value
N 22 0.42 30 0.58 1.00 NA NA NA
Y 7 0.30 16 0.70 1.21 0.84 1.72 0.44

> round(epitab(jello, ill, method = "oddsratio")$tab, 2)

Outcome
Predictor N p0 Y p1 oddsratio lower upper p.value
N 22 0.76 30 0.65 1.00 NA NA NA
Y 7 0.24 16 0.35 1.68 0.59 4.76 0.44
> detach(oswego) #detach dataset

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 57 / 60
Hypothesis testing using Oswego: Passing a table

> jello.tab1
ill
jello N Y
N 22 30
Y 7 16
> round(epitab(jello.tab1)$tab, 2)
ill
jello N p0 Y p1 oddsratio lower upper p.value
N 22 0.76 30 0.65 1.00 NA NA NA
Y 7 0.24 16 0.35 1.68 0.59 4.76 0.44

> round(epitab(jello.tab1, method = "risk")$tab, 2)

ill
jello N p0 Y p1 riskratio lower upper p.value
N 22 0.42 30 0.58 1.00 NA NA NA
Y 7 0.30 16 0.70 1.21 0.84 1.72 0.44

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 58 / 60
Hypothesis testing using Oswego: Passing one vector

> round(epitab(c(22, 30, 7, 16))$tab, 2)

Outcome
Predictor Disease1 p0 Disease2 p1 oddsratio lower upper p.value
Exposed1 22 0.76 30 0.65 1.00 NA NA NA
Exposed2 7 0.24 16 0.35 1.68 0.59 4.76 0.44

> round(epitab(c(22, 30, 7, 16), method = "risk")$tab, 2)

Outcome
Predictor Disease1 p0 Disease2 p1 riskratio lower upper p.value
Exposed1 22 0.42 30 0.58 1.00 NA NA NA
Exposed2 7 0.30 16 0.70 1.21 0.84 1.72 0.44

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 59 / 60
Summary

1 Background
Cost
Quality
Community

2 Getting started with R

Full-function calculator/spreadsheet
Extensible statistical packages
High quality graphics tool
Multi-use programming language

3 Working with R data objects

Atomic vs. recursive data objects
Working with vectors, matrices, & arrays
Working with lists, data frames, and functions

Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 60 / 60

This Study Resource Was: H H H H H H
0% (1)
This Study Resource Was: H H H H H H
2 pages
R Programming
100% (8)
R Programming
60 pages
Chapter 3 (Part 1) of The Book of Why: From Evidence To Causes - Reverend Bayes Meets Mr. Holmes
No ratings yet
Chapter 3 (Part 1) of The Book of Why: From Evidence To Causes - Reverend Bayes Meets Mr. Holmes
23 pages
Motor Cycle Repairing
75% (8)
Motor Cycle Repairing
22 pages
Okra Mucilage As Adhesive in Paper Recycling: An Investigatory Project
No ratings yet
Okra Mucilage As Adhesive in Paper Recycling: An Investigatory Project
16 pages
Applied Epidemiology Using R PDF
No ratings yet
Applied Epidemiology Using R PDF
302 pages
EpidemiologyUsingR PDF
No ratings yet
EpidemiologyUsingR PDF
302 pages
Aragón, Tomás J. - Applied Epidemiology Using R-Springer (2010)
No ratings yet
Aragón, Tomás J. - Applied Epidemiology Using R-Springer (2010)
190 pages
A Dictionary of Epidemiology - 5th Edition Full Text Download
100% (11)
A Dictionary of Epidemiology - 5th Edition Full Text Download
14 pages
A Dictionary of Epidemiology - 5th Edition Full eBook Access
No ratings yet
A Dictionary of Epidemiology - 5th Edition Full eBook Access
17 pages
Epidemiology W R
No ratings yet
Epidemiology W R
240 pages
Epiwithr
No ratings yet
Epiwithr
130 pages
Lab 1 Manual - Introduction to R
No ratings yet
Lab 1 Manual - Introduction to R
7 pages
MD115 Wk01
No ratings yet
MD115 Wk01
67 pages
Introduction To R Programming 1691124649
No ratings yet
Introduction To R Programming 1691124649
79 pages
Applied Epidemiology Using R Aragn Toms J instant download
No ratings yet
Applied Epidemiology Using R Aragn Toms J instant download
50 pages
Using R For Epidemiological Research
No ratings yet
Using R For Epidemiological Research
37 pages
All v2 Basic Statistics Using R
No ratings yet
All v2 Basic Statistics Using R
241 pages
Epidemiology with R
No ratings yet
Epidemiology with R
246 pages
R Basic
No ratings yet
R Basic
16 pages
Applied Epi Course Brochure
No ratings yet
Applied Epi Course Brochure
5 pages
Sunil Test
No ratings yet
Sunil Test
15 pages
R for Basic Biostatistics in Medical Research ISBN 9819769795, 9789819769797 Accessible DOCX Download
No ratings yet
R for Basic Biostatistics in Medical Research ISBN 9819769795, 9789819769797 Accessible DOCX Download
14 pages
Tatistical Nalysis With: Course Outline
No ratings yet
Tatistical Nalysis With: Course Outline
11 pages
R Programming
No ratings yet
R Programming
37 pages
R for Basic Biostatistics in Medical Research Scribd Full Download
100% (8)
R for Basic Biostatistics in Medical Research Scribd Full Download
14 pages
A Brief Introduction To R
No ratings yet
A Brief Introduction To R
17 pages
Lecture 4
No ratings yet
Lecture 4
73 pages
116656200
No ratings yet
116656200
81 pages
Lec9 10
No ratings yet
Lec9 10
20 pages
R Handout Statistics and Data Analysis Using R
No ratings yet
R Handout Statistics and Data Analysis Using R
91 pages
RBasics Handout
No ratings yet
RBasics Handout
6 pages
Presentation of R
No ratings yet
Presentation of R
109 pages
R Programming
No ratings yet
R Programming
60 pages
Da Session 4
No ratings yet
Da Session 4
75 pages
1 0 0 0 Flu Dosage Age Sex
No ratings yet
1 0 0 0 Flu Dosage Age Sex
3 pages
Data Analysis and Graphics Using R 1st Edition Matthew Norman - The ebook is ready for download with just one simple click
No ratings yet
Data Analysis and Graphics Using R 1st Edition Matthew Norman - The ebook is ready for download with just one simple click
80 pages
R Programming
No ratings yet
R Programming
61 pages
Introduction To R
No ratings yet
Introduction To R
15 pages
Lec7 8
No ratings yet
Lec7 8
28 pages
Data Analysis and Graphics Using R 1st Edition Matthew Norman download
No ratings yet
Data Analysis and Graphics Using R 1st Edition Matthew Norman download
53 pages
Rintro
No ratings yet
Rintro
42 pages
R Programming
No ratings yet
R Programming
60 pages
Workshop 1
No ratings yet
Workshop 1
7 pages
Data_analysis_with_R _24
No ratings yet
Data_analysis_with_R _24
47 pages
BM1, Applied Statistics, Lesson 1: Data and Graph Basics: Luis Del Peso Ovalle
No ratings yet
BM1, Applied Statistics, Lesson 1: Data and Graph Basics: Luis Del Peso Ovalle
17 pages
Lesson 1
No ratings yet
Lesson 1
18 pages
R Function Cheat Sheet
No ratings yet
R Function Cheat Sheet
2 pages
Module 3 R Data Science
No ratings yet
Module 3 R Data Science
158 pages
Getting Started With R
No ratings yet
Getting Started With R
155 pages
Applied Epi Course Brochure
No ratings yet
Applied Epi Course Brochure
5 pages
R Programming
No ratings yet
R Programming
47 pages
R-Tutorial - Introduction
No ratings yet
R-Tutorial - Introduction
30 pages
R PPT
No ratings yet
R PPT
63 pages
Introduction To R
No ratings yet
Introduction To R
20 pages
Fundamentals of Chemical Reaction Engineering
From Everand
Fundamentals of Chemical Reaction Engineering
Mark E. Davis
2.5/5 (3)
Introduction to Bioinformatics Using Action Labs
From Everand
Introduction to Bioinformatics Using Action Labs
Jean-Louis Lassez
5/5 (1)
Green Tio2 as Nanocarriers for Targeting Cervical Cancer Cell Lines
From Everand
Green Tio2 as Nanocarriers for Targeting Cervical Cancer Cell Lines
Mythreyi M
No ratings yet
How to Find Inter-Groups Differences Using Spss/Excel/Web Tools in Common Experimental Designs: Book 1
From Everand
How to Find Inter-Groups Differences Using Spss/Excel/Web Tools in Common Experimental Designs: Book 1
P.Y. Cheng
No ratings yet
Mastering Parallel Programming with R
From Everand
Mastering Parallel Programming with R
Simon R. Chapple
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
Introduction to X-Ray Powder Diffractometry
From Everand
Introduction to X-Ray Powder Diffractometry
Ron Jenkins
No ratings yet
Feedback Control Theory
From Everand
Feedback Control Theory
Bruce Francis
5/5 (1)
Partial Differential Equations: A Detailed Exploration
From Everand
Partial Differential Equations: A Detailed Exploration
Kartikeya Dutta
No ratings yet
Chapter 3 (Part 2) of The Book of Why: From Evidence To Causes - Reverend Bayes Meets Mr. Holmes
No ratings yet
Chapter 3 (Part 2) of The Book of Why: From Evidence To Causes - Reverend Bayes Meets Mr. Holmes
29 pages
Designing A Humble and Healing Organization: A Neuroscience and Skills-Based Approach
No ratings yet
Designing A Humble and Healing Organization: A Neuroscience and Skills-Based Approach
34 pages
San Francisco Substance Use Update, 2019 - The Annual David E. Smith, MD Symposium
No ratings yet
San Francisco Substance Use Update, 2019 - The Annual David E. Smith, MD Symposium
27 pages
Humility Is The New Smart: Rethinking Human Excellence in The Smart Machine Age
100% (2)
Humility Is The New Smart: Rethinking Human Excellence in The Smart Machine Age
38 pages
Clean Zomoto Data
No ratings yet
Clean Zomoto Data
469 pages
Chapter 18
No ratings yet
Chapter 18
10 pages
Estimate The Live Storage Capacity of Reservoir .
100% (2)
Estimate The Live Storage Capacity of Reservoir .
7 pages
Anwaar Qureshi Final Working Drawings
No ratings yet
Anwaar Qureshi Final Working Drawings
38 pages
Thermodynamics of Surfaces
No ratings yet
Thermodynamics of Surfaces
8 pages
Aim of The Project
No ratings yet
Aim of The Project
5 pages
QUALIS CAPES - Engenharias 1 X Interdisciplinar-1
No ratings yet
QUALIS CAPES - Engenharias 1 X Interdisciplinar-1
26 pages
Green Tea and Cognitive Functioning
No ratings yet
Green Tea and Cognitive Functioning
1 page
Manuscript Submitted Overview
No ratings yet
Manuscript Submitted Overview
5 pages
Microbiology, epidemiology, and pathogenesis of Legionella infection - UpToDate
No ratings yet
Microbiology, epidemiology, and pathogenesis of Legionella infection - UpToDate
19 pages
7th Pracitce Sheet - 1 - 18. Wastewater Story
No ratings yet
7th Pracitce Sheet - 1 - 18. Wastewater Story
4 pages
G7-Persuasive Article Writing Graphic Organize
100% (1)
G7-Persuasive Article Writing Graphic Organize
2 pages
Heat Transfer Notes
No ratings yet
Heat Transfer Notes
17 pages
SCP
No ratings yet
SCP
257 pages
Aonla Rejuvenation-English
No ratings yet
Aonla Rejuvenation-English
22 pages
Afya Presentation
No ratings yet
Afya Presentation
82 pages
Examen de Ingles Tercera Unidad
No ratings yet
Examen de Ingles Tercera Unidad
5 pages
Tooth Arrangement For The Manufacture of A Complete Denture Using A Robot
No ratings yet
Tooth Arrangement For The Manufacture of A Complete Denture Using A Robot
6 pages
1-s2.0-S1098360021002860-main (1)
No ratings yet
1-s2.0-S1098360021002860-main (1)
9 pages
General Hospital Operation Policy in ICT Environment 132page ALL
No ratings yet
General Hospital Operation Policy in ICT Environment 132page ALL
133 pages
4th Quarter Moral Education Notes (Grade 11)
No ratings yet
4th Quarter Moral Education Notes (Grade 11)
7 pages
Chapter 6
No ratings yet
Chapter 6
28 pages
Sputum Culture and Sensitivity
No ratings yet
Sputum Culture and Sensitivity
8 pages
Imds Newsletter 58
No ratings yet
Imds Newsletter 58
4 pages
Books-for-EDRA-preparation_final
No ratings yet
Books-for-EDRA-preparation_final
3 pages
12th - English - Answer Key
No ratings yet
12th - English - Answer Key
5 pages
Traditional Healing Practices in Zamboanga City, Philippines
No ratings yet
Traditional Healing Practices in Zamboanga City, Philippines
8 pages

Understanding

Uploaded by

Understanding

Uploaded by

Understanding R for Epidemiologists

Tomás Aragón, MD, DrPH

Faculty, Division of Epidemiology

Health Officer, City & County of San Francisco

URL: http://www.medepi.com (tja)

August 27, 2012

2 Getting started with R

3 Working with R data objects

Core Development Team

Large community of users

Generalized Linear Models (Base)

Null distribution Alternative distribution

95% Lower Confidence Limit = 0.74

95% Upper Confidence Limit = 21.0

Confidence level (%)

Median unbiased estimate

0.2 0.5 1.0 2.0 2.9 5.0 10.0 20.0

West Nile Virus Human Cases Reported in California

Object types Operations

Data object Possible modea Default class

A vector is a collection of like elements without dimensions1 . The vector

> y <- c("Pedro", "Paulo", "Maria")

Indexing by Try these examples

Replacing by Try these examples

A matrix is a collection of like elements organized into a 2-dimensional

> x <- c("a", "b", "c", "d")

> x <- c(30, 21, 170, 180) # creating

Consider the 2 × 2 table of crude data in Table. In this randomized clinical

> dat <- matrix(c(30, 174, 21, 184), 2, 2)

> # display results

Age<55 Age≥55 Combined

, , Age group = Age>=55

Figure: Schematic representation of a 4-dimensional array: Year 2000 population

> udat <- read.csv("http://www.medepi.net/data/ugdp.txt")

> ftab <- fisher.test(tab)

Let’s index some of the bins from ftab.

> infert[1:10, 1:6]

> #age variable

> # education variable

Infert data is a matched case-control study evaluating the association of

> mod3 <- clogit(case ~ spontaneous + induced +

coef exp(coef) se(coef) z p

> summod3 <- summary(mod3)

exp(coef) exp(-coef) lower .95 upper .95

Rsquare= 0.193 (max possible= 0.519 )

> summod3$coef[ ,2]

Run rr.wald function on UGDP RCT data (results displayed in 2

The following epidemiologists, directly or indirectly, contributed to

Using the epitab function (only arguments are displayed);

> library(epitools) #load ’epitools’ package

> round(epitab(jello, ill, method = "oddsratio")$tab, 2)

> round(epitab(jello.tab1, method = "risk")$tab, 2)

> round(epitab(c(22, 30, 7, 16))$tab, 2)

> round(epitab(c(22, 30, 7, 16), method = "risk")$tab, 2)

2 Getting started with R

3 Working with R data objects

You might also like