Understanding
Understanding
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 1 / 60
Outline
1 Background
Cost
Quality
Community
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 2 / 60
Background: Major issues
Cost
Quality
Community
Functionality
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 3 / 60
Cost: Open Source vs. Proprietary Software
Costs of software
Costs of multi-platforms
Costs of education and training
Costs of adding solutions (e.g., packages)
Costs of solving problems and sharing solutions
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 4 / 60
Quality: Open Source vs. Proprietary Software
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 5 / 60
Community: Open Source vs. Proprietary Software
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 6 / 60
Current R contributors
Douglas Bates
Martin Maechler
John Chambers
Duncan Murdoch
Peter Dalgaard
Paul Murrell
Seth Falcon
Martyn Plummer
Robert Gentleman
Brian Ripley
Kurt Hornik
Deepayan Sarkar
Stefano Iacus
Duncan Temple Lang
Ross Ihaka
Luke Tierney
Friedrich Leisch
Simon Urbanek
Thomas Lumley
Source: http://www.r-project.org/contributors.html
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 7 / 60
What is R?
Full-function calculator/spreadsheet
Extensible statistical packages
High-quality graphics tool
Multi-use programming language
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 8 / 60
Full-function calculator: Selected math operators
Operator Description Try these examples
+ addition 5+4
− subtraction 5-4
∗ multiplication 5*4
/ division 5/4
ˆ exponentiation 5^4
− unary minus (change current -5
sign)
abs absolute value abs(-23)
exp exponentiation (e to a power) exp(8)
log logarithm (default is natural log) log(exp(8))
sqrt square root sqrt(64)
%/% integer divide 10%/%3
%% modulus 10%%3
%*% matrix multiplication xx <- matrix(1:4, 2, 2)
xx%*%c(1, 1)
c(1, 1)%*%xx
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 9 / 60
Extensible statistical packages
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 10 / 60
Graphics display of sample size curves
Power
( 1 − β)
β α 2
− Z1−α 2 µ0 Z1−α 2 µ1
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 11 / 60
Graphics display of P value function
1 0
0.9 10
0.8 20
40
0.5 50
0.4 60
0.3 70
0.2 80
0.1 90
95% Confidence Interval
0.05 95
0 100
Rate Ratio
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 12 / 60
Graphical display of multiple linear regression
10 20 30 40 50 60 70 80 90
y
x2
50
40
30
20
10
0
0 10 20 30 40 50
x1
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 13 / 60
Epidemic curve using Color Brewer colors
Unknown
WNF
60
WNND
Cases
+ Horse
40
6/20
+ Chicken
5/17
+ Mosquito
20
4/14
+ Bird
2/24
0
52 03 06 09 12 15 18 21 24 27 30 33 36 39 42 45 48 51
Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Disease Week & Calendar Month, 2004
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 14 / 60
Multi-use programming language
Vectorized computations
Functional programming language
Object-oriented programming
Text processing (e.g., using regular expressions)
Links to C, Fortran, etc.
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 15 / 60
Data objects in R
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 16 / 60
Summary of types of data objects in R
a
We are ignoring complex numbers
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 17 / 60
Understanding vectors
1
In other programming languages, vectors are either row vectors or column
vectors. R does not make this distinction until it is necessary.
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 18 / 60
Understanding vectors: Indexing
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 19 / 60
Understanding vectors: Replacement
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 20 / 60
Understanding vectors: Replacement
> x <- c(chol = 234, sbp = 148, dbp = 78, age = 54)
> x[1] <- 250 #by position
> x
chol sbp dbp age
250 148 78 54
> x["sbp"] <- 150 #by name
> x
chol sbp dbp age
250 150 78 54
> x[x<100]
dbp age
78 54
> x[x<100] <- NA #by logical
> x
chol sbp dbp age
250 150 NA NA
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 21 / 60
Understanding matrices
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 22 / 60
Understanding matrices
When R returns a matrix the [n,] indicates the nth row and [,m]
indicates the mth column.
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 23 / 60
Understanding matrices
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 24 / 60
Understanding matrices
Table: Deaths among subjects who received tolbutamide and placebo in the
Unversity Group Diabetes Program (1970)
Tolbutamide Placebo
Deaths 30 21
Survivors 174 184
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 25 / 60
Understanding matrices
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 26 / 60
Understanding matrices
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 27 / 60
Understanding arrays
An array is a collection of like elements organized into a n-dimensional
data object. When R returns an array the [n,,] indicates the nth row
and [,m,] indicates the mth column, and so on.
> x <- 1:8
> y <- array(x, dim=c(2, 2, 2))
> y
, , 1
[,1] [,2]
[1,] 1 3
[2,] 2 4
, , 2
[,1] [,2]
[1,] 5 7
[2,] 6 8
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 28 / 60
Understanding arrays
While a matrix is a 2-dimensional table of like elements, an array is the
generalization of matrices to n-dimensions. Stratified contingency tables in
epidemiology are represented as array data objects in R. For example, the
RCT previously shown comparing the number deaths among diabetic
subjects that received tolbutamide vs. placebo is now also stratified by age
group:
Table: Deaths among subjects who received tolbutamide and placebo in the
Unversity Group Diabetes Program (1970), stratifying by age
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 29 / 60
Understanding arrays
> tdat <- c(8, 98, 5, 115, 22, 76, 16, 69)
> tdat <- array(tdat, c(2, 2, 2))
> dimnames(tdat) <- list(Outcome=c("Deaths", "Survivors"),
+ Treatment=c("Tolbutamide", "Placebo"),
+ "Age group"=c("Age<55", "Age>=55"))
> tdat
, , Age group = Age<55
Treatment
Outcome Tolbutamide Placebo
Deaths 8 5
Survivors 98 115
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 30 / 60
Table: Example of 4-dimensional array: Year 2000 population estimates by age,
ethnicity, sex, and county
Ethnicity
County/Sex Age White AfrAmer AsianPI Latino Multirace AmerInd
Alameda
Female <=19 58,160 31,765 40,653 49,738 10,120 839
20–44 112,326 44,437 72,923 58,553 7,658 1,401
45–64 82,205 24,948 33,236 18,534 2,922 822
65+ 49,762 12,834 16,004 7,548 1,014 246
Male <=19 61,446 32,277 42,922 53,097 10,102 828
20–44 115,745 36,976 69,053 69,233 6,795 1,263
45–64 81,332 20,737 29,841 17,402 2,506 687
65+ 33,994 8,087 11,855 5,416 711 156
San Francisco
Female <=19 14,355 6,986 23,265 13,251 2,940 173
20–44 85,766 10,284 52,479 23,458 3,656 526
45–64 35,617 6,890 31,478 9,184 1,144 282
65+ 27,215 5,172 23,044 5,773 554 121
Male <=19 14,881 6,959 24,541 14,480 2,851 165
20–44 105,798 11,111 48,379 31,605 3,766 782
45–64 43,694 7,352 26,404 8,674 1,220 354
65+ 20,072 3,329 17,190 3,428 450 76
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 31 / 60
Understanding arrays
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 32 / 60
Understanding arrays
Figure: Schematic of a theoretical 5-D array (e.g., data by age (1), race (2), sex
(3), party affiliation (4), and state (5)). We can see that the field “state” has 3
levels, and the field “party affiliation” has 2 levels; however, it is not apparent the
number of age, race, and sex levels. Although not displayed, age levels would be
represented by row names (along 1st dimension), race levels by column names
(along 2nd dimension), and sex levels by depth names (along 3rd dimension).
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 33 / 60
Understanding lists
Up to now, we have been working with atomic data objects (vector, matrix,
array). In contrast, lists, data frames, and functions are recursive data
objects. Recursive data objects have more flexibility in combining diverse
data objects into one object. A list provides the most flexibility. Think of a
list object as a collection of “bins” that can contain any R object. Lists
are very useful for collecting results of an analysis or a function into one
data object where all its contents are readily accessible by indexing.
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 34 / 60
Understanding lists
A list is a collection of data objects without any restrictions:
> x <- c(11, 22, 34)
> y <- c("Male", "Female", "Male")
> z <- matrix(c(67, 34, 56,22), 2, 2)
> mylist <- list(x, y, z)
> mylist
[[1]]
[1] 11 22 34
[[2]]
[1] "Male" "Female" "Male"
[[3]]
[,1] [,2]
[1,] 67 56
[2,] 34 22
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 35 / 60
Understanding lists
Names can be assigned to each bin of a list.
> names(mylist) <- c("Age", "Sex", "Data")
> mylist
$Age
[1] 11 22 34
$Sex
[1] "Male" "Female" "Male"
$Data
[,1] [,2]
[1,] 67 56
[2,] 34 22
> mylist$Sex
[1] "Male" "Female" "Male"
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 36 / 60
Understanding lists
Figure: Schematic representation of a list of length four. The first bin [1]
contains a smiling face [[1]], the second bin [2] contains a flower [[2]], the
third bin [3] contains a lightning bolt [[3]], and the fourth bin [[4]] contains
a heart [[4]]. When indexing a list object, single brackets [·] indexes the bin,
and double brackets [[·]] indexes the bin contents. If the bin has a name, then
$name also indexes the contents.
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 37 / 60
Understanding lists
For example, using the UGDP clinical trial data, suppose we perform
Fisher’s exact test for testing the null hypothesis of independence of rows
and columns in a contingency table with fixed marginals.
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 38 / 60
Understanding lists
> ftab
Fisher’s Exact Test for Count Data
data: tab
p-value = 0.1813
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.8013768 2.8872863
sample estimates:
odds ratio
1.509142
The default display only shows partial results. The total results are stored
in the object ftab. Let’s evaluate the structure of ftab and extract some
results:
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 39 / 60
Understanding lists
> str(ftab)
List of 7
$ p.value : num 0.181
$ conf.int : atomic [1:2] 0.801 2.887
..- attr(*, "conf.level")= num 0.95
$ estimate : Named num 1.51
..- attr(*, "names")= chr "odds ratio"
$ null.value : Named num 1
..- attr(*, "names")= chr "odds ratio"
$ alternative: chr "two.sided"
$ method : chr "Fisher’s Exact Test for Count Data"
$ data.name : chr "tab"
- attr(*, "class")= chr "htest"
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 40 / 60
Understanding lists
> ftab$conf.int
[1] 0.80138 2.88729
> ftab$conf.int[2]
[1] 2.887286
attr(,"conf.level")
[1] 0.95
> ftab$p.value
[1] 0.18126
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 41 / 60
Understanding data frames
A data frame is a list with a 2-dimensional (tabular) structure.
Epidemiologists are very experienced working with data frames where each
row usually represents data collected on individual subjects (also called
records or observations) and columns represent fields for each type of data
collected (also called variables).
> subjno <- c(1, 2, 3, 4)
> age <- c(34, 56, 45, 23)
> sex <- c("Male", "Male", "Female", "Male")
> case <- c("Yes", "No", "No", "Yes")
> mydat <- data.frame(subjno, age, sex, case)
> mydat
subjno age sex case
1 1 34 Male Yes
2 2 56 Male No
3 3 45 Female No
4 4 23 Male Yes
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 42 / 60
Understanding data frames
Epidemiologists are familiar with tabular data sets where each row is a
record and each column is a field. A record can be data collected on
individuals or groups. We usually refer to the field name as a variable
(e.g., age, gender, ethnicity). Fields can contain numeric or character
data. In R, these types of data sets are handled by data frames. Each
column of a data frame is usually either a factor or numeric vector,
although it can have complex, character, or logical vectors. Data frames
have the functionality of matrices and lists. For example, here is the first
10 rows of the infert data set, a matched case-control study published in
1976 that evaluated whether infertility was associated with prior
spontaneous or induced abortions.
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 43 / 60
Understanding data frames
> data(infert)
> str(infert)
‘data.frame’: 248 obs. of 8 variables:
$ education : Factor w/ 3 levels "0-5yrs",..: 1 1 ...
$ age : num NA 45 NA 23 35 36 23 32 21 28 ...
$ parity : num 6 1 6 4 3 4 1 2 1 2 ...
$ induced : num 1 1 2 2 1 2 0 0 0 0 ...
$ case : num 1 1 1 1 1 1 1 1 1 1 ...
$ spontaneous : num 2 0 0 0 1 1 0 0 1 0 ...
$ stratum : int 1 2 3 4 5 6 7 8 9 10 ...
$ pooled.stratum: num 3 1 4 2 32 36 6 22 5 19 ...
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 44 / 60
Understanding data frames
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 45 / 60
Understanding data frames
The fields are obviously vectors. Let’s explore a few of these vectors to see
what we can learn about their structure in R.
> mode(infert$age)
[1] "numeric"
> class(infert$age)
[1] "numeric"
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 46 / 60
Understanding data frames
> mode(infert$education)
[1] "numeric"
> class(infert$education)
[1] "factor"
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 47 / 60
Understanding data frames and factors
A factor is R’s representation of categorical fields and keeps track of all
possible category levels.
> sex <- sample(c("Male", "Female"), 100, replace = TRUE)
> mode(sex); class(sex)
[1] "character"
[1] "character"
> table(sex)
sex
Female Male
51 49
> sexf <- factor(sex, levels = c("Male", "Female", "Transgender"))
> table(sexf)
sexf
Male Female Transgender
49 51 0
> mode(sexf); class(sexf)
[1] "numeric"
[1] "factor"
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 48 / 60
Understanding data frames and lists
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 49 / 60
Understanding data frames and lists
> summod3
n= 248
coef exp(coef) se(coef) z Pr(>|z|)
spontaneous 1.9859 7.2854 0.3524 5.635 1.75e-08 ***
induced 1.4090 4.0919 0.3607 3.906 9.38e-05 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 50 / 60
Understanding data frames and lists
> str(summod3)
List of 12
$ call : language coxph(formula = Surv(rep(1, 248L), case) ~ sponta
$ fail : NULL
$ na.action : NULL
$ n : int 248
$ loglik : num [1:2] -90.8 -64.2
$ coefficients: num [1:2, 1:5] 1.986 1.409 7.285 4.092 0.352 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "spontaneous" "induced"
.. ..$ : chr [1:5] "coef" "exp(coef)" "se(coef)" "z" ...
$ conf.int : num [1:2, 1:4] 7.285 4.092 0.137 0.244 3.651 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "spontaneous" "induced"
.. ..$ : chr [1:4] "exp(coef)" "exp(-coef)" "lower .95" "upper .95"
$ logtest : Named num [1:3] 5.32e+01 2.00 2.87e-12
... [output truncated]
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 51 / 60
Understanding data frame and lists
> summod3$coef
coef exp(coef) se(coef) z Pr(>|z|)
spontaneous 1.985876 7.285423 0.3524435 5.634592 1.754734e-08
induced 1.409012 4.091909 0.3607124 3.906191 9.376245e-05
> summod3$coef[1, ]
coef exp(coef) se(coef) z Pr(>|z|)
1.985876e+00 7.285423e+00 3.524435e-01 5.634592e+00 1.754734e-08
> summod3$coef[1,2]
[1] 7.285423
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 52 / 60
Understanding functions
Risk Ratio confidence interval from baby Rothman, p. 135
rr.wald <- function(x, conf.level = 0.95){
## prepare input
x1 <- x[1,1]; n1 <- sum(x[1,])
x0 <- x[2,1]; n0 <- sum(x[2,])
## do calculations
p1 <- x1/n1 ##risk among exposed
p0 <- x0/n0 ##risk among unexposed
RR <- p1/p0;
logRR <- log(RR)
SElogRR <- sqrt(1/x1 - 1/n1 + 1/x0 - 1/n0)
Z <- qnorm(0.5*(1 + conf.level))
LCL <- exp(logRR - Z*SElogRR)
UCL <- exp(logRR + Z*SElogRR)
##collect output
list(x = x, risks = c(p1 = p1, p0 = p0), risk.ratio = RR,
conf.int = c(LCL, UCL), conf.level = conf.level)
}
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 53 / 60
Understanding functions
> tab
Treatment $risks
Status Tolbutamide Placebo p1 p0
Death 30 21 0.5882353 0.4860335
Survivor 174 184
$risk.ratio
> rr.wald(tab) [1] 1.210277
$x
Treatment $conf.int
Status Tolbutamide Placebo [1] 0.9396227 1.5588927
Death 30 21
Survivor 174 184 $conf.level
[1] 0.95
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 54 / 60
The epitools package
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 55 / 60
Using epitools for outbreak investigations
epitab(x, y = NULL,
method = c("oddsratio", "riskratio", "rateratio"),
conf.level = 0.95,
rev = c("neither", "rows", "columns", "both"),
oddsratio = c("wald", "fisher", "midp", "small"),
riskratio = c("wald", "boot", "small"),
rateratio = c("wald", "midp"),
pvalue = c("fisher.exact", "midp.exact", "chi2"),
correction = FALSE,
verbose = FALSE)
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 56 / 60
Hypothesis testing using Oswego: Passing 2 vectors
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 57 / 60
Hypothesis testing using Oswego: Passing a table
> jello.tab1
ill
jello N Y
N 22 30
Y 7 16
> round(epitab(jello.tab1)$tab, 2)
ill
jello N p0 Y p1 oddsratio lower upper p.value
N 22 0.76 30 0.65 1.00 NA NA NA
Y 7 0.24 16 0.35 1.68 0.59 4.76 0.44
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 58 / 60
Hypothesis testing using Oswego: Passing one vector
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 59 / 60
Summary
1 Background
Cost
Quality
Community
Tomás Aragón, MD, DrPH (medepi.com) Understanding R for Epidemiologists August 27, 2012 60 / 60