Homework 2R
PART 1 -things are done for [Link] over everything again. There are parts that you have to do it.
Some define Statistics as the field that focuses on turning information into knowledge. The first step in that
process is to summarize and describe the raw information - the data. In this homework, you will gain insight
into public health by generating simple graphical and numerical summaries of a data set collected by the
Centers for Disease Control and Prevention (CDC). As this is a large data set, along the way you’ll also learn
the indispensable skills of data processing and subsetting.
Getting started
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in
the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population
and report emerging health trends. For example, respondents are asked about their diet and weekly physical
activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The
BRFSS Web site ([Link] contains a complete description of the survey, including the
research questions that motivate the study and many interesting results derived from the data.
We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there
are over 200 variables in this data set, we will work with a small subset.
We begin by loading the data set of 20,000 observations into the R workspace. After launching RStudio, enter
the following commands.
cdc <- [Link]("[Link] [Link]=T)
The data set cdc that shows up in your workspace is a data frame, with each row representing a case and
each column representing a variable.
To view the names of the variables, type the command
names(cdc)
## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight" "wtdesire"
## [8] "age" "gender"
This returns the names genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and
gender. Each one of these variables corresponds to a question that was asked in the survey. For example, for
genhlth, respondents were asked to evaluate their general health, responding either excellent, very good,
good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (1)
or did not (0). Likewise, hlthplan indicates whether the respondent had some form of health coverage (1) or
did not (0). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in
her lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their
desired weight, wtdesire, age in years, and gender.
A very useful function for taking a quick peek at your dataset is summary.
summary(cdc)
## genhlth exerany hlthplan smoke100
## Length:20000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## Class :character 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
1
## Mode :character Median :1.0000 Median :1.0000 Median :0.0000
## Mean :0.7457 Mean :0.8738 Mean :0.4721
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## height weight wtdesire age
## Min. :48.00 Min. : 68.0 Min. : 68.0 Min. :18.00
## 1st Qu.:64.00 1st Qu.:140.0 1st Qu.:130.0 1st Qu.:31.00
## Median :67.00 Median :165.0 Median :150.0 Median :43.00
## Mean :67.18 Mean :169.7 Mean :155.1 Mean :45.07
## 3rd Qu.:70.00 3rd Qu.:190.0 3rd Qu.:175.0 3rd Qu.:57.00
## Max. :93.00 Max. :500.0 Max. :680.0 Max. :99.00
## gender
## Length:20000
## Class :character
## Mode :character
##
##
##
Note that categorical variables in R are currently coded as character format, which is fine to start with. If
we explicitly tell R to convert the variables to factor format, we can get some extra insight when we ask for
the summary().
cdc <- cdc %>% mutate(gender = factor(gender),
genhlth = factor(genhlth))
1. How many cases are there in this data set? How many variables? For each variable, identify its data
type (e.g. categorical: ordinal, if the categories have an ordering, or not, numerical: continuous or
discrete). Do not just rely on the R output, also think about the nature of the variables.
• GO:
You could also look at all of the data frame at once by typing its name into the console, but that might be
unwise here. We know cdc has 20,000 rows, so viewing the entire data set would mean flooding your screen.
Using head() or tail() is the way to go!
Summaries and tables
The BRFSS questionnaire is a massive trove of information. A good first step in any analysis is to distill all
of that information into a few summary statistics and graphics. As a simple example, the function favstats
returns a numerical summary: minimum, first quartile, median, third quartile, maximum, mean, standard
deviation, number of observations, and number of missing values. For weight this is
favstats(~weight, data = cdc)
## min Q1 median Q3 max mean sd n missing
## 68 140 165 190 500 169.683 40.08097 20000 0
As we have seen, R can function like a very fancy calculator. If you wanted to compute the interquartile range
for the respondents’ weight, you could look at the output from the summary command above and then enter
190 - 140
## [1] 50
iqr(~weight, data = cdc) # or have R do it for you!
## [1] 50
2
R also has built-in functions to compute summary statistics one by one. For instance, to calculate the mean,
median, and variance of weight, type
mean(~weight, data = cdc)
## [1] 169.683
var(~weight, data = cdc)
## [1] 1606.484
median(~weight, data = cdc)
## [1] 165
While it makes sense to describe a quantitative variable like weight in terms of these statistics, what about
categorical data? We would instead consider the sample frequency or relative frequency distribution. The
function tally does this for you by counting the number of times each kind of response was given. For
example, to see the number of people who have smoked 100 cigarettes in their lifetime, type
tally(~smoke100, data = cdc)
## smoke100
## 0 1
## 10559 9441
or instead look at the relative frequency distribution by typing
tally(~smoke100, data = cdc, format = "proportion")
Notice how R automatically divides all entries in the table by 20,000 in the command above. Next, we make
a bar chart of the entries in the table by putting the table inside the barchart command.
barchart(tally(~smoke100, data = cdc, margins=FALSE), horizontal=FALSE)
Notice what we’ve done here! We’ve computed the table of smoke100 and then immediately applied the
graphical function, barchart. This is an important idea: R commands can be nested. You could also break
this into two steps by typing the following:
smoke <- tally(~smoke100, data = cdc, margins=FALSE)
barchart(smoke, horizontal=FALSE)
Here, we’ve made a new object, a table, called smoke (the contents of which we can see by typing smoke into
the console) and then used it in as the input for barchart. The special symbol <- performs an assignment,
taking the output of one line of code and saving it into an object in your workspace.
This is another important idea that we’ll return to later.
2. Create a numerical summary for height and age, and compute the interquartile range for each.
Compute the relative frequency distribution (i.e. marginal distribution) for gender and exerany. How
many males are in the sample? What proportion of the sample reports being in excellent health?
• GO:
Modifying/Subsetting the Data
It’s often useful to extract all individuals (cases) in a data set that have specific characteristics. We can do
this easily using the filter function and a series of logical operators. The most commonly used logical
operators for data analysis are
• == means “equal to”
• != means “not equal to”
• > or < means “greater than” or “less than”
3
• >= or <= means “greater than or equal to” or “less than or equal to”
Using these, we can create a subset of the cdc dataset for just the men, and save this as a new dataset called
males:
males <- cdc %>%
filter(gender == "m")
The following two lines first make a new column called bmi and then creates box plots of these values, defining
groups by the variable genhlth.
cdc <- cdc %>% mutate(bmi = (weight/height^2)*703)
names(cdc) # see, we now have a new column called bmi!
## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight"
## [7] "wtdesire" "age" "gender" "bmi"
bwplot(bmi ~ genhlth, data=cdc)
70
60
50
bmi
40
30
20
10
excellent fair good poor very good
Notice that the first line above is just some arithmetic, but it’s applied to all 20,000 numbers in the cdc data
set. That is, for each of the 20,000 participants, we take their weight, divide by their height-squared and
then multiply by 703. The result is 20,000 BMI values, one for each respondent. This is one reason why we
like R: it lets us perform computations like this using very simple expressions.
This new data set contains all the same variables but just under half the rows.
As an aside, you can use several of these conditions together with & and |. The & is read “and” so that
males_and_over30 <- cdc %>%
filter(gender == "m" & age > 30)
will give you the data for men over the age of 30. The | character is read “or” so that
males_or_over30 <- cdc %>%
filter(gender == "m" | age > 30)
will take people who are men or over the age of 30 (why that’s an interesting group is hard to say, but right
now the mechanics of this are the important thing). In principle, you may use as many “and” and “or” clauses
as you like when forming a subset.
3. Create a new object called under23_and_smoke that contains all observations of respondents under the
4
age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the
new object as the answer to this exercise, and report the number of cases that meet this criteria.
• GO:
Visualization Tools
We’ve seen several ways to produce visual displays of variables.
• histogram (via histogram())
• bargraph (via bargraph())
• boxplot (via bwplot())
• scatterplot (via xyplot())
4. For each of these graphs, how many variables are displayed simultaneously? Are they categorical or
quantitative variables?
• GO:
One way to visualize the relationship between two categorical variables, we can use a mosaic plot (no relation
to the package name).
mosaicplot(tally(gender ~ smoke100, data = cdc))
We could have accomplished this in two steps by saving the table in one line and applying mosaicplot in the
next (see the tally/barchart example above).
5. What does the mosaic plot reveal about smoking habits and gender?
• GO:
Let’s get some more practice with the questions below.
6. Pick a categorical variable from the data set and see how it relates to BMI. List the variable you chose,
why you might think it would have a relationship to BMI, create an appropriate plot, and provide an
interpretation for this plot.
• GO:
7. Pick a quantitative variable from the data set and see how it relates to BMI. List the variable you chose,
why you might think it would have a relationship to BMI, create an appropriate plot, and provide an
interpretation for this plot.
• GO:
Remember that we can examine more than two variables in a plot by using groups= and adding panels using
the | operator in our formula. Study the following examples:
5
xyplot(weight ~ height, groups=gender, data=cdc, [Link]=T)
f
m
500
400
weight
300
200
100
50 60 70 80 90
height
bwplot(weight ~ genhlth | gender, data=cdc )
f m
500
400
weight
300
200
100
excellentfair good poor
very good
excellentfair good poor
very good
8. Now combine all three variables used in the last two displays together into one plot. Describe what you
learn from your plot.
• GO:
Part 2 - SLR
In this problem, we aim to produce a model to predict the number of calories using the amount of carbohydrates,
fat, fiber, or protein that Starbucks food menu items contain.
#Be sure to download the file from Moodle and edit
#the path or working directory if needed
starbucks = [Link]("[Link]")
6
(a) What is the most frequently appearing type of food product that is included in this dataset?
Use some code to show how you arrived at your answer.
SOLUTION:
(b) Use a scatterplot matrix to determine which ONE of the variables correlates best with
calories.
SOLUTION:
(c) Fit a simple linear regression model using the variable from (b) to predict calories. Show
the model summary.
SOLUTION:
(d) Interpret the intercept and slope based on the model summary.
SOLUTION:
(e) Interpret the value of R2 .
SOLUTION:
(f) Check the assumptions of the linear model using some plots. Do you conclude that a linear
model is appropriate?
SOLUTION: