0% found this document useful (0 votes)

51 views7 pages

Analyzing BRFSS Data in R

The data set contains 20,000 cases and 9 variables. The variables include categorical variables like gender (male, female) and genhlth (excellent, very good, etc.) and numerical variables like height, weight, and age. The interquartile range for height is 6 inches and for age is 26 years. About 48% of the sample are males. The relative frequency of 'excellent' health is 0.14.

Uploaded by

Harsh Vardhan Dubey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views7 pages

Analyzing BRFSS Data in R

Uploaded by

Harsh Vardhan Dubey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Homework 2R

PART 1 -things are done for [Link] over everything again. There are parts that you have to do it.
Some define Statistics as the field that focuses on turning information into knowledge. The first step in that
process is to summarize and describe the raw information - the data. In this homework, you will gain insight
into public health by generating simple graphical and numerical summaries of a data set collected by the
Centers for Disease Control and Prevention (CDC). As this is a large data set, along the way you’ll also learn
the indispensable skills of data processing and subsetting.

Getting started
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in
the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population
and report emerging health trends. For example, respondents are asked about their diet and weekly physical
activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The
BRFSS Web site ([Link] contains a complete description of the survey, including the
research questions that motivate the study and many interesting results derived from the data.
We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there
are over 200 variables in this data set, we will work with a small subset.
We begin by loading the data set of 20,000 observations into the R workspace. After launching RStudio, enter
the following commands.
cdc <- [Link]("[Link] [Link]=T)

The data set cdc that shows up in your workspace is a data frame, with each row representing a case and
each column representing a variable.
To view the names of the variables, type the command
names(cdc)

## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight" "wtdesire"

## [8] "age" "gender"
This returns the names genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and
gender. Each one of these variables corresponds to a question that was asked in the survey. For example, for
genhlth, respondents were asked to evaluate their general health, responding either excellent, very good,
good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (1)
or did not (0). Likewise, hlthplan indicates whether the respondent had some form of health coverage (1) or
did not (0). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in
her lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their
desired weight, wtdesire, age in years, and gender.
A very useful function for taking a quick peek at your dataset is summary.
summary(cdc)

## genhlth exerany hlthplan smoke100

## Length:20000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## Class :character 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000

1
## Mode :character Median :1.0000 Median :1.0000 Median :0.0000
## Mean :0.7457 Mean :0.8738 Mean :0.4721
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## height weight wtdesire age
## Min. :48.00 Min. : 68.0 Min. : 68.0 Min. :18.00
## 1st Qu.:64.00 1st Qu.:140.0 1st Qu.:130.0 1st Qu.:31.00
## Median :67.00 Median :165.0 Median :150.0 Median :43.00
## Mean :67.18 Mean :169.7 Mean :155.1 Mean :45.07
## 3rd Qu.:70.00 3rd Qu.:190.0 3rd Qu.:175.0 3rd Qu.:57.00
## Max. :93.00 Max. :500.0 Max. :680.0 Max. :99.00
## gender
## Length:20000
## Class :character
## Mode :character
##
##
##
Note that categorical variables in R are currently coded as character format, which is fine to start with. If
we explicitly tell R to convert the variables to factor format, we can get some extra insight when we ask for
the summary().
cdc <- cdc %>% mutate(gender = factor(gender),
genhlth = factor(genhlth))

1. How many cases are there in this data set? How many variables? For each variable, identify its data
type (e.g. categorical: ordinal, if the categories have an ordering, or not, numerical: continuous or
discrete). Do not just rely on the R output, also think about the nature of the variables.
• GO:
You could also look at all of the data frame at once by typing its name into the console, but that might be
unwise here. We know cdc has 20,000 rows, so viewing the entire data set would mean flooding your screen.
Using head() or tail() is the way to go!

Summaries and tables

The BRFSS questionnaire is a massive trove of information. A good first step in any analysis is to distill all
of that information into a few summary statistics and graphics. As a simple example, the function favstats
returns a numerical summary: minimum, first quartile, median, third quartile, maximum, mean, standard
deviation, number of observations, and number of missing values. For weight this is
favstats(~weight, data = cdc)

## min Q1 median Q3 max mean sd n missing

## 68 140 165 190 500 169.683 40.08097 20000 0
As we have seen, R can function like a very fancy calculator. If you wanted to compute the interquartile range
for the respondents’ weight, you could look at the output from the summary command above and then enter
190 - 140

## [1] 50
iqr(~weight, data = cdc) # or have R do it for you!

## [1] 50

2
R also has built-in functions to compute summary statistics one by one. For instance, to calculate the mean,
median, and variance of weight, type
mean(~weight, data = cdc)

## [1] 169.683
var(~weight, data = cdc)

## [1] 1606.484
median(~weight, data = cdc)

## [1] 165
While it makes sense to describe a quantitative variable like weight in terms of these statistics, what about
categorical data? We would instead consider the sample frequency or relative frequency distribution. The
function tally does this for you by counting the number of times each kind of response was given. For
example, to see the number of people who have smoked 100 cigarettes in their lifetime, type
tally(~smoke100, data = cdc)

## smoke100
## 0 1
## 10559 9441
or instead look at the relative frequency distribution by typing
tally(~smoke100, data = cdc, format = "proportion")

Notice how R automatically divides all entries in the table by 20,000 in the command above. Next, we make
a bar chart of the entries in the table by putting the table inside the barchart command.
barchart(tally(~smoke100, data = cdc, margins=FALSE), horizontal=FALSE)

Notice what we’ve done here! We’ve computed the table of smoke100 and then immediately applied the
graphical function, barchart. This is an important idea: R commands can be nested. You could also break
this into two steps by typing the following:
smoke <- tally(~smoke100, data = cdc, margins=FALSE)
barchart(smoke, horizontal=FALSE)

Here, we’ve made a new object, a table, called smoke (the contents of which we can see by typing smoke into
the console) and then used it in as the input for barchart. The special symbol <- performs an assignment,
taking the output of one line of code and saving it into an object in your workspace.
This is another important idea that we’ll return to later.
2. Create a numerical summary for height and age, and compute the interquartile range for each.
Compute the relative frequency distribution (i.e. marginal distribution) for gender and exerany. How
many males are in the sample? What proportion of the sample reports being in excellent health?
• GO:

Modifying/Subsetting the Data

It’s often useful to extract all individuals (cases) in a data set that have specific characteristics. We can do
this easily using the filter function and a series of logical operators. The most commonly used logical
operators for data analysis are
• == means “equal to”
• != means “not equal to”
• > or < means “greater than” or “less than”

3
• >= or <= means “greater than or equal to” or “less than or equal to”
Using these, we can create a subset of the cdc dataset for just the men, and save this as a new dataset called
males:
males <- cdc %>%
filter(gender == "m")

The following two lines first make a new column called bmi and then creates box plots of these values, defining
groups by the variable genhlth.
cdc <- cdc %>% mutate(bmi = (weight/height^2)*703)
names(cdc) # see, we now have a new column called bmi!

## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight"

## [7] "wtdesire" "age" "gender" "bmi"
bwplot(bmi ~ genhlth, data=cdc)

50
bmi

10
excellent fair good poor very good

Notice that the first line above is just some arithmetic, but it’s applied to all 20,000 numbers in the cdc data
set. That is, for each of the 20,000 participants, we take their weight, divide by their height-squared and
then multiply by 703. The result is 20,000 BMI values, one for each respondent. This is one reason why we
like R: it lets us perform computations like this using very simple expressions.
This new data set contains all the same variables but just under half the rows.
As an aside, you can use several of these conditions together with & and |. The & is read “and” so that
males_and_over30 <- cdc %>%
filter(gender == "m" & age > 30)

will give you the data for men over the age of 30. The | character is read “or” so that
males_or_over30 <- cdc %>%
filter(gender == "m" | age > 30)

will take people who are men or over the age of 30 (why that’s an interesting group is hard to say, but right
now the mechanics of this are the important thing). In principle, you may use as many “and” and “or” clauses
as you like when forming a subset.
3. Create a new object called under23_and_smoke that contains all observations of respondents under the

4
age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the
new object as the answer to this exercise, and report the number of cases that meet this criteria.
• GO:

Visualization Tools
We’ve seen several ways to produce visual displays of variables.
• histogram (via histogram())
• bargraph (via bargraph())
• boxplot (via bwplot())
• scatterplot (via xyplot())
4. For each of these graphs, how many variables are displayed simultaneously? Are they categorical or
quantitative variables?
• GO:
One way to visualize the relationship between two categorical variables, we can use a mosaic plot (no relation
to the package name).
mosaicplot(tally(gender ~ smoke100, data = cdc))

We could have accomplished this in two steps by saving the table in one line and applying mosaicplot in the
next (see the tally/barchart example above).
5. What does the mosaic plot reveal about smoking habits and gender?
• GO:
Let’s get some more practice with the questions below.
6. Pick a categorical variable from the data set and see how it relates to BMI. List the variable you chose,
why you might think it would have a relationship to BMI, create an appropriate plot, and provide an
interpretation for this plot.
• GO:
7. Pick a quantitative variable from the data set and see how it relates to BMI. List the variable you chose,
why you might think it would have a relationship to BMI, create an appropriate plot, and provide an
interpretation for this plot.
• GO:
Remember that we can examine more than two variables in a plot by using groups= and adding panels using
the | operator in our formula. Study the following examples:

5
xyplot(weight ~ height, groups=gender, data=cdc, [Link]=T)

f
m

500

400
weight

300

200

100

50 60 70 80 90

height

bwplot(weight ~ genhlth | gender, data=cdc )

f m
500

400
weight

300

200

100

excellentfair good poor

very good
excellentfair good poor
very good

8. Now combine all three variables used in the last two displays together into one plot. Describe what you
learn from your plot.
• GO:
Part 2 - SLR
In this problem, we aim to produce a model to predict the number of calories using the amount of carbohydrates,
fat, fiber, or protein that Starbucks food menu items contain.
#Be sure to download the file from Moodle and edit
#the path or working directory if needed
starbucks = [Link]("[Link]")

6
(a) What is the most frequently appearing type of food product that is included in this dataset?
Use some code to show how you arrived at your answer.
SOLUTION:

(b) Use a scatterplot matrix to determine which ONE of the variables correlates best with
calories.
SOLUTION:

(c) Fit a simple linear regression model using the variable from (b) to predict calories. Show
the model summary.
SOLUTION:

(d) Interpret the intercept and slope based on the model summary.
SOLUTION:

(e) Interpret the value of R2 .

SOLUTION:

(f) Check the assumptions of the linear model using some plots. Do you conclude that a linear
model is appropriate?
SOLUTION:

Lab 1 Introduction To Data
No ratings yet
Lab 1 Introduction To Data
11 pages
R Assignment Guide for BRFSS Data Analysis
No ratings yet
R Assignment Guide for BRFSS Data Analysis
6 pages
Midterm Project Group 6
No ratings yet
Midterm Project Group 6
41 pages
BM-1, Applied Statistics, Lesson 2: Comparing Two Groups (And One Group)
No ratings yet
BM-1, Applied Statistics, Lesson 2: Comparing Two Groups (And One Group)
39 pages
R Cheat Sheet
No ratings yet
R Cheat Sheet
9 pages
R Tutorial for EHS Data Analysis
No ratings yet
R Tutorial for EHS Data Analysis
9 pages
Lab Report For APSC 254
No ratings yet
Lab Report For APSC 254
6 pages
STATS 10 Assignment 1
No ratings yet
STATS 10 Assignment 1
7 pages
Exercises
No ratings yet
Exercises
20 pages
Data Modeling and Statistics in R
No ratings yet
Data Modeling and Statistics in R
6 pages
R Notes For Data Analysis and Statistical Inference
No ratings yet
R Notes For Data Analysis and Statistical Inference
10 pages
R Programming: Statistical Analysis Assignment
No ratings yet
R Programming: Statistical Analysis Assignment
8 pages
R For Health Data Science 1st Edition Complete Volume Download
No ratings yet
R For Health Data Science 1st Edition Complete Volume Download
15 pages
R Workshop: Data Manipulation & Analysis
No ratings yet
R Workshop: Data Manipulation & Analysis
3 pages
ProbList5 24 SLN
No ratings yet
ProbList5 24 SLN
9 pages
F24 Lab-01
No ratings yet
F24 Lab-01
4 pages
R Programming Basics and Data Analysis
No ratings yet
R Programming Basics and Data Analysis
18 pages
Unit3 R
No ratings yet
Unit3 R
19 pages
Unit 4-1
No ratings yet
Unit 4-1
21 pages
Diabetes Data Analysis and Statistics
No ratings yet
Diabetes Data Analysis and Statistics
9 pages
SPSS Statistical Analysis Guide
No ratings yet
SPSS Statistical Analysis Guide
39 pages
Group 11 Project 2
No ratings yet
Group 11 Project 2
60 pages
Lab 1 Manual - Introduction To R
No ratings yet
Lab 1 Manual - Introduction To R
7 pages
Basic R Commands for Data Analysis
No ratings yet
Basic R Commands for Data Analysis
7 pages
Steak Preference Analysis and Reporting
No ratings yet
Steak Preference Analysis and Reporting
5 pages
Stata Commands for Data Analysis
No ratings yet
Stata Commands for Data Analysis
14 pages
R Data Analysis for Data Scientists
No ratings yet
R Data Analysis for Data Scientists
42 pages
Group 5 - Applied Statistics and Experimental 152611
No ratings yet
Group 5 - Applied Statistics and Experimental 152611
28 pages
Biostatistics: Descriptive Statistics Guide
No ratings yet
Biostatistics: Descriptive Statistics Guide
53 pages
Introduction To Biostatistics
No ratings yet
Introduction To Biostatistics
73 pages
Descriptive Stats with R-Studio
No ratings yet
Descriptive Stats with R-Studio
23 pages
Statistical Analysis Fundamentals
No ratings yet
Statistical Analysis Fundamentals
8 pages
Analyze Complex Survey Data in R
No ratings yet
Analyze Complex Survey Data in R
24 pages
R Programming Cheat Sheet
No ratings yet
R Programming Cheat Sheet
7 pages
Lab 0 CR
No ratings yet
Lab 0 CR
3 pages
Biostatistics in Public Health Using STATA (Introduction)
50% (2)
Biostatistics in Public Health Using STATA (Introduction)
35 pages
楊睿中統計學合併版
No ratings yet
楊睿中統計學合併版
557 pages
Flint Water Study and Income Analysis
No ratings yet
Flint Water Study and Income Analysis
22 pages
R Tutorial: Descriptive Statistics Guide
No ratings yet
R Tutorial: Descriptive Statistics Guide
32 pages
Explanationdocx
No ratings yet
Explanationdocx
9 pages
R Regression Analysis and Visualization Guide
No ratings yet
R Regression Analysis and Visualization Guide
16 pages
Healthcare Analytics
No ratings yet
Healthcare Analytics
72 pages
L1 Distributions
No ratings yet
L1 Distributions
22 pages
Intro To R
No ratings yet
Intro To R
18 pages
R Script for Data Import, Export & Visualization
No ratings yet
R Script for Data Import, Export & Visualization
57 pages
7th Report
No ratings yet
7th Report
14 pages
Exploratory Data Analysis Homework
No ratings yet
Exploratory Data Analysis Homework
23 pages
Descriptive Statistics and Data Graphing
No ratings yet
Descriptive Statistics and Data Graphing
18 pages
Customer Profile Analysis for Treadmills
No ratings yet
Customer Profile Analysis for Treadmills
77 pages
Descriptive Statistics Basics
No ratings yet
Descriptive Statistics Basics
23 pages
Biostatistics Fundamentals Overview
No ratings yet
Biostatistics Fundamentals Overview
364 pages
Stata Basics: Commands and Data Analysis
No ratings yet
Stata Basics: Commands and Data Analysis
2 pages
R Studio Basics: Data Mining & Operations
No ratings yet
R Studio Basics: Data Mining & Operations
7 pages
Assignment - 1: Data Analytics and R
No ratings yet
Assignment - 1: Data Analytics and R
4 pages
New Section Overview and Details
No ratings yet
New Section Overview and Details
7 pages
Programming Assignment 5 Guidelines
No ratings yet
Programming Assignment 5 Guidelines
1 page
Document Printout Overview
No ratings yet
Document Printout Overview
11 pages
Statistical Hypothesis Testing Guide
No ratings yet
Statistical Hypothesis Testing Guide
9 pages
Assignment 2 - Due 9/22
No ratings yet
Assignment 2 - Due 9/22
2 pages
AlphaBB Algorithm for Nonconvex Optimization
No ratings yet
AlphaBB Algorithm for Nonconvex Optimization
14 pages
Neural Networks & Auto Differentiation
No ratings yet
Neural Networks & Auto Differentiation
13 pages
Document Scanned with CamScanner
No ratings yet
Document Scanned with CamScanner
21 pages
Add Module in Rslogix 5000
No ratings yet
Add Module in Rslogix 5000
10 pages
Unit - Vi: ROBOT PROGRAMMING - A Robot Program May Be Defined As A Path in
No ratings yet
Unit - Vi: ROBOT PROGRAMMING - A Robot Program May Be Defined As A Path in
15 pages
Human Factors in Aircraft Accidents
No ratings yet
Human Factors in Aircraft Accidents
3 pages
Final Revision MCQs Parallel Processing CSW325 - 16 - 1 - 2023
No ratings yet
Final Revision MCQs Parallel Processing CSW325 - 16 - 1 - 2023
17 pages
Energy Labeling for Arielli TV Model
No ratings yet
Energy Labeling for Arielli TV Model
2 pages
ANSYS Mechanical APDL Substructuring Analysis Guide 18.2
No ratings yet
ANSYS Mechanical APDL Substructuring Analysis Guide 18.2
88 pages
Answer Key Class 4
No ratings yet
Answer Key Class 4
13 pages
Dispute Redressal Mechanism NPCI - National Payments Corporation of India
No ratings yet
Dispute Redressal Mechanism NPCI - National Payments Corporation of India
1 page
Boost Your Resume with Certifications
100% (1)
Boost Your Resume with Certifications
8 pages
Backup Camera
No ratings yet
Backup Camera
3 pages
Maintenance Schedule Overview
No ratings yet
Maintenance Schedule Overview
6 pages
Object Oriented Analysis & Design Question Bank
No ratings yet
Object Oriented Analysis & Design Question Bank
12 pages
Udemy Courses Data Analysis Report
No ratings yet
Udemy Courses Data Analysis Report
9 pages
Typeface - Project Assignment Questions
No ratings yet
Typeface - Project Assignment Questions
3 pages
India's Tech Evolution: A Journey Forward
No ratings yet
India's Tech Evolution: A Journey Forward
138 pages
Project Network Diagram
No ratings yet
Project Network Diagram
6 pages
Course Outline - Organisational Study of IS - 2021-2022
No ratings yet
Course Outline - Organisational Study of IS - 2021-2022
7 pages
Magic Tutorial 1
No ratings yet
Magic Tutorial 1
7 pages
Chemical Ventilation System Design
No ratings yet
Chemical Ventilation System Design
4 pages
AQA Modular Homework Book Higher 1 Answers
100% (1)
AQA Modular Homework Book Higher 1 Answers
8 pages
s50l Manual PDF
No ratings yet
s50l Manual PDF
4 pages
Testing and Evaluation of The Vertical Fire Sprinkler System Located at The Sevice Building
No ratings yet
Testing and Evaluation of The Vertical Fire Sprinkler System Located at The Sevice Building
17 pages
04wp SD Wan - Solutions Ekinops
No ratings yet
04wp SD Wan - Solutions Ekinops
8 pages
Android Platform APIs for Mobile Apps
No ratings yet
Android Platform APIs for Mobile Apps
6 pages
Oracle 12c Creating SecureFile LOBs On Import
100% (1)
Oracle 12c Creating SecureFile LOBs On Import
1 page
Umrah in Tamil G&6 BNO& 012: Share This Document
No ratings yet
Umrah in Tamil G&6 BNO& 012: Share This Document
1 page
Globalwits 2025 Quotation
No ratings yet
Globalwits 2025 Quotation
7 pages
Automatic Plant Irrigation System Using Arduino
100% (1)
Automatic Plant Irrigation System Using Arduino
5 pages
Pump Sizing for Engineers
No ratings yet
Pump Sizing for Engineers
18 pages
Oracle Projects Basics
80% (5)
Oracle Projects Basics
62 pages

Analyzing BRFSS Data in R

Uploaded by

Analyzing BRFSS Data in R

Uploaded by

Homework 2R

## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight" "wtdesire"

## genhlth exerany hlthplan smoke100

Summaries and tables

## min Q1 median Q3 max mean sd n missing

Modifying/Subsetting the Data

## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight"

bwplot(weight ~ genhlth | gender, data=cdc )

excellentfair good poor

(e) Interpret the value of R2 .

You might also like