Calculus Laboratory 5
Risk Factors for Breast Cancer
Purpose. The purpose of this lab is to acquaint you with the interesting and difficult questions
that arise when one uses linear models (that is, straight lines) to extract information from data.
Preview. In this lab, we will focus on some of the factors that may affect the risk of breast
cancer, and we will attempt to model the relationship between the factor and the associated risk
using a linear function. As part of the modeling process, we will discuss what it means to have a
good fit to the data, the significance of the slope of the linear function, the validity of the model,
and the effect of other factors not taken into consideration in our data (and hence our model).
Facts and Figures. The American Cancer Society reports1 that breast cancer is the second most
commonly diagnosed cancer (after skin cancer) for American women and the second leading cause
of cancer death (after lung cancer). A cancerous tumor is called invasive if it tends to spread in
the body. There were about 192,000 new cases of invasive breast cancer in the United States in
2001. There were about 40,200 deaths from breast cancer in that year. Among women under age
30, there were an estimated 900 new cases of invasive breast cancer and about 100 breast cancer
deaths in 2001. In addition, there were approximately 1,500 new cases of breast cancer in men in
2001, and about 400 men died of breast cancer in that year.
Age is one of several factors that are known to affect breast cancer risk: whereas a 20-year-old
woman faces a 0.05% probability of contracting breast cancer in the next ten years, that probability
increases to 2.77% for a 40-year-old woman and 4.16% for a 70-year-old woman.
The fact that cancer risk depends on several variables makes it difficult to quantify the effect
of any one variable (such as a trait or activity) on cancer risk. This is why relative risk is used.
According to the American Cancer Society, “A relative risk compares the risk of disease among
people with a particular exposure to the risk among people without that exposure. If the relative
risk is above 1.0, then the risk is higher among exposed than among unexposed persons. Relative
risk below 1.0 reflects an inverse association between a risk factor and the disease, i.e., a protective
effect, or lower risk, associated with the exposure.” In this passage, exposure refers to any given
measurable phenomenon, for example, environmental carcinogens, smoking, diet, exercise, and
genetic makeup. For instance, it is estimated that each of several inherited genetic mutations
carries a relative risk of breast cancer of over 4.0. This means that a woman with one of these
mutations is four times more likely to develop breast cancer than is an average woman.
A Case Study. The data used in this lab is in large part taken from a study2 of women, age
50-64 years, that were resident of King County in northwestern Washington State; the women in
the study included 537 women that had been diagnosed with breast cancer between January 1988
and June 1990 (cases group) and 492 women picked at random from the population that had never
been diagnosed with breast cancer (control group). Data were gathered from the study participants
1
“Breast Cancer Facts & Figures 2001-2002,” http://www.cancer.org/
2
Occurrence of Breast Cancer in Relation to Recreational Exercise in Women Age 50 - 64 Years, Epidemiology,
Nov. 1996, Vol. 7, No. 6, 598-604.
6 Lab: Risk Factors for Breast Cancer
on various characteristics and activities. Using these data, the authors of the study computed a
relative risk of breast cancer for each characteristic. We will consider two of these characteristics,
hours of exercise and fat intake.
I. Hours of Exercise. The first risk factor that we will consider is exercise. Based on responses
to the question, “During the 2-year period prior to (date of study), with what frequency did you do
any strenuous physical activities, exercise, or sports?”, the following data were collected:
Average number of exercise hours per week 0 2.0 3.0 4.0
Relative risk 1.07 0.79 0.73 0.65
(a) In order to model the relationship between exercise and breast cancer, we wish to find a line
that is a good fit to the data. In your opinion, what general characteristics would make a
given line a good fit to data points?
(b) Draw a straight line in Figure 1 that fits the data well (you may not use the statistical features
of your calculator). Denote the relative risk by the variable R, and hours of exercise by the
variable E. Determine the slope of the line you drew. Also determine the R-intercept (where
the line intersects the vertical axis). What is the equation of the line?
1.2
æ
1.0
0.8
relative risk
æ
æ
æ
0.6
0.4
0.2
0.0
0 1 2 3 4
hours of exercise
Figure 1: Relative risk of breast cancer vs. exercise.
(c) What does the R-intercept represent in terms of exercise and the risk of breast cancer?
(d) Is your R-intercept the same as the data point that is on the R axis? Do you believe that the
data point is where the “correct” model would actually have an intercept? Why or why not?
(e) The slope of your line should be negative. Describe what this means in terms of exercise and
the relative risk of breast cancer.
(f) What is the absolute value of the slope of your line? Use this number and the negativity of
the slope of the line to finish the following sentence: “If a woman increases her exercise by
one hour per week, then her relative risk of breast cancer will be (approximately) . . . ”
Lab: Risk Factors for Breast Cancer 7
(g) Use your linear model to estimate the relative risk of breast cancer for a woman who exercises
20 hours a week. Is this a reasonable estimate? We should, of course, qualify the sentence
above by restricting the model to a certain set of values of E. Determine a set of values of
E for which you believe your linear model will be valid, and complete the following sentence:
“if a woman exercises between and hours per week, and if she increases her
exercise by one hour per week, then the relative risk of breast cancer will be . . . ”
(h) Estimate the relative risk of breast cancer if a woman exercises for 2.3 hours a week.
(i) Estimate the relative risk of breast cancer if a woman exercises for 5 hours a week.
(j) Find a value for E for which R ≈ 1.0. Explain the significance of this value of E.
(k) Actually, we do have another piece of data from this study: if a woman exercises more than
5 hours a week then (according to the study) her relative risk jumps to 1.2. Do you now
wish to refine the set of values for E for which the linear model appears valid? Suggest two
possible reasons for why the data shows an increased relative risk of breast cancer when a
woman exercises more than 5 hours a week.
(l) In statistics, a common criterion for how well a line fits a set of data points is the sum of the
squares of the vertical distances from the line to the data points. Compute this sum for your
line. Draw a different line that fits reasonably well and compute the sum for that line.
A link between exercise and breast cancer has been confirmed by many studies. A more recent
and extensive study3 of 25,624 women in Norway found a strong relationship between exercise and
breast cancer. For instance, the study found that women whose occupation involved heavy manual
labor had a relative risk of 0.46. However, “while a causal relationship between regular exercise
and a reduced risk of breast cancer (i.e., that it is actually the exercise itself which reduces the
risk of breast cancer) has yet to be proved”4 , there appear to be many reasons why exercise would
reduce risk: reduction of fat stores, which in turn reduce the level of estrogen, thereby inhibiting
carcinogenesis in the breast, improved immune system, and improved fertility (with moderate
exercise) promoting the protection against breast cancer conferred by childbearing. Indeed, there
appear to be quite complex relationships between risk factors and relative risks, and it can be very
difficult to determine whether a factor is causal (and to what extent it is causal) or non-causal.
II. Fat Intake. The data in Table 1 relate dietary fat intake to relative risk.
% of daily calories from fat 25 32 40 50
(Raw) relative risk 0.77 0.87 1.02 1.14
Table 1: Relative risk of breast cancer vs. fat intake.
3
Physical Activity and the Risk of Breast Cancer, The New England Journal of Medicine, Vol. 336, No. 18, 1269-
1275.
4
Exercise and Breast Cancer – time to get moving?, The New England Journal of Medicine, Vol. 336, No. 18,
1311-1312.
8 Lab: Risk Factors for Breast Cancer
(a) Find a linear function that will fit this data.
(b) Analyze the relationship between fat intake and breast cancer as we did in the case of exercise.
III. Correlations and Causality. Newspaper headlines often report that scientists or social
scientists have found that two particular variables are “linked” or “correlated.” This means roughly
that when one variable gets higher the other usually gets higher, too. In statistics, the “correlation
coefficient” is defined as a technical measure of the degree to which two variables are correlated; it
is always a number between -1 and 1. A correlation coefficient near 1 says that when one variable
rises, the other usually rises (as in the data set in Part II). A correlation coefficient near -1 says
that when one variable rises, the other usually falls (as in the data set in Part I). A correlation
coefficient near zero means that the values of the two variables seem independent of each other.
It is extremely important to realize that high correlation between two variables A and B does
not prove that there is a causal relationship between them. That is, it is not necessarily true that
A causes B or that B causes A. Here is an example. Suppose that every day during the Fall term
we pick a random student in front of the chapel and measure the percentage of the student’s skin that
is exposed (that’s A). We also pick a random deciduous tree in the Forest and measure the percentage
of leaves still on the tree (that’s B). We would find that A and B are correlated. But are they causally
related? When students wear long pants and sweaters, does that make the leaves fall off trees, or vice
versa? No, clearly that’s absurd. In this case, there is a third variable (C = temperature) which
“causes” both A and B. Thus, A and B are linked but not causally related.
(a) Fouls and Points Scored. Here is a table showing the total number of fouls committed and
the total number of points scored by each player on the 2000 - 2001 Duke NCAA Champion
men’s basketball team.
player fouls committed points scored
Williams, J. 87 841
Battier, S. 80 778
Dunleavy, M. 81 493
Boozer, C. 91 425
James, N. 78 480
Duhon, C. 65 280
Christensen, M. 62 48
Sanders, C. 72 87
Love, R. 21 28
Horvath, N. 7 17
Buchner, A. 5 5
Caldbeck, R. 1 1
Simpson, J. 2 20
Sweet, A. 6 29
Borman, A. 1 6
Make a scatter plot of these data and write a paragraph discussing whether these two variables
are correlated. Are they causally related? If not, why are they correlated?
Lab: Risk Factors for Breast Cancer 9
(b) Lead Exposure and Delinquent Behavior. A recent study in the Journal of the Amer-
ican Medical Association (Volume 275(5), Feb. 7, 1996, pp. 363-369) concluded that “Lead
exposure is associated with increased risk for antisocial and delinquent behavior. . . ”. Write a
paragraph speculating on how these variables could be causally related or whether they could
be linked because both are caused by a third variable.
lc, 97; mcr, 98, 00; lb, mcr, 02