ch03 Regression
ch03 Regression
Linear Regression
Our work on correlation shows that it is a useful measure for understanding relationships between
variables, telling us both what type of relationship exists (positive or negative) and how strong it is.
Note that correlation only measures the strength of linear relationships. Correlation can also tell you
things like “when one variable goes up, the other tends to go up too” but it cannot quantify by how
much. It can, however, quantify how reliable the relationship is, and that rank correlation can ignore
the di↵erences in those quantities.
In this chapter we will study regression, which again is about understanding the relationship between
variables. But it can tell us something di↵erent from correlation. Whereas correlation says how strong
a relationship is, regression says exactly how large a change in one variable to expect from a particular
change in the other. Together, regression and correlation tell us everything we need about these types
of relationships.
Regression is a statistical technique for fitting a function to data, that is, drawing a line or a curve that
goes through data points. If you have some data, and you can fit a function to it, then you can make
predictions about what new data will look like. In a business context, making predictions is very useful,
as it allows us to answer questions such as:
We will look first at linear regression, which it turns out is closely related to correlation. Linear models
are simple but surprisingly powerful.
Recall that a line can be written with the equation y = a + bx, where x and y are the two variables,
and a and b are constants (i.e. just numbers). That means that if x = 0, then y = a (a is called the
intercept). It also means that for every increment of one unit in x, y will increase by b (b is called the
slope). We say that y is a linear function of x.
Consider the data shown in Figure 3.1. It represents sales of a franchise-type shop (like GAP, Subway,
etc) in cities with di↵erent population sizes. We can plot this data using a scatter plot, by using
Population as the x variable, and Sales as the y variable. Notice that city 5 appears twice, meaning that
there are two shops in that city, so there are two sales values for that city.
It looks as if sales are somewhat related to city population. In fact, the relationship looks somewhat
linear. That is, one could draw a straight line through the data, and although it would not go through
every single data point, it would be quite close. We could even use that line to answer a question like
“what value for sales would we expect if a new location for this franchise was started in a new city with
population x = 1 million?”
One problem is that the process of drawing the line is quite subjective. Two di↵erent people might think
that slightly di↵erent lines were correct. If one has a particular bias or an interest in the decision of
whether or not to open a new franchise, he or she might try to push the line a little in a favourable
direction. Clearly, it is better to use an objective method of drawing the line and answering these
questions. To do that, we use a method called Least Squares.
29
30 MIS2008L
Figure 3.1. Data on the sales of a chain of shops and the populations of the cities where
they are located.
1. Least Squares
Let us start by looking at two di↵erent lines: see Figure 3.2. The line on the left has the equation
y = 3790.1x 1000, and we can see that it is pretty good, but it misses some points. The line on the
right has the equation y = 2522x + 56.3, and it is also pretty good, but it too misses some points. Which
is better?
Sales&versus&popula4on& y"="3790.1x"+"1000"
Sales&versus&popula4on& y"="2522x"+"56.3"
4000" 4000"
3500" 3500"
Sales&in&euros&(thousands)&
Sales&in&euros&(thousands)&
3000" 3000"
2500" 2500"
2000" 2000"
1500" 1500"
1000" 1000"
500" 500"
0" 0"
0" 0.2" 0.4" 0.6" 0.8" 1" 1.2" 1.4" 0" 0.2" 0.4" 0.6" 0.8" 1" 1.2" 1.4"
City&popula4on&(millions)& City&popula4on&(millions)&
Figure 3.2. Two linear approximations to the city population–sales data. Which is better?
Clearly, we want to choose a line that is as close to all the points as possible. And clearly there is no
straight line which goes through all the points (for this data, and generally for real-world problems).
The objective way to draw the best line is to draw the line which minimises the distance of the line to
all the points.
We can see the distance from each point to the line as the error of the linear model. So we choose to
draw the line which gives the smallest total error, taking into account that that error can be positive or
negative (if a point is above or below the line). There are several error measures that one can use for
this, such as:
• Mean Absolute Error (MAE);
• Mean Squared Error (MSE);
• Root Mean Squared Error (RMSE).
It might be useful to think of a picture like Figure 3.3. The springs “want” to contract. As they
contract, they move the pole. The vertical distance from each hook to the pole represents the “error”
in a regression sense. Eventually the pole settles into a position which corresponds to the best possible
regression line.
In the context of linear regression, the RMSE of a line is the root mean square of the di↵erences between
the data-points yi and the values of the line at those points. How do we calculate those di↵erences? See
Figure 3.4.
Data Analysis for Decision Makers 31
Figure 3.3. Each regression data point imagined as a fixed hook, attached by a spring to
a movable pole. The springs will try to contract, that is try to minimise the distance from
the pole to the hooks. Image from the LION book intelligent-optimization.org.
y y=f(x)=a+bx
(xi , y)
i
yi
E
i
xi x
Figure 3.4. Calculating the error of a line y = f (x) = a + bx, at a point xi . The
diamond-shaped point has coordinates (xi , yi ), so the error of the line at that xi point is
E = (yi f (xi )).
The y-value of a point (xi , yi ) is just yi . The y-value of the line at the point xi is f (xi ) = a + bxi .
Therefore the error is |yi (a + bxi )|, i.e. the absolute di↵erence between the observed yi value and the
corresponding f (xi ) = a + bxi value.
We can then calculate the RMSE as such:
r Pn
(a + bxi ))2
i=1 (yi
RMSE =
n
where n is the number of data points. Note that the errors are squared, to account for positive and
negative di↵erences.
Smaller RMSE values imply a good fit between the line and the data, i.e. the given line models or
describes well the relationship between x and y.
Exercise 3.1. Write out the formula for the error at a single point (xi , yi ). Then write out the formula
for the sum of squared errors. Then write out the formula for the mean square error. Then write out
the formula for the root mean square error. This should help you see how each part of the name root
mean square error corresponds to a “layer” in the RMSE formula.
How can we actually find the values for a and b which give the very best possible line? There is a set of
equations which find the best (optimal) a and b values for given data, that is, the a and b values which
32 MIS2008L
y = a + bx
P P P
n · xy x y
where b = P 2 P 2
n· x ( x)
a = ȳ bx̄
where ȳ and x̄ are the average values of all y and x values, respectively.
These equations allow us to find the best-fitting straight line to model the relationship between any two
variables. This line is called the line of best fit.
Exercise 3.2. Given these equations, and the data in Figure 3.1, write the equation of the line of best
fit.
Once we have fitted a regression line, we can use it to make predictions. To be specific, if we want
to estimate the likely value of y given some value of x, we can do so using a model built with linear
regression.
In the example of sales versus city population, that would allow us to make a prediction about the likely
sales if we opened a new shop in a new city. We can do this just using the chart, as follows. We find
the line of best fit using linear regression. Then we find the x-value for the new city: suppose we are
considering a city where x = 1 million. We find that point on the x-axis, and we go vertically up from
there to hit the regression line, as shown in Figure 3.5. Then we go horizontally from that point to hit
the y-axis. The y-value here — about 2600 — is our prediction for this city. It is in thousands of Euros,
hence our estimate for sales in a new shop in a city with population of 1 million is e2.6M.
Figure 3.5. Linear regression finds the line of best fit. We can find an estimate for sales
in a new shop in a new city of (e.g.) 1 million people.
We can carry out the same process more precisely. Our question is: what is the y-value of the regression
line, when x = 1.0? We would feed in the city population (x) to the equation and get back an estimate
for sales (y). This city has x = 1. The line has the equation y = 56.3 + 2522x. We calculate y by putting
x = 1 in the equation, obtaining y = 56.3 + 2522 ⇥ 1 = 2578.3, quite close to the 2600 we estimated
graphically. As always, we must interpret the figure correctly: it means e2,578,300, because the y-values
are in thousands.
Data Analysis for Decision Makers 33
2.1. Interpolation and Extrapolation. In either case, this is an example of interpolation: making
predictions inside the range of the existing data. It is the opposite of extrapolation: making predictions
outside the range of existing data points.
We could also carry out extrapolation here. We might be interested in a new city where population
is 1.3 million. That is outside the range of data we have previously observed, but we can just extend
the regression line so that it intersects with our straight line up from x = 1.3. So we can just put
x = 1.3 straight into the equation, getting y = 56.3 + 2522(1.3) = 3334.9, which we interpret as sales of
e3,334,900. See Figure 3.6.
Figure 3.6. Extrapolation means making a prediction for y at an x-value outside the
range of existing x values.
Interpolation is more “reliable” than extrapolation, all else being equal. Extrapolation is more likely to
lead to an incorrect estimate, because with extrapolation we do not have evidence that the existing linear
trend will continue. Later we will see examples where a trend changes over time, making extrapolation
into the future unreliable.
Exercise 3.3. Make a sales prediction for a city with population of 0 people. What is the predicted
sales value? Does it make sense? What does this tell you about interpolation versus extrapolation?
3.1. Interpreting Regression Models. In the example above, we found a simple model for sales
(y) in terms of city population (x): y = a + bx. It allowed us to make predictions, and that was useful.
We can also gain some business insight from it, by interpreting the values of a and b.
Interpreting the coefficient, b: suppose we are thinking about our shop in the city of population 0.9
million. Suppose we hear that this city is growing fast. b tells us how fast the sales might change in
response to the change in population. To be specific: if x increased by 1 unit (from 0.9 to 1.9), then y
would increase by b. That is what b (the slope) means. In the current example, growing from 0.9 to 1.9
is not realistic, so we might instead consider: what if it grew from 0.9 to 1.0? That is an increase of 0.1
in x, so y would increase by 0.1b.
Interpreting the intercept, a: let us pretend there was a city of population 0. That would mean
x = 0. That would imply y = a. In other words, a is the value that we might expect for sales if population
was zero. That is not a very realistic situation; maybe we can imagine a shop in the middle of nowhere?
Now is a good time to consider the principle illustrated in Figure 3.7: extrapolating too far can give you
nonsensical results. A linear model should only be trusted for interpolation or for extrapolation “near”
the original data.
34 MIS2008L
Figure 3.7. Extrapolating too far backwards gives nonsensical results. From
smbc-comics.com.
Exercise 3.4. Based on the graphic in Figure 3.5, give a visual estimate for the correlation between
population and sales. Now calculate that correlation, using the equation given in Section 6.2. How close
was your estimate to the actual measured correlation value?
Exercise 3.5. Given the data shown in Figure 3.1, and the equations for the a and b least squares
coefficients, derive a linear model to predict sales (in thousands of Euros), given population (in millions).
Exercise 3.6. Calculate the RMSE for the model you derived. Compare that value to the distances
between the datapoints and the model line, as seen in Figure 3.5.
Exercise 3.7. Suppose you have data on sales versus shop size (in square metres). Provide a verbal
interpretation of what a and b would mean.
Note that the model we derived allows us to predict what the value of sales is, for a given population
size. Unlike correlation, this is not a symmetric technique: we specifically chose x to be the population
size, and y to be the corresponding sales, and a model1 y = a + bx is made specifically to predict the
value of y.
In statistics, the variable we are trying to model (and thus predict) is usually represented with the letter
y, and can have several names; the most common are:
• dependent variable;
• response variable.
Dependent on what? Response to what? To the other variable (or, more generally, variables; though we
will not cover the topic of multiple regression). The variable we are using to predict sales is population
size; it can thus be referred to as:
• predictor variable;
1We could easily change this model to predict values of x: x = y a
. But we would also need to recalculate RMSE.
b
Data Analysis for Decision Makers 35
• independent variable;
• explanatory variable.
It is typically represented with the letter x.
5.1. Testing Goodness of Fit. Note that the line of best fit is not necessarily a good fit. Even
if there is no relationship between the variables, some line will still be found using the above methods.
Probably it will not fit the data well, i.e. it will miss the data points, i.e. the RMSE will be large. So it
would be good to have some measure of how good a fit it is. We could use RMSE for this purpose, but
the disadvantage of that is that RMSE depends on the scale of the data, that is how large the y values
are overall.
For example, suppose you are doing a linear regression where the dependent variable is a person’s income,
and you are studying a sample of people on typical wages. You might find, say, that RMSE is 5000.
Now suppose you have a sample of professional footballers. You might find that RMSE is 12000 on
this sample. But because the footballers’ wages are larger to start with, you expect RMSE to be larger
(Exercise: think about this, and make a plot to help if needed), but it is hard to be sure how much
larger. You cannot compare the two values for RMSE. You cannot say which regression has found a
better fit.
Another way to measure how well the line fits the data is to use the coefficient of determination R2 ,
which is based on Linear Correlation. To be specific: we measure the correlation r between the dependent
values yi and the values predicted by the linear regression, i.e. y 0 = a + bx. We regard these two things
as a paired data-set (y, y 0 ), so we can calculate r and from that R2 which is just defined as r2 . Higher
values for R2 indicate a better fit, i.e. how well the linear model describes the relationship between x and
y. Note carefully that this is not the same as measuring the correlation between the original variables x
and y.
The advantage of using R2 in this way is that it does not depend on the scale of the data. An R2 value
of 0.85 always indicates that the linear regression is giving a very good fit, whether we are talking about
millionaires or people on typical wages. An R2 value of 0.05 indicates a pretty poor fit, in other words
it indicates that linear regression is not succeeding in “modelling” the data well.
Note that RMSE is still a useful measure. In many business scenarios, you might be interested in the
mean error of your predictions; for example, if you want to predict the price of an asset, R2 will tell you
if your model is adequate, but RM SE will give you an idea of how good or bad that prediction will be.
If you have a tight budget, that might be vital information.
So when testing how good a linear model is, R2 gives you insight into how appropriate the model is
(with the given data), whereas RM SE gives you insight into the expected error of that model.
Exercise 3.8. Calculate the R2 for the model you derived in the previous sub-section. Interpret this
value; is your model a good fit for the data?
With the model we have derived, we can predict the sales in a new city with any population (although
better predictions will be achieved when doing interpolation). We can estimate how good that model
is, by comparing the estimated values for the actual datapoints (0.4, 0.5, etc) that we used to build
the linear model; and then calculating the RMSE or R2 between the actual observed y values, and the
estimated y 0 values.
However, one major problem in analytics methods is the di↵erence between in-sample and out-of-sample
data. In-sample data is the data which we use to train the model (as seen in Figure 3.1). Out-of-sample
data, on the other hand, is data on which the model will be run later. The problem is that sometimes a
model will perform well (low RMSE or high R2 ) on the in-sample data, but badly on the out-of-sample
data.
36 MIS2008L
One reason for this to occur is that the model is just not able to generalise well. Such a model is a bit
like a sports pundit who is always able to provide a good explanation for what just happened in a game,
but is never able to predict what will happen in an upcoming game.
The second reason is that the data on which the model will be used might not be similar to that on
which the model was trained. For example, the training data seen in Figure 3.1 might come from a
Western country, where prices are relatively high. So if the model is used to make predictions for cities
in a third world country, chances are that its predictions of sales figures will be wrong.
6.1. Training and Testing Error. If we have a model that performs well on in-sample data and
then makes bad predictions in practice, we may lose a lot of money! We mitigate this risk in an obvious
way: we only use some of the available data for training. We keep some of it back, “unseen” by the
model, and we test whether the model makes good predictions on it.
So in essence, we are dividing the available data into two sets: Training Data and Testing Data. We
use only the training data to create our model: we choose the model that gives us a low training error
(RMSE) or high accuracy (R2 ). But when estimating how well that model will do in unseen data, we
measure its test error (or test accuracy), that is, its RMSE or R2 in the test data.
How many samples to use for training versus testing? It depends on the modelling technique. As a rule
of thumb, 66% of the data might be used for training, and 33% for testing (or other variants: 75/25,
80/20, etc).
For example, Figure 3.8 shows the same data as before, but with four additional cities. In this case, we
decided to use the first 10 cities as our training set, and the last four as our test set. So we built a model
as before, using only the blue lines, and calculated the train R2 and RM SE. Then, using that same
model built using only the first ten observations, we calculate the R2 and RM SE values associated with
the test samples. This gives us an idea of how good/bad our model will predict unseen data, that is,
data not used to build the model.
Figure 3.8. Further data on the sales of a chain of shops and the populations of the
cities where they are located, divided into train (blue) and test (red) sets.
Note that we cannot always do this. Sometimes, our datasets are very small, and further breaking them
down into train and test data would leave very few samples to build a model. Ideally, we have many
data samples (thousands of observations), so that we create a model using a large training set, and test
its accuracy on unseen data using a realistic test set.
7.1. Least Squares in Excel. We can create linear regression models in Excel using the LINEST
function: see Figure 3.9.
Alternatively, we can start with a scatter plot of the data. Right-click a data point, choose to add
a trendline, and choose the linear trendline type. Then, under Options, check Display equation on
Data Analysis for Decision Makers 37
chart. That will show an equation with the values of b and a (yes, in this order) which Excel has
calculated using the above equations.
Figure 3.9. Calculating a regression in Excel. The x and y data are in two columns. The
linear regression will return the two values which specify a line: the slope and the constant
(i.e. b and a, in our terminology). Therefore you must select the 2x1 rectangle of cells
as shown to tell Excel that you expect two results. Enter the formula =LINEST(C2:C11,
B2:B11). Note the order of B and C here: we put the dependent variable first. Type
Ctrl+Shift+Enter to execute. The rectangle of cells is filled with the results, as shown.
The two cells give b = 2522 and a = 56.3, in this order, which specifies the line of best fit.
The spreadsheet also shows how to calculate the values of a and b using the least squares
equations. Finally, the train and test RM SE and R2 values are also calculated.
Exercise 3.9. Given the x-values in column B, and y-values in column C (as in Figure 3.9), and given
the equation of the line f (x) = 56.3 + 2522x as shown, how can you calculate the RMSE? Hints: start
by calculating a new column (say G) with the values of the line at the x-points, in other words put
the formula for f (xi ) into a new column. Then calculate the line’s errors at those points in another
new column, by subtracting the line’s value from the correct value. Then use the Excel functions SQRT,
SUMSQ, and COUNTA (look up Excel help to find out what they do).
Regression can be carried out on multiple independent variables. In this case, it is usually referred to as
multiple regression. Instead of using a model like y = a + bx, we use a model like y = a + b1 x1 + b2 x2 ,
supposing there are 2 independent variables, x1 and x2 . There is always only one dependent variable, y.
Sometimes a linear model is unsuitable for the data. For example, what if we plot one independent
variable against a dependent variable and instead of seeing an approximately linear relationship, we see
a relationship that looks logarithmic (like that in Figure 3.10)?
We can see a “plateau” e↵ect, which makes sense in context. If our advertising spend is currently low,
we can gain a lot by increasing. But if our advertising spend is already high, adding a lot more does
not seem worth it. Clearly, we could fit a straight line to this data, but it would not fit the data well,
particularly the initial strong growth of sales with small advertising spending.
38 MIS2008L
60000"
50000"
40000"
Sales&
30000"
20000"
10000"
0"
0" 5000" 10000" 15000" 20000" 25000" 30000"
Adver+sing&spend&
Figure 3.10. Sales versus advertising spend. Clearly, spending some money on adver-
tising is a good idea, but the e↵ect seems to plateau. This is a logarithmic relationship.
In this case, a good approach is to carry out a variable transformation, to make the variable “look linear”.
In the current case, because it currently “looks” logarithmic, we carry out an logarithmic transform. Let
us always use log2 (log to the base 2) for ease of interpretation (see below) but it does not really matter
which base we use so long as we are consistent. We define a new variable x0 = log2 (x), where x was
the original variable and x0 is the new one. We replace x with x0 . Then we will see that x0 is in an
approximately linear relationship with y. Then we can run linear regression as before.
We have to be careful when interpreting and using the resulting model, though. Remember that x0 is in
di↵erent units from the original x.
If the Advertising variable has a coefficient b in the model, then we cannot say “for every 1 euro increase
in Advertising, we see a b euro increase in Sales”. Instead, you must interpret as follows: “for every 1
unit increase in log2 (Advertising), we see a b euro increase in Sales”. But what does a 1 unit increase
in log2 (Advertising) mean? This is a crucial point. If log2 (Advertising) increases by 1, that means
Advertising itself has doubled.
If we want to predict the value of y (Sales) for some given value of x (Advertising), clearly we must
calculate x0 = log2 (Advertising) before putting that into our linear model. We do not put in x itself.
Above we used a logarithmic transform to deal with logarithmic data. We can also do the opposite:
if there seems to be an exponential relationship (an “accelerating” curve, in contrast to the “slowing”
curve above), then we can transform x0 = ex .
It is quite common to receive a data file with some missing data. Correlation still works fine: it acts as
if the whole row was deleted. However, regression does not. We can address this by deleting the row
manually.
Sometimes missing data is not shown as a blank value in datafiles. It is sometimes shown as “na”, “NA”,
“n/a” or similar, which stands for “not available”, or as “?”. In these cases we will have to filter the
data, to find those values and remove the samples containing them.
There are even cases where missing data is encoded by a special numerical value. E.g. a record of student
grades (percentages) might indicate a missing grade using the value 1. Obviously, if we do not notice
that and we calculate a mean, a correlation, or a regression, we will get misleading results. When we
receive a data file, we have to check these issues carefully, before carrying out our analysis.