Interpreting Data Course Notes 365 Data Science
Interpreting Data Course Notes 365 Data Science
Interpreting Data
365 DATA SCIENCE 2
Table of Contents
ABSTRACT .................................................................................................................................................. 3
1. Interpreting Data ............................................................................................................................... 4
1.1. Correlation analysis ................................................................................................................... 5
1.1.1. Correlation coefficient ...................................................................................................... 7
1.1.2. Correlation and causation................................................................................................ 8
1.2. Simple linear regression........................................................................................................... 9
1.2.1. R-squared .......................................................................................................................... 11
1.3. Forecasting ............................................................................................................................... 13
1.3.1. Forecast errors ................................................................................................................. 14
1.4. Statistical tests .......................................................................................................................... 17
1.4.1. Hypothesis testing ........................................................................................................... 18
1.4.2. P-value ............................................................................................................................... 20
1.4.3. Statistical significance ..................................................................................................... 21
1.5. Classification ............................................................................................................................. 22
1.5.1. Accuracy ............................................................................................................................ 24
1.5.2. Recall and precision ........................................................................................................ 25
365 DATA SCIENCE 3
ABSTRACT
Now that you have built a solid theoretical foundation and understanding of data
literacy qualities, real-life data applications, data terminology, storage methods and
quality assessment techniques, in the final section of the data literacy course notes with,
performance of any asset on the financial market, predict consumer behavior and
Leaning on the more technical side, these course notes and their supplementary
classification
365 DATA SCIENCE 4
1. Interpreting Data
can be statistics, coefficients, probabilities, errors, etc., which can provide different
mean, to know what kind of conclusions can be drawn and how to apply them for
Data interpretation requires domain expertise, but also curiosity to ask the right
questions. When the results are not in line with one’s expectations, these should be
met with a sufficient level of doubt and examined in further depth. Such sense of
1) Correlation
2) Linear regression
3) Forecasting
4) Statistical tests
5) Classification
365 DATA SCIENCE 5
scatter plot:
When a correlation exists, you should be able to draw a straight line (called
Positive correlation (upward sloping regression line) - both variables move in the
well.
20
15
10
5
0
6.0 8.0 10.0 12.0 14.0 16.0 18.0
Daylight (hours)
Rainfall (days)
11
6
6.0 8.0 10.0 12.0 14.0 16.0 18.0
Daylight (hours)
The more tightly the plot forms a line rising from left to right, the stronger the
correlation.
“Perfect” correlation - all the points would lie on the regression line itself
30
20
10
0
0 2 4 6 8 10 12 14
Temperature low (°C)
Lack of correlation - If a line does not fit, i.e. if the dots are located far away from the
line, it means that the variables are not correlated. In that case, the line is relatively
flat.
Correlation between daylight and rainfall
amount
100
Rainfall (mm)
50
0
6.0 8.0 10.0 12.0 14.0 16.0 18.0
Daylight (hours)
365 DATA SCIENCE 7
Although the visual examination of a scatter plot can provide some initial clues
about the correlation between two variables, relying solely on them may lead to an
When the value of one variable increases/decreases, the value of the other one
The higher the absolute value of the correlation coefficient, the stronger the
correlation
variables.
However, this does not necessarily mean the variables are not related at all. They
Correlation in the real world rarely return coefficients of exactly +1.0, –1.0, or 0.
A rule of thumb for the interpretation of the absolute value of correlation coefficients:
When applying this rule of thumb, always bear in mind that the definition of a weak,
moderate, or strong correlation may vary from one domain to the next.
significant the correlation is, however, it never provides sufficient evidence to claim a
causal link between variables. This applies to both the existence or the direction of a
Тhe strong correlation between two variables A and B can be due to various
scenarios:
further investigation.
enough to make quality decisions. А good analyst or a wise decision maker will prefer
to strive for an explanation and always bear in mind the old statistical adage:
With simple linear regression, it is possible to predict the value of the dependent
20
15
10
5
0
-5 0.0 5.0 10.0 15.0 20.0
-10
Y = 1.67X - 6.64 Daylight (hours)
365 DATA SCIENCE 10
Regression equation: Y = a + bX
of Y
N.B. The higher the absolute value of b, the steeper the regression curve
The sign of b indicates the direction of the relationship between the Y and X
increase of Y
decrease of Y
Notice that a prediction using simple linear regression does not prove any
causality. The coefficient b, no matter its absolute value, says nothing about a causal
Summary: The objective of the regression is to plot the one line that best
1.2.1. R-squared
After running a linear regression analysis, you need to examine how well the
model fits the data., i.e., determine if the regression equation does a good job
The regression model fits the data w ell w hen the differences betw een the
observations and the predicted values are relatively small. If these differences are
too large, or if the model is biased, you cannot trust the results.
inherent variability.
• R-squared = 0 => the model does not explain any of the variance in the
make predictions
dependent variable. Its predictions perfectly fit the data, as all the
Example:
-15
Y = 1.67X - 6.64 Daylight (hours)
R² = 0.72
R-squared = 0.72
=> the number of daylight accounts for about 72% of the variance of the temperature
What is a good R-squared value? At w hat level can you trust a model?
• Some people or textbooks claim that 0.70 is such a threshold, i.e., that if
• Feel free to use this rule of thumb. At the same time, you should be aware
of its caveats.
• Indeed, the properties of R-squared are not as clear-cut as one may think
A higher R-squared indicates a better fit for the model, producing more accurate
predictions
*A model that explains 70% of the variance is likely to be much better than one
that explains 30% of the variance. However, such a conclusion is not necessarily correct.
365 DATA SCIENCE 13
• The sample size: The larger the number of observations, lower the R-
• The granularity of the data: Models based on case-level data have lower
country data)
• The type of data employed in the model: When the variables are
continuous data
• The field of research: studies that aim at explaining human behavior tend
1.3. Forecasting
Forecast accuracy – the closeness of the forecasted value to the actual value.
For a business decision maker, the key question is how to determine the
accuracy of forecasts:
“Can you trust the forecast enough to make a decision based on it?
As the actual value cannot be measured at the time the forecast, the accuracy
known as errors. These errors can be assessed without knowing anything about a
*The MAE, MAPE and RMSE only measure typical errors. They cannot anticipate black
swan events like financial crises, global pandemics, terrorist attacks, or the Brexit.
As the errors associated with these events are not covered by the time series data,
they cannot be modeled. Accordingly, it is impossible to determine in advance how
big the error will be.
1.3.1. Forecast errors
Мean Absolute Error (MAE) - the absolute value of the difference between the
• With the MAE, we can get an idea about how large the error from the
• The main problem with it is that it can be difficult to anticipate the relative
size of the error. How can we tell a big error from a small error?
That depends on the underlying quantities and their units. For example, if your
monthly average sales volume is 10’000 units and the MAE of your forecast for the next
month is 100, then this is an amazing forecasting accuracy. However, if sales volume is
Mean Absolute Percentage Error (MAPE) - the sum of the individual absolute errors
scales.
value and volume in US$, EUR, units, liters or gallons - even though these
• Thanks to this property, the MAPE is one of the most used indicators to
• One key problem with the MAPE is that it may understate the influence of
big, but rare, errors. Consequently, the error is smaller than it should be,
Root Mean Square Error (RMSE) - the square root of the average squared error.
RMSE has the key advantage of giving more importance to the most significant errors.
Accordingly, one big error is enough to lead to a higher RMSE. The decision maker is
not as easily pointed in the wrong direction as with the MAE or MAPE.
• Best practice is to compare the MAE and RMSE to determine whether the
forecast contains large errors. The smaller the difference between RMSE
and MAE, the more consistent the error size, and the more reliable the
value.
365 DATA SCIENCE 16
What is a “good” or “acceptable” value of the MAE, MAPE or RSME depends on:
not vary so much over time (e.g., electricity or water distribution), demand
forecasting model may therefore yield a very low MAPE, possibly under
5%.
• The industry: In volatile industries (e.g., machine building, oil & gas,
the FMCG or travel industries), sales volumes vary significantly over time
MAPE of a model could be much higher than 5%, and yet be useful for
continental or national level) are generally more accurate than for smaller
• The time frame: Longer period (e.g., monthly) forecasts usually yield
With three well-established indicators available, one cannot conclude that one is
better than the other. Each indicator can help you avoid some shortcomings but will
365 DATA SCIENCE 17
be prone to others. Only experimentation with all three indicators can tell you which
Hypothesis testing is a key tool in inferential statistics, and used in various domains -
social sciences, medicine, and market research. The purpose of hypothesis testing is
If you want to compare the satisfaction levels of male and female employees in your
Sample - the specific group that data are collected from. Its size is always smaller
If you randomly select 189 men and 193 women among these employees to carry out
collected, for example because the process would be too lengthy or too expensive.
In these cases, researchers need to develop specific experiment designs, and rely on
Modern statistical software is there to calculate various relevant statistics, test values,
probabilities, etc. All you need to do is to learn interpret the most important ones:
- null hypothesis
- the p-value
- statistical significance.
A hypothesis resembles a theory in science. But it is “less” than that, because it first
propose two opposite, mutually exclusive, hypotheses so that only one can be right:
- A blue conversion button on the website results in the same CTR as a red
button
365 DATA SCIENCE 19
In statistics terms, the null hypothesis is therefore usually stated as the equality
- The mean CTRs of the red and blue conversion buttons are the same.
OR: the difference of the mean CTRs of the red and the blue conversion
It is called “null” hypothesis, because it is usually the hypothesis that we want to nullify
or to disprove.
The alternative hypothesis (H1) is the one that you want to investigate,
follows:
- A blue conversion button on the website will lead to a different CTR than
the one with a red button
In this case, the objective is to determine whether the population parameter is
generally distinct or differs in either direction from the hypothesized value. It is called
differs from the hypothesized value in a specific direction, i.e. is smaller or greater than
- Example: The difference of the mean CTRs of the blue and the red
- Here we only care about the blue button yielding a higher CTR than the red
button
We can also be even more aggressive in our statement and quantify that
- The difference of the mean CTRs of the red and the blue conversion button
- That would be equivalent to stating that the mean CTR of the blue button is
N.B. You do not have to specify the alternative hypothesis. Given that the two
hypotheses are opposites and mutually exclusive, only one can, and will, be true. For
the purpose of statistical testing, it is enough to reject the null hypothesis. It is therefore
1.4.2. P-value
The greater the dissimilarity between these patterns, the less likely it is that the
Examples:
If p-value = 0.0326 => there is a 0.0326 (or 3.26%) chance that the results
happened randomly.
365 DATA SCIENCE 21
If p-value = 0.9429 => the results have a 94.29% chance of being random
The smaller the p-value, the stronger the evidence that you should reject the
null hypothesis
When you see a report with the results of statistical tests, look out for the p-value.
Normally, the closer to 0.000, the better – depending, of course, on the hypotheses
experiment
• If the p-value falls below the significance level, the result of the test is
statistically significant.
• Unlike the p-value, the alpha does not depend on the underlying
• The alpha will often depend on the scientific domain the research is being
carried out in
If p-value < alpha => you can reject the null hypothesis at the level alpha.
If the P-value is lower than the significance level alpha, which should be set in
advance, then we can conclude that the results are strong enough to reject the old
notion (the null hypothesis) in favor of a new one (the alternative hypothesis).
365 DATA SCIENCE 22
Example:
If p-value = 0.0321, alpha = 0.05 => “Based on the results, we can reject the null
If the p-value = 0.1474, alpha = 0.05 => “Based on the results, we accept the null
“important”. That will depend on the real-world relevance of that result, which the
1.5. Classification
A classification model can only achieve two results: Either the prediction is
correct (i.e., the observation was placed in the right category), or it is incorrect.
classification model, especially when there are only two available categories or labels.
Оut-of-sample validation - withholding some of the sample data used for the
training of the model. Once the model is ready, it is validated with the data initially set
Example:
Imagine that we trained a model for a direct marketing campaign. We used the data
We set aside 100 customer records, which constitute our validation data.
• For these 100 customers, we use the model to predict their responses.
• As these customers also receive the marketing offer, we also get to know
who responded favorably, and who did not. These responses constitute
• You can compare the predicted with the actual classes, and find out which
Confusion matrix - shows the actual and predicted classes of a classification problem
(correct and incorrect matches). The rows represent the occurrences in the actual
class, while the columns represent the occurrences in the predicted class.
Predicted class
n = 100 Yes No
Yes 10 5 15
Actual class
No 15 70 85
25 75 100
Out of the 100 customer who received the offer, the model predicted that 25
customers would accept it (i.e., 25 times “yes”) and that 75 customers would reject it
After running the campaign, it turned out that 15 customers responded favorably
Based on the confusion matrix, one can estimate the quality of a classification model
by calculating its:
- Accuracy
- Recall
- Precision
1.5.1. Accuracy
The model correctly predicted 10 “Yes” cases and 70 “No” cases =>
10 + 70
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 = = 80%
100
of classification models.
• It comes with a major flaw, which becomes apparent when the classes are
imbalanced.
• Experienced analysts are familiar with this issue, and have at their disposal
Recall (also known as sensitivity) is the ability of a classification model to identify all
relevant instances.
10
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = = 66.67%
15
• This means that only two-thirds of the positives were identified as positive,
Precision is the ability of a classification model to return only relevant instances (to be
predicted =>
10
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = = 40%
25
There are two types of incorrect predictions: false positives and false negative
Copyright 2022 365 Data Science Ltd. Reproduction is forbidden unless authorized. All rights reserved.
Learn DATA SCIENCE
anytime, anywhere, at your own pace.
If you found this resource useful, check out our e-learning program. We have
everything you need to succeed in data science.
Learn the most sought-after data science skills from the best experts in the field!
Earn a verifiable certificate of achievement trusted by employers worldwide and
future proof your car
$432 $172.80/year
Email: [email protected]