0% found this document useful (1 vote)
101 views45 pages

Descriptive and Predictive Analytics

This document discusses installing and using the Real Statistics Resource Pack add-in for Excel. It allows for advanced analytics functions beyond the default Excel capabilities. Instructions are provided to download and install the add-in. The document then covers using functions in the add-in for data cleansing tasks like reformatting, extracting columns, and multiple imputation to handle missing values in a dataset before further analysis.

Uploaded by

FucKerWengie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
101 views45 pages

Descriptive and Predictive Analytics

This document discusses installing and using the Real Statistics Resource Pack add-in for Excel. It allows for advanced analytics functions beyond the default Excel capabilities. Instructions are provided to download and install the add-in. The document then covers using functions in the add-in for data cleansing tasks like reformatting, extracting columns, and multiple imputation to handle missing values in a dataset before further analysis.

Uploaded by

FucKerWengie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Data Analytics and Visualization 1

Dr. Chang Yun Fah

MEME19403 Data Analytics and Visualization


Module Two: Descriptive and Predictive Analytics
Instructor: Dr. Chang Yun Fah

1) Installing Real Statistics Resource Pack


In order to perform data analytics in Excel, we will recommend the users to install a few add-
ins to enable complicated analysis in Excel. These add-ins are available to download for free
from web sources. These add-ins are, but not limited to
 Analysis ToolPak and Analysis ToolPak VBA
 The Real Statistics Resource Pack: contains a variety of supplementary functions and
data analysis tools not provided by Excel. It provides more comprehensive functions as
compared to Analysis ToolPak and it is called RealStats.xlam. It is compatible with
Excel 2010, 2013 and 2016 (Windows). Get the free download from the webpage
http://www.real-statistics.com/free-download/real-statistics-resource-pack/
 Solver Add-ins

We will explain the installation of the RealStats.xlam here.


 Go to the Real Statistics Resource Pack webpage

 Download RealStats.xlam into your folder.


Data Analytics and Visualization 2
Dr. Chang Yun Fah

 Open up a blank Excel spreadsheet and click on the File tap and choose Help/Options

 Click Add-Ins and click on the Go button.

 Check the RealStats option and other add-ins you want in the dialog box that appears.
If these options do not appear, click on Browse to search the add-ins from the folder
you kept them.
Data Analytics and Visualization 3
Dr. Chang Yun Fah

 Installation done. RealStats has been installed in Excel Add-Ins tap. To open the
RealStats, you can just click on the Add-Ins top or just press Ctrl-M.

 The RealStats add-ins consist of many additional Excel functions as follow:


Data Analytics and Visualization 4
Dr. Chang Yun Fah

2) Data Cleansing
Once we have identified the problematic entries in our dataset such as outliers, missing values
and errors, we should perform data cleansing before carrying out further analysis. Real
Statistics provides several functions for these purposes. For illustration, we use the data from
the Buying Decision.xlsx file.

2.1) Reformatting a Data Range


Firstly, we use the Reformatting a Data Range by Rows to sort, it is to remove empty cells, and
to remove non-numeric data.

Sort by column 2
Data Analytics and Visualization 5
Dr. Chang Yun Fah

Remove empty cells Remove non numeric cells


Age Income Buy Age Income Buy Age Income Buy
25 2500 0 25 2500 0 25 2500 0
35 4200 1 35 4200 1 35 4200 1
23 1800 1 23 1800 1 23 1800 1
43 3600 0 43 3600 0 43 3600 0
36 1 35 3600 0 35 3600 0
35 3600 0 52 6100 0 52 6100 0
52 6100 0 53 4700 0 53 4700 0
53 4700 0 NA 7000 1 28 3100 1
NA 7000 1 28 3100 1
28 3100 1
Sort by the first column Sort by the column mentioned
Age Income Buy Age Income Buy
23 1800 1 36 1
25 2500 0 23 1800 1
28 3100 1 25 2500 0
35 4200 1 28 3100 1
35 3600 0 43 3600 0
36 1 35 3600 0
43 3600 0 35 4200 1
52 6100 0 53 4700 0

An alternative way to do these is by using Reformatting a Data Range.

Reshape the original data into 3 rows


and 11 columns
Reshape
Age 23 35 62 2500 3600 6100 3100 1 1 0
25 43 52 28 4200 2800 4700 Buy 1 0 1
35 36 53 Income 1800 3600 7000 0 0 0 1

Randomize Shuffle Reverse


7000 0 28 3600 52 35 1 3100 28
28 0 Age 4700 0 2800 1 7000 62
2800 1 1 Buy 62 Income 0 4700 53
Age 35 Buy 1800 1 0 0 6100 52
2500 35 1 53 7000 28 0 3600 35
52 3100 0 6100 43 36 1 2800 36
36 35 1 0 0 3600 0 3600 43
0 2500 Income 1 0 23 1 1800 23
2500 4700 Buy 3100 Age 25 1 4200 35
35 1 1 35 2500 1 0 2500 25
52 Buy Buy 1 4200 1 Buy Income Age
Data Analytics and Visualization 6
Dr. Chang Yun Fah

2.2) Extracting Columns from a Data Range


During analysis stage, a user may decide not to use one or more collected variables in his or
her analysis. Thus, we need to remove these variables from the raw data before carrying out
the analysis. This can be done using function Extracting Columns from the Data Range as
described below.

Age Buy
25 0
35 1
23 1
43 0
1
35 0
52 0
53 0
62 1
28 1

Select only Age and Buy

Original dataset

For example, there are 3 variables in the original dataset, i.e. Age, Income and Buy. By not
selecting Income in the dialog box above, we are now only have Age and Buy as our new
dataset.

2.3) Multiple Imputation


Imputation is an approach to replace the missing values in dataset with certain value. Consider
the original dataset that have 2 missing values, one is the Age value and another is the Income
value. From the Input Range, we select the original dataset (either include or exclude the
variables without missing values). Then, the number of imputations, in this case is 10, say that
we will try to compare the performance of 10 different values used to replace the missing values.
Finally, we will choose the values from Impute 4 as the replacement value for the missing
values (23.36613 for Age, and 4734.645 for Income). These values are calculated based on the
highest R-square value.
Data Analytics and Visualization 7
Dr. Chang Yun Fah

Age Income Buy


25 2500 0
35 4200 1
23 1800 1
43 3600 0
2800 1
35 3600 0
52 0
53 4700 0
62 7000 1
28 3100 1

Descriptive statistics Multiple Imputation Summary Regression Analysis

Age Income Buy Age Income Buy OVERALL FIT


count 9 9 10 mean 38.2957 3836.927 0.5 Multiple R 0.623024
mean 39.55556 3700 0.5 stdev 13.58988 1500.123 0.527046 R Square 0.388159
stdev 13.74874 1515.751 0.527046 coeff -0.06715 0.000558 0.934449 Adjusted R Square
0.213347
min 23 1800 0 s.e. 0.0335 0.000307 0.475256 Standard Error
0.467456
max 62 7000 1 R-sq, size 0.388159 10 Observations 10

Frequency of Non-Missing Data ANOVA Alpha 0.05


df SS MS F p-value sig
Age Income Buy rows cells Regression 2 0.970397 0.485198 2.220437 0.179158 no
9 9 10 8 28 Residual 7 1.529603 0.218515
90.00% 90.00% 100.00% 80.00% 93.33% Total 9 2.5

Patterns of Missing Data coeff std err t stat p-value lower upper
Intercept 0.934449 0.475256 1.966202 0.089994 -0.18935 2.05825
Age Income Buy freq % Age -0.06715 0.0335 -2.0045 0.085053 -0.14637 0.012064
. x x 1 11.11% Income 0.000558 0.000307 1.816604 0.112125 -0.00017 0.001283
x . x 1 11.11%
x x x 8 88.89%
10
Data Analytics and Visualization 8
Dr. Chang Yun Fah

Impute 1 Impute 2 Impute 3 Impute 4 Impute

Age Income Buy Age Income Buy Age Income Buy Age Income Buy Age
25 2500 0 25 2500 0 25 2500 0 25 2500 0 25
35 4200 1 35 4200 1 35 4200 1 35 4200 1 35
23 1800 1 23 1800 1 23 1800 1 23 1800 1 23
43 3600 0 43 3600 0 43 3600 0 43 3600 0 43
30.80159 2800 1 29.19996 2800 1 29.28112 2800 1 23.36613 2800 1 26.87974
35 3600 0 35 3600 0 35 3600 0 35 3600 0 35
52 5921.274 0 52 4785.536 0 52 5373.001 0 52 4734.646 0 52
53 4700 0 53 4700 0 53 4700 0 53 4700 0 53
62 7000 1 62 7000 1 62 7000 1 62 7000 1 62
28 3100 1 28 3100 1 28 3100 1 28 3100 1 28

38.68016 3922.127 0.5 38.52 3808.554 0.5 38.52811 3867.3 0.5 37.93661 3803.465 0.5 38.28797
13.25473 1592.366 0.527046 13.36969 1469.715 0.527046 13.36342 1523.849 0.527046 13.9368 1466.039 0.527046 13.56806
-0.04954 0.000357 1.017118 -0.06897 0.000579 0.953158 -0.06697 0.000533 1.017273 -0.06958 0.000607 0.832456 -0.07262
0.039877 0.000332 0.57062 0.032495 0.000296 0.473212 0.037257 0.000327 0.507482 0.027308 0.00026 0.4145 0.031429
0.187892 10 0.391789 10 0.318317 10 0.481309 10 0.433131

3) Descriptive Statistics and Normality


In the beginning, we used descriptive statistics option and histogram option in Analysis
ToolPak to explore the data properties. We also discussed how to generate the normal
probability plot and boxplot manually to check for normality and outliers.

The Descriptive Statistics and Normality option in RealStats provides similar function but with
more information for the user. This option generates more descriptive statistics, boxplot,
normal QQ plot, and potential outliers and missing values by just clicking on one button.

For illustration, we use the data from the Buying Decision.xlsx file as follow. There are 10
observations on the customers’ age and income level and their buying decision (0=not buy,
1=buy).

Then press Ctrl-m and select Descriptive Statistics and Normality option from the dialog box.
Data Analytics and Visualization 9
Dr. Chang Yun Fah

Say we want to analyse the customer’s properties. In the Input Range, select a few observations
from Age column and click “Fill” button next to it. This will enable the software to select all
observations in the same column Age. Make sure you checked the ‘column headings included
with data’.
For the options, you may select any of the option like:
 Descriptive statistics
 Boxplot (graphical method for normality, outliers etc.)
 QQ plot (graphical method for normality and outliers)
 Shapiro-Wilk (analytical test the normality of the data)
 Outliers and Missing Data (identify missing data, blank or non-numeric and potential
outliers which are outside 2.5 or 3.0 standard deviation from mean)
 Grubbs’ Test (is to check the extreme value is an outlier or not an outlier).

You can fix the output range to specific the starting cell of the results, or by clicking on ‘New’
button to display the results in a new worksheet.

3.1) Descriptive Statistics


RealStats provides standard descriptive statistics for one or more continuous variables as listed
below. The output on the left has only one variable (Age) and the output on the right has three
variables.

In the descriptive statistics output, two additional measures of central tendency provided are
Geometric Mean and Harmonic Mean, while AAD and MAD are additional information for

dispersion. Their formulas are  =  ∑| − ( )| and  = (| −
( )|) where ( ) can be any of the mean, median and mod value. Since the variable
Buy is a binary data, and hence its geometric mean and harmonic mean are unable to construct.
The mode value for Income cannot be constructed because there is a missing value.

In order to check for the normality, the user can use the five statistics (mean, median, mode,
minimum and maximum) or the skewness and kurtosis. If the values of mean, median and mode
are the same (or at least close to each other), this indicates that the data is symmetry, an
indication of (approximately) normal. On the other hand, the data is normal if skewness is 0
Data Analytics and Visualization 10
Dr. Chang Yun Fah

and kurtosis is 3, else it is not normal. In this example, Age is right skewed (mean > median
and mode, or skewness is 2.05 a positive value), although the variable Buy has skewness 0, but
we don’t examine its normality due to binary data.

Age Age Income Buy


Mean 43 Mean 43 3277.778 0.5
Standard Error 7.11493 Standard Error 7.11493 535.355 0.166667
Median 35.5 Median 35.5 3100 0.5
Mode 35 Mode 35 #N/A 0
Standard Deviation 22.49938 Standard Deviation 22.49938 1606.065 0.527046
Sample Variance 506.2222 Sample Variance 506.2222 2579444 0.277778
Kurtosis 4.943037 Kurtosis 4.943037 0.163264 -2.57143
Skewness 2.05288 Skewness 2.05288 0.202963 0
Range 77 Range 77 5400 1
Maximum 100 Maximum 100 6100 1
Minimum 23 Minimum 23 700 0
Sum 430 Sum 430 29500 5
Count 10 Count 10 9 10
Geometric Mean 39.12922 Geometric Mean 39.12922 2835.604 #NUM!
Harmonic Mean 36.3019 Harmonic Mean 36.3019 2274.782 #NUM!
AAD 15.2 AAD 15.2 1219.753 0.5
MAD 9 MAD 9 1100 0.5
IQR 20 IQR 20 1700 1

3.2) Boxplot
There are two options for Box plot, i.e. with and without outliers. The output on the left is a
boxplot statistics without outlier, while the right output includes outliers. These outputs are
used to draw the boxplot. Instead of the point value for minimum, Q1, median, Q3 and
maximum, the results stated the differences between them. For example, the minimum value
is 23, and the difference between Q1 and minimum values is 6.75. Hence, the Q1 values is
1 −   = 6.75 ⇒ 1 = 6.75 +   = 6.75 + 23 = 29.75 . In the boxplot
created without outliers, the whisker always linked the Q1 and Q3 to the minimum and
maximum value respectively.

If we examine the boxplot, we observed that the plot has larger box (purple colour) and longer
tail on the right (larger y-axis value). Thus, the data is right (positive) skewed. The empty dot
on the right plot is the outlier point with value 100. The X sign in the right plot is the mark for
mean value.

Beside checking normality and identifying outliers, boxplot has many other applications such
as comparison of different datasets, identify trends of the data etc.
Data Analytics and Visualization 11
Dr. Chang Yun Fah

Box Plot Box Plot

Age Age
Min 23 Min 23
Q1-Min 6.75 Q1-Min 6.75
Med-Q1 5.75 Med-Q1 5.75
Q3-Med 14.25 Q3-Med 14.25
Max-Q3 50.25 Max-Q3 3.25

Min 23
Q1 29.75
Median 35.5
Q3 49.75
Max 53
Mean 43

Grand Min 0

Box Plot Box Plot


120 120
100 100
80 80
60 60
40 40
20 20
0 0
Age Age

3.3) QQ-Plot (Quantile-Quantile Plot)


There are five possible cases of QQ-plot results. If the resulting points from QQ plot lie
approximately on a straight line, with emphasis on the central values (e.g. the 0.33 and 0.67
cumulative probability points), then we may conclude that the data is normally distributed. If
the QQ-plot has a C shape, then it indicates that the data is negative or left skewed. For the
positive or right skewed, QQ-plot shows an inverted C shape. The remaining two types of QQ-
plot are heavy or long-tails distribution when there are sharp upward and downward curve at
both extremes, and light or short-tails distribution when it is flattening at extremes (S shape).

1.0
1.0

0.8
0.8
Expected Cum Prob
Expected Cum Prob

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Observed Cum Prob Observed Cum Prob

a) Normal (straight line) b) Negative/Left skewed c) Positive/Right skewed


Data Analytics and Visualization 12
Dr. Chang Yun Fah

1.0

0.8
Expected Cum Prob

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0
Observed Cum Prob

d) Heavy/long-tails distribution e) Ligth/thinner-tails distribution

The QQ table showed the Age values in ascending order with its inverse of normal cumulative
distribution in Column “Std Norm” when the underlying population is standard normal. For
example, "(# < 1.64485) = 0.95 or 95% of the people are younger than  = 1.64485 ×
22.4994 + 43 = 80 years-old under normal population. The column “Std Data” gives the z-
) +,-.
score values # = *
/01 1-2 .0 3
. Both boxplot and QQ plot indicate that Age have a right-skewed
distribution. Besides, QQ plot shows that there is a potential outlier.

The QQ-plot for Income has only 9 points as there is one missing value. It is obvious that the
QQ-plot for Income has a “straight” line and hence it is normally distributed. There is no outlier
found in the Income.

QQ Plot - Age
QQ Plot - Income
Count 10 20
Mean 43 Count 9 18
Std Dev 22.49938 Mean 3277.778
Std Dev 1606.065
Interval Data Std Norm Std Data
1 23 -1.64485 -0.88891 Interval Data Std Norm Std Data
3 25 -1.03643 -0.80002 1 700 -1.59322 -1.60503
5 28 -0.67449 -0.66668 3 1800 -0.96742 -0.92012
7 35 -0.38532 -0.35557 5 2500 -0.58946 -0.48428
9 35 -0.12566 -0.35557 7 2800 -0.28222 -0.29748
11 36 0.125661 -0.31112 9 3100 0 -0.11069
13 43 0.38532 0 11 3600 0.282216 0.200628
15 52 0.67449 0.400011 13 4200 0.589456 0.574212
17 53 1.036433 0.444457 15 4700 0.967422 0.885532
19 100 1.644854 2.533403 17 6100 1.593219 1.757228
Data Analytics and Visualization 13
Dr. Chang Yun Fah

3.4) Shapiro-Wilk Test for Normality


From the Shapiro-Wilk test, we observed that Age and Buy are not normally distributed at 0.05
significance level. The Age result is consistent with the previous normality checking using
boxplot and QQ-plot. The Buy result is expected since it is a binary data. The variable Income
is normally distributed with test statistic 4 = 0.9954 and 5 = 0.9998 6 0.05. Take note that
the null-hypothesis for Shapiro-Wilk test is that the population is normally distributed. Thus,
the data is not normally distributed if the p value is greater than the chosen significance level.

Shapiro-Wilk Test

Age Income Buy


W 0.778463 0.995399 0.655271
p-value 0.007929 0.999782 0.000254
alpha 0.05 0.05 0.05
normal no yes no

3.5) Grubbs’ Test


Grubbs’ or ESD test is a method to identify the outliers. For outlier identification, you may
also define outliers by the number of standard deviation (Outlier Limit) or fix the number of
outliers (# of Outliers). The default # of outliers is 1 when it is blank. Say we want to test
whether the 3 furthest (either largest or smallest) values are outliers, then you need to key in 3
in the space # of Outliers. The three furthest values 100, 53 and 52 are displayed. The outputs
indicate that only age 100 year-old is a significant outlier, while both 53 and 52 year-old are
not outliers.
Data Analytics and Visualization 14
Dr. Chang Yun Fah

Grubbs/ESD Test

alpha 0.05

outlier 100 53 52
G 2.533403 1.502015 1.808838
G-crit 2.289954 2.215004 2.126645
sig yes no no

The outlier limit set the standard deviation value if we want to check whether the data outside
this limit is an outlier. Assuming the sample is normally distributed, we know that 1-
Normsdist(2.5)=0.621% of the data should have a z-score larger than 2.5 or less than – 2.5.
Here, we use 2.5 as a default limit for potential outlier.

Select the Outliers and Missing Data option, say we choose the outlier limit as 2 and the output
shows that there is 1 potential outlier which is the observation #9. If we change the outlier limit
to 3, then no potential outlier is found. This function could also be used to identify the number
of missing values in the dataset.

Note that just because a data element is identified as a potential outlier doesn’t mean that it is
wrong or should be eliminated, but it does mean that that data element should be investigated
to see if a typing mistake has been made or some other problem has occurred that will distort
any analyses that are undertaken.

This is a potential
outlier when
OL=2, but not an
outlier when OL=3

OL=3 OL=2 OL=2 OL=3


Data Analytics and Visualization 15
Dr. Chang Yun Fah

Outliers and Missing Data Outliers and Missing Data

Age Age
mean 43 mean 43
stdev 22.49938 stdev 22.49938
# outliers 1 # outliers 0
# blank 0 # blank 0
# non-num 0 # non-num 0
1 -0.80002 1 -0.80002
2 -0.35557 2 -0.35557
3 -0.88891 3 -0.88891
4 0 4 0
5 -0.31112 5 -0.31112
6 -0.35557 6 -0.35557
7 0.400011 7 0.400011
8 0.444457 8 0.444457
9 2.533403 * 9 2.533403
10 -0.66668 10 -0.66668

4) Frequency Table
There are several ways to construct frequency table in MS Excel. The common method is to
use Pivot table. Here, RealStats provides an alternative method to construct frequency table
and plot the histogram.

Select the Frequency Table from the Real Statistics dialog box, and you will get the Frequency
Table dialog box on the right. Firstly, select the input range from raw data to create frequency
table. In order to determine the number of bins in histogram, user can decide the bin (interval)
size or the maximum bin value or use the default settings. For example, the maximum and
minimum ages are 100 and 23 respectively. If we choose the bin size as 10, there will be
Data Analytics and Visualization 16
Dr. Chang Yun Fah

  <= −   <= 100 − 23


 78 9: 7; = = = 7.7 > 8
7 ;# 10

Or else, user can fix the maximum bin value using the maximum data value.

The frequency table generated have 3 columns, namely the midpoint of the intervals, frequency
values and cumulative frequency values. Take note that the chart obtained is actually a bar
chart, and hence, user needs to change the gap width between bars to 0% using Format Data
Series.
Data Analytics and Visualization 17
Dr. Chang Yun Fah

5) Chi-Squares Test of Independence


A set of data of Gender (Male and Female) and Ethnicity (Malay, Chinese and Indian) was
collected and stored in ChiSquareTest.xlsx. It is our interest to test whether the factors Gender
and Ethnicity are not associated (or independent) using Chi-Squares Test of Independent. The
hypotheses are

?@ : B8  CDℎFDG 8 9D ;;9FD (5D)


? : B8  CDℎFDG 8 ;;9FD

The data recorded could be in standard format (one factor one column) or in Excel format
(tabulation form). Hence, the user needs to choose the input format accordingly as below.

For standard format, the observation tabulation will be constructed by RealStat. The expected
values will be given and these observed values and expected values will be used to calculate
the test statistic as follow:
Q O I
KL M − C M N
H = JJ
I
C M
P MP
Data Analytics and Visualization 18
Dr. Chang Yun Fah

6) Correlation
The RealStats provides 3 options for correlation measures, namely Pearson’s correlation,
Spearman’s correlation and Kendall’s correlation. Unfortunately, user can only find the
correlation value of a pair of variables at one time in this option. For correlation matrix, we
will use another option in RealStats and Data Analysis tool.

Refer to Buying Decision.xlsx again with no missing value. Choose the Miscellaneous tab from
the Real Statistic dialog box and select correlation tests option. In the “One Sample Correlation”
dialog box, insert the values for first variable (Age) in Input Range 1 and the values for second
variable (Income) in Input Range 2. User can choose the type of correlations from Pearson’s,
Spearman’s or Kendall’s. In order to test for the hypothesis ?@ : R = 0, user can use one tail
test for ? : R < 0 or ? : R 6 0 and two tails test for ? : R ≠ 0 at a given level of significance
(Alpha value). This hypothesis testing is applicable only if the user would like to generalize
the result to the population. The correlation coefficient can also be used to quantify the sample
relationship.

The Pearson’s correlation coefficient is –0.2203. Since the p-value is 0.5408 greater than the
significance level, then we conclude that the relationship between Age and Income is not
significant. Take notes that Kendall’s and Spearman’s are nonparametric tests and hence they
are more conservative, i.e. tends not to reject the null hypothesis. For statistical inference
analysis, there are 2 possible tests (t-test and Fisher test) can be considered for Pearson’s
correlation.
Data Analytics and Visualization 19
Dr. Chang Yun Fah

Correlation Coefficients

Pearson -0.2203
Spearman 0.320122
Kendall 0.340909

Pearson's coeff (t test) Pearson's coeff (Fisher)

Alpha 0.05 Rho 0


Tails 2 Alpha 0.05
Tails 2
corr -0.2203
std err 0.344868 corr -0.2203
t -0.63879 std err 0.333333
p-value 0.540815 z -0.59256
lower -1.01556 p-value 0.553474
upper 0.57497 lower -0.74639
upper 0.475249

In this example, Pearson method yields a negative relationship but the Spearman and Kendall
gave a positive relationship. This is something interesting to investigate further. If we look at
the scatter plot of Age versus Income, we notice that the regression line is affected by the
extreme value (100, 700), hence changed the relationship to negative. If we remove the extreme
point, the relationship will “back” to positive. This indicates that the Pearson correlation (and
regression model) is very sensitive to the extreme values, while the nonparametric methods are
robust to extreme values.
Data Analytics and Visualization 20
Dr. Chang Yun Fah

7000
6000
5000

Income
4000
3000
2000
1000
0
0 20 40 60 80 100 120
Age

In order to find the correlation matrix for many variables, go to Descriptive tab and choose the
Matrix Operations as follow. Then select the three variables in Input range and click the
Correlation option. The resultant correlation matrix is shown below.

Correlation Matrix

1 -0.2203 0.06559
-0.2203 1 -0.5487
0.06559 -0.5487 1

Another way to create correlation matrix is to go to Data tab in Excel and choose the Data
Analysis option as follow.
Data Analytics and Visualization 21
Dr. Chang Yun Fah

Select the correlation tool from the dialog box and provide the input range of Age, Income and
Buy. The output shows the correlation coefficients for all the six pairs.

Age Income Buy


Age 1
Income -0.21406 1
Buy 0.06559 -0.55951 1

7) Predictive Modelling
RealStats provides a wide range of choices for predictive modelling such as k-mean clustering,
time series forecasting models, linear and logistics regression models. These models can be
classified into supervised and supervised approaches.

Supervised learning - the historical information of the response variable is available for model
building. Some supervised learning method includes linear regression, logistic regression, and
time series models.

Unsupervised learning - the historical information is not available and the decision is purely
based on the explanatory variables.

In RealStats, the only unsupervised learning model is k-mean clustering.

Case Study 2:
Assume that you have just landed a great analytic job with MegaTelCo, one of the largest
telecommunication firms in United States. They are having a major problem on customer
retention in their wireless business. In the Mid-Atlantic region, 20% of cell phone customers
leave when their contracts expire, and it is getting increasingly difficult to acquire new
customers. Since the cell phone market is now saturated, the huge growth in the wireless market
has tapered off. Communications companies are now engaged in battles to attract each other’s
customers while retaining their own customers. Customers switching from one company to
another is called churn, and it is expensive all around. A company must provide incentives to
attract a customer while another company loses revenue when the customer departs.

You have been called in to help understand the problem and to devise a solution. Attracting
new customers is much more expensive than retaining existing ones, so a good deal of
marketing budget is allocated to prevent churn. Marketing has already designed a special
retention offer. Your task is to device a precise, step-by-step plan of how your team should use
MegaTelCo’s vast data resources to decide which customers should be offered the special
Data Analytics and Visualization 22
Dr. Chang Yun Fah

retention deal prior to the expiration of their contract. Specifically, how should MegaTelCo
choose a set of customers to receive their offer in order to best reduce churn for a particular
incentive budget?

Objectives:
i. To classify the customers into 2 clusters of customer status based on their age, gender,
payment method and house’s location.
ii. Predict whether a given customer will be churned.

Dataset: Telecom_CustomerData.xlsx

Before carry out the above tasks, you need to clean your dataset as follow:
 Move rowNumber to the first column
 Extract the first number from PostalCode to represent the house’s location [using
=left(B2,1)+0]
 Assign the Gender as 0 (female) and 1 (male) [using if function]
 Assign the Payment Method as 1 (credit card), 2 (cheque), and 3 (cash) [using nested if
function]
 Sort the ChurnDate and replace the missing ChurnDate with today’s date (say 14 April
2017) [using =TODAY()]
 Calculate the number of days between LastTransaction and ChurnDate or Today’s date
[using =DAYS(end_date,start_date)]
 Generate a new variable to indicate whether the customer is loyal or churned. [using
=IF(G2=TODAY(),”Loyal”,”Churned”)]
 Remove/Hide the unnecessary variables.
Data Analytics and Visualization 23
Dr. Chang Yun Fah

7.1) Unsupervised data segmentation using k-means clustering


Given a data set S, there are many situations where we would like to partition the data set into
subsets (called clusters) where the data elements in each cluster are more similar to other data
elements in that cluster and less similar to data elements in other clusters. Here “similar” can
mean many things. In biology, it might mean that the organisms are genetically similar. In
marketing, it might mean that customers are in the same market segment.

In this section, we will describe a form of prototype clustering, called k-means clustering,
where a prototype member of each cluster is identified (called a centroid) which somehow
represents that cluster. The approach we take is that each data element belongs to the cluster
whose centroid is nearest to it; i.e. which minimizes the distance between that data element and
that cluster’s centroid.

Below is the basic algorithm for k-means clustering:

 Step 1: Choose the number of clusters k


 Step 2: Make an initial selection of k centroids
 Step 3: Assign each data element in S to its nearest centroid (in this way k clusters are
formed one for each centroid, where each cluster consists of all the data elements
assigned to that centroid)
 Step 4: For each cluster, make a new selection of its centroid
 Step 5: Go back to step 3, repeating the process until the centroids do not change (or some
other convergence criterion is met)

Press Ctrl-M to open the RealStats dialog box. Select the K-Means Cluster Analysis from
Multi(variate) Var(iable) tab and click OK button.
Data Analytics and Visualization 24
Dr. Chang Yun Fah

Select the data range from H1:K1001 with headings for variables Area, Age, Sex, and Payment.
In this case, we do not have the initial clusters’ centroid, and hence we will leave the box blank.
Since we want to group the customers into one of the “Loyal” and “Churned” status, the number
of clusters will be set as 2. The default number of iterations is 10. Store the outputs in a new
worksheet.

In the output worksheet, the Column A is the rowNumber, it indicate the customer’s index.
Column B is the cluster group of the customer. For example, customer 1 is in the cluster 2,
customers 2 to 8 are in the cluster 1, etc.

Column E and Column F are the centroid values for cluster 1 and cluster, respectively. Columns
H and I are showing the distance of that customer from the centroids, and Column J is the
minimum value of Column E and Column F. If the value of Column J is from Column E, then
that customer is classified as group 1, otherwise he/she will be classified as group 2. The sum
of square error achieved by the model is 142516.6.
Data Analytics and Visualization 25
Dr. Chang Yun Fah

7.2) Logistic Regression


Binary logistic regression is used when the response variable has only 2 possible values, i.e.
yes or no, buy or not buy, agree or disagree, win or lose etc. If we labelled these possible
outcomes as 0 or 1 for the case of failure or success, then our intention is to predict the
probability of the success, i.e. "(G = 1) given all the independent variables { , I , … , W }.
The logistic regression is defined as
 Z[\
Y = "(G = 1) =
1 +  Z[\

The output of logistic regression is a probability value range from 0 to 1. If the predicted
probability is 0.5 or greater, then it is classified as case G = 1 or else it is classified as case G =
0. Here, the value 0.5 is a classification cutoff value.

From the Logistic Regression dialog box, the input range consists of all the independent
variables and the response variable. Take note that user must place the response variable in the
last column of the dataset. This means that the last column of the dataset must have binary
value. It is advice to analyse using the Newton’s method. The alpha value should be determined
by the user in the beginning stage before data collection. The common alpha values are 0.01,
0.05 and 0.025. In the classification cutoff, user may start with the value 0.5 and try any other
values between 0 and 1 to compare the results.
Data Analytics and Visualization 26
Dr. Chang Yun Fah

The input range does not allow any discontinuity of the variables, user has no choice but to
select all variables between the first independent variable and the response variable. For
instance, the variables are stored in the column sequence as { , I , … , W , G} =
{8, ], ^, "G D, _9G;, ^DD;}. Among these variables, say for example, we
know that I = _9G; should not be considered in the model. We will move the column I
to another place. It is recommended to use the default 20 iterations for Newton’s method to
generate convergence value.

From the logistic output, the first 5 columns (Area, Age, Sex, Payment, Success=Churned) are
the original data sorted in ascending order. Failure column is a complement of success (Loyal)
column and the Total column is the sum of success and failure.

The Coeff values in Column P gives the estimated parameters ` for Intercept, Area, Age, Sex
and Payment. Thus, the fitted logistic model is

 +I.@bII+@.@bcdeQ-.f@.@bdghhei-+.Idddgj-)[email protected]@ddm.n,-0
Ya = "(G = 1) =
1 +  +I.@bII+@.@bcdeQ-.f@.@bdghhei-+.Idddgj-)[email protected]@ddm.n,-0
Data Analytics and Visualization 27
Dr. Chang Yun Fah

which gives the probability a customer will be churned.

Attention should be given to the predicted probability (p-Pred) column and the corresponding
prediction outcome (% correct). For a given classification cut-off value, say 0.50, the particular
customer will be predicted as loyal case if his p-Pred value is below 0.50, or else he will be
classified as churned case. If the customer is classified as churned case and he was churned
(Column E: Success=1), then the prediction is correct and % Correct is 100%. Otherwise, the %
Correct is 0% if the p-Pred value is less than 0.50 or classified as loyal. Since there are 753
correct predictions out of 1000 customers, the accuracy (% correct prediction) is 75.3%.

Next, we look at the contribution of individual explanatory variable to the prediction of


customer’s churn. The value exp(0) = 1 lies in the 95% confidence interval or 5 − <= =
0.986124 6 0.05 for Area (0.907685,1.071341) indicates that it does not make a significant
contribution to the model (The variable is significant if exp(0) = 1 does not lie in the 95% CI
or the p-value<0.05).

The column LL indicates the log-likelihood of the observation and its sum of -693.129 is the
value that we would like to maximize (closer to zero) for a given set of estimated parameters
(Coeff column). The LLO is the log-likelihood for a naïve model (e.g. the model only has
intercept and without explanatory variable) and LL1 is the log-likelihood for the developed
model. Since rr1 = −569.562 < −691.129 = rr0, we say that the developed model is
better than naïve model, and hence the fitted model is significance with p-value of 2.7E-52 for
the Chi-square test for goodness of fit.
Data Analytics and Visualization 28
Dr. Chang Yun Fah

The classification table (or commonly named as confusion matrix) shows that the accuracy
tmftu bdbfbl@
achieved is ss = = = 0.753 with the cut-off value 0.5. The true positive rate
mfu gc@fh@
tm bdb
(TPR) or sensitivity or recall is v"w = tmfxu = h@b = 0.742 and the false positive rate (FPR)
xm d
or fall-out is y"w = = = 0.235. The true negative rate (TNR) or specificity is
xmftu gcd
tu bl@
^"s = xmftu = gcd = 0.765. Take note that y"w = 1 − ^"s.

Thus, the receiver operating characteristic (ROC) curve can be constructed by plotting the
probability of detecting the signal of interest (sensitivity) against the probability of getting a
false signal (1-specificity) for a range of possible cutoff values. The area under the ROC curve
(AUC) lies between zero and one. It measures the ability of the model to discriminate between
observations that will lead to the response of interest and those that will not lead to the response
of interest.

Hosmer and Lemeshow (2002) suggest that the AUC is a general guide to how well the model
discriminates, with the following guidelines:
 wLs = 0.5, no discrimination – the classification is a random process.
 0.7 ≤ wLs < 0.8, acceptable discrimination
 0.8 ≤ wLs < 0.9, excellent discrimination
 wLs ≥ 0.9, outstanding discrimination.

The model with Area, Age, Sex and Payment has an acceptable discrimination of loyal or
churned with the overall AUC of 0.793. You may try different cutoff values to compare the
AUC values.
Data Analytics and Visualization 29
Dr. Chang Yun Fah
Data Analytics and Visualization 30
Dr. Chang Yun Fah

Case Study 3:
You are the new marketing manager of an established Bicycle company. The company sells
bicycles and accessories, such as clothing and other accessories to bikers in six countries. The
company has just hired Lucy as its new Sales manager. You are assigned to introduce Lucy to
the company, its product portfolio and its sales performance since 2011. Your objective is to
predict the profit based on customer’s age, gender, state of origin and product purchased.

Dataset: Sport_SaleProfit.xlsx

Variables are Date (from 1 Jan 2011 until 31 July 2016), Year (from 2011 to 2016), Customer
ID, Age (from 17 to 87 years old), Gender, Country of origin (Australia, Canada, France,
Germany, UK, USA), Product Category (Accessories, Bikes & Clothing), Quantity, Unit Cost,
Unit Price, Cost (unit cost x quantity), and Revenue (unit price x quantity).

Carry out the following data cleansing before conducting predictive analytics:
1. Convert Gender to Sex: 0 (female) and 1 (male).
2. Convert Country to State: 1 (Australia), 2 (Canada), 3 (France), 4 (Germany), 5 (UK),
6 (USA)
3. Convert Product Category to Product: 1 (Accessories), 2 (Bikes), 3 (Clothing).
4. Create new column Revenue=Profit – Cost.
5. Move Year, CustomerID, Gender, Country, Product Category, Quantity, Unit Cost,
Unit Price, Cost, and Revenue to the most left columns and hike them.
6. Go to Formulas tab and define the header’s name.
7. Go to Insert tab and convert the dataset to Table format.

Finally, you will get the following dataset:


Data Analytics and Visualization 31
Dr. Chang Yun Fah

7.3) Convert Categorical Data to Dummy Coding


For categorical data, you may choose to use ordinary coding or alternative coding (deleting
column is not recommended). Example: for Ethnicity, we have 1=Malay, 2=Chinese and
3=Indian, hence we have the Ordinary coding:
Variable 1 Variable 2
Malay 1 0
Chinese 0 1
Indian 0 0

This can be done in RealStats (old version) by selecting the categorical column into Input
Range X and leave the Input Range Y blank, then select ordinary coding and press OK. The
above coding will be created in the original data set. The Alternative coding uses other codes
to represent the ethnic groups. The above is done in the Multiple Linear Regression dialog box.

For the new RealStats version, you may follow the following steps:
 Extracting Columns from a Data Range after press Ctrl-M. The dialog box will appear
on the right-hand side.
 Enter all explanatory variables and dependent variable into the Input Range.
 Select a cell, say R1 as the starting place to put the outputs.
 Choose the Code type as Ordinary tag coding and Degree as 2.
 Click OK.
Data Analytics and Visualization 32
Dr. Chang Yun Fah

 The variables’ name will be appeared in the middle box.


 Select Age and click Add Column. This will just copy and paste the Age into the new
column without any coding as Age is not a categorical variable.
 Select Sex and click Add Column. Again, this will only duplicate the Sex column
because it is already an ordinary coded.
 Select State and click Add Code. This will convert the state values into ordinary code
with the values 0 and 1.
 Repeat the process for Product and Profit with Add Code and Add Column, respectively.

If you wish to remain the original coding of 1, 2 and 3 for the ethnics, then proceed with the
following regression analysis.

Now, you can develop the linear regression using the original dataset or the coded dataset. In
this example, we will use the original dataset to build the model.

7.4) Splitting Sample


In performance analysis of the predictive models, the sample data are divided into training set
and testing set. The training set is used to train the model, while the testing set is used to validate
the performance of the model. This approach is called cross validation. The size of the training
set must be greater than the size of testing set. Usually, the training set consists of 60% to 90%
of the sample data and the testing set consists of the remaining 10% to 40% of the sample data.
Data Analytics and Visualization 33
Dr. Chang Yun Fah

7.4.1) Random Split


In order to random split the sample into 2 sets, we create an index column and a shuffle column
before the dataset as follow. The index column is filled with 1 to n and the shuffle column is
leaves empty. We use the following example (Buying Decision.xlsx) for illustration purpose.

Shuffle Index Age Income Buy


1 25 2500 0
2 35 4200 1
3 23 1800 1
4 43 3600 0
5 36 2800 1
6 35 3600 0
7 52 6100 0
8 53 4700 0
9 62 7000 1
10 28 3100 1

Then, we call out the function “Reformatting a Data Range” from RealStats and choose the
Shuffle option from the dialog box. Select the Index column (B2:B11) as the input range and
place the output in the Shuffle column starting from cell A2.

The following output is obtained from shuffling the index numbers. Copy and paste value of
the shuffle column before sorting it in an ascending order. Our goal is to select 70%-30% for
training and testing sets. Therefore, the first 7 out of 10 cases will be in the training set and the
remaining 3 cases in the testing set.
Data Analytics and Visualization 34
Dr. Chang Yun Fah

Shuffle Index Age Income Buy Shuffle Index Age Income Buy
6 1 25 2500 0 1 10 28 3100 1
10 2 35 4200 1 2 8 53 4700 0
8 3 23 1800 1 3 7 52 6100 0
5 4 43 3600 0 4 6 35 3600 0
9 5 36 2800 1 Sort 5 4 43 3600 0
4 6 35 3600 0 6 1 25 2500 0
3 7 52 6100 0 7 9 62 7000 1
2 8 53 4700 0 8 3 23 1800 1
7 9 62 7000 1 9 5 36 2800 1
1 10 28 3100 1 10 2 35 4200 1

7.4.2) Stratified Sampling (Optional: use Kutools add-ins)


One of problem in random split is that the selected training set may only consists of cases from
the same group. From the above result for example, the first 7 cases is our training set, but there
are only 2 cases of buying and 5 cases of not buying in the training set. Thus, if we use this
training set to train the model, the fitted model will tend to be more sensitive to not buying
cases. In order to overcome this issue, the stratified sampling method is recommended where
the samples are grouped into buying group and not buying group. This is followed by random
split on each group separately and 70% of the cases in buying group and not buying group will
be combined together as training set. The remaining cases from both groups are combined as
testing set.

Try to apply random split method to get the following output, where the first column A is the
shuffling of the index in column B.

Next, we will split the worksheet into multiple worksheets with fewer observations in each
worksheet. After installing the Kutools add-ins, click on the Data tab to select Split Data tab.
In the Split Data dialog box, we fixed the number of rows for each worksheet as 20000, then
click OK. We will obtain multiple worksheets, each worksheet has 20000 observations.
Data Analytics and Visualization 35
Dr. Chang Yun Fah

Let’s use the first worksheet which contained observation #1 to #20000 for the next linear
regression analysis.

7.5) Multiple Linear Regression


Multiple linear regression is used to model the continuous response variable with a set of
continuous or/and categorical independent variables. If the model contains only one
independent variable, then it is called simple linear regression.

Press Ctrl M to open the Real Statistics dialog box as follow and select Regression tab. In the
Regression dialog box, choose Multiple linear regression and then click OK button.

Select the independent variables from your MS Excel sheet as the input range X (note that all
columns must be arranged side-by-side) and the dependent variable as the input range Y. Select
the headings of the variables too. Let’s include the intercept in your model and use the level of
Data Analytics and Visualization 36
Dr. Chang Yun Fah

significance 0.05 for analysis. Click the Stepwise regression check box if you want to use
stepwise selection method for model selection.

Select Regression Analysis to perform least squares regression analysis. The Residuals and
Cook’s D are used to identifies outliers and influential points (include DFFITS). The Durbin-
Watson Test will be used to test for autocorrelation of the errors/observations.

The initial model is


"89:D = −145.137 + 0.728896] + 2.115252^ + 7.056447^DD
+ 342.8702"89FD

Since the variable ‘Sex’ is not significant with p-value >0.05, you may need to remove this
variable from the subsequent analysis and the re-construct of the model.

The next task is to investigate the model performance. The ANOVA test assumes that there is
no linear relationship between the estimated response and the independent variables, i.e. the
null and alternative hypothesis are
H 0 :  0  1  ...   k  0
H1 :  j  0 for at least one j.

The result indicates that ANOVA test rejects the null hypothesis (p=0.00 and F=1508.746) at
0.05 level of significance. Hence, we conclude that the constructed model is significant, i.e. at
least one of the independent variables significantly contribute to the estimation of the response
variable.

The performance of the model can also be investigated using coefficient of determination,
w I = 0.231847 and its adjusted value, w.1M I
= 0.231694. The coefficient of determination
indicates the total variability in response variable can be explained by the model (explanatory
or independent variables used). The output shows that the constructed model can explains 23.18%
Data Analytics and Visualization 37
Dr. Chang Yun Fah

of variation in Profit. The model has included most of the important explanatory variables in
the model (difference between w I and w.1M
I
is small).

The following partial outputs from multiple linear regression are used to check for outlier
points (where response y does not follow the general trend of other data) and to identify
leverage and/or influential points. First, we compare R-student residual and the T-Test value
(Bonferroni critical value) to identify outlier points as follow
H 0 : Not an outlier . vs H 1 : Is an outlier.
Reject H 0 if t i  t [1   / 2n; n  k  2] i.e. t1-α/2n, n-k-2

Hence, observations 1, 3, 7 and so on are potential outliers because their absolute R-Student
values are larger than T-Test.

A data point has high leverage if it has “extreme” explanatory x values. We use Leverage
(diagonal entries of the Hat matrix, ℎ ) to identify potential leverage points with the following
rule of thumb for determining if hii is “large”:
a) hii  2( k  1) / n  2 p / n where p / n is the mean of hii
b) hii  0.5 indicates very high leverage, 0.2  hii  0.5 indicate moderate leverage.

Since  = 20000 and there are | = 4 independent variables in the study, then 5 = | + 1 = 5.
I} @
Thus, any point with leverage value greater than  = I@@@@ = 0.0005 is consider as high
leverage point. From the output, we found that observations 98, 162, 814 and others are high
leverage points.

A data point is influential if the removal of this point changes the regression line. We use
Cook’s D and DFFITS measures to identify potential influential points. A point is considered
influential if the absolute Cook’s D value or DFFITS value is greater than 1. From the above
(partial) results, we do not found any influential point.
Data Analytics and Visualization 38
Dr. Chang Yun Fah

Next, we would like to study whether the data has correlated errors, or autocorrelation. Since
the Durbin-Watson Test is not significant, we conclude that the dataset has no autocorrelation
issue.

Case Study 4: Time Series Models


You are a market analyst. Based on the daily KLSE index from 16 March 2015 until 14 April
2017, you are interested to forecast the index values for the next 5 days given in KLSE
Stock.xlsx.

KLSE Index
1,900.00
1,850.00
1,800.00
1,750.00
1,700.00
1,650.00
1,600.00
1,550.00
1,500.00
Mar 16, 2015
Apr 03, 2015
Apr 23, 2015
May 15, 2015

Aug 04, 2015


Aug 24, 2015
Sep 14, 2015

Dec 07, 2015


Dec 29, 2015
Jan 19, 2016
Feb 11, 2016
Mar 02, 2016
Mar 22, 2016
Apr 11, 2016
Apr 29, 2016
May 20, 2016

Aug 11, 2016


Sep 01, 2016
Sep 23, 2016

Dec 14, 2016


Jan 05, 2017
Jan 25, 2017
Feb 17, 2017
Mar 09, 2017
Mar 29, 2017
Jun 04, 2015
Jun 24, 2015
Jul 14, 2015

Oct 06, 2015


Oct 27, 2015
Nov 17, 2015

Jun 09, 2016


Jun 30, 2016
Jul 22, 2016

Oct 14, 2016


Nov 03, 2016
Nov 23, 2016

7.6) Time Series Forecast


In linear regression analysis, we assume that data are cross-sectional, in which there is no
autocorrelation between errors (or response values). However, there are many situations where
the observed value at time t is related to the observed value at the previous times. These data
often presence trend, seasonal effect, cyclical effect and irregular errors. Among others time
series forecast analysis are economic growth analysis, sales prediction, stock market analysis,
yield projections, process and quality control common applications of time series data.

A time series is a sequence of observations G , GI , … , G0+ , where the subscripts represent


evenly spaced time intervals (seconds, minutes, hours, days, months, seasons, years, etc.). Our
interest is to forecast the current value, G0 based on G , GI , … , G0+ . There are many methods
that can be applied, we will only consider two of them. They are the weighted moving average,
and autoregressive integrated moving average (ARIMA).
Data Analytics and Visualization 39
Dr. Chang Yun Fah

7.6.1) Autocorrelation
The Autocorrelation Function (ACF) is defined as
n

 y
t  k 1
t  y  y t  k  y 
rk  n

 y  y
2
t
t 1

then r1 indicates how successive values of Y relate to each other, r2 indicates how Y values
two periods apart relate to each other, and so on. Together, the autocorrelations at lags 1, 2, …,
make up the auto correlation function or ACF.
Partial autocorrelations are used to measure the degree of association between Yt and Yt  k ,
when the effects of other time lags – 1, 2, 3, ..., k  1 are removed. The partial autocorrelation
at lag k is
r1 if k  1
 k 1
 rk   rk 1, j rk  j
rkk   j 1
 k 1
if k  2, 3, ...
 1   rk 1, j r j
 j 1

where rkj  rk 1, j  rkk rk 1, k  j for j  1, 2, 3, ..., k  1 . As with the ACF , the partial
autocorrelations should all be close to zero for a white noise series.

Using Sport_SaleProfit (ReducedData), we will determine the number of lags that will be used
in the time series analysis. Press Ctrl-M, choose Time Series option from the main menu and
then the Testing option from the dialog box. Enter the profit data B2:B518 into the Input Range,
excluding the header. We generate the correlations of lag 1 until lag 30 using Autocorrelation
Function (ACF) and Partial Autocorrelation Function (PACF) and then press OK.

The outputs of ACF and PACF are as follow. We then plot the ACF and PACF values on bar
charts. Although the ACF and PACF values are generally small, indicating no autocorrelation
issue. We still choose Lag 9 for further analysis as it has ACF more than 0.8 and largest absolute
PACF value.
Data Analytics and Visualization 40
Dr. Chang Yun Fah

7.6.2) Weighted Moving Average


Often, the contribution of G0+, , G0+,f , … , G0+ to the forecast of G0 are different, and a
weight ~ ,  = 1,2, … , and ∑,
P ~ = 1 is assigned to y. The weighted moving average
forecast value is defined as

~, G0+, + ~,+ G0+,f + ⋯ + ~ G0+


Ga0 =

To perform weighted moving average, press Ctrl-M, choose the Time Series option from the
main menu and then the Basic forecasting methods option from the dialog box. Fill in the
dialog box with B1:B518 by choosing the Weighted Moving Averages option and fill in
the Weights Range with I2:I10 (note that column heading is not included in the weights range).
We set the Parameter values as 9 (must be the same value as the number of weight rows) for
the number of Lags from the previous ACF and PACF analysis, and both the number of
Forecasts and Seasons as 1. The other parameters are not needed in weighted moving average
and we will leave it as default number.
Data Analytics and Visualization 41
Dr. Chang Yun Fah

The forecast values only started at the 10th observation and the forecast KLSE index value for
the next day is 1740.19 with the mean absolute error (MAE) as 14.33646 and mean squared
error (MSE) is 401.355. Both MAE and MSE are measurement of forecast accuracy.
Data Analytics and Visualization 42
Dr. Chang Yun Fah

7.6.3) ARIMA Models


An autoregressive integrated moving average (ARIMA) process (aka a Box-
Jenkins process) add on differencing to an ARMA process. An ARMA(p,q) process with d-
order differencing is called an ARIMA(p.d,q) process. The general formula for ARIMA(p,d,q)
is
} †

€1 − J ∅W W ‚ (1 − )1 ƒ0 = „ + €1 + J …W W ‚ ‡0
WP WP

where the lag operator W (ƒ0 ) = ƒ0+W and the constant (average) „ is optional. Thus, an
ARIMA(2,0,1) process is an AR(2) process with first-order differencing.

Our goal now is to fit time series data to an appropriate ARIMA process. We use the following
approaches to identify a reasonable process to use:
 Plot the time series: This helps identify trends, which generally requires differencing.
We generally restrict ourselves to first or second-order differencing.
 Calculate ACF and PACF: AR processes have ACF values that converge to zero as
the lag increases, while MA processes have PACF values that converge to zero as the
lag increases. The order of the process may not be obvious when using this approach.

AR(p) processes have PACF values that are small (near zero) for lags > p, while MA(q)
processes have ACF values that are small for lags > q. If the ACF and PACF values don’t seem
to converge to zero, then differencing may be needed.

To perform ARIMA(p,d,q), press Ctrl-M, choose the Time Series option from the main menu
and then the Arima Model and Forecast option from the dialog box. Fill in the dialog box
with B2:B518 (without including the column heading). From the ACF and PACF analysis, we
observed that the ACF converges to zeros after Lag 1 and PACF does not converge to zero
(probably it has very large lags). Thus, we set the Parameters for AR order as 1, MA order as
Data Analytics and Visualization 43
Dr. Chang Yun Fah

1 (probably larger than 30 from PACF plot) and differencing order as 1. The model we used is
ARIMA(1,1,1). We want to forecast the KLSE index for the next 5 days.

The estimated ARIMA(1,1,1) model with intercept is


 

€1 − J ∅W W ‚ (1 − ) ƒ0 = „ + €1 + J …W W ‚ ‡0
WP WP
(1 − ∅  )(ƒ0 − ƒ0+ ) = „ + (1 + …  )‡0
(ƒ0 − ƒ0+ ) − ∅ (ƒ0+ − ƒ0+I ) = ‡0 + … ‡0+ + „
(ƒ0 − ƒ0+ ) = ∅ (ƒ0+ − ƒ0+I ) + ‡0 + … ‡0+ + „
Data Analytics and Visualization 44
Dr. Chang Yun Fah

From the outputs, we have


∆G0 = 0.478529∆G0+ + ‡0 − 0.41805‡0+ − 0.06981

where ∆G0 = G0 − G0+. Thus, the above model can be re-written as


G0 − G0+ = 0.478529(G0+ − G0+I ) + ‡0 − 0.41805‡0+ − 0.06981
which can be re-expressed as
G0 = −0.06981 + 1.478529G0+ − 0.478529G0+I + ‡0 − 0.41805‡0+

Both Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) are criterion
for model selection among several models. The model with smaller AIC or BIC value have
better fit for the data and is preferred.

We see that the ARIMA(1,1,1) process is both stationary and invertible because the absolute
values of all the roots are greater than 1. We see that all the coefficients are making a significant
contribution to log-likelihood (LL) value of the ARIMA(1,1,1) model with the exception of
constant „ = −0.06981 since its 5 − <= = 0.786814 6 0.05.

The psi coefficients are used to create the forecast values. Five psi coefficient values are
produced since we requested 5 forecast values. Finally, we have the forecast values in the
following outputs. The forecast values for times 518 to 522 are the values for the next 5 days,
where Ga0 = 1730.35, Ga0f = 1729.998, Ga0fI = 172973, Ga0fb = 1729.54, Ga0fg = 1729.38.
Data Analytics and Visualization 45
Dr. Chang Yun Fah

You might also like