0% found this document useful (0 votes)
77 views9 pages

Methodology For Long-Term Prediction of Time Series: Antti Sorjamaa, Jin Hao, Nima Reyhani, Yongnan Ji, Amaury Lendasse

This document proposes a methodology for long-term time series prediction that combines direct prediction strategies with sophisticated input selection criteria. It introduces a global input selection strategy using forward selection, backward elimination, and forward-backward selection to optimize three input selection criteria: k-nearest neighbors, mutual information, and nonparametric noise estimation. This methodology is successfully applied to predict electricity load for Poland over long time horizons.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views9 pages

Methodology For Long-Term Prediction of Time Series: Antti Sorjamaa, Jin Hao, Nima Reyhani, Yongnan Ji, Amaury Lendasse

This document proposes a methodology for long-term time series prediction that combines direct prediction strategies with sophisticated input selection criteria. It introduces a global input selection strategy using forward selection, backward elimination, and forward-backward selection to optimize three input selection criteria: k-nearest neighbors, mutual information, and nonparametric noise estimation. This methodology is successfully applied to predict electricity load for Poland over long time horizons.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

ARTICLE IN PRESS

Neurocomputing ] (]]]]) ]]]–]]]


www.elsevier.com/locate/neucom

Methodology for long-term prediction of time series


Antti Sorjamaa, Jin Hao, Nima Reyhani, Yongnan Ji, Amaury Lendasse
Helsinki University of Technology, Adaptive Informatics Research Centre, P.O. Box 5400, 02015 Espoo, Finland

Abstract

In this paper, a global methodology for the long-term prediction of time series is proposed. This methodology combines direct
prediction strategy and sophisticated input selection criteria: k-nearest neighbors approximation method (k-NN), mutual information
(MI) and nonparametric noise estimation (NNE). A global input selection strategy that combines forward selection, backward
elimination (or pruning) and forward–backward selection is introduced. This methodology is used to optimize the three input selection
criteria (k-NN, MI and NNE). The methodology is successfully applied to a real life benchmark: the Poland Electricity Load dataset.
r 2007 Elsevier B.V. All rights reserved.

Keywords: Time series prediction; Input selection; k-Nearest neighbors; Mutual information; Nonparametric noise estimation; Recursive prediction;
Direct prediction; Least squares support vector machines

1. Introduction uncertainties arising from various sources, for instance,


accumulation of errors and the lack of information [21]. In
Time series forecasting is a challenge in many fields. In this paper, two variants of prediction strategies, namely,
finance, experts forecast stock exchange courses or stock direct and recursive predictions are studied and compared.
market indices; data processing specialists forecast the flow This paper illustrates that the direct prediction strategy
of information on their networks; producers of electricity gives better results than the recursive one.
forecast the load of the following day. The common point In this paper, a global methodology to perform direct
to their problems is the following: how can one analyze and prediction is presented. It includes input selection strategies
use the past to predict the future? and input selection criteria. A global input selection
Many techniques exist for the approximation of the strategy that combines forward selection, backward
underlying process of a time series: linear methods such as elimination and forward–backward selection is introduced.
ARX, ARMA, etc. [11], and nonlinear ones such as It is shown that this selection strategy is a good alternative
artificial neural networks [21]. In general, these methods try to exhaustive search, which suffers from too large
to build a model of the process. The model is then used on computational load.
the last values of the series to predict the future values. The Three different input selection criteria are presented for
common difficulty to all the methods is the determination the comparison of the input sets: k-nearest neighbors based
of sufficient and necessary information for an accurate input selection criteria (k-NN), mutual information (MI)
prediction. and nonparametric noise estimation (NNE). The optimal
A new challenge in the field of time series prediction is set of inputs is the one that optimizes one of the three
the long-term prediction: several steps ahead have to be criteria; for example, the optimal set of inputs can be
predicted. Long-term prediction has to face growing defined as the one that maximizes the MI between the
inputs and the output.
Corresponding author.
This paper shows that all the presented criteria (k-NN,
MI and NNE) provide good selections of inputs. It is also
E-mail addresses: antti.sorjamaa@hut.fi (A. Sorjamaa),
[email protected].fi (J. Hao), [email protected].fi (N. Reyhani), shown experimentally that the introduced global metho-
[email protected].fi (Y. Ji), [email protected].fi, amaury.lendasse@hut.fi dology provides accurate predictions with all three
(A. Lendasse). presented criteria.

0925-2312/$ - see front matter r 2007 Elsevier B.V. All rights reserved.
doi:10.1016/j.neucom.2006.06.015

Please cite this article as: A. Sorjamaa, et al., Methodology for long-term prediction of time series, Neurocomputing (2007), doi:10.1016/
j.neucom.2006.06.015
ARTICLE IN PRESS
2 A. Sorjamaa et al. / Neurocomputing ] (]]]]) ]]]–]]]

In this paper, least squares support vector machines But when H exceeds M, all the inputs are the predicted
(LS-SVM) are used as nonlinear models in order to avoid values. The use of the predicted values as inputs
local minima problems [19]. deteriorates the accuracy of the prediction.
Section 2 presents the prediction strategies for the long-
term prediction of time series. In Section 3 the global 2.2. Direct prediction strategy
methodology is introduced. Section 3.1 presents the input
selection strategies and Section 3.2 the input selection Another strategy for the long-term prediction is the
criteria. Finally, the prediction model LS-SVM is briefly direct strategy. For the H-steps ahead prediction, the
summarized in Section 4 and experimental results are model is
shown in Section 5 using a real life benchmark: the Poland
electricity load dataset. y^ tþh ¼ f h ðyt ; yt1 ; . . . ; ytMþ1 Þ with 1phpH. (3)
This strategy estimates H direct models between the
2. Time series prediction regressor (which does not contain any predicted values)
and the H outputs. The errors in the predicted values are
The time series prediction problem is the prediction of not accumulated in the next prediction. When all the
future values based on the previous values and the current values, from y^ tþ1 to y^ tþH , need to be predicted, H different
value of the time series (see Eq. (1)). The previous values models must be built. The direct strategy increases the
and the current value of the time series are used as inputs complexity of the prediction, but more accurate results are
for the prediction model. One-step ahead prediction is achieved as illustrated in Section 5.
needed in general and is referred to as short-term
prediction. But when multi-step ahead predictions are 3. Methodology
needed, it is called a long-term prediction problem.
Unlike the short-term time series prediction, the long- In the experiments, the direct prediction strategy is used.
term prediction is typically faced with growing uncertain- H models have to be built as shown in Eq. (3). For each
ties arising from various sources. For instance, the model, three different input selection criteria are presented:
accumulation of errors and the lack of information make
the prediction more difficult. In long-term prediction,
 minimization of the k-NN leave-one-out generalization
performing multiple step ahead prediction, there are
error estimate,
several alternatives to build models. In the following
 maximization of the MI between the inputs and the
sections, two variants of prediction strategies are intro-
output,
duced and compared: the direct and the recursive predic-
 minimization of the NNE.
tion strategies.
In order to optimize one of the criteria, a global input
2.1. Recursive prediction strategy
selection strategy combining the forward selection, the
backward elimination and the forward–backward selection
To predict several steps ahead values of a time series,
is presented in Section 3.1.
recursive strategy seems to be the most intuitive and simple
The estimation of MI and NNE demands the choice of
method. It uses the predicted values as known data to
hyperparameters. The definitions and the significance of
predict the next ones. In more detail, the model can be
the hyperparameters are more deeply explained in Sections
constructed by first making one-step ahead prediction:
3.2.2 and 3.2.3. In this paper, the most adequate
y^ tþ1 ¼ f 1 ðyt ; yt1 ; . . . ; ytMþ1 Þ, (1) hyperparameter values are selected by minimizing the
where M denotes the number inputs. The regressor of the LOO error provided by k-NN approximators presented in
model is defined as the vector of inputs: Section 3.2.
yt ; yt1 ; . . . ; ytMþ1 . It is possible to use also exogenous In order to avoid local minima in the training phase of
variables as inputs in the regressor, but they are not the nonlinear models (f k in Eq. (3)), the LS-SVM are used.
considered here in order to simplify the notation. Never- The LS-SVM are presented in Section 4.
theless, the presented global methodology can also be used
with exogenous variables. 3.1. Input selection strategies
To predict the next value, the same model is used:
Input selection is an essential pre-processing stage to
y^ tþ2 ¼ f 1 ðy^ tþ1 ; yt ; yt1 ; . . . ; ytMþ2 Þ. (2)
guarantee high accuracy, efficiency and scalability [7] in
In Eq. (2), the predicted value of y^ tþ1 is used instead of problems such as machine learning, especially when the
the true value, which is unknown. Then, for the H-steps number of observations is relatively small compared to
ahead prediction, y^ tþ2 to y^ tþH are predicted iteratively. So, the number of inputs. It has been the subject in many
when the regressor length M is larger than H, there are application domains like pattern recognition [14],
M  H real data in the regressor to predict the Hth step. process identification [15], time series modeling [20] and

Please cite this article as: A. Sorjamaa, et al., Methodology for long-term prediction of time series, Neurocomputing (2007), doi:10.1016/
j.neucom.2006.06.015
ARTICLE IN PRESS
A. Sorjamaa et al. / Neurocomputing ] (]]]]) ]]]–]]] 3

econometrics [13]. Problems that occur due to poor On the contrary, the filter method is much faster because
selection of input variables are: the procedure is simpler. In this paper, due to the long
computational time of the wrapper method, it is unrealistic
to compare the wrapper and filter methods for the input
 If the input dimensionality is too large, the ‘curse of selection problem that is studied.
dimensionality’ problem [20] may happen. Moreover, In the following sections, the discussion is focused on the
the computational complexity and memory require- filter methods. The filter method selects a set of inputs by
ments of the learning model increase. Additional optimizing a criterion over different combinations of
unrelated inputs lead to poor models (lack of general- inputs. The criterion computes the dependencies between
ization). each combination of inputs and the output using predict-
 Understanding complex models (too many inputs) is ability, correlation, mutual information or other statistics.
more difficult than simple models (less inputs), which Various alternatives of these criteria exist.
can provide comparable good performances. This paper uses three methods based on different
criteria: k-NN, MI and NNE. In the following, MI is
Usually, the input selection methods can be divided into taken as an example to explain the global input selection
two broad classes: filter methods and wrapper methods, see strategy. For the other two input selection criteria, the
Fig. 1. procedures are similar.
In the case of the filter methods, the best subset of inputs
is selected a priori based only on the dataset. The input 3.1.1. Exhaustive search
subset is chosen by an evaluation criterion, which measures The optimal algorithm is to compute MI between all the
the relationship of each subset of input variables with the possible combinations of inputs and the output, e.g. 2M  1
output. In the literature, plenty of filter measure methods inputs combinations are tested (M is the number of input
of different natures [2] exist: distance metrics, dependence variables). Then, the one that gives maximum MI is
measures, scores based on the information theory, . . . , etc. selected. In the case of long-term prediction of time series,
In the case of the wrapper methods, the best input subset is M is usually larger than 15, so the exhaustive search
selected according to the criterion, which is directly defined procedure becomes too time consuming. Therefore, a
from the learning algorithm. The wrapper methods search global input selection strategy that combines forward
for a good subset of inputs using the learning model itself selection, backward elimination and forward–backward
as a part of the evaluation function. This evaluation selection is used. Forward selection, backward elimination
function is also employed to induce the final learning and forward–backward selection are summarized in the
model. following sections.
Comparing these two types of input selection strategies,
the wrapper methods solve the real problem. But it is 3.1.2. Forward selection
potentially very time consuming, as the ultimate algorithm In this method, starting from the empty set S of selected
has to be included in the cost function. Therefore, input variables, the best available input is added to the set
thousands of evaluations are performed when searching S one by one, until the size of S is M. Suppose we have a
for the best subset. For example, if 15 input variables are set of inputs X i ; i ¼ 1; 2; . . . ; M and output Y, the
considered and if the forward selection strategy (intro- algorithm is summarized in Fig. 2.
duced in Section 3.1.2) is used, then 15ð15 þ 1Þ=2 ¼ 120 In forward selection, only MðM þ 1Þ=2 different input
different subsets have to be tested. In practice, more than sets are evaluated. This is much less than the number of
15 inputs are realistic for time series prediction problems input sets evaluated with exhaustive search. On the other
and the computational time is thus increased dramatically. hand, optimality is not guaranteed. The selected set may
not be the global optimal one.

Fig. 1. Two approaches of input variable subset selection. (a) Filter


method, (b) wrapper method. Fig. 2. Forward selection strategy.

Please cite this article as: A. Sorjamaa, et al., Methodology for long-term prediction of time series, Neurocomputing (2007), doi:10.1016/
j.neucom.2006.06.015
ARTICLE IN PRESS
4 A. Sorjamaa et al. / Neurocomputing ] (]]]]) ]]]–]]]

3.1.3. Backward elimination or pruning selection. From the candidate input sets of all four
Backward elimination, also called pruning [12] proce- selection methods, the one that optimizes the chosen
dure, is the opposite of forward selection process. In this criteria (k-NN, MI or NNE) is selected.
strategy, the selected inputs set S is initialized to contain all This combined strategy does not guarantee the selection
the input variables. Then, the input variable for which the of the best input set that would be obtained with the
elimination maximizes MI is removed from set S one by exhaustive search strategy. Nevertheless, the input selection
one, until the size of S is 1. is improved and the number of tested subsets is consider-
Basically, backward elimination is the same procedure as ably reduced compared to the exhaustive search strategy.
forward selection presented in the previous section, but
reversed. It evaluates the same amount of input sets as 3.2. Input selection criteria
forward selection, MðM þ 1Þ=2. Also, the same restriction
exists, optimality is not guaranteed. 3.2.1. k-Nearest neighbors
The k-NN approximation method is a very simple and
3.1.4. Forward–backward selection powerful method. It has been used in many different
Both forward selection and backward elimination applications, particularly for classification tasks [3]. The
methods suffer from an incomplete search. Forward–back- key idea behind the k-NN is that samples with similar
ward selection algorithm combines both methods. It offers inputs have similar output values. Nearest neighbors are
the flexibility to reconsider input variables previously selected, according to Euclidean distance, and their
discarded and vice versa, to discard input variables corresponding output values are used to obtain the
previously selected. It can start from any initial input set, approximation of the desired output. In this paper, the
including empty, full or randomly initialized input set. estimation of the output is calculated simply by averaging
Let us suppose a set of inputs X i , i ¼ 1; 2; . . . ; M and the outputs of the nearest neighbors:
output Y, the procedure of the forward–backward Selec- Pk
tion is summarized in Fig. 3. j¼1 yjðiÞ
y^ i ¼ , (4)
It is noted that the selection result depends on the k
initialization of the input set. In this paper, two options are where y^ i represents the estimate (approximation) of the
considered. One is to begin from the empty set and the output, yjðiÞ is the output of the jth nearest neighbor of
other is to begin from the full set. sample xi and k denotes the number of neighbors used.
The number of input sets to be evaluated varies and is The distances between samples are influenced by the
dependent on the initialization of the input set, the input selection. Then, the nearest neighbors and the
stopping criteria and the nature of the problem. Still, it is approximation of the outputs depend on the input
not guaranteed that in all cases this selection method finds selection.
the global optimal input set. The k-NN is a nonparametric method and only k, the
number of neighbors, has to be determined. The selection
3.1.5. Global selection strategy of k can be performed by many different model structure
In order to select the best input set, we propose to use all selection techniques, for example k-fold cross-validation
four selection methods: forward selection, backward [9], leave-one-out [9], Bootstrap [5] and Bootstrap 632 [6].
elimination, forward–backward selection initialized with These methods estimate the generalization error obtained
an empty set of inputs and forward–backward selection for each value of k. The selected k is the one that minimizes
initialized with a full set of inputs. All four selection the generalization error.
methods are fast to perform, but do not always converge to In [16] all methods, the leave-one-out and Bootstraps,
the same input set, because of the local minima. Therefore, select the same input sets. Moreover, the number of
it is necessary to use all of them to get more optimal neighbors is more efficiently selected by the Bootstraps
[16]. It has also been shown that the k-NN itself is a good
approximator for time series [16]. In this paper, however,
the k-NN is not used as an approximator, but as a tool to
select the input set.

3.2.2. Mutual information


The MI can be used to evaluate the dependencies
between random variables. The MI between two variables,
let, say, X and Y be the amount of information obtained
from X in the presence of Y and vice versa. In time series
prediction problem, if Y is the output and X is a subset of
the input variables, the MI between X and Y is one
Fig. 3. Forward–backward selection strategy. criterion for measuring the dependence between inputs

Please cite this article as: A. Sorjamaa, et al., Methodology for long-term prediction of time series, Neurocomputing (2007), doi:10.1016/
j.neucom.2006.06.015
ARTICLE IN PRESS
A. Sorjamaa et al. / Neurocomputing ] (]]]]) ]]]–]]] 5

(regressor) and output. Thus, the inputs subset X , which Given N input–output pairs: ðxi ; yi Þ 2 RM  R, the
gives maximum MI, is chosen to predict the output Y. relationship between xi and yi can be expressed as
The definition of MI originates from the entropy in the
information theory. For continuous random variables yi ¼ f ðxi Þ þ ri , (10)
(scalar or vector), let mX ;Y ; mX and mY represent the joint where f is the unknown function and r is the noise. The GT
probability density function and the two marginal density estimates the variance of the noise r.
functions of the variables. The entropy of X is defined by The GT is useful for evaluating the nonlinear correlation
Shannon [1] as between two random variables, namely, input and output
Z 1
pairs. The GT has been introduced for model selection but
HðX Þ ¼  mX ðxÞ log mX ðxÞ dx, (5) also for input selection: the set of inputs that minimizes the
1
GT is the one that is selected. Indeed, according to the GT,
where log is the natural logarithm and then, the informa-
the selected set of inputs is the one that represents the
tion is measured in natural units.
relationship between inputs and output in the most
The remaining uncertainty of X is measured by the
deterministic way.
conditional entropy as
Z 1 GT is based on hypotheses coming from the continuity
of the regression function. If two points x and x0 are close
HðX jY Þ ¼  mY ðyÞ
1 in the input space, the continuity of regression function
Z 1
implies the outputs f ðxÞ and f ðx0 Þ will be close enough in
 mX ðxjY ¼ yÞ log mX ðxjY ¼ yÞ dx d y. ð6Þ the output space. Alternatively, if the corresponding
1
output values are not close in the output space, this is
The joint entropy is defined as due to the influence of the noise.
Z 1
Two versions for evaluating the GT are suggested. The
HðX ; Y Þ ¼  mX ;Y ðx; yÞ log mX ;Y ðx; yÞ dx dy. (7)
1
first one evaluates the value of g; s in increasing sized sets
of data. Then the result for a particular parameter pair is
The MI between variables X and Y is defined as [4]
obtained by averaging the results from all set sizes. The
MIðX ; Y Þ ¼ HðY Þ  HðY jX Þ new or refined version establishes the estimation based on
¼ HðX Þ þ HðY Þ  HðX ; Y Þ. ð8Þ the k-NN differences instead of increasing the number of
data points gradually. In order to distinguish the k used in
From Eqs. (5) to (8), MI is computed as the NNE context from the conventional k in k-NN, the
Z 1
mX ;Y ðx; yÞ number of nearest neighbors is denoted by p.
MIðX ; Y Þ ¼ mX ;Y ðx; yÞ log X dx dy. (9) Let us denote the pth nearest neighbor of the point xi in
1 m ðxÞmY ðyÞ
the set fx1 ; . . . ; xN g by xpðiÞ . Then the following variables,
For computing the MI, only the estimations of the gN and sN are defined as
probability density functions mX ;Y ; mX and mY are required.
In this paper, MIðX ; Y Þ is estimated by a k-NN 1 XN
approach presented in [10]. In order to distinguish the gN ðpÞ ¼ jy  yi j2 , ð11Þ
2N i¼1 pðiÞ
number of neighbors that used in the MI and the one used
in the k-NN, the number of neighbors is denoted by l for 1 XN

the estimation of MI. sN ðpÞ ¼ jxpðiÞ  xi j2 , ð12Þ


2N i¼1
The novelty of this l-NN based MI estimator consists in
its ability to estimate the MI between two variables of any where j:j denotes the Euclidean metric and ypðiÞ is the
dimensional space. Then, the estimation of MI depends on output of xpðiÞ . For correctly selected p [8], the constant
the predefined value l. term of the linear regression model between the pairs
In [10], it is suggested to use a mid-range value l ¼ 6. But ðgN ðpÞ; sN ðpÞÞ determines the noise variance estimate. For
it has been shown that when applied to time series the proof of the convergence of the GT, see [8].
prediction problems, l needs to be tuned for different The GT assumes the existence of the first and second
datasets and different data dimensions in order to obtain derivatives of the regression function. Let us denote
better performance. As explained in Section 3, to select the
 M  M
inputs based on the l-NN MI estimator, the optimal l is qf q2 f
selected using k-NN and leave-one-out. rf ðxÞ ¼ ; Hf ðxÞ ¼ , (13)
qxðiÞ i¼1 qxðiÞqxðjÞ i;j¼1

3.2.3. Nonparametric noise estimator using the gamma test where xi and xj are the ith and jth components of x,
Gamma test (GT) is a technique for estimating the respectively. M is the number of variables. The GT
variance of the noise, or the mean square error (MSE), that requires both jHf ðxÞj and jrf ðxÞj are bounded.
can be achieved without overfitting [8]. The evaluation of These two conditions are general and are usually
the NNE is done using the GT estimation introduced by satisfied in practical problems. The GT requires no other
Stefansson in [17]. assumption on the smoothness property of the regression

Please cite this article as: A. Sorjamaa, et al., Methodology for long-term prediction of time series, Neurocomputing (2007), doi:10.1016/
j.neucom.2006.06.015
ARTICLE IN PRESS
6 A. Sorjamaa et al. / Neurocomputing ] (]]]]) ]]]–]]]

function. Consequently, the method is able to deal with the 1.4


regression functions of any degree of roughness.
The second assumption is about the noise distribution: 1.2

E F frg ¼ 0 and E F fr2 g ¼ varfgo1, ð14Þ


1
3 4
E F fr go1 and E F fr go1, ð15Þ
0.8
where E f frg is the noise density function. Furthermore, it is
required that the noisy variable should be independent and
identically distributed. In the case of heterogeneous noise, 0 200 400 600 800 1000
the GT provides the average of noise variance extracted
from the whole dataset.
Fig. 4. Learning set of the Poland Electricity Load dataset.
As discussed above (see Eq. (11)), the GT depends on the
number of p used to evaluate the regression. It is suggested
to use a mid-range value p ¼ 10 [8]. But, when applied to
time series prediction problems, p needs to be tuned for 5.2. Results
each dataset and for each set of variables to obtain better
performance. As explained in Section 3, to select the inputs The maximum regressor size is set to 15 according to
the optimal p is selected using k-NN and leave-one-out. [16]. Two-weeks regressor is large enough to catch the main
dynamics of the electricity load time series. The selected
inputs based on the three methods are shown in Table 1.
4. Nonlinear models
For example, the inputs selected by MI for the one-step
ahead prediction are t, t  6 and t  7. Then, the prediction
In this paper, LS-SVM are used as nonlinear models [19],
model is
which are defined in their primal weight space by [22,18]
y^ ¼ oT jðxÞ þ b, (16) yðt þ 1Þ ¼ f 1 ðyðtÞ; yðt  6Þ; yðt  7ÞÞ. (20)
where jðxÞ is a function, which maps the input space into a From Table 1, it can also be seen that the time distance
higher-dimensional feature space, x is the vector of inputs. between the target time and some selected inputs is
o and b are the parameters of the model. The optimization constant over the whole prediction horizon. For example,
problem can be formulated as input t  6 is used to predict t þ 1, input t  5 is used to
predict t þ 2, input t  4 is used to predict t þ 3, etc. This
1 1X N
min Jðo; eÞ ¼ oT o þ g e2 , (17) fact is due to the weekly dynamics of the time series.
o;b;e 2 2 i¼1 i
The number of inputs selected by the k-NN varies from 2
to 9 and on average is 7 (from the maximum of 15 inputs).
subject to yi ¼ xT jðxi Þ þ b þ ei ; i ¼ 1; . . . ; N, (18) It shows that the models are sparse and the curse of
and the solution is dimensionality is reduced.
The sparsity also enables a physical interpretation of the
X
N
selected inputs. For example, for one-step ahead predic-
hðxÞ ¼ ai Kðx; xi Þ þ b. (19)
i¼1
tion, the inputs selected by the k-NN are t, t  5, t  6,
t  7 and t  13. This means, that in order to predict the
In the above equations, i refers to the index of a sample and load of the next day, let us say Tuesday, we need to use the
Kðx; xi Þ is the kernel function defined as the dot product load of Monday (current day); Wednesday, Tuesday,
between the jðxÞT and jðxÞ. Training methods for the Monday of the previous week and Tuesday 2 weeks before.
estimation of the o and b parameters can be found in [22]. The load of the current day is needed, because it is the most
up-to-date measurement. Monday, Tuesday and Wednes-
5. Experiments day of the previous week are needed to estimate the trend
of the electricity load over Tuesday. Tuesday 2 weeks
5.1. Dataset before is needed to handle the day specific changes in the
electricity load.
One time series is used as an example. The dataset is The LS-SVM are used to compare the performances.
called Poland Electricity Load, and it represents two A 10-fold cross-validation [9] procedure for model
periods of the daily electricity load of Poland during structure selection purposes has been applied.
around 1500 days in the 1990s [23]. The quasi-sinusoidal The MSE of direct prediction on the test set, based on
seasonal variation is clearly visible from the dataset. the three input selection methods are drawn in Fig. 5.
The first 1000 values are used for training, and the All input selection criteria (k-NN, MI and GT) provide
remaining data for testing. The learning part of the dataset good and quite similar inputs. The selected inputs also
is shown in Fig. 4. provide predictions with similar errors. From the three

Please cite this article as: A. Sorjamaa, et al., Methodology for long-term prediction of time series, Neurocomputing (2007), doi:10.1016/
j.neucom.2006.06.015
ARTICLE IN PRESS
A. Sorjamaa et al. / Neurocomputing ] (]]]]) ]]]–]]] 7

Table 1
Selected inputs for the Poland electricity Load dataset

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 x x x x x x x x x x x x
             
n n n n n n n n n n n n n n n

-1 x x x x x x
       
n n n n n n n n

-2 x x x x x
      
n n n n n n n n

-3 x x x x
   
n n n n n n n

-4 x x
  
n n n n n n n n

-5 x x
    
n n n n n n n n

-6 x x x
      
n n n n n n n n n

-7 x x
       
n n n

-8 x x x x
      
n n n

-9 x x
    
n n n n

-10 x x
   
n n n

-11 x x x
   
n n n n n

-12 x x
  
n n n

-13 x x x x x
   
n n n

-14 x x x x x x x x
       
n n n n n n

The numbers in the first row and first column represent time steps and regressor index, respectively. Symbol X is for MI selected inputs, O represents NNE
selection results, D is for k-NN selected results.

Please cite this article as: A. Sorjamaa, et al., Methodology for long-term prediction of time series, Neurocomputing (2007), doi:10.1016/
j.neucom.2006.06.015
ARTICLE IN PRESS
8 A. Sorjamaa et al. / Neurocomputing ] (]]]]) ]]]–]]]

methods, the k-NN is the fastest and therefore should be 0.85


preferred.
The MSE of direct and recursive predictions on the test 0.8
set based on the k-NN input selection criteria are shown in
0.75
Fig. 6.
From Fig. 6 it can be seen, that the direct prediction 0.7
strategy gives smaller error than the recursive one. The
error difference increases as the horizon of prediction 0.65
increases. The error of the direct strategy is linear with 2 4 6 8 10 12 14
respect to the horizon of prediction. This is not the case for
the recursive strategy. Fig. 7. Prediction results of the k-NN method for the Poland Electricity
Fifteen time step predictions of the direct prediction Load data: solid line is for the true values and solid line with  mark
method based on the k-NN input selection method are represents the prediction results.
given in Fig. 7.
In Fig. 7 it can be seen that the long-term prediction has 6. Conclusion
captured the intrinsic behavior of the time series.
Our results agree with the intuition and with the models This paper presents a global methodology for the long-
used by the real life electricity companies in their electricity term prediction of the time series. It illustrates that the
consumption estimation. direct prediction strategy gives better results than the
Similar results have been obtained on other time series recursive one. On the other hand, the direct prediction
benchmarks. The direct prediction strategy always provides strategy multiplies the computational load by the number
accurate predictions. Furthermore, the global methodology of prediction steps needed. In order to deal with the
introduced in this paper provides sparse and accurate long- increase of the computational load, a fast and reliable
term prediction models that can be easily interpreted. global input selection strategy has been introduced.
It has been shown that the k-NN, the MI and the NNE
x 10-3
criteria provide good selections of inputs. It is also shown
6 that global input selection strategy combining the forward
selection, the backward elimination and the forward–back-
ward selection is a good alternative to the exhaustive
4 search, which suffers from a too large computational load.
The k-NN selection criterion is the fastest, because the
2
selection of hyperparameters is not needed. This makes
k-NN roughly 10 times faster than MI and 20 times faster
than NNE.
0 The use of LS-SVM, which do not suffer from the
0 5 10 15 problems of local minima, allows reliable comparison. The
methodology has been applied successfully to a real life
benchmark. The sparseness of the selected models allows
Fig. 5. The MSE of different methods on the test set of the Poland
Electricity Load data: dashed line with  mark corresponds to MI selected straightforward physical interpretations.
inputs, solid line is for the NNE selected inputs, and solid line with  mark In further works, efforts have to be done to reduce the
corresponds to the k-NN selected inputs. computational load of the input selection criteria. Alter-
natives to the forward, the backward and the forward–
backward selection strategies have to be explored as well.
0.04
References
0.03
[1] R. Battiti, Using mutual information for selecting features in
supervised neural net learning, IEEE Trans. Neural Network 50
0.02
(1994) 537–550.
[2] M. Ben-Bassat, Pattern recognition and reduction of dimensionality,
0.01 in: Handbook of Statistics, vol. II, 1982, pp. 773–910.
[3] C. Bishop, Neural Networks for Pattern Recognition, Oxford
0 University Press, Oxford, 1995.
0 5 10 15 [4] T. Cover, J. Thomas, Elements of Information Theory, Wiley, New
York, 1991.
[5] B. Efron, R. Tibshirani, An Introduction to the Bootstrap, Chapman
Fig. 6. The MSE of the direct and recursive predictions for the test set of & Hall, London, 1993.
Poland Electricity Load data: solid line represents the direct prediction [6] B. Efron, R.J. Tibshirani, Improvements on cross-validation: the
error and dashed line is for the recursive prediction error. .632+ bootstrap method, J. Am. Statist. Assoc. 92 (1997) 548–560.

Please cite this article as: A. Sorjamaa, et al., Methodology for long-term prediction of time series, Neurocomputing (2007), doi:10.1016/
j.neucom.2006.06.015
ARTICLE IN PRESS
A. Sorjamaa et al. / Neurocomputing ] (]]]]) ]]]–]]] 9

[7] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Jin Hao was born in China in 1980. She received
Morgan Kaufmann, San Francisco, 2001. her bachelor degree from Beijing University of
[8] A.J. Jones, New tools in non-linear modeling and prediction, Technology in 2003 and her master degree from
Comput. Manage. Sci. 1 (2004) 109–149. Helsinki University of Technology in 2005. Her
[9] R. Kohavi, A study of cross-validation and bootstrap for accuracy master thesis title is ‘‘Input Selection using
estimation and model selection, in: Proceedings of the 14th Mutual Information—Applications to Time Ser-
International Joint Conference on Artificial Intelligence, vol. 2, 1995. ies Prediction’’. She is the author or the coauthor
[10] A. Kraskov, H. Stgbauer, P. Grassberger, Estimating mutual of five scientific papers in international journals,
information, Phys. Rev. 69 (2004) 066138. books or communications to conferences with
[11] L. Ljung, System Identification Theory for User, Prentice-Hall, reviewing committee. She is now working in the
Englewood Cliffs, NJ, 1987. Telecommunications and Networking Department of Samsung Electro-
[12] G. Manzini, Perimeter search in restricted memory, Comput. Math. nics in Korea.
Appl. 32 (1996) 37–45.
[13] R. Meiri, J. Zahavi, Using simulated annealing to optimize the Nima Reyhani was born in the northern part of
feature selection problem in marketing applications, Eur. J. Oper. Iran (Persia) in 1979. He received his bachelor
Res. 171 (2006) 842–858. degree from University of Isfahan. During his
[14] E. Rasek, A contribution to the problem of feature selection with bachelor studies, he was working in Iran Telecom
similarity functionals in pattern recognition, Pattern Recognition 3 Research Center. He received his master degree
(1971) 31–36. from Helsinki University of Technology, Fin-
[15] Q. Shen, R. Jensen, Selecting informative features with fuzzy-rough land. He is the author or the coauthor of seven
sets and its application for complex systems monitoring, Pattern scientific papers in international journals, books
Recognition 37 (2004) 1351–1363. or communications to conferences with reviewing
[16] A. Sorjamaa, N. Reyhani, A. Lendasse, Input and structure selection committee. Now, he is a Ph.D. student in HUT
for k-nn approximator, in: J. Cabestany, A. Prieto, F.S. Hernandez and his field of research is Noise Estimation.
(Eds.), Lecture Notes in Computer Science, vol. 3512, pp. 985–991.
Computational Intelligence and Bioinspired Systems: 8th Interna-
tional Work-Conference on Artificial Neural Networks, IWANN Yongnan Ji was born in 1981 in Daqing, in
northern part of China. He received his bachelor
2005, Barcelona, Spain, Springer, Berlin/Heidelberg, 2005.
[17] A. Stefansson, N. Koncar, A.J. Jones, A note on the gamma test, degree from Harbin Institute of Technology in
Neural Comput. Appl. 5 (3) (1997) 131–133. 2003, China. In 2005, he received his master
degree from Helsinki University of Technology,
[18] J.A.K. Suykens, J.D. Brabanter, L. Lukas, J. Vandewalle, Weighted
Finland. The title of his master thesis is ‘‘Least
least squares support vector machines: robustness and sparse
approximation, Neurocomputing 48 (2002) 85–105. Squares Support Vector Machines for Time
[19] J.A.K. Suykens, T.V. Gestel, J.D. Brabanter, B.D. Moor, J. Series Prediction’’. He is the author or the
coauthor of four scientific papers in international
Vandewalle, Least Squares Support Vector Machines, World
Scientific, Singapore, 2002. journals, books or communications to confer-
[20] M. Verleysen, D. Francois, The curse of dimensionality in data ences with reviewing committee. He is currently a Ph.D. student in HUT,
mining and time series prediction, in: J. Cabestany, A. Prieto, F.S. working on machine learning algorithms for chemometrics data.
Hernandez (Eds.), Lecture Notes in Computer Science, vol. 3512, pp.
758–770. Invited Talk in Computational Intelligence and Bioinspired Amaury Lendasse was born in 1972 in Belgium.
Systems: 8th International Work-Conference on Artificial Neural He received the M.S. degree in mechanical
Networks, IWANN 2005, Barcelona, Spain, Springer, Berlin/Heidel- engineering from the Université catholique de
berg, 2005. Louvain (Belgium) in 1996, M.S. in control in
[21] A. Weigend, N. Gershenfeld, Times Series Prediction: Forecasting the 1997 and Ph.D. in 2003 from the same Uni-
Future and Understanding the Past, Addison-Wesley, Reading, MA, versity. In 2003, he has been a postdoctoral
1994. researcher in the Computational Neurodynamics
[22] Available from hhttp://www.esat.kuleuven.ac.be/sista/lssvmlab/i. Lab at the University of Memphis. Since 2004, he
[23] hhttp://www.cis.hut.fi/projects/tsp/?page=Timeseriesi. is a senior researcher in the Adaptive Informatics
Research Centre in the Helsinki University of
Antti Sorjamaa was born in 1980 in a small city in Technology in Finland. He is leading the Time Series Prediction Group.
northern Finland. He received his master degree He is the author or the coauthor of 64 scientific papers in international
from Helsinki University of Technology in 2005. journals, books or communications to conferences with reviewing
His Master’s thesis is entitled ’’Strategies for the committee. His research includes time series prediction, chemometrics,
Long-Term Prediction of Time Series using Local variable selection, noise variance estimation, determination of missing
Models’’. Currently, he is continuing as a Ph.D. values in temporal databases, nonlinear approximation in financial
student in HUT. He is the author or the coauthor problems, functional neural networks and classification.
of six scientific papers in international journals,
books or communications to conferences with
reviewing committee. The topic of his research is
missing value problems in temporal databases.

Please cite this article as: A. Sorjamaa, et al., Methodology for long-term prediction of time series, Neurocomputing (2007), doi:10.1016/
j.neucom.2006.06.015

You might also like