Methodology For Long-Term Prediction of Time Series: Antti Sorjamaa, Jin Hao, Nima Reyhani, Yongnan Ji, Amaury Lendasse
Methodology For Long-Term Prediction of Time Series: Antti Sorjamaa, Jin Hao, Nima Reyhani, Yongnan Ji, Amaury Lendasse
Abstract
In this paper, a global methodology for the long-term prediction of time series is proposed. This methodology combines direct
prediction strategy and sophisticated input selection criteria: k-nearest neighbors approximation method (k-NN), mutual information
(MI) and nonparametric noise estimation (NNE). A global input selection strategy that combines forward selection, backward
elimination (or pruning) and forward–backward selection is introduced. This methodology is used to optimize the three input selection
criteria (k-NN, MI and NNE). The methodology is successfully applied to a real life benchmark: the Poland Electricity Load dataset.
r 2007 Elsevier B.V. All rights reserved.
Keywords: Time series prediction; Input selection; k-Nearest neighbors; Mutual information; Nonparametric noise estimation; Recursive prediction;
Direct prediction; Least squares support vector machines
0925-2312/$ - see front matter r 2007 Elsevier B.V. All rights reserved.
doi:10.1016/j.neucom.2006.06.015
Please cite this article as: A. Sorjamaa, et al., Methodology for long-term prediction of time series, Neurocomputing (2007), doi:10.1016/
j.neucom.2006.06.015
ARTICLE IN PRESS
2 A. Sorjamaa et al. / Neurocomputing ] (]]]]) ]]]–]]]
In this paper, least squares support vector machines But when H exceeds M, all the inputs are the predicted
(LS-SVM) are used as nonlinear models in order to avoid values. The use of the predicted values as inputs
local minima problems [19]. deteriorates the accuracy of the prediction.
Section 2 presents the prediction strategies for the long-
term prediction of time series. In Section 3 the global 2.2. Direct prediction strategy
methodology is introduced. Section 3.1 presents the input
selection strategies and Section 3.2 the input selection Another strategy for the long-term prediction is the
criteria. Finally, the prediction model LS-SVM is briefly direct strategy. For the H-steps ahead prediction, the
summarized in Section 4 and experimental results are model is
shown in Section 5 using a real life benchmark: the Poland
electricity load dataset. y^ tþh ¼ f h ðyt ; yt1 ; . . . ; ytMþ1 Þ with 1phpH. (3)
This strategy estimates H direct models between the
2. Time series prediction regressor (which does not contain any predicted values)
and the H outputs. The errors in the predicted values are
The time series prediction problem is the prediction of not accumulated in the next prediction. When all the
future values based on the previous values and the current values, from y^ tþ1 to y^ tþH , need to be predicted, H different
value of the time series (see Eq. (1)). The previous values models must be built. The direct strategy increases the
and the current value of the time series are used as inputs complexity of the prediction, but more accurate results are
for the prediction model. One-step ahead prediction is achieved as illustrated in Section 5.
needed in general and is referred to as short-term
prediction. But when multi-step ahead predictions are 3. Methodology
needed, it is called a long-term prediction problem.
Unlike the short-term time series prediction, the long- In the experiments, the direct prediction strategy is used.
term prediction is typically faced with growing uncertain- H models have to be built as shown in Eq. (3). For each
ties arising from various sources. For instance, the model, three different input selection criteria are presented:
accumulation of errors and the lack of information make
the prediction more difficult. In long-term prediction,
minimization of the k-NN leave-one-out generalization
performing multiple step ahead prediction, there are
error estimate,
several alternatives to build models. In the following
maximization of the MI between the inputs and the
sections, two variants of prediction strategies are intro-
output,
duced and compared: the direct and the recursive predic-
minimization of the NNE.
tion strategies.
In order to optimize one of the criteria, a global input
2.1. Recursive prediction strategy
selection strategy combining the forward selection, the
backward elimination and the forward–backward selection
To predict several steps ahead values of a time series,
is presented in Section 3.1.
recursive strategy seems to be the most intuitive and simple
The estimation of MI and NNE demands the choice of
method. It uses the predicted values as known data to
hyperparameters. The definitions and the significance of
predict the next ones. In more detail, the model can be
the hyperparameters are more deeply explained in Sections
constructed by first making one-step ahead prediction:
3.2.2 and 3.2.3. In this paper, the most adequate
y^ tþ1 ¼ f 1 ðyt ; yt1 ; . . . ; ytMþ1 Þ, (1) hyperparameter values are selected by minimizing the
where M denotes the number inputs. The regressor of the LOO error provided by k-NN approximators presented in
model is defined as the vector of inputs: Section 3.2.
yt ; yt1 ; . . . ; ytMþ1 . It is possible to use also exogenous In order to avoid local minima in the training phase of
variables as inputs in the regressor, but they are not the nonlinear models (f k in Eq. (3)), the LS-SVM are used.
considered here in order to simplify the notation. Never- The LS-SVM are presented in Section 4.
theless, the presented global methodology can also be used
with exogenous variables. 3.1. Input selection strategies
To predict the next value, the same model is used:
Input selection is an essential pre-processing stage to
y^ tþ2 ¼ f 1 ðy^ tþ1 ; yt ; yt1 ; . . . ; ytMþ2 Þ. (2)
guarantee high accuracy, efficiency and scalability [7] in
In Eq. (2), the predicted value of y^ tþ1 is used instead of problems such as machine learning, especially when the
the true value, which is unknown. Then, for the H-steps number of observations is relatively small compared to
ahead prediction, y^ tþ2 to y^ tþH are predicted iteratively. So, the number of inputs. It has been the subject in many
when the regressor length M is larger than H, there are application domains like pattern recognition [14],
M H real data in the regressor to predict the Hth step. process identification [15], time series modeling [20] and
Please cite this article as: A. Sorjamaa, et al., Methodology for long-term prediction of time series, Neurocomputing (2007), doi:10.1016/
j.neucom.2006.06.015
ARTICLE IN PRESS
A. Sorjamaa et al. / Neurocomputing ] (]]]]) ]]]–]]] 3
econometrics [13]. Problems that occur due to poor On the contrary, the filter method is much faster because
selection of input variables are: the procedure is simpler. In this paper, due to the long
computational time of the wrapper method, it is unrealistic
to compare the wrapper and filter methods for the input
If the input dimensionality is too large, the ‘curse of selection problem that is studied.
dimensionality’ problem [20] may happen. Moreover, In the following sections, the discussion is focused on the
the computational complexity and memory require- filter methods. The filter method selects a set of inputs by
ments of the learning model increase. Additional optimizing a criterion over different combinations of
unrelated inputs lead to poor models (lack of general- inputs. The criterion computes the dependencies between
ization). each combination of inputs and the output using predict-
Understanding complex models (too many inputs) is ability, correlation, mutual information or other statistics.
more difficult than simple models (less inputs), which Various alternatives of these criteria exist.
can provide comparable good performances. This paper uses three methods based on different
criteria: k-NN, MI and NNE. In the following, MI is
Usually, the input selection methods can be divided into taken as an example to explain the global input selection
two broad classes: filter methods and wrapper methods, see strategy. For the other two input selection criteria, the
Fig. 1. procedures are similar.
In the case of the filter methods, the best subset of inputs
is selected a priori based only on the dataset. The input 3.1.1. Exhaustive search
subset is chosen by an evaluation criterion, which measures The optimal algorithm is to compute MI between all the
the relationship of each subset of input variables with the possible combinations of inputs and the output, e.g. 2M 1
output. In the literature, plenty of filter measure methods inputs combinations are tested (M is the number of input
of different natures [2] exist: distance metrics, dependence variables). Then, the one that gives maximum MI is
measures, scores based on the information theory, . . . , etc. selected. In the case of long-term prediction of time series,
In the case of the wrapper methods, the best input subset is M is usually larger than 15, so the exhaustive search
selected according to the criterion, which is directly defined procedure becomes too time consuming. Therefore, a
from the learning algorithm. The wrapper methods search global input selection strategy that combines forward
for a good subset of inputs using the learning model itself selection, backward elimination and forward–backward
as a part of the evaluation function. This evaluation selection is used. Forward selection, backward elimination
function is also employed to induce the final learning and forward–backward selection are summarized in the
model. following sections.
Comparing these two types of input selection strategies,
the wrapper methods solve the real problem. But it is 3.1.2. Forward selection
potentially very time consuming, as the ultimate algorithm In this method, starting from the empty set S of selected
has to be included in the cost function. Therefore, input variables, the best available input is added to the set
thousands of evaluations are performed when searching S one by one, until the size of S is M. Suppose we have a
for the best subset. For example, if 15 input variables are set of inputs X i ; i ¼ 1; 2; . . . ; M and output Y, the
considered and if the forward selection strategy (intro- algorithm is summarized in Fig. 2.
duced in Section 3.1.2) is used, then 15ð15 þ 1Þ=2 ¼ 120 In forward selection, only MðM þ 1Þ=2 different input
different subsets have to be tested. In practice, more than sets are evaluated. This is much less than the number of
15 inputs are realistic for time series prediction problems input sets evaluated with exhaustive search. On the other
and the computational time is thus increased dramatically. hand, optimality is not guaranteed. The selected set may
not be the global optimal one.
Please cite this article as: A. Sorjamaa, et al., Methodology for long-term prediction of time series, Neurocomputing (2007), doi:10.1016/
j.neucom.2006.06.015
ARTICLE IN PRESS
4 A. Sorjamaa et al. / Neurocomputing ] (]]]]) ]]]–]]]
3.1.3. Backward elimination or pruning selection. From the candidate input sets of all four
Backward elimination, also called pruning [12] proce- selection methods, the one that optimizes the chosen
dure, is the opposite of forward selection process. In this criteria (k-NN, MI or NNE) is selected.
strategy, the selected inputs set S is initialized to contain all This combined strategy does not guarantee the selection
the input variables. Then, the input variable for which the of the best input set that would be obtained with the
elimination maximizes MI is removed from set S one by exhaustive search strategy. Nevertheless, the input selection
one, until the size of S is 1. is improved and the number of tested subsets is consider-
Basically, backward elimination is the same procedure as ably reduced compared to the exhaustive search strategy.
forward selection presented in the previous section, but
reversed. It evaluates the same amount of input sets as 3.2. Input selection criteria
forward selection, MðM þ 1Þ=2. Also, the same restriction
exists, optimality is not guaranteed. 3.2.1. k-Nearest neighbors
The k-NN approximation method is a very simple and
3.1.4. Forward–backward selection powerful method. It has been used in many different
Both forward selection and backward elimination applications, particularly for classification tasks [3]. The
methods suffer from an incomplete search. Forward–back- key idea behind the k-NN is that samples with similar
ward selection algorithm combines both methods. It offers inputs have similar output values. Nearest neighbors are
the flexibility to reconsider input variables previously selected, according to Euclidean distance, and their
discarded and vice versa, to discard input variables corresponding output values are used to obtain the
previously selected. It can start from any initial input set, approximation of the desired output. In this paper, the
including empty, full or randomly initialized input set. estimation of the output is calculated simply by averaging
Let us suppose a set of inputs X i , i ¼ 1; 2; . . . ; M and the outputs of the nearest neighbors:
output Y, the procedure of the forward–backward Selec- Pk
tion is summarized in Fig. 3. j¼1 yjðiÞ
y^ i ¼ , (4)
It is noted that the selection result depends on the k
initialization of the input set. In this paper, two options are where y^ i represents the estimate (approximation) of the
considered. One is to begin from the empty set and the output, yjðiÞ is the output of the jth nearest neighbor of
other is to begin from the full set. sample xi and k denotes the number of neighbors used.
The number of input sets to be evaluated varies and is The distances between samples are influenced by the
dependent on the initialization of the input set, the input selection. Then, the nearest neighbors and the
stopping criteria and the nature of the problem. Still, it is approximation of the outputs depend on the input
not guaranteed that in all cases this selection method finds selection.
the global optimal input set. The k-NN is a nonparametric method and only k, the
number of neighbors, has to be determined. The selection
3.1.5. Global selection strategy of k can be performed by many different model structure
In order to select the best input set, we propose to use all selection techniques, for example k-fold cross-validation
four selection methods: forward selection, backward [9], leave-one-out [9], Bootstrap [5] and Bootstrap 632 [6].
elimination, forward–backward selection initialized with These methods estimate the generalization error obtained
an empty set of inputs and forward–backward selection for each value of k. The selected k is the one that minimizes
initialized with a full set of inputs. All four selection the generalization error.
methods are fast to perform, but do not always converge to In [16] all methods, the leave-one-out and Bootstraps,
the same input set, because of the local minima. Therefore, select the same input sets. Moreover, the number of
it is necessary to use all of them to get more optimal neighbors is more efficiently selected by the Bootstraps
[16]. It has also been shown that the k-NN itself is a good
approximator for time series [16]. In this paper, however,
the k-NN is not used as an approximator, but as a tool to
select the input set.
Please cite this article as: A. Sorjamaa, et al., Methodology for long-term prediction of time series, Neurocomputing (2007), doi:10.1016/
j.neucom.2006.06.015
ARTICLE IN PRESS
A. Sorjamaa et al. / Neurocomputing ] (]]]]) ]]]–]]] 5
(regressor) and output. Thus, the inputs subset X , which Given N input–output pairs: ðxi ; yi Þ 2 RM R, the
gives maximum MI, is chosen to predict the output Y. relationship between xi and yi can be expressed as
The definition of MI originates from the entropy in the
information theory. For continuous random variables yi ¼ f ðxi Þ þ ri , (10)
(scalar or vector), let mX ;Y ; mX and mY represent the joint where f is the unknown function and r is the noise. The GT
probability density function and the two marginal density estimates the variance of the noise r.
functions of the variables. The entropy of X is defined by The GT is useful for evaluating the nonlinear correlation
Shannon [1] as between two random variables, namely, input and output
Z 1
pairs. The GT has been introduced for model selection but
HðX Þ ¼ mX ðxÞ log mX ðxÞ dx, (5) also for input selection: the set of inputs that minimizes the
1
GT is the one that is selected. Indeed, according to the GT,
where log is the natural logarithm and then, the informa-
the selected set of inputs is the one that represents the
tion is measured in natural units.
relationship between inputs and output in the most
The remaining uncertainty of X is measured by the
deterministic way.
conditional entropy as
Z 1 GT is based on hypotheses coming from the continuity
of the regression function. If two points x and x0 are close
HðX jY Þ ¼ mY ðyÞ
1 in the input space, the continuity of regression function
Z 1
implies the outputs f ðxÞ and f ðx0 Þ will be close enough in
mX ðxjY ¼ yÞ log mX ðxjY ¼ yÞ dx d y. ð6Þ the output space. Alternatively, if the corresponding
1
output values are not close in the output space, this is
The joint entropy is defined as due to the influence of the noise.
Z 1
Two versions for evaluating the GT are suggested. The
HðX ; Y Þ ¼ mX ;Y ðx; yÞ log mX ;Y ðx; yÞ dx dy. (7)
1
first one evaluates the value of g; s in increasing sized sets
of data. Then the result for a particular parameter pair is
The MI between variables X and Y is defined as [4]
obtained by averaging the results from all set sizes. The
MIðX ; Y Þ ¼ HðY Þ HðY jX Þ new or refined version establishes the estimation based on
¼ HðX Þ þ HðY Þ HðX ; Y Þ. ð8Þ the k-NN differences instead of increasing the number of
data points gradually. In order to distinguish the k used in
From Eqs. (5) to (8), MI is computed as the NNE context from the conventional k in k-NN, the
Z 1
mX ;Y ðx; yÞ number of nearest neighbors is denoted by p.
MIðX ; Y Þ ¼ mX ;Y ðx; yÞ log X dx dy. (9) Let us denote the pth nearest neighbor of the point xi in
1 m ðxÞmY ðyÞ
the set fx1 ; . . . ; xN g by xpðiÞ . Then the following variables,
For computing the MI, only the estimations of the gN and sN are defined as
probability density functions mX ;Y ; mX and mY are required.
In this paper, MIðX ; Y Þ is estimated by a k-NN 1 XN
approach presented in [10]. In order to distinguish the gN ðpÞ ¼ jy yi j2 , ð11Þ
2N i¼1 pðiÞ
number of neighbors that used in the MI and the one used
in the k-NN, the number of neighbors is denoted by l for 1 XN
3.2.3. Nonparametric noise estimator using the gamma test where xi and xj are the ith and jth components of x,
Gamma test (GT) is a technique for estimating the respectively. M is the number of variables. The GT
variance of the noise, or the mean square error (MSE), that requires both jHf ðxÞj and jrf ðxÞj are bounded.
can be achieved without overfitting [8]. The evaluation of These two conditions are general and are usually
the NNE is done using the GT estimation introduced by satisfied in practical problems. The GT requires no other
Stefansson in [17]. assumption on the smoothness property of the regression
Please cite this article as: A. Sorjamaa, et al., Methodology for long-term prediction of time series, Neurocomputing (2007), doi:10.1016/
j.neucom.2006.06.015
ARTICLE IN PRESS
6 A. Sorjamaa et al. / Neurocomputing ] (]]]]) ]]]–]]]
Please cite this article as: A. Sorjamaa, et al., Methodology for long-term prediction of time series, Neurocomputing (2007), doi:10.1016/
j.neucom.2006.06.015
ARTICLE IN PRESS
A. Sorjamaa et al. / Neurocomputing ] (]]]]) ]]]–]]] 7
Table 1
Selected inputs for the Poland electricity Load dataset
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 x x x x x x x x x x x x
n n n n n n n n n n n n n n n
-1 x x x x x x
n n n n n n n n
-2 x x x x x
n n n n n n n n
-3 x x x x
n n n n n n n
-4 x x
n n n n n n n n
-5 x x
n n n n n n n n
-6 x x x
n n n n n n n n n
-7 x x
n n n
-8 x x x x
n n n
-9 x x
n n n n
-10 x x
n n n
-11 x x x
n n n n n
-12 x x
n n n
-13 x x x x x
n n n
-14 x x x x x x x x
n n n n n n
The numbers in the first row and first column represent time steps and regressor index, respectively. Symbol X is for MI selected inputs, O represents NNE
selection results, D is for k-NN selected results.
Please cite this article as: A. Sorjamaa, et al., Methodology for long-term prediction of time series, Neurocomputing (2007), doi:10.1016/
j.neucom.2006.06.015
ARTICLE IN PRESS
8 A. Sorjamaa et al. / Neurocomputing ] (]]]]) ]]]–]]]
Please cite this article as: A. Sorjamaa, et al., Methodology for long-term prediction of time series, Neurocomputing (2007), doi:10.1016/
j.neucom.2006.06.015
ARTICLE IN PRESS
A. Sorjamaa et al. / Neurocomputing ] (]]]]) ]]]–]]] 9
[7] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Jin Hao was born in China in 1980. She received
Morgan Kaufmann, San Francisco, 2001. her bachelor degree from Beijing University of
[8] A.J. Jones, New tools in non-linear modeling and prediction, Technology in 2003 and her master degree from
Comput. Manage. Sci. 1 (2004) 109–149. Helsinki University of Technology in 2005. Her
[9] R. Kohavi, A study of cross-validation and bootstrap for accuracy master thesis title is ‘‘Input Selection using
estimation and model selection, in: Proceedings of the 14th Mutual Information—Applications to Time Ser-
International Joint Conference on Artificial Intelligence, vol. 2, 1995. ies Prediction’’. She is the author or the coauthor
[10] A. Kraskov, H. Stgbauer, P. Grassberger, Estimating mutual of five scientific papers in international journals,
information, Phys. Rev. 69 (2004) 066138. books or communications to conferences with
[11] L. Ljung, System Identification Theory for User, Prentice-Hall, reviewing committee. She is now working in the
Englewood Cliffs, NJ, 1987. Telecommunications and Networking Department of Samsung Electro-
[12] G. Manzini, Perimeter search in restricted memory, Comput. Math. nics in Korea.
Appl. 32 (1996) 37–45.
[13] R. Meiri, J. Zahavi, Using simulated annealing to optimize the Nima Reyhani was born in the northern part of
feature selection problem in marketing applications, Eur. J. Oper. Iran (Persia) in 1979. He received his bachelor
Res. 171 (2006) 842–858. degree from University of Isfahan. During his
[14] E. Rasek, A contribution to the problem of feature selection with bachelor studies, he was working in Iran Telecom
similarity functionals in pattern recognition, Pattern Recognition 3 Research Center. He received his master degree
(1971) 31–36. from Helsinki University of Technology, Fin-
[15] Q. Shen, R. Jensen, Selecting informative features with fuzzy-rough land. He is the author or the coauthor of seven
sets and its application for complex systems monitoring, Pattern scientific papers in international journals, books
Recognition 37 (2004) 1351–1363. or communications to conferences with reviewing
[16] A. Sorjamaa, N. Reyhani, A. Lendasse, Input and structure selection committee. Now, he is a Ph.D. student in HUT
for k-nn approximator, in: J. Cabestany, A. Prieto, F.S. Hernandez and his field of research is Noise Estimation.
(Eds.), Lecture Notes in Computer Science, vol. 3512, pp. 985–991.
Computational Intelligence and Bioinspired Systems: 8th Interna-
tional Work-Conference on Artificial Neural Networks, IWANN Yongnan Ji was born in 1981 in Daqing, in
northern part of China. He received his bachelor
2005, Barcelona, Spain, Springer, Berlin/Heidelberg, 2005.
[17] A. Stefansson, N. Koncar, A.J. Jones, A note on the gamma test, degree from Harbin Institute of Technology in
Neural Comput. Appl. 5 (3) (1997) 131–133. 2003, China. In 2005, he received his master
degree from Helsinki University of Technology,
[18] J.A.K. Suykens, J.D. Brabanter, L. Lukas, J. Vandewalle, Weighted
Finland. The title of his master thesis is ‘‘Least
least squares support vector machines: robustness and sparse
approximation, Neurocomputing 48 (2002) 85–105. Squares Support Vector Machines for Time
[19] J.A.K. Suykens, T.V. Gestel, J.D. Brabanter, B.D. Moor, J. Series Prediction’’. He is the author or the
coauthor of four scientific papers in international
Vandewalle, Least Squares Support Vector Machines, World
Scientific, Singapore, 2002. journals, books or communications to confer-
[20] M. Verleysen, D. Francois, The curse of dimensionality in data ences with reviewing committee. He is currently a Ph.D. student in HUT,
mining and time series prediction, in: J. Cabestany, A. Prieto, F.S. working on machine learning algorithms for chemometrics data.
Hernandez (Eds.), Lecture Notes in Computer Science, vol. 3512, pp.
758–770. Invited Talk in Computational Intelligence and Bioinspired Amaury Lendasse was born in 1972 in Belgium.
Systems: 8th International Work-Conference on Artificial Neural He received the M.S. degree in mechanical
Networks, IWANN 2005, Barcelona, Spain, Springer, Berlin/Heidel- engineering from the Université catholique de
berg, 2005. Louvain (Belgium) in 1996, M.S. in control in
[21] A. Weigend, N. Gershenfeld, Times Series Prediction: Forecasting the 1997 and Ph.D. in 2003 from the same Uni-
Future and Understanding the Past, Addison-Wesley, Reading, MA, versity. In 2003, he has been a postdoctoral
1994. researcher in the Computational Neurodynamics
[22] Available from hhttp://www.esat.kuleuven.ac.be/sista/lssvmlab/i. Lab at the University of Memphis. Since 2004, he
[23] hhttp://www.cis.hut.fi/projects/tsp/?page=Timeseriesi. is a senior researcher in the Adaptive Informatics
Research Centre in the Helsinki University of
Antti Sorjamaa was born in 1980 in a small city in Technology in Finland. He is leading the Time Series Prediction Group.
northern Finland. He received his master degree He is the author or the coauthor of 64 scientific papers in international
from Helsinki University of Technology in 2005. journals, books or communications to conferences with reviewing
His Master’s thesis is entitled ’’Strategies for the committee. His research includes time series prediction, chemometrics,
Long-Term Prediction of Time Series using Local variable selection, noise variance estimation, determination of missing
Models’’. Currently, he is continuing as a Ph.D. values in temporal databases, nonlinear approximation in financial
student in HUT. He is the author or the coauthor problems, functional neural networks and classification.
of six scientific papers in international journals,
books or communications to conferences with
reviewing committee. The topic of his research is
missing value problems in temporal databases.
Please cite this article as: A. Sorjamaa, et al., Methodology for long-term prediction of time series, Neurocomputing (2007), doi:10.1016/
j.neucom.2006.06.015