Transformers For Energy Forecast
Transformers For Energy Forecast
Article
Transformers for Energy Forecast
Hugo S. Oliveira 1,2, * , Helder P. Oliveira 1,2
1 Institute for Systems and Computer Engineering, Technology and Science—INESC TEC,
University of Porto, 4200-465 Porto, Portugal; [email protected]
2 Faculty of Sciences (FCUP), University of Porto, 4169-007 Porto, Portugal
* Correspondence: [email protected]
Abstract: Forecasting energy consumption models allow for improvements in building performance
and reduce energy consumption. Energy efficiency has become a pressing concern in recent years due
to the increasing energy demand and concerns over climate change. This paper addresses the energy
consumption forecast as a crucial ingredient in the technology to optimize building system operations
and identifies energy efficiency upgrades. The work proposes a modified multi-head transformer
model focused on multi-variable time series through a learnable weighting feature attention matrix
to combine all input variables and forecast building energy consumption properly. The proposed
multivariate transformer-based model is compared with two other recurrent neural network models,
showing a robust performance while exhibiting a lower mean absolute percentage error. Overall, this
paper highlights the superior performance of the modified transformer-based model for the energy
consumption forecast in a multivariate step, allowing it to be incorporated in future forecasting tasks,
allowing for the tracing of future energy consumption scenarios according to the current building
usage, playing a significant role in creating a more sustainable and energy-efficient building usage.
1. Introduction
Building energy efficiency has become increasingly important as climate change and
energy security concerns have grown [1]. Building energy usage accounts for significant
Citation: Oliveira, H.S.; Oliveira, H.P. global energy consumption and greenhouse gas emissions. Improving building energy effi-
Transformers for Energy Forecast. ciency is a crucial strategy for reducing energy consumption and mitigating climate change.
Sensors 2021, 23, 6840. https:// One emerging technology that can potentially improve building energy efficiency is
doi.org/10.3390/s23156840 energy forecast models, trained from live data from sensors and other sources [2].
Academic Editors: Igor Martek
Energy forecasts can be used to improve building energy efficiency in several ways.
and Mehdi Amirkhani
One approach is to optimize the operation of building systems, such as heating, ventilation,
and air conditioning (HVAC) systems [3] according to the current demand.
Received: 31 May 2023 By forecasting and analyzing the building’s energy performance, operators can identify
Revised: 20 July 2023 ways to reduce energy consumption while still maintaining comfort and safety. Another ap-
Accepted: 25 July 2023 proach is to use the forecast models to identify and prioritize energy efficiency upgrades [4].
Published: 1 August 2023
Developing an effective energy forecast model requires heterogeneous sensor data to
unveil hidden building usage patterns. The Figure 1 diagram shows how building sensor
data combined with additional data and machine-learning models allow all available data
to represent the building energy consumption pattern.
Copyright: © 2020 by the authors.
Licensee MDPI, Basel, Switzerland.
In addition to improving the building energy efficiency, forecasting energy consump-
This article is an open access article
tion allows operators to identify ways to optimize the structure for different use cases by
distributed under the terms and simulating different scenarios.
conditions of the Creative Commons The energy forecast model is based on modifying the multi-attention transformer
Attribution (CC BY) license (https:// model by including a learnable weighting feature attention matrix to address the build-
creativecommons.org/licenses/by/ ing energy efficiency. The model is leveraged by analyzing live sensor data from the
4.0/). Institute for Systems and Computer Engineering, Technology and Science (INESC TEC)
Research Building, consisting of computer research labs, auditoriums and support facilities,
alternating more than 700 researchers, staff and other personnel.
The objective is to forecast the energy consumption for the next 10 days, allowing
the application of effective mitigation procedures regarding energy waste and quickly
assessing how these mitigation procedures impact the actual energy consumption.
With this in mind, this research work is organized into sections, with Section 2 pre-
senting a comprehensive study on forecast models works, focused on energy forecast and
maintenance; Section 3 showing the principal methodology to be followed and techniques
to be employed; Section 4 containing the data analysis and forecast modeling; Section 5
containing the results discussion; and the conclusion in Section 6.
2. Literature Review
Building energy forecast models are an excellent tool for understanding and mitigating
infrastructure efficiency in many fields. They allow for the creation of several data scenarios
to foresee the impact on energy consumption and evaluate the impact of building usage modi-
fications.
Maintenance decision-making and sustainable energy are the main concerns of [5], by
employing the forecasting of electrical energy consumption in equipment maintenance by
means of an artificial neural network (ANN) and particle swarm optimization (PSO). With
the same objective, [6] employs fully data-driven analysis and modeling by first analyzing
the linear correlation between the outdoor environmental parameters with the actual
measured energy consumption data and then employing the use of the backpropagation
artificial neural network (BP-ANN) to forecast energy consumption, allowing the reduction
of the low-carbon operation and maintenance for the building HVAC systems.
The issue of CO2 emissions poses a significant challenge to building efficiency. In the
study conducted by [7], the research delves into both the advantages and limitations of
conventional energy consumption mitigation methods. Notably, the study emphasizes
the potential of integrating 2D GIS and 3DGIS (CityGML) [8] with energy prediction
approaches. This combination considers frequent interventions at the building scale,
applicability throughout the building’s life cycle and the conventional energy consumption
forecasting process.
Sensors 2021, 23, 6840 3 of 21
In the work of [9], the emphasis is placed on the significance of forecasting energy usage
in building energy planning, management and optimization. The review acknowledges the
potential of deep learning approaches in effectively handling vast quantities of data, extracting
relevant features and enhancing modeling capabilities for nonlinear phenomena.
The use of LSTM and GRU for energy forecasting was the subject of study for [10].
LSTM combinations with CNN were proposed by [11] for the same purpose. The CNN-
LSTM proposed model employs a CNN to extract complex non-linear features combined
with a LSTM to handle long-term dependencies through modeling temporal information in
the time-series data.
LSTM, Bi-LSTM and GRU are employed in [12] to perform occupancy prediction in
buildings with different space types. Another example related to intelligent vehicle systems
can be found in [13], where the authors evaluated the performance of different architectures,
including LSTM, to predict vehicle stop activity based on vehicular time-series data.
In a similar way, [14] exploits CNN–LSTM models for household energy consump-
tion, with a particular emphasis on data normalization. The transformers for time series
are also a subject of interest in [15], proposing Autoformer, a novel architecture with an
auto-correlation mechanism based on the series periodicity, which conducts the depen-
dencies discovery and representation aggregation at the sub-series level, outperforming
self-attention in both efficiency and accuracy.
Although energy forecasting models mostly target foreseen target variables, they
can be employed to define maintenance operations, allowing for the mitigation of the
operational cost in building or manufacturing facilities [16].
The current time-series forecast models have several limitations, namely, in exploring
all the available information in a meaningful way, harming the forecasting of a longer
period of time. This concern makes it vital for developing models that use all informa-
tion in a multivariate way and captures very long patterns useful in a correct energy
consumption forecast.
3. Methodology
The main objective of this work is to employ a modified multi-variable transformer to
forecast energy building consumption for the next 10-day period (250 h).
To establish a comparison baseline, multistep LSTM/GRU models are also employed
and trained with the same data. The overall description of the proposed and evaluated
models is below.
The equations for the long short-term memory (LSTM) model are as follows:
where
it represents the input gate activation at time step t,
f t represents the forget gate activation at time step t,
c̃t represents the candidate cell state at time step t,
ct represents the cell state at time step t,
ot represents the output gate activation at time step t,
ht represents the hidden state (output) at time step t,
xt represents the input at time step t,
ht−1 represents the hidden state at the previous time step (t − 1),
ct−1 represents the cell state at the previous time step (t − 1),
σ represents the sigmoid activation function,
represents the element-wise multiplication (Hadamard product).
These equations describe the operations performed by an LSTM cell to update and
pass information through time steps in a recurrent neural network architecture.
The memory cell is updated based on the input, forget, and output gates, while the
hidden state is updated based on the memory cell and the output gate. In summary, LSTM
has these particular hyperparameters: the number of LSTM layers that determine the
network depth; the number of LSTM units, allowing for the definition of the size of the
hidden state and memory cell, controlling the network ability to store information.
GRU [18] is also a type of RNN widely used in time-series forecasting. It addresses
the same problem as lstm through a simpler architecture, described in Figure 3.
The GRU cell also contains a hidden state, which is used to pass information between
cells in the network. Unlike LSTM, GRU only has one memory cell, which is updated using
the reset and update gates. Both models commonly employ a Dropout technique [19] to
prevent overfitting and adaptable learning rate usage, such as Adam [20] or SGD [21].
Sensors 2021, 23, 6840 5 of 21
The equations for the gated recurrent unit (GRU) model are as follows:
Update Gate:
Reset Gate:
h t = (1 − z t ) h t −1 + z t h̃t
where
zt represents the update gate activation at time step t,
rt represents the reset gate activation at time step t,
h̃t represents the candidate hidden state at time step t,
ht represents the hidden state (output) at time step t,
xt represents the input at time step t,
ht−1 represents the hidden state at the previous time step (t − 1),
W represents weight matrices,
b represents bias vectors,
σ represents the sigmoid activation function,
represents the element-wise multiplication (Hadamard product).
These equations describe the operations performed by a GRU cell to update and pass
information through time steps in a recurrent neural network architecture.
Overall, LSTM and GRU architectures are both effective for time-series forecasting,
and the choice between the two will depend on the specific requirements of the task at hand.
In general, LSTM is a better choice for tasks requiring capturing long-term dependencies,
while GRU may be more appropriate for tasks requiring faster training times or datasets
with shorter-term dependencies. Although many variations of LSTM and GRU were
proposed to improve forecasting capabilities, such as a bidirectional LSTM [22], by utilizing
information from both sides, or a bidirectional GRU [23] with the same purposes, the main
fundamental architecture is used.
Sensors 2021, 23, 6840 6 of 21
The final positional encoded P vector aggregates both the even and pairs positional
encoding, resulting in a final embedding vector with X + P dimensions.
Concerning the encoding, layers are formed by a stack of Nl . This hyperparameter
corresponds to the number of stacked identical layers. Each layer is formed by a multi-head
self-attention mechanism combined with the positional-wise fully connected feed-forward
network. Each sub-layer employs a residual connection followed by a layer normalization,
with the output expressed in Equation (3).
Output = LayerNorm(y) + SubLayer (y)) (3)
All sub-layers and embedding layers inside the model generate an output dimen-
sion D equal to model size dmodel considering the input features input f eature. In this
case, the optimal value was found to be 250 × input f eature to encompass the full 10-day
forecast period.
The decoder is very similar to the encoder. However, it adds a third sub-layer to
perform multi-head attention over the output of the stacked encoders and two sub-layers,
using similar residual connections as the encoder and adjacent layer normalization. Simi-
larly, the number of decoder layers is a hyperparameter to be determined. The final decoder
stack embedding output is compensated by one position, allowing the avoidance of current
positions being mixed with subsequent ones when using squared subsequent masking.
The input queries Q, keys K and values V of dimension dv are subject to the scaled
√ dot
product of the keys with all given keys, normalized by the dimension of keys dk as dk , to
overcome the monotonous magnitude growth with the increase of the d dimension, leading
to gradient vanish (Figure 4). The final attention matrix A that encapsulates the packed
Q, K, V is obtained using a softmax to gather the weights of the values, being represented
in Equation (4)
QK T
A( Q, K, V ) = so f tmax √ V (4)
dk
Sensors 2021, 23, 6840 7 of 21
Figure 4 shows how Queries Q, Keys K and Values V are combined in an incremental
dot product, passing by the scaling layer, optional masking layer, softmax and final matrix
multiplication to form the attention matrix A.
with hn being the number of attention heads and n being the projections matrices, with each
head represented as headi = Attention( QWiQ , KWiK , ViW ), containing Q queries and K keys
as a result of the dot product, and the projections matrices that correspond to parameters
matrices WiQ ∈ RD×dk , WiK ∈ RD×dk and WiV ∈ RD×dv . Following this operation, the
outputs are concatenated and multiplied by the weighting matrix W O , corresponding to a
squared matrix obtained from Rhdv × D [24].
The proposed multistep transformers with the modified attention heads and input
embedding follow the diagram of Figure 6, which is largely based on [24]. However,
with a modification on the multi-attention heads and how the inputs X are combined,
we use a correspondent embedding through a learnable weight attention matrix instead
of a dedicated embedding for each input feature. The final proposed input aggregated
embedding is expressed in Equation (6).
Sensors 2021, 23, 6840 8 of 21
n
∑ m Xt et
( j) ( j
et = (6)
j =1
where êt corresponds to the linear transformer feature input, m Xt is the leaned matrix
weight regarding the combination of each of the input transformed features at time t
towards the final embedding êt at instant t.
4.1. Models
To evaluate the performance of the proposed transformer and baseline models, after
training, the models infer over 250 points of historical data (test input that corresponds to
10 days) and forecast another 250 hourly future points (corresponds to the 10 day period
Sensors 2021, 23, 6840 10 of 21
ahead). For training, 70% of the data starting from day 1 are set for training, the subsequent
10% chunks for validation and 20% for testing purposes.
In more detail, the proposed transformer blocks follow the arrangement of Figure 6
with an aggregation of the multivariate input features using Equation (6).
The LSTM and GRU models use 250 units cells to forecast the same period length,
described in Figure 7.
Considering the multistep forecast, an additional repeat vector layer and time-distributed
dense layer are added to the architecture.
MSE considers the average of the squared errors between the real value yi and the
predicted ŷi in a given N set of samples, expressed as
N
1
MSE(y, ŷ) =
N ∑ (yi − ŷi )2 . (8)
i =1
Sensors 2021, 23, 6840 11 of 21
4.3. Dataset
To construct a model capable of forecasting energy consumption, the first task is to
analyze the data produced by several sensors during one year. The activities conducted at
the building are mainly research regarding computer science and electrical engineering and
support services, composed of research labs, human resources, auditoriums, restaurants
and other common building infrastructure. The building comprises two blocks of four floors
encompassing several room characteristics, such as computer labs, meeting auditoriums,
restaurants and service spaces.
Each of the individual rooms contains a set of sensors to measure environmental
variables and energy consumption through a dedicated sensor network. The deployed
sensors on each room perform a combined measurement of humidity; temperature; Co2
concentration, pressure, expressed in Pascals; and local energy consumption, for a total of
144 sensors.
The dataset includes the following:
• Real-time historical data from the INESC TEC building.
• Encompassing two-year time span.
• Totaling 8760 × 2 sample points (2 years).
Figure 8. Cont.
Sensors 2021, 23, 6840 12 of 21
Figure 8 clearly highlights that weekends account for lower consumption, and the
morning and noon periods account for a large part of the total building consumption,
corroborated by the fact that it is a computer science research lab. However, it is clear that
Thursday accounts for a peak in average room consumption, with Friday showing a clear
descent in energy consumption.
Regarding the energy consumption among the building’s main floors and sections,
Figure 9 aggregates the average consumption per floor.
Figure 9. Average consumption per building floor (A or B means the building block and the number
of the floor).
Sensors 2021, 23, 6840 13 of 21
From Figure 9, it is possible to identify a clear outlier, namely, floor B0, which corre-
sponds to the location of the restaurant and cafeteria of the building, where large energy
kitchen appliances are present. In the opposite direction, floor B5 accounts for the lowest
average energy consumption, correlated by the fact that this floor only contains small
building auxiliary devices. The restaurant information can be corroborated by Figure 10.
Figure 10. Average session consumption per building floor compared to lower floor consumption.
Figure 10 exhibits that floor B0 has a higher energy peak in the morning and at noon
compared to the second largest floor consumption (A4), corresponding to the services floor.
The consumption on floor B0 intersects with the most agitated period of the restaurant
at lunchtime.
Figure 11 shows the mean building consumption by month, with clear evidence that
the August months account for a lower consumption due to vacations and the winter
months for a substantial increase in consumption due to lighting and heating.
From Figure 12 there is clear evidence that auditoriums present higher humidity
levels due to less usage and exposure to the sun. Considering these large auditoriums are
located on the lower floor, they present with higher humidity due to a lower temperature
(Figure 13), a well-known ambient variable phenomenon.
Shows that for most lags, the auto-correlation is significantly non-zero for all lags.
Therefore, the time series is not random, presenting some degree of seasonal patterns.
In order for models to converge properly, the dataset is pre-processed to find outliers
or erroneous measures that may harm the training of the models, replacing those values
with the mean of the previous and next values Xi = 12 ( xi−1 + xi+1 ). Furthermore, the data
were normalized with respect to min–max to be fed to models, and the reverse process was
made to recover the target forecasted value.
5. Results
Several combinations of hyperparameters were used to train and evaluate the three
models effectively. All models were trained during 100 epochs in a 2 × Intel Xeon Gold
2.5 Ghz with 20 cores each, totaling 384 GB RAM, fitted with 2 Teslas v100 with 32 Gb each
and 2 GTX 2080 with 11 Gb each. A 5 k-CV was employed to determine the set of the best
hyperparameters for each configuration and to evaluate the model in the test set.
In order to compare different time-series forecasts, MAPE was employed, since it is a
scale-independent measurement, and our time series does not cross zero, so the undefined
problem is circumvented.
The final transformer model architecture took approximately 20 h of training, with an
average of 1000 s on each epoch, approximately 1.6 times higher than LSTM and 1.9 than
GRU, with the final convergence curves represented in Figure 15.
Figure 15. Convergence curves of all evaluate models (training and validation).
From Figure 15, it is evident that GRU models presented some degree of overfitting,
compared with LSTM and the transformer proposed model. The transformer model shows
a more stable convergence during the set of epochs, mainly due to a larger complexity
when compared with other RNN models.
After curated debugging and hyperparameter tuning, the transformer with six heads
stacked with six identical encoders obtained the best result (Table 1), with a dropout max
of dropout = 0.2.
The decoder block employs the use of a linear transformation of the input data into
the same number of in f eatures = f eaturessize = 250 × n and out f eature = 1, corresponding to
the forecast of 10 days (250 h).
The loss was the L2-norm, employing SGD with a gamma = 0.98, a learning rate of
lr = 0.005 and steps ize = 10 using a batch the size of 16. The transformer’s total number of
parameters was 16,330,827, with a look-back window of 34.
Regarding LSTM, the optimal number of nodes is 256 nodes per layer (Table 1), using
a dropout of 0.2, a decay rate 0.99, with a look-back window of 32, employing Adam
optimization with a learning rate/step size of lr = 0.005, with β 1 = 0.9 and β 2 = 0.99,
using a L2-loss with a batch size of 16.
Sensors 2021, 23, 6840 17 of 21
Regarding GRU, the optimal number of nodes is 200, 128 nodes per layer, a dropout of
0.2, Xavier weight initialization, decay rate 0.99, with a look-back window of 26, employing
Adam optimization with a learning rate/step size of lr = 0.005, with β 1 = 0.9 and β 2 = 0.99,
using a L2-loss with a batch size of 16.
Table 1 summarizes the results obtained in all the evaluated models using the test
input samples (MAPE and MSE).
LSTM
N Cells N Nodes Window Parameters MAPE MSE
400 128 32 18M 18.11% 15.43%
300 128 32 16M 16.34% 13.35%
200 256 32 13M 17.41% 14.45%
300 256 32 14M 14.26% 11.65%
200 128 24 12M 12.42% 10.02%
250 256 32 14M 10.02% 7.04%
GRU
N Cells N Nodes Window Parameters MAPE MSE
400 128 32 14M 23.56% 21.43%
300 128 32 13M 15.98% 12.54%
200 256 32 12M 13.93% 11.56%
300 256 12 13M 15.34% 12.76%
200 128 26 11M, 11.66% 09.43%
250 128 32 12M 13.59% 10.49%
Transformer
Heads Enc/Deco Window Parameters MAPE MSE
10 10/10 32 24M 12.33% 10.61%
10 6/6 32 14M 11.25% 9.75%
10 5/5 32 13M 10.43% 8.27%
6 10/10 32 20M 10.24% 8.11%
6 6/6 32 16M 7.09% 5.42%
5 5/5 32 13M 8.36% 6.62%
Figure 16 presents a side-to-side comparison of the top performer models against the
ground-true values in the test set.
(a)
(b)
(c)
Figure 16. The forecast of energy consumption for each best-performing model. (a) Transformer.
(b) LSTM. (c) GRU.
However, further changes and improvements are necessary to enhance the model’s
efficiency and robustness to ensure greater competence in real-world data-driven models
for different sectors.
6. Conclusions
Energy forecasting is crucial for building characterization, which often manifest un-
predictable energy consumption patterns that are not captured by models, leading to a
degradation of their performance. Deep learning approaches, on the contrary, allow for a
model of non-linear dependencies and capture relevant information to perform forecasts.
In this article, a modification of a multi-head multivariable transformer model is
proposed for building an energy consumption forecast, complemented by a comprehensive
performance comparison with common RNN models, such as GRU and LSTM and a
real-time building energy and environment collected dataset. MSE and MAPE were used
to evaluate the models. The performance of the multivariate transformer model using a
multi-head attention mechanism and modified input embedding is almost 3.2 p.p. better
than the best-trained baselines.
The construction of a richer dataset must follow the guidelines of the ROBOD project [26],
which encompasses many sensors, HVAC, building occupancy and WiFi traffic, among others.
The main drawback of the multi-head attention transformer concerns its complexity
and training time and the limited set of training features. Thus, training in a more rich
dataset such as the Building Data Genome Project 2 [27] or the Global Occupant Behaviour
database [28] would enable us to further validate the proposed forecast model with more
variables, allowing us to foresee potentialities in the use of transformers in the current
energy forecast, providing accurate energy consumption forecasts.
Abbreviations
The following abbreviations are used in this manuscript:
NLP Natural language
CNN Convolution neural networks
ViT Visual transformers
AUC Area under the curve
MLP Multi layer perceptron
BERT Bidirectional encoder representation of transformers
DT Digital twin
GRU Gated recurrent unit
LSTM Long short term memory
Sensors 2021, 23, 6840 20 of 21
References
1. Bielecki, J. Energy security: Is the wolf at the door? Q. Rev. Econ. Financ. 2002, 42, 235–250.
2. Madni, A.M.; Madni, C.C.; Lucero, S.D. Leveraging digital twin technology in model-based systems engineering. Systems 2019,
7, 7. [CrossRef]
3. Chen, K.; Zhu, X.; Anduv, B.; Jin, X.; Du, Z. Digital twins model and its updating method for heating, ventilation and air
conditioning system using broad learning system algorithm. Energy 2022, 251, 124040. [CrossRef]
4. Teng, S.Y.; Touš, M.; Leong, W.D.; How, B.S.; Lam, H.L.; Máša, V. Recent advances on industrial data-driven energy savings:
Digital twins and infrastructures. Renew. Sustain. Energy Rev. 2021, 135, 110208. [CrossRef]
5. Jiang, X.; Ling, H.; Yan, J.; Li, B.; Li, Z. Forecasting electrical energy consumption of equipment maintenance using neural network
and particle swarm optimization. Math. Probl. Eng. 2013, 2013, 194730. [CrossRef]
6. Zhao, X.; Yin, Y.; Zhang, S.; Xu, G. Data-driven prediction of energy consumption of district cooling systems (DCS) based on the
weather forecast data. Sustain. Cities Soc. 2023, 90, 104382. [CrossRef]
7. Khahro, S.H.; Kumar, D.; Siddiqui, F.H.; Ali, T.H.; Raza, M.S.; Khoso, A.R. Optimizing energy use, cost and carbon emission
through building information modelling and a sustainability approach: A case-study of a hospital building. Sustainability 2021,
13, 3675. [CrossRef]
8. Gröger, G.; Plümer, L. CityGML–Interoperable semantic 3D city models. ISPRS J. Photogramm. Remote. Sens. 2012, 71, 12–33.
[CrossRef]
9. Runge, J.; Zmeureanu, R. A review of deep learning techniques for forecasting energy use in buildings. Energies 2021, 14, 608.
[CrossRef]
10. Mahjoub, S.; Chrifi-Alaoui, L.; Marhic, B.; Delahoche, L. Predicting Energy Consumption Using LSTM, Multi-Layer GRU and
Drop-GRU Neural Networks. Sensors 2022, 22, 4062.
11. Somu, N.; MR, G.R.; Ramamritham, K. A deep learning framework for building energy consumption forecast. Renew. Sustain.
Energy Rev. 2021, 137, 110591. [CrossRef]
12. Tekler, Z.D.; Chong, A. Occupancy prediction using deep learning approaches across multiple space types: A minimum sensing
strategy. Build. Environ. 2022, 226, 109689. [CrossRef]
13. Low, R.; Cheah, L.; You, L. Commercial vehicle activity prediction with imbalanced class distribution using a hybrid sampling
and gradient boosting approach. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1401–1410. [CrossRef]
14. Berriel, R.F.; Lopes, A.T.; Rodrigues, A.; Varejao, F.M.; Oliveira-Santos, T. Monthly energy consumption forecast: A deep learning
approach. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), IEEE, Anchorage, AL, USA,
14–19 May 2017; pp. 4283–4290.
15. Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting.
Adv. Neural Inf. Process. Syst. 2021, 34, 22419–22430.
16. Sanzana, M.R.; Maul, T.; Wong, J.Y.; Abdulrazic, M.O.M.; Yip, C.C. Application of deep learning in facility management and
maintenance for heating, ventilation, and air conditioning. Autom. Constr. 2022, 141, 104445. [CrossRef]
17. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef]
18. Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations
using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078.
19. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks
from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958.
20. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014. arXiv:1412.6980.
21. Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747.
22. Siami-Namini, S.; Tavakoli, N.; Namin, A.S. The performance of LSTM and BiLSTM in forecasting time series. In Proceedings of
the 2019 IEEE International Conference on Big Data (Big Data), IEEE, Los Angeles, CA, USA, 9–12 December 2019; pp. 3285–3292.
23. Ayoobi, N.; Sharifrazi, D.; Alizadehsani, R.; Shoeibi, A.; Gorriz, J.M.; Moosaei, H.; Khosravi, A.; Nahavandi, S.; Chofreh, A.G.;
Goni, F.A.; et al. Time series forecasting of new cases and new deaths rate for COVID-19 using deep learning methods. Results
Phys. 2021, 27, 104495. [CrossRef]
24. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In
Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9
December 2017.
25. Deb, C.; Zhang, F.; Yang, J.; Lee, S.E.; Shah, K.W. A review on time series forecasting techniques for building energy consumption.
Renew. Sustain. Energy Rev. 2017, 74, 902–924. [CrossRef]
Sensors 2021, 23, 6840 21 of 21
26. Tekler, Z.D.; Ono, E.; Peng, Y.; Zhan, S.; Lasternas, B.; Chong, A. ROBOD, room-level occupancy and building operation dataset.
In Building Simulation; Springer: Berlin, Germany, 2022; Volume 15, pp. 2127–2137.
27. Miller, C.; Kathirgamanathan, A.; Picchetti, B.; Arjunan, P.; Park, J.Y.; Nagy, Z.; Raftery, P.; Hobson, B.W.; Shi, Z.; Meggers, F. The
Building Data Genome Project 2, energy meter data from the ASHRAE Great Energy Predictor III competition. Sci. Data 2020,
7, 368. [CrossRef] [PubMed]
28. Dong, B.; Liu, Y.; Mu, W.; Jiang, Z.; Pandey, P.; Hong, T.; Olesen, B.; Lawrence, T.; O’Neil, Z.; Andrews, C.; et al. A global building
occupant behavior database. Sci. Data 2022, 9, 369. [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.