\tocauthor

Marina Moran, Javier Balladini, Dolores Rexachs and Emilio Luque ¹¹institutetext: Facultad de Informática, Universidad Nacional del Comahue, Buenos Aires 1400, 8300 Neuquén, Argentina
¹¹email: [email protected], ²²institutetext: Departamento de Arquitectura de Computadores y Sistemas Operativos, Universitat Autònoma de Barcelona, Campus UAB, Edifici Q, 08193 Bellaterra (Barcelona), España

Checkpoint and Restart: An Energy Consumption Characterization in Clusters

Marina Morán 11 Javier Balladini 11 Dolores Rexachs 22 Emilio Luque 22

Abstract

The fault tolerance method currently used in High Performance Computing (HPC) is the rollback-recovery method by using checkpoints. This, like any other fault tolerance method, adds an additional energy consumption to that of the execution of the application. The objective of this work is to determine the factors that affect the energy consumption of the computing nodes on homogeneous cluster, when performing checkpoint and restart operations, on SPMD (Single Program Multiple Data) applications. We have focused on the energetic study of compute nodes, contemplating different configurations of hardware and software parameters. We studied the effect of performance states (states P) and power states (states C) of processors, application problem size, checkpoint software (DMTCP) and distributed file system (NFS) configuration. The results analysis allowed to identify opportunities to reduce the energy consumption of checkpoint and restart operations.

keywords:

checkpoint, restart, energy consumption, power, fault tolerance methods

1 Introduction

^†^†footnotetext: The final authenticated version is available online at https://doi.org/10.1007/978-3-030-20787-8_2

High Performance Computing (HPC) continues to increase its computing power while increasing its energy consumption. Given the limitations that exist to supply energy to this type of computers, it is necessary to know the behavior of energy consumption in these systems, to find ways to limit and decrease it. In particular, for the exaescala era, a maximum limit of 20 MW is estimated [17].

The fault tolerance method most currently used in HPC is the rollback-recovery method by using checkpoints. This, like any other fault tolerance method, adds an additional energy consumption to the execution of the application [6]. Due to this, it is important to know and predict the energy behavior of fault tolerance methods, in order to manage their impact on the total energy consumption during the execution of an application.

The objective of this work, which is an extension of [15], is to determine the factors that affect the energy consumption of checkpoint and restart operations (C/R), on Single Program Multiple Data (SPMD) applications on homogeneous cluster, contemplating different configurations of hardware and software parameters. A cluster system has compute nodes, storage nodes, and at least one interconnection network. We have focused on the energetic study of computing nodes, and we have extended the previous work contemplating, on the one hand, a second experimentation platform, and on the other hand, the storage node, in particular when studying the impact of checkpoint files compression. The energy consumption of the network is not considered in this article.

The contributions of this article are:

•

A study of the system’s own factors (hardware and software) and of applications factors that impact the energy consumption produced by checkpoint and restart operations.
•

The identification of opportunities to reduce the energy consumption of checkpoint and restart operations.

Section 2 presents some related works, while in section 3 the factors that affect the energy consumption are identified. Section 4 mention the experimental platform and the design of experiments, whose results and their analysis are presented in the section 5. Finally, the conclusions and future work can be found in the section 6.

2 Related work

[5] and [4] evaluates the energetic behavior of the coordinated and uncoordinated C/R with message logs. In [4] they also evaluate the parallel recovery and propose an analytic model to predict the behavior of these protocols at exascale. In [3] they use an analytic model to compare the execution time and the consumed energy of the replication and the coordinated C/R. [1] and [7] presents an analytic model to estimate the optimal interval of a multilevel checkpoint in terms of energy consumption. They do not measure dissipated power but use values from other publications. In [16] they measure the power dissipated and the execution time of the high level operations involved in checkpoint (coordinated, uncoordinated and hierarchical) varying the number of cores involved. They do not use different processor frequencies, nor do they indicate whether the checkpoint is compressed or not. In [8], [9] and [10] a framework for energy saving of the C/R are presented. In [8], many small I/O operations are replaced by a few large, single-core operations to make the checkpoint and restart more energy efficient. They use RAPL to measure and limit energy consumption. [9] propose to have a core to execute a replica of all the processes of the node in order to avoid re-execution from the last checkpoint and analytically compare the energy consumption of this proposal with the traditional checkpoint. In [10] a runtime that allows modifying the clock frequency and the number of processes that carry out the I/O operations of the C/R are designed, in order to optimize the energy consumption. Another work that analyze the impact of the dynamic scaling of frequency and voltage on the energy consumption of checkpoint operations is in [2]. They measure the power at the component level while writing the checkpoint locally and remotely and compare variations in remote storage: NFS using the kernel network stack and NFS using the IB RDMA interface. [11] evaluate the energy consumption of an application that uses compressed checkpoints. They show that when using compression, more energy is spent but time is saved, so that the complete execution of the application with all its checkpoints can be benefited from an energy point of view.

Our work focuses on coordinated C/R at the system level. The dissipated power of the checkpoint and restart operations are measurements obtained with an external physical meter. We have not found papers that evaluate the impact of C states and NFS configurations on the energy consumption of C/R operations.

3 Factors that affect the energy consumption

Energy can be calculated as the product between power and time. Any factor that may affect one of these two parameters should be considered and then analyze how it affects energy consumption. These factors belong to different levels: Hardware, Application Software and System Software.

3.1 Hardware

The Advanced Configuration and Power Interface (ACPI) specification provides an open standard that allows the operating system to manage the power of the devices and the entire computing system [18]. It allows managing the energy behavior of the processor, the component that consumes the most energy in a computer system [12]. ACPI defines Processor Power States (Cx states), where C0 is the execution state, and C1…Cx are inactive states.

A processor that is in the C0 state will also be in a Performance State (Px states). The status P0 means an execution at the maximum capacity of performance and power demand. As the number of state P increases, its performance and demanded power is reduced. Processors implement P states using the Dynamic Frequency and Voltage Scaling technique (DVFS) [13]. Reducing the voltage supplied reduces the energy consumption. However, the delay of the logic gates is increased, so it is necessary to reduce the clock frequency of the CPU so that the circuit works correctly. In certain multicore processors, each core is allowed to be in a different P state.

When there are no instructions to execute, the processor can be put in a C state greater than 0 to save energy. There are different C state levels, where each of the levels could turn off certain clocks, reduce certain voltages supplied to idle components, turn off the cache memory, etc. The higher the C state number, the lower the power demanded, but the higher the latency required to return to state C0 (execution status). Some processors allow the choice of a C state per core.

As both states, C and P, have an impact on power and time, it is necessary to evaluate their impact on energy consumption during C/R operations.

The GNU/Linux kernel supports frequency scaling through the subsystem CPUFreq (CPU Frequency scaling). This subsystem includes the scaling governors and the scaling drivers. The different scaling governors represent different policies for the P states. The available scaling governors are: performance (this causes the highest frequency defined by the policy), power save (this causes the lowest frequency defined by the policy), userspace (the user defines frequency), schedutil (this uses CPU utilization data available from the CPU scheduler), ondemand (this uses CPU load as a CPU frequency selection metric) and conservative (same as ondemand but it avoids changing the frequency significantly over short time intervals which may not be suitable for systems with limited power supply capacity). The scaling drivers provide information to the scaling governors about the available P states and make the changes in those states ¹¹1https://www.kernel.org/doc/html/v4.14/admin-guide/pm/cpufreq .html. In this work we use userspace and ondemand governors.

3.2 Application software

Basically, a checkpoint consists in saving the state of an application, so that in case of failure it can restart the execution from that saved point. The larger the problem size of the application, the longer the time required to save its state. Its incidence, at least over time, converts problem size into a factor that affects the energy consumption of C/R operations.

3.3 System software

There are two types of system software highly involved in C/R operations. On the one hand, the system that carry out these operations. On the other hand, since the checkpoint file needs to be protected in stable and remote storage, it is necessary to use a distributed file system. In our case, the system software we use is Distributed MultiThreaded CheckPointing (DMTCP) [14] and Network File System (NFS). Both have configuration options that affect the execution time and/or power, and therefore are factors that affect the energy consumption of the C/R operations.

In the NFS case, folders can be mounted synchronously (sync option) or asynchronously (async option). If an NFS folder is mounted with the sync option, writes at that mount point will cause the data to be completely downloaded to the NFS server, and written to persistent storage before returning control to the client²²2https://linux.die.net/man/5/nfs. Thus, the time of a write operation is affected by varying this configuration.

In the case of DMTCP, it is a tool that performs checkpoints transparently on a group of processes spread among many nodes and connected by sockets, as is the case with MPI (Message Passing Interface) applications. DMTCP is able to compress (using the gzip program) the state of a process to require less disk storage space and reduce the amount of data transmitted over the network (between the compute node and the storage node). The use or not of compression impacts the time and the power required, therefore it is another factor that affects energy consumption.

4 Experimental platforms and design

4.1 Experimental platform

The experiments were carried out on two platforms. Platform 1, on which most of the analysis of this work is carried out, is a cluster of computers with a 1 Gbps Ethernet network. Each node, both computing and storage, has 4 GiB of main memory, a SATA hard disk of 500 GB and 7200 rpm, and an Intel Core i5-750 processor, with a frequency range of 1.2 GHz to 2.66 GHz (with the Intel Turbo Boost ³³3https://www.intel.com/content/www/us/en/architecture-and-technology/turbo-boost/turbo-boost-technology.html disabled), four cores (without multithreading), 8 MiB of cache and 95 W TDP. The clock frequency range goes from 1.199 GHz. to 2.667 GHz. Platform 2 is a compute node connected to a storage node with a 1 Gbps Ethernet network. The compute node has an Intel Xeon E5-2630 processor, a frequency range of 1.2 GHz to 2.801 GHz (with the Intel Turbo Boost mechanism disabled), six cores (with multithreading disabled), 16 GiB of main memory, 15 MiB of cache and TDP of 95 W. It uses a Debian 9 Stretch operating system. The storage computer has an Intel Core 2 Quad Q6600 processor, four cores (without multithreading), 8 GiB of main memory and 4 MiB of cache memory. The clock frequency range goes from 1.2 GHz. to 2.801 GHz.

The nodes of Platform 1 and the computing node of Platform 2 use the GNU/Linux operating system Debian 8.2 Jessie (kernel version 3.16 of 64 bits), OpenMPI version 1.10.1 as an MPI message passing library, and the tool checkpoint DMTCP version 2.4.2, configured to compress the checkpoint files. The network file system used to make the remote writing of the checkpoint files is NFS v4 (Network File System).

For power measurements we use the PicoScope 2203 oscilloscope (whose accuracy is 3%), the TA041 active differential probe, and the PP264 60 A AC/DC current clamp, all Pico Technology products. The electrical signals captured by the two-channel oscilloscope are transmitted in real time to a computer through a USB connection. The voltage is measured using the TA041 probe that is connected to an input channel of the oscilloscope. The current of the phase conductor, which provides energy to the complete node (including the power source) is measured using the current clamp PP264, which is connected to the other input channel of the oscilloscope.

The selected application⁴⁴4https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesHeat for system characterization is a SPMD heat transfer application written in MPI that uses the float data type. This application describes, by means of an equation, the change of temperature in time over a plane, given an initial temperature distribution and certain edge conditions.

4.2 Experimental design

Two compute nodes are used (unless otherwise specified) and each compute node writes to a dedicated storage node through an NFS configured in asynchronous mode (unless otherwise specified). The sampling rate used for both channels of the oscilloscope was set at 1000 Hz. The power measurements correspond to the power dissipated by the complete node including the source. The tests were performed with the processor C states option active (unless otherwise specified). For the measurements of the checkpoint and restart time we use the option provided by DMTCP. To change the frequency of the processor, the userspace governor is used. The same frequency is used in all cores at the same time.

Each experiment consists of launching the application with one process per core, letting it run during a preheating period (20 seconds), performing a checkpoint manually, aborting the application from the DMTCP coordinator and re-starting the application from the command line with the script generated by DMTCP. The experiment is repeated three times for each frequency and problem size due to the low variability of the measurement instruments used.

5 Experiments and results analysis

In this section we show the experiments and analyze how the mentioned factors influence power, time, and energy consumption of CR operations.

The first subsections (5.1 to 5.4) analyze the effect of clock frequency (see subsection 3.1) and problem size (see subsection 3.2) on Platforms 1 and 2. As there are no significant differences between the two platforms for these factors, we have carried out the study of the remaining factors, NFS configuration and compression of the checkpoint files (subsections 5.5 and 5.6) only on Platform 1. We understand that these last factors are influenced mainly by the network, which is the same in both platforms.

5.1 Real power

In order to know the energy we need to know the average power dissipated of each operation. Fig. 1 shows the real power on both Platforms, obtained with the measurements delivered by the oscilloscope, when an application is executed, a checkpoint is done, a fault is injected, and a restart is initiated. These power measurements are averaged over the duration of the checkpoint and restart operation. In the rest of the work, when referring to dissipated power, we refer to the average dissipated power.

Refer to caption — (a) Platform 1 at 2.667 GHz.

Platform 2, unlike Platform 1, has two high phases and two low phases during the checkpoint. To know what happened in those phases we measured the CPU and network usage in both platforms, which is shown in Fig. 2. When observing the use of CPU, we see that the high power phases coincide with a higher CPU usage, and that low phases coincide with a low CPU usage. During a checkpoint, the CPU is used to compress. From this it follows that in Platform 2, DMTCP, at times, stops compressing. When observing the use of the network, we see that on both platforms the transmission begins a few seconds after the checkpoint has started and continues until the end of the checkpoint. We also note that in the first low phase of Platform 2, the transmission rate drops by half, causing inefficient use of the network. Studying these inefficiencies could be part of future work. In the case of the restart, the transmission rate remains stable throughout its duration, on both platforms.

5.2 Processor’s P states

The impact of processor’s P states on the C/R operations energy consumption was evaluated. The figures 3 and 4 show the average dissipated power (a) and time (b) for different frequencies of the processor, on both platforms. It is observed that as the clock frequency increases, the dissipated power increases and the time decreases. The functions obtained are strictly increasing for the case of dissipated power and strictly decreasing for the case of time. It is also observed that the checkpoint is more affected by clock frequency changes, both in the power dissipated and in time.

5.3 Processor’s C states

During the writing or reading of a checkpoint file it is possible that the processor has idle moments and therefore transitions between C states can occur. These transitions can affect the power dissipation of the processor. To study its behavior, C/R operations were performed with the C states enabled and disabled, for several frequencies of the processor, on Platform 1 and 2. The results are shown in the figures 5 and 6.

In both platforms it is observed that the power measurements with the C states enabled show greater variability, especially in the restart at Platform 1. It is also observed how the difference between the power dissipated with C states enabled and disabled increases with the increasing frequency of the processor. On Platform 1, this difference becomes approximately 9 % for the checkpoint (at the maximum frequency), and 10% for the restart (at the frequency 2.533 GHz). In Platform 2 these differences are smaller, 6% for the checkpoint, and 5% for the restart, in the frequency 2.6 GHz in both cases.

The consumption of energy is mainly benefited by the use of C states. On Platform 1, for the checkpoint case, the consumption is up to 13% higher when the C states are disabled, and in the restart case, up to 20% higher. On Platform 2 these differences are smaller, up to 5%, except for the 1.4 GHz frequency, where the differences are greater than 15%, both for checkpoint and restart.

In any case, the best option is to keep the C states enabled since they reduce the energy consumption by up to 13% for the checkpoint, and up to 20% for the restart, at certain CPU frequencies. Times showed no variation when enabling or disabling the C states.

5.4 Problem size

The impact of the problem size on energy consumption of C/R operations was evaluated. The power dissipated and the time of the checkpoint and restart were measured, for different problem sizes, on both platforms.

Problem sizes do not exceed the main memory available in the compute node. In the figures 7(a) and 8(a) it is observed that the power dissipated almost does not vary when varying the problem size (the differences do not exceed 4% in any case). In the figures 7(b) and 8(b) it is observed that the time increases as the problem size increases, as expected.

5.5 NFS configuration

The impact on energy consumption of the use of the NFS sync and async options was evaluated. The figure 9 compares the dissipated power, time and energy consumed by a checkpoint stored on a network file system mounted with the option sync and async, for three different clock frequencies (minimum, average and maximum of the processor), on Platform 1. For the three frequencies, the dissipated power is greater and the execution time is shorter when the asynchronous configuration is used. The shortest time is explained by the operating mode of the asynchronous mode, which does not need to wait for the data to be downloaded to the server to advance. The greater dissipated power may be due to the fact that the asynchronous mode decreases idle times of the processor, and therefore C states of energy saving does not activate (see subsection 3.1). However, for the minimum and average frequency, these differences are small, resulting in a similar energy consumption, as shown in Fig. 9(c). The asynchronous mode consumes 1.5% more energy at the minimum frequency. It could be said that this configuration of the NFS does not affect the energy consumption when using this frequency. In the medium frequency, the asynchronous mode consumes 7% less energy. If we now observe what happens at the maximum frequency, we see that the asynchronous mode consumes 25% less energy. Although the dissipated power is 37% higher, the time is 85% lower, and this means that at the maximum frequency, it is convenient to use the asynchronous mode for lower energy consumption.

5.6 Checkpoint file compression

The impact of checkpoint files compression on the energy consumption of the computation node was analyzed, and in this work, we added the study of the storage node. The experiments were performed on a single computation node writing on a storage node, for three different clock frequencies (minimum, average and maximum available in the processor) and for the governor ondemand (see section 3.1), on Platform 1.

The figure 10 shows the power and time of checkpoint and restart, with and without compression of the checkpoint files.

The results obtained show the following:

•

The dissipated power, both in checkpoint and restart, is greater when using compression. This is due to the higher CPU usage that is required to run the compression program (gzip) that DMTCP uses.
•

Without compression, checkpoint and restart times almost do not vary for different clock frequencies.

Fig. 11 shows the energy consumption of computation and storage nodes. In the case of the storage node, the clock frequencies indicated are the frequencies used by the computing node. Because the application does not share the storage node with other applications, the energy considered for the storage node is calculated using the total dissipated power (base power plus dynamic power). In the computation node, the energy consumption of the checkpoint is always greater when using compression (up to 55% higher in the minimum frequency). In the storage node, the energy consumption of the checkpoint is always lower when using compression, except for the minimum frequency. The energy consumption of the restart is always lower when using compression (up to 20%), both in computing and storage node.

Let’s see the total energy consumption, considering both nodes (computation and storage), in Fig. 12. For the checkpoint case, it is never advisable to compress, and even less at the minimum frequency. Although, when no compression is used, the minimum frequency is the most convenient, because it is the one with the lowest energy consumption. In any case, the energy consumption when no compression is used is similar for all frequencies studied, with differences that do not exceed 12%. For the restart case, it is always convenient to compress. When using compression, the lowest energy consumption is obtained with the frequency 1.999 GHz. However, by not using compression, the lowest energy consumption is obtained with the ondemand governor. Studying this behavior can be part of future work.

Taking into account that, in general, more checkpoint operations than restart operations are carried out, it is advisable not to use compression to reduce energy consumption.

6 Conclusions and future work

This work shows, from a series of experiments on an homogeneous cluster executing an SPMD application, how different factors of the system and the applications impact in the energy consumption of the C/R coordinated operations using DMTCP. The impact of the P and C states of the processor was studied. It was found that checkpoint operations are more sensitive to changes in P states than restart operations. On the contrary, the changes of the C states of the processor affect more the restart. In the evaluated platforms, the use of C states allows energy savings (up to 15% for the checkpoint and 20% for the restart). The increase in the application problem size under study always results in an increase in energy consumption, due to its high incidence in the time taken by the operations. It was found that by using the maximum clock frequency, up to 25% energy savings was possible when using the asynchronous mode in the NFS configuration. The compression of the checkpoint files is beneficial for the restart. When considering only the computation node, up to 20% of energy saving was registered when using compressed files. However, compression negatively impacts the checkpoint operation, with a 55% higher energy consumption when using the minimum clock frequency. Among the future works it is expected to evaluate other factors that may affect the energy consumption of checkpoint and restart operations, such as the compression program used for checkpoint files, the parallel application programming model and the fault tolerance tool used.

References

[1] Amrizal, M. A., Takizawa, H.: Optimizing Energy Consumption on HPC Systems with a Multi-Level Checkpointing Mechanism. In International Conference on Networking, Architecture, and Storage (NAS). IEEE (2017).
[2] Mills, B., Grant, R. E., Ferreira, K. B.: Evaluating energy savings for checkpoint/restart. In Proceedings of the 1st International Workshop on Energy Efficient Supercomputing. ACM (2013)
[3] Mills, B., Znati, T., Melhem, R., Ferreira, K. B., Grant, R. E.: Energy consumption of resilience mechanisms in large scale systems. In: 122nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. IEEE, (2014)
[4] Meneses, E., Sarood, O., Kalé, L. V.: Assessing energy efficiency of fault tolerance protocols for HPC systems. In 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing (pp. 35-42). IEEE.
[5] Diouri, M., Glück, O., Lefevre, L., Cappello, F.: Energy considerations in checkpointing and fault tolerance protocols. In IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) (pp. 1-6). IEEE. (2012, June).
[6] Bergman, K., Borkar, S., Campbell, D., Carlson, W., Dally, W., Denneau, M., Karp, S.: Exascale computing study: Technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep, (2008)
[7] Dauwe, D., Jhaveri, R., Pasricha, S., Maciejewski, A. A., Siegel, H. J.: Optimizing checkpoint intervals for reduced energy use in exascale systems. In 2017 Eighth International Green and Sustainable Computing Conference (IGSC) (pp. 1-8). IEEE.(2017, October)
[8] Chandrasekar, R. R., Venkatesh, A., Hamidouche, K., Panda, D. K.: Power-check: An energy-efficient checkpointing framework for HPC clusters. In 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (pp. 261-270). IEEE. (2015, May)
[9] Cui, X., Znati, T., Melhem, R.: Adaptive and power-aware resilience for extreme-scale computing. In Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld) (pp. 671-679). IEEE.(2016, July)
[10] Saito, T., Sato, K., Sato, H., Matsuoka, S.: Energy-aware I/O optimization for checkpoint and restart on a NAND flash memory system. In Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale (pp. 41-48). ACM. (2013, June).
[11] Ferreira, K. B., Ibtesham, D., DeBonis, D., Arnold, D.: Coarse-grained Energy Modeling of Rollback/Recovery Mechanisms (No. SAND2014-2159C). Sandia National Lab.(SNL-NM), Albuquerque, NM (United States) (2014).
[12] Silveira, D. S., Moro, G. B., Cruz, E. H. M., Navaux, P. O. A., Schnorr, L. M., and Bampi, S.:Energy Consumption Estimation in Parallel Applications: an Analysis in Real and Theoretical Models. In XVII Simposio em Sistemas Computacionais de Alto Desempenho, (pp. 134-145) (2016).
[13] Le Sueur, E., Heiser, G.: Dynamic voltage and frequency scaling: The laws of diminishing returns. In Proceedings of the 2010 international conference on Power aware computing and systems (pp. 1-8).(2010, October).
[14] Ansel, J., Arya, K., Cooperman, G.: DMTCP: Transparent checkpointing for cluster computations and the desktop. In 2009 IEEE International Symposium on Parallel & Distributed Processing (pp. 1-12). IEEE (2009, May).
[15] Morán, M., Balladini, J., Rexachs, D., Luque E.: Factores que afectan el consumo energético de operaciones de checkpoint y restart en clusters. In: XIX Workshop Procesamiento Distribuido y Paralelo (WPDP), XXIV Congreso Argentino de Ciencias de la Computación, pp. 63-72, CACIC 2018. ISBN 978-950-658-472-6.
[16] Diouri, M., Glück, O., Lefevre, L., Cappello, F.: ECOFIT: A framework to estimate energy consumption of fault tolerance protocols for HPC applications. In 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (pp. 522-529). IEEE (2013, May).
[17] Bergman, K., Borkar, S., Campbell, D., Carlson, W., Dally, W., Denneau, M., Karp, S.: Exascale computing study: Technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep, 15. (2008).
[18] ACPI - Advanced Configuration and Power Interface. http://www.acpi.info