Cloud_computing_for_big_data_analytics_in_the_Process_Control_Industry
Cloud_computing_for_big_data_analytics_in_the_Process_Control_Industry
Abstract— The aim of this article is to present an example of Towards this vision, the industrial processes require an IT
a novel cloud computing infrastructure for big data analytics infrastructure that could efficiently manage massive amounts
in the Process Control Industry. Latest innovations in the field of complex data structures collected form disparate data
of Process Analyzer Techniques (PAT), big data and wireless
technologies have created a new environment in which almost sources, while providing the necessary computational power
all stages of the industrial process can be recorded and utilized, and tools for analyzing these data in batch, near and hard
not only for safety, but also for real time optimization. Based real-time approaches. The overall problem becomes more
on analysis of historical sensor data, machine learning based complex, because of the diversity of acquired data mainly
optimization models can be developed and deployed in real time due to the: different data and sensors types, data reliability
closed control loops. However, still the local implementation
of those systems requires a huge investment in hardware and levels, measurement frequencies and missing data. Moreover
software, as a direct result of the big data nature of sensors in every case, the acquired data needs to be filtered, stored
data being recorded continuously. The current technological and often aggregated before any meaningful analysis can be
advancements in cloud computing for big data processing, open performed.
new opportunities for the industry, while acting as an enabler
for a significant reduction in costs, making the technology With the explosion of the “Internet of Things” [4] in the
available to plants of all sizes. The main contribution of this last decade, a world of new technologies has become readily
article stems from the presentation for a fist time ever of a accessible and relevant for the industrial process. Nowadays,
pilot cloud based architecture for the application of a data with relatively low costs, it is possible to send torrents of data
driven modeling and optimal control configuration for the field to the ”cloud” for storage and analysis. Cloud computing en-
of Process Control. As it will be presented, these developments
have been carried in close relationship with the process industry compasses, cloud storage, and batch and streaming analysis
and pave a way for a generalized application of the cloud based of data using the latest Machine Learning (ML) algorithms.
approaches, towards the future of Industry 4.0. The potential benefits of using cloud computing for dynamic
optimal control in the industrial plants include:
I. INTRODUCTION
• Dramatically reduced costs of storing and analyzing
For many years SCADA systems have been used to collect large amounts of data
sensor data in order to control industrial processes, usually in • Low levels of complexity relative to existing systems
real time [1]. The topological complexity of these systems • Enabling the use of advanced ML algorithms in batch
(see [2]) involves large costs associated to scaling and and real time
adapting to the vast amount of signals gathered for allowing • Reduces the industry entry level costs, for implementing
a general reconfiguration on the control structure for the advanced control systems
process plant (see [3]). It should be also mentioned that the • Enabling large scale implementation with many low cost
majority of these SCADA systems, up to now, have been sensors
utilized mainly for providing an overview of the controlled • Very easy to manage from the cloud
process, while having the ability to perform Process Analyzer • Easy to scale or modify storage capacities
Techniques (PAT) mainly for the statistical processing of the
received data for an off line analysis. Inspired by these capabilities of the cloud infrastructure
and the reachability of these technologies nowadays, the
However, the recent innovations in online PAT and wire-
proposed architecture aims to combine the existing PAT
less embedded technologies have created a new era in which
based analysis of process that is carried in most of the times
almost all stages in the industrial process can be recorded,
off line, or in a batch of time samples, with the multiple
stored and analyzed. This process is producing a massive
streams of sensory data describing the process and product
amount of sampled data that need to be stored and processed
states. The low-dimensional data should be robust against
in real time for allowing an overall reconfiguration of the
infrequent updates of PAT measurements and missing data,
control plant and for achieving a continuous operational
while handling largely varying measurement intervals. The
optimality against the variations of the production stages.
model should also be able to handle the multivariate and
The work has received funding from the European Unions Horizon auto correlated nature of process data and the high quantities
2020 Research and Innovation Programme under the Grant Agreement of data from regular on line measurements. Principles from
No.636834, DISIRE wireless sensor networks, estimation and statistical signal
1 GSTAT, Israel
2 Robotic Team, Division of Signal and Systems, Electrical Engineering processing will be integrated and evaluated with real process
Department, Luleå University of Technology, Luleå, Sweden. data in order to create a novel and reliable PAT based swarm
1374
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on February 20,2024 at 21:51:45 UTC from IEEE Xplore. Restrictions apply.
HDFS was designed to store Big Data with a very high G. Near Real-time Computing
reliability and flexibility to scale up by simply adding com- Apache Kafka [12] is a publish-subscribe messaging
modity servers. application that enables sending and receiving streaming
In the presented prototype architecture it has been utilized information between the plants and the Spark engine on the
Hadoop as a framework for setting up the HDFS cluster on cloud. On the local computers (in the plants) a Kafka API
which the sensor data are stored. (which consists of a few Java libraries) sends streaming data
to a Kafka Server set up on AWS that manages the queue of
C. Apache Spark Engine
information passed on to the Spark engine. The Spark engine
The main feature of Apache Spark is its in-memory cluster then performs the streaming analysis and pushes back the
computing that increases of the processing speed much faster results to the Kafka server and from there back to the plants.
than the Hadoop’s MapReduce technology. Spark uses HDFS The analysis can be either cleaning of the data, searching
for storage purpose, where calculations are performed in for outliers or implementing a ML algorithm in real-time.
memory on each of the nodes. Aside from the increased In addition, every 10 minutes the Spark server sends the
speed in computation, the Spark engine is able to: accumulated data to the Historical Big Data Repository for
• Provide built-in APIs for multiple languages: Java, future use or for batch computing.
Scala, Python and R
H. Batch Computing
• Spark-SQL for querying big data with SQL liked code
• Spark-MLlib [11] for big data parallel machine learning In batch computing, the data are initially stored in the
algorithms like linear and logistic regression, clustering Historical Big Data Repository where it can be properly
K-means, decision trees, random forest, neural network, cleaned, aggregated or transformed before being analyzed
recommendation engine and more by the process managers. In many cases, this step includes
• Spark-Streaming for calculating machine learning algo- saving the data in the Parquet format which can reduce the
rithms on streaming data size of the data by using the R or Python languages. In
general, the process managers can choose from a vast array
D. Process Managers of ML algorithms that can be implemented on the cluster
At the other end of the proposed architecture are the through the Spark engine.
process managers who, through local computers, can access III. T HE USE CASE OF THE WALKING BEAM FURNACE
and perform machine learning algorithms on the data stored
in the Hadoop cluster. The two leading programs that serve The walking beam furnace is used to re-heat slabs (large
as an interface for conducting statistical analysis using the steel beams) to a specific temperature before their refinement
Spark engine are: in the steel industry (see [13]). The slabs are walked from
the feed to the output of the furnace by the cyclic movement
1) R - An open-source statistical language used widely of so-called walking beams. During this passage, the items
both in the industry in academia. are directly exposed to the heat produced by burners located
2) Python - An open-source all around language which inside the furnace. Since the heat distribution affects the qual-
has a vast library of functions for implementing ma- ity of the finished product, a natural optimal control problem
chine learning algorithms. in this context is to regulate pre-assigned temperatures at
As mentioned above, both of these coding languages have specific points of the furnace, while minimizing the energy
APIs that pass commands to the Spark engine. The process expenditure for the heat generation (see [14], [15]).
managers access and run these programs through a number The walking beam furnace at MEFOS is an experimental
of web-based development environments and notebooks such furnace and lacks some of the features of an industrial
as the Jupyter notebook, which is popular in the Python furnace. Specifically, the temperatures throughout the furnace
community and RStudio, which is the leading IDE amongst are not feedback controlled (as it is otherwise customary in
R users. the industry), i.e., the furnace operates open loop. Currently,
a human operator configures the furnace set-points manually
E. Control Feedback Loop
(the set-point values are, however, computed numerically)
After the process managers have performed their analysis, and then measures the slabs temperature at the furnace
they can set up dynamic models for implementation in exit using a pyrometer. In fact, under normal operating
the cloud that can push back responses to the industrial conditions, the open-loop control can be tuned to work well.
processes. This process is explained further in the Near Real- Additionally, this industrial installation is affected by stops
Time Computing subsection. and other variations that influence the control performance
and correspondingly the need for a feedback control loop.
F. Historical Big Data Repository In the described use case the main variables that need to be
In the cloud, the raw data and the process manager’s controlled are thus: a) the furnace temperatures in several
recommendations will be stored at the historical big data zones of the furnace and b) the temperature of slabs at the
repository (AWS S3). AWS offers great flexibility in storage output (the target temperature). Furthermore, the main ob-
plans that have the merit to be easily scaled as needed. jective is to reduce the operating costs through the reduction
1375
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on February 20,2024 at 21:51:45 UTC from IEEE Xplore. Restrictions apply.
of energy consumption. In this respect, a small decrease in atomization air supply rate, the combustion air flow and the
energy consumption such as 0.5% translates into a saving exhaust flow. In the cloud the raw data and the optimizers
of 2kWh per ton of heated product, while optimal control recommendations will be stored at historical big data repos-
strategies could lead to quality improvements as well. The itory (AWS S3). The overall schematic representation of the
overall schematic diagram of the WBF with the indicative presented architecture is depicted in Figure 4.
control loops, the sensors and the different heating zones is For this usecase, the variables required from the optimal
depicted in Figure 2. control module are the following ones in Figure 5:
The minimum data input for the optimal control is 200
past values of the averages every 10 seconds of the above
parameters (one value every 10 seconds in the last 2,000
seconds, i.e. 33 minute and 20 seconds) is required.
A. Transferring data from the sensors to the cloud
For transferring data from the sensors to the cloud, a
computer connected to the WBF process is utilized that is
able to manage and update the site metadata, i.e. a Mefos-
Service method which run preliminary for synchronization
of factory list, zone list, sensor list, bath list and model
Fig. 2. Schematic Diagram of the Walking Beam Furnace list. Furthermore, this method create a file in json structure
with 3 fields: FactoryID, ZoneID, SensorID in Every possible
To achieve these goals there is a need to gather more values, while the posted data can be either a single message
information about the process on-line, while the optimal or array. The input messages are processed at the Kafka
controls output would optimize the process by controlling
the following variables: 1) the fuel supply rate at the burners, TABLE I
one burner at each zone, total of three burners, 2) the fuel M ESSAGE T YPES
atomization air supply rate, one for each burner, 3) the Message Type 1 - Process Status Change
combustion air flow, one at each zone, total of three zones, Factory ID F Key [Predifined Integer]
and 4) the exhaust flow, e.g. exhaust damper position, one Batch ID F key [Predefined Integer]
Status ID P key [running Integer]
exhaust damper in the furnace. Date time [Time Stamp]
In this use case, MEFOS has installed a dedicating PC in Current Status [Predefined String:Idle/Start/Stop/Pause/Restart ]
the WBF site for managing the flow of the measurements Message Type 2 - Measurements
Factory ID F key [Predefined Integer: -1 / 1 / 2 / 3 / ]
data. Figure 3 presents the flow of the sensory data from Zone ID F key [Predefined Integer]
ABB control system to the connectivity server and from there Sensor ID F key [Predefined Integer]
to the corresponding PC and in the sequel to the cloud. Batch ID F key [Predefined Integer]
Date Time [Time Stamp]
Measurement value [Double]
Measurement unit [Char: C/ % / m3 /h / kg/h / MMWC / Boolean ]
Quality [Integer]
1376
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on February 20,2024 at 21:51:45 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. Schematic description of the Architecture
Fig. 5. Variables required by the optimal control module process by a provided URL.
For the big data repository, the Spark-Streaming process
metadata are synchronized and pre-processed. After this step
the recommendations to the Kafka server. In the sequel, the the data are being pushed from the Mefos-Service PC into the
Spark streaming process saves the measurements data along Kafka server and from there are pulled by the Spark cluster.
with the recommendations to AWS S3. Overall, the streaming At the Spark-Streaming, the initial data are accumulated
process is depicted in Figure 6. in the memory and afterwards are saved at a historical
The Kafka server will also keep and be responsible for Big Data repository. The Controls recommendations data
the recommendations data queue that it is arrived from the are also accumulated at the memory and are saved at the
Spark cluster. For the transfering of the results from the historical Big Data repository that relies at the AWS S3
cloud back to the process, the Kafka server keeps the controls (Amazon Simple Storage Service), while the files will be
recommendations data and streams them on a specific output saved as Parquet file type with the following benefits: 1)
topic to some consumer, while the ”MefosService” includes The structure of the table, i.e. the number of the columns,
the Kafka-consumer feature that pulls the recommendations their types and the delimiter between columns, will be saved,
data from the output topic, e.g. ”FromSpark”. Finally, the 2) the data are compressed, a fact that saves about 60%
output recommendations are reaching to the Web-API of the of its volume compared to text file type, and 3) it enables
1377
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on February 20,2024 at 21:51:45 UTC from IEEE Xplore. Restrictions apply.
the straight upload into Spark in memory data storage, no
conversions will be needed. Furthermore, the historical Big
Data repository will enable deep investigation of the data in
case it is required for the development of new models, such
as the BI reports, etc.
IV. C ONCLUSIONS
In this article an example of a novel cloud computing
infrastructure for big data analytics in the Process Control
Industry has been presented. The current technological ad-
vancements in cloud computing for big data processing, open
new opportunities for the industry, while acting as an enabler
for a significant reduction in costs, making the technology
available to plants of all sizes. The main contribution of
this article has been the presentation for the fist time ever
of a pilot cloud based architecture for the application of a
data driven modeling and optimal control configuration for
the field of Process Control. These developments have been
carried in close relationship with the process industry, since it
has been presented a use case at the walking beam furnace of
the Steel Industry MEFOS in Sweden. Part of the future work
includes the full extended experimentation and validation of
the proposed scheme in WBF campaigns.
R EFERENCES
[1] D. Bailey and E. Wright, Practical SCADA for industry. Newnes,
2003.
[2] O. Sporns and G. Tononi, “Classes of network connectivity and
dynamics,” in Complexity, vol. 7, pp. 28–38, 2001.
[3] M. van de Wal and B. de Jager, “Control structure design: a survey,”
in Proceedings of the 1995 American Control Conference, vol. 1,
pp. 225–229 vol.1, Jun 1995.
[4] L. Atzori, A. Iera, and G. Morabito, “The internet of things: A survey,”
Computer networks, vol. 54, no. 15, pp. 2787–2805, 2010.
[5] S. Skogestad, “Plantwide control: the search for the self-optimizing
control structure,” Journal of Process Control, vol. 10, pp. 487–507,
October 2000.
[6] W. L. Luyben, B. D. Tyreus, and M. L. Luyben, Plant-wide process
control. McGraw-Hill, 1998.
[7] F. Bonomi, R. Milito, J. Zhu, and S. Addepalli, “Fog computing and
its role in the internet of things,” in Proceedings of the first edition
of the MCC workshop on Mobile cloud computing, pp. 13–16, ACM,
2012.
[8] T. White, Hadoop: The definitive guide. ” O’Reilly Media, Inc.”, 2012.
[9] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,
“Spark: Cluster computing with working sets.,” HotCloud, vol. 10,
no. 10-10, p. 95, 2010.
[10] J. Allaire, “Rstudio: Integrated development environment for r,”
Boston, MA, 2012.
[11] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu,
J. Freeman, D. Tsai, M. Amde, S. Owen, et al., “Mllib: Machine
learning in apache spark,” Journal of Machine Learning Research,
vol. 17, no. 34, pp. 1–7, 2016.
[12] N. Garg, Apache Kafka. Packt Publishing Ltd, 2013.
[13] H. S. Ko, J.-S. Kim, T.-W. Yoon, M. Lim, D. R. Yang, and I. S. Jun,
“Modeling and predictive control of a reheating furnace,” in American
Control Conference, 2000. Proceedings of the 2000, vol. 4, pp. 2725–
2729, IEEE, 2000.
[14] B. Leden, “A control system for fuel optimization of reheating
furnaces,” Scand. J. Metall., vol. 15, no. 1, pp. 16–24, 1986.
[15] J. Srisertpol, S. Tantrairatn, P. Tragrunwong, and V. Khomphis, “Es-
timation of the mathematical model of the reheating furnace walking
hearth type in heating curve up process,”
1378
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on February 20,2024 at 21:51:45 UTC from IEEE Xplore. Restrictions apply.