Big Data Ingestion and Preparation Tools
Big Data Ingestion and Preparation Tools
9; 2020
ISSN 1913-1844 E-ISSN 1913-1852
Published by Canadian Center of Science and Education
Received: January 25, 2020 Accepted: August 24, 2020 Online Published: August 27, 2020
doi:10.5539/mas.v14n9p12 URL: https://doi.org/10.5539/mas.v14n9p12
Abstract
Developing in Big Data applications become very important in the last few years, many organizations and
industries are aware that data analysis is becoming an important factor to be more competitive and discover new
trends and insights. Data ingestion and preparation step is the starting point for developing any Big Data project.
This paper is a review for some of the most widely used Big Data ingestion and preparation tools, it discusses the
main features, advantages and usage for each tool. The purpose of this paper is to help users to select the right
ingestion and preparation tool according to their needs and applications’ requirements.
Keywords: big data, Hadoop, HDFS, data ingestion, data preparation
1. Introduction
In recent years the data is growing quickly, multiple sources such as computers, social media and mobile phones
are generating large volume of data with different format, namely structured, semi-structured and unstructured.
(Oussous, A., Benjelloun, F. Z., Lahcen, A. A., & Belfkih, S., 2018) (Erraissi, A., Belangour, A., & Tragha, A.,
2018)
Big Data require to ingest, clean, process and extract an important value from the data. Different models,
hardware’s and technologies have been developed for Big Data to provide more trustable and accurate results,
most of these technologies are open source and available to handle the volume and variety of data. Hadoop is the
popular framework for Big Data that integrate different technologies for ingesting and analyzing the different
type of data. However, in different cases it is challenging to choose the best technology to be used as this depend
on different parameters such as cost, performance and support. (Oussous, A., Benjelloun, F. Z., Lahcen, A. A., &
Belfkih, S., 2018) (Mohamed, E., & Hong, Z., 2016)
Data ingestion process is an important step in building any big data project, it is frequently discussed with ETL
concept which is extract, transform, and load. Traditionally, ETL was built for moving the data from source to
destination via created pipeline, but this process is slow and not time-sensitive. Modern applications aim to
provide a model for real time processing and decision making, in this case the ETL is created with different
architecture to solve the latency problem and to deal with streaming data such as website clicks, sensors and
telecommunications, so the new arrived data will be transferred immediately for processing. (Meehan, J.,
Aslantas, C., Zdonik, S., Tatbul, N., & Du, J., 2017)
Data ingestion process should be handle the different volume, speed and variety of data. It can be batch data
ingestion or Stream data ingestion. This paper discussed the Big Data ingestion process with different tools for
batch and stream ingestion such as Sqoop, NIFI, Flume and Kafka. Each tool is discussed with its’ features,
architecture and real use case. It has a comparison for big data ingestion tools based in different criteria, this
comparison will help users to choose the tool that satisfies their needs. Also, it mentioned the data preparation
process that aims to clean, validate and reduce the ingested data.it mentioned some tools for data preparation like
Hive, Impala, Storm and Spark.
The paper has the following structure, section two introduced the data sources and the types of data. Section
three presented the big data ingestion concept, parameters and challenges, it reviewed some of the ingestion tools
categorized based on ingestion type either batch or stream, and it discussed details about each tool. Section four
introduced the data preparation process which is pre-processing step for data quality enhancement, and
mentioned some tools for data preparation with its main characteristics and real use case.
12
mas.ccsenet.org Modern Applied Science Vol. 14, No. 9; 2020
2. Data Source
The volume of data that used in big data projects is very large, also the sources and format of data are changing
rapidly. There are two main data sources internal and external, internal sources which are controlled by the
organizations and included data about daily operations of the company that collected and stored in databases, in
this case we are discussing about structured data, external sources refer to all the data that retrieved from external
sources that are not controlled by the organization. (Bucur, C., 2015) (Erraissi, A., Belangour, A., & Tragha, A.,
2018)
Big data has different data sources, social media is the most important source, Twitter and Facebook generate
very large amount of data such as tweets, profiles and likes, this data can be analyzed and provide important
value, for example analysis of social media data that related to new product can provide better understanding
about customer satisfaction. Log files are another source of data, for example clicks on specific website can be
logged into web log files, and these logs can be analyzed to understand the online user’s behavior. Sensors and
machines such as medical devices, smart meters and road cameras generate large volume of data and these data
can be analyzed and provide valuable output. Geospatial data that generated by cell phones is another source of
data that can be used by another application. 0
There are three types of data: (Erraissi, A., Belangour, A., & Tragha, A., 2018)
• Structured data: it refers to the data that has fixed format and stored into rows and columns, such as
data that stored into relational databases.
• Unstructured data: it refers to data that does not have specific format or structure and this make it
difficult for processing, it can be textual like emails or non-textual like audio and video.
• Semi-Structured: it lies between the above mentioned types, it does not have complex format however
it has specific information such as tags, xml file is an example of semi-structured data.
Figure 1. Data Sources (Oughdir, L., Dakkak, A., & Dahdouh, K., 2019)
3. Data Ingestion
Data ingestion is a process of moving and transferring different types of data (structured, un-structured and
semi-structured) from their sources to other system for processing, this process starts with prioritizing data
sources then validating information and routing data to the correct destination. (MATACUTA, A., & POPA, C.,
2018) (Nargesian, F., Zhu, E., Miller, R. J., Pu, K. Q., & Arocena, P. C., 2019)
13
mas.ccsenet.org Modern Applied Science Vol. 14, No. 9; 2020
Figure 2. Sqoop Functionality (Cheng, Y., Zhang, Q., & Ye, Z., 2019)
Sqoop is used by many applications that related to different sectors, it was used by “Coupons.com” which is
online marketer in order to transfer the data between IBM -Netezza and Hadoop, also it was used by” Apollo
Group” which is educational company to import and export data between Hadoop and relational
databases. (Armbrust, M., Das, T., Davidson, A., Ghodsi, A., Or, A., Rosen, J., … Zaharia, M., 2015)
14
mas.ccsenett.org Modernn Applied Sciencce Vol. 14, No. 9; 2020
Sqoop alsoo used within different reseearches, it wass used by electtronic medicall records data analysis on cloud,
this researrch is about analyzing
a heallth care and eelectronic meddical records ((EMR) data, tthe purpose off this
research iss to help the heealth care orgaanizations to ddetect any un-uusual measurem ments that needd immediate action
and suppoort the decisionn making proccess. Sqoop is used to impoort bulk of elecctronic medicaal records from m the
related dattabase and inseert the data intto hive table, tthen analyzingg the data by uusing MapReduuce algorithmss and
finally expport the data aggain to externaal database on cloud. (Rallappalli, S., & Gonndkar, R. R., 22015)
Sqoop is aalso used withiin crime data aanalysis researrch, the purposse of this analyysis is to analyyze the popula
ation,
a this is veryy critical issuee for the goverrnments in ordder to make strrategic decisions to
crimes andd crime rates and
apply the llaw and to keeep the citizens safe from the crime. The rellated data is looaded from RD DBMS to HDF FS by
using apacche Sqoop andd apache Flum me was used too load unstructtured data, the imported dataa is analyzed using
u
MapReducce and Pig to get the needeed results and to answer onn the research questions, thee results were total
number off crimes per yeears (2000-2014), state, type and gender (w women). (Jain, A., & Bhatnaggar, V., 2016) 0
3.3.2 NIFII Apache
NIFI is a dataflow sysstem that cann collect, transsform, processs and route ddata.it was buuilt on flow-b
based
programmming concept, it
i was designeed to automatee and manage the flow of ddata between ssystems. (Peng g, R.,
2019)
NIFI is Jaava based andd executed witthin JVM on a host operatting system, aas shown on ffigure 3 below w the
architecturre of NIFI connsist of differrent componennts, Web Servver which is rresponsible abbout hosting NiFi’s
N
HTTP-bassed command and a to enable the user to acccess NIFI via web based intterface. Flow C Controller whiich is
responsible about providding and schedduling threads for executionn. FlowFile Repository whichh is the area where
w
NIFI trackk the status uppdates about tthe flowfiles. Content Repoository that hoolds the contennt of flowfiless and
Provenancce Repository that
t holds provvenance event data.
F
Figure 3. NIFI Architecture ((Oussous, A., B
Benjelloun, F. Z., Lahcen, A
A. A., & Belfkih, S., 2018)
NIFI is aable to run wiithin a cluster,, each node inn the NIFI cluster complete the same taskks but interact with
differentt set of data. Thhe cluster is m
managed by cluuster coordinatoor which is eleected by ache Z
Zookeeper.
F
Figure 4. NiFi Distributed A
Architecture
15
mas.ccsenet.org Modern Applied Science Vol. 14, No. 9; 2020
NIFI has friendly web based user interface that allow users to drag and drop components to build the dataflow,
the components can be started and stopped in real time also the errors and statistics can be viewed easily. NIFI
buffering all the queued data and allow setting prioritization schemes to indicate how the data will be retrieved
from the queue. NIFI provide data provenance module in order to track the data from the start of the flow until
the end. The implemented dataflow is secure since NIFI use secured protocols like SSL, HTTPS, SSH and other
encryptions.
A processor is an atomic element in NiFi dataflow which can do different tasks, it can read data from multiple
resources, route, transform and publish data to external resources. For batch data ingestion NIFI processors can
read the data from different sources, it can be any SQL database server like Oracle and MySQL, or NoSQL
databases like MongoDB, or pulling data with different format from local or remote systems.
3.4 Stream Data Ingestion
Stream data or real time data is the data that comes out with the quick input and quick analysis to bring out a
decision or action within a short time and very determined time line, the flow of data is very quick and difficult
way that request to be managed, stored and analyzed there is a strong need to support real-time data ingestion
particular for demanding new applications. Stream data ingestion is important for different sectors such as large
volume of real time data in business needs to be ingested for developing mobile marketing analysis, advertising
recommendation framework and visualizing the changed data and progress in real time. (Salah, H., Al-Omari, I.,
Alwidian, J., Al-Hamadin, R., & Tawalbeh, T., 2019)
Data Streaming Ingestion facing the challenges of processing the operational and real time data, which is vital in
quick mutable situation. The process of streaming separates nonstop smooth input data into different units for
advanced processing, when real time data is stored on hard discs will have a fair amount access of latency, so
work with large volume of data makes hard discs are not suitable, which creates the memory challenge, existing
systems often suffer from the extremely slow identification process.
Extra intensive data used to be extracted from all the data sources ranging from different live multimedia, to IoT
data, and to real-time data from social media and blogs, growing applications of real time data analytics in the
area of social media like Facebook and Twitter will create another challenge as the companies aim to ingest
these data with low latency.
Security is another challenge for stream data ingestion process which comes out from quick growth of the
internet, web-based systems who are facing malicious and suspicions files threatening in their security, so the
ingestion process should provide security, auditing, and provenance. The analytical value from the stream data
depends on accuracy and completeness of data so achieving good and accurate stream data ingestion is
complicated and challenging task that require good planning and expertise (Yadranjiaghdam, B., Yasrobi, S., &
Tabrizi, N., 217) (Pal, G., Li, G., & Atkinson, K., 2018) (Gurcan, F., & Berigel, M., 2018)
3.4.1 Flume Apache
It’s a distributed reliable, available and efficient service for importing, collecting, aggregating and bringing in
huge amount of data with its streaming feature and ingest it in a way that makes it easy for processing tool,
hardly supports fault tolerance with accurate consistency ways, the data model used by flume is particularly used
for online analytic application It has the most important role in data ingestion for real time data analytics, which
is responsible for data refining and data visualization (Yadranjiaghdam, B., Pool, N., & Tabrizi, N., 2016)
(Hemavathi, D., & Srimathi, H., 2017) (Yadranjiaghdam, B., Yasrobi, S., & Tabrizi, N., 2017) (Begum, N., &
Shankara, A. A., 2016) Flume provides a framework for collecting and analyzing data from a sensor network
with high performance scalability of HDFS. (Yadranjiaghdam, B., Yasrobi, S., & Tabrizi, N., 2017) (Ji, C., Liu,
S., Yang, C., Wu, L., & Pan, L., 2016)
The data flow in flume same as pipeline that ingest data from the source to destination. Regarding to figure 5
below that discussed Flume architecture, data is transformed from source to destination based on flume agent
which is JVM process that host the components during the data flow from the source to next end and it contains
of channel, sink and the source.
Source is the part of agent that receive the data from related generators and move them to channel, the channel is
considered as a bridge between the sink and sources, sink is the entity that sends the data to the destination.
(Begum, N., & Shankara, A. A., 2016)
16
mas.ccsenet.org Modern Applied Science Vol. 14, No. 9; 2020
(Aravinth, S. S., Begam, A. H., Shanmugapriyaa, S., Sowmya, S., & Arun, E., 2015)
Flume was used in different researches, it was used in ingesting variance detection of household heating data
“Jinan municipal steam heating system”, sensors were built in all the rooms; this is to get information about the
rooms like the thermal power, accumulated heat and temperature, within this research they depend on 16909
rooms of 394 buildings, flume is ingested all the sensors data, then all of these data will be processed by spark
and come out with specified results (Lee, C., & Paik, I., 2017)
3.4.2 Kafka Apache
It’s a distributed streaming tool that provides unified high real data feeds and messaging brokering system, the
most important specification for Kafka is the low latency as all the process will occur in memory to prevent
access latency of hard disks. (Shahin, D., Hannen Ennab, R. S., & Alwidian, J., 2019) Kafka provide high
performance of ingestion large amount of messages with law latency and fault tolerance.
It has three major components broker, consumer and producer. Even if the consumer and producer were written
in different programming language it will work efficiently and connecting different platforms together, not only
used for streaming data other type of data also. Broker will be as the server in Kafka and it’s responsible for the
fault tolerance which is the most important feature in Kafka. Producer sends the message to consumer through
broker, broker will be as a channel to differentiate the message. It has a high performance real time data channel
(multi node and multi broker). (Lee, C., & Paik, I., 2017)
Based on the below figure 6, there are two main processes in Kafka Architecture first one distributing the
messages and the second is publishing them. There are many servers in Kafka which processed as clusters, each
one can deal with thousands of customers for huge amounts of read and writes capability as a central point for
huge organization data, cluster in Kafka keeping the log of messages and give it a sequential ID. The ID is
distributed through all the clusters and request a share of division which guarantee the fault tolerance
(Hemavathi, D., & Srimathi, H., 2017) (Yadranjiaghdam, B., Yasrobi, S., & Tabrizi, N., 2017) (Pal, G., Li, G., &
Atkinson, K., 2018) Kafka provides better scalability and message consistency compared to Flume. 0
The scalability feature in Kafka; allows the system to enlarge elastically and clearly without any stoppage. Data
can be divided and distributed all over the machine even if the capacity of the machine is less than the size of
data, the terminated messages are conserved on disk to avoid the data loss. (Hemavathi, D., & Srimathi, H., 2017)
00
17
mas.ccsenett.org Modernn Applied Sciencce Vol. 14, No. 9; 2020
Figure 6. K
Kafka Architeccture
Kafka wass used in differrent researches, it was used to study the reeaction of peopple about Tsunnami earthquake in
Japan, forr this purpose they depend on Twitter soocial media, thhey depend onn keywords liike “Tsunami” ” and
“Japan earrthquake” durinng the tweets iingestion, the tweets are ingested, filtered,, processed andd visualized. Kafka
K
was used ffor tweets ingestion and waas linked to Tw witter Streaminng API, then tthe flow of tw weets was classsified
based on ttheir content. Spark
S was usedd for twitter data analysis annd it was basedd on time, locaation of tweetss and
the time zone of tweetinng. This informmation was processed in meemory as its reeal time processing over massive
amount off data flowingg into the sysstem, the results discuss hoow the peoplee around the w world reacted with
Tsunami eearthquake. (Yadranjiaghdam m, B., Yasrobi,, S., & Tabrizi,, N., 2017)
3.4.3 NIFII Apache
NIFI is used for stream data ingestionn also, it was built to solve thhe challenge of the flow dataa between systtems.
Flow meanns that some syystems create tthe data and soome of the sysstems consumee the data.
NIFI is used in ingestingg real time datta; in many researches they depend on NIIFI for ingestinng the data into the
system, onne example thaat it was used media monitoring applicatioons, like Twittter and safori cchannels. NIFII was
used in a ccompany with Kafka as a dattaflow in a cluuster. They useed the live clouud based platfoorm, which dellivers
messagingg services labelled as RTM, thhe data is a reaal time data as input for the oopen data channnels initiative. The
first source of the data that
t need to bee ingest from BBig RSS and live data channnels in Safori and as RSS feeds.f
This is as the biggest RSS
R aggregatioon in the worlld the volumee of the data is over 6.5 million feeds3. The
second souurce of the daata and very vvital source iss from streamiing news storiies Twitter AP PI platform, which
w
offered a pplatform with scalable accesss to its data. T
They used twoo kinds of filteering tools as ddifferent numb ber of
filters as filtering capabbilities for reaal time tweetss and enterpriise options with vital operaator that has from
2502,000 ffilters with 2,0048 character ffor each stream
m.
The big voolume of the data and the hhigh velocity ffor the data sttreams that coome from Twiitter streaming g API
depend off the reputationn of the queriies. The ingesttion was usingg NIFI for alll the flow withh using three local
processes groups. The fiiltered used too ingest media depends on reemoving the dduplicates and noise data then the
data will bbe routed to thee related analyytics systems. ((Isah, H., & Zuulkernine, F., 22018)
18
mas.ccsenet.org Modern Applied Science Vol. 14, No. 9; 2020
Basic nature works well with any works well for works well for data works well for
RDBMS that has streaming data flow creation messaging
JDBC(Java sources which between different Streaming data
Database continuously systems.
connectivity) like generating.
oracle
Type of Data Batch Data stream Data Batch and Stream Stream
Data
Type of loading Not-event driven Event driven Both (Event and Event driven
not-event)
Architecture Connector based Agent based Flow based Process topology .
Link to HDFS Connected Connected Connected Programmable
4. Data Preparation
For Big Data Analytics as shown in figure 7 below, the data preparation stage is considered as the most integral
phase in which data preprocessing and integration operations are performed in order to enhance big data quality
and suitability. This phase embraces a wide range of operations and techniques that are mainly applied to
generate useful data sets for further data mining algorithms.
For example, in the real-world, the collection of big data from various sources such as sensors and social media
using the Internet-of-Things (IoT) techniques will produce massive data with irrelevant and noisy information.
Therefore, tackling the problem of noisy data, outliers, and anomalies are required to provide noise-free and
high-quality datasets.
19
mas.ccsenet.org Modern Applied Science Vol. 14, No. 9; 2020
Furthermore, at this level of data cleansing and de-noising, it is imperative to deploy feature extraction methods
to separate useful and structured data from big data rows. Other challenges in the data preparation phase will
appear depending on the nature of big data sources including the velocity and variety of big data types. As the
gathered data from various sources will differ in data type, format and dimension accordingly, intelligent data
fusion process, dimension reduction and uniform datasets techniques are performed for achieving data integrity
and consistency for the collected unstructured and semi-structured data streams.
However, the presence of missing data values that cannot be avoided in data analysis remains a huge issue even
with the creation of a uniformly structured big data format. Thus, it is necessary to deploy operations that handle
missing values such as data elimination, sketching, and imputation to increase the efficiency of knowledge
extraction processes and improve the overall quality of produced data for better decision making. (Rehman, M.
H. ur, Chang, V., Batool, A., & Wah, T. Y., 2016)
Figure 7. Big Data Analytics (Rallapalli, S., & Gondkar, R. R., 2015)
20
mas.ccsenet.org Modern Applied Science Vol. 14, No. 9; 2020
Many features that characteristics spark, the important one is speed to run an application in Hadoop cluster, more
than 100 times faster when it runs in memory, and 10 times faster when it runs on disk. This is achieved by
reducing the number of read and write operations on disk, it stores the temporary processing data in memory.
Spark supports multiple languages such as Java, Scala, or Python.
Spark was used into a study in 2019 goals to analysis an agricultural big data, plenty of terminal equipment in
the agricultural park collect environmental data that affects crop growth every day.
Hadoop and Spark are used for improving this analysis. They developed applications for real agricultural park
big data analysis in both frameworks and implemented a yield prediction model based on multiple linear
regression using Spark MLlib. The results show that the performance of Spark is higher than Hadoop, and the
model can obtain better prediction results. 0(Cheng, Y., Zhang, Q., & Ye, Z., 2019)
4.2 Hive Apache
Hive is a Data warehouse system for Hadoop. It similar to any SQL language it runs queries that compiled and
deal with functions as MapReduce and return the result to the user. Hadoop has unstructured data that has some
unclear structure connected with it, the important reason for using Hive that easy to work on the Hadoop file
system and MapReduce for non-developers. Users like scientists, analysts. Who already know SQL syntax, they
can find out the data by writing SQL statements instead of writing code which means less time. (Surekha, D.,
Swamy, G., & Venkatramaphanikumar, S., 2016)
Hive perfect when its use with data aggregation method, Adhoc querying, and analysis a massive data such as
Analyzing social media data such as Twitter Data (Bhardwaj, A., Vanraj, Kumar, A., Narayan, Y., & Kumar, P.,
2015).The below diagram is the architecture of the hive. Hive metadata (or called metastore) can use embedded,
local, or remote databases. Hive servers are built on Apache economy Server technology. A multi of user
interfaces and web UI give another advantage of using Hive.
21
mas.ccsenet.org Modern Applied Science Vol. 14, No. 9; 2020
Hive was built on top of Hadoop, it simplifies access to data via SQL, thus enabling data warehousing tasks such
as extract/transform/load (ETL), reporting, and data analysis. Easy Access to files stored either directly
in HDFS or in other data storage systems such as HBase.
In August 2016 hive was applied as a new big data tool because challenges of the traditional database systems to
face the differences in nature and complexity with the data obtained from multi sources. They performance
profiling of Meteorological and Oceanographic data on Hive is conducted. Hive being the commonly used data
warehouse analytical platform for big data is chosen with the view to exposing the intricacies that are involved in
the formatting and loading of the data. The Meteorological and Oceanographic data if properly formatted its
analytics with Hive proved to be efficient compared to the traditional database systems. The results of this study
have the potentials of attracting the oil and gas companies to adopt big data technologies for the handling of their
exploration dataset. (Abdullahi, A. U., Ahmad, R., & Zakaria, N. M., 2016)
4.3 Impala Apache
Impala is a SQL query tool on big data designed for real-time processing its interactive and responsive tools,
which is inspired by the Google Dremel project and developed by Cloudera. Impala using different approaches
to deal with data does not like Hive and MapReduce, it uses a distributed query engine instead of slow batch
processing mode, and the distributed engine is similar to the parallel relational database. It can use all the
features on SQL and other statistical functions to apply it to the data stored in HDFS or HBase directly, this
represent another value on reducing the delay. (Jingmin Li, 2014)
Regarding to the below figure, Impala consists of two main parts, the first one called Impalad which is a
distributed query process and it consists of query planner, query coordinator, and query execution engine. The
second part is Impala state which a process is called statestored which is responsible for collecting
CPU/memory/network resource information from all nodes in the cluster. It creates multiple threads to process
the Impalad subscription and keep the heartbeat connection with each Impalad. The Impalad will cache a copy of
the information in the statestored and swipe its mode between recovery mode and normal mode depends on
statestored status (offline/online). (Jingmin Li, 2014)
22
mas.ccsenet.org Modern Applied Science Vol. 14, No. 9; 2020
gather data and accumulated in temporary storage then streamed via Kafka platform, and these data stored using
Impala database. (Wiska, R., Habibie, N., Wibisono, A., Nugroho, W. S., & Mursanto, P., 2016)
4.4 Storm Apache
Storm defines as a distributed system for computing and real-time data processing it’s a fault-tolerance and easy
compile complex real-time computation in a computer cluster, it's similar to what Hadoop does in batch
processing. Storm can ensure that the messages will be speedily systematic. Additionally, it integrated with
many program languages for development. (Yang, W., Liu, X., Zhang, L., & Yang, L. T., 2013)
Regarding to the below figure, storm cluster divided into three nodes: Nimbus node (master node) which help to
uploads computations for execution, distributes code across the cluster, launches workers across the cluster,
supervises computation and reallocates workers as needed, the second is Zookeeper nodes – coordinates the
Storm cluster, at the end, we have supervisor nodes – communicates with Nimbus through Zookeeper, starts and
stops workers according to signals from Nimbus.
23
mas.ccsenet.org Modern Applied Science Vol. 14, No. 9; 2020
Both Hive and Impala is SQL engines but Impala is faster than Hive, Impala is a good choice for BI analytics
queries on Hadoop as it provides low latency and high concurrency, however Hive is used for building an
efficient data warehousing solutions.
Spark is a fast processing engine that provides different capabilities such as interactive analytics, streaming data,
and machine learning so it is a good choice for online and real-world applications.
24
mas.ccsenet.org Modern Applied Science Vol. 14, No. 9; 2020
ingestion tools was discussed in table 1 in order to help users to choose the tool that satisfies their needs, also
comparison between data preparation tools was reviewed under table 2.
In future research, the performance indicator such as speed and number of processed files per minute will be
studied for each tool and mentioned within the comparison.
References
Abdullahi, A. U., Ahmad, R., & Zakaria, N. M. (2016). Big data: Performance profiling of Meteorological and
Oceanographic data on Hive. 2016 3rd International Conference on Computer and Information Sciences
(ICCOINS). https://doi.org/10.1109/ICCOINS.2016.7783215
Abuqabita, F., Al-Omoush, R., & Alwidian, J. (2019). A Comparative Study on Big Data Analytics Frameworks,
Data Resources and Challenges. Modern Applied Science, 13(7), 1. https://doi.org/10.5539/mas.v13n7p1
APACHE HIVE. Retrieved December 2019, from https://hive.apache.org/
Apache Sqoop. Retrieved December 2019, from https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
Aravinth, S. S., Begam, A. H., Shanmugapriyaa, S., Sowmya, S., & Arun, E. (2015). An efficient HADOOP
frameworks SQOOP and ambari for big data processing. International Journal for Innovative Research in
Science and Technology. ISSN (online): 2349-6010
Armbrust, M., Das, T., Davidson, A., Ghodsi, A., Or, A., Rosen, J., … Zaharia, M. (2015). Scaling spark in the
real world. Proceedings of the VLDB Endowment, 8(12), 1840–1843.
https://doi.org/10.14778/2824032.2824080
Begum, N., & Shankara, A. A. (2016). Rectify and envision the server log data using apache flume. Int. J.
Technol. Res. Eng, 3(9). ISSN (Online): 2347 - 4718
Bhardwaj, A., Vanraj, Kumar, A., Narayan, Y., & Kumar, P. (2015). Big data emerging technologies: A
CaseStudy with analyzing twitter data using apache hive. 2015 2nd International Conference on Recent
Advances in Engineering & Computational Sciences (RAECS).
https://doi.org/10.1109/RAECS.2015.7453400
Bucur, C. (2015, July). Using big data for intelligent businesses. In Proceedings of the Scientific Conference
AFASES (Vol. 2, pp. 605-612).
Chen, Z., Chen, N., & Gong, J. (2015). Design and implementation of the real-time GIS data model and Sensor
Web service platform for environmental big data management with the Apache Storm. 2015 Fourth
International Conference on Agro-Geoinformatics (Agro-Geoinformatics).
https://doi.org/10.1109/Agro-Geoinformatics.2015.7248139
Cheng, Y., Zhang, Q., & Ye, Z. (2019). Research on the Application of Agricultural Big Data Processing with
Hadoop and Spark. 2019 IEEE International Conference on Artificial Intelligence and Computer
Applications (ICAICA). https://doi.org/10.1109/ICAICA.2019.8873519
Erraissi, A., Belangour, A., & Tragha, A. (2018). Meta-Modeling of Data Sources and Ingestion Big Data Layers.
SSRN Electronic Journal. https://doi.org/10.2139/ssrn.3185342
Gurcan, F., & Berigel, M. (2018). Real-Time Processing of Big Data Streams: Lifecycle, Tools, Tasks, and
Challenges. 2018 2nd International Symposium on Multidisciplinary Studies and Innovative Technologies
(ISMSIT). https://doi.org/10.1109/ISMSIT.2018.8567061
Hadoop, A., Cloudera, Inc, & Apache Software Foundation. (2020, January 14). Cloudera. Retrieved December
2019, from https://www.cloudera.com/
Hemavathi, D., & Srimathi, H. (2017). Survey on data failure handling methods of streaming data. 2017
International Conference on Intelligent Computing and Control Systems (ICICCS).
https://doi.org/10.1109/ICCONS.2017.8250686
Isah, H., & Zulkernine, F. (2018). A Scalable and Robust Framework for Data Stream Ingestion. 2018 IEEE
International Conference on Big Data (Big Data). https://doi.org/10.1109/BigData.2018.8622360
Jain, A., & Bhatnagar, V. (2016). Crime Data Analysis Using Pig with Hadoop. Procedia Computer Science, 78,
571–578. https://doi.org/10.1016/j.procs.2016.02.104
Ji, C., Liu, S., Yang, C., Wu, L., & Pan, L. (2015). IBDP: An Industrial Big Data Ingestion and Analysis
Platform and Case Studies. 2015 International Conference on Identification, Information, and Knowledge in
the Internet of Things (IIKI). https://doi.org/10.1109/IIKI.2015.55
25
mas.ccsenet.org Modern Applied Science Vol. 14, No. 9; 2020
Jingmin Li. (2014). Design of real-time data analysis system based on Impala. 2014 IEEE Workshop on
Advanced Research and Technology in Industry Applications (WARTIA).
https://doi.org/10.1109/WARTIA.2014.6976427
Lee, C., & Paik, I. (2017). Stock market analysis from Twitter and news based on streaming big data
infrastructure. 2017 IEEE 8th International Conference on Awareness Science and Technology (iCAST).
https://doi.org/10.1109/ICAwST.2017.8256469
MATACUTA, A., & POPA, C. (2018). Big Data Analytics: Analysis of Features and Performance of Big Data
Ingestion Tools. Informatica Economica, 22(2/2018), 25–34.
https://doi.org/10.12948/issn14531305/22.2.2018.03
Meehan, J., Aslantas, C., Zdonik, S., Tatbul, N., & Du, J. (2017, January). Data Ingestion for the Connected
World. In CIDR.
Mohamed, E., & Hong, Z. (2016). Hadoop-MapReduce Job Scheduling Algorithms Survey. 2016 7th
International Conference on Cloud Computing and Big Data (CCBD).
https://doi.org/10.1109/CCBD.2016.054
Nargesian, F., Zhu, E., Miller, R. J., Pu, K. Q., & Arocena, P. C. (2019). Data lake management. Proceedings of
the VLDB Endowment, 12(12), 1986–1989. https://doi.org/10.14778/3352063.3352116
Oughdir, L., Dakkak, A., & Dahdouh, K. (2019). Big data: a distributed storage and processing for online
learning systems. International Journal of Computational Intelligence Studies, 8(3), 192.
https://doi.org/10.1504/IJCISTUDIES.2019.10024283
Oussous, A., Benjelloun, F. Z., Lahcen, A. A., & Belfkih, S. (2018). Big Data technologies: A survey. Journal of
King Saud University-Computer and Information Sciences, 30(4), 431-448.
https://doi.org/10.1016/j.jksuci.2017.06.001
Pal, G., Li, G., & Atkinson, K. (2018). Big Data Ingestion and Lifelong Learning Architecture. 2018 IEEE
International Conference on Big Data (Big Data). https://doi.org/10.1109/BigData.2018.8621859
Peng, R. (2019). Kylo Data Lakes Configuration deployed in Public Cloud environments in Single Node Mode.
DiVA, id: diva2:1367021
Rallapalli, S., & Gondkar, R. R. (2015). Map reduce programming for electronic medical records data analysis
on cloud using apache hadoop, hive and sqoop. International Journal of Latest Technology in Engineering,
Management & Applied Science. ISSN 2278.
Rehman, M. H. ur, Chang, V., Batool, A., & Wah, T. Y. (2016). Big data reduction framework for value creation
in sustainable enterprises. International Journal of Information Management, 36(6), 917–928.
https://doi.org/10.1016/j.ijinfomgt.2016.05.013
Salah, H., Al-Omari, I., Alwidian, J., Al-Hamadin, R., & Tawalbeh, T. (2019). Data Streams Curation for Better
Machine Learning Functionality and Result to Serve IoT and other Applications: A Survey. Journal of
Computer Science, 15(10), 1572–1584. https://doi.org/10.3844/jcssp.2019.1572.1584
Shahin, D., Hannen Ennab, R. S., & Alwidian, J. (2019). Big Data Platform Privacy and Security, A
Review. IJCSNS, 19(5), 24.
Surekha, D., Swamy, G., & Venkatramaphanikumar, S. (2016). Real time streaming data storage and processing
using storm and analytics with Hive. 2016 International Conference on Advanced Communication Control
and Computing Technologies (ICACCCT). https://doi.org/10.1109/ICACCCT.2016.7831712
Team, A. N. F. (n.d.). Apache NIFI. Retrieved December 2019, from
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html
Wiska, R., Habibie, N., Wibisono, A., Nugroho, W. S., & Mursanto, P. (2016). Big sensor-generated data
streaming using Kafka and Impala for data storage in Wireless Sensor Network for CO<inf>2</inf>
monitoring. 2016 International Workshop on Big Data and Information Security (IWBIS).
https://doi.org/10.1109/IWBIS.2016.7872896
Yadranjiaghdam, B., Pool, N., & Tabrizi, N. (2016). A Survey on Real-Time Big Data Analytics: Applications
and Tools. 2016 International Conference on Computational Science and Computational Intelligence
(CSCI). https://doi.org/10.1109/CSCI.2016.0083
26
mas.ccsenet.org Modern Applied Science Vol. 14, No. 9; 2020
Yadranjiaghdam, B., Yasrobi, S., & Tabrizi, N. (2017). Developing a Real-Time Data Analytics Framework for
Twitter Streaming Data. 2017 IEEE International Congress on Big Data (BigData Congress).
https://doi.org/10.1109/BigDataCongress.2017.49
Yang, W., Liu, X., Zhang, L., & Yang, L. T. (2013). Big Data Real-Time Processing Based on Storm. 2013 12th
IEEE International Conference on Trust, Security and Privacy in Computing and Communications.
https://doi.org/10.1109/TrustCom.2013.247
Copyrights
Copyright for this article is retained by the author(s), with first publication rights granted to the journal.
This is an open-access article distributed under the terms and conditions of the Creative Commons Attribution
license (http://creativecommons.org/licenses/by/3.0/).
27