0% found this document useful (0 votes)
7 views

32Study_of_Data_Ingestion_Tools

Uploaded by

FADEL NESRINE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

32Study_of_Data_Ingestion_Tools

Uploaded by

FADEL NESRINE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/373440042

STUDY OF DATA INGESTION TOOLS

Article · August 2023

CITATION READS

1 313

3 authors, including:

Rakhee Yadav
Somaiya Vidyavihar
2 PUBLICATIONS 1 CITATION

SEE PROFILE

All content following this page was uploaded by Rakhee Yadav on 28 August 2023.

The user has requested enhancement of the downloaded file.


Vol. 10, Issue 38 April-June 2020
Page Nos. 246-253

STUDY OF DATA INGESTION TOOLS


Rakhee Yadav*
Yogesh Kumar Sharma**
Rajendra Patil***

ABSTRACT
With the development of information technology, business intelligence is playing a vital role,
so the data used for business intelligence is accumulated from different tools such as kafka, NiFi etc.
This paper includes the literature study that throws the light on selection of data ingestion tools, in
different scenario. The author has selected some popular tools that is preferred by industries for
accumulating data.
Index Terms – Data ingestion, data ingestion tools, real time streaming, apache kafka, business
intelligence.

1.0 INTRODUCTION and sanitization where parsing and removal


It is a process where data is transferred of redundancy activities are possible.
from multiple source to destination where it can Complex operations like identifying and
be stored and used for future analysis. Data is in deleting invalid data or null data values
multiple formats and coming from different could be performed with scripts.
sources, including relational databases, various 3) Transportation Process: Moving of the data
types of database, CSVs, or from other streams. into its respective stores within the data lake
As the data has come from various places it depends on the automation procedures and
needs to be rectified and transformed such that clarity of routing rules set up. Following are
analysis is combined with data from other the types of data ingestion.
sources. Else data is a bunch of puzzled pieces
that don’t connect to each other. Here are the Following are the types of data ingestion.
following functions of data ingestion. Batch : its an efficient way to process
large volume of data is to make Batches of data
1) Data Collection: the primary purpose of data where a set of transactions collect over time.
ingestion is collection of data from many Here Data is collected and processed. Batch
sources. These have multiple formats some results are produced with software tools similar
of which can be structured, semi-structured, to Hadoop. The data is ingested into batches data
unstructured, that can be available in batches can be imported at regularly intervals that is
which could be moved into other data lake. scheduled. This is useful when there are
2) Filtration Process: Initially in data lifecycle, processes running on a schedule, such as reports
it is passed through a process of filtration generation that daily executed at a specific time.

*Research Scholar, Shri JJT University, Jhunjhunu, Rajasthan


**Professor, Shri JJT University, Jhunjhunu, Rajasthan
246
Fig 1: Batch data processing
Real time or stream data: Real-time data moving to BIG data systems when it arrive. Real-
demands input from a continual source, which is time ingestion is beneficial when the information
processed and generates an output of it. Data gleaned is highly time-sensitive, like data from a
here is processed in a small time or period (or power grid which is monitored moment-to-
near real-time). It’s a process of transiting data moment.

Fig 2: stream data processing


lambda architecture: Ingest data can use a combined views of a batch of data, while using
lambda(hybrid) architecture. This approach real-time processing provides views of real time
balances the advantage of batch as well as real- data.
time modes using batch processing that provide

247
Fig 3 : lambda(hybrid) data processing
2.0 ARCHITECTURE OF DATA social networks, IoT devices, machines. And
INGESTION every stream of data streaming in has different
Data ingestion is initial & the toughest semantics. A stream might be structured,
part of the entire data processing architecture. unstructured or semi-structured. Data is streamed
The key parameters which are is considered in continually in real-time or at regular batches.
when designing a data ingestion solution are: We would need whether data to stream in
Data Velocity, size & format. Data streams in continually. On the other hand, to study trends
through several different sources into the system social media data is streamed in at regular
at different speeds & size. Data streams from intervals
.

Fig 4: data ingestion architecture

248
3. SOME POPULAR DATA INGESTION Kafka cluster. Kafka is a stream processor, which
TOOLS integrates applications and data streams via an
3.1 Apache kafka API. Apache Kafka has been used by
Apache Kafka is distributed system developed by organizations such as Uber and Netflix. Large
LinkedIn. Which describes the partitioned for amount of the data can be run concurrently and
data and it provides replication & distribution moved faster. Kafka can run concurrent
based of log-based service. Kafka maintains the processes and transfer huge amount of data
topics for consumer to feed messages and faster. Kafka is used for big data streams, like
subscribing the topics will publish message feeds Netflix big data ingestion platform uses kafka for
as producer due to the requests & responses of big data streams.

.
Fig: apache kafka
3.2 apache NiFi systems. It is derived from the "NiagaraFiles"
Apache NiFi Apache NiFi is a software project software initially developed by the NSA. It was
from the Apache Software Foundation designed open-sourced as a part of NSA's technology.
to automate the flow of data between software on NSA transferred program in 2014.

249
3.3 Apache Spark programs with overall clusters with implicative
Spark is an open source and cluster computing data that show properties parallelism and fault
structure. Spark is having interfacing for tolerance.

3.4 apache Storm reduce is due to main-training storm topologies.


Apache Storm is for managing unbounded data A stream of data from twitter could be converted
streams over a distributed real-time streaming, into streams of latest trending topics. The two
developed by twitter. Mainly it varies from map- main components of Storm are tuples and nodes

3.5 Apache Flink distributed streaming engine. It executes the


Flink is an open source processing software dataflow instructions in parallel and pipelined
developed at the Apache firmware. Flink is fashion.

250
3.6 Amazon kinesis streams per hour continuously for Amazon Client
Amazon Kinesis is used for real-time data Library (KCL) to process & manage the stream
processing for distributed environment over large data based on the sources of web clicks, client
amount of data streams. Amazon kinesis is required finance transactions, feeds for social
cloud-based service provide by Amazon Web media & log-based tracking events.
Services to store & process the TBs of data

3.7 Amazon Firehose warehouse. Data ingestion challenges Gathering


It enables to combine data stream into Business data from various multiple sources and in
intelligence tools with multiples interfaces or different forms for various business use cases
data repository. It helps to fetch data, integrating exhibits a variety of challenges and difficulties in
it into warehousing solutions that are available data ingestion.
by Amazon like S3 and Redshift cloud

251
DATA INGESTION CHALLENGES continual query produces new results, since new
Gathering data from multiple sources and in data arrive at the system continuously
different forms for business use cases presents a
variety of challenges in data ingestion. These 3) Speed of ingestion
challenges can include: Data sources deliver data at varying
frequencies. For example, comment forums
1) Multiple source ingestion amount to large data sets but occur at a low
The data sources are constantly evolving. frequency. However, information such as tweets
Data ingestion should be robust and efficient can be seen as small volumes of data that occur
enough to ingest the volume and diversity in the at a high frequency and require more rapid
data. Taking decision is complicated when you ingestion. A number of platforms now exist to
have to manage and decide which data to include process big data, including advanced SQL
in your data repository. In Organizations huge (sometimes called NewSQL) databases that adapt
amount of data is generated during product life SQL to handle larger volumes of structured data
cycle. The dataset includes customer data, with greater speed, and NoSQL platforms that
vendor and product data, and assets information. may range from file systems to document or
columnar data stores that typically dispense with
2) Managing streaming/real-time data the need for modeling data.
This kind of ingestion challenges occurs
when data management data coming from 4) Change detection (capturing low latency)
scattered sources like log files, eCommerce Data sources takes and deliver data at
buying information, or information from public different frequencies. In a comment forums has
networks. Data stream management Systems large data sets which occur at a low frequency.
developed help to manage continual data But in information like tweets small volumes of
streams. They seem to be identical to database data occur at a high frequency requiring more
management systems (DBMS), which is rapid ingestion. There are a number of
designed for static data, as well as Data stream frameworks that exist to operate huge data, like
management Systems executes a continual query advanced SQL ( NewSQL) databases that adapts
which is not only performed once and SQL to handle huge volumes of structured data
permanently installed. Most Data stream with higher speed, and NoSQL platforms that
management Systems are data-driven and a can range from file systems to document or data

252
stored in columnar format that dispense with the  www.altexsoft.com/
requirement of modeling data  https://www.predictiveanalyticstoday.com/
data-ingestion-tools/
CONCLUSION AND FUTURE WORK
In this paper there is discussion about  https://www.intersysconsulting.com/servic
various data ingestion tools. Few tools can es/data/data-ingestion/
handle low latency data. This paper is to help the  https://docs.cloudera.com/HDPDocuments/
researches to find which is effective and efficient HDF3/HDF-3.4.1.1/apache-nifi-
data ingestion tool. The future scope of this work overview/content/nifi-architecture.html
is to implement artificial intelligence with data
ingestion to help in ensuring effortless data  John Meehan et al. “Data Ingestion for the
collection and connection to system. Connected World” CIDR’17 January 8–11,
2017
REFERENCES  B. V. S. Srikanth and V. Krishna Reddy,
 tsicilian.wordpress.com/2015/02/16/stream “Efficiency of Stream Processing Engines
ing-big-data-storm-spark-and-samza/ for Processing BIGDATA Streams” Indian
 kafka.apache.org/ Journal of Science and Technology Vol 9
(14) | April 2016
 https://dataladder.com/data-lake-eco-
system-unique-data-ingestion-challenges/

253

View publication stats

You might also like