0% found this document useful (0 votes)

7 views

32Study_of_Data_Ingestion_Tools

Uploaded by

FADEL NESRINE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

32Study_of_Data_Ingestion_Tools

Uploaded by

FADEL NESRINE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/373440042

STUDY OF DATA INGESTION TOOLS

Article · August 2023

CITATION READS

1 313

3 authors, including:

Rakhee Yadav
Somaiya Vidyavihar
2 PUBLICATIONS 1 CITATION

SEE PROFILE

All content following this page was uploaded by Rakhee Yadav on 28 August 2023.

The user has requested enhancement of the downloaded file.

Vol. 10, Issue 38 April-June 2020
Page Nos. 246-253

STUDY OF DATA INGESTION TOOLS

Rakhee Yadav*
Yogesh Kumar Sharma**
Rajendra Patil***

ABSTRACT
With the development of information technology, business intelligence is playing a vital role,
so the data used for business intelligence is accumulated from different tools such as kafka, NiFi etc.
This paper includes the literature study that throws the light on selection of data ingestion tools, in
different scenario. The author has selected some popular tools that is preferred by industries for
accumulating data.
Index Terms – Data ingestion, data ingestion tools, real time streaming, apache kafka, business
intelligence.

1.0 INTRODUCTION and sanitization where parsing and removal

It is a process where data is transferred of redundancy activities are possible.
from multiple source to destination where it can Complex operations like identifying and
be stored and used for future analysis. Data is in deleting invalid data or null data values
multiple formats and coming from different could be performed with scripts.
sources, including relational databases, various 3) Transportation Process: Moving of the data
types of database, CSVs, or from other streams. into its respective stores within the data lake
As the data has come from various places it depends on the automation procedures and
needs to be rectified and transformed such that clarity of routing rules set up. Following are
analysis is combined with data from other the types of data ingestion.
sources. Else data is a bunch of puzzled pieces
that don’t connect to each other. Here are the Following are the types of data ingestion.
following functions of data ingestion. Batch : its an efficient way to process
large volume of data is to make Batches of data
1) Data Collection: the primary purpose of data where a set of transactions collect over time.
ingestion is collection of data from many Here Data is collected and processed. Batch
sources. These have multiple formats some results are produced with software tools similar
of which can be structured, semi-structured, to Hadoop. The data is ingested into batches data
unstructured, that can be available in batches can be imported at regularly intervals that is
which could be moved into other data lake. scheduled. This is useful when there are
2) Filtration Process: Initially in data lifecycle, processes running on a schedule, such as reports
it is passed through a process of filtration generation that daily executed at a specific time.

*Research Scholar, Shri JJT University, Jhunjhunu, Rajasthan

**Professor, Shri JJT University, Jhunjhunu, Rajasthan
246
Fig 1: Batch data processing
Real time or stream data: Real-time data moving to BIG data systems when it arrive. Real-
demands input from a continual source, which is time ingestion is beneficial when the information
processed and generates an output of it. Data gleaned is highly time-sensitive, like data from a
here is processed in a small time or period (or power grid which is monitored moment-to-
near real-time). It’s a process of transiting data moment.

Fig 2: stream data processing

lambda architecture: Ingest data can use a combined views of a batch of data, while using
lambda(hybrid) architecture. This approach real-time processing provides views of real time
balances the advantage of batch as well as real- data.
time modes using batch processing that provide

247
Fig 3 : lambda(hybrid) data processing
2.0 ARCHITECTURE OF DATA social networks, IoT devices, machines. And
INGESTION every stream of data streaming in has different
Data ingestion is initial & the toughest semantics. A stream might be structured,
part of the entire data processing architecture. unstructured or semi-structured. Data is streamed
The key parameters which are is considered in continually in real-time or at regular batches.
when designing a data ingestion solution are: We would need whether data to stream in
Data Velocity, size & format. Data streams in continually. On the other hand, to study trends
through several different sources into the system social media data is streamed in at regular
at different speeds & size. Data streams from intervals
.

Fig 4: data ingestion architecture

248
3. SOME POPULAR DATA INGESTION Kafka cluster. Kafka is a stream processor, which
TOOLS integrates applications and data streams via an
3.1 Apache kafka API. Apache Kafka has been used by
Apache Kafka is distributed system developed by organizations such as Uber and Netflix. Large
LinkedIn. Which describes the partitioned for amount of the data can be run concurrently and
data and it provides replication & distribution moved faster. Kafka can run concurrent
based of log-based service. Kafka maintains the processes and transfer huge amount of data
topics for consumer to feed messages and faster. Kafka is used for big data streams, like
subscribing the topics will publish message feeds Netflix big data ingestion platform uses kafka for
as producer due to the requests & responses of big data streams.

.
Fig: apache kafka
3.2 apache NiFi systems. It is derived from the "NiagaraFiles"
Apache NiFi Apache NiFi is a software project software initially developed by the NSA. It was
from the Apache Software Foundation designed open-sourced as a part of NSA's technology.
to automate the flow of data between software on NSA transferred program in 2014.

249
3.3 Apache Spark programs with overall clusters with implicative
Spark is an open source and cluster computing data that show properties parallelism and fault
structure. Spark is having interfacing for tolerance.

3.4 apache Storm reduce is due to main-training storm topologies.

Apache Storm is for managing unbounded data A stream of data from twitter could be converted
streams over a distributed real-time streaming, into streams of latest trending topics. The two
developed by twitter. Mainly it varies from map- main components of Storm are tuples and nodes

3.5 Apache Flink distributed streaming engine. It executes the

Flink is an open source processing software dataflow instructions in parallel and pipelined
developed at the Apache firmware. Flink is fashion.

250
3.6 Amazon kinesis streams per hour continuously for Amazon Client
Amazon Kinesis is used for real-time data Library (KCL) to process & manage the stream
processing for distributed environment over large data based on the sources of web clicks, client
amount of data streams. Amazon kinesis is required finance transactions, feeds for social
cloud-based service provide by Amazon Web media & log-based tracking events.
Services to store & process the TBs of data

3.7 Amazon Firehose warehouse. Data ingestion challenges Gathering

It enables to combine data stream into Business data from various multiple sources and in
intelligence tools with multiples interfaces or different forms for various business use cases
data repository. It helps to fetch data, integrating exhibits a variety of challenges and difficulties in
it into warehousing solutions that are available data ingestion.
by Amazon like S3 and Redshift cloud

251
DATA INGESTION CHALLENGES continual query produces new results, since new
Gathering data from multiple sources and in data arrive at the system continuously
different forms for business use cases presents a
variety of challenges in data ingestion. These 3) Speed of ingestion
challenges can include: Data sources deliver data at varying
frequencies. For example, comment forums
1) Multiple source ingestion amount to large data sets but occur at a low
The data sources are constantly evolving. frequency. However, information such as tweets
Data ingestion should be robust and efficient can be seen as small volumes of data that occur
enough to ingest the volume and diversity in the at a high frequency and require more rapid
data. Taking decision is complicated when you ingestion. A number of platforms now exist to
have to manage and decide which data to include process big data, including advanced SQL
in your data repository. In Organizations huge (sometimes called NewSQL) databases that adapt
amount of data is generated during product life SQL to handle larger volumes of structured data
cycle. The dataset includes customer data, with greater speed, and NoSQL platforms that
vendor and product data, and assets information. may range from file systems to document or
columnar data stores that typically dispense with
2) Managing streaming/real-time data the need for modeling data.
This kind of ingestion challenges occurs
when data management data coming from 4) Change detection (capturing low latency)
scattered sources like log files, eCommerce Data sources takes and deliver data at
buying information, or information from public different frequencies. In a comment forums has
networks. Data stream management Systems large data sets which occur at a low frequency.
developed help to manage continual data But in information like tweets small volumes of
streams. They seem to be identical to database data occur at a high frequency requiring more
management systems (DBMS), which is rapid ingestion. There are a number of
designed for static data, as well as Data stream frameworks that exist to operate huge data, like
management Systems executes a continual query advanced SQL ( NewSQL) databases that adapts
which is not only performed once and SQL to handle huge volumes of structured data
permanently installed. Most Data stream with higher speed, and NoSQL platforms that
management Systems are data-driven and a can range from file systems to document or data

252
stored in columnar format that dispense with the  www.altexsoft.com/
requirement of modeling data  https://www.predictiveanalyticstoday.com/
data-ingestion-tools/
CONCLUSION AND FUTURE WORK
In this paper there is discussion about  https://www.intersysconsulting.com/servic
various data ingestion tools. Few tools can es/data/data-ingestion/
handle low latency data. This paper is to help the  https://docs.cloudera.com/HDPDocuments/
researches to find which is effective and efficient HDF3/HDF-3.4.1.1/apache-nifi-
data ingestion tool. The future scope of this work overview/content/nifi-architecture.html
is to implement artificial intelligence with data
ingestion to help in ensuring effortless data  John Meehan et al. “Data Ingestion for the
collection and connection to system. Connected World” CIDR’17 January 8–11,
2017
REFERENCES  B. V. S. Srikanth and V. Krishna Reddy,
 tsicilian.wordpress.com/2015/02/16/stream “Efficiency of Stream Processing Engines
ing-big-data-storm-spark-and-samza/ for Processing BIGDATA Streams” Indian
 kafka.apache.org/ Journal of Science and Technology Vol 9
(14) | April 2016
 https://dataladder.com/data-lake-eco-
system-unique-data-ingestion-challenges/

253

View publication stats

Data Build Tool (DBT)
No ratings yet
Data Build Tool (DBT)
65 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
An Architect's View of The Hazelcast Platform
No ratings yet
An Architect's View of The Hazelcast Platform
14 pages
Data Arch Base
No ratings yet
Data Arch Base
11 pages
Chapter 6
No ratings yet
Chapter 6
26 pages
Big Data Ingestion and Preparation Tools
No ratings yet
Big Data Ingestion and Preparation Tools
16 pages
ECS765P - W6 - Big Data Ingestion and Storage
No ratings yet
ECS765P - W6 - Big Data Ingestion and Storage
34 pages
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Ch 05 Data Engineering.pptx (2)
No ratings yet
Ch 05 Data Engineering.pptx (2)
28 pages
Lecture Notes: Data Ingestion For Structured/Unstructured Data
No ratings yet
Lecture Notes: Data Ingestion For Structured/Unstructured Data
31 pages
Data Ingestion, Processing and Architecture Layers For Big Data and Iot
No ratings yet
Data Ingestion, Processing and Architecture Layers For Big Data and Iot
32 pages
Lecture 3 (Data Ingestion)
No ratings yet
Lecture 3 (Data Ingestion)
3 pages
1 - Big Data Analytics & IoT
No ratings yet
1 - Big Data Analytics & IoT
13 pages
Big Data Architecture
No ratings yet
Big Data Architecture
4 pages
Big Data Components
No ratings yet
Big Data Components
58 pages
Lecture 7 - Data Acquisition
No ratings yet
Lecture 7 - Data Acquisition
45 pages
Unit-4
No ratings yet
Unit-4
11 pages
ACFrOgAo1SpYCo1YmTJeiGbHKH22nYKAL3GLgRtzpk4R3gRbHCAsTnCSMxfKm0SFBNYGz7keG7rfZN Y3QVo gdxiQyqG - 6KLsY2icn
No ratings yet
ACFrOgAo1SpYCo1YmTJeiGbHKH22nYKAL3GLgRtzpk4R3gRbHCAsTnCSMxfKm0SFBNYGz7keG7rfZN Y3QVo gdxiQyqG - 6KLsY2icn
14 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
25-Introduction to Data Streaming-04-03-2025
No ratings yet
25-Introduction to Data Streaming-04-03-2025
13 pages
Data Engineering Life Cycle
No ratings yet
Data Engineering Life Cycle
33 pages
5a. Introduction to Data Ingestion and Processing
No ratings yet
5a. Introduction to Data Ingestion and Processing
26 pages
Big Data Architecture
No ratings yet
Big Data Architecture
41 pages
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
From Everand
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
Will Girten
No ratings yet
What Is Data Ingestion? Big Data Architecture - Where Does Data Ingestion Fit ?
No ratings yet
What Is Data Ingestion? Big Data Architecture - Where Does Data Ingestion Fit ?
3 pages
Ds 6
No ratings yet
Ds 6
7 pages
Unit-2
No ratings yet
Unit-2
11 pages
CCD UNIT 4
No ratings yet
CCD UNIT 4
5 pages
Data Ingestion Use Cases: Moving Big Data Into Hadoop
No ratings yet
Data Ingestion Use Cases: Moving Big Data Into Hadoop
2 pages
big data unit 1
No ratings yet
big data unit 1
24 pages
ECS765P - W10 - Stream Processing
No ratings yet
ECS765P - W10 - Stream Processing
39 pages
Big Data Components
No ratings yet
Big Data Components
31 pages
Course1_summary
No ratings yet
Course1_summary
4 pages
Bda Mid Ans
No ratings yet
Bda Mid Ans
18 pages
DataStreaming L-4
No ratings yet
DataStreaming L-4
16 pages
unit II big data architecture
No ratings yet
unit II big data architecture
5 pages
INDJCSE24-15-04-020
No ratings yet
INDJCSE24-15-04-020
13 pages
The Various Facets of Data Ingestion
No ratings yet
The Various Facets of Data Ingestion
2 pages
Engineering Data Mesh in Azure Cloud: Implement data mesh using Microsoft Azure's Cloud Adoption Framework
From Everand
Engineering Data Mesh in Azure Cloud: Implement data mesh using Microsoft Azure's Cloud Adoption Framework
Aniruddha Deswandikar
No ratings yet
N3 2020 Copy Updated
No ratings yet
N3 2020 Copy Updated
22 pages
BDMA Part 2
No ratings yet
BDMA Part 2
16 pages
Types of Digital Data: Unit 1 Big Data KCS-061
No ratings yet
Types of Digital Data: Unit 1 Big Data KCS-061
12 pages
[FREE PDF sample] (Ebook) Streaming Data Pipelines with Kafka (MEAP) by Stefan Sprenger ISBN 9781633437012, 1633437019 ebooks
100% (5)
[FREE PDF sample] (Ebook) Streaming Data Pipelines with Kafka (MEAP) by Stefan Sprenger ISBN 9781633437012, 1633437019 ebooks
81 pages
UNIT 1 To 5
No ratings yet
UNIT 1 To 5
37 pages
Components of A Big Data Architecture
No ratings yet
Components of A Big Data Architecture
3 pages
Big Data Analytics - Unit 2
No ratings yet
Big Data Analytics - Unit 2
10 pages
Data Engg
No ratings yet
Data Engg
19 pages
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
5
No ratings yet
5
1 page
Module II
No ratings yet
Module II
22 pages
Big Data Architecture
No ratings yet
Big Data Architecture
9 pages
Application Design: Key Principles For Data-Intensive App Systems
From Everand
Application Design: Key Principles For Data-Intensive App Systems
Rob Botwright
No ratings yet
Network and Security Considerations
No ratings yet
Network and Security Considerations
14 pages
De Imp Qa
No ratings yet
De Imp Qa
12 pages
C1_W2
No ratings yet
C1_W2
87 pages
Data Capture Services
No ratings yet
Data Capture Services
10 pages
4.data Engineering
No ratings yet
4.data Engineering
9 pages
module 2-3 fuba midterms
100% (1)
module 2-3 fuba midterms
5 pages
Unit 2
No ratings yet
Unit 2
10 pages
Bigdata Notes
No ratings yet
Bigdata Notes
136 pages
Open Source Tools For Data Engineering - LinkedIn
No ratings yet
Open Source Tools For Data Engineering - LinkedIn
5 pages
UNIT III BASICS_OF_HADOOP
No ratings yet
UNIT III BASICS_OF_HADOOP
22 pages
Big Data - Hands-On Manual The Fastest Way To Learn Big Data! - Alvaro de Castro
No ratings yet
Big Data - Hands-On Manual The Fastest Way To Learn Big Data! - Alvaro de Castro
46 pages
Testing Data Streaming Applications: Lars Albertsson, Independent Consultant Øyvind Løkling, Schibsted Media Group
No ratings yet
Testing Data Streaming Applications: Lars Albertsson, Independent Consultant Øyvind Løkling, Schibsted Media Group
26 pages
Building Cost-Based Query Optimizers With Apache Calcite
No ratings yet
Building Cost-Based Query Optimizers With Apache Calcite
33 pages
unit 4 Streaming data
No ratings yet
unit 4 Streaming data
4 pages
The Ultimate Hands-On Hadoop - Tame Your Big Data!: What You'll Learn
No ratings yet
The Ultimate Hands-On Hadoop - Tame Your Big Data!: What You'll Learn
1 page
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
Flink
No ratings yet
Flink
31 pages
Vijay_Data Engineer re
No ratings yet
Vijay_Data Engineer re
7 pages
Big Data- Road map
No ratings yet
Big Data- Road map
22 pages
Chapter 6 spark and flink questions answers
No ratings yet
Chapter 6 spark and flink questions answers
5 pages
Why Google Data Cloud 2022 Whitepaper
No ratings yet
Why Google Data Cloud 2022 Whitepaper
18 pages
Detection of Violence in Football Stadium Through Big Data Framework and Deep Learning Approach
No ratings yet
Detection of Violence in Football Stadium Through Big Data Framework and Deep Learning Approach
11 pages
Chapters 7 & 8
No ratings yet
Chapters 7 & 8
7 pages
Apache Flink Getting Started
No ratings yet
Apache Flink Getting Started
4 pages
Glue DG
No ratings yet
Glue DG
836 pages
Apache Flink Stateful Computations Over Data Streams
No ratings yet
Apache Flink Stateful Computations Over Data Streams
1 page
Huawei HCIA-Big Data V3.0 Certification Exam
No ratings yet
Huawei HCIA-Big Data V3.0 Certification Exam
4 pages
Data Science: Executive PG Programme in
No ratings yet
Data Science: Executive PG Programme in
32 pages
02data Stream Processing With Apache Flink
No ratings yet
02data Stream Processing With Apache Flink
61 pages
feature engineering
No ratings yet
feature engineering
10 pages
Executive PG Programme in Data Science
No ratings yet
Executive PG Programme in Data Science
33 pages
Hazelcast Level Up To Instant Action-1706173416548
No ratings yet
Hazelcast Level Up To Instant Action-1706173416548
36 pages
Manish Resume
No ratings yet
Manish Resume
1 page
Datalakes
No ratings yet
Datalakes
18 pages
Buyers Guide_Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
No ratings yet
Buyers Guide_Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
17 pages
Apache Flink ™: Stream and Batch Processing in A Single Engine
No ratings yet
Apache Flink ™: Stream and Batch Processing in A Single Engine
11 pages
20240509-EB-Confluent Chronicles Issue 2
No ratings yet
20240509-EB-Confluent Chronicles Issue 2
24 pages

32Study_of_Data_Ingestion_Tools

Uploaded by

32Study_of_Data_Ingestion_Tools

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

STUDY OF DATA INGESTION TOOLS

Article · August 2023

The user has requested enhancement of the downloaded file.

STUDY OF DATA INGESTION TOOLS

1.0 INTRODUCTION and sanitization where parsing and removal

*Research Scholar, Shri JJT University, Jhunjhunu, Rajasthan

Fig 2: stream data processing

Fig 4: data ingestion architecture

3.4 apache Storm reduce is due to main-training storm topologies.

3.5 Apache Flink distributed streaming engine. It executes the

3.7 Amazon Firehose warehouse. Data ingestion challenges Gathering

View publication stats

You might also like