0% found this document useful (0 votes)

11 views

Data Arch Base

Uploaded by

reis cumhur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Data Arch Base

Uploaded by

reis cumhur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

1.

Data Sources and Ingestion:

 Identify the diverse data sources, both internal and external, that
feed into the company's data ecosystem.

Internal Data Sources:

1. Enterprise Resource Planning (ERP) Systems:

 Financial data: General ledger, accounts payable,

accounts receivable, fixed assets, and inventory.

 Supply chain data: Purchase orders, sales orders,

production schedules, and logistics.

 Human resources data: Employee records, payroll,

and benefits.

2. Customer Relationship Management (CRM) Systems:

 Customer data: Accounts, contacts, leads,

opportunities, and sales activities.

 Marketing data: Campaign management, email

marketing, and website analytics.

 Support data: Tickets, cases, and customer

interactions.

3. Operational Databases and Transaction Systems:

 Transactional data: Order management, inventory

management, and point-of-sale (POS) systems.

 Logistical data: Fleet management, warehouse

management, and transportation systems.

 Manufacturing data: Production planning, quality

control, and maintenance systems.

4. Enterprise Content Management (ECM) Systems:

 Unstructured data: Documents, images, videos, and

other media files.

 Knowledge management: Policies, procedures, and

technical manuals.

 Collaboration data: File shares, wikis, and discussion

forums.
External Data Sources:

1. Third-Party APIs:

 Market data: Stock prices, economic indicators, and

industry benchmarks.

 Geospatial data: Maps, weather data, and location-

based services.

 Social media data: Sentiment analysis, influencer

data, and customer engagements.

2. Web Scraping:

 Competitor data: Pricing, product information, and

marketing strategies.

 Industry news and trends: Trade publications, blogs,

and forums.

 Customer reviews and feedback: E-commerce sites,

review platforms, and social media.

3. Public Data Repositories:

 Government data: Census, economic, and

demographic information.

 Research data: Academic publications, datasets, and

scientific papers.

 Open-source data: Crowdsourced data, open data

initiatives, and community-contributed datasets.

4. Syndicated Data Providers:

 Market research data: Industry trends, consumer

behavior, and competitive intelligence.

 Demographic data: Household income, age, gender,

and other population statistics.

 Firmographic data: Company size, industry, location,

and other business attributes.
 Understand the mechanisms for ingesting and collecting data, such
as batch processing, real-time streaming, APIs, and web scraping.

- Batch processing : This type of data ingestion moves

data in batches at scheduled intervals and is best-suited
to applications that only require periodic updates
- Real-time or streaming data ingestion : Use cases
for real time data ingestion include stock market trading,
fraud detection, real-time monitoring, and other
applications that demand instant insights
- API data ingestion. Data is ingested from external
sources through APIs, a structured means of accessing
and retrieving data from other applications or platforms.
- Web scraping. Data is extracted from websites and
web pages, often to gather information for data
analytics, competitive analysis, and other research
purposes.
 Explore the use of data ingestion tools and frameworks, like Apache
Kafka, Flume, or Amazon Kinesis, that enable high-throughput, low-
latency data pipelines.

Data ingestion tools and frameworks:

1. Apache Kafka:
 Apache Kafka is a distributed streaming platform that
excels at handling large volumes of data in real-time.
 Key features:
 Scalable and fault-tolerant data pipelines
 High-throughput, low-latency message delivery
 Ability to handle both batch and real-time data
 Flexible data processing through Kafka Streams
and KSQL
 Use cases:
 Streaming data ingestion from various sources
(e.g., IoT, logs, transactions)
 Building real-time data analytics and monitoring
applications
 Enabling event-driven architectures and
microservices
2. Amazon Kinesis:
 Amazon Kinesis is a fully managed real-time data
streaming service provided by AWS.
 Key features:
 Scalable and highly available data ingestion
 Low-latency data processing and analysis
 Integrations with other AWS services (e.g.,
Lambda, S3, Glue) :
1. Real-time data processing (Lambda)
2. Long-term data storage and data lake (S3)
3. Automated data cataloging and ETL
workflows (Glue)
 Ability to handle diverse data sources (e.g., logs,
metrics, click-streams)
 Use cases:
 Ingesting and processing real-time data for
application monitoring and analytics
 Powering real-time dashboards and event-driven
applications
 Implementing serverless architectures with
event-driven computing
3. Apache Flume:
 Apache Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating, and
moving large amounts of log data.
 Key features:
 Flexible and extensible architecture for data
ingestion
 Reliable and fault-tolerant data delivery
 Support for various data sources and sinks
 Ability to handle high-volume, low-latency data
streams
 Use cases:
 Aggregating and ingesting log data from
multiple sources
 Feeding real-time data pipelines for analytical
processing
 Integrating with big data ecosystems like
Hadoop and Spark
4. Apache NiFi:
 Apache NiFi is a powerful and scalable data flow
management platform.
 Key features:
 Drag-and-drop UI for building data processing
flows
 Support for diverse data sources and sinks
 Automated data routing, transformation, and
actions
 Monitoring, provenance, and data lineage
capabilities
 Use cases:
 Ingesting and processing data from various
sources (e.g., databases, files, IoT devices)
 Enabling data movement, transformation, and
enrichment
 Implementing data processing workflows and
ETL pipelines
5. Google Cloud Dataflow:
 Google Cloud Dataflow is a fully managed batch and
streaming data processing service.
 Key features:
 Unified programming model for batch and
streaming data processing
 Automatic scaling and resource management
 Integrations with other Google Cloud services
(e.g., Pub/Sub, BigQuery)
1. Pub/Sub: Providing a way to ingest real-
time data streams and trigger data
processing pipelines
2. BigQuery: Allowing you to store the
processed data in a scalable and
performant data warehouse for further
analysis
 Use cases:
 Ingesting and processing real-time data streams
 Performing batch data processing and ETL tasks
 Building data pipelines for analytics and
machine learning
6. Azure Data Factory:
 Azure Data Factory is a cloud-based data integration
service provided by Microsoft.
 Key features:
 Drag-and-drop pipeline authoring
 Support for diverse data sources and sinks
 Scheduling and orchestrating data movement
and transformation
 Monitoring and alerting capabilities
 Use cases:
 Ingesting and processing data from on-premises
and cloud data sources
 Implementing ETL and ELT workflows
 Enabling data-driven decision-making and
business intelligence
7. Talend Data Fabric:
 Talend Data Fabric is a unified platform for data
integration, data quality, and master data
management.
 Key features:
 Graphical design tools for building data pipelines
 Support for batch and real-time data ingestion
 Data quality and governance capabilities
 Connectivity to a wide range of data sources and
targets
 Use cases:
 Ingesting and integrating data from
heterogeneous sources
 Implementing data quality and master data
management strategies
 Building end-to-end data pipelines for business
intelligence and analytics
2.Data Ingestion Mechanisms:
-> Batch processing: Scheduled or event-driven processes that
extract data in bulk from source systems, often using tools like
Apache Sqoop, AWS Glue, or Azure Data Factory.
-> Real-time streaming: Leveraging stream processing
frameworks like Apache Kafka, Amazon Kinesis, or Google
Pub/Sub to ingest and process data in near real-time.
API-based ingestion: Utilizing RESTful or GraphQL APIs to
retrieve data from various sources, often integrated through an
API management platform.
Web scraping: Deploying web scraping tools and libraries (e.g.,
Python's BeautifulSoup, Scrapy, or Selenium) to extract data
from websites.
3.Data Ingestion Tools and Frameworks:
 Apache Kafka(streaming): A popular open-source
distributed streaming platform for building real-time
data pipelines and applications.
 Amazon Kinesis(streamig): A fully managed AWS
service for collecting, processing, and analyzing real-
time streaming data.
 Apache Flume(batch): A distributed, reliable, and
available service for efficiently collecting,
aggregating, and moving large amounts of log data.
 Apache Sqoop(batch): A tool designed for efficiently
transferring bulk data between Hadoop and
structured datastores like relational databases.
 AWS Glue(batch): A fully managed extract, transform,
and load (ETL) service that makes it easy to prepare
and load data for analytics.
 Azure Data Factory(both streaming, batch): A cloud-
based data integration service that allows you to
create data-driven workflows for orchestrating and
automating data movement and transformation.
2. Data Ingestion Strategies:
 Incremental data loading: Ingesting only the new or
updated data since the last ingestion, to minimize
processing overhead.
 Change data capture (CDC): Identifying and ingesting
only the changes made to source data, often using
database transaction logs or event-based triggers.
 Data lake ingestion: Consolidating diverse data
sources into a centralized data lake, using
technologies like Amazon S3, Azure Data Lake
Storage, or Hadoop-based solutions.
 Hybrid ingestion: Combining batch and real-time
ingestion approaches to handle both historical and
newly generated data.

Ch 05 Data Engineering.pptx (2)
No ratings yet
Ch 05 Data Engineering.pptx (2)
28 pages
32Study_of_Data_Ingestion_Tools
No ratings yet
32Study_of_Data_Ingestion_Tools
9 pages
unit II big data architecture
No ratings yet
unit II big data architecture
5 pages
Unit-2
No ratings yet
Unit-2
11 pages
BDMA Part 2
No ratings yet
BDMA Part 2
16 pages
System Design
No ratings yet
System Design
6 pages
Big Data Architecture
No ratings yet
Big Data Architecture
4 pages
Data Engineering Life Cycle
No ratings yet
Data Engineering Life Cycle
33 pages
DS Day 6
No ratings yet
DS Day 6
5 pages
Big_Data_Integration_and_Processing_15_Marks (1)
No ratings yet
Big_Data_Integration_and_Processing_15_Marks (1)
5 pages
Notes For DMML
No ratings yet
Notes For DMML
27 pages
Big Data Components
No ratings yet
Big Data Components
58 pages
Big Data Components
No ratings yet
Big Data Components
31 pages
Unit-4
No ratings yet
Unit-4
11 pages
CCD UNIT 4
No ratings yet
CCD UNIT 4
5 pages
Data Ingestion Layer
No ratings yet
Data Ingestion Layer
2 pages
Ds 6
No ratings yet
Ds 6
7 pages
De Imp Qa
No ratings yet
De Imp Qa
12 pages
Unit 1 Topic 2 Big Data Platform
No ratings yet
Unit 1 Topic 2 Big Data Platform
31 pages
Components of A Big Data Architecture
No ratings yet
Components of A Big Data Architecture
3 pages
Data Glossary - Michael Dillon
No ratings yet
Data Glossary - Michael Dillon
11 pages
Group 3&4 Assignment Sample Solution
No ratings yet
Group 3&4 Assignment Sample Solution
5 pages
Data Pipeline Architecture
No ratings yet
Data Pipeline Architecture
6 pages
UNIT 1 To 5
No ratings yet
UNIT 1 To 5
37 pages
big data unit 1
No ratings yet
big data unit 1
24 pages
ECS765P - W6 - Big Data Ingestion and Storage
No ratings yet
ECS765P - W6 - Big Data Ingestion and Storage
34 pages
Group 3&4 Assignment
No ratings yet
Group 3&4 Assignment
6 pages
Unit 2 (ETI) BDA
No ratings yet
Unit 2 (ETI) BDA
22 pages
Karthik (project details)
No ratings yet
Karthik (project details)
14 pages
Week 1 Lecture 2
No ratings yet
Week 1 Lecture 2
92 pages
DATA ENGINEERING LAB
No ratings yet
DATA ENGINEERING LAB
6 pages
Chapter 6
No ratings yet
Chapter 6
26 pages
Big Data Chatgpt
No ratings yet
Big Data Chatgpt
8 pages
module5DataEngineering
No ratings yet
module5DataEngineering
10 pages
Data Engg
No ratings yet
Data Engg
19 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Unit 2
No ratings yet
Unit 2
17 pages
Attachment (20)
No ratings yet
Attachment (20)
25 pages
Unit 5
No ratings yet
Unit 5
14 pages
unit 1 b tech 3 year bd
No ratings yet
unit 1 b tech 3 year bd
10 pages
Data Report Martin Inline Graphics R7 PDF
No ratings yet
Data Report Martin Inline Graphics R7 PDF
6 pages
Big data analytics
No ratings yet
Big data analytics
36 pages
Data Report Martin Inline Graphics R8 1
No ratings yet
Data Report Martin Inline Graphics R8 1
6 pages
Data Ingestion, Processing and Architecture Layers For Big Data and Iot
No ratings yet
Data Ingestion, Processing and Architecture Layers For Big Data and Iot
32 pages
PPT 1.1.2
No ratings yet
PPT 1.1.2
17 pages
Course1_summary
No ratings yet
Course1_summary
4 pages
Introduction To Big Data, Hadoop and Spark
No ratings yet
Introduction To Big Data, Hadoop and Spark
40 pages
Real-Time Big Data Analytics - Sample Chapter
100% (2)
Real-Time Big Data Analytics - Sample Chapter
30 pages
DSBDA EndSem2023 12F FlyHigh
No ratings yet
DSBDA EndSem2023 12F FlyHigh
20 pages
dsbda_ut3
No ratings yet
dsbda_ut3
14 pages
Big Data Architecture
No ratings yet
Big Data Architecture
41 pages
1 - Big Data Analytics & IoT
No ratings yet
1 - Big Data Analytics & IoT
13 pages
all questions
No ratings yet
all questions
7 pages
3
No ratings yet
3
12 pages
GROUP_4
No ratings yet
GROUP_4
10 pages
Types of Digital Data: Unit 1 Big Data KCS-061
No ratings yet
Types of Digital Data: Unit 1 Big Data KCS-061
12 pages
Data Pipelines From Zero to Solid
No ratings yet
Data Pipelines From Zero to Solid
16 pages
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
From Everand
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
Will Girten
No ratings yet
bigdata (1) (1)
No ratings yet
bigdata (1) (1)
23 pages
Unit 5
No ratings yet
Unit 5
6 pages
Revision Pack (Grade 8)
No ratings yet
Revision Pack (Grade 8)
4 pages
1 s2.0 S254266052100086X Main
No ratings yet
1 s2.0 S254266052100086X Main
11 pages
Slide Database Concept
No ratings yet
Slide Database Concept
30 pages
Lab_ Performing ETL on a Dataset by Using AWS Glue
100% (1)
Lab_ Performing ETL on a Dataset by Using AWS Glue
26 pages
DATA STORE OBJECT (DSO) by Mani
No ratings yet
DATA STORE OBJECT (DSO) by Mani
14 pages
Business Statistics - Chapter 2 (1)
No ratings yet
Business Statistics - Chapter 2 (1)
112 pages
ASM Validate Normal High Redundancy DGs Partnership 1961372.1 Ok
No ratings yet
ASM Validate Normal High Redundancy DGs Partnership 1961372.1 Ok
10 pages
dp_203_microsoft_azure_data_engineer_associate_exam
No ratings yet
dp_203_microsoft_azure_data_engineer_associate_exam
1 page
Data Mining - Cluster Analysis: What Is Clustering?
No ratings yet
Data Mining - Cluster Analysis: What Is Clustering?
4 pages
Assign Time Types To Time Pairs
No ratings yet
Assign Time Types To Time Pairs
2 pages
InterPro Final Print
No ratings yet
InterPro Final Print
9 pages
CST204 e
No ratings yet
CST204 e
4 pages
SPSS
No ratings yet
SPSS
5 pages
DBMS CH-3
No ratings yet
DBMS CH-3
45 pages
Suguna_Profile_BIDW Consultant
No ratings yet
Suguna_Profile_BIDW Consultant
5 pages
BIM-GPT: A Prompt-Based Virtual Assistant Framework For BIM Information Retrieval
No ratings yet
BIM-GPT: A Prompt-Based Virtual Assistant Framework For BIM Information Retrieval
35 pages
(Ebook) The Essentials of Statistics: A Tool for Social Research by Joseph F. Healey ISBN 9781111829568, 111182956X download
100% (2)
(Ebook) The Essentials of Statistics: A Tool for Social Research by Joseph F. Healey ISBN 9781111829568, 111182956X download
60 pages
Data Mining MCQs - Unit-2 - DM _ Study Glance
No ratings yet
Data Mining MCQs - Unit-2 - DM _ Study Glance
10 pages
BIG DATA ANALYTICS: Introduction to Hadoop, Spark, and Machine-Learning 1st Edition - eBook PDF download
100% (1)
BIG DATA ANALYTICS: Introduction to Hadoop, Spark, and Machine-Learning 1st Edition - eBook PDF download
60 pages
Rajesh Kumar - 11.8 - SDET - Lead
No ratings yet
Rajesh Kumar - 11.8 - SDET - Lead
5 pages
Wondimu G.: o o o o o
No ratings yet
Wondimu G.: o o o o o
4 pages
Spatial Electric Load Forecasting Using An Evolutionary Heuristic
No ratings yet
Spatial Electric Load Forecasting Using An Evolutionary Heuristic
10 pages
dbs-600-final-database-systems
No ratings yet
dbs-600-final-database-systems
10 pages
02 CO HANA - S4 HANA Structure
No ratings yet
02 CO HANA - S4 HANA Structure
6 pages
Recap - 3 12
No ratings yet
Recap - 3 12
19 pages
SQL Server - Foreign Keys With Set Null On Delete
No ratings yet
SQL Server - Foreign Keys With Set Null On Delete
6 pages
PRACTICAL FILE 24-25
No ratings yet
PRACTICAL FILE 24-25
31 pages
Oracle ASM
100% (1)
Oracle ASM
13 pages
Chapter 9 AIS Reviewer/summary
50% (2)
Chapter 9 AIS Reviewer/summary
2 pages
Epi-Info 7
No ratings yet
Epi-Info 7
47 pages

Data Arch Base

Uploaded by

Data Arch Base

Uploaded by

1.

Data Sources and Ingestion:

Internal Data Sources:

1. Enterprise Resource Planning (ERP) Systems:

 Financial data: General ledger, accounts payable,

 Supply chain data: Purchase orders, sales orders,

 Human resources data: Employee records, payroll,

2. Customer Relationship Management (CRM) Systems:

 Customer data: Accounts, contacts, leads,

 Marketing data: Campaign management, email

 Support data: Tickets, cases, and customer

3. Operational Databases and Transaction Systems:

 Transactional data: Order management, inventory

 Logistical data: Fleet management, warehouse

 Manufacturing data: Production planning, quality

4. Enterprise Content Management (ECM) Systems:

 Unstructured data: Documents, images, videos, and

 Knowledge management: Policies, procedures, and

 Collaboration data: File shares, wikis, and discussion

 Market data: Stock prices, economic indicators, and

 Geospatial data: Maps, weather data, and location-

 Social media data: Sentiment analysis, influencer

 Competitor data: Pricing, product information, and

 Industry news and trends: Trade publications, blogs,

 Customer reviews and feedback: E-commerce sites,

3. Public Data Repositories:

 Government data: Census, economic, and

 Research data: Academic publications, datasets, and

 Open-source data: Crowdsourced data, open data

4. Syndicated Data Providers:

 Market research data: Industry trends, consumer

 Demographic data: Household income, age, gender,

 Firmographic data: Company size, industry, location,

- Batch processing : This type of data ingestion moves

Data ingestion tools and frameworks:

You might also like