0% found this document useful (0 votes)

18 views

Group 3&4 Assignment

Uploaded by

Mutomba Tichaona

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

Group 3&4 Assignment

Uploaded by

Mutomba Tichaona

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Name: Group 3&4

Course: Big Data

Hadoop is an open-source framework designed for processing and storing large datasets across clusters of
computers using simple programming models. It is a powerful tool for big data analytics. A robust Hadoop
architecture for processing and analysing large datasets involves leveraging various components of the
Hadoop ecosystem. Below, are the phases of the data pipeline and justification for the choice of specific
Hadoop components for each phase.

1. Data Ingestion

Component: Apache Flume or Apache Kafka.

For ingesting large volumes of data from various sources, Apache Flume or Apache Kafka can be used. Flume
is designed for collecting and aggregating large amounts of log data, while Kafka is a distributed streaming
platform that can handle real-time data feeds. Both tools ensure that data is ingested efficiently and can be
processed in real-time or batch modes.

2. Storage

Component: HDFS (Hadoop Distributed File System).

Once the data is ingested, it needs to be stored in a distributed manner. HDFS is the backbone of the Hadoop
ecosystem, providing a reliable and scalable storage solution. It allows for the storage of large files across
multiple machines, ensuring fault tolerance and high availability. HDFS is optimized for high-throughput
access to application data, making it suitable for big data applications. While HDFS is optimized for batch
processing, it is less efficient when dealing with real-time data due to the time required to write and read large
files, hence this is an ideal choice is the dataset is not real-time data.

3. Data Processing

Component: YARN (Yet Another Resource Negotiator) and MapReduce/Spark.

For data processing, YARN acts as the resource management layer of Hadoop – scheduling tasks efficiently
and allocating resources based on the needs of each component, allowing multiple data processing engines to
run and manage resources efficiently. YARN is highly effective for large batch jobs, but real-time data require
tighter integration with faster frameworks like Apache Kafka for streaming.

MapReduce is a programming model that enables the processing of large datasets in parallel across a
distributed cluster. It is particularly effective for batch processing tasks where data is processed in large
chunks, making it suitable if our use case where real time processing is not a requirement. It is slower due to
its disk-based processing.

Apache Spark can also be used for data processing, especially when low-latency processing is required. Spark

provides in-memory processing capabilities, which can significantly speed up data processing tasks compared
to traditional MapReduce. Spark’s in-memory processing offers speed but may require more memory
resources, making it costlier in large scale scenarios.

4. Data Analysis

Component: Hive or Pig.

For analysing the processed data, Apache Hive or Apache Pig can be employed:

 Hive provides an SQL-like interface for querying and managing large datasets stored in HDFS. It is
suitable for users who are familiar with SQL and want to perform data analysis without writing
complex MapReduce code. It is ideal for batch processing and is well-suited for data analysis tasks,
making it a great choice for the need to perform complex queries on large datasets. Hive’s batch-
oriented nature makes it slower for real-time analytics.
 Pig is a high-level platform for creating programs that run on Hadoop. It uses a language called Pig
Latin, which is designed to handle data transformations and analysis in a more procedural way than
Hive. This makes it an excellent choice for ETL (Extract, Transform, Load) processes within the data
pipeline. Pig is optimal for batch data.
5. Data Visualization and Reporting.

Component: Apache Superset or Tableau.

For visualizing the results of the data analysis, Apache Superset or Tableau can be integrated. These tools
allow users to create interactive dashboards and reports, making it easier to derive insights from the data.

6. Workflow Management.

Component: Apache Oozie

To manage the workflow of the entire data pipeline, Apache Oozie can be used. Oozie is a workflow
scheduler system that allows users to define complex data processing workflows, ensuring that tasks are
executed in the correct order and managing dependencies between different components.

In conclusion, a robust Hadoop architecture for processing and analysing large datasets can be build using
HDFS for storage, YARN for resource management, MapReduce for batch processing, Spark for advanced
processing, Hive for querying, and Pig for data transformations. Each component plays a crucial role in
ensuring that data is ingested, stored, processed, analysed, and visualized effectively, catering to the needs of
big data applications.
Designing data processing systems using frameworks like Hadoop it is essential to consider the trade-offs
between batch processing and streaming and importance of data quality and consistency:

Batch processing advantages:

Efficiency with large datasets: Handling large volumes of data at once, making it suitable for complex analytics
and transformations is what batch processing is optimized for.

Simplicity: Manage well-defined repetitive tasks such as ETL (Extract, Transform, Load) processes and easier to
implement.

Cost-Effective: It requires fewer resources as jobs can be scheduled during off-peak hours, utilizing the cluster’s
capacity more efficiently.

Batch processing disadvantages:

Complexity in workflows: Batch processing can complicate workflows, necessitating additional mechanisms to
bridge the gap especially for applications requiring near real-time insights.

Latency: It can lead to delays in decision-making since results are not available until the batch job completes.

Streaming processing advantages:

Event-Driven architecture: It supports use cases whereby actions have to be taken in response to specific events
or triggers, enhancing responsiveness.

Real-time insights: It is ideal for applications like fraud detection and monitoring systems by enabling immediate
processing of data as it’s ingested.

Streaming processing disadvantages:

Resource Intensive: It can lead to higher operational costs because continuous processing may require more
computational resources compared to batch jobs.

Complexity: Due to the need for continuous data handling and state management, implementing a streaming
architecture can be more complex than batch processing.

Importance of Data Quality and Consistency:

Maintaining data quality and consistency is essential or crucial regardless of the processing method chosen.

Data Quality:

Accuracy: It ensures that the data reflects the real-world scenario correctly.

Timeliness: It is important for streaming applications that data should be processed and available when needed.
Completeness: Data has to be complete, with no missing values that could impact analysis.

Data Consistency:

Schema Evolution: Maintaining a consistent schema is essential to prevent errors and ensure that both batch and
streaming processes can interact seamlessly since data structures change over time.

Consistency Across Systems: Ensuring that data remains consistent across different systems and versions is
crucial especially in environments where both batch and streaming processes are used.

Handling late data: Techniques like water-making can help addressing streaming scenarios whereby late-arriving
data can lead to inconsistencies if not managed properly.

Best practices for ensuring data consistency include data standardization, data validation, data normalization,
data governance and data quality monitoring. Tools for ensuring data consistency include big data analytics
platforms (Hadoop, Spark), data governance platforms (Data360, Collibra), data integration platforms
(Informatica, Talend), and data quality tools (DataCleaner, Trillium). Real-world applications include Netflix,
Amazon and healthcare organizations.

In conclusion, a careful assessment of the specific needs of the application, including the required speed of
insights, the volume of data and the complexity of workflows when choosing between batch processing and
streaming. Additionally, prioritizing data quality and consistency is fundamental to ensure that the insights
derived from both methods are actionable and reliable. It is wise sometimes to combine both producing a hybrid
approach that maximize both batch and streaming processing to achieve varied analytical needs while
maintaining high standards of data quality.
References.

1. Components of MapReduce Architecture https:// www.geeksforgeeks.org/mapreducearchitecture/

Microsoft: Exam Questions DP-203
100% (2)
Microsoft: Exam Questions DP-203
17 pages
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Azure Databricks Documentation
No ratings yet
Azure Databricks Documentation
7,197 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Basics of Apache Kafka
100% (1)
Basics of Apache Kafka
168 pages
Bigdata Hadoop
No ratings yet
Bigdata Hadoop
4 pages
Group 3&4 Assignment Sample Solution
No ratings yet
Group 3&4 Assignment Sample Solution
5 pages
Hadoop Kufkaf Apeche
No ratings yet
Hadoop Kufkaf Apeche
14 pages
Assignment Group 3
No ratings yet
Assignment Group 3
21 pages
Untitled Document
No ratings yet
Untitled Document
2 pages
Noted Assignment
No ratings yet
Noted Assignment
4 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
From Everand
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
William Smith
No ratings yet
Big Data Architecture
No ratings yet
Big Data Architecture
4 pages
Sqoop Essentials: Definitive Reference for Developers and Engineers
From Everand
Sqoop Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Big_Data_Integration_and_Processing_15_Marks (1)
No ratings yet
Big_Data_Integration_and_Processing_15_Marks (1)
5 pages
New Printout
No ratings yet
New Printout
5 pages
BD by maaz
No ratings yet
BD by maaz
19 pages
dSbDa MiniProject Case Study
No ratings yet
dSbDa MiniProject Case Study
10 pages
Bigdata
No ratings yet
Bigdata
3 pages
Unit 5
No ratings yet
Unit 5
14 pages
UNIT III BASICS_OF_HADOOP
No ratings yet
UNIT III BASICS_OF_HADOOP
22 pages
Last Min Preparation -Big Data
No ratings yet
Last Min Preparation -Big Data
5 pages
dspl_casestidy.docx
No ratings yet
dspl_casestidy.docx
3 pages
2.2. Components of Hadoop - Analysing.docx
No ratings yet
2.2. Components of Hadoop - Analysing.docx
16 pages
SUB UNIT 3 - Copy
No ratings yet
SUB UNIT 3 - Copy
9 pages
Introduction Big Data With Hadoop
No ratings yet
Introduction Big Data With Hadoop
3 pages
Big data
No ratings yet
Big data
8 pages
Unit IV Basics of Hadoop 463b93e20ce0da4b6c33b8a0656c10bb
No ratings yet
Unit IV Basics of Hadoop 463b93e20ce0da4b6c33b8a0656c10bb
21 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Components of A Big Data Architecture
No ratings yet
Components of A Big Data Architecture
3 pages
IOT and Comp.architecture
No ratings yet
IOT and Comp.architecture
17 pages
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Hadoop
No ratings yet
Hadoop
3 pages
System Design
No ratings yet
System Design
6 pages
Unit 3
No ratings yet
Unit 3
12 pages
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
From Everand
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
Robert Johnson
No ratings yet
Big Data
No ratings yet
Big Data
27 pages
Unit 2
No ratings yet
Unit 2
17 pages
Unit IV Basics_of_hadoop (1)
No ratings yet
Unit IV Basics_of_hadoop (1)
20 pages
bigdata
No ratings yet
bigdata
18 pages
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
CT2 BDTT
No ratings yet
CT2 BDTT
6 pages
Mastering Apache Hudi: Building Real-Time Data Lakes
From Everand
Mastering Apache Hudi: Building Real-Time Data Lakes
Robert Johnson
No ratings yet
Big Data Analytics Application
No ratings yet
Big Data Analytics Application
6 pages
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Data Arch Base
No ratings yet
Data Arch Base
11 pages
SDCBDASPARKWEEK1-1
No ratings yet
SDCBDASPARKWEEK1-1
9 pages
Kafka
No ratings yet
Kafka
78 pages
yasir f29 ass1 bigdata
No ratings yet
yasir f29 ass1 bigdata
7 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
17 pages
MA_VaishuAchini_VIT_24 - ICT703 - A3
No ratings yet
MA_VaishuAchini_VIT_24 - ICT703 - A3
8 pages
BDMA Part 2
No ratings yet
BDMA Part 2
16 pages
BDA Unit 3
No ratings yet
BDA Unit 3
30 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Unit 2 - Intro To Hadoop
No ratings yet
Unit 2 - Intro To Hadoop
51 pages
Redshift Essentials: Definitive Reference for Developers and Engineers
From Everand
Redshift Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Notes For DMML
No ratings yet
Notes For DMML
27 pages
Cloud - UNIT V
No ratings yet
Cloud - UNIT V
18 pages
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
InterSystems IRIS Data Platform-Unified Platform For Powering Real-Time Data-Intensive Applications-Whitepaper
No ratings yet
InterSystems IRIS Data Platform-Unified Platform For Powering Real-Time Data-Intensive Applications-Whitepaper
12 pages
Supriya Data Engineer Resume (1)
No ratings yet
Supriya Data Engineer Resume (1)
4 pages
Resume_Anil Babu_6.6 years_Azure Developer_Technogen_Hyderabad
No ratings yet
Resume_Anil Babu_6.6 years_Azure Developer_Technogen_Hyderabad
4 pages
Databricks Associate Data Engineer Notes
No ratings yet
Databricks Associate Data Engineer Notes
39 pages
Azure Synapse Analytics
No ratings yet
Azure Synapse Analytics
8 pages
Google Cloud Platform and Machine Learning Specialization Coursera
100% (5)
Google Cloud Platform and Machine Learning Specialization Coursera
25 pages
Open Position 17-03-2022
No ratings yet
Open Position 17-03-2022
9 pages
Aimlsyll
No ratings yet
Aimlsyll
113 pages
DP-203T00 Data Engineering On Microsoft Azure
No ratings yet
DP-203T00 Data Engineering On Microsoft Azure
12 pages
Essentials of Business Analytics: An Introduction to the Methodology and its Applications (International Series in Operations Research & Management Science Book 264) Bhimasankaram Pochiraju All Chapters Instant Download
100% (3)
Essentials of Business Analytics: An Introduction to the Methodology and its Applications (International Series in Operations Research & Management Science Book 264) Bhimasankaram Pochiraju All Chapters Instant Download
55 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
Advanced Certification in Data Science and Artificial Intelligence
No ratings yet
Advanced Certification in Data Science and Artificial Intelligence
15 pages
What Is Apache Zeppelin
No ratings yet
What Is Apache Zeppelin
5 pages
BDA Final Lab Manual
100% (1)
BDA Final Lab Manual
56 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
24 pages
Santosh Goud - Senior AWS Big Data Engineer
No ratings yet
Santosh Goud - Senior AWS Big Data Engineer
9 pages
Apache Spark Tutorial
100% (1)
Apache Spark Tutorial
6 pages
Ultimate Big Data Masters Program Curriculum v1
No ratings yet
Ultimate Big Data Masters Program Curriculum v1
14 pages
Data Science IBM
No ratings yet
Data Science IBM
157 pages
Krushna_Resume.pdf
No ratings yet
Krushna_Resume.pdf
1 page
Software Architect
No ratings yet
Software Architect
1 page
Hadoop and Related Tools
No ratings yet
Hadoop and Related Tools
57 pages
List of JDs - INI Hiring - FY'23
No ratings yet
List of JDs - INI Hiring - FY'23
27 pages
SPARK Question answers
No ratings yet
SPARK Question answers
19 pages
BDA PYQ
No ratings yet
BDA PYQ
4 pages
PySpark Optimization techniques for Data Engineers
No ratings yet
PySpark Optimization techniques for Data Engineers
1 page
Quastor Summaries
No ratings yet
Quastor Summaries
355 pages

Group 3&4 Assignment

Uploaded by

Group 3&4 Assignment

Uploaded by

Name: Group 3&4

Course: Big Data

Component: Apache Flume or Apache Kafka.

Component: HDFS (Hadoop Distributed File System).

Component: YARN (Yet Another Resource Negotiator) and MapReduce/Spark.

Component: Hive or Pig.

Component: Apache Superset or Tableau.

Component: Apache Oozie

Batch processing advantages:

Batch processing disadvantages:

Streaming processing advantages:

Streaming processing disadvantages:

Importance of Data Quality and Consistency:

1. Components of MapReduce Architecture https:// www.geeksforgeeks.org/mapreducearchitecture/

You might also like