0% found this document useful (0 votes)
18 views

Group 3&4 Assignment

Uploaded by

Mutomba Tichaona
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Group 3&4 Assignment

Uploaded by

Mutomba Tichaona
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Name: Group 3&4

Course: Big Data


Hadoop is an open-source framework designed for processing and storing large datasets across clusters of
computers using simple programming models. It is a powerful tool for big data analytics. A robust Hadoop
architecture for processing and analysing large datasets involves leveraging various components of the
Hadoop ecosystem. Below, are the phases of the data pipeline and justification for the choice of specific
Hadoop components for each phase.

1. Data Ingestion

Component: Apache Flume or Apache Kafka.

For ingesting large volumes of data from various sources, Apache Flume or Apache Kafka can be used. Flume
is designed for collecting and aggregating large amounts of log data, while Kafka is a distributed streaming
platform that can handle real-time data feeds. Both tools ensure that data is ingested efficiently and can be
processed in real-time or batch modes.

2. Storage

Component: HDFS (Hadoop Distributed File System).

Once the data is ingested, it needs to be stored in a distributed manner. HDFS is the backbone of the Hadoop
ecosystem, providing a reliable and scalable storage solution. It allows for the storage of large files across
multiple machines, ensuring fault tolerance and high availability. HDFS is optimized for high-throughput
access to application data, making it suitable for big data applications. While HDFS is optimized for batch
processing, it is less efficient when dealing with real-time data due to the time required to write and read large
files, hence this is an ideal choice is the dataset is not real-time data.

3. Data Processing

Component: YARN (Yet Another Resource Negotiator) and MapReduce/Spark.

For data processing, YARN acts as the resource management layer of Hadoop – scheduling tasks efficiently
and allocating resources based on the needs of each component, allowing multiple data processing engines to
run and manage resources efficiently. YARN is highly effective for large batch jobs, but real-time data require
tighter integration with faster frameworks like Apache Kafka for streaming.

MapReduce is a programming model that enables the processing of large datasets in parallel across a
distributed cluster. It is particularly effective for batch processing tasks where data is processed in large
chunks, making it suitable if our use case where real time processing is not a requirement. It is slower due to
its disk-based processing.

Apache Spark can also be used for data processing, especially when low-latency processing is required. Spark

provides in-memory processing capabilities, which can significantly speed up data processing tasks compared
to traditional MapReduce. Spark’s in-memory processing offers speed but may require more memory
resources, making it costlier in large scale scenarios.

4. Data Analysis

Component: Hive or Pig.


For analysing the processed data, Apache Hive or Apache Pig can be employed:

 Hive provides an SQL-like interface for querying and managing large datasets stored in HDFS. It is
suitable for users who are familiar with SQL and want to perform data analysis without writing
complex MapReduce code. It is ideal for batch processing and is well-suited for data analysis tasks,
making it a great choice for the need to perform complex queries on large datasets. Hive’s batch-
oriented nature makes it slower for real-time analytics.
 Pig is a high-level platform for creating programs that run on Hadoop. It uses a language called Pig
Latin, which is designed to handle data transformations and analysis in a more procedural way than
Hive. This makes it an excellent choice for ETL (Extract, Transform, Load) processes within the data
pipeline. Pig is optimal for batch data.
5. Data Visualization and Reporting.

Component: Apache Superset or Tableau.

For visualizing the results of the data analysis, Apache Superset or Tableau can be integrated. These tools
allow users to create interactive dashboards and reports, making it easier to derive insights from the data.

6. Workflow Management.

Component: Apache Oozie

To manage the workflow of the entire data pipeline, Apache Oozie can be used. Oozie is a workflow
scheduler system that allows users to define complex data processing workflows, ensuring that tasks are
executed in the correct order and managing dependencies between different components.

In conclusion, a robust Hadoop architecture for processing and analysing large datasets can be build using
HDFS for storage, YARN for resource management, MapReduce for batch processing, Spark for advanced
processing, Hive for querying, and Pig for data transformations. Each component plays a crucial role in
ensuring that data is ingested, stored, processed, analysed, and visualized effectively, catering to the needs of
big data applications.
Designing data processing systems using frameworks like Hadoop it is essential to consider the trade-offs
between batch processing and streaming and importance of data quality and consistency:

Batch processing advantages:

Efficiency with large datasets: Handling large volumes of data at once, making it suitable for complex analytics
and transformations is what batch processing is optimized for.

Simplicity: Manage well-defined repetitive tasks such as ETL (Extract, Transform, Load) processes and easier to
implement.

Cost-Effective: It requires fewer resources as jobs can be scheduled during off-peak hours, utilizing the cluster’s
capacity more efficiently.

Batch processing disadvantages:

Complexity in workflows: Batch processing can complicate workflows, necessitating additional mechanisms to
bridge the gap especially for applications requiring near real-time insights.

Latency: It can lead to delays in decision-making since results are not available until the batch job completes.

Streaming processing advantages:

Event-Driven architecture: It supports use cases whereby actions have to be taken in response to specific events
or triggers, enhancing responsiveness.

Real-time insights: It is ideal for applications like fraud detection and monitoring systems by enabling immediate
processing of data as it’s ingested.

Streaming processing disadvantages:

Resource Intensive: It can lead to higher operational costs because continuous processing may require more
computational resources compared to batch jobs.

Complexity: Due to the need for continuous data handling and state management, implementing a streaming
architecture can be more complex than batch processing.

Importance of Data Quality and Consistency:

Maintaining data quality and consistency is essential or crucial regardless of the processing method chosen.

Data Quality:

Accuracy: It ensures that the data reflects the real-world scenario correctly.

Timeliness: It is important for streaming applications that data should be processed and available when needed.
Completeness: Data has to be complete, with no missing values that could impact analysis.

Data Consistency:

Schema Evolution: Maintaining a consistent schema is essential to prevent errors and ensure that both batch and
streaming processes can interact seamlessly since data structures change over time.

Consistency Across Systems: Ensuring that data remains consistent across different systems and versions is
crucial especially in environments where both batch and streaming processes are used.

Handling late data: Techniques like water-making can help addressing streaming scenarios whereby late-arriving
data can lead to inconsistencies if not managed properly.

Best practices for ensuring data consistency include data standardization, data validation, data normalization,
data governance and data quality monitoring. Tools for ensuring data consistency include big data analytics
platforms (Hadoop, Spark), data governance platforms (Data360, Collibra), data integration platforms
(Informatica, Talend), and data quality tools (DataCleaner, Trillium). Real-world applications include Netflix,
Amazon and healthcare organizations.

In conclusion, a careful assessment of the specific needs of the application, including the required speed of
insights, the volume of data and the complexity of workflows when choosing between batch processing and
streaming. Additionally, prioritizing data quality and consistency is fundamental to ensure that the insights
derived from both methods are actionable and reliable. It is wise sometimes to combine both producing a hybrid
approach that maximize both batch and streaming processing to achieve varied analytical needs while
maintaining high standards of data quality.
References.

1. Components of MapReduce Architecture https:// www.geeksforgeeks.org/mapreducearchitecture/

You might also like