0% found this document useful (0 votes)
9 views

Hadoop Ecosystem

The document provides an overview of the Hadoop ecosystem, detailing its core components such as HDFS, YARN, Hive, and Pig, which facilitate the management and processing of big data. It also discusses various tools like HBase, Zookeeper, and Oozie that enhance Hadoop's functionality for real-time analytics and data handling. Additionally, recent research highlights the continued relevance of Hadoop in the evolving big data landscape, emphasizing the need for improvements in performance and integration.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Hadoop Ecosystem

The document provides an overview of the Hadoop ecosystem, detailing its core components such as HDFS, YARN, Hive, and Pig, which facilitate the management and processing of big data. It also discusses various tools like HBase, Zookeeper, and Oozie that enhance Hadoop's functionality for real-time analytics and data handling. Additionally, recent research highlights the continued relevance of Hadoop in the evolving big data landscape, emphasizing the need for improvements in performance and integration.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

NAME: Syed Muhammad Hassan

Reg no : B23F1000DS056

Hadoop Ecosystem
The Hadoop is used to manage process and save big data. It has its own
environment . Like in our pc we use file system similarly Hadoop has
HDFS(Hadoop distributive file system) it breaks a large database into small
pieces . Like we can store data not only on one device we will store data on various
locations and data is copied on other devices incase our data is losses.
In pc we use CPU likewise in the case of bigdata we use MapReduce to
process our data.

On top of these core components, the Hadoop ecosystem integrates a wide range of
tools that cater to specific use cases:
1. YARN (Yet Another Resource Negotiator): Like in pc we have our
operating system in bigdata we have YARN , it provides necessary resources
required to process data,Yarn efficiently locates the running resources on
system.
2. It has two major steps : it splits input data and process it in parallel across
node and reduced phase where result is satisfactory .
3. Hive: hive is used for structured data and if I want wo MySQL
commands then I will use hive enabling users familiar with relational
database management systems (RDBMS) to query and manages data
stored in HDFS. Hive translates SQL queries into map reduce making it
easier , user don’t have to keep a deep knowledge of programming.
4. Pig: pig is used to make complex tasks easier , Long codes are written
for mapreducer, Yarn and HDFS. It reduces line of code and developers
are very happy with this tool.It handles flow of data across hadoop.
components based on their needs—whether it’s batch processing, real-time
analytics, machine learning, or data streaming—while providing scalability,
reliability, and fault tolerance essential for handling modern big data
challenges.
HBase is a column-oriented NoSQL database that works on top of HDFS.
Unlike traditional databases, it allows real-time read/write access to large
datasets. It's very scalable and good for handling sparse datasets, similar to
Google’s Bigtable model. This makes it ideal for real-time analytics or search
engines.
Zookeeper is a coordination service used within the Hadoop ecosystem. It
manages configuration info, synchronization, and group services in
distributed environments. Zookeeper makes sure all the components of
Hadoop can work together smoothly without any conflicts.
Oozie is a workflow scheduler that lets users define a sequence of jobs to be
run. It helps in organizing and managing the execution of Hadoop jobs, like
MapReduce, Pig, Hive, etc. It can trigger jobs based on rules like time or data
availability.
Sqoop is used for transferring large amounts of data between Hadoop and
structured data stores like relational databases. It automates importing and
exporting data between HDFS and external databases, making it easier to
work with enterprise data in big data analysis.
Flume is designed to bring in large amounts of streaming data into HDFS. It’s
often used to gather data from logs or other streaming sources like web
servers. Flume is good for real-time data collection from various sources.
Mahout is a library that has machine learning algorithms built on Hadoop. It
supports clustering, classification, and collaborative filtering, which helps
businesses apply machine learning to large datasets.
Apache Spark is now a key part of the Hadoop ecosystem. It offers in-memory
processing, making data analysis much faster compared to MapReduce.
Spark supports batch processing, interactive queries, and real-time streaming
workloads.
Apache Flink, like Spark, is another distributed computing engine, but it's
more focused on real-time processing and streaming analytics. It’s used when
low-latency, high-throughput stream processing is needed, making it useful
for real-time analytics.
Cassandra and Kafka are not directly part of Hadoop but are often used with
it. Cassandra is a NoSQL database, and Kafka is a platform for distributed
event streaming. They are used to improve Hadoop’s data handling
capabilities, especially in real-time scenarios.
HDFS Federation helps to scale the HDFS namespace by adding multiple
namespaces (or NameNodes) to the same cluster. This lets big companies store
and manage huge amounts of data while avoiding performance bottlenecks.
All of these tools together make Hadoop a strong and flexible platform for
managing and analyzing huge amounts of structured, semi-structured, and
unstructured data. The modular setup allows organizations to use the parts
they need based on their requirements, whether it's batch processing, real-
time analytics, or machine learning.

Latest Research on Hadoop ecosystem


This is the link of the article https://arxiv.org/pdf/2304.05028
This research was conducted in Tsinghua University.
He said that although many database management systems(DSMS) have
extensive support to opensource formats but they were made in early 2010 for
hadoop ecosystem. Both hardware and worklode landscapes have changed, we
identify design decisions advantageous with modern hardware and real-world
data distributions. These include using dictionary encoding by default,
favoring decoding speed over compression ratio for integer encoding
algorithms, making block compression optional, and embedding finer-grained
auxiliary data structures.
Columnar storage has been widely adopted for data analytics because of its
advantages such as irrelevant attribute skipping, efficient data compression,
and vectorized query processing [60, 64, 73].
BACKGROUND AND RELATED WORK The Big Data ecosystem in the
early 2010s gave rise to open-source file formats. Apache Hadoop first
introduced two row-oriented formats, SequenceFile [54] organized as key-
value pairs, and Avro [10] based on JSON. At the same time, column-oriented
DBMSs, such as C-Store [107], MonetDB [84], and VectorWise [122].
Conclusion:
Ameneh Zarei and her co-authors conclude that focuses on the continued
relevance of Hadoop in handling large-scale data despite the rise of newer
frameworks like Apache Spark .The paper proves that Hadoop core concepts are
still relevant and they are proving best solution to latest problems. there are some
issues in terms of performance and integration. They suggest that future
improvements in cloud integration and real-time processing could ensure
Hadoop's sustainability in the evolving big data ecosystem.

You might also like