100% found this document useful (2 votes)
889 views16 pages

Describe The Functions and Features of HDP

Hortonworks Data Platform (HDP) is an open source Apache Hadoop distribution that provides a centralized platform for storing and processing data at rest, and includes Apache Ambari for provisioning, managing, and monitoring Hadoop clusters through a web UI and REST APIs, as well as the Ambari Metrics System for collecting and serving metrics from Hadoop and systems.

Uploaded by

Mahmoud Elmahdy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
889 views16 pages

Describe The Functions and Features of HDP

Hortonworks Data Platform (HDP) is an open source Apache Hadoop distribution that provides a centralized platform for storing and processing data at rest, and includes Apache Ambari for provisioning, managing, and monitoring Hadoop clusters through a web UI and REST APIs, as well as the Ambari Metrics System for collecting and serving metrics from Hadoop and systems.

Uploaded by

Mahmoud Elmahdy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
  • HDP Features: Covers the functions and features of the Hortonworks Data Platform including its components and architecture.
  • Apache Ambari: Discusses the role of Apache Ambari in managing Hadoop clusters with detailed explanations of its architecture and components.
  • MapReduce and YARN: Explores the concept and mechanisms of MapReduce and YARN within Hadoop, including detailed engine and programming models.
  • Apache Spark: Examines Apache Spark's role in the Hadoop ecosystem, its architecture, components, and the significance of RDD operations.
  • HDFS Architecture: Details the master/slave architecture of HDFS, including the replication of blocks and advantages and disadvantages of Hadoop.

1.

Describe the functions and features of HDP

Hortonworks Data Platform (HDP)


• HDP is platform for data-at-rest
• Secure, enterprise-ready open source Apache Hadoop distribution
based on a centralized architecture (YARN)
• HDP is:
 Open
 Central
 Interoperable
 Enterprise ready
Apache Ambari
• For provisioning, managing, and monitoring Apache Hadoop clusters.
• Provides intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs

Ambari REST APIs


 Allows application developers and system integrators to easily integrate Hadoop provisioning, management, and
monitoring capabilities to their own applications

Functionality of Apache Ambari


Ambari enables System Administrators to:
• Provision a Hadoop cluster
 Ambari provides wizard for installing Hadoop services across any number of hosts
• Manage a Hadoop cluster
 Ambari provides central management for starting, stopping, and reconfiguring
Hadoop services across the entire cluster
• Monitor a Hadoop cluster
 Ambari provides dashboard for monitoring health and status of the Hadoop cluster
 Ambari leverages Ambari Metrics System ("AMS") for metrics collection
• Ambari enables application developers and system integrators to:
 Easily integrate Hadoop provisioning, management, and monitoring capabilities to
their own applications with the Ambari REST APIs

Ambari Metrics System ("AMS")


• System for collecting, aggregating and serving Hadoop and system
metrics in Ambari-managed clusters. The AMS works as follows:
1. Metrics Monitors run on each host and send system-level metrics to the AMS (which is a daemon).
2. Hadoop Sinks run on each host and send Hadoop-level metrics to the Collector.
3. The Metrics Collector stores and aggregates metrics. The Collector can store data either on the local
filesystem ("embedded mode") or can use an external HDFS for storage ("distributed mode").
4. Ambari exposes a REST API, which makes metrics retrieval easy.
5. Ambari REST API feeds the Ambari Web UI.

The Ambari Metrics System ("AMS") is a system for collecting, aggregating and
serving Hadoop and system metrics in Ambari-managed clusters.

Ambari User Interface, which is a web-based interface that allows users to easily interact with the system.
Ambari Architecture
Ambari Server : contains or interacts with the following components:

 Postgres RDBMS (default) stores the cluster configurations


 Authorization Provider integrates with an organization's authentication/authorization provider such as the
LDAP service (By default, Ambari uses an internal database as the user store for authentication and
authorization)
 Ambari Alert Framework supports alerts and notifications
 REST API integrates with the web-based front-end Ambari Web. This REST API can also be used by
custom applications.

How Ambari manages hosts in a cluster


• Ambari provides the following actions using the Hosts tab:
 Working with Hosts
 Determining Host Status
 Filtering the Hosts List
 Performing Host-Level Actions
 Viewing Components on a Host
 Decommissioning Masters and Slaves
 Deleting a Host from a Cluster
 Setting Maintenance Mode
 Adding Hosts to a Cluster

Ambari terminology
Service: Service refers to services in the Hadoop stack. HDFS, HBase, and Pig are examples of services

Component: A service consists of one or more components. For example, HDFS has 3 components: NameNode,
DataNode and Secondary NameNode.

Node/Host: Node refers to a machine in the cluster. Node and host are used interchangeably in this document.
Node-Component: Node-component refers to an instance of a component on a particular node.

Operation: An operation refers to a set of changes or actions performed on a cluster to satisfy a user request or to
achieve a desirable state change in the cluster.

Task: Task is the unit of work that is sent to a node to execute. A task is the work that node has to carry out as part
of an action.

Stage: A stage refers to a set of tasks that are required to complete an operation and are independent of each other;
all tasks in the same stage can be run across different nodes in parallel.

Action: An 'action' consists of a task or tasks on a machine or a group of machines. Each action is tracked by an
action id and nodes report the status at least at the granularity of the action.

Stage Plan: An operation typically consists of multiple tasks on various machines and they usually have
dependencies requiring them to run in a particular order.

Manifest: Manifest refers to the definition of a task which is sent to a node for execution.

Role: A role maps to either a component (for example, NameNode, DataNode) or an action (for example, HDFS
rebalancing, HBase smoke test, other admin commands, etc.)
MapReduce and YARN
The Distributed File System (DFS)
• Driving principles
 data is stored across the entire cluster
 programs are brought to the data, not the data to the program
• Data is stored across the entire cluster (the DFS)
 the entire cluster participates in the file system
 blocks of a single file are distributed across the cluster
 a given block is typically replicated as well for resiliency

Describe the MapReduce model v1

Hadoop computational model


 Data stored in a distributed file system spanning many inexpensive computers
 Bring function to the data
 Distribute application to the compute resources where the data is stored
The MapReduce programming model
"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to
worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node
processes that smaller problem, and passes the answer back to its master node.
"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to
get the output - the answer to the problem it was originally trying to solve.

The MapReduce execution environments


• APIs vs. Execution Environment
 APIs are implemented by applications and are largely independent of
execution environment
 Execution Environment defines how MapReduce jobs are executed
• MapReduce APIs
 org.apache.mapred:
- Old API, largely superseded some classes still used in new API
- Not changed with YARN
 org.apache.mapreduce:
- New API, more flexibility, widely used
- Applications may have to be recompiled to use YARN (not binary compatible)
• Execution Environments
 Classic JobTracker/TaskTracker from Hadoop v1.
MapReduce phases
 Map
Mappers
 Small program (typically), distributed across the cluster, local to data
 Handed a portion of the input data (called a split)
 Each mapper parses, filters, or transforms its input

 Shuffle

Shuffle phase
• The output of each mapper is locally grouped together by key
• One node is chosen to process data for each unique key
• All of this movement (shuffle) of data is transparently orchestrated by MapReduce
 Reduce

Reducers
 Small programs (typically) that aggregate all of the values for the key
that they are responsible for
 Each reducer writes output to its own file

 Combiner

Combiner (Optional)
• The data that will go to each reduce node is sorted and merged before going to the reduce node, pre-doing some of
the work of the receiving reduce node in order to minimize network traffic between map and reduce nodes.

The process of running a MapReduce job on Hadoop consists of 10 major steps:

1. The first step is the MapReduce program you've written tells the Job Client to run a MapReduce job.
2. This sends a message to the JobTracker which produces a unique ID for the job.
3. The Job Client copies job resources, such as a jar file containing a Java code
you have written to implement the map or the reduce task, to the shared file system, usually HDFS.
4. Once the resources are in HDFS, the Job Client can tell the JobTracker to start the job.
5. The JobTracker does its own initialization for the job. It calculates how to split the data so that it can send each
"split" to a different mapper process to maximize throughput.
6. It retrieves these "input splits" from the distributed file system, not the data itself.

7. The TaskTrackers are continually sending heartbeat messages to the JobTracker. Now that the JobTracker has
work for them, it will return a map task or a reduce task as a response to the heartbeat.
8. The TaskTrackers need to obtain the code to execute, so they get it from the shared file system.
9. Then they can launch a Java Virtual Machine with a child process running in it and this child process runs your
map code or your reduce code. The result of the map operation will remain in the local disk for the given
TaskTracker node (not in HDFS).
10. The output of the Reduce task is stored in HDFS file system using the number of copies specified by replication
factor.

Classes
• There are three main Java classes provided in Hadoop to read data
in MapReduce:
 InputSplitter dividing a file into splits
-Splits are normally the block size but depends on number of requested Map
tasks, whether any compression allows splitting, etc.
 RecordReader takes a split and reads the files into records
-For example, one record per line (LineRecordReader)
-But note that a record can be spit across splits
 InputFormat takes each record and transforms it into a
<key, value> pair that is then passed to the Map task

Limitations of classic MapReduce (MRv1)


The most serious limitations of classical MapReduce are:
 Scalability
 Resource utilization
 Support of workloads different from MapReduce.
• In the MapReduce framework, the job execution is controlled by two
types of processes:
 A single master process called JobTracker, which coordinates all jobs
running on the cluster and assigns map and reduce tasks to run on the TaskTrackers
 A number of subordinate processes called TaskTrackers, which run assigned tasks and periodically report the
progress to the JobTracker

YARN overhauls MRv1


• MapReduce has undergone a complete overhaul with YARN, splitting
up the two major functionalities of JobTracker (resource management
and job scheduling/monitoring) into separate daemons
• ResourceManager (RM)
 The global ResourceManager and per-node slave, the NodeManager (NM),
form the data-computation framework
 The ResourceManager is the ultimate authority that arbitrates resources
among all the applications in the system
• ApplicationMaster (AM)
 The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating
resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks
 An application is either a single job in the classical sense of Map-Reduce jobs or a directed acyclic graph (DAG)
of jobs

The Scheduler is responsible for allocating resources to the various running applications subject to familiar
constraints of capacities, queues

The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for
executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster
container on failure.

The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource
usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.
The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from
the Scheduler, tracking their status and monitoring for progress.

YARN features
• Scalability
• Multi-tenancy
• Compatibility
• Serviceability
• Higher cluster utilization
• Reliability/Availability

YARN major features summarized


• Multi-tenancy
 YARN allows multiple access engines (either open-source or proprietary) to use Hadoop as the common standard
for batch, interactive, and real-time engines that can simultaneously access the same data sets
 Multi-tenant data processing improves an enterprise's return on its Hadoop investments.
• Cluster utilization
 YARN's dynamic allocation of cluster resources improves utilization over more static MapReduce rules used in
early versions of Hadoop
• Scalability
 Data center processing power continues to rapidly expand. YARN's ResourceManager focuses exclusively on
scheduling and keeps pace as clusters expand to thousands of nodes managing petabytes of data.
• Compatibility
 Existing MapReduce applications developed for Hadoop 1 can run YARN without any disruption to existing
processes that already work.

List the phases in a MR job.


 Map, Shuffle, Reduce, Combiner
2. What are the limitations of MR v1?
 Centralized handling of job control flow
 Tight coupling of a specific programming model with the resource management
infrastructure
 Hadoop is now being used for all kinds of tasks beyond its original design
3. The JobTracker in MR1 is replaced by which component(s) in YARN?
 ResourceManager
 ApplicationMaster
4.What are the major features of YARN?
 Multi-tenancy
 Cluster utilization
 Scalability
 Compatibility

---------------------------------------------------------------------------------------------------------------------
Apache Spark
List the purpose of Apache Spark in the Hadoop ecosystem
 Faster results from analytics has become increasingly important
 Apache Spark is a computing platform designed to be fast and generalpurpose, and
easy to use

Who uses Spark and why?


• Parallel distributed processing, fault tolerance on commodity hardware,
scalability, in-memory computing, high level APIs, etc.
• Data scientist
 Analyze and model the data to obtain insight using ad-hoc analysis
 Transforming the data into a useable format
 Statistics, machine learning, SQL
• Data engineers
 Develop a data processing system or application
 Inspect and tune their applications
 Programming with the Spark's API
• Everyone else
 Easeof use
 Wide variety of functionality
 Mature and reliable.
List and describe the architecture and components of the Spark unified stack

 Spark SQL is designed to work with the Spark via SQL and HiveQL (a Hive variant of SQL).
 Spark Streaming provides processing of live streams of data. The Spark
Streaming API closely matches that of the Sparks Core's API, making it easy for developers to move
between applications that processes data stored in memory vs arriving in real-time.
 MLlib is the machine learning library that provides multiple types of machine learning algorithms.
 GraphX is a graph processing library with APIs to manipulate graphs and
performing graph-parallel computations. Graphs are data structures comprised of vertices and edges
connecting them.

Describe the role of a Resilient Distributed Dataset (RDD)


Resilient Distributed Datasets (RDDs)
• Spark's primary abstraction: Distributed collection of elements, parallelized across the cluster
• Two types of RDD operations:
 Transformations
-Creates a directed acyclic graph (DAG)
-Lazy evaluations
-No return value
 Actions
-Performs the transformations
-The action that follows returns a value
• RDD provides fault tolerance
• Has in-memory caching (with overflow to disk).

Resilient Distributed Dataset (RDD)


• Fault-tolerant collection of elements that can be operated on in parallel
• RDDs are immutable
• Three methods for creating RDD
 Parallelizing
an existing collection
 Referencing a dataset
 Transformation from an existing RDD
• Two types of RDD operations
 Transformations
 Actions
• Dataset from any storage supported by Hadoop
 HDFS, Cassandra, HBase, Amazon S3, etc.
• Types of files supported:
 Text files, SequenceFiles, Hadoop InputFormat, etc.

RDD operations: Transformations


• These are some of the transformations available - the full set can be found on
Spark's website.
• Transformations are lazy evaluations
• Returns a pointer to the transformed RDD

RDD operations: Actions


RDD persistence
• Each node stores partitions of the cache that it computes in memory
• Reuses them in other actions on that dataset (or derived datasets)

Spark SQL
• Allows relational queries expressed in
 SQL
 HiveQL
 Scala
• SchemaRDD
 Row objects
 Schema
 Created from:
-Existing
RDD
-Parquet
file
-JSON dataset
-HiveQL against Apache Hive

• Supports Scala, Java, R, and Python


MLlib
• MLlib for machine learning library - under active development
• Provides, currently, the following common algorithm and utilities
 Classification
 Regression
 Clustering
 Collaborativefiltering
 Dimensionality reduction
Advantages and disadvantages of Hadoop
• Hadoop is good for:
 processing massive amounts of data through parallelism
 handling a variety of data (structured, unstructured, semi-structured)
 using inexpensive commodity hardware
• Hadoop is not good for:
 processing transactions (random access)
 when work cannot be parallelized
 low latency data access
 processing lots of small files
 intensive calculations with small amounts of data

Common questions

Powered by AI

YARN enhances scalability and resource management by separating resource management and job scheduling/monitoring, which were previously handled by the singular JobTracker in MRv1. This separation allows Hadoop to support larger clusters and workloads through the ResourceManager, which is focused solely on resource arbitration. YARN's architectural improvements facilitate multi-tenancy and better cluster utilization, accommodating various workloads beyond traditional MapReduce , thus surpassing the scalability limitations of MRv1 .

In the classic MapReduce model, the Map step involves the master node dividing the input data into smaller sub-problems which are processed by worker nodes before being passed to the Reduce phase . In Apache Spark, RDD transformations that perform similar tasks to the Map step do so using a more generic transformation model that allows for in-memory processing and lazy evaluation, providing greater flexibility and performance improvement over traditional MapReduce .

YARN improves cluster utilization by dynamically allocating resources across various running applications, unlike the static allocation in previous MapReduce models. It allows multiple types of processing frameworks to run concurrently, thereby optimizing the overall workload that a cluster can handle. This flexibility supports higher resource utilization and eliminates inefficiencies seen in MRv1's more rigid scheduling approach .

Ambari simplifies Hadoop cluster management by providing an intuitive web-based UI and RESTful APIs that allow system administrators to easily provision, manage, and monitor clusters. It includes features like a setup wizard for installing Hadoop services, centralized management for starting, stopping, and reconfiguring services, and a dashboard for monitoring cluster health and status. These tools streamline cluster operations and integrate seamlessly with other applications .

The Scheduler in YARN allocates resources to running applications while considering constraints like capacities and queues. It improves resource allocation efficiency by dynamically adjusting resources based on current demand, supporting diverse workloads and more finely-tuned resource distribution compared to MRv1. This flexibility enhances overall system throughput and scalability, fundamentally transforming resource management strategies in Hadoop ecosystems .

Spark Streaming allows real-time data processing by dividing data streams into mini-batches that are processed sequentially, unlike traditional batch models that process large static datasets in one go. This capability, alongside its tight integration with Spark Core, enables seamless transition between batch and streaming contexts, optimizing resources and providing near-instantaneous data processing and analytics .

Apache Ambari's REST API is integral for integrating Hadoop management capabilities into custom applications. It allows developers and system integrators to programmatically provision, manage, and monitor Hadoop clusters. This API facilitates seamless incorporation of Hadoop's extensive capabilities into various enterprise solutions, enabling custom workflows and versions without depending solely on Ambari's web UI .

YARN introduces a decoupled architecture that separates resource management from job scheduling and monitoring by using distinct ResourceManager and ApplicationMaster components. This approach enables YARN to handle cluster resources dynamically, allowing applications to utilize resources more efficiently and supporting a wider range of application types beyond MapReduce, unlike MRv1, which had a static resource allocation approach via the JobTracker and TaskTrackers .

The Ambari Metrics System (AMS) collects, aggregates, and serves both Hadoop and system metrics in clusters managed by Ambari. Metrics Monitors and Hadoop Sinks collect data on each host, which is stored and aggregated by the Metrics Collector. The REST API facilitates easy metrics retrieval and feeds into the Ambari Web UI, providing comprehensive monitoring of cluster health and enabling proactive management .

RDDs in Spark provide fault tolerance by storing lineage information that allows lost data to be recomputed from the original datasets. They facilitate data processing efficiency through in-memory caching, which minimizes disk I/O costs. RDDs are immutable, allowing transformations to be applied lazily, thus optimizing computation by deferring execution until an action is required .

You might also like