0% found this document useful (0 votes)
25 views14 pages

Hadoop

Uploaded by

eradevska
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views14 pages

Hadoop

Uploaded by

eradevska
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Difference between Traditional

Database System and Hadoop


Traditional Database System Hadoop
Data is stored in a central location and sent to the In Hadoop, the program goes to the data. It initially
processor at runtime. distributes the data to multiple systems and later runs
the computation wherever the data is located.
Traditional Database Systems cannot be used to Hadoop works better when the data size is big. It can
process and store a significant amount of data(big process and store a large amount of data efficiently
data). and effectively.
Traditional RDBMS is used to manage only structured Hadoop can process and store a variety of data,
and semi-structured data. It cannot be used to control whether it is structured or unstructured.
unstructured data.
•HDFS(Hadoop Distributed file system)
•HBase Hadoop Ecosystem
•Sqoop
•Flume
•Spark
•Hadoop MapReduce
•Pig
•Impala
•Hive
•Oozie
•Hue
HDFS
HDFS (HADOOP DISTRIBUTED FILE SYSTEM)
•HDFS is a storage layer for Hadoop.
•HDFS is suitable for distributed storage and processing, that is, while the
data is being stored, it first gets distributed and then it is processed.
•HDFS provides Streaming access to file system data.
•HDFS provides file permission and authentication.
•HDFS uses a command line interface to interact with Hadoop.
HBase
•HBase is a NoSQL database or non-relational database.
•HBase is important and mainly used when you need random,
real-time, read or write access to your Big Data.
•It provides support to a high volume of data and high
throughput.
•In an HBase, a table can have thousands of columns.
Sqoop
•Sqoop is a tool designed to transfer data between Hadoop and relational
database servers.
•It is used to import data from relational databases (such as Oracle and
MySQL) to HDFS and export data from HDFS to relational databases.
Flume
•Flume is a distributed service that collects event data and
transfers it to HDFS.
•It is ideally suited for event data from multiple systems.
Hadoop MapReduce
•Hadoop MapReduce is the framework that processes data.
•It is the original Hadoop processing engine, which is primarily Java-based.
•It is based on the map and reduces programming model.
•Many tools such as Hive and Pig are built on a map-reduce model.
•It has an extensive and mature fault tolerance built into the framework.
•It is still very commonly used but losing ground to Spark.
Pig
•Pig converts its scripts to Map and Reduce code, thereby saving the user from writing
complex MapReduce programs.
•Ad-hoc queries like Filter and Join, which are difficult to perform in MapReduce, can be
easily done using Pig.
Impala
•Impala supports a dialect of SQL, so data in HDFS is modeled as a database table.
•Impala is preferred for ad-hoc queries.
•It is an open-source high-performance SQL engine, which runs on the Hadoop cluster.
•It is ideal for interactive analysis and has very low latency which can be measured in
milliseconds.
Hive
•HIVE executes queries using MapReduce; however, a user need not write any code
in low-level MapReduce.
•Hive is suitable for structured data. After the data is analyzed, it is ready for the
users to access.
•It is very similar to Impala. However, it is preferred for data processing and Extract
Transform Load, also known as ETL, operations.
Oozie

•Oozie is a workflow or coordination system


that you can use to manage Hadoop jobs.
Hue
(Hadoop User Experience)
•Upload and browse data
•Query a table in HIVE and Impala
•Run Spark and Pig jobs and workflows Search data
•All-in-all, Hue makes Hadoop easier to use.
•It also provides SQL editor for HIVE, Impala, MySQL, Oracle,
PostgreSQL, SparkSQL, and Solr SQL.
Spark
•Spark is an open source cluster computing framework.
•It provides up to 100 times faster performance for a few applications with in-memory
primitives as compared to the two-stage disk-based MapReduce paradigm of Hadoop.
•Spark can run in the Hadoop cluster and process data in HDFS.
•It also supports a wide variety of workload, which includes Machine learning, Business
intelligence, Streaming, and Batch processing.

•Spark Core and Resilient Distributed datasets or RDD


•Spark SQL
•Spark streaming
•Machine learning library or Mlib
•Graphx.

You might also like