We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14
Difference between Traditional
Database System and Hadoop
Traditional Database System Hadoop Data is stored in a central location and sent to the In Hadoop, the program goes to the data. It initially processor at runtime. distributes the data to multiple systems and later runs the computation wherever the data is located. Traditional Database Systems cannot be used to Hadoop works better when the data size is big. It can process and store a significant amount of data(big process and store a large amount of data efficiently data). and effectively. Traditional RDBMS is used to manage only structured Hadoop can process and store a variety of data, and semi-structured data. It cannot be used to control whether it is structured or unstructured. unstructured data. •HDFS(Hadoop Distributed file system) •HBase Hadoop Ecosystem •Sqoop •Flume •Spark •Hadoop MapReduce •Pig •Impala •Hive •Oozie •Hue HDFS HDFS (HADOOP DISTRIBUTED FILE SYSTEM) •HDFS is a storage layer for Hadoop. •HDFS is suitable for distributed storage and processing, that is, while the data is being stored, it first gets distributed and then it is processed. •HDFS provides Streaming access to file system data. •HDFS provides file permission and authentication. •HDFS uses a command line interface to interact with Hadoop. HBase •HBase is a NoSQL database or non-relational database. •HBase is important and mainly used when you need random, real-time, read or write access to your Big Data. •It provides support to a high volume of data and high throughput. •In an HBase, a table can have thousands of columns. Sqoop •Sqoop is a tool designed to transfer data between Hadoop and relational database servers. •It is used to import data from relational databases (such as Oracle and MySQL) to HDFS and export data from HDFS to relational databases. Flume •Flume is a distributed service that collects event data and transfers it to HDFS. •It is ideally suited for event data from multiple systems. Hadoop MapReduce •Hadoop MapReduce is the framework that processes data. •It is the original Hadoop processing engine, which is primarily Java-based. •It is based on the map and reduces programming model. •Many tools such as Hive and Pig are built on a map-reduce model. •It has an extensive and mature fault tolerance built into the framework. •It is still very commonly used but losing ground to Spark. Pig •Pig converts its scripts to Map and Reduce code, thereby saving the user from writing complex MapReduce programs. •Ad-hoc queries like Filter and Join, which are difficult to perform in MapReduce, can be easily done using Pig. Impala •Impala supports a dialect of SQL, so data in HDFS is modeled as a database table. •Impala is preferred for ad-hoc queries. •It is an open-source high-performance SQL engine, which runs on the Hadoop cluster. •It is ideal for interactive analysis and has very low latency which can be measured in milliseconds. Hive •HIVE executes queries using MapReduce; however, a user need not write any code in low-level MapReduce. •Hive is suitable for structured data. After the data is analyzed, it is ready for the users to access. •It is very similar to Impala. However, it is preferred for data processing and Extract Transform Load, also known as ETL, operations. Oozie
•Oozie is a workflow or coordination system
that you can use to manage Hadoop jobs. Hue (Hadoop User Experience) •Upload and browse data •Query a table in HIVE and Impala •Run Spark and Pig jobs and workflows Search data •All-in-all, Hue makes Hadoop easier to use. •It also provides SQL editor for HIVE, Impala, MySQL, Oracle, PostgreSQL, SparkSQL, and Solr SQL. Spark •Spark is an open source cluster computing framework. •It provides up to 100 times faster performance for a few applications with in-memory primitives as compared to the two-stage disk-based MapReduce paradigm of Hadoop. •Spark can run in the Hadoop cluster and process data in HDFS. •It also supports a wide variety of workload, which includes Machine learning, Business intelligence, Streaming, and Batch processing.
•Spark Core and Resilient Distributed datasets or RDD
•Spark SQL •Spark streaming •Machine learning library or Mlib •Graphx.