Data Analytics IT 404 - Mod 6: Ojus Thomas Lee CE Kidangoor
Data Analytics IT 404 - Mod 6: Ojus Thomas Lee CE Kidangoor
Module VI
1
https://data-flair.training/blogs/top-big-data-tools/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 5 / 53
Big Data Tools
Apache Hadoop
Hadoop is an open-source framework from Apache and runs on
commodity hardware.
It is used to store process and analyze Big Data.
Apache Spark
Spark supports both real-time as well as batch processing.
It also supports in-memory calculations, which makes it 100 times
faster than Hadoop.
Apache Storm
Apache Storm is an open-source big data tool, distributed real-time
and fault-tolerant processing system. It efficiently processes unbounded
streams of data.
The processing speed of Storm is very high. It is easily scalable and
also fault-tolerant.
2
https://data-flair.training/blogs/top-big-data-tools/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 6 / 53
Big Data Tools
3
Apache Cassandra
Apache Cassandra is a distributed database that provides high
availability and scalability without compromising performance efficiency.
Cassandra works quite efficiently under heavy loads.
It does not follow master-slave architecture so all nodes have the same
role.
Apache Cassandra supports the ACID (Atomicity, Consistency,
Isolation, and Durability) properties.
MongoDB
MongoDB is an open-source data analytics tool, NoSQL database that
provides cross-platform capabilities.
Apache Flink
Apache Flink is an Open-source data analytics tool distributed
processing framework for bounded and unbounded data streams.
3
https://data-flair.training/blogs/top-big-data-tools/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 7 / 53
Big Data Tools
Kafka
Apache Kafka is an open-source platform that was created by LinkedIn
in the year 2011.
Apache Kafka is a distributed event processing or streaming platform
which provides high throughput to the systems.
It can handle trillions of events a day.
It is highly scalable and also provides great fault tolerance.
R Programming
R is an open-source programming language and is one of the most
comprehensive statistical analysis languages.
helps in generating the results of data analysis in graphical as well as
text format.
4
https://data-flair.training/blogs/top-big-data-tools/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 8 / 53
Big Data Tools
Apache Hadoop
HDFS
HDFS is a Filesystem of Hadoop designed for storing very large files
running on a cluster of commodity hardware.
It is designed on the principle of storage of less number of large files
rather than the huge number of small files.
HDFS Nodes
Name nodes
Data Node
5
https://data-flair.training/blogs/top-big-data-tools/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 10 / 53
Hadoop
6
6
https://data-flair.training/blogs/top-big-data-tools/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 11 / 53
Hadoop - HDFS
Name nodes
NameNode works as a Master in a Hadoop cluster that guides the
Datanode(Slaves).
Namenode is mainly used for storing the Metadata i.e. the data about
the data.
Meta Data can be the transaction logs that keep track of the user’s
activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information
about the location(Block number, Block ids) of Datanode that
Namenode stores to find the closest DataNode for Faster
Communication.
Namenode instructs the DataNodes with the operation like Delete,
Create, Replicate, etc.
7
https://www.geeksforgeeks.org/hadoop-architecture/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 12 / 53
Hadoop - HDFS
Name nodes
DataNodes works as a Slave DataNodes are mainly utilized for storing
the data in a Hadoop cluster,
the number of DataNodes can be from 1 to 500 or even more than
that.
The more number of DataNode, the Hadoop cluster will be able to
store more data.
So it is advised that the DataNode should have high storing capacity to
store a large number of file blocks.
8
https://www.geeksforgeeks.org/hadoop-architecture/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 13 / 53
Hadoop - HDFS
9
9
https://www.geeksforgeeks.org/hadoop-architecture/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 14 / 53
Hadoop - HDFS
10
Replication In HDFS
Replication ensures the availability of the data.
Replication is making a copy of something and the number of times
you make a copy of that particular thing can be expressed as it’s
Replication Factor.
As we have seen in File blocks that the HDFS stores the data in the
form of various blocks at the same time Hadoop is also configured to
make a copy of those file blocks.
10
https://www.geeksforgeeks.org/hadoop-architecture/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 15 / 53
Hadoop - HDFS
11
Rack Awareness
The rack is nothing but just the physical collection of nodes in our
Hadoop cluster (maybe 30 to 40).
A large Hadoop cluster consists of so many Racks,
With the help of this Racks information Namenode chooses the closest
Datanode to achieve the maximum performance while performing the
read/write information which reduces the Network Traffic.
11
https://www.geeksforgeeks.org/hadoop-architecture/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 16 / 53
Hadoop - HDFS
12
12
https://www.geeksforgeeks.org/hadoop-architecture/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 17 / 53
Hadoop - MapReduce
13
Map Reduce
MapReduce nothing but just like an Algorithm or a data structure.
The major feature of MapReduce is to perform the distributed
processing in parallel in a Hadoop cluster which Makes Hadoop working
so fast.
When you are dealing with Big Data, serial processing is no more of
any use.
MapReduce has mainly 2 tasks which are divided phase-wise:
Map Task
Reduce Task
13
https://www.geeksforgeeks.org/hadoop-architecture/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 18 / 53
Hadoop - MapReduce
14
Map Reduce
14
https://www.geeksforgeeks.org/hadoop-architecture/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 19 / 53
Hadoop - MapReduce
15
15
https://www.geeksforgeeks.org/hadoop-architecture/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 20 / 53
Hadoop - MapReduce
16
16
https://www.geeksforgeeks.org/hadoop-architecture/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 21 / 53
Hadoop - MapReduce
17
17
https://www.geeksforgeeks.org/hadoop-architecture/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 22 / 53
Hadoop - MapReduce
18
18
https://www.geeksforgeeks.org/hadoop-architecture/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 23 / 53
Hadoop - MapReduce
19
19
https://www.geeksforgeeks.org/hadoop-architecture/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 24 / 53
Hadoop - MapReduce
20
20
https://www.edureka.co/blog/mapreduce-tutorial/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 25 / 53
Hadoop - MapReduce
21
21
https://www.edureka.co/blog/mapreduce-tutorial/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 26 / 53
Hadoop - MapReduce
22
22
https://www.edureka.co/blog/mapreduce-tutorial/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 27 / 53
Hadoop - MapReduce
23
23
https://www.edureka.co/blog/mapreduce-tutorial/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 28 / 53
Hadoop - YARN
24
24
https://www.educba.com/mapreduce-vs-yarn/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 29 / 53
Hadoop - YARN
25
25
https://www.educba.com/mapreduce-vs-yarn/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 30 / 53
Hadoop - MapReduce
26
Hadoop Yarn
HadoopYarn
26
https://www.edureka.co/blog/mapreduce-tutorial/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 31 / 53
Hadoop - MapReduce
27
Hadoop Yarn
HadoopYarn
27
https://www.edureka.co/blog/mapreduce-tutorial/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 32 / 53
Hadoop - HBASE
28
HBASE
Hbase is an open source and sorted map data built on Hadoop.
It is column oriented and horizontally scalable.
It is based on Google’s Big Table.
It has set of tables which keep data in key value format.
It is a part of the Hadoop ecosystem that provides random real-time
read/write access to data in the Hadoop File System.
28
https://www.javatpoint.com/what-is-hbase
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 33 / 53
Hadoop - HBASE
29
Why HBase?
RDBMS get exponentially slow as the data becomes large
Expects data to be highly structured, i.e. ability to fit in a well-defined
schema
Any change in schema might require a downtime
For sparse datasets, too much of overhead of maintaining NULL values
29
https://www.javatpoint.com/what-is-hbase
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 34 / 53
Hadoop - HBASE
30
HBase Features
Apache HBase has a completely distributed architecture.
It can easily work on extremely large scale data.
HBase offers high security and easy management which results in
unprecedented high write throughput.
For both structured and semi-structured data types we can use it.
Moreover, the MapReduce jobs can be backed with HBase Tables.
30
https://www.javatpoint.com/what-is-hbase
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 35 / 53
Hadoop - HBASE
31
31
https://www.tutorialspoint.com/hbase/hbase overview.htm
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 36 / 53
Hadoop - HBASE
32
HBase Architecture
HBase Architecture is basically a column-oriented key-value data store
and
It is the natural fit for deploying as a top layer on HDFS because it
works extremely fine with the kind of data that Hadoop process.
32
https://www.javatpoint.com/what-is-hbase
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 37 / 53
33
Hadoop - HBASE
33
https://www.tutorialspoint.com/hbase/hbase overview.htm
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 38 / 53
Hadoop - HBASE
34
In HBase, tables are split into regions and are served by the region
servers.
Regions are vertically divided by column families into “Stores”.
Stores are saved as files in HDFS.
HBase has three major components
The client library,
A master Server, and
Region Servers.
Region servers can be added or removed as per requirement.
34
https://www.tutorialspoint.com/hbase/hbase overview.htm
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 39 / 53
Hadoop - HBASE
35
35
https://www.tutorialspoint.com/hbase/hbase overview.htm
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 40 / 53
36
Hadoop - HBASE
Regions are nothing but tables that are split up and spread across the
region servers.
The region servers have regions that -
Communicate with the client and handle data-related operations.
Handle read and write requests for all the regions under it.
Decide the size of the region by following the region size thresholds.
36
https://www.tutorialspoint.com/hbase/hbase overview.htm
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 41 / 53
37
Hadoop - HBASE
37
https://www.tutorialspoint.com/hbase/hbase overview.htm
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 42 / 53
38
Hadoop - HBASE
38
https://www.tutorialspoint.com/hbase/hbase overview.htm
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 43 / 53
39
Hadoop - PIG
39
https://www.javatpoint.com/pig
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 44 / 53
40
Hadoop - PIG
The Pig scripts get internally converted to Map Reduce jobs and get
executed on data stored in HDFS.
Pig can handle any type of data, i.e., structured, semi-structured or
unstructured and stores the corresponding results into Hadoop Data
File System.
Every task which can be achieved using PIG can also be achieved
using java used in MapReduce.
40
https://www.javatpoint.com/pig
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 45 / 53
41
Hadoop - PIG
Architecture of PIG
41
https://data-flair.training/blogs/pig-architecture/
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 46 / 53
42
Hadoop - HIVE
42
https://www.javatpoint.com/what-is-hive
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 47 / 53
43
Hadoop - HIVE
Features of Hive
Hive is fast and scalable.
It provides SQL-like queries (i.e., HQL) that are implicitly transformed
to MapReduce or Spark jobs.
It is capable of analyzing large datasets stored in HDFS.
It allows different storage types such as plain text, RCFile, and HBase.
It uses indexing to accelerate queries.
It can operate on compressed data stored in the Hadoop ecosystem.
43
https://www.javatpoint.com/what-is-hive
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 48 / 53
44
Hadoop - HIVE
44
https://www.javatpoint.com/what-is-hive
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 49 / 53
45
Hadoop - HIVE
HIVE Architecture
45
https://www.javatpoint.com/what-is-hive
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 50 / 53
46
Hadoop - MAHOUT
46
https://www.tutorialspoint.com/mahout/mahout introduction.htm
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 51 / 53
47
Hadoop - MAHOUT
Features of Mahout
The algorithms of Mahout are written on top of Hadoop, so it works
well in distributed environment.
Mahout uses the Apache Hadoop library to scale effectively in the
cloud.
Mahout offers the coder a ready-to-use framework for doing data
mining tasks on large volumes of data.
Mahout lets applications to analyze large sets of data effectively and in
quick time.
Includes several MapReduce enabled clustering implementations such
as k-means, fuzzy k-means, Canopy, Dirichlet, and Mean-Shift.
Supports Distributed Naive Bayes and Complementary Naive Bayes
classification implementations.
47
https://www.tutorialspoint.com/mahout/mahout introduction.htm
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 52 / 53
48
Hadoop - MAHOUT
Mahout - Architecture
48
https://www.tutorialspoint.com/mahout/mahout introduction.htm
Ojus Thomas Lee CE Kidangoor Data Analytics IT 404 - Mod 6 53 / 53