Big Data Lab Manual
Big Data Lab Manual
A PROJECT REPORT
Submitted by
M. Rianee rayen
(950921104030)
DEPARTMENT
OF
BONAFIDE CERTIFICATE
Certified that this Naan Mudhalvan project report “BIG DATA” is the
bonafide work of “M. Rianee rayen (950921104030)” who carried out the
project at “Infosys”.
1 INTRODUCTION 1
1.2 HADOOP 2
1.3 HIVE 3
1.5 SPARK 5
2 SYSTEM SPECIFICATIONS 7
3 IMPLEMENTATION 9
4 CONCLUSION 31
5 CERTIFICATE 33 CHAPTER 1
INTRODUCTION
1.1 BIG DATA
complex structures. This is very difficult for traditional data processing software to deal
with.
Among the larger concepts of rage in technology, big data technologies are widely
associated with many other technologies such as deep learning, machine learning, artificial
intelligence (AI), and Internet of Things (IoT) that are massively augmented. In
combination with these technologies, big data technologies are focused on analyzing and
handling large amounts of real-time data and batch-related data.
Big Data is typically managed and analyzed using advanced tools and frameworks such as:
• Hadoop and Spark for distributed data storage and processing.
• NoSQL databases like MongoDB and Cassandra for flexible data handling.
• Machine learning and AI models to extract meaningful patterns and predictions
Key Characteristics of Big Data (The 5 Vs):
1. Volume: The sheer amount of data, ranging from terabytes to petabytes and beyond.
2. Velocity: The speed at which data is generated, collected, and processed, often in realtime.
3. Variety: The diverse formats and types of data, including text, images, audio, video, and
log files.
4. Veracity: The uncertainty and reliability of data, highlighting the need for accurate and
trustworthy sources.
5. Value: The insights and business advantages derived from analyzing big data.
1.2 HADOOP
Hadoop is an open source software programming framework for storing a large amount
of data and performing the computation. Its framework is based on Java programming with
some native code in C and shell scripts. It is designed to handle big data and is based on
the MapReduce programming model, which allows for the parallel processing of large
datasets.
• HDFS (Hadoop Distributed File System): This is the storage component of Hadoop, which
allows for the storage of large amounts of data across multiple machines. It is designed to
work with commodity hardware, which makes it cost-effective.
v
• YARN (Yet Another Resource Negotiator): This is the resource management component
of Hadoop, which manages the allocation of resources (such as CPU and memory) for
processing the data stored in HDFS.
• Hadoop also includes several additional modules that provide additional functionality, such
as Hive (a SQL-like query language), Pig (a high-level platform for creating MapReduce
programs), and HBase (a non-relational, distributed database).
• Hadoop is commonly used in big data scenarios such as data warehousing, business
intelligence, and machine learning. It’s also used for data processing, data analysis, and
data mining. It enables the distributed processing of large data sets across clusters of
computers using a simple programming model.
ARCHITECTURE
1.3 HIVE
Hive is a data warehouse system which is used to analyze structured data. It is built on
the top of Hadoop. It was developed by Facebook.
Hive provides the functionality of reading, writing, and managing large datasets residing
in distributed storage. It runs SQL like queries called HQL (Hive query language) which
gets internally converted to MapReduce jobs.
vi
Using Hive, skip the requirement of the traditional approach of writing complex
MapReduce programs can be skipped. Hive supports Data Definition Language (DDL),
Data Manipulation Language (DML), and User Defined Functions (UDF).
FEATURES
1.4 SCALA
Scala is a general-purpose, high-level, multi-paradigm programming language. It is a pure
objectoriented programming language which also provides support to the functional
programming approach. Scala programs can convert to bytecodes and can run on the
JVM (Java Virtual Machine). Scala stands for Scalable language. It also provides
Javascript runtimes. Scala is highly influenced by Java and some other programming
languages like Lisp, Haskell, Pizza etc.
vii
1.5 SPARK
Apache Spark is a powerful, open-source unified analytics engine designed for
processing and analyzing large datasets. It provides high-speed computation and supports
a wide range of big data operations, making it one of the most popular frameworks in the
big data ecosystem.
Core Components of Apache Spark
2. Spark SQL:
o Enables querying of structured and semi-structured data using SQL and DataFrames. o
6. Spark Structured Streaming: o A newer API for real-time data processing with better
fault tolerance and scalability compared to Spark Streaming.
ARCHITECTURE
CHAPTER 2
SYSTEM SPECIFICATIONS
• Hadoop 2.8.0: Hadoop 2.8.0 is a version of the Apache Hadoop project, released in March
2017, as part of the Hadoop 2.x series. This version introduced various improvements and
fixes over previous releases, enhancing the stability, performance, and functionality of the
Hadoop ecosystem.
• Hive 2.3.0: Apache Hive 2.3.0, released in July 2017, introduced several improvements,
optimizations, and new features aimed at enhancing performance, usability, and integration
within the Hadoop ecosystem. It continued Hive's role as a key data warehouse system for
querying and managing large datasets stored in distributed systems like HDFS.
• Sqoop 1.4.6: Apache Sqoop 1.4.6, released in August 2015, is a tool designed to transfer
data between Hadoop and structured data stores, such as relational databases and enterprise
data warehouses. Sqoop simplifies the process of importing data from external systems
into Hadoop Distributed File System (HDFS), as well as exporting data from Hadoop to
relational databases.
• Spark 2.x: Apache Spark 2.x is a major version of the open-source distributed computing
system, released to provide faster processing, enhanced performance, and more robust
APIs.
• JDK 1.8: Also known as Java 8, was released by Oracle in March 2014. It is one of the
most significant updates to the Java programming language, introducing a wide range of
features that enhance the language’s expressiveness, performance, and ease of use.
CHAPTER 3
IMPLEMENTATION
Downloading Hadoop (Please note link is updated to new version of hadoop here on 6th
May 2022)
===============================
wget https://downloads.apache.org/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz tar
xzf hadoop-3.2.3.tar.gz
source ~/.bashrc
2nd File
============================
sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
3rd File
===============================
sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml
4th File
====================================
sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
5th File
================================================
sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
6th File
==================================================
sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HAD
OOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,
HADOOP_MAPRED_HOME</value>
</property>
Launching Hadoop
==================================
hdfs namenode -format
./start-dfs.sh
export PATH=$PATH:$HIVE_HOME/bin
source ~/.bashrc
Edit hive-config.sh file
====================================
=================
rm $HIVE_HOME/lib/guava-19.0.jar cp
$HADOOP_HOME/share/hadoop/hdfs/lib/guava-27.0-jre.jar $HIVE_HOME/lib/
hive
OUTPUT
[5, 3, 4, 2]
3.3.2 Find the total number of cases in each continent. from pyspark.sql import
SparkSession # Step 1: Initialize SparkSession spark = SparkSession.builder \
.appName("Covid Data Analysis") \ .getOrCreate()
xx
OUTPUT
+-----------+-----------+
| Continent | Total_Cases|
+-----------+-----------+
| Asia | 123456789 | |
Europe | 987654321 |
| Africa | 543210123 |
| ... | ... |
xxi
+-----------+-----------+
OUTPUT
+-----------+------------+
| location|Total_Deaths|
+-----------+------------+
| USA| 3000|
| India| 1800|
| Brazil| 1500|
+-----------+------------+
3.3.4 Compute the maximum deaths at specific locations like ‘Europe’ and ‘Asia’
from pyspark.sql import SparkSession
OUTPUT
+--------+----------+
|location|Max_Deaths|
+--------+----------+
| Europe| 2000|
| Asia| 1500|
+--------+----------+
OUTPUT
+---------+----------------+
xxv
|continent|Total_Vaccinated|
+---------+----------------+
| Asia| 1500000 |
| Europe| 3000000 |
| Africa| 750000 |
+---------+----------------+
3.3.6 Find the count of countrywise vaccination for the month “January 2021”
from pyspark.sql import SparkSession from pyspark.sql.functions import col,
to_date, month, year
OUTPUT
+---------+----------------+
| country|Total_Vaccinated|
+---------+----------------+
| USA| 125000 |
| India| 100000 |
| Brazil| 25000 |
+---------+----------------+
3.3.7 What is the average number of total cases across all locations?
from pyspark.sql import SparkSession from pyspark.sql.functions
import avg
xxvii
# Step 4: Calculate the average number of total cases # Assuming the dataset
has a column 'total_cases' average_cases =
data.select(avg("total_cases").alias("Average_Total_Cases")) # Step 5: Show
the result average_cases.show()
OUTPUT
+-------------------+
|Average_Total_Cases|
+-------------------+
| 625000.0|
+-------------------+
xxviii
# Step 5: Order by the total vaccinated in descending order and take the first row
highest_vaccination = continent_vaccination.orderBy("Total_Vaccinated",
ascending=False).first()
{highest_vaccination['Total_Vaccinated']} vaccinations.")
else: print("No data available.")
OUTPUT
Continent with the highest vaccinations: America with 7000000 vaccinations.
3.3.9 Extract the year,month and day from the date_current column and creates
separate columns for each.
.withColumn("month", month("date_current")) \
.withColumn("day", dayofmonth("date_current"))
OUTPUT
+------------+----+-----+---+
|date_current|year|month|day|
+------------+----+-----+---+
| 2021-01-15 |2021| 1| 15|
| 2020-05-22 |2020| 5| 22|
| 2022-07-30 |2022| 7| 30|
+------------+----+-----+---+ CHAPTER 4
CONCLUSION
Big Data has revolutionized the way we analyze, process, and make decisions based on
vast amounts of information. It refers to datasets that are so large and complex that
traditional data processing methods are insufficient. The advent of technologies like
Hadoop, Spark, NoSQL databases, and cloud computing has enabled businesses and
organizations to manage, store, and analyze data at unprecedented scales.
Key Takeaways:
xxxi
1. Volume, Variety, Velocity: Big Data is characterized by the three V's: high volume (large
datasets), variety (diverse data types such as structured, semi-structured, and unstructured
data), and velocity (the speed at which data is generated and needs to be processed).
2. Data Processing Frameworks: Technologies like Hadoop and Apache Spark provide
distributed processing capabilities, allowing organizations to break down complex tasks
into smaller, manageable pieces across multiple machines. These tools enable the
processing of vast datasets quickly and efficiently.
3. Analytics and Insights: By leveraging Big Data technologies, businesses can gain
valuable insights into customer behavior, market trends, and operational efficiency.
Advanced analytics, machine learning, and artificial intelligence techniques allow for the
extraction of actionable insights that can drive innovation, improve decision-making, and
optimize business processes.
4. Scalability: The scalability of Big Data systems means that they can grow to accommodate
increasingly large datasets without sacrificing performance. This is crucial for
organizations that need to handle rapidly expanding data from IoT devices, social media,
sensors, and more.
5. Real-World Applications: From healthcare and finance to marketing and e-commerce,
Big Data applications are widespread. It plays a critical role in fraud detection, predictive
maintenance, personalized recommendations, supply chain optimization, and much more.
6. Challenges: While Big Data brings numerous benefits, it also introduces challenges such
as data security, privacy concerns, data quality, and the need for specialized skills. Ensuring
the ethical and responsible use of Big Data is essential to avoid risks and build trust with
stakeholders.
7. Future of Big Data: As the amount of data continues to grow, technologies like artificial
intelligence, machine learning, and deep learning will become even more integral to
extracting meaningful insights. The future of Big Data will likely involve more advanced
predictive models, automation, and real-time analytics that can continuously adapt to new
data.
xxxii
CHAPTER 5
CERTIFICATE
xxxiii