0% found this document useful (0 votes)

523 views36 pages

Big Data Lab Manual

Uploaded by

riyahp15s

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

523 views36 pages

Big Data Lab Manual

Uploaded by

riyahp15s

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

NM1059 – BIG DATA BY INFOSYS

A PROJECT REPORT

Submitted by

M. Rianee rayen
(950921104030)

DEPARTMENT

COMPUTER SCIENCE AND ENGINEERING

HOLY CROSS ENGINEERING COLLEGE

THOOTHUKUDI-628851

ANNA UNIVERSITY, CHENNAI 600 025

DECEMBER 2024
ii

ANNA UNIVERSITY : CHENNAI 600 025

BONAFIDE CERTIFICATE

Certified that this Naan Mudhalvan project report “BIG DATA” is the
bonafide work of “M. Rianee rayen (950921104030)” who carried out the
project at “Infosys”.

STAFF INCHARGE HEAD OF THE DEPARTMENT

INTERNAL EXAMINER EXTERNAL EXAMINER

TABLE OF CONTENTS

CHAPTER TITLE PAGE NO NO

iii

1 INTRODUCTION 1

1.1 BIG DATA 1

1.2 HADOOP 2

1.3 HIVE 3

1.4 SCALA PROGRAMMING 4

1.5 SPARK 5

2 SYSTEM SPECIFICATIONS 7

2.1 SOFTWARE REQUIREMENTS 7

2.2 HARDWARE REQUIREMENTS 8

3 IMPLEMENTATION 9

3.1 PROBLEM STATEMENT 9

3.2 INSTRUCTIONS FOR PROBLEM 9

SOLVING
3.3 HADOOP INSTALLATION 10

3.4 HIVE INSTALLATION 16

3.5 SPARK INSATLLATION AND TASKS 18

4 CONCLUSION 31

5 CERTIFICATE 33 CHAPTER 1

INTRODUCTION
1.1 BIG DATA

Big data technology is defined as software-utility. This technology is primarily designed

to analyze, process and extract information from a large data set and a huge set of extremely
iv

complex structures. This is very difficult for traditional data processing software to deal
with.
Among the larger concepts of rage in technology, big data technologies are widely
associated with many other technologies such as deep learning, machine learning, artificial
intelligence (AI), and Internet of Things (IoT) that are massively augmented. In
combination with these technologies, big data technologies are focused on analyzing and
handling large amounts of real-time data and batch-related data.
Big Data is typically managed and analyzed using advanced tools and frameworks such as:
• Hadoop and Spark for distributed data storage and processing.
• NoSQL databases like MongoDB and Cassandra for flexible data handling.
• Machine learning and AI models to extract meaningful patterns and predictions
Key Characteristics of Big Data (The 5 Vs):
1. Volume: The sheer amount of data, ranging from terabytes to petabytes and beyond.
2. Velocity: The speed at which data is generated, collected, and processed, often in realtime.
3. Variety: The diverse formats and types of data, including text, images, audio, video, and
log files.
4. Veracity: The uncertainty and reliability of data, highlighting the need for accurate and
trustworthy sources.
5. Value: The insights and business advantages derived from analyzing big data.
1.2 HADOOP
Hadoop is an open source software programming framework for storing a large amount
of data and performing the computation. Its framework is based on Java programming with
some native code in C and shell scripts. It is designed to handle big data and is based on
the MapReduce programming model, which allows for the parallel processing of large
datasets.
• HDFS (Hadoop Distributed File System): This is the storage component of Hadoop, which
allows for the storage of large amounts of data across multiple machines. It is designed to
work with commodity hardware, which makes it cost-effective.
v

• YARN (Yet Another Resource Negotiator): This is the resource management component
of Hadoop, which manages the allocation of resources (such as CPU and memory) for
processing the data stored in HDFS.
• Hadoop also includes several additional modules that provide additional functionality, such
as Hive (a SQL-like query language), Pig (a high-level platform for creating MapReduce
programs), and HBase (a non-relational, distributed database).
• Hadoop is commonly used in big data scenarios such as data warehousing, business
intelligence, and machine learning. It’s also used for data processing, data analysis, and
data mining. It enables the distributed processing of large data sets across clusters of
computers using a simple programming model.

ARCHITECTURE

1.3 HIVE

Hive is a data warehouse system which is used to analyze structured data. It is built on
the top of Hadoop. It was developed by Facebook.

Hive provides the functionality of reading, writing, and managing large datasets residing
in distributed storage. It runs SQL like queries called HQL (Hive query language) which
gets internally converted to MapReduce jobs.
vi

Using Hive, skip the requirement of the traditional approach of writing complex
MapReduce programs can be skipped. Hive supports Data Definition Language (DDL),
Data Manipulation Language (DML), and User Defined Functions (UDF).

FEATURES

• Hive is fast and scalable.

• It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or
Spark jobs.

• It is capable of analyzing large datasets stored in HDFS.

• It allows different storage types such as plain text, RCFile, and HBase.
• It uses indexing to accelerate queries.
• It can operate on compressed data stored in the Hadoop ecosystem.
• It supports user-defined functions (UDFs) where user can provide its functionality.
ARCHITECTURE

1.4 SCALA
Scala is a general-purpose, high-level, multi-paradigm programming language. It is a pure
objectoriented programming language which also provides support to the functional
programming approach. Scala programs can convert to bytecodes and can run on the
JVM (Java Virtual Machine). Scala stands for Scalable language. It also provides
Javascript runtimes. Scala is highly influenced by Java and some other programming
languages like Lisp, Haskell, Pizza etc.
vii

Scala's Role in Big Data Frameworks

1. Apache Spark: o Spark is implemented in Scala, and its APIs for data manipulation and
analytics are natively supported in Scala. o Features like RDDs (Resilient Distributed
Datasets), DataFrames, and Datasets are optimized for Scala. o Scala's functional
constructs, like map, reduce, and filter, align well with Spark's transformations and
actions.
2. Kafka Streams:
o Kafka, a distributed event-streaming platform, provides Scala APIs for building robust
stream-processing applications.
3. Akka:
o Akka is a toolkit for building concurrent, distributed systems in Scala, often used in realtime
big data applications.
4. Big Data Pipelines: o Scala is often used in data engineering workflows for ETL (Extract,
Transform, Load) operations in distributed systems.

1.5 SPARK
Apache Spark is a powerful, open-source unified analytics engine designed for
processing and analyzing large datasets. It provides high-speed computation and supports
a wide range of big data operations, making it one of the most popular frameworks in the
big data ecosystem.
Core Components of Apache Spark

1. Spark Core: o The foundational engine responsible for scheduling, memory

management, fault recovery, and interacting with storage systems. o Implements Resilient
Distributed Datasets (RDDs), the fundamental abstraction for distributed data.

2. Spark SQL:
o Enables querying of structured and semi-structured data using SQL and DataFrames. o

Supports integration with Hive for advanced data warehousing.

3. Spark Streaming: o Provides real-time stream processing capabilities for data from
sources like Kafka, Flume, or socket streams.

4. MLlib (Machine Learning Library): o Offers distributed algorithms for

classification, regression, clustering, and recommendation.
5. GraphX: o A library for graph processing and analytics, such as page ranking and
community detection.
viii

6. Spark Structured Streaming: o A newer API for real-time data processing with better
fault tolerance and scalability compared to Spark Streaming.

ARCHITECTURE

CHAPTER 2

SYSTEM SPECIFICATIONS

2.1 SOFTWARE REQUIREMENTS

• MySQL
• Hadoop 2.8.0
• Hive 2.3.0
• Sqoop 1.4.6
• Spark 2.x
• JDK 1.8 or above
• Eclipse IDE
SOFTWARE DESCRIPTION
• MySQL: MySQL is an open-source, relational database management system (RDBMS)
developed by MySQL AB and now owned by Oracle Corporation. It follows the
clientserver model and is widely used in web development, business applications, and data
warehousing.
ix

• Hadoop 2.8.0: Hadoop 2.8.0 is a version of the Apache Hadoop project, released in March
2017, as part of the Hadoop 2.x series. This version introduced various improvements and
fixes over previous releases, enhancing the stability, performance, and functionality of the
Hadoop ecosystem.

• Hive 2.3.0: Apache Hive 2.3.0, released in July 2017, introduced several improvements,
optimizations, and new features aimed at enhancing performance, usability, and integration
within the Hadoop ecosystem. It continued Hive's role as a key data warehouse system for
querying and managing large datasets stored in distributed systems like HDFS.

• Sqoop 1.4.6: Apache Sqoop 1.4.6, released in August 2015, is a tool designed to transfer
data between Hadoop and structured data stores, such as relational databases and enterprise
data warehouses. Sqoop simplifies the process of importing data from external systems
into Hadoop Distributed File System (HDFS), as well as exporting data from Hadoop to
relational databases.

• Spark 2.x: Apache Spark 2.x is a major version of the open-source distributed computing
system, released to provide faster processing, enhanced performance, and more robust
APIs.

• JDK 1.8: Also known as Java 8, was released by Oracle in March 2014. It is one of the
most significant updates to the Java programming language, introducing a wide range of
features that enhance the language’s expressiveness, performance, and ease of use.

• Eclipse IDE: Highly popular, open-source integrated development environment (IDE)

primarily used for Java development but also supports various other programming
languages through plugins.

2.2 HARDWARE REQUIREMENTS

• i5 or i7 processor or R5 from AMD

• 16 GB of RAM,500 GB storage system
x

CHAPTER 3
IMPLEMENTATION

3.1 PROBLEM STATEMENT

The pandemic Covid has badly impacted everyone's life across the globe in the year
2020. Assessing the available data related to patients, treatments, post Covid prognosis,
recovery rate and many other such information will help hospitals and health organizations
to evaluate what care approaches are most effective. This can also help in understanding
what is the effect of medication on patients with history of other illness such as cardiac
problems, diabetics, cancer etc.
All data related to Covid pandemic have continuously been monitored and analyzed
to find the intensity of its spread. A sample of such data that has been captured from
different locations on daily basis. You are required to get some useful insights by
processing this data using the Big Data platform Hadoop and its ecosystem components.
Listed below are few reports expected from analysis:
• What is the number of people who are infected globally?
• How many cases are reported in a continent?
• Which country has recorded maximum number of deaths due to Covid?
• How many people are vaccinated so far?
xi

3.2 INSTRUCTION FOR PROBLEM SOLVING

Datasource: CovidGlobalData.csv
The file contains the details about Covid infections worldwide.File structure is given as
below.
iso_code String
continent String
Location String
Date_current String
Total cases Double
Total_deaths Double
Total_vaccinations Double
People_vaccinated Double
Median_age Double
Age_65_older Double
Age_70_older Double
Cardiovasc_death_rate Double
Diabetes_prevalence Double
3.3 HADOOP INSTALLATION
Prerequisite Test
=============================
sudo apt update sudo apt install
openjdk-8-jdk -y

java -version; javac -version sudo apt install

openssh-server openssh-client -y sudo adduser
hdoop su - hdoop ssh-keygen -t rsa -P '' -f
~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >>
~/.ssh/authorized_keys chmod 0600
~/.ssh/authorized_keys ssh localhost
xii

Downloading Hadoop (Please note link is updated to new version of hadoop here on 6th
May 2022)
===============================
wget https://downloads.apache.org/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz tar
xzf hadoop-3.2.3.tar.gz

Editng 6 important files

=================================
1st file
===========================
sudo nano .bashrc - here you might face issue saying hdoop is not sudo user
if this issue comes then su - aman
sudo adduser hdoop sudo

sudo nano .bashrc

#Add below lines in this file

#Hadoop Related Options export HADOOP_HOME=/home/hdoop/hadoop-3.2.3

export HADOOP_INSTALL=$HADOOP_HOME export
HADOOP_MAPRED_HOME=$HADOOP_HOME export
HADOOP_COMMON_HOME=$HADOOP_HOME export
HADOOP_HDFS_HOME=$HADOOP_HOME export
YARN_HOME=$HADOOP_HOME export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export
PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS"-Djava.library.path=$HADOOP_HOME/lib/nativ"
xiii

source ~/.bashrc

2nd File
============================
sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

#Add below line in this file in the end

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

3rd File
===============================
sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml

#Add below lines in this file(between "<configuration>" and "<"/configuration>")

<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdoop/tmpdata</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system></description>
</property>
xiv

4th File
====================================
sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

#Add below lines in this file(between "<configuration>" and "<"/configuration>")

<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

5th File
================================================
sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

#Add below lines in this file(between "<configuration>" and "<"/configuration>")

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

6th File
==================================================
sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

#Add below lines in this file(between "<configuration>" and "<"/configuration>")

<property>
<name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
xvi

<property>
<name>yarn.nodemanager.env-whitelist</name>

<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HAD
OOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,
HADOOP_MAPRED_HOME</value>
</property>

Launching Hadoop
==================================
hdfs namenode -format

./start-dfs.sh

3.4 HIVE INSTALLATION

Steps for hive installation
• Download and Unzip Hive
• Edit .bashrc file
• Edit hive-config.sh file
• Create Hive directories in HDFS
xvii

• Initiate Derby database

• Configure hive-site.xml file

download and unzip Hive

=============================
wget https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz tar
xzf apache-hive-3.1.2-bin.tar.gz
Edit .bashrc file
========================
sudo nano .bashrc export HIVE_HOME=
/home/hdoop/apache-hive-3.1.2-bin

export PATH=$PATH:$HIVE_HOME/bin
source ~/.bashrc
Edit hive-config.sh file
====================================

sudo nano $HIVE_HOME/bin/hive-config.sh export

HADOOP_HOME=/home/hdoop/hadoop-3.2.1
Create Hive directories in HDFS
===================================
hdfs dfs -mkdir /tmp hdfs dfs -chmod g+w
/tmp hdfs dfs -mkdir -p
/user/hive/warehouse hdfs dfs -chmod g+w
/user/hive/warehouse

Fixing guava problem – Additional step

xviii

=================

rm $HIVE_HOME/lib/guava-19.0.jar cp
$HADOOP_HOME/share/hadoop/hdfs/lib/guava-27.0-jre.jar $HIVE_HOME/lib/

Initialize Derby and hive

============================
schematool -initSchema -dbType derby

hive

optional Step – Edit hive-site.xml

=========== cd
$HIVE_HOME/conf cp
hive-
default.xml.template
hive-site.xml

sudo nano hive-site.xml – change metastore location to above created hdfs

path(/user/hive/warehouse)
xix

3.3 SPARK INSTALLATION AND TASKS

Pyspark code to read csv file and select a particular column and store the value in RDD for
operations to perform on that stored values like Maximum Value, SUM, Average etc.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("app_name").getOrCreate()
df = spark.read.csv("/home/hdoop/Hive_Database/Hive_datasets/sales.csv", header
=True, inferSchema = True) df.show()

3.3.1 DATA PREPROCESSING

Creates RDD and prints (selected a specific column and stored in RDD to perform
operations) ratings_list = df.select("Column_name").rdd.flatMap(lambda x: x).collect()
column_name_rdd = spark.sparkContext.parallelize(ratings_list) print(age_rdd.collect())

OUTPUT
[5, 3, 4, 2]
3.3.2 Find the total number of cases in each continent. from pyspark.sql import
SparkSession # Step 1: Initialize SparkSession spark = SparkSession.builder \
.appName("Covid Data Analysis") \ .getOrCreate()
xx

# Step 2: Load the CSV file file_path = "/path/to/CovidGlobalData.csv" #

Replace with the actual path data = spark.read.csv(file_path, header=True,
inferSchema=True)

# Step 3: Inspect the schema (optional) data.printSchema()

# Step 4: Group by continent and calculate the total cases #

Assuming the file has columns: 'Continent' and 'Cases'
continent_cases = data.groupBy("Continent") \ .sum("Cases") \
.withColumnRenamed("sum(Cases)", "Total_Cases")

# Step 5: Show the result continent_cases.show()

# Step 6: Save the result (optional) output_path =

"/path/to/output"
continent_cases.write.csv(output_path, header=True)
# Stop the Spark session spark.stop()

OUTPUT
+-----------+-----------+
| Continent | Total_Cases|
+-----------+-----------+
| Asia | 123456789 | |
Europe | 987654321 |
| Africa | 543210123 |
| ... | ... |
xxi

+-----------+-----------+

3.3.3 Find the total number of deaths in each location from

pyspark.sql import SparkSession

# Step 1: Initialize SparkSession spark

= SparkSession.builder \
.appName("Covid Deaths Analysis") \
.getOrCreate()

# Step 2: Load the data file_path = "/path/to/CovidGlobalData.csv" # Replace with

the path to your dataset data = spark.read.csv(file_path, header=True,
inferSchema=True)

# Step 3: Inspect the schema (optional) data.printSchema()

# Step 4: Calculate total deaths per location

# Assuming the dataset has columns: 'location' and 'deaths'
total_deaths = data.groupBy("location") \ .sum("deaths") \
.withColumnRenamed("sum(deaths)", "Total_Deaths")

# Step 5: Show the result total_deaths.show()

# Step 6: Save the result to a file (optional) output_path

= "/path/to/output" total_deaths.write.csv(output_path,
header=True)
xxii

# Stop the Spark session spark.stop()

OUTPUT
+-----------+------------+
| location|Total_Deaths|
+-----------+------------+
| USA| 3000|
| India| 1800|
| Brazil| 1500|
+-----------+------------+

3.3.4 Compute the maximum deaths at specific locations like ‘Europe’ and ‘Asia’
from pyspark.sql import SparkSession

# Step 1: Initialize SparkSession

spark = SparkSession.builder \ .appName("Covid Deaths Analysis - Maximum by
Location") \ .getOrCreate()
# Step 2: Load the data file_path = "/path/to/CovidGlobalData.csv" #
Replace with the actual path data = spark.read.csv(file_path, header=True,
inferSchema=True)

# Step 3: Inspect the schema (optional) data.printSchema()

xxiii

# Step 4: Filter for specific locations (e.g., 'Europe' and 'Asia')

specific_locations = ["Europe", "Asia"] filtered_data =
data.filter(data["location"].isin(specific_locations))

# Step 5: Compute the maximum deaths for each location max_deaths

= filtered_data.groupBy("location") \ .max("deaths") \
.withColumnRenamed("max(deaths)", "Max_Deaths")

# Step 6: Show the result max_deaths.show()

# Step 7: Save the result to a file (optional) output_path

= "/path/to/output" max_deaths.write.csv(output_path,
header=True)

# Stop the Spark session spark.stop()

OUTPUT
+--------+----------+
|location|Max_Deaths|
+--------+----------+
| Europe| 2000|
| Asia| 1500|
+--------+----------+

3.3.5 Find the total number of people vaccinated at each continent.

xxiv

from pyspark.sql import SparkSession

# Step 1: Initialize SparkSession spark = SparkSession.builder \ .appName("Covid

Vaccination Analysis by Continent") \
.getOrCreate()

# Step 2: Load the data file_path = "/path/to/CovidGlobalData.csv" #

Replace with the actual path data = spark.read.csv(file_path, header=True,
inferSchema=True)

# Step 3: Inspect the schema (optional) data.printSchema()

# Step 4: Group by continent and calculate the total vaccinated #

Assuming the dataset has columns: 'continent' and 'vaccinated'
total_vaccinated = data.groupBy("continent") \ .sum("vaccinated") \
.withColumnRenamed("sum(vaccinated)", "Total_Vaccinated")

# Step 5: Show the result total_vaccinated.show()

# Step 6: Save the result to a file (optional)

output_path = "/path/to/output"
total_vaccinated.write.csv(output_path, header=True)

# Stop the Spark session spark.stop()

OUTPUT

+---------+----------------+
xxv

|continent|Total_Vaccinated|
+---------+----------------+
| Asia| 1500000 |
| Europe| 3000000 |
| Africa| 750000 |
+---------+----------------+

3.3.6 Find the count of countrywise vaccination for the month “January 2021”
from pyspark.sql import SparkSession from pyspark.sql.functions import col,
to_date, month, year

# Step 1: Initialize SparkSession

spark = SparkSession.builder \ .appName("Covid Vaccination Analysis for January
2021") \ .getOrCreate()

# Step 2: Load the data file_path = "/path/to/CovidGlobalData.csv" #

Replace with the actual path data = spark.read.csv(file_path, header=True,
inferSchema=True)

# Step 3: Inspect the schema (optional) data.printSchema()

# Step 4: Filter data for January 2021

# Assuming the dataset has columns: 'country', 'date', and 'vaccinated' data =
data.withColumn("date", to_date(col("date"), "yyyy-MM-dd")) filtered_data =
data.filter((month(col("date")) == 1) & (year(col("date")) == 2021))
xxvi

# Step 5: Group by country and calculate the total vaccinated countrywise_vaccination

= filtered_data.groupBy("country") \ .sum("vaccinated") \
.withColumnRenamed("sum(vaccinated)", "Total_Vaccinated")

# Step 6: Show the result countrywise_vaccination.show()

# Step 7: Save the result to a file (optional) output_path =

"/path/to/output"
countrywise_vaccination.write.csv(output_path, header=True)

# Stop the Spark session spark.stop()

OUTPUT
+---------+----------------+
| country|Total_Vaccinated|
+---------+----------------+
| USA| 125000 |
| India| 100000 |
| Brazil| 25000 |
+---------+----------------+

3.3.7 What is the average number of total cases across all locations?
from pyspark.sql import SparkSession from pyspark.sql.functions
import avg
xxvii

# Step 1: Initialize SparkSession spark = SparkSession.builder \

.appName("Covid Average Total Cases") \
.getOrCreate()

# Step 2: Load the data file_path = "/path/to/CovidGlobalData.csv" #

Replace with the actual path data = spark.read.csv(file_path, header=True,
inferSchema=True)

# Step 3: Inspect the schema (optional) data.printSchema()

# Step 4: Calculate the average number of total cases # Assuming the dataset
has a column 'total_cases' average_cases =
data.select(avg("total_cases").alias("Average_Total_Cases")) # Step 5: Show
the result average_cases.show()

# Stop the Spark session spark.stop()

OUTPUT
+-------------------+
|Average_Total_Cases|
+-------------------+
| 625000.0|
+-------------------+
xxviii

3.3.8 Which continent has the highest total number of vaccinations?

from pyspark.sql import SparkSession from pyspark.sql.functions
import sum

# Step 1: Initialize SparkSession spark = SparkSession.builder \ .appName("Highest

Total Vaccinations by Continent") \
.getOrCreate()

# Step 2: Load the data

file_path = "/path/to/CovidGlobalData.csv" # Replace with the actual path data
= spark.read.csv(file_path, header=True, inferSchema=True)

# Step 3: Inspect the schema (optional) data.printSchema()

# Step 4: Group by continent and calculate the total vaccinated #

Assuming the dataset has columns: 'continent' and 'vaccinated'
continent_vaccination = data.groupBy("continent") \
.sum("vaccinated") \
.withColumnRenamed("sum(vaccinated)", "Total_Vaccinated")

# Step 5: Order by the total vaccinated in descending order and take the first row
highest_vaccination = continent_vaccination.orderBy("Total_Vaccinated",
ascending=False).first()

# Step 6: Show the result if

highest_vaccination:
print(f"Continent with the highest vaccinations: {highest_vaccination['continent']} with
xxix

{highest_vaccination['Total_Vaccinated']} vaccinations.")
else: print("No data available.")

# Stop the Spark session spark.stop()

OUTPUT
Continent with the highest vaccinations: America with 7000000 vaccinations.

3.3.9 Extract the year,month and day from the date_current column and creates
separate columns for each.

from pyspark.sql import SparkSession from

pyspark.sql.functions import year, month, dayofmonth #
Step 1: Initialize SparkSession spark =
SparkSession.builder \
.appName("Extract Year, Month, Day") \
.getOrCreate()

# Step 2: Load the data file_path = "/path/to/CovidGlobalData.csv" #

Replace with the actual path data = spark.read.csv(file_path, header=True,
inferSchema=True)

# Step 3: Inspect the schema (optional) data.printSchema()

# Step 4: Extract year, month, and day from 'date_current' column

# Assuming the 'date_current' column is of type 'date' or 'string' in the format 'yyyy-MMdd'
data_with_date_parts = data.withColumn("year", year("date_current")) \
xxx

.withColumn("month", month("date_current")) \
.withColumn("day", dayofmonth("date_current"))

# Step 5: Show the result data_with_date_parts.show()

# Stop the Spark session spark.stop()

OUTPUT
+------------+----+-----+---+
|date_current|year|month|day|
+------------+----+-----+---+
| 2021-01-15 |2021| 1| 15|
| 2020-05-22 |2020| 5| 22|
| 2022-07-30 |2022| 7| 30|
+------------+----+-----+---+ CHAPTER 4
CONCLUSION

Big Data has revolutionized the way we analyze, process, and make decisions based on
vast amounts of information. It refers to datasets that are so large and complex that
traditional data processing methods are insufficient. The advent of technologies like
Hadoop, Spark, NoSQL databases, and cloud computing has enabled businesses and
organizations to manage, store, and analyze data at unprecedented scales.

Key Takeaways:
xxxi

1. Volume, Variety, Velocity: Big Data is characterized by the three V's: high volume (large
datasets), variety (diverse data types such as structured, semi-structured, and unstructured
data), and velocity (the speed at which data is generated and needs to be processed).
2. Data Processing Frameworks: Technologies like Hadoop and Apache Spark provide
distributed processing capabilities, allowing organizations to break down complex tasks
into smaller, manageable pieces across multiple machines. These tools enable the
processing of vast datasets quickly and efficiently.
3. Analytics and Insights: By leveraging Big Data technologies, businesses can gain
valuable insights into customer behavior, market trends, and operational efficiency.
Advanced analytics, machine learning, and artificial intelligence techniques allow for the
extraction of actionable insights that can drive innovation, improve decision-making, and
optimize business processes.
4. Scalability: The scalability of Big Data systems means that they can grow to accommodate
increasingly large datasets without sacrificing performance. This is crucial for
organizations that need to handle rapidly expanding data from IoT devices, social media,
sensors, and more.
5. Real-World Applications: From healthcare and finance to marketing and e-commerce,
Big Data applications are widespread. It plays a critical role in fraud detection, predictive
maintenance, personalized recommendations, supply chain optimization, and much more.

6. Challenges: While Big Data brings numerous benefits, it also introduces challenges such
as data security, privacy concerns, data quality, and the need for specialized skills. Ensuring
the ethical and responsible use of Big Data is essential to avoid risks and build trust with
stakeholders.
7. Future of Big Data: As the amount of data continues to grow, technologies like artificial
intelligence, machine learning, and deep learning will become even more integral to
extracting meaningful insights. The future of Big Data will likely involve more advanced
predictive models, automation, and real-time analytics that can continuously adapt to new
data.
xxxii

CHAPTER 5

CERTIFICATE
xxxiii

5.1 Big Data 101

xxxiv

5.2 Big Data 201

xxxv

5.3 Big Data 301

xxxvi

5.4 Scala Programming

Aiml Lab Manual Upto DT
No ratings yet
Aiml Lab Manual Upto DT
40 pages
STM Lab Manual
No ratings yet
STM Lab Manual
50 pages
Ai - Unit - 3-1
No ratings yet
Ai - Unit - 3-1
31 pages
Al3502deep Learning For Visionl T P C
No ratings yet
Al3502deep Learning For Visionl T P C
3 pages
SL Unit I
No ratings yet
SL Unit I
12 pages
Microprocessor BY Ramesh Gaonkar (Color) PDF
93% (30)
Microprocessor BY Ramesh Gaonkar (Color) PDF
832 pages
Fundamentals of Digital Circuits-Anand Kumar
88% (34)
Fundamentals of Digital Circuits-Anand Kumar
1,102 pages
Concept Learning
No ratings yet
Concept Learning
62 pages
Devops Unit 1 Material
No ratings yet
Devops Unit 1 Material
21 pages
Let Us Python by Yashavant Kanetkar
92% (25)
Let Us Python by Yashavant Kanetkar
429 pages
Full Course of Machine Learning
100% (16)
Full Course of Machine Learning
660 pages
Mathematics of Cryptography: Part I: Modular Arithmetic, Congruence, and Matrices
No ratings yet
Mathematics of Cryptography: Part I: Modular Arithmetic, Congruence, and Matrices
82 pages
BDA Unit 2
No ratings yet
BDA Unit 2
12 pages
SM 6th-Sem Cse Internet-Of-Things
No ratings yet
SM 6th-Sem Cse Internet-Of-Things
76 pages
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
93% (15)
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
334 pages
Learn Excel Data Analysis
100% (15)
Learn Excel Data Analysis
721 pages
LAB MANUAL - OS - 2021 Regulation Final-1
No ratings yet
LAB MANUAL - OS - 2021 Regulation Final-1
68 pages
TYPES OF SCHEDULING ALGORITHMS in Cloud
100% (1)
TYPES OF SCHEDULING ALGORITHMS in Cloud
4 pages
BE LP5 Manual 23-24
No ratings yet
BE LP5 Manual 23-24
67 pages
Cambridge Igcse Information and Communication Technology Thirdnbsped 9781398318540 139831854x
38% (13)
Cambridge Igcse Information and Communication Technology Thirdnbsped 9781398318540 139831854x
89 pages
M.Tech (CSE) Big Data Analytics Curriculum
No ratings yet
M.Tech (CSE) Big Data Analytics Curriculum
69 pages
SEPM Lab Manual Without Code
No ratings yet
SEPM Lab Manual Without Code
62 pages
CS 3 - Problem Solving Agent
No ratings yet
CS 3 - Problem Solving Agent
80 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
37 pages
Unit 1
No ratings yet
Unit 1
88 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
33 pages
Hadoop Building Blocks
No ratings yet
Hadoop Building Blocks
30 pages
Big - Data Lab Manual
No ratings yet
Big - Data Lab Manual
65 pages
Module 5
No ratings yet
Module 5
19 pages
CS8691 AI CO-PO Mapping
No ratings yet
CS8691 AI CO-PO Mapping
6 pages
Deep Learning Unit1
No ratings yet
Deep Learning Unit1
63 pages
ML Lab Mannual R22 Cse (DS)
No ratings yet
ML Lab Mannual R22 Cse (DS)
46 pages
System Models For Distributed and Cloud Computing
No ratings yet
System Models For Distributed and Cloud Computing
15 pages
Python Class 11 Full Book Sumita Arora Good Quality Print
86% (206)
Python Class 11 Full Book Sumita Arora Good Quality Print
530 pages
Unit-1 Cyber Laws
No ratings yet
Unit-1 Cyber Laws
21 pages
CCS334 Big Data Analytics Important Question
No ratings yet
CCS334 Big Data Analytics Important Question
1 page
Big Data - SRM University PDF
No ratings yet
Big Data - SRM University PDF
29 pages
Natural Language Processing: by Dr. Parminder Kaur
No ratings yet
Natural Language Processing: by Dr. Parminder Kaur
26 pages
Unit 2 Np-Completeness and Np-Hard Problems
No ratings yet
Unit 2 Np-Completeness and Np-Hard Problems
23 pages
Co Po Mapping Bda With Justiificaton
No ratings yet
Co Po Mapping Bda With Justiificaton
4 pages
Unit 1: Database Management System (DBMS) Historical Perspective
100% (1)
Unit 1: Database Management System (DBMS) Historical Perspective
30 pages
Deep Learning Handout
100% (1)
Deep Learning Handout
6 pages
Machine Learning Projects Python
94% (18)
Machine Learning Projects Python
134 pages
Bda Unit 5 Notes
No ratings yet
Bda Unit 5 Notes
23 pages
Presentation On: Crime Analysis and Prediction Using Data Mining
No ratings yet
Presentation On: Crime Analysis and Prediction Using Data Mining
14 pages
BD by Maaz
No ratings yet
BD by Maaz
19 pages
CCS334 BDA Practical Question
No ratings yet
CCS334 BDA Practical Question
2 pages
Big Data Analytics - Notes
No ratings yet
Big Data Analytics - Notes
13 pages
Unit 3 PPT Slides Text Books
100% (1)
Unit 3 PPT Slides Text Books
66 pages
Heuristic Search: Dr.M. Nagaratna Professor, Dept - of CSE Jntuceh
No ratings yet
Heuristic Search: Dr.M. Nagaratna Professor, Dept - of CSE Jntuceh
54 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
@vtucode - in Module 4 AI 2021 Scheme 5th Sem
No ratings yet
@vtucode - in Module 4 AI 2021 Scheme 5th Sem
11 pages
BDA Lab Manual AI&DS
No ratings yet
BDA Lab Manual AI&DS
60 pages
DAN Lab ManuaL
No ratings yet
DAN Lab ManuaL
53 pages
ER Diagram and Relational Model Exercise Solution
83% (41)
ER Diagram and Relational Model Exercise Solution
9 pages
CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238
No ratings yet
CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238
56 pages
Rhcsa Ex200 Practice Paper - 1494511722571
50% (6)
Rhcsa Ex200 Practice Paper - 1494511722571
68 pages
STM - Lab - Manul III Cse II Sem
No ratings yet
STM - Lab - Manul III Cse II Sem
36 pages
Bda - Unit 3
No ratings yet
Bda - Unit 3
29 pages
Big Data Analytics - Lab-Manual
No ratings yet
Big Data Analytics - Lab-Manual
19 pages
ML Unit 1
No ratings yet
ML Unit 1
44 pages
Python PPT
67% (12)
Python PPT
13 pages
MCS101-Artificial Intelligence
100% (1)
MCS101-Artificial Intelligence
3 pages
LAB (CSE 610) : Advance Computer Architecture
100% (1)
LAB (CSE 610) : Advance Computer Architecture
21 pages
Unit-5 Alt
No ratings yet
Unit-5 Alt
15 pages
Rule 37-38 Full Text Cases PDF
No ratings yet
Rule 37-38 Full Text Cases PDF
202 pages
Libro Azul New Edition
No ratings yet
Libro Azul New Edition
208 pages
Thomas Plummer-Living Your Dream
100% (1)
Thomas Plummer-Living Your Dream
274 pages
Intelligent Agents and Types of Agents: Artificial Intelligence
100% (3)
Intelligent Agents and Types of Agents: Artificial Intelligence
22 pages
Artificial Intelligence & Machine Learning Digital Notes
100% (2)
Artificial Intelligence & Machine Learning Digital Notes
116 pages
AI Lab MAnual Final
No ratings yet
AI Lab MAnual Final
44 pages
Unit 2
No ratings yet
Unit 2
56 pages
Cambridge IGCSE and O Level Computer Science Study and Revision Guide Second Edition (David Watson, Helen Williams, David Fairley) (Z-Library)
94% (18)
Cambridge IGCSE and O Level Computer Science Study and Revision Guide Second Edition (David Watson, Helen Williams, David Fairley) (Z-Library)
211 pages
Unit 3 AI Srs 13-14
No ratings yet
Unit 3 AI Srs 13-14
45 pages
Question Bank AML
No ratings yet
Question Bank AML
4 pages
CP5191 Machine Learning Techniques L T P C3 0 0 3
No ratings yet
CP5191 Machine Learning Techniques L T P C3 0 0 3
7 pages
Capstone 2
No ratings yet
Capstone 2
46 pages
Python Interview Questions
85% (20)
Python Interview Questions
77 pages
Artificial Intelligence Notes Class 9
100% (1)
Artificial Intelligence Notes Class 9
129 pages
Computer Networks
80% (5)
Computer Networks
22 pages
Pay Slip Aug
No ratings yet
Pay Slip Aug
1 page
Hikma Brand & Identity Guidelines
No ratings yet
Hikma Brand & Identity Guidelines
49 pages
Scalable Parallel Computing
No ratings yet
Scalable Parallel Computing
11 pages
Tecshow Halo 740 XL Manual
No ratings yet
Tecshow Halo 740 XL Manual
37 pages
2.60 SP Series-Min
No ratings yet
2.60 SP Series-Min
23 pages
INT344
50% (2)
INT344
2 pages
Machine Learning Notes
100% (10)
Machine Learning Notes
19 pages
Multicycle: The Processor: Datapath and Control
No ratings yet
Multicycle: The Processor: Datapath and Control
33 pages
18CS72-BDA Question Bank of First Internal Syllabus
No ratings yet
18CS72-BDA Question Bank of First Internal Syllabus
1 page
Python For System Administrators
No ratings yet
Python For System Administrators
25 pages
Defense Proposal Presentation Guide
No ratings yet
Defense Proposal Presentation Guide
29 pages
Dapsone
100% (1)
Dapsone
33 pages
Declaratory Suit
100% (1)
Declaratory Suit
2 pages
Web Systems and Technologies 2 Notes
No ratings yet
Web Systems and Technologies 2 Notes
8 pages
Industrial Relations: Unit 7
100% (10)
Industrial Relations: Unit 7
24 pages
Python Programming Notes
100% (2)
Python Programming Notes
136 pages
57 Muslim Country List PDF
No ratings yet
57 Muslim Country List PDF
1 page
Dental Management System
No ratings yet
Dental Management System
8 pages
Python Durga Notes
84% (64)
Python Durga Notes
367 pages
AI and ML
50% (2)
AI and ML
13 pages
Admitcard 2510003294 6408194
No ratings yet
Admitcard 2510003294 6408194
2 pages
Work Agreement-1
No ratings yet
Work Agreement-1
5 pages
Serial Number
80% (10)
Serial Number
3 pages
Talent Management As A Key Aspect of Human Resources Management Strategy in Contemporary Enterprise
No ratings yet
Talent Management As A Key Aspect of Human Resources Management Strategy in Contemporary Enterprise
7 pages
Microsoft Office 2016 Product Key
59% (17)
Microsoft Office 2016 Product Key
2 pages
Makoto Fujita (Chemist)
No ratings yet
Makoto Fujita (Chemist)
3 pages
Resume Yan Liang
No ratings yet
Resume Yan Liang
2 pages
EPP (Sewing of Household Linens)
No ratings yet
EPP (Sewing of Household Linens)
4 pages
Uploads Library LIBRARY 09A2347335E8DBC9tech Paper Ballast Tank
No ratings yet
Uploads Library LIBRARY 09A2347335E8DBC9tech Paper Ballast Tank
6 pages
Pitch Products: Rain Cii
No ratings yet
Pitch Products: Rain Cii
6 pages
HTTP WWW - Apexcpe.com FinalExam4.Aspx ProductID 2171020&visit 1
No ratings yet
HTTP WWW - Apexcpe.com FinalExam4.Aspx ProductID 2171020&visit 1
3 pages
Release Waiver and Quitclaim: Project in Charge
No ratings yet
Release Waiver and Quitclaim: Project in Charge
1 page
Before-Reading Scanner Retry Error: Analysis Flow
No ratings yet
Before-Reading Scanner Retry Error: Analysis Flow
1 page