0% found this document useful (0 votes)
27 views6 pages

Cse413 201-15-3452 Lab-Report 02

Big Data & IoT Lab

Uploaded by

Avishek Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views6 pages

Cse413 201-15-3452 Lab-Report 02

Big Data & IoT Lab

Uploaded by

Avishek Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Page|1

LAB REPORT

Course Title: Big Data & IoT Lab


Course Code: CSE413

Submitted to:
Mr. Md. Mafiul Hasan Matin

Lecturer

Department of CSE

Daffodil International University

Submitted by:
Name: Avishek Das
Id: 201-15-3452
Section: 55-N
Department of CSE
Daffodil International University

Submission Date: 07/09/2023


Page|2

EXPERIMENT NO: 02
EXPERIMENT NAME: UNDERSTANDING THE PYSPARK DATAFRAME.
OBJECTIVES :

PySpark DataFrames are a fundamental component of Apache Spark, a powerful open-source


framework for distributed data processing. PySpark DataFrames provide a structured and efficient way
to work with large datasets, making them an essential tool for data engineers, data scientists, and
analysts in the big data landscape.
LEARNING OBJECTIVES :

• Understand PySpark DataFrames: Explain what PySpark DataFrames are and why they are
important for big data processing.
• Perform DataFrame Operations: Apply various DataFrame operations to manipulate and analyze
data efficiently.
• Conduct Data Exploration: Use PySpark DataFrames to explore data, calculate statistics, and create
simple visualizations.
• Work with External Data: Integrate PySpark DataFrames with external data sources to load,
transform, and save data.

CODE:

STEP 1: SETUP PYSPARK


EXPLAIN : The above code installs the headless version of openjdk 8 (the java development kit), downloads Apache
spark 3.4.1 for HAFOOP 3, and extracts the spark distribution package into the current directory. For Apache Spark
to be configured for development or data processing activities, this setup is a typical first step. Then, I installed
findspark. Environment variables 'JAVA_HOME' and 'SPARK_HOME' are assigned to certain directories in this bit of
code. 'JAVA_HOME' is set to point to the directory where OpenJDK 8 is installed, and 'SPARK_HOME' is set to the
location where Apache Spark 3.4.1 for Hadoop 3 has been extracted. When dealing with Spark applications in Python
or other programming languages, these environment variables are frequently used to indicate the Java and Spark
installations. Using the 'findspark' module, this code initializes Apache Spark in a Python environment. The process
then starts a Spark session ('SparkSession') with a local master configuration that makes use of every CPU core that
is available ("local[*]"). Work with Spark's distributed data processing features throughout this session.

STEP 2: UPLOAD AND READ THE DATASET :

EXPLAIN : This code imports the NumPy and Pandas libraries after allowing you to submit a file using Google
Colab's "files" module. Last but not least, it reads the uploaded CSV file with the name "my.csv" into a Pandas
DataFrame to enable data manipulation and analysis within the Colab environment..

EXPLAIN : Here, displays the structure and data types of the columns in a DataFrame called "df"'s schema in
tabular form.

STEP 3 : CHECK THE SPARK VERSION :


STEP 4 : DATA SHOW, TYPE, SELECTING SPECIFIC :

EXPLAIN : Here, I read the "my.csv" file using Apache Spark's "spark.read.csv()" method and produce a DataFrame
called "df_pyspark." They then use the "show()" function to display the data in "df_pyspark," giving users a tabular
representation of the data in a Spark context.

EXPLAIN : This collection of lines of code retrieves the first two rows and the final four rows of data from the
'df_pyspark' DataFrame in Apache Spark, providing a short peek of the dataset's beginning and finish.

STEP 5 : SHOW READ OPERATION :

EXPLAIN : With the ''header','true'' option, this code instructs Apache Spark to treat the first row of the CSV
file'my.csv' as a column header. The data is then shown in a tabular manner with the column names of the
DataFrame visible.
STEP 6 : SHOW DATAFRAME :

EXPLAIN : Using Apache Spark, this code reads the CSV file "my.csv," interpreting the first row as the column
headers and inferring the column's schema with the statement "inferSchema=True". The resultant DataFrame is
then assigned to the variable 'df,' providing a structured representation of the data with inferred data types for
each column.

STEP 7 : SELECT COLUMN:

EXPLAIN : The 'Age' column from the DataFrame 'df' is selected and shown using Apache Spark, presenting the
information contained in that particular column in a tabular fashion.

STEP 8 : COLUMN TYPE & SHOW:

EXPLAIN : Here -The first line, "df.dtypes," contains details on the data types of each column in the DataFrame
"df." The second line, "df.describe().show()," provides summary statistics (count, mean, standard deviation, min,
max) for each numerical column in "df," then tabulates them. This makes it easier to comprehend how the data
inside those columns is distributed.
STEP 9 : ADDING COLUMN:

EXPLAIN : Here, I creates a new column in the DataFrame df called "Experience after 5 years," which is created by
multiplying the values in the "Experience" column by 5. The old and the newly added column are then shown in
the modified DataFrame.

STEP 10 : DELETE COLUMN:

EXPLAIN : In this case, I delete the "Experience after 5 years" column from the DataFrame "df" and display the
DataFrame without it.

CONCLUSION:
In this lab, I gained a comprehensive understanding of PySpark DataFrames, a crucial data structure for working
with structured data in big data processing. They learned how to perform various operations on DataFrames,
explore data, and integrate with external data sources. These skills are essential for data engineers, data
scientists, and analysts working in a big data environment.

You might also like