0% found this document useful (0 votes)

27 views6 pages

Cse413 201-15-3452 Lab-Report 02

Big Data & IoT Lab

Uploaded by

Avishek Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views6 pages

Cse413 201-15-3452 Lab-Report 02

Big Data & IoT Lab

Uploaded by

Avishek Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Page|1

LAB REPORT

Course Title: Big Data & IoT Lab

Course Code: CSE413

Submitted to:
Mr. Md. Mafiul Hasan Matin

Lecturer

Department of CSE

Daffodil International University

Submitted by:
Name: Avishek Das
Id: 201-15-3452
Section: 55-N
Department of CSE
Daffodil International University

Submission Date: 07/09/2023

Page|2

EXPERIMENT NO: 02
EXPERIMENT NAME: UNDERSTANDING THE PYSPARK DATAFRAME.
OBJECTIVES :

PySpark DataFrames are a fundamental component of Apache Spark, a powerful open-source

framework for distributed data processing. PySpark DataFrames provide a structured and efficient way
to work with large datasets, making them an essential tool for data engineers, data scientists, and
analysts in the big data landscape.
LEARNING OBJECTIVES :

• Understand PySpark DataFrames: Explain what PySpark DataFrames are and why they are
important for big data processing.
• Perform DataFrame Operations: Apply various DataFrame operations to manipulate and analyze
data efficiently.
• Conduct Data Exploration: Use PySpark DataFrames to explore data, calculate statistics, and create
simple visualizations.
• Work with External Data: Integrate PySpark DataFrames with external data sources to load,
transform, and save data.

CODE:

STEP 1: SETUP PYSPARK

EXPLAIN : The above code installs the headless version of openjdk 8 (the java development kit), downloads Apache
spark 3.4.1 for HAFOOP 3, and extracts the spark distribution package into the current directory. For Apache Spark
to be configured for development or data processing activities, this setup is a typical first step. Then, I installed
findspark. Environment variables 'JAVA_HOME' and 'SPARK_HOME' are assigned to certain directories in this bit of
code. 'JAVA_HOME' is set to point to the directory where OpenJDK 8 is installed, and 'SPARK_HOME' is set to the
location where Apache Spark 3.4.1 for Hadoop 3 has been extracted. When dealing with Spark applications in Python
or other programming languages, these environment variables are frequently used to indicate the Java and Spark
installations. Using the 'findspark' module, this code initializes Apache Spark in a Python environment. The process
then starts a Spark session ('SparkSession') with a local master configuration that makes use of every CPU core that
is available ("local[*]"). Work with Spark's distributed data processing features throughout this session.

STEP 2: UPLOAD AND READ THE DATASET :

EXPLAIN : This code imports the NumPy and Pandas libraries after allowing you to submit a file using Google
Colab's "files" module. Last but not least, it reads the uploaded CSV file with the name "my.csv" into a Pandas
DataFrame to enable data manipulation and analysis within the Colab environment..

EXPLAIN : Here, displays the structure and data types of the columns in a DataFrame called "df"'s schema in
tabular form.

STEP 3 : CHECK THE SPARK VERSION :

STEP 4 : DATA SHOW, TYPE, SELECTING SPECIFIC :

EXPLAIN : Here, I read the "my.csv" file using Apache Spark's "spark.read.csv()" method and produce a DataFrame
called "df_pyspark." They then use the "show()" function to display the data in "df_pyspark," giving users a tabular
representation of the data in a Spark context.

EXPLAIN : This collection of lines of code retrieves the first two rows and the final four rows of data from the
'df_pyspark' DataFrame in Apache Spark, providing a short peek of the dataset's beginning and finish.

STEP 5 : SHOW READ OPERATION :

EXPLAIN : With the ''header','true'' option, this code instructs Apache Spark to treat the first row of the CSV
file'my.csv' as a column header. The data is then shown in a tabular manner with the column names of the
DataFrame visible.
STEP 6 : SHOW DATAFRAME :

EXPLAIN : Using Apache Spark, this code reads the CSV file "my.csv," interpreting the first row as the column
headers and inferring the column's schema with the statement "inferSchema=True". The resultant DataFrame is
then assigned to the variable 'df,' providing a structured representation of the data with inferred data types for
each column.

STEP 7 : SELECT COLUMN:

EXPLAIN : The 'Age' column from the DataFrame 'df' is selected and shown using Apache Spark, presenting the
information contained in that particular column in a tabular fashion.

STEP 8 : COLUMN TYPE & SHOW:

EXPLAIN : Here -The first line, "df.dtypes," contains details on the data types of each column in the DataFrame
"df." The second line, "df.describe().show()," provides summary statistics (count, mean, standard deviation, min,
max) for each numerical column in "df," then tabulates them. This makes it easier to comprehend how the data
inside those columns is distributed.
STEP 9 : ADDING COLUMN:

EXPLAIN : Here, I creates a new column in the DataFrame df called "Experience after 5 years," which is created by
multiplying the values in the "Experience" column by 5. The old and the newly added column are then shown in
the modified DataFrame.

STEP 10 : DELETE COLUMN:

EXPLAIN : In this case, I delete the "Experience after 5 years" column from the DataFrame "df" and display the
DataFrame without it.

CONCLUSION:
In this lab, I gained a comprehensive understanding of PySpark DataFrames, a crucial data structure for working
with structured data in big data processing. They learned how to perform various operations on DataFrames,
explore data, and integrate with external data sources. These skills are essential for data engineers, data
scientists, and analysts working in a big data environment.

PySpark Installation and Basics Guide
100% (1)
PySpark Installation and Basics Guide
131 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Pyspark Basics
No ratings yet
Pyspark Basics
74 pages
Big Data Analysis Certification
No ratings yet
Big Data Analysis Certification
23 pages
DataGrokr Technical Assignment - Data Engineering - Internshala
No ratings yet
DataGrokr Technical Assignment - Data Engineering - Internshala
5 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Pyspark
No ratings yet
Pyspark
4 pages
DataGrokr Technical Assignment - Data Engineering
No ratings yet
DataGrokr Technical Assignment - Data Engineering
4 pages
Unit 4 (Data Frame and Apache Kafka)
No ratings yet
Unit 4 (Data Frame and Apache Kafka)
28 pages
Int 421
No ratings yet
Int 421
2 pages
Pyspark
No ratings yet
Pyspark
10 pages
Athul Dev - Spark With Python (2020) - Libgen - Li
No ratings yet
Athul Dev - Spark With Python (2020) - Libgen - Li
153 pages
Big Data Technologies Lab
No ratings yet
Big Data Technologies Lab
8 pages
PySpark DataFrames Guide
No ratings yet
PySpark DataFrames Guide
33 pages
Tushar Verma 21scse1310012 Data Analysis Using Big Data Tools 21scse1310012 Report
No ratings yet
Tushar Verma 21scse1310012 Data Analysis Using Big Data Tools 21scse1310012 Report
6 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
Day11 Notes
No ratings yet
Day11 Notes
2 pages
Data Visualization - Lab - Manual - 2024
No ratings yet
Data Visualization - Lab - Manual - 2024
13 pages
Final Dev Record
No ratings yet
Final Dev Record
49 pages
Pandas
No ratings yet
Pandas
50 pages
Analysis of Heart Disease Dataset
No ratings yet
Analysis of Heart Disease Dataset
16 pages
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
No ratings yet
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
106 pages
Data Frame
No ratings yet
Data Frame
95 pages
Module 5 Data Science
No ratings yet
Module 5 Data Science
8 pages
Unit IV Spark
No ratings yet
Unit IV Spark
23 pages
AI Student HandbookXII 2025-26!8!20
No ratings yet
AI Student HandbookXII 2025-26!8!20
13 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
24UAD315 DEV Final Record
No ratings yet
24UAD315 DEV Final Record
49 pages
Pyspark Scala-Spark Syllabus
No ratings yet
Pyspark Scala-Spark Syllabus
23 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
Python Integration with Hadoop & Spark
No ratings yet
Python Integration with Hadoop & Spark
10 pages
Vamshi ml-1,2
No ratings yet
Vamshi ml-1,2
25 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
AI & Data Science Lab Guide
No ratings yet
AI & Data Science Lab Guide
35 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
DSP
No ratings yet
DSP
3 pages
Data Analytics Lab Course Overview
No ratings yet
Data Analytics Lab Course Overview
125 pages
Course Outline Hadoop and Spark For Big Data and Data Science PDF
No ratings yet
Course Outline Hadoop and Spark For Big Data and Data Science PDF
4 pages
Big Data Mastery with Hadoop & Spark
100% (1)
Big Data Mastery with Hadoop & Spark
4 pages
Skyess Spark Syllabus
No ratings yet
Skyess Spark Syllabus
12 pages
Python Libraries for Data Science
No ratings yet
Python Libraries for Data Science
53 pages
Spark and Scala: Comprehensive Course Guide
No ratings yet
Spark and Scala: Comprehensive Course Guide
26 pages
Pyspark TOC - 24 Hours
No ratings yet
Pyspark TOC - 24 Hours
2 pages
Rest of The Ip Project
No ratings yet
Rest of The Ip Project
26 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Big Data Analytics (Bda) : Laboratory Workbook
No ratings yet
Big Data Analytics (Bda) : Laboratory Workbook
20 pages
Module 5 Data Science
No ratings yet
Module 5 Data Science
7 pages
LearningSpark EXCERPT
50% (2)
LearningSpark EXCERPT
47 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
200 pages
TY FDS Workbook
No ratings yet
TY FDS Workbook
56 pages
Pyspark Tutorial
100% (2)
Pyspark Tutorial
27 pages
Ds Final
No ratings yet
Ds Final
45 pages
Python Spark Basics and Examples
No ratings yet
Python Spark Basics and Examples
28 pages
Page 01
No ratings yet
Page 01
2 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Wk. 9. Cluster Analysis (01-04-2021)
No ratings yet
Wk. 9. Cluster Analysis (01-04-2021)
97 pages
CSE 332 - 201-15-3452 - Lab-Report 03
No ratings yet
CSE 332 - 201-15-3452 - Lab-Report 03
7 pages
CSE 332 - 201-15-3452 - Lab-Report 05
No ratings yet
CSE 332 - 201-15-3452 - Lab-Report 05
3 pages
CSE 332 - 201-15-3452 - Lab-Report 07docx
No ratings yet
CSE 332 - 201-15-3452 - Lab-Report 07docx
3 pages
CSE 332 - 201-15-3452 - Lab-Report 03
No ratings yet
CSE 332 - 201-15-3452 - Lab-Report 03
7 pages
MS Access 2013 Database Guide
No ratings yet
MS Access 2013 Database Guide
4 pages
Why Do We Need Control-M When I Can Use Cron Jobs or Windows Task Scheduler?
No ratings yet
Why Do We Need Control-M When I Can Use Cron Jobs or Windows Task Scheduler?
8 pages
Slip16 (Employee Investment) (1 M)
No ratings yet
Slip16 (Employee Investment) (1 M)
3 pages
Advanced Database Systems Exam
No ratings yet
Advanced Database Systems Exam
1 page
MODULE4
No ratings yet
MODULE4
69 pages
PL-SQL Package
No ratings yet
PL-SQL Package
2 pages
3714 ST2 2025 Memo
No ratings yet
3714 ST2 2025 Memo
6 pages
Department of Computer Science Teaching Plan
No ratings yet
Department of Computer Science Teaching Plan
13 pages
DWH MCQ
No ratings yet
DWH MCQ
34 pages
Oracle - Passguide.1z0 448.v2018!08!05.by - Amanda.47q
No ratings yet
Oracle - Passguide.1z0 448.v2018!08!05.by - Amanda.47q
22 pages
SQL SELECT Queries with WHERE & HAVING
No ratings yet
SQL SELECT Queries with WHERE & HAVING
8 pages
Oracle MAA for Exadata Users
No ratings yet
Oracle MAA for Exadata Users
31 pages
Database Queries and Reports
No ratings yet
Database Queries and Reports
6 pages
NUIX User Guide
100% (1)
NUIX User Guide
338 pages
High School IT Automation Project
No ratings yet
High School IT Automation Project
8 pages
IT Project (CBSE)
No ratings yet
IT Project (CBSE)
16 pages
BIP - SepareteDataSource - Document 1064043.1
No ratings yet
BIP - SepareteDataSource - Document 1064043.1
4 pages
FortiAnalyzer 05 Reports
No ratings yet
FortiAnalyzer 05 Reports
59 pages
ASP NET CRUD Lab Report
No ratings yet
ASP NET CRUD Lab Report
3 pages
BI Day - 1 - Introduction, Data Model and Data Sets
No ratings yet
BI Day - 1 - Introduction, Data Model and Data Sets
40 pages
18csl58 Dbms Lab Manual 2022-23
No ratings yet
18csl58 Dbms Lab Manual 2022-23
72 pages
Employee Management System
No ratings yet
Employee Management System
19 pages
First Normal Form: First Normal Form (1NF) Is A Property of A Relation in A Relational Database. A
No ratings yet
First Normal Form: First Normal Form (1NF) Is A Property of A Relation in A Relational Database. A
4 pages
Document 2147007.1
No ratings yet
Document 2147007.1
19 pages
ATM System Database Management Project
No ratings yet
ATM System Database Management Project
20 pages
Laravel Advanced Topics MNC Roadmap
No ratings yet
Laravel Advanced Topics MNC Roadmap
2 pages
CBSE Air Ticket Reservation Project
No ratings yet
CBSE Air Ticket Reservation Project
25 pages
Da & Vis - Question Paper 3 - Answer Key
No ratings yet
Da & Vis - Question Paper 3 - Answer Key
3 pages
Braindumpscollection C Abapd 2309 Sap Certified Associate Back End Developer Abap Cloud Verified Questions Answers by Castillo 15 04 2024 12qa
No ratings yet
Braindumpscollection C Abapd 2309 Sap Certified Associate Back End Developer Abap Cloud Verified Questions Answers by Castillo 15 04 2024 12qa
24 pages
Experiment 8 DS Student
No ratings yet
Experiment 8 DS Student
8 pages

Cse413 201-15-3452 Lab-Report 02

Uploaded by

Cse413 201-15-3452 Lab-Report 02

Uploaded by

Page|1

Course Title: Big Data & IoT Lab

Daffodil International University

Submission Date: 07/09/2023

PySpark DataFrames are a fundamental component of Apache Spark, a powerful open-source

STEP 1: SETUP PYSPARK

STEP 2: UPLOAD AND READ THE DATASET :

STEP 3 : CHECK THE SPARK VERSION :

STEP 5 : SHOW READ OPERATION :

STEP 7 : SELECT COLUMN:

STEP 8 : COLUMN TYPE & SHOW:

STEP 10 : DELETE COLUMN:

You might also like