0% found this document useful (0 votes)

4 views5 pages

??????? ????????? ????????? ?????? - Handling Null values 5p

The document outlines a series of tasks for a Data Engineer to clean a sales dataset using PySpark. Key operations include replacing NULL values in the Quantity and Price columns, dropping rows with NULL Product values, filling missing Sales_Date values, and removing rows where all columns are NULL. The document provides sample code and outputs for each operation performed on the dataset.

Uploaded by

hurshid101416

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views5 pages

??????? ????????? ????????? ?????? - Handling Null values 5p

Uploaded by

hurshid101416

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Interview question

𝐐𝐮𝐞𝐬𝐭𝐢𝐨𝐧 You are working as a Data Engineer for a company. The sales team has provided
you with a dataset containing sales information. However, the data has some missing values
that need to be addressed before processing. You are required to perform the following tasks:

1. Load the following sample dataset into a PySpark DataFrame:

2. Perform the following operations:

a. Replace all NULL values in the Quantity column with 0.

b. Replace all NULL values in the Price column with the average price of the existing data.

c. Drop rows where the Product column is NULL.

d. Fill missing Sales_Date with a default value of '2025-01-01'.

e. Drop rows where all columns are NULL.

𝐬𝐜𝐡𝐞𝐦𝐚 data = [ (1, "Laptop", 10, 50000, "North", "2025-01-01"), (2, "Mobile", None, 15000,
"South", None), (3, "Tablet", 20, None, "West", "2025-01-03"), (4, "Desktop", 15, 30000, None,
"2025-01-04"), (5, None, None, None, "East", "2025-01-05") ]

columns = ["Sales_ID", "Product", "Quantity", "Price", "Region", "Sales_Date"]

data = [ (1, "Laptop", 10, 50000, "North", "2025-01-01"), (2,

"Mobile", None, 15000, "South", None), (3, "Tablet", 20, None, "West",
"2025-01-03"), (4, "Desktop", 15, 30000, None, "2025-01-04"), (5,
None, None, None, "East", "2025-01-05") ]
columns = ["Sales_ID", "Product", "Quantity", "Price", "Region",
"Sales_Date"]

df = spark.createDataFrame(data, columns)

df.show()

+--------+-------+--------+-----+------+----------+
|Sales_ID|Product|Quantity|Price|Region|Sales_Date|
+--------+-------+--------+-----+------+----------+
| 1| Laptop| 10|50000| North|2025-01-01|
| 2| Mobile| null|15000| South| null|
| 3| Tablet| 20| null| West|2025-01-03|
| 4|Desktop| 15|30000| null|2025-01-04|
| 5| null| null| null| East|2025-01-05|
+--------+-------+--------+-----+------+----------+

df.createOrReplaceTempView("sales_tbl")
# replace null value in qty with 0

df.fillna({"Quantity":0}).show()

+--------+-------+--------+-----+------+----------+
|Sales_ID|Product|Quantity|Price|Region|Sales_Date|
+--------+-------+--------+-----+------+----------+
| 1| Laptop| 10|50000| North|2025-01-01|
| 2| Mobile| 0|15000| South| null|
| 3| Tablet| 20| null| West|2025-01-03|
| 4|Desktop| 15|30000| null|2025-01-04|
| 5| null| 0| null| East|2025-01-05|
+--------+-------+--------+-----+------+----------+

from pyspark.sql.types import *

from pyspark.sql.functions import *

# replace null Quanity with 0 using when-otherwise

df.withColumn("Quantity",
when(col("Quantity").isNull(),0).otherwise(col("Quantity"))).show()

+--------+-------+--------+-----+------+----------+
|Sales_ID|Product|Quantity|Price|Region|Sales_Date|
+--------+-------+--------+-----+------+----------+
| 1| Laptop| 10|50000| North|2025-01-01|
| 2| Mobile| 0|15000| South| null|
| 3| Tablet| 20| null| West|2025-01-03|
| 4|Desktop| 15|30000| null|2025-01-04|
| 5| null| 0| null| East|2025-01-05|
+--------+-------+--------+-----+------+----------+

%sql
-- fill na with 0
select *,coalesce(Quantity,0) from sales_tbl;

# replace null values in price with average column

average = df.agg(avg("Price")).collect()[0][0]
print(average)

31666.666666666668

df.fillna({"Price":average}).show()

+--------+-------+--------+-----+------+----------+
|Sales_ID|Product|Quantity|Price|Region|Sales_Date|
+--------+-------+--------+-----+------+----------+
| 1| Laptop| 10|50000| North|2025-01-01|
| 2| Mobile| null|15000| South| null|
| 3| Tablet| 20|31666| West|2025-01-03|
| 4|Desktop| 15|30000| null|2025-01-04|
| 5| null| null|31666| East|2025-01-05|
+--------+-------+--------+-----+------+----------+

df.withColumn("Price", when(col("Price").isNull(),
average).otherwise(col("Price"))).show()

+--------+-------+--------+------------------+------+----------+
|Sales_ID|Product|Quantity| Price|Region|Sales_Date|
+--------+-------+--------+------------------+------+----------+
| 1| Laptop| 10| 50000.0| North|2025-01-01|
| 2| Mobile| null| 15000.0| South| null|
| 3| Tablet| 20|31666.666666666668| West|2025-01-03|
| 4|Desktop| 15| 30000.0| null|2025-01-04|
| 5| null| null|31666.666666666668| East|2025-01-05|
+--------+-------+--------+------------------+------+----------+

%sql
select Price from sales_tbl;

# drop rows where product column is null

df.show()

# drop rows where product is null

df.filter(col("Product").isNotNull()).show()

+--------+-------+--------+-----+------+----------+
|Sales_ID|Product|Quantity|Price|Region|Sales_Date|
+--------+-------+--------+-----+------+----------+
| 1| Laptop| 10|50000| North|2025-01-01|
| 2| Mobile| null|15000| South| null|
| 3| Tablet| 20| null| West|2025-01-03|
| 4|Desktop| 15|30000| null|2025-01-04|
+--------+-------+--------+-----+------+----------+

%sql
-- drop rows where product is null
select * from sales_tbl
where Product is not null;

# drop rows where all columns are null

df.dropna("all").show()

# Fill missing Sales_Date with a default value of '2025-01-01'.

df.withColumn("Sales_Date",when(col("Sales_Date").isNull(),'2025-01-
01').otherwise(col("Sales_Date"))).show()

+--------+-------+--------+-----+------+----------+
|Sales_ID|Product|Quantity|Price|Region|Sales_Date|
+--------+-------+--------+-----+------+----------+
| 1| Laptop| 10|50000| North|2025-01-01|
| 2| Mobile| null|15000| South|2025-01-01|
| 3| Tablet| 20| null| West|2025-01-03|
| 4|Desktop| 15|30000| null|2025-01-04|
| 5| null| null| null| East|2025-01-05|
+--------+-------+--------+-----+------+----------+

df.fillna({"Sales_Date":'2025-01-01'}).show()

+--------+-------+--------+-----+------+----------+
|Sales_ID|Product|Quantity|Price|Region|Sales_Date|
+--------+-------+--------+-----+------+----------+
| 1| Laptop| 10|50000| North|2025-01-01|
| 2| Mobile| null|15000| South|2025-01-01|
| 3| Tablet| 20| null| West|2025-01-03|
| 4|Desktop| 15|30000| null|2025-01-04|
| 5| null| null| null| East|2025-01-05|
+--------+-------+--------+-----+------+----------+

pdf = df.toPandas()

pdf

# fill null Quantity with 0 in pandas

pdf["Quantity"].fillna(0)
Out[44]: 0 10.0
1 0.0
2 20.0
3 15.0
4 0.0
Name: Quantity, dtype: float64

# fill null price with average in Pandas

pdf["Price"].fillna(pdf["Price"].mean())

Out[46]: 0 50000.000000
1 15000.000000
2 31666.666667
3 30000.000000
4 31666.666667
Name: Price, dtype: float64

# drop row where Product column is null

pdf.dropna(subset=["Product"])

# drop rows where all columns are null in Pandas

pdf.dropna(how="all")

White Stephen A Activities For English Language Learners Acr
0% (1)
White Stephen A Activities For English Language Learners Acr
136 pages
Introduction To Gymnastica and Aerobics
No ratings yet
Introduction To Gymnastica and Aerobics
69 pages
EXP 5 DE lab
No ratings yet
EXP 5 DE lab
5 pages
Code Feature
No ratings yet
Code Feature
7 pages
Vertopal.com Outlook Module3 (1)
No ratings yet
Vertopal.com Outlook Module3 (1)
21 pages
Python
No ratings yet
Python
8 pages
Handling nulls in PySpark _
No ratings yet
Handling nulls in PySpark _
15 pages
documentpython2
No ratings yet
documentpython2
22 pages
SalesDataAnalysis__1693296057
No ratings yet
SalesDataAnalysis__1693296057
14 pages
Day 10 Pandasdatacleaning
No ratings yet
Day 10 Pandasdatacleaning
6 pages
CLEANING DATA SET - Jupyter Notebook
No ratings yet
CLEANING DATA SET - Jupyter Notebook
15 pages
EcommerceAnalysis 1680541297
No ratings yet
EcommerceAnalysis 1680541297
11 pages
Pandas Data Cleaning Presentation
No ratings yet
Pandas Data Cleaning Presentation
11 pages
DA Cheat Codes
No ratings yet
DA Cheat Codes
2 pages
7 Cleaning data w3s.............................................
No ratings yet
7 Cleaning data w3s.............................................
2 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
2 pages
DMV - 1 - Jupyter Notebook
No ratings yet
DMV - 1 - Jupyter Notebook
4 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
4 pages
IP Practic MINE
No ratings yet
IP Practic MINE
30 pages
Module 3.Pptx
No ratings yet
Module 3.Pptx
20 pages
Dataframe
No ratings yet
Dataframe
19 pages
Lab Exercise 2-CS0017
No ratings yet
Lab Exercise 2-CS0017
17 pages
Data Frame Creation
No ratings yet
Data Frame Creation
10 pages
Sales Analysis Using Python and SQL
No ratings yet
Sales Analysis Using Python and SQL
15 pages
practice_questions2
No ratings yet
practice_questions2
2 pages
Import Pandas as Pd
No ratings yet
Import Pandas as Pd
6 pages
pandas_notes
No ratings yet
pandas_notes
8 pages
Item Module
No ratings yet
Item Module
18 pages
Sales Dataset Analysis
No ratings yet
Sales Dataset Analysis
28 pages
MeriSkill Sales Analysis
No ratings yet
MeriSkill Sales Analysis
17 pages
Data Cleaning and Pre Processing 2
No ratings yet
Data Cleaning and Pre Processing 2
27 pages
Online Sales Data Analysis
No ratings yet
Online Sales Data Analysis
9 pages
Marking scheme practical paper (2)
No ratings yet
Marking scheme practical paper (2)
7 pages
Copy of Dats Group Final(1)
No ratings yet
Copy of Dats Group Final(1)
41 pages
Lab Programmes Adwaith
No ratings yet
Lab Programmes Adwaith
18 pages
DataCleaning Techniques
No ratings yet
DataCleaning Techniques
20 pages
E farm dbms
No ratings yet
E farm dbms
39 pages
Daily Task 6 & 7 - Explore Merge Function & Perform Data Cleaning - Jupyter Notebook
No ratings yet
Daily Task 6 & 7 - Explore Merge Function & Perform Data Cleaning - Jupyter Notebook
23 pages
dw lab file
No ratings yet
dw lab file
18 pages
Pyhtonpractice Questions
No ratings yet
Pyhtonpractice Questions
5 pages
Module 05.0 - PA - Pandas - DataFrame - Select - Data
No ratings yet
Module 05.0 - PA - Pandas - DataFrame - Select - Data
3 pages
Unit 5 Python
No ratings yet
Unit 5 Python
30 pages
2023 08 05 13 43 36 - 1691223216
No ratings yet
2023 08 05 13 43 36 - 1691223216
7 pages
CFE
No ratings yet
CFE
5 pages
Pandas Test Answer
No ratings yet
Pandas Test Answer
7 pages
class 11-dataframes-part 3
No ratings yet
class 11-dataframes-part 3
27 pages
Ip Project XII a Computer Shop Abhishek_sayam_kavya (2)
No ratings yet
Ip Project XII a Computer Shop Abhishek_sayam_kavya (2)
23 pages
What Is The Concept of Data Cleaning
No ratings yet
What Is The Concept of Data Cleaning
20 pages
SalesMgmtSystem XII IP Projectreport 2022 23
No ratings yet
SalesMgmtSystem XII IP Projectreport 2022 23
18 pages
Document (4)-1
No ratings yet
Document (4)-1
15 pages
PRACTICALS
No ratings yet
PRACTICALS
52 pages
aide memoire preparation des données
No ratings yet
aide memoire preparation des données
2 pages
EDA Project
No ratings yet
EDA Project
7 pages
Lab File
No ratings yet
Lab File
96 pages
ML Practical 4D
No ratings yet
ML Practical 4D
11 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
5 pages
Important Pandas Operations 1697910759
No ratings yet
Important Pandas Operations 1697910759
6 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Support Functions
No ratings yet
Support Functions
1 page
EDA (2)
No ratings yet
EDA (2)
7 pages
2-Introduction to data cleaning P02
No ratings yet
2-Introduction to data cleaning P02
7 pages
Angular Shopping Store: From Scratch to Successful Payment
From Everand
Angular Shopping Store: From Scratch to Successful Payment
Abdelfattah Ragab
No ratings yet
Modular Construction: Assessing The Challenges Faced With The Adoption of An Innovative Approach To Improve U.S. Residential Construction
No ratings yet
Modular Construction: Assessing The Challenges Faced With The Adoption of An Innovative Approach To Improve U.S. Residential Construction
10 pages
Keje Sekolah Pad 120
No ratings yet
Keje Sekolah Pad 120
2 pages
TEFL Teaching Mixed Ability Classes
No ratings yet
TEFL Teaching Mixed Ability Classes
12 pages
Frequency Distribution Math4
100% (2)
Frequency Distribution Math4
14 pages
Feasibility Study Book
No ratings yet
Feasibility Study Book
34 pages
Patch Adams Analysis
No ratings yet
Patch Adams Analysis
6 pages
Advance Reading Skills
No ratings yet
Advance Reading Skills
10 pages
Phy101l 1
No ratings yet
Phy101l 1
4 pages
IPIRA 2nd Conference Rundown
No ratings yet
IPIRA 2nd Conference Rundown
21 pages
620a Lynneforbeszeller 620a Personal Reflection Final
No ratings yet
620a Lynneforbeszeller 620a Personal Reflection Final
4 pages
ASTERIX - Political Satire - Mavo
No ratings yet
ASTERIX - Political Satire - Mavo
3 pages
Consulteps Free Test
No ratings yet
Consulteps Free Test
28 pages
Atikel Tentang Email - Faisal Esa Guna Pratama
No ratings yet
Atikel Tentang Email - Faisal Esa Guna Pratama
8 pages
Nyeri Akut (00132) : Kenyamanan
No ratings yet
Nyeri Akut (00132) : Kenyamanan
3 pages
Saints A Year In Faith And Art Giorgi Rosa instant download
100% (1)
Saints A Year In Faith And Art Giorgi Rosa instant download
35 pages
Vineet SoamCV
No ratings yet
Vineet SoamCV
2 pages
Module Week 4
No ratings yet
Module Week 4
21 pages
Salvia
No ratings yet
Salvia
13 pages
RSK CV
No ratings yet
RSK CV
2 pages
Parametric Evaluation of Biriran (Averrhoa Carambola Linn.) Fruit Wine Fermented in Varying Conditions
No ratings yet
Parametric Evaluation of Biriran (Averrhoa Carambola Linn.) Fruit Wine Fermented in Varying Conditions
24 pages
FM Assignment 1
No ratings yet
FM Assignment 1
4 pages
Penyuluhan Media Audio Visual (Serupa) - Converted - by - Abcdpdf
No ratings yet
Penyuluhan Media Audio Visual (Serupa) - Converted - by - Abcdpdf
166 pages
Identifying The Causes of Academic Procrastination From The Perspective of Male Middle School Male Students
No ratings yet
Identifying The Causes of Academic Procrastination From The Perspective of Male Middle School Male Students
24 pages
HISTORY PAPER 2
No ratings yet
HISTORY PAPER 2
8 pages
Peza VS Gacdc
No ratings yet
Peza VS Gacdc
1 page
PracRes Notes-Week 1-10
No ratings yet
PracRes Notes-Week 1-10
18 pages
Conditional Sentences
No ratings yet
Conditional Sentences
3 pages
Demo
No ratings yet
Demo
6 pages

??????? ????????? ????????? ?????? - Handling Null values 5p

Uploaded by

??????? ????????? ????????? ?????? - Handling Null values 5p

Uploaded by

Interview question

1. Load the following sample dataset into a PySpark DataFrame:

2. Perform the following operations:

c. Drop rows where the Product column is NULL.

d. Fill missing Sales_Date with a default value of '2025-01-01'.

e. Drop rows where all columns are NULL.

columns = ["Sales_ID", "Product", "Quantity", "Price", "Region", "Sales_Date"]

data = [ (1, "Laptop", 10, 50000, "North", "2025-01-01"), (2,

from pyspark.sql.types import *

# replace null Quanity with 0 using when-otherwise

# replace null values in price with average column

# drop rows where product column is null

# drop rows where product is null

# drop rows where all columns are null

# Fill missing Sales_Date with a default value of '2025-01-01'.

# fill null Quantity with 0 in pandas

# fill null price with average in Pandas

# drop row where Product column is null

# drop rows where all columns are null in Pandas

You might also like