??????? ????????? ????????? ?????? - Handling Null values 5p
??????? ????????? ????????? ?????? - Handling Null values 5p
𝐐𝐮𝐞𝐬𝐭𝐢𝐨𝐧 You are working as a Data Engineer for a company. The sales team has provided
you with a dataset containing sales information. However, the data has some missing values
that need to be addressed before processing. You are required to perform the following tasks:
b. Replace all NULL values in the Price column with the average price of the existing data.
𝐬𝐜𝐡𝐞𝐦𝐚 data = [ (1, "Laptop", 10, 50000, "North", "2025-01-01"), (2, "Mobile", None, 15000,
"South", None), (3, "Tablet", 20, None, "West", "2025-01-03"), (4, "Desktop", 15, 30000, None,
"2025-01-04"), (5, None, None, None, "East", "2025-01-05") ]
df = spark.createDataFrame(data, columns)
df.show()
+--------+-------+--------+-----+------+----------+
|Sales_ID|Product|Quantity|Price|Region|Sales_Date|
+--------+-------+--------+-----+------+----------+
| 1| Laptop| 10|50000| North|2025-01-01|
| 2| Mobile| null|15000| South| null|
| 3| Tablet| 20| null| West|2025-01-03|
| 4|Desktop| 15|30000| null|2025-01-04|
| 5| null| null| null| East|2025-01-05|
+--------+-------+--------+-----+------+----------+
df.createOrReplaceTempView("sales_tbl")
# replace null value in qty with 0
df.fillna({"Quantity":0}).show()
+--------+-------+--------+-----+------+----------+
|Sales_ID|Product|Quantity|Price|Region|Sales_Date|
+--------+-------+--------+-----+------+----------+
| 1| Laptop| 10|50000| North|2025-01-01|
| 2| Mobile| 0|15000| South| null|
| 3| Tablet| 20| null| West|2025-01-03|
| 4|Desktop| 15|30000| null|2025-01-04|
| 5| null| 0| null| East|2025-01-05|
+--------+-------+--------+-----+------+----------+
+--------+-------+--------+-----+------+----------+
|Sales_ID|Product|Quantity|Price|Region|Sales_Date|
+--------+-------+--------+-----+------+----------+
| 1| Laptop| 10|50000| North|2025-01-01|
| 2| Mobile| 0|15000| South| null|
| 3| Tablet| 20| null| West|2025-01-03|
| 4|Desktop| 15|30000| null|2025-01-04|
| 5| null| 0| null| East|2025-01-05|
+--------+-------+--------+-----+------+----------+
%sql
-- fill na with 0
select *,coalesce(Quantity,0) from sales_tbl;
31666.666666666668
df.fillna({"Price":average}).show()
+--------+-------+--------+-----+------+----------+
|Sales_ID|Product|Quantity|Price|Region|Sales_Date|
+--------+-------+--------+-----+------+----------+
| 1| Laptop| 10|50000| North|2025-01-01|
| 2| Mobile| null|15000| South| null|
| 3| Tablet| 20|31666| West|2025-01-03|
| 4|Desktop| 15|30000| null|2025-01-04|
| 5| null| null|31666| East|2025-01-05|
+--------+-------+--------+-----+------+----------+
df.withColumn("Price", when(col("Price").isNull(),
average).otherwise(col("Price"))).show()
+--------+-------+--------+------------------+------+----------+
|Sales_ID|Product|Quantity| Price|Region|Sales_Date|
+--------+-------+--------+------------------+------+----------+
| 1| Laptop| 10| 50000.0| North|2025-01-01|
| 2| Mobile| null| 15000.0| South| null|
| 3| Tablet| 20|31666.666666666668| West|2025-01-03|
| 4|Desktop| 15| 30000.0| null|2025-01-04|
| 5| null| null|31666.666666666668| East|2025-01-05|
+--------+-------+--------+------------------+------+----------+
%sql
select Price from sales_tbl;
df.show()
+--------+-------+--------+-----+------+----------+
|Sales_ID|Product|Quantity|Price|Region|Sales_Date|
+--------+-------+--------+-----+------+----------+
| 1| Laptop| 10|50000| North|2025-01-01|
| 2| Mobile| null|15000| South| null|
| 3| Tablet| 20| null| West|2025-01-03|
| 4|Desktop| 15|30000| null|2025-01-04|
| 5| null| null| null| East|2025-01-05|
+--------+-------+--------+-----+------+----------+
+--------+-------+--------+-----+------+----------+
|Sales_ID|Product|Quantity|Price|Region|Sales_Date|
+--------+-------+--------+-----+------+----------+
| 1| Laptop| 10|50000| North|2025-01-01|
| 2| Mobile| null|15000| South| null|
| 3| Tablet| 20| null| West|2025-01-03|
| 4|Desktop| 15|30000| null|2025-01-04|
+--------+-------+--------+-----+------+----------+
%sql
-- drop rows where product is null
select * from sales_tbl
where Product is not null;
+--------+-------+--------+-----+------+----------+
|Sales_ID|Product|Quantity|Price|Region|Sales_Date|
+--------+-------+--------+-----+------+----------+
| 1| Laptop| 10|50000| North|2025-01-01|
| 2| Mobile| null|15000| South| null|
| 3| Tablet| 20| null| West|2025-01-03|
| 4|Desktop| 15|30000| null|2025-01-04|
| 5| null| null| null| East|2025-01-05|
+--------+-------+--------+-----+------+----------+
+--------+-------+--------+-----+------+----------+
|Sales_ID|Product|Quantity|Price|Region|Sales_Date|
+--------+-------+--------+-----+------+----------+
| 1| Laptop| 10|50000| North|2025-01-01|
| 2| Mobile| null|15000| South|2025-01-01|
| 3| Tablet| 20| null| West|2025-01-03|
| 4|Desktop| 15|30000| null|2025-01-04|
| 5| null| null| null| East|2025-01-05|
+--------+-------+--------+-----+------+----------+
df.fillna({"Sales_Date":'2025-01-01'}).show()
+--------+-------+--------+-----+------+----------+
|Sales_ID|Product|Quantity|Price|Region|Sales_Date|
+--------+-------+--------+-----+------+----------+
| 1| Laptop| 10|50000| North|2025-01-01|
| 2| Mobile| null|15000| South|2025-01-01|
| 3| Tablet| 20| null| West|2025-01-03|
| 4|Desktop| 15|30000| null|2025-01-04|
| 5| null| null| null| East|2025-01-05|
+--------+-------+--------+-----+------+----------+
pdf = df.toPandas()
Out[46]: 0 50000.000000
1 15000.000000
2 31666.666667
3 30000.000000
4 31666.666667
Name: Price, dtype: float64