0% found this document useful (0 votes)

16 views21 pages

Mini Project2 DAV Answers - Jupyter Notebook

This document discusses data preprocessing steps for a Bigmart sales dataset in Python. It defines necessary libraries, loads the dataset, drops unnecessary columns, extracts the target label, performs encoding of categorical variables, handles missing data through imputation, splits the data into train and test sets, and applies linear regression to calculate RMSE. Standardization is also applied before another train-test split and linear regression. The goal is to clean, preprocess and analyze the dataset to build a predictive model for sales.

Uploaded by

Priscella Coc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views21 pages

Mini Project2 DAV Answers - Jupyter Notebook

Uploaded by

Priscella Coc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Data Analysis and Visualization MPA-2

NITHIN RAJ

KISHORE KUMAR M

VISHNU VARADHAN REDDY

Define the necessary libraries (1 mark)

In [1]: import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from sklearn.linear_model import LinearRegression
from sklearn import metrics as mt
from sklearn.preprocessing import OrdinalEncoder,StandardScaler,MinMaxScaler,MaxAbsScaler,MaxAbsScaler,RobustScaler,No
from sklearn.model_selection import train_test_split

Load the dataset into the dataframe (1 mark)

In [2]: df = pd.read_csv('BigmartSales.csv')

Drop the "Item_Identifier" and "Outlet_Identifier" columns (1 mark)

In [3]: # df.drop() drops the data in the dataframe-df.

# By Default it drops the row data.
# To drop the column data set the axis=1
In [4]: print('Columns in the dataset before dropping are: ',df.columns)
df = df.drop(['Item_Identifier','Outlet_Identifier'],axis=1)
print('Columns in the dataset after dropping are: ',df.columns)

Columns in the dataset before dropping are: Index(['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visib
ility',
'Item_Type', 'Item_MRP', 'Outlet_Identifier',
'Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type',
'Outlet_Type', 'Item_Outlet_Sales'],
dtype='object')
Columns in the dataset after dropping are: Index(['Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type',
'Item_MRP', 'Outlet_Establishment_Year', 'Outlet_Size',
'Outlet_Location_Type', 'Outlet_Type', 'Item_Outlet_Sales'],
dtype='object')

Extract the target labels (1 mark)

In [5]: # The target label in a dataset is the output data.

# It changes based on the other features in the dataset

In [6]: target_label = df.Item_Outlet_Sales

target_label

Out[6]: 0 3735.1380
1 443.4228
2 2097.2700
3 732.3800
4 994.7052
...
8518 2778.3834
8519 549.2850
8520 1193.1136
8521 1845.5976
8522 765.6700
Name: Item_Outlet_Sales, Length: 8523, dtype: float64

Replace the field "Item_Fat_Content" with numerical value (1 mark)

In [7]: # df.replace({'old_data':'new_data'}) replaces the old_data in a dataframe with the new_data provided as the key-value

In [8]: print('Before Replacing: ',df['Item_Fat_Content'].unique())

df['Item_Fat_Content'] = df['Item_Fat_Content'].replace({'Low Fat':0,'LF':0,'Regular':1,'reg':1,'low fat':0})
print('After Replacing: ',df['Item_Fat_Content'].unique())

Before Replacing: ['Low Fat' 'Regular' 'low fat' 'LF' 'reg']

After Replacing: [0 1]

Perform ordinal encoding of the "Item_Type", "Outlet_Type", "Outlet_Location_Type" and "Outlet_Type" field (1 mark)

In [9]: # Encoding is the process of transforming the categorical (discrete) features into ordinal integers.
# This is the preprocessing step to be done before using the dataset for ML model training

In [10]: ordEnc = OrdinalEncoder()

df['Item_Type'] = ordEnc.fit_transform(df['Item_Type'].values.reshape(-1, 1))
df['Item_Type']

Out[10]: 0 4.0
1 14.0
2 10.0
3 6.0
4 9.0
...
8518 13.0
8519 0.0
8520 8.0
8521 13.0
8522 14.0
Name: Item_Type, Length: 8523, dtype: float64
In [11]: df['Outlet_Type'] = ordEnc.fit_transform(df['Outlet_Type'].values.reshape(-1, 1))
df['Outlet_Type']

Out[11]: 0 1.0
1 2.0
2 1.0
3 0.0
4 1.0
...
8518 1.0
8519 1.0
8520 1.0
8521 2.0
8522 1.0
Name: Outlet_Type, Length: 8523, dtype: float64

In [12]: df['Outlet_Location_Type'] = ordEnc.fit_transform(df['Outlet_Location_Type'].values.reshape(-1, 1))

df['Outlet_Location_Type']

Out[12]: 0 0.0
1 2.0
2 0.0
3 2.0
4 2.0
...
8518 2.0
8519 1.0
8520 1.0
8521 2.0
8522 0.0
Name: Outlet_Location_Type, Length: 8523, dtype: float64
In [13]: df.isna().sum()

Out[13]: Item_Weight 1463

Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Establishment_Year 0
Outlet_Size 2410
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64

Imputation of "Outlet_Size" field with mode value (1 mark)

In [14]: # fillna() is the method used to place the custom values at the NaN in a dataframe of series

In [15]: print('The Mode of Outlet Size is: ',df['Outlet_Size'].mode())

The Mode of Outlet Size is: 0 Medium

Name: Outlet_Size, dtype: object

In [16]: df['Outlet_Size'] = df['Outlet_Size'].fillna('Medium')

In [17]: df.isna().sum()

Out[17]: Item_Weight 1463

Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Establishment_Year 0
Outlet_Size 0
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64

Check for null values (1 mark)

In [18]: df.isnull().sum()

Out[18]: Item_Weight 1463

Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Establishment_Year 0
Outlet_Size 0
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64

Imputation of "Item_Weight" field with mode value (1 mark)

In [19]: print('The Mode of Item Weight is: ',df['Item_Weight'].mode())

The Mode of Item Weight is: 0 12.15

Name: Item_Weight, dtype: float64
In [20]: df['Item_Weight'] = df['Item_Weight'].fillna(12.15)

In [21]: df.isna().sum()

Out[21]: Item_Weight 0
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Establishment_Year 0
Outlet_Size 0
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64

Display all field in the dataset using boxplot (1 mark)

In [22]: # Box plot is used to find the outliers present in a data set. Mostly used for a univariate analysis.
# Also can be applied to bivariate analysis having 1 numerical and 1 categorical data
# It is called as grouped boxplot
In [23]: plt.figure(figsize=(10,5))
sns.boxplot(df)
plt.xticks(rotation=90)
plt.title('Bigmart Sales Data')
plt.show()
Split the dataset into train and test(20%), apply Linear Regression and calculate RMSE value (1 mark)
In [24]: # train_test_split is the method in sklearn.model_selection
# It is used to create the training and testing data from a complete data
# It gets the parameters - input data, output data,
# test_size=the size of the data that has to be selected for the testing of the ML model
# it returns four values - xtrain,xtest,ytrain,ytest that are given to the ML model for training and testing

In [25]: df['Outlet_Size'] = ordEnc.fit_transform(df['Outlet_Size'].values.reshape(-1, 1))

X=df.drop('Item_Outlet_Sales',axis=1)
Y=df['Item_Outlet_Sales']
xtrain,xtest,ytrain,ytest = train_test_split(X,Y,test_size=0.2,)
# Create and fit the linear regression model
model = LinearRegression()
model.fit(xtrain, ytrain)

# Make predictions on the test set
ypred = model.predict(xtest)

# Calculate RMSE
rmse1 = math.sqrt(mt.mean_squared_error(ytest, ypred))

print(f"Root Mean Squared Error (RMSE): {rmse1}")

Root Mean Squared Error (RMSE): 1177.361349688933

Apply StandardScaller and split the dataset into train and test(20%) (1 mark)

In [26]: # StandardScaler standardize features by removing the mean and scaling to unit variance.
# Standardization of a dataset is a common requirement for many machine learning estimators:
# they might behave badly if the individual features do not more or less look like standard normally distributed data
In [27]: sc = StandardScaler()
df_sc = sc.fit_transform(df)
df1 = pd.DataFrame(df_sc)

df1.columns=['Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type',
'Item_MRP', 'Outlet_Establishment_Year', 'Outlet_Size',
'Outlet_Location_Type', 'Outlet_Type', 'Item_Outlet_Sales']

X1=df1.drop('Item_Outlet_Sales',axis=1)
Y1=df1['Item_Outlet_Sales']
x1train,x1test,y1train,y1test=train_test_split(X1,Y,test_size=0.2)
# Create and fit the linear regression model
model1 = LinearRegression()
model1.fit(x1train, y1train)

Out[27]: LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Display all field in the dataset using boxplot (1 mark)

In [28]: plt.figure(figsize=(10,5))
sns.boxplot(df1)
plt.xticks(rotation=90)
plt.title('Bigmart Sales Data')
plt.show()
Apply Linear Regression and calculate RMSE value (1 mark)
In [29]: # Make predictions on the test set
y1pred = model1.predict(x1test)

# Calculate RMSE
rmse2 = math.sqrt(mt.mean_squared_error(y1test, y1pred))

print(f"Root Mean Squared Error (RMSE): {rmse2}")

Root Mean Squared Error (RMSE): 1161.6406081768139

Apply MinMaxScaler, split the dataset into train and test(20%), apply LinearRegression and calculate RMSE (1 mark)

In [30]: # MinMaxScaler Transform features by scaling each feature to a given range.

# This estimator scales and translates each feature individually such that
# it is in the given range on the training set, e.g. between zero and one.
# This transformation is often used as an alternative to zero mean, unit variance scaling.
In [31]: mmsc = MinMaxScaler()
df_mmsc = mmsc.fit_transform(df)
df2 = pd.DataFrame(df_mmsc)

df2.columns=['Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type',
'Item_MRP', 'Outlet_Establishment_Year', 'Outlet_Size',
'Outlet_Location_Type', 'Outlet_Type', 'Item_Outlet_Sales']

X2=df2.drop('Item_Outlet_Sales',axis=1)
Y2=df2['Item_Outlet_Sales']
x2train,x2test,y2train,y2test=train_test_split(X2,Y,test_size=0.2)
# Create and fit the linear regression model
model2 = LinearRegression()
model2.fit(x2train, y2train)
# Make predictions on the test set
y2pred = model2.predict(x2test)

# Calculate RMSE
rmse3 = math.sqrt(mt.mean_squared_error(y2test, y2pred))

print(f"Root Mean Squared Error (RMSE): {rmse3}")

Root Mean Squared Error (RMSE): 1176.5289257439433

Apply RobustScaler,Split the dataset into train and test(20%), apply LinearRegression and calculate RMSE (1 mark)

In [32]: # RobustScaler scales features using statistics that are robust to outliers.
# This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile R
# The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).
In [33]: rsc = RobustScaler()
df_rsc = rsc.fit_transform(df)
dfr = pd.DataFrame(df_rsc)

dfr.columns=['Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type',
'Item_MRP', 'Outlet_Establishment_Year', 'Outlet_Size',
'Outlet_Location_Type', 'Outlet_Type', 'Item_Outlet_Sales']

Xr=dfr.drop('Item_Outlet_Sales',axis=1)
Yr=dfr['Item_Outlet_Sales']
xrtrain,xrtest,yrtrain,yrtest=train_test_split(Xr,Y,test_size=0.2)
# Create and fit the linear regression model
modelr = LinearRegression()
modelr.fit(xrtrain, yrtrain)
# Make predictions on the test set
yrpred = modelr.predict(xrtest)

# Calculate RMSE
rmse4 = math.sqrt(mt.mean_squared_error(yrtest, yrpred))

print(f"Root Mean Squared Error (RMSE): {rmse4}")

Root Mean Squared Error (RMSE): 1143.8487793222237

Apply MaxAbsScaler, split the dataset into train and test(20%), apply LinearRegression and calculate RMSE (1 mark)

In [34]: # MaxAbsScaler scales each feature by its maximum absolute value.

# This estimator scales and translates each feature individually such that
# the maximal absolute value of each feature in the training set will be 1.0.
# It does not shift/center the data, and thus does not destroy any sparsity.
# MaxAbsScaler doesn’t reduce the effect of outliers; it only linearily scales them down.
In [35]: masc = MaxAbsScaler()
df_masc = masc.fit_transform(df)
dfa = pd.DataFrame(df_masc)

dfa.columns=['Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type',
'Item_MRP', 'Outlet_Establishment_Year', 'Outlet_Size',
'Outlet_Location_Type', 'Outlet_Type', 'Item_Outlet_Sales']

Xa=dfa.drop('Item_Outlet_Sales',axis=1)
Ya=dfa['Item_Outlet_Sales']
xatrain,xatest,yatrain,yatest=train_test_split(Xa,Y,test_size=0.2)
# Create and fit the linear regression model
modela = LinearRegression()
modela.fit(xatrain, yatrain)
# Make predictions on the test set
yapred = modela.predict(xatest)

# Calculate RMSE
rmse5 = math.sqrt(mt.mean_squared_error(yatest, yapred))

print(f"Root Mean Squared Error (RMSE): {rmse5}")

Root Mean Squared Error (RMSE): 1195.9232136536114

Apply Normalizer, split the dataset into train and test(20%), apply LinearRegression and calculate RMSE (1 mark)

In [36]: # Normalizer normalizes samples individually to unit norm.

# Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of oth
# so that its norm (l1, l2 or inf) equals one.
In [37]: nsc = Normalizer()
df_nsc = masc.fit_transform(df)
dfn = pd.DataFrame(df_nsc)

dfn.columns=['Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type',
'Item_MRP', 'Outlet_Establishment_Year', 'Outlet_Size',
'Outlet_Location_Type', 'Outlet_Type', 'Item_Outlet_Sales']

Xn=dfn.drop('Item_Outlet_Sales',axis=1)
Yn=dfn['Item_Outlet_Sales']
xntrain,xntest,yntrain,yntest=train_test_split(Xn,Y,test_size=0.2)
# Create and fit the linear regression model
modeln = LinearRegression()
modeln.fit(xntrain, yntrain)
# Make predictions on the test set
ynpred = modeln.predict(xntest)

# Calculate RMSE
rmse6 = math.sqrt(mt.mean_squared_error(yntest, ynpred))

print(f"Root Mean Squared Error (RMSE): {rmse6}")

Root Mean Squared Error (RMSE): 1218.0003678085768

Define a function valuelabel to place the legend of each bar in the histogram (1 mark)
In [38]: def valuelabel(ax, spacing=3):

# For each bar: Place a label
for rect in ax.patches:
# Get X and Y placement of label from rect.
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2

# Number of points between bar and label
space = spacing
# Vertical alignment for positive values
va = 'bottom'

# If value of bar is negative: Place label below bar
if y_value < 0:
# Invert space to place label below
space *= -1
# Vertically align label at top
va = 'top'

# Use Y value as label and format number with one decimal place
label = "{:.1f}".format(y_value)

# Create annotation
ax.annotate(
label, # Use `label` as label
(x_value, y_value), # Place label at end of the bar
xytext=(0, space), # Vertically shift label by `space`
textcoords="offset points", # Interpret `xytext` as offset in points
ha='center', # Horizontally center label
va=va) # Vertically align label differently for
# positive and negative values.

Plot a histogram to display the RMSE value of each scaler (1 mark)

In [39]: rmses = [rmse1,rmse2,rmse3,rmse4,rmse5,rmse6]
rmse_Series = pd.Series(rmses)
labels = ['rmse1','rmse2','rmse3','rmse4','rmse5','rmse6']
# Creating histogram
plt.figure(figsize=(10,5))
ax = rmse_Series.plot(kind='bar')
ax.set_xticklabels(labels)
valuelabel(ax)
# Show plot
plt.show()

Delhivery Mani
No ratings yet
Delhivery Mani
79 pages
Script Creation For A 2D Airfoil Meshing Using ICEM CFD On ANSYS
No ratings yet
Script Creation For A 2D Airfoil Meshing Using ICEM CFD On ANSYS
30 pages
PHP Unit I - V - Notes
No ratings yet
PHP Unit I - V - Notes
113 pages
vertopal.com_Project_12_Big_Mart_Sales_Prediction
No ratings yet
vertopal.com_Project_12_Big_Mart_Sales_Prediction
15 pages
PRJ Sales Forecasting
No ratings yet
PRJ Sales Forecasting
22 pages
Lab1 Features Selections-Class-GI2
No ratings yet
Lab1 Features Selections-Class-GI2
25 pages
BigMart Sales Data Analysis
No ratings yet
BigMart Sales Data Analysis
16 pages
Big Sales Mart Final Script PDF
No ratings yet
Big Sales Mart Final Script PDF
36 pages
BIG Mart Data Analyst Project
No ratings yet
BIG Mart Data Analyst Project
19 pages
Data_preprocessing_example_programs1
No ratings yet
Data_preprocessing_example_programs1
9 pages
Deep Learning Assignments
No ratings yet
Deep Learning Assignments
13 pages
Mini Project (BDA) Output
No ratings yet
Mini Project (BDA) Output
5 pages
Project 4: Final Project: Bigmart Sales Prediction: Chapter 1: Problem Statement
No ratings yet
Project 4: Final Project: Bigmart Sales Prediction: Chapter 1: Problem Statement
35 pages
Lab File
No ratings yet
Lab File
96 pages
Data Science Tutorial 1686911993
No ratings yet
Data Science Tutorial 1686911993
41 pages
Task 6
No ratings yet
Task 6
14 pages
BigMart PDF
100% (1)
BigMart PDF
42 pages
ML Practical 4D
No ratings yet
ML Practical 4D
11 pages
Data Clearning
No ratings yet
Data Clearning
7 pages
Machine Learning Record VR19
No ratings yet
Machine Learning Record VR19
46 pages
5-2a dataframes column operations - instruction
No ratings yet
5-2a dataframes column operations - instruction
2 pages
House Price Prediction Using Machine Learning in Python
No ratings yet
House Price Prediction Using Machine Learning in Python
13 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Data Preprocessing 2
No ratings yet
Data Preprocessing 2
5 pages
Machine Exercise 3 (1)
No ratings yet
Machine Exercise 3 (1)
22 pages
Customer Segmentation With K-means Clustering and Visualization - Colab
No ratings yet
Customer Segmentation With K-means Clustering and Visualization - Colab
3 pages
practice_questions2
No ratings yet
practice_questions2
2 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
HET ka FML
No ratings yet
HET ka FML
13 pages
ML Lab Manual 2025-2
No ratings yet
ML Lab Manual 2025-2
35 pages
Dataframe
No ratings yet
Dataframe
19 pages
Analysis and Prediction of House Prices by Linear Regression Model
No ratings yet
Analysis and Prediction of House Prices by Linear Regression Model
91 pages
DS-Food
No ratings yet
DS-Food
23 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
Data Wrangling Python.
No ratings yet
Data Wrangling Python.
8 pages
porter-case-study
No ratings yet
porter-case-study
153 pages
Practicals
No ratings yet
Practicals
42 pages
Exp 01-B Feature Selection and Extraction
No ratings yet
Exp 01-B Feature Selection and Extraction
12 pages
Handling Missing Values in A Real-Time Dataset During
No ratings yet
Handling Missing Values in A Real-Time Dataset During
5 pages
Aerofit_Case_Study
No ratings yet
Aerofit_Case_Study
16 pages
Machine Learning Lab Assignment 2
No ratings yet
Machine Learning Lab Assignment 2
23 pages
Certificate
No ratings yet
Certificate
25 pages
ML LAB manual-1
No ratings yet
ML LAB manual-1
33 pages
assignment1
No ratings yet
assignment1
7 pages
Project paarth (1) (1)
No ratings yet
Project paarth (1) (1)
21 pages
Exp 2 Data Preprocessing_ Cleaning the Dataset Obtained from the UCI ML Repository
No ratings yet
Exp 2 Data Preprocessing_ Cleaning the Dataset Obtained from the UCI ML Repository
9 pages
Supermart Grocery Sales - Retail Analytics Dataset - (Data Analyst)
No ratings yet
Supermart Grocery Sales - Retail Analytics Dataset - (Data Analyst)
17 pages
Data Analysis: Data Preparation
No ratings yet
Data Analysis: Data Preparation
9 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
ML Complete Notes Hridoy.docx
No ratings yet
ML Complete Notes Hridoy.docx
5 pages
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
100% (1)
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
28 pages
IP Practic MINE
No ratings yet
IP Practic MINE
30 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
LAB MANUAL 5 SOLVED 40 (1)
No ratings yet
LAB MANUAL 5 SOLVED 40 (1)
13 pages
Identifying Columns with Missing Values
No ratings yet
Identifying Columns with Missing Values
4 pages
Lab Manual 4
No ratings yet
Lab Manual 4
23 pages
Another Project-Creating Customer Segments
No ratings yet
Another Project-Creating Customer Segments
31 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
Hrithik Saini Class 12th c1, Roll No 1033
No ratings yet
Hrithik Saini Class 12th c1, Roll No 1033
25 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
Data Preprocessing 1
No ratings yet
Data Preprocessing 1
6 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
MCSl-217 mca ignou assignment
No ratings yet
MCSl-217 mca ignou assignment
6 pages
Making Words Lesson Plan
No ratings yet
Making Words Lesson Plan
2 pages
Essay Plan
No ratings yet
Essay Plan
3 pages
MEDIA CODES AND CONVENTIONS
No ratings yet
MEDIA CODES AND CONVENTIONS
2 pages
Architectural Graphics 101 - Drawing Alignment and Notes
No ratings yet
Architectural Graphics 101 - Drawing Alignment and Notes
2 pages
User Manual Quickhash-GUI-Windows-v3.3.4 para Hashear Archivos
No ratings yet
User Manual Quickhash-GUI-Windows-v3.3.4 para Hashear Archivos
38 pages
Bibliophile Event 1
No ratings yet
Bibliophile Event 1
5 pages
Rounak's Resume PDF
No ratings yet
Rounak's Resume PDF
1 page
En4G-Iig-3.2 Melc 14: (Formerly Pasongkawayan I Elementary School)
No ratings yet
En4G-Iig-3.2 Melc 14: (Formerly Pasongkawayan I Elementary School)
7 pages
Jebin Resume 1
No ratings yet
Jebin Resume 1
2 pages
Lectures On Discrete Geometry, Jiri Matousek
100% (1)
Lectures On Discrete Geometry, Jiri Matousek
491 pages
Fitting Data To A Distribution
No ratings yet
Fitting Data To A Distribution
2 pages
Summary Book - Longman Complete Course For The Toelf Test - 2001
No ratings yet
Summary Book - Longman Complete Course For The Toelf Test - 2001
10 pages
ARREST FORMAT 04.12.2024
No ratings yet
ARREST FORMAT 04.12.2024
5 pages
Stratix 5700 and ArmorStratix 5700 - A - 15.2 (7) E3 (Released 9 - 2020) - 15.2 (8) E1 (Released 12 - 2021)
No ratings yet
Stratix 5700 and ArmorStratix 5700 - A - 15.2 (7) E3 (Released 9 - 2020) - 15.2 (8) E1 (Released 12 - 2021)
17 pages
Resolution Formatting Guide
100% (1)
Resolution Formatting Guide
2 pages
Three Word Phrasal Verbs Exercise
No ratings yet
Three Word Phrasal Verbs Exercise
2 pages
MRS 550 Ecclesiology and Church Leadership 12-1-21
No ratings yet
MRS 550 Ecclesiology and Church Leadership 12-1-21
62 pages
8.2.2022 - Grammar Practice
No ratings yet
8.2.2022 - Grammar Practice
5 pages
Jawaban Kuis Praktikum TOEFL Sesi UAS
No ratings yet
Jawaban Kuis Praktikum TOEFL Sesi UAS
7 pages
Self Introduction: English For Teenth Graders
No ratings yet
Self Introduction: English For Teenth Graders
10 pages
Ttorial Com
No ratings yet
Ttorial Com
110 pages
Analytics 2024 04 01 090020.ips - Ca.synced
No ratings yet
Analytics 2024 04 01 090020.ips - Ca.synced
69 pages
Materi 2 English - 8 Sept 2023
No ratings yet
Materi 2 English - 8 Sept 2023
3 pages
How To Write A Professional CV and Cover Letter
100% (1)
How To Write A Professional CV and Cover Letter
9 pages
SFL (Systemic Functional Linguistics) in Discourse Analysis
No ratings yet
SFL (Systemic Functional Linguistics) in Discourse Analysis
14 pages
Graduate Report Tejas
No ratings yet
Graduate Report Tejas
150 pages
Quiz 1 Present Simple & Present Continuous
100% (1)
Quiz 1 Present Simple & Present Continuous
2 pages

Mini Project2 DAV Answers - Jupyter Notebook

Uploaded by

Mini Project2 DAV Answers - Jupyter Notebook

Uploaded by

Data Analysis and Visualization MPA-2

VISHNU VARADHAN REDDY

Define the necessary libraries (1 mark)

In [1]: import pandas as pd

Load the dataset into the dataframe (1 mark)

Drop the "Item_Identifier" and "Outlet_Identifier" columns (1 mark)

In [3]: # df.drop() drops the data in the dataframe-df.

Extract the target labels (1 mark)

In [5]: # The target label in a dataset is the output data.

In [6]: target_label = df.Item_Outlet_Sales

Replace the field "Item_Fat_Content" with numerical value (1 mark)

In [8]: print('Before Replacing: ',df['Item_Fat_Content'].unique())

Before Replacing: ['Low Fat' 'Regular' 'low fat' 'LF' 'reg']

In [10]: ordEnc = OrdinalEncoder()

In [12]: df['Outlet_Location_Type'] = ordEnc.fit_transform(df['Outlet_Location_Type'].values.reshape(-1, 1))

Out[13]: Item_Weight 1463

Imputation of "Outlet_Size" field with mode value (1 mark)

In [15]: print('The Mode of Outlet Size is: ',df['Outlet_Size'].mode())

The Mode of Outlet Size is: 0 Medium

In [16]: df['Outlet_Size'] = df['Outlet_Size'].fillna('Medium')

Out[17]: Item_Weight 1463

Check for null values (1 mark)

Out[18]: Item_Weight 1463

Imputation of "Item_Weight" field with mode value (1 mark)

In [19]: print('The Mode of Item Weight is: ',df['Item_Weight'].mode())

The Mode of Item Weight is: 0 12.15

Display all field in the dataset using boxplot (1 mark)

In [25]: df['Outlet_Size'] = ordEnc.fit_transform(df['Outlet_Size'].values.reshape(-1, 1))

Root Mean Squared Error (RMSE): 1177.361349688933

Display all field in the dataset using boxplot (1 mark)

Root Mean Squared Error (RMSE): 1161.6406081768139

In [30]: # MinMaxScaler Transform features by scaling each feature to a given range.

Root Mean Squared Error (RMSE): 1176.5289257439433

Root Mean Squared Error (RMSE): 1143.8487793222237

In [34]: # MaxAbsScaler scales each feature by its maximum absolute value.

Root Mean Squared Error (RMSE): 1195.9232136536114

In [36]: # Normalizer normalizes samples individually to unit norm.

Root Mean Squared Error (RMSE): 1218.0003678085768

Plot a histogram to display the RMSE value of each scaler (1 mark)

You might also like