Mini Project2 DAV Answers - Jupyter Notebook
Mini Project2 DAV Answers - Jupyter Notebook
NITHIN RAJ
KISHORE KUMAR M
In [2]: df = pd.read_csv('BigmartSales.csv')
Columns in the dataset before dropping are: Index(['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visib
ility',
'Item_Type', 'Item_MRP', 'Outlet_Identifier',
'Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type',
'Outlet_Type', 'Item_Outlet_Sales'],
dtype='object')
Columns in the dataset after dropping are: Index(['Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type',
'Item_MRP', 'Outlet_Establishment_Year', 'Outlet_Size',
'Outlet_Location_Type', 'Outlet_Type', 'Item_Outlet_Sales'],
dtype='object')
Out[6]: 0 3735.1380
1 443.4228
2 2097.2700
3 732.3800
4 994.7052
...
8518 2778.3834
8519 549.2850
8520 1193.1136
8521 1845.5976
8522 765.6700
Name: Item_Outlet_Sales, Length: 8523, dtype: float64
Perform ordinal encoding of the "Item_Type", "Outlet_Type", "Outlet_Location_Type" and "Outlet_Type" field (1 mark)
In [9]: # Encoding is the process of transforming the categorical (discrete) features into ordinal integers.
# This is the preprocessing step to be done before using the dataset for ML model training
Out[10]: 0 4.0
1 14.0
2 10.0
3 6.0
4 9.0
...
8518 13.0
8519 0.0
8520 8.0
8521 13.0
8522 14.0
Name: Item_Type, Length: 8523, dtype: float64
In [11]: df['Outlet_Type'] = ordEnc.fit_transform(df['Outlet_Type'].values.reshape(-1, 1))
df['Outlet_Type']
Out[11]: 0 1.0
1 2.0
2 1.0
3 0.0
4 1.0
...
8518 1.0
8519 1.0
8520 1.0
8521 2.0
8522 1.0
Name: Outlet_Type, Length: 8523, dtype: float64
Out[12]: 0 0.0
1 2.0
2 0.0
3 2.0
4 2.0
...
8518 2.0
8519 1.0
8520 1.0
8521 2.0
8522 0.0
Name: Outlet_Location_Type, Length: 8523, dtype: float64
In [13]: df.isna().sum()
In [14]: # fillna() is the method used to place the custom values at the NaN in a dataframe of series
In [18]: df.isnull().sum()
In [21]: df.isna().sum()
Out[21]: Item_Weight 0
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Establishment_Year 0
Outlet_Size 0
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64
In [22]: # Box plot is used to find the outliers present in a data set. Mostly used for a univariate analysis.
# Also can be applied to bivariate analysis having 1 numerical and 1 categorical data
# It is called as grouped boxplot
In [23]: plt.figure(figsize=(10,5))
sns.boxplot(df)
plt.xticks(rotation=90)
plt.title('Bigmart Sales Data')
plt.show()
Split the dataset into train and test(20%), apply Linear Regression and calculate RMSE value (1 mark)
In [24]: # train_test_split is the method in sklearn.model_selection
# It is used to create the training and testing data from a complete data
# It gets the parameters - input data, output data,
# test_size=the size of the data that has to be selected for the testing of the ML model
# it returns four values - xtrain,xtest,ytrain,ytest that are given to the ML model for training and testing
Apply StandardScaller and split the dataset into train and test(20%) (1 mark)
In [26]: # StandardScaler standardize features by removing the mean and scaling to unit variance.
# Standardization of a dataset is a common requirement for many machine learning estimators:
# they might behave badly if the individual features do not more or less look like standard normally distributed data
In [27]: sc = StandardScaler()
df_sc = sc.fit_transform(df)
df1 = pd.DataFrame(df_sc)
df1.columns=['Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type',
'Item_MRP', 'Outlet_Establishment_Year', 'Outlet_Size',
'Outlet_Location_Type', 'Outlet_Type', 'Item_Outlet_Sales']
X1=df1.drop('Item_Outlet_Sales',axis=1)
Y1=df1['Item_Outlet_Sales']
x1train,x1test,y1train,y1test=train_test_split(X1,Y,test_size=0.2)
# Create and fit the linear regression model
model1 = LinearRegression()
model1.fit(x1train, y1train)
Out[27]: LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Apply MinMaxScaler, split the dataset into train and test(20%), apply LinearRegression and calculate RMSE (1 mark)
Apply RobustScaler,Split the dataset into train and test(20%), apply LinearRegression and calculate RMSE (1 mark)
In [32]: # RobustScaler scales features using statistics that are robust to outliers.
# This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile R
# The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).
In [33]: rsc = RobustScaler()
df_rsc = rsc.fit_transform(df)
dfr = pd.DataFrame(df_rsc)
dfr.columns=['Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type',
'Item_MRP', 'Outlet_Establishment_Year', 'Outlet_Size',
'Outlet_Location_Type', 'Outlet_Type', 'Item_Outlet_Sales']
Xr=dfr.drop('Item_Outlet_Sales',axis=1)
Yr=dfr['Item_Outlet_Sales']
xrtrain,xrtest,yrtrain,yrtest=train_test_split(Xr,Y,test_size=0.2)
# Create and fit the linear regression model
modelr = LinearRegression()
modelr.fit(xrtrain, yrtrain)
# Make predictions on the test set
yrpred = modelr.predict(xrtest)
# Calculate RMSE
rmse4 = math.sqrt(mt.mean_squared_error(yrtest, yrpred))
print(f"Root Mean Squared Error (RMSE): {rmse4}")
Apply MaxAbsScaler, split the dataset into train and test(20%), apply LinearRegression and calculate RMSE (1 mark)
Apply Normalizer, split the dataset into train and test(20%), apply LinearRegression and calculate RMSE (1 mark)
Define a function valuelabel to place the legend of each bar in the histogram (1 mark)
In [38]: def valuelabel(ax, spacing=3):
# For each bar: Place a label
for rect in ax.patches:
# Get X and Y placement of label from rect.
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2
# Number of points between bar and label
space = spacing
# Vertical alignment for positive values
va = 'bottom'
# If value of bar is negative: Place label below bar
if y_value < 0:
# Invert space to place label below
space *= -1
# Vertically align label at top
va = 'top'
# Use Y value as label and format number with one decimal place
label = "{:.1f}".format(y_value)
# Create annotation
ax.annotate(
label, # Use `label` as label
(x_value, y_value), # Place label at end of the bar
xytext=(0, space), # Vertically shift label by `space`
textcoords="offset points", # Interpret `xytext` as offset in points
ha='center', # Horizontally center label
va=va) # Vertically align label differently for
# positive and negative values.