0% found this document useful (0 votes)
17 views

Ads Lab Manual

The document outlines the experiments and assignments for the Computer Engineering branch at Chhatrapati Shivaji Maharaj Institute of Technology for the academic year 2024-25. It includes tasks related to data preparation, visualization, modeling, and analysis using various Python libraries such as NumPy, Pandas, Matplotlib, and Seaborn. The document also provides code snippets for data manipulation and visualization, demonstrating the application of statistical techniques on a dataset of cars.

Uploaded by

Nilay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Ads Lab Manual

The document outlines the experiments and assignments for the Computer Engineering branch at Chhatrapati Shivaji Maharaj Institute of Technology for the academic year 2024-25. It includes tasks related to data preparation, visualization, modeling, and analysis using various Python libraries such as NumPy, Pandas, Matplotlib, and Seaborn. The document also provides code snippets for data manipulation and visualization, demonstrating the application of statistical techniques on a dataset of cars.

Uploaded by

Nilay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 63

CHHATRAPATI SHIVAJI MAHARAJ INSTITUTE OF

TECHNOLOGY
(Affiliated to the Mumbai University, Approved by AICTE-New Delhi)

Academic Year: 2024-25 Semester: VIII Branch: Computer Engineering


Sr. Title of Experiments
No
1. Data preparation using NumPy and Pandas.
Data Visualization / Exploratory Data Analysis for the selected data set using Matplotlib and
2. Seaborn.

3. Data Modelling
Implementation of Statistical Hypothesis Test using Scipy and Sci-kit learn
4.

5. Regression Analysis,

6. To implement classification modelling.


Clustering algorithms for unsupervised classification
7.
Using any machine learning techniques using available data set to develop a recommendation
8.

Sr. Title of Assignments


No
1.

2.

3.

4.

5.

6.

Signature of Student Signature of Staff


EXPERIMENT 1

Code:
from google.colab import files
uploaded = files.upload()
#Importing the required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(color_codes=True)
#Loading the csv file into a pandas dataframe
df = pd.read_csv("CARS.csv")
df.head(5)
# Removing irrelevant features
df = df.drop(['Model','DriveTrain','Invoice', 'Origin', 'Type'], axis=1
)
df.head(5)
#To peek at the first five rows.
df.head(5)
#To peek at the last five rows
df.tail(5)
#Finding the null values
print(df.isnull().sum())
#printing the null value rows
df[0:249]
# Filling the rows with the mean of the
column val = df['Cylinders'].mean()
df['Cylinders'][247] = round(val)

val = df['Cylinders'].mean()
df['Cylinders'][248]= round(val)
# Removing the formatting
df['MSRP'] = [x.replace('$', '') for x in df['MSRP']]
df['MSRP'] = [x.replace(',', '') for x in df['MSRP']]
df['MSRP']=pd.to_numeric(df['MSRP'],errors='coerce')

sns.boxplot(x=df['MSRP
']) Q1 =
df.quantile(0.25) Q3 =
df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
df = df[~((df < (Q1 -
1.5 * IQR)) | (df> (Q3 + 1.5 * IQR))).any(axis=
1)] sns.boxplot(x=df['MSRP'])
df.describe()
Output
Upload widget is only available when the cell has been executed in the current browser
session. Please rerun this cell to enable.
Saving CARS.csv to CARS (2).csv

M Cyli Le
Mo Ty Ori Drive MS Inv Engin Horse MPG MPG_H We Whee
ak nder ngt
del pe gin Train RP oice eSize power _City ighway ight lbase
e s h
M 1
Ac SU $36, $33,3
0 D Asia All 3.5 6.0 265 17 23 4451 106 8
ura V 945 37
X 9
RS
X
Ty 1
Ac Sed Fro $23, $21,7
1 pe Asia 2.0 4.0 200 24 31 2778 101 7
ura an nt 820 61
S 2
2d
r
TS
1
Ac X Sed Fro $26, $24,6
2 Asia 2.4 4.0 200 22 29 3230 105 8
ura 4d an nt 990 47
3
r
TL 1
Ac Sed Fro $33, $30,2
3 4d Asia 3.2 6.0 270 20 28 3575 108 8
ura an nt 195 99
r 6
3.5
1
Ac R Sed Fro $43, $39,0
4 Asia 3.5 6.0 225 18 24 3880 115 9
ura L an nt 755 14
7
4d

Ma MSR EngineS Cylind Horsepo MPG_C MPG_High Weig Wheelb Leng


ke P ize ers wer ity way ht ase th
Acur 18
0 $36,945 3.5 6.0 265 17 23 4451 106
a 9
Acur 17
1 $23,820 2.0 4.0 200 24 31 2778 101
a 2
Acur 18
2 $26,990 2.4 4.0 200 22 29 3230 105
a 3
Acur 18
3 $33,195 3.2 6.0 270 20 28 3575 108
a 6
Acur 19
4 $43,755 3.5 6.0 225 18 24 3880 115
a 7
Ma MSR EngineS Cylind Horsepo MPG_C MPG_High Weig Wheelb Leng
ke P ize ers wer ity way ht ase th
Acur 18
0 $36,945 3.5 6.0 265 17 23 4451 106
a 9
Acur 17
1 $23,820 2.0 4.0 200 24 31 2778 101
a 2
Acur 18
2 $26,990 2.4 4.0 200 22 29 3230 105
a 3
Acur 18
3 $33,195 3.2 6.0 270 20 28 3575 108
a 6
Acur 19
4 $43,755 3.5 6.0 225 18 24 3880 115
a 7

Ma MSR EngineS Cylind Horsepo MPG_C MPG_High Weig Wheelb Leng


ke P ize ers wer ity way ht ase th
Volv 18
423 $40,565 2.4 5.0 197 21 28 3450 105
o 6
424 Volv $42,565 2.3 5.0 242 20 26 3450 105 18
Ma MSR EngineS Cylind Horsepo MPG_C MPG_High Weig Wheelb Leng
ke P ize ers wer ity way ht ase th
o 6
Volv 19
425 $45,210 2.9 6.0 268 19 26 3653 110
o 0
Volv 18
426 $26,135 1.9 4.0 170 22 29 2822 101
o 0
Volv 18
427 $35,145 2.5 5.0 208 20 27 3823 109
o 6

Output:
Make 0
MSRP 0
EngineSize 0
Cylinders 2
Horsepower 0
MPG_City 0
MPG_Highway 0
Weight 0
Wheelbase 0
Length 0
dtype: int64
Ma MSR EngineS Cylind Horsepo MPG_C MPG_High Weig Wheelb Leng
ke P ize ers wer ity way ht ase th
Acur 18
0 $36,945 3.5 6.0 265 17 23 4451 106
a 9
Acur 17
1 $23,820 2.0 4.0 200 24 31 2778 101
a 2
Acur 18
2 $26,990 2.4 4.0 200 22 29 3230 105
a 3
Acur 18
3 $33,195 3.2 6.0 270 20 28 3575 108
a 6
Acur 19
4 $43,755 3.5 6.0 225 18 24 3880 115
a 7
... ... ... ... ... ... ... ... ... ... ...
Maz 18
244 $28,750 3.0 6.0 200 18 25 3812 112
da 8
Maz 15
245 $22,388 1.8 4.0 142 23 28 2387 89
da 6

249 rows × 10 columns

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:3:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation:


https://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
This is separate from the ipykernel package so we can avoid doing imports
until
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:6:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation:


https://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

<matplotlib.axes._subplots.AxesSubplot at 0x7fad76d5e450>
Output:
MSRP 18870.750

Cylinders 2.000
Horsepower 90.000
MPG_City 4.250
MPG_Highway 5.000
Weight 873.750
Wheelbase 9.000
Length 16.000
dtype: float64

<matplotlib.axes._subplots.AxesSubplot at 0x7fad76cb92d0>

MS EngineSi Cylind Horsepo MPG_ MPG_Hig Wheelb


Weight Length
RP ze ers wer City hway ase
cou 341.0000 341.000 341.000 341.000 341.00000 341.000 341.000 341.000 341.000
nt 00 000 000 000 0 000 000 000 000
mea 29789.43 3.10615 5.68621 210.513 26.6187 3543.29 107.803 186.304
19.624633
n 9883 8 7 196 68 9120 519 985
11048.74 0.89026 1.31666 54.9398 3.83499 562.054 5.90509 11.6520
std 2.928538
8802 9 4 39 4 298 1 99
10280.00 1.30000 4.00000 104.000 17.0000 2403.00 95.0000 158.000
min 13.000000
0000 0 0 000 00 0000 00 000
25 21445.00 2.40000 4.00000 170.000 18.000000 25.0000 3188.00 104.000 178.000

Conclusion:
Thus, data was visualized and analyzed for the given data set using Numpy and Pandas, in
such a way that it is now ready to build a model.
EXPERIMENT 2

Code:

from google.colab import


files
uploaded = files.upload()

# Importing the required

#visualisation
libraries import seaborn as sns
#visualisation
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(color_codes=Tr
ue)
# To identify the type of data
df.info()
# Getting the number of instances and
features df.shape
# Getting the dimensions of the data
frame Df.ndim
# Removing duplicate data
df = df.drop_duplicates(subset='MSRP',
keep='first') df.count()
# To peek at first five
rows df.head(5)
# To peek at last five
rows df.tail(5)
# Finding the null
values
print(df.isnull().sum(
))
# Printing the null value rows
Df[0:248]

# Filling the rows with the mean of the column


val = df['Cylinders'].mean() df['Cylinders']
[247] = round(val)

val = df['Cylinders'].mean()
df['Cylinders'][248]= round(val)

# Removing the formatting


df['MSRP'] = [x.replace('$', '') for x in df['MSRP']]
df['MSRP'] = [x.replace(',', '') for x in df['MSRP']]
df['MSRP']=pd.to_numeric(df['MSRP'],errors='coerce')

# Detecting outliers
sns.boxplot(x=df['MSRP'])

df.describe()

# Plotting a Histogram
df.Make.value_counts().nlargest(40).plot(kind='bar', figsize=(10,5))
plt.title("Number of cars by make")
plt.ylabel('Number of cars')
plt.xlabel('Make');

# Plotting a heat map


plt.figure(figsize=(10,5))
c= df.corr()
sns.heatmap(c,cmap="BrBG",annot=True)

# Plotting a scatter plot


fig, ax = plt.subplots(figsize=(5,5))
ax.scatter(df['Horsepower'], df['MSRP'])
plt.title('Scatter plot between MSRP and Horsepower')
ax.set_xlabel('Horsepower')
ax.set_ylabel('MSRP')

plt.show()
Output:
A.
<matplotlib.axes._subplots.AxesSubplot at 0x7fad76d5e450>

<matplotlib.axes._subplots.AxesSubplot at 0x7fad76cb92d0>
Contingency Table :

M
Engine Cylin Horse MPG MPG_Hi Weig Wheel Lengt
SR
Size ders power _City ghway ht base h
P
co
341.000 341.0 341.00 341.0 341.0000 341.0 341.00 341.0 341.0
un 000 00000 0000 00000 00 00000 0000 00000 00000
t
me 29789.4 3.106 5.6862 210.5 19.62463 26.61 3543.2 107.8 186.3
an 39883 158 17 13196 3 8768 99120 03519 04985
11048.7 0.890 1.3166 54.93 3.834 562.05 5.905 11.65
std 2.928538
48802 269 64 9839 994 4298 091 2099
mi 10280.0 1.300 4.0000 104.0 13.00000 17.00 2403.0 95.00 158.0
n 00000 000 00 00000 0 0000 00000 0000 00000
25 21445.0 2.400 4.0000 170.0 18.00000 25.00 3188.0 104.0 178.0
% 00000 000 00 00000 0 0000 00000 00000 00000
50 27560.0 3.000 6.0000 208.0 19.00000 26.00 3470.0 107.0 187.0
% 00000 000 00 00000 0 0000 00000 00000 00000
75 36395.0 3.500 6.0000 240.0 21.00000 29.00 3851.0 112.0 193.0
% 00000 000 00 00000 0 0000 00000 00000 00000
ma 65000.0 5.700 8.0000 390.0 27.00000 36.00 5270.0 124.0 215.0
x 00000 000 00 00000 0 0000 00000 00000 00000
Scatter plot:

B. Histogram:
Heat Maps :

Conclusion:
Thus, the data for the given data set was visualized and analyzed using Matplotlib and
Seaborn, in such a way that it is now ready to build a model.
EXPERIMENT 3

Code:
# To make debugging of logistic_regression module easier we enable imported modules
autoreloading feature.
# By doing this you may change the code of logistic_regression library and all these changes will be
available here.
%load_ext autoreload
%autoreload 2

# Add project root folder to module loading paths.


import sys
sys.path.append('../..')
# Import 3rd party dependencies.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt import
matplotlib.image as mpimg import
math

# Import custom logistic regression implementation.


from homemade.logistic_regression import LogisticRegression #
Load the data.
data = pd.read_csv('../../data/mnist-demo.csv')

# Print the data table.


data.head(10)
# How many numbers to display.
numbers_to_display = 25
# Calculate the number of cells that will hold all the numbers.
num_cells = math.ceil(math.sqrt(numbers_to_display))

# Make the plot a little bit bigger than default one.


plt.figure(figsize=(10, 10))

# Go through the first numbers in a training set and plot them. for
plot_index in range(numbers_to_display):
# Extrace digit data.
digit = data[plot_index:plot_index + 1].values
digit_label = digit[0][0]
digit_pixels = digit[0][1:]

# Calculate image size (remember that each picture has square proportions).
image_size = int(math.sqrt(digit_pixels.shape[0]))

# Convert image vector into the matrix of pixels.


frame = digit_pixels.reshape((image_size, image_size))

# Plot the number matrix.


plt.subplot(num_cells, num_cells, plot_index + 1)
plt.imshow(frame, cmap='Greys') plt.title(digit_label)
plt.tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False,
labelleft=False)

# Plot all subplots.


plt.subplots_adjust(hspace=0.5, wspace=0.5)
plt.show()
# Split data set on training and test sets with proportions 80/20. #
Function sample() returns a random sample of items. pd_train_data
= data.sample(frac=0.8)
pd_test_data = data.drop(pd_train_data.index)

# Convert training and testing data from Pandas to NumPy format.


train_data = pd_train_data.values
test_data = pd_test_data.values

# Extract training/test labels and features.


num_training_examples = 6000
x_train = train_data[:num_training_examples, 1:]
y_train = train_data[:num_training_examples, [0]]

x_test = test_data[:, 1:]


y_test = test_data[:, [0]]
# Set up linear regression parameters.
max_iterations = 10000 # Max number of gradient descent iterations.
regularization_param = 10 # Helps to fight model overfitting.
polynomial_degree = 0 # The degree of additional polynomial features.
sinusoid_degree = 0 # The degree of sinusoid parameter multipliers of additional features.
normalize_data = True # Whether we need to normalize data to make it more unifrom or not.

# Init logistic regression instance.


logistic_regression = LogisticRegression(x_train, y_train, polynomial_degree, sinusoid_degree,
normalize_data)
# Train logistic regression.
(thetas, costs) = logistic_regression.train(regularization_param, max_iterations)

# Print thetas table.


pd.DataFrame(thetas)
# How many numbers to display.
numbers_to_display = 9

# Calculate the number of cells that will hold all the numbers.
num_cells = math.ceil(math.sqrt(numbers_to_display))

# Make the plot a little bit bigger than default one.


plt.figure(figsize=(10, 10))

# Go through the thetas and print them.


for plot_index in range(numbers_to_display): #
Extrace digit data.
digit_pixels = thetas[plot_index][1:]

# Calculate image size (remember that each picture has square proportions).
image_size = int(math.sqrt(digit_pixels.shape[0]))

# Convert image vector into the matrix of pixels.


frame = digit_pixels.reshape((image_size, image_size))

# Plot the number matrix.


plt.subplot(num_cells, num_cells, plot_index + 1)
plt.imshow(frame, cmap='Greys')
plt.title(plot_index)
plt.tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False,
labelleft=False)

# Plot all subplots.


plt.subplots_adjust(hspace=0.5, wspace=0.5)
plt.show()

# Draw gradient descent progress for each label.


labels = logistic_regression.unique_labels
for index, label in enumerate(labels): plt.plot(range(len(costs[index])),
costs[index], label=labels[index])

plt.xlabel('Gradient Steps')
plt.ylabel('Cost') plt.legend()
plt.show()

# Make the plot a little bit bigger than default one.


plt.figure(figsize=(15, 15))

# Go through the first numbers in a test set and plot them. for
plot_index in range(numbers_to_display):
# Extrace digit data.
digit_label = y_test[plot_index, 0]
digit_pixels = x_test[plot_index, :]

# Predicted label.
predicted_label = y_test_predictions[plot_index][0]

# Calculate image size (remember that each picture has square proportions).
image_size = int(math.sqrt(digit_pixels.shape[0]))

# Convert image vector into the matrix of pixels.


frame = digit_pixels.reshape((image_size, image_size))

# Plot the number matrix.


color_map = 'Greens' if predicted_label == digit_label else 'Reds'
plt.subplot(num_cells, num_cells, plot_index + 1)
plt.imshow(frame, cmap=color_map)
plt.title(predicted_label)
plt.tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False,
labelleft=False)

# Plot all subplots.


plt.subplots_adjust(hspace=0.5, wspace=0.5)
plt.show()

Output:
1 1 1 1 1 1 1 1 1 . 28 28 28 28 28 28 28 28 28 28
la
x x x x x x x x x . x1 x2 x2 x2 x2 x2 x2 x2 x2 x2
b
1 2 3 4 5 6 7 8 9 . 9 0 1 2 3 4 5 6 7 8
el

0 5 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

2 4 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

3 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

4 9 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 2 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

6 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

7 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

8 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1
la x 1 1 1 1 1 1 1 1 . 28 28 28 28 28 28 28 28 28 28
b el x x x x x x x x . x1 x2 x2 x2 x2 x2 x2 x2 x2 x2
1 2 3 4 5 6 7 8 9 . 9 0 1 2 3 4 5 6 7 8

9 4 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

,
10 rows × 785 columns

7
.. 7 7 7 7
1 2 3 4 5 6 7 8 9 7
0 776 777 778 779 780 8 8 8 8
. 5 1 2 3 4
7 7 7 7 7
..
1 2 3 4 5 6 7 8 9 7 776 777 778 779 780 8 8 8 8
0 1 2 3 4
. 5

- 0 0 0 0 0 0 0 0 0
- - - - - 0
0 9.77 . . . . . . . . . 0.01 0. 0.
0.0 0.0 0.0 0.0 0. .
2579 .. 0
0 0 1 1 0. 0 0
0 0 0 0 0 0 0 0 .
941 752 902 567 125 0
0
7 5 8 2 0
5

- - - - - -
0 0 0 0 0 0 0 0 0 0
1 0.0 0.0 0.0 0.0 0.0 0. 0. 0. 0.
.
4 0 0
0 0 0 9 6 2 7
0 0 0 0
0 0

- 0 0 0 0 0 0 0 0 0
- 0
2 7.26 . . . . . . . . . 0.00 0.00 0.00 0.00 0. 0.
0.0 0. .
2680 .. 038 019 011 021 0
0 0 0
0 0 0 0 0 0 0 0 . 5 1 2 4
015 0.
0
5 0

- 0 0 0 0 0 0 0 0 0
- - - - 0
3 7.54 . . . . . . . . . 0.00 0. 0.
0.0 0.0 0.0 0.0 0. .
0222 .. 012 0
3 1 0 0 0 0
0 0 0 0 0 0 0 0 . 6
722 956 583 183 0.
0
4 1 7 5 0

- 0 0 0 0 0 0 0 0 0
- - - - 0
4 9.93 . . . . . . . . . 0.00 0.03 0. 0.
0.1 0.0 0.0 0. .
2585 .. 021 0
0 0 3 0. 0 0
0 0 0 0 0 0 0 0 . 6
450 127 317 247 0
0
4 0 5 0
5

- 0 0 0 0 0 0 0 0 0 - - - - -
0
5 7.63 . . . . . . . . . 0.02 0.01 0.00 0.03 0.03 0. 0. 0. 0.
... .
5457 039 437 574 514 287 0 0 0 0
1 4 3 9 8 0
0 0 0 0 0 0 0 0
0

- 0 0 0 0 0 0 0 0 0
- - - - 0
6 9.04 . . . . . . . . . 0.00 0.02 0. 0.
0.0 0.0 0.0 0. .
7295 .. 007 0
0 0 2 0. 0 0
0 0 0 0 0 0 0 0 . 9
345 091 232 185 0
0 6 7 6 0
4
7
.. 7 78 7 7 7
0 1 2 3 4 5 6 7 8 9 776 777 778 779 780 8 8 8
. 5 1 2 3 4

-
0 0 0 0 0 0 0 0 0 - 0
10.4 0.05 0.08 0.16 0.15
7 . . . . . . . . . 0.0 0. 0. 0.
9172 .. 849 678 149 658 .
4 0 0 0
3 . 7 0 1 6
0 0 0 0 0 0 0 0 201 0.
0 1 0

- 0 0 0 0 0 0 0 0 0 - - - - -
0
8 6.31 . . . . . . . . . 0.00 0.00 0.00 0.02 0.03 0. 0. 0. 0.
... .
1099 543 673 187 554 858 0 0 0 0
6 0 3 9 0 0
0 0 0 0 0 0 0 0
0

- 0 0 0 0 0 0 0 0 0
- - - - 0
9 8.19 . . . . . . . . . 0.01 0.05 0. 0.
0.0 0.0 0.0 0. .
9128 .. 031 0
0 0 7 0. 0 0
0 0 0 0 0 0 0 0 . 1
483 799 316 638 0
0
8 5 0 0
0
EXPERIMENT 4

Code:
# generate gaussian data from
numpy.random import seed
from numpy.random import randn
from numpy import mean
from numpy import std
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(100) + 50
# summarize
print('mean=%.3f stdv=%.3f' % (mean(data),
std(data))) # histogram plot
from numpy.random import seed
from numpy.random import randn
from matplotlib import pyplot
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(100) + 50
# histogram plot
pyplot.hist(data)
pyplot.show()
# QQ Plot
from numpy.random import seed
from numpy.random import randn
from statsmodels.graphics.gofplots import
qqplot from matplotlib import pyplot
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(100) + 50
# q-q plot
qqplot(data, line='s')
pyplot.show()
# Shapiro-Wilk Test
from numpy.random import seed
from numpy.random import randn
from scipy.stats import shapiro
# seed the random number generator
seed(1) # generate univariate
observations data =
5 * randn(100) + 50
# normality test
stat, p = shapiro(data) print('Statistics=%.3f, p=
%.3f' % (stat, p)) # interpret
alpha = 0.05 if p >
alpha:
print('Sample looks Gaussian (fail to reject H0)') else:
print('Sample does not look Gaussian (reject H0)') #
D'Agostino and Pearson's Test
from numpy.random import seed from
numpy.random import randn from
scipy.stats import normaltest # seed the
random number generator seed(1)
# generate univariate observations data =
5 * randn(100) + 50
# normality test
stat, p = normaltest(data) print('Statistics=%.3f,
p=%.3f' % (stat, p)) # interpret
alpha = 0.05 if p >
alpha:
print('Sample looks Gaussian (fail to reject H0)') else:
print('Sample does not look Gaussian (reject H0)') #
Anderson-Darling Test
from numpy.random import seed from
numpy.random import randn from
scipy.stats import anderson # seed the
random number generator seed(1)
# generate univariate observations data =
5 * randn(100) + 50
# normality test result =
anderson(data)
print('Statistic: %.3f' % result.statistic) p = 0
for i in range(len(result.critical_values)):
sl, cv = result.significance_level[i], result.critical_values[i]

if result.statistic < result.critical_values[i]:


print('%.3f: %.3f, data looks normal (fail to reject H0)' % (sl,
))
else:
print('%.3f: %.3f, data does not look normal (reject H0)' % (sl,
))
Output: mean=50.303 stdv=4.426

/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19:
FutureWarning: pandas.util.testing is deprecated. Use the functions in the public
API at pandas.testing instead.
import pandas.util.testing as tm

Statistics=0.992, p=0.822
Sample looks Gaussian (fail to reject H0)
Statistics=0.102, p=0.950
Sample looks Gaussian (fail to reject H0)

Statistics=0.102, p=0.950
Sample looks Gaussian (fail to reject H0)

Statistic: 0.220
15.000: 0.555, data looks normal (fail to reject H0) 10.000:
0.632, data looks normal (fail to reject H0) 5.000: 0.759,
data looks normal (fail to reject H0) 2.500: 0.885, data
looks normal (fail to reject H0) 1.000: 1.053, data looks
normal (fail to reject H0)

Conclusion:
With the help of this experiment we now know the implementation of Statistical Hypothesis Test using
Scipy and learn.
EXPERIMENT 5

Code :

import numpy as np import


pandas as pd
import matplotlib.pyplot as plt from
sklearn import metrics
from sklearn.linear_model import LogisticRegression
%matplotlib inline
x_values = np.linspace(-5, 5, 100)
y_values = [1 / (1 + np.exp(-x)) for x in x_values]
plt.plot(x_values, y_values)
plt.title('Logsitic Function') plt.show()

from google.colab import drive


drive.mount('/content/drive')

data = pd.read_csv("/content/drive/MyDrive/AlmaBetter/Cohort Aravali/Module 3/Week 2/Day


2/WA_Fn-UseC_-Telco-Customer-Churn.csv")
print("Dataset size")
print("Rows {} Columns {}".format(data.shape[0], data.shape[1]))

print("Columns and data types")


pd.DataFrame(data.dtypes).rename(columns = {0:'dtype'}
### EDA: Independent variables import
pandas as pd
import numpy as np
import matplotlib.pyplot as plt fig =
plt.figure(figsize=(9, 6)) ax = fig.gca()
df.boxplot(column = 'MonthlyCharges', by = 'Churn', ax = ax)
ax.set_ylabel("MonthlyCharges")
plt.show()

fig = plt.figure(figsize=(9, 6)) ax =


fig.gca()
df.boxplot(column = 'tenure', by = 'Churn', ax = ax)
ax.set_ylabel("Tenure")
plt.show()

df['class'] = df['Churn'].apply(lambda x : 1 if x == "Yes" else 0) # features


will be saved as X and our target will be saved as y X =
df[['tenure','MonthlyCharges']].copy()
y = df['class'].copy() df.shape

#Splitting data into train and test


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X,y , test_size = 0.2, random_state = 0)
print(X_train.shape)
print(X_test.shape)
y_test.value_counts()

#Fitting logistic regression on train data


from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(fit_intercept=True, max_iter=10000)


clf.fit(X_train, y_train)

clf.coef_

# Get the model coefficients clf.coef_


clf.intercept_

#Evaluating the performance of the trained model # Get the


predicted probabilities
train_preds = clf.predict_proba(X_train) test_preds
= clf.predict_proba(X_test)

X_test

test_preds
# Get the predicted classes train_class_preds =
clf.predict(X_train) test_class_preds =
clf.predict(X_test)

train_class_preds

from sklearn.metrics import accuracy_score, confusion_matrix


# Get the accuracy scores
train_accuracy = accuracy_score(train_class_preds,y_train)
test_accuracy = accuracy_score(test_class_preds,y_test)

print("The accuracy on train data is ", train_accuracy) print("The


accuracy on test data is ", test_accuracy)

# Get the confusion matrix for both train and test

labels = ['Retained', 'Churned']


cm = confusion_matrix(y_train, train_class_preds) print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks


ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

# Get the confusion matrix for both train and test

labels = ['Retained', 'Churned']


cm = confusion_matrix(y_test, test_class_preds) print(cm)
sns.heatmap(cm, annot=True, ax = ax); #annot=True to annotate cells

# labels, title and ticks


ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

from sklearn.linear_model import LogisticRegressionCV from


sklearn.model_selection import cross_validate

logistic = LogisticRegression() scoring =


['accuracy']
scores = cross_validate(logistic,X_train, y_train, scoring = scoring, cv = 5,
return_train_score=True,return_estimator=True,verbose = 10)

scores['train_accuracy']

scores['test_accuracy']

scores['estimator']

for model in scores['estimator']:


print(model.coef_)
OUTPUT:

Mounted at /content/drive

Dataset size
Rows 7043 Columns 21

Columns and data types


dtype
customerID object
gender object
SeniorCitizen int64
Partner object
Dependents object
tenure int64
PhoneService object
MultipleLines object
InternetService object
OnlineSecurity object
OnlineBackup object
DeviceProtection object
TechSupport object
StreamingTV object
StreamingMovies object
Contract object
PaperlessBilling object
PaymentMethod object
MonthlyCharges float64
TotalCharges object
Churn object

(7043, 22)

(5634, 2)
(1409, 2)

0 4133
1 1501
Name: class, dtype: int64

0 4133
1 1501
Name: class, dtype: int64
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=10000, multi_class='auto',
n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)

array([[-0.05646728, 0.03315385]])

array([[-0.05646728, 0.03315385]])

array([[-0.05646728, 0.03315385]])

# Evaluating the performance of the trained model

tenure MonthlyCharges
2200 19 58.20
4627 60 116.60
3225 13 71.95
2828 1 20.45
3768 55 77.75
... ... ...
2631 7 99.25
5333 13 88.35
6972 56 111.95
4598 18 56.25
3065 1 45.80
1409 rows × 2 columns

array([[0.7145149 , 0.2854851
], [0.78522641,
0.21477359],
[0.53064776, 0.46935224],
...,
[0.77288679, 0.22711321],
[0.71618111, 0.28381889],
[0.57740038, 0.42259962]])

array([0.2854851 , 0.21477359, 0.46935224, ..., 0.22711321, 0.28381889,


0.42259962])

array([0, 0, 0, ..., 0, 1, 0])

The accuracy on train data is 0.7857649982250621 The accuracy


on test data is 0.7735982966643009

[CV] ................................................................
[CV]..............., accuracy=(train=0.785, test=0.789), total= 0.0s [CV]
................................................................
[CV]..............., accuracy=(train=0.787, test=0.791), total= 0.0s [CV]
................................................................
[CV]..............., accuracy=(train=0.788, test=0.771), total= 0.0s [CV]
................................................................
[CV]..............., accuracy=(train=0.789, test=0.775), total= 0.0s [CV]
................................................................
[CV]..............., accuracy=(train=0.781, test=0.806), total= 0.0s [Parallel(n_jobs=1)]: Using
backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.1s finished

array([0.78500111, 0.78677613, 0.78788551, 0.78877302, 0.78127773])

array([0.78500111, 0.78677613, 0.78788551, 0.78877302, 0.78127773])

(LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,


intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto',
n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False),
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto',
n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False),
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto',
n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False),
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto',
n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False),
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto',
n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False))

[[-0.05617762 0.03293792]]
[[-0.05562275 0.03215852]]
[[-0.05820295 0.03454813]]
[[-0.05711808 0.03362381]]
[[-0.05530045 0.03257423]]

Conclusion :
We have successful implement the Logistic Regression to find out relation between variables & Apply
regression Model techniques to predict the data on the dataset.
EXPERIMENT 6

Code:
import matplotlib.pyplot as plt
import numpy as np
import os
import PIL
import tensorflow as tf

from tensorflow import keras


from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
import pathlib
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/e
xample_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file('flower_photos',
origin=dataset_url, untar=True)
data_dir = pathlib.Path(data_dir)
image_count = len(list(data_dir.glob('*/*.jpg')))
print(image_count)
roses = list(data_dir.glob('roses/*'))
PIL.Image.open(str(roses[0]))
PIL.Image.open(str(roses[1]))
tulips = list(data_dir.glob('tulips/*'))
PIL.Image.open(str(tulips[0]))
PIL.Image.open(str(tulips[1])) batch_size = 32
img_height = 180
img_width = 180
train_ds =
tf.keras.utils.image_dataset_from_directory( data_dir,
validation_split=0.2,
subset="training", seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
val_ds =
tf.keras.utils.image_dataset_from_directory( data_dir,
validation_split=0.2,
subset="validation",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
class_names = train_ds.class_names
print(class_names)
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))
for images, labels in train_ds.take(1): for i
in range(9):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(images[i].numpy().astype("uint8"))
plt.title(class_names[labels[i]]) plt.axis("off")
for image_batch, labels_batch in
train_ds: print(image_batch.shape)
print(labels_batch.shape)
break
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE
)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
normalization_layer = layers.Rescaling(1./255)
normalized_ds = train_ds.map(lambda x, y: (normalization_layer(x),
y)) image_batch, labels_batch = next(iter(normalized_ds))
first_image = image_batch[0]
# Notice the pixel values are now in `[0,1]`.
print(np.min(first_image), np.max(first_image))
num_classes = len(class_names)

model = Sequential([
layers.Rescaling(1./255, input_shape=(img_height, img_width, 3)),
layers.Conv2D(16, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Conv2D(32, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Conv2D(64, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dense(num_classes)
])
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_l
ogits=True),
metrics=['accuracy'])
model.summary()
epochs=10
history = model.fit( train_ds,
validation_data=val_ds,
epochs=epochs
)
acc = history.history['accuracy'] val_acc =
history.history['val_accuracy']

loss = history.history['loss'] val_loss =


history.history['val_loss']

epochs_range =

range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation
Loss') plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

Output:
Downloading data from
https://storage.googleapis.com/download.tensorflow.org/example_images/flowe
r_photos.tgz
228818944/228813984 [==============================] - 1s 0us/step
228827136/228813984 [==============================] - 1s 0us/step

3670
Found 3670 files belonging to 5 classes. Using
2936 files for training.

Found 3670 files belonging to 5 classes. Using


734 files for validation.

['daisy', 'dandelion', 'roses', 'sunflowers', 'tulips']


(32, 180, 180, 3)
(32,)

Model: "sequential"

Layer (type) Output Shape Param #


=================================================================
rescaling_1 (Rescaling) (None, 180, 180, 3) 0

conv2d (Conv2D) (None, 180, 180, 16) 448

max_pooling2d (MaxPooling2D (None, 90, 90, 16) 0


)

conv2d_1 (Conv2D) (None, 90, 90, 32) 4640


max_pooling2d_1 (MaxPooling
(None, 45, 45, 32) 0
2D)

conv2d_2 (Conv2D) (None, 45, 45, 64) 18496

max_pooling2d_2
(None, 22, 22, 64) 0
(MaxPooling 2D)

flatten (Flatten) (None, 30976) 0

dense (Dense) (None, 128) 3965056

dense_1 (Dense) (None, 5) 645

=================================================================
Total params: 3,989,285
Trainable params: 3,989,285
Non-trainable params: 0

Epoch 1/10
92/92 [==============================] - 13s 38ms/step - loss: 1.3093 -
accuracy: 0.4349 - val_loss: 1.1781 - val_accuracy: 0.5232 Epoch
2/10
92/92 [==============================] - 2s 24ms/step - loss: 1.0252 -
accuracy: 0.5940 - val_loss: 0.9801 - val_accuracy: 0.6131 Epoch
3/10
92/92 [==============================] - 2s 23ms/step - loss: 0.8460 -
accuracy: 0.6737 - val_loss: 0.9532 - val_accuracy: 0.6185 Epoch
4/10
92/92 [==============================] - 2s 23ms/step - loss: 0.6524 -
accuracy: 0.7653 - val_loss: 0.9262 - val_accuracy: 0.6526 Epoch
5/10
92/92 [==============================] - 2s 23ms/step - loss: 0.4360 -
accuracy: 0.8457 - val_loss: 1.0237 - val_accuracy: 0.6403 Epoch
6/10
92/92 [==============================] - 2s 23ms/step - loss: 0.2660 -
accuracy: 0.9111 - val_loss: 1.1619 - val_accuracy: 0.6226 Epoch
7/10
92/92 [==============================] - 2s 23ms/step - loss: 0.1573 -
accuracy: 0.9527 - val_loss: 1.4132 - val_accuracy: 0.6158 Epoch
8/10
92/92 [==============================] - 2s 24ms/step - loss: 0.0708 -
accuracy: 0.9802 - val_loss: 1.5212 - val_accuracy: 0.6308 Epoch
9/10
92/92 [==============================] - 2s 24ms/step - loss: 0.0446 -
accuracy: 0.9874 - val_loss: 1.6525 - val_accuracy: 0.6349 Epoch
10/10
92/92 [==============================] - 2s 24ms/step - loss: 0.0271 -
accuracy: 0.9942 - val_loss: 1.7895 - val_accuracy: 0.6349
Conclusion:
Now we know how to implement the classification model using tensor flow.
EXPERIMENT 7
Code:
from sklearn.preprocessing import StandardScaler import
numpy as np
import pandas as pd
import matplotlib.pyplot as plt import
seaborn as sns
%matplotlib inline import os
import warnings
warnings.filterwarnings('ignore')
print(os.listdir("../input")) df.info()
df.rename(index=str, columns={'Annual
Income (k$)': 'Income', 'Spending Score (1-
100)': 'Score'}, inplace=True)
# Let's see our data in a detailed way with pairplot
X = df.drop(['CustomerID', 'Gender'], axis=1) sns.pairplot(df.drop('CustomerID',
axis=1), hue='Gender', aspect=1.5

plt.show()

#K-Mean
from sklearn.cluster import KMeans clusters =
[]
for i in range(1, 11):
km = KMeans(n_clusters=i).fit(X)
clusters.append(km.inertia_)
fig, ax = plt.subplots(figsize=(12, 8))
sns.lineplot(x=list(range(1, 11)), y=clusters, ax=ax)
ax.set_title('Searching for Elbow') ax.set_xlabel('Clusters')
ax.set_ylabel('Inertia')

# Annotate arrow
ax.annotate('Possible Elbow Point', xy=(3, 140000), xytext=(3, 50000), xycoords='data',
arrowprops=dict(arrowstyle='->', connectionstyle='arc3', color='blue', lw=2))
ax.annotate('Possible Elbow Point', xy=(5, 80000), xytext=(5, 150000), xycoords='data',
arrowprops=dict(arrowstyle='->', connectionstyle='arc3', color='blue', lw=2))
plt.show() # 3 cluster
km3 = KMeans(n_clusters=3).fit(X)
X['Labels'] = km3.labels_
plt.figure(figsize=(12, 8))
sns.scatterplot(X['Income'], X['Score'], hue=X['Labels'], palette=sns.color_palette('hls', 3))
plt.title('KMeans with 3 Clusters')
plt.show()
fig = plt.figure(figsize=(20,8)) ax =
fig.add_subplot(121)
sns.swarmplot(x='Labels', y='Income', data=X, ax=ax)
ax.set_title('Labels According to Annual Income'
ax = fig.add_subplot(122)

sns.swarmplot(x='Labels', y='Score', data=X, ax=ax)


ax.set_title('Labels According to Scoring History')

plt.show()
# Hierarchical Clustering
from sklearn.cluster import AgglomerativeClustering
agglom = AgglomerativeClustering(n_clusters=5, linkage='average').fit(X)
X['Labels'] = agglom.labels_
plt.figure(figsize=(12, 8))
sns.scatterplot(X['Income'], X['Score'], hue=X['Labels'],
palette=sns.color_palette('hls', 5))
plt.title('Agglomerative with 5 Clusters') plt.show()
from scipy.cluster import hierarchy
from scipy.spatial import distance_matrix dist =
distance_matrix(X, X)
print(dist) plt.figure(figsize=(18, 50))
dendro = hierarchy.dendrogram(Z, leaf_rotation=0, leaf_font_size=12, orientation='right') Z =
hierarchy.linkage(dist, 'average')
plt.figure(figsize=(18, 50))
dendro = hierarchy.dendrogram(Z, leaf_rotation=0, leaf_font_size =12, orientation = 'right')
#Density Based Clustering (DBSCAN)
from sklearn.cluster import DBSCAN

db = DBSCAN(eps=11, min_samples=6).fit(X)
X['Labels'] = db.labels_
plt.figure(figsize=(12, 8))
sns.scatterplot(X['Income'], X['Score'], hue=X['Labels']
palette=sns.color_palette('hls', np.unique(db.labels_).shape[0])) plt.title('DBSCAN
with epsilon 11, min samples 6')
plt.show()
OUTPUT:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns): CustomerID
200 non-null int64 Gender
200 non-null object
Age 200 non-null int64 Annual Income
(k$) 200 non-null int64 Spending Score
(1-100) 200 non-null int64 dtypes:
int64(4), object(1)
memory usage: 7.9+ KB
[[ 0. 42.05948169 33.03028913 ... 117.12813496 124.53915047
130.17296186]
[ 42.05948169 0. 75.01999733 ... 111.76761606 137.77880824
122.35195135]
[ 33.03028913 75.01999733 0. ... 129.89226305 122.24974438
143.78456106]
...
[117.12813496 111.76761606 129.89226305 ... 0. 57.10516614
14.35270009]
[124.53915047 137.77880824 122.24974438 ... 57.10516614 0.
65.06150936]
[130.17296186 122.35195135 143.78456106 ... 14.35270009 65.06150936
0. ]]
Conclusion:
With the help of this experiment the implementation Clustering algorithms for unsupervised class & Plot
the cluster data in the algor
EXPERIMENT 8
Code :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# %matplotlib inline
plt.style.use("ggplot")

import sklearn
from sklearn.decomposition import TruncatedSVD
amazon_ratings = pd.read_csv('../input/amazon-ratings/ratings_Beauty.csv')
amazon_ratings = amazon_ratings.dropna()
amazon_ratings.head()
amazon_ratings.shape()
popular_products = pd.DataFrame(amazon_ratings.groupby('ProductId')['Rating'].count())
most_popular = popular_products.sort_values('Rating', ascending=False)
most_popular.head(10)
most_popular.head(30).plot(kind = "bar")
amazon_ratings1 = amazon_ratings.head(10000)
ratings_utility_matrix = amazon_ratings1.pivot_table(values='Rating', index='UserId', colum
ns='ProductId', fill_value=0)
ratings_utility_matrix.head()
ratings_utility_matrix.shape

X = ratings_utility_matrix.T
X.head()
X.shape
X1 = X

SVD = TruncatedSVD(n_components=10)
decomposed_matrix = SVD.fit_transform(X)
decomposed_matrix.shape
correlation_matrix = np.corrcoef(decomposed_matrix)
correlation_matrix.shape
X.index[99]
i = "6117036094"
product_names = list(X.index)
product_ID = product_names.index(i)
product_ID
correlation_product_ID = correlation_matrix[product_ID]
correlation_product_ID.shape

Recommend = list(X.index[correlation_product_ID > 0.90])

# Removes the item already bought by the customer


Recommend.remove(i)

# Importing libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
product_descriptions = pd.read_csv('../input/home-depot-product-search-
relevance/product_d
escriptions.csv')
product_descriptions.sh
ape

# Missing values

product_descriptions =
product_descriptions.dropna()
product_descriptions.shape
product_descriptions.head()
product_descriptions1 =
product_descriptions.head(500) #
vectorizer = TfidfVectorizer(stop_words='english')
X1 = vectorizer.fit_transform(product_descriptions1["product_description"])
X1
# Fitting K-Means to the dataset
X=X1
# # Optimal clusters is
true_k = 10
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X1)

print("Top terms per cluster:")


order_centroids = model.cluster_centers_.argsort()[:, ::-1] terms =
vectorizer.get_feature_names()
for i in range(true_k):
print_cluster(i)
def show_recommendations(product):
#print("Cluster ID:")
Y = vectorizer.transform([product]) prediction
= model.predict(Y) #print(prediction)
print_cluster(prediction[0])
show_recommendations("cutting tool")
show_recommendations("spray paint")
show_recommendations("steel drill")
OUTPUT :

(2023070, 4)

<matplotlib.axes._subplots.AxesSubplot at 0x7fd439e493c8>
(9697, 886)

# Decomposing the Matrix (886,


9697)
# Correlation Matrix (886, 10)
# Isolating Product ID # 6117036094 from the Correlation Matrix
'6117036094'
Index # of product ID purchased by customer 99
(886,)
#Recommending top 10 highly correlated products in sequence
['0733001998',
'1304139212',
'1304139220',
'130414089X',
'130414643X',
'130414674X',
'1304174778',
'1304174867',
'1304174905']

(124428, 2)

0 Not only do angles make joints stronger, they ...


1 BEHR Premium Textured DECKOVER is an innovativ...
2 Classic architecture meets contemporary design...
3 The Grape Solar 265-Watt Polycrystalline PV So...
4 Update your bathroom with the Delta Vero Singl...
5 Achieving delicious results is almost effortle...
6 The Quantum Adjustable 2-Light LED Black Emerg...
7 The Teks #10 x 1-1/2 in. Zinc-Plated Steel Was...
8 Get the House of Fara 3/4 in. x 3 in. x 8 ft. ...
9 Valley View Industries Metal Stakes (4-Pack) a...
Name: product_description, dtype: object

<500x8932 sparse matrix of type '<class 'numpy.float64'>'


with 34817 stored elements in Compressed Sparse Row format>

Top terms per cluster: Cluster


0:
concrete stake ft
coating apply
epoxy drying sq
garage formula

Cluster 1: wood
patio bamboo
natural frame
outdoor rug
size steel dining
Cluster 2: used trim
painted 65
proposition nbsp
residents california
project
32
Cluster 3: door
lbs easy dog
nickle solid roof

plastic house
adjustable Cluster 4:
cutting saw
tool blade design
cut pliers grip
metal non
Cluster 5: wall piece
finish tile design
use
color easy
installation water
Cluster 6: light
watt bulb led

fixture volt bulbs


lighting use power
Cluster 7: helps
water easy snow
handle nozzle year
features tool control
Cluster 8: air
ft water unit
room
installation fan
cooling use
easy Cluster 9: post

fence gate ft
screen vinyl posts
aluminum brackets
spline
# Keyword : cutting tool Cluster 4:
cutting saw tool
blade design cut
pliers grip metal
non
# Keyword :spray paint Cluster 2:
used trim painted
65
proposition nbsp

residents california
project 32
#Keyword : steel drill Cluster 8:
air ft
water unit room
installation fan
cooling use
easy

Conclusion:
With the help of this experiment the implementation Using any machine learning techniques using
available data set to develop a recommendation system

You might also like