0% found this document useful (0 votes)

17 views

Ads Lab Manual

The document outlines the experiments and assignments for the Computer Engineering branch at Chhatrapati Shivaji Maharaj Institute of Technology for the academic year 2024-25. It includes tasks related to data preparation, visualization, modeling, and analysis using various Python libraries such as NumPy, Pandas, Matplotlib, and Seaborn. The document also provides code snippets for data manipulation and visualization, demonstrating the application of statistical techniques on a dataset of cars.

Uploaded by

Nilay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Ads Lab Manual

Uploaded by

Nilay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 63

CHHATRAPATI SHIVAJI MAHARAJ INSTITUTE OF

TECHNOLOGY
(Affiliated to the Mumbai University, Approved by AICTE-New Delhi)

Academic Year: 2024-25 Semester: VIII Branch: Computer Engineering

Sr. Title of Experiments
No
1. Data preparation using NumPy and Pandas.
Data Visualization / Exploratory Data Analysis for the selected data set using Matplotlib and
2. Seaborn.

3. Data Modelling
Implementation of Statistical Hypothesis Test using Scipy and Sci-kit learn
4.

5. Regression Analysis,

6. To implement classification modelling.

Clustering algorithms for unsupervised classification
7.
Using any machine learning techniques using available data set to develop a recommendation
8.

Sr. Title of Assignments

No
1.

Signature of Student Signature of Staff

EXPERIMENT 1

Code:
from google.colab import files
uploaded = files.upload()
#Importing the required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(color_codes=True)
#Loading the csv file into a pandas dataframe
df = pd.read_csv("CARS.csv")
df.head(5)
# Removing irrelevant features
df = df.drop(['Model','DriveTrain','Invoice', 'Origin', 'Type'], axis=1
)
df.head(5)
#To peek at the first five rows.
df.head(5)
#To peek at the last five rows
df.tail(5)
#Finding the null values
print(df.isnull().sum())
#printing the null value rows
df[0:249]
# Filling the rows with the mean of the
column val = df['Cylinders'].mean()
df['Cylinders'][247] = round(val)

val = df['Cylinders'].mean()
df['Cylinders'][248]= round(val)
# Removing the formatting
df['MSRP'] = [x.replace('$', '') for x in df['MSRP']]
df['MSRP'] = [x.replace(',', '') for x in df['MSRP']]
df['MSRP']=pd.to_numeric(df['MSRP'],errors='coerce')

sns.boxplot(x=df['MSRP
']) Q1 =
df.quantile(0.25) Q3 =
df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
df = df[~((df < (Q1 -
1.5 * IQR)) | (df> (Q3 + 1.5 * IQR))).any(axis=
1)] sns.boxplot(x=df['MSRP'])
df.describe()
Output
Upload widget is only available when the cell has been executed in the current browser
session. Please rerun this cell to enable.
Saving CARS.csv to CARS (2).csv

M Cyli Le
Mo Ty Ori Drive MS Inv Engin Horse MPG MPG_H We Whee
ak nder ngt
del pe gin Train RP oice eSize power _City ighway ight lbase
e s h
M 1
Ac SU $36, $33,3
0 D Asia All 3.5 6.0 265 17 23 4451 106 8
ura V 945 37
X 9
RS
X
Ty 1
Ac Sed Fro $23, $21,7
1 pe Asia 2.0 4.0 200 24 31 2778 101 7
ura an nt 820 61
S 2
2d
r
TS
1
Ac X Sed Fro $26, $24,6
2 Asia 2.4 4.0 200 22 29 3230 105 8
ura 4d an nt 990 47
3
r
TL 1
Ac Sed Fro $33, $30,2
3 4d Asia 3.2 6.0 270 20 28 3575 108 8
ura an nt 195 99
r 6
3.5
1
Ac R Sed Fro $43, $39,0
4 Asia 3.5 6.0 225 18 24 3880 115 9
ura L an nt 755 14
7
4d

Ma MSR EngineS Cylind Horsepo MPG_C MPG_High Weig Wheelb Leng

ke P ize ers wer ity way ht ase th
Acur 18
0 $36,945 3.5 6.0 265 17 23 4451 106
a 9
Acur 17
1 $23,820 2.0 4.0 200 24 31 2778 101
a 2
Acur 18
2 $26,990 2.4 4.0 200 22 29 3230 105
a 3
Acur 18
3 $33,195 3.2 6.0 270 20 28 3575 108
a 6
Acur 19
4 $43,755 3.5 6.0 225 18 24 3880 115
a 7
Ma MSR EngineS Cylind Horsepo MPG_C MPG_High Weig Wheelb Leng
ke P ize ers wer ity way ht ase th
Acur 18
0 $36,945 3.5 6.0 265 17 23 4451 106
a 9
Acur 17
1 $23,820 2.0 4.0 200 24 31 2778 101
a 2
Acur 18
2 $26,990 2.4 4.0 200 22 29 3230 105
a 3
Acur 18
3 $33,195 3.2 6.0 270 20 28 3575 108
a 6
Acur 19
4 $43,755 3.5 6.0 225 18 24 3880 115
a 7

Ma MSR EngineS Cylind Horsepo MPG_C MPG_High Weig Wheelb Leng

ke P ize ers wer ity way ht ase th
Volv 18
423 $40,565 2.4 5.0 197 21 28 3450 105
o 6
424 Volv $42,565 2.3 5.0 242 20 26 3450 105 18
Ma MSR EngineS Cylind Horsepo MPG_C MPG_High Weig Wheelb Leng
ke P ize ers wer ity way ht ase th
o 6
Volv 19
425 $45,210 2.9 6.0 268 19 26 3653 110
o 0
Volv 18
426 $26,135 1.9 4.0 170 22 29 2822 101
o 0
Volv 18
427 $35,145 2.5 5.0 208 20 27 3823 109
o 6

Output:
Make 0
MSRP 0
EngineSize 0
Cylinders 2
Horsepower 0
MPG_City 0
MPG_Highway 0
Weight 0
Wheelbase 0
Length 0
dtype: int64
Ma MSR EngineS Cylind Horsepo MPG_C MPG_High Weig Wheelb Leng
ke P ize ers wer ity way ht ase th
Acur 18
0 $36,945 3.5 6.0 265 17 23 4451 106
a 9
Acur 17
1 $23,820 2.0 4.0 200 24 31 2778 101
a 2
Acur 18
2 $26,990 2.4 4.0 200 22 29 3230 105
a 3
Acur 18
3 $33,195 3.2 6.0 270 20 28 3575 108
a 6
Acur 19
4 $43,755 3.5 6.0 225 18 24 3880 115
a 7
... ... ... ... ... ... ... ... ... ... ...
Maz 18
244 $28,750 3.0 6.0 200 18 25 3812 112
da 8
Maz 15
245 $22,388 1.8 4.0 142 23 28 2387 89
da 6

249 rows × 10 columns

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:3:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation:

https://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
This is separate from the ipykernel package so we can avoid doing imports
until
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:6:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation:

https://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

<matplotlib.axes._subplots.AxesSubplot at 0x7fad76d5e450>
Output:
MSRP 18870.750

Cylinders 2.000
Horsepower 90.000
MPG_City 4.250
MPG_Highway 5.000
Weight 873.750
Wheelbase 9.000
Length 16.000
dtype: float64

<matplotlib.axes._subplots.AxesSubplot at 0x7fad76cb92d0>

MS EngineSi Cylind Horsepo MPG_ MPG_Hig Wheelb

Weight Length
RP ze ers wer City hway ase
cou 341.0000 341.000 341.000 341.000 341.00000 341.000 341.000 341.000 341.000
nt 00 000 000 000 0 000 000 000 000
mea 29789.43 3.10615 5.68621 210.513 26.6187 3543.29 107.803 186.304
19.624633
n 9883 8 7 196 68 9120 519 985
11048.74 0.89026 1.31666 54.9398 3.83499 562.054 5.90509 11.6520
std 2.928538
8802 9 4 39 4 298 1 99
10280.00 1.30000 4.00000 104.000 17.0000 2403.00 95.0000 158.000
min 13.000000
0000 0 0 000 00 0000 00 000
25 21445.00 2.40000 4.00000 170.000 18.000000 25.0000 3188.00 104.000 178.000

Conclusion:
Thus, data was visualized and analyzed for the given data set using Numpy and Pandas, in
such a way that it is now ready to build a model.
EXPERIMENT 2

Code:

from google.colab import

files
uploaded = files.upload()

# Importing the required

#visualisation
libraries import seaborn as sns
#visualisation
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(color_codes=Tr
ue)
# To identify the type of data
df.info()
# Getting the number of instances and
features df.shape
# Getting the dimensions of the data
frame Df.ndim
# Removing duplicate data
df = df.drop_duplicates(subset='MSRP',
keep='first') df.count()
# To peek at first five
rows df.head(5)
# To peek at last five
rows df.tail(5)
# Finding the null
values
print(df.isnull().sum(
))
# Printing the null value rows
Df[0:248]

# Filling the rows with the mean of the column

val = df['Cylinders'].mean() df['Cylinders']
[247] = round(val)

val = df['Cylinders'].mean()
df['Cylinders'][248]= round(val)

# Removing the formatting

df['MSRP'] = [x.replace('$', '') for x in df['MSRP']]
df['MSRP'] = [x.replace(',', '') for x in df['MSRP']]
df['MSRP']=pd.to_numeric(df['MSRP'],errors='coerce')

# Detecting outliers
sns.boxplot(x=df['MSRP'])

df.describe()

# Plotting a Histogram
df.Make.value_counts().nlargest(40).plot(kind='bar', figsize=(10,5))
plt.title("Number of cars by make")
plt.ylabel('Number of cars')
plt.xlabel('Make');

# Plotting a heat map

plt.figure(figsize=(10,5))
c= df.corr()
sns.heatmap(c,cmap="BrBG",annot=True)

# Plotting a scatter plot

fig, ax = plt.subplots(figsize=(5,5))
ax.scatter(df['Horsepower'], df['MSRP'])
plt.title('Scatter plot between MSRP and Horsepower')
ax.set_xlabel('Horsepower')
ax.set_ylabel('MSRP')

plt.show()
Output:
A.
<matplotlib.axes._subplots.AxesSubplot at 0x7fad76d5e450>

<matplotlib.axes._subplots.AxesSubplot at 0x7fad76cb92d0>
Contingency Table :

M
Engine Cylin Horse MPG MPG_Hi Weig Wheel Lengt
SR
Size ders power _City ghway ht base h
P
co
341.000 341.0 341.00 341.0 341.0000 341.0 341.00 341.0 341.0
un 000 00000 0000 00000 00 00000 0000 00000 00000
t
me 29789.4 3.106 5.6862 210.5 19.62463 26.61 3543.2 107.8 186.3
an 39883 158 17 13196 3 8768 99120 03519 04985
11048.7 0.890 1.3166 54.93 3.834 562.05 5.905 11.65
std 2.928538
48802 269 64 9839 994 4298 091 2099
mi 10280.0 1.300 4.0000 104.0 13.00000 17.00 2403.0 95.00 158.0
n 00000 000 00 00000 0 0000 00000 0000 00000
25 21445.0 2.400 4.0000 170.0 18.00000 25.00 3188.0 104.0 178.0
% 00000 000 00 00000 0 0000 00000 00000 00000
50 27560.0 3.000 6.0000 208.0 19.00000 26.00 3470.0 107.0 187.0
% 00000 000 00 00000 0 0000 00000 00000 00000
75 36395.0 3.500 6.0000 240.0 21.00000 29.00 3851.0 112.0 193.0
% 00000 000 00 00000 0 0000 00000 00000 00000
ma 65000.0 5.700 8.0000 390.0 27.00000 36.00 5270.0 124.0 215.0
x 00000 000 00 00000 0 0000 00000 00000 00000
Scatter plot:

B. Histogram:
Heat Maps :

Conclusion:
Thus, the data for the given data set was visualized and analyzed using Matplotlib and
Seaborn, in such a way that it is now ready to build a model.
EXPERIMENT 3

Code:
# To make debugging of logistic_regression module easier we enable imported modules
autoreloading feature.
# By doing this you may change the code of logistic_regression library and all these changes will be
available here.
%load_ext autoreload
%autoreload 2

# Add project root folder to module loading paths.

import sys
sys.path.append('../..')
# Import 3rd party dependencies.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt import
matplotlib.image as mpimg import
math

# Import custom logistic regression implementation.

from homemade.logistic_regression import LogisticRegression #
Load the data.
data = pd.read_csv('../../data/mnist-demo.csv')

# Print the data table.

data.head(10)
# How many numbers to display.
numbers_to_display = 25
# Calculate the number of cells that will hold all the numbers.
num_cells = math.ceil(math.sqrt(numbers_to_display))

# Make the plot a little bit bigger than default one.

plt.figure(figsize=(10, 10))

# Go through the first numbers in a training set and plot them. for
plot_index in range(numbers_to_display):
# Extrace digit data.
digit = data[plot_index:plot_index + 1].values
digit_label = digit[0][0]
digit_pixels = digit[0][1:]

# Calculate image size (remember that each picture has square proportions).
image_size = int(math.sqrt(digit_pixels.shape[0]))

# Convert image vector into the matrix of pixels.

frame = digit_pixels.reshape((image_size, image_size))

# Plot the number matrix.

plt.subplot(num_cells, num_cells, plot_index + 1)
plt.imshow(frame, cmap='Greys') plt.title(digit_label)
plt.tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False,
labelleft=False)

# Plot all subplots.

plt.subplots_adjust(hspace=0.5, wspace=0.5)
plt.show()
# Split data set on training and test sets with proportions 80/20. #
Function sample() returns a random sample of items. pd_train_data
= data.sample(frac=0.8)
pd_test_data = data.drop(pd_train_data.index)

# Convert training and testing data from Pandas to NumPy format.

train_data = pd_train_data.values
test_data = pd_test_data.values

# Extract training/test labels and features.

num_training_examples = 6000
x_train = train_data[:num_training_examples, 1:]
y_train = train_data[:num_training_examples, [0]]

x_test = test_data[:, 1:]

y_test = test_data[:, [0]]
# Set up linear regression parameters.
max_iterations = 10000 # Max number of gradient descent iterations.
regularization_param = 10 # Helps to fight model overfitting.
polynomial_degree = 0 # The degree of additional polynomial features.
sinusoid_degree = 0 # The degree of sinusoid parameter multipliers of additional features.
normalize_data = True # Whether we need to normalize data to make it more unifrom or not.

# Init logistic regression instance.

logistic_regression = LogisticRegression(x_train, y_train, polynomial_degree, sinusoid_degree,
normalize_data)
# Train logistic regression.
(thetas, costs) = logistic_regression.train(regularization_param, max_iterations)

# Print thetas table.

pd.DataFrame(thetas)
# How many numbers to display.
numbers_to_display = 9

# Calculate the number of cells that will hold all the numbers.
num_cells = math.ceil(math.sqrt(numbers_to_display))

# Make the plot a little bit bigger than default one.

plt.figure(figsize=(10, 10))

# Go through the thetas and print them.

for plot_index in range(numbers_to_display): #
Extrace digit data.
digit_pixels = thetas[plot_index][1:]

# Calculate image size (remember that each picture has square proportions).
image_size = int(math.sqrt(digit_pixels.shape[0]))

# Convert image vector into the matrix of pixels.

frame = digit_pixels.reshape((image_size, image_size))

# Plot the number matrix.

plt.subplot(num_cells, num_cells, plot_index + 1)
plt.imshow(frame, cmap='Greys')
plt.title(plot_index)
plt.tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False,
labelleft=False)

# Plot all subplots.

plt.subplots_adjust(hspace=0.5, wspace=0.5)
plt.show()

# Draw gradient descent progress for each label.

labels = logistic_regression.unique_labels
for index, label in enumerate(labels): plt.plot(range(len(costs[index])),
costs[index], label=labels[index])

plt.xlabel('Gradient Steps')
plt.ylabel('Cost') plt.legend()
plt.show()

# Make the plot a little bit bigger than default one.

plt.figure(figsize=(15, 15))

# Go through the first numbers in a test set and plot them. for
plot_index in range(numbers_to_display):
# Extrace digit data.
digit_label = y_test[plot_index, 0]
digit_pixels = x_test[plot_index, :]

# Predicted label.
predicted_label = y_test_predictions[plot_index][0]

# Calculate image size (remember that each picture has square proportions).
image_size = int(math.sqrt(digit_pixels.shape[0]))

# Convert image vector into the matrix of pixels.

frame = digit_pixels.reshape((image_size, image_size))

# Plot the number matrix.

color_map = 'Greens' if predicted_label == digit_label else 'Reds'
plt.subplot(num_cells, num_cells, plot_index + 1)
plt.imshow(frame, cmap=color_map)
plt.title(predicted_label)
plt.tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False,
labelleft=False)

# Plot all subplots.

plt.subplots_adjust(hspace=0.5, wspace=0.5)
plt.show()

Output:
1 1 1 1 1 1 1 1 1 . 28 28 28 28 28 28 28 28 28 28
la
x x x x x x x x x . x1 x2 x2 x2 x2 x2 x2 x2 x2 x2
b
1 2 3 4 5 6 7 8 9 . 9 0 1 2 3 4 5 6 7 8
el

0 5 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

2 4 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

3 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

4 9 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 2 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

6 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

7 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

8 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1
la x 1 1 1 1 1 1 1 1 . 28 28 28 28 28 28 28 28 28 28
b el x x x x x x x x . x1 x2 x2 x2 x2 x2 x2 x2 x2 x2
1 2 3 4 5 6 7 8 9 . 9 0 1 2 3 4 5 6 7 8

9 4 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

,
10 rows × 785 columns

7
.. 7 7 7 7
1 2 3 4 5 6 7 8 9 7
0 776 777 778 779 780 8 8 8 8
. 5 1 2 3 4
7 7 7 7 7
..
1 2 3 4 5 6 7 8 9 7 776 777 778 779 780 8 8 8 8
0 1 2 3 4
. 5

- 0 0 0 0 0 0 0 0 0
- - - - - 0
0 9.77 . . . . . . . . . 0.01 0. 0.
0.0 0.0 0.0 0.0 0. .
2579 .. 0
0 0 1 1 0. 0 0
0 0 0 0 0 0 0 0 .
941 752 902 567 125 0
0
7 5 8 2 0
5

- - - - - -
0 0 0 0 0 0 0 0 0 0
1 0.0 0.0 0.0 0.0 0.0 0. 0. 0. 0.
.
4 0 0
0 0 0 9 6 2 7
0 0 0 0
0 0

- 0 0 0 0 0 0 0 0 0
- 0
2 7.26 . . . . . . . . . 0.00 0.00 0.00 0.00 0. 0.
0.0 0. .
2680 .. 038 019 011 021 0
0 0 0
0 0 0 0 0 0 0 0 . 5 1 2 4
015 0.
0
5 0

- 0 0 0 0 0 0 0 0 0
- - - - 0
3 7.54 . . . . . . . . . 0.00 0. 0.
0.0 0.0 0.0 0.0 0. .
0222 .. 012 0
3 1 0 0 0 0
0 0 0 0 0 0 0 0 . 6
722 956 583 183 0.
0
4 1 7 5 0

- 0 0 0 0 0 0 0 0 0
- - - - 0
4 9.93 . . . . . . . . . 0.00 0.03 0. 0.
0.1 0.0 0.0 0. .
2585 .. 021 0
0 0 3 0. 0 0
0 0 0 0 0 0 0 0 . 6
450 127 317 247 0
0
4 0 5 0
5

- 0 0 0 0 0 0 0 0 0 - - - - -
0
5 7.63 . . . . . . . . . 0.02 0.01 0.00 0.03 0.03 0. 0. 0. 0.
... .
5457 039 437 574 514 287 0 0 0 0
1 4 3 9 8 0
0 0 0 0 0 0 0 0
0

- 0 0 0 0 0 0 0 0 0
- - - - 0
6 9.04 . . . . . . . . . 0.00 0.02 0. 0.
0.0 0.0 0.0 0. .
7295 .. 007 0
0 0 2 0. 0 0
0 0 0 0 0 0 0 0 . 9
345 091 232 185 0
0 6 7 6 0
4
7
.. 7 78 7 7 7
0 1 2 3 4 5 6 7 8 9 776 777 778 779 780 8 8 8
. 5 1 2 3 4

-
0 0 0 0 0 0 0 0 0 - 0
10.4 0.05 0.08 0.16 0.15
7 . . . . . . . . . 0.0 0. 0. 0.
9172 .. 849 678 149 658 .
4 0 0 0
3 . 7 0 1 6
0 0 0 0 0 0 0 0 201 0.
0 1 0

- 0 0 0 0 0 0 0 0 0 - - - - -
0
8 6.31 . . . . . . . . . 0.00 0.00 0.00 0.02 0.03 0. 0. 0. 0.
... .
1099 543 673 187 554 858 0 0 0 0
6 0 3 9 0 0
0 0 0 0 0 0 0 0
0

- 0 0 0 0 0 0 0 0 0
- - - - 0
9 8.19 . . . . . . . . . 0.01 0.05 0. 0.
0.0 0.0 0.0 0. .
9128 .. 031 0
0 0 7 0. 0 0
0 0 0 0 0 0 0 0 . 1
483 799 316 638 0
0
8 5 0 0
0
EXPERIMENT 4

Code:
# generate gaussian data from
numpy.random import seed
from numpy.random import randn
from numpy import mean
from numpy import std
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(100) + 50
# summarize
print('mean=%.3f stdv=%.3f' % (mean(data),
std(data))) # histogram plot
from numpy.random import seed
from numpy.random import randn
from matplotlib import pyplot
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(100) + 50
# histogram plot
pyplot.hist(data)
pyplot.show()
# QQ Plot
from numpy.random import seed
from numpy.random import randn
from statsmodels.graphics.gofplots import
qqplot from matplotlib import pyplot
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(100) + 50
# q-q plot
qqplot(data, line='s')
pyplot.show()
# Shapiro-Wilk Test
from numpy.random import seed
from numpy.random import randn
from scipy.stats import shapiro
# seed the random number generator
seed(1) # generate univariate
observations data =
5 * randn(100) + 50
# normality test
stat, p = shapiro(data) print('Statistics=%.3f, p=
%.3f' % (stat, p)) # interpret
alpha = 0.05 if p >
alpha:
print('Sample looks Gaussian (fail to reject H0)') else:
print('Sample does not look Gaussian (reject H0)') #
D'Agostino and Pearson's Test
from numpy.random import seed from
numpy.random import randn from
scipy.stats import normaltest # seed the
random number generator seed(1)
# generate univariate observations data =
5 * randn(100) + 50
# normality test
stat, p = normaltest(data) print('Statistics=%.3f,
p=%.3f' % (stat, p)) # interpret
alpha = 0.05 if p >
alpha:
print('Sample looks Gaussian (fail to reject H0)') else:
print('Sample does not look Gaussian (reject H0)') #
Anderson-Darling Test
from numpy.random import seed from
numpy.random import randn from
scipy.stats import anderson # seed the
random number generator seed(1)
# generate univariate observations data =
5 * randn(100) + 50
# normality test result =
anderson(data)
print('Statistic: %.3f' % result.statistic) p = 0
for i in range(len(result.critical_values)):
sl, cv = result.significance_level[i], result.critical_values[i]

if result.statistic < result.critical_values[i]:

print('%.3f: %.3f, data looks normal (fail to reject H0)' % (sl,
))
else:
print('%.3f: %.3f, data does not look normal (reject H0)' % (sl,
))
Output: mean=50.303 stdv=4.426

/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19:
FutureWarning: pandas.util.testing is deprecated. Use the functions in the public
API at pandas.testing instead.
import pandas.util.testing as tm

Statistics=0.992, p=0.822
Sample looks Gaussian (fail to reject H0)
Statistics=0.102, p=0.950
Sample looks Gaussian (fail to reject H0)

Statistics=0.102, p=0.950
Sample looks Gaussian (fail to reject H0)

Statistic: 0.220
15.000: 0.555, data looks normal (fail to reject H0) 10.000:
0.632, data looks normal (fail to reject H0) 5.000: 0.759,
data looks normal (fail to reject H0) 2.500: 0.885, data
looks normal (fail to reject H0) 1.000: 1.053, data looks
normal (fail to reject H0)

Conclusion:
With the help of this experiment we now know the implementation of Statistical Hypothesis Test using
Scipy and learn.
EXPERIMENT 5

Code :

import numpy as np import

pandas as pd
import matplotlib.pyplot as plt from
sklearn import metrics
from sklearn.linear_model import LogisticRegression
%matplotlib inline
x_values = np.linspace(-5, 5, 100)
y_values = [1 / (1 + np.exp(-x)) for x in x_values]
plt.plot(x_values, y_values)
plt.title('Logsitic Function') plt.show()

from google.colab import drive

drive.mount('/content/drive')

data = pd.read_csv("/content/drive/MyDrive/AlmaBetter/Cohort Aravali/Module 3/Week 2/Day

2/WA_Fn-UseC_-Telco-Customer-Churn.csv")
print("Dataset size")
print("Rows {} Columns {}".format(data.shape[0], data.shape[1]))

print("Columns and data types")

pd.DataFrame(data.dtypes).rename(columns = {0:'dtype'}
### EDA: Independent variables import
pandas as pd
import numpy as np
import matplotlib.pyplot as plt fig =
plt.figure(figsize=(9, 6)) ax = fig.gca()
df.boxplot(column = 'MonthlyCharges', by = 'Churn', ax = ax)
ax.set_ylabel("MonthlyCharges")
plt.show()

fig = plt.figure(figsize=(9, 6)) ax =

fig.gca()
df.boxplot(column = 'tenure', by = 'Churn', ax = ax)
ax.set_ylabel("Tenure")
plt.show()

df['class'] = df['Churn'].apply(lambda x : 1 if x == "Yes" else 0) # features

will be saved as X and our target will be saved as y X =
df[['tenure','MonthlyCharges']].copy()
y = df['class'].copy() df.shape

#Splitting data into train and test

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X,y , test_size = 0.2, random_state = 0)
print(X_train.shape)
print(X_test.shape)
y_test.value_counts()

#Fitting logistic regression on train data

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(fit_intercept=True, max_iter=10000)

clf.fit(X_train, y_train)

clf.coef_

# Get the model coefficients clf.coef_

clf.intercept_

#Evaluating the performance of the trained model # Get the

predicted probabilities
train_preds = clf.predict_proba(X_train) test_preds
= clf.predict_proba(X_test)

X_test

test_preds
# Get the predicted classes train_class_preds =
clf.predict(X_train) test_class_preds =
clf.predict(X_test)

train_class_preds

from sklearn.metrics import accuracy_score, confusion_matrix

# Get the accuracy scores
train_accuracy = accuracy_score(train_class_preds,y_train)
test_accuracy = accuracy_score(test_class_preds,y_test)

print("The accuracy on train data is ", train_accuracy) print("The

accuracy on test data is ", test_accuracy)

# Get the confusion matrix for both train and test

labels = ['Retained', 'Churned']

cm = confusion_matrix(y_train, train_class_preds) print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

# Get the confusion matrix for both train and test

labels = ['Retained', 'Churned']

cm = confusion_matrix(y_test, test_class_preds) print(cm)
sns.heatmap(cm, annot=True, ax = ax); #annot=True to annotate cells

# labels, title and ticks

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

from sklearn.linear_model import LogisticRegressionCV from

sklearn.model_selection import cross_validate

logistic = LogisticRegression() scoring =

['accuracy']
scores = cross_validate(logistic,X_train, y_train, scoring = scoring, cv = 5,
return_train_score=True,return_estimator=True,verbose = 10)

scores['train_accuracy']

scores['test_accuracy']

scores['estimator']

for model in scores['estimator']:

print(model.coef_)
OUTPUT:

Mounted at /content/drive

Dataset size
Rows 7043 Columns 21

Columns and data types

dtype
customerID object
gender object
SeniorCitizen int64
Partner object
Dependents object
tenure int64
PhoneService object
MultipleLines object
InternetService object
OnlineSecurity object
OnlineBackup object
DeviceProtection object
TechSupport object
StreamingTV object
StreamingMovies object
Contract object
PaperlessBilling object
PaymentMethod object
MonthlyCharges float64
TotalCharges object
Churn object

(7043, 22)

(5634, 2)
(1409, 2)

0 4133
1 1501
Name: class, dtype: int64

0 4133
1 1501
Name: class, dtype: int64
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=10000, multi_class='auto',
n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)

array([[-0.05646728, 0.03315385]])

# Evaluating the performance of the trained model

tenure MonthlyCharges
2200 19 58.20
4627 60 116.60
3225 13 71.95
2828 1 20.45
3768 55 77.75
... ... ...
2631 7 99.25
5333 13 88.35
6972 56 111.95
4598 18 56.25
3065 1 45.80
1409 rows × 2 columns

array([[0.7145149 , 0.2854851
], [0.78522641,
0.21477359],
[0.53064776, 0.46935224],
...,
[0.77288679, 0.22711321],
[0.71618111, 0.28381889],
[0.57740038, 0.42259962]])

array([0.2854851 , 0.21477359, 0.46935224, ..., 0.22711321, 0.28381889,

0.42259962])

array([0, 0, 0, ..., 0, 1, 0])

The accuracy on train data is 0.7857649982250621 The accuracy

on test data is 0.7735982966643009

[CV] ................................................................
[CV]..............., accuracy=(train=0.785, test=0.789), total= 0.0s [CV]
................................................................
[CV]..............., accuracy=(train=0.787, test=0.791), total= 0.0s [CV]
................................................................
[CV]..............., accuracy=(train=0.788, test=0.771), total= 0.0s [CV]
................................................................
[CV]..............., accuracy=(train=0.789, test=0.775), total= 0.0s [CV]
................................................................
[CV]..............., accuracy=(train=0.781, test=0.806), total= 0.0s [Parallel(n_jobs=1)]: Using
backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.1s finished

array([0.78500111, 0.78677613, 0.78788551, 0.78877302, 0.78127773])

(LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,

intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto',
n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False),
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto',
n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False),
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto',
n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False),
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto',
n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False),
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto',
n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False))

[[-0.05617762 0.03293792]]
[[-0.05562275 0.03215852]]
[[-0.05820295 0.03454813]]
[[-0.05711808 0.03362381]]
[[-0.05530045 0.03257423]]

Conclusion :
We have successful implement the Logistic Regression to find out relation between variables & Apply
regression Model techniques to predict the data on the dataset.
EXPERIMENT 6

Code:
import matplotlib.pyplot as plt
import numpy as np
import os
import PIL
import tensorflow as tf

from tensorflow import keras

from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
import pathlib
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/e
xample_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file('flower_photos',
origin=dataset_url, untar=True)
data_dir = pathlib.Path(data_dir)
image_count = len(list(data_dir.glob('*/*.jpg')))
print(image_count)
roses = list(data_dir.glob('roses/*'))
PIL.Image.open(str(roses[0]))
PIL.Image.open(str(roses[1]))
tulips = list(data_dir.glob('tulips/*'))
PIL.Image.open(str(tulips[0]))
PIL.Image.open(str(tulips[1])) batch_size = 32
img_height = 180
img_width = 180
train_ds =
tf.keras.utils.image_dataset_from_directory( data_dir,
validation_split=0.2,
subset="training", seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
val_ds =
tf.keras.utils.image_dataset_from_directory( data_dir,
validation_split=0.2,
subset="validation",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
class_names = train_ds.class_names
print(class_names)
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))
for images, labels in train_ds.take(1): for i
in range(9):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(images[i].numpy().astype("uint8"))
plt.title(class_names[labels[i]]) plt.axis("off")
for image_batch, labels_batch in
train_ds: print(image_batch.shape)
print(labels_batch.shape)
break
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE
)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
normalization_layer = layers.Rescaling(1./255)
normalized_ds = train_ds.map(lambda x, y: (normalization_layer(x),
y)) image_batch, labels_batch = next(iter(normalized_ds))
first_image = image_batch[0]
# Notice the pixel values are now in `[0,1]`.
print(np.min(first_image), np.max(first_image))
num_classes = len(class_names)

model = Sequential([
layers.Rescaling(1./255, input_shape=(img_height, img_width, 3)),
layers.Conv2D(16, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Conv2D(32, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Conv2D(64, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dense(num_classes)
])
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_l
ogits=True),
metrics=['accuracy'])
model.summary()
epochs=10
history = model.fit( train_ds,
validation_data=val_ds,
epochs=epochs
)
acc = history.history['accuracy'] val_acc =
history.history['val_accuracy']

loss = history.history['loss'] val_loss =

history.history['val_loss']

epochs_range =

range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation
Loss') plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

Output:
Downloading data from
https://storage.googleapis.com/download.tensorflow.org/example_images/flowe
r_photos.tgz
228818944/228813984 [==============================] - 1s 0us/step
228827136/228813984 [==============================] - 1s 0us/step

3670
Found 3670 files belonging to 5 classes. Using
2936 files for training.

Found 3670 files belonging to 5 classes. Using

734 files for validation.

['daisy', 'dandelion', 'roses', 'sunflowers', 'tulips']

(32, 180, 180, 3)
(32,)

Model: "sequential"

Layer (type) Output Shape Param #

=================================================================
rescaling_1 (Rescaling) (None, 180, 180, 3) 0

conv2d (Conv2D) (None, 180, 180, 16) 448

max_pooling2d (MaxPooling2D (None, 90, 90, 16) 0

)

conv2d_1 (Conv2D) (None, 90, 90, 32) 4640

max_pooling2d_1 (MaxPooling
(None, 45, 45, 32) 0
2D)

conv2d_2 (Conv2D) (None, 45, 45, 64) 18496

max_pooling2d_2
(None, 22, 22, 64) 0
(MaxPooling 2D)

flatten (Flatten) (None, 30976) 0

dense (Dense) (None, 128) 3965056

dense_1 (Dense) (None, 5) 645

=================================================================
Total params: 3,989,285
Trainable params: 3,989,285
Non-trainable params: 0

Epoch 1/10
92/92 [==============================] - 13s 38ms/step - loss: 1.3093 -
accuracy: 0.4349 - val_loss: 1.1781 - val_accuracy: 0.5232 Epoch
2/10
92/92 [==============================] - 2s 24ms/step - loss: 1.0252 -
accuracy: 0.5940 - val_loss: 0.9801 - val_accuracy: 0.6131 Epoch
3/10
92/92 [==============================] - 2s 23ms/step - loss: 0.8460 -
accuracy: 0.6737 - val_loss: 0.9532 - val_accuracy: 0.6185 Epoch
4/10
92/92 [==============================] - 2s 23ms/step - loss: 0.6524 -
accuracy: 0.7653 - val_loss: 0.9262 - val_accuracy: 0.6526 Epoch
5/10
92/92 [==============================] - 2s 23ms/step - loss: 0.4360 -
accuracy: 0.8457 - val_loss: 1.0237 - val_accuracy: 0.6403 Epoch
6/10
92/92 [==============================] - 2s 23ms/step - loss: 0.2660 -
accuracy: 0.9111 - val_loss: 1.1619 - val_accuracy: 0.6226 Epoch
7/10
92/92 [==============================] - 2s 23ms/step - loss: 0.1573 -
accuracy: 0.9527 - val_loss: 1.4132 - val_accuracy: 0.6158 Epoch
8/10
92/92 [==============================] - 2s 24ms/step - loss: 0.0708 -
accuracy: 0.9802 - val_loss: 1.5212 - val_accuracy: 0.6308 Epoch
9/10
92/92 [==============================] - 2s 24ms/step - loss: 0.0446 -
accuracy: 0.9874 - val_loss: 1.6525 - val_accuracy: 0.6349 Epoch
10/10
92/92 [==============================] - 2s 24ms/step - loss: 0.0271 -
accuracy: 0.9942 - val_loss: 1.7895 - val_accuracy: 0.6349
Conclusion:
Now we know how to implement the classification model using tensor flow.
EXPERIMENT 7
Code:
from sklearn.preprocessing import StandardScaler import
numpy as np
import pandas as pd
import matplotlib.pyplot as plt import
seaborn as sns
%matplotlib inline import os
import warnings
warnings.filterwarnings('ignore')
print(os.listdir("../input")) df.info()
df.rename(index=str, columns={'Annual
Income (k$)': 'Income', 'Spending Score (1-
100)': 'Score'}, inplace=True)
# Let's see our data in a detailed way with pairplot
X = df.drop(['CustomerID', 'Gender'], axis=1) sns.pairplot(df.drop('CustomerID',
axis=1), hue='Gender', aspect=1.5

plt.show()

#K-Mean
from sklearn.cluster import KMeans clusters =
[]
for i in range(1, 11):
km = KMeans(n_clusters=i).fit(X)
clusters.append(km.inertia_)
fig, ax = plt.subplots(figsize=(12, 8))
sns.lineplot(x=list(range(1, 11)), y=clusters, ax=ax)
ax.set_title('Searching for Elbow') ax.set_xlabel('Clusters')
ax.set_ylabel('Inertia')

# Annotate arrow
ax.annotate('Possible Elbow Point', xy=(3, 140000), xytext=(3, 50000), xycoords='data',
arrowprops=dict(arrowstyle='->', connectionstyle='arc3', color='blue', lw=2))
ax.annotate('Possible Elbow Point', xy=(5, 80000), xytext=(5, 150000), xycoords='data',
arrowprops=dict(arrowstyle='->', connectionstyle='arc3', color='blue', lw=2))
plt.show() # 3 cluster
km3 = KMeans(n_clusters=3).fit(X)
X['Labels'] = km3.labels_
plt.figure(figsize=(12, 8))
sns.scatterplot(X['Income'], X['Score'], hue=X['Labels'], palette=sns.color_palette('hls', 3))
plt.title('KMeans with 3 Clusters')
plt.show()
fig = plt.figure(figsize=(20,8)) ax =
fig.add_subplot(121)
sns.swarmplot(x='Labels', y='Income', data=X, ax=ax)
ax.set_title('Labels According to Annual Income'
ax = fig.add_subplot(122)

sns.swarmplot(x='Labels', y='Score', data=X, ax=ax)

ax.set_title('Labels According to Scoring History')

plt.show()
# Hierarchical Clustering
from sklearn.cluster import AgglomerativeClustering
agglom = AgglomerativeClustering(n_clusters=5, linkage='average').fit(X)
X['Labels'] = agglom.labels_
plt.figure(figsize=(12, 8))
sns.scatterplot(X['Income'], X['Score'], hue=X['Labels'],
palette=sns.color_palette('hls', 5))
plt.title('Agglomerative with 5 Clusters') plt.show()
from scipy.cluster import hierarchy
from scipy.spatial import distance_matrix dist =
distance_matrix(X, X)
print(dist) plt.figure(figsize=(18, 50))
dendro = hierarchy.dendrogram(Z, leaf_rotation=0, leaf_font_size=12, orientation='right') Z =
hierarchy.linkage(dist, 'average')
plt.figure(figsize=(18, 50))
dendro = hierarchy.dendrogram(Z, leaf_rotation=0, leaf_font_size =12, orientation = 'right')
#Density Based Clustering (DBSCAN)
from sklearn.cluster import DBSCAN

db = DBSCAN(eps=11, min_samples=6).fit(X)
X['Labels'] = db.labels_
plt.figure(figsize=(12, 8))
sns.scatterplot(X['Income'], X['Score'], hue=X['Labels']
palette=sns.color_palette('hls', np.unique(db.labels_).shape[0])) plt.title('DBSCAN
with epsilon 11, min samples 6')
plt.show()
OUTPUT:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns): CustomerID
200 non-null int64 Gender
200 non-null object
Age 200 non-null int64 Annual Income
(k$) 200 non-null int64 Spending Score
(1-100) 200 non-null int64 dtypes:
int64(4), object(1)
memory usage: 7.9+ KB
[[ 0. 42.05948169 33.03028913 ... 117.12813496 124.53915047
130.17296186]
[ 42.05948169 0. 75.01999733 ... 111.76761606 137.77880824
122.35195135]
[ 33.03028913 75.01999733 0. ... 129.89226305 122.24974438
143.78456106]
...
[117.12813496 111.76761606 129.89226305 ... 0. 57.10516614
14.35270009]
[124.53915047 137.77880824 122.24974438 ... 57.10516614 0.
65.06150936]
[130.17296186 122.35195135 143.78456106 ... 14.35270009 65.06150936
0. ]]
Conclusion:
With the help of this experiment the implementation Clustering algorithms for unsupervised class & Plot
the cluster data in the algor
EXPERIMENT 8
Code :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# %matplotlib inline
plt.style.use("ggplot")

import sklearn
from sklearn.decomposition import TruncatedSVD
amazon_ratings = pd.read_csv('../input/amazon-ratings/ratings_Beauty.csv')
amazon_ratings = amazon_ratings.dropna()
amazon_ratings.head()
amazon_ratings.shape()
popular_products = pd.DataFrame(amazon_ratings.groupby('ProductId')['Rating'].count())
most_popular = popular_products.sort_values('Rating', ascending=False)
most_popular.head(10)
most_popular.head(30).plot(kind = "bar")
amazon_ratings1 = amazon_ratings.head(10000)
ratings_utility_matrix = amazon_ratings1.pivot_table(values='Rating', index='UserId', colum
ns='ProductId', fill_value=0)
ratings_utility_matrix.head()
ratings_utility_matrix.shape

X = ratings_utility_matrix.T
X.head()
X.shape
X1 = X

SVD = TruncatedSVD(n_components=10)
decomposed_matrix = SVD.fit_transform(X)
decomposed_matrix.shape
correlation_matrix = np.corrcoef(decomposed_matrix)
correlation_matrix.shape
X.index[99]
i = "6117036094"
product_names = list(X.index)
product_ID = product_names.index(i)
product_ID
correlation_product_ID = correlation_matrix[product_ID]
correlation_product_ID.shape

Recommend = list(X.index[correlation_product_ID > 0.90])

# Removes the item already bought by the customer

Recommend.remove(i)

# Importing libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
product_descriptions = pd.read_csv('../input/home-depot-product-search-
relevance/product_d
escriptions.csv')
product_descriptions.sh
ape

# Missing values

product_descriptions =
product_descriptions.dropna()
product_descriptions.shape
product_descriptions.head()
product_descriptions1 =
product_descriptions.head(500) #
vectorizer = TfidfVectorizer(stop_words='english')
X1 = vectorizer.fit_transform(product_descriptions1["product_description"])
X1
# Fitting K-Means to the dataset
X=X1
# # Optimal clusters is
true_k = 10
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X1)

print("Top terms per cluster:")

order_centroids = model.cluster_centers_.argsort()[:, ::-1] terms =
vectorizer.get_feature_names()
for i in range(true_k):
print_cluster(i)
def show_recommendations(product):
#print("Cluster ID:")
Y = vectorizer.transform([product]) prediction
= model.predict(Y) #print(prediction)
print_cluster(prediction[0])
show_recommendations("cutting tool")
show_recommendations("spray paint")
show_recommendations("steel drill")
OUTPUT :

(2023070, 4)

<matplotlib.axes._subplots.AxesSubplot at 0x7fd439e493c8>
(9697, 886)

# Decomposing the Matrix (886,

9697)
# Correlation Matrix (886, 10)
# Isolating Product ID # 6117036094 from the Correlation Matrix
'6117036094'
Index # of product ID purchased by customer 99
(886,)
#Recommending top 10 highly correlated products in sequence
['0733001998',
'1304139212',
'1304139220',
'130414089X',
'130414643X',
'130414674X',
'1304174778',
'1304174867',
'1304174905']

(124428, 2)

0 Not only do angles make joints stronger, they ...

1 BEHR Premium Textured DECKOVER is an innovativ...
2 Classic architecture meets contemporary design...
3 The Grape Solar 265-Watt Polycrystalline PV So...
4 Update your bathroom with the Delta Vero Singl...
5 Achieving delicious results is almost effortle...
6 The Quantum Adjustable 2-Light LED Black Emerg...
7 The Teks #10 x 1-1/2 in. Zinc-Plated Steel Was...
8 Get the House of Fara 3/4 in. x 3 in. x 8 ft. ...
9 Valley View Industries Metal Stakes (4-Pack) a...
Name: product_description, dtype: object

<500x8932 sparse matrix of type '<class 'numpy.float64'>'

with 34817 stored elements in Compressed Sparse Row format>

Top terms per cluster: Cluster

0:
concrete stake ft
coating apply
epoxy drying sq
garage formula

Cluster 1: wood
patio bamboo
natural frame
outdoor rug
size steel dining
Cluster 2: used trim
painted 65
proposition nbsp
residents california
project
32
Cluster 3: door
lbs easy dog
nickle solid roof

plastic house
adjustable Cluster 4:
cutting saw
tool blade design
cut pliers grip
metal non
Cluster 5: wall piece
finish tile design
use
color easy
installation water
Cluster 6: light
watt bulb led

fixture volt bulbs

lighting use power
Cluster 7: helps
water easy snow
handle nozzle year
features tool control
Cluster 8: air
ft water unit
room
installation fan
cooling use
easy Cluster 9: post

fence gate ft
screen vinyl posts
aluminum brackets
spline
# Keyword : cutting tool Cluster 4:
cutting saw tool
blade design cut
pliers grip metal
non
# Keyword :spray paint Cluster 2:
used trim painted
65
proposition nbsp

residents california
project 32
#Keyword : steel drill Cluster 8:
air ft
water unit room
installation fan
cooling use
easy

Conclusion:
With the help of this experiment the implementation Using any machine learning techniques using
available data set to develop a recommendation system

Data Mining
No ratings yet
Data Mining
10 pages
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
100% (1)
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
28 pages
Assignment
No ratings yet
Assignment
49 pages
Report
No ratings yet
Report
4 pages
Grafik
No ratings yet
Grafik
4 pages
Project 8 Predictive Analytics - Ipynb - Colaboratory
No ratings yet
Project 8 Predictive Analytics - Ipynb - Colaboratory
8 pages
nalysis-manipulation-and-cleaning
No ratings yet
nalysis-manipulation-and-cleaning
15 pages
Course2 - DataAnalysis With Python - Week3 - Exploratory Data Analysis
No ratings yet
Course2 - DataAnalysis With Python - Week3 - Exploratory Data Analysis
23 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
22 pages
#1 - Skill Builds - Data Analysis With Python
No ratings yet
#1 - Skill Builds - Data Analysis With Python
3 pages
Machine Learning With Python - Part-2
No ratings yet
Machine Learning With Python - Part-2
27 pages
Data Analysis Report
No ratings yet
Data Analysis Report
74 pages
Internship
No ratings yet
Internship
23 pages
IP project model
No ratings yet
IP project model
51 pages
Exp_5_Exploratory_Data_Analysis_sdk_ok
No ratings yet
Exp_5_Exploratory_Data_Analysis_sdk_ok
13 pages
Mtcars - Ipynb - Colab
No ratings yet
Mtcars - Ipynb - Colab
2 pages
EDA Withoutcode (1)
No ratings yet
EDA Withoutcode (1)
36 pages
Data Frames and Charts 2: 2.1 Dealing With Missing Values
No ratings yet
Data Frames and Charts 2: 2.1 Dealing With Missing Values
12 pages
2
No ratings yet
2
6 pages
Aayushi Bda File
No ratings yet
Aayushi Bda File
41 pages
'Horsepower' "?" 'Horsepower' 'Horsepower' 'Horsepower' 'Horsepower' 'Horsepower'
No ratings yet
'Horsepower' "?" 'Horsepower' 'Horsepower' 'Horsepower' 'Horsepower' 'Horsepower'
5 pages
car-price-prediction-1 (1)
No ratings yet
car-price-prediction-1 (1)
24 pages
DS_on_MTCARS_Solutions
No ratings yet
DS_on_MTCARS_Solutions
3 pages
Pandas Notes Basic To Advance
No ratings yet
Pandas Notes Basic To Advance
21 pages
Spark Easy
No ratings yet
Spark Easy
30 pages
content beyond syllabus and case based program
No ratings yet
content beyond syllabus and case based program
8 pages
Pandas 1
No ratings yet
Pandas 1
32 pages
Auto Dataset MK - Part 1: Pandas PD Numpy NP
No ratings yet
Auto Dataset MK - Part 1: Pandas PD Numpy NP
18 pages
UI21CS29_Lab2
No ratings yet
UI21CS29_Lab2
11 pages
elite-sports-cars-eda
No ratings yet
elite-sports-cars-eda
9 pages
R Lab Ex 1 to 5
No ratings yet
R Lab Ex 1 to 5
26 pages
ML Foram
No ratings yet
ML Foram
17 pages
Laptop Price Analysis
No ratings yet
Laptop Price Analysis
37 pages
Eda 1
No ratings yet
Eda 1
29 pages
Car Price Prediction
No ratings yet
Car Price Prediction
35 pages
R
No ratings yet
R
3 pages
City Cycle Fuel Consumption 2024
No ratings yet
City Cycle Fuel Consumption 2024
23 pages
22eg107a11 DWV
No ratings yet
22eg107a11 DWV
15 pages
Model
No ratings yet
Model
164 pages
GmPrac1 - Jupyter Notebook
No ratings yet
GmPrac1 - Jupyter Notebook
11 pages
pandas-2
No ratings yet
pandas-2
18 pages
vertopal.com_Numpy,,Pandas(24.4.25)
No ratings yet
vertopal.com_Numpy,,Pandas(24.4.25)
1 page
project_documentation
No ratings yet
project_documentation
1 page
01_Seaborn_Intro
No ratings yet
01_Seaborn_Intro
10 pages
Python Codes
No ratings yet
Python Codes
17 pages
R Studio
No ratings yet
R Studio
4 pages
Intro to Exploratory Data Analysis Eda in Python
No ratings yet
Intro to Exploratory Data Analysis Eda in Python
7 pages
Submitted By:-Shaikshahanaafroz - Cms20Mba093: 1. Identify The Shape of The Data
No ratings yet
Submitted By:-Shaikshahanaafroz - Cms20Mba093: 1. Identify The Shape of The Data
6 pages
Machine Learning Project Car Price Prediction Algorithm
No ratings yet
Machine Learning Project Car Price Prediction Algorithm
4 pages
Chapter 4 Exercise 11
No ratings yet
Chapter 4 Exercise 11
5 pages
Laptop Price Analysis (Finance Analyst)
No ratings yet
Laptop Price Analysis (Finance Analyst)
36 pages
R Notebook: "Mtcars - CSV"
No ratings yet
R Notebook: "Mtcars - CSV"
4 pages
Note
No ratings yet
Note
9 pages
Car Price Prediction Project
No ratings yet
Car Price Prediction Project
34 pages
Fintech Practice V1
No ratings yet
Fintech Practice V1
2 pages
Exercise5 Solution
No ratings yet
Exercise5 Solution
22 pages
Assignment+questions+python+fundmentals ANSWER
No ratings yet
Assignment+questions+python+fundmentals ANSWER
3 pages
Task 3 Car Price Prediction Using Machine Learning
No ratings yet
Task 3 Car Price Prediction Using Machine Learning
30 pages
Team AN
No ratings yet
Team AN
23 pages
Zizi and The Germs
From Everand
Zizi and The Germs
Leanne Tarvin
No ratings yet
Machine Learning Notes
No ratings yet
Machine Learning Notes
27 pages
SMA Lab Manual
No ratings yet
SMA Lab Manual
65 pages
Distributed Computing Lab Manual
No ratings yet
Distributed Computing Lab Manual
81 pages
1 MP Rohan Imp Answer
No ratings yet
1 MP Rohan Imp Answer
21 pages
CN
No ratings yet
CN
9 pages
Suyash CN Final Imp Notes ?
No ratings yet
Suyash CN Final Imp Notes ?
47 pages
TCS Basics and Theory Qna
No ratings yet
TCS Basics and Theory Qna
43 pages
Still Om Pimespo Electric Fork Truck Xe12!20!4016 4017 Spare Parts List 60422139
No ratings yet
Still Om Pimespo Electric Fork Truck Xe12!20!4016 4017 Spare Parts List 60422139
22 pages
250209-SOTC5
No ratings yet
250209-SOTC5
7 pages
Snow 17
No ratings yet
Snow 17
10 pages
L2CP and VLAN Tagging
No ratings yet
L2CP and VLAN Tagging
5 pages
Trending Poetry QUESTIONS
No ratings yet
Trending Poetry QUESTIONS
48 pages
Master Thesis Kwalitatief Onderzoek
100% (2)
Master Thesis Kwalitatief Onderzoek
4 pages
Weekly Home Learning Plan (WHLP) For Modular DL Students
No ratings yet
Weekly Home Learning Plan (WHLP) For Modular DL Students
5 pages
DrWeb Crash
No ratings yet
DrWeb Crash
9 pages
Theorist's Toolkit Lecture 10: Discrete Fourier Transform and Its Uses
No ratings yet
Theorist's Toolkit Lecture 10: Discrete Fourier Transform and Its Uses
9 pages
Tutorial Letter 1010202 2 ELECTRICAL MACHINES II (Practical) EMMPRA2 Year Module
No ratings yet
Tutorial Letter 1010202 2 ELECTRICAL MACHINES II (Practical) EMMPRA2 Year Module
12 pages
Wedding Traditions
No ratings yet
Wedding Traditions
1 page
Predator and Prey
No ratings yet
Predator and Prey
4 pages
Module 1 Python Basics - Programs
No ratings yet
Module 1 Python Basics - Programs
13 pages
Expanding One's Vocabulary
No ratings yet
Expanding One's Vocabulary
4 pages
Final Project Process Paper
No ratings yet
Final Project Process Paper
2 pages
Assignment 1 (Module 1) - KAS 203T
No ratings yet
Assignment 1 (Module 1) - KAS 203T
1 page
# 'ZZZZZZZH28 06 2023GMT0139sunsavingtime
No ratings yet
# 'ZZZZZZZH28 06 2023GMT0139sunsavingtime
59 pages
Exercise 1
No ratings yet
Exercise 1
2 pages
9608 Computer Science Protocol
No ratings yet
9608 Computer Science Protocol
11 pages
Statement of Result
No ratings yet
Statement of Result
1 page
Grade 11 Com Prog Quarter 1 Week 5 Module 5
No ratings yet
Grade 11 Com Prog Quarter 1 Week 5 Module 5
12 pages
College of Teacher Education: Mariano Marcos State University
No ratings yet
College of Teacher Education: Mariano Marcos State University
7 pages
Bug Bounty Cheatsheet
100% (1)
Bug Bounty Cheatsheet
32 pages
Angular JS & MVC Architecture
No ratings yet
Angular JS & MVC Architecture
12 pages
Manual Saphir
50% (2)
Manual Saphir
16 pages
Progress Test 3
No ratings yet
Progress Test 3
1 page
Main Task (Do This) : PART I. Choose The Correct Answer For Every Question Below. Encircle Your
No ratings yet
Main Task (Do This) : PART I. Choose The Correct Answer For Every Question Below. Encircle Your
3 pages
VI Editor
No ratings yet
VI Editor
23 pages
Teacher As Researcher
No ratings yet
Teacher As Researcher
16 pages
Yoruba Grammar Oro Ayalo Loan Words 2
No ratings yet
Yoruba Grammar Oro Ayalo Loan Words 2
10 pages