Ads Lab Manual
Ads Lab Manual
TECHNOLOGY
(Affiliated to the Mumbai University, Approved by AICTE-New Delhi)
3. Data Modelling
Implementation of Statistical Hypothesis Test using Scipy and Sci-kit learn
4.
5. Regression Analysis,
2.
3.
4.
5.
6.
Code:
from google.colab import files
uploaded = files.upload()
#Importing the required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(color_codes=True)
#Loading the csv file into a pandas dataframe
df = pd.read_csv("CARS.csv")
df.head(5)
# Removing irrelevant features
df = df.drop(['Model','DriveTrain','Invoice', 'Origin', 'Type'], axis=1
)
df.head(5)
#To peek at the first five rows.
df.head(5)
#To peek at the last five rows
df.tail(5)
#Finding the null values
print(df.isnull().sum())
#printing the null value rows
df[0:249]
# Filling the rows with the mean of the
column val = df['Cylinders'].mean()
df['Cylinders'][247] = round(val)
val = df['Cylinders'].mean()
df['Cylinders'][248]= round(val)
# Removing the formatting
df['MSRP'] = [x.replace('$', '') for x in df['MSRP']]
df['MSRP'] = [x.replace(',', '') for x in df['MSRP']]
df['MSRP']=pd.to_numeric(df['MSRP'],errors='coerce')
sns.boxplot(x=df['MSRP
']) Q1 =
df.quantile(0.25) Q3 =
df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
df = df[~((df < (Q1 -
1.5 * IQR)) | (df> (Q3 + 1.5 * IQR))).any(axis=
1)] sns.boxplot(x=df['MSRP'])
df.describe()
Output
Upload widget is only available when the cell has been executed in the current browser
session. Please rerun this cell to enable.
Saving CARS.csv to CARS (2).csv
M Cyli Le
Mo Ty Ori Drive MS Inv Engin Horse MPG MPG_H We Whee
ak nder ngt
del pe gin Train RP oice eSize power _City ighway ight lbase
e s h
M 1
Ac SU $36, $33,3
0 D Asia All 3.5 6.0 265 17 23 4451 106 8
ura V 945 37
X 9
RS
X
Ty 1
Ac Sed Fro $23, $21,7
1 pe Asia 2.0 4.0 200 24 31 2778 101 7
ura an nt 820 61
S 2
2d
r
TS
1
Ac X Sed Fro $26, $24,6
2 Asia 2.4 4.0 200 22 29 3230 105 8
ura 4d an nt 990 47
3
r
TL 1
Ac Sed Fro $33, $30,2
3 4d Asia 3.2 6.0 270 20 28 3575 108 8
ura an nt 195 99
r 6
3.5
1
Ac R Sed Fro $43, $39,0
4 Asia 3.5 6.0 225 18 24 3880 115 9
ura L an nt 755 14
7
4d
Output:
Make 0
MSRP 0
EngineSize 0
Cylinders 2
Horsepower 0
MPG_City 0
MPG_Highway 0
Weight 0
Wheelbase 0
Length 0
dtype: int64
Ma MSR EngineS Cylind Horsepo MPG_C MPG_High Weig Wheelb Leng
ke P ize ers wer ity way ht ase th
Acur 18
0 $36,945 3.5 6.0 265 17 23 4451 106
a 9
Acur 17
1 $23,820 2.0 4.0 200 24 31 2778 101
a 2
Acur 18
2 $26,990 2.4 4.0 200 22 29 3230 105
a 3
Acur 18
3 $33,195 3.2 6.0 270 20 28 3575 108
a 6
Acur 19
4 $43,755 3.5 6.0 225 18 24 3880 115
a 7
... ... ... ... ... ... ... ... ... ... ...
Maz 18
244 $28,750 3.0 6.0 200 18 25 3812 112
da 8
Maz 15
245 $22,388 1.8 4.0 142 23 28 2387 89
da 6
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:3:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
<matplotlib.axes._subplots.AxesSubplot at 0x7fad76d5e450>
Output:
MSRP 18870.750
Cylinders 2.000
Horsepower 90.000
MPG_City 4.250
MPG_Highway 5.000
Weight 873.750
Wheelbase 9.000
Length 16.000
dtype: float64
<matplotlib.axes._subplots.AxesSubplot at 0x7fad76cb92d0>
Conclusion:
Thus, data was visualized and analyzed for the given data set using Numpy and Pandas, in
such a way that it is now ready to build a model.
EXPERIMENT 2
Code:
#visualisation
libraries import seaborn as sns
#visualisation
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(color_codes=Tr
ue)
# To identify the type of data
df.info()
# Getting the number of instances and
features df.shape
# Getting the dimensions of the data
frame Df.ndim
# Removing duplicate data
df = df.drop_duplicates(subset='MSRP',
keep='first') df.count()
# To peek at first five
rows df.head(5)
# To peek at last five
rows df.tail(5)
# Finding the null
values
print(df.isnull().sum(
))
# Printing the null value rows
Df[0:248]
val = df['Cylinders'].mean()
df['Cylinders'][248]= round(val)
# Detecting outliers
sns.boxplot(x=df['MSRP'])
df.describe()
# Plotting a Histogram
df.Make.value_counts().nlargest(40).plot(kind='bar', figsize=(10,5))
plt.title("Number of cars by make")
plt.ylabel('Number of cars')
plt.xlabel('Make');
plt.show()
Output:
A.
<matplotlib.axes._subplots.AxesSubplot at 0x7fad76d5e450>
<matplotlib.axes._subplots.AxesSubplot at 0x7fad76cb92d0>
Contingency Table :
M
Engine Cylin Horse MPG MPG_Hi Weig Wheel Lengt
SR
Size ders power _City ghway ht base h
P
co
341.000 341.0 341.00 341.0 341.0000 341.0 341.00 341.0 341.0
un 000 00000 0000 00000 00 00000 0000 00000 00000
t
me 29789.4 3.106 5.6862 210.5 19.62463 26.61 3543.2 107.8 186.3
an 39883 158 17 13196 3 8768 99120 03519 04985
11048.7 0.890 1.3166 54.93 3.834 562.05 5.905 11.65
std 2.928538
48802 269 64 9839 994 4298 091 2099
mi 10280.0 1.300 4.0000 104.0 13.00000 17.00 2403.0 95.00 158.0
n 00000 000 00 00000 0 0000 00000 0000 00000
25 21445.0 2.400 4.0000 170.0 18.00000 25.00 3188.0 104.0 178.0
% 00000 000 00 00000 0 0000 00000 00000 00000
50 27560.0 3.000 6.0000 208.0 19.00000 26.00 3470.0 107.0 187.0
% 00000 000 00 00000 0 0000 00000 00000 00000
75 36395.0 3.500 6.0000 240.0 21.00000 29.00 3851.0 112.0 193.0
% 00000 000 00 00000 0 0000 00000 00000 00000
ma 65000.0 5.700 8.0000 390.0 27.00000 36.00 5270.0 124.0 215.0
x 00000 000 00 00000 0 0000 00000 00000 00000
Scatter plot:
B. Histogram:
Heat Maps :
Conclusion:
Thus, the data for the given data set was visualized and analyzed using Matplotlib and
Seaborn, in such a way that it is now ready to build a model.
EXPERIMENT 3
Code:
# To make debugging of logistic_regression module easier we enable imported modules
autoreloading feature.
# By doing this you may change the code of logistic_regression library and all these changes will be
available here.
%load_ext autoreload
%autoreload 2
# Go through the first numbers in a training set and plot them. for
plot_index in range(numbers_to_display):
# Extrace digit data.
digit = data[plot_index:plot_index + 1].values
digit_label = digit[0][0]
digit_pixels = digit[0][1:]
# Calculate image size (remember that each picture has square proportions).
image_size = int(math.sqrt(digit_pixels.shape[0]))
# Calculate the number of cells that will hold all the numbers.
num_cells = math.ceil(math.sqrt(numbers_to_display))
# Calculate image size (remember that each picture has square proportions).
image_size = int(math.sqrt(digit_pixels.shape[0]))
plt.xlabel('Gradient Steps')
plt.ylabel('Cost') plt.legend()
plt.show()
# Go through the first numbers in a test set and plot them. for
plot_index in range(numbers_to_display):
# Extrace digit data.
digit_label = y_test[plot_index, 0]
digit_pixels = x_test[plot_index, :]
# Predicted label.
predicted_label = y_test_predictions[plot_index][0]
# Calculate image size (remember that each picture has square proportions).
image_size = int(math.sqrt(digit_pixels.shape[0]))
Output:
1 1 1 1 1 1 1 1 1 . 28 28 28 28 28 28 28 28 28 28
la
x x x x x x x x x . x1 x2 x2 x2 x2 x2 x2 x2 x2 x2
b
1 2 3 4 5 6 7 8 9 . 9 0 1 2 3 4 5 6 7 8
el
0 5 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 4 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 9 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 2 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1
la x 1 1 1 1 1 1 1 1 . 28 28 28 28 28 28 28 28 28 28
b el x x x x x x x x . x1 x2 x2 x2 x2 x2 x2 x2 x2 x2
1 2 3 4 5 6 7 8 9 . 9 0 1 2 3 4 5 6 7 8
9 4 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
,
10 rows × 785 columns
7
.. 7 7 7 7
1 2 3 4 5 6 7 8 9 7
0 776 777 778 779 780 8 8 8 8
. 5 1 2 3 4
7 7 7 7 7
..
1 2 3 4 5 6 7 8 9 7 776 777 778 779 780 8 8 8 8
0 1 2 3 4
. 5
- 0 0 0 0 0 0 0 0 0
- - - - - 0
0 9.77 . . . . . . . . . 0.01 0. 0.
0.0 0.0 0.0 0.0 0. .
2579 .. 0
0 0 1 1 0. 0 0
0 0 0 0 0 0 0 0 .
941 752 902 567 125 0
0
7 5 8 2 0
5
- - - - - -
0 0 0 0 0 0 0 0 0 0
1 0.0 0.0 0.0 0.0 0.0 0. 0. 0. 0.
.
4 0 0
0 0 0 9 6 2 7
0 0 0 0
0 0
- 0 0 0 0 0 0 0 0 0
- 0
2 7.26 . . . . . . . . . 0.00 0.00 0.00 0.00 0. 0.
0.0 0. .
2680 .. 038 019 011 021 0
0 0 0
0 0 0 0 0 0 0 0 . 5 1 2 4
015 0.
0
5 0
- 0 0 0 0 0 0 0 0 0
- - - - 0
3 7.54 . . . . . . . . . 0.00 0. 0.
0.0 0.0 0.0 0.0 0. .
0222 .. 012 0
3 1 0 0 0 0
0 0 0 0 0 0 0 0 . 6
722 956 583 183 0.
0
4 1 7 5 0
- 0 0 0 0 0 0 0 0 0
- - - - 0
4 9.93 . . . . . . . . . 0.00 0.03 0. 0.
0.1 0.0 0.0 0. .
2585 .. 021 0
0 0 3 0. 0 0
0 0 0 0 0 0 0 0 . 6
450 127 317 247 0
0
4 0 5 0
5
- 0 0 0 0 0 0 0 0 0 - - - - -
0
5 7.63 . . . . . . . . . 0.02 0.01 0.00 0.03 0.03 0. 0. 0. 0.
... .
5457 039 437 574 514 287 0 0 0 0
1 4 3 9 8 0
0 0 0 0 0 0 0 0
0
- 0 0 0 0 0 0 0 0 0
- - - - 0
6 9.04 . . . . . . . . . 0.00 0.02 0. 0.
0.0 0.0 0.0 0. .
7295 .. 007 0
0 0 2 0. 0 0
0 0 0 0 0 0 0 0 . 9
345 091 232 185 0
0 6 7 6 0
4
7
.. 7 78 7 7 7
0 1 2 3 4 5 6 7 8 9 776 777 778 779 780 8 8 8
. 5 1 2 3 4
-
0 0 0 0 0 0 0 0 0 - 0
10.4 0.05 0.08 0.16 0.15
7 . . . . . . . . . 0.0 0. 0. 0.
9172 .. 849 678 149 658 .
4 0 0 0
3 . 7 0 1 6
0 0 0 0 0 0 0 0 201 0.
0 1 0
- 0 0 0 0 0 0 0 0 0 - - - - -
0
8 6.31 . . . . . . . . . 0.00 0.00 0.00 0.02 0.03 0. 0. 0. 0.
... .
1099 543 673 187 554 858 0 0 0 0
6 0 3 9 0 0
0 0 0 0 0 0 0 0
0
- 0 0 0 0 0 0 0 0 0
- - - - 0
9 8.19 . . . . . . . . . 0.01 0.05 0. 0.
0.0 0.0 0.0 0. .
9128 .. 031 0
0 0 7 0. 0 0
0 0 0 0 0 0 0 0 . 1
483 799 316 638 0
0
8 5 0 0
0
EXPERIMENT 4
Code:
# generate gaussian data from
numpy.random import seed
from numpy.random import randn
from numpy import mean
from numpy import std
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(100) + 50
# summarize
print('mean=%.3f stdv=%.3f' % (mean(data),
std(data))) # histogram plot
from numpy.random import seed
from numpy.random import randn
from matplotlib import pyplot
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(100) + 50
# histogram plot
pyplot.hist(data)
pyplot.show()
# QQ Plot
from numpy.random import seed
from numpy.random import randn
from statsmodels.graphics.gofplots import
qqplot from matplotlib import pyplot
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(100) + 50
# q-q plot
qqplot(data, line='s')
pyplot.show()
# Shapiro-Wilk Test
from numpy.random import seed
from numpy.random import randn
from scipy.stats import shapiro
# seed the random number generator
seed(1) # generate univariate
observations data =
5 * randn(100) + 50
# normality test
stat, p = shapiro(data) print('Statistics=%.3f, p=
%.3f' % (stat, p)) # interpret
alpha = 0.05 if p >
alpha:
print('Sample looks Gaussian (fail to reject H0)') else:
print('Sample does not look Gaussian (reject H0)') #
D'Agostino and Pearson's Test
from numpy.random import seed from
numpy.random import randn from
scipy.stats import normaltest # seed the
random number generator seed(1)
# generate univariate observations data =
5 * randn(100) + 50
# normality test
stat, p = normaltest(data) print('Statistics=%.3f,
p=%.3f' % (stat, p)) # interpret
alpha = 0.05 if p >
alpha:
print('Sample looks Gaussian (fail to reject H0)') else:
print('Sample does not look Gaussian (reject H0)') #
Anderson-Darling Test
from numpy.random import seed from
numpy.random import randn from
scipy.stats import anderson # seed the
random number generator seed(1)
# generate univariate observations data =
5 * randn(100) + 50
# normality test result =
anderson(data)
print('Statistic: %.3f' % result.statistic) p = 0
for i in range(len(result.critical_values)):
sl, cv = result.significance_level[i], result.critical_values[i]
/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19:
FutureWarning: pandas.util.testing is deprecated. Use the functions in the public
API at pandas.testing instead.
import pandas.util.testing as tm
Statistics=0.992, p=0.822
Sample looks Gaussian (fail to reject H0)
Statistics=0.102, p=0.950
Sample looks Gaussian (fail to reject H0)
Statistics=0.102, p=0.950
Sample looks Gaussian (fail to reject H0)
Statistic: 0.220
15.000: 0.555, data looks normal (fail to reject H0) 10.000:
0.632, data looks normal (fail to reject H0) 5.000: 0.759,
data looks normal (fail to reject H0) 2.500: 0.885, data
looks normal (fail to reject H0) 1.000: 1.053, data looks
normal (fail to reject H0)
Conclusion:
With the help of this experiment we now know the implementation of Statistical Hypothesis Test using
Scipy and learn.
EXPERIMENT 5
Code :
clf.coef_
X_test
test_preds
# Get the predicted classes train_class_preds =
clf.predict(X_train) test_class_preds =
clf.predict(X_test)
train_class_preds
ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells
scores['train_accuracy']
scores['test_accuracy']
scores['estimator']
Mounted at /content/drive
Dataset size
Rows 7043 Columns 21
(7043, 22)
(5634, 2)
(1409, 2)
0 4133
1 1501
Name: class, dtype: int64
0 4133
1 1501
Name: class, dtype: int64
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=10000, multi_class='auto',
n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
array([[-0.05646728, 0.03315385]])
array([[-0.05646728, 0.03315385]])
array([[-0.05646728, 0.03315385]])
tenure MonthlyCharges
2200 19 58.20
4627 60 116.60
3225 13 71.95
2828 1 20.45
3768 55 77.75
... ... ...
2631 7 99.25
5333 13 88.35
6972 56 111.95
4598 18 56.25
3065 1 45.80
1409 rows × 2 columns
array([[0.7145149 , 0.2854851
], [0.78522641,
0.21477359],
[0.53064776, 0.46935224],
...,
[0.77288679, 0.22711321],
[0.71618111, 0.28381889],
[0.57740038, 0.42259962]])
[CV] ................................................................
[CV]..............., accuracy=(train=0.785, test=0.789), total= 0.0s [CV]
................................................................
[CV]..............., accuracy=(train=0.787, test=0.791), total= 0.0s [CV]
................................................................
[CV]..............., accuracy=(train=0.788, test=0.771), total= 0.0s [CV]
................................................................
[CV]..............., accuracy=(train=0.789, test=0.775), total= 0.0s [CV]
................................................................
[CV]..............., accuracy=(train=0.781, test=0.806), total= 0.0s [Parallel(n_jobs=1)]: Using
backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.1s finished
[[-0.05617762 0.03293792]]
[[-0.05562275 0.03215852]]
[[-0.05820295 0.03454813]]
[[-0.05711808 0.03362381]]
[[-0.05530045 0.03257423]]
Conclusion :
We have successful implement the Logistic Regression to find out relation between variables & Apply
regression Model techniques to predict the data on the dataset.
EXPERIMENT 6
Code:
import matplotlib.pyplot as plt
import numpy as np
import os
import PIL
import tensorflow as tf
plt.figure(figsize=(10, 10))
for images, labels in train_ds.take(1): for i
in range(9):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(images[i].numpy().astype("uint8"))
plt.title(class_names[labels[i]]) plt.axis("off")
for image_batch, labels_batch in
train_ds: print(image_batch.shape)
print(labels_batch.shape)
break
AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE
)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
normalization_layer = layers.Rescaling(1./255)
normalized_ds = train_ds.map(lambda x, y: (normalization_layer(x),
y)) image_batch, labels_batch = next(iter(normalized_ds))
first_image = image_batch[0]
# Notice the pixel values are now in `[0,1]`.
print(np.min(first_image), np.max(first_image))
num_classes = len(class_names)
model = Sequential([
layers.Rescaling(1./255, input_shape=(img_height, img_width, 3)),
layers.Conv2D(16, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Conv2D(32, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Conv2D(64, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dense(num_classes)
])
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_l
ogits=True),
metrics=['accuracy'])
model.summary()
epochs=10
history = model.fit( train_ds,
validation_data=val_ds,
epochs=epochs
)
acc = history.history['accuracy'] val_acc =
history.history['val_accuracy']
epochs_range =
range(epochs)
plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')
plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation
Loss') plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()
Output:
Downloading data from
https://storage.googleapis.com/download.tensorflow.org/example_images/flowe
r_photos.tgz
228818944/228813984 [==============================] - 1s 0us/step
228827136/228813984 [==============================] - 1s 0us/step
3670
Found 3670 files belonging to 5 classes. Using
2936 files for training.
Model: "sequential"
max_pooling2d_2
(None, 22, 22, 64) 0
(MaxPooling 2D)
=================================================================
Total params: 3,989,285
Trainable params: 3,989,285
Non-trainable params: 0
Epoch 1/10
92/92 [==============================] - 13s 38ms/step - loss: 1.3093 -
accuracy: 0.4349 - val_loss: 1.1781 - val_accuracy: 0.5232 Epoch
2/10
92/92 [==============================] - 2s 24ms/step - loss: 1.0252 -
accuracy: 0.5940 - val_loss: 0.9801 - val_accuracy: 0.6131 Epoch
3/10
92/92 [==============================] - 2s 23ms/step - loss: 0.8460 -
accuracy: 0.6737 - val_loss: 0.9532 - val_accuracy: 0.6185 Epoch
4/10
92/92 [==============================] - 2s 23ms/step - loss: 0.6524 -
accuracy: 0.7653 - val_loss: 0.9262 - val_accuracy: 0.6526 Epoch
5/10
92/92 [==============================] - 2s 23ms/step - loss: 0.4360 -
accuracy: 0.8457 - val_loss: 1.0237 - val_accuracy: 0.6403 Epoch
6/10
92/92 [==============================] - 2s 23ms/step - loss: 0.2660 -
accuracy: 0.9111 - val_loss: 1.1619 - val_accuracy: 0.6226 Epoch
7/10
92/92 [==============================] - 2s 23ms/step - loss: 0.1573 -
accuracy: 0.9527 - val_loss: 1.4132 - val_accuracy: 0.6158 Epoch
8/10
92/92 [==============================] - 2s 24ms/step - loss: 0.0708 -
accuracy: 0.9802 - val_loss: 1.5212 - val_accuracy: 0.6308 Epoch
9/10
92/92 [==============================] - 2s 24ms/step - loss: 0.0446 -
accuracy: 0.9874 - val_loss: 1.6525 - val_accuracy: 0.6349 Epoch
10/10
92/92 [==============================] - 2s 24ms/step - loss: 0.0271 -
accuracy: 0.9942 - val_loss: 1.7895 - val_accuracy: 0.6349
Conclusion:
Now we know how to implement the classification model using tensor flow.
EXPERIMENT 7
Code:
from sklearn.preprocessing import StandardScaler import
numpy as np
import pandas as pd
import matplotlib.pyplot as plt import
seaborn as sns
%matplotlib inline import os
import warnings
warnings.filterwarnings('ignore')
print(os.listdir("../input")) df.info()
df.rename(index=str, columns={'Annual
Income (k$)': 'Income', 'Spending Score (1-
100)': 'Score'}, inplace=True)
# Let's see our data in a detailed way with pairplot
X = df.drop(['CustomerID', 'Gender'], axis=1) sns.pairplot(df.drop('CustomerID',
axis=1), hue='Gender', aspect=1.5
plt.show()
#K-Mean
from sklearn.cluster import KMeans clusters =
[]
for i in range(1, 11):
km = KMeans(n_clusters=i).fit(X)
clusters.append(km.inertia_)
fig, ax = plt.subplots(figsize=(12, 8))
sns.lineplot(x=list(range(1, 11)), y=clusters, ax=ax)
ax.set_title('Searching for Elbow') ax.set_xlabel('Clusters')
ax.set_ylabel('Inertia')
# Annotate arrow
ax.annotate('Possible Elbow Point', xy=(3, 140000), xytext=(3, 50000), xycoords='data',
arrowprops=dict(arrowstyle='->', connectionstyle='arc3', color='blue', lw=2))
ax.annotate('Possible Elbow Point', xy=(5, 80000), xytext=(5, 150000), xycoords='data',
arrowprops=dict(arrowstyle='->', connectionstyle='arc3', color='blue', lw=2))
plt.show() # 3 cluster
km3 = KMeans(n_clusters=3).fit(X)
X['Labels'] = km3.labels_
plt.figure(figsize=(12, 8))
sns.scatterplot(X['Income'], X['Score'], hue=X['Labels'], palette=sns.color_palette('hls', 3))
plt.title('KMeans with 3 Clusters')
plt.show()
fig = plt.figure(figsize=(20,8)) ax =
fig.add_subplot(121)
sns.swarmplot(x='Labels', y='Income', data=X, ax=ax)
ax.set_title('Labels According to Annual Income'
ax = fig.add_subplot(122)
plt.show()
# Hierarchical Clustering
from sklearn.cluster import AgglomerativeClustering
agglom = AgglomerativeClustering(n_clusters=5, linkage='average').fit(X)
X['Labels'] = agglom.labels_
plt.figure(figsize=(12, 8))
sns.scatterplot(X['Income'], X['Score'], hue=X['Labels'],
palette=sns.color_palette('hls', 5))
plt.title('Agglomerative with 5 Clusters') plt.show()
from scipy.cluster import hierarchy
from scipy.spatial import distance_matrix dist =
distance_matrix(X, X)
print(dist) plt.figure(figsize=(18, 50))
dendro = hierarchy.dendrogram(Z, leaf_rotation=0, leaf_font_size=12, orientation='right') Z =
hierarchy.linkage(dist, 'average')
plt.figure(figsize=(18, 50))
dendro = hierarchy.dendrogram(Z, leaf_rotation=0, leaf_font_size =12, orientation = 'right')
#Density Based Clustering (DBSCAN)
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=11, min_samples=6).fit(X)
X['Labels'] = db.labels_
plt.figure(figsize=(12, 8))
sns.scatterplot(X['Income'], X['Score'], hue=X['Labels']
palette=sns.color_palette('hls', np.unique(db.labels_).shape[0])) plt.title('DBSCAN
with epsilon 11, min samples 6')
plt.show()
OUTPUT:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns): CustomerID
200 non-null int64 Gender
200 non-null object
Age 200 non-null int64 Annual Income
(k$) 200 non-null int64 Spending Score
(1-100) 200 non-null int64 dtypes:
int64(4), object(1)
memory usage: 7.9+ KB
[[ 0. 42.05948169 33.03028913 ... 117.12813496 124.53915047
130.17296186]
[ 42.05948169 0. 75.01999733 ... 111.76761606 137.77880824
122.35195135]
[ 33.03028913 75.01999733 0. ... 129.89226305 122.24974438
143.78456106]
...
[117.12813496 111.76761606 129.89226305 ... 0. 57.10516614
14.35270009]
[124.53915047 137.77880824 122.24974438 ... 57.10516614 0.
65.06150936]
[130.17296186 122.35195135 143.78456106 ... 14.35270009 65.06150936
0. ]]
Conclusion:
With the help of this experiment the implementation Clustering algorithms for unsupervised class & Plot
the cluster data in the algor
EXPERIMENT 8
Code :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# %matplotlib inline
plt.style.use("ggplot")
import sklearn
from sklearn.decomposition import TruncatedSVD
amazon_ratings = pd.read_csv('../input/amazon-ratings/ratings_Beauty.csv')
amazon_ratings = amazon_ratings.dropna()
amazon_ratings.head()
amazon_ratings.shape()
popular_products = pd.DataFrame(amazon_ratings.groupby('ProductId')['Rating'].count())
most_popular = popular_products.sort_values('Rating', ascending=False)
most_popular.head(10)
most_popular.head(30).plot(kind = "bar")
amazon_ratings1 = amazon_ratings.head(10000)
ratings_utility_matrix = amazon_ratings1.pivot_table(values='Rating', index='UserId', colum
ns='ProductId', fill_value=0)
ratings_utility_matrix.head()
ratings_utility_matrix.shape
X = ratings_utility_matrix.T
X.head()
X.shape
X1 = X
SVD = TruncatedSVD(n_components=10)
decomposed_matrix = SVD.fit_transform(X)
decomposed_matrix.shape
correlation_matrix = np.corrcoef(decomposed_matrix)
correlation_matrix.shape
X.index[99]
i = "6117036094"
product_names = list(X.index)
product_ID = product_names.index(i)
product_ID
correlation_product_ID = correlation_matrix[product_ID]
correlation_product_ID.shape
# Importing libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
product_descriptions = pd.read_csv('../input/home-depot-product-search-
relevance/product_d
escriptions.csv')
product_descriptions.sh
ape
# Missing values
product_descriptions =
product_descriptions.dropna()
product_descriptions.shape
product_descriptions.head()
product_descriptions1 =
product_descriptions.head(500) #
vectorizer = TfidfVectorizer(stop_words='english')
X1 = vectorizer.fit_transform(product_descriptions1["product_description"])
X1
# Fitting K-Means to the dataset
X=X1
# # Optimal clusters is
true_k = 10
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X1)
(2023070, 4)
<matplotlib.axes._subplots.AxesSubplot at 0x7fd439e493c8>
(9697, 886)
(124428, 2)
Cluster 1: wood
patio bamboo
natural frame
outdoor rug
size steel dining
Cluster 2: used trim
painted 65
proposition nbsp
residents california
project
32
Cluster 3: door
lbs easy dog
nickle solid roof
plastic house
adjustable Cluster 4:
cutting saw
tool blade design
cut pliers grip
metal non
Cluster 5: wall piece
finish tile design
use
color easy
installation water
Cluster 6: light
watt bulb led
fence gate ft
screen vinyl posts
aluminum brackets
spline
# Keyword : cutting tool Cluster 4:
cutting saw tool
blade design cut
pliers grip metal
non
# Keyword :spray paint Cluster 2:
used trim painted
65
proposition nbsp
residents california
project 32
#Keyword : steel drill Cluster 8:
air ft
water unit room
installation fan
cooling use
easy
Conclusion:
With the help of this experiment the implementation Using any machine learning techniques using
available data set to develop a recommendation system