Data Preprocessisng:-
1. Get Data Set
2. Import important libraries:-
import numpy as np :- for number calculations and array manupulation
import [Link] as plt:- for pictorial representation of results
import pandas as pd:- read and manupulate the data , for series operations
3. Import dataset:- (sir’s example)
[Link]/xls
dataset=pd.read_csv(‘[Link]’)
> create matrix of all independent variables(sir’s example)
x = [Link][:, :-1].values
> create matrix of dependent variables(sir’s example)
y = [Link][:, 3].values
4. Handaling missing values
taking care of missing data from :-
> from [Link] import Imputer (sklearn is a ML lib for multiple
jobs,
Imputer use to find the missing values
rememberer caps I)
> imputer = Imputer(missing_values =’NaN’,strategy = ’mean’, axis=0)
imputer = [Link](x[:,1:3])
>x[:,1:3] = [Link](x[:,1:3])
5. Categorical Data:-
Encoding Categorical Data:
#Encoding the independent variable:-
> from [Link] import LabelEncoder,OneHotEncoder
(LabelEncoder will give numbers to entities of same
category)
> labelencoder_x = LabelEncoder()
x[:,0] = labelencoder_x.fit_transform(x[:,0])(here it will enocde the first
column values as 0,1,2...)
> onehotencoder = OneHotEncoder(categorical_features=[0])
> x=onehotencoder.fit_transform(x).toarray() (to encode x in terms of o’s and
1’s and other values in
exponential form)
x
>labelencoder_y=LabelEncoder() (encoding y)
y = labelencoder_y.fit_transform(y)
6. Spliting Training and Test Data:-
> from sklearn.cross_validation import train_test_split
note:- (cross validation is library for spliting the whole data set in training and testing
data..... inside which we call the train and test
class for spliting the data)
>x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state = 0)
(splits the whole data set for 80% data for tarining, 20% data for testing,,
random state maintains the consistency in the train and test data,if not then
every time it takes duffrent set if values)
try to keep the train test in between the range of 20-30% as <20 results in
overfitting and more than 30 leads to error
7. Future scaling:- (is used to scale large values in small space.. like putting the two
numbers and the square of their diuffrence in the same graph)
(Note:- all values will be scaled between -1 to +1)
> from [Link] import Standard Scaler
sc_x = Standard Scaler
x_train = sc_x.fit_transform(x_train) #[Link]-- use only for training data
x_test = sc_x.transform(x_test)
## x_train=always a dependent variable
## standard scaler is a class that scales all the values based on volume of ,model...
## fit()- generate learning model parameters from training data (only makes
machine to learn) going to make the object ready
##transform()-- applied upon model to generate transform data set..
Mnote:_ fit_transform() can only be applied on standard scaler functions
**k-nn algorithm:-
from [Link] import KneighborsClassifier
classifier = KneighborsClassifier(n_neighbors=5, metric=’minkowski’, p=2)
[Link](X_train,Y_train)
-------till here machine if fit with trianing data and machines learns with training data----
## in sklearn neighbors is a library in which we have kneighbors classifiers
## kneighbors takes some values=== n-neighbors are number of neighbours... a prime
number
metrics == defines the type of method being used
p=2 means using euclidean distance
------ for testing and predictiong------
y_pred = [Link](X_test) ## predicts only on x_test values given before
y_pred
**making the confusion matrix----
from [Link] import confusion_matrix ## confusion_matrix is a fnc
cm = confusion_matrix(y_test,y_pred)
cm
gives out a confusion matrix with [TP,FP,FN,TN] format/....
*STEP 8:- Visualizing the Training and Test data set results
from [Link] import ListedColormap
x_set,y_set=x_train,y_train
x1,x2=
[Link]([Link](start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),[Link]
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
[Link](x1,x2,[Link]([Link]([[Link](),[Link]()]).T).reshape([Link]),al
pha=0.75,cmap=ListedColormap((‘red’,’green’)))
[Link]([Link](),[Link]())
[Link]([Link](),[Link]())
for i,j in enumerate([Link](y_set)):
[Link](x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap((‘red’,’green’))(i),label=j)
[Link](‘K-NN(Training set)’)
[Link](‘Age’)
[Link](‘Estimnated sAlary’)
[Link]()
[Link]()
**
*STEP 9:- Visualizing the Training and Test data set results
from [Link] import ListedColormap
x_set,y_set=x_train,y_train
x1,x2=
[Link]([Link](start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),[Link]
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
[Link](x1,x2,[Link]([Link]([[Link](),[Link]()]).T).reshape([Link]),al
pha=0.75,cmap=ListedColormap((‘red’,’green’)))
[Link]([Link](),[Link]())
[Link]([Link](),[Link]())
for i,j in enumerate([Link](y_set)):
[Link](x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap((‘red’,’green’))(i),label=j)
[Link](‘K-NN(Training set)’)
[Link](‘Age’)
[Link](‘Estimnated sAlary’)
[Link]()
[Link]()
**decision treee:--
dataset=pd.read_csv('Social_Network_Ads.csv')
x = [Link][:,[2,3]].values
y = [Link][:, 4].values
from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state = 0)
from [Link] import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
from [Link] import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy',random_state=0)
[Link](x_train,y_train)
y_pred = [Link](x_test)
from [Link] import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
cm
training plot:-
from [Link] import ListedColormap
x_set,y_set=x_train,y_train
x1,x2=
[Link]([Link](start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),[Link]
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
[Link](x1,x2,[Link]([Link]([[Link](),[Link]()]).T).reshape([Link]),al
pha = 0.75,cmap = ListedColormap(('red','green')))
[Link]([Link](),[Link]())
[Link]([Link](),[Link]())
for i,j in enumerate([Link](y_set)):
[Link](x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
[Link]('Decison Tree(Training set)')
[Link]('Age')
[Link]('Estimnated sAlary')
[Link]()
[Link]()
Test plot :--
from [Link] import ListedColormap
x_set,y_set=x_test,y_test
x1,x2=
[Link]([Link](start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),[Link]
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
[Link](x1,x2,[Link]([Link]([[Link](),[Link]()]).T).reshape([Link]),al
pha = 0.75,cmap = ListedColormap(('red','green')))
[Link]([Link](),[Link]())
[Link]([Link](),[Link]())
for i,j in enumerate([Link](y_set)):
[Link](x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
[Link]('Decison Tree(Test set)')
[Link]('Age')
[Link]('Estimnated sAlary')
[Link]()
[Link]()
***Naive Bayes
dataset=pd.read_csv('Social_Network_Ads.csv')
x = [Link][:,[2,3]].values
y = [Link][:, 4].values
from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state = 0)
from [Link] import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
[Link](x_train,y_train)
y_pred = [Link](x_test)
from [Link] import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
cm
training plot:-
from [Link] import ListedColormap
x_set,y_set=x_train,y_train
x1,x2=
[Link]([Link](start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),[Link]
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
[Link](x1,x2,[Link]([Link]([[Link](),[Link]()]).T).reshape([Link]),al
pha = 0.75,cmap = ListedColormap(('red','green')))
[Link]([Link](),[Link]())
[Link]([Link](),[Link]())
for i,j in enumerate([Link](y_set)):
[Link](x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
[Link]('Naive Bayes(Training set)')
[Link]('Age')
[Link]('Estimnated sAlary')
[Link]()
[Link]()
Test plot :--
from [Link] import ListedColormap
x_set,y_set=x_test,y_test
x1,x2=
[Link]([Link](start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),[Link]
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
[Link](x1,x2,[Link]([Link]([[Link](),[Link]()]).T).reshape([Link]),al
pha = 0.75,cmap = ListedColormap(('red','green')))
[Link]([Link](),[Link]())
[Link]([Link](),[Link]())
for i,j in enumerate([Link](y_set)):
[Link](x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
[Link]('Naive Bayes(Test set)')
[Link]('Age')
[Link]('Estimnated sAlary')
[Link]()
[Link]()
**Random forest
dataset=pd.read_csv('Social_Network_Ads.csv')
x = [Link][:,[2,3]].values
y = [Link][:, 4].values
from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state = 0)
from [Link] import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
from [Link] import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10,criterion =
'entropy',random_state=0)
[Link](x_train,y_train)
y_pred = [Link](x_test)
from [Link] import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
cm
from [Link] import ListedColormap
x_set,y_set=x_train,y_train
x1,x2=
[Link]([Link](start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),[Link]
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
[Link](x1,x2,[Link]([Link]([[Link](),[Link]()]).T).reshape([Link]),al
pha = 0.75,cmap = ListedColormap(('red','green')))
[Link]([Link](),[Link]())
[Link]([Link](),[Link]())
for i,j in enumerate([Link](y_set)):
[Link](x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
[Link]('Naive Bayes(Training set)')
[Link]('Age')
[Link]('Estimnated sAlary')
[Link]()
[Link]()
from [Link] import ListedColormap
x_set,y_set=x_test,y_test
x1,x2=
[Link]([Link](start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),[Link]
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
[Link](x1,x2,[Link]([Link]([[Link](),[Link]()]).T).reshape([Link]),al
pha = 0.75,cmap = ListedColormap(('red','green')))
[Link]([Link](),[Link]())
[Link]([Link](),[Link]())
for i,j in enumerate([Link](y_set)):
[Link](x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
[Link]('Naive Bayes(Test set)')
[Link]('Age')
[Link]('Estimnated sAlary')
[Link]()
[Link]()
Linear Regression:-
import numpy as np
import [Link] as plt
import pandas as pd
dataset=pd.read_csv('Salary_Data.csv')
x = [Link][:,:-1].values
y = [Link][:,1].values
from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state = 0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
[Link](x_train,y_train)
y_pred = [Link](x_test)
[Link](x_train,y_train, color='red')
[Link](x_train,[Link](x_train),color = 'blue')
[Link]('sal vs exp (Test set)')
[Link]('Age')
[Link]('Estimnated sAlary')
[Link]()
[Link]()