0% found this document useful (0 votes)
12 views

Machine learning (1)

The document outlines the roles of data engineers, analysts, and scientists in the AI journey, detailing their responsibilities in data collection, visualization, and mathematical analysis. It also covers machine learning concepts, including types, model building steps, and techniques like linear and logistic regression, as well as regularization methods to prevent overfitting. Additionally, it discusses the importance of statistical measures such as p-values and R-squared in model evaluation.

Uploaded by

sskumar848586
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Machine learning (1)

The document outlines the roles of data engineers, analysts, and scientists in the AI journey, detailing their responsibilities in data collection, visualization, and mathematical analysis. It also covers machine learning concepts, including types, model building steps, and techniques like linear and logistic regression, as well as regularization methods to prevent overfitting. Additionally, it discusses the importance of statistical measures such as p-values and R-squared in model evaluation.

Uploaded by

sskumar848586
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

JOURNEY OF AI:

DATA ENGINEER - the job data engineer is to gather the information from different places and
place them in a common env

Eg : Mainframe , linux , unix etc..


We collected data from these different ecosystem and made a data lake in bigdata/hadoop

Before 2010 - database , 2015 - big data

1gb - 800 rs
15 gb - 800 rs

DATA ANALYST - (Minimal amount of data engineering)


Data visualization and understanding (identify the business hindsight and represent it)
Build the knowledge understanding between business and data analyst

DATA SCIENTIST -
Apply mathematical stats to the data and try to identify some relation between them.

MACHINE LEARNING -

Predict the future

Set of mathematical equation / formula , so when applied on a given set of input it can predict
the output

Early 19’s

Types of Machine Learning -


Supervised Learning
Unsupervised Learning
Semi-supervised Learning
Reinforcement Learning
Reinforcement learning
Equation of a straight line -

Linear Regression -

One input and one output - simple linear regression

Y = c + mx

When you have more then one input , one output - multiple linear regression

Y = c + m1 x1 + m2 x2 + ….. Mn xn
Y = B0 + B1 X

Y = B0 + B1X
Y = 3.3518 + 0.0527 X

Xxxx

model building :

step 1 : loading the data


step 2 : EDA (DATA ANALYSIS AND SCIENTIST)
step 3 : train test split

imagine : you have 100 records , split 70 records for training and remaining 30 records for
testing

step 4 : model building , he will use the train data


step 5 : model evaluation , on testing data with the model he build in step 4
in the test data you have input and output
in step 5 , your model is going to predict the output
you gona calc error , between actual output and the predicted output

this error calculation should be done in both step4 and step5

step 6 : calc the performance /accuracy

step 7 : check overfitting / underfitting

training - 90 % , test data - 50 % accurasy

if model over fits /underfits:


repeat step 4,5,6
Linear Regression Formula -

Using above farmula , we are going to calculate B0 and B1 , and predict the output

ERROR :

R-Square - accuracy of model


R - Square value will be between , 0 to 1
If the value is 0.90 - means 90 % the model is accurate
Gradient Descent :
Logistic regression:

It is a classification problem , where we are going to classify it as yes/no

If yes is 1 , then no will zero and vice versa


This equation will give you a probability between zero to 1.

It is the developer who chooses the threshold .

Default logistic regression comes with a threshold of 0.5


## Multiple logistic regression using sklearn

Confusion Matrix is used to predict the model accuracy in logistic regression -

Sensitivity - out of positive outcome , how much the model predicted correctly

specificity- out of negative outcome , how much the model predicted correctly

Accuracy - overall accuracy


Logistics model building -

1,sourcing the data


2, convert alpha to numeric (get_dummies)
3, train test split
4, model building on train data
5, model prediction in test data
6, accuracy (confusion matrix)
7, get optimal threshold
7a, get sens , spec , acc for different threshold
7b, based on business use case pick the optimal threshold as per required

8, check overfit / underfit

ROC - receiver operating characteristic curve:


AUC - Area under the ROC Curve
True positive rate = Recall = sensitivity
False positive rate = 1 - specificity

Specificity - total negative how much u did correct


Fpr - total negative how much u did wrong
when u perform train test .

in training it was able to perform very good, but in testing if the model shows more variance / more error
the it is overfitiing

overfiting - high variance and low biase

in training if the model is unable to learn properly then it is underfitting

underfitting - high biase and low variance


How to remove unnecessary columns

1, correlated
2, Removing column that has no influence on any other columns like Employee Id or Employee Name
3, derived metrics
4, all the data analytical stuffs u learned , remove outliers
hypothesis how it is usefull in column selecttion -

A very important parameter of this analysis is the p-value

the null hypothesis corresponding to each p-value is that the corresponding independent variable does
not impact the dependent variable.
The alternate hypothesis is that the corresponding independent variable impacts the dependent variable

Now, p-value indicates the probability that the null hypothesis is true. Therefore, a low p-value, i.e. less
than 0.05, indicates that you can reject the null hypothesis

p > 0.05 - accept null hypothesis - independent variable does not impact the dependent variable
p < 0.05 - reject null hypothesis - independent variable impacts the dependent variable

5, normalizing
5 , regularization model (regression to be discussed)
6, vif - Variance Inflation Factor

Y = b0 + b1xn + b2 x2 + b3 x3 ….. Bn xn

If my
Vif < 2 - very very good fit , you can have the column
Vif < 5 - very good - you can have the column
Vif < 10 - okay - you can have the column

7 , rfe - Recursive feature elimination


from sklearn.feature_selection import RFE

Out = rfe.fit(xtrain,y_train)
Rfe.support_
rfe.raking_

REGULARIZATION MODEL MACHINE LEARNING -


These are the models which avoid overfitting/underfitting in Regression .

Lasso /Ridge Regression

Lasso - L1 Regularization .

1, calc b0 - initialize b0
2, calc b1 - initialize b1
3, y_pred = b0 + b1x
4, error = 1/n (y - y_pred) ^2 + penalty term
Penalty term - alpha * modulus of sum of all co-efficient
Alpha - random number given by user (eg : 1.0 , 0.1, 0.001)
5, gradient descent :
B0_new = b0_old - (learning_rate * d/db0)
B1_new = b1_old - (learning_rate * d/db1)

6, step 3,4,5 will be repeated till the optimal b0 and b1 is reached

Ridge - L2 Regularization

1, calc b0 - initialize b0
2, calc b1 - initialize b1
3, y_pred = b0 + b1x
4, error = 1/n (y - y_pred) ^2 + penalty term
Penalty term - alpha * (sum of (square of all co-efficient))
Alpha - random number given by user (eg : 1.0 , 0.1, 0.001)
5, gradient descent :
B0_new = b0_old - (learning_rate * d/db0)
B1_new = b1_old - (learning_rate * d/db1)

6, step 3,4,5 will be repeated till the optimal b0 and b1 is reached

Note: In the process of reducing the error by adding penalty term , both lasso and ridge is reducing the
coefficient value
In the process of reducing the coefficient .

Lasso - will make some of the coefficient equal to zero(unfit for business use case - my perspective)
Ridge - will make some of the coefficient closer to zero

Polynomial regression 👍

Degree of conversion is important

How to identify degree of conversion : by plotting the input and output


By trial and error method

SGD - Stochastic Gradient Descent


In linear regression we used formula to arrive at b0 and b1 rather than starting it with zero
Also in linear regression we didn't do gradient descent

In lasso or ridge regression , i initialized bo and b1 with zero then i did gradient descent and reduced the
error
In lasso and ridge we added penalty term to error formula

Sgd - basically it will arrive to b0 and b1 with linear regression formula


Then it performs gradient descent to optimize it

In error : we don’t have any penalty term , penalty term is for only ridge and lasso
x y

2 3

3 4

4 2

5 5

6 7

You might also like