0% found this document useful (0 votes)

12 views

Machine learning (1)

The document outlines the roles of data engineers, analysts, and scientists in the AI journey, detailing their responsibilities in data collection, visualization, and mathematical analysis. It also covers machine learning concepts, including types, model building steps, and techniques like linear and logistic regression, as well as regularization methods to prevent overfitting. Additionally, it discusses the importance of statistical measures such as p-values and R-squared in model evaluation.

Uploaded by

sskumar848586

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Machine learning (1)

Uploaded by

sskumar848586

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 20

JOURNEY OF AI:

DATA ENGINEER - the job data engineer is to gather the information from different places and
place them in a common env

Eg : Mainframe , linux , unix etc..

We collected data from these different ecosystem and made a data lake in bigdata/hadoop

Before 2010 - database , 2015 - big data

1gb - 800 rs
15 gb - 800 rs

DATA ANALYST - (Minimal amount of data engineering)

Data visualization and understanding (identify the business hindsight and represent it)
Build the knowledge understanding between business and data analyst

DATA SCIENTIST -
Apply mathematical stats to the data and try to identify some relation between them.

MACHINE LEARNING -

Predict the future

Set of mathematical equation / formula , so when applied on a given set of input it can predict
the output

Early 19’s

Types of Machine Learning -

Supervised Learning
Unsupervised Learning
Semi-supervised Learning
Reinforcement Learning
Reinforcement learning
Equation of a straight line -

Linear Regression -

One input and one output - simple linear regression

Y = c + mx

When you have more then one input , one output - multiple linear regression

Y = c + m1 x1 + m2 x2 + ….. Mn xn
Y = B0 + B1 X

Y = B0 + B1X
Y = 3.3518 + 0.0527 X

Xxxx

model building :

step 1 : loading the data

step 2 : EDA (DATA ANALYSIS AND SCIENTIST)
step 3 : train test split

imagine : you have 100 records , split 70 records for training and remaining 30 records for
testing

step 4 : model building , he will use the train data

step 5 : model evaluation , on testing data with the model he build in step 4
in the test data you have input and output
in step 5 , your model is going to predict the output
you gona calc error , between actual output and the predicted output

this error calculation should be done in both step4 and step5

step 6 : calc the performance /accuracy

step 7 : check overfitting / underfitting

training - 90 % , test data - 50 % accurasy

if model over fits /underfits:

repeat step 4,5,6
Linear Regression Formula -

Using above farmula , we are going to calculate B0 and B1 , and predict the output

ERROR :

R-Square - accuracy of model

R - Square value will be between , 0 to 1
If the value is 0.90 - means 90 % the model is accurate
Gradient Descent :
Logistic regression:

It is a classification problem , where we are going to classify it as yes/no

If yes is 1 , then no will zero and vice versa

This equation will give you a probability between zero to 1.

It is the developer who chooses the threshold .

Default logistic regression comes with a threshold of 0.5

## Multiple logistic regression using sklearn

Confusion Matrix is used to predict the model accuracy in logistic regression -

Sensitivity - out of positive outcome , how much the model predicted correctly

specificity- out of negative outcome , how much the model predicted correctly

Accuracy - overall accuracy

Logistics model building -

1,sourcing the data

2, convert alpha to numeric (get_dummies)
3, train test split
4, model building on train data
5, model prediction in test data
6, accuracy (confusion matrix)
7, get optimal threshold
7a, get sens , spec , acc for different threshold
7b, based on business use case pick the optimal threshold as per required

8, check overfit / underfit

ROC - receiver operating characteristic curve:

AUC - Area under the ROC Curve
True positive rate = Recall = sensitivity
False positive rate = 1 - specificity

Specificity - total negative how much u did correct

Fpr - total negative how much u did wrong
when u perform train test .

in training it was able to perform very good, but in testing if the model shows more variance / more error
the it is overfitiing

overfiting - high variance and low biase

in training if the model is unable to learn properly then it is underfitting

underfitting - high biase and low variance

How to remove unnecessary columns

1, correlated
2, Removing column that has no influence on any other columns like Employee Id or Employee Name
3, derived metrics
4, all the data analytical stuffs u learned , remove outliers
hypothesis how it is usefull in column selecttion -

A very important parameter of this analysis is the p-value

the null hypothesis corresponding to each p-value is that the corresponding independent variable does
not impact the dependent variable.
The alternate hypothesis is that the corresponding independent variable impacts the dependent variable

Now, p-value indicates the probability that the null hypothesis is true. Therefore, a low p-value, i.e. less
than 0.05, indicates that you can reject the null hypothesis

p > 0.05 - accept null hypothesis - independent variable does not impact the dependent variable
p < 0.05 - reject null hypothesis - independent variable impacts the dependent variable

5, normalizing
5 , regularization model (regression to be discussed)
6, vif - Variance Inflation Factor

Y = b0 + b1xn + b2 x2 + b3 x3 ….. Bn xn

If my
Vif < 2 - very very good fit , you can have the column
Vif < 5 - very good - you can have the column
Vif < 10 - okay - you can have the column

7 , rfe - Recursive feature elimination

from sklearn.feature_selection import RFE

Out = rfe.fit(xtrain,y_train)
Rfe.support_
rfe.raking_

REGULARIZATION MODEL MACHINE LEARNING -

These are the models which avoid overfitting/underfitting in Regression .

Lasso /Ridge Regression

Lasso - L1 Regularization .

1, calc b0 - initialize b0
2, calc b1 - initialize b1
3, y_pred = b0 + b1x
4, error = 1/n (y - y_pred) ^2 + penalty term
Penalty term - alpha * modulus of sum of all co-efficient
Alpha - random number given by user (eg : 1.0 , 0.1, 0.001)
5, gradient descent :
B0_new = b0_old - (learning_rate * d/db0)
B1_new = b1_old - (learning_rate * d/db1)

6, step 3,4,5 will be repeated till the optimal b0 and b1 is reached

Ridge - L2 Regularization

1, calc b0 - initialize b0
2, calc b1 - initialize b1
3, y_pred = b0 + b1x
4, error = 1/n (y - y_pred) ^2 + penalty term
Penalty term - alpha * (sum of (square of all co-efficient))
Alpha - random number given by user (eg : 1.0 , 0.1, 0.001)
5, gradient descent :
B0_new = b0_old - (learning_rate * d/db0)
B1_new = b1_old - (learning_rate * d/db1)

6, step 3,4,5 will be repeated till the optimal b0 and b1 is reached

Note: In the process of reducing the error by adding penalty term , both lasso and ridge is reducing the
coefficient value
In the process of reducing the coefficient .

Lasso - will make some of the coefficient equal to zero(unfit for business use case - my perspective)
Ridge - will make some of the coefficient closer to zero

Polynomial regression 👍

Degree of conversion is important

How to identify degree of conversion : by plotting the input and output

By trial and error method

SGD - Stochastic Gradient Descent

In linear regression we used formula to arrive at b0 and b1 rather than starting it with zero
Also in linear regression we didn't do gradient descent

In lasso or ridge regression , i initialized bo and b1 with zero then i did gradient descent and reduced the
error
In lasso and ridge we added penalty term to error formula

Sgd - basically it will arrive to b0 and b1 with linear regression formula

Then it performs gradient descent to optimize it

In error : we don’t have any penalty term , penalty term is for only ridge and lasso
x y

2 3

3 4

4 2

5 5

6 7

Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Machine Learning Interview Questions.
50% (2)
Machine Learning Interview Questions.
43 pages
03 Employee Database
No ratings yet
03 Employee Database
11 pages
CommVault Basic Admin
100% (1)
CommVault Basic Admin
118 pages
Harshini Week 8 Doc PDF
No ratings yet
Harshini Week 8 Doc PDF
10 pages
Welcome To The Learning Unit On: T24 Application Structure and Files
100% (2)
Welcome To The Learning Unit On: T24 Application Structure and Files
25 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Polynomial Regression From Scratch in Python - by Rashida Nasrin Sucky - Towards Data Science
No ratings yet
Polynomial Regression From Scratch in Python - by Rashida Nasrin Sucky - Towards Data Science
1 page
Regression
No ratings yet
Regression
24 pages
ML-1
No ratings yet
ML-1
24 pages
Mauryan Empire
No ratings yet
Mauryan Empire
11 pages
Lecture 09 ML
No ratings yet
Lecture 09 ML
26 pages
ML_AI
No ratings yet
ML_AI
53 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Logistic Regression
No ratings yet
Logistic Regression
42 pages
ML models and when to choose one over others
No ratings yet
ML models and when to choose one over others
7 pages
Data Science Interview Questions 30 Days 1686062665
No ratings yet
Data Science Interview Questions 30 Days 1686062665
300 pages
week2
No ratings yet
week2
43 pages
Lecture 09_02.09.2024_Regression-01
No ratings yet
Lecture 09_02.09.2024_Regression-01
62 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
Topic 07 - Data Modelling - Part I
No ratings yet
Topic 07 - Data Modelling - Part I
40 pages
Unit 2
No ratings yet
Unit 2
80 pages
Week11_regularization and optimization
No ratings yet
Week11_regularization and optimization
75 pages
3. LR, decision tree
No ratings yet
3. LR, decision tree
48 pages
Lecture 7
No ratings yet
Lecture 7
19 pages
Basic ML Algorithm
No ratings yet
Basic ML Algorithm
74 pages
AI & ML Notes
No ratings yet
AI & ML Notes
22 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
25 pages
Predictive Maintenance
No ratings yet
Predictive Maintenance
66 pages
Ch2Regression and Regularization1
No ratings yet
Ch2Regression and Regularization1
45 pages
APSC 258 Midterm Study Guide
No ratings yet
APSC 258 Midterm Study Guide
4 pages
ML Summary PDF
No ratings yet
ML Summary PDF
5 pages
Chapter 6 Supervised Learning
No ratings yet
Chapter 6 Supervised Learning
6 pages
Logistic Regression Lecture Notes
No ratings yet
Logistic Regression Lecture Notes
11 pages
Modern Pridictive Modelling(Regression)
No ratings yet
Modern Pridictive Modelling(Regression)
12 pages
Regression Analysis
No ratings yet
Regression Analysis
11 pages
DADM S2 Data Preprocessing-Data Cleaning and Transformation
No ratings yet
DADM S2 Data Preprocessing-Data Cleaning and Transformation
12 pages
Week 10_Lecture 10
No ratings yet
Week 10_Lecture 10
59 pages
ML PYQs
No ratings yet
ML PYQs
32 pages
Regression_Questionnaire
No ratings yet
Regression_Questionnaire
10 pages
Data Science Checklist
No ratings yet
Data Science Checklist
22 pages
General ML Notes
No ratings yet
General ML Notes
30 pages
BFCAI BigDataAnalytics Lecture#5 2
No ratings yet
BFCAI BigDataAnalytics Lecture#5 2
69 pages
Lecture3_upload
No ratings yet
Lecture3_upload
28 pages
lecture3_supervised_learning_I
No ratings yet
lecture3_supervised_learning_I
84 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
Regression
No ratings yet
Regression
45 pages
LFD-1
No ratings yet
LFD-1
39 pages
linear regression (1)
No ratings yet
linear regression (1)
8 pages
LP III Lab Manual
100% (1)
LP III Lab Manual
8 pages
(Slide) Non Linear Regression
No ratings yet
(Slide) Non Linear Regression
39 pages
Introduction To Machine Learning Algorithms: Linear Regression
No ratings yet
Introduction To Machine Learning Algorithms: Linear Regression
1 page
ML Interview Questions
No ratings yet
ML Interview Questions
10 pages
Unit 2
No ratings yet
Unit 2
8 pages
Machine Learning with python(3&4)
No ratings yet
Machine Learning with python(3&4)
37 pages
Data Science Interview Questions (#Day11) PDF
100% (1)
Data Science Interview Questions (#Day11) PDF
11 pages
20dit073 Jay Prajapati ML
No ratings yet
20dit073 Jay Prajapati ML
68 pages
ML (1)
No ratings yet
ML (1)
6 pages
LogisticRegression_ExercisesSolutions
No ratings yet
LogisticRegression_ExercisesSolutions
5 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Lecture Notes - Logistic Regression
100% (1)
Lecture Notes - Logistic Regression
11 pages
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Data Base
No ratings yet
Data Base
26 pages
JDBC
No ratings yet
JDBC
35 pages
Under The Guidance of Mr. R.L.Kadam: Maharashtra State Board of Technical Education Mumbai
No ratings yet
Under The Guidance of Mr. R.L.Kadam: Maharashtra State Board of Technical Education Mumbai
14 pages
5 Unit Notes
100% (1)
5 Unit Notes
166 pages
LIS107_ Written Report_ Zahra S. Unda
No ratings yet
LIS107_ Written Report_ Zahra S. Unda
6 pages
College Management System
No ratings yet
College Management System
10 pages
Oracle Tutorial
100% (1)
Oracle Tutorial
8 pages
DMR TableColumnDescription PDF
No ratings yet
DMR TableColumnDescription PDF
4,741 pages
Descriptive Statistics Task 50 Completed
No ratings yet
Descriptive Statistics Task 50 Completed
8 pages
FDBS Unit - 1,2,3
No ratings yet
FDBS Unit - 1,2,3
72 pages
Vinay Deokar: Core Competencies
No ratings yet
Vinay Deokar: Core Competencies
2 pages
William Wizner - Python For Data Science - Data Analysis and Deep Learning With Python Coding and Programming
100% (1)
William Wizner - Python For Data Science - Data Analysis and Deep Learning With Python Coding and Programming
73 pages
JD Edwards EnterpriseOne - BSFN Cache Programming
75% (12)
JD Edwards EnterpriseOne - BSFN Cache Programming
27 pages
Chapter7 Testing
No ratings yet
Chapter7 Testing
21 pages
Hadoop Interview Guide
100% (1)
Hadoop Interview Guide
34 pages
DOC-20231118-WA0008new Unit 3
No ratings yet
DOC-20231118-WA0008new Unit 3
15 pages
Gowrva_Narasimha_Resume
No ratings yet
Gowrva_Narasimha_Resume
2 pages
Salesforce Data Dictionary Documentation
No ratings yet
Salesforce Data Dictionary Documentation
4 pages
HE190952_PhamThuaGiap_Lab2_DBI202
No ratings yet
HE190952_PhamThuaGiap_Lab2_DBI202
3 pages
6 Designing Interface Objects
No ratings yet
6 Designing Interface Objects
2 pages
IA 1 - QP DBMS-18CS53 SET1 final print
No ratings yet
IA 1 - QP DBMS-18CS53 SET1 final print
2 pages
Next Steps Scheme of Work Unit 4281 - Exploring Databases
No ratings yet
Next Steps Scheme of Work Unit 4281 - Exploring Databases
10 pages
Top Down Design
No ratings yet
Top Down Design
4 pages
My Revision Notes Aqa Alevel Computer Science 3rd Edition 3rd Edition Mark Clarkson instant download
No ratings yet
My Revision Notes Aqa Alevel Computer Science 3rd Edition 3rd Edition Mark Clarkson instant download
90 pages
Title: Data Analysis Using Postgresql and Power Bi
No ratings yet
Title: Data Analysis Using Postgresql and Power Bi
5 pages
Real Statistics Examples Correlation Reliability
No ratings yet
Real Statistics Examples Correlation Reliability
404 pages

Machine learning (1)

Uploaded by

Machine learning (1)

Uploaded by

JOURNEY OF AI:

Eg : Mainframe , linux , unix etc..

Before 2010 - database , 2015 - big data

DATA ANALYST - (Minimal amount of data engineering)

Predict the future

Types of Machine Learning -

One input and one output - simple linear regression

step 1 : loading the data

step 4 : model building , he will use the train data

this error calculation should be done in both step4 and step5

step 6 : calc the performance /accuracy

step 7 : check overfitting / underfitting

training - 90 % , test data - 50 % accurasy

if model over fits /underfits:

R-Square - accuracy of model

It is a classification problem , where we are going to classify it as yes/no

If yes is 1 , then no will zero and vice versa

It is the developer who chooses the threshold .

Default logistic regression comes with a threshold of 0.5

Confusion Matrix is used to predict the model accuracy in logistic regression -

Accuracy - overall accuracy

1,sourcing the data

8, check overfit / underfit

ROC - receiver operating characteristic curve:

Specificity - total negative how much u did correct

overfiting - high variance and low biase

in training if the model is unable to learn properly then it is underfitting

underfitting - high biase and low variance

A very important parameter of this analysis is the p-value

7 , rfe - Recursive feature elimination

REGULARIZATION MODEL MACHINE LEARNING -

Lasso /Ridge Regression

6, step 3,4,5 will be repeated till the optimal b0 and b1 is reached

6, step 3,4,5 will be repeated till the optimal b0 and b1 is reached

Degree of conversion is important

How to identify degree of conversion : by plotting the input and output

SGD - Stochastic Gradient Descent

Sgd - basically it will arrive to b0 and b1 with linear regression formula

You might also like