Aiml 4
Aiml 4
ENGINEERING
AND TECHNOLOGY
2024-25
1
www.jainuniversity.ac.in www.set.jainuniversity.ac.in
FACULTY OF
SUPERVISED MACHINE LEARNING ENGINEERING
AND TECHNOLOGY
www.jainuniversity.ac.in www.set.jainuniversity.ac.in
Machine Learning
33
Objectives
1) What is Learning?
5
What is Machine Learning?
Machine Learning is the study of methods for programming
computers to learn.
66
What is Machine Learning?
Learning Trained
algorithm machine
TRAINING
DATA Answer
Query 7
Steps in machine learning
1) Data collection.
2) Representation.
3) Modeling.
4) Estimation.
5) Validation.
88
General structure of a learning system
Learning system
Problem Solving
Teacher
Results
Performance
Evaluation
9
Advantages of ML
1) Solving vision problems through statistical
inference.
10
10
Disadvantages of ML
3)Computational complexity.
11
11
Types of machine Learning
1) Unsupervised Learning .
2) Semi-Supervised (reinforcement).
3) Supervised Learning.
12
12
Unsupervised Learning
Advantage
Most of the laws of science were developed through
unsupervised learning.
Disadvantage
The identification of the features itself is a complex
problem in many situations.
14
14
Semi-Supervised (reinforcement)
it is in between Supervised and Unsupervised learning
techniques the amount of labeled and unlabelled data
required for training.
With the goal of reducing the amount of supervision
required compared to supervised learning.
At the same time improving the results of unsupervised
clustering to the expectations of the user.
15
15
Semi-Supervised (reinforcement)
16
16
Applications
of
Machine Learning 17
Drug discovery
18
18
Medical diagnosis
Photo MRI CT
19
Iris verification
20
20
21
21
Radar Imaging
22
Speech Recognition
23
Finger print
fingerprint image
24
Signature Verification
25
Face Recognition
26
Target Recognition
27
Robotics vision
28
Traffic Monitoring
29
Linear Regression
Regression: It predicts the continuous output variables based on
the independent input variable. like the prediction of house
prices based on different parameters like house age, distance
from the main road, location, area, etc.
Linear regression is a type of supervised machine
learning algorithm that computes the linear relationship between
the dependent variable and one or more independent features by
fitting a linear equation to observed data.
30
Linear Regression
When there is only one independent feature, it is known
as Simple Linear Regression, and when there are more than
one feature, it is known as Multiple Linear Regression.
31
Why Linear Regression important?
Its simplicity is a virtue, as linear regression is
transparent, easy to implement, and serves as a
foundational concept for more complex
algorithms.
Linear Regression is a supervised learning algorithm in
machine learning that predicts a continuous output variable
based on one or more input features.
32
simple Linear Regression
This is the simplest form of linear regression, and it involves
only one independent variable and one dependent variable. The
equation for simple linear regression is:
y=β0+β1X
where:
Y is the dependent variable
X is the independent variable
β0 is the intercept
β1 is the slope
33
multiple Linear Regression
This involves more than one independent variable and one dependent variable.
The equation for multiple linear regression is:
y=β0+β1X1+β2X2+………βnXn
where:
Y is the dependent variable
X1, X2, …, Xn are the independent variables
β0 is the intercept
β1, β2, …, βn are the slopes
The goal of the algorithm is to find the best Fit Line equation that can
predict the values based on the independent variables.
34
Linear Regression
Y is called a dependent or target variable and X is called an independent variable also known as the predictor
of Y
35
Linear Regression Algorithm
1. Initialize model parameters (β0, β1)
2. Calculate predicted values (y_pred = β0 + β1x)
3. Calculate error (ε = y_true - y_pred)
4. Calculate cost function (MSE = (1/n) * Σ(ε^2))
5. Update model parameters using optimization
algorithm (e.g., Gradient Descent)
6. Repeat steps 2-5 until convergence
36
Linear Regression Algorithm
import numpy as np
from sklearn.linear_model import
LinearRegression
import matplotlib.pyplot as plt
# Generate sample data
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 3, 5, 7, 11])
37
Linear Regression Algorithm
# Create and train model
model = LinearRegression()
model.fit(X, y)
# Make predictions
predictions = model.predict(X)
# Plot data and regression line
plt.scatter(X, y)
plt.plot(X, predictions, color='red')
plt.show()
38
Applications
1. Predicting house prices
2. Forecasting sales
3. Analyzing stock prices
4. Energy consumption prediction
5. Medical diagnosis
39
Advantage
1. Interpretability
2. Simplicity
3. Computational efficiency
4. Wide range of applications
40
Disadvantage
1. Assumptions may not hold
2. Sensitive to outliers
3. Limited to linear relationships
41
Real world examples
1. Google's self-driving cars (predicting steering angles)
2. Netflix's recommendation system (predicting user ratings)
3. Weather forecasting (predicting temperature and precipitation)
42
Example
Problem Statement:Predict house prices based on
the number of bedrooms.
Dataset:| Bedrooms | Price || --- | --- || 1 | 100,000
|| 2 | 150,000 || 3 | 200,000 || 4 | 250,000 || 5 |
300,000 |
43
Python code
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
# Define dataset
X = np.array([1, 2, 3, 4, 5])
y = np.array([100000, 150000, 200000, 250000, 300000])
# Create and train model
model = LinearRegression()
model.fit(X, y)
44
Python code
# Make predictions
predictions = model.predict(X)
# Plot data and regression line
plt.scatter(X, y)
plt.plot(X,predictions,color='red')
plt.xlabel('Bedrooms')
plt.ylabel('Price')
plt.show()
45
Python code
# Print coefficients
print('Intercept (β0):', model.intercept_)
print('Slope (β1):', model.coef_)
Output:
Intercept (β0): 50000.0
Slope (β1): 50000.0
The linear regression algorithm predicts house prices based on the
number of bedrooms. The equation is:Price = 50000 + 50000 *
BedroomsThis means that for each additional bedroom, the price
increases by $50,000.
46
Logistic Regression
Logistic Regression is a supervised learning algorithm used
to predict the probability of an event occurring (binary
classification). It models the relationship between a
dependent variable (target) and one or more independent
variables (features).
47
Logistic Regression
Key Components:
1. Logistic Function (Sigmoid): Maps input to probability
between 0 and 1.
2. Cost Function (Log Loss): Measures difference between
48
Logistic Regression
Logistic Regression Equation:
p = 1 / (1 + e^(-z))
where:- p: Probability of positive class
49
Logistic Regression
Types of Logistic Regression:
1. Binary Logistic Regression: Two-class classification.
2.Multinomial Logistic Regression: Multi-class classification.
3.Ordinal Logistic Regression: Ordered multi-class
classification.
50
Logistic Regression Algorithm
1.Import necessary libraries and load dataset.
2. Preprocess data (handle missing values, normalize/scale
features).
3.Split data into training (~70%) and testing sets (~30%).
4. Create a Logistic Regression model.
5. Train the model using the training data.
6. Evaluate the model using the testing data
51
Logistic Regression
Common Evaluation Metrics:
Accuracy
Precision
Recall
F1-score
ROC-AUC
52
Logistic Regression
Advantages:
1.Interpretable: Model weights indicate feature importance.
2.Efficient: Computationally fast.
3.Simple: Easy to implement.
Disadvantages:
1.Assumes Linearity: Relationships between features and
target.
2.Sensitive to Outliers: Affects model performance.
3.Not Suitable for Complex Relationships: Non-linear
relationships. 53
Logistic Regression
Applications:
Credit Risk Assessment
Medical Diagnosis
Spam Detection
Image Classification
54
Logistic Regression
Example: Predicting Diabetes based on Health
Indicators.
Dataset:| Feature | Description || --- | --- |
| Age | Patient's age |
| BMI | Body Mass Index |
| BP | Blood Pressure |
| Glucose | Blood Glucose level |
| Diabetes | Target variable (Yes/No) |
55
Logistic Regression
import pandas as pd
from sklearn.model_selection import
train_test_split
from sklearn.preprocessing import
StandardScaler
From sklearn.linear_model import LogisticRegression
from sklearn.metrics import
accuracy_score, classification_report, confusion_matrix
# Load dataset
df = pd.read_csv('diabetes.csv')
56
Logistic Regression
# Preprocess data
scaler = StandardScaler()
df[['Age', 'BMI', 'BP', 'Glucose']] = scaler.fit_transform(df[['Age', 'BMI', 'BP',
'Glucose']])
# Split data
X = df.drop('Diabetes', axis=1)
y = df['Diabetes']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Create Logistic Regression model
logreg = LogisticRegression(max_iter=1000)
57
Logistic Regression
# Train model
logreg.fit(X_train,y_train)
# Evaluate model
y_pred = logreg.predict(X_test)
print("Accuracy:",accuracy_score(y_test, y_pred))
print("ClassificationReport:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test,y_pred))
58
Logistic Regression
Output:
Accuracy: 0.85
Classification Report: Precision Recall f1-score
Diabetes 0.83 0.86 0.84
No Diabetes 0.86 0.83 0.85
Confusion Matrix:[[55 10][12 53]]
59
Logistic Regression
Feature Importance:
| Feature | Coefficient || --- | --- |
| Age | 0.23 |
| BMI | 0.31 |
| BP | 0.17 |
| Glucose | 0.45 |
Glucose level has the highest coefficient, indicating its significant impact on
diabetes prediction.
This example demonstrates how Logistic Regression can be applied to binary
classification problems in healthcare.
60
parameters
Accuracy is the proportion of correctly predicted
instances out of total instances in a test dataset.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
TP = True Positives(number of instances that are correctly predicted positive instances)
TN = True Negatives(number of instances that are correctly predicted as negative)
FP = False Positives(number of instances that are incorrectly predicted as positive )
FN = False Negatives(number of instances that are incorrectly predicted as negative)
61
parameters
Types of Accuracy:
1. Training Accuracy: Accuracy on training data.
2. Testing Accuracy: Accuracy on unseen test
data.
3. Validation Accuracy: Accuracy on validation
data.
62
parameters
Importance:
1. Evaluates model performance.
2. Compares models.
3. Identifies overfitting/underfitting.
63
parameters
Precision:
Precision is the ratio of true positives (TP) to the
64
parameters
Importance:
1. Evaluates model's ability to avoid false
positives.
2. Critical in applications with severe
consequences (e.g., medical diagnosis).
3. Balances with recall to achieve optimal
performance.
65
parameters
Recall:
Recall is the ratio of true positives (TP) to the
66
parameters
Recall:
Recall is the ratio of true positives (TP) to the
67
parameters
Precision vs Recall:
1. Precision focuses on accuracy, recall focuses
on completeness.
2. Precision is sensitive to false positives, recall is
sensitive to false negatives.
3. Precision and recall are inversely related.
68
parameters
F1-Score:
The F1-Score is the harmonic mean of precision
(Precision + Recall)
69
parameters
Confusion matrix:
A confusion matrix is a table that evaluates the
70
parameters
Confusion matrix:
1. Accuracy: (TP + TN) / (TP + TN + FP + FN)
+ Recall)
5. False Positive Rate: FP / (FP + TN)
74
Decision Tree Algorithm
Algorithm Steps:
1. Choose Age as the root node attribute.
2. Split data into Age < 30 and Age >= 30.
3. For Age < 30, choose Income as the next attribute.
4. Split data into Income < 55000 and Income >= 55000.
75
Decision Tree Algorithm
Python Implementation:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# Load dataset
df = pd.DataFrame({'Age': [25, 30, 40, 20, 35], 'Income':
[50000, 60000, 70000, 30000, 55000],
'Buys Car': [1, 1, 1, 0, 1]})
.
76
Decision Tree Algorithm
# Define features and target
X = df[['Age', 'Income']]
y = df['Buys Car']
# Train/Test Split
77
Decision Tree Algorithm
# Create Decision Tree model
clf = DecisionTreeClassifier(random_state=42)
# Train model
clf.fit(X_train, y_train)
# Evaluate model
78
Decision Tree Algorithm
Advantages:
1. Easy to interpret
2. Handles categorical features
3. Fast training and prediction
Disadvantages:
1. Prone to overfitting
2. Not suitable for complex relationships
79
Decision Tree Algorithm
Real-World Applications:
1. Credit risk assessment
2. Medical diagnosis
3. Customer segmentation
4. Image classification
5. Natural Language Processing
80
Random forest Algorithm
Random Forest is an ensemble learning algorithm that
combines multiple decision trees to improve prediction
accuracy and reduce overfitting.
Key components:
1. Bootstrapping: Randomly select samples from the training data.
2. Decision Tree Creation: Train a decision tree on the
bootstrapped samples.
3. Feature Randomization: Randomly select features for each
decision tree.
4. Voting: Combine predictions from multiple decision trees.
81
Random forest Algorithm
How it Works:
1. Train decision trees on bootstrapped samples.
2. Each decision tree predicts outcome.
3. Combine predictions using voting.
82
Random forest Algorithm
Example:Suppose we want to predict whether someone
will buy a car based on age, income, credit score, and
location.
| Age | Income | Credit Score | Location | Bought Car || --- | --- | --- | --- |
--- |
| 25 | 50000 | 700 | Urban | Yes |
| 30 | 60000 | 800 | Suburban | Yes |
| 35 | 70000 | 900 | Rural | Yes |
| 20 | 40000 | 600 | Urban | No |
| 40 | 80000 | 950 | Suburban | Yes |
Random forest Algorithm
Random Forest Model:
Create 100 decision trees with random feature subsets.
Each decision tree predicts whether someone will buy a car.
Combine predictions using voting.
Random forest Algorithm
from sklearn.ensemble import
RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
# Load data
df = pd.read_csv('car_data.csv')
# Split data
X = df[['Age', 'Income', 'Credit Score', 'Location']]
y = df['Bought Car']
Random forest Algorithm
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train model
rf=RandomForestClassifier(n_estimators=100,random_state=42)
rf.fit(X_train, y_train)
Random forest Algorithm
# Make predictions
y_pred = rf.predict(X_test)
# Evaluate modelaccurac
y = rf.score(X_test, y_test)
print("Accuracy:", accuracy)
Random forest Algorithm
Advantages:
1. Improves prediction accuracy
2. Reduces overfitting
3. Handles high-dimensional data
4. Robust to missing values
5. Parallelizable
Disadvantages:1. Computationally expensive
2. Difficult to interpret
3. Requires hyperparameter tuning
88
Random forest Algorithm
Real-World Applications:
Image classification
Natural language processing
Recommender systems
Credit risk assessment
Medical diagnosis
89
overfitting
Overfitting occurs when a model is too complex and
performs well on training data but poorly on new, unseen
data.
Causes:
1. Complex models
2. Small training datasets
3. Noise in training data
4. Feature correlation
5. Poor regularization
90
Underfitting
Underfitting occurs when a model is too simple and fails to
capture the underlying patterns in the training data,
resulting in poor performance on both training and test
data.
Causes:
Simple models
Insufficient training data
Lack of relevant features
Inadequate model complexity
Poor feature engineering
91
KNN algorithm
KNN (K-Nearest Neighbors) is a supervised
machine learning algorithm that classifies new
data points based on the similarity to nearby
data points.
92
Key Components
1. Features: Input variables (e.g., age, income).
2. Target Variable: Output variable (e.g., bought
car).
3. Distance Metric: Measure of similarity (e.g.,
Euclidean).
4. K: Number of nearest neighbors.
93
Types
1. Classification KNN: Predict categorical labels.
2. Regression KNN: Predict continuous values.
94
KNN algorithm
1. Collect and preprocess data. (features and
target variable)
2. Choose K (number of nearest neighbors).
3. Calculate distance between new data point
and existing data points.
4. Select K nearest neighbors.
5. Assign label based on majority vote.
95
Distance Metrics
1. Euclidean Distance
2. Manhattan Distance (L1 Distance)
3. Minkowski Distance
4. Cosine Similarity
5. Hamming Distance
96
KNN algorithm
Example: Suppose we want to predict whether someone will buy a car based
on their age and income.
Training Data:
| Age | Income | Bought Car || --- | --- | --- |
| 25 | 50000 | Yes |
| 30 | 60000 | Yes |
| 35 | 70000 | Yes |
| 20 | 40000 | No |
| 40 | 80000 | Yes |
| 45 | 90000 | Yes |
97
KNN algorithm
Test Data:
| Age | Income || --- | --- |
| 32 | 55000 |
98
KNN algorithm
Distance Calculations:
| Age | Income | Distance |
| 25 | 50000 | 7.07 |
| 30 | 60000 | 5.00 |
| 35 | 70000 | 7.07 |
| 20 | 40000 | 12.53 |
| 40 | 80000 | 10.00 |
| 45 | 90000 | 12.73 |
99
KNN algorithm
3 Nearest Neighbors:
| Age | Income | Bought Car |
| 30 | 60000 | Yes |
| 25 | 50000 | Yes |
| 35 | 70000 | Yes |
100
python implementation
import numpy as np
from sklearn.neighbors import
KNeighborsClassifier
# Training data
102
python implementation
# Make prediction
prediction = knn.predict(X_test)
print(prediction)
103
Advantages
1. Simple to implement.
2. Effective for non-linear relationships.
3. Handles high-dimensional data.
104
Disadvantages
1. Computationally expensive.
2. Sensitive to choice of K.
3. Vulnerable to outliers.
105
Real-World Applications
1. Image classification
2. Text classification
3. Recommendation systems
4. Customer segmentation
5. Predictive maintenance
106
FACULTY OF
ENGINEERING
AND TECHNOLOGY
107
www.jainuniversity.ac.in www.set.jainuniversity.ac.in