Machine learning notes
Machine learning notes
Machine Learning (ML) is a branch of artificial intelligence (AI) that enables systems to learn
and improve their performance on a specific task without being explicitly programmed. Instead
of relying on hard-coded rules, machine learning models analyze data, identify patterns, and
make decisions or predictions based on that data.
1. Data: The foundation of ML. Models learn from historical data to make future
predictions or decisions.
Machine learning is categorized based on the nature of the learning process and the type of data
used. Below are the main types and their subtypes, along with the methods used in each
category.
1. Supervised Learning
Definition:
Supervised learning involves training a model on labeled data, where the input data has
corresponding output labels. The goal is to map inputs to outputs accurately.
Subtypes:
1. Regression:
Predicting continuous values.
o Methods:
▪ Linear Regression
▪ Polynomial Regression
▪ Ridge and Lasso Regression
▪ Support Vector Regression (SVR)
2. Classification:
Predicting discrete categories.
o Methods:
▪ Logistic Regression
▪ Decision Trees
▪ Random Forest
▪ Support Vector Machines (SVM)
▪ Naïve Bayes
▪ k-Nearest Neighbours (KNN)
2. Unsupervised Learning
Definition:
Unsupervised learning deals with unlabelled data. The goal is to identify patterns, structures, or
groupings within the data.
Subtypes:
1. Clustering:
Grouping data points into clusters.
o Methods:
▪ K-Means
▪ K-Medoids
▪ Hierarchical Clustering
▪ DBSCAN (Density-Based Spatial Clustering)
▪ Gaussian Mixture Models (GMM)
2. Dimensionality Reduction:
Reducing the number of features while retaining important information.
o Methods:
o Methods:
▪ Isolation Forest
▪ One-Class SVM
▪ Clustering-based techniques
Regression in machine learning is a supervised learning technique used to model and predict
continuous outcomes. It involves identifying the relationship between dependent (target) and
independent (input) variables.
1. Data Preparation:
o Collect and preprocess data (handle missing values, scaling, etc.).
2. Model Selection:
o Choose the appropriate regression model based on the problem and data
characteristics.
3. Training:
o Fit the model to the training dataset.
4. Evaluation:
o Use metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-
squared to evaluate performance.
5. Prediction:
o Make predictions on unseen (test) data.
Linear Regression
Linear Regression is a supervised learning technique in machine learning used to model the
relationship between a dependent variable (target) and one or more independent variables
(features). It predicts a continuous output based on this relationship.
o
o y: Target variable.
2. Objective:
3. Error Minimization:
o Linear regression minimizes the error using a cost function, typically Mean
Squared Error (MSE):
o
o Here, yiy_iyi are the actual values, and y^i\hat{y}_iy^i are the predicted values.
The coefficients are chosen to minimize this MSE.
o The coefficients are calculated using methods like Ordinary Least Squares (OLS)
or Gradient Descent. Once fitted, the model uses the learned coefficients to
make predictions on new data.
To evaluate the accuracy of a linear regression model, the following metrics and analyses are
used:
1. R-squared (R2):
• This metric indicates the proportion of variance in the target variable explained by the
independent variables.
•
• R2 ranges from 0 to 1:
• MAE measures the average magnitude of errors in the predictions, without considering
their direction.
• MSE calculates the average squared differences between the actual and predicted
values.
• It penalizes larger errors more than smaller ones, making it sensitive to outliers.
• RMSE is the square root of MSE. It provides an error metric in the same units as the
target variable.
6. Residual Analysis:
1. Overfitting: In machine learning, if a model is too complex (e.g., with many features), it
might memorize the training data, leading to overfitting. This means that while the model
performs well on the training set, its performance on the test set (or any new data) is
poor.
2. Bias-Variance Tradeoff: Regularization helps to manage the tradeoff between bias
(error due to overly simple models) and variance (error due to overly complex models).
By penalizing complexity, regularization helps to find a balance that leads to better
generalization.
Types of Regularization:
• Concept: In L1 regularization, the penalty term added to the loss function is the sum of
the absolute values of the model parameters (weights). This results in a sparsity effect,
where some of the model coefficients become exactly zero. This can be interpreted as
automatic feature selection, as it tends to eliminate irrelevant features from the model.
Where:
• Effect: L1 regularization can lead to sparse models where some features are effectively
ignored (with weights set to zero). It is useful when you suspect that many features are
irrelevant.
• Use Cases: Lasso (Least Absolute Shrinkage and Selection Operator) regression is
commonly used when feature selection is important, and you want to automatically
reduce the number of variables.
• Concept: In L2 regularization, the penalty term added to the loss function is the sum of
the squares of the model parameters. This shrinks the coefficients, but unlike L1, it does
not set them exactly to zero. L2 regularization leads to a smoother model where all
features are still included, but their influence is reduced.
Where:
• Use Cases: Ridge regression is used when you believe that most features should
contribute to the prediction but with reduced magnitude. It’s often applied when there is
multicollinearity (correlation among features) in the data.
Where:
• Effect: Elastic Net is useful when there are many correlated features. It will both shrink
the coefficients and perform feature selection, while addressing situations where L1 or
L2 alone might not work effectively.
• Use Cases: Elastic Net is commonly used when you have a large number of features
and some of them are highly correlated.
• Effect: Dropout helps prevent overfitting by forcing the network to generalize better. It
also prevents neurons from "co-adapting," where two or more neurons learn to work
together too specifically for the training set.
• Use Cases: Dropout is widely used in deep learning models, especially in convolutional
and fully connected layers of neural networks.
5. Early Stopping
• Use Cases: Early stopping is particularly useful in training deep neural networks where
overfitting is a common problem.
• L1 Regularization is useful when you expect many features to be irrelevant and want to
eliminate them completely.
• L2 Regularization is useful when you expect most features to be relevant but want to
reduce their impact.
• Elastic Net is useful when you have many correlated features and want to benefit from
both L1 and L2 penalties.
• Early Stopping is typically used in iterative models and can be a very effective way to
avoid overfitting without adjusting the model's architecture.
Logistic regression is a supervised learning algorithm used for classification tasks. It predicts
the probability that a given input belongs to a specific category (class). Despite its name, logistic
regression is used for classification problems rather than regression problems.
Key Concepts:
2. Logit Function: Logistic regression models the relationship between the dependent
variable (target) and independent variables (features) using the logit function.
3. Sigmoid Function: The output of logistic regression is obtained by applying the sigmoid
function to a linear combination of the input features:
The sigmoid function ensures the output is in the range [0,1] representing a probability.
4. Cost Function: Logistic regression uses the log-loss (or cross-entropy) as its cost
function to optimize the parameters w and b:
J
Where mmm is the number of samples, y(i) is the true label, and y^(i)is the predicted
probability.
5. Prediction: A threshold (e.g., 0.5) is applied to the output probabilities to assign a class
label:
Solved Example:
Problem:
Predict whether a student passes an exam based on their study hours. We have the following
dataset:
1 0
2 0
3 0
4 1
5 1
Solution:
3. Optimization (Gradient Descent): Using the cost function, optimize www and bbb
using gradient descent.
This example demonstrates how logistic regression models probabilities and makes predictions
based on the decision boundary defined by y^
A decision tree is a supervised learning algorithm used for both classification and regression
tasks. It works by splitting the data into subsets based on the value of input features. The
structure resembles a tree, where:
Key Concepts:
1. Splitting: At each node, the algorithm chooses the best feature and value to split the
data to achieve the highest information gain or lowest impurity.
o Gini Impurity:
o Where pi is the probability of class iii in a subset, and ccc is the number of
classes.
o
o Information Gain:
o
o Where n is the number of samples in the child node, and NNN is the total
number of samples.
4. Prediction:
o For regression: The average of the target values in a leaf node is the prediction.