Machine Learning Juunit2.pdf Lands
Machine Learning Juunit2.pdf Lands
Unit-2
Regularization, bias, and variance
Overfitting and underfitting are two common problems encountered in machine learning
when building predictive models. They both relate to the ability of a model to generalize
well to unseen data. Regularization, bias, and variance are fundamental concepts in machine learning that are
closely related to each other and play a crucial role in model performance and
1. Overfitting: generalization.
Overfitting occurs when a model learns the training data too well, capturing 1. Regularization:
noise and random fluctuations in the data rather than the underlying Regularization is a technique used to prevent overfitting in machine
patterns. As a result, an overfitted model performs very well on the training learning models by adding a penalty term to the model's objective function.
data but poorly on new, unseen data. This penalty discourages overly complex models by penalizing large
Characteristics of an overfitted model include excessively complex decision parameter values.
boundaries or relationships between features and target variables. The purpose of regularization is to find a balance between fitting the
Overfitting often happens when a model is too flexible or has too many training data well and avoiding overfitting.
parameters relative to the amount of training data available. Common regularization techniques include L1 regularization (Lasso), L2
Common remedies for overfitting include using simpler models, reducing regularization (Ridge), and elastic net regularization, which combine L1
the complexity of the model (e.g., by decreasing the number of features or and L2 penalties.
parameters), or applying regularization techniques. Regularization helps to improve a model's generalization performance by
discouraging overly complex models that may perform well on the training
2. Underfitting: data but poorly on new, unseen data.
Underfitting occurs when a model is too simplistic to capture the 2. Bias:
underlying structure of the data. In other words, the model fails to learn the Bias refers to the error introduced by approximating a real-world problem
patterns in the training data and performs poorly both on the training data with a simplified model. It represents the difference between the average
and new, unseen data. prediction of the model and the true value being predicted.
Characteristics of an underfitted model include high training error and high High bias models are overly simplistic and may fail to capture the
test error, indicating that the model is not capturing the underlying underlying patterns in the data. These models often underfit the training
relationships present in the data. data and perform poorly on both the training and test datasets.
Underfitting often happens when a model is too simple or lacks the capacity Examples of high bias models include linear regression with few features or
to represent the complexity of the data adequately. low polynomial degrees.
Common remedies for underfitting include using more complex models, 3. Variance:
increasing the number of features, or enhancing the model's capacity to Variance refers to the amount by which the model's prediction would
capture the underlying patterns in the data. change if it were trained on a different dataset. It measures the sensitivity of
the model to fluctuations in the training data.
To summarize, overfitting and underfitting represent two extremes in model performance, High variance models are overly complex and tend to fit the training data
with overfitting occurring when a model is too complex and learns noise in the data, and too closely, capturing noise and random fluctuations in the data. These
underfitting occurring when a model is too simplistic and fails to capture the underlying models often overfit the training data and perform well on the training
patterns. Achieving an appropriate balance between model complexity and generalization dataset but poorly on new, unseen data.
is crucial for building effective machine learning models. Regularization techniques, Examples of high variance models include decision trees with deep
cross-validation, and monitoring performance on validation or test datasets are essential branches or high polynomial degrees in polynomial regression.
strategies for mitigating overfitting and underfitting. Understanding the trade-off between bias and variance is essential in machine learning
model selection and training. High bias models may benefit from increased model
complexity or feature engineering to capture more complex patterns in the data, while
high variance models may benefit from regularization techniques to reduce complexity
Machine Learning Machine Learning
and a hyperplane in higher-dimensional space. vectors. These points determine the hyperplane's position and orientation.
3. Maximizing Margin: SVM aims to find the hyperplane that maximizes the Kernel Trick: If the data is not linearly separable, SVM can use a kernel trick
margin between the two classes. The margin is the distance between the (e.g., polynomial kernel, RBF kernel) to transform the data into a higher-
hyperplane and the closest data points (called support vectors) from each class. dimensional space where it can be separated linearly.
Maximizing the margin helps improve the generalization of the model.
Classification: To classify a new animal, SVM checks which side of the line the
4. Support Vectors: Support vectors are the data points closest to the hyperplane. animal lies on. If it's on the positive side, the animal is classified as a cat; if it's on
These points are crucial for determining the hyperplane's position and orientation. the negative side, it's classified as a dog.
Only the support vectors contribute to defining the hyperplane, while other data
points can be ignored.
5. Kernel Trick: In cases where the data is not linearly separable, SVM uses a
kernel trick to map the data into a higher-dimensional space where it can be
separated linearly. Commonly used kernels include polynomial kernel, radial basis
function (RBF) kernel, and sigmoid kernel.
6. Classification: To classify a new data point, SVM checks which side of the
hyperplane the point lies on. If it's on the positive side, the point belongs to one
class; if it's on the negative side, it belongs to the other class.
Suppose you have a dataset of cats and dogs, where each data point has two features:
weight and height. The goal is to classify new animals as either cats or dogs based on
their weight and height.
Data Preparation: Scale the features (weight and height) to ensure they are in the
same range.
Hyperplane: In this case, the hyperplane is a line in 2D space that separates cats
from dogs.
Maximizing Margin: SVM finds the line that maximizes the margin between cats
and dogs.
Support Vectors: The data points closest to the hyperplane are the support
Machine Learning Machine Learning
5. Kernel Matrix: To use the kernel trick, we need to compute the kernel matrix,
which is a matrix where each element (i, j) is the result of applying the kernel
Here's a figure to illustrate the concept of SVM:
function to the ith and jth data points. The kernel matrix can be used to compute
In the figure, the hyperplane is the dashed line that separates the two classes (blue and red
circles). The support vectors are the data points closest to the hyperplane (the filled the inner products between all pairs of data points.
circles on the dashed line). The margin is the distance between the hyperplane and the
support vectors. SVM aims to maximize this margin while correctly classifying the data 6. SVM in High-Dimensional Space: In the high-dimensional space, the SVM
points. algorithm finds the hyperplane that separates the data points belonging to
different classes. The hyperplane is defined by a set of weights (coefficients) and
a bias term, just like in the original input space.
Kernel methods
7.Classification: To classify a new data point, the SVM algorithm computes the
Kernel methods are a class of algorithms used in machine learning for various inner product between the new data point and each support vector (data points
tasks, including classification, regression, and unsupervised learning. They are closest to the hyperplane) in the high-dimensional space. The sign of the sum of
particularly popular in the context of Support Vector Machines (SVMs) for these inner products determines the class of the new data point.
classification tasks. The kernel method allows SVMs to implicitly operate in a
high-dimensional feature space without explicitly computing the transformation, The kernel method is particularly useful when dealing with non-linear data, as it
thereby avoiding the computational burden associated with high-dimensional allows SVM to find a hyperplane that separates the data points in the original
data. input space, even when they are not linearly separable. This makes SVM a
powerful algorithm for a wide range of classification tasks.
Here's a step-by-step explanation of how the kernel method works in SVM:
Decision Trees are a popular and intuitive model used for both classification and 8. Disadvantages: Decision trees are prone to overfitting, especially with complex
regression tasks in machine learning. They are easy to understand and interpret, making data, and can be sensitive to small variations in the data. They also tend to create
them a valuable tool for many applications. biased trees when the class distribution is imbalanced.
2. Decision Tree Learning: The process of constructing a decision tree involves Decision trees are a versatile and powerful tool in machine learning, and understanding
recursively partitioning the input space into smaller regions based on the values of their structure and learning process can help in building effective models for various
the input features. The goal is to create "pure" regions where all data points belong tasks.
to the same class or have the same value.
3. Splitting Criteria: At each node, the decision tree algorithm selects a feature and
a threshold value to split the data into two child nodes. The splitting criteria aim to
maximize the "purity" of the resulting nodes, typically measured using metrics like
Gini impurity or information gain.
4. Stopping Criteria: The decision tree algorithm continues to grow the tree until a
stopping criterion is met, such as reaching a maximum depth, minimum number of
samples per node, or no further improvement in purity.
5. Predictions: To make predictions, the decision tree algorithm follows the decision
rules from the root node to the leaf node that corresponds to the input data's
feature values. The output at the leaf node is the predicted class label or value.
6. Example: Suppose we have a dataset of housing prices with features like square
footage, number of bedrooms, and location. The goal is to predict the price of a
house based on these features. A decision tree might have a root node that splits
the data based on the square footage (e.g., if square footage > 2000, go left;
otherwise, go right). The left child node might further split the data based on the
number of bedrooms, and so on, until we reach leaf nodes with predicted prices.
7. Advantages: Decision trees are easy to understand and interpret, can handle both
numerical and categorical data, and can capture non-linear relationships between
Machine Learning Machine Learning
Machine Learning Machine Learning