0% found this document useful (0 votes)
24 views7 pages

Machine Learning Juunit2.pdf Lands

Uploaded by

sahugungun76
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views7 pages

Machine Learning Juunit2.pdf Lands

Uploaded by

sahugungun76
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Machine Learning Machine Learning

Unit-2
Regularization, bias, and variance
Overfitting and underfitting are two common problems encountered in machine learning
when building predictive models. They both relate to the ability of a model to generalize
well to unseen data. Regularization, bias, and variance are fundamental concepts in machine learning that are
closely related to each other and play a crucial role in model performance and
1. Overfitting: generalization.
 Overfitting occurs when a model learns the training data too well, capturing 1. Regularization:
noise and random fluctuations in the data rather than the underlying  Regularization is a technique used to prevent overfitting in machine
patterns. As a result, an overfitted model performs very well on the training learning models by adding a penalty term to the model's objective function.
data but poorly on new, unseen data. This penalty discourages overly complex models by penalizing large
 Characteristics of an overfitted model include excessively complex decision parameter values.
boundaries or relationships between features and target variables.  The purpose of regularization is to find a balance between fitting the
 Overfitting often happens when a model is too flexible or has too many training data well and avoiding overfitting.
parameters relative to the amount of training data available.  Common regularization techniques include L1 regularization (Lasso), L2
 Common remedies for overfitting include using simpler models, reducing regularization (Ridge), and elastic net regularization, which combine L1
the complexity of the model (e.g., by decreasing the number of features or and L2 penalties.
parameters), or applying regularization techniques.  Regularization helps to improve a model's generalization performance by
discouraging overly complex models that may perform well on the training
2. Underfitting: data but poorly on new, unseen data.
 Underfitting occurs when a model is too simplistic to capture the 2. Bias:
underlying structure of the data. In other words, the model fails to learn the  Bias refers to the error introduced by approximating a real-world problem
patterns in the training data and performs poorly both on the training data with a simplified model. It represents the difference between the average
and new, unseen data. prediction of the model and the true value being predicted.
 Characteristics of an underfitted model include high training error and high  High bias models are overly simplistic and may fail to capture the
test error, indicating that the model is not capturing the underlying underlying patterns in the data. These models often underfit the training
relationships present in the data. data and perform poorly on both the training and test datasets.
 Underfitting often happens when a model is too simple or lacks the capacity  Examples of high bias models include linear regression with few features or
to represent the complexity of the data adequately. low polynomial degrees.
 Common remedies for underfitting include using more complex models, 3. Variance:
increasing the number of features, or enhancing the model's capacity to  Variance refers to the amount by which the model's prediction would
capture the underlying patterns in the data. change if it were trained on a different dataset. It measures the sensitivity of
the model to fluctuations in the training data.
To summarize, overfitting and underfitting represent two extremes in model performance,  High variance models are overly complex and tend to fit the training data
with overfitting occurring when a model is too complex and learns noise in the data, and too closely, capturing noise and random fluctuations in the data. These
underfitting occurring when a model is too simplistic and fails to capture the underlying models often overfit the training data and perform well on the training
patterns. Achieving an appropriate balance between model complexity and generalization dataset but poorly on new, unseen data.
is crucial for building effective machine learning models. Regularization techniques,  Examples of high variance models include decision trees with deep
cross-validation, and monitoring performance on validation or test datasets are essential branches or high polynomial degrees in polynomial regression.
strategies for mitigating overfitting and underfitting. Understanding the trade-off between bias and variance is essential in machine learning
model selection and training. High bias models may benefit from increased model
complexity or feature engineering to capture more complex patterns in the data, while
high variance models may benefit from regularization techniques to reduce complexity
Machine Learning Machine Learning

and improve generalization. depending on the distribution of the features:


In summary, regularization helps to control model complexity and prevent overfitting,  Gaussian Naive Bayes: Assumes that numerical features follow a Gaussian
bias refers to the error introduced by model simplification, and variance refers to the (normal) distribution.
model's sensitivity to fluctuations in the training data. Achieving an appropriate balance  Multinomial Naive Bayes: Suitable for features that represent counts or
between bias and variance is crucial for building machine learning models that generalize frequencies, often used in text classification.
well to new, unseen data.  Bernoulli Naive Bayes: Appropriate for binary features, where features
represent presence or absence of certain characteristics.
Naive Bayes 6. Classification Decision: Once the posterior probabilities for each class are
calculated, Naive Bayes assigns the class with the highest posterior probability as
the predicted class for the given instance.
Naive Bayes is a popular and simple probabilistic classification algorithm based on 7. Advantages:
Bayes' theorem with the "naive" assumption of feature independence. Despite its  Simple and easy to implement.
simplicity, Naive Bayes often performs surprisingly well in many real-world  Fast training and prediction.
classification tasks and is widely used in text classification, spam filtering, and other  Works well with high-dimensional data.
applications. Here's an explanation of how Naive Bayes works:  Performs well even with small training datasets.
1. Bayes' Theorem: Bayes' theorem is a fundamental theorem in probability theory 8. Limitations:
that describes the probability of a hypothesis given the evidence:  The assumption of feature independence may not hold true in real-world
�(�∣�)=�(�∣�)×�(�)�(�)P(H∣E)=P(E)P(E∣H)×P(H) datasets.
Where:  May suffer from the "zero-frequency" problem when encountering unseen
 �(�∣�)P(H∣E) is the posterior probability of hypothesis �H given features during prediction.
evidence �E.  Sensitivity to irrelevant features.
 �(�∣�)P(E∣H) is the likelihood of observing evidence �E given Despite its simplifying assumptions, Naive Bayes often performs surprisingly well in
hypothesis �H. practice and serves as a baseline model for many classification tasks in machine learning.
 �(�)P(H) is the prior probability of hypothesis �H.
 �(�)P(E) is the probability of observing evidence �E.
2. Naive Assumption of Feature Independence: Naive Bayes assumes that all Support Vector Machine (SVM)
features are conditionally independent given the class label. In other words, the
presence or absence of a particular feature is assumed to be unrelated to the
presence or absence of any other feature, given the class label. Support Vector Machine (SVM) is a supervised machine learning algorithm used for
classification and regression tasks. The primary objective of SVM is to find a hyperplane
3. Classification: Given a set of features �1,�2,...,��x1,x2,...,xn and a class label
in an N-dimensional space (where N is the number of features) that distinctly separates
��Ck, Naive Bayes calculates the posterior probability of each class given the
data points belonging to different classes.
features using Bayes' theorem:
�(��∣�1,�2,...,��)∝�(��)×∏�=1��(��∣��)P(Ck∣x1,x2,...,xn)∝P(Ck Here's a step-by-step explanation of how SVM works:
)×∏i=1nP(xi∣Ck) 1. Data Preparation: SVM works well with both linear and non-linear data. For
Where: linear data, it's essential to scale the features to ensure that they are all in the same
 �(��)P(Ck) is the prior probability of class ��Ck. range. For non-linear data, you might need to use a kernel trick (e.g., polynomial
 �(��∣��)P(xi∣Ck) is the likelihood of observing feature ��xi given kernel, radial basis function kernel) to transform the data into a higher-
class ��Ck. dimensional space where it can be linearly separated.
4. Model Training: To train a Naive Bayes classifier, it estimates the prior
probabilities �(��)P(Ck) and the class-conditional probabilities 2. Hyperplane: A hyperplane in an N-dimensional space is an (N-1)-dimensional
�(��∣��)P(xi∣Ck) from the training data. plane that separates the data points belonging to different classes. In a binary
5. Types of Naive Bayes: There are different types of Naive Bayes classifiers classification problem, the hyperplane is a line in 2D space, a plane in 3D space,
Machine Learning Machine Learning

and a hyperplane in higher-dimensional space. vectors. These points determine the hyperplane's position and orientation.

3. Maximizing Margin: SVM aims to find the hyperplane that maximizes the  Kernel Trick: If the data is not linearly separable, SVM can use a kernel trick
margin between the two classes. The margin is the distance between the (e.g., polynomial kernel, RBF kernel) to transform the data into a higher-
hyperplane and the closest data points (called support vectors) from each class. dimensional space where it can be separated linearly.
Maximizing the margin helps improve the generalization of the model.
 Classification: To classify a new animal, SVM checks which side of the line the
4. Support Vectors: Support vectors are the data points closest to the hyperplane. animal lies on. If it's on the positive side, the animal is classified as a cat; if it's on
These points are crucial for determining the hyperplane's position and orientation. the negative side, it's classified as a dog.
Only the support vectors contribute to defining the hyperplane, while other data
points can be ignored.

5. Kernel Trick: In cases where the data is not linearly separable, SVM uses a
kernel trick to map the data into a higher-dimensional space where it can be
separated linearly. Commonly used kernels include polynomial kernel, radial basis
function (RBF) kernel, and sigmoid kernel.

6. Classification: To classify a new data point, SVM checks which side of the
hyperplane the point lies on. If it's on the positive side, the point belongs to one
class; if it's on the negative side, it belongs to the other class.

Here's an example of SVM in action:

Suppose you have a dataset of cats and dogs, where each data point has two features:
weight and height. The goal is to classify new animals as either cats or dogs based on
their weight and height.

 Data Preparation: Scale the features (weight and height) to ensure they are in the
same range.

 Hyperplane: In this case, the hyperplane is a line in 2D space that separates cats
from dogs.

 Maximizing Margin: SVM finds the line that maximizes the margin between cats
and dogs.

 Support Vectors: The data points closest to the hyperplane are the support
Machine Learning Machine Learning

5. Kernel Matrix: To use the kernel trick, we need to compute the kernel matrix,
which is a matrix where each element (i, j) is the result of applying the kernel
Here's a figure to illustrate the concept of SVM:
function to the ith and jth data points. The kernel matrix can be used to compute
In the figure, the hyperplane is the dashed line that separates the two classes (blue and red
circles). The support vectors are the data points closest to the hyperplane (the filled the inner products between all pairs of data points.
circles on the dashed line). The margin is the distance between the hyperplane and the
support vectors. SVM aims to maximize this margin while correctly classifying the data 6. SVM in High-Dimensional Space: In the high-dimensional space, the SVM
points. algorithm finds the hyperplane that separates the data points belonging to
different classes. The hyperplane is defined by a set of weights (coefficients) and
a bias term, just like in the original input space.
Kernel methods
7.Classification: To classify a new data point, the SVM algorithm computes the
Kernel methods are a class of algorithms used in machine learning for various inner product between the new data point and each support vector (data points
tasks, including classification, regression, and unsupervised learning. They are closest to the hyperplane) in the high-dimensional space. The sign of the sum of
particularly popular in the context of Support Vector Machines (SVMs) for these inner products determines the class of the new data point.
classification tasks. The kernel method allows SVMs to implicitly operate in a
high-dimensional feature space without explicitly computing the transformation, The kernel method is particularly useful when dealing with non-linear data, as it
thereby avoiding the computational burden associated with high-dimensional allows SVM to find a hyperplane that separates the data points in the original
data. input space, even when they are not linearly separable. This makes SVM a
powerful algorithm for a wide range of classification tasks.
Here's a step-by-step explanation of how the kernel method works in SVM:

1. Linear Separability: SVM is a binary classification algorithm that aims to find a


hyperplane that separates data points belonging to different classes. In a linearly
separable case, the hyperplane is a line in 2D space, a plane in 3D space, or a
hyperplane in higher-dimensional space.
2. Kernel Function: A kernel function is a mathematical function that computes the
inner product of two vectors in a high-dimensional space. Commonly used kernel
functions include:
 Linear Kernel: k(x, y) = x^T y
 Polynomial Kernel: k(x, y) = (x^T y + c)^d
 Radial Basis Function (RBF) Kernel: k(x, y) = exp(-γ||x - y||^2)
3. Inner Product: The kernel function essentially computes the inner product of two
vectors in the high-dimensional space. The inner product is a measure of
similarity between two vectors.
4. Kernel Trick: Instead of explicitly transforming the data into a high-dimensional
space and computing the inner product, the kernel trick allows us to compute the
inner product directly in the original input space using the kernel function.
Machine Learning Machine Learning

Decision Trees features and target variables.

Decision Trees are a popular and intuitive model used for both classification and 8. Disadvantages: Decision trees are prone to overfitting, especially with complex
regression tasks in machine learning. They are easy to understand and interpret, making data, and can be sensitive to small variations in the data. They also tend to create
them a valuable tool for many applications. biased trees when the class distribution is imbalanced.

1. Decision Tree Structure: A decision tree is a tree-like structure where each


internal node represents a "decision" based on the value of an input feature, and 9. Ensemble Methods: To address the limitations of individual decision trees,
each leaf node represents the outcome (class label or value) of the decision tree. ensemble methods like Random Forest and Gradient Boosting are often used.
The topmost node in the tree is called the "root," and the branches represent the These methods combine multiple decision trees to improve predictive performance
decision rules. and robustness.

2. Decision Tree Learning: The process of constructing a decision tree involves Decision trees are a versatile and powerful tool in machine learning, and understanding
recursively partitioning the input space into smaller regions based on the values of their structure and learning process can help in building effective models for various
the input features. The goal is to create "pure" regions where all data points belong tasks.
to the same class or have the same value.

3. Splitting Criteria: At each node, the decision tree algorithm selects a feature and
a threshold value to split the data into two child nodes. The splitting criteria aim to
maximize the "purity" of the resulting nodes, typically measured using metrics like
Gini impurity or information gain.

4. Stopping Criteria: The decision tree algorithm continues to grow the tree until a
stopping criterion is met, such as reaching a maximum depth, minimum number of
samples per node, or no further improvement in purity.

5. Predictions: To make predictions, the decision tree algorithm follows the decision
rules from the root node to the leaf node that corresponds to the input data's
feature values. The output at the leaf node is the predicted class label or value.

6. Example: Suppose we have a dataset of housing prices with features like square
footage, number of bedrooms, and location. The goal is to predict the price of a
house based on these features. A decision tree might have a root node that splits
the data based on the square footage (e.g., if square footage > 2000, go left;
otherwise, go right). The left child node might further split the data based on the
number of bedrooms, and so on, until we reach leaf nodes with predicted prices.

7. Advantages: Decision trees are easy to understand and interpret, can handle both
numerical and categorical data, and can capture non-linear relationships between
Machine Learning Machine Learning
Machine Learning Machine Learning

You might also like