Machine Learning Unit - 1
Machine Learning Unit - 1
Unit - 1
Shilpa Das
Machine Learning – Expectations vs
Reality
So, to avoid such hilarious mistakes,
a grip on the topic is a mandate,
that’s where we begin the learning
of the subject in detail...
Machine Learning Process
Machine Learning Types
• Supervised
• Unsupervised
• Semi - Supervised
• Reinforcement
Data Mining Process
Data Mining Process
Machine Learning
• Machine learning is the science of getting computers to act, to learn &
improve from experience, without explicitly being programmed.
• Supervised Learning – It can apply what has been learned in the past to new data using labeled
examples to predict future events. Starting from the analysis of a known training dataset, the
learning algorithm produces an inferred function to make predictions about the output values.
• The system is able to provide targets for any new input after sufficient training. The learning
algorithm can also compare its output with the correct, intended output and find errors in order to
modify the model accordingly.
• Ex – Learning to ride bike with instructor sitting behind instructing. (Expected output & Data is
given to find method or program to achieve optimal solution), e.g., Nearest Neighbor, Naive Bayes,
Decision Trees, Linear Regression, Support Vector Machines (SVM), Neural Networks, etc.
•
- Classification – An object’s category prediction
- Regression – Prediction of a specific point on a numeric axis, e.g., Forecasting
Types of Machine Learning Methods
• Unsupervised Learning – They are used when the information used to train is neither classified
nor labeled. Unsupervised learning studies how systems can infer a function to describe
a hidden structure from unlabeled data.
• The system doesn’t figure out the right output, but it explores the data and can draw inferences
from datasets to describe hidden structures from unlabeled data.
• Ex – Learning to ride bike without any instructor, only by self- experience. (No expected output,
but, data is given to find method or program to achieve optimal solution), e.g., K – means
clustering, Association Rules, etc.
• The systems that use this method are able to considerably improve learning accuracy.
Usually, semi-supervised learning is chosen when the acquired labeled data requires
skilled and relevant resources in order to train it/learn from it.
• Ex – Learning to ride bike with few instructions with instructor at a distance. (Some
expected output & Data is given to find method or program to achieve optimal
solution),e.g., Transductive SVM, Speech Analysis, Web Content classification, Protein
Sequence Classification, Generative Models, Multi view & Graph-Based Algorithms, etc.
Types of Machine Learning Methods
• Reinforcement Learning – It is a learning method that interacts with its environment by
producing actions and discovers errors or rewards. Trial and error search and delayed reward are
the most relevant characteristics of reinforcement learning.
• This method allows machines and software agents to automatically determine the ideal behavior
within a specific context in order to maximize its performance. Simple reward feedback is required
for the agent to learn which action is best; this is known as the reinforcement signal.
• Ex – Learning to ride bike with appreciation or criticism from the instructor based on performance.
(Data along with rewards & punishments is given to find method or program to achieve optimal
solution), e.g., Two- Armed & K – Armed Bandit Problem, Deep Adversial Networks, etc.
- Positive – Rewards/Appreciation
- Negative – Punishments/Criticism
Issues in Machine Learning
• 1. Understanding Which Processes Need Automation - It's becoming increasingly difficult to evaluate which
problems you’re seeking to solve. The easiest processes to automate are the ones that are done manually
every day with no variable output. Complicated processes require further inspection before automation.
While Machine Learning can definitely help automate some processes, not all automation problems need
Machine Learning.
• 2. Lack of Quality Data - While enhancing algorithms often consumes most of the time of developers in AI,
data quality is essential for the algorithms to function as intended. Noisy data, dirty data, and incomplete
data are the problems in ideal Machine Learning, there is lack of good data.. The solution to this is to take the
time to evaluate and scope data with meticulous data governance, data integration, and data exploration
until clear data if found, then only, ML process can start. Not all data will be relevant and valuable, bad data
would convert to bad results. If data is not well understood, ML results could also provide negative
expectations. The initial testing would say that you are right about everything, but when launched, your
model becomes disastrous. When creating products, data scientists should initiate tests using unforeseen
variables, which include smart attackers, so that they can know about any possible outcome. Also, there are
ethical issues surrounding machine learning involve not so much machine learning algorithms themselves,
but the way the data is used. Without users’ knowledge or consent, illustrates a lot of the problems
associated with the collection and use of user data.
Issues in Machine Learning
• 3. Inadequate Infrastructure - Machine Learning requires vast amounts of data churning
capabilities. Legacy/Normal systems often can’t handle the workload and buckle under pressure.
The infrastructure should be checked if can handle ML process, if it can’t, you should look to
upgrade, complete with hardware acceleration and flexible storage.
• 4. Implementation - Organizations often have analytics engines working with them by the time
they choose to upgrade to Machine Learning. Integrating newer Machine Learning methodologies
into existing methodologies is a complicated task. Maintaining proper interpretation and
documentation goes a long way to easing implementation. Partnering with an implementation
partner can make the implementation of services like anomaly detection, predictive analysis, and
ensemble modeling much easier.
• 5. Lack of Skilled Resources - Deep analytics and Machine Learning in their current forms are still
new technologies. Thus, there is a shortage of skilled employees available to manage and develop
analytical content for Machine Learning. Data scientists often need a combination of domain
experience as well as in-depth knowledge of science, technology, and mathematics.
Issues in Machine Learning
• 6. Making the Wrong Assumptions - ML algorithms running over fully automated systems have to be able to deal
with missing data points. One popular approach to this issue is using mean value as a replacement for the missing
value. This application will provide reliable assumptions about data including the particular data missing at
random. Whether they’re being used in automated systems or not, ML algorithms automatically assume that the
data is random and representative. However, having random data in a company is not common. The best way to
deal with this issue is to make sure that your data does not come with gaping holes and can deliver a substantial
amount of assumptions.
• 7. Getting Bad Predictions to Come Together With Biases - There have been several instances of racial and other
biases making it into machine learning programs unintentionally. One algorithm identified black people as
gorillas, and another altered the facial features of people of color to make them appear more “European” while
claiming to beautify them. The first you need to impose additional constraints over an algorithm other than
accuracy alone. Second, the smarter the algorithm becomes, the more difficulty you’ll have controlling it. When
you want to fit complex models to a small amount of data, you can always do so. Doing so will then allow your
complex model to hit every data point, including the random fluctuations. Depending on the amount of data and
noise, you can fit a complex model that matches these requirements. Marketers should always keep these items
in mind when dealing with data sets. Make sure that your data is as clean of an inherent bias as possible and over
fitting resulting from noise in the data set. You can deal with this concern immediately during the evaluation stage
of an ML project while you’re looking at the variations between training and test data.
Issues in Machine Learning
• 8. Having Algorithms Become Obsolete as Soon as Data Grows - ML algorithms will always require much data when
being trained. Often, these ML algorithms will be trained over a particular data set and then used to predict future
data, a process which you can’t easily anticipate. The previously “accurate” model over a data set may no longer be
as accurate as it once was when the set of data changes. For a system that changes slowly, the accuracy may still not
be compromised; however, if the system changes rapidly, the ML algorithm will have a lesser accuracy rate given that
the past data no longer applies.
• 9. ML as Black boxes - It’s impossible to see how ML algorithms really work. It may be impossible to know why a
machine learning algorithm made a decision. Some people want to know why machine learning models make certain
decisions. Why a user was served a certain ad? Why was a contract interpreted in a certain way? Why did the car
move in the way that it did? There’s an underlying belief that people should be able to explain why machine learning
algorithms and other software took certain actions. That’s a fine goal in theory, but it sets the bar far higher for
software than the one we set for ourselves. That’s because humans are not interpretable either.
Human decisions are impacted by factors they are simply not aware of.
• 10. Expectations Exceed Reality - No matter how much you’re able to accomplish with machine learning, you’ll
probably fall short of somebody’s sci-fi inspired ideas about what should be possible. These expectations are
relatively new. We know from experience how quickly expectations around artificial intelligence have accelerated.
On one hand, it’s easier than ever to talk about deploying solutions inside a company. Executives are generally
receptive. On the other hand, some people’s expectations of what machine learning can accomplish in practice can
far exceed what is possible or even reasonable.
When Do We Need Machine
•Learning?
When do we need machine learning rather than directly program our computers to carry out the task at
hand? Two aspects of a given problem may call for the use of programs that learn and improve on the basis of
their “experience": the problem’s complexity and the need for adaptivity.
• Tasks beyond Human Capabilities: Another wide family of tasks that benefit from machine learning
techniques are related to the analysis of very large and complex data sets: astronomical data, turning medical
archives into medical knowledge, weather prediction, and analysis of genomic data, Web search engines, and
electronic commerce. With more and more available digitally recorded data, it becomes obvious that there are
treasures of meaningful information buried in data archives that are way too large and too complex for
humans to make sense of. Learning to detect meaningful patterns in large and complex data sets is a
promising domain in which the combination of programs that learn with the almost unlimited memory
capacity and ever increasing processing speed of computers opens up new horizons.
What is data?
- Domain Set - An arbitrary set, X This is the set of objects that we may wish to label. For example, in a papaya learning
problem mentioned before, the domain set will be the set of all papayas. Usually, these domain points will be
represented by a vector of features (like the papaya’s colour and softness). We also refer to domain points as instances
and to X as instance space.
- Label Set - For our current discussion, we will restrict the label set to be a two-element set, usually {0, 1} or {-1, +1}.
Let Y denote our set of possible labels. For our papayas example, let Y be {0,1}, where 1 represents being tasty and 0
stands for being not-tasty.
Formal Model of Statistical Learning Framework
- Training data- S = ((x1; y1)... (xm; ym)) is a finite sequence of pairs in X ×Y: that is, a sequence of labelled domain points.
This is the input that the learner has access to (like a set of papayas that have been tasted and their colour, softness, and
tastiness). Such labeled examples are often called training examples. We sometimes also refer to S as a training set.
• The learner’s output - The learner is requested to output a prediction rule, h : X-> Y. This function is also called a
predictor, a hypothesis, or a classifier. The predictor can be used to predict the label of new domain points. In our
papayas example, it is a rule that our learner will employ to predict whether future papayas he examines in the
farmers’ market are going to be tasty or not. We use the notation A(S) to denote the hypothesis that a learning
algorithm, A, returns upon receiving the training sequence S.
• A simple data-generation model – To generate a training data, first, we assume that the instances (the papayas we
encounter) are generated by some probability distribution (in this case, representing the environment). Let us
denote that probability distribution over X by D. It is important to note that we do not assume that the learner
knows anything about this distribution. For the type of learning tasks we discuss, this could be any arbitrary
probability distribution. As to the labels, in the current discussion we assume that there is some “correct" labelling
function, f: X -> Y, and that yi = f(xi) for all i. The labelling function is unknown to the learner. In fact, this is just what
the learner is trying to figure out. In summary, each pair in the training data S is generated by first sampling a point
xi according to D and then labelling it by f.
Formal Model of Statistical Learning Framework
• Measures of success - We define the error of a classifier to be the probability that it does
not predict the correct label on a random data point generated by the aforementioned
underlying distribution. That is, the error of h is the probability to draw a random instance
x, according to the distribution D, such that h(x) does not equal f(x).
• A note about the information available to the learner - The learner is blind to the
underlying distribution D over the world and to the labelling function f. In our papayas
example, we have just arrived in a new island and we have no clue as to how papayas are
distributed and how to predict their tastiness. The only way the learner can interact with
the environment is through observing the training set.
Machine Learning Techniques
• 1. Regression Regression algorithms are mostly used to make predictions on
numbers i.e when the output is a real or continuous value.
• 3. Clustering Clustering is a Machine Learning technique that involves classifying data points
into specific groups. If we have some objects or data points, then we can apply the clustering
algorithm(s) to analyze and group them as per their properties and features.
• Clustering methods:
• Density-based methods: In this method, clusters are considered dense regions depending on
their similarity and difference from the lower dense region.
• Hierarchical methods: The clusters formed in this method are the tree-like structures. This
method forms trees or clusters from the previous cluster. There are two types of hierarchical
methods: Agglomerative (Bottom-up approach) and Divisive (Top-down approach).
• Partitioning methods: This method partitions the objects based on k-clusters and each method
form a single cluster.
• Grid based methods: In this method, data are combined into a number of cells that form a grid-
like structure.
Machine Learning Techniques
• 4. Association Analysis Association rule mining finds interesting associations and
relationships among large sets of data items. This rule shows how frequently a itemset
occurs in a transaction. A typical example is Market Based Analysis.
• Market Based Analysis is one of the key techniques used by large relations to show
associations between items. It allows retailers to identify relationships between the
items that people buy together frequently.
• Precision is often measured by the standard deviation of a set of values, while bias is
measured by taking the difference between the mean of the set of values and the
known value of the quantity being measured. Bias can only be determined for objects
whose measured quantity is known by means external to the current situation.
• Accuracy: The closeness of measurements to the true value of the quantity being
measured.
• Accuracy depends on precision and bias, but since it is a general concept there is no
specific formula for accuracy in terms of these two quantities.
• One important aspect of accuracy is the use of significant digits. The goal is to use only
as many digits to represent the result of a measurement or calculation as are justified
by the precision of the data.
Noise
• Noise refers to modification of original values
• Examples: distortion of a person’s voice when talking on a poor phone and “snow” on
television screen
• Examples:
• Same person with multiple email addresses
• Data cleaning
• Process of dealing with duplicate data issues
Dimensionality Reduction
• There are a variety of benefits to dimensionality reduction . A key benefit is that many
data mining algorithms work better if the dimensionality-the number of attributes in the
data-is lower. This is partly because dimensionality reduction can eliminate irrelevant
features and reduce noise and partly because of the curse of dimensionality.
• Even if dimensionality reduction doesn't reduce the data to two or three dimensions,
data is often visualized by looking at pairs or triplets of attributes, and the number of
such combinations is greatly reduced. Finally, the amount of time and memory required
by the data mining algorithm is reduced with a reduction in dimensionality.
• The term dimensionality reduction is often reserved for those techniques that reduce
the dimensionality of a data set by creating new attributes that are a combination of the
old attributes. The reduction of dimensionality by selecting new attributes that are a
subset of the old is known as feature subset selection or feature selection.
Dimensionality Reduction
• Purpose:
• Avoid curse of dimensionality
• Reduce amount of time and memory required by data mining algorithms
• Allow data to be more easily visualized
• May help to eliminate irrelevant features or reduce noise
• Techniques
• Principle Component Analysis
• Singular Value Decomposition
• Others: supervised and non-linear techniques
Duplicate Data
• Data set may include data objects that are duplicates, or almost duplicates of
one another
• Major issue when merging data from heterogeneous sources
• Examples:
• Same person with multiple email addresses
• Data cleaning
• Process of dealing with duplicate data issues
Dimensionality Reduction
• There are a variety of benefits to dimensionality reduction . A key benefit is that many
data mining algorithms work better if the dimensionality-the number of attributes in the
data-is lower. This is partly because dimensionality reduction can eliminate irrelevant
features and reduce noise and partly because of the curse of dimensionality.
• Even if dimensionality reduction doesn't reduce the data to two or three dimensions,
data is often visualized by looking at pairs or triplets of attributes, and the number of
such combinations is greatly reduced. Finally, the amount of time and memory required
by the data mining algorithm is reduced with a reduction in dimensionality.
• The term dimensionality reduction is often reserved for those techniques that reduce
the dimensionality of a data set by creating new attributes that are a combination of the
old attributes. The reduction of dimensionality by selecting new attributes that are a
subset of the old is known as feature subset selection or feature selection.
Dimensionality Reduction
• Purpose:
• Avoid curse of dimensionality
• Reduce amount of time and memory required by data mining algorithms
• Allow data to be more easily visualized
• May help to eliminate irrelevant features or reduce noise
• Techniques
• Principle Component Analysis
• Singular Value Decomposition
• Others: supervised and non-linear techniques
Duplicate Data
• Data set may include data objects that are duplicates, or almost duplicates of
one another
• Major issue when merging data from heterogeneous sources
• Examples:
• Same person with multiple email addresses
• Data cleaning
• Process of dealing with duplicate data issues
Dimensionality Reduction
• There are a variety of benefits to dimensionality reduction . A key benefit is that many
data mining algorithms work better if the dimensionality-the number of attributes in the
data-is lower. This is partly because dimensionality reduction can eliminate irrelevant
features and reduce noise and partly because of the curse of dimensionality.
• Even if dimensionality reduction doesn't reduce the data to two or three dimensions,
data is often visualized by looking at pairs or triplets of attributes, and the number of
such combinations is greatly reduced. Finally, the amount of time and memory required
by the data mining algorithm is reduced with a reduction in dimensionality.
• The term dimensionality reduction is often reserved for those techniques that reduce
the dimensionality of a data set by creating new attributes that are a combination of the
old attributes. The reduction of dimensionality by selecting new attributes that are a
subset of the old is known as feature subset selection or feature selection.
Dimensionality Reduction
• Purpose:
• Avoid curse of dimensionality
• Reduce amount of time and memory required by data mining algorithms
• Allow data to be more easily visualized
• May help to eliminate irrelevant features or reduce noise
• Techniques
• Principle Component Analysis
• Singular Value Decomposition
• Others: supervised and non-linear techniques
Curse of Dimensionality
• When dimensionality increases, data
becomes increasingly sparse in the
space that it occupies
Key Issues:
Data Sparsity: As the number of features increases, the data becomes sparse, making it harder to find
meaningful patterns.
Distance Metrics: In high-dimensional spaces, the distance between data points becomes less
informative, making clustering or classification less effective.
Overfitting: Higher dimensions lead to overfitting since the model may fit the noise in the data.
Mitigation:
Dimensionality Reduction: Use techniques like Principal Component Analysis (PCA) to reduce the
number of features while retaining the important information.
Feature Subset Selection
• Another way to reduce dimensionality of data
• Redundant features
• duplicate much or all of the information contained in one or more other attributes
• Example: purchase price of a product and the amount of sales tax paid
• Irrelevant features
• contain no information that is useful for the data mining task at hand
• Example: students' ID is often irrelevant to the task of predicting students' GPA
Feature Subset Selection
• Techniques:
• Brute-force approach:
• Try all possible feature subsets as input to data mining algorithm
• Embedded approaches:
• Feature selection occurs naturally as part of the data mining algorithm
• Filter approaches:
• Features are selected before data mining algorithm is run
• Wrapper approaches:
• Use the data mining algorithm as a black box to find best subset of attributes
Terminologies
• Hypothesis class - The hypothesis class from which we believe C is drawn,
namely, the set of rectangles.
• Most General hypothesis – It is denoted by G, which is the largest rectangle we hypothesis can draw
that includes all the positive examples and none of the negative examples.
• Version Space - Any h ∈ H between S and G is a valid hypothesis with no error, said to be consistent
with the training set, and such h make version space up the version space. Given another training set,
S, G, version space, the parameters and thus the learned hypothesis, h, can be different.
• Doubt – In some applications, a wrong decision may be very costly and in such a case, we can say
that any instance that falls in between S and G is a case of doubt, which we cannot label with
certainty due to lack of data. In such a case, the system rejects the instance and defers the decision to
a human expert.
Terminologies
• Reject - When more than one hypothesis is true, value is 1, we cannot choose a class, and this is the case of
doubt and the classifier rejects such cases.
• VC Dimension - The maximum number of points that can be shattered by H is called the Vapnik
Chervonenkis (VC) dimension of H, is denoted as VC(H ), and measures the capacity of H.
misclassified, denoted by 1 - 𝛿.
• Confidence Probability – The model's predicted value regarding the observed outcome is similar, not
denoted by 𝜀. We can have arbitrary large confidence by decreasing δ and arbitrary small error by
• d where the term probability of error may refer to the probabilities of various amounts of error occurring,
decreasing 𝜀.
• Margin - Given X, we can find S, or G, or any h from the version space and use it as our hypothesis, h. It
seems intuitive to choose h halfway between S margin and G; this is to increase the margin, which is the
distance between the boundary and the instances closest to it. For our error function to have a minimum at
h with the maximum margin, we should use an error (loss) function which not only checks whether an
instance is on the correct side of the boundary but also how far away it is. That is, instead of h(x) that
returns 0/1, we need to have a hypothesis that returns a value which carries a measure of the distance to
the boundary and we need to have a loss function which uses it, different from 1(·) that checks for equality.
Terminologies
• Noise - Noise is any unwanted anomaly in the data and due to noise, the class may be more difficult to learn and zero
error may be infeasible with a simple hypothesis class.
• Posed Problem - After seeing N example cases, there remain 2 2 𝑑− 𝑁 possible ill-posed problem functions. This is an
example of an ill-posed problem where the data by itself is not sufficient to find a unique solution.
• Inductive Bias - The set of assumptions we make inductive bias to have learning possible is called the inductive bias of
the learning algorithm.
• Model Selection - Learning is not possible without inductive bias, and now the ques tion is how to choose the right bias.
This is called model selection, which is choosing between possible H. In answering this question, we should remember
that the aim of machine learning is rarely to replicate the training data but the prediction for new cases. That is we would
like to be able to generate the right output for an input instance outside the training set one for which the correct output
is not given in the training set.
• Decision Boundary - A decision boundary or decision surface is a hypersurface that partitions the underlying vector space
into two sets, one for each class. The classifier will classify all the points on one side of the decision boundary as
belonging to one class and all those on the other side as belonging to the other class. If the decision surface is a
hyperplane, then the classification problem is linear, and the classes are linearly separable.
Terminologies
• Underfitting - If underfitting H is less complex than the function, we have underfitting, for example, when trying to fit a
line to data sampled from a third-order polynomial. In such a case, as we increase the complexity, the training error
decreases. But if we have H that is too complex, the data is not enough to constrain it and we may end up with a bad
hypothesis, h ∈ H , for example, when fitting two rectangles to data sampled from one rectangle.
• Overfitting - If there is noise, an overcomplex hypothesis may learn not only the underlying function but also the noise
in the data and may make a bad fit, This is called overfitting. In such a case, having more training data helps but only up
to a certain point.
• Appropriate Fitting - It is the point just before the error on the test dataset starts to increase where the model has good
skill on both the training dataset and the unseen test dataset.
• Triple Trade – Off - Given a training set and H , we can find h ∈ H that has the minimum training error but if H is not
chosen well, no matter which h ∈ H we pick, we will not have good generalization. In all learning algorithms that are
trained from example data, there is a trade-off between three factors:
• - The complexity of the hypothesis we fit to data, namely, the capacity of the hypothesis class
• - The amount of training data, and
• - The generalization error on new examples.
Terminologies
Dimensions of a Supervised Machine Learning Algorithm
• Let say we have a sample of data which is independent and identically distributed, where
is the input and is the associated desired output (0/1 for a two – class learning). The aim
is to build a good and useful approximation to using the model, so there are three
decisions to make in doing so:-
• Model we use in learning, denoted as g(xt| ), where g(·) is the model defining the
hypothesis class H, x is the input, and are the parameters instantiating one hypothesis h
∈ H.
• Loss function, L(·), to compute the difference between the desired output, , and our
approximation to it, g(xt|), given the current value of the parameters,. The approximation
error, or loss, is the sum of losses over the individual instances. In class learning where
outputs are 0/1, L(·) checks for equality or not; in regression, because the output is a
numeric value, we have ordering information for distance and one possibility is to use the
square of the difference.
Dimensions of a Supervised Machine Learning Algorithm
• Optimization procedure to find *that minimizes the total error with argmin function to return the
argument that minimizes error. In regression, we can solve analytically for the optimum. With more
complex models and error functions, we may need to use more complex optimization methods, for
example, gradient-based methods, simulated annealing, or genetic algorithms.
• For this to work well, the following conditions should be satisfied:
• The hypothesis class of g(·) should be large enough, that is, have enough capacity, to include the
unknown function that generated the data that is represented in in a noisy form.
• Second, there should be enough training data to allow us to pinpoint the correct (or a good enough)
hypothesis
from the hypothesis class.
• Third, we should have a good optimization method that finds the correct hypothesis given the training
data.
Different machine learning algorithms differ either in the models they assume (their hypothesis
class/inductive bias), the loss measures they employ, or the optimization procedure they use.
Dimensions of a Supervised Machine Learning Algorithm
• Let say we have a sample of data which is independent and identically distributed, where
is the input and is the associated desired output (0/1 for a two – class learning). The aim
is to build a good and useful approximation to using the model, so there are three
decisions to make in doing so:-
• Model we use in learning, denoted as g(xt| ), where g(·) is the model defining the
hypothesis class H, x is the input, and are the parameters instantiating one hypothesis h
∈ H.
• Loss function, L(·), to compute the difference between the desired output, , and our
approximation to it, g(xt|), given the current value of the parameters,. The approximation
error, or loss, is the sum of losses over the individual instances. In class learning where
outputs are 0/1, L(·) checks for equality or not; in regression, because the output is a
numeric value, we have ordering information for distance and one possibility is to use the
square of the difference.
Dimensions of a Supervised Machine Learning Algorithm
• Optimization procedure to find *that minimizes the total error with argmin function to return the
argument that minimizes error. In regression, we can solve analytically for the optimum. With more
complex models and error functions, we may need to use more complex optimization methods, for
example, gradient-based methods, simulated annealing, or genetic algorithms.
• For this to work well, the following conditions should be satisfied:
• The hypothesis class of g(·) should be large enough, that is, have enough capacity, to include the
unknown function that generated the data that is represented in in a noisy form.
• Second, there should be enough training data to allow us to pinpoint the correct (or a good enough)
hypothesis
from the hypothesis class.
• Third, we should have a good optimization method that finds the correct hypothesis given the
training data.
Different machine learning algorithms differ either in the models they assume (their hypothesis
class/inductive bias), the loss measures they employ, or the optimization procedure they use.
Model Selection
•Definition: Model selection is the process of choosing the best machine learning model for
a given problem. It involves selecting an algorithm and tuning its parameters to achieve the
best performance.
•Approaches:
•Hyperparameter Tuning: Use grid search or random search to find the best set of
hyperparameters for the chosen model.
•Approaches:
•Hyperparameter Tuning: Use grid search or random search to find the best set of
hyperparameters for the chosen model.
Error Analysis:
Confusion Matrix: Used for classification tasks to show the performance of a classification model.
Residual Analysis: For regression models, analyze the residuals (the difference between the
actual and predicted values) to understand the model's performance.
Model Validation
Hold-out Validation: Split data into training and test sets. Train the model on the
training set and validate it on the test set.
Cross-validation: Split the data into multiple folds and train and validate the model on
each fold.
Validation Set: Set aside a portion of the training data to validate the model during
training.
Parametric vs. Non-Parametric Models
Parametric Models:
Definition: Parametric models make assumptions about the underlying data distribution and have a fixed
number of parameters.
Advantages:
Easier to interpret.
Disadvantages:
May not perform well if the assumption about the data distribution is wrong.
Non-Parametric Models
Definition: Non-parametric models do not make strong assumptions about the data distribution and do
not have a fixed number of parameters. They are more flexible in terms of data modeling.
Advantages:
Disadvantages:
Assumption: There is a linear relationship between hours studied () and exam score ().
Y = 5 X+ 50
For every additional hour studied, the score increases by 5 points. Even with minimal data (e.g., 10 students), this model
assumes the same relationship applies universally.
Scenario: Predicting a student's exam score based on hours studied, but without assuming a specific form of the
relationship.
Stores all the training data (e.g., scores of 50 students with their study hours). To predict a new student's score, it looks at the
scores of the closest students (e.g., those who studied a similar number of hours).
Example:
A new student studied for 4 hours.
The model looks at the 3 students who studied closest to 4 hours and predicts the score based on their average.
Bivariate and multivariate models
(Age) x4
• The model has two variables, the independent or explanatory variable, x, and the
dependent variable y, the variable whose variation is to be explained.
• The relationship between x and y is a linear or straight line relationship.
• Two parameters to estimate – the slope of the line β1 and the y-intercept β0 (where
the line crosses the vertical axis).
• ε is the unexplained, random, or error component. Much more on this later.
Regression line
• The regression model is y 0 1 x
• Data about x and y are obtained from a sample.
• From the sample of values of x and y, estimates b0 of β0 and b1 of β1 are
obtained using the least squares or another method.
• The resulting estimate of the model is
yˆ b0 b1 x
• The symbol ŷ is termed “y hat” and refers to the predicted values of the
dependent variable y that are associated with values of x, given the
linear model.
Linear Regression
• Regression Models - These are explanatory forecasting models, which attempt to develop logical
relationships that not only provide useful forecasts, but also identify the causes and factors leading
to the forecast value. Regression models assume that a linear relationship exists between a
variable designated as the dependent (unknown) variable and one or more other independent
(known) variables.
• Simple Regression - This model, also called as least square method, assumes that the independent
variable, I, depends on a single dependent variable D.
Linear Regression
• Regression can be used to calculate the best fit to a straight line on a normal graph. The regression
problem is to identify a line, D = a + bI, such that sum of the squares of the deviations between
actual and estimated values (the vertical line segments in the figure) is minimized.
It models the relationship between the dependent variable Y and one or more independent variables X
by fitting a linear equation:
Types
Simple Linear Regression:
y = wx + b
Multiple Linear Regression:
The "best fit" is achieved by optimizing the weights w and bias b to minimize the cost function.
y = wx + b
Example:
Suppose you have data of house sizes X and their prices Y:
A linear regression model predicts house prices based on size by finding a straight line that best matches the
data.
Cost Function
Purpose:
The
cost
functi
on
meas
ures
the
error
or
differ
ence
betw
een
the
predi
cted
value
Why Use MSE?
Squaring the errors ensures they are positive, avoiding cancellations between positive and negative errors.
Larger errors are penalized more heavily, encouraging the model to minimize big mistakes.
Optimization techniques like Gradient Descent are used for this purpose.
Example Workflow of Linear Regression
• Update w and b iteratively to reduce the cost function (using Gradient Descent).
β 0 = constant/Intercept,
β 1 = Slope/Intercept,
X i = Independent variable.
The goal of the linear regression algorithm is to get the best values for B 0 and B 1 to find the best-fit line.
The best-fit line is a line that has the least error which means the error between predicted values and
actual values should be minimum.
But how the linear regression finds out which is the best fit line?
The goal of the linear regression algorithm is to get the best values for B0 and B1 to find the best fit line.
The best fit line is a line that has the least error which means the error between predicted values
and actual values should be minimum.
We calculate MSE using the simple linear equation : Y i = β 0 + β 1 X i
Using the MSE function, we’ll update the values of B 0 and B 1 such that the MSE value settles
at the minima.
These parameters can be determined using the gradient descent method such that the value for
the cost function is minimum.
Gradient Descent is one of the optimization algorithms that optimize the cost function (objective function)
to reach the optimal minimal solution.
To find the optimum solution, we need to reduce the cost function (MSE) for all data points. This is
done by updating the values of the slope coefficient (B1) and the constant coefficient (B0) iteratively
until we get an optimal solution for the linear function.
A regression model optimizes the gradient descent algorithm to update the coefficients of the line by
reducing the cost function by randomly selecting coefficient values and then iteratively updating the
coefficient values to reach the minimum cost function.
In the gradient descent algorithm, the number of steps you’re taking can be considered as the learning
rate, and this decides how fast the algorithm converges to the minima.
Steps to Solve Linear Regression
Example Problem
Relationships
• Economic theory specifies the type and structure of relationships that
are to be expected.
• Historical studies.
• Studies conducted by other researchers – different samples and related
issues.
• Speculation about possible relationships.
• Correlation and causation.
• Theoretical reasons for estimation of regression relationships; empirical
relationships need to have theoretical explanation.
Uses of regression
• Amount of change in a dependent variable that results from changes in the
independent variable(s) – can be used to estimate elasticities, returns on
investment in human capital, etc.
• Attempt to determine causes of phenomena.
• Prediction and forecasting of sales, economic growth, etc.
• Support or negate theoretical model.
• Modify and improve theoretical models and explanations of phenomena.
Partial Correlation
• A partial correlation measures the relationship between two variables (X and Y) while
eliminating the influence of a third variable (Z).
• Partial correlations are used to reveal the real, underlying relationship between two
variables when researchers suspect that the apparent relation may be distorted by a third
variable.
Partial Correlation (cont.)
• For example, there probably is no underlying relationship between weight and
mathematics skill for elementary school children.
• However, both of these variables are positively related to age: Older children weigh more
and, because they have spent more years in school, have higher mathematics skills.
Partial Correlation (cont.)
• As a result, weight and mathematics skill will show a positive correlation for a sample of
children that includes several different ages.
• A partial correlation between weight and mathematics skill, holding age constant, would
eliminate the influence of age and show the true correlation which is near zero.
Income hrs/week Income hrs/week
8000 38 8000 35
6400 50 18000 37.5
2500 15 5400 37
3000 30 15000 35
6000 50 3500 30
5000 38 24000 45
8000 50 1000 4
4000 20 8000 37.5
11000 45 2100 25
25000 50 8000 46
4000 20 4000 30
8800 35 1000 200
5000 30 2000 200
7000 43 4800 30
Summe r Income as a Function of Hours Worke d
30000
25000
20000
Income
15000
10000
5000
0
0 10 20 30 40 50 60
Hours pe r We e k
Outliers
• The cost function quantifies the difference between the model’s predictions and the actual
outcomes for each data point in the dataset. The goal in linear regression is to find the line
(or hyperplane, in the case of multiple variables) that minimizes this difference, thereby
creating the most accurate model.
• The cost function plays a key role in determining the best-fitting line by evaluating the
difference between the actual and predicted values. When the model finds parameters that
minimize the cost function, it achieves optimal performance.
Cost Function in Linear Regression
Size (sq.ft.) True Price (in 1000$)
500 50
1000 100
1500 150
2000 200
The linear regression equation is: y ^ = w ⋅ x y ^ =w⋅x where, y ^ y ^ is predicted house price x is size of the house
(input feature) w is weight (slope of the line).
Our task is to find the best weight (w) such that the model’s predictions are as close as possible to the true prices. The
difference between model’s predictions and true prices is called error, and the goal is to minimize the total error.
Let’s now understand the workings and how cost functions play an significant role in Linear Regression model, starting
with an assume an initial guess: w= 0.04
Using this value of w, let’s predict the prices for the given house sizes:
Multiple Linear Regression (MLR)
Introduction
Multiple Linear Regression (MLR) is a statistical technique used to model the relationship between one
dependent variable (target) and two or more independent variables (predictors).
The goal is to find the linear equation that best predicts the dependent variable using the independent
variables.
Assumptions of MLR
Linearity: The relationship between dependent and independent variables is linear.
Homoscedasticity: The variance of residuals is constant across all levels of the independent variables.
No Multicollinearity: Independent variables are not highly correlated with each other.
Applications
Predicting housing prices based on features like size, location, and number of rooms.
• The main aim of gradient descent is to find the best parameters of a model which gives
the highest accuracy on training as well as testing datasets. In gradient descent, The
gradient is a vector that points in the direction of the steepest increase of the function at
a specific point.
• Moving in the opposite direction of the gradient allows the algorithm to gradually
descend towards lower values of the function, and eventually reaching to the minimum
of the function.
Gradient Descent in Linear Regression
• Steps Required in Gradient Descent Algorithm
• Step 2 Compute the gradient of the cost function with respect to each parameter. It involves
making partial differentiation of cost function with respect to the parameters.
• Step 3 Update the parameters of the model by taking steps in the opposite direction of the
model. Here we choose a hyperparameter learning rate which is denoted by alpha. It helps
in deciding the step size of the gradient.
• Step 4 Repeat steps 2 and 3 iteratively to get the best parameter for the defined model
Types of Gradient Descent
1. Batch Gradient Descent
• SGD:
• Update: Once per sample
• Fast but noisy.
• Mini-Batch GD:
• Update: Once per batch
• Balanced and efficient.
Applicati
ons
• Batch GD: Small datasets (e.g., linear regression)
Standardization:
Useful for algorithms like k-Nearest Neighbors (k-NN) and Neural Networks that are sensitive to the scale
of data.
Standardization:
Essential for models assuming Gaussian distribution or sensitive to feature magnitude, such as Support
Vector Machines (SVM), Principal Component Analysis (PCA), and Linear Regression.
Normalized Data:
Standardization Example:
Standardized Data:
[−1.26,−0.63,0,0.63,1.26]
[−1.26,−0.63,0,0.63,1.26]
X = np.array([1, 2, 3, 4,
5,7,8,9,10,15,17,19,21]).reshape(-1, 1)
y = np.array([2, 4, 5, 4, 5,7,9,10,12,15,18,20,25])
plt.scatter(X, y, color="blue",
label="Data Points")
plt.plot(X, model.predict(X), color="red", label="Best Fit Line")
plt.xlabel("X")
plt.ylabel("y")
plt.title("Linear Regression: Best Fit Line")
plt.legend()
plt.show()
Overfitting and underfitting are common problems in machine learning that affect
model performance.
They arise from the trade-off between a model's complexity and its ability to generalize
to new, unseen data.
Underfitting
A model is underfitting when it is too simple to capture the underlying patterns in the data.
Symptoms:
Causes:
Symptoms:
Example: A decision tree with very deep splits that memorizes the training data.
Example predicting house prices:
Underfitting: Using only the size of the house as a feature while ignoring other factors like
location, number of bedrooms, etc.
Overfitti ng: Including irrelevant details like the color of the walls, which does not
generalize to new data.
What is Regularization?
Regularization is a technique to prevent overfitting in machine learning models.
It adds a penalty term to the loss function to shrink the magnitude of coefficients, ensuring the model
generalizes well to new data.
Lasso Regularization
Penalty Term: Adds the L1 norm (sum of absolute values of coefficients) to the loss function.
Key Features:
Shrinks coefficients.
Feature Selection: Some coefficients are reduced to exactly zero, effectively removing
irrelevant features.
When to Use:
When you suspect some features are irrelevant and want to automatically select the most
important ones.
Ridge Regularization
Penalty Term: Adds the L2 norm (sum of squares of coefficients) to the loss function.
Key Features:
Retains all features and is better for handling multicollinearity (correlated features).
When to Use:
When all features are expected to contribute to the target variable, even if minimally.
Lasso Regularization: Feature Selection in Predictive
Modeling
Example:
A healthcare company is building a predictive model to estimate a patient's risk of developing diabetes based
on features like age, BMI, cholesterol levels, glucose levels, and hundreds of genetic markers.
Problem:
Many features (like certain genetic markers) might have little to no effect on the outcome.
Including all features increases model complexity and could lead to overfitting.
Solution:
The model keeps only the most important features, making it simpler and more interpretable for clinicians.
Ridge Regularization: Improving Predictions in
Multicollinear Data
Example:
An e-commerce company wants to predict product demand based on factors like price, discount offered,
advertising spend, and competitor pricing.
Problem:
Some features (e.g., "price" and "discount offered") are highly correlated, leading to multicollinearity.
Standard linear regression struggles in such scenarios, resulting in unstable coefficient estimates.
Solution:
Use Ridge regularization, which penalizes large coefficients without dropping any features.
Ridge shrinks the correlated feature coefficients, stabilizing predictions while retaining all information.
Dataset: Predict house prices based on features like size, location, and age.
Lasso: Automatically drops less relevant features like age if size and location explain most of
the variance.
Ridge: Keeps all features and distributes importance across them, even if age has minimal
impact.
Ridge = Stable, robust predictions (use when all features are useful).
1000590
100090
1000590
95534
Bias, Variance, and Tradeoff
Bias is when the model is too simple and cannot capture the pattern in the data properly.
This leads to underfitting.
Example
you always guess the same average age for everyone (e.g., 25 years).
This is a high bias guess because you ignore specific details and make overly simplistic
assumptions.
Variance is when the model is too complex and tries to learn even the smallest details
(noise) in the data. This leads to overfitting.
This is high variance because you are too sensitive to small details that don’t really
matter.
Bias-Variance Tradeoff (Simple Analogy of Exam)
High Bias (Underfitting):
You only study one chapter for the exam and ignore everything else.
Result: You don’t perform well because you miss most questions.
You try to memorize every single word from the textbook, notes, and examples—even unnecessary details.
Result: You get confused in the exam because you overanalyzed and didn’t focus on what’s important.
Result: You perform well on the exam because you have the right balance of preparation.
Goal: Find the right balance (Bias-Variance Tradeoff) so the model can generalize well to new data.
Non – Linear Regression
• Nonlinear regression is a regression in which the dependent or criterion variables are modeled as a
non-linear function of model parameters and one or more independent variables.
• The reason that these models are called nonlinear regression is because the relationships between
the dependent and independent parameters are not linear.
• Model Expression is the model used; the first task is to create a model. The selection of the model in is
based on theory and past experience in the field. For example, in demographics, for the study of
population growth, logistic nonlinear regression growth model is usefu l.
Non – Linear Regression
• Parameters are those which are estimated. For example, in logistic nonlinear regression growth model, the parameters
are b1, b2 and b3.
• Segmented model is required for those models which have multiple different equations of different ranges, equations are
then specified as a term in multiple conditional logic statements.
• Loss function is a function which is required to be minimized. This is done by nonlinear regression.
• Assumptions The data level in must be quantitative, the categorical variables must be coded as binary variables. The value
of the coefficients can be correctly interpreted, only if the correct model has been fitted, therefore it is important to
identify useful models.
• A good choice of starting points can lead to a desirable output, a poor choice will make the output misleading. Types –
Cubic, Quadratic, Exponential, Logarithmic, Sigmoidal/Logistic, etc.
• If the correlation is higher than 0.7, apply linear regression else non – linear regression.
Nonlinear Regression
• An iterative form of linear regression, with some modifications to the normal equations to make them
work in practice.
• General procedure:
1. Linearize the model around the current parameter values. This results in a linearized objective-
function surface.
2. Using the normal equations, calculate new parameter values that are closer to the minimum of the
linearized objective-function surface, and therefore, hopefully closer to the minimum of the nonlinear
objective-function surface.
3. Repeat from step 1.
Nonlinear Regression
Given n data points ( x1, y1), ( x 2, y 2), ... , ( xn, yn ) best fit y f (x )
to the data, where f (x ) is a nonlinear function of x.
( xn , y n )
( x2 , y 2 )
y f (x )
( xi , yi )
y f (x )
i i
(x , y )
1 1
• Likelihood - Based Classification – By estimating the prior probabilities, , and the class
likelihoods, , we then, used Bayes’ rule to calculate the posterior densities. We then defined
the discriminant functions in terms of the posterior, similar is the method used in Naive Bayes’
Classification.
• Bayes’ classifier - For minimum error, the Bayes’ classifier chooses the class with the highest
posterior probability; that is, we choose Ci if
Terminologies
• Prior Probability - P(C = 1) is called the prior probability that C takes the value 1, which in our
example corresponds to the probability that a customer is high risk, regardless of the x value. It is
called the prior probability because it is the knowledge we have as to the value of C before looking at
the observables x, satisfying
• Class Likelihood - p(x|C) is called the class likelihood and is the conditional probability that an
event belonging to C has the associated observation value x. In our case, p(x1, x2|C = 1) is the
probability that a high-risk customer has his or her X1 = x1 and X2 = x2. It is what the data tells us
regarding the class.
• Evidence – p(x), the evidence, is the marginal probability that an observation x is seen, regardless
of whether it is a positive or negative example.
Terminologies
• Posterior Probability - Combining the prior and what the data tells us using Bayes’ rule, we
calculate the posterior probability of the concept, P(C|x), after having seen the observation, x.
• For now, we assume that we know the prior and likelihoods, we discuss how to estimate P(C) and
p(x|C) from a given training sample. In the general case, we have K mutually exclusive and
exhaustive classes; Ci, i = 1, . . . , K; for example, in optical digit recognition, the input is a bitmap
image and there are ten classes. We have the prior probabilities satisfying, P(Ci) ≥ 0 and
• p(x|Ci) is the probability of seeing x as the input when it is known to belong to class Ci. The posterior
probability of class Ci can be calculated as,
Terminologies
• Losses and Risks – It may be the case that decisions are not equally good or costly. A financial
institution when making a decision for a loan applicant should take into account the potential gain
and loss as well. An accepted low-risk applicant increases profit, while a rejected high-risk applicant
decreases loss. The loss for a high-risk applicant erroneously accepted may be different from the
potential gain for an erroneously rejected low-risk applicant. The situation is much more critical and
far from symmetry in other domains like medical diagnosis or earthquake prediction. Let us define
action αi as the decision to assign the input to class Ci and λik as the loss incurred for taking action αi
when the input actually belongs to Ck. Then the expected risk for taking action αi is
• Let us define K actions αi, i = 1, . . . , K, where αi is the action of assigning x to Ci. In the special case
of the 0/1 loss case where,
• all correct decisions have no loss and all errors are equally costly. The risk of taking action αi is
Terminologies
• Reject - In some applications, wrong decisions—namely, misclassifications— may have very high cost, and it is
generally required that a more complex— for example, manual—decision is made if the automatic system has
low certainty of its decision. For example, if we are using an optical digit recognizer to read postal codes on
envelopes, wrongly recognizing the code causes the envelope to be sent to a wrong destination. In such a
case, we define an additional action of reject or doubt, αK+1, with αi, i = 1, . . . , K, being the usual actions of
deciding on classes Ci, i = 1, . . . , K . A possible loss function is
• where 0 < λ < 1 is the loss incurred for choosing the (K + 1)st action of reject. Then the risk of reject is
and the risk of choosing class Ci is
• The optimal decision rule is to
•
Bayes’ Rule & Naïve Bayes’
•Classification
Naive Bayes’ Classification Model - Naive Bayes classifier is not linear, but if the likelihood factors
p(xi∣c) are from exponential families, the naive Bayes classifier corresponds to a linear classifier in a
particular feature space. Bayes’ rule was given as,
• P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
P(c) is the prior probability of class, i.e., class probability
P(x|c) is the likelihood which is the probability of predictor given class, conditional probability
P(x) is the prior probability of predictor.
• If the inputs are independent, this is called the naive Bayes’ classifier, because it ignores possible
dependencies, namely, correlations, among the inputs and reduces a multivariate problem to a
group of univariate problems:
Naïve Bayes’ Classification
Algorithm
• Naïve Bayes Algorithm – It is a classification technique based on Bayes’ Theorem with an
assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes
that the presence of a particular feature in a class is unrelated to the presence of any other feature.
This classifier assumes the features (in this case we had words as input) are independent, hence,
the word naive. The algorithm is as follows –
Step 1: Convert the data set into a frequency table (would be given already in the question).
Step 3: Find Likelihood probability with each attribute for each class
Step 4: Put these value in Bayes Formula and calculate posterior probability.
Step 5: See which class has a higher probability, given the input belongs to the higher probability
class.
Pros & cons of Naïve Bayes’
Classification
Pros: Algorithm
It is easy and fast to predict class of test data set. It also perform well in multi class prediction
When assumption of independence holds, a Naive Bayes classifier performs better compare to
other models like logistic regression and you need less training data.
It perform well in case of categorical input variables compared to numerical variable(s). For
numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).
Cons:
If categorical variable has a category (in test data set), which was not observed in training data set,
then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often
known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest
smoothing techniques is called Laplace estimation.
On the other side naive Bayes is also known as a bad estimator, so the probability outputs are not
to be taken too seriously.
Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is
almost impossible that we get a set of predictors which are completely independent.
Applications of Naïve Bayes’
Classification Algorithm
• Applications of Naive Bayes Algorithms –
• Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could
be used for making predictions in real time.
• Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here
we can predict the probability of multiple classes of target variable.
• Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in
text classification (due to better result in multi class problems and independence rule) have higher
success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify
spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative
customer sentiments)
• Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a
Recommendation System that uses machine learning and data mining techniques to filter unseen
information and predict whether a user would like a given resource or not