0% found this document useful (0 votes)
6 views22 pages

Unit-1 ML (Reference guide for students)

The document provides a comprehensive overview of machine learning, detailing its definition, types (supervised, unsupervised, semi-supervised, and reinforcement learning), and applications in various fields such as image recognition, speech recognition, and self-driving cars. It also outlines the machine learning life cycle, which includes steps like data gathering, preparation, analysis, model training, testing, and deployment. Additionally, it distinguishes between artificial intelligence and machine learning, explaining their differences and the types of datasets used in machine learning.

Uploaded by

bifedi3655
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views22 pages

Unit-1 ML (Reference guide for students)

The document provides a comprehensive overview of machine learning, detailing its definition, types (supervised, unsupervised, semi-supervised, and reinforcement learning), and applications in various fields such as image recognition, speech recognition, and self-driving cars. It also outlines the machine learning life cycle, which includes steps like data gathering, preparation, analysis, model training, testing, and deployment. Additionally, it distinguishes between artificial intelligence and machine learning, explaining their differences and the types of datasets used in machine learning.

Uploaded by

bifedi3655
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Unit: 1

Machine Learning (Reference


Guide)
Machine Learning:
Machine learning is a branch of artificial intelligence where systems learn from data,
identifying patterns and making decisions without explicit programming. Algorithms
iteratively improve their performance as they're exposed to more data. It encompasses
various techniques like supervised learning (learning from labeled data), unsupervised
learning (finding patterns in unlabeled data), and reinforcement learning (learning
from feedback).

Classification of Machine Learning:

Machine Learning
Supervised Learning Unsupervised Learning Semi-supervised Reinforcement

Regression Classification Clustering Association

1
Supervised Learning:
In supervised learning, the algorithm is trained on a labeled dataset, meaning the input data is
paired with corresponding target labels.

Objective: The goal is to learn a mapping from input features to the correct output labels so
that the model can make predictions on new, unseen data.

Examples:
1. Predicting house prices (regression).
2. Email spam detection (classification).
3. Image recognition (classification).

Unsupervised Learning:
Unsupervised learning involves training the algorithm on an unlabeled dataset, and the model
must find patterns, relationships, or structures within the data without explicit guidance.

Objective: Discover hidden patterns or intrinsic structures in the data, such as grouping
similar data points (clustering) or reducing dimensionality.

Examples:
1. Clustering customer segments based on purchasing behavior.
2. Dimensionality reduction for data visualization.
3. Anomaly detection in network security.

Semi-Supervised Learning:
Semi-supervised learning is a hybrid approach that combines elements of both supervised and
unsupervised learning. The algorithm is trained on a dataset that contains both labeled and
unlabeled examples.

Objective: Leverage the available labeled data for supervised learning tasks while also
benefiting from the unlabeled data to improve model generalization and performance.

Examples:
1. Text classification where only a subset of documents is labeled.
2. Image classification with a small set of labeled images and a larger set of unlabeled
images.
3. Speech recognition with limited labeled audio data.

2
Applications of Machine learning:

1. Image Recognition:
Application: Identifying objects, people, scenes, and activities in images and videos. Used
in facial recognition, medical imaging analysis, and self-driving cars.
Algorithm: Convolutional Neural Networks (CNNs) are highly successful for image
recognition. They analyze images pixel by pixel, extracting features and learning
patterns to classify objects.
Example:
a. Unlocking your phone with Face ID (Apple) or facial recognition on Android
phones.
b. Auto-tagging friends in photos on Facebook.
c. Content moderation on social media platforms to identify inappropriate content.

3
2. Speech Recognition:
Application: Converting spoken language into text. Used in voice assistants, dictation
software, and automated call centers.
Algorithm: Hidden Markov Models (HMMs) are a popular choice. They analyze the
statistical properties of speech sounds to recognize words. Deep learning models are
also increasingly used for speech recognition.
Example:
a. Using voice assistants like Siri (Apple), Alexa (Amazon), or Google Assistant to
control smart devices, make calls, or dictate messages.
b. Voice search on smartphones and computers.
c. Automated customer service systems that understand spoken inquiries.

3. Traffic Prediction:
Application: Forecasting traffic congestion patterns to optimize traffic flow and
navigation.
Algorithm: Various algorithms can be used, including recurrent neural networks (RNNs)
which can analyze historical traffic data to identify patterns and predict future
conditions.
Example: Traffic navigation apps like Waze or Google Maps that suggest alternative
routes based on real-time traffic conditions.

4. Product Recommendations:
Application: Suggesting products to users based on their past purchases, browsing
history, and similar user behavior.
Algorithm: Collaborative filtering and content-based filtering are common approaches.
Collaborative filtering recommends products based on what similar users liked, while
content-based filtering recommends products similar to what users have purchased or
shown interest in.
Example:
a. Recommendation sections on e-commerce sites like Amazon or Netflix
suggesting products or shows you might like based on your past purchases or
viewing habits.
b. Personalized advertising that targets users based on their online activity.

4
5. Self-Driving Cars:
Application: Enabling cars to navigate and perceive their surroundings without human
input.
Algorithm: A combination of algorithms is used, including deep learning for image
recognition, sensor data analysis, and reinforcement learning for decision making and
control.
Example: Tesla Autopilot, a driver-assistance system that uses machine learning for
features like lane centering and adaptive cruise control. (Note: Autopilot does not make
cars fully autonomous)

6. Email Spam and Malware Filtering:


Application: Identifying and filtering unwanted emails containing spam or malicious
content.
Algorithm: Naive Bayes classifiers are commonly used. They analyze email content and
compare it to known spam patterns to categorize emails.
Example: Gmail's spam filter that automatically sorts spam emails away from your
inbox.

7. Virtual Personal Assistant (VPA):


Application: Responding to user queries and requests using voice commands.
Algorithm: Speech recognition, natural language processing (NLP) to understand user
intent, and machine learning models to generate responses or perform actions.
Example:
a. Using voice commands to control smart home devices with Amazon Echo or
Google Home.
b. Dictating text messages or emails using your smartphone's voice assistant.

8. Online Fraud Detection:


Application: Identifying and preventing fraudulent transactions on e-commerce
platforms and financial institutions.
Algorithm: Anomaly detection algorithms can be used to analyze user behavior and
identify patterns that deviate from normal activity, potentially indicating fraud.
Example:

5
a. Banks and credit card companies using machine learning to identify and block
suspicious transactions in real-time.
b. Online payment platforms like PayPal that use fraud detection algorithms to
protect user accounts.

Machine learning Life cycle:


Machine learning life cycle is a cyclic process to build an efficient machine learning
project. The main purpose of the life cycle is to find a solution to the problem or project.

Machine learning life cycle involves seven major steps, which are given below:

6
The most important thing in the complete process is to understand the problem and to
know the purpose of the problem.
In the complete life cycle process, to solve a problem, we create a machine learning
system called "model", and this model is created by providing "training". But to train a
model, we need data, hence, life cycle starts by collecting data.

1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step is
to identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected from
various sources such as files, database, internet, or mobile devices. It is one of the
most important steps of the life cycle. The quantity and quality of the collected data will
determine the efficiency of the output. The more will be the data, the more accurate will
be the prediction.
This step includes the below tasks:

a. Identify various data sources

b. Collect data

c. Integrate the data obtained from different sources

By performing the above task, we get a coherent set of data, also called as a dataset. It
will be used in further steps.

2. Data preparation:
After collecting the data, we need to prepare it for further steps. Data preparation is a
step where we put our data into a suitable place and prepare it to use in our machine
learning training.
In this step, first, we put all data together, and then randomize the ordering of data.
This step can be further divided into two processes:

a. Data exploration:
It is used to understand the nature of data that we have to work with. We
need to understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.

b. Data pre-processing:
Now the next step is preprocessing of data for its analysis.

7
3. Data Wrangling:
Data wrangling is the process of cleaning and converting raw data into a useable format.
It is the process of cleaning the data, selecting the variable to use, and transforming the
data in a proper format to make it more suitable for analysis in the next step. It is one of
the most important steps of the complete process. Cleaning of data is required to
address the quality issues.
It is not necessary that data we have collected is always of our use as some of the data
may not be useful. In real-world applications, collected data may have various issues,
including:

a. Missing Values

b. Duplicate data

c. Invalid data

d. Noise
So, we use various filtering techniques to clean the data.
It is mandatory to detect and remove the above issues because it can negatively affect
the quality of the outcome.

4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step involves:

a. Selection of analytical techniques

b. Building models

c. Review the result

The aim of this step is to build a machine learning model to analyze the data using
various analytical techniques and review the outcome. It starts with the determination
of the type of the problems, where we select the machine learning techniques such
as Classification, Regression, Cluster analysis, Association, etc. then build the model using
prepared data, and evaluate the model.
Hence, in this step, we take the data and use machine learning algorithms to build the
model.

5. Train Model
Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.

8
We use datasets to train the model using various machine learning algorithms. Training
a model is required so that it can understand the various patterns, rules, and, features.

6. Test Model
Once our machine learning model has been trained on a given dataset, then we test the
model. In this step, we check for the accuracy of our model by providing a test dataset to
it.
Testing the model determines the percentage accuracy of the model as per the
requirement of project or problem.

7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the model in
the real-world system.
If the above-prepared model is producing an accurate result as per our requirement
with acceptable speed, then we deploy the model in the real system. But before
deploying the project, we will check whether it is improving its performance using
available data or not. The deployment phase is similar to making the final report for a
project.

Difference between Artificial intelligence and


Machine learning
Artificial intelligence and machine learning are the part of computer science that are
correlated with each other. These two technologies are the most trending technologies
which are used for creating intelligent systems.
Although these are two related technologies and sometimes people use them as a
synonym for each other, but still both are the two different terms in various cases.
On a broad level, we can differentiate both AI and ML as:
AI is a bigger concept to create intelligent machines that can simulate human thinking
capability and behavior, whereas, machine learning is an application or subset of AI that
allows machines to learn from data without being programmed explicitly.
Below are some main differences between AI and machine learning along with the
overview of Artificial intelligence and machine learning.

Artificial Intelligence
Artificial intelligence is a field of computer science which makes a computer system that
can mimic human intelligence. It is comprised of two words "Artificial" and

9
"intelligence", which means "a human-made thinking power." Hence we can define it
as,
Artificial intelligence is a technology using which we can create intelligent systems that
can simulate human intelligence.
The Artificial intelligence system does not require to be pre-programmed, instead of
that, they use such algorithms which can work with their own intelligence. It involves
machine learning algorithms such as Reinforcement learning algorithm and deep
learning neural networks. AI is being used in multiple places such as Siri, Google?s
AlphaGo, AI in Chess playing, etc.
Based on capabilities, AI can be classified into three types:
a. Weak AI
b. General AI
c. Strong AI
Currently, we are working with weak AI and general AI. The future of AI is Strong AI for
which it is said that it will be intelligent than humans.

Key differences between Artificial Intelligence


(AI) and Machine learning (ML):

Artificial Intelligence Machine learning

Machine learning is a subset of AI which allows a


Artificial intelligence is a technology which
machine to automatically learn from past data
enables a machine to simulate human behavior.
without programming explicitly.

The goal of AI is to make a smart computer The goal of ML is to allow machines to learn from
system like humans to solve complex problems. data so that they can give accurate output.

In AI, we make intelligent systems to perform any In ML, we teach machines with data to perform a
task like a human. particular task and give an accurate result.

10
Machine learning and deep learning are the two Deep learning is a main subset of machine
main subsets of AI. learning.

AI has a very wide range of scope. Machine learning has a limited scope.

Machine learning is working to create machines


AI is working to create an intelligent system
that can perform only those specific tasks for
which can perform various complex tasks.
which they are trained.

AI system is concerned about maximizing the Machine learning is mainly concerned about
chances of success. accuracy and patterns.

The main applications of machine learning are


The main applications of AI are Siri, customer
Online recommender system, Google
support using catboats, Expert System, Online
game playing, intelligent humanoid robot, etc. search algorithms, Facebook auto friend
tagging suggestions, etc.

Machine learning can also be divided into mainly


On the basis of capabilities, AI can be divided three types
into three types, which are, Weak AI, General AI,
and Strong AI. that are Supervised learning, Unsupervised
learning, and Reinforcement learning.

It includes learning, reasoning, and self- It includes learning and self-correction when
correction. introduced with new data.

AI completely deals with Structured, semi- Machine learning deals with Structured and semi-
structured, and unstructured data. structured data.

11
How to get datasets for Machine Learning?
The field of ML depends vigorously on datasets for preparing models and making
precise predictions. Datasets assume a vital part in the progress of AIML projects and
are fundamental for turning into a gifted information researcher. In this article, we will
investigate the various sorts of datasets utilized in AI and give a definite aid on where to
track down them.

What is a dataset?
A dataset is a collection of data in which data is arranged in some order. A dataset can
contain any data from a series of an array to a database table. Below table shows an
example of the dataset:

Country Age Salary Purchased


India 38 48000 No

France 43 45000 Yes

Germany 30 54000 No

France 48 65000 No

Germany 40 Yes

India 35 58000 Yes

A tabular dataset can be understood as a database table or matrix, where each column
corresponds to a particular variable, and each row corresponds to the fields of the
dataset. The most supported file type for a tabular dataset is "Comma Separated
File," or CSV. But to store a "tree-like data," we can use the JSON file more efficiently.

Types of data in datasets:


I. Numerical data: Such as house price, temperature, etc.
II. Categorical data: Such as Yes/No, True/False, Blue/green, etc.
III. Ordinal data: These data are similar to categorical data but can be measured
on the basis of comparison.

12
Types of datasets:
Machine learning incorporates different domains, each requiring explicit sorts of
datasets. A few normal sorts of datasets utilized in machine learning include:

Image Datasets:
Image datasets contain an assortment of images and are normally utilized in computer
vision tasks such as image classification, object detection, and image segmentation.

Examples: ImageNet, CIFAR-10, MNIST

Text Datasets:
Text datasets comprise textual information, like articles, books, or virtual entertainment
posts. These datasets are utilized in NLP techniques like sentiment analysis, text
classification, and machine translation.

Examples: Gutenberg Task dataset, IMDb film reviews dataset

Time Series Datasets:


Time series datasets include information focuses gathered after some time. They are
generally utilized in determining, abnormality location, and pattern examination.
Examples: Securities exchange information, Climate information, Sensor readings.

Tabular Datasets:
Tabular datasets are organized information coordinated in tables or calculation sheets.
They contain lines addressing examples or tests and segments addressing highlights or
qualities. Tabular datasets are utilized for undertakings like relapse and arrangement.
The dataset given before in the article is an illustration of a tabular dataset.

Need of Dataset
1. Completely ready and pre-handled datasets are significant for machine learning
projects.
2. They give the establishment to prepare exact and solid models. Notwithstanding,
working with enormous datasets can introduce difficulties regarding the board and
handling.
3. To address these difficulties, productive information the executive's strategies and
are expected to handle calculations.

13
Data Pre-processing:
Data pre-processing is a fundamental stage in preparing datasets for machine learning.
It includes changing raw data into a configuration reasonable for model training.
Normal pre-processing procedures incorporate data cleaning to eliminate irregularities
or blunders, standardization to scale data inside a particular reach, highlight scaling to
guarantee highlights have comparative ranges, and taking care of missing qualities.

During the development of the ML project, the developers completely rely on the
datasets. In building ML applications, datasets are divided into two parts:

 Training dataset:

 Test Dataset

Training Dataset and Test Dataset:


In machine learning, datasets are ordinarily partitioned into two sections: the training
dataset and the test dataset. The training dataset is utilized to prepare the machine
learning model, while the test dataset is utilized to assess the model's exhibition. This
division surveys the model's capacity, to sum up to inconspicuous data. It is

14
fundamental to guarantee that the datasets are representative of the issue space and
appropriately split to stay away from inclination or overfitting.

Popular sources for Machine Learning datasets


Below is the list of datasets which are freely available for the public to work on it:

1. Kaggle Datasets:
Kaggle is one of the best sources for providing datasets for Data Scientists and Machine
Learners. It allows users to find, download, and publish datasets in an easy way. It also
provides the opportunity to work with other machine learning engineers and solve
difficult Data Science related tasks.
Kaggle provides a high-quality dataset in different formats that we can easily find and
download.
The link for the Kaggle dataset is https://www.kaggle.com/datasets.

2. UCI Machine Learning Repository:


The UCI Machine Learning Repository is an important asset that has been broadly
utilized by scientists and specialists beginning around 1987. It contains a huge
collection of datasets sorted by machine learning tasks such as regression, classification,
and clustering. Remarkable datasets in the storehouse incorporate the Iris dataset,
Vehicle Assessment dataset, and Poker Hand dataset.
The link for the UCI machine learning repository
is https://archive.ics.uci.edu/ml/index.php.

3. Datasets via AWS:


We can search, download, access, and share the datasets that are publicly available via
AWS resources. These datasets can be accessed through AWS resources but provided
and maintained by different government organizations, researches, businesses, or
individuals.
Anyone can analyze and build various services using shared data via AWS resources.
The shared dataset on cloud helps users to spend more time on data analysis rather
than on acquisitions of data.

15
This source provides the various types of datasets with examples and ways to use the
dataset. It also provides the search box using which we can search for the required
dataset. Anyone can add any dataset or example to the Registry of Open Data on AWS.

The link for the resource is https://registry.opendata.aws/.

4. Google's Dataset Search Engine:


Google's Dataset Web index helps scientists find and access important datasets from
different sources across the web. It files datasets from areas like sociologies, science,
and environmental science. Specialists can utilize catchphrases to find datasets, channel
results in light of explicit standards, and access the datasets straightforwardly from the
source.
The link for the Google dataset search engine
is https://toolbox.google.com/datasetsearch.

5. Microsoft Datasets:
The Microsoft has launched the "Microsoft Research Open data" repository with the
collection of free datasets in various areas such as natural language processing,
computer vision, and domain-specific sciences. It gives admittance to assorted and
arranged datasets that can be significant for machine learning projects.
The link to download or use the dataset from this resource
is https://msropendata.com/.

6. Awesome Public Dataset Collection:


Awesome public dataset collection provides high-quality datasets that are arranged in a
well-organized manner within a list according to topics such as Agriculture, Biology,
Climate, Complex networks, etc. Most of the datasets are available free, but some may
not, so it is better to check the license before downloading the dataset.
The link to download the dataset from Awesome public dataset collection
is https://github.com/awesomedata/awesome-public-datasets.

7. Government Datasets:
There are different sources to get government-related data. Various countries publish
government data for public use collected by them from different departments.

16
The goal of providing these datasets is to increase transparency of government work
among the people and to use the data in an innovative approach. Below are some links
of government datasets:
 Indian Government dataset
 US Government Dataset
 Northern Ireland Public Sector Datasets
 European Union Open Data Portal

9. Computer Vision Datasets:


Visual data provides multiple numbers of the great dataset that are specific to computer
visions such as Image Classification, Video classification, Image Segmentation, etc.
Therefore, if you want to build a project on deep learning or image processing, then you
can refer to this source.

The link for downloading the dataset from this source is https://www.visualdata.io/.

9. Scikit-learn dataset:
Scikit-learn, a well-known machine learning library in Python, gives a few underlying
datasets to practice and trial and error. These datasets are open through the sci-kit-
learn Programming interface and can be utilized for learning different machine-learning
calculations. Scikit-learn offers both toy datasets, which are little and improved, and
genuine world datasets with greater intricacy. Instances of sci-kit-learn datasets
incorporate the Iris dataset, the Boston Lodging dataset, and the Wine dataset.

The link to download datasets from this source is https://scikit-


learn.org/stable/datasets/index.html.

17
Error:
This difference between the actual values and predicted values is the error and it is used to evaluate
the model. The error for any supervised Machine Learning algorithm comprises of 3 parts:

1. Bias error
2. Variance error
3. The noise

While the noise is the irreducible error that we cannot eliminate, the other two i.e. Bias and Variance
are reducible errors that we can attempt to minimize as much as possible.

Bias and variance:


Bias and variance are two fundamental concepts in machine learning that describe
different types of errors that a model can make. Understanding bias and variance helps
in assessing the overall performance and generalization capabilities of a machine
learning model.

Bias:
It represents the difference between the predicted values by the model and the true
values in the underlying data.
A high bias model tends to underfit the training data, meaning it fails to capture the
underlying patterns and relationships in the data. It oversimplifies the problem.
Example: Consider a linear regression model attempting to predict housing prices based
solely on the number of bedrooms in a house. This model has high bias because it
oversimplifies the relationship between housing prices and various other factors like
square footage, location, amenities, etc. It fails to capture the complexities of the
housing market.
Mathematically, bias can be quantified as the difference between the expected
prediction of the model and the true value. Let's denote:

 𝑓̂(𝑥) 𝑎𝑠 𝑡ℎ𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑜𝑑𝑒𝑙 𝑓𝑜𝑟 𝑖𝑛𝑝𝑢𝑡 𝑥𝑖


 𝑓(𝑥) 𝑎𝑠 𝑡ℎ𝑒 𝑡𝑟𝑢𝑒 𝑜𝑢𝑡𝑝𝑢𝑡 𝑐𝑜𝑟𝑟𝑒𝑠𝑝𝑜𝑛𝑑𝑖𝑛𝑔 𝑡𝑜 𝑖𝑛𝑝𝑢𝑡 𝑥𝑖
 𝐸[𝑓̂(𝑥)] 𝑎𝑠 𝑡ℎ𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 (𝑎𝑣𝑒𝑟𝑎𝑔𝑒) 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑜𝑑𝑒𝑙 ′ 𝑠 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝑜𝑣𝑒𝑟
𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑒𝑡𝑠.

Then, the bias can be defined as: 𝐵𝑖𝑎𝑠 (𝑓̂(𝑥)) = 𝐸[𝑓̂(𝑥)] − 𝑓(𝑥)

18
Variance:
Variance refers to the model's sensitivity to fluctuations in the training data. It measures
the variability of model predictions for different training datasets.
A high variance model tends to overfit the training data, meaning it captures noise in the
training data as if it were true patterns. Such a model performs well on the training data
but fails to generalize to unseen data.
Example: Consider a decision tree model that is highly complex and deep, capable of
perfectly fitting every data point in the training set. Such a model may have high
variance because it captures noise specific to the training data, leading to poor
performance on unseen data due to overfitting.
Mathematically, variance can be quantified as the variability of the model's predictions
over different training sets. Let's denote:

 𝑓̂(𝑥) 𝑎𝑠 𝑡ℎ𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑜𝑑𝑒𝑙 𝑓𝑜𝑟 𝑖𝑛𝑝𝑢𝑡 𝑥𝑖


 𝐸[𝑓̂(𝑥)] 𝑎𝑠 𝑡ℎ𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 (𝑎𝑣𝑒𝑟𝑎𝑔𝑒) 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑜𝑑𝑒𝑙 ′ 𝑠 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝑜𝑣𝑒𝑟
𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑒𝑡𝑠.
 𝐸[𝑓̂(𝑥)2 ] 𝑎𝑠 𝑡ℎ𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑠𝑞𝑢𝑎𝑟𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑜𝑑𝑒𝑙′𝑠 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠.
Then, the variance can be defined as:
2
𝑉𝑎𝑟 (𝑓̂(𝑥)) = 𝐸[𝑓̂(𝑥)2 ] − 𝐸[𝑓̂(𝑥)]

Bias-Variance Tradeoff:
The bias-variance tradeoff refers to the delicate balance between bias and variance in
machine learning models. Increasing model complexity typically decreases bias but
increases variance, and vice versa.
The goal is to find the right balance that minimizes both bias and variance, leading to
optimal model performance on unseen data.
Example: In the case of the housing price prediction, a more sophisticated model such as
a polynomial regression with multiple features may have lower bias but higher variance
compared to a simple linear regression model. Finding the right complexity level that
minimizes both bias and variance is crucial for good generalization performance.

Overfitting and Underfitting:


Overfitting and Underfitting are two common problems encountered when training
machine learning models, both of which affect the model's ability to generalize well to
unseen data.

19
OVERFITTING:

Overfitting occurs when a model learns to capture noise or random fluctuations in the
training data, rather than the underlying patterns or relationships.
Characteristics of overfitting:
 The model performs well on the training data but poorly on unseen data.
 There is a high variance in the model's predictions across different training
datasets.
 The model is too complex, capturing both the signal and the noise in the data.
Causes of overfitting:
 Using a model that is too complex relative to the size or complexity of the
training data.
 Insufficient regularization or lack of constraints on the model's parameters.
Remedies for overfitting:
 Simplifying the model architecture or reducing its complexity.
 Applying regularization techniques such as L1 or L2 regularization to penalize
large parameter values.
 Increasing the amount of training data to provide more information for the
model to learn from.

UNDERFITTING:

Underfitting occurs when a model is too simple to capture the underlying patterns or
relationships in the training data.
Characteristics of under fitting:
 The model performs poorly on both the training data and unseen data.
 There is a high bias in the model's predictions, indicating that it fails to capture
important features of the data.
 The model is too simplistic, failing to capture the complexity of the underlying
data distribution.
Causes of under fitting:
 Using a model that is too simple or has insufficient capacity to represent the
underlying function.
 Insufficient training or not allowing the model to learn from the data effectively.
Remedies for under fitting:

 Increasing the model's complexity by adding more features or using a more


sophisticated model architecture.

20
 Allowing the model to train for a longer duration or increasing the number of
training iterations.
 Ensuring that the training data is representative of the underlying data
distribution and contains sufficient variability.

Function approximation:
Function approximation is a fundamental concept in machine learning that involves
estimating or approximating an unknown function based on a finite set of input-output
pairs. The goal of function approximation is to learn a model that can accurately predict
the output for new input values that were not present in the training data.
In machine learning, function approximation is typically approached through
supervised learning, where the model learns from labeled training data. The process
involves selecting a model architecture and adjusting its parameters to minimize the
discrepancy between the predicted outputs and the true labels in the training data.

Data Ethics and Privacy:


Data ethics and privacy are basic contemplations in machine learning projects. It is
fundamental to guarantee that data is gathered and utilized morally, regarding privacy
freedoms and observing pertinent regulations and guidelines. Data experts ought to go
to lengths to safeguard data privacy, get appropriate assent, and handle delicate data
mindfully. Assets, for example, moral rules and privacy structures can give direction on
keeping up with moral practices in data assortment and use.

Conclusion:
In conclusion, datasets structure the groundwork of effective machine-learning projects.
Understanding the various kinds of datasets, the significance of data pre-processing,
and the job of training and testing datasets are key stages towards building powerful
models. By utilizing well-known sources, for example, Kaggle, UCI Machine Learning
Repository, AWS, Google's Dataset Search, Microsoft Datasets, and government datasets,
data researchers and specialists can get to an extensive variety of datasets for their
machine learning projects. It is fundamental to consider data ethics and privacy all
through the whole data lifecycle to guarantee mindful and moral utilization of data.
With the right datasets and moral practices, machine learning models can accomplish
exact predictions and drive significant bits of knowledge.

21
References:

1. https://www.javatpoint.com/machine-learning
2. https://www.geeksforgeeks.org/machine-learning/
3. https://ngugijoan.medium.com/bias-and-variance-6cf244080082
4. https://www.analyticsvidhya.com/blog/2020/12/a-measure-of-bias-and-
variance-an-experiment/
5. https://www.geeksforgeeks.org/bias-vs-variance-in-machine-learning/
6. https://www.analyticsvidhya.com/blog/2020/08/bias-and-variance-tradeoff-
machine-learning/

22

You might also like