Unit-1 ML (Reference guide for students)
Unit-1 ML (Reference guide for students)
Machine Learning
Supervised Learning Unsupervised Learning Semi-supervised Reinforcement
1
Supervised Learning:
In supervised learning, the algorithm is trained on a labeled dataset, meaning the input data is
paired with corresponding target labels.
Objective: The goal is to learn a mapping from input features to the correct output labels so
that the model can make predictions on new, unseen data.
Examples:
1. Predicting house prices (regression).
2. Email spam detection (classification).
3. Image recognition (classification).
Unsupervised Learning:
Unsupervised learning involves training the algorithm on an unlabeled dataset, and the model
must find patterns, relationships, or structures within the data without explicit guidance.
Objective: Discover hidden patterns or intrinsic structures in the data, such as grouping
similar data points (clustering) or reducing dimensionality.
Examples:
1. Clustering customer segments based on purchasing behavior.
2. Dimensionality reduction for data visualization.
3. Anomaly detection in network security.
Semi-Supervised Learning:
Semi-supervised learning is a hybrid approach that combines elements of both supervised and
unsupervised learning. The algorithm is trained on a dataset that contains both labeled and
unlabeled examples.
Objective: Leverage the available labeled data for supervised learning tasks while also
benefiting from the unlabeled data to improve model generalization and performance.
Examples:
1. Text classification where only a subset of documents is labeled.
2. Image classification with a small set of labeled images and a larger set of unlabeled
images.
3. Speech recognition with limited labeled audio data.
2
Applications of Machine learning:
1. Image Recognition:
Application: Identifying objects, people, scenes, and activities in images and videos. Used
in facial recognition, medical imaging analysis, and self-driving cars.
Algorithm: Convolutional Neural Networks (CNNs) are highly successful for image
recognition. They analyze images pixel by pixel, extracting features and learning
patterns to classify objects.
Example:
a. Unlocking your phone with Face ID (Apple) or facial recognition on Android
phones.
b. Auto-tagging friends in photos on Facebook.
c. Content moderation on social media platforms to identify inappropriate content.
3
2. Speech Recognition:
Application: Converting spoken language into text. Used in voice assistants, dictation
software, and automated call centers.
Algorithm: Hidden Markov Models (HMMs) are a popular choice. They analyze the
statistical properties of speech sounds to recognize words. Deep learning models are
also increasingly used for speech recognition.
Example:
a. Using voice assistants like Siri (Apple), Alexa (Amazon), or Google Assistant to
control smart devices, make calls, or dictate messages.
b. Voice search on smartphones and computers.
c. Automated customer service systems that understand spoken inquiries.
3. Traffic Prediction:
Application: Forecasting traffic congestion patterns to optimize traffic flow and
navigation.
Algorithm: Various algorithms can be used, including recurrent neural networks (RNNs)
which can analyze historical traffic data to identify patterns and predict future
conditions.
Example: Traffic navigation apps like Waze or Google Maps that suggest alternative
routes based on real-time traffic conditions.
4. Product Recommendations:
Application: Suggesting products to users based on their past purchases, browsing
history, and similar user behavior.
Algorithm: Collaborative filtering and content-based filtering are common approaches.
Collaborative filtering recommends products based on what similar users liked, while
content-based filtering recommends products similar to what users have purchased or
shown interest in.
Example:
a. Recommendation sections on e-commerce sites like Amazon or Netflix
suggesting products or shows you might like based on your past purchases or
viewing habits.
b. Personalized advertising that targets users based on their online activity.
4
5. Self-Driving Cars:
Application: Enabling cars to navigate and perceive their surroundings without human
input.
Algorithm: A combination of algorithms is used, including deep learning for image
recognition, sensor data analysis, and reinforcement learning for decision making and
control.
Example: Tesla Autopilot, a driver-assistance system that uses machine learning for
features like lane centering and adaptive cruise control. (Note: Autopilot does not make
cars fully autonomous)
5
a. Banks and credit card companies using machine learning to identify and block
suspicious transactions in real-time.
b. Online payment platforms like PayPal that use fraud detection algorithms to
protect user accounts.
Machine learning life cycle involves seven major steps, which are given below:
6
The most important thing in the complete process is to understand the problem and to
know the purpose of the problem.
In the complete life cycle process, to solve a problem, we create a machine learning
system called "model", and this model is created by providing "training". But to train a
model, we need data, hence, life cycle starts by collecting data.
1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step is
to identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected from
various sources such as files, database, internet, or mobile devices. It is one of the
most important steps of the life cycle. The quantity and quality of the collected data will
determine the efficiency of the output. The more will be the data, the more accurate will
be the prediction.
This step includes the below tasks:
b. Collect data
By performing the above task, we get a coherent set of data, also called as a dataset. It
will be used in further steps.
2. Data preparation:
After collecting the data, we need to prepare it for further steps. Data preparation is a
step where we put our data into a suitable place and prepare it to use in our machine
learning training.
In this step, first, we put all data together, and then randomize the ordering of data.
This step can be further divided into two processes:
a. Data exploration:
It is used to understand the nature of data that we have to work with. We
need to understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.
b. Data pre-processing:
Now the next step is preprocessing of data for its analysis.
7
3. Data Wrangling:
Data wrangling is the process of cleaning and converting raw data into a useable format.
It is the process of cleaning the data, selecting the variable to use, and transforming the
data in a proper format to make it more suitable for analysis in the next step. It is one of
the most important steps of the complete process. Cleaning of data is required to
address the quality issues.
It is not necessary that data we have collected is always of our use as some of the data
may not be useful. In real-world applications, collected data may have various issues,
including:
a. Missing Values
b. Duplicate data
c. Invalid data
d. Noise
So, we use various filtering techniques to clean the data.
It is mandatory to detect and remove the above issues because it can negatively affect
the quality of the outcome.
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step involves:
b. Building models
The aim of this step is to build a machine learning model to analyze the data using
various analytical techniques and review the outcome. It starts with the determination
of the type of the problems, where we select the machine learning techniques such
as Classification, Regression, Cluster analysis, Association, etc. then build the model using
prepared data, and evaluate the model.
Hence, in this step, we take the data and use machine learning algorithms to build the
model.
5. Train Model
Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.
8
We use datasets to train the model using various machine learning algorithms. Training
a model is required so that it can understand the various patterns, rules, and, features.
6. Test Model
Once our machine learning model has been trained on a given dataset, then we test the
model. In this step, we check for the accuracy of our model by providing a test dataset to
it.
Testing the model determines the percentage accuracy of the model as per the
requirement of project or problem.
7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the model in
the real-world system.
If the above-prepared model is producing an accurate result as per our requirement
with acceptable speed, then we deploy the model in the real system. But before
deploying the project, we will check whether it is improving its performance using
available data or not. The deployment phase is similar to making the final report for a
project.
Artificial Intelligence
Artificial intelligence is a field of computer science which makes a computer system that
can mimic human intelligence. It is comprised of two words "Artificial" and
9
"intelligence", which means "a human-made thinking power." Hence we can define it
as,
Artificial intelligence is a technology using which we can create intelligent systems that
can simulate human intelligence.
The Artificial intelligence system does not require to be pre-programmed, instead of
that, they use such algorithms which can work with their own intelligence. It involves
machine learning algorithms such as Reinforcement learning algorithm and deep
learning neural networks. AI is being used in multiple places such as Siri, Google?s
AlphaGo, AI in Chess playing, etc.
Based on capabilities, AI can be classified into three types:
a. Weak AI
b. General AI
c. Strong AI
Currently, we are working with weak AI and general AI. The future of AI is Strong AI for
which it is said that it will be intelligent than humans.
The goal of AI is to make a smart computer The goal of ML is to allow machines to learn from
system like humans to solve complex problems. data so that they can give accurate output.
In AI, we make intelligent systems to perform any In ML, we teach machines with data to perform a
task like a human. particular task and give an accurate result.
10
Machine learning and deep learning are the two Deep learning is a main subset of machine
main subsets of AI. learning.
AI has a very wide range of scope. Machine learning has a limited scope.
AI system is concerned about maximizing the Machine learning is mainly concerned about
chances of success. accuracy and patterns.
It includes learning, reasoning, and self- It includes learning and self-correction when
correction. introduced with new data.
AI completely deals with Structured, semi- Machine learning deals with Structured and semi-
structured, and unstructured data. structured data.
11
How to get datasets for Machine Learning?
The field of ML depends vigorously on datasets for preparing models and making
precise predictions. Datasets assume a vital part in the progress of AIML projects and
are fundamental for turning into a gifted information researcher. In this article, we will
investigate the various sorts of datasets utilized in AI and give a definite aid on where to
track down them.
What is a dataset?
A dataset is a collection of data in which data is arranged in some order. A dataset can
contain any data from a series of an array to a database table. Below table shows an
example of the dataset:
Germany 30 54000 No
France 48 65000 No
Germany 40 Yes
A tabular dataset can be understood as a database table or matrix, where each column
corresponds to a particular variable, and each row corresponds to the fields of the
dataset. The most supported file type for a tabular dataset is "Comma Separated
File," or CSV. But to store a "tree-like data," we can use the JSON file more efficiently.
12
Types of datasets:
Machine learning incorporates different domains, each requiring explicit sorts of
datasets. A few normal sorts of datasets utilized in machine learning include:
Image Datasets:
Image datasets contain an assortment of images and are normally utilized in computer
vision tasks such as image classification, object detection, and image segmentation.
Text Datasets:
Text datasets comprise textual information, like articles, books, or virtual entertainment
posts. These datasets are utilized in NLP techniques like sentiment analysis, text
classification, and machine translation.
Tabular Datasets:
Tabular datasets are organized information coordinated in tables or calculation sheets.
They contain lines addressing examples or tests and segments addressing highlights or
qualities. Tabular datasets are utilized for undertakings like relapse and arrangement.
The dataset given before in the article is an illustration of a tabular dataset.
Need of Dataset
1. Completely ready and pre-handled datasets are significant for machine learning
projects.
2. They give the establishment to prepare exact and solid models. Notwithstanding,
working with enormous datasets can introduce difficulties regarding the board and
handling.
3. To address these difficulties, productive information the executive's strategies and
are expected to handle calculations.
13
Data Pre-processing:
Data pre-processing is a fundamental stage in preparing datasets for machine learning.
It includes changing raw data into a configuration reasonable for model training.
Normal pre-processing procedures incorporate data cleaning to eliminate irregularities
or blunders, standardization to scale data inside a particular reach, highlight scaling to
guarantee highlights have comparative ranges, and taking care of missing qualities.
During the development of the ML project, the developers completely rely on the
datasets. In building ML applications, datasets are divided into two parts:
Training dataset:
Test Dataset
14
fundamental to guarantee that the datasets are representative of the issue space and
appropriately split to stay away from inclination or overfitting.
1. Kaggle Datasets:
Kaggle is one of the best sources for providing datasets for Data Scientists and Machine
Learners. It allows users to find, download, and publish datasets in an easy way. It also
provides the opportunity to work with other machine learning engineers and solve
difficult Data Science related tasks.
Kaggle provides a high-quality dataset in different formats that we can easily find and
download.
The link for the Kaggle dataset is https://www.kaggle.com/datasets.
15
This source provides the various types of datasets with examples and ways to use the
dataset. It also provides the search box using which we can search for the required
dataset. Anyone can add any dataset or example to the Registry of Open Data on AWS.
5. Microsoft Datasets:
The Microsoft has launched the "Microsoft Research Open data" repository with the
collection of free datasets in various areas such as natural language processing,
computer vision, and domain-specific sciences. It gives admittance to assorted and
arranged datasets that can be significant for machine learning projects.
The link to download or use the dataset from this resource
is https://msropendata.com/.
7. Government Datasets:
There are different sources to get government-related data. Various countries publish
government data for public use collected by them from different departments.
16
The goal of providing these datasets is to increase transparency of government work
among the people and to use the data in an innovative approach. Below are some links
of government datasets:
Indian Government dataset
US Government Dataset
Northern Ireland Public Sector Datasets
European Union Open Data Portal
The link for downloading the dataset from this source is https://www.visualdata.io/.
9. Scikit-learn dataset:
Scikit-learn, a well-known machine learning library in Python, gives a few underlying
datasets to practice and trial and error. These datasets are open through the sci-kit-
learn Programming interface and can be utilized for learning different machine-learning
calculations. Scikit-learn offers both toy datasets, which are little and improved, and
genuine world datasets with greater intricacy. Instances of sci-kit-learn datasets
incorporate the Iris dataset, the Boston Lodging dataset, and the Wine dataset.
17
Error:
This difference between the actual values and predicted values is the error and it is used to evaluate
the model. The error for any supervised Machine Learning algorithm comprises of 3 parts:
1. Bias error
2. Variance error
3. The noise
While the noise is the irreducible error that we cannot eliminate, the other two i.e. Bias and Variance
are reducible errors that we can attempt to minimize as much as possible.
Bias:
It represents the difference between the predicted values by the model and the true
values in the underlying data.
A high bias model tends to underfit the training data, meaning it fails to capture the
underlying patterns and relationships in the data. It oversimplifies the problem.
Example: Consider a linear regression model attempting to predict housing prices based
solely on the number of bedrooms in a house. This model has high bias because it
oversimplifies the relationship between housing prices and various other factors like
square footage, location, amenities, etc. It fails to capture the complexities of the
housing market.
Mathematically, bias can be quantified as the difference between the expected
prediction of the model and the true value. Let's denote:
Then, the bias can be defined as: 𝐵𝑖𝑎𝑠 (𝑓̂(𝑥)) = 𝐸[𝑓̂(𝑥)] − 𝑓(𝑥)
18
Variance:
Variance refers to the model's sensitivity to fluctuations in the training data. It measures
the variability of model predictions for different training datasets.
A high variance model tends to overfit the training data, meaning it captures noise in the
training data as if it were true patterns. Such a model performs well on the training data
but fails to generalize to unseen data.
Example: Consider a decision tree model that is highly complex and deep, capable of
perfectly fitting every data point in the training set. Such a model may have high
variance because it captures noise specific to the training data, leading to poor
performance on unseen data due to overfitting.
Mathematically, variance can be quantified as the variability of the model's predictions
over different training sets. Let's denote:
Bias-Variance Tradeoff:
The bias-variance tradeoff refers to the delicate balance between bias and variance in
machine learning models. Increasing model complexity typically decreases bias but
increases variance, and vice versa.
The goal is to find the right balance that minimizes both bias and variance, leading to
optimal model performance on unseen data.
Example: In the case of the housing price prediction, a more sophisticated model such as
a polynomial regression with multiple features may have lower bias but higher variance
compared to a simple linear regression model. Finding the right complexity level that
minimizes both bias and variance is crucial for good generalization performance.
19
OVERFITTING:
Overfitting occurs when a model learns to capture noise or random fluctuations in the
training data, rather than the underlying patterns or relationships.
Characteristics of overfitting:
The model performs well on the training data but poorly on unseen data.
There is a high variance in the model's predictions across different training
datasets.
The model is too complex, capturing both the signal and the noise in the data.
Causes of overfitting:
Using a model that is too complex relative to the size or complexity of the
training data.
Insufficient regularization or lack of constraints on the model's parameters.
Remedies for overfitting:
Simplifying the model architecture or reducing its complexity.
Applying regularization techniques such as L1 or L2 regularization to penalize
large parameter values.
Increasing the amount of training data to provide more information for the
model to learn from.
UNDERFITTING:
Underfitting occurs when a model is too simple to capture the underlying patterns or
relationships in the training data.
Characteristics of under fitting:
The model performs poorly on both the training data and unseen data.
There is a high bias in the model's predictions, indicating that it fails to capture
important features of the data.
The model is too simplistic, failing to capture the complexity of the underlying
data distribution.
Causes of under fitting:
Using a model that is too simple or has insufficient capacity to represent the
underlying function.
Insufficient training or not allowing the model to learn from the data effectively.
Remedies for under fitting:
20
Allowing the model to train for a longer duration or increasing the number of
training iterations.
Ensuring that the training data is representative of the underlying data
distribution and contains sufficient variability.
Function approximation:
Function approximation is a fundamental concept in machine learning that involves
estimating or approximating an unknown function based on a finite set of input-output
pairs. The goal of function approximation is to learn a model that can accurately predict
the output for new input values that were not present in the training data.
In machine learning, function approximation is typically approached through
supervised learning, where the model learns from labeled training data. The process
involves selecting a model architecture and adjusting its parameters to minimize the
discrepancy between the predicted outputs and the true labels in the training data.
Conclusion:
In conclusion, datasets structure the groundwork of effective machine-learning projects.
Understanding the various kinds of datasets, the significance of data pre-processing,
and the job of training and testing datasets are key stages towards building powerful
models. By utilizing well-known sources, for example, Kaggle, UCI Machine Learning
Repository, AWS, Google's Dataset Search, Microsoft Datasets, and government datasets,
data researchers and specialists can get to an extensive variety of datasets for their
machine learning projects. It is fundamental to consider data ethics and privacy all
through the whole data lifecycle to guarantee mindful and moral utilization of data.
With the right datasets and moral practices, machine learning models can accomplish
exact predictions and drive significant bits of knowledge.
21
References:
1. https://www.javatpoint.com/machine-learning
2. https://www.geeksforgeeks.org/machine-learning/
3. https://ngugijoan.medium.com/bias-and-variance-6cf244080082
4. https://www.analyticsvidhya.com/blog/2020/12/a-measure-of-bias-and-
variance-an-experiment/
5. https://www.geeksforgeeks.org/bias-vs-variance-in-machine-learning/
6. https://www.analyticsvidhya.com/blog/2020/08/bias-and-variance-tradeoff-
machine-learning/
22