0% found this document useful (0 votes)
10 views22 pages

FAIML Unit 4 Introduction To ML

This document provides an introduction to machine learning (ML), covering its definition, workflow, types (supervised, unsupervised, reinforcement), and applications across various fields. It also discusses data preprocessing steps, cross-validation techniques, and the distinction between training and testing datasets. Additionally, it highlights the differences between artificial intelligence and machine learning, as well as the concepts of positive and negative classes in classification tasks.

Uploaded by

Sumit Kolgire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views22 pages

FAIML Unit 4 Introduction To ML

This document provides an introduction to machine learning (ML), covering its definition, workflow, types (supervised, unsupervised, reinforcement), and applications across various fields. It also discusses data preprocessing steps, cross-validation techniques, and the distinction between training and testing datasets. Additionally, it highlights the differences between artificial intelligence and machine learning, as well as the concepts of positive and negative classes in classification tasks.

Uploaded by

Sumit Kolgire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

FAIML: Unit 4: Introduction to

ML
Syllabus:
Introduction to Machine Learning: History of ML Examples of Machine Learning
Applications,
Learning Types, ML Life cycle, AI & ML, dataset for ML, Data Pre-processing,
Training versus
Testing, Positive and Negative Class, Cross-validation.

Q.1) What is machine learning? Give an overview of


machine learning with suitable diagram
Ans:

✅ What is Machine Learning?


1. Machine Learning is a method where computers learn from data instead of
being programmed step by step.

2. It helps machines make decisions or predictions based on past experiences


(data).

🔄 Steps in Machine Learning – Two Points Each


1. Data Collection
Gather data from sources like files, websites, or sensors.

This data is used to train and test the machine learning model.

2. Data Preprocessing

FAIML: Unit 4: Introduction to ML 1


Clean the data by handling missing values, duplicates, and wrong formats.

Convert data into a usable form (e.g., change text to numbers, normalize).

3. Data Splitting
Divide the data into Training Set (to learn) and Testing Set (to check).

Common split: 80% training and 20% testing.

4. Model Selection
Choose the right algorithm based on the problem (e.g., regression or
classification).

Different tasks need different models like Decision Trees, SVM, etc.

5. Model Training
Feed the training data to the model to help it learn patterns.

The model adjusts itself to give the correct output for input data.

6. Model Testing
Check how well the model works using the testing data.

Calculate accuracy, error, or other performance metrics.

7. Prediction / Deployment
Use the trained model to predict outcomes on new data.

Deploy the model in apps or websites for real-time use.

📊 Diagram: Machine Learning Workflow


+----------------+
| Data |
| Collection |

FAIML: Unit 4: Introduction to ML 2


+--------+-------+
|
v
+----------------+
| Data |
| Preprocessing |
+--------+-------+
|
v
+----------------+
| Data Splitting |
| (Train/Test) |
+--------+-------+
|
v
+----------------+
| Model |
| Selection |
+--------+-------+
|
v
+----------------+
| Model Training |
+--------+-------+
|
v
+----------------+
| Model Testing |
+--------+-------+
|
v
+----------------+
| Prediction / |
| Deployment |

FAIML: Unit 4: Introduction to ML 3


+----------------+

🧠 Types of Machine Learning – Two Points Each


🔵 Supervised Learning
The model learns from labeled data (input with correct output).

Examples: Email spam detection, exam result prediction.

🟢 Unsupervised Learning
The model finds patterns in data without labels.

Examples: Customer grouping, market segmentation.

🔴 Reinforcement Learning
The model learns by trial and error using rewards or punishments.

Examples: Robots learning to walk, AI playing games.

Q.2) Elaborate various cross validation techniques with


advantages and limitations.
Ans:

✅ What is Cross-Validation?
Cross-validation is a technique used to check how well a machine learning
model performs on unseen data. It helps ensure that the model is not just
memorizing training data but can generalize to new inputs.

📌 Why Cross-Validation is Important?


FAIML: Unit 4: Introduction to ML 4
To avoid overfitting (model performing well on training data but poorly on new
data)

To evaluate model performance more reliably

To use data efficiently, especially when the dataset is small

🧪 Types of Cross-Validation Techniques


🔹 1. Hold-Out Validation
The dataset is split into two parts:

Training Set (e.g., 70%)

Testing Set (e.g., 30%)

Model is trained on the training set and tested on the testing set.

Advantages:

Simple and fast

Useful for large datasets

Limitations:

Performance depends on how the data was split

May not represent the whole dataset well

🔹 2. K-Fold Cross-Validation
The dataset is divided into K equal parts (folds).

The model is trained on K-1 folds and tested on the remaining 1 fold.

This process is repeated K times, with a different fold used for testing each
time.

The average result is taken as final performance.

Advantages:

More reliable than hold-out

FAIML: Unit 4: Introduction to ML 5


Uses all data for both training and testing

Limitations:

More computationally expensive

Can be slow if K is large

🔹 3. Stratified K-Fold Cross-Validation


Similar to K-Fold, but it ensures that each fold has the same percentage of
classes (important for classification problems).

Advantages:

Good for imbalanced datasets

Gives better model evaluation in classification

Limitations:

Slightly more complex to implement

🔹 4. Leave-One-Out Cross-Validation (LOOCV)


Special case of K-Fold where K = number of data points.

Each sample is used once as test data, and the rest are used for training.

Advantages:

Maximum use of data for training

Low bias (very accurate)

Limitations:

Very time-consuming for large datasets

High variance — different results for different splits

🔹 5. Leave-P-Out Cross-Validation
Similar to LOOCV, but instead of leaving one point out, P data points are left
out for testing in each iteration.

FAIML: Unit 4: Introduction to ML 6


Advantages:

More general and flexible than LOOCV

Limitations:

Very high computation cost

Not commonly used for large datasets

🔹 6. Repeated K-Fold Cross-Validation


K-Fold Cross-Validation is repeated multiple times with different random splits.

Advantages:

Reduces randomness and gives more stable results

Limitations:

Even slower due to repeated training/testing

More complex

📊 Summary Table
Technique Advantages Limitations

Hold-Out Simple, fast Less reliable, depends on split

Slower, higher computational


K-Fold More reliable, uses all data
cost

Stratified K-Fold Good for classification tasks Slightly complex

Leave-One-Out Uses nearly all data, low


Very slow, high variance
(LOOCV) bias

Leave-P-Out Flexible Extremely slow for large datasets

Repeated K-Fold More stable results Time-consuming

Q.3) What is Dataset? Differentiate between Training


dataset and Testing dataset


Ans:

FAIML: Unit 4: Introduction to ML 7


✅ What is a Dataset?
A dataset is a collection of data used to train and evaluate machine learning
models.

It usually contains multiple rows (examples) and columns (features or attributes).


📝 Example:
If you're building a model to predict student marks:

Each row = one student

Columns = features like hours studied, attendance, and final marks

🔍 Types of Datasets in Machine Learning:


Dataset Type Purpose

Training Dataset To teach the model (learn patterns)

Testing Dataset To check how well the model has learned

🔁Dataset
Difference Between Training Dataset and Testing

Feature Training Dataset Testing Dataset

📘 Purpose Used to train the machine


learning model
Used to test the model’s
accuracy/performance

🔢 Data Used Usually 70%–80% of the total


data
Usually 20%–30% of the total data

🎯 Role Helps the model learn patterns


and relationships
Checks how well the model performs
on new/unseen data

🧠 Learning Model learns from this data Model is evaluated using this data

🔄 Repeated Used multiple times during


Used only once, after training
Use training

✅ Summary
FAIML: Unit 4: Introduction to ML 8
The training dataset helps the model learn.

The testing dataset checks how well the model has learned.

Using both ensures the model is not overfitting and can make good predictions
on real-world data.

Q.4) Difference between Positive And Negative class


Ans:

Positive Class vs Negative Class


Aspect Positive Class Negative Class

The class/category that represents The class/category that


Meaning the target event or condition we represents the absence of the
want to detect or identify target event or condition

In spam email detection: Spam In spam email detection: Non-


Example
emails (what we want to detect) spam (ham) emails

Role in Often considered the “true” or Considered the “other” class or


Classification “important” class background

In Binary
Labeled as 1 or “positive” Labeled as 0 or “negative”
Classification

True Positives (TP) relate to positive True Negatives (TN) relate to


Use in Metrics
class negative class

Quick Example:
If you want to detect cancer in medical images:

Positive Class: Images with cancer (disease present)

Negative Class: Images without cancer (healthy)

FAIML: Unit 4: Introduction to ML 9


Q.5) Explain different types of machine learning
techniques.
Ans:
Machine Learning techniques are mainly divided into three types:

1. Supervised Learning
What it is: The model learns from labeled data—data where each example
has input features and a known correct output (label).

Goal: Predict the output for new, unseen data based on learned patterns.

Example:
Email Spam Detection:

Input: Email text and metadata

Output Label: Spam or Not Spam

The model learns from emails that are already marked as spam or not
spam, then predicts whether new emails are spam.

2. Unsupervised Learning
What it is: The model learns from unlabeled data—data that has no output
labels. It tries to find patterns, groups, or structures by itself.

Goal: Discover hidden patterns or groupings in data.

Example:
Customer Segmentation:

Input: Customer purchase data without any labels

Task: Group customers into clusters based on buying habits, so


businesses can target marketing campaigns effectively.

3. Reinforcement Learning

FAIML: Unit 4: Introduction to ML 10


What it is: The model (called an agent) learns by trial and error, taking actions
in an environment and receiving rewards or penalties.

Goal: Learn the best sequence of actions to maximize cumulative rewards


over time.

Example:
Training a Robot to Walk:

The robot tries different movements.

It receives positive feedback (reward) for stable walking and negative


feedback (penalty) for falling.

Over time, it learns how to walk better by maximizing rewards.

Q.6) Discuss various applications of machine learning.


Ans:

🌟 Applications of Machine Learning


1. Healthcare
Disease Diagnosis: ML models analyze medical images or patient data to
detect diseases like cancer, diabetes, or heart conditions early.

Personalized Treatment: Recommends treatments based on patient history


and genetic data.

2. Finance
Fraud Detection: Detects unusual transactions or fraud patterns in banking
and credit card usage.

Stock Market Prediction: Predicts stock prices and market trends using
historical data.

FAIML: Unit 4: Introduction to ML 11


3. E-commerce
Recommendation Systems: Suggests products based on your browsing and
purchase history (like Amazon, Netflix).

Customer Segmentation: Groups customers by behavior to target marketing


campaigns better.

4. Transportation
Self-Driving Cars: Autonomous vehicles use ML to understand surroundings
and make driving decisions.

Traffic Prediction: Predicts traffic congestion and suggests optimal routes.

5. Natural Language Processing (NLP)


Speech Recognition: Converts spoken language into text (e.g., Siri, Google
Assistant).

Language Translation: Translates text from one language to another


automatically (Google Translate).

6. Image and Video Analysis


Facial Recognition: Used in security and tagging photos on social media.

Object Detection: Detects and classifies objects in images or videos (useful in


surveillance, robotics).

7. Gaming
Game AI: Creates intelligent agents that learn to play and improve over time
(e.g., AlphaGo).

Personalized Gaming Experience: Adjusts difficulty based on player skill.

8. Manufacturing
Predictive Maintenance: Predicts machine failures before they happen,
reducing downtime.

FAIML: Unit 4: Introduction to ML 12


Quality Control: Automatically inspects products for defects using image
recognition.

9. Education
Personalized Learning: Adapts course content based on student performance
and learning style.

Automated Grading: Grades assignments and exams using ML algorithms.

10. Social Media


Content Filtering: Detects and filters spam, fake news, or inappropriate
content.

User Behavior Analysis: Understands user preferences for better


engagement.

Q.7) Explain the data preprocessing steps in machine


learning.
Ans:

What is Data Preprocessing?


Data preprocessing is the process of cleaning and transforming raw data into a
suitable format so that machine learning models can learn from it effectively.

Key Steps in Data Preprocessing


1. Data Collection
Gather raw data from different sources like databases, files, or online.

2. Data Cleaning
Handle missing values: Fill missing data using mean, median, or remove
rows/columns.

FAIML: Unit 4: Introduction to ML 13


Remove duplicates: Get rid of repeated records.

Fix errors: Correct inconsistencies or typos in the data.

3. Data Transformation
Normalization/Scaling: Adjust numerical data to a standard scale (e.g., 0 to 1)
so that no feature dominates due to scale differences.

Encoding categorical data: Convert categories (like “red”, “blue”) into


numbers using techniques like one-hot encoding or label encoding.

4. Feature Selection
Choose the most important features (columns) that contribute to the
prediction and remove irrelevant or redundant ones.

5. Data Splitting
Split the dataset into training, testing, and sometimes validation sets to
evaluate model performance properly.

6. Data Reduction (Optional)


Reduce the data size while preserving important information using techniques
like Principal Component Analysis (PCA).

Summary Table
Step Purpose

Data Collection Gather raw data

Data Cleaning Fix missing values, remove duplicates, correct errors

Data Transformation Scale features, convert categories to numbers

Feature Selection Select important features, remove noise

Data Splitting Divide data into training and testing sets

Data Reduction Reduce dataset size while keeping key info

FAIML: Unit 4: Introduction to ML 14


Why is Data Preprocessing Important?
Improves model accuracy

Makes training faster and more stable

Helps handle real-world messy data

Q.8) Distinguish between artificial intelligence and


machine learning.
Ans:

Aspect Artificial Intelligence (AI) Machine Learning (ML)

AI is the broad science of ML is a subset of AI that enables


creating machines that can machines to learn from data and
Definition
perform tasks that normally improve automatically without being
require human intelligence. explicitly programmed.

To build intelligent systems that


To develop algorithms that allow
can simulate human reasoning,
Goal machines to learn patterns from data
decision-making, and problem-
and make predictions or decisions.
solving.

Encompasses reasoning,
Focused on learning from data using
Scope planning, natural language
statistical methods and optimization.
processing, robotics, and ML.

Chatbots, self-driving cars,


Spam detection, image recognition,
Examples game-playing AI agents, expert
recommendation systems.
systems.

Uses rules, logic, and knowledge Uses data and algorithms to find
How it works
bases to mimic intelligence. patterns and make predictions.

Q.9) What is unsupervised learning and explain types


of unsupervised learning algorithm with example.
Ans:

FAIML: Unit 4: Introduction to ML 15


What is Unsupervised Learning?
Unsupervised Learning is a type of machine learning where the algorithm learns
without any labeled output. This means:

You provide the model with data, but no answers (no categories or target
values).

The model tries to find hidden patterns, structures, or relationships in the


data on its own.

Useful when you don’t know what to expect from the data or want to explore
it.

Types of Unsupervised Learning


Algorithms
1. Clustering
Goal:
Group data points into clusters so that points in the same group are more similar to
each other than to those in other groups.
How it works:

The algorithm measures the similarity or distance between data points (using
metrics like Euclidean distance).

It then assigns points to clusters iteratively to minimize within-cluster


distances and maximize between-cluster distances.

Popular Algorithms:

K-Means Clustering:

Choose number of clusters kkk.

Randomly initialize cluster centers.

Assign each data point to the nearest center.

Update cluster centers as the mean of assigned points.

FAIML: Unit 4: Introduction to ML 16


Repeat until centers stabilize.

Hierarchical Clustering:

Builds a tree (dendrogram) of clusters by either merging or splitting


clusters step-by-step.

Example:

Suppose an online retailer wants to group customers based on their shopping


behavior to offer personalized deals.

Using clustering, similar customers are grouped so marketing campaigns can


be targeted more effectively.

2. Dimensionality Reduction
Goal:
Reduce the number of features (dimensions) in the data while preserving as much
important information as possible.
Why needed:

High-dimensional data is hard to visualize and can slow down learning.

Removing redundant or irrelevant features improves model performance.

Popular Algorithm:

Principal Component Analysis (PCA):

Finds new axes (principal components) that capture maximum variance in


data.

Projects data onto these axes, reducing the number of dimensions.

Example:

Compressing image data: reduce 1000 features (pixels) to 50 while retaining


the main image structure.

Helps in visualizing complex data in 2D or 3D.

FAIML: Unit 4: Introduction to ML 17


Summary Table
Example
Algorithm Type Purpose How It Works
Application

Group similar data Assign points to clusters Customer


Clustering
points based on similarity segmentation

Transform data to fewer Image


Dimensionality Reduce
features preserving compression,
Reduction features/dimensions
variance visualization

Why Use Unsupervised Learning?


When labeled data is unavailable or expensive to obtain.

To explore and understand data structure.

To detect anomalies or unusual patterns.

For data preprocessing before supervised learning.

Q.10) Explain features with suitable examples. Explain


Feature selection and feature extraction.
Ans:

What are Features?


Features are the individual measurable properties or characteristics of the
data that are used by machine learning models to make predictions or
decisions.

Each feature represents one aspect of the data. Together, features describe
each data point.

Example:
If you want to predict house prices, features could be:

FAIML: Unit 4: Introduction to ML 18


Number of bedrooms

Size in square feet

Location

Age of the house

Each house in your dataset is described by these features.

Feature Selection
Feature Selection is the process of selecting the most important features from
the original dataset and removing irrelevant or redundant features.

The goal is to keep only those features that improve model performance,
reduce complexity, and avoid overfitting.

Why do feature selection?


To improve model accuracy

To reduce training time

To simplify the model for better understanding

Example:
From 20 features describing a customer, only 5 may be useful to predict whether
they will buy a product. Feature selection methods help identify these 5 important
features.

Feature Extraction
Feature Extraction creates new features by transforming or combining the
original features into a smaller set of more informative features.

Unlike selection, which picks existing features, extraction generates new


features.

Why do feature extraction?


To reduce dimensionality while preserving important information

FAIML: Unit 4: Introduction to ML 19


To simplify data representation for better learning

Example:
Using Principal Component Analysis (PCA) to combine many correlated
features into a few principal components that summarize the data.

Converting a long text document into a set of numerical features using TF-IDF
(Term Frequency-Inverse Document Frequency).

Q.11) Write a short Note on PCA.


Ans:

Principal Component Analysis (PCA)


PCA is a popular dimensionality reduction technique used in machine learning
and data analysis. It transforms a large set of correlated variables into a smaller
set of uncorrelated variables called principal components. These components
capture the most important information (variance) in the data.

How PCA works:

It finds new axes (directions) in the feature space where the data varies the
most.

The first principal component captures the highest variance, the second
captures the next highest, and so on.

By keeping only the first few components, PCA reduces the data’s dimensions
while preserving most of its information.

Advantages:

Reduces computational cost by decreasing the number of features.

Helps visualize high-dimensional data in 2D or 3D plots.

Removes noise and redundant features, improving model performance.

Limitations:

FAIML: Unit 4: Introduction to ML 20


PCA assumes linearity; it may not capture complex nonlinear relationships.

Interpretation of principal components can be difficult because they are


combinations of original features.

Sensitive to outliers and scaling of data.

Example Use Case:


In image processing, each image can have thousands of pixels (features). PCA
reduces these thousands to a few components, preserving important visual
information but greatly simplifying computations for tasks like face recognition.

Q.12) What is dimensionality reduction? What are its


advantages and disadvantages?
Ans:

What is Dimensionality Reduction?


Dimensionality Reduction is a process used to reduce the number of features
(dimensions) in a dataset while keeping as much important information as
possible.

Imagine you have data with many features (like 100 or 1000 columns).

It can be hard to analyze or visualize such high-dimensional data.

Dimensionality reduction helps by creating a smaller set of features that


summarize the original data.

Advantages of Dimensionality Reduction


1. Simplifies Data:
Fewer features make the data easier to visualize and understand.

2. Speeds Up Algorithms:

Machine learning models train faster with fewer features.

3. Reduces Overfitting:

FAIML: Unit 4: Introduction to ML 21


Removing irrelevant or noisy features helps models generalize better to new
data.

4. Removes Redundancy:

Combines correlated features into fewer new features, reducing duplication.

Disadvantages of Dimensionality Reduction


1. Loss of Information:
Some important details might be lost when reducing dimensions.

2. Hard to Interpret:
New features created (like in PCA) are combinations of original ones, so they
can be difficult to understand.

3. Computational Cost:

Some methods (like t-SNE) can be slow on very large datasets.

4. Not Always Useful:


If data already has few important

FAIML: Unit 4: Introduction to ML 22

You might also like