FAIML: Unit 4: Introduction to
ML
Syllabus:
Introduction to Machine Learning: History of ML Examples of Machine Learning
Applications,
Learning Types, ML Life cycle, AI & ML, dataset for ML, Data Pre-processing,
Training versus
Testing, Positive and Negative Class, Cross-validation.
Q.1) What is machine learning? Give an overview of
machine learning with suitable diagram
Ans:
✅ What is Machine Learning?
1. Machine Learning is a method where computers learn from data instead of
being programmed step by step.
2. It helps machines make decisions or predictions based on past experiences
(data).
🔄 Steps in Machine Learning – Two Points Each
1. Data Collection
Gather data from sources like files, websites, or sensors.
This data is used to train and test the machine learning model.
2. Data Preprocessing
FAIML: Unit 4: Introduction to ML 1
Clean the data by handling missing values, duplicates, and wrong formats.
Convert data into a usable form (e.g., change text to numbers, normalize).
3. Data Splitting
Divide the data into Training Set (to learn) and Testing Set (to check).
Common split: 80% training and 20% testing.
4. Model Selection
Choose the right algorithm based on the problem (e.g., regression or
classification).
Different tasks need different models like Decision Trees, SVM, etc.
5. Model Training
Feed the training data to the model to help it learn patterns.
The model adjusts itself to give the correct output for input data.
6. Model Testing
Check how well the model works using the testing data.
Calculate accuracy, error, or other performance metrics.
7. Prediction / Deployment
Use the trained model to predict outcomes on new data.
Deploy the model in apps or websites for real-time use.
📊 Diagram: Machine Learning Workflow
+----------------+
| Data |
| Collection |
FAIML: Unit 4: Introduction to ML 2
+--------+-------+
|
v
+----------------+
| Data |
| Preprocessing |
+--------+-------+
|
v
+----------------+
| Data Splitting |
| (Train/Test) |
+--------+-------+
|
v
+----------------+
| Model |
| Selection |
+--------+-------+
|
v
+----------------+
| Model Training |
+--------+-------+
|
v
+----------------+
| Model Testing |
+--------+-------+
|
v
+----------------+
| Prediction / |
| Deployment |
FAIML: Unit 4: Introduction to ML 3
+----------------+
🧠 Types of Machine Learning – Two Points Each
🔵 Supervised Learning
The model learns from labeled data (input with correct output).
Examples: Email spam detection, exam result prediction.
🟢 Unsupervised Learning
The model finds patterns in data without labels.
Examples: Customer grouping, market segmentation.
🔴 Reinforcement Learning
The model learns by trial and error using rewards or punishments.
Examples: Robots learning to walk, AI playing games.
Q.2) Elaborate various cross validation techniques with
advantages and limitations.
Ans:
✅ What is Cross-Validation?
Cross-validation is a technique used to check how well a machine learning
model performs on unseen data. It helps ensure that the model is not just
memorizing training data but can generalize to new inputs.
📌 Why Cross-Validation is Important?
FAIML: Unit 4: Introduction to ML 4
To avoid overfitting (model performing well on training data but poorly on new
data)
To evaluate model performance more reliably
To use data efficiently, especially when the dataset is small
🧪 Types of Cross-Validation Techniques
🔹 1. Hold-Out Validation
The dataset is split into two parts:
Training Set (e.g., 70%)
Testing Set (e.g., 30%)
Model is trained on the training set and tested on the testing set.
Advantages:
Simple and fast
Useful for large datasets
Limitations:
Performance depends on how the data was split
May not represent the whole dataset well
🔹 2. K-Fold Cross-Validation
The dataset is divided into K equal parts (folds).
The model is trained on K-1 folds and tested on the remaining 1 fold.
This process is repeated K times, with a different fold used for testing each
time.
The average result is taken as final performance.
Advantages:
More reliable than hold-out
FAIML: Unit 4: Introduction to ML 5
Uses all data for both training and testing
Limitations:
More computationally expensive
Can be slow if K is large
🔹 3. Stratified K-Fold Cross-Validation
Similar to K-Fold, but it ensures that each fold has the same percentage of
classes (important for classification problems).
Advantages:
Good for imbalanced datasets
Gives better model evaluation in classification
Limitations:
Slightly more complex to implement
🔹 4. Leave-One-Out Cross-Validation (LOOCV)
Special case of K-Fold where K = number of data points.
Each sample is used once as test data, and the rest are used for training.
Advantages:
Maximum use of data for training
Low bias (very accurate)
Limitations:
Very time-consuming for large datasets
High variance — different results for different splits
🔹 5. Leave-P-Out Cross-Validation
Similar to LOOCV, but instead of leaving one point out, P data points are left
out for testing in each iteration.
FAIML: Unit 4: Introduction to ML 6
Advantages:
More general and flexible than LOOCV
Limitations:
Very high computation cost
Not commonly used for large datasets
🔹 6. Repeated K-Fold Cross-Validation
K-Fold Cross-Validation is repeated multiple times with different random splits.
Advantages:
Reduces randomness and gives more stable results
Limitations:
Even slower due to repeated training/testing
More complex
📊 Summary Table
Technique Advantages Limitations
Hold-Out Simple, fast Less reliable, depends on split
Slower, higher computational
K-Fold More reliable, uses all data
cost
Stratified K-Fold Good for classification tasks Slightly complex
Leave-One-Out Uses nearly all data, low
Very slow, high variance
(LOOCV) bias
Leave-P-Out Flexible Extremely slow for large datasets
Repeated K-Fold More stable results Time-consuming
Q.3) What is Dataset? Differentiate between Training
dataset and Testing dataset
✅
Ans:
FAIML: Unit 4: Introduction to ML 7
✅ What is a Dataset?
A dataset is a collection of data used to train and evaluate machine learning
models.
It usually contains multiple rows (examples) and columns (features or attributes).
📝 Example:
If you're building a model to predict student marks:
Each row = one student
Columns = features like hours studied, attendance, and final marks
🔍 Types of Datasets in Machine Learning:
Dataset Type Purpose
Training Dataset To teach the model (learn patterns)
Testing Dataset To check how well the model has learned
🔁Dataset
Difference Between Training Dataset and Testing
Feature Training Dataset Testing Dataset
📘 Purpose Used to train the machine
learning model
Used to test the model’s
accuracy/performance
🔢 Data Used Usually 70%–80% of the total
data
Usually 20%–30% of the total data
🎯 Role Helps the model learn patterns
and relationships
Checks how well the model performs
on new/unseen data
🧠 Learning Model learns from this data Model is evaluated using this data
🔄 Repeated Used multiple times during
Used only once, after training
Use training
✅ Summary
FAIML: Unit 4: Introduction to ML 8
The training dataset helps the model learn.
The testing dataset checks how well the model has learned.
Using both ensures the model is not overfitting and can make good predictions
on real-world data.
Q.4) Difference between Positive And Negative class
Ans:
Positive Class vs Negative Class
Aspect Positive Class Negative Class
The class/category that represents The class/category that
Meaning the target event or condition we represents the absence of the
want to detect or identify target event or condition
In spam email detection: Spam In spam email detection: Non-
Example
emails (what we want to detect) spam (ham) emails
Role in Often considered the “true” or Considered the “other” class or
Classification “important” class background
In Binary
Labeled as 1 or “positive” Labeled as 0 or “negative”
Classification
True Positives (TP) relate to positive True Negatives (TN) relate to
Use in Metrics
class negative class
Quick Example:
If you want to detect cancer in medical images:
Positive Class: Images with cancer (disease present)
Negative Class: Images without cancer (healthy)
FAIML: Unit 4: Introduction to ML 9
Q.5) Explain different types of machine learning
techniques.
Ans:
Machine Learning techniques are mainly divided into three types:
1. Supervised Learning
What it is: The model learns from labeled data—data where each example
has input features and a known correct output (label).
Goal: Predict the output for new, unseen data based on learned patterns.
Example:
Email Spam Detection:
Input: Email text and metadata
Output Label: Spam or Not Spam
The model learns from emails that are already marked as spam or not
spam, then predicts whether new emails are spam.
2. Unsupervised Learning
What it is: The model learns from unlabeled data—data that has no output
labels. It tries to find patterns, groups, or structures by itself.
Goal: Discover hidden patterns or groupings in data.
Example:
Customer Segmentation:
Input: Customer purchase data without any labels
Task: Group customers into clusters based on buying habits, so
businesses can target marketing campaigns effectively.
3. Reinforcement Learning
FAIML: Unit 4: Introduction to ML 10
What it is: The model (called an agent) learns by trial and error, taking actions
in an environment and receiving rewards or penalties.
Goal: Learn the best sequence of actions to maximize cumulative rewards
over time.
Example:
Training a Robot to Walk:
The robot tries different movements.
It receives positive feedback (reward) for stable walking and negative
feedback (penalty) for falling.
Over time, it learns how to walk better by maximizing rewards.
Q.6) Discuss various applications of machine learning.
Ans:
🌟 Applications of Machine Learning
1. Healthcare
Disease Diagnosis: ML models analyze medical images or patient data to
detect diseases like cancer, diabetes, or heart conditions early.
Personalized Treatment: Recommends treatments based on patient history
and genetic data.
2. Finance
Fraud Detection: Detects unusual transactions or fraud patterns in banking
and credit card usage.
Stock Market Prediction: Predicts stock prices and market trends using
historical data.
FAIML: Unit 4: Introduction to ML 11
3. E-commerce
Recommendation Systems: Suggests products based on your browsing and
purchase history (like Amazon, Netflix).
Customer Segmentation: Groups customers by behavior to target marketing
campaigns better.
4. Transportation
Self-Driving Cars: Autonomous vehicles use ML to understand surroundings
and make driving decisions.
Traffic Prediction: Predicts traffic congestion and suggests optimal routes.
5. Natural Language Processing (NLP)
Speech Recognition: Converts spoken language into text (e.g., Siri, Google
Assistant).
Language Translation: Translates text from one language to another
automatically (Google Translate).
6. Image and Video Analysis
Facial Recognition: Used in security and tagging photos on social media.
Object Detection: Detects and classifies objects in images or videos (useful in
surveillance, robotics).
7. Gaming
Game AI: Creates intelligent agents that learn to play and improve over time
(e.g., AlphaGo).
Personalized Gaming Experience: Adjusts difficulty based on player skill.
8. Manufacturing
Predictive Maintenance: Predicts machine failures before they happen,
reducing downtime.
FAIML: Unit 4: Introduction to ML 12
Quality Control: Automatically inspects products for defects using image
recognition.
9. Education
Personalized Learning: Adapts course content based on student performance
and learning style.
Automated Grading: Grades assignments and exams using ML algorithms.
10. Social Media
Content Filtering: Detects and filters spam, fake news, or inappropriate
content.
User Behavior Analysis: Understands user preferences for better
engagement.
Q.7) Explain the data preprocessing steps in machine
learning.
Ans:
What is Data Preprocessing?
Data preprocessing is the process of cleaning and transforming raw data into a
suitable format so that machine learning models can learn from it effectively.
Key Steps in Data Preprocessing
1. Data Collection
Gather raw data from different sources like databases, files, or online.
2. Data Cleaning
Handle missing values: Fill missing data using mean, median, or remove
rows/columns.
FAIML: Unit 4: Introduction to ML 13
Remove duplicates: Get rid of repeated records.
Fix errors: Correct inconsistencies or typos in the data.
3. Data Transformation
Normalization/Scaling: Adjust numerical data to a standard scale (e.g., 0 to 1)
so that no feature dominates due to scale differences.
Encoding categorical data: Convert categories (like “red”, “blue”) into
numbers using techniques like one-hot encoding or label encoding.
4. Feature Selection
Choose the most important features (columns) that contribute to the
prediction and remove irrelevant or redundant ones.
5. Data Splitting
Split the dataset into training, testing, and sometimes validation sets to
evaluate model performance properly.
6. Data Reduction (Optional)
Reduce the data size while preserving important information using techniques
like Principal Component Analysis (PCA).
Summary Table
Step Purpose
Data Collection Gather raw data
Data Cleaning Fix missing values, remove duplicates, correct errors
Data Transformation Scale features, convert categories to numbers
Feature Selection Select important features, remove noise
Data Splitting Divide data into training and testing sets
Data Reduction Reduce dataset size while keeping key info
FAIML: Unit 4: Introduction to ML 14
Why is Data Preprocessing Important?
Improves model accuracy
Makes training faster and more stable
Helps handle real-world messy data
Q.8) Distinguish between artificial intelligence and
machine learning.
Ans:
Aspect Artificial Intelligence (AI) Machine Learning (ML)
AI is the broad science of ML is a subset of AI that enables
creating machines that can machines to learn from data and
Definition
perform tasks that normally improve automatically without being
require human intelligence. explicitly programmed.
To build intelligent systems that
To develop algorithms that allow
can simulate human reasoning,
Goal machines to learn patterns from data
decision-making, and problem-
and make predictions or decisions.
solving.
Encompasses reasoning,
Focused on learning from data using
Scope planning, natural language
statistical methods and optimization.
processing, robotics, and ML.
Chatbots, self-driving cars,
Spam detection, image recognition,
Examples game-playing AI agents, expert
recommendation systems.
systems.
Uses rules, logic, and knowledge Uses data and algorithms to find
How it works
bases to mimic intelligence. patterns and make predictions.
Q.9) What is unsupervised learning and explain types
of unsupervised learning algorithm with example.
Ans:
FAIML: Unit 4: Introduction to ML 15
What is Unsupervised Learning?
Unsupervised Learning is a type of machine learning where the algorithm learns
without any labeled output. This means:
You provide the model with data, but no answers (no categories or target
values).
The model tries to find hidden patterns, structures, or relationships in the
data on its own.
Useful when you don’t know what to expect from the data or want to explore
it.
Types of Unsupervised Learning
Algorithms
1. Clustering
Goal:
Group data points into clusters so that points in the same group are more similar to
each other than to those in other groups.
How it works:
The algorithm measures the similarity or distance between data points (using
metrics like Euclidean distance).
It then assigns points to clusters iteratively to minimize within-cluster
distances and maximize between-cluster distances.
Popular Algorithms:
K-Means Clustering:
Choose number of clusters kkk.
Randomly initialize cluster centers.
Assign each data point to the nearest center.
Update cluster centers as the mean of assigned points.
FAIML: Unit 4: Introduction to ML 16
Repeat until centers stabilize.
Hierarchical Clustering:
Builds a tree (dendrogram) of clusters by either merging or splitting
clusters step-by-step.
Example:
Suppose an online retailer wants to group customers based on their shopping
behavior to offer personalized deals.
Using clustering, similar customers are grouped so marketing campaigns can
be targeted more effectively.
2. Dimensionality Reduction
Goal:
Reduce the number of features (dimensions) in the data while preserving as much
important information as possible.
Why needed:
High-dimensional data is hard to visualize and can slow down learning.
Removing redundant or irrelevant features improves model performance.
Popular Algorithm:
Principal Component Analysis (PCA):
Finds new axes (principal components) that capture maximum variance in
data.
Projects data onto these axes, reducing the number of dimensions.
Example:
Compressing image data: reduce 1000 features (pixels) to 50 while retaining
the main image structure.
Helps in visualizing complex data in 2D or 3D.
FAIML: Unit 4: Introduction to ML 17
Summary Table
Example
Algorithm Type Purpose How It Works
Application
Group similar data Assign points to clusters Customer
Clustering
points based on similarity segmentation
Transform data to fewer Image
Dimensionality Reduce
features preserving compression,
Reduction features/dimensions
variance visualization
Why Use Unsupervised Learning?
When labeled data is unavailable or expensive to obtain.
To explore and understand data structure.
To detect anomalies or unusual patterns.
For data preprocessing before supervised learning.
Q.10) Explain features with suitable examples. Explain
Feature selection and feature extraction.
Ans:
What are Features?
Features are the individual measurable properties or characteristics of the
data that are used by machine learning models to make predictions or
decisions.
Each feature represents one aspect of the data. Together, features describe
each data point.
Example:
If you want to predict house prices, features could be:
FAIML: Unit 4: Introduction to ML 18
Number of bedrooms
Size in square feet
Location
Age of the house
Each house in your dataset is described by these features.
Feature Selection
Feature Selection is the process of selecting the most important features from
the original dataset and removing irrelevant or redundant features.
The goal is to keep only those features that improve model performance,
reduce complexity, and avoid overfitting.
Why do feature selection?
To improve model accuracy
To reduce training time
To simplify the model for better understanding
Example:
From 20 features describing a customer, only 5 may be useful to predict whether
they will buy a product. Feature selection methods help identify these 5 important
features.
Feature Extraction
Feature Extraction creates new features by transforming or combining the
original features into a smaller set of more informative features.
Unlike selection, which picks existing features, extraction generates new
features.
Why do feature extraction?
To reduce dimensionality while preserving important information
FAIML: Unit 4: Introduction to ML 19
To simplify data representation for better learning
Example:
Using Principal Component Analysis (PCA) to combine many correlated
features into a few principal components that summarize the data.
Converting a long text document into a set of numerical features using TF-IDF
(Term Frequency-Inverse Document Frequency).
Q.11) Write a short Note on PCA.
Ans:
Principal Component Analysis (PCA)
PCA is a popular dimensionality reduction technique used in machine learning
and data analysis. It transforms a large set of correlated variables into a smaller
set of uncorrelated variables called principal components. These components
capture the most important information (variance) in the data.
How PCA works:
It finds new axes (directions) in the feature space where the data varies the
most.
The first principal component captures the highest variance, the second
captures the next highest, and so on.
By keeping only the first few components, PCA reduces the data’s dimensions
while preserving most of its information.
Advantages:
Reduces computational cost by decreasing the number of features.
Helps visualize high-dimensional data in 2D or 3D plots.
Removes noise and redundant features, improving model performance.
Limitations:
FAIML: Unit 4: Introduction to ML 20
PCA assumes linearity; it may not capture complex nonlinear relationships.
Interpretation of principal components can be difficult because they are
combinations of original features.
Sensitive to outliers and scaling of data.
Example Use Case:
In image processing, each image can have thousands of pixels (features). PCA
reduces these thousands to a few components, preserving important visual
information but greatly simplifying computations for tasks like face recognition.
Q.12) What is dimensionality reduction? What are its
advantages and disadvantages?
Ans:
What is Dimensionality Reduction?
Dimensionality Reduction is a process used to reduce the number of features
(dimensions) in a dataset while keeping as much important information as
possible.
Imagine you have data with many features (like 100 or 1000 columns).
It can be hard to analyze or visualize such high-dimensional data.
Dimensionality reduction helps by creating a smaller set of features that
summarize the original data.
Advantages of Dimensionality Reduction
1. Simplifies Data:
Fewer features make the data easier to visualize and understand.
2. Speeds Up Algorithms:
Machine learning models train faster with fewer features.
3. Reduces Overfitting:
FAIML: Unit 4: Introduction to ML 21
Removing irrelevant or noisy features helps models generalize better to new
data.
4. Removes Redundancy:
Combines correlated features into fewer new features, reducing duplication.
Disadvantages of Dimensionality Reduction
1. Loss of Information:
Some important details might be lost when reducing dimensions.
2. Hard to Interpret:
New features created (like in PCA) are combinations of original ones, so they
can be difficult to understand.
3. Computational Cost:
Some methods (like t-SNE) can be slow on very large datasets.
4. Not Always Useful:
If data already has few important
FAIML: Unit 4: Introduction to ML 22