ML Mid1
ML Mid1
MOD-1
SA
Q1) Define supervised learning with an example?
Ans:
Supervised learning is a type of machine learning where the model is trained on labeled data. The
algorithm learns to map inputs to outputs based on the provided labels.
Example: Predicting house prices based on labeled data like house size, location, and price.
Ans:
Unsupervised learning is a machine learning method where the algorithm learns patterns from
unlabeled data. The model finds hidden structures without any explicit output labels.
Example: Customer segmentation based on purchasing behavior using clustering algorithms like K-
Means.
Ans:
Reinforcement learning is a type of machine learning where an agent learns to make decisions by
interacting with its environment and receiving rewards or penalties.
Example: Training a robot to walk by rewarding it for successful movements and penalizing it for falling.
Ans:
Deep learning is a subset of machine learning that uses neural networks with many layers (deep
networks) to learn complex patterns in data.
Example: Image recognition systems that classify objects (e.g., identifying cats or dogs) using
Convolutional Neural Networks (CNNs).
Ans:
Semi-supervised learning is a technique that uses a small amount of labeled data and a large amount of
unlabeled data for training. It helps improve performance when labeling data is expensive or difficult.
Example: A model trained to classify emails as spam using a small labeled dataset and a larger set of
unlabeled emails.
LA
Q1) What is Machine Learning and explain the different types of Machine Learning?
Ans:
Machine Learning (ML) is a subset of artificial intelligence that allows systems to learn from data and
improve their performance over time without explicit programming. ML algorithms find patterns in data
and make decisions or predictions based on it.
1. Supervised Learning:
Involves labeled data, where the algorithm learns from input-output pairs. It is used for tasks like
classification (e.g., spam detection) and regression (e.g., predicting house prices).
2. Unsupervised Learning:
Works with unlabeled data. The goal is to find hidden patterns or structures. For example,
clustering algorithms like K-Means can group customers into segments based on purchasing
behavior.
3. Reinforcement Learning:
This method learns by interacting with an environment and receiving feedback in the form of
rewards or punishments. It is used in robotics, game AI, and self-driving cars.
4. Semi-supervised Learning:
Uses a small amount of labeled data and a large amount of unlabeled data. It is useful when
labeling data is expensive or time-consuming.
5. Self-supervised Learning:
A variant of unsupervised learning where the system creates its own labels from the input data.
This is popular in NLP and computer vision tasks.
Q2) How is unsupervised learning different from supervised learning with a practical example?
Ans:
The main difference between unsupervised learning and supervised learning lies in the presence of
labeled data:
1. Supervised Learning:
o In supervised learning, the algorithm is trained on a labeled dataset, where each input
has a corresponding output (label). The goal is to learn a mapping from inputs to
outputs.
o Example: Predicting house prices based on labeled data where features like house size
and location are linked to known prices.
2. Unsupervised Learning:
In supervised learning, the model aims to minimize the error between predicted and actual values, while
in unsupervised learning, the focus is on identifying inherent patterns in the data.
Q3) How is reinforcement learning different from deep learning with a practical example?
Ans:
Reinforcement learning and deep learning are different approaches in machine learning:
o Example: Training a robot to walk. The robot tries different movements and receives
positive or negative feedback based on its success, eventually learning to walk efficiently.
o Example: Image classification using Convolutional Neural Networks (CNNs). The system
learns to classify images (e.g., identifying objects) by learning features from the raw pixel
data.
While reinforcement learning focuses on learning through interaction and feedback, deep learning
focuses on finding complex patterns in large amounts of data using neural networks.
Q4) What is the difference between MCAR, MAR, and MNAR in Machine Learning?
Ans:
These terms refer to different types of missing data in machine learning:
o Example: Age data is missing, but its absence is related to another feature like income
level.
o Example: Individuals with higher income may choose not to disclose their income in
surveys, meaning the missing data is not random.
Understanding the type of missing data is essential for choosing the appropriate imputation technique to
handle it in machine learning models.
Ans:
Machine learning has a wide range of applications across industries, impacting various aspects of life:
1. Healthcare:
o Used for medical diagnosis, drug discovery, and predicting disease outbreaks. For
example, machine learning models can analyze medical images to detect diseases like
cancer.
2. Finance:
o Machine learning is used for fraud detection, stock market predictions, and credit
scoring. It helps banks and financial institutions identify unusual transaction patterns
that may indicate fraud.
3. E-commerce:
o Recommendation systems like those used by Amazon and Netflix are powered by
machine learning. They suggest products or content based on user behavior and
preferences.
4. Autonomous Vehicles:
o Self-driving cars use machine learning to perceive their environment, make decisions,
and navigate safely.
6. Manufacturing:
MOD-2
SA
Q1) Define Normal Distribution?
Ans:
A normal distribution is a symmetrical, bell-shaped distribution where most data points cluster around
the mean. In a normal distribution, the mean, median, and mode are all equal. It is commonly used in
statistics and machine learning to model natural data patterns.
Ans:
A skewed distribution is an asymmetrical distribution where data points are not evenly distributed
around the mean. It can be positively skewed (right tail is longer) or negatively skewed (left tail is
longer). Skewed data can affect the performance of machine learning models.
Ans:
Data transformation is important in machine learning to ensure that data is in the appropriate format for
model training. It helps to normalize, scale, or standardize features, making models more accurate and
efficient by removing bias, improving convergence, and handling outliers.
Ans:
Ans:
LA
Q1) What are the different types of data distributions in Machine Learning?
Ans:
In machine learning, different types of data distributions are important to understand for effective model
development. The most common types include:
1. Normal Distribution: Also called Gaussian distribution, where data points are symmetrically
distributed around the mean. It's shaped like a bell curve.
2. Uniform Distribution: All outcomes have an equal chance of occurring. The probability is
constant, leading to a flat, horizontal distribution.
3. Skewed Distribution: In skewed data, most of the data points are located on one side, either
left-skewed (negative skew) or right-skewed (positive skew).
4. Binomial Distribution: Used for binary data, representing the probability of success or failure
(e.g., heads/tails in coin tosses).
5. Poisson Distribution: Describes the number of events occurring within a fixed interval of time or
space, assuming the events happen with a known constant rate.
6. Exponential Distribution: Used to model the time between events in a Poisson process, where
events occur continuously and independently.
Understanding the data distribution helps in selecting the right algorithms and transformation
techniques to improve model performance.
Ans:
Handling imbalanced data in machine learning involves adjusting the dataset or algorithm to ensure the
minority class is well-represented. Common approaches include:
1. Resampling Techniques:
o Oversampling: Replicating instances of the minority class to balance the class
distribution (e.g., using SMOTE - Synthetic Minority Over-sampling Technique).
o Undersampling: Reducing the number of majority class instances to match the minority
class, which might lead to loss of information.
2. Use of Different Evaluation Metrics: Instead of accuracy, metrics like Precision, Recall, F1-score,
and ROC-AUC are used to evaluate the model’s performance on imbalanced datasets.
3. Cost-Sensitive Learning: Assigning a higher misclassification cost to the minority class helps the
algorithm focus more on correctly classifying it. Algorithms such as decision trees and SVMs can
be adapted for this purpose.
4. Ensemble Methods: Techniques like Random Forest or Boosting (e.g., AdaBoost, XGBoost) can
be helpful as they build multiple models and can adjust for class imbalance.
5. Synthetic Data Generation: Using techniques like SMOTE to synthetically generate new instances
for the minority class, thus balancing the dataset.
Ans:
Filter feature selection is a method used to select relevant features (variables) before training a machine
learning model, independent of the model. It evaluates each feature individually using statistical
techniques and ranks them based on their relevance.
1. Methods Used:
o Correlation Coefficient: Measures the correlation between each feature and the target
variable.
o Chi-Square Test: Tests the dependence between categorical features and the target
variable.
o Variance Threshold: Removes features with low variance, assuming they don’t carry
useful information.
2. Advantages:
o Fast and computationally efficient because it does not involve running the machine
learning model.
3. Disadvantages:
o Since it’s independent of the model, it doesn’t account for interactions between features
that might impact model performance.
Filter methods are useful when working with high-dimensional datasets, where reducing feature space
can enhance model training and interpretation.
Ans:
Wrapper feature selection is a method that uses a machine learning model to evaluate the importance
of features by assessing their impact on the model’s performance. It wraps the feature selection process
around the model training process.
1. Approaches:
o Forward Selection: Starts with no features and iteratively adds features that improve
model performance until no further improvement is observed.
o Backward Elimination: Starts with all features and iteratively removes the least
important features.
o Recursive Feature Elimination (RFE): Works by recursively removing the least important
feature, based on the model’s performance, until the optimal set of features is found.
2. Advantages:
o Model-specific, meaning it finds the feature set that gives the best performance for a
specific model.
o It takes into account feature interactions, which can be beneficial for complex datasets.
3. Disadvantages:
Wrapper methods provide more accurate feature selection but at the cost of higher computation,
making them suitable when accuracy is prioritized over speed.
Ans:
Embedded feature selection integrates the feature selection process into the training of the machine
learning model itself. The model determines which features contribute most to its performance during
the learning process.
o Decision Trees and Random Forests: These algorithms inherently rank feature
importance based on how well they split the data at each node.
2. Advantages:
o More efficient than wrapper methods since feature selection is done during model
training, reducing the need for multiple iterations.
o Often more accurate than filter methods because it’s specific to the model being trained.
3. Disadvantages:
Embedded methods strike a balance between the efficiency of filter methods and the accuracy of
wrapper methods, making them useful for many real-world applications.
MOD-3
SA
Q1) Define Linear Regression?
Ans:
Linear Regression is a supervised machine learning algorithm used to model the relationship between a
dependent variable and one or more independent variables by fitting a linear equation to the data. It
predicts continuous outcomes, like predicting house prices based on size and location.
Ans:
Logistic Regression is a classification algorithm used to predict a binary outcome (e.g., yes/no, 0/1) based
on one or more predictor variables. It uses the logistic function to model the probability of the target
variable belonging to a particular class.
Ans:
1. Chi-Square Test
2. Correlation Coefficient
3. Variance Threshold
1. Forward Selection
2. Backward Elimination
LA
Q1) How to detect outliers in Machine Learning?
Ans:
Outliers are data points that significantly differ from other observations in a dataset. Detecting outliers is
crucial as they can distort the training process and lead to inaccurate models. Some common methods to
detect outliers in machine learning include:
1. Statistical Methods:
o Z-Score: Measures how many standard deviations a data point is from the mean. If the Z-
score is above a threshold (e.g., greater than 3), the data point is considered an outlier.
o IQR (Interquartile Range): Uses the range between the first (Q1) and third quartiles
(Q3). Data points outside the range of Q1 - 1.5IQR and Q3 + 1.5IQR are flagged as
outliers.
2. Visualization:
o Box Plot: A graphical tool that highlights outliers as points beyond the "whiskers."
o Scatter Plot: Can visually reveal data points that are far from others in a two-
dimensional space.
3. Distance-Based Methods:
o Euclidean Distance: In high-dimensional data, calculating the distance from a data point
to its neighbors. If the distance is much greater than that of other points, it may be an
outlier.
4. Model-Based Methods:
o DBSCAN (Density-Based Spatial Clustering): Classifies data points that do not belong to
a cluster as outliers.
Ans:
Logistic Regression is a supervised machine learning algorithm used for binary classification problems. It
predicts the probability of an outcome that has two possible values (e.g., 0 or 1). The logistic function
(sigmoid) is used to map predicted values to a range of 0 to 1.
Practical Example:
By training a logistic regression model on this dataset, it learns the relationship between the input
features and the probability of an email being spam. After training, the model can predict the probability
of a new email being spam, and if the probability exceeds a threshold (e.g., 0.5), the email is classified as
spam.
Ans:
Linear Regression is a supervised learning algorithm used to predict a continuous output based on input
features. It models the relationship between the dependent variable (Y) and one or more independent
variables (X) by fitting a linear equation to the data.
Practical Example:
For example, if the house size is 2,000 square feet, the predicted price would be:
Price = 300 * 2000 + 50,000 = $650,000.
Linear regression finds the best-fitting line by minimizing the error between predicted and actual values,
often using the least squares method.
Q4) Explain the differences between Linear Regression and Logistic Regression?
Ans:
Linear Regression and Logistic Regression are both supervised learning algorithms, but they differ in
purpose and approach:
1. Purpose:
o Linear Regression: Used for predicting a continuous numerical value (e.g., predicting
house prices).
o Logistic Regression: Used for binary classification problems (e.g., predicting whether an
email is spam or not).
2. Output:
3. Equation:
o Logistic Regression: Uses the sigmoid function P(Y=1) = 1 / (1 + e^-(mX + b)) to map the
output to a probability.
4. Use Case:
o Logistic Regression: Used for classification tasks, like disease prediction (yes/no), or
fraud detection.
5. Loss Function:
o Linear Regression: Uses Mean Squared Error (MSE) to minimize the difference between
predicted and actual values.
o Logistic Regression: Uses Log Loss (Cross-Entropy) to measure the difference between
predicted probabilities and actual class labels.
Ans:
Data is the core component of machine learning. The various sources of data used for training and
building machine learning models include:
1. Public Datasets:
o Many organizations provide datasets for research purposes. Popular sources include
Kaggle, UCI Machine Learning Repository, and Google Dataset Search.
2. Web Scraping:
o Data can be gathered from websites using scraping tools like BeautifulSoup or Scrapy.
This data is often used for applications such as price comparisons, sentiment analysis,
and news aggregation.
3. User-Generated Data:
o Social media platforms, forums, and review sites generate vast amounts of data. For
example, Twitter, Amazon, and Reddit provide text data used for natural language
processing (NLP).
4. Transactional Data:
o This includes data generated through e-commerce transactions, financial records, or any
system that logs interactions. It’s commonly used for recommendation systems and
fraud detection.
o Devices equipped with sensors (e.g., smartwatches, medical devices) generate real-time
data used in predictive maintenance, health monitoring, and smart city applications.
Machine learning practitioners choose data sources based on the problem being solved, data availability,
and the structure of the data (e.g., structured or unstructured).