0% found this document useful (0 votes)
14 views

ML Mid1

ddd

Uploaded by

melllo gang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

ML Mid1

ddd

Uploaded by

melllo gang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

MACHINE LEARNING

MOD-1
SA
Q1) Define supervised learning with an example?

Ans:
Supervised learning is a type of machine learning where the model is trained on labeled data. The
algorithm learns to map inputs to outputs based on the provided labels.
Example: Predicting house prices based on labeled data like house size, location, and price.

Q2) Define unsupervised learning with an example?

Ans:
Unsupervised learning is a machine learning method where the algorithm learns patterns from
unlabeled data. The model finds hidden structures without any explicit output labels.
Example: Customer segmentation based on purchasing behavior using clustering algorithms like K-
Means.

Q3) Define reinforcement learning with an example?

Ans:
Reinforcement learning is a type of machine learning where an agent learns to make decisions by
interacting with its environment and receiving rewards or penalties.
Example: Training a robot to walk by rewarding it for successful movements and penalizing it for falling.

Q4) Define deep learning with an example?

Ans:
Deep learning is a subset of machine learning that uses neural networks with many layers (deep
networks) to learn complex patterns in data.
Example: Image recognition systems that classify objects (e.g., identifying cats or dogs) using
Convolutional Neural Networks (CNNs).

Q5) Define semi-supervised learning with an example?

Ans:
Semi-supervised learning is a technique that uses a small amount of labeled data and a large amount of
unlabeled data for training. It helps improve performance when labeling data is expensive or difficult.
Example: A model trained to classify emails as spam using a small labeled dataset and a larger set of
unlabeled emails.

LA
Q1) What is Machine Learning and explain the different types of Machine Learning?

Ans:
Machine Learning (ML) is a subset of artificial intelligence that allows systems to learn from data and
improve their performance over time without explicit programming. ML algorithms find patterns in data
and make decisions or predictions based on it.

Types of Machine Learning:

1. Supervised Learning:
Involves labeled data, where the algorithm learns from input-output pairs. It is used for tasks like
classification (e.g., spam detection) and regression (e.g., predicting house prices).

2. Unsupervised Learning:
Works with unlabeled data. The goal is to find hidden patterns or structures. For example,
clustering algorithms like K-Means can group customers into segments based on purchasing
behavior.

3. Reinforcement Learning:
This method learns by interacting with an environment and receiving feedback in the form of
rewards or punishments. It is used in robotics, game AI, and self-driving cars.

4. Semi-supervised Learning:
Uses a small amount of labeled data and a large amount of unlabeled data. It is useful when
labeling data is expensive or time-consuming.

5. Self-supervised Learning:
A variant of unsupervised learning where the system creates its own labels from the input data.
This is popular in NLP and computer vision tasks.

Q2) How is unsupervised learning different from supervised learning with a practical example?

Ans:
The main difference between unsupervised learning and supervised learning lies in the presence of
labeled data:

1. Supervised Learning:

o In supervised learning, the algorithm is trained on a labeled dataset, where each input
has a corresponding output (label). The goal is to learn a mapping from inputs to
outputs.
o Example: Predicting house prices based on labeled data where features like house size
and location are linked to known prices.

2. Unsupervised Learning:

o In unsupervised learning, the algorithm is trained on an unlabeled dataset and has to


find patterns or groupings within the data.

o Example: Customer segmentation in marketing. A business might use unsupervised


learning to segment its customer base based on their purchasing behavior, even though
no labels are available.

In supervised learning, the model aims to minimize the error between predicted and actual values, while
in unsupervised learning, the focus is on identifying inherent patterns in the data.

Q3) How is reinforcement learning different from deep learning with a practical example?

Ans:
Reinforcement learning and deep learning are different approaches in machine learning:

1. Reinforcement Learning (RL):


RL involves an agent that interacts with an environment and learns by receiving feedback in the
form of rewards or penalties. The goal is to maximize the cumulative reward over time.

o Example: Training a robot to walk. The robot tries different movements and receives
positive or negative feedback based on its success, eventually learning to walk efficiently.

2. Deep Learning (DL):


DL is a subset of machine learning that uses neural networks with multiple layers (deep neural
networks) to model complex patterns in data. It is typically used for tasks such as image
recognition, speech processing, and NLP.

o Example: Image classification using Convolutional Neural Networks (CNNs). The system
learns to classify images (e.g., identifying objects) by learning features from the raw pixel
data.

While reinforcement learning focuses on learning through interaction and feedback, deep learning
focuses on finding complex patterns in large amounts of data using neural networks.

Q4) What is the difference between MCAR, MAR, and MNAR in Machine Learning?

Ans:
These terms refer to different types of missing data in machine learning:

1. MCAR (Missing Completely at Random):


Data is missing independently of both observed and unobserved data. There is no pattern to the
missing data, and it occurs purely by chance.
o Example: A sensor fails to record data due to a random malfunction.

2. MAR (Missing at Random):


The probability of missing data depends only on observed data but not on the missing data itself.
In other words, missingness is related to other observed variables.

o Example: Age data is missing, but its absence is related to another feature like income
level.

3. MNAR (Missing Not at Random):


Data is missing due to reasons related to the missing value itself. In this case, there is a pattern
to the missing data.

o Example: Individuals with higher income may choose not to disclose their income in
surveys, meaning the missing data is not random.

Understanding the type of missing data is essential for choosing the appropriate imputation technique to
handle it in machine learning models.

Q5) Explain the applications of Machine Learning?

Ans:
Machine learning has a wide range of applications across industries, impacting various aspects of life:

1. Healthcare:

o Used for medical diagnosis, drug discovery, and predicting disease outbreaks. For
example, machine learning models can analyze medical images to detect diseases like
cancer.

2. Finance:

o Machine learning is used for fraud detection, stock market predictions, and credit
scoring. It helps banks and financial institutions identify unusual transaction patterns
that may indicate fraud.

3. E-commerce:

o Recommendation systems like those used by Amazon and Netflix are powered by
machine learning. They suggest products or content based on user behavior and
preferences.

4. Autonomous Vehicles:

o Self-driving cars use machine learning to perceive their environment, make decisions,
and navigate safely.

5. Natural Language Processing (NLP):


o Machine learning powers applications like language translation, sentiment analysis, and
chatbots, making human-computer interactions more natural.

6. Manufacturing:

o Machine learning is used for predictive maintenance, allowing companies to predict


equipment failures and schedule maintenance before breakdowns occur.

These applications demonstrate machine learning's ability to automate decision-making, improve


efficiency, and provide insights across various domains.

MOD-2
SA
Q1) Define Normal Distribution?

Ans:
A normal distribution is a symmetrical, bell-shaped distribution where most data points cluster around
the mean. In a normal distribution, the mean, median, and mode are all equal. It is commonly used in
statistics and machine learning to model natural data patterns.

Q2) Define Skewed Distribution?

Ans:
A skewed distribution is an asymmetrical distribution where data points are not evenly distributed
around the mean. It can be positively skewed (right tail is longer) or negatively skewed (left tail is
longer). Skewed data can affect the performance of machine learning models.

Q3) Why data transformation is important in Machine Learning?

Ans:
Data transformation is important in machine learning to ensure that data is in the appropriate format for
model training. It helps to normalize, scale, or standardize features, making models more accurate and
efficient by removing bias, improving convergence, and handling outliers.

Q4) List two data transformation techniques in Machine Learning?

Ans:

1. Normalization: Scales data to a range (e.g., 0 to 1).

2. Standardization: Rescales data to have a mean of 0 and a standard deviation of 1.


Q5) List out the three types of Missing Values in Machine Learning?

Ans:

1. MCAR (Missing Completely at Random)

2. MAR (Missing at Random)

3. MNAR (Missing Not at Random)

LA
Q1) What are the different types of data distributions in Machine Learning?

Ans:
In machine learning, different types of data distributions are important to understand for effective model
development. The most common types include:

1. Normal Distribution: Also called Gaussian distribution, where data points are symmetrically
distributed around the mean. It's shaped like a bell curve.

2. Uniform Distribution: All outcomes have an equal chance of occurring. The probability is
constant, leading to a flat, horizontal distribution.

3. Skewed Distribution: In skewed data, most of the data points are located on one side, either
left-skewed (negative skew) or right-skewed (positive skew).

4. Binomial Distribution: Used for binary data, representing the probability of success or failure
(e.g., heads/tails in coin tosses).

5. Poisson Distribution: Describes the number of events occurring within a fixed interval of time or
space, assuming the events happen with a known constant rate.

6. Exponential Distribution: Used to model the time between events in a Poisson process, where
events occur continuously and independently.

Understanding the data distribution helps in selecting the right algorithms and transformation
techniques to improve model performance.

Q2) Explain the process to handle imbalanced data in Machine Learning?

Ans:
Handling imbalanced data in machine learning involves adjusting the dataset or algorithm to ensure the
minority class is well-represented. Common approaches include:

1. Resampling Techniques:
o Oversampling: Replicating instances of the minority class to balance the class
distribution (e.g., using SMOTE - Synthetic Minority Over-sampling Technique).

o Undersampling: Reducing the number of majority class instances to match the minority
class, which might lead to loss of information.

2. Use of Different Evaluation Metrics: Instead of accuracy, metrics like Precision, Recall, F1-score,
and ROC-AUC are used to evaluate the model’s performance on imbalanced datasets.

3. Cost-Sensitive Learning: Assigning a higher misclassification cost to the minority class helps the
algorithm focus more on correctly classifying it. Algorithms such as decision trees and SVMs can
be adapted for this purpose.

4. Ensemble Methods: Techniques like Random Forest or Boosting (e.g., AdaBoost, XGBoost) can
be helpful as they build multiple models and can adjust for class imbalance.

5. Synthetic Data Generation: Using techniques like SMOTE to synthetically generate new instances
for the minority class, thus balancing the dataset.

Q3) Explain Filter feature selection in Machine Learning?

Ans:
Filter feature selection is a method used to select relevant features (variables) before training a machine
learning model, independent of the model. It evaluates each feature individually using statistical
techniques and ranks them based on their relevance.

1. Methods Used:

o Correlation Coefficient: Measures the correlation between each feature and the target
variable.

o Chi-Square Test: Tests the dependence between categorical features and the target
variable.

o Variance Threshold: Removes features with low variance, assuming they don’t carry
useful information.

o Mutual Information: Measures how much information a feature contributes to the


prediction of the target.

2. Advantages:

o Fast and computationally efficient because it does not involve running the machine
learning model.

o Prevents overfitting by removing irrelevant or redundant features early on.

3. Disadvantages:
o Since it’s independent of the model, it doesn’t account for interactions between features
that might impact model performance.

Filter methods are useful when working with high-dimensional datasets, where reducing feature space
can enhance model training and interpretation.

Q4) Explain Wrapper feature selection in Machine Learning?

Ans:
Wrapper feature selection is a method that uses a machine learning model to evaluate the importance
of features by assessing their impact on the model’s performance. It wraps the feature selection process
around the model training process.

1. Approaches:

o Forward Selection: Starts with no features and iteratively adds features that improve
model performance until no further improvement is observed.

o Backward Elimination: Starts with all features and iteratively removes the least
important features.

o Recursive Feature Elimination (RFE): Works by recursively removing the least important
feature, based on the model’s performance, until the optimal set of features is found.

2. Advantages:

o Model-specific, meaning it finds the feature set that gives the best performance for a
specific model.

o It takes into account feature interactions, which can be beneficial for complex datasets.

3. Disadvantages:

o Computationally expensive since it involves training and evaluating models multiple


times.

o Time-consuming, especially for large datasets.

Wrapper methods provide more accurate feature selection but at the cost of higher computation,
making them suitable when accuracy is prioritized over speed.

Q5) Explain Embedded feature selection in Machine Learning?

Ans:
Embedded feature selection integrates the feature selection process into the training of the machine
learning model itself. The model determines which features contribute most to its performance during
the learning process.

1. Examples of Embedded Methods:


o Lasso Regression (L1 Regularization): Shrinks less important feature coefficients to zero,
effectively selecting the most important features.

o Decision Trees and Random Forests: These algorithms inherently rank feature
importance based on how well they split the data at each node.

o Ridge Regression (L2 Regularization): It penalizes large coefficients, indirectly promoting


simpler models by reducing the impact of less important features.

2. Advantages:

o More efficient than wrapper methods since feature selection is done during model
training, reducing the need for multiple iterations.

o Often more accurate than filter methods because it’s specific to the model being trained.

3. Disadvantages:

o Model-specific, meaning different models may select different features.

o Requires careful tuning of hyperparameters (like regularization strength) to ensure


proper feature selection.

Embedded methods strike a balance between the efficiency of filter methods and the accuracy of
wrapper methods, making them useful for many real-world applications.

MOD-3
SA
Q1) Define Linear Regression?

Ans:
Linear Regression is a supervised machine learning algorithm used to model the relationship between a
dependent variable and one or more independent variables by fitting a linear equation to the data. It
predicts continuous outcomes, like predicting house prices based on size and location.

Q2) Define Logistic Regression?

Ans:
Logistic Regression is a classification algorithm used to predict a binary outcome (e.g., yes/no, 0/1) based
on one or more predictor variables. It uses the logistic function to model the probability of the target
variable belonging to a particular class.

Q3) Define Data Imputation?


Ans:
Data Imputation is the process of replacing missing or incomplete data with substituted values to
maintain dataset integrity. It helps in handling missing data, which can otherwise skew results, by
techniques like mean substitution or predictive modeling.

Q4) List out the 4 Filter Feature selection techniques?

Ans:

1. Chi-Square Test

2. Correlation Coefficient

3. Variance Threshold

4. ANOVA (Analysis of Variance)

Q5) List out the 4 Wrapper Feature selection techniques?

1. Forward Selection

2. Backward Elimination

3. Recursive Feature Elimination (RFE)

4. Exhaustive Feature Selection

LA
Q1) How to detect outliers in Machine Learning?

Ans:
Outliers are data points that significantly differ from other observations in a dataset. Detecting outliers is
crucial as they can distort the training process and lead to inaccurate models. Some common methods to
detect outliers in machine learning include:

1. Statistical Methods:

o Z-Score: Measures how many standard deviations a data point is from the mean. If the Z-
score is above a threshold (e.g., greater than 3), the data point is considered an outlier.

o IQR (Interquartile Range): Uses the range between the first (Q1) and third quartiles
(Q3). Data points outside the range of Q1 - 1.5IQR and Q3 + 1.5IQR are flagged as
outliers.

2. Visualization:

o Box Plot: A graphical tool that highlights outliers as points beyond the "whiskers."
o Scatter Plot: Can visually reveal data points that are far from others in a two-
dimensional space.

3. Distance-Based Methods:

o Euclidean Distance: In high-dimensional data, calculating the distance from a data point
to its neighbors. If the distance is much greater than that of other points, it may be an
outlier.

4. Model-Based Methods:

o Isolation Forest: A tree-based model specifically designed to detect outliers by isolating


data points that require fewer splits.

o DBSCAN (Density-Based Spatial Clustering): Classifies data points that do not belong to
a cluster as outliers.

Q2) Explain Logistic Regression with a practical example?

Ans:
Logistic Regression is a supervised machine learning algorithm used for binary classification problems. It
predicts the probability of an outcome that has two possible values (e.g., 0 or 1). The logistic function
(sigmoid) is used to map predicted values to a range of 0 to 1.

Practical Example:

Spam Email Classification:


Suppose you want to classify whether an email is spam or not spam. You have a dataset with emails
labeled as "spam" (1) or "not spam" (0), along with features like the number of links, words, and the
sender's address.

• Input Features (X): Number of links, suspicious words, etc.

• Output (Y): 0 (not spam) or 1 (spam).

By training a logistic regression model on this dataset, it learns the relationship between the input
features and the probability of an email being spam. After training, the model can predict the probability
of a new email being spam, and if the probability exceeds a threshold (e.g., 0.5), the email is classified as
spam.

The logistic function is represented as:


P(Y=1) = 1 / (1 + e^-(mX + b)), where m is the coefficients, X is the input feature, and b is the bias.

Q3) Explain Linear Regression with a practical example?

Ans:
Linear Regression is a supervised learning algorithm used to predict a continuous output based on input
features. It models the relationship between the dependent variable (Y) and one or more independent
variables (X) by fitting a linear equation to the data.

Practical Example:

House Price Prediction:


Suppose a real estate company wants to predict the price of a house based on its size (square feet).
Using linear regression, the company can create a model based on historical data of house sizes and
prices.

• Independent Variable (X): Size of the house in square feet.

• Dependent Variable (Y): Sale price of the house.

The model will establish a linear relationship such as:


Price = 300 * (Size) + 50,000.

For example, if the house size is 2,000 square feet, the predicted price would be:
Price = 300 * 2000 + 50,000 = $650,000.

Linear regression finds the best-fitting line by minimizing the error between predicted and actual values,
often using the least squares method.

Q4) Explain the differences between Linear Regression and Logistic Regression?

Ans:
Linear Regression and Logistic Regression are both supervised learning algorithms, but they differ in
purpose and approach:

1. Purpose:

o Linear Regression: Used for predicting a continuous numerical value (e.g., predicting
house prices).

o Logistic Regression: Used for binary classification problems (e.g., predicting whether an
email is spam or not).

2. Output:

o Linear Regression: Outputs a continuous value (can be any real number).

o Logistic Regression: Outputs a probability between 0 and 1, which is then classified as 0


or 1 based on a threshold (e.g., 0.5).

3. Equation:

o Linear Regression: Uses a linear equation Y = mX + b.

o Logistic Regression: Uses the sigmoid function P(Y=1) = 1 / (1 + e^-(mX + b)) to map the
output to a probability.
4. Use Case:

o Linear Regression: Predicts numerical values, such as sales, temperature, or house


prices.

o Logistic Regression: Used for classification tasks, like disease prediction (yes/no), or
fraud detection.

5. Loss Function:

o Linear Regression: Uses Mean Squared Error (MSE) to minimize the difference between
predicted and actual values.

o Logistic Regression: Uses Log Loss (Cross-Entropy) to measure the difference between
predicted probabilities and actual class labels.

Q5) Explain the sources of data in Machine Learning?

Ans:
Data is the core component of machine learning. The various sources of data used for training and
building machine learning models include:

1. Public Datasets:

o Many organizations provide datasets for research purposes. Popular sources include
Kaggle, UCI Machine Learning Repository, and Google Dataset Search.

2. Web Scraping:

o Data can be gathered from websites using scraping tools like BeautifulSoup or Scrapy.
This data is often used for applications such as price comparisons, sentiment analysis,
and news aggregation.

3. User-Generated Data:

o Social media platforms, forums, and review sites generate vast amounts of data. For
example, Twitter, Amazon, and Reddit provide text data used for natural language
processing (NLP).

4. Transactional Data:

o This includes data generated through e-commerce transactions, financial records, or any
system that logs interactions. It’s commonly used for recommendation systems and
fraud detection.

5. Sensor Data (IoT):

o Devices equipped with sensors (e.g., smartwatches, medical devices) generate real-time
data used in predictive maintenance, health monitoring, and smart city applications.
Machine learning practitioners choose data sources based on the problem being solved, data availability,
and the structure of the data (e.g., structured or unstructured).

You might also like