UNIVERSITY OF GUJRAT
Hafiz Hayat Campus (Morning)
FINAL TERM EXAMINATIONS
FALL-2023
Course Code : IT-446 Course Title : Data Mining
Section A
Q2 Answer the following Questions.
i) What are over fitted models? Explain their effects on performance.
Ans : Definition: Overfitting occurs when a machine learning model learns the training data too well,
capturing noise or random fluctuations in the data instead of just the underlying patterns.
Causes:
▪ Too complex model architecture with too many parameters.
▪ Too many features relative to the amount of training data.
▪ Lack of regularization techniques to prevent overfitting.
Effects on Performance:
▪ Poor Generalization: Overfitted models may perform exceptionally well on the training data but
fail to generalize to new, unseen data.
▪ High Variance: Overfitting increases the variance of the model, making it sensitive to small
fluctuations in the training data.
▪ Loss of Interpretability: Capturing noise as meaningful patterns, leading to less interpretable
and useful models.
▪ Increased Error on Test Data: The model tends to perform poorly on new, unseen data because
it has essentially memorized the training set rather than learning the underlying patterns.
▪ Complex Models: Overfitting often results from overly complex models that capture noise
rather than the true underlying relationships in the data.
▪ Loss of Predictive Power: Overfitted models may not provide accurate predictions or
classifications for real-world scenarios due to their focus on training data specifics.
ii) Explain K-Fold-Cross Validation Technique with diagram.
K-fold cross-validation approach divides the input dataset into K groups of samples of equal sizes. These
samples are called folds. For each learning set, the prediction function uses k-1 folds, and the rest of the
folds are used for the test set. This approach is a very popular CV approach because it is easy to
understand, and the output is less biased than other methods.
▪ The steps for k-fold cross-validation are:
▪ Split the input dataset into K groups
▪ For each group:
▪ Take one group as the reserve or test data set.
▪ Use remaining groups as the training dataset
▪ Fit the model on the training set and evaluate the performance of the model using the
test set.
iii) Why Binning is used in Data Preprocessing?
Ans : Binning, or discretization, is used in data preprocessing for various purposes to improve the
quality and effectiveness of data analysis and machine learning models.
Quantization and Error Reduction: Binning reduces the impact of minor errors in data by grouping
values into intervals and assigning representative values, aiding in error reduction.
Non-Linearity and Model Performance: Introducing non-linearity through binning can improve model
performance, especially when transforming continuous variables into categorical features.
Overfitting Prevention: Binning helps prevent overfitting, providing a smoother representation of data,
which is particularly beneficial in small datasets.
Identification of Outliers and Missing Values: Binning can be used to identify outliers and missing values
in the dataset.
Categorical Transformation: It transforms continuous variables into categorical features, enhancing
interpretability and addressing non-linear relationships in the data.
iv) What is Clustering? Explain Hierarchical clustering with Example.
Ans : Definition: Clustering is a data analysis technique that involves grouping similar data points
together based on certain characteristics, aiming to discover patterns and structures within a dataset. It
is commonly used in machine learning, exploratory data analysis, and pattern recognition to reveal
inherent groupings within the data.
Hierarchical Clustering: (Type: Agglomerative (Bottom-Up)
Definition: Hierarchical clustering is an algorithm that builds a hierarchy of clusters. It starts with
individual data points and progressively merges or divides them.
Example:
Dataset: Consider customer data with features like spending habits and types of products bought.
Steps:
1. Start with Individual Points: Each customer is a separate cluster.
2. Calculate Similarity: Measure similarity based on features.
3. Merge Similar Clusters: Merge the two most similar clusters.
4. Repeat: Iterate until all points are in one cluster.
5. Result: A hierarchical tree (dendrogram) visually representing clusters.
Application: Identifying customer segments for tailored marketing strategies.
Section – B
Q. 2 Write Down about your Term project that clearly mentions the Objectives/Goal of the
Project, Feature Extraction/Addition/Deletion Techniques, Machine Learning Algorithms and
Experimental Results.
Ans : Title: Automated Speech Emotion Recognition
Objectives/Goals:
▪ Develop a system capable of accurately recognizing and classifying emotions from spoken
language.
▪ Improve human-computer interaction by enabling machines to understand and respond to
users' emotional states.
▪ Enhance applications such as customer service and virtual assistants with emotion-aware
functionalities.
Feature Extraction/Addition/Deletion Techniques:
▪ Utilize signal processing techniques to extract acoustic features such as pitch, intensity, and
formants from speech signals.
▪ Explore natural language processing (NLP) techniques for extracting linguistic features, including
sentiment-related words and tone.
▪ Investigate the addition of prosodic features like speech rate and pauses to capture emotional
nuances.
▪ Experiment with feature scaling and normalization for better model convergence.
Machine Learning Algorithms:
▪ Employ deep learning models, such as Convolutional Neural Networks (CNNs) and Recurrent
Neural Networks (RNNs), for end-to-end emotion recognition from audio signals.
▪ Combine acoustic and linguistic features using ensemble methods like Stacking or Fusion
models.
▪ Implement transfer learning approaches to leverage pre-trained models on large speech
datasets.
Experimental Results:
▪ Evaluate the model on diverse datasets containing a variety of emotional expressions in speech.
▪ Measure accuracy, precision, recall, and F1 score to assess the model's performance across
different emotion categories.
▪ Conduct user studies to evaluate the system's effectiveness in real-world scenarios.
▪ Showcase the system's potential applications in improving user experience in virtual
environments, customer service applications, and other relevant domains.
OR
Title: Fraud Detection in Financial Transactions
Objectives/Goals:
▪ Develop a robust system for real-time detection of fraudulent activities in financial transactions.
▪ Enhance security measures and protect customers from unauthorized access and financial loss.
▪ Improve the efficiency of fraud detection systems to minimize false positives and negatives.
Feature Extraction/Addition/Deletion Techniques:
▪ Implement feature scaling and normalization to standardize numerical variables in transaction
data.
▪ Explore dimensionality reduction techniques such as Principal Component Analysis (PCA) to
handle high-dimensional data.
▪ Investigate the addition of derived features like transaction frequency, geographical location,
and user behavior patterns.
▪ Experiment with anomaly detection methods to identify irregularities in transaction patterns.
Machine Learning Algorithms:
▪ Utilize supervised learning algorithms, including Logistic Regression and Random Forests, for
binary classification of transactions into fraudulent and non-fraudulent categories.
▪ Implement unsupervised learning algorithms such as Isolation Forest and One-Class SVM for
anomaly detection in transaction data.
▪ Combine multiple models using ensemble techniques like Bagging or Boosting to improve
overall system performance.
Experimental Results:
▪ Evaluate the model on a large and diverse dataset of financial transactions to assess its accuracy
and efficiency.
▪ Measure metrics such as precision, recall, and F1 score to quantify the system's ability to detect
fraudulent activities.
▪ Analyze the system's performance in real-time scenarios to ensure timely and accurate fraud
detection.
▪ Showcase the reduction in financial losses and false alarms achieved through the implemented
fraud detection system.
Q. 3 With the help of confusion matrix
Predicted (YES) Predicted (NO)
Actual Yes 44 (TP) 36 (FN)
Actual NO 59 (FP) 101 (TN)
Find :
a) F1 Score
b) Accuracy
c) Precision
Solution:
Formulas Used to Solve this Question are Given Below (In Image):
From the Question we got the values below
▪ TP = 44
▪ FP = 59
▪ FN = 36
▪ TN = 101
Note : Must Watch this video to understand other related concepts regarding to Confusion Matrix
Link : https://www.youtube.com/watch?si=71QU5QS1dMLWqGEt&v=AyP85ocS-
8Y&feature=youtu.be
By Putting Values in Formulas we get the Answers:
Q.4 Use the computed conditional probabilities to predict the class label for a test sample
(A=1; B = 0; C=0) using the NAÏVE BASE
A B C Q
1 0 1 1
1 1 1 1
0 1 1 0
1 1 0 0
1 0 1 0
0 0 0 1
0 0 0 1
0 0 1 0
Ans : Hand Written Solution Below
eshon no
Count t instances sahere Q24
-Coundt no
ot instapces here Q=
(-ve class)()
Sstepalclate Rriex Reossblies
-Nunb o ingtances shee
Tota inetances
Pstep Conditiona Robablties
oinstances shere Ql
2
ATQ-o)2-o.5
3Step: fosteeio Probiities tos Test
Sample(A L, B-oao)
05XoSXo:5X o-2S
-PlQ-o|A-l,B-oczo) oo312.S)
Step No~malize rbablies
Ihe noxnai2tan is he Sum ot he
prababilities
=Q'o935 0.03125
Caleulate Actoal Proba bilites
we obtain he adual
pibabiËes ydiwidig- posteio
homalizohion Constant:
PQA,B.oCeo) o1375
r(QolAl,Bzo,Ce) o0312S}
pese aties xepxest hekhesd
test sanle esch
class given heobsexveda
A.L B=o,ceD