0% found this document useful (0 votes)
268 views10 pages

Data Mining - UOG (HH) - Final - F23-1

Past papers UOG

Uploaded by

chudarybushra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
268 views10 pages

Data Mining - UOG (HH) - Final - F23-1

Past papers UOG

Uploaded by

chudarybushra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIVERSITY OF GUJRAT

Hafiz Hayat Campus (Morning)


FINAL TERM EXAMINATIONS
FALL-2023

Course Code : IT-446 Course Title : Data Mining


Section A
Q2 Answer the following Questions.

i) What are over fitted models? Explain their effects on performance.

Ans : Definition: Overfitting occurs when a machine learning model learns the training data too well,
capturing noise or random fluctuations in the data instead of just the underlying patterns.

Causes:

▪ Too complex model architecture with too many parameters.


▪ Too many features relative to the amount of training data.
▪ Lack of regularization techniques to prevent overfitting.

Effects on Performance:
▪ Poor Generalization: Overfitted models may perform exceptionally well on the training data but
fail to generalize to new, unseen data.
▪ High Variance: Overfitting increases the variance of the model, making it sensitive to small
fluctuations in the training data.
▪ Loss of Interpretability: Capturing noise as meaningful patterns, leading to less interpretable
and useful models.
▪ Increased Error on Test Data: The model tends to perform poorly on new, unseen data because
it has essentially memorized the training set rather than learning the underlying patterns.
▪ Complex Models: Overfitting often results from overly complex models that capture noise
rather than the true underlying relationships in the data.
▪ Loss of Predictive Power: Overfitted models may not provide accurate predictions or
classifications for real-world scenarios due to their focus on training data specifics.
ii) Explain K-Fold-Cross Validation Technique with diagram.
K-fold cross-validation approach divides the input dataset into K groups of samples of equal sizes. These
samples are called folds. For each learning set, the prediction function uses k-1 folds, and the rest of the
folds are used for the test set. This approach is a very popular CV approach because it is easy to
understand, and the output is less biased than other methods.

▪ The steps for k-fold cross-validation are:


▪ Split the input dataset into K groups
▪ For each group:
▪ Take one group as the reserve or test data set.
▪ Use remaining groups as the training dataset
▪ Fit the model on the training set and evaluate the performance of the model using the
test set.

iii) Why Binning is used in Data Preprocessing?

Ans : Binning, or discretization, is used in data preprocessing for various purposes to improve the
quality and effectiveness of data analysis and machine learning models.
Quantization and Error Reduction: Binning reduces the impact of minor errors in data by grouping
values into intervals and assigning representative values, aiding in error reduction.

Non-Linearity and Model Performance: Introducing non-linearity through binning can improve model
performance, especially when transforming continuous variables into categorical features.

Overfitting Prevention: Binning helps prevent overfitting, providing a smoother representation of data,
which is particularly beneficial in small datasets.

Identification of Outliers and Missing Values: Binning can be used to identify outliers and missing values
in the dataset.

Categorical Transformation: It transforms continuous variables into categorical features, enhancing


interpretability and addressing non-linear relationships in the data.
iv) What is Clustering? Explain Hierarchical clustering with Example.

Ans : Definition: Clustering is a data analysis technique that involves grouping similar data points
together based on certain characteristics, aiming to discover patterns and structures within a dataset. It
is commonly used in machine learning, exploratory data analysis, and pattern recognition to reveal
inherent groupings within the data.

Hierarchical Clustering: (Type: Agglomerative (Bottom-Up)

Definition: Hierarchical clustering is an algorithm that builds a hierarchy of clusters. It starts with
individual data points and progressively merges or divides them.

Example:

Dataset: Consider customer data with features like spending habits and types of products bought.

Steps:

1. Start with Individual Points: Each customer is a separate cluster.


2. Calculate Similarity: Measure similarity based on features.
3. Merge Similar Clusters: Merge the two most similar clusters.
4. Repeat: Iterate until all points are in one cluster.
5. Result: A hierarchical tree (dendrogram) visually representing clusters.

Application: Identifying customer segments for tailored marketing strategies.

Section – B

Q. 2 Write Down about your Term project that clearly mentions the Objectives/Goal of the
Project, Feature Extraction/Addition/Deletion Techniques, Machine Learning Algorithms and
Experimental Results.
Ans : Title: Automated Speech Emotion Recognition

Objectives/Goals:

▪ Develop a system capable of accurately recognizing and classifying emotions from spoken
language.
▪ Improve human-computer interaction by enabling machines to understand and respond to
users' emotional states.
▪ Enhance applications such as customer service and virtual assistants with emotion-aware
functionalities.

Feature Extraction/Addition/Deletion Techniques:

▪ Utilize signal processing techniques to extract acoustic features such as pitch, intensity, and
formants from speech signals.
▪ Explore natural language processing (NLP) techniques for extracting linguistic features, including
sentiment-related words and tone.
▪ Investigate the addition of prosodic features like speech rate and pauses to capture emotional
nuances.
▪ Experiment with feature scaling and normalization for better model convergence.

Machine Learning Algorithms:

▪ Employ deep learning models, such as Convolutional Neural Networks (CNNs) and Recurrent
Neural Networks (RNNs), for end-to-end emotion recognition from audio signals.
▪ Combine acoustic and linguistic features using ensemble methods like Stacking or Fusion
models.
▪ Implement transfer learning approaches to leverage pre-trained models on large speech
datasets.

Experimental Results:

▪ Evaluate the model on diverse datasets containing a variety of emotional expressions in speech.
▪ Measure accuracy, precision, recall, and F1 score to assess the model's performance across
different emotion categories.
▪ Conduct user studies to evaluate the system's effectiveness in real-world scenarios.
▪ Showcase the system's potential applications in improving user experience in virtual
environments, customer service applications, and other relevant domains.

OR
Title: Fraud Detection in Financial Transactions

Objectives/Goals:

▪ Develop a robust system for real-time detection of fraudulent activities in financial transactions.
▪ Enhance security measures and protect customers from unauthorized access and financial loss.
▪ Improve the efficiency of fraud detection systems to minimize false positives and negatives.

Feature Extraction/Addition/Deletion Techniques:

▪ Implement feature scaling and normalization to standardize numerical variables in transaction


data.
▪ Explore dimensionality reduction techniques such as Principal Component Analysis (PCA) to
handle high-dimensional data.
▪ Investigate the addition of derived features like transaction frequency, geographical location,
and user behavior patterns.
▪ Experiment with anomaly detection methods to identify irregularities in transaction patterns.

Machine Learning Algorithms:

▪ Utilize supervised learning algorithms, including Logistic Regression and Random Forests, for
binary classification of transactions into fraudulent and non-fraudulent categories.
▪ Implement unsupervised learning algorithms such as Isolation Forest and One-Class SVM for
anomaly detection in transaction data.
▪ Combine multiple models using ensemble techniques like Bagging or Boosting to improve
overall system performance.

Experimental Results:

▪ Evaluate the model on a large and diverse dataset of financial transactions to assess its accuracy
and efficiency.
▪ Measure metrics such as precision, recall, and F1 score to quantify the system's ability to detect
fraudulent activities.
▪ Analyze the system's performance in real-time scenarios to ensure timely and accurate fraud
detection.
▪ Showcase the reduction in financial losses and false alarms achieved through the implemented
fraud detection system.

Q. 3 With the help of confusion matrix


Predicted (YES) Predicted (NO)

Actual Yes 44 (TP) 36 (FN)

Actual NO 59 (FP) 101 (TN)

Find :

a) F1 Score
b) Accuracy
c) Precision

Solution:

Formulas Used to Solve this Question are Given Below (In Image):
From the Question we got the values below

▪ TP = 44
▪ FP = 59
▪ FN = 36
▪ TN = 101

Note : Must Watch this video to understand other related concepts regarding to Confusion Matrix

Link : https://www.youtube.com/watch?si=71QU5QS1dMLWqGEt&v=AyP85ocS-
8Y&feature=youtu.be
By Putting Values in Formulas we get the Answers:
Q.4 Use the computed conditional probabilities to predict the class label for a test sample
(A=1; B = 0; C=0) using the NAÏVE BASE

A B C Q
1 0 1 1
1 1 1 1
0 1 1 0
1 1 0 0
1 0 1 0
0 0 0 1
0 0 0 1
0 0 1 0

Ans : Hand Written Solution Below


eshon no

Count t instances sahere Q24


-Coundt no
ot instapces here Q=
(-ve class)()
Sstepalclate Rriex Reossblies
-Nunb o ingtances shee
Tota inetances

Pstep Conditiona Robablties

oinstances shere Ql
2

ATQ-o)2-o.5
3Step: fosteeio Probiities tos Test
Sample(A L, B-oao)

05XoSXo:5X o-2S
-PlQ-o|A-l,B-oczo) oo312.S)
Step No~malize rbablies
Ihe noxnai2tan is he Sum ot he
prababilities

=Q'o935 0.03125
Caleulate Actoal Proba bilites
we obtain he adual
pibabiËes ydiwidig- posteio
homalizohion Constant:
PQA,B.oCeo) o1375
r(QolAl,Bzo,Ce) o0312S}

pese aties xepxest hekhesd


test sanle esch
class given heobsexveda
A.L B=o,ceD

You might also like