Devesh
Devesh
DATA MINING
(Under the supervision of Dr. Bhavya Deep sir)
DEVESH MEENA
2302016
2nd YEAR 4th SEMESTER
BSC(H).COMPUTER SCIENCE
INDEX
Sr.
Practical Question sign
No.
Apply data cleaning techniques on any dataset (e.g., wine dataset). Techniques may include
1 handling missing values, outliers, inconsistent values. A set of validation rules can be prepared
based on the dataset and validations can be performed.
Run Apriori algorithm to find frequent item sets and association rules on 2 real datasets and use
appropriate evaluation
a) Use minimum measures
support to compute
as 50% and minimumcorrectness
confidenceofasobtained
75%. patterns.
3
b) Use minimum support as 60% and minimum confidence as 60%.
Use Naive Bayes, K-Nearest, and Decision Tree classification algorithms and build classifiers on
any two datasets. Divide the dataset into training and test sets. Compare the accuracy of the
different classifiers under the following situations:
I. a) Training set = 75%, Test set = 25%.
b) Training set = 66.6%, Test set = 33.3%.
4
II. Training set is chosen by:
i) Hold-out method
ii) Random subsampling
iii) Cross-validation.
Compare the accuracy of the classifiers obtained. Data needs to be scaled to standard format.
Use Simple K-Means algorithm for clustering on any dataset. Compare the performance of clusters
5 by changing the parameters involved in the algorithm. Plot MSE computed after each iteration
using a line plot for any set of parameters.
Q1.Apply data cleaning techniques on any dataset (e,g, wine dataset). Techniques may include
handling missing values, outliers, inconsistent values. A set of validation rules can be prepared
based on the dataset and validations can be performed.
Code:
import pandas as pd
# 3. Handle missing values (if any appear—this dataset has none by default)
df.fillna(df.mean(), inplace=True)
Output:
Q2.Apply data pre-processing techniques such as standardization/normalization, transformation,
aggregation, discretization/binarization, sampling etc. on any dataset
Code:
import pandas as pd
from sklearn.preprocessing import StandardScaler
# 1. Load dataset
df = pd.read_csv("winequality-red.csv", sep=';')
output:
Q3. . Run Apriori algorithm to find frequent item sets and association rules on 2 real datasets and
use appropriate evaluation measures to compute correctness of obtained patterns
code:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
# Load dataset
df = pd.read_csv("winequality-red.csv", sep=';')
# Discretize selected features
features = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'alcohol']
for col in features:
OUTPUT:
Q4.Use Naive bayes, K-nearest, and Decision tree classification algorithms and build classifiers on
any two datasets. Divide the data set into training and test set. Compare the accuracy of the
different classifiers under the following situations: I. a) Training set = 75% Test set = 25% b) Training
set = 66.6% (2/3rd of total), Test set = 33.3% II. Training set is chosen by i) hold out method ii)
Random subsampling iii) Cross-Validation. Compare the accuracy of the classifiers obtained. Data
needs to be scaled to standard format.
Code:
import pandas as pd
import numpy as np
results = []
classifiers = {
'KNN': KNeighborsClassifier(),
'Decision Tree': DecisionTreeClassifier()
}
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
clf.fit(X_train, y_train)
acc = clf.score(X_test, y_test)
for _ in range(5):
clf.fit(X_train, y_train)
scores.append(clf.score(X_test, y_test))
return results
# Load datasets
iris = load_iris()
wine = load_wine()
# Run evaluation
import pandas as pd
import numpy as np
wine = load_wine()
X = wine.data
mse_list = []
for i in range(max_iter):
mse_list.append(mse)
return mse_list
# Parameters
clusters = 3
iterations = 10
plt.figure(figsize=(8, 5))
plt.plot(range(1, iterations + 1), mse_values, marker='o', linestyle='-', color='blue')
plt.grid(True)
plt.tight_layout()
plt.show()
Output: