0% found this document useful (0 votes)
10 views11 pages

Devesh

The document outlines practical exercises in data mining under the supervision of Dr. Bhavya Deep. It includes tasks such as data cleaning, pre-processing, applying the Apriori algorithm, using classification algorithms, and clustering with K-Means. Each section provides code examples and expected outputs for datasets, primarily focusing on the wine dataset.

Uploaded by

kavyachauhan374
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views11 pages

Devesh

The document outlines practical exercises in data mining under the supervision of Dr. Bhavya Deep. It includes tasks such as data cleaning, pre-processing, applying the Apriori algorithm, using classification algorithms, and clustering with K-Means. Each section provides code examples and expected outputs for datasets, primarily focusing on the wine dataset.

Uploaded by

kavyachauhan374
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

PRACTICAL RECORD FILE

DATA MINING
(Under the supervision of Dr. Bhavya Deep sir)

DEVESH MEENA
2302016
2nd YEAR 4th SEMESTER
BSC(H).COMPUTER SCIENCE
INDEX

Sr.
Practical Question sign
No.

Apply data cleaning techniques on any dataset (e.g., wine dataset). Techniques may include
1 handling missing values, outliers, inconsistent values. A set of validation rules can be prepared
based on the dataset and validations can be performed.

Apply data pre-processing techniques such as standardization/normalization, transformation,


2
aggregation, discretization/binarization, sampling etc. on any dataset.

Run Apriori algorithm to find frequent item sets and association rules on 2 real datasets and use
appropriate evaluation
a) Use minimum measures
support to compute
as 50% and minimumcorrectness
confidenceofasobtained
75%. patterns.
3
b) Use minimum support as 60% and minimum confidence as 60%.

Use Naive Bayes, K-Nearest, and Decision Tree classification algorithms and build classifiers on
any two datasets. Divide the dataset into training and test sets. Compare the accuracy of the
different classifiers under the following situations:
I. a) Training set = 75%, Test set = 25%.
b) Training set = 66.6%, Test set = 33.3%.
4
II. Training set is chosen by:
i) Hold-out method
ii) Random subsampling
iii) Cross-validation.
Compare the accuracy of the classifiers obtained. Data needs to be scaled to standard format.

Use Simple K-Means algorithm for clustering on any dataset. Compare the performance of clusters
5 by changing the parameters involved in the algorithm. Plot MSE computed after each iteration
using a line plot for any set of parameters.
Q1.Apply data cleaning techniques on any dataset (e,g, wine dataset). Techniques may include
handling missing values, outliers, inconsistent values. A set of validation rules can be prepared
based on the dataset and validations can be performed.

Code:
import pandas as pd

# 1. Load dataset (semicolon-delimited)


df = pd.read_csv("winequality-red.csv", sep=';')

# 2. for missing values


print("Missing values per column:\n", df.isna().sum())

# 3. Handle missing values (if any appear—this dataset has none by default)
df.fillna(df.mean(), inplace=True)

# 4. Normalize/standardize text columns


if 'type' in df.columns:
df['type'] = df['type'].str.lower()

print("\nPost-cleaning summary statistics:")


print(df.describe())
print("\nData cleaning completed.")

Output:
Q2.Apply data pre-processing techniques such as standardization/normalization, transformation,
aggregation, discretization/binarization, sampling etc. on any dataset

Code:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# 1. Load dataset
df = pd.read_csv("winequality-red.csv", sep=';')

# 2. Select all numeric feature columns for scaling


numeric_cols = [c for c in df.columns if df[c].dtype in ['float64','int64'] and c != 'quality']

# 3. Initialize the scaler


scaler = StandardScaler()

# 4. Fit & transform the numeric features


scaled_array = scaler.fit_transform(df[numeric_cols])

# 5. Convert back to a DataFrame


df_scaled = pd.DataFrame(scaled_array, columns=numeric_cols)

# 6. Re-attach the target column


df_scaled['quality'] = df['quality']

print("Standardized feature summary:")


print(df_scaled.describe().loc[['mean','std']])

output:

Q3. . Run Apriori algorithm to find frequent item sets and association rules on 2 real datasets and
use appropriate evaluation measures to compute correctness of obtained patterns

a) Use minimum support as 50% and minimum confidence as 75%


b) Use minimum support as 60% and minimum confidence as 60 %

code:

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
# Load dataset
df = pd.read_csv("winequality-red.csv", sep=';')
# Discretize selected features
features = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'alcohol']
for col in features:

q1 = df[col].quantile(0.25) q3 = df[col].quantile(0.75) bins = [df[col].min()-1,


q1, q3, df[col].max()+1] df[col + '_cat'] = pd.cut(df[col], bins=bins, labels=
['low', 'medium', 'high'])

transactions = df[[c + '_cat' for c in features]].astype(str).values.tolist()


# Encode transactions
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df_trans = pd.DataFrame(te_ary, columns=te.columns_)

(a) Support ≥ 50%, Confidence ≥ 75%


itemsets_50 = apriori(df_trans, min_support=0.50, use_colnames=True)
rules_50 = association_rules(itemsets_50, metric="confidence", min_threshold=0.75)
print("Support ≥ 50%, Confidence ≥ 75%")
print(rules_50[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

(b) Support ≥ 60%, Confidence ≥ 60%


itemsets_60 = apriori(df_trans, min_support=0.60, use_colnames=True)
rules_60 = association_rules(itemsets_60, metric="confidence", min_threshold=0.60)

print("Support ≥ 60%, Confidence ≥ 60%")

print(rules_60[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

OUTPUT:
Q4.Use Naive bayes, K-nearest, and Decision tree classification algorithms and build classifiers on
any two datasets. Divide the data set into training and test set. Compare the accuracy of the
different classifiers under the following situations: I. a) Training set = 75% Test set = 25% b) Training
set = 66.6% (2/3rd of total), Test set = 33.3% II. Training set is chosen by i) hold out method ii)
Random subsampling iii) Cross-Validation. Compare the accuracy of the classifiers obtained. Data
needs to be scaled to standard format.

Code:

import pandas as pd
import numpy as np

from sklearn.datasets import load_iris, load_wine


from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.preprocessing import StandardScaler


from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier


from sklearn.metrics import accuracy_score

# Function to evaluate classifiers

def evaluate_models(X, y, dataset_name):

results = []

classifiers = {

'Naive Bayes': GaussianNB(),

'KNN': KNeighborsClassifier(),
'Decision Tree': DecisionTreeClassifier()
}

# Standardize features

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# I.a) 75/25 split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)


for name, clf in classifiers.items():

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

acc = accuracy_score(y_test, y_pred)


results.append((dataset_name, name, "75/25 Split", acc))

# I.b) 66.6/33.3 split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.333, random_state=42)


for name, clf in classifiers.items():

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

acc = accuracy_score(y_test, y_pred)

results.append((dataset_name, name, "66.6/33.3 Split", acc))

# II.i) Hold Out Method


for name, clf in classifiers.items():

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=0)

clf.fit(X_train, y_train)
acc = clf.score(X_test, y_test)

results.append((dataset_name, name, "Hold Out", acc))

# II.ii) Random Subsampling (avg of 5)


for name, clf in classifiers.items():
scores = []

for _ in range(5):

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3)

clf.fit(X_train, y_train)

scores.append(clf.score(X_test, y_test))

results.append((dataset_name, name, "Random Subsampling", np.mean(scores)))

# II.iii) Cross Validation (5-fold)

for name, clf in classifiers.items():

scores = cross_val_score(clf, X_scaled, y, cv=5)

results.append((dataset_name, name, "5-Fold CV", np.mean(scores)))

return results

# Load datasets
iris = load_iris()

wine = load_wine()
# Run evaluation

iris_results = evaluate_models(iris.data, iris.target, "Iris")

wine_results = evaluate_models(wine.data, wine.target, "Wine")

# Combine all results


combined_results = pd.DataFrame(iris_results + wine_results, columns=["Dataset", "Classifier",
"Evaluation Method", "Accuracy"])
print(combined_results)
output:
5.Use Simple K-means algorithm for clustering on any dataset. Compare the performance of
clusters by changing the parameters involved in the algorithm. Plot MSE computed after each
iteration using a line plot for any set of parameters.
Code:

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt


from sklearn.cluster import KMeans

from sklearn.preprocessing import StandardScaler


from sklearn.datasets import load_wine

from sklearn.metrics import mean_squared_error

# Load the Wine dataset

wine = load_wine()
X = wine.data

# Standardize the dataset


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Function to run KMeans and collect MSE after each iteration


def kmeans_with_mse(X, n_clusters=3, max_iter=10):

mse_list = []

kmeans = KMeans(n_clusters=n_clusters, init='random', n_init=1, max_iter=1, random_state=42)

for i in range(max_iter):

kmeans.max_iter = i + 1 # Increase iterations step by step


kmeans.fit(X)
labels = kmeans.predict(X)
mse = mean_squared_error(X, kmeans.cluster_centers_[labels])

mse_list.append(mse)
return mse_list

# Parameters

clusters = 3
iterations = 10

# Run and collect MSEs

mse_values = kmeans_with_mse(X_scaled, n_clusters=clusters, max_iter=iterations)

# Plotting MSE vs Iterations

plt.figure(figsize=(8, 5))
plt.plot(range(1, iterations + 1), mse_values, marker='o', linestyle='-', color='blue')

plt.title(f'K-Means Clustering MSE vs Iterations (k={clusters})')


plt.xlabel('Iteration')

plt.ylabel('Mean Squared Error (MSE)')

plt.grid(True)

plt.tight_layout()
plt.show()

Output:

You might also like