0% found this document useful (0 votes)

23 views

DMML Lab

The document outlines the vision and mission of an institute and its Computer Science and Engineering department, emphasizing the importance of technical competence, research, and industry collaboration. It details the program outcomes, specific outcomes, and educational objectives for students, alongside a structured laboratory curriculum for data mining and machine learning. Additionally, it provides specific lab exercises aimed at enhancing practical skills in handling data, implementing algorithms, and understanding machine learning concepts.

Uploaded by

ehejdhjee299

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

DMML Lab

Uploaded by

ehejdhjee299

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

INSTITUTE VISION AND MISSION

VISION
To emerge as an institute of eminence in the fields of engineering, technology and management
in serving the industry and the nation by empowering students with a high degree of technical,
managerial and practical competence.

MISSION

• To strengthen the theoretical, practical and ethical dimensions of the learning process by
fostering a culture of research and innovation among faculty members and students

• To encourage long-term interaction between the academia and industry through their
involvement in the design of curriculum and its hands-on implementation

• To strengthen and mould students in professional, ethical, social and environmental

dimensions by encouraging participation in co-curricular and extracurricular activities
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

VISION
To emerge as a department of eminence in Computer Science and Engineering in serving the
Information Technology Industry and the nation by empowering students with a high degree
of technical and practical competence.

MISSION
• To strengthen the theoretical and practical aspects of the learning process by strongly
encouraging a culture of research, innovation and hands-on learning in Computer Science
and Engineering

• To encourage long-term interaction between the department and the IT industry, through
the involvement of the IT industry in the design of the curriculum and its hands-on
implementation

• To widen the awareness of students in professional, ethical, social and environmental

dimensions by encouraging their participation in co-curricular and extracurricular activities
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

PROGRAM OUTCOMES (POs)

The student will be able to:

PO1: Engineering Knowledge: Apply knowledge of mathematics, science, engineering

fundamentals and an engineering specialization to the solution of complex Computer Science
and engineering problems.

PO2: Problem Analysis: Identify, formulate, review research literature and analyze complex
engineering problems in Computer Science and Engineering reaching substantiated conclusions
using first principles of mathematics, natural sciences and engineering sciences.

PO3: Design / Development of Solutions: Design solutions for complex engineering problems
and design system components or processes of Computer Science and Engineering that meet
the specified needs with appropriate consideration for public health and safety, cultural, societal
and environmental considerations.

PO4: Conduct Investigations of Complex Problems: Use research-based knowledge and

research methods including design of experiments in Computer Science and Engineering,
analysis and interpretation of data, and synthesis of the information to provide valid
conclusions.

PO5: Modern tool usage: Create, select and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modeling to complex engineering
activities related to Computer Science and Engineering with an understanding of the
limitations.

PO6: The engineer and society: Apply reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering practice in Computer Science and Engineering.

PO7: Environment and sustainability: Understand the impact of the professional

engineering solutions of Computer Science and Engineering in societal and environmental
contexts, and demonstrate the knowledge of, and need for sustainable development.

PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities
and norms of the engineering practice.
PO9: Individual and Team Work: Function effectively as an individual and as a member or
leader to diverse teams, and in multidisciplinary settings.

PO10: Communication: Communicate effectively on complex engineering activities with the

engineering community and with society at large, such as, being able to comprehend and write
effective report and design documentation, make effective presentations, and give and receive
clear instructions.

PO11: Project Management and Finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.

PO12: Life-Long Learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological change.

PROGRAM SPECIFIC OUTCOMES (PSOs)

The student will be able to:

PSO1: Ability to design, develop, implement computer programs and use knowledge in
various domains to identify research gaps and hence to provide solution to new ideas and
innovations.

PSO2: Work with and communicate effectively with professionals in various fields and pursue
lifelong professional development in computing.

PROGRAM EDUCATIONAL OBJECTIVES (PEOS)

The Graduate of the program will be able to:

PE01: Develop proficiency as computer scientists with an ability to solve a wide range of
computational problems in industry, government, or other work environments.

PE02: Attain the ability to adapt quickly to new environments and technologies, assimilate
new information, and work in multi-disciplinary areas with a strong focus on innovation and
entrepreneurship.

PE03: Possess the ability to think logically and the capacity to understand technical problems
with computational systems.

PE04: Possess the ability to collaborate as team members and team leaders to facilitate cutting-
edge technical solutions for computing systems and thereby providing improved functionality.
CONTENT
Exp. Page
No List of Experiments
No
PART A

Given a dataset, analyze whether there is missing data in the dataset

1
and handle it with different data preprocessing methods
Given a dataset, perform the required data standardization and
2
normalization on the data.
3 Explore Label encoding and other encoding methods on various
attributes of the data
Perform Oversampling, under sampling and SMOTE algorithm to
4
handle imbalanced dataset.
Implement Apriori algorithm to identify the frequent itemset and
5
association rule from suitable transaction data.
Implement FP Growth Tree algorithm to identify the frequent
6
itemset and association rule from a suitable transaction data.

PART B
Write a program to demonstrate the working of the decision tree
based ID3 algorithm. Use an appropriate data set for building the
7
decision tree and apply this knowledge to classify a new sample.
Write a program to implement the naïve Bayesian classifier for a
8 sample training data set stored as a .CSV file. Compute the accuracy
of the classifier, considering few test data sets.
9 Write a program to implement the support vector machine classifier
for a sample training data set stored as a .CSV file. Compute the
accuracy of the classifier, considering few test data sets.
Write a program to implement k-Nearest Neighbour algorithm to
10 classify the iris data set. Print both correct and wrong predictions.
Java/Python ML library classes can be used for this problem.
Build an Artificial Neural Network by implementing the
11 Backpropagation algorithm and test the same using appropriate
data sets.
Build a classifier using any ensemble learning method and compare
12 the results against the classic learning models
Department of Computer Science & Engineering.
DATA MINING AND MACHINE LEARNING LABORATORY
[22CSL61]
LAB RUBRICS

Internal Assessment Marks: 50

Divided into two components: Continuous Assessment : 30 Marks

Internal Test : 20 Marks
Continuous Assessment:

i) Will be carried out in every lab (for 11 labs – 12 experiments)

ii) Each lab will be evaluated for 10 marks
iii) Totally for 12 labs it will be 120 marks. This will be scaled down to 30.

Break up of 10 marks (in every lab):

Will be carried out in every lab (for 10 labs – 11 experiments)

Attributes Descriptors Scores

Student can neatly type the program and able to explain
3
to working of program
Conduction of
experiment/ Student can explain 75% of their program 2
Writing the
Student can explain 50% of their program 1
program (3)
Student can’t explain program 0
Execution of program without error 3
Execution of
program /output Partial Execution of program without error 2
(3)
Student can’t execute program 0
Submits in time and completed (during subsequent lab)
4
and answer all Viva questions
Fails to submit the record (during subsequent lab) and
Result & Record 3
answer all Viva questions
completion and
Submits in time and completed (during subsequent lab)
submission(4) 2
and partially answer all Viva questions
Fails to submit the record in time / incomplete
0
submission/not answer viva questions
Internal Test -1+2 : 10+10 Marks

SN EXPLANATION CIE1(MARKS) CIE2(MARKS)

01 Write up 2.5 2.5
02 Execution and results 5 5
03 Viva Voce 2.5 2.5
TOTAL 10 10

Attributes Descriptors Scores

Student can neatly type the program and able to explain
2.5
Conduction of to working of program
experiment/ Student can explain 75% of their program 2
Writing the
Student can explain 50% of their program 1.5
program (2.5)
Student can’t explain program 1
Execution of Execution of program without error 5
program
Partial Execution of program without error 3
/output
(5) Student can’t execute program 0
Answers correctly 2.5
Viva Voce
Answers satisfactorily 1.5
(2.5)
Do not answer any question 0

SEE Assessment Marks: 50

Attributes Descriptors Scores

Student can neatly type the program and able to explain
10
Conduction of to working of program
experiment/ Student can explain 75% of their program 8-9
Writing the
Student can explain 50% of their program 3-5
program (10)
Student can’t explain program 0-2
Execution of Execution of program without error 20-30
program
Partial Execution of program without error 10-19
/output
(30) Student can’t execute program 0-9
Answers correctly 10
Viva Voce
Answers satisfactorily 5-9
(10)
Do not answer any question 0-4

Lab Course Faculty Course Coordinator

DATAMINING AND MACHINE LEARNING LAB 22CSL61

Program no: 1

Handling Missing Values

Aim: Given a dataset, analyze whether there is missing data in the dataset and handle it with
different data preprocessing methods
Algorithm:
Basics
1. Import Pandas
2. Use provided csv file and read csv file using read_csv function and assign it to
dataframe(df)
3. Use describe function on dataframe check following details
#count: Total Number of Non-Empty values
#mean: Mean of the column values
#std: Standard Deviation of the column values
#min: Minimum value from the column
#25%: 25 percentile
#50%: 50 percentile
#75%: 75 percentile
#max: Maximum value from the column
4. Check Number of rows and columns using the shape function on the dataframe.
5. Check column metadata using df.info()

Handling Missing values

1. Calculate the count of null values for each column Using df.isnull().sum()
2. #Calculate the count of null values for column 'salary’ df['salary'].isnull().sum()
3. import seaborn and plot histogram for salary: sns.histplot(data=df, x="salary")
4. #Part1:handle missing values :replacing with median a. Replace null salary with median
value of salary
5. #Part2: handle missing values: dropping row df.dropna(how='any')

Page | 1
DATAMINING AND MACHINE LEARNING LAB 22CSL61

6. Check no null values in result df['salary'].isnull().sum()

Program:

# MISSING VALUES CAN BE HANDLED BY 1. IMPUTATION 2.DROPPING

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('/content/Placement_Dataset.csv')
df.head()
df.describe
df.shape
df.info()
df.isnull().sum()
df['salary'].isnull().sum()
# analyze the value of salary , how it is distributed
sns.distplot(df.salary)
# replacing the missing values with median value
df['salary'].fillna(df['salary'].median(), inplace=True)
#now check the dataset
df.info()
df['salary'].isnull().sum()
# dropping na
df.dropna(how='any')

Page | 2
DATAMINING AND MACHINE LEARNING LAB 22CSL61

Page | 3
DATAMINING AND MACHINE LEARNING LAB 22CSL61

Program no: 2
Label Encoding
Aim: Given a dataset, perform the required data standardization and normalization on the
data.
Algorithm
Data Standardization
Data standardization brings the data in small range values with standard deviation 1 and
data becomes easy to apply machine learning algorithms.
1. Imports
import numpy as np
import pandas as pd
import sklearn.datasets from sklearn.preprocessing
import StandardScaler from sklearn.model_selection
import train_test_split
2. Use sklearn.datasets.load_breast_cancer() function, it returns dataset(ds)
3. Convert dataset to dataframe using DataFrame function. df =pd.DataFrame(ds.data ,
columns=ds.feature_names)
4. Check standard deviation before applying Standardization
5. Use scaler=StandardScaler() and use fit_transform function, pass df as parameter
6. Check new standard deviation of output data( of step 5). It should be 1.0
Data Normalization
Data Normalization converts all values in the range 0 to 1.
Small values , easy to apply algorithms.
1. Import sklearn.preprocessing MinMaxScaler
2. Use sklearn.datasets.load_breast_cancer() function, it returns dataset(ds)
3. Convert dataset to dataframe using DataFrame function. df =pd.DataFrame(ds.data ,
columns=ds.feature_names)

Page | 4
DATAMINING AND MACHINE LEARNING LAB 22CSL61

4. Normalize features between 0 and 1. Use MinMaxScaler with fit_transform function. Pass df
as parameter.
5. Print the first 5 rows of the transformed data

Program:

# DATA STANDARDIZATION---THE PROCESS CONVERTING A DATA TO A COMMON

FORMAT
import pandas as pd
import sklearn.datasets
from sklearn.preprocessing import StandardScaler

ds = sklearn.datasets.load_breast_cancer()
ds
ds.target_names
df =pd.DataFrame(ds.data , columns=ds.feature_names)
df.head()
x= df
y=ds.target
print(ds.data.std())
# standartize the data before splitting ,,, after standartize check the standard deviation fun, the
result should be close to 1

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_standard = scaler.fit_transform(x)
print(x_standard)
print(x_standard.std())
Page | 5
DATAMINING AND MACHINE LEARNING LAB 22CSL61

# Minmaxscaler transforms data to the range [0,1]

from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()
x_minmax=minmax.fit_transform(x)
print(x_minmax)

OUTPUT:

Page | 6
DATAMINING AND MACHINE LEARNING LAB 22CSL61

Program no: 3
Label Encoding
Aim: Explore Label encoding and other encoding methods on various attributes of the data

Algorithms
Label Encoding Label Encoding is a technique that is used to convert categorical columns
into numerical ones so that they can be fitted by machine learning models which only take
numerical data. It is an important preprocessing step in a machine-learning project.
Use LabelEncoder
1. Import pandas, sklearn.preprocessing.LabelEncoder
2. Read given dataset iris.csv
3.Print dataset.Observe ‘variety’ columns values. Its descriptive.
4. Count unique values for column ‘variety’. Use value_counts function on column ‘variety’
5.Initialize LabelEncoder and Call fit_transform function. Pass descriptive column df[‘variety’]
as parameter
6.Assign output of step 5 to df[‘variety’]
Page | 7
DATAMINING AND MACHINE LEARNING LAB 22CSL61

7.Print df first 5 rows and you will see ‘variety’ column values are encoded/replaced by
numerical values(1,2 3 etc).
8.Print dataset and check if encoded labels are shown.

Program:

#LABEL ENCODING OF BRAEST CANCER

df= pd.read_csv('/content/brest_cancer.csv')
df.head()
# check the label of diagnosis
df.info()
df['diagnosis'].unique()
df['diagnosis'].value_counts()
#assign numerical values for B and M --- thats is label emcoding
label_encoding = LabelEncoder()
diag_label = label_encoding.fit_transform(df['diagnosis'])
Page | 8
DATAMINING AND MACHINE LEARNING LAB 22CSL61

print(diag_label)
df['diagnosis'] =diag_label # copy the new labeled value of diagnosis to the column diagnosis
#print first 5 rows
df.head()
# OneHotEncoder method
from sklearn.preprocessing import OneHotEncoder
onehotencoder=OneHotEncoder(sparse=False)
onehot_label = onehotencoder.fit_transform(df[['diagnosis']])
print(onehot_label)

#LABEL ENCODING FOR IRIS DATASET

df1= pd.read_csv('/content/iris_data.csv')
df1.head()
df1['Species'].value_counts()
label_encoder2 = LabelEncoder()
species_encoded = label_encoder2.fit_transform(df1['Species'])
print(species_encoded)
df1['species']= species_encoded
print(df1.head())

OUTPUT For BREAST CANCER DATASET:

OUTPUT For IRIS DATASET:

Page | 9
DATAMINING AND MACHINE LEARNING LAB 22CSL61

Program no: 4
SMOTE Algorithm
Aim: Perform Oversampling, under sampling and SMOTE algorithm to handle imbalanced
dataset
Page | 10
DATAMINING AND MACHINE LEARNING LAB 22CSL61

Algorithm:

Handling Imbalanced DataSet

1. Read creditcard.csv file using pandas library

2. clean data. Drop Null or empty values if any.
3. check unique values for column 'Class
4. Count rows for each unique values of column ‘Class’
5. As we need two parameters of dataframes for oversampling and undersampling
functions, create two dataframes x and y. X as or original dataframe. Y as y=df[‘Class’].
6. Drop the ‘Class’ column in x as we are copying it in y for sampling purpose
7. Import over_sampling and under_sampling package from imblearn library
8. Import RandomOverSampler from
imblearn.over_sampling
9. Initialize RandomOverSampler with parameter as sampling_strategy and value
to it ‘minority’
10. Call fit_resample function on RandomOverSampler variable and pass x and y, output
will be x_resampled And y_resampled
11. Check the output of step10 for the unique values count
12. Repeat step 7 to 11 by using RandomUnderSampler with parameter as
sampling_strategy=‘majority’ and value to it ‘majority’

Use SMOTE(Synthetic Minority Oversampling Technique) Algorithm for OverSampling

1. Import SMOTE from imblearn.over_sampling

2. Initialize Smote
3. Call fit_resample function by passing x and y as parameters. Output will be
x_smote_resampled y_smote_resampled
4. Check the output of step3 for the unique values count.

Page | 11
DATAMINING AND MACHINE LEARNING LAB 22CSL61

Program:

#import packages required

import numpy as np
import pandas as pd
#read csv using pandas
df= pd.read_csv('creditcard.csv')
df.shape
#clean data
df.dropna(inplace=True)
df.shape
#check unique values for column 'Class'
df['Class'].unique()
#count rows for each unique values
df['Class'].value_counts()
# creating x and y
x = df
y = df['Class']
x.shape
y.value_counts()
y.shape
# drop the 'Class column in x as we are copying it in y for sampling purpose
x.drop('Class',axis=1,inplace=True)
x.shape

#import Over sampling and under sampling package

Page | 12
DATAMINING AND MACHINE LEARNING LAB 22CSL61

from imblearn import under_sampling , over_sampling

from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(sampling_strategy='minority') #This means that if the majority class
had 284315 records and the minority class had 492,
#this strategy would oversampling the minority class so that it has 284315 examples.
x_resampled , y_resampled = ros.fit_resample(x,y)
y_resampled.value_counts()
#undersampling
from imblearn.under_sampling import RandomUnderSampler
ros = RandomUnderSampler(sampling_strategy='majority')
x_resampled , y_resampled = ros.fit_resample(x,y)
y_resampled.value_counts()
## X_resampled and y_resampled are now the undersampled data
# You can use them for further analysis or modeling
#SMOTE Synthetic Minority Oversampling Technique
from imblearn.over_sampling import SMOTE
# Instantiate SMOTE
# By setting the random_state parameter to a fixed value, you can ensure that the algorithm
generates the same synthetic samples each time it is run,
#which can make the results more reproducible.
smote = SMOTE()
# Resample the data
x_smote_resampled, y_smote_resampled = smote.fit_resample(x, y)
y_smote_resampled.value_counts()
OUTPUT FOR UNDERSAMPLING

Page | 13
DATAMINING AND MACHINE LEARNING LAB 22CSL61

OUTPUT FOR OVERSAMPLING

Program no: 5
Apriori Algorithm
Aim: Implement Apriori algorithm to identify the frequent itemset and association rule from
suitable transaction data.

Apriori Algorithm

THEORY:
• Importing the TranscationEncoder and using .fit_transform(“) so that we can scale the
training data.
• Importing apriori and use apriori(datasetname,min_support,colname) to find minimum
support.
• Importing association_rules and using association_rules(freq=itemset, metric” =confidence ” )
to find the association rule.

Dataset:= [['milk','onion','nescafe','kitkat','eggs','yogurt'],
['dairy milk','onion','nescafe','kitkat','eggs','yogurt'], ['milk','apple','kitkat','eggs'],
['milk','uday','corn','kitkat','yogurt'],
['corn','onion','onion','kitkat','ice cream','eggs']]
Page | 14
DATAMINING AND MACHINE LEARNING LAB 22CSL61

Pseudocode To implement Apriori using Python

1. Use the given dataset. Assume Support 0.6 and confidence 0.8
2. Import apriori from mlxtend.frequent_patterns
3. Initialize apriori by passing parameter df, min_support and use_colnames=
True, Output will be freq_itemset
4. Print freq_itemset and verify result
5. Import association_rules from mlxtend.frequent_patterns
6. Initialize association_rules. Pass freq_itemset,metric = "confidence"
and min_threshold with value
7. Print and verify output of step 5

Program:

#Dataset
dataset = [['milk','onion','nescafe','kitkat','eggs','yogurt'],
['dairy milk','onion','nescafe','kitkat','eggs','yogurt'],
['milk','apple','kitkat','eggs'],
['milk','uday','corn','kitkat','yogurt'],
['corn','onion','onion','kitkat','ice cream','eggs']]
print(dataset)
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_data = te.fit_transform(dataset)
Page | 15
DATAMINING AND MACHINE LEARNING LAB 22CSL61

te_data
df = pd.DataFrame(te_data, columns=te.columns_)
df
#import apriori and use function apriori and find frequent itemset
from mlxtend.frequent_patterns import apriori
freq_itemset = apriori(df, min_support=0.6 , use_colnames= True)
freq_itemset

#import association_rules
from mlxtend.frequent_patterns import association_rules
rules = association_rules(freq_itemset, metric = "confidence", min_threshold=0.8)
rules

Page | 16
DATAMINING AND MACHINE LEARNING LAB 22CSL61

OUTPUT:

Program no: 6
FP Growth Algorithm
Aim: Implement FP Growth Tree algorithm to identify the frequent itemset and association rule
from a suitable transaction data.

FP Growth Algorithm

THEORY:
• Importing the TranscationEncoder and using .fit_transform(“) so that we can scale the
training data.
• Importing fpgrowth and use fpgrowth (datasetname,min_support,colname) to find minimum
support.
• Importing association_rules and using association_rules(freq=itemset, metric” =confidence ”
) to find the association rule.

Page | 17
DATAMINING AND MACHINE LEARNING LAB 22CSL61

Pseudocode To implement Fp Growth using Python

1. Use the given dataset. Assume Support 0.6 and confidence 0.8
2. Import fpgrowth from mlxtend.frequent_patterns
3. Initialize fpgrowth by passing parameter df, min_support and use_colnames= True,
Output will be freq_itemset
4. Print freq_itemset and verify result
5. Import association_rules from mlxtend.frequent_patterns

Program:

# Small dataset
dataset = [['milk','onion','nescafe','kitkat','eggs','yogurt'],
['dairy milk','onion','nescafe','kitkat','eggs','yogurt'],
['milk','apple','kitkat','eggs'],
['milk','uday','corn','kitkat','yogurt'],
['corn','onion','onion','kitkat','ice cream','eggs']]
print(dataset)
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_data = te.fit_transform(dataset)

Page | 18
DATAMINING AND MACHINE LEARNING LAB 22CSL61

te_data
df = pd.DataFrame(te_data, columns=te.columns_)
df
from mlxtend.frequent_patterns import fpgrowth
freq_itemset = fpgrowth(df, min_support=0.6 , use_colnames= True)
freq_itemset
from mlxtend.frequent_patterns import association_rules
rules = association_rules(freq_itemset, metric = "confidence", min_threshold=0.8)
rules
# to check the running time for fpgrowth and compare
from mlxtend.frequent_patterns import fpgrowth
%timeit fpgrowth(df, min_support=0.6)
OUTPUT

Program no: 7
Decision tree using ID3 algorithm
Aim: Write a program to implement decision tree using ID3 algorithm. Use an appropriate data
set for building the decision tree and apply this knowledge to classify a new sample.

ID3 Algorithm

Page | 19
DATAMINING AND MACHINE LEARNING LAB 22CSL61

1. Read PlayTennis.csv Using Pandas.

2. Write a function to calculate entropy.
3. Write a function to calculate information gain
4. Define function id3
5. Call id3 function. Pass parameters df and class label.
6. Print decision tree(output of step 5).

Part1
Define function to calculate entropy.
1. Create new function calculate_entropy
2. Pass parameters df(data) and class label.
3. Calculate unique values of class labels.
4. Initialize entropy variable to 0
5. For each unique value of class label
a. Calculate entropy using entropy formula.

6. Return entropy

Part2
Define function to calculate information gain.

1. Create new function calculate_information_gain

2. Pass data, feature and class label.
3. Calculate unique values of feature
4. Calculate average information entropy.
for value in unique_values of feature:
subset = data[data[feature] == value] proportion =
len(subset) / len(data)
Average_informaton_entropy += proportion *

Page | 20
DATAMINING AND MACHINE LEARNING LAB 22CSL61

calculate_entropy(subset, target_column)
5. Calculate information gain.
6. Return information gain.

Part3
Define ID3 function.
1. Create function id3 and pass parameters data and class label.
2. If all data points belong to the same class, create a leaf node and return.
3. Calculate entropy for the current dataset.
4. Iterate over features and calculate information gain. Find a feature with max information gain.
5. Declare variable decision_tree. Create a decision node based on the best feature found in step
4.
6. For all unique values of best feature , make recursive call to id3 function
7. Return decision_tree.

Part 4:
Plotting Decision Tree

1. Import from sklearn.tree import DecisionTreeClassifier, plot_tree

import matplotlib.pyplot as plt
2. Convert categorical variables to numerical values
data_numerical = pd.get_dummies(df.iloc[:, :-1])
3. Initialize DecisionTreeClassifier (clf ) by passing parameter criterion='entropy'. Call clf.fit
function by passing data_numerical,and class label
4. Plot Decision Tree using plot_tree function by passing clf,
filled=True, feature_names=data_numerical.columns, class_names=['No', 'Yes'] as parameters.

5. Verify Decision tree plotted.

Program:

# Importing the required packages

import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.model_selection import train_test_split

Page | 21
DATAMINING AND MACHINE LEARNING LAB 22CSL61

from sklearn.tree import DecisionTreeClassifier

import matplotlib.pyplot as plt

# Data Import and Exploration

# Function to import the dataset

def importdata():
balance_data = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-' +
'databases/balance-scale/balance-scale.data',
sep=',', header=None)
# Displaying dataset information
print("Dataset Length: ", len(balance_data))
print("Dataset Shape: ", balance_data.shape)
print("Dataset: ", balance_data.head())
return balance_data
# Data Splitting

# Function to split the dataset into features and target variables

def splitdataset(balance_data):

# Separating the target variable

X = balance_data.values[:, 1:5]
Y = balance_data.values[:, 0]

# Splitting the dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.3, random_state=100)
Page | 22
DATAMINING AND MACHINE LEARNING LAB 22CSL61

return X, Y, X_train, X_test, y_train, y_test

# Training with Gini Index:

def train_using_gini(X_train, X_test, y_train):

# Creating the classifier object

clf_gini = DecisionTreeClassifier(criterion="gini",
random_state=100, max_depth=3, min_samples_leaf=5)

# Performing training
clf_gini.fit(X_train, y_train)
return clf_gini
# Training with Entropy: train_using_entropy(X_train, X_test, y_train)

def tarin_using_entropy(X_train, X_test, y_train):

# Decision tree with entropy

clf_entropy = DecisionTreeClassifier(

criterion="entropy", random_state=100,

max_depth=3, min_samples_leaf=5)

# Performing training

clf_entropy.fit(X_train, y_train)

return clf_entropy

Page | 23
DATAMINING AND MACHINE LEARNING LAB 22CSL61

# Function to make predictions

def prediction(X_test, clf_object):
y_pred = clf_object.predict(X_test)
print("Predicted values:")
print(y_pred)
return y_pred

# Placeholder function for cal_accuracy

def cal_accuracy(y_test, y_pred):
print("Confusion Matrix: ",
confusion_matrix(y_test, y_pred))
print("Accuracy : ",
accuracy_score(y_test, y_pred)*100)
print("Report : ",
classification_report(y_test, y_pred))
# Function to plot the decision tree
def plot_decision_tree(clf_object, feature_names, class_names):
plt.figure(figsize=(15, 10))
plot_tree(clf_object, filled=True, feature_names=feature_names,
class_names=class_names, rounded=True)
plt.show()

OUTPUT:

Page | 24
DATAMINING AND MACHINE LEARNING LAB 22CSL61

Page | 25
DATAMINING AND MACHINE LEARNING LAB 22CSL61

Program no: 8
Naïve Bayesian classifier
Aim: Write a program to implement the naïve Bayesian classifier for a sample training data set
stored as a .CSV file. Compute the accuracy of the classifier, considering few test data sets.

Dataset= [ ['r','t','d','yes'],

['r','t','d','no'],

['r','t','d','yes'],

['y', 't', 'd', 'no'],

['y', 't', 'i', 'yes'],

['y', 'v', 'i', 'no'],

['y', 'v', 'i', 'yes'],

['y', 'v', 'd', 'no'],

['r', 'v', 'i', 'no'],

['r', 't', 'i', 'yes']

Pseudocode

1. Initialize dataset
2. Separate features and labels (X and y)
3. Define class ‘BayesianClassifier’
a. Define function fit (Calculate prior probabilities and conditional
probabilities)
b. Define predict function

Page | 26
DATAMINING AND MACHINE LEARNING LAB 22CSL61

4. Create and train the Bayesian classifier

a. classifier = BayesianClassifier()
b. classifier.fit(X, y)
5. Initialize New instance to predict :['r', 'v', 'd']
6. Predict the class label for the new instance. Call classifier
.predict() by passing a new instance as parameter.
7. Print result of step 6
Program:
import pandas as pd

# Sample data
data = {
'outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny',
'sunny', 'rainy',
'sunny', 'overcast', 'overcast', 'rainy'],
'temperature': ['hot', 'hot', 'hot', 'mild', 'cool', 'cool', 'cool', 'mild', 'cool', 'mild', 'mild',
'mild', 'hot', 'mild'],
'humidity': ['high', 'high', 'high', 'high', 'normal', 'normal', 'normal', 'high', 'normal',
'normal', 'normal',
'high', 'normal', 'high'],
'wind': ['weak', 'strong', 'weak', 'weak', 'weak', 'strong', 'strong', 'weak', 'weak', 'weak',
'strong', 'strong',
'weak', 'strong'],
'play_tennis': ['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes',
'no']
}

tennis_df = pd.DataFrame(data)

# Function to calculate conditional probabilities

def calculate_conditional_probabilities(df, feature, target):
Page | 27
DATAMINING AND MACHINE LEARNING LAB 22CSL61

conditional_probabilities = {}
target_classes = df[target].unique()

for feature_value in df[feature].unique():

conditional_probabilities[feature_value] = {}
for target_class in target_classes:
total_count = len(df[df[feature] == feature_value])
feature_count = len(df[(df[feature] == feature_value) & (df[target] ==
target_class)])
conditional_probabilities[feature_value][target_class] = feature_count / total_count

return conditional_probabilities

# Calculate conditional probabilities for each feature

outlook_probabilities = calculate_conditional_probabilities(tennis_df, 'outlook',
'play_tennis')
temperature_probabilities = calculate_conditional_probabilities(tennis_df, 'temperature',
'play_tennis')
humidity_probabilities = calculate_conditional_probabilities(tennis_df, 'humidity',
'play_tennis')
wind_probabilities = calculate_conditional_probabilities(tennis_df, 'wind', 'play_tennis')

# Function to predict class based on conditional probabilities

def predict(outlook, temperature, humidity, wind):
probabilities = {}
for target_class in tennis_df['play_tennis'].unique():
probabilities[target_class] = (outlook_probabilities[outlook][target_class] *
temperature_probabilities[temperature][target_class] *

Page | 28

A Review of Machine Learning Applications in Human Resource Management
No ratings yet
A Review of Machine Learning Applications in Human Resource Management
21 pages
Data Mining and Warehouse MCQS With Answer Good
74% (72)
Data Mining and Warehouse MCQS With Answer Good
30 pages
NLP Lab Manual
83% (6)
NLP Lab Manual
56 pages
D_Machine Learning Lab Manual 2024-25.Docx
No ratings yet
D_Machine Learning Lab Manual 2024-25.Docx
52 pages
Ayush Machine Learning Lab
No ratings yet
Ayush Machine Learning Lab
38 pages
20CS0527 - ML
No ratings yet
20CS0527 - ML
49 pages
Python Programming & Data Science Lab Manual
No ratings yet
Python Programming & Data Science Lab Manual
25 pages
23cb1311-Oops Lab Manual (6.4.24) 1
No ratings yet
23cb1311-Oops Lab Manual (6.4.24) 1
129 pages
ADA LAB MANUAL-BCSL404 Lab Manual-updated
No ratings yet
ADA LAB MANUAL-BCSL404 Lab Manual-updated
54 pages
JAVA Lab Record (All)
No ratings yet
JAVA Lab Record (All)
52 pages
1638699417_Computer-Programming-Lab (2)
No ratings yet
1638699417_Computer-Programming-Lab (2)
74 pages
DL Lab Manual Student
No ratings yet
DL Lab Manual Student
6 pages
22CSL62 CN Labmanual
No ratings yet
22CSL62 CN Labmanual
37 pages
ML Lab Manual Session
No ratings yet
ML Lab Manual Session
50 pages
Python Lab Manual Final
100% (6)
Python Lab Manual Final
88 pages
ML-LAB-MANUAL
No ratings yet
ML-LAB-MANUAL
55 pages
20AIL58A (Lab Manual Final) - 1
No ratings yet
20AIL58A (Lab Manual Final) - 1
47 pages
Data Structures Lab Record-1
No ratings yet
Data Structures Lab Record-1
55 pages
EST02 C LAB MANUAL
No ratings yet
EST02 C LAB MANUAL
70 pages
CS3381-OOPS-lab manual
No ratings yet
CS3381-OOPS-lab manual
47 pages
231CS11-PGM IN C LB RECORD FINALCP
No ratings yet
231CS11-PGM IN C LB RECORD FINALCP
60 pages
Aiml Lab Mannual 7TH Sem
No ratings yet
Aiml Lab Mannual 7TH Sem
35 pages
Dslabmanual
No ratings yet
Dslabmanual
122 pages
DAA Lab Manual (R16)
No ratings yet
DAA Lab Manual (R16)
54 pages
Lab File BCS351
No ratings yet
Lab File BCS351
8 pages
STM Lab Manual - DRK
No ratings yet
STM Lab Manual - DRK
69 pages
ML Using Python IT UPDATED
No ratings yet
ML Using Python IT UPDATED
53 pages
24_12_2022_1363316390
No ratings yet
24_12_2022_1363316390
18 pages
final file
No ratings yet
final file
38 pages
Data Structure Lab Manual 2021-22
No ratings yet
Data Structure Lab Manual 2021-22
235 pages
DATA STRUCTURE LAB MANUAL 2020-2021 Final Edited
100% (2)
DATA STRUCTURE LAB MANUAL 2020-2021 Final Edited
117 pages
DevOps_Sem6_labmanual
No ratings yet
DevOps_Sem6_labmanual
100 pages
ML File Fnail Merged
No ratings yet
ML File Fnail Merged
82 pages
R21UCS111 C Manual (24-25)
No ratings yet
R21UCS111 C Manual (24-25)
39 pages
DL Lab Manual A.Y 2022-23-1
100% (1)
DL Lab Manual A.Y 2022-23-1
67 pages
BCS-452-1
No ratings yet
BCS-452-1
10 pages
BDA Final Lab Manual
100% (1)
BDA Final Lab Manual
56 pages
M Tech Cse Syllabus 2021 Ds 1875a43a25
No ratings yet
M Tech Cse Syllabus 2021 Ds 1875a43a25
96 pages
DSA lab-AIDS
No ratings yet
DSA lab-AIDS
53 pages
Geethanjali College of Engineering and Technology (Ugc Autonomous Institution)
No ratings yet
Geethanjali College of Engineering and Technology (Ugc Autonomous Institution)
34 pages
DAA Lab Manual
No ratings yet
DAA Lab Manual
93 pages
Kamal ML
No ratings yet
Kamal ML
38 pages
Pps Lab Manual
No ratings yet
Pps Lab Manual
66 pages
Updated_Lab_File_BCS351
No ratings yet
Updated_Lab_File_BCS351
7 pages
BPOPS103203 (1)
No ratings yet
BPOPS103203 (1)
50 pages
DSDA Lab Record
No ratings yet
DSDA Lab Record
77 pages
ML Lab Manual 20-06
No ratings yet
ML Lab Manual 20-06
40 pages
Lecture Plan
No ratings yet
Lecture Plan
6 pages
AA Lab Manual Session 2022-23
No ratings yet
AA Lab Manual Session 2022-23
33 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
62 pages
3130702 Data Structure Lab Manual
No ratings yet
3130702 Data Structure Lab Manual
104 pages
3rd-Year
No ratings yet
3rd-Year
104 pages
Est 102 Computer Programming in c
No ratings yet
Est 102 Computer Programming in c
326 pages
Os&cd
No ratings yet
Os&cd
73 pages
CS3501 Compiler Design Laboratory Record
No ratings yet
CS3501 Compiler Design Laboratory Record
44 pages
Analysis of Algorithms Lab Manual
No ratings yet
Analysis of Algorithms Lab Manual
29 pages
2nd-Year
No ratings yet
2nd-Year
137 pages
DAA Manual
No ratings yet
DAA Manual
43 pages
CF IAI
No ratings yet
CF IAI
43 pages
2027 JAVA MASTER RECORD new pdf
No ratings yet
2027 JAVA MASTER RECORD new pdf
54 pages
ADA Lab Manual
No ratings yet
ADA Lab Manual
67 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
BCA VI SEM SYLLABUS (GENERAL)_VI SEM BCA SYLLABUS(GENERAL)
No ratings yet
BCA VI SEM SYLLABUS (GENERAL)_VI SEM BCA SYLLABUS(GENERAL)
34 pages
Insurance Recommender System
No ratings yet
Insurance Recommender System
4 pages
Data-Mining-Series-2-Important-Topics
No ratings yet
Data-Mining-Series-2-Important-Topics
22 pages
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
No ratings yet
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
56 pages
4-4 Autonomous Syllabus R-15 250418
No ratings yet
4-4 Autonomous Syllabus R-15 250418
44 pages
Data Analyticskit601 Unit 4 Notes
No ratings yet
Data Analyticskit601 Unit 4 Notes
178 pages
Part 4 Mining Freqent Patterns
No ratings yet
Part 4 Mining Freqent Patterns
59 pages
unit3mining association rules
No ratings yet
unit3mining association rules
21 pages
26 Weka
No ratings yet
26 Weka
5 pages
Data Mining MCQs unit1&2
No ratings yet
Data Mining MCQs unit1&2
11 pages
A_Predictive_Model_for_the_Early_Identification_of_Student_Dropout_Using_Data_Classification_Clustering_and_Association_Methods
No ratings yet
A_Predictive_Model_for_the_Early_Identification_of_Student_Dropout_Using_Data_Classification_Clustering_and_Association_Methods
10 pages
Jess 1907203
No ratings yet
Jess 1907203
11 pages
Lec 1 Data Mining Introduction For Exam
No ratings yet
Lec 1 Data Mining Introduction For Exam
48 pages
Data-Mining-Lab-Manual Cs 703b
No ratings yet
Data-Mining-Lab-Manual Cs 703b
41 pages
Question Bank 2
No ratings yet
Question Bank 2
4 pages
Qustion Bank DMDW
No ratings yet
Qustion Bank DMDW
8 pages
Data Science Process and Machine Learning
No ratings yet
Data Science Process and Machine Learning
6 pages
BAQB (2) (1)
No ratings yet
BAQB (2) (1)
21 pages
CSE704 Data Analytics Syllabus Theory
No ratings yet
CSE704 Data Analytics Syllabus Theory
2 pages
Market Basket Analysis
No ratings yet
Market Basket Analysis
37 pages
DM Consolidated
100% (1)
DM Consolidated
676 pages
Association Rule Mining - Models and Algorithms (Zhang & Zhang 2002-05-28)
50% (2)
Association Rule Mining - Models and Algorithms (Zhang & Zhang 2002-05-28)
248 pages
Data Warehouse and Data Mining: Lab Manual
100% (1)
Data Warehouse and Data Mining: Lab Manual
69 pages
Notes - Machine Learning
No ratings yet
Notes - Machine Learning
138 pages
Ex 9 DWM Aryant
No ratings yet
Ex 9 DWM Aryant
9 pages
Lab Manual LPII 2
No ratings yet
Lab Manual LPII 2
43 pages
Exam Tdt4300 2022 Autumn Solutions
No ratings yet
Exam Tdt4300 2022 Autumn Solutions
14 pages
Data Mining Concept (MMU)
No ratings yet
Data Mining Concept (MMU)
38 pages

DMML Lab

Uploaded by

DMML Lab

Uploaded by

INSTITUTE VISION AND MISSION

• To strengthen and mould students in professional, ethical, social and environmental

• To widen the awareness of students in professional, ethical, social and environmental

PROGRAM OUTCOMES (POs)

PO1: Engineering Knowledge: Apply knowledge of mathematics, science, engineering

PO4: Conduct Investigations of Complex Problems: Use research-based knowledge and

PO7: Environment and sustainability: Understand the impact of the professional

PO10: Communication: Communicate effectively on complex engineering activities with the

PROGRAM SPECIFIC OUTCOMES (PSOs)

PROGRAM EDUCATIONAL OBJECTIVES (PEOS)

Given a dataset, analyze whether there is missing data in the dataset

Internal Assessment Marks: 50

Divided into two components: Continuous Assessment : 30 Marks

i) Will be carried out in every lab (for 11 labs – 12 experiments)

Break up of 10 marks (in every lab):

Will be carried out in every lab (for 10 labs – 11 experiments)

Attributes Descriptors Scores

SN EXPLANATION CIE1(MARKS) CIE2(MARKS)

Attributes Descriptors Scores

SEE Assessment Marks: 50

Attributes Descriptors Scores

Lab Course Faculty Course Coordinator

Handling Missing Values

Handling Missing values

6. Check no null values in result df['salary'].isnull().sum()

# MISSING VALUES CAN BE HANDLED BY 1. IMPUTATION 2.DROPPING

# DATA STANDARDIZATION---THE PROCESS CONVERTING A DATA TO A COMMON

from sklearn.preprocessing import StandardScaler

# Minmaxscaler transforms data to the range [0,1]

#LABEL ENCODING OF BRAEST CANCER

#LABEL ENCODING FOR IRIS DATASET

OUTPUT For BREAST CANCER DATASET:

OUTPUT For IRIS DATASET:

Handling Imbalanced DataSet

1. Read creditcard.csv file using pandas library

Use SMOTE(Synthetic Minority Oversampling Technique) Algorithm for OverSampling

1. Import SMOTE from imblearn.over_sampling

#import packages required

#import Over sampling and under sampling package

from imblearn import under_sampling , over_sampling

from imblearn.over_sampling import RandomOverSampler

OUTPUT FOR OVERSAMPLING

Pseudocode To implement Apriori using Python

Pseudocode To implement Fp Growth using Python

1. Read PlayTennis.csv Using Pandas.

1. Create new function calculate_information_gain

1. Import from sklearn.tree import DecisionTreeClassifier, plot_tree

5. Verify Decision tree plotted.

# Importing the required packages

from sklearn.tree import DecisionTreeClassifier

# Data Import and Exploration

# Function to import the dataset

# Function to split the dataset into features and target variables

# Separating the target variable

# Splitting the dataset into train and test

return X, Y, X_train, X_test, y_train, y_test

# Training with Gini Index:

def train_using_gini(X_train, X_test, y_train):

# Creating the classifier object

def tarin_using_entropy(X_train, X_test, y_train):

# Decision tree with entropy

# Function to make predictions

# Placeholder function for cal_accuracy

['y', 't', 'd', 'no'],

['y', 't', 'i', 'yes'],

['y', 'v', 'i', 'no'],

['y', 'v', 'i', 'yes'],

['y', 'v', 'd', 'no'],

['r', 'v', 'i', 'no'],

['r', 't', 'i', 'yes']

4. Create and train the Bayesian classifier

# Function to calculate conditional probabilities

for feature_value in df[feature].unique():

# Calculate conditional probabilities for each feature

# Function to predict class based on conditional probabilities

You might also like