0% found this document useful (0 votes)
23 views

DMML Lab

The document outlines the vision and mission of an institute and its Computer Science and Engineering department, emphasizing the importance of technical competence, research, and industry collaboration. It details the program outcomes, specific outcomes, and educational objectives for students, alongside a structured laboratory curriculum for data mining and machine learning. Additionally, it provides specific lab exercises aimed at enhancing practical skills in handling data, implementing algorithms, and understanding machine learning concepts.

Uploaded by

ehejdhjee299
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

DMML Lab

The document outlines the vision and mission of an institute and its Computer Science and Engineering department, emphasizing the importance of technical competence, research, and industry collaboration. It details the program outcomes, specific outcomes, and educational objectives for students, alongside a structured laboratory curriculum for data mining and machine learning. Additionally, it provides specific lab exercises aimed at enhancing practical skills in handling data, implementing algorithms, and understanding machine learning concepts.

Uploaded by

ehejdhjee299
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

INSTITUTE VISION AND MISSION

VISION
To emerge as an institute of eminence in the fields of engineering, technology and management
in serving the industry and the nation by empowering students with a high degree of technical,
managerial and practical competence.

MISSION

• To strengthen the theoretical, practical and ethical dimensions of the learning process by
fostering a culture of research and innovation among faculty members and students

• To encourage long-term interaction between the academia and industry through their
involvement in the design of curriculum and its hands-on implementation

• To strengthen and mould students in professional, ethical, social and environmental


dimensions by encouraging participation in co-curricular and extracurricular activities
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

VISION
To emerge as a department of eminence in Computer Science and Engineering in serving the
Information Technology Industry and the nation by empowering students with a high degree
of technical and practical competence.

MISSION
• To strengthen the theoretical and practical aspects of the learning process by strongly
encouraging a culture of research, innovation and hands-on learning in Computer Science
and Engineering

• To encourage long-term interaction between the department and the IT industry, through
the involvement of the IT industry in the design of the curriculum and its hands-on
implementation

• To widen the awareness of students in professional, ethical, social and environmental


dimensions by encouraging their participation in co-curricular and extracurricular activities
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

PROGRAM OUTCOMES (POs)


The student will be able to:

PO1: Engineering Knowledge: Apply knowledge of mathematics, science, engineering


fundamentals and an engineering specialization to the solution of complex Computer Science
and engineering problems.

PO2: Problem Analysis: Identify, formulate, review research literature and analyze complex
engineering problems in Computer Science and Engineering reaching substantiated conclusions
using first principles of mathematics, natural sciences and engineering sciences.

PO3: Design / Development of Solutions: Design solutions for complex engineering problems
and design system components or processes of Computer Science and Engineering that meet
the specified needs with appropriate consideration for public health and safety, cultural, societal
and environmental considerations.

PO4: Conduct Investigations of Complex Problems: Use research-based knowledge and


research methods including design of experiments in Computer Science and Engineering,
analysis and interpretation of data, and synthesis of the information to provide valid
conclusions.

PO5: Modern tool usage: Create, select and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modeling to complex engineering
activities related to Computer Science and Engineering with an understanding of the
limitations.

PO6: The engineer and society: Apply reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering practice in Computer Science and Engineering.

PO7: Environment and sustainability: Understand the impact of the professional


engineering solutions of Computer Science and Engineering in societal and environmental
contexts, and demonstrate the knowledge of, and need for sustainable development.

PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities
and norms of the engineering practice.
PO9: Individual and Team Work: Function effectively as an individual and as a member or
leader to diverse teams, and in multidisciplinary settings.

PO10: Communication: Communicate effectively on complex engineering activities with the


engineering community and with society at large, such as, being able to comprehend and write
effective report and design documentation, make effective presentations, and give and receive
clear instructions.

PO11: Project Management and Finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.

PO12: Life-Long Learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological change.

PROGRAM SPECIFIC OUTCOMES (PSOs)


The student will be able to:

PSO1: Ability to design, develop, implement computer programs and use knowledge in
various domains to identify research gaps and hence to provide solution to new ideas and
innovations.

PSO2: Work with and communicate effectively with professionals in various fields and pursue
lifelong professional development in computing.

PROGRAM EDUCATIONAL OBJECTIVES (PEOS)


The Graduate of the program will be able to:

PE01: Develop proficiency as computer scientists with an ability to solve a wide range of
computational problems in industry, government, or other work environments.

PE02: Attain the ability to adapt quickly to new environments and technologies, assimilate
new information, and work in multi-disciplinary areas with a strong focus on innovation and
entrepreneurship.

PE03: Possess the ability to think logically and the capacity to understand technical problems
with computational systems.

PE04: Possess the ability to collaborate as team members and team leaders to facilitate cutting-
edge technical solutions for computing systems and thereby providing improved functionality.
CONTENT
Exp. Page
No List of Experiments
No
PART A

Given a dataset, analyze whether there is missing data in the dataset


1
and handle it with different data preprocessing methods
Given a dataset, perform the required data standardization and
2
normalization on the data.
3 Explore Label encoding and other encoding methods on various
attributes of the data
Perform Oversampling, under sampling and SMOTE algorithm to
4
handle imbalanced dataset.
Implement Apriori algorithm to identify the frequent itemset and
5
association rule from suitable transaction data.
Implement FP Growth Tree algorithm to identify the frequent
6
itemset and association rule from a suitable transaction data.

PART B
Write a program to demonstrate the working of the decision tree
based ID3 algorithm. Use an appropriate data set for building the
7
decision tree and apply this knowledge to classify a new sample.
Write a program to implement the naïve Bayesian classifier for a
8 sample training data set stored as a .CSV file. Compute the accuracy
of the classifier, considering few test data sets.
9 Write a program to implement the support vector machine classifier
for a sample training data set stored as a .CSV file. Compute the
accuracy of the classifier, considering few test data sets.
Write a program to implement k-Nearest Neighbour algorithm to
10 classify the iris data set. Print both correct and wrong predictions.
Java/Python ML library classes can be used for this problem.
Build an Artificial Neural Network by implementing the
11 Backpropagation algorithm and test the same using appropriate
data sets.
Build a classifier using any ensemble learning method and compare
12 the results against the classic learning models
Department of Computer Science & Engineering.
DATA MINING AND MACHINE LEARNING LABORATORY
[22CSL61]
LAB RUBRICS

Internal Assessment Marks: 50

Divided into two components: Continuous Assessment : 30 Marks


Internal Test : 20 Marks
Continuous Assessment:

i) Will be carried out in every lab (for 11 labs – 12 experiments)


ii) Each lab will be evaluated for 10 marks
iii) Totally for 12 labs it will be 120 marks. This will be scaled down to 30.

Break up of 10 marks (in every lab):

Will be carried out in every lab (for 10 labs – 11 experiments)

Attributes Descriptors Scores


Student can neatly type the program and able to explain
3
to working of program
Conduction of
experiment/ Student can explain 75% of their program 2
Writing the
Student can explain 50% of their program 1
program (3)
Student can’t explain program 0
Execution of program without error 3
Execution of
program /output Partial Execution of program without error 2
(3)
Student can’t execute program 0
Submits in time and completed (during subsequent lab)
4
and answer all Viva questions
Fails to submit the record (during subsequent lab) and
Result & Record 3
answer all Viva questions
completion and
Submits in time and completed (during subsequent lab)
submission(4) 2
and partially answer all Viva questions
Fails to submit the record in time / incomplete
0
submission/not answer viva questions
Internal Test -1+2 : 10+10 Marks

SN EXPLANATION CIE1(MARKS) CIE2(MARKS)


01 Write up 2.5 2.5
02 Execution and results 5 5
03 Viva Voce 2.5 2.5
TOTAL 10 10

Attributes Descriptors Scores


Student can neatly type the program and able to explain
2.5
Conduction of to working of program
experiment/ Student can explain 75% of their program 2
Writing the
Student can explain 50% of their program 1.5
program (2.5)
Student can’t explain program 1
Execution of Execution of program without error 5
program
Partial Execution of program without error 3
/output
(5) Student can’t execute program 0
Answers correctly 2.5
Viva Voce
Answers satisfactorily 1.5
(2.5)
Do not answer any question 0

SEE Assessment Marks: 50

Attributes Descriptors Scores


Student can neatly type the program and able to explain
10
Conduction of to working of program
experiment/ Student can explain 75% of their program 8-9
Writing the
Student can explain 50% of their program 3-5
program (10)
Student can’t explain program 0-2
Execution of Execution of program without error 20-30
program
Partial Execution of program without error 10-19
/output
(30) Student can’t execute program 0-9
Answers correctly 10
Viva Voce
Answers satisfactorily 5-9
(10)
Do not answer any question 0-4

Lab Course Faculty Course Coordinator


DATAMINING AND MACHINE LEARNING LAB 22CSL61

Program no: 1

Handling Missing Values


Aim: Given a dataset, analyze whether there is missing data in the dataset and handle it with
different data preprocessing methods
Algorithm:
Basics
1. Import Pandas
2. Use provided csv file and read csv file using read_csv function and assign it to
dataframe(df)
3. Use describe function on dataframe check following details
#count: Total Number of Non-Empty values
#mean: Mean of the column values
#std: Standard Deviation of the column values
#min: Minimum value from the column
#25%: 25 percentile
#50%: 50 percentile
#75%: 75 percentile
#max: Maximum value from the column
4. Check Number of rows and columns using the shape function on the dataframe.
5. Check column metadata using df.info()

Handling Missing values


1. Calculate the count of null values for each column Using df.isnull().sum()
2. #Calculate the count of null values for column 'salary’ df['salary'].isnull().sum()
3. import seaborn and plot histogram for salary: sns.histplot(data=df, x="salary")
4. #Part1:handle missing values :replacing with median a. Replace null salary with median
value of salary
5. #Part2: handle missing values: dropping row df.dropna(how='any')

Page | 1
DATAMINING AND MACHINE LEARNING LAB 22CSL61

6. Check no null values in result df['salary'].isnull().sum()


Program:

# MISSING VALUES CAN BE HANDLED BY 1. IMPUTATION 2.DROPPING


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('/content/Placement_Dataset.csv')
df.head()
df.describe
df.shape
df.info()
df.isnull().sum()
df['salary'].isnull().sum()
# analyze the value of salary , how it is distributed
sns.distplot(df.salary)
# replacing the missing values with median value
df['salary'].fillna(df['salary'].median(), inplace=True)
#now check the dataset
df.info()
df['salary'].isnull().sum()
# dropping na
df.dropna(how='any')

Page | 2
DATAMINING AND MACHINE LEARNING LAB 22CSL61

Page | 3
DATAMINING AND MACHINE LEARNING LAB 22CSL61

Program no: 2
Label Encoding
Aim: Given a dataset, perform the required data standardization and normalization on the
data.
Algorithm
Data Standardization
Data standardization brings the data in small range values with standard deviation 1 and
data becomes easy to apply machine learning algorithms.
1. Imports
import numpy as np
import pandas as pd
import sklearn.datasets from sklearn.preprocessing
import StandardScaler from sklearn.model_selection
import train_test_split
2. Use sklearn.datasets.load_breast_cancer() function, it returns dataset(ds)
3. Convert dataset to dataframe using DataFrame function. df =pd.DataFrame(ds.data ,
columns=ds.feature_names)
4. Check standard deviation before applying Standardization
5. Use scaler=StandardScaler() and use fit_transform function, pass df as parameter
6. Check new standard deviation of output data( of step 5). It should be 1.0
Data Normalization
Data Normalization converts all values in the range 0 to 1.
Small values , easy to apply algorithms.
1. Import sklearn.preprocessing MinMaxScaler
2. Use sklearn.datasets.load_breast_cancer() function, it returns dataset(ds)
3. Convert dataset to dataframe using DataFrame function. df =pd.DataFrame(ds.data ,
columns=ds.feature_names)

Page | 4
DATAMINING AND MACHINE LEARNING LAB 22CSL61

4. Normalize features between 0 and 1. Use MinMaxScaler with fit_transform function. Pass df
as parameter.
5. Print the first 5 rows of the transformed data

Program:

# DATA STANDARDIZATION---THE PROCESS CONVERTING A DATA TO A COMMON


FORMAT
import pandas as pd
import sklearn.datasets
from sklearn.preprocessing import StandardScaler

ds = sklearn.datasets.load_breast_cancer()
ds
ds.target_names
df =pd.DataFrame(ds.data , columns=ds.feature_names)
df.head()
x= df
y=ds.target
print(ds.data.std())
# standartize the data before splitting ,,, after standartize check the standard deviation fun, the
result should be close to 1

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
x_standard = scaler.fit_transform(x)
print(x_standard)
print(x_standard.std())
Page | 5
DATAMINING AND MACHINE LEARNING LAB 22CSL61

# Minmaxscaler transforms data to the range [0,1]


from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()
x_minmax=minmax.fit_transform(x)
print(x_minmax)

OUTPUT:

Page | 6
DATAMINING AND MACHINE LEARNING LAB 22CSL61

Program no: 3
Label Encoding
Aim: Explore Label encoding and other encoding methods on various attributes of the data

Algorithms
Label Encoding Label Encoding is a technique that is used to convert categorical columns
into numerical ones so that they can be fitted by machine learning models which only take
numerical data. It is an important preprocessing step in a machine-learning project.
Use LabelEncoder
1. Import pandas, sklearn.preprocessing.LabelEncoder
2. Read given dataset iris.csv
3.Print dataset.Observe ‘variety’ columns values. Its descriptive.
4. Count unique values for column ‘variety’. Use value_counts function on column ‘variety’
5.Initialize LabelEncoder and Call fit_transform function. Pass descriptive column df[‘variety’]
as parameter
6.Assign output of step 5 to df[‘variety’]
Page | 7
DATAMINING AND MACHINE LEARNING LAB 22CSL61

7.Print df first 5 rows and you will see ‘variety’ column values are encoded/replaced by
numerical values(1,2 3 etc).
8.Print dataset and check if encoded labels are shown.

Program:

#LABEL ENCODING OF BRAEST CANCER


df= pd.read_csv('/content/brest_cancer.csv')
df.head()
# check the label of diagnosis
df.info()
df['diagnosis'].unique()
df['diagnosis'].value_counts()
#assign numerical values for B and M --- thats is label emcoding
label_encoding = LabelEncoder()
diag_label = label_encoding.fit_transform(df['diagnosis'])
Page | 8
DATAMINING AND MACHINE LEARNING LAB 22CSL61

print(diag_label)
df['diagnosis'] =diag_label # copy the new labeled value of diagnosis to the column diagnosis
#print first 5 rows
df.head()
# OneHotEncoder method
from sklearn.preprocessing import OneHotEncoder
onehotencoder=OneHotEncoder(sparse=False)
onehot_label = onehotencoder.fit_transform(df[['diagnosis']])
print(onehot_label)

#LABEL ENCODING FOR IRIS DATASET


df1= pd.read_csv('/content/iris_data.csv')
df1.head()
df1['Species'].value_counts()
label_encoder2 = LabelEncoder()
species_encoded = label_encoder2.fit_transform(df1['Species'])
print(species_encoded)
df1['species']= species_encoded
print(df1.head())

OUTPUT For BREAST CANCER DATASET:

OUTPUT For IRIS DATASET:


Page | 9
DATAMINING AND MACHINE LEARNING LAB 22CSL61

Program no: 4
SMOTE Algorithm
Aim: Perform Oversampling, under sampling and SMOTE algorithm to handle imbalanced
dataset
Page | 10
DATAMINING AND MACHINE LEARNING LAB 22CSL61

Algorithm:

Handling Imbalanced DataSet

1. Read creditcard.csv file using pandas library


2. clean data. Drop Null or empty values if any.
3. check unique values for column 'Class
4. Count rows for each unique values of column ‘Class’
5. As we need two parameters of dataframes for oversampling and undersampling
functions, create two dataframes x and y. X as or original dataframe. Y as y=df[‘Class’].
6. Drop the ‘Class’ column in x as we are copying it in y for sampling purpose
7. Import over_sampling and under_sampling package from imblearn library
8. Import RandomOverSampler from
imblearn.over_sampling
9. Initialize RandomOverSampler with parameter as sampling_strategy and value
to it ‘minority’
10. Call fit_resample function on RandomOverSampler variable and pass x and y, output
will be x_resampled And y_resampled
11. Check the output of step10 for the unique values count
12. Repeat step 7 to 11 by using RandomUnderSampler with parameter as
sampling_strategy=‘majority’ and value to it ‘majority’

Use SMOTE(Synthetic Minority Oversampling Technique) Algorithm for OverSampling

1. Import SMOTE from imblearn.over_sampling


2. Initialize Smote
3. Call fit_resample function by passing x and y as parameters. Output will be
x_smote_resampled y_smote_resampled
4. Check the output of step3 for the unique values count.

Page | 11
DATAMINING AND MACHINE LEARNING LAB 22CSL61

Program:

#import packages required


import numpy as np
import pandas as pd
#read csv using pandas
df= pd.read_csv('creditcard.csv')
df.shape
#clean data
df.dropna(inplace=True)
df.shape
#check unique values for column 'Class'
df['Class'].unique()
#count rows for each unique values
df['Class'].value_counts()
# creating x and y
x = df
y = df['Class']
x.shape
y.value_counts()
y.shape
# drop the 'Class column in x as we are copying it in y for sampling purpose
x.drop('Class',axis=1,inplace=True)
x.shape

#import Over sampling and under sampling package

Page | 12
DATAMINING AND MACHINE LEARNING LAB 22CSL61

from imblearn import under_sampling , over_sampling

from imblearn.over_sampling import RandomOverSampler


ros = RandomOverSampler(sampling_strategy='minority') #This means that if the majority class
had 284315 records and the minority class had 492,
#this strategy would oversampling the minority class so that it has 284315 examples.
x_resampled , y_resampled = ros.fit_resample(x,y)
y_resampled.value_counts()
#undersampling
from imblearn.under_sampling import RandomUnderSampler
ros = RandomUnderSampler(sampling_strategy='majority')
x_resampled , y_resampled = ros.fit_resample(x,y)
y_resampled.value_counts()
## X_resampled and y_resampled are now the undersampled data
# You can use them for further analysis or modeling
#SMOTE Synthetic Minority Oversampling Technique
from imblearn.over_sampling import SMOTE
# Instantiate SMOTE
# By setting the random_state parameter to a fixed value, you can ensure that the algorithm
generates the same synthetic samples each time it is run,
#which can make the results more reproducible.
smote = SMOTE()
# Resample the data
x_smote_resampled, y_smote_resampled = smote.fit_resample(x, y)
y_smote_resampled.value_counts()
OUTPUT FOR UNDERSAMPLING

Page | 13
DATAMINING AND MACHINE LEARNING LAB 22CSL61

OUTPUT FOR OVERSAMPLING

Program no: 5
Apriori Algorithm
Aim: Implement Apriori algorithm to identify the frequent itemset and association rule from
suitable transaction data.

Apriori Algorithm

THEORY:
• Importing the TranscationEncoder and using .fit_transform(“) so that we can scale the
training data.
• Importing apriori and use apriori(datasetname,min_support,colname) to find minimum
support.
• Importing association_rules and using association_rules(freq=itemset, metric” =confidence ” )
to find the association rule.

Dataset:= [['milk','onion','nescafe','kitkat','eggs','yogurt'],
['dairy milk','onion','nescafe','kitkat','eggs','yogurt'], ['milk','apple','kitkat','eggs'],
['milk','uday','corn','kitkat','yogurt'],
['corn','onion','onion','kitkat','ice cream','eggs']]
Page | 14
DATAMINING AND MACHINE LEARNING LAB 22CSL61

Pseudocode To implement Apriori using Python


1. Use the given dataset. Assume Support 0.6 and confidence 0.8
2. Import apriori from mlxtend.frequent_patterns
3. Initialize apriori by passing parameter df, min_support and use_colnames=
True, Output will be freq_itemset
4. Print freq_itemset and verify result
5. Import association_rules from mlxtend.frequent_patterns
6. Initialize association_rules. Pass freq_itemset,metric = "confidence"
and min_threshold with value
7. Print and verify output of step 5

Program:

#Dataset
dataset = [['milk','onion','nescafe','kitkat','eggs','yogurt'],
['dairy milk','onion','nescafe','kitkat','eggs','yogurt'],
['milk','apple','kitkat','eggs'],
['milk','uday','corn','kitkat','yogurt'],
['corn','onion','onion','kitkat','ice cream','eggs']]
print(dataset)
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_data = te.fit_transform(dataset)
Page | 15
DATAMINING AND MACHINE LEARNING LAB 22CSL61

te_data
df = pd.DataFrame(te_data, columns=te.columns_)
df
#import apriori and use function apriori and find frequent itemset
from mlxtend.frequent_patterns import apriori
freq_itemset = apriori(df, min_support=0.6 , use_colnames= True)
freq_itemset

#import association_rules
from mlxtend.frequent_patterns import association_rules
rules = association_rules(freq_itemset, metric = "confidence", min_threshold=0.8)
rules

Page | 16
DATAMINING AND MACHINE LEARNING LAB 22CSL61

OUTPUT:

Program no: 6
FP Growth Algorithm
Aim: Implement FP Growth Tree algorithm to identify the frequent itemset and association rule
from a suitable transaction data.

FP Growth Algorithm

THEORY:
• Importing the TranscationEncoder and using .fit_transform(“) so that we can scale the
training data.
• Importing fpgrowth and use fpgrowth (datasetname,min_support,colname) to find minimum
support.
• Importing association_rules and using association_rules(freq=itemset, metric” =confidence ”
) to find the association rule.

Dataset:= [['milk','onion','nescafe','kitkat','eggs','yogurt'],
['dairy milk','onion','nescafe','kitkat','eggs','yogurt'], ['milk','apple','kitkat','eggs'],
['milk','uday','corn','kitkat','yogurt'],
['corn','onion','onion','kitkat','ice cream','eggs']]

Page | 17
DATAMINING AND MACHINE LEARNING LAB 22CSL61

Pseudocode To implement Fp Growth using Python


1. Use the given dataset. Assume Support 0.6 and confidence 0.8
2. Import fpgrowth from mlxtend.frequent_patterns
3. Initialize fpgrowth by passing parameter df, min_support and use_colnames= True,
Output will be freq_itemset
4. Print freq_itemset and verify result
5. Import association_rules from mlxtend.frequent_patterns

Program:

# Small dataset
dataset = [['milk','onion','nescafe','kitkat','eggs','yogurt'],
['dairy milk','onion','nescafe','kitkat','eggs','yogurt'],
['milk','apple','kitkat','eggs'],
['milk','uday','corn','kitkat','yogurt'],
['corn','onion','onion','kitkat','ice cream','eggs']]
print(dataset)
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_data = te.fit_transform(dataset)

Page | 18
DATAMINING AND MACHINE LEARNING LAB 22CSL61

te_data
df = pd.DataFrame(te_data, columns=te.columns_)
df
from mlxtend.frequent_patterns import fpgrowth
freq_itemset = fpgrowth(df, min_support=0.6 , use_colnames= True)
freq_itemset
from mlxtend.frequent_patterns import association_rules
rules = association_rules(freq_itemset, metric = "confidence", min_threshold=0.8)
rules
# to check the running time for fpgrowth and compare
from mlxtend.frequent_patterns import fpgrowth
%timeit fpgrowth(df, min_support=0.6)
OUTPUT

Program no: 7
Decision tree using ID3 algorithm
Aim: Write a program to implement decision tree using ID3 algorithm. Use an appropriate data
set for building the decision tree and apply this knowledge to classify a new sample.

ID3 Algorithm

Page | 19
DATAMINING AND MACHINE LEARNING LAB 22CSL61

1. Read PlayTennis.csv Using Pandas.


2. Write a function to calculate entropy.
3. Write a function to calculate information gain
4. Define function id3
5. Call id3 function. Pass parameters df and class label.
6. Print decision tree(output of step 5).

Part1
Define function to calculate entropy.
1. Create new function calculate_entropy
2. Pass parameters df(data) and class label.
3. Calculate unique values of class labels.
4. Initialize entropy variable to 0
5. For each unique value of class label
a. Calculate entropy using entropy formula.

6. Return entropy

Part2
Define function to calculate information gain.

1. Create new function calculate_information_gain


2. Pass data, feature and class label.
3. Calculate unique values of feature
4. Calculate average information entropy.
for value in unique_values of feature:
subset = data[data[feature] == value] proportion =
len(subset) / len(data)
Average_informaton_entropy += proportion *

Page | 20
DATAMINING AND MACHINE LEARNING LAB 22CSL61

calculate_entropy(subset, target_column)
5. Calculate information gain.
6. Return information gain.

Part3
Define ID3 function.
1. Create function id3 and pass parameters data and class label.
2. If all data points belong to the same class, create a leaf node and return.
3. Calculate entropy for the current dataset.
4. Iterate over features and calculate information gain. Find a feature with max information gain.
5. Declare variable decision_tree. Create a decision node based on the best feature found in step
4.
6. For all unique values of best feature , make recursive call to id3 function
7. Return decision_tree.

Part 4:
Plotting Decision Tree

1. Import from sklearn.tree import DecisionTreeClassifier, plot_tree


import matplotlib.pyplot as plt
2. Convert categorical variables to numerical values
data_numerical = pd.get_dummies(df.iloc[:, :-1])
3. Initialize DecisionTreeClassifier (clf ) by passing parameter criterion='entropy'. Call clf.fit
function by passing data_numerical,and class label
4. Plot Decision Tree using plot_tree function by passing clf,
filled=True, feature_names=data_numerical.columns, class_names=['No', 'Yes'] as parameters.

5. Verify Decision tree plotted.


Program:

# Importing the required packages


import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.model_selection import train_test_split

Page | 21
DATAMINING AND MACHINE LEARNING LAB 22CSL61

from sklearn.tree import DecisionTreeClassifier


import matplotlib.pyplot as plt

# Data Import and Exploration

# Function to import the dataset


def importdata():
balance_data = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-' +
'databases/balance-scale/balance-scale.data',
sep=',', header=None)
# Displaying dataset information
print("Dataset Length: ", len(balance_data))
print("Dataset Shape: ", balance_data.shape)
print("Dataset: ", balance_data.head())
return balance_data
# Data Splitting

# Function to split the dataset into features and target variables


def splitdataset(balance_data):

# Separating the target variable


X = balance_data.values[:, 1:5]
Y = balance_data.values[:, 0]

# Splitting the dataset into train and test


X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.3, random_state=100)
Page | 22
DATAMINING AND MACHINE LEARNING LAB 22CSL61

return X, Y, X_train, X_test, y_train, y_test

# Training with Gini Index:

def train_using_gini(X_train, X_test, y_train):

# Creating the classifier object


clf_gini = DecisionTreeClassifier(criterion="gini",
random_state=100, max_depth=3, min_samples_leaf=5)

# Performing training
clf_gini.fit(X_train, y_train)
return clf_gini
# Training with Entropy: train_using_entropy(X_train, X_test, y_train)

def tarin_using_entropy(X_train, X_test, y_train):

# Decision tree with entropy

clf_entropy = DecisionTreeClassifier(

criterion="entropy", random_state=100,

max_depth=3, min_samples_leaf=5)

# Performing training

clf_entropy.fit(X_train, y_train)

return clf_entropy

Page | 23
DATAMINING AND MACHINE LEARNING LAB 22CSL61

# Function to make predictions


def prediction(X_test, clf_object):
y_pred = clf_object.predict(X_test)
print("Predicted values:")
print(y_pred)
return y_pred

# Placeholder function for cal_accuracy


def cal_accuracy(y_test, y_pred):
print("Confusion Matrix: ",
confusion_matrix(y_test, y_pred))
print("Accuracy : ",
accuracy_score(y_test, y_pred)*100)
print("Report : ",
classification_report(y_test, y_pred))
# Function to plot the decision tree
def plot_decision_tree(clf_object, feature_names, class_names):
plt.figure(figsize=(15, 10))
plot_tree(clf_object, filled=True, feature_names=feature_names,
class_names=class_names, rounded=True)
plt.show()

OUTPUT:

Page | 24
DATAMINING AND MACHINE LEARNING LAB 22CSL61

Page | 25
DATAMINING AND MACHINE LEARNING LAB 22CSL61

Program no: 8
Naïve Bayesian classifier
Aim: Write a program to implement the naïve Bayesian classifier for a sample training data set
stored as a .CSV file. Compute the accuracy of the classifier, considering few test data sets.

Dataset= [ ['r','t','d','yes'],

['r','t','d','no'],

['r','t','d','yes'],

['y', 't', 'd', 'no'],

['y', 't', 'i', 'yes'],

['y', 'v', 'i', 'no'],

['y', 'v', 'i', 'yes'],

['y', 'v', 'd', 'no'],

['r', 'v', 'i', 'no'],

['r', 't', 'i', 'yes']

Pseudocode

1. Initialize dataset
2. Separate features and labels (X and y)
3. Define class ‘BayesianClassifier’
a. Define function fit (Calculate prior probabilities and conditional
probabilities)
b. Define predict function

Page | 26
DATAMINING AND MACHINE LEARNING LAB 22CSL61

4. Create and train the Bayesian classifier


a. classifier = BayesianClassifier()
b. classifier.fit(X, y)
5. Initialize New instance to predict :['r', 'v', 'd']
6. Predict the class label for the new instance. Call classifier
.predict() by passing a new instance as parameter.
7. Print result of step 6
Program:
import pandas as pd

# Sample data
data = {
'outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny',
'sunny', 'rainy',
'sunny', 'overcast', 'overcast', 'rainy'],
'temperature': ['hot', 'hot', 'hot', 'mild', 'cool', 'cool', 'cool', 'mild', 'cool', 'mild', 'mild',
'mild', 'hot', 'mild'],
'humidity': ['high', 'high', 'high', 'high', 'normal', 'normal', 'normal', 'high', 'normal',
'normal', 'normal',
'high', 'normal', 'high'],
'wind': ['weak', 'strong', 'weak', 'weak', 'weak', 'strong', 'strong', 'weak', 'weak', 'weak',
'strong', 'strong',
'weak', 'strong'],
'play_tennis': ['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes',
'no']
}

tennis_df = pd.DataFrame(data)

# Function to calculate conditional probabilities


def calculate_conditional_probabilities(df, feature, target):
Page | 27
DATAMINING AND MACHINE LEARNING LAB 22CSL61

conditional_probabilities = {}
target_classes = df[target].unique()

for feature_value in df[feature].unique():


conditional_probabilities[feature_value] = {}
for target_class in target_classes:
total_count = len(df[df[feature] == feature_value])
feature_count = len(df[(df[feature] == feature_value) & (df[target] ==
target_class)])
conditional_probabilities[feature_value][target_class] = feature_count / total_count

return conditional_probabilities

# Calculate conditional probabilities for each feature


outlook_probabilities = calculate_conditional_probabilities(tennis_df, 'outlook',
'play_tennis')
temperature_probabilities = calculate_conditional_probabilities(tennis_df, 'temperature',
'play_tennis')
humidity_probabilities = calculate_conditional_probabilities(tennis_df, 'humidity',
'play_tennis')
wind_probabilities = calculate_conditional_probabilities(tennis_df, 'wind', 'play_tennis')

# Function to predict class based on conditional probabilities


def predict(outlook, temperature, humidity, wind):
probabilities = {}
for target_class in tennis_df['play_tennis'].unique():
probabilities[target_class] = (outlook_probabilities[outlook][target_class] *
temperature_probabilities[temperature][target_class] *

Page | 28

You might also like