0% found this document useful (0 votes)
25 views4 pages

Decision Tree

The document discusses decision tree classifiers and includes code to implement one. It imports necessary libraries, loads diabetes patient data from a CSV file, defines functions to calculate entropy and information gain, sets the features and label, and includes a recursive function to create the decision tree by splitting on the feature with the highest information gain at each node until it reaches pure leaf nodes or runs out of features.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views4 pages

Decision Tree

The document discusses decision tree classifiers and includes code to implement one. It imports necessary libraries, loads diabetes patient data from a CSV file, defines functions to calculate entropy and information gain, sets the features and label, and includes a recursive function to create the decision tree by splitting on the feature with the highest information gain at each node until it reaches pure leaf nodes or runs out of features.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Decision tree classifier

Dishant Kumar Yadav 2021BCS0136

Implementation:

General Terms: Let us first discuss a few statistical concepts used in this post.

Entropy: The entropy of a dataset, is a measure the impurity, of the dataset Entropy can also be
thought, as a measure of uncertainty. We should try to minimize, the Entropy. The goal of
machine learning models is to reduce uncertainty or entropy, as far as possible.

Information Gain: Information gain, is a measure of, how much information, a feature gives us
about the classes. Decision Trees algorithm, will always try, to maximize information gain.
Feature, that perfectly partitions the data, should give maximum information. A feature, with the
highest Information gain, will be used for split first.

keyboard_arrow_down Import Libraries:


We are going to import NumPy and the pandas library.

# Import the required libraries


import pandas as pd
import numpy as np

from google.colab import files

uploaded = files.upload()

Choose Files diabetes11.csv


diabetes11.csv(text/csv) - 7491 bytes, last modified: 17/1/2024 - 100% done
Saving diabetes11.csv to diabetes11.csv

import shutil

# Assuming the file name is 'diabetes.csv'


shutil.move('diabetes11.csv', '/content/diabetes11.csv')
'/content/diabetes11.csv'

import os

# List files in the /content directory


os.listdir('/content')

['.config',
'diabetes (1).csv',
'diabetes.csv',
'diabetes11.csv',
'sample_data']

import pandas as pd

# Read the CSV file into a DataFrame


df = pd.read_csv('/content/diabetes11.csv')

# Display the first few rows of the DataFrame


df.head()
1 to 5 of 5 entries Filter
index Glucose BloodPressure diabetes
0 148 72 1
1 85 66 0
2 183 64 1
3 89 66 0
4 137 40 1
Show 25 per page

Like what you see? Visit the data table notebook to learn more about interactive tables.

Distributions

2-d distributions

Time series

# Define the calculate entropy function


def calculate_entropy(df_label):
classes,class_counts = np.unique(df_label,return_counts = True)
entropy_value = np.sum([(-class_counts[i]/np.sum(class_counts))*np.log2(class_counts
for i in range(len(classes))])
return entropy_value
# Define the calculate information gain function
def calculate_information_gain(dataset,feature,label):
# Calculate the dataset entropy
dataset_entropy = calculate_entropy(dataset[label])
values,feat_counts= np.unique(dataset[feature],return_counts=True)

# Calculate the weighted feature entropy # Call the ca


weighted_feature_entropy = np.sum([(feat_counts[i]/np.sum(feat_counts))*calculate_ent
==values[i]).dropna()[label]) for i in range(len(values))]
feature_info_gain = dataset_entropy - weighted_feature_entropy
return feature_info_gain
# Set the features and label
features = df.columns[:-1]
label = 'diabetes'
parent=None
features

Index(['Glucose', 'BloodPressure'], dtype='object')

import numpy as np

def create_decision_tree(dataset, df, features, label, parent=None):


datum = np.unique(df[label], return_counts=True)
unique_data = np.unique(dataset[label])

if len(unique_data) <= 1:
return unique_data[0]

elif len(dataset) == 0:
return unique_data[np.argmax(datum[1])]

elif len(features) == 0:
return parent

else:
parent = unique_data[np.argmax(datum[1])]

# call the calculate_information_gain function


item_values = [calculate_information_gain(dataset, feature, label) for feature in
optimum_feature = features[np.argmax(item_values)]

You might also like