Ocs353 Data Science Fundamentals Laboratory-eee
Ocs353 Data Science Fundamentals Laboratory-eee
OPEN ELECTIVE
COMMON TO ALL BRANCHES
OCS353 DATA SCIENCE FUNDAMENTALS LTPC
202 3
COURSE OBJECTIVES:
LIST OF EXPERIMENTS:
1. Download, install and explore the features of Python for data analytics.
2. Working with Numpy arrays
3. Working with Pandas data frames
4. Basic plots using Matplotlib
5. Statistical and Probability measures
a) Frequency distributions
b) Mean, Mode, Standard Deviation
c) Variability
d) Normal curves
e) Correlation and scatter plots
f) Correlation coefficient
g) Regression
6. Use the standard benchmark data set for performing the following:
a) Univariate Analysis: Frequency, Mean, Median, Mode, Variance, Standard
Deviation,Skewness and Kurtosis.
b) Bivariate Analysis: Linear and logistic regression modelling.
7. Apply supervised learning algorithms and unsupervised learning algorithms on any data set.
8. Apply and explore various plotting functions on any data set.
Note: Example data sets like: UCI, Iris, Pima Indians Diabetes etc.
TOTAL = 30 PERIODS
COURSE OUTCOMES:
1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data Science”, Manning
Publications,
2016.
2. Jake VanderPlas, “Python Data Science Handbook”, O’Reilly, 2016.
REFERENCES
1. Robert S. Witte and John S. Witte, “Statistics”, Eleventh Edition, Wiley Publications, 2017.
2. Allen B. Downey, “Think Stats: Exploratory Data Analysis in Python”, Green TeaPress, 2014.
CO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3
CO1 2 2 1 2 2 - - - 1 1 1 2 2 2 2
CO2 3 2 2 1 2 - - - 1 1 2 2 3 3 2
CO3 2 1 2 1 1 - - - 2 1 1 3 1 1 1
CO4 2 2 1 2 2 - - - 1 1 1 2 2 2 2
CO5 2 1 2 1 1 - - - 2 1 1 3 1 1 1
Average 2.2 1.6 1.6 1.4 1.6 - - - 1.4 1 1.2 2.4 1.8 1.8 1.6
NAME OF THE SUBJECT : DATA SCIENCE FUNDAMENTALS
SUBJECT CODE : OCS353
YEAR/BRANCH/SEMESTER : IV-B.E-ELECTRICAL AND ELECTRONICS ENGINEERING/VII
COURSE OUTCOMES
At the end of this course, the students will be able to:
CO1: Use appropriate search algorithms for problem solving
CO2: Apply reasoning under uncertainty
CO3: Build supervised learning models
CO4: Build ensembling and unsupervised models
CO5: Build deep learning neural network models
LIST OF EXPERIMENTS
BLOOMS
S.NO NAME OF EXPERIMENTS COs
LEVEL
1 Download, install and explore the features of Python for data analytics L3 CO1
2 Working with Numpy arrays L3 CO1
3 Working with Pandas data frames L3 CO2
4 Basic plots using Matplotlib L3 CO2
Statistical and Probability measures
a)Frequency distributions
b)Mean, Mode, Standard Deviation
c)Variability
5 L4 CO3
d)Normal curves
e)Correlation and scatter plots
f)Correlation coefficient
g)Regression
Use the standard benchmark data set for performing the following:
a) Univariate Analysis: Frequency, Mean, Median, Mode, Variance,
6 L4 CO4
Standard Deviation, Skewness and Kurtosis.
b) Bivariate Analysis: Linear and logistic regression modelling.
Apply supervised learning algorithms and unsupervised learning
7 L4 CO4
algorithms on any data set.
8 Apply and explore various plotting functions on any data set. L3 CO5
Content Beyond Experiment
Reading data from text files, excel and the web And exploring various
9 L3 CO4
commands for descriptive Analytics on the iris data set.
10 The decision tree based id3 algorithm L3 CO5
*BL: (L1- Remembering, L2- Understanding, L3- Applying, L4- Analysing, L5- Evaluating, L6-Creating)
Ex No: 01 Download, install and explore the features of Python for data
Date : analytics.
AIM:
In this experiment, we explore the knowledge of python packages download, install and features.
OBJECTIVES:
pip list
d. Install packages.
g. Uninstall packages.
pip uninstall <package_name>
conda deactivate
FEATURES OF PACKAGES:
A. Numpy (Numerical Python):
NumPy (Numerical Python) is an open source Python library that‟s used in almost every field
of science and engineering.
It is the core library for numeric and scientific computing.
The NumPy library contains multidimensional array and matrix data structures.
It provides ndarray, a homogeneous n-dimensional array object, with methods to efficiently
operate on it.
NumPy can be used to perform a wide variety of mathematical operations on arrays.
It adds powerful data structures to Python that guarantee efficient calculations with arrays
and matrices and it supplies an enormous library of high-level mathematical functions that
operate on these arrays and matrices.
C. Jupyter Notebook:
D. Statsmodels:
RESULT:
The python packages are installed and features are studied successfully.
Ex No: 02
Working With Numpy Arrays
Date :
AIM:
In this exercise learn about various functions of Numpy package to perform mathematical
and logical operations in numpy arrays.
OBJECTIVES:
import numpy as np
print(np. version )
OUTPUT
1.23.5
OUTPUT
[[1 2]
[3 4]
[5 6]]
[[0 0 0 0]
[0 0 0 0]]
[[1. 1. 1. 1.]
[1. 1. 1. 1.]]
[[1. 1. 1. 1.]
[1. 1. 1. 1.]]
[ 2 4 6 8 10 12 14 16 18]
[ 1. 5.75 10.5 15.25 20. ]
[[3 3]
[3 3]]
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
[1 1 2 2 3 3 4 4 5 5 6 6]
3. Numpy Array Indexing & Slicing.
arr = np.array([11,12,13,14,15])
print(arr)
print("Ele 1 : ",arr[0])
print("Ele 3 : ",arr[2])
print("Ele 1 to 3 : ",arr[0:3])
print("All elements : ",arr[:])
print("All elements except first 2 : ",arr[2:])
arr = np.array([[1,2,3,4,5],[6,7,8,9,10]])
print("Array : ",arr)
print("Dim : ",arr.ndim)
print("Ele (1,1) : ",arr[1][1])
print("Ele (0,0) : ",arr[0][0])
print("Ele (1,3) : ",arr[1][3])
print("First three Ele in first row : ",arr[0][:3])
print("Ele in second row : ",arr[1][:])
print("All the elements in the matrix : \n",arr[:][:])
arr = np.array([20,21,22,23,24])
print("Last Ele : ",arr[-1])
print("Last three element : ",arr[-3:])
OUTPUT
[11 12 13 14 15]
Ele 1: 11
Ele 3: 13
Ele 1 to 3: [11 12 13]
All elements: [11 12 13 14 15]
All elements except first 2: [13 14 15]
4. Attributes of the numpy array.
arr=np.array([[1,2,3],[4,5,6],[7,8,9]])
print("Dimension of the ndarray:",arr.ndim)
print("Size of the ndarray in each dimension:",arr.shape)
print("Total number of elements in the ndarray:",arr.size)
print("The data type of the elements of a NumPy array:",arr.dtype)
print("Returns the size (in bytes) of each element of a ndarray:",arr.itemsize)
OUTPUT
Array: [[ 1 2 3 4 5] [ 6 7 8 9 10]]
Dim: 2
Ele (1,1): 7
Ele (0,0): 1
Ele (1,3): 9
First three Ele in first row : [1 2 3] Ele in second row : [ 6 7 8 9 10]
All the elements in the matrix :
[[ 1 2 3 4 5]
[ 6 7 8 9 10]]
5. Reshaping array
np.reshape(), np.flatten()
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8,9,10,11,12])
newarr = arr.reshape(2, 2, 3)
print(newarr)
a = np.array([[1,2], [3,4]]) arr=a.flatten()
print(arr)
OUTPUT
Dimension of the ndarray: 2
Size of the ndarray in each dimension: (3, 3)
Total number of elements in the ndarray: 9
The data type of the elements of a NumPy array: int64
Returns the size (in bytes) of each element of a ndarray: 8
OUTPUT
Sorting the array: [ 3 5 5 6 6 34 34 89]
Spliting the array: [array([ 3, 5, 89, 34]), array([ 6, 5, 34, 6])]
Joining two array: [[ 1 2 3]
[ 7 8 9]
[ 4 5 6]
[10 11 12]]
7. Basic mathematic operations in array.
a = np.array([7,3,4,5,1])
b = np.array([3,4,5,6,7])
print("Addition:",np.add(a,b))
print("Multiplication:",np.multiply(a,b))
print("Subtraction:",np.subtract(a,b))
print("Power:",np.power(a,b))
print("Division:",np.divide(a,b))
print("Modulo Devision:",np.remainder(a,b))
OUTPUT
Addition: [10 7 9 11 8]
Multiplication: [21 12 20 30 7]
Subtraction: [ 4 -1 -1 -1 -6]
Power: [ 343 81 1024 15625 1]
Division: [2.33333333 0.75 0.8 0.83333333 0.14285714]
Modulo Devision: [1 3 4 5 1]
OUTPUT
[1 2 3 4 5 6 7 8 9]
Original: [ 1 2 3 4 90 6]
Copy: [1 2 3 4 5 6]
Original: [1 2 3 4 5 6]
Copy: [1 2 3 4 5 6]
a = np.array([21,22,34,45,56,67,31,78])
print("Sum:",np.sum(a))
print("Minimum:",np.min(a))
print("Maximum:",np.max(a))
print("Mean:",np.mean(a))
print("Standard Deviation:",np.std(a))
print("Varience:",np.var(a))
print("Exponent:",np.exp(a))
print("Square:",np.sqrt(a))
print("Percentile:",np.percentile(a,25))
OUTPUT
Sum: 354
Minimum: 21
Maximum: 78
Mean: 44.25
Standard Deviation: 19.721498421773127
Varience: 388.9375
Exponent: [1.31881573e+09 3.58491285e+09 5.83461743e+14 3.49342711e+19
2.09165950e+24 1.25236317e+29 2.90488497e+13 7.49841700e+33]
Square: [4.58257569 4.69041576 5.83095189 6.70820393 7.48331477 8.18535277
5.56776436 8.83176087]
Percentile: 28.75
10. Iterating array.
a=np.array([3,4,6,7,78,8,98,9,9])
for x in np.nditer(a):
print(x)
OUTPUT
3
4
6
7
78
8
98
9
9
11. Save and Load numpy array.
a=np.array([[1,2,3,4],[3,4,5,6]])
np.savetxt('one.txt',a,delimiter=',')
a=np.array([[1,2,3,4],[3,4,5,6]])
np.loadtxt('one.txt',delimiter=',')
OUTPUT
array([[1., 2., 3., 4.],
[3., 4., 5., 6.]])
OUTPUT
[11 12 13 14 15 16 17 18 19 20]
13. Reverse an array.
OUTPUT
[[12 11 10 9]
[ 8 7 6 5]
[ 4 3 2 1]]
a=np.random.randint(1,7,size=10)
print(a)
OUTPUT
[5 2 1 3 5 2 1 6 6 5]
RESULT:
In this exercise the numpy package functions are studied and executed successfully.
Ex No: 03
Working With Pandas Data Frames
Date :
AIM:
The aim of this exercise is to acquire the knowledge of pandas package for data manipulation
and analysis.
OBJECTIVE:
To understand DataFrame object creation for data manipulation with integrated indexing.
To understand data alignment and integrated handling of missing data.
To understand reshaping and pivoting of data sets.
To understand data set merging and joining.
To understand data filtration.
To understand group by engine allowing split-apply-combine operations on data sets.
To understand data structure column and row insertion and deletion.
To understand easy to convert NumPy data structures into DataFrame objects and
DataFrame objects to NumPy data structures.
To understand reading and writing data between in-memory data structures and different
file formats.
PROCEDURE:
OUTPUT
1.5.3
2. Object Creation (DataFrame and Series).
values = [91, 7, 2,10,14,15]
myseries = pd.Series(values, index = ["a", "b", "c","d","e","f"])
print(myseries)
stu_personal=pd.DataFrame({'Rollno':[1001,1002,1003,1004,1005],'Name':['A','B','
C','D','E'],'Address':['salem','erode','covai','chennai','namakkal']})
stu_personal
college=pd.Series(['GCT','GCE','GCT','GCE','GCT'])
mark=pd.Series([456,345,399,421,367])
rollno=[1001,1002,1003,1004,1005]
per=pd.Series([91.2,69,79.8,84.2,73.4])
stu_college=pd.DataFrame({'Rollno':rollno,'College':college,'Mark':mark,'Percentag
e':per})
stu_college
stu_fees=pd.DataFrame({'Rollno':[1001,1002,1003,1004,1005],'Fees':[25000,3000,
15000,35000,17500]})
stu_fees
OUTPUT:
a 91
b 7
c 2
d 10
e 14
f 15
dtype: int64
OUTPUT:
4. Viewing data.
stu_personal.head()
stu_college.tail(3)
stu_personal.index
stu_college.columns
stu_personal.to_numpy()
stu_college.describe()
stu_personal.sort_index(axis=1, ascending=False)
stu_college.sort_values(by="Mark")
stu_college.info()
stu_college.value_counts()
OUTPUT:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
0 Rollno 5 non-null int64
1 College 5 non-null object
2 Mark 5 non-null int64
3 Percentage 5 non-null float64 dtypes: float64(1), int64(2), object(1) memory usage: 288.0+
bytes
5. Selection.
a. Selection by Label.
stu_college[["Mark"]]
stu_personal[0:3]
stu_personal.loc[:, ["Name", "Address"]]
OUTPUT:
b. Selection by position.
stu_personal.iloc[3]
stu_personal.iloc[3:5, 0:2]
stu_personal.iloc[[1, 2, 4], [0, 2]]
stu_personal.iloc[1:3, :]
OUTPUT:
6. Operations.
stu_fees['Fees'].mean()
stu_fees['Fees'].sum()
stu_fees['Fees'].max()
stu_fees['Fees'].min()
OUTPUT:
3000
7. Apply.
new = stu_college['Mark'].apply(lambda num : num + 5)
new
stu_personal.applymap(lambda x: len(str(x)))
OUTPUT:
OUTPUT:
9. Grouping.
stu_all.groupby(['Address'])[['Fees']].sum()
10. Reshaping.
stacked=stu_all.stack() stacked
stacked.unstack()
pd.pivot_table(stu_all, values=["Fees"], index=["Rollno"], columns=["Address"])
OUTPUT:
11. Correlation.
stu_all.corr()
OUTPUT:
RESULT:
In this exercise the Pandas package functions are studied and executed successfully.
Ex No: 04
Basic plots using Matplotlib
Date :
AIM:
The aim of this exercise is to acquire the knowledge of Basic plots using Matplotlib and also
implement the program that.
OBJECTIVE:
Line Plot: A line plot is used to visualize data points connected by straight lines.
Scatter Plot: A scatter plot displays individual data points.
Bar Plot: A bar plot is used to compare different categories.
Histogram: A histogram shows the distribution of data over bins.
Pie Chart: A pie chart displays proportions of a whole.
Subplots: You can create multiple plots in a single figure using subplots.
OUTPUT:
2. Scatter Plot Program:
import matplotlib.pyplot as plt
# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
# Create a scatter plot
plt.scatter(x, y)
# Add title and labels
plt.title('Scatter Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
# Show plot
plt.show()
OUTPUT:
4. Histogram Program:
import matplotlib.pyplot as plt
# Sample data
data = [1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 7]
# Create a histogram
plt.hist(data, bins=5, edgecolor='black')
# Add title and labels
plt.title('Histogram')
plt.xlabel('Bins')
plt.ylabel('Frequency')
# Show plot
plt.show()
OUTPUT:
5. Pie Chart Program:
import matplotlib.pyplot as plt
# Sample data
labels = ['A', 'B', 'C', 'D']
sizes = [15, 30, 45, 10]
# Create a pie chart
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
# Add title
plt.title('Pie Chart')
# Show plot
plt.show()
OUTPUT:
6. Subplots Program:
import matplotlib.pyplot as plt
# Sample data
x = [1, 2, 3, 4, 5]
y1 = [2, 3, 5, 7, 11]
y2 = [1, 4, 6, 8, 10]
# Create a figure and axes
fig, (ax1, ax2) = plt.subplots(1, 2)
# Plot on the first subplot
ax1.plot(x, y1, 'r-')
ax1.set_title('Line Plot 1')
# Plot on the second subplot
ax2.scatter(x, y2, c='b')
ax2.set_title('Scatter Plot 2')
# Show plot
plt.show()
OUTPUT:
RESULT:
In this exercise the Basic plots using Matplotlib are studied and executed successfully.
Ex No: 05
Statistical and Probability measures
Date :
AIM:
The aim of this exercise is to acquire the knowledge of Statistical and Probability measures and
also implement the program for that.
OBJECTIVE:
a) Frequency Distributions
Frequency distribution is a summary of how often different values occur in a data set. It helps to
organize data into a more interpretable form.
Absolute Frequency: The count of how many times a value appears.
Relative Frequency: The percentage of the total count represented by the absolute frequency.
Cumulative Frequency: The sum of the frequencies for all values up to a certain point.
b) Mean, Mode, Standard Deviation
A Normal Distribution (or Gaussian distribution) is a bell-shaped curve where the mean,
median, and mode are all equal.
e) Correlation and Scatter Plots
Correlation measures the strength and direction of a linear relationship between two
variables. A scatter plot is a graph that shows the relationship between two variables.
f) Correlation Coefficient
The correlation coefficient (r) quantifies the degree of linear relationship between two
variables, ranging from -1 (perfect negative) to +1 (perfect positive).
g) Regression Using Python
Linear regression is a method for modeling the relationship between a dependent variable
and one or more independent variables.
PROGRAM FOR FREQUENCY DISTRIBUTION:
import pandas as pd
# Sample data
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4]
# Creating a frequency distribution table
frequency_distribution = pd.Series(data).value_counts().sort_index()
print(frequency_distribution)
OUTPUT:
1 1
2 2
3 3
4 4
dtype: int64
OUTPUT
Mean: 3.0, Mode: 4, Standard Deviation: 1.0
OUTPUT
PROGRAM FOR CORRELATION AND SCATTER PLOTS:
import numpy as np
import matplotlib.pyplot as plt
# Sample data
x = np.random.rand(100)
y = 2 * x + np.random.normal(0, 0.1, 100)
# Scatter plot
plt.scatter(x, y)
plt.title("Scatter Plot of X vs Y")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()
OUTPUT
RESULT:
In this exercise the Statistical and Probability measures are studied and executed successfully.
Ex No: 06.a Univariate Analysis: Frequency, Mean, Median, Mode, Variance,
Date : Standard Deviation, Skewness and Kurtosis
AIM:
To implement the program for Univariate Analysis: Frequency, Mean, Median, Mode, Variance,
Standard Deviation, Skewness and Kurtosis.
PROGRAM
import pandas as pd
from sklearn.datasets import load_iris
import numpy as np
from scipy import stats
# Univariate Analysis
# Frequency for the categorical variable 'species'
frequency = df['species'].value_counts()
# Display results
print("Frequency:\n", frequency, "\n")
print("Mean:\n", mean, "\n")
print("Median:\n", median, "\n")
print("Mode:\n", mode, "\n")
print("Variance:\n", variance, "\n")
print("Standard Deviation:\n", std_dev, "\n")
print("Skewness:\n", skewness, "\n")
print("Kurtosis:\n", kurtosis, "\n")
OUTPUT:
Frequency:
0 50
1 50
2 50
Name: species, dtype: int64
Mean:
sepal length (cm) 5.843333
sepal width (cm) 3.057333
petal length (cm) 3.758000
petal width (cm) 1.199333
species 1.000000
dtype: float64
Median:
sepal length (cm) 5.80
sepal width (cm) 3.00
petal length (cm) 4.35
petal width (cm) 1.30
species 1.00
dtype: float64
Mode:
sepal length (cm) 5.0
sepal width (cm) 3.0
petal length (cm) 1.4
petal width (cm) 0.2
species 0.0
Name: 0, dtype: float64
Variance:
sepal length (cm) 0.685694
sepal width (cm) 0.189979
petal length (cm) 3.116278
petal width (cm) 0.581006
species 0.671141
dtype: float64
Standard Deviation:
sepal length (cm) 0.828066
sepal width (cm) 0.435866
petal length (cm) 1.765298
petal width (cm) 0.762238
species 0.819232
dtype: float64
Skewness:
sepal length (cm) 0.314911
sepal width (cm) 0.318966
petal length (cm) -0.274884
petal width (cm) -0.102967
species 0.000000
dtype: float64
Kurtosis:
sepal length (cm) -0.552064
sepal width (cm) 0.228249
petal length (cm) -1.402103
petal width (cm) -1.340604
species -1.510135
dtype: float64
RESULT:
Thus the program for developing Univariate Analysis: Frequency, Mean, Median, Mode, Variance,
Standard Deviation, Skewness and Kurtosis has been executed successfully and output was
verified.
Ex No: 06.b
Bivariate Analysis: Linear and logistic regression modelling
Date :
AIM:
To implement the program for Bivariate Analysis: Linear and logistic regression modelling.
PROGRAM:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['species'] = iris.target
# Bivariate Analysis
# 1. Linear Regression: Predicting petal length using sepal length
X_linear = df[['sepal length (cm)']]
y_linear = df['petal length (cm)']
# Split data into training and testing sets
X_train_linear, X_test_linear, y_train_linear, y_test_linear = train_test_split(X_linear, y_linear,
test_size=0.3, random_state=42)
# Linear Regression model
linear_reg = LinearRegression()
linear_reg.fit(X_train_linear, y_train_linear)
# Predictions and performance metrics for Linear Regression
y_pred_linear = linear_reg.predict(X_test_linear)
linear_reg_mse = mean_squared_error(y_test_linear, y_pred_linear)
linear_reg_r2 = r2_score(y_test_linear, y_pred_linear)
# 2. Logistic Regression: Predicting species using sepal and petal measurements
X_logistic = df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']]
y_logistic = df['species']
# Split data into training and testing sets
X_train_logistic, X_test_logistic, y_train_logistic, y_test_logistic = train_test_split(X_logistic,
y_logistic, test_size=0.3, random_state=42)
# Logistic Regression model
logistic_reg = LogisticRegression(max_iter=200)
logistic_reg.fit(X_train_logistic, y_train_logistic)
# Predictions and performance metrics for Logistic Regression
y_pred_logistic = logistic_reg.predict(X_test_logistic)
logistic_reg_accuracy = accuracy_score(y_test_logistic, y_pred_logistic)
# Display results
print("Linear Regression:")
print(f"Mean Squared Error: {linear_reg_mse}")
print(f"R² Score: {linear_reg_r2}\n")
print("Logistic Regression:")
print(f"Accuracy: {logistic_reg_accuracy}")
OUTPUT:
Linear Regression:
Mean Squared Error: 0.827694417849684
R² Score: 0.7545375840776991
Logistic Regression:
Accuracy: 1.0
RESULT:
Thus the program for developing Bivariate Analysis: Linear and logistic regression modelling
has been executed successfully and output was verified.
Ex No: 07 Apply supervised learning algorithms and unsupervised learning
Date : algorithms on any data set
AIM:
To implement the program for Apply supervised learning algorithms and unsupervised learning
algorithms on any data set.
OBJECTIVE:
Supervised Learning: Classification using k-Nearest Neighbors (k-NN)
Unsupervised Learning: Clustering using k-Means
Supervised Learning (k-NN):
The k-NN algorithm is trained on the Iris dataset and achieves 100% accuracy on the test set.
The confusion matrix shows that all test samples are correctly classified.
Unsupervised Learning (k-Means):
k-Means clustering groups the iris samples into 3 clusters. The output shows the centers of these
clusters and the distribution of samples among the clusters.
The visualization plot shows how the samples are grouped in the reduced 2D space (using PCA).
PROGRAM:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['species'] = iris.target
# Display results
print("Supervised Learning: k-Nearest Neighbors (k-NN) Results")
print(f"Accuracy: {knn_accuracy}")
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)
print("\nUnsupervised Learning: k-Means Clustering Results")
print("Cluster Centers:\n", kmeans.cluster_centers_)
print("Labels:\n", df['cluster'].value_counts())
OUTPUT:
Supervised Learning: k-Nearest Neighbors (k-NN) Results
Accuracy: 1.0
Confusion Matrix:
[[19 0 0]
[ 0 16 0]
[ 0 0 10]]
Classification Report:
precision recall f1-score support
RESULT:
Thus the program for developing Apply supervised learning algorithms and unsupervised
learning algorithms on any data set has been executed successfully and output was verified.
Ex No: 08
Apply and explore various plotting functions on any data set
Date :
AIM:
To implement the program Apply and explore various plotting functions on any data set.
OBJECTIVE:
Scatter Plot: To visualize the relationship between two features.
Pair Plot: To visualize pairwise relationships between all features.
Histogram: To visualize the distribution of a feature.
Box Plot: To visualize the distribution and outliers for each species.
Heatmap: To visualize the correlation between features.
Step 1: Install the Required Libraries
Step 2: Load the Dataset and Libraries
Step 3: Create the Plots
Step 4: Run the Code
PROGRAM:
Install the Required Libraries
pip install matplotlib seaborn pandas scikit-learn
#Load the Dataset and Libraries
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_iris
# Load the Iris dataset from sklearn
iris_sklearn = load_iris()
iris_df = pd.DataFrame(data=iris_sklearn.data, columns=iris_sklearn.feature_names)
iris_df['species'] = pd.Categorical.from_codes(iris_sklearn.target, iris_sklearn.target_names)
# Display the first few rows of the dataset
print(iris_df.head())
#Create the Plots
# Scatter Plot
plt.figure(figsize=(6, 4))
sns.scatterplot(x='sepal length (cm)', y='sepal width (cm)', hue='species', data=iris_df)
plt.title('Scatter Plot: Sepal Length vs Sepal Width')
plt.show()
# Pair Plot
sns.pairplot(iris_df, hue='species')
plt.suptitle('Pair Plot of Iris Dataset', y=1.02)
plt.show()
# Histogram
plt.figure(figsize=(6, 4))
sns.histplot(iris_df['petal length (cm)'], kde=True, bins=20)
plt.title('Histogram: Petal Length')
plt.show()
# Box Plot
plt.figure(figsize=(6, 4))
sns.boxplot(x='species', y='sepal width (cm)', data=iris_df)
plt.title('Box Plot: Sepal Width by Species')
plt.show()
# Heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(iris_df.corr(), annot=True, cmap='coolwarm')
plt.title('Heatmap of Feature Correlations')
plt.show()
OUTPUT:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
species
0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
RESULT:
Thus the program for Apply and explore various plotting functions on any data set has been
executed successfully and output was verified.
CONTENT BEYOND SYLLABUS
Ex No: 01 Reading data from text files, excel and the web
And exploring various commands for descriptive analytics on the iris
Date : data set.
AIM:
The aim of this exercise is to acquire the knowledge of reading dataset from the different
extensions and analyzing the Iris dataset.
OBJECTIVE:
PROGRAM:
Step 1:Import necessary packages
import pandas as pd importnumpyasnp
import matplotlib.pyplot as plt importseabornas sns
importrequests frombs4importBeautifulSoup
OUTPUT
df=pd.read_csv("Iris.csv")
df
Step 2:Reading data from text files
df=pd.read_csv("Iris.csv")
df
OUTPUT:
Step 5:Exploring various commands for doing descriptive analytics on the Iris dataset
Display the dataset information
df.info()
OUTPUT:
RESULT:
In this experiment the analysis of the iris dataset have been executed successfully.
Ex No: 02
The decision tree based id3 algorithm
Date :
AIM:
In this experiment, to demonstrate the working of the decision tree based ID3 algorithm.
PROCEDURE:
Decision trees are one of the simplest non-linear supervised algorithms in the machine
learning world.
As the name suggests they are used for making decisions in ML terms we call it
classification (although they can be used for regression as well).
The decision trees have a unidirectional tree structure i.e. at every node the algorithm
makes a decision to split into child nodes based on certain stopping criteria.
Most commonly DTs use entropy, information gain, Gini index, etc.
There are a few known algorithms in DTs such as ID3, C4.5, CART, C5.0, CHAID, QUEST,
CRUISE.
a. Import Libraries
import math
import pandas as pd
from operator import itemgetter
b. Simple Decision Tree Class
class DecisionTree:
def_init (self, df, target, positive, parent_val, parent):
self.data = df
self.target = target
self.positive = positive
self.parent_val = parent_val
self.parent = parent
self.childs = []
self.decision = ''
def _get_entropy(self, data):
p = sum(data[self.target]==self.positive)
n = data.shape[0] - p
p_ratio = p/(p+n)
n_ratio = 1 - p_ratio
entropy_p = -p_ratio*math.log2(p_ratio) if p_ratio != 0 else 0
entropy_n = - n_ratio*math.log2(n_ratio) if n_ratio !=0 else 0
return entropy_p + entropy_n
def _get_gain(self, feat):
avg_info=0
for val in self.data[feat].unique()
avg_info+=self._get_entropy(self.data[self.data[feat]==val])*sum(self.data[feat]==val)/s
elf.data.shape[0]
return self._get_entropy(df) - avg_info
def _get_splitter(self):
self.splitter = max(self.gains, key = itemgetter(1))[0]
def update_nodes(self):
self.features = [col for col in self.data.columns if col != self.target]
self.entropy = self._get_entropy(self.data)
if self.entropy != 0:
self.gains = [(feat, self._get_gain(feat)) for feat in self.features]
self._get_splitter()
residual_columns = [k for k in self.data.columns if k != self.splitter]
for val in self.data[self.splitter].unique():
df_tmp = self.data[self.data[self.splitter]==val][residual_columns]
tmp_node = DecisionTree(df_tmp, self.target, self.positive, val,self.splitter)
tmp_node.update_nodes()
self.childs.append(tmp_node)
c. Printing the Tree
def print_tree(n):
for child in n.childs:
if child:
print(child. dict .get('parent', ''))
print(child. dict .get('parent_val', ''), '\n')
print_tree(child)
df = pd.read_csv('../input/dt-id3-examplecsv/dt_id3_example.csv')
dt = DecisionTree(df, 'Play', 'Yes', '', '')
dt.update_nodes()
print_tree(dt)
OUTPUT:
Outlook
Sunny
Humidity
High
Humidity
Normal
Outlook
Overcast
Outlook
Rainy
WindSpeed
Weak
WindSpeed
Strong
RESULT:
In this experiment, to demonstrate the working of the decision tree based ID3algorithm is
executed successfully.