0% found this document useful (0 votes)

11 views52 pages

Ocs353 Data Science Fundamentals Laboratory-eee

The document is a lab manual for the Data Science Fundamentals course (OCS 353) aimed at familiarizing students with the data science process, data manipulation using Numpy and Pandas, and various machine learning approaches. It includes a list of experiments covering Python installation, Numpy and Pandas functionalities, statistical measures, and data visualization techniques. The course outcomes emphasize students' ability to manipulate data, apply machine learning models, and visualize data effectively.

Uploaded by

THEEVANAN. T

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views52 pages

Ocs353 Data Science Fundamentals Laboratory-eee

Uploaded by

THEEVANAN. T

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

OCS 353 - DATA SCIENCE FUNDAMENTALS LAB MANUAL

OPEN ELECTIVE
COMMON TO ALL BRANCHES
OCS353 DATA SCIENCE FUNDAMENTALS LTPC

202 3

COURSE OBJECTIVES:

● Familiarize students with the data science process.

● Understand the data manipulation functions in Numpy and Pandas.
● Explore different types of machine learning approaches.
● Understand and practice visualization techniques using tools.
● Learn to handle large volumes of data with case studies.

LIST OF EXPERIMENTS:
1. Download, install and explore the features of Python for data analytics.
2. Working with Numpy arrays
3. Working with Pandas data frames
4. Basic plots using Matplotlib
5. Statistical and Probability measures
a) Frequency distributions
b) Mean, Mode, Standard Deviation
c) Variability
d) Normal curves
e) Correlation and scatter plots
f) Correlation coefficient
g) Regression

6. Use the standard benchmark data set for performing the following:
a) Univariate Analysis: Frequency, Mean, Median, Mode, Variance, Standard
Deviation,Skewness and Kurtosis.
b) Bivariate Analysis: Linear and logistic regression modelling.
7. Apply supervised learning algorithms and unsupervised learning algorithms on any data set.
8. Apply and explore various plotting functions on any data set.

Note: Example data sets like: UCI, Iris, Pima Indians Diabetes etc.

TOTAL = 30 PERIODS

COURSE OUTCOMES:

At the end of this course, the students will be able to:

CO1: Gain knowledge on data science process.

CO2: Perform data manipulation functions using Numpy and Pandas.
CO3 Understand different types of machine learning approaches.
CO4: Perform data visualization using tools.
CO5: Handle large volumes of data in practical scenarios.
TEXT BOOKS

1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data Science”, Manning
Publications,
2016.
2. Jake VanderPlas, “Python Data Science Handbook”, O’Reilly, 2016.

REFERENCES

1. Robert S. Witte and John S. Witte, “Statistics”, Eleventh Edition, Wiley Publications, 2017.
2. Allen B. Downey, “Think Stats: Exploratory Data Analysis in Python”, Green TeaPress, 2014.

CO’s vs PO’s & PSO’s MAPPING:

CO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3
CO1 2 2 1 2 2 - - - 1 1 1 2 2 2 2
CO2 3 2 2 1 2 - - - 1 1 2 2 3 3 2
CO3 2 1 2 1 1 - - - 2 1 1 3 1 1 1
CO4 2 2 1 2 2 - - - 1 1 1 2 2 2 2
CO5 2 1 2 1 1 - - - 2 1 1 3 1 1 1
Average 2.2 1.6 1.6 1.4 1.6 - - - 1.4 1 1.2 2.4 1.8 1.8 1.6
NAME OF THE SUBJECT : DATA SCIENCE FUNDAMENTALS
SUBJECT CODE : OCS353
YEAR/BRANCH/SEMESTER : IV-B.E-ELECTRICAL AND ELECTRONICS ENGINEERING/VII

COURSE OUTCOMES
At the end of this course, the students will be able to:
CO1: Use appropriate search algorithms for problem solving
CO2: Apply reasoning under uncertainty
CO3: Build supervised learning models
CO4: Build ensembling and unsupervised models
CO5: Build deep learning neural network models
LIST OF EXPERIMENTS
BLOOMS
S.NO NAME OF EXPERIMENTS COs
LEVEL
1 Download, install and explore the features of Python for data analytics L3 CO1
2 Working with Numpy arrays L3 CO1
3 Working with Pandas data frames L3 CO2
4 Basic plots using Matplotlib L3 CO2
Statistical and Probability measures
a)Frequency distributions
b)Mean, Mode, Standard Deviation
c)Variability
5 L4 CO3
d)Normal curves
e)Correlation and scatter plots
f)Correlation coefficient
g)Regression
Use the standard benchmark data set for performing the following:
a) Univariate Analysis: Frequency, Mean, Median, Mode, Variance,
6 L4 CO4
Standard Deviation, Skewness and Kurtosis.
b) Bivariate Analysis: Linear and logistic regression modelling.
Apply supervised learning algorithms and unsupervised learning
7 L4 CO4
algorithms on any data set.
8 Apply and explore various plotting functions on any data set. L3 CO5
Content Beyond Experiment
Reading data from text files, excel and the web And exploring various
9 L3 CO4
commands for descriptive Analytics on the iris data set.
10 The decision tree based id3 algorithm L3 CO5
*BL: (L1- Remembering, L2- Understanding, L3- Applying, L4- Analysing, L5- Evaluating, L6-Creating)
Ex No: 01 Download, install and explore the features of Python for data
Date : analytics.

AIM:
In this experiment, we explore the knowledge of python packages download, install and features.
OBJECTIVES:

 To understand the installation of python open source packages.

 To understand the features of different python packages.
PROCEDURE:
 Download and install Anaconda software. (Anaconda is a distribution of the python
programming language. It offers the easiest way to perform thousands of python open-
source packages and libraries).
 Using Anaconda software, we download and install Jupyter Notebook and python open
source packages.
a. After Anaconda installation, open Anaconda Navigator from start menu.
b. In Anaconda Navigator, select Environments in the left hand pane below home.
c. Just to the right of where you selected and below the "search environments" bar, you
should see “base (root)”. Click on it.
d. A triangle pointing right should appear, click on it a select "open terminal".
 In Anaconda Terminal, using below commands you install Jupyter Notebook and packages
successfully.
Download and install packages using Terminal Commands:
a. Activate Anaconda. This activates your conda environment.
 conda activate

b. Install Jupyter Notebook.

 pip install jupyter notebook

c. To list all of the installed packages in the conda active environment.

 pip list

d. Install packages.

 pip install <package_name>

Example:
 pip install numpy
 pip install pandas
 pip install scipy
 pip install statsmodels

e. Verify package installation, and know the version of installed package.

 pip show <package_name>

f. Upgrading package version.

 pip install –upgrade <package_name>

g. Uninstall packages.
 pip uninstall <package_name>

h. Deactivate the conda environment.

 conda deactivate

FEATURES OF PACKAGES:
A. Numpy (Numerical Python):

 NumPy (Numerical Python) is an open source Python library that‟s used in almost every field
of science and engineering.
 It is the core library for numeric and scientific computing.
 The NumPy library contains multidimensional array and matrix data structures.
 It provides ndarray, a homogeneous n-dimensional array object, with methods to efficiently
operate on it.
 NumPy can be used to perform a wide variety of mathematical operations on arrays.
 It adds powerful data structures to Python that guarantee efficient calculations with arrays
and matrices and it supplies an enormous library of high-level mathematical functions that
operate on these arrays and matrices.

B. SciPy (Scientific Python):

 SciPy is a collection of mathematical algorithms and convenience functions built on the

NumPy extension of Python.
 It provides many user-friendly and effective numerical functions for numerical
integration and optimization.
 It allows users to manipulate the data and visualize the data using a wide range of high-
level Python commands.
 The SciPy library supports integration,gradient optimization, special functions,
ordinary differential equation solvers, parallel programming tools, and many more.

C. Jupyter Notebook:

 Jupyter Notebook is an open-source, web-based interactive environment.

 Allows you to create and share documents that contain live code, mathematical
equations, graphics, maps, plots, visualizations, and narrative text.
 It integrates with many programming languages like Python, PHP, R, C#, etc.
 Jupyter Notebook allows users to convert the notebooks into other formats such as HTML
and PDF.
 Jupyter Notebook is platform-independent because it is represented as JSON (JavaScript
Object Notation) format, which is a language-independent, text-based file format.

D. Statsmodels:

 Statsmodels is a Python library built specifically for statistics.

 Statsmodels is built on top of NumPy, SciPy, and matplotlib,
 It contains more advanced functions for statistical testing and modeling the estimation of
many different statistical models, as well as for conducting statistical tests, and statistical
data exploration.

E. Pandas (Panel Data):

 Pandas is an open-source Python Library providing high-performance data manipulation

and analysis tool using its powerful data structures.
 Fast and efficient DataFrame object with default and customized indexing.
 Tools for loading data into in-memory data objects from different file formats.
 Data alignment and integrated handling of missing data.
 Reshaping and pivoting of date sets.
Other Python important packages for data science:
1. Plotly: Plotly is a well-known Python data visualization package. It provides us with
interactive graphs that allow us to explore the relationship of variables.
2. Matplotlib: Matplotlib is the most famous Python data visualization package.
3. BeautifulSoup: This is popular python library most commonly known for web crawling and
data scraping.
4. Scrapy: Scrapy is one of the most popular, fast, open-source web crawling frameworks
written in Python.
5. Scikit-learn: Scikit-learn, a machine learning library that provides almost all the machine
learning algorithms you might need.
6. Keras: Keras is a neural network library in Python. Aims to work quickly with deep learning
networks, while being designed to be compact, modular, and extensible.
7. TensorFlow: TensorFlow is a software library or framework to make machine learning and
deep learning concepts as simple as possible.

RESULT:
The python packages are installed and features are studied successfully.
Ex No: 02
Working With Numpy Arrays
Date :

AIM:
In this exercise learn about various functions of Numpy package to perform mathematical
and logical operations in numpy arrays.
OBJECTIVES:

 To understand the multidimensional array and matrix data structures.

 To understand how to perform advanced operations on multidimensional array.
 To understand how to apply statistical operations to n-dimensional arrays.
 To understand axis and shape properties for n-dimensional arrays.
PROCEDURE:

 Open Jupyter Notebook.

 Create new Notebook.
 Import numpy python library.
 Start using different numpy functions.
PROGRAMS:
1. Import numpy package.

 import numpy as np
print(np. version )

OUTPUT
1.23.5

2. Functions for creating numpy array.

np.array(), np.zeros(), np.ones(), np.empty(), np.arange(), np.linspace(), np.full(), np.eye().
 a=np.array([[1,2],[3,4],[5,6]])
b=np.zeros((2,4),dtype=np.int64)
c=np.ones((2,4))
d=np.empty((2,4))
e=np.arange(2,20,2)
f=np.linspace(1,20,num=5)
g=np.full((2,2),3)
h=np.eye(3,3)
i=np.repeat()
print(a,'\n',b,'\n',c,'\n',d,'\n',e,'\n',f,'\n',g,'\n',h,'\n',i)

OUTPUT
[[1 2]
[3 4]
[5 6]]
[[0 0 0 0]
[0 0 0 0]]
[[1. 1. 1. 1.]
[1. 1. 1. 1.]]
[[1. 1. 1. 1.]
[1. 1. 1. 1.]]
[ 2 4 6 8 10 12 14 16 18]
[ 1. 5.75 10.5 15.25 20. ]
[[3 3]
[3 3]]
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
[1 1 2 2 3 3 4 4 5 5 6 6]
3. Numpy Array Indexing & Slicing.

 arr = np.array([11,12,13,14,15])
print(arr)
print("Ele 1 : ",arr[0])
print("Ele 3 : ",arr[2])
print("Ele 1 to 3 : ",arr[0:3])
print("All elements : ",arr[:])
print("All elements except first 2 : ",arr[2:])
 arr = np.array([[1,2,3,4,5],[6,7,8,9,10]])
print("Array : ",arr)
print("Dim : ",arr.ndim)
print("Ele (1,1) : ",arr[1][1])
print("Ele (0,0) : ",arr[0][0])
print("Ele (1,3) : ",arr[1][3])
print("First three Ele in first row : ",arr[0][:3])
print("Ele in second row : ",arr[1][:])
print("All the elements in the matrix : \n",arr[:][:])
 arr = np.array([20,21,22,23,24])
print("Last Ele : ",arr[-1])
print("Last three element : ",arr[-3:])

OUTPUT
[11 12 13 14 15]
Ele 1: 11
Ele 3: 13
Ele 1 to 3: [11 12 13]
All elements: [11 12 13 14 15]
All elements except first 2: [13 14 15]
4. Attributes of the numpy array.
 arr=np.array([[1,2,3],[4,5,6],[7,8,9]])
print("Dimension of the ndarray:",arr.ndim)
print("Size of the ndarray in each dimension:",arr.shape)
print("Total number of elements in the ndarray:",arr.size)
print("The data type of the elements of a NumPy array:",arr.dtype)
print("Returns the size (in bytes) of each element of a ndarray:",arr.itemsize)
OUTPUT
Array: [[ 1 2 3 4 5] [ 6 7 8 9 10]]
Dim: 2
Ele (1,1): 7
Ele (0,0): 1
Ele (1,3): 9
First three Ele in first row : [1 2 3] Ele in second row : [ 6 7 8 9 10]
All the elements in the matrix :
[[ 1 2 3 4 5]
[ 6 7 8 9 10]]

5. Reshaping array
np.reshape(), np.flatten()
 arr = np.array([1, 2, 3, 4, 5, 6, 7, 8,9,10,11,12])
newarr = arr.reshape(2, 2, 3)
print(newarr)
a = np.array([[1,2], [3,4]]) arr=a.flatten()
print(arr)

OUTPUT
Dimension of the ndarray: 2
Size of the ndarray in each dimension: (3, 3)
Total number of elements in the ndarray: 9
The data type of the elements of a NumPy array: int64
Returns the size (in bytes) of each element of a ndarray: 8

6. Joining, Sorting, Splitting Array.

np.concatenate(), np.sort(), np.array_split()
 arr=np.array([3,5,89,34,6,5,34,6])
print("Sorting the array:",np.sort(arr))
print("Spliting the array:",np.array_split(arr,2))
arr1 = np.array([[1, 2, 3],[7,8,9]])
arr2 = np.array([[4, 5, 6],[10,11,12]])
arr = np.concatenate((arr1, arr2))
print("Joining two array:",arr)

OUTPUT
Sorting the array: [ 3 5 5 6 6 34 34 89]
Spliting the array: [array([ 3, 5, 89, 34]), array([ 6, 5, 34, 6])]
Joining two array: [[ 1 2 3]
[ 7 8 9]
[ 4 5 6]
[10 11 12]]
7. Basic mathematic operations in array.
 a = np.array([7,3,4,5,1])
b = np.array([3,4,5,6,7])
print("Addition:",np.add(a,b))
print("Multiplication:",np.multiply(a,b))
print("Subtraction:",np.subtract(a,b))
print("Power:",np.power(a,b))
print("Division:",np.divide(a,b))
print("Modulo Devision:",np.remainder(a,b))

OUTPUT
Addition: [10 7 9 11 8]
Multiplication: [21 12 20 30 7]
Subtraction: [ 4 -1 -1 -1 -6]
Power: [ 343 81 1024 15625 1]
Division: [2.33333333 0.75 0.8 0.83333333 0.14285714]
Modulo Devision: [1 3 4 5 1]

8. Creating array from existing data

np.asarray(), np.copy(), np.view()
 a=(1,2,3,4,5,6,7,8,9)
arr=np.asarray(a)
print(arr)
 a=np.array((1,2,3,4,5,6))
b=a.copy()
a[4]=90
print("Original:",a)
print("Copy:",b)
 a=np.array((1,2,3,4,5,6))
b=a.view()
print("Original:",a)
print("Copy:",b)

OUTPUT
[1 2 3 4 5 6 7 8 9]
Original: [ 1 2 3 4 90 6]
Copy: [1 2 3 4 5 6]
Original: [1 2 3 4 5 6]
Copy: [1 2 3 4 5 6]

9. More useful statistical array operations.

 a = np.array([21,22,34,45,56,67,31,78])
print("Sum:",np.sum(a))
print("Minimum:",np.min(a))
print("Maximum:",np.max(a))
print("Mean:",np.mean(a))
print("Standard Deviation:",np.std(a))
print("Varience:",np.var(a))
print("Exponent:",np.exp(a))
print("Square:",np.sqrt(a))
print("Percentile:",np.percentile(a,25))

OUTPUT
Sum: 354
Minimum: 21
Maximum: 78
Mean: 44.25
Standard Deviation: 19.721498421773127
Varience: 388.9375
Exponent: [1.31881573e+09 3.58491285e+09 5.83461743e+14 3.49342711e+19
2.09165950e+24 1.25236317e+29 2.90488497e+13 7.49841700e+33]
Square: [4.58257569 4.69041576 5.83095189 6.70820393 7.48331477 8.18535277
5.56776436 8.83176087]
Percentile: 28.75
10. Iterating array.

 a=np.array([3,4,6,7,78,8,98,9,9])
for x in np.nditer(a):
print(x)

OUTPUT
3
4
6
7
78
8
98
9
9
11. Save and Load numpy array.

 a=np.array([[1,2,3,4],[3,4,5,6]])
np.savetxt('one.txt',a,delimiter=',')
a=np.array([[1,2,3,4],[3,4,5,6]])
np.loadtxt('one.txt',delimiter=',')

OUTPUT
array([[1., 2., 3., 4.],
[3., 4., 5., 6.]])

12. Get array unique items.

 a = np.array([11, 11, 12, 13, 14, 15, 16, 17, 12, 13, 11, 14, 18, 19, 20])
unique_values = np.unique(a)
print(unique_values)

OUTPUT
[11 12 13 14 15 16 17 18 19 20]
13. Reverse an array.

 arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

reversed_arr = np.flip(arr)
print(reversed_arr)

OUTPUT
[[12 11 10 9]
[ 8 7 6 5]
[ 4 3 2 1]]

14. Random number generation.

 a=np.random.randint(1,7,size=10)
print(a)

OUTPUT
[5 2 1 3 5 2 1 6 6 5]

RESULT:
In this exercise the numpy package functions are studied and executed successfully.
Ex No: 03
Working With Pandas Data Frames
Date :

AIM:
The aim of this exercise is to acquire the knowledge of pandas package for data manipulation
and analysis.
OBJECTIVE:
 To understand DataFrame object creation for data manipulation with integrated indexing.
 To understand data alignment and integrated handling of missing data.
 To understand reshaping and pivoting of data sets.
 To understand data set merging and joining.
 To understand data filtration.
 To understand group by engine allowing split-apply-combine operations on data sets.
 To understand data structure column and row insertion and deletion.
 To understand easy to convert NumPy data structures into DataFrame objects and
DataFrame objects to NumPy data structures.
 To understand reading and writing data between in-memory data structures and different
file formats.
PROCEDURE:

 Open Jupyter Notebook.

 Create new Notebook.
 Import pandas python library.
 Start using different pandas functions.
PROGRAM:
1. Import the package.
 import pandas as pd
print(pd. version )

OUTPUT
1.5.3
2. Object Creation (DataFrame and Series).
 values = [91, 7, 2,10,14,15]
myseries = pd.Series(values, index = ["a", "b", "c","d","e","f"])
print(myseries)
 stu_personal=pd.DataFrame({'Rollno':[1001,1002,1003,1004,1005],'Name':['A','B','
C','D','E'],'Address':['salem','erode','covai','chennai','namakkal']})
stu_personal
college=pd.Series(['GCT','GCE','GCT','GCE','GCT'])
mark=pd.Series([456,345,399,421,367])
rollno=[1001,1002,1003,1004,1005]
per=pd.Series([91.2,69,79.8,84.2,73.4])
stu_college=pd.DataFrame({'Rollno':rollno,'College':college,'Mark':mark,'Percentag
e':per})
stu_college
 stu_fees=pd.DataFrame({'Rollno':[1001,1002,1003,1004,1005],'Fees':[25000,3000,
15000,35000,17500]})
stu_fees

OUTPUT:
a 91
b 7
c 2
d 10
e 14
f 15
dtype: int64

3. Add or delete columns and rows.

 stu_personal = stu_personal.append({'Rollno' : 1006,'Name' : 'F','Address':'salem'},

ignore_index=True)
stu_personal
 stu_personal['Age']=[21,20,22,21,20,22]
stu_personal
 stu_personal.drop(['Age'],axis=1)
 stu_personal.drop(1)

OUTPUT:

4. Viewing data.
 stu_personal.head()
 stu_college.tail(3)
 stu_personal.index
 stu_college.columns
 stu_personal.to_numpy()
 stu_college.describe()
 stu_personal.sort_index(axis=1, ascending=False)
 stu_college.sort_values(by="Mark")
 stu_college.info()
 stu_college.value_counts()

OUTPUT:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
0 Rollno 5 non-null int64
1 College 5 non-null object
2 Mark 5 non-null int64
3 Percentage 5 non-null float64 dtypes: float64(1), int64(2), object(1) memory usage: 288.0+
bytes

Rollno College Mark Percentage

1001 GCT 456 91.2 1
1002 GCE 345 69.0 1
1003 GCT 399 79.8 1
1004 GCE 421 84.2 1
1005 GCT 367 73.4 1
dtype: int64

5. Selection.
a. Selection by Label.
 stu_college[["Mark"]]
 stu_personal[0:3]
 stu_personal.loc[:, ["Name", "Address"]]

OUTPUT:

b. Selection by position.
 stu_personal.iloc[3]
 stu_personal.iloc[3:5, 0:2]
 stu_personal.iloc[[1, 2, 4], [0, 2]]
 stu_personal.iloc[1:3, :]

OUTPUT:

6. Operations.
 stu_fees['Fees'].mean()
 stu_fees['Fees'].sum()
 stu_fees['Fees'].max()
 stu_fees['Fees'].min()

OUTPUT:
3000
7. Apply.
 new = stu_college['Mark'].apply(lambda num : num + 5)
new
 stu_personal.applymap(lambda x: len(str(x)))

OUTPUT:

8. Merge two datasets.

 stu_per_col=pd.concat([stu_personal,stu_college],axis=1)
stu_per_col
 stu_all=pd.merge(stu_per_col,stu_fees, on="Rollno")
stu_all

OUTPUT:
9. Grouping.
 stu_all.groupby(['Address'])[['Fees']].sum()
10. Reshaping.
 stacked=stu_all.stack() stacked
 stacked.unstack()
 pd.pivot_table(stu_all, values=["Fees"], index=["Rollno"], columns=["Address"])

OUTPUT:

11. Correlation.
 stu_all.corr()

OUTPUT:

12. Read and Write dataset.

 stu_all.to_excel('one.xlsx')
 pd.read_excel("one.xlsx")
OUTPUT:

13. Cleaning the dataset.

 sample=pd.util.testing.makeMissingDataframe()
 sample
 sample.isnull()
 sample.isnull().sum()
 sample.dropna()
 sample.duplicated()
 sample.drop_duplicates()
 sample.fillna(130.6542)
 x = sample["A"].mean()
 sample["A"].fillna(x)

RESULT:
In this exercise the Pandas package functions are studied and executed successfully.
Ex No: 04
Basic plots using Matplotlib
Date :

AIM:
The aim of this exercise is to acquire the knowledge of Basic plots using Matplotlib and also
implement the program that.
OBJECTIVE:
 Line Plot: A line plot is used to visualize data points connected by straight lines.
 Scatter Plot: A scatter plot displays individual data points.
 Bar Plot: A bar plot is used to compare different categories.
 Histogram: A histogram shows the distribution of data over bins.
 Pie Chart: A pie chart displays proportions of a whole.
 Subplots: You can create multiple plots in a single figure using subplots.

1. Line Plot Program:

import matplotlib.pyplot as plt
# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
# Create a line plot
plt.plot(x, y, marker='o')
# Add title and labels
plt.title('Line Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
# Show plot
plt.show()

OUTPUT:
2. Scatter Plot Program:
import matplotlib.pyplot as plt
# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
# Create a scatter plot
plt.scatter(x, y)
# Add title and labels
plt.title('Scatter Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
# Show plot
plt.show()

OUTPUT:

3. Bar Plot Program:

import matplotlib.pyplot as plt
# Sample data
categories = ['A', 'B', 'C', 'D']
values = [10, 20, 15, 30]
# Create a bar plot
plt.bar(categories, values)
# Add title and labels
plt.title('Bar Plot')
plt.xlabel('Categories')
plt.ylabel('Values')
# Show plot
plt.show()
OUTPUT:

4. Histogram Program:
import matplotlib.pyplot as plt
# Sample data
data = [1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 7]
# Create a histogram
plt.hist(data, bins=5, edgecolor='black')
# Add title and labels
plt.title('Histogram')
plt.xlabel('Bins')
plt.ylabel('Frequency')
# Show plot
plt.show()

OUTPUT:
5. Pie Chart Program:
import matplotlib.pyplot as plt
# Sample data
labels = ['A', 'B', 'C', 'D']
sizes = [15, 30, 45, 10]
# Create a pie chart
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
# Add title
plt.title('Pie Chart')
# Show plot
plt.show()

OUTPUT:

6. Subplots Program:
import matplotlib.pyplot as plt
# Sample data
x = [1, 2, 3, 4, 5]
y1 = [2, 3, 5, 7, 11]
y2 = [1, 4, 6, 8, 10]
# Create a figure and axes
fig, (ax1, ax2) = plt.subplots(1, 2)
# Plot on the first subplot
ax1.plot(x, y1, 'r-')
ax1.set_title('Line Plot 1')
# Plot on the second subplot
ax2.scatter(x, y2, c='b')
ax2.set_title('Scatter Plot 2')
# Show plot
plt.show()
OUTPUT:

RESULT:
In this exercise the Basic plots using Matplotlib are studied and executed successfully.
Ex No: 05
Statistical and Probability measures
Date :

AIM:
The aim of this exercise is to acquire the knowledge of Statistical and Probability measures and
also implement the program for that.
OBJECTIVE:
a) Frequency Distributions
Frequency distribution is a summary of how often different values occur in a data set. It helps to
organize data into a more interpretable form.
 Absolute Frequency: The count of how many times a value appears.
 Relative Frequency: The percentage of the total count represented by the absolute frequency.
 Cumulative Frequency: The sum of the frequencies for all values up to a certain point.
b) Mean, Mode, Standard Deviation

 Mean: The average of a data set.

 Mode: The most frequently occurring value.
 Standard Deviation: A measure of how spread out the numbers in a data set are.
c) Variability
 Variability refers to how spread out or dispersed the data points are. Common measures of
variability include:
 Range: The difference between the highest and lowest values.
 Variance: The average of the squared differences from the mean.
 Standard Deviation: The square root of the variance.
d) Normal Curves

 A Normal Distribution (or Gaussian distribution) is a bell-shaped curve where the mean,
median, and mode are all equal.
e) Correlation and Scatter Plots
 Correlation measures the strength and direction of a linear relationship between two
variables. A scatter plot is a graph that shows the relationship between two variables.
f) Correlation Coefficient
 The correlation coefficient (r) quantifies the degree of linear relationship between two
variables, ranging from -1 (perfect negative) to +1 (perfect positive).
g) Regression Using Python
 Linear regression is a method for modeling the relationship between a dependent variable
and one or more independent variables.
PROGRAM FOR FREQUENCY DISTRIBUTION:
import pandas as pd
# Sample data
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4]
# Creating a frequency distribution table
frequency_distribution = pd.Series(data).value_counts().sort_index()
print(frequency_distribution)

OUTPUT:
1 1
2 2
3 3
4 4
dtype: int64

PROGRAM FOR MEAN, MODE, STANDARD DEVIATION:

import numpy as np
from scipy import stats
# Sample data
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4]
# Mean
mean = np.mean(data)
# Mode
mode = stats.mode(data)[0][0]
# Standard Deviation
std_dev = np.std(data)
print(f"Mean: {mean}, Mode: {mode}, Standard Deviation: {std_dev}")

OUTPUT
Mean: 3.0, Mode: 4, Standard Deviation: 1.0

PROGRAM FOR VARIABILITY:

import numpy as np
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4]
# Variance
variance = np.var(data)
# Range
data_range = np.ptp(data) # peak-to-peak (range)
print(f"Variance: {variance}, Range: {data_range}")
OUTPUT
Variance: 1.0, Range: 3
PROGRAM FOR NORMAL CURVES:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Generate data with normal distribution
data = np.random.normal(loc=0, scale=1, size=1000)
# Plotting the normal curve
sns.histplot(data, kde=True)
plt.title("Normal Distribution Curve")
plt.show()

OUTPUT
PROGRAM FOR CORRELATION AND SCATTER PLOTS:
import numpy as np
import matplotlib.pyplot as plt
# Sample data
x = np.random.rand(100)
y = 2 * x + np.random.normal(0, 0.1, 100)
# Scatter plot
plt.scatter(x, y)
plt.title("Scatter Plot of X vs Y")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

OUTPUT

PROGRAM FOR CORRELATION COEFFICIENT:

import numpy as np
x = np.random.rand(100)
y = 2 * x + np.random.normal(0, 0.1, 100)
# Correlation coefficient
correlation_coefficient = np.corrcoef(x, y)[0, 1]
print(f"Correlation Coefficient: {correlation_coefficient}")
OUTPUT:
Correlation Coefficient: 0.9862686055665738

PROGRAM FOR REGRESSION USING PYTHON:

from sklearn.linear_model import LinearRegression
# Reshaping the data for the model
x_reshaped = x.reshape(-1, 1)
y_reshaped = y.reshape(-1, 1)
# Create a linear regression model
model = LinearRegression()
# Fit the model
model.fit(x_reshaped, y_reshaped)
# Make predictions
y_pred = model.predict(x_reshaped)
# Plot the results
plt.scatter(x, y, label="Data")
plt.plot(x, y_pred, color='red', label="Regression Line")
plt.title("Linear Regression")
plt.xlabel("X")
plt.ylabel("Y")
plt.legend()
plt.show()
# Model coefficients
slope = model.coef_[0][0]
intercept = model.intercept_[0]
print(f"Slope: {slope}, Intercept: {intercept}")
OUTPUT

RESULT:
In this exercise the Statistical and Probability measures are studied and executed successfully.
Ex No: 06.a Univariate Analysis: Frequency, Mean, Median, Mode, Variance,
Date : Standard Deviation, Skewness and Kurtosis

AIM:
To implement the program for Univariate Analysis: Frequency, Mean, Median, Mode, Variance,
Standard Deviation, Skewness and Kurtosis.
PROGRAM
import pandas as pd
from sklearn.datasets import load_iris
import numpy as np
from scipy import stats

# Load the Iris dataset

iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Adding the target variable as 'species'

df['species'] = iris.target

# Univariate Analysis
# Frequency for the categorical variable 'species'
frequency = df['species'].value_counts()

# Mean, Median, Mode for numerical variables

mean = df.mean()
median = df.median()
mode = df.mode().iloc[0]

# Variance and Standard Deviation for numerical variables

variance = df.var()
std_dev = df.std()

# Skewness and Kurtosis

skewness = df.skew()
kurtosis = df.kurtosis()

# Display results
print("Frequency:\n", frequency, "\n")
print("Mean:\n", mean, "\n")
print("Median:\n", median, "\n")
print("Mode:\n", mode, "\n")
print("Variance:\n", variance, "\n")
print("Standard Deviation:\n", std_dev, "\n")
print("Skewness:\n", skewness, "\n")
print("Kurtosis:\n", kurtosis, "\n")

OUTPUT:
Frequency:
0 50
1 50
2 50
Name: species, dtype: int64
Mean:
sepal length (cm) 5.843333
sepal width (cm) 3.057333
petal length (cm) 3.758000
petal width (cm) 1.199333
species 1.000000
dtype: float64
Median:
sepal length (cm) 5.80
sepal width (cm) 3.00
petal length (cm) 4.35
petal width (cm) 1.30
species 1.00
dtype: float64
Mode:
sepal length (cm) 5.0
sepal width (cm) 3.0
petal length (cm) 1.4
petal width (cm) 0.2
species 0.0
Name: 0, dtype: float64
Variance:
sepal length (cm) 0.685694
sepal width (cm) 0.189979
petal length (cm) 3.116278
petal width (cm) 0.581006

species 0.671141
dtype: float64
Standard Deviation:
sepal length (cm) 0.828066
sepal width (cm) 0.435866
petal length (cm) 1.765298
petal width (cm) 0.762238

species 0.819232
dtype: float64
Skewness:
sepal length (cm) 0.314911
sepal width (cm) 0.318966
petal length (cm) -0.274884
petal width (cm) -0.102967
species 0.000000
dtype: float64
Kurtosis:
sepal length (cm) -0.552064
sepal width (cm) 0.228249
petal length (cm) -1.402103
petal width (cm) -1.340604

species -1.510135
dtype: float64

RESULT:
Thus the program for developing Univariate Analysis: Frequency, Mean, Median, Mode, Variance,
Standard Deviation, Skewness and Kurtosis has been executed successfully and output was
verified.
Ex No: 06.b
Bivariate Analysis: Linear and logistic regression modelling
Date :

AIM:
To implement the program for Bivariate Analysis: Linear and logistic regression modelling.
PROGRAM:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['species'] = iris.target
# Bivariate Analysis
# 1. Linear Regression: Predicting petal length using sepal length
X_linear = df[['sepal length (cm)']]
y_linear = df['petal length (cm)']
# Split data into training and testing sets
X_train_linear, X_test_linear, y_train_linear, y_test_linear = train_test_split(X_linear, y_linear,
test_size=0.3, random_state=42)
# Linear Regression model
linear_reg = LinearRegression()
linear_reg.fit(X_train_linear, y_train_linear)
# Predictions and performance metrics for Linear Regression
y_pred_linear = linear_reg.predict(X_test_linear)
linear_reg_mse = mean_squared_error(y_test_linear, y_pred_linear)
linear_reg_r2 = r2_score(y_test_linear, y_pred_linear)
# 2. Logistic Regression: Predicting species using sepal and petal measurements
X_logistic = df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']]
y_logistic = df['species']
# Split data into training and testing sets
X_train_logistic, X_test_logistic, y_train_logistic, y_test_logistic = train_test_split(X_logistic,
y_logistic, test_size=0.3, random_state=42)
# Logistic Regression model
logistic_reg = LogisticRegression(max_iter=200)
logistic_reg.fit(X_train_logistic, y_train_logistic)
# Predictions and performance metrics for Logistic Regression
y_pred_logistic = logistic_reg.predict(X_test_logistic)
logistic_reg_accuracy = accuracy_score(y_test_logistic, y_pred_logistic)
# Display results
print("Linear Regression:")
print(f"Mean Squared Error: {linear_reg_mse}")
print(f"R² Score: {linear_reg_r2}\n")
print("Logistic Regression:")
print(f"Accuracy: {logistic_reg_accuracy}")

OUTPUT:
Linear Regression:
Mean Squared Error: 0.827694417849684
R² Score: 0.7545375840776991
Logistic Regression:
Accuracy: 1.0

RESULT:
Thus the program for developing Bivariate Analysis: Linear and logistic regression modelling
has been executed successfully and output was verified.
Ex No: 07 Apply supervised learning algorithms and unsupervised learning
Date : algorithms on any data set

AIM:
To implement the program for Apply supervised learning algorithms and unsupervised learning
algorithms on any data set.
OBJECTIVE:
Supervised Learning: Classification using k-Nearest Neighbors (k-NN)
Unsupervised Learning: Clustering using k-Means
Supervised Learning (k-NN):
The k-NN algorithm is trained on the Iris dataset and achieves 100% accuracy on the test set.
The confusion matrix shows that all test samples are correctly classified.
Unsupervised Learning (k-Means):
k-Means clustering groups the iris samples into 3 clusters. The output shows the centers of these
clusters and the distribution of samples among the clusters.
The visualization plot shows how the samples are grouped in the reduced 2D space (using PCA).
PROGRAM:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['species'] = iris.target

# Supervised Learning: k-Nearest Neighbors (k-NN)

# Features and target
X = df.drop('species', axis=1)
y = df['species']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Standardizing the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# k-NN classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
# Predictions and evaluation
y_pred_knn = knn.predict(X_test)
knn_accuracy = accuracy_score(y_test, y_pred_knn)
conf_matrix = confusion_matrix(y_test, y_pred_knn)
class_report = classification_report(y_test, y_pred_knn)

# Unsupervised Learning: k-Means Clustering

# Apply k-Means with 3 clusters (since we know there are 3 species)
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
# Add the cluster labels to the dataframe
df['cluster'] = kmeans.labels_
# Visualization using PCA (to reduce dimensions to 2D)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Plotting the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df['cluster'], cmap='viridis', marker='o')
plt.title("k-Means Clustering of Iris Data (PCA Reduced)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.colorbar(label="Cluster")
plt.show()

# Display results
print("Supervised Learning: k-Nearest Neighbors (k-NN) Results")
print(f"Accuracy: {knn_accuracy}")
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)
print("\nUnsupervised Learning: k-Means Clustering Results")
print("Cluster Centers:\n", kmeans.cluster_centers_)
print("Labels:\n", df['cluster'].value_counts())

OUTPUT:
Supervised Learning: k-Nearest Neighbors (k-NN) Results
Accuracy: 1.0
Confusion Matrix:
[[19 0 0]
[ 0 16 0]
[ 0 0 10]]
Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 19

1 1.00 1.00 1.00 16
2 1.00 1.00 1.00 10
Accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45

Unsupervised Learning: k-Means Clustering Results

Cluster Centers:
[[6.85 3.07 5.74 2.07]
[5.01 3.43 1.46 0.25]
[5.9 2.75 4.39 1.43]]
Labels:
1 50
2 48
0 52
Name: cluster, dtype: int64

RESULT:
Thus the program for developing Apply supervised learning algorithms and unsupervised
learning algorithms on any data set has been executed successfully and output was verified.
Ex No: 08
Apply and explore various plotting functions on any data set
Date :

AIM:
To implement the program Apply and explore various plotting functions on any data set.
OBJECTIVE:
Scatter Plot: To visualize the relationship between two features.
Pair Plot: To visualize pairwise relationships between all features.
Histogram: To visualize the distribution of a feature.
Box Plot: To visualize the distribution and outliers for each species.
Heatmap: To visualize the correlation between features.
Step 1: Install the Required Libraries
Step 2: Load the Dataset and Libraries
Step 3: Create the Plots
Step 4: Run the Code
PROGRAM:
Install the Required Libraries
pip install matplotlib seaborn pandas scikit-learn
#Load the Dataset and Libraries
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_iris
# Load the Iris dataset from sklearn
iris_sklearn = load_iris()
iris_df = pd.DataFrame(data=iris_sklearn.data, columns=iris_sklearn.feature_names)
iris_df['species'] = pd.Categorical.from_codes(iris_sklearn.target, iris_sklearn.target_names)
# Display the first few rows of the dataset
print(iris_df.head())
#Create the Plots
# Scatter Plot
plt.figure(figsize=(6, 4))
sns.scatterplot(x='sepal length (cm)', y='sepal width (cm)', hue='species', data=iris_df)
plt.title('Scatter Plot: Sepal Length vs Sepal Width')
plt.show()
# Pair Plot
sns.pairplot(iris_df, hue='species')
plt.suptitle('Pair Plot of Iris Dataset', y=1.02)
plt.show()
# Histogram
plt.figure(figsize=(6, 4))
sns.histplot(iris_df['petal length (cm)'], kde=True, bins=20)
plt.title('Histogram: Petal Length')
plt.show()
# Box Plot
plt.figure(figsize=(6, 4))
sns.boxplot(x='species', y='sepal width (cm)', data=iris_df)
plt.title('Box Plot: Sepal Width by Species')
plt.show()
# Heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(iris_df.corr(), annot=True, cmap='coolwarm')
plt.title('Heatmap of Feature Correlations')
plt.show()

OUTPUT:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

species
0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
RESULT:
Thus the program for Apply and explore various plotting functions on any data set has been
executed successfully and output was verified.
CONTENT BEYOND SYLLABUS
Ex No: 01 Reading data from text files, excel and the web
And exploring various commands for descriptive analytics on the iris
Date : data set.

AIM:
The aim of this exercise is to acquire the knowledge of reading dataset from the different
extensions and analyzing the Iris dataset.
OBJECTIVE:

 To understand some of the necessary packages.

 To understand visualization packages like seaborn and matplotlib.
 To understand reading dataset from different extensions.
 To understand information of the dataset and missing values.
 To understand dataset description.
 To understand visualization chart like scatterplot, countplot.
 To understand correlation concepts.
 To understand correlation through visualization.
 TounderstandHistogramcharts,Boxplotandtheconceptoftheremovingoutliers.
PROCEDURE:

 Open Jupyter Notebook.

 Create new Notebook.
 Import necessary packages.
 Load the dataset in different extensions.
 Start analyzing the Iris dataset.

PROGRAM:
Step 1:Import necessary packages
import pandas as pd importnumpyasnp
import matplotlib.pyplot as plt importseabornas sns
importrequests frombs4importBeautifulSoup
OUTPUT
df=pd.read_csv("Iris.csv")
df
Step 2:Reading data from text files
df=pd.read_csv("Iris.csv")
df
OUTPUT:

Step 3:Reading data from excel files

df=pd.read_excel("Iris.xlsx")df
Step 4:Reading data from web
#Makearequest
page=requests.get("https://gct.org.in")
soup=BeautifulSoup(page.text,'html.parser')
# Extract title of page
title=soup.titleprint(title)
# Extract all menu list in page
li=soup.find_all('li',class_="menu-item")
forlin li:
print(l.get_text())
## Extract all reference link page
forlinkinsoup.find_all('a'):
print(link.get('href'))
OUTPUT:

Step 5:Exploring various commands for doing descriptive analytics on the Iris dataset
Display the dataset information
df.info()
OUTPUT:

Display the column names of the iris dataset

df.columns
OUTPUT:

Display the shape and size of the dataset

print(df.shape) print(len(df))
OUTPUT:
(150, 6)
150
Know the column data types of the dataset
df.dtypes
OUTPUT:

Display the statistical summary of the dataset

df.describe()
OUTPUT:

Count the data by species

df['Species'].value_counts()
OUTPUT:

Checking missing values

df.isnull().sum()
OUTPUT:
Checking duplicate data
df.duplicated().sum()
OUTPUT:
0
Step 6:Understand the data by data visualization
Countplot displays the number of observations for a categorical variable using bars.
sns.countplot(x='Species', data=df,palette="OrRd") plt.show()
OUTPUT:

Relation between variables using scatter plot.

sns.scatterplot(x='SepalLengthCm', y='SepalWidthCm',hue='Species',
data=df)plt.legend(bbox_to_anchor=(1, 1),loc=2)
plt.show()
sns.scatterplot(x='PetalLengthCm', y='PetalWidthCm',hue='Species',
data=df)plt.legend(bbox_to_anchor=(1, 1),loc=2)
plt.show()
OUTPUT:

Heat Map-shows a correlation between all numerical variables in the dataset

sns.heatmap(df.corr(method='pearson').drop(['Id'], axis=1).drop(['Id']axis=0),annot =True)
plt.show()

Plot all the column’s relationships using a pair plot

sns.pairplot(df.drop(['Id'],axis=1),hue='Species',height=2)
OUTPUT:

Data Distribution using Histogram

fig,axes=plt.subplots(2,2,figsize=(10,10))
axes[0,0].set_title("Sepal Length")
axes[0,0].hist(df['SepalLengthCm'], bins=7)
axes[0,1].set_title("Sepal Width")
axes[0,1].hist(df['SepalWidthCm'], bins=5)
axes[1,0].set_title("Petal Length")
axes[1,0].hist(df['PetalLengthCm'], bins=6)
axes[1,1].set_title("Petal Width")
axes[1,1].hist(df['PetalWidthCm'], bins=6)
plt.show()
OUTPUT:

Using Box plot check the outlier data's

defgraph(y):
sns.boxplot(x="Species", y=y, data=df)
plt.figure(figsize=(10,10))
plt.subplot(221)
graph('SepalLengthCm')
plt.subplot(222)
graph('SepalWidthCm')
plt.subplot(223)
graph('PetalLengthCm')
plt.subplot(224)
graph('PetalWidthCm')
plt.show()
OUTPUT:
Removing outlier data using Box plot.
defoutlier(df):
q1=df.quantile(0.25) q3=df.quantile(0.75)
IQR=q3-q1
final=df[((df<(q1-1.5*IQR))|(df>(q3+1.5*IQR))).any(axis=1)] return final
result=outlier(df)
plt.figure(figsize=(15,15))
sns.boxplot(data=df.drop(['Id'],axis=1))
plt.show()
OUTPUT:

RESULT:
In this experiment the analysis of the iris dataset have been executed successfully.
Ex No: 02
The decision tree based id3 algorithm
Date :

AIM:
In this experiment, to demonstrate the working of the decision tree based ID3 algorithm.
PROCEDURE:
 Decision trees are one of the simplest non-linear supervised algorithms in the machine
learning world.
 As the name suggests they are used for making decisions in ML terms we call it
classification (although they can be used for regression as well).
 The decision trees have a unidirectional tree structure i.e. at every node the algorithm
makes a decision to split into child nodes based on certain stopping criteria.
 Most commonly DTs use entropy, information gain, Gini index, etc.
 There are a few known algorithms in DTs such as ID3, C4.5, CART, C5.0, CHAID, QUEST,
CRUISE.

a. Import Libraries
import math
import pandas as pd
from operator import itemgetter
b. Simple Decision Tree Class
class DecisionTree:
def_init (self, df, target, positive, parent_val, parent):
self.data = df
self.target = target
self.positive = positive
self.parent_val = parent_val
self.parent = parent
self.childs = []
self.decision = ''
def _get_entropy(self, data):
p = sum(data[self.target]==self.positive)
n = data.shape[0] - p
p_ratio = p/(p+n)
n_ratio = 1 - p_ratio
entropy_p = -p_ratio*math.log2(p_ratio) if p_ratio != 0 else 0
entropy_n = - n_ratio*math.log2(n_ratio) if n_ratio !=0 else 0
return entropy_p + entropy_n
def _get_gain(self, feat):
avg_info=0
for val in self.data[feat].unique()
avg_info+=self._get_entropy(self.data[self.data[feat]==val])*sum(self.data[feat]==val)/s
elf.data.shape[0]
return self._get_entropy(df) - avg_info
def _get_splitter(self):
self.splitter = max(self.gains, key = itemgetter(1))[0]
def update_nodes(self):
self.features = [col for col in self.data.columns if col != self.target]
self.entropy = self._get_entropy(self.data)
if self.entropy != 0:
self.gains = [(feat, self._get_gain(feat)) for feat in self.features]
self._get_splitter()
residual_columns = [k for k in self.data.columns if k != self.splitter]
for val in self.data[self.splitter].unique():
df_tmp = self.data[self.data[self.splitter]==val][residual_columns]
tmp_node = DecisionTree(df_tmp, self.target, self.positive, val,self.splitter)
tmp_node.update_nodes()
self.childs.append(tmp_node)
c. Printing the Tree
def print_tree(n):
for child in n.childs:
if child:
print(child. dict .get('parent', ''))
print(child. dict .get('parent_val', ''), '\n')
print_tree(child)
df = pd.read_csv('../input/dt-id3-examplecsv/dt_id3_example.csv')
dt = DecisionTree(df, 'Play', 'Yes', '', '')
dt.update_nodes()
print_tree(dt)

OUTPUT:
Outlook
Sunny
Humidity
High
Humidity
Normal
Outlook
Overcast
Outlook
Rainy
WindSpeed
Weak
WindSpeed
Strong

RESULT:
In this experiment, to demonstrate the working of the decision tree based ID3algorithm is
executed successfully.

OCS353 - Data Science Manual-FULL
No ratings yet
OCS353 - Data Science Manual-FULL
64 pages
Python Libraries and Packages For Data Science
100% (1)
Python Libraries and Packages For Data Science
5 pages
Data Science Lab
No ratings yet
Data Science Lab
61 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
74 pages
Data Ty
No ratings yet
Data Ty
59 pages
Data Sceince Lab Manual
No ratings yet
Data Sceince Lab Manual
64 pages
DAL EXT 1 and 2
No ratings yet
DAL EXT 1 and 2
125 pages
TY FDS Workbook
No ratings yet
TY FDS Workbook
56 pages
CS3361 Data Science Lab Manual
No ratings yet
CS3361 Data Science Lab Manual
82 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
18 pages
FDS Dhana
No ratings yet
FDS Dhana
49 pages
CS3361 - Data Science Laboratory
No ratings yet
CS3361 - Data Science Laboratory
31 pages
Lab - Manual FDS
No ratings yet
Lab - Manual FDS
12 pages
Ass1 DSBDA Writeup
No ratings yet
Ass1 DSBDA Writeup
8 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
85 pages
1755594488288_data science lab manual
No ratings yet
1755594488288_data science lab manual
56 pages
23CS302 - Dslab - Experiment 1
No ratings yet
23CS302 - Dslab - Experiment 1
5 pages
FDS Lab Meterial CS3361
No ratings yet
FDS Lab Meterial CS3361
30 pages
Python For Data Science
No ratings yet
Python For Data Science
8 pages
Dslab Manual_merged (1)
No ratings yet
Dslab Manual_merged (1)
59 pages
DS409 DataScience LabManual Jan2021
No ratings yet
DS409 DataScience LabManual Jan2021
41 pages
Cs3361 Data Science Laboratory
No ratings yet
Cs3361 Data Science Laboratory
2 pages
Final Fds Manual
No ratings yet
Final Fds Manual
77 pages
Fods (1) - Merged (1) - 1
No ratings yet
Fods (1) - Merged (1) - 1
100 pages
CS3361-Data Science Laboratory Manual
No ratings yet
CS3361-Data Science Laboratory Manual
58 pages
Fds PDF
No ratings yet
Fds PDF
4 pages
FDS Ex No 1
No ratings yet
FDS Ex No 1
6 pages
CS3362 - Data Science Laboratory - Manual - Final-1
No ratings yet
CS3362 - Data Science Laboratory - Manual - Final-1
76 pages
Unit - 2 Notes DSF
No ratings yet
Unit - 2 Notes DSF
30 pages
Suraj Report File
No ratings yet
Suraj Report File
17 pages
Final Fds Manual Print
No ratings yet
Final Fds Manual Print
55 pages
Lab Course - II (Foundations of Data Science)
No ratings yet
Lab Course - II (Foundations of Data Science)
59 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
49 pages
PYTHON
No ratings yet
PYTHON
11 pages
IDS Syllabus
No ratings yet
IDS Syllabus
5 pages
D P Lab Manual
No ratings yet
D P Lab Manual
54 pages
Data Science Laboratory
No ratings yet
Data Science Laboratory
2 pages
Beginner's Guide to Python for Data Science Rodriguez Special
No ratings yet
Beginner's Guide to Python for Data Science Rodriguez Special
7 pages
Fdsa Lab Manual Final
No ratings yet
Fdsa Lab Manual Final
70 pages
CSL 410 L05
No ratings yet
CSL 410 L05
17 pages
Fds Record
No ratings yet
Fds Record
69 pages
Cs3353 Foundations of Data Science L T P C 3 0 0 3
No ratings yet
Cs3353 Foundations of Data Science L T P C 3 0 0 3
2 pages
PDS Merged New
No ratings yet
PDS Merged New
19 pages
Fds Merged
No ratings yet
Fds Merged
102 pages
Python Libraries For Data Science 1679435534
No ratings yet
Python Libraries For Data Science 1679435534
64 pages
FODS Record
No ratings yet
FODS Record
66 pages
Grace Python Numpy MB
No ratings yet
Grace Python Numpy MB
56 pages
LAB MANUAL ML R22
No ratings yet
LAB MANUAL ML R22
27 pages
ML Lab File
No ratings yet
ML Lab File
33 pages
CS3362 Data Science Laboratory Manual 2022-23
No ratings yet
CS3362 Data Science Laboratory Manual 2022-23
54 pages
FDS Syllabus and CIS
No ratings yet
FDS Syllabus and CIS
10 pages
Dsbda Unit4
No ratings yet
Dsbda Unit4
110 pages
Cs3352 Foundations of Data Science l t p c
No ratings yet
Cs3352 Foundations of Data Science l t p c
2 pages
Syllabus - Principle of Data Science
No ratings yet
Syllabus - Principle of Data Science
4 pages
Data Science Lecture 5 6th Semster
No ratings yet
Data Science Lecture 5 6th Semster
3 pages
StatisticsMachineLearningPythonDraft PDF
100% (1)
StatisticsMachineLearningPythonDraft PDF
219 pages
Cs3361 Data Science Laboratory
No ratings yet
Cs3361 Data Science Laboratory
139 pages
ML File Updated
No ratings yet
ML File Updated
60 pages
PST Unit 5
No ratings yet
PST Unit 5
26 pages
PST Unit 2
No ratings yet
PST Unit 2
29 pages
Jc Cuddalore, WP.29122 of 2025 (Affidavit)
No ratings yet
Jc Cuddalore, WP.29122 of 2025 (Affidavit)
21 pages
PST Unit 1
No ratings yet
PST Unit 1
31 pages
Unit-Ii QB-SMPS
No ratings yet
Unit-Ii QB-SMPS
66 pages
Unit I QB-SMPS
No ratings yet
Unit I QB-SMPS
37 pages
Unit Iii QB-SMPS
No ratings yet
Unit Iii QB-SMPS
41 pages
Gnanamani College of Technology (An Autonomous Institution, Affiliated To Anna University, Chennai.)
No ratings yet
Gnanamani College of Technology (An Autonomous Institution, Affiliated To Anna University, Chennai.)
62 pages
Industry Safety
No ratings yet
Industry Safety
10 pages
SMPS and Ups
No ratings yet
SMPS and Ups
15 pages
Kaviya.g.k (422622105011) .
No ratings yet
Kaviya.g.k (422622105011) .
32 pages
Aadalarasan G (422622105302) F
No ratings yet
Aadalarasan G (422622105302) F
30 pages
Kaviya.g.k (422622105011) .
No ratings yet
Kaviya.g.k (422622105011) .
20 pages
The Effectof Influencer Marketingon Gen ZPurchasing Intentionsin Emerging Economies
No ratings yet
The Effectof Influencer Marketingon Gen ZPurchasing Intentionsin Emerging Economies
21 pages
3 Portfolio Theory
No ratings yet
3 Portfolio Theory
23 pages
Ed 202 PPT Curriculum Development
No ratings yet
Ed 202 PPT Curriculum Development
33 pages
Social Dominance and Academic Performance PARAPHRASED
No ratings yet
Social Dominance and Academic Performance PARAPHRASED
27 pages
Test 4
No ratings yet
Test 4
69 pages
Tyba Economics Syllabus Course Outcomes
No ratings yet
Tyba Economics Syllabus Course Outcomes
20 pages
BBA Syllabus Regulation 2017 2018 Pub Date 20032018
No ratings yet
BBA Syllabus Regulation 2017 2018 Pub Date 20032018
20 pages
Internal Exam Syllabus For Mba Sem 1
No ratings yet
Internal Exam Syllabus For Mba Sem 1
3 pages
Molan Set 3 Maths
No ratings yet
Molan Set 3 Maths
4 pages
Review of Literature On Employee Safety and Welfare Measures
No ratings yet
Review of Literature On Employee Safety and Welfare Measures
5 pages
Statistics For Business and Economics: Metric Version - Ebook PDF Download
100% (3)
Statistics For Business and Economics: Metric Version - Ebook PDF Download
53 pages
NPGC BBA DB Syllabus
No ratings yet
NPGC BBA DB Syllabus
51 pages
MS 02 More Exercises
No ratings yet
MS 02 More Exercises
5 pages
Chapter Six Risk and Return
No ratings yet
Chapter Six Risk and Return
46 pages
Aho - Helminth Communities of Amphibians
No ratings yet
Aho - Helminth Communities of Amphibians
39 pages
Esla Mod3@Azdocuments - in
No ratings yet
Esla Mod3@Azdocuments - in
29 pages
Unit 1
No ratings yet
Unit 1
27 pages
Chapter 4
No ratings yet
Chapter 4
16 pages
American J of Botany - 2018 - Vieilledent - New Formula and Conversion Factor To Compute Basic Wood Density of Tree Species
No ratings yet
American J of Botany - 2018 - Vieilledent - New Formula and Conversion Factor To Compute Basic Wood Density of Tree Species
9 pages
Statistics For Management Sample Paper Solution
No ratings yet
Statistics For Management Sample Paper Solution
15 pages
Ricart-Huguet BJPS 2022
No ratings yet
Ricart-Huguet BJPS 2022
22 pages
Forecasting Techniques
No ratings yet
Forecasting Techniques
36 pages
2nd Statistic Report 2022 (AutoRecovered)
No ratings yet
2nd Statistic Report 2022 (AutoRecovered)
5 pages
DADM - Tools Help
No ratings yet
DADM - Tools Help
25 pages
Clarke Statistical Mobile Radio
No ratings yet
Clarke Statistical Mobile Radio
44 pages
Berry Statbook All
100% (1)
Berry Statbook All
328 pages
Effect of Price, Promotion, Location and Interest Buy As Variable Moderation To Decision Purchase of Cendana Homes Property at PT. Lippo Karawaci
No ratings yet
Effect of Price, Promotion, Location and Interest Buy As Variable Moderation To Decision Purchase of Cendana Homes Property at PT. Lippo Karawaci
16 pages
The Temperature and Pressure Dependences of The Laminar Burning Velocity Experiments and Modelling
No ratings yet
The Temperature and Pressure Dependences of The Laminar Burning Velocity Experiments and Modelling
13 pages
Financial Literacy and Portfolio Dynamics
No ratings yet
Financial Literacy and Portfolio Dynamics
34 pages
Making Sense of Statistics A Conceptual Overview 7th Edition Fred Pyrczak - Own The Ebook Now and Start Reading Instantly
100% (1)
Making Sense of Statistics A Conceptual Overview 7th Edition Fred Pyrczak - Own The Ebook Now and Start Reading Instantly
75 pages

Ocs353 Data Science Fundamentals Laboratory-eee

Uploaded by

Ocs353 Data Science Fundamentals Laboratory-eee

Uploaded by

OCS 353 - DATA SCIENCE FUNDAMENTALS LAB MANUAL

● Familiarize students with the data science process.

At the end of this course, the students will be able to:

CO1: Gain knowledge on data science process.

CO’s vs PO’s & PSO’s MAPPING:

 To understand the installation of python open source packages.

b. Install Jupyter Notebook.

 pip install jupyter notebook

c. To list all of the installed packages in the conda active environment.

 pip install <package_name>

e. Verify package installation, and know the version of installed package.

 pip show <package_name>

 pip install –upgrade <package_name>

h. Deactivate the conda environment.

B. SciPy (Scientific Python):

 SciPy is a collection of mathematical algorithms and convenience functions built on the

 Jupyter Notebook is an open-source, web-based interactive environment.

 Statsmodels is a Python library built specifically for statistics.

E. Pandas (Panel Data):

 Pandas is an open-source Python Library providing high-performance data manipulation

 To understand the multidimensional array and matrix data structures.

 Open Jupyter Notebook.

2. Functions for creating numpy array.

6. Joining, Sorting, Splitting Array.

8. Creating array from existing data

9. More useful statistical array operations.

12. Get array unique items.

 arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

14. Random number generation.

 Open Jupyter Notebook.

3. Add or delete columns and rows.

 stu_personal = stu_personal.append({'Rollno' : 1006,'Name' : 'F','Address':'salem'},

Rollno College Mark Percentage

8. Merge two datasets.

12. Read and Write dataset.

13. Cleaning the dataset.

1. Line Plot Program:

3. Bar Plot Program:

 Mean: The average of a data set.

PROGRAM FOR MEAN, MODE, STANDARD DEVIATION:

PROGRAM FOR VARIABILITY:

PROGRAM FOR CORRELATION COEFFICIENT:

PROGRAM FOR REGRESSION USING PYTHON:

# Load the Iris dataset

# Adding the target variable as 'species'

# Mean, Median, Mode for numerical variables

# Variance and Standard Deviation for numerical variables

# Skewness and Kurtosis

# Supervised Learning: k-Nearest Neighbors (k-NN)

# Unsupervised Learning: k-Means Clustering

0 1.00 1.00 1.00 19

Unsupervised Learning: k-Means Clustering Results

 To understand some of the necessary packages.

 Open Jupyter Notebook.

Step 3:Reading data from excel files

Display the column names of the iris dataset

Display the shape and size of the dataset

Display the statistical summary of the dataset

Count the data by species

Checking missing values

Relation between variables using scatter plot.

Heat Map-shows a correlation between all numerical variables in the dataset

Plot all the column’s relationships using a pair plot

Data Distribution using Histogram

Using Box plot check the outlier data's

You might also like