0% found this document useful (0 votes)
3 views

ds final

The document is a record of the Data Science Laboratory course at Sri Eshwar College of Engineering, detailing experiments conducted during the 5th semester of the academic year 2024-2025. It includes various data science experiments such as web scraping, probability distribution visualization, and normal distribution analysis, along with objectives, learning outcomes, and sample code. The document serves as a bonafide certificate for students' practical work in the course, highlighting their competencies in data science.

Uploaded by

RAJESH KUMAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

ds final

The document is a record of the Data Science Laboratory course at Sri Eshwar College of Engineering, detailing experiments conducted during the 5th semester of the academic year 2024-2025. It includes various data science experiments such as web scraping, probability distribution visualization, and normal distribution analysis, along with objectives, learning outcomes, and sample code. The document serves as a bonafide certificate for students' practical work in the course, highlighting their competencies in data science.

Uploaded by

RAJESH KUMAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

R19AD251 DATA SCIENCE LABORATORY

DEPARTMENT OF COMPUTER AND COMMUNICATION ENGINEERING

SRI ESHWAR COLLEGE OF ENGINEERING


KINATHUKADAVU COIMBATORE - 641202
DEPARTMENT OF COMPUTER AND COMMUNICATION ENGINEERING

BONAFIDE CERTIFICATE

Certified that this is the bonafide record of work done by

Name: Mr. /Ms. ...…………………………………………………………………………………….

Register No: ……………………………………………………………………………... of 3rd Year

B.E – COMPUTER AND COMMUNICATION ENGINEERING in the R19AD251 DATA SCIENCE

LABORATORY during the 5th Semester of the academic year 2024 – 2025 (Odd Semester).

Signature of faculty In-charge Head of the department

Submitted for the practical examinations of Anna University, held on …………….

Internal Examiner External Examiner


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

Contents

Page Number

Marks (50)

Faculty Member
Signature of the
S.No Date Name of the Experiment

1 Web Scrapping

2 Probability Distribution Plots


3 Normal Distribution using Q-Q Plot

4 Exploratory Data Analysis

5 House Price Prediction – Linear Regression

6 Medical Diagnosis Pattern – Logistic Regression


7 Customer Segmentation
8 Customer Churn
9 Online Shoppers Behaviour Analysis – kNN
10 Tableau Dashboard

Average:
Average (in words) Signature of the Faculty

(c) Sri Eshwar College of Engineering 2024 Page 3 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory
Experiential Learning 1
Course Code R19AD251
Competency Data Science
Discipline Data Science
Domain Data Science
Title of the Experiment Web-Scrapping
Objectives Perform Web-Scrapping, create DataFrame by collecting the data from the
suitable resource.
Learning Outcomes Web-Scrapping to collect data
Welcome to Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion
You are tasked with extracting specific data from a publicly accessible website, such as product prices, reviews,
or job listings. After collecting the data, you need to structure it in a DataFrame, perform data cleaning if
necessary, and analyze the data to answer a set of predefined questions.

Problem Statement Terminology Theory Code Input Output Conclusion


Web Scraping: The process of automatically extracting data from websites.
HTML: HyperText Markup Language, the standard language for documents designed to be displayed in a web
browser.
HTTP Request: A request sent by a client to a server to retrieve information, often used in web scraping.
BeautifulSoup: A Python library used for parsing HTML and XML documents.
DataFrame: A two-dimensional labeled data structure in Pandas, used for data manipulation.
Pandas: A Python library used for data manipulation and analysis, providing data structures like DataFrames.
CSS Selector: A pattern used in CSS to select and style HTML elements, also used in web scraping to locate
elements in the HTML structure.
HTTP Status Code: A code returned by the server indicating the status of the HTTP request (e.g., 200 for
success, 404 for not found).
Data Cleaning: The process of correcting or removing inaccurate records from a dataset.
JSON (JavaScript Object Notation): A lightweight data-interchange format that is easy for humans to read and
write and easy for machines to parse and generate, often used for transmitting data in web scraping.
API (Application Programming Interface): A set of rules and protocols for building and interacting with
software applications, sometimes used as an alternative to web scraping for data extraction.
Captcha: A type of challenge-response test used in computing to determine whether the user is human, often
encountered when performing web scraping to prevent automated access.
Regex (Regular Expression): A sequence of characters that defines a search pattern, often used in web scraping
to extract specific text patterns from HTML.

Problem Statement Terminology Theory Code Input Output Conclusion


Web scraping involves programmatically navigating a website's HTML structure to retrieve and extract relevant
data. This data is usually unstructured and requires cleaning and formatting to make it useful for analysis. Python
offers several powerful libraries for web scraping, such as requests for sending HTTP requests and
BeautifulSoup for parsing HTML content. Once the data is extracted, it is often stored in a Pandas DataFrame,
which allows for efficient data manipulation and analysis.

(c) Sri Eshwar College of Engineering 2024 Page 4 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion
from bs4 import BeautifulSoup as bs
import requests

url="https://finance.yahoo.com/most-active"

resp=requests.get(url)
print(resp)

htmlcont=resp.content
htmlcont

soup=bs(htmlcont,"html.parser")

title=soup.find("title")
title

title.get_text()

stock_table=soup.find("tbody")
stock_table

for i in stock_table.find_all("tr"):
print(i)

symbol=[]
for i in stock_table.find_all("tr"):
sym=i.find("td",attrs={"aria-label":"Symbol"})
symbol.append(sym.text)
symbol

print(type(symbol))

for i in stock_table.find_all("tr"):
print(i.prettify())

name=[]
for i in stock_table.find_all("tr"):
nam=i.find("td",attrs={"aria-label":"Name"})
name.append(nam.text)
name

price=[]
for i in stock_table.find_all("tr"):
pr=i.find("td",attrs={"aria-label":"Price (Intraday)"})
price.append(pr.text)
price

change=[]
for i in stock_table.find_all("tr"):
chan=i.find("td",attrs={"aria-label":"Change"})

(c) Sri Eshwar College of Engineering 2024 Page 5 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory
change.append(chan.text)
change

per_change=[]
for i in stock_table.find_all("tr"):
change=i.find("td",attrs={"aria-label":"% Change"})
per_change.append(change.text)
per_change

data={"Symbol":symbol,
"Name":name,
"Price(Intraday)":price,
"% Change":per_change
}

data

import pandas as pd
df=pd.DataFrame(data)
df

df.to_excel('Web Scrapping.xlsx', index=False)

Problem Statement Terminology Theory Code Input Output Conclusion

url="https://finance.yahoo.com/most-active"

Problem Statement Terminology Theory Code Input Output Conclusion

(c) Sri Eshwar College of Engineering 2024 Page 6 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusions
The script successfully scrapes the latest price from the yahoo website, extracting their titles and URLs, and
organizes this data into a DataFrame. The DataFrame is then printed and saved as a CSV file named web.csv.

(c) Sri Eshwar College of Engineering 2024 Page 7 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory
Experiential Learning 2
Course Code R19AD251
Competency Data Science
Discipline Data Science
Domain Data Science
Title of the Experiment Study and Visualization of Different Probability Distributions
Objectives To understand the concepts of various probability distributions. To generate and
visualize different types of probability distributions using Python. To compare
the shapes and properties of different distributions.
Learning Outcomes Gain a solid understanding of fundamental probability distributions and their
applications. Be able to generate and visualize common distributions (e.g.,
Normal, Binomial, Poisson) using Python.
Welcome to Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion
To generate the plot for all the probability distributions and get the overall idea about distribution.

Problem Statement Terminology Theory Code Input Output Conclusion


Probability Distribution: A function that describes the likelihood of different outcomes in a random experiment.
Mean (μ): The average or expected value of the distribution.
Variance (σ²): A measure of how spread out the values are around the mean.
Standard Deviation (σ): The square root of variance, indicating the dispersion of a dataset.
KDE (Kernel Density Estimate): A method for estimating the probability density function of a random variable.
Uniform Distribution: A distribution where all outcomes are equally likely.
Normal Distribution (Gaussian): A continuous probability distribution with a bell-shaped curve.
Binomial Distribution: Describes the number of successes in a fixed number of independent binary trials.
Poisson Distribution: Describes the probability of a given number of events happening in a fixed interval.
Exponential Distribution: Describes the time between events in a Poisson process.
Gamma Distribution: A two-parameter distribution often used in queuing models.
Beta Distribution: Models outcomes in the interval [0, 1] and is used in Bayesian statistics.
Chi-Square Distribution: Used in hypothesis testing, especially for variance analysis.
Log-Normal Distribution: Describes a random variable whose logarithm is normally distributed.
Student's t-Distribution: Similar to the normal distribution but with heavier tails, used when sample sizes are
small.

Problem Statement Terminology Theory Code Input Output Conclusion


Uniform Distribution: It is characterized by having a constant probability across a defined range. It is used to
model situations where every outcome has an equal chance.
Normal Distribution (Gaussian): It is widely used in statistics and is symmetric about the mean. Its key properties
include being fully described by two parameters: mean (μ) and standard deviation (σ).
Binomial Distribution: It models the number of successes in a fixed number of independent trials of a binary
experiment (e.g., flipping a coin). It is defined by two parameters: the number of trials (n) and the probability of
success (p).
Poisson Distribution: It is useful for modeling the number of events that occur in a fixed interval of time or
space, where the events occur independently.
Exponential Distribution: It models the time between events in a Poisson process, such as the time until a
radioactive particle decays.

(c) Sri Eshwar College of Engineering 2024 Page 8 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory
Gamma Distribution: It generalizes the exponential distribution and is used for waiting times between events in
Poisson processes.
Beta Distribution: This is often used in Bayesian statistics and is defined on the interval [0, 1], making it useful
for modeling proportions.
Chi-Square Distribution: This is commonly used in hypothesis testing, especially in tests of independence and
goodness-of-fit.
Student's t-Distribution: Used in situations where the sample size is small, and the population standard deviation
is unknown.
Log-Normal Distribution: This distribution is used when the variable in question is a product of many
independent random variables.
Uniform Distribution: It is characterized by having a constant probability across a defined range. It is used to
model situations where every outcome has an equal chance.

Problem Statement Terminology Theory Code Input Output Conclusion


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm, uniform, binom, poisson, expon, gamma, beta, chi2, t, lognorm

# Set the style for seaborn


sns.set(style="whitegrid")

# Create a figure to hold the subplots (4 rows and 3 columns)


fig, axes = plt.subplots(4, 3, figsize=(18, 16))
fig.suptitle('Different Probability Distributions', fontsize=20)

# 1. Uniform Distribution
uniform_data = np.random.uniform(low=0, high=10, size=1000)
sns.histplot(uniform_data, kde=True, color='b', ax=axes[0, 0])
axes[0, 0].set_title('Uniform Distribution')

# 2. Normal (Gaussian) Distribution


normal_data = np.random.normal(loc=0, scale=1, size=1000)
sns.histplot(normal_data, kde=True, color='g', ax=axes[0, 1])
axes[0, 1].set_title('Normal Distribution')

# 3. Binomial Distribution
n, p = 10, 0.5
binomial_data = np.random.binomial(n, p, 1000)
sns.histplot(binomial_data, kde=False, color='r', ax=axes[0, 2])
axes[0, 2].set_title('Binomial Distribution')

# 4. Poisson Distribution
poisson_data = np.random.poisson(lam=3, size=1000)
sns.histplot(poisson_data, kde=False, color='y', ax=axes[1, 0])
axes[1, 0].set_title('Poisson Distribution')

# 5. Exponential Distribution
exponential_data = np.random.exponential(scale=1, size=1000)
sns.histplot(exponential_data, kde=True, color='m', ax=axes[1, 1])
axes[1, 1].set_title('Exponential Distribution')

(c) Sri Eshwar College of Engineering 2024 Page 9 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory
# 6. Gamma Distribution
gamma_data = np.random.gamma(shape=2, scale=1, size=1000)
sns.histplot(gamma_data, kde=True, color='c', ax=axes[1, 2])
axes[1, 2].set_title('Gamma Distribution')

# 7. Beta Distribution
a, b = 2, 5
beta_data = np.random.beta(a, b, size=1000)
sns.histplot(beta_data, kde=True, color='orange', ax=axes[2, 0])
axes[2, 0].set_title('Beta Distribution')

# 8. Chi-Square Distribution
df = 2
chi_square_data = np.random.chisquare(df, size=1000)
sns.histplot(chi_square_data, kde=True, color='purple', ax=axes[2, 1])
axes[2, 1].set_title('Chi-Square Distribution')

# 9. Student's t-Distribution
df = 10
t_data = np.random.standard_t(df, size=1000)
sns.histplot(t_data, kde=True, color='teal', ax=axes[2, 2])
axes[2, 2].set_title("Student's t-Distribution")

# 10. Log-Normal Distribution


lognormal_data = np.random.lognormal(mean=0, sigma=1, size=1000)
sns.histplot(lognormal_data, kde=True, color='brown', ax=axes[3, 0])
axes[3, 0].set_title('Log-Normal Distribution')

# Hide empty subplots


axes[3, 1].axis('off')
axes[3, 2].axis('off')

# Adjust layout
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

Problem Statement Terminology Theory Code Input Output Conclusion


Input Data

(c) Sri Eshwar College of Engineering 2024 Page 10 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion

Problem Statement Terminology Theory Code Input Output Conclusions


Therefore, the study of Distribution is successfully completed.

(c) Sri Eshwar College of Engineering 2024 Page 11 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

Experiential Learning 3
Course Code R19AD251
Competency Data Science
Discipline Data Science
Domain Data Science
Title of the Experiment Plotting Normal Distribution and Performing Normality Test using Q-Q Plot
Objectives To understand the concept of a normal distribution and its significance in statistical
analysis.
Learning Outcomes Be able to generate and visualize a normal distribution using Python. Understand
how Q-Q plots are used to check the normality of data.
Welcome to Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion
How can we visualize a normal distribution and assess whether a dataset is normally distributed using a Q-Q
plot and statistical tests?

Problem Statement Terminology Theory Code Input Output Conclusion


Normal Distribution (Gaussian Distribution): A continuous probability distribution that is symmetric about
the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. It is
often referred to as the bell curve.
KDE (Kernel Density Estimate): A non-parametric way to estimate the probability density function of a
random variable.
Q-Q Plot (Quantile-Quantile Plot): A graphical technique to determine if two datasets come from
populations with a common distribution. In this experiment, it is used to assess normality.
Shapiro-Wilk Test: A test to assess whether a given sample comes from a normally distributed population.
P-Value: The probability of observing a test statistic at least as extreme as the one observed, assuming the
null hypothesis is true.
Null Hypothesis (H0): The assumption that the data follows a normal distribution. In the Shapiro-Wilk test,
rejecting the null hypothesis suggests the data is not normally distributed.
Alpha (α): The significance level, often set at 0.05. If the p-value is less than α, the null hypothesis is
rejected.

Problem Statement Terminology Theory Code Input Output Conclusion


A normal distribution is one of the most widely used probability distributions in statistics. It has the following
key properties:
 Symmetry: It is symmetric around its mean (μ).
 Bell-shaped curve: The shape of the distribution is bell-like.
 Empirical Rule: In a normal distribution, approximately 68% of the data lies within one standard deviation
from the mean, 95% within two standard deviations, and 99.7% within three standard deviations.
A Q-Q plot is a scatter plot that compares the quantiles of the sample data with the quantiles of a theoretical
distribution (in this case, a normal distribution). If the data is normally distributed, the points should fall along
a straight line. Deviations from the line indicate departures from normality.
A Shapiro-Wilk test is a statistical test used to assess the normality of the data. It calculates a test statistic and
a p-value:
 If p > 0.05: The data is likely to follow a normal distribution (fail to reject H0).
 If p < 0.05: The data does not follow a normal distribution (reject H0).

(c) Sri Eshwar College of Engineering 2024 Page 12 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

Problem Statement Terminology Theory Code Input Output Conclusion


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
sns.set(style="whitegrid")

mu, sigma = 0, 1
normal_data = np.random.normal(mu, sigma, 1000)

plt.figure(figsize=(10, 6))
sns.histplot(normal_data, kde=True, color='blue')
plt.title('Normal Distribution (μ=0, σ=1)', fontsize=15)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

# Perform Normality Test using Q-Q Plot


plt.figure(figsize=(8, 8))
sm.qqplot(normal_data, line ='45')
plt.title('Q-Q Plot for Normal Distribution', fontsize=15)
plt.grid(True)
plt.show()

# Shapiro-Wilk Test for Normality (optional)


stat, p = stats.shapiro(normal_data)
print(f'Statistics={stat:.3f}, p-value={p:.3f}')

# Interpret the p-value


alpha = 0.05
if p > alpha:
print('Sample looks normal (fail to reject H0)')
else:
print('Sample does not look normal (reject H0)')

Problem Statement Terminology Theory Code Input Output Conclusion


Random Input

(c) Sri Eshwar College of Engineering 2024 Page 13 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion

Problem Statement Terminology Theory Code Input Output Conclusions


Therefore, the study of Distribution and Q-Q plot is successfully completed.

(c) Sri Eshwar College of Engineering 2024 Page 14 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

Experiential Learning 4
Course Code R19AD251
Competency Data Science
Discipline Data Science
Domain Data Science
Title of the Experiment Exploratory Data Analysis
Objectives The objective of this lab is to introduce students to the concept of Exploratory Data
Analysis (EDA). Students will learn how to use various tools and techniques to
understand the underlying structure of a dataset, identify patterns, spot anomalies,
test hypotheses, and check assumptions through visual and quantitative methods.
Learning Outcomes Understand the importance of EDA, EDA techniques such as Outlier handling,
missing data handling, transformation, etc.
Welcome to Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion
You are provided with a dataset containing various attributes related to customer information for a retail
company. The goal is to perform an Exploratory Data Analysis to uncover insights about customer behavior,
identify key factors that influence sales, and highlight any potential issues within the dataset that need to be
addressed before proceeding to further analysis.

Problem Statement Terminology Theory Code Input Output Conclusion


Z Score = (x − x̅ )/σ
IQR = Q3 - Q1

Problem Statement Terminology Theory Code Input Output Conclusion


Data Cleaning: Identifying and dealing with missing data, outliers, and errors.
Hypothesis Generation: Formulating hypotheses that can be tested using statistical methods.
Normalization: The process of scaling data into a standard range, often between 0 and 1.
Pandas: A Python library widely used for data manipulation and analysis.
Matplotlib/Seaborn: Python libraries used for data visualization.
Data Distribution: The way in which data points are spread across different values or ranges.
Outliers: Data points that deviate significantly from other observations, potentially indicating variability in the
measurement or an error.
Missing Data: Instances where no data value is stored for a variable in an observation.
Correlation: A measure that describes the extent to which two variables are linearly related.
Box Plot: A graphical representation of data that shows the distribution through their quartiles and highlights
outliers.
Histogram: A type of bar chart that represents the frequency distribution of a dataset.
Scatter Plot: A graph that uses dots to represent the values of two different variables, often used to observe
relationships.

Problem Statement Terminology Theory Code Input Output Conclusion


import numpy as np # Numerical Python for Linear Algebra
from numpy import array # NumPy for creating Array
from numpy import argmax # NumPy for argmax to get indices of maximum element
import pandas as pd # For Data Manipulation, Indexing, Slicing, Chopping
import matplotlib.pyplot as plt # For Data Visualization
import seaborn as sns # For Data Visualization, Support for Matplolib

(c) Sri Eshwar College of Engineering 2024 Page 15 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

from sklearn.preprocessing import LabelEncoder # To normalize the data


from sklearn.preprocessing import OneHotEncoder # For Categorical data to represent in Binary form
from matplotlib import cm
from matplotlib.colors import ListedColormap, LinearSegmentedColormap
#from wordcloud import WordCloud, STOPWORDS
import numpy as npy
from PIL import Image
from sklearn import linear_model, metrics

df = pd.read_excel(r'Netflix Dataset.xlsx')
df

df.shape
df.describe()
df.head(10)
df.tail(10)

df.isnull().values.any()
df.isnull().sum()

sns.heatmap(df.isnull(), yticklabels=False, cmap="viridis")

df.drop(['Title','Tags', 'View Rating','Release Date','Netflix Release Date','Production House',


'Netflix Link','IMDb Link','Metacritic Score','Boxoffice','Summary','Poster','TMDb Trailer',
'Trailer Site'], axis = 1,inplace=True)

sns.heatmap(df.isnull(), yticklabels=False, cmap="Paired")

df.dropna(subset=['Director'], inplace=True)
df.dropna(subset=['Writer'], inplace=True)
df.dropna(subset=['Languages'], inplace=True)
df.dropna(subset=['Actors'], inplace=True)
df.dropna(subset=['Country Availability'], inplace=True)
df.dropna(subset=['Genre'], inplace=True)

df.isna().sum()

sns.heatmap(df.isnull(), yticklabels=False, cmap="tab20c")


df['Hidden Gem Score'].ffill(axis=0, inplace=True)
df['IMDb Score'].bfill(axis=0, inplace=True)
df['IMDb Votes'] = df['IMDb Votes'].fillna(df['IMDb Votes'].mode()[0])
df.isna().sum()
sns.heatmap(df.isnull(), yticklabels=False, cmap="autumn")

df['Rotten Tomatoes Score'].value_counts().sum()


df['Rotten Tomatoes Score'].isna().sum()
df['Rotten Tomatoes Score'].value_counts().sum() - df['Rotten Tomatoes Score'].isna().sum()
df['Rotten Tomatoes Score'].fillna(df['Rotten Tomatoes Score'].mean(), inplace= True)
df['Awards Received'] = df['Awards Received'].fillna(0)
df['Awards Nominated For'] = df['Awards Nominated For'].replace(np.nan, 0)

(c) Sri Eshwar College of Engineering 2024 Page 16 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory
df.columns.values[12] = 'Award Nominated'
sns.heatmap(df.isnull(), yticklabels=False, cmap="Accent")
df.reset_index(inplace = True)

df

df.rename(columns = {'Series or Movie':'Series_Movie'}, inplace = True)

df.shape

pd.get_dummies(df.Series_Movie)
dataa=pd.concat([df, pd.get_dummies(df.Series_Movie)], axis=1)

p1=df['Series_Movie'].value_counts()
colors = ('violet', 'c')
plt.pie(p1, labels=p1.index, explode=(0.5, 0), autopct='%.1f%%', shadow = True, colors = colors)
plt.title('Series & Movie')
plt.show()

p2 = df['Director'].sort_values(ascending=True).value_counts().head(20)
explode = (0.1, 0.175, 0.2, 0.3, 0.05, 0.05, 0.275, 0.2, 0.26, 0.3, 0.1, 0.15, 0.19, 0.1, 0.15, 0.3, 0.25,
0.15, 0.16, 0.2)
colors = ('b', 'c', 'r', 'g', 'lime', 'khaki', 'pink', 'olive', 'gray', 'orange')
wp = { 'linewidth' : 1, 'edgecolor' : "green" }
def func(pct, allvalues):
absolute = int(pct / 100.*np.sum(allvalues))
return "{:.1f}".format(pct, absolute)
fig, ax = plt.subplots(figsize =(15, 12))
wedges, texts, autotexts = ax.pie(p2,
autopct = lambda pct: func(pct, p2), explode = explode, labels = p2.index,
shadow = True, colors = colors, startangle = 90, wedgeprops = wp,
textprops = dict(color ="deeppink"))
ax.legend(wedges, p2, title ="Director", loc ="center left", bbox_to_anchor =(1.15, 0.1, 0.5, 1))
plt.setp(autotexts, size = 10, weight ="bold")
plt.title("Top 20 Director")
plt.show()

df.groupby('Director')['Director']
p3 = pd.DataFrame(df['Languages'].sort_values(ascending=True).value_counts().head(20))
p3

xs = p2.index
labels = p2
plt.bar(xs,p2.values, color='red')
plt.xlabel("Director")
plt.ylabel("Counts")
plt.xticks(rotation=90)

p5 = pd.DataFrame(df['Actors'].sort_values(ascending=True).value_counts().head(20))

(c) Sri Eshwar College of Engineering 2024 Page 17 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

p6 = ['Ewan McGregor', 'Natalie Portman', 'Jake Lloyd', 'Liam Neeson', 'Ewan McGregor', 'Natalie Portman',
'Jake Lloyd', 'Liam Neeson', 'Ewan McGregor', 'Natalie Portman', 'Jake Lloyd', 'Liam Neeson', 'Ewan
McGregor', 'Natalie Portman', 'Jake Lloyd', 'Liam Neeson', 'Ewan McGregor', 'Natalie Portman', 'Jake Lloyd',
'Liam Neeson', 'Ewan McGregor', 'Natalie Portman', 'Jake Lloyd', 'Liam Neeson', 'Ewan McGregor', 'Natalie
Portman', 'Jake Lloyd', 'Liam Neeson', 'Ewan McGregor', 'Natalie Portman', 'Jake Lloyd', 'Liam Neeson',
'Ewan McGregor', 'Natalie Portman', 'Jake Lloyd', 'Liam Neeson', 'Ewan McGregor', 'Natalie Portman', 'Jake
Lloyd', 'Liam Neeson', 'Ewan McGregor', 'Natalie Portman', 'Jake Lloyd', 'Liam Neeson', 'Ewan McGregor',
'Natalie Portman', 'Jake Lloyd', 'Liam Neeson', 'Noriko Ohara', 'Nobuyo Ôyama', 'Michiko Nomura', 'Kaneta
Kimotsuki', 'Noriko Ohara', 'Nobuyo Ôyama', 'Michiko Nomura', 'Kaneta Kimotsuki', 'Noriko Ohara',
'Nobuyo Ôyama', 'Michiko Nomura', 'Kaneta Kimotsuki', 'Noriko Ohara', 'Nobuyo Ôyama', 'Michiko
Nomura', 'Kaneta Kimotsuki', 'Noriko Ohara', 'Nobuyo Ôyama', 'Michiko Nomura', 'Kaneta Kimotsuki',
'Noriko Ohara', 'Nobuyo Ôyama', 'Michiko Nomura', 'Kaneta Kimotsuki', 'Noriko Ohara', 'Nobuyo Ôyama',
'Michiko Nomura', 'Kaneta Kimotsuki', 'Noriko Ohara', 'Nobuyo Ôyama', 'Michiko Nomura', 'Kaneta
Kimotsuki', 'Noriko Ohara', 'Nobuyo Ôyama', 'Michiko Nomura', 'Kaneta Kimotsuki', 'Noriko Ohara',
'Nobuyo Ôyama', 'Michiko Nomura', 'Kaneta Kimotsuki', 'Noriko Ohara', 'Nobuyo Ôyama', 'Michiko
Nomura', 'Kaneta Kimotsuki', 'Noriko Ohara', 'Nobuyo Ôyama', 'Michiko Nomura', 'Kaneta Kimotsuki', 'Isha
Koppikar', 'Priyanka Chopra', 'Shah Rukh Khan', 'Arjun Rampal', 'Isha Koppikar', 'Priyanka Chopra', 'Shah
Rukh Khan', 'Arjun Rampal', 'Isha Koppikar', 'Priyanka Chopra', 'Shah Rukh Khan', 'Arjun Rampal', 'Isha
Koppikar', 'Priyanka Chopra', 'Shah Rukh Khan', 'Arjun Rampal', 'Isha Koppikar', 'Priyanka Chopra', 'Shah
Rukh Khan', 'Arjun Rampal', 'Isha Koppikar', 'Priyanka Chopra', 'Shah Rukh Khan', 'Arjun Rampal', 'Isha
Koppikar', 'Priyanka Chopra', 'Shah Rukh Khan', 'Arjun Rampal', 'Isha Koppikar', 'Priyanka Chopra', 'Shah
Rukh Khan', 'Arjun Rampal', 'Isha Koppikar', 'Priyanka Chopra', 'Shah Rukh Khan', 'Arjun Rampal', 'Isha
Koppikar', 'Priyanka Chopra', 'Shah Rukh Khan', 'Arjun Rampal', 'Isha Koppikar', 'Priyanka Chopra', 'Shah
Rukh Khan', 'Arjun Rampal', 'Finn Wolfhard', 'Jaeden Martell', 'Sophia Lillis', 'Jeremy Ray Taylor', 'Finn
Wolfhard', 'Jaeden Martell', 'Sophia Lillis', 'Jeremy Ray Taylor', 'Finn Wolfhard', 'Jaeden Martell', 'Sophia
Lillis', 'Jeremy Ray Taylor', 'Finn Wolfhard', 'Jaeden Martell', 'Sophia Lillis', 'Jeremy Ray Taylor', 'Finn
Wolfhard', 'Jaeden Martell', 'Sophia Lillis', 'Jeremy Ray Taylor', 'Finn Wolfhard', 'Jaeden Martell', 'Sophia
Lillis', 'Jeremy Ray Taylor', 'Chris Pine', 'Dale Dickey', 'Ben Foster', 'William Sterchi', 'Chris Pine', 'Dale
Dickey', 'Ben Foster', 'William Sterchi', 'Chris Pine', 'Dale Dickey', 'Ben Foster', 'William Sterchi', 'Chris
Pine', 'Dale Dickey', 'Ben Foster', 'William Sterchi', 'Chris Pine', 'Dale Dickey', 'Ben Foster', 'William
Sterchi', 'Chris Pine', 'Dale Dickey', 'Ben Foster', 'William Sterchi', 'Eric Idle', 'Terry Gilliam', 'John Cleese',
'Graham Chapman', 'Eric Idle', 'Terry Gilliam', 'John Cleese', 'Graham Chapman', 'Eric Idle', 'Terry Gilliam',
'John Cleese', 'Graham Chapman', 'Eric Idle', 'Terry Gilliam', 'John Cleese', 'Graham Chapman', 'Eric Idle',
'Terry Gilliam', 'John Cleese', 'Graham Chapman']

from wordcloud import WordCloud, STOPWORDS


df = p6

comment_words = ''
stopwords = set(STOPWORDS)
for val in df:
val = str(val)
tokens = val.split()
for i in range(len(tokens)):
tokens[i] = tokens[i].lower()
comment_words += " ".join(tokens)+" "
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = stopwords,
min_font_size = 10).generate(comment_words)
plt.figure(figsize = (8, 8), facecolor = None)

(c) Sri Eshwar College of Engineering 2024 Page 18 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()

p7 = pd.DataFrame(dataa['Country Availability'].sort_values(ascending=True).value_counts().head(20))
p7
dataa['IMDb Score'].plot()

import pandas as pd
import plotly as py
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot
init_notebook_mode(connected=True)

ata = dict(type = 'choropleth',


locations =[pd.DataFrame(dataa['Country
Availability'].sort_values(ascending=True).value_counts().head(20))],
locationmode = 'country names',
colorscale= 'Picnic',
#text= ['IND','SLK','NEP','BAN','CHI','PAK','BHU', 'MYN'],
#z=[1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0],
colorbar = {'title':'Country Colours'})
layout = dict(geo={'scope':'world'})
import plotly.graph_objs as go
chmap = go.Figure(data=[ata],layout=layout)
iplot(chmap)

dataa['Runtime'].value_counts()

dataa.replace(to_replace ="1-2 hour", value =1.5, inplace=True)


dataa.replace(to_replace ="< 30 minutes", value =0.5, inplace=True)
dataa.replace(to_replace ="30-60 mins", value =0.5, inplace=True)
dataa.replace(to_replace ="> 2 hrs", value =2, inplace=True)

i2=dataa['Runtime'].value_counts()
i2

dataa['Runtime'] = dataa['Runtime'].astype(float)

s = pd.Series([np.random.randint(1,100) for i in range(1,100)])


p = s.plot(kind='hist', color='r', alpha=0.5)
gfg = pd.Series(dataa['Runtime'])
gfg.plot(kind='kde')
plt.show()

sns.distplot(dataa["Hidden Gem Score"], hist=True, kde=True, rug=False )


sns.pairplot(dataa)
plt.show()

data.columns

(c) Sri Eshwar College of Engineering 2024 Page 19 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

plt.figure(figsize=(12,8))
sns.boxplot(dataa["IMDb Score"])

plt.figure(figsize=(12,8))
sns.boxplot(dataa["Awards Received"])

p21 = dataa['Genre'].sort_values(ascending=True).value_counts().head(20)

sns.boxplot(dataa["Director"])
plt.xticks(rotation=90)
plt.figure(figsize=(30, 30))

sns.violinplot(dataa["Director"])

from PIL import Image


import requests
from io import BytesIO
url = "https://occ-0-56-55.1.nflxso.net/dnm/api/v6/evlCitJPPCVCry0BZlEFb5-
QjKc/AAAABTptq7lNjL6UKqv8dT80V5Tc2Hck6ltdkW6z5JMk4e2kLjyipJjDxNCGghdN1Hi_V3sbVlgDqQXT8G
XMgDMhM6aHEw.jpg?r=053"
response = requests.get(url)
img = Image.open(BytesIO(response.content))
img

Problem Statement Terminology Theory Code Input Output Conclusion


Netflix International.xlsx: https://docs.google.com/spreadsheets/d/1ndGvsYZN-
tDG5lbpdSSjmtXerqgpMeFx/edit?usp=drive_link&ouid=110963070738200462224&rtpof=true&sd=true

Problem Statement Terminology Theory Code Input Output Conclusion

Problem Statement Terminology Theory Code Input Output Conclusions


The dataset is successfully cleaned, transformed, and visualized. The data preprocessing and data wrangling
steps help prepare the data for further analysis or machine learning tasks. The script handles missing values,
removes duplicates, corrects data types, performs feature engineering, and provides basic exploratory data
analysis through visualizations.

(c) Sri Eshwar College of Engineering 2024 Page 20 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory
Experiential Learning 5
Course Code R19AD251
Competency Data Science
Discipline Data Science
Domain Data Science
Title of the Experiment Home Price Prediction
Objectives To understand the concept and working of linear regression in supervised machine
learning. To predict home prices based on various independent features using linear
regression. To evaluate the performance of the linear regression model using
metrics such as Mean Squared Error (MSE) and R-squared.
Learning Outcomes Gain an understanding of how linear regression can be used to predict continuous
variables. Understand how to interpret the coefficients of a linear regression model
and how features affect the target variable.
Welcome to Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion
To use Linear Regression to predict house price.

Problem Statement Terminology Theory Code Input Output Conclusion


Linear Regression: A supervised learning algorithm used to model the relationship between a dependent
variable (target) and one or more independent variables (features) by fitting a linear equation.
Dependent Variable: The variable that you are trying to predict (in this case, home price).
Independent Variables: The input features used to predict the target variable (e.g., number of bedrooms,
square footage, location).
Mean Squared Error (MSE): A common evaluation metric for regression models, calculated as the average
of the squared differences between the predicted and actual values.
R-squared (R²): A statistical measure that explains the proportion of variance in the dependent variable
explained by the independent variables.
Training Set: The portion of the dataset used to train the machine learning model.
Test Set: The portion of the dataset used to evaluate the performance of the trained model.
Feature Scaling: Normalizing or standardizing independent variables so they are on the same scale.

Problem Statement Terminology Theory Code Input Output Conclusion


Linear regression models the relationship between the dependent variable and one or more independent
variables using a straight line (in the case of simple linear regression) or a hyperplane (in the case of multiple
regression).
The general form of the linear regression equation is:
Y=β0+β1X1+β2X2+⋯+βnXn+ϵ

Problem Statement Terminology Theory Code Input Output Conclusion


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder

(c) Sri Eshwar College of Engineering 2024 Page 21 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
data = pd.read_csv(url)
data.head()

X = data.drop("medv", axis=1)
y = data["medv"]

numeric_features = X.select_dtypes(include=['int64', 'float64']).columns


categorical_features = X.select_dtypes(include=['object']).columns

preprocessor = ColumnTransformer(
transformers=[
('num', SimpleImputer(strategy='mean'), numeric_features),
('cat', OneHotEncoder(), categorical_features)
])

pipeline = Pipeline(steps=[('preprocessor', preprocessor),


('scaler', StandardScaler())])

X_preprocessed = pipeline.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)


r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-Squared: {r2}")

plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, color='blue')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='--')
plt.title('Actual vs Predicted Home Prices')
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.show()

(c) Sri Eshwar College of Engineering 2024 Page 22 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion
https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv

Problem Statement Terminology Theory Code Input Output Conclusion

Problem Statement Terminology Theory Code Input Output Conclusions


Thus, the study of Linear Regression for House Price Prediction is successfully done.

(c) Sri Eshwar College of Engineering 2024 Page 23 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

Experiential Learning 6
Course Code R19AD251
Competency Data Science
Discipline Data Science
Domain Data Science
Title of the Experiment Medical diagnosis for disease spread pattern
Objectives To implement Logistic Regression to analyze and predict disease spread patterns.
The aim is to understand how Logistic Regression can model the probability of
disease occurrence based on input features and assess its accuracy in medical
diagnosis.
Learning Outcomes Understand the basic concepts of Logistic Regression and its application in binary
classification problems. Apply Logistic Regression to predict disease spread patterns
using real-world datasets.
Welcome to Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion
Given a dataset containing various health-related features of patients, the task is to predict whether a patient is
likely to be infected with a certain disease. The model should classify the patients as either infected or not
infected based on the input features. You are required to use Logistic Regression to build the model and evaluate
its effectiveness.

Problem Statement Terminology Theory Code Input Output Conclusion


Logistic Regression: A supervised machine learning algorithm used for binary classification tasks. It models
the probability of the dependent variable belonging to one of two classes.
Sigmoid Function: The logistic function that maps predicted values to a probability between 0 and 1. The
formula is: P(y=1∣X)=1/1+e−(β0+β1X1+β2X2+⋯+βnXn)
Binary Classification: Classification of data into two categories (e.g., diseased or healthy).
Confusion Matrix: A table used to evaluate the performance of a classification model by comparing predicted
and actual values.
Precision: The proportion of true positive predictions out of all positive predictions (i.e., the accuracy of the
positive predictions).
Recall (Sensitivity): The proportion of actual positives that were correctly identified by the model.
F1 Score: The harmonic mean of precision and recall.
ROC-AUC (Receiver Operating Characteristic - Area Under Curve): A metric used to evaluate how well a
model distinguishes between classes, with values closer to 1 indicating better performance.

Problem Statement Terminology Theory Code Input Output Conclusion


Logistic regression is a classification algorithm used to predict binary outcomes (e.g., disease present vs.
disease absent) based on one or more independent variables (e.g., age, symptoms). Instead of fitting a straight
line (like in linear regression), logistic regression fits an S-shaped curve called the logistic function.
The logistic function ensures that the output is between 0 and 1, which can be interpreted as a probability:
 0 indicates no disease (healthy).
 1 indicates the presence of disease.
The goal is to estimate the probability that a patient has a disease given their medical features. The logistic
regression model outputs this probability, which can be used to classify patients as either healthy or diseased
using a decision threshold (typically 0.5).

(c) Sri Eshwar College of Engineering 2024 Page 24 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

Problem Statement Terminology Theory Code Input Output Conclusion


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score,
roc_auc_score, roc_curve
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
column_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',
'DiabetesPedigree', 'Age', 'Outcome']
data = pd.read_csv(url, names=column_names)

data.head()

X = data.drop("Outcome", axis=1) # 'Outcome' is the target column representing disease (1) or no disease (0)
y = data["Outcome"]

numeric_features = X.columns

preprocessor = ColumnTransformer(
transformers=[('num', SimpleImputer(strategy='mean'), numeric_features)])
pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('scaler', StandardScaler())])

X_preprocessed = pipeline.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y, test_size=0.2, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_prob = model.predict_proba(X_test)[:, 1]

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

accuracy = accuracy_score(y_test, y_pred)


precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

(c) Sri Eshwar College of Engineering 2024 Page 25 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

roc_auc = roc_auc_score(y_test, y_pred_prob)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
print(f"ROC AUC: {roc_auc}")

fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)


plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', label=f"ROC Curve (AUC = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], color='red', linestyle='--')
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

Problem Statement Terminology Theory Code Input Output Conclusion


"https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"

Problem Statement Terminology Theory Code Input Output Conclusion

(c) Sri Eshwar College of Engineering 2024 Page 26 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

Problem Statement Terminology Theory Code Input Output Conclusion


Hence, the study is successfully completed.

(c) Sri Eshwar College of Engineering 2024 Page 27 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

Experiential Learning 7
Course Code R19AD251
Competency Data Science
Discipline Data Science
Domain Data Science
Title of the Experiment Customer Segmentation
Objectives To apply customer segmentation techniques by analyzing demographic,
psychographic, and behavioral data using Python. The goal is to cluster customers
into distinct groups based on their similarities, which can help businesses in targeting
the right customers with personalized marketing strategies and improving customer
retention.
Learning Outcomes Understand and apply different types of customer segmentation, including
demographic, psychographic, and behavioral segmentation. Implement K-Means
clustering and use the Elbow Method to determine the optimal number of clusters.
Welcome to Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion
Customer segmentation in business model based on their demographic, psychographic and behavioural data.

Problem Statement Terminology Theory Code Input Output Conclusion


Customer Segmentation: The process of dividing a customer base into distinct groups that share common
characteristics, such as demographics or behaviors.
Demographic Segmentation: Dividing customers based on variables such as age, gender, income, and
education level.
Psychographic Segmentation: Grouping customers based on their lifestyles, values, interests, or attitudes.
Behavioral Segmentation: Categorizing customers based on their purchasing habits, spending behavior, and
brand loyalty.
K-Means Clustering: A machine learning algorithm that partitions data into k distinct, non-overlapping
groups or clusters.
Elbow Method: A technique to determine the optimal number of clusters by plotting the within-cluster sum
of squares (WCSS) for different cluster counts and identifying the "elbow point."
Silhouette Score: A measure of how similar an object is to its own cluster compared to other clusters. Higher
scores indicate better-defined clusters.
Principal Component Analysis (PCA): A dimensionality reduction technique that transforms data into a
lower-dimensional space while retaining most of the variance.
Centroid: The center of a cluster, representing the mean values of the data points in that cluster.
Standardization: The process of scaling data to have a mean of 0 and a standard deviation of 1, ensuring all
features contribute equally to the analysis.

Problem Statement Terminology Theory Code Input Output Conclusion


K-Means Clustering Algorithm:
 Initialization: The algorithm begins by randomly selecting k centroids (or cluster centers) in the dataset.
 Assignment: Each data point is assigned to the nearest centroid based on the Euclidean distance.
 Update: After assigning all data points to a cluster, the centroid of each cluster is recalculated by
averaging the positions of all data points within the cluster.
 Repeat: Steps 2 and 3 are repeated until the centroids no longer change significantly (convergence).

(c) Sri Eshwar College of Engineering 2024 Page 28 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from sklearn.impute import SimpleImputer
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv(“customer_data.csv”)
print("Dataset preview:\n", df.head())

df = df.drop('country', axis=1)

df['education'] = df['education'].fillna('Unknown')
df['education'] = df['education'].map({'High School': 0, 'Bachelor': 1, 'Master': 2, 'PhD': 3, 'Unknown': 4})

df['gender'] = df['gender'].map({'Male': 1, 'Female': 0})

df.isnull().sum()

X = df[['age', 'gender', 'education', 'income', 'purchase_frequency', 'spending']]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

wcss = []

for i in range(1, 11):


kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=42)
kmeans.fit(X_scaled)
wcss.append(kmeans.inertia_)

plt.figure(figsize=(8, 6))
plt.plot(range(1, 11), wcss, marker='o', linestyle='-', color='b')
plt.title('Elbow Method for Optimal Number of Clusters')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS (Within-cluster Sum of Squares)')
plt.show()

optimal_clusters = 4

kmeans = KMeans(n_clusters=optimal_clusters, init='k-means++', max_iter=300, n_init=10,


random_state=42)
df['segment'] = kmeans.fit_predict(X_scaled)

sil_score = silhouette_score(X_scaled, df['segment'])


print(f'Silhouette Score: {sil_score}')

(c) Sri Eshwar College of Engineering 2024 Page 29 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory
cluster_centers = pd.DataFrame(scaler.inverse_transform(kmeans.cluster_centers_), columns=X.columns)
cluster_centers['Cluster'] = range(optimal_clusters)
print("\nCluster Centers:\n", cluster_centers)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df['segment'], cmap='viridis', s=100)
plt.title(f'Customer Segmentation (n_clusters={optimal_clusters})')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.colorbar(label='Segment')
plt.show()

print("\nSegmented customer data:\n", df[['name', 'age', 'income', 'purchase_frequency', 'spending',


'segment']].head())

Problem Statement Terminology Theory Code Input Output Conclusion


‘customer_data.csv’:
https://drive.google.com/file/d/1o0UdOspy6RDimSWqBtgv6hcCFrXCnboQ/view?usp=sharing

(c) Sri Eshwar College of Engineering 2024 Page 30 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion

Problem Statement Terminology Theory Code Input Output Conclusions


Customer segmentation is a powerful tool in business analytics, allowing companies to identify distinct groups
of customers and tailor their strategies accordingly. By analyzing demographic, psychographic, and behavioral
data, businesses can improve customer engagement, increase sales, and optimize marketing campaigns. Using
machine learning techniques like K-Means clustering, businesses can automate and enhance their
segmentation efforts, making data-driven decisions with greater confidence.

(c) Sri Eshwar College of Engineering 2024 Page 31 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

Experiential Learning 8
Course Code R19AD251
Competency Data Science
Discipline Data Science
Domain Data Science
Title of the Experiment Customer churn classification using decision tree and random forest on telecom data
Objectives The primary objective of this lab is to implement Decision Tree and Random Forest
classifiers to predict customer churn in a telecom dataset. You will preprocess the
data, train the models, evaluate their performance, and compare their accuracy using
various metrics such as confusion matrix, classification report, and ROC-AUC curve.
Learning Outcomes Understand how to handle real-world datasets by performing data preprocessing
including handling missing values, encoding categorical variables, and feature
scaling. Identify and visualize feature importance to understand the impact of
different variables on customer churn prediction.
Welcome to Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion
The telecom industry faces a significant challenge with customer churn, where users leave for competitors,
impacting revenue. The objective is to build a predictive model using Decision Tree and Random Forest
algorithms to identify customers likely to churn. This model will enable the company to take proactive
retention measures, improving customer satisfaction and reducing churn rates..

Problem Statement Terminology Theory Code Input Output Conclusion


Customer Churn in Telecom: Telecom companies lose significant revenue due to customer churn. Predicting
which customers are likely to leave helps businesses take retention measures, such as offering discounts, better
service, or targeted campaigns.
Decision Tree Algorithm: Decision trees are simple yet powerful for classification tasks. The model splits data
based on feature values that maximize the separation between classes (using criteria like Gini impurity or
entropy).
Random Forest Algorithm: Random Forest addresses the problem of overfitting that may occur in a single
Decision Tree. It builds multiple trees using random samples and features, improving both accuracy and
robustness.
Evaluation Metrics: Since telecom churn prediction often deals with imbalanced datasets, precision, recall,
and ROC-AUC are critical in addition to accuracy.
Customer Churn in Telecom: Telecom companies lose significant revenue due to customer churn. Predicting
which customers are likely to leave helps businesses take retention measures, such as offering discounts, better
service, or targeted campaigns.

Problem Statement Terminology Theory Code Input Output Conclusion


Customer Churn: The process when customers stop using a company’s service or product. In a telecom setting,
churn indicates that a customer has left the service provider.
Decision Tree: A tree-like model used for classification and regression tasks, where each internal node
represents a feature, each branch represents a decision, and each leaf node represents an outcome.
Random Forest: An ensemble of Decision Trees, where each tree is trained on a random subset of the data.
The final prediction is based on the majority vote from all trees.
Confusion Matrix: A table that describes the performance of a classification model by comparing predicted
and actual labels.
Precision and Recall: Precision is the ratio of true positive predictions to all positive predictions. Recall is the
ratio of true positives to all actual positives.

(c) Sri Eshwar College of Engineering 2024 Page 32 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

Problem Statement Terminology Theory Code Input Output Conclusion


import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv(r"Telco-Customer-Churn.csv")

df.dtypes

df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')


df['TotalCharges'].fillna(df['TotalCharges'].mean(), inplace=True)

label_enc = LabelEncoder()
df['gender'] = label_enc.fit_transform(df['gender'])
df['Partner'] = label_enc.fit_transform(df['Partner'])
df['Dependents'] = label_enc.fit_transform(df['Dependents'])
df['PhoneService'] = label_enc.fit_transform(df['PhoneService'])
df['MultipleLines'] = label_enc.fit_transform(df['MultipleLines'])
df['InternetService'] = label_enc.fit_transform(df['InternetService'])
df['OnlineSecurity'] = label_enc.fit_transform(df['OnlineSecurity'])
df['OnlineBackup'] = label_enc.fit_transform(df['OnlineBackup'])
df['DeviceProtection'] = label_enc.fit_transform(df['DeviceProtection'])
df['TechSupport'] = label_enc.fit_transform(df['TechSupport'])
df['StreamingTV'] = label_enc.fit_transform(df['StreamingTV'])
df['StreamingMovies'] = label_enc.fit_transform(df['StreamingMovies'])
df['Contract'] = label_enc.fit_transform(df['Contract'])
df['PaperlessBilling'] = label_enc.fit_transform(df['PaperlessBilling'])
df['PaymentMethod'] = label_enc.fit_transform(df['PaymentMethod'])
df['Churn'] = label_enc.fit_transform(df['Churn'])

X = df.drop(['Churn', 'customerID'], axis=1)


y = df['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)

print("Decision Tree Classifier Report:\n", classification_report(y_test, y_pred_dt))


print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_dt))

rf_model = RandomForestClassifier(random_state=42, n_estimators=100)

(c) Sri Eshwar College of Engineering 2024 Page 33 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

print("Random Forest Classifier Report:\n", classification_report(y_test, y_pred_rf))


print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))

feat_importances = pd.Series(rf_model.feature_importances_, index=X.columns)


feat_importances.nlargest(10).plot(kind='barh')
plt.title('Top 10 Important Features - Random Forest')
plt.show()

y_pred_prob_rf = rf_model.predict_proba(X_test)[:,1]
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_prob_rf)
roc_auc_rf = roc_auc_score(y_test, y_pred_prob_rf)

plt.figure(figsize=(6,4))
plt.plot(fpr_rf, tpr_rf, label=f'Random Forest AUC = {roc_auc_rf:.2f}', color='darkorange')
plt.plot([0, 1], [0, 1], 'k--', color='navy')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Random Forest')
plt.legend(loc='lower right')
plt.show()

test_results = X_test.copy()
test_results['Actual_Churn'] = y_test
test_results['Predicted_Churn'] = y_pred_rf
test_results['Churn_Probability'] = y_pred_prob_rf
print("Final Churn Predictions on Test Set:\n")
print(test_results)

Problem Statement Terminology Theory Code Input Output Conclusion


Telco-Customer-Churn.csv:

Problem Statement Terminology Theory Code Input Output Conclusion

(c) Sri Eshwar College of Engineering 2024 Page 34 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

(c) Sri Eshwar College of Engineering 2024 Page 35 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

Problem Statement Terminology Theory Code Input Output Conclusions


The Random Forest classifier outperformed the Decision Tree in terms of both accuracy and AUC score,
making it a better choice for predicting customer churn. Additionally, the analysis of feature importance offers
valuable business insights for customer retention strategies. This lab provided hands-on experience with
classification techniques and churn prediction, a crucial application in the telecom industry.

(c) Sri Eshwar College of Engineering 2024 Page 36 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

Experiential Learning 9
Course Code R19AD251
Competency Data Science
Discipline Data Science
Domain Data Science
Title of the Experiment Behavioural analysis of online shoppers’ intention for online purchase model using
KNN Model.
Objectives To perform behavioral analysis of online shoppers to predict their intention for
online purchases using the K-Nearest Neighbors (KNN) classification algorithm.
The focus is on understanding how different shopping behaviors can be used to
classify shoppers' purchase intentions.
Learning Outcomes Understand the working principles of the K-Nearest Neighbors (KNN) algorithm.
Apply KNN for behavioral analysis and classification problems.
Welcome to Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion
The task is to classify online shoppers based on their behavior and predict whether they intend to make an
online purchase or not. Given a dataset with features such as the number of pages visited, time spent on the
site, bounce rate, and revenue, you are required to build a classification model using KNN to predict shoppers'
purchase intentions.

Problem Statement Terminology Theory Code Input Output Conclusion


K-Nearest Neighbors (KNN): A non-parametric, instance-based learning algorithm used for classification and
regression tasks.
Euclidean Distance: A common distance metric used in KNN, calculated as the straight-line distance between
two points in Euclidean space.
Manhattan Distance: A distance metric where the distance between two points is the sum of the absolute
differences of their coordinates.
Classification: The process of predicting the category or class of a data point from a set of labeled data.
Hyperparameters: Parameters like the value of k in KNN, which must be set before the learning process begins
and optimized for better results.
Confusion Matrix: A matrix used to evaluate the performance of a classification model, displaying the counts
of true positives, true negatives, false positives, and false negatives.
Precision, Recall, F1 Score: Metrics used to evaluate classification models.

Problem Statement Terminology Theory Code Input Output Conclusion


K-Nearest Neighbors (KNN) is a simple yet effective machine learning algorithm used primarily for
classification tasks. The algorithm works by finding the k-nearest data points (neighbors) to a given test point
and assigning the class label based on the majority class among these neighbors.
Working Mechanism:
 Feature Selection: KNN relies on a set of features, such as the number of pages visited, bounce rate, and
session duration in the context of online shopping behavior.
 Distance Metric: The algorithm uses a distance metric, typically Euclidean distance or Manhattan
distance. The formula for Euclidean distance between two points is: d(A, B) = √ ((x2-x1)2 + (y2-y1)2 )
 Choosing K Neighbors: The value of k is a hyperparameter that defines how many neighbors to consider
for the classification. A small value of k might lead to a more sensitive model (prone to noise), while a
large k value might smooth out important patterns.
 Majority Voting: After identifying the k nearest neighbors, the class label is determined based on the
majority class among these neighbors. If most of the k neighbors indicate an intention to purchase, the
test point will be classified as "intending to purchase."

(c) Sri Eshwar College of Engineering 2024 Page 37 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

Problem Statement Terminology Theory Code Input Output Conclusion


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score
import pandas as pd
import warnings
warnings.filterwarnings('ignore', category = FutureWarning)
warnings.filterwarnings('ignore', category = DeprecationWarning)
%matplotlib inline

online_shoppers_intention = pd.read_csv("online_shoppers_intention.csv")
online_shoppers_intention
online_shoppers_intention.shape
online_shoppers_intention.info()
online_shoppers_intention.describe
online_shoppers_intention.isnull().sum()

online_shoppers_intention["Month"].value_counts()
duplicateRows = online_shoppers_intention.duplicated().sum()
print("Total number of duplicate rows:", duplicateRows)
online_shoppers_intention.drop_duplicates(inplace=True)

duplicateRows = online_shoppers_intention.duplicated().sum()
print("Total number of duplicate rows:", duplicateRows)

online_shoppers_intention["Revenue"].value_counts()
sns.set(style="darkgrid") #style the plot background to become a grid
sns.countplot(online_shoppers_intention['Revenue'])
plt.ylim(0,12000)
plt.title('Revenue', fontsize= 18)
plt.xlabel('Transaction Completed', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.text(x=-.115, y=9200 ,s='10,422', fontsize=10, color="white")
plt.text(x=.899, y=1000, s='1,908', fontsize=10, color= "white")
plt.show()

MonthlyValue = online_shoppers_intention['Month'].value_counts()

sns.set(style="darkgrid") # Style the plot background as a grid


sns.countplot(x='Month', data=online_shoppers_intention, order=MonthlyValue.index) # x='Month'
specifies the column to count
plt.title('Months', fontsize=18)
plt.xlabel('Month', fontsize=12)
plt.show()

(c) Sri Eshwar College of Engineering 2024 Page 38 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
online_shoppers_intention['Month'] = le.fit_transform(online_shoppers_intention['Month'])
online_shoppers_intention['VisitorType'] = le.fit_transform(online_shoppers_intention['VisitorType'])
online_shoppers_intention['Weekend'] = le.fit_transform(online_shoppers_intention['Weekend'])
online_shoppers_intention['Revenue'] = le.fit_transform(online_shoppers_intention['Revenue'])

data_correlations= online_shoppers_intention.corr(method = "pearson")


print(data_correlations)

sns.heatmap(data_correlations, annot=True, cmap="coolwarm")


plt.gcf().set_size_inches(20,8)
plt.show()

X = online_shoppers_intention.drop('Revenue',axis=1).values
y = online_shoppers_intention["Revenue"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

resampler = RandomUnderSampler(random_state=0)
X_train_undersampled, y_train_undersampled = resampler.fit_resample(X_train, y_train)

sns.countplot(x=y_train_undersampled)

from sklearn.feature_selection import VarianceThreshold


variance_selector = VarianceThreshold(threshold=0)
X_train_fs = variance_selector.fit_transform(X_train_undersampled)
X_test_fs = variance_selector.transform(X_test)
print(f"{X_train.shape[1]-X_train_fs.shape[1]} features have been removed, {X_train_fs.shape[1]} features
remain")

from sklearn.preprocessing import StandardScaler


sc = StandardScaler()
X_train_s = sc.fit_transform(X_train_undersampled)
X_test_s = sc.transform(X_test)

from sklearn.neighbors import KNeighborsClassifier


from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_s, y_train_undersampled)
y_pred = knn.predict(X_test_s)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy Score : {accuracy*100:.2f}%")
print("Classification Report:")
print(classification_report(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
ax = sns.heatmap(cm, cmap="flare", annot=True, fmt='d')
plt.xlabel("Predicted Class",fontsize=12)
plt.ylabel("True Class",fontsize=12)
plt.title("Confusion Matrix",fontsize=12)

(c) Sri Eshwar College of Engineering 2024 Page 39 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion
https://drive.google.com/file/d/1xWcxluaMehZTPTi9V99F7GFpqIMCwSTe/view?usp=sharing

Problem Statement Terminology Theory Code Input Output Conclusion

(c) Sri Eshwar College of Engineering 2024 Page 40 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

(c) Sri Eshwar College of Engineering 2024 Page 41 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

Problem Statement Terminology Theory Code Input Output Conclusion


The visualizations and analysis of stock market volatility using Power BI provide valuable insights into price
fluctuations and trading volumes. Candlestick charts, volatility plots, moving averages, and Bollinger Bands help
identify patterns and potential trading opportunities. Financial analysts can use these insights to make informed
investment decisions and optimize their trading strategies.

(c) Sri Eshwar College of Engineering 2024 Page 42 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory
Experiential Learning 10
Course Code R19AD251
Competency Data Science
Discipline Data Science
Domain Data Science
Title of the Experiment Sales Performance Analysis
Objectives Analyze sales data to identify top-performing products and regions for strategic
decision-making. Experiment: Analyze sales data using Microsoft Excel to uncover
insights into sales performance and trends. Utilize Excel's data manipulation,
visualization, and analysis tools to examine total sales revenue, product performance,
regional sales distribution, and sales trends over time
Learning Outcomes Understand the concept of stock market volatility and its significance in financial
analysis. Import and prepare stock price data for analysis. Calculate key metrics related
to stock market volatility, such as moving averages and standard deviation.
Welcome to Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion
You are provided with historical stock price data for several companies. The goal is to analyze the data to
understand the volatility of stock prices over a specified period. By visualizing this data, you will be able to
identify periods of high volatility, compare different stocks, and gain insights into market behavior. This analysis
can help investors and financial analysts make better investment decisions.

Problem Statement Terminology Theory Code Input Output Conclusion


Volatility, Standard Deviation, Moving Average (MA), Exponential Moving Average (EMA), Candlestick
Chart, Relative Strength Index (RSI), VIX (Volatility Index)

Problem Statement Terminology Theory Code Input Output Conclusion


Historical Volatility: A measure of how much the price of a stock has fluctuated in the past. It is often calculated
as the standard deviation of the stock's returns over a specific period.
Implied Volatility: A metric derived from the price of options, reflecting the market's expectation of future
volatility.
Moving Average: A commonly used indicator in technical analysis that smooths out price data to identify the
direction of the trend.
Bollinger Bands: A volatility indicator that consists of a middle band (usually a moving average) and two outer
bands that are standard deviations away from the middle band. It helps identify overbought or oversold
conditions.
Candlestick Chart: A type of financial chart that shows the opening, closing, high, and low prices of a stock for
a specific period. It is commonly used to visualize price movements and patterns.

Problem Statement Terminology Theory Code Input Output Conclusion


1. Prepare Your Data
 Clean your data: Ensure your data is well-structured and free of errors. Each column
should represent a specific variable, and each row should represent an individual
record.
 Organize data by category: Use pivot tables if necessary to organize large data sets.

(c) Sri Eshwar College of Engineering 2024 Page 43 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory
2. Insert Pivot Tables and Charts
 Create a Pivot Table:
 Go to Insert → PivotTable.
 Select your data range and decide whether to place it on a new worksheet.
 Drag and drop fields into the "Rows", "Columns", "Values", and "Filters" sections to
create summaries of your data.
 Create Pivot Charts: Once your Pivot Table is ready:
 Select the Pivot Table.
 Go to Insert → choose the type of chart (e.g., bar, pie, line).
 Customize the chart format and design.

3. Design Your Dashboard


 Create the dashboard layout:
 Use a new worksheet to organize your dashboard components.
 Leave space for your charts, tables, and key figures.
 Create headings and color-coded sections to make it user-friendly.
 Add Pivot Charts: Copy charts and tables from other sheets or sections into the
dashboard sheet.
 Go to the worksheet with the chart.
 Copy the chart (Ctrl+C).
 Paste it into your dashboard sheet.

4. Add Interactivity
 Insert Slicers: To add interactivity (filtering options):
 Click on any Pivot Table or Pivot Chart.
 Go to PivotTable Analyze → Insert Slicer.
 Choose fields to filter by (e.g., categories like Date, Region).
 Insert Timelines (for date ranges):
 Click on the Pivot Table.
 Go to PivotTable Analyze → Insert Timeline.
 Select the date field to filter data over time.

5. Customize the Dashboard


 Modify Chart Styles: Select charts and use the Chart Tools in the Ribbon to modify
colors, labels, axes, and legends.
 Use Conditional Formatting for tables:
 Select the data range.
 Go to Home → Conditional Formatting → Choose a rule (like Data Bars or Color
Scales).
 Insert Icons and Shapes to highlight important metrics.
 Go to Insert → Shapes or Icons and place them strategically.

6. Finishing Touches
 Hide Unnecessary Gridlines: Go to View → Uncheck Gridlines.

(c) Sri Eshwar College of Engineering 2024 Page 44 of 45


Kondampatti (PO), Coimbatore-641 202, India
https://sece.ac.in
Data Science Laboratory

 Hide Extra Sheets: If you don’t want other sheets visible, right-click the tab and
choose Hide.
 Protect the Dashboard: Optionally, you can protect the sheet so that users can't alter
the layout by going to Review → Protect Sheet.

7. Refresh Data
 Refresh Pivot Tables when data updates by right-clicking the Pivot Table and
selecting Refresh.
 Refresh All: To refresh all PivotTables and charts at once, go to Data → Refresh All.

Problem Statement Terminology Theory Code Input Output Conclusion


https://drive.google.com/file/d/1fUkLmV0l8EiF5go-eyVIDGjbj-2psL_M/view?usp=sharing

https://docs.google.com/spreadsheets/d/1KNihku9JpFVsntKLOS6DfBtEeQEwAqwh/edit?usp=sharing&oui
d=110963070738200462224&rtpof=true&sd=true

Problem Statement Terminology Theory Code Input Output Conclusion

Problem Statement Terminology Theory Code Input Output Conclusions


The analysis provides valuable insights into sales performance and trends. By visualizing total sales revenue,
product performance, regional sales distribution, and sales trends over time in Excel, the financial analysts can
make informed strategic decisions. The top-performing products and regions are identified, and trends are
analyzed to help guide future sales strategies and resource allocation.

(c) Sri Eshwar College of Engineering 2024 Page 45 of 45

You might also like