ds final
ds final
BONAFIDE CERTIFICATE
LABORATORY during the 5th Semester of the academic year 2024 – 2025 (Odd Semester).
Contents
Page Number
Marks (50)
Faculty Member
Signature of the
S.No Date Name of the Experiment
1 Web Scrapping
Average:
Average (in words) Signature of the Faculty
url="https://finance.yahoo.com/most-active"
resp=requests.get(url)
print(resp)
htmlcont=resp.content
htmlcont
soup=bs(htmlcont,"html.parser")
title=soup.find("title")
title
title.get_text()
stock_table=soup.find("tbody")
stock_table
for i in stock_table.find_all("tr"):
print(i)
symbol=[]
for i in stock_table.find_all("tr"):
sym=i.find("td",attrs={"aria-label":"Symbol"})
symbol.append(sym.text)
symbol
print(type(symbol))
for i in stock_table.find_all("tr"):
print(i.prettify())
name=[]
for i in stock_table.find_all("tr"):
nam=i.find("td",attrs={"aria-label":"Name"})
name.append(nam.text)
name
price=[]
for i in stock_table.find_all("tr"):
pr=i.find("td",attrs={"aria-label":"Price (Intraday)"})
price.append(pr.text)
price
change=[]
for i in stock_table.find_all("tr"):
chan=i.find("td",attrs={"aria-label":"Change"})
per_change=[]
for i in stock_table.find_all("tr"):
change=i.find("td",attrs={"aria-label":"% Change"})
per_change.append(change.text)
per_change
data={"Symbol":symbol,
"Name":name,
"Price(Intraday)":price,
"% Change":per_change
}
data
import pandas as pd
df=pd.DataFrame(data)
df
url="https://finance.yahoo.com/most-active"
# 1. Uniform Distribution
uniform_data = np.random.uniform(low=0, high=10, size=1000)
sns.histplot(uniform_data, kde=True, color='b', ax=axes[0, 0])
axes[0, 0].set_title('Uniform Distribution')
# 3. Binomial Distribution
n, p = 10, 0.5
binomial_data = np.random.binomial(n, p, 1000)
sns.histplot(binomial_data, kde=False, color='r', ax=axes[0, 2])
axes[0, 2].set_title('Binomial Distribution')
# 4. Poisson Distribution
poisson_data = np.random.poisson(lam=3, size=1000)
sns.histplot(poisson_data, kde=False, color='y', ax=axes[1, 0])
axes[1, 0].set_title('Poisson Distribution')
# 5. Exponential Distribution
exponential_data = np.random.exponential(scale=1, size=1000)
sns.histplot(exponential_data, kde=True, color='m', ax=axes[1, 1])
axes[1, 1].set_title('Exponential Distribution')
# 7. Beta Distribution
a, b = 2, 5
beta_data = np.random.beta(a, b, size=1000)
sns.histplot(beta_data, kde=True, color='orange', ax=axes[2, 0])
axes[2, 0].set_title('Beta Distribution')
# 8. Chi-Square Distribution
df = 2
chi_square_data = np.random.chisquare(df, size=1000)
sns.histplot(chi_square_data, kde=True, color='purple', ax=axes[2, 1])
axes[2, 1].set_title('Chi-Square Distribution')
# 9. Student's t-Distribution
df = 10
t_data = np.random.standard_t(df, size=1000)
sns.histplot(t_data, kde=True, color='teal', ax=axes[2, 2])
axes[2, 2].set_title("Student's t-Distribution")
# Adjust layout
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()
Experiential Learning 3
Course Code R19AD251
Competency Data Science
Discipline Data Science
Domain Data Science
Title of the Experiment Plotting Normal Distribution and Performing Normality Test using Q-Q Plot
Objectives To understand the concept of a normal distribution and its significance in statistical
analysis.
Learning Outcomes Be able to generate and visualize a normal distribution using Python. Understand
how Q-Q plots are used to check the normality of data.
Welcome to Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion
How can we visualize a normal distribution and assess whether a dataset is normally distributed using a Q-Q
plot and statistical tests?
mu, sigma = 0, 1
normal_data = np.random.normal(mu, sigma, 1000)
plt.figure(figsize=(10, 6))
sns.histplot(normal_data, kde=True, color='blue')
plt.title('Normal Distribution (μ=0, σ=1)', fontsize=15)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Experiential Learning 4
Course Code R19AD251
Competency Data Science
Discipline Data Science
Domain Data Science
Title of the Experiment Exploratory Data Analysis
Objectives The objective of this lab is to introduce students to the concept of Exploratory Data
Analysis (EDA). Students will learn how to use various tools and techniques to
understand the underlying structure of a dataset, identify patterns, spot anomalies,
test hypotheses, and check assumptions through visual and quantitative methods.
Learning Outcomes Understand the importance of EDA, EDA techniques such as Outlier handling,
missing data handling, transformation, etc.
Welcome to Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion
You are provided with a dataset containing various attributes related to customer information for a retail
company. The goal is to perform an Exploratory Data Analysis to uncover insights about customer behavior,
identify key factors that influence sales, and highlight any potential issues within the dataset that need to be
addressed before proceeding to further analysis.
df = pd.read_excel(r'Netflix Dataset.xlsx')
df
df.shape
df.describe()
df.head(10)
df.tail(10)
df.isnull().values.any()
df.isnull().sum()
df.dropna(subset=['Director'], inplace=True)
df.dropna(subset=['Writer'], inplace=True)
df.dropna(subset=['Languages'], inplace=True)
df.dropna(subset=['Actors'], inplace=True)
df.dropna(subset=['Country Availability'], inplace=True)
df.dropna(subset=['Genre'], inplace=True)
df.isna().sum()
df
df.shape
pd.get_dummies(df.Series_Movie)
dataa=pd.concat([df, pd.get_dummies(df.Series_Movie)], axis=1)
p1=df['Series_Movie'].value_counts()
colors = ('violet', 'c')
plt.pie(p1, labels=p1.index, explode=(0.5, 0), autopct='%.1f%%', shadow = True, colors = colors)
plt.title('Series & Movie')
plt.show()
p2 = df['Director'].sort_values(ascending=True).value_counts().head(20)
explode = (0.1, 0.175, 0.2, 0.3, 0.05, 0.05, 0.275, 0.2, 0.26, 0.3, 0.1, 0.15, 0.19, 0.1, 0.15, 0.3, 0.25,
0.15, 0.16, 0.2)
colors = ('b', 'c', 'r', 'g', 'lime', 'khaki', 'pink', 'olive', 'gray', 'orange')
wp = { 'linewidth' : 1, 'edgecolor' : "green" }
def func(pct, allvalues):
absolute = int(pct / 100.*np.sum(allvalues))
return "{:.1f}".format(pct, absolute)
fig, ax = plt.subplots(figsize =(15, 12))
wedges, texts, autotexts = ax.pie(p2,
autopct = lambda pct: func(pct, p2), explode = explode, labels = p2.index,
shadow = True, colors = colors, startangle = 90, wedgeprops = wp,
textprops = dict(color ="deeppink"))
ax.legend(wedges, p2, title ="Director", loc ="center left", bbox_to_anchor =(1.15, 0.1, 0.5, 1))
plt.setp(autotexts, size = 10, weight ="bold")
plt.title("Top 20 Director")
plt.show()
df.groupby('Director')['Director']
p3 = pd.DataFrame(df['Languages'].sort_values(ascending=True).value_counts().head(20))
p3
xs = p2.index
labels = p2
plt.bar(xs,p2.values, color='red')
plt.xlabel("Director")
plt.ylabel("Counts")
plt.xticks(rotation=90)
p5 = pd.DataFrame(df['Actors'].sort_values(ascending=True).value_counts().head(20))
p6 = ['Ewan McGregor', 'Natalie Portman', 'Jake Lloyd', 'Liam Neeson', 'Ewan McGregor', 'Natalie Portman',
'Jake Lloyd', 'Liam Neeson', 'Ewan McGregor', 'Natalie Portman', 'Jake Lloyd', 'Liam Neeson', 'Ewan
McGregor', 'Natalie Portman', 'Jake Lloyd', 'Liam Neeson', 'Ewan McGregor', 'Natalie Portman', 'Jake Lloyd',
'Liam Neeson', 'Ewan McGregor', 'Natalie Portman', 'Jake Lloyd', 'Liam Neeson', 'Ewan McGregor', 'Natalie
Portman', 'Jake Lloyd', 'Liam Neeson', 'Ewan McGregor', 'Natalie Portman', 'Jake Lloyd', 'Liam Neeson',
'Ewan McGregor', 'Natalie Portman', 'Jake Lloyd', 'Liam Neeson', 'Ewan McGregor', 'Natalie Portman', 'Jake
Lloyd', 'Liam Neeson', 'Ewan McGregor', 'Natalie Portman', 'Jake Lloyd', 'Liam Neeson', 'Ewan McGregor',
'Natalie Portman', 'Jake Lloyd', 'Liam Neeson', 'Noriko Ohara', 'Nobuyo Ôyama', 'Michiko Nomura', 'Kaneta
Kimotsuki', 'Noriko Ohara', 'Nobuyo Ôyama', 'Michiko Nomura', 'Kaneta Kimotsuki', 'Noriko Ohara',
'Nobuyo Ôyama', 'Michiko Nomura', 'Kaneta Kimotsuki', 'Noriko Ohara', 'Nobuyo Ôyama', 'Michiko
Nomura', 'Kaneta Kimotsuki', 'Noriko Ohara', 'Nobuyo Ôyama', 'Michiko Nomura', 'Kaneta Kimotsuki',
'Noriko Ohara', 'Nobuyo Ôyama', 'Michiko Nomura', 'Kaneta Kimotsuki', 'Noriko Ohara', 'Nobuyo Ôyama',
'Michiko Nomura', 'Kaneta Kimotsuki', 'Noriko Ohara', 'Nobuyo Ôyama', 'Michiko Nomura', 'Kaneta
Kimotsuki', 'Noriko Ohara', 'Nobuyo Ôyama', 'Michiko Nomura', 'Kaneta Kimotsuki', 'Noriko Ohara',
'Nobuyo Ôyama', 'Michiko Nomura', 'Kaneta Kimotsuki', 'Noriko Ohara', 'Nobuyo Ôyama', 'Michiko
Nomura', 'Kaneta Kimotsuki', 'Noriko Ohara', 'Nobuyo Ôyama', 'Michiko Nomura', 'Kaneta Kimotsuki', 'Isha
Koppikar', 'Priyanka Chopra', 'Shah Rukh Khan', 'Arjun Rampal', 'Isha Koppikar', 'Priyanka Chopra', 'Shah
Rukh Khan', 'Arjun Rampal', 'Isha Koppikar', 'Priyanka Chopra', 'Shah Rukh Khan', 'Arjun Rampal', 'Isha
Koppikar', 'Priyanka Chopra', 'Shah Rukh Khan', 'Arjun Rampal', 'Isha Koppikar', 'Priyanka Chopra', 'Shah
Rukh Khan', 'Arjun Rampal', 'Isha Koppikar', 'Priyanka Chopra', 'Shah Rukh Khan', 'Arjun Rampal', 'Isha
Koppikar', 'Priyanka Chopra', 'Shah Rukh Khan', 'Arjun Rampal', 'Isha Koppikar', 'Priyanka Chopra', 'Shah
Rukh Khan', 'Arjun Rampal', 'Isha Koppikar', 'Priyanka Chopra', 'Shah Rukh Khan', 'Arjun Rampal', 'Isha
Koppikar', 'Priyanka Chopra', 'Shah Rukh Khan', 'Arjun Rampal', 'Isha Koppikar', 'Priyanka Chopra', 'Shah
Rukh Khan', 'Arjun Rampal', 'Finn Wolfhard', 'Jaeden Martell', 'Sophia Lillis', 'Jeremy Ray Taylor', 'Finn
Wolfhard', 'Jaeden Martell', 'Sophia Lillis', 'Jeremy Ray Taylor', 'Finn Wolfhard', 'Jaeden Martell', 'Sophia
Lillis', 'Jeremy Ray Taylor', 'Finn Wolfhard', 'Jaeden Martell', 'Sophia Lillis', 'Jeremy Ray Taylor', 'Finn
Wolfhard', 'Jaeden Martell', 'Sophia Lillis', 'Jeremy Ray Taylor', 'Finn Wolfhard', 'Jaeden Martell', 'Sophia
Lillis', 'Jeremy Ray Taylor', 'Chris Pine', 'Dale Dickey', 'Ben Foster', 'William Sterchi', 'Chris Pine', 'Dale
Dickey', 'Ben Foster', 'William Sterchi', 'Chris Pine', 'Dale Dickey', 'Ben Foster', 'William Sterchi', 'Chris
Pine', 'Dale Dickey', 'Ben Foster', 'William Sterchi', 'Chris Pine', 'Dale Dickey', 'Ben Foster', 'William
Sterchi', 'Chris Pine', 'Dale Dickey', 'Ben Foster', 'William Sterchi', 'Eric Idle', 'Terry Gilliam', 'John Cleese',
'Graham Chapman', 'Eric Idle', 'Terry Gilliam', 'John Cleese', 'Graham Chapman', 'Eric Idle', 'Terry Gilliam',
'John Cleese', 'Graham Chapman', 'Eric Idle', 'Terry Gilliam', 'John Cleese', 'Graham Chapman', 'Eric Idle',
'Terry Gilliam', 'John Cleese', 'Graham Chapman']
comment_words = ''
stopwords = set(STOPWORDS)
for val in df:
val = str(val)
tokens = val.split()
for i in range(len(tokens)):
tokens[i] = tokens[i].lower()
comment_words += " ".join(tokens)+" "
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = stopwords,
min_font_size = 10).generate(comment_words)
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
p7 = pd.DataFrame(dataa['Country Availability'].sort_values(ascending=True).value_counts().head(20))
p7
dataa['IMDb Score'].plot()
import pandas as pd
import plotly as py
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot
init_notebook_mode(connected=True)
dataa['Runtime'].value_counts()
i2=dataa['Runtime'].value_counts()
i2
dataa['Runtime'] = dataa['Runtime'].astype(float)
data.columns
plt.figure(figsize=(12,8))
sns.boxplot(dataa["IMDb Score"])
plt.figure(figsize=(12,8))
sns.boxplot(dataa["Awards Received"])
p21 = dataa['Genre'].sort_values(ascending=True).value_counts().head(20)
sns.boxplot(dataa["Director"])
plt.xticks(rotation=90)
plt.figure(figsize=(30, 30))
sns.violinplot(dataa["Director"])
url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
data = pd.read_csv(url)
data.head()
X = data.drop("medv", axis=1)
y = data["medv"]
preprocessor = ColumnTransformer(
transformers=[
('num', SimpleImputer(strategy='mean'), numeric_features),
('cat', OneHotEncoder(), categorical_features)
])
X_preprocessed = pipeline.fit_transform(X)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, color='blue')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='--')
plt.title('Actual vs Predicted Home Prices')
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.show()
Experiential Learning 6
Course Code R19AD251
Competency Data Science
Discipline Data Science
Domain Data Science
Title of the Experiment Medical diagnosis for disease spread pattern
Objectives To implement Logistic Regression to analyze and predict disease spread patterns.
The aim is to understand how Logistic Regression can model the probability of
disease occurrence based on input features and assess its accuracy in medical
diagnosis.
Learning Outcomes Understand the basic concepts of Logistic Regression and its application in binary
classification problems. Apply Logistic Regression to predict disease spread patterns
using real-world datasets.
Welcome to Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion
Given a dataset containing various health-related features of patients, the task is to predict whether a patient is
likely to be infected with a certain disease. The model should classify the patients as either infected or not
infected based on the input features. You are required to use Logistic Regression to build the model and evaluate
its effectiveness.
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
column_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',
'DiabetesPedigree', 'Age', 'Outcome']
data = pd.read_csv(url, names=column_names)
data.head()
X = data.drop("Outcome", axis=1) # 'Outcome' is the target column representing disease (1) or no disease (0)
y = data["Outcome"]
numeric_features = X.columns
preprocessor = ColumnTransformer(
transformers=[('num', SimpleImputer(strategy='mean'), numeric_features)])
pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('scaler', StandardScaler())])
X_preprocessed = pipeline.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_prob = model.predict_proba(X_test)[:, 1]
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
print(f"ROC AUC: {roc_auc}")
Experiential Learning 7
Course Code R19AD251
Competency Data Science
Discipline Data Science
Domain Data Science
Title of the Experiment Customer Segmentation
Objectives To apply customer segmentation techniques by analyzing demographic,
psychographic, and behavioral data using Python. The goal is to cluster customers
into distinct groups based on their similarities, which can help businesses in targeting
the right customers with personalized marketing strategies and improving customer
retention.
Learning Outcomes Understand and apply different types of customer segmentation, including
demographic, psychographic, and behavioral segmentation. Implement K-Means
clustering and use the Elbow Method to determine the optimal number of clusters.
Welcome to Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion
Customer segmentation in business model based on their demographic, psychographic and behavioural data.
df = pd.read_csv(“customer_data.csv”)
print("Dataset preview:\n", df.head())
df = df.drop('country', axis=1)
df['education'] = df['education'].fillna('Unknown')
df['education'] = df['education'].map({'High School': 0, 'Bachelor': 1, 'Master': 2, 'PhD': 3, 'Unknown': 4})
df.isnull().sum()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
wcss = []
plt.figure(figsize=(8, 6))
plt.plot(range(1, 11), wcss, marker='o', linestyle='-', color='b')
plt.title('Elbow Method for Optimal Number of Clusters')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS (Within-cluster Sum of Squares)')
plt.show()
optimal_clusters = 4
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df['segment'], cmap='viridis', s=100)
plt.title(f'Customer Segmentation (n_clusters={optimal_clusters})')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.colorbar(label='Segment')
plt.show()
Experiential Learning 8
Course Code R19AD251
Competency Data Science
Discipline Data Science
Domain Data Science
Title of the Experiment Customer churn classification using decision tree and random forest on telecom data
Objectives The primary objective of this lab is to implement Decision Tree and Random Forest
classifiers to predict customer churn in a telecom dataset. You will preprocess the
data, train the models, evaluate their performance, and compare their accuracy using
various metrics such as confusion matrix, classification report, and ROC-AUC curve.
Learning Outcomes Understand how to handle real-world datasets by performing data preprocessing
including handling missing values, encoding categorical variables, and feature
scaling. Identify and visualize feature importance to understand the impact of
different variables on customer churn prediction.
Welcome to Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion
The telecom industry faces a significant challenge with customer churn, where users leave for competitors,
impacting revenue. The objective is to build a predictive model using Decision Tree and Random Forest
algorithms to identify customers likely to churn. This model will enable the company to take proactive
retention measures, improving customer satisfaction and reducing churn rates..
df = pd.read_csv(r"Telco-Customer-Churn.csv")
df.dtypes
label_enc = LabelEncoder()
df['gender'] = label_enc.fit_transform(df['gender'])
df['Partner'] = label_enc.fit_transform(df['Partner'])
df['Dependents'] = label_enc.fit_transform(df['Dependents'])
df['PhoneService'] = label_enc.fit_transform(df['PhoneService'])
df['MultipleLines'] = label_enc.fit_transform(df['MultipleLines'])
df['InternetService'] = label_enc.fit_transform(df['InternetService'])
df['OnlineSecurity'] = label_enc.fit_transform(df['OnlineSecurity'])
df['OnlineBackup'] = label_enc.fit_transform(df['OnlineBackup'])
df['DeviceProtection'] = label_enc.fit_transform(df['DeviceProtection'])
df['TechSupport'] = label_enc.fit_transform(df['TechSupport'])
df['StreamingTV'] = label_enc.fit_transform(df['StreamingTV'])
df['StreamingMovies'] = label_enc.fit_transform(df['StreamingMovies'])
df['Contract'] = label_enc.fit_transform(df['Contract'])
df['PaperlessBilling'] = label_enc.fit_transform(df['PaperlessBilling'])
df['PaymentMethod'] = label_enc.fit_transform(df['PaymentMethod'])
df['Churn'] = label_enc.fit_transform(df['Churn'])
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
y_pred_prob_rf = rf_model.predict_proba(X_test)[:,1]
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_prob_rf)
roc_auc_rf = roc_auc_score(y_test, y_pred_prob_rf)
plt.figure(figsize=(6,4))
plt.plot(fpr_rf, tpr_rf, label=f'Random Forest AUC = {roc_auc_rf:.2f}', color='darkorange')
plt.plot([0, 1], [0, 1], 'k--', color='navy')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Random Forest')
plt.legend(loc='lower right')
plt.show()
test_results = X_test.copy()
test_results['Actual_Churn'] = y_test
test_results['Predicted_Churn'] = y_pred_rf
test_results['Churn_Probability'] = y_pred_prob_rf
print("Final Churn Predictions on Test Set:\n")
print(test_results)
Experiential Learning 9
Course Code R19AD251
Competency Data Science
Discipline Data Science
Domain Data Science
Title of the Experiment Behavioural analysis of online shoppers’ intention for online purchase model using
KNN Model.
Objectives To perform behavioral analysis of online shoppers to predict their intention for
online purchases using the K-Nearest Neighbors (KNN) classification algorithm.
The focus is on understanding how different shopping behaviors can be used to
classify shoppers' purchase intentions.
Learning Outcomes Understand the working principles of the K-Nearest Neighbors (KNN) algorithm.
Apply KNN for behavioral analysis and classification problems.
Welcome to Data Science Laboratory
Problem Statement Terminology Theory Code Input Output Conclusion
The task is to classify online shoppers based on their behavior and predict whether they intend to make an
online purchase or not. Given a dataset with features such as the number of pages visited, time spent on the
site, bounce rate, and revenue, you are required to build a classification model using KNN to predict shoppers'
purchase intentions.
online_shoppers_intention = pd.read_csv("online_shoppers_intention.csv")
online_shoppers_intention
online_shoppers_intention.shape
online_shoppers_intention.info()
online_shoppers_intention.describe
online_shoppers_intention.isnull().sum()
online_shoppers_intention["Month"].value_counts()
duplicateRows = online_shoppers_intention.duplicated().sum()
print("Total number of duplicate rows:", duplicateRows)
online_shoppers_intention.drop_duplicates(inplace=True)
duplicateRows = online_shoppers_intention.duplicated().sum()
print("Total number of duplicate rows:", duplicateRows)
online_shoppers_intention["Revenue"].value_counts()
sns.set(style="darkgrid") #style the plot background to become a grid
sns.countplot(online_shoppers_intention['Revenue'])
plt.ylim(0,12000)
plt.title('Revenue', fontsize= 18)
plt.xlabel('Transaction Completed', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.text(x=-.115, y=9200 ,s='10,422', fontsize=10, color="white")
plt.text(x=.899, y=1000, s='1,908', fontsize=10, color= "white")
plt.show()
MonthlyValue = online_shoppers_intention['Month'].value_counts()
le = LabelEncoder()
online_shoppers_intention['Month'] = le.fit_transform(online_shoppers_intention['Month'])
online_shoppers_intention['VisitorType'] = le.fit_transform(online_shoppers_intention['VisitorType'])
online_shoppers_intention['Weekend'] = le.fit_transform(online_shoppers_intention['Weekend'])
online_shoppers_intention['Revenue'] = le.fit_transform(online_shoppers_intention['Revenue'])
X = online_shoppers_intention.drop('Revenue',axis=1).values
y = online_shoppers_intention["Revenue"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)
resampler = RandomUnderSampler(random_state=0)
X_train_undersampled, y_train_undersampled = resampler.fit_resample(X_train, y_train)
sns.countplot(x=y_train_undersampled)
4. Add Interactivity
Insert Slicers: To add interactivity (filtering options):
Click on any Pivot Table or Pivot Chart.
Go to PivotTable Analyze → Insert Slicer.
Choose fields to filter by (e.g., categories like Date, Region).
Insert Timelines (for date ranges):
Click on the Pivot Table.
Go to PivotTable Analyze → Insert Timeline.
Select the date field to filter data over time.
6. Finishing Touches
Hide Unnecessary Gridlines: Go to View → Uncheck Gridlines.
Hide Extra Sheets: If you don’t want other sheets visible, right-click the tab and
choose Hide.
Protect the Dashboard: Optionally, you can protect the sheet so that users can't alter
the layout by going to Review → Protect Sheet.
7. Refresh Data
Refresh Pivot Tables when data updates by right-clicking the Pivot Table and
selecting Refresh.
Refresh All: To refresh all PivotTables and charts at once, go to Data → Refresh All.
https://docs.google.com/spreadsheets/d/1KNihku9JpFVsntKLOS6DfBtEeQEwAqwh/edit?usp=sharing&oui
d=110963070738200462224&rtpof=true&sd=true