0% found this document useful (0 votes)
17 views

EDA LAB

The document outlines a series of exercises for a B.Sc. in Data Science program focused on Exploratory Data Analysis (EDA) using various tools and libraries such as WEKA, NumPy, Matplotlib, and Pandas. Each exercise includes aims, apparatus, algorithms, and programming examples for tasks like data visualization, histogram analysis, and generating different charts. The document serves as a practical guide for students to develop their skills in data analysis and visualization techniques.

Uploaded by

karishmasuga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

EDA LAB

The document outlines a series of exercises for a B.Sc. in Data Science program focused on Exploratory Data Analysis (EDA) using various tools and libraries such as WEKA, NumPy, Matplotlib, and Pandas. Each exercise includes aims, apparatus, algorithms, and programming examples for tasks like data visualization, histogram analysis, and generating different charts. The document serves as a practical guide for students to develop their skills in data analysis and visualization techniques.

Uploaded by

karishmasuga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.

(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

INDEX

PAGE
SI/NO DATE PARTICULARS SIGN
NO

Download, Install and practice


01
opensource tools for EDA – WEKA

02 Visualize the data using various graphs

Perform histogram analysis using


03
NumPy, Matplotlib, pandas.

Write a program to generate different


04
charts and plots.

Write a program to generate pivot


05
using groupby() method.

Perform Time Series analysis and test


06
with with a predictive model

Write a program to identify the


07 correlation of the features/parameters
in the Titanic Dataset.

08 Perform EDA on Wine Data

Demonstrate different visualizations


09
based on Exercise 7.

Develop and evaluate ML models on


10
open dataset

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

Download, Install and practice opensource


EX. No. 01 DATE:
tools for EDA – WEKA

AIM:

To Download, Install and practice opensource tools for EDA – WEKA

APPARATUS:

EDA WEKA

ALGORITHM:

Step-1: Visit this website using any web browser. Click on Free Download.
Step-2: It will redirect to a new webpage, click on Start Download. Downloading of
the executable file will start shortly. It is a big 118 MB file that will take some
minutes.
Step-3: Now check for the executable file in downloads in your system and run it.
Step-4: It will prompt confirmation to make changes to your system. Click on Yes.
Step-5: Setup screen will appear, click on Next.
Step-6: The next screen will be of License Agreement, click on I Agree.
Step-7: Next screen is of choosing components, all components are already marked so
don’t change anything just click on the Install button.
Step-8: The next screen will be of installing location so choose the drive which will
have sufficient memory space for installation. It needed a memory space of
301 MB.
Step-9: Next screen will be of choosing the Start menu folder so don’t do anything just
click on Install Button.
Step-10: After this installation process will start and will hardly take a minute to
complete the installation.
Step-11: Click on the Next button after the installation process is complete.
Step-12: Click on Finish to finish the installation process.
Step-13: Weka is successfully installed on the system and an icon is created on the
desktop.
Step-14: Run the software and see the interface.

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

PROGRAM:
1) Installing Weka on Windows:
Follow the below steps to install Weka on Windows:
Step-1: Visit this website using any web browser. Click on Free Download.

Step-2: It will redirect to a new webpage, click on Start Download. Downloading of


the executable file will start shortly. It is a big 118 MB file that will take some
minutes.

Step-3: Now check for the executable file in downloads in your system and run it.

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

Step-4: It will prompt confirmation to make changes to your system. Click on Yes.

Step-5: Setup screen will appear, click on Next.

Step-6: The next screen will be of License Agreement, click on I Agree.

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

Step-7: Next screen is of choosing components, all components are already marked so
don’t change anything just click on the Install button.

Step-8: The next screen will be of installing location so choose the drive which will
have sufficient memory space for installation. It needed a memory space of 301 MB.

Step-9: Next screen will be of choosing the Start menu folder so don’t do anything just
click on Install Button.

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

Step-10: After this installation process will start and will hardly take a minute to
complete the installation.

Step-11: Click on the Next button after the installation process is complete.

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

Step-12: Click on Finish to finish the installation process.

Step-13: Weka is successfully installed on the system and an icon is created on the
desktop.

Step-14: Run the software and see the interface.

Congratulations!! At this point, you have successfully installed Weka on your windows
system.

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

S.No. Particulars Max. Marks Marks Secured

01 Aim and Algorithm 5

02 Program and Execution 10

03 Viva Voce 5

04 Output 5

05 Total 25

RESULT:

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

EX. No. 02 Visualize the data using various graphs DATE:

AIM:

To visualize the data using various graphs

APPARATUS:

TRINKET (Online Compiler)

ALGORITHM:

Matplotlib:
Step-1: Import the matplotlib.pyplot, numpy.
Step-2: Create a data Frame x (linspace), y (sin(x)).
Step-3: Create plot figure.
Step-4: Plot x, y (line).
Step-5: Title, labels, legend, grid.
Step-6: Display plot.
Seaborn:
Step-1: Import the seaborn, pandas, numpy, matplotlib.pyplot.
Step-2: Create DataFrame (x, y, category).
Step-3: Create plot figure.
Step-4: Plot using sns.scatterplot (hue, size).
Step-5: Customize the Title.
Step-6: How to Display plot.
Plotly:
Step-1: Import the plotly.express.
Step-2: Create data dictionary (x, y).
Step-3: Plot the px.scatter.
Step-4: Save in fig.write_image (PNG).
Step-5: Print in Confirmation message.

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

Pandas Built-in Plotting:


Step-1: Import the pandas, numpy, matplotlib.pyplot.
Step-2: Create DataFrame (random data).
Step-3: Plot the df.plot.
Step-4: Customize the Labels, grid.
Step-5: Show the Display plot.

PROGRAM:

Matplotlib:
Matplotlib is the foundation of data visualization in Python. It's versatile but requires
more code for complex visualizations.

Matplotlib Program:
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100) # Sample data
y = np.sin(x)
plt.figure(figsize=(10, 6)) # Create a simple line plot
plt.plot(x, y, 'b-', linewidth=2, label='sin(x)')
plt.title('Simple Line Plot')
plt.xlabel('X axis')
plt.ylabel('Y axis')
plt.legend()
plt.grid(True)
plt.show()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Seaborn:
Seaborn is built on Matplotlib but provides a higher-level interface for statistical
visualizations.

Seaborn Program:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42) # Create sample dataset
df = pd.DataFrame({
'x': np.random.normal(0, 1, 100),
'y': np.random.normal(0, 1, 100),
'category': np.random.choice(['A', 'B', 'C'], 100)
})

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

plt.figure(figsize=(10, 6)) # Create a scatter plot with additional dimensions


sns.scatterplot(x='x', y='y', hue='category', data=df, s=100)
plt.title('Seaborn Scatter Plot')
plt.show()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Plotly:
Plotly creates interactive visualizations that are great for dashboards and web
applications.

Plotly Program:
import plotly.express as px
data = {
"x": [1, 2, 3, 4, 5],
"y": [2, 5, 1, 8, 3]
}
fig = px.scatter(data, x="x", y="y", title="Simple Plotly Scatter Plot")
# Save as PNG image
fig.write_image("plotly_plot.png")
print("Plot saved as plotly_plot.png. Download and view.")
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Pandas Built-in Plotting:


Pandas has built-in plotting capabilities based on Matplotlib that make it easy to quickly
visualize dataframes.

Pandas Built-in Plotting Program:


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # Import matplotlib.pyplot

# Create sample dataframe


df = pd.DataFrame(np.random.randn(20, 5),
columns=['A', 'B', 'C', 'D', 'E'])

# Simple line plot of all columns


ax = df.plot(figsize=(10, 6), title='Pandas DataFrame Plot')
ax.set_xlabel('Index')
ax.set_ylabel('Value')
plt.grid(True)
plt.show()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

INPUT & OUTPUT:

Matplotlib Output:

Seaborn Output:

Plotly Output:

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

Pandas Built-in Plotting Program:

S.No. Particulars Max. Marks Marks Secured

01 Aim and Algorithm 5

02 Program and Execution 10

03 Viva Voce 5

04 Output 5

05 Total 25

RESULT:

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

Perform histogram analysis using NumPy,


EX. No. 03 DATE:
Matplotlib, pandas.

AIM:

To Perform histogram analysis using NumPy, Matplotlib, Pandas

APPARATUS:

TRINKET (Online Compiler)

ALGORITHM:

Step-1: Import Required Libraries


Step-2: Generate Sample Data Using NumPy
Step-3: Create Histogram Using NumPy + Matplotlib
Step-4: Use Matplotlib’s Built-in hist() Function
Step-5: Convert to Pandas Series and Plot Using Pandas
Step-6: Customize the Pandas Histogram

PROGRAM:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# 1. Using NumPy directly
# Generate sample data
data_numpy = np.random.randn(1000) # Normally distributed data
# Calculate histogram
hist, bins = np.histogram(data_numpy, bins=30) # 30 bins
# Plot using Matplotlib
plt.figure(figsize=(8, 6))
plt.bar(bins[:-1], hist, width=(bins[1] - bins[0]))
plt.title("Histogram using NumPy and Matplotlib")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
# 2. Using Matplotlib's hist() function
plt.figure(figsize=(8, 6))
plt.hist(data_numpy, bins=30)
plt.title("Histogram using Matplotlib hist()")
plt.xlabel("Value")
plt.ylabel("Frequency")

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

plt.show()
# 3. Using pandas
# Create a pandas Series from the NumPy array
data_pandas = pd.Series(data_numpy)
# Plot histogram using pandas' hist() function
plt.figure(figsize=(8, 6))
data_pandas.hist(bins=30)
plt.title("Histogram using pandas")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
# 4. Pandas' hist() with more customization.
plt.figure(figsize=(8, 6))
data_pandas.hist(bins=30, edgecolor='black', alpha=0.7) # adding edge color and
transparency
plt.title("Customized Histogram using pandas")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.grid(axis='y', alpha=0.75) # adding grid lines on the y axis.
plt.show()

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

INPUT & OUTPUT:

S.No. Particulars Max. Marks Marks Secured

01 Aim and Algorithm 5

02 Program and Execution 10

03 Viva Voce 5

04 Output 5

05 Total 25

RESULT:

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

Write a program to generate different charts


EX. No. 04 DATE:
and plots.

AIM:

To program to generate different charts and plots

APPARATUS:

TRINKET (Online Compiler)

ALGORITHM:

Line Chart:
Step-1: Import Matplotlib
Step-2: Prepare the Data
Step-3: Create the Plot to draw a line
Step-4: Add Labels and Title
Step-5: Add Grid (Optional but Recommended)
Step-6: Display the Plot
Bar Chart:
Step-1: Import Matplotlib
Step-2: Prepare Your Data
Step-3: Create the Bar Chart
Step-4: Add Labels and Title
Step-5: Add Grid Lines
Step-6: Show the Chart
Pie Chart:
Step 1: Import Matplotlib
Step 2: Prepare Your Data
Step 3: Create the Pie Chart
Step 4: Add Title and Fix Aspect Ratio
Step 5: Display the Chart

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

Scatter Plot:
Step 1: Import Required Libraries
Step 2: Generate Random Data
Step 3: Create the Scatter Plot
Step 4: Add Labels and Title
Step 5: Add Color Bar
Step 6: Add Grid and Show the Plot
Histogram:
Step 1: Import Libraries
Step 2: Generate Random Data
Step 3: Create the Histogram
Step 4: Add Labels and Title
Step 5: Add Grid Lines
Step 6: Show the Plot

PROGRAM:
Line Chart:
import matplotlib.pyplot as plt
# Sample data
x = [1, 2, 3, 4, 5, 6, 7]
y = [10, 15, 13, 17, 20, 18, 22]
# Create the line chart
plt.figure(figsize=(8, 5))
plt.plot(x, y, color='blue', marker='o', linestyle='-', linewidth=2)
# Add labels and title
plt.title('Line Chart Example')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.grid(True)
# Show the chart
plt.show()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

Bar Chart:
import matplotlib.pyplot as plt
# Sample data
categories = ['Apples', 'Bananas', 'Cherries', 'Dates', 'Elderberries']
values = [23, 45, 56, 12, 34]
# Create the bar chart
plt.figure(figsize=(8, 5))
plt.bar(categories, values, color='skyblue', edgecolor='black')
# Add labels and title
plt.title('Fruit Sales Bar Chart')
plt.xlabel('Fruits')
plt.ylabel('Quantity Sold')
plt.grid(axis='y', linestyle='--', alpha=0.7)
# Show the chart
plt.show()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Pie Chart:
import matplotlib.pyplot as plt
# Sample data
labels = ['Python', 'Java', 'C++', 'JavaScript']
sizes = [35, 30, 20, 15]
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99']
explode = (0.1, 0, 0, 0) # "explode" the 1st slice
# Create pie chart
plt.figure(figsize=(6, 6))
plt.pie(sizes, labels=labels, colors=colors, explode=explode,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.title('Programming Language Popularity')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
# Show chart
plt.show()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Scatter Plot:
import matplotlib.pyplot as plt
import numpy as np
# Generate some random data
np.random.seed(42) # For reproducibility
x = np.random.rand(50) * 100
y = np.random.rand(50) * 100
colors = np.random.rand(50)
sizes = 100 * np.random.rand(50)
# Create scatter plot
plt.figure(figsize=(8, 6))
scatter = plt.scatter(x, y, c=colors, s=sizes, alpha=0.7, cmap='viridis')

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

# Add labels and title


plt.title('Scatter Plot Example')
plt.xlabel('X Value')
plt.ylabel('Y Value')
plt.colorbar(scatter, label='Color Intensity') # Add color legend
# Show the plot
plt.grid(True)
plt.show()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Histogram:
import matplotlib.pyplot as plt
import numpy as np
# Generate random data (normally distributed)
np.random.seed(0) # For reproducibility
data = np.random.randn(1000)
# Create histogram
plt.figure(figsize=(8, 5))
plt.hist(data, bins=30, color='skyblue', edgecolor='black', alpha=0.7)
# Add labels and title
plt.title('Histogram of Normally Distributed Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.grid(True)
# Show the histogram
plt.show()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

INPUT & OUTPUT:


Line Chart:

Bar Chart:

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

Pie Chart:

Scatter Plot

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

Histogram:

S.No. Particulars Max. Marks Marks Secured

01 Aim and Algorithm 5

02 Program and Execution 10

03 Viva Voce 5

04 Output 5

05 Total 25

RESULT:

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

Write a program to generate pivot using


EX. No. 05 DATE:
groupby() method

AIM:

To program to generate pivot using groupby() method

APPARATUS:

TRINKET (Online Compiler)

ALGORITHM:

Step 1: Import Required Libraries


Step 2: Define Sample Data
Step 3: Group and Aggregate Sales
Step 4: Create a Pivot Table
Step 5: Plot the Pivot Table as an Image
Step 6: Save and Show the Table

PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
# Sample data
data = {
'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
'Product': ['A', 'A', 'B', 'B', 'A', 'B', 'A', 'B'],
'Sales': [250, 150, 200, 300, 400, 120, 330, 500]
}
# Create DataFrame
df = pd.DataFrame(data)
# Group and pivot
grouped = df.groupby(['Region', 'Product'])['Sales'].sum().reset_index()
pivot_table = grouped.pivot(index='Region', columns='Product', values='Sales')
# Plot the pivot table as an image
fig, ax = plt.subplots(figsize=(6, 4))
ax.axis('tight')
ax.axis('off')
table = ax.table(cellText=pivot_table.fillna(0).values,
colLabels=pivot_table.columns,
rowLabels=pivot_table.index,
cellLoc='center',
loc='center')
table.auto_set_font_size(False)
table.set_fontsize(12)

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

table.scale(1.2, 1.2)
# Save as image
plt.title("Pivot Table: Total Sales by Region and Product", fontsize=14)
plt.savefig("pivot_table.png", bbox_inches='tight')
plt.show()

INPUT & OUTPUT:

S.No. Particulars Max. Marks Marks Secured

01 Aim and Algorithm 5

02 Program and Execution 10

03 Viva Voce 5

04 Output 5

05 Total 25

RESULT:

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

Perform Time Series analysis and test with


EX. No. 06 DATE:
a predictive model

AIM:

To Perform Time Series analysis and test with a predictive model

APPARATUS:

TRINKET (Online Compiler)

ALGORITHM:

Step 1: Import Required Libraries


Step 2: Generate Synthetic Time Series Data
Step 3: Visualize the Time Series
Step 4: Decompose the Time Series
Step 5: Test for Stationarity (ADF Test)
Step 6: Difference the Data (If Needed)
Step 7: Re-Test for Stationarity
Step 8: Plot ACF and PACF
Step 9: Train/Test Split
Step 10: Fit ARIMA Model
Step 11: Forecast Future Values
Step 12: Convert Forecast Back to Original Scale
Step 13: Plot Forecast vs Actual
Step 14: Evaluate Model Performance

PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
from math import sqrt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# 1. Generate Synthetic Time Series Data
date_rng = pd.date_range(start='2020-01-01', end='2023-12-01', freq='MS')
np.random.seed(42)
sales_data = np.random.poisson(lam=200, size=len(date_rng)) + np.linspace(0, 50,
len(date_rng))
df = pd.DataFrame({'Date': date_rng, 'Sales': sales_data})

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

df.set_index('Date', inplace=True)
# 2. Visualize Time Series
plt.figure(figsize=(10, 5))
sns.lineplot(data=df, x=df.index, y='Sales')
plt.title("Monthly Sales Over Time")
plt.ylabel("Sales")
plt.xlabel("Date")
plt.grid(True)
plt.show()
# 3. Time Series Decomposition
decomposition = seasonal_decompose(df['Sales'], model='additive')
decomposition.plot()
plt.show()
# 4. Stationarity Test (Augmented Dickey-Fuller)
result = adfuller(df['Sales'])
print('ADF Statistic:', result[0])
print('p-value:', result[1])
print('Critical Values:', result[4])
if result[1] <= 0.05:
print("Series is stationary")
else:
print("Series is not stationary. Differencing required")
# 5. Differencing (if needed)
df['Sales_diff'] = df['Sales'].diff()
df.dropna(inplace=True) # remove the first NaN value from differencing
# 6. Stationarity Test after Differencing
result_diff = adfuller(df['Sales_diff'])
print('ADF Statistic (Differenced):', result_diff[0])
print('p-value (Differenced):', result_diff[1])
print('Critical Values (Differenced):', result_diff[4])
if result_diff[1] <= 0.05:
print("Differenced series is stationary")
else:
print("Differenced series is not stationary.")
# 7. ACF and PACF Plots
plot_acf(df['Sales_diff'].dropna(), lags=20)
plt.show()
plot_pacf(df['Sales_diff'].dropna(), lags=20)
plt.show()
# 8. Train/Test Split
train = df['Sales_diff'].iloc[:-12]
test = df['Sales_diff'].iloc[-12:]
# 9. ARIMA Model Fitting (with frequency)
try:
model = ARIMA(train, order=(1, 0, 1), freq='MS') # Explicitly set frequency
fitted_model = model.fit()

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

# 10. Forecast
forecast = fitted_model.forecast(steps=12)
forecast = pd.Series(forecast, index=test.index)
# 11. Reverse Differencing for Forecast
forecast_original_scale = forecast.cumsum() + df['Sales'].iloc[-13]
actual_original_scale = df['Sales'].iloc[-12:]
# 12. Plot Forecast vs. Actual
plt.figure(figsize=(10, 5))
plt.plot(df['Sales'].iloc[:-12], label='Training')
plt.plot(actual_original_scale, label='Actual')
plt.plot(forecast_original_scale, label='Forecast', linestyle='--')
plt.title("ARIMA Forecast vs Actual Sales")
plt.legend()
plt.grid(True)
plt.show()
# 13. Evaluate Model
rmse = sqrt(mean_squared_error(actual_original_scale, forecast_original_scale))
print(f"Root Mean Squared Error: {rmse:.2f}")
except Exception as e:
print(f"An error occurred: {e}")

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

INPUT & OUTPUT:

S.No. Particulars Max. Marks Marks Secured

01 Aim and Algorithm 5

02 Program and Execution 10

03 Viva Voce 5

04 Output 5

05 Total 25

RESULT:

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

Write a program to identify the correlation


EX. No. 07 of the features/parameters in the Titanic DATE:
Dataset.

AIM:

To program to identify the correlation of the features/parameters in the Titanic Dataset.

APPARATUS:

TRINKET (Online Compiler)

ALGORITHM:

Step 1: Import Required Libraries


Step 2: Load Titanic Dataset from URL
Step 3: Select Numerical Features for Correlation
Step 4: Handle Missing Values
Step 5: Compute Correlation Matrix
Step 6: Visualize Correlation with a Heatmap
Step 7: Print Additional Correlation Insights
Step 8: Plot Scatterplot of Two Variables
Step 9: Plot a Boxplot for Category vs Numeric

PROGRAM:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import io
import requests
# Load the Titanic dataset from a URL
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
try:
s = requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')))
except requests.exceptions.RequestException as e:
print(f"Error loading dataset from URL: {e}")
exit()
# Select numerical features for correlation analysis
numerical_features = ['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
numerical_df = df[numerical_features].copy()
# Handle missing values (e.g., fill with mean for 'Age')
numerical_df['Age'].fillna(numerical_df['Age'].mean(), inplace=True)
# Calculate the correlation matrix
correlation_matrix = numerical_df.corr()
# Visualize the correlation matrix using a heatmap

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix of Titanic Features")
plt.show()
# Additional insights (optional)
print("\nCorrelation Insights:")
print(correlation_matrix['Survived'].sort_values(ascending=False))
# Example of plotting scatterplots of highly correlated pairs:
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Pclass', y='Fare', data=numerical_df)
plt.title("Pclass vs. Fare")
plt.show()
# Example of plotting a boxplot of a categorical feature against a numerical one.
plt.figure(figsize=(8, 6))
sns.boxplot(x='Pclass', y='Age', data=numerical_df)
plt.title("Pclass vs Age")
plt.show()
INPUT & OUTPUT:

Correlation Insights:
Survived 1.000000
Fare 0.257307
Parch 0.081629
SibSp -0.035322
Age -0.069809
Pclass -0.338481
Name: Survived, dtype: float64

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

S.No. Particulars Max. Marks Marks Secured

01 Aim and Algorithm 5

02 Program and Execution 10

03 Viva Voce 5

04 Output 5

05 Total 25

RESULT:

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

EX. No. 08 Perform EDA on Wine Data DATE:

AIM:

To Perform EDA on Wine Data

APPARATUS:

TRINKET (Online Compiler)

ALGORITHM:

Step 1: Import Required Libraries


Step 2: Load Wine Dataset from URL
Step 3: Basic Data Exploration
Step 4: Visualize Feature Distributions
Step 5: Correlation Analysis
Step 6: Scatter Plots
Step 7: Pair Plot
Step 8: Violin Plot
Step 9: Density Plots

PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import io
# Load the Wine dataset from a URL
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
column_names = [
"Class", "Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium",
"Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins",
"Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline"
]
try:
s = requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')), names=column_names)
except requests.exceptions.RequestException as e:
print(f"Error loading dataset from URL: {e}")
exit()
# Basic Information
print("--- Basic Information ---")

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

print(df.info())
print("\n--- Descriptive Statistics ---")
print(df.describe())
print("\n--- Class Distribution ---")
print(df['Class'].value_counts())
# Visualizations
# Histograms for numerical features
df.hist(figsize=(15, 10))
plt.suptitle("Histograms of Numerical Features", fontsize=16)
plt.show()
# Boxplots for numerical features
plt.figure(figsize=(15, 10))
sns.boxplot(data=df.drop('Class', axis=1))
plt.title("Boxplots of Numerical Features")
plt.xticks(rotation=45)
plt.show()
# Correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()
# Scatter plots (example: Alcohol vs. Color intensity)
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Alcohol', y='Color intensity', hue='Class', data=df)
plt.title("Alcohol vs. Color Intensity")
plt.show()
# Pair plot (sample)
sns.pairplot(df[['Alcohol', 'Malic acid', 'Color intensity', 'Class']], hue='Class')
plt.suptitle("Pair Plot (Sample)", fontsize=16)
plt.show()
#Violin Plots
plt.figure(figsize = (15,8))
sns.violinplot(x = 'Class', y='Proline', data = df)
plt.title("Proline distribution by wine class")
plt.show()
#Density Plots
for column in df.drop('Class', axis =1).columns:
plt.figure(figsize=(8, 6))
for wine_class in df['Class'].unique():
sns.kdeplot(df[df['Class'] == wine_class][column], label = f"Class {wine_class}")
plt.title(f'Density Plot of {column} by Wine Class')
plt.legend()
plt.show()

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

INPUT & OUTPUT:

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

S.No. Particulars Max. Marks Marks Secured

01 Aim and Algorithm 5

02 Program and Execution 10

03 Viva Voce 5

04 Output 5

05 Total 25

RESULT:

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

Demonstrate different visualizations based on


EX. No. 09 DATE:
Exercise 7.

AIM:

To Perform EDA on Wine Data

APPARATUS:

TRINKET (Online Compiler)

ALGORITHM:

Step 1: Import Required Libraries


Step 2: Load the Wine Dataset
Step 3: Explore the Dataset
Step 4: Plot Histograms
Step 5: Create Boxplots
Step 6: Correlation Heatmap
Step 7: Scatter Plot (Example)
Step 8: Pair Plot
Step 9: Violin Plot
Step 10: KDE (Density) Plots

PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import io
# Load the Wine dataset from a URL
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
column_names = [
"Class", "Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium",
"Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins",
"Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline"
]
try:
s = requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')), names=column_names)
except requests.exceptions.RequestException as e:
print(f"Error loading dataset from URL: {e}")
exit()
# Basic Information

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

print("--- Basic Information ---")


print(df.info())
print("\n--- Descriptive Statistics ---")
print(df.describe())
print("\n--- Class Distribution ---")
print(df['Class'].value_counts())
# Visualizations
# Histograms for numerical features
df.hist(figsize=(15, 10))
plt.suptitle("Histograms of Numerical Features", fontsize=16)
plt.show()
# Boxplots for numerical features
plt.figure(figsize=(15, 10))
sns.boxplot(data=df.drop('Class', axis=1))
plt.title("Boxplots of Numerical Features")
plt.xticks(rotation=45)
plt.show()
# Correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()
# Scatter plots (example: Alcohol vs. Color intensity)
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Alcohol', y='Color intensity', hue='Class', data=df)
plt.title("Alcohol vs. Color Intensity")
plt.show()
# Pair plot (sample)
sns.pairplot(df[['Alcohol', 'Malic acid', 'Color intensity', 'Class']], hue='Class')
plt.suptitle("Pair Plot (Sample)", fontsize=16)
plt.show()
#Violin Plots
plt.figure(figsize = (15,8))
sns.violinplot(x = 'Class', y='Proline', data = df)
plt.title("Proline distribution by wine class")
plt.show()
#Density Plots
for column in df.drop('Class', axis =1).columns:
plt.figure(figsize=(8, 6))
for wine_class in df['Class'].unique():
sns.kdeplot(df[df['Class'] == wine_class][column], label = f"Class {wine_class}")
plt.title(f'Density Plot of {column} by Wine Class')
plt.legend()
plt.show()

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

INPUT & OUTPUT:

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

S.No. Particulars Max. Marks Marks Secured

01 Aim and Algorithm 5

02 Program and Execution 10

03 Viva Voce 5

04 Output 5

05 Total 25

RESULT:

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

Develop and evaluate ML models on open


EX. No. 10 DATE:
dataset

AIM:

To Perform EDA on Wine Data

APPARATUS:

TRINKET (Online Compiler)

ALGORITHM:

Step 1: Import Required Libraries


Step 2: Load the Dataset
Step 3: Data Preprocessing
Step 4: Train-Test Split
Step 5: Train Models
Step 6: Hyperparameter Tuning (Grid Search)
Step 7: Feature Importance Visualization

PROGRAM:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
import requests
import io
import matplotlib.image as mpimg
from io import BytesIO
# 1. Load the Iris dataset (example)
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
try:
s = requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')), names=column_names)
except requests.exceptions.RequestException as e:
print(f"Error loading dataset from URL: {e}")
exit()

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

# 2. Preprocessing
X = df.drop('species', axis=1)
y = df['species']
# Impute missing values (if any)
imputer = SimpleImputer(strategy='mean') # using mean imputation
X = imputer.fit_transform(X)
# Scale features
scaler = StandardScaler()
X = scaler.fit_transform(X)
# Encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y) # 0, 1, 2 for Iris-setosa, Iris-versicolor, Iris-virginica
# 3. Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 4. Model Development and Evaluation
# Logistic Regression
logreg = LogisticRegression(random_state=42, max_iter=1000)
logreg.fit(X_train, y_train)
y_pred_logreg = logreg.predict(X_test)
# Random Forest
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
# Gradient Boosting
gb = GradientBoostingClassifier(random_state=42)
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_test)
# 5. Hyperparameter Tuning (Example: Random Forest)
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid,
cv=5)
grid_search.fit(X_train, y_train)
best_rf = grid_search.best_estimator_
y_pred_best_rf = best_rf.predict(X_test)
# 7. Visualization (Example: Feature Importance from Random Forest)
feature_importance = best_rf.feature_importances_
plt.figure(figsize=(8, 6))
sns.barplot(x=feature_importance, y=column_names[:-1])
plt.title("Feature Importance (Random Forest)")
#Save the plot to a BytesIO object, then convert to an image.
buffer = BytesIO()
plt.savefig(buffer, format='png')

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )


PG DEPARTMENT OF COMPUTIONAL STUDIES, B.Sc.(DSA) SCHOOL OF ARTS AND SCIENCE, SMVEC

buffer.seek(0)
image = mpimg.imread(buffer)
plt.imshow(image)
plt.axis('off') #Turn off axis labels.
plt.show()

INPUT & OUTPUT:

S.No. Particulars Max. Marks Marks Secured

01 Aim and Algorithm 5

02 Program and Execution 10

03 Viva Voce 5

04 Output 5

05 Total 25

RESULT:

EXPLORATORY DATA ANALYSIS LABORATORY(A24DAS202D) PAGE NO. ( )

You might also like