EDA LAB
EDA LAB
INDEX
PAGE
SI/NO DATE PARTICULARS SIGN
NO
AIM:
APPARATUS:
EDA WEKA
ALGORITHM:
Step-1: Visit this website using any web browser. Click on Free Download.
Step-2: It will redirect to a new webpage, click on Start Download. Downloading of
the executable file will start shortly. It is a big 118 MB file that will take some
minutes.
Step-3: Now check for the executable file in downloads in your system and run it.
Step-4: It will prompt confirmation to make changes to your system. Click on Yes.
Step-5: Setup screen will appear, click on Next.
Step-6: The next screen will be of License Agreement, click on I Agree.
Step-7: Next screen is of choosing components, all components are already marked so
don’t change anything just click on the Install button.
Step-8: The next screen will be of installing location so choose the drive which will
have sufficient memory space for installation. It needed a memory space of
301 MB.
Step-9: Next screen will be of choosing the Start menu folder so don’t do anything just
click on Install Button.
Step-10: After this installation process will start and will hardly take a minute to
complete the installation.
Step-11: Click on the Next button after the installation process is complete.
Step-12: Click on Finish to finish the installation process.
Step-13: Weka is successfully installed on the system and an icon is created on the
desktop.
Step-14: Run the software and see the interface.
PROGRAM:
1) Installing Weka on Windows:
Follow the below steps to install Weka on Windows:
Step-1: Visit this website using any web browser. Click on Free Download.
Step-3: Now check for the executable file in downloads in your system and run it.
Step-4: It will prompt confirmation to make changes to your system. Click on Yes.
Step-7: Next screen is of choosing components, all components are already marked so
don’t change anything just click on the Install button.
Step-8: The next screen will be of installing location so choose the drive which will
have sufficient memory space for installation. It needed a memory space of 301 MB.
Step-9: Next screen will be of choosing the Start menu folder so don’t do anything just
click on Install Button.
Step-10: After this installation process will start and will hardly take a minute to
complete the installation.
Step-11: Click on the Next button after the installation process is complete.
Step-13: Weka is successfully installed on the system and an icon is created on the
desktop.
Congratulations!! At this point, you have successfully installed Weka on your windows
system.
03 Viva Voce 5
04 Output 5
05 Total 25
RESULT:
AIM:
APPARATUS:
ALGORITHM:
Matplotlib:
Step-1: Import the matplotlib.pyplot, numpy.
Step-2: Create a data Frame x (linspace), y (sin(x)).
Step-3: Create plot figure.
Step-4: Plot x, y (line).
Step-5: Title, labels, legend, grid.
Step-6: Display plot.
Seaborn:
Step-1: Import the seaborn, pandas, numpy, matplotlib.pyplot.
Step-2: Create DataFrame (x, y, category).
Step-3: Create plot figure.
Step-4: Plot using sns.scatterplot (hue, size).
Step-5: Customize the Title.
Step-6: How to Display plot.
Plotly:
Step-1: Import the plotly.express.
Step-2: Create data dictionary (x, y).
Step-3: Plot the px.scatter.
Step-4: Save in fig.write_image (PNG).
Step-5: Print in Confirmation message.
PROGRAM:
Matplotlib:
Matplotlib is the foundation of data visualization in Python. It's versatile but requires
more code for complex visualizations.
Matplotlib Program:
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100) # Sample data
y = np.sin(x)
plt.figure(figsize=(10, 6)) # Create a simple line plot
plt.plot(x, y, 'b-', linewidth=2, label='sin(x)')
plt.title('Simple Line Plot')
plt.xlabel('X axis')
plt.ylabel('Y axis')
plt.legend()
plt.grid(True)
plt.show()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Seaborn:
Seaborn is built on Matplotlib but provides a higher-level interface for statistical
visualizations.
Seaborn Program:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42) # Create sample dataset
df = pd.DataFrame({
'x': np.random.normal(0, 1, 100),
'y': np.random.normal(0, 1, 100),
'category': np.random.choice(['A', 'B', 'C'], 100)
})
Plotly:
Plotly creates interactive visualizations that are great for dashboards and web
applications.
Plotly Program:
import plotly.express as px
data = {
"x": [1, 2, 3, 4, 5],
"y": [2, 5, 1, 8, 3]
}
fig = px.scatter(data, x="x", y="y", title="Simple Plotly Scatter Plot")
# Save as PNG image
fig.write_image("plotly_plot.png")
print("Plot saved as plotly_plot.png. Download and view.")
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Matplotlib Output:
Seaborn Output:
Plotly Output:
03 Viva Voce 5
04 Output 5
05 Total 25
RESULT:
AIM:
APPARATUS:
ALGORITHM:
PROGRAM:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# 1. Using NumPy directly
# Generate sample data
data_numpy = np.random.randn(1000) # Normally distributed data
# Calculate histogram
hist, bins = np.histogram(data_numpy, bins=30) # 30 bins
# Plot using Matplotlib
plt.figure(figsize=(8, 6))
plt.bar(bins[:-1], hist, width=(bins[1] - bins[0]))
plt.title("Histogram using NumPy and Matplotlib")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
# 2. Using Matplotlib's hist() function
plt.figure(figsize=(8, 6))
plt.hist(data_numpy, bins=30)
plt.title("Histogram using Matplotlib hist()")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
# 3. Using pandas
# Create a pandas Series from the NumPy array
data_pandas = pd.Series(data_numpy)
# Plot histogram using pandas' hist() function
plt.figure(figsize=(8, 6))
data_pandas.hist(bins=30)
plt.title("Histogram using pandas")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
# 4. Pandas' hist() with more customization.
plt.figure(figsize=(8, 6))
data_pandas.hist(bins=30, edgecolor='black', alpha=0.7) # adding edge color and
transparency
plt.title("Customized Histogram using pandas")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.grid(axis='y', alpha=0.75) # adding grid lines on the y axis.
plt.show()
03 Viva Voce 5
04 Output 5
05 Total 25
RESULT:
AIM:
APPARATUS:
ALGORITHM:
Line Chart:
Step-1: Import Matplotlib
Step-2: Prepare the Data
Step-3: Create the Plot to draw a line
Step-4: Add Labels and Title
Step-5: Add Grid (Optional but Recommended)
Step-6: Display the Plot
Bar Chart:
Step-1: Import Matplotlib
Step-2: Prepare Your Data
Step-3: Create the Bar Chart
Step-4: Add Labels and Title
Step-5: Add Grid Lines
Step-6: Show the Chart
Pie Chart:
Step 1: Import Matplotlib
Step 2: Prepare Your Data
Step 3: Create the Pie Chart
Step 4: Add Title and Fix Aspect Ratio
Step 5: Display the Chart
Scatter Plot:
Step 1: Import Required Libraries
Step 2: Generate Random Data
Step 3: Create the Scatter Plot
Step 4: Add Labels and Title
Step 5: Add Color Bar
Step 6: Add Grid and Show the Plot
Histogram:
Step 1: Import Libraries
Step 2: Generate Random Data
Step 3: Create the Histogram
Step 4: Add Labels and Title
Step 5: Add Grid Lines
Step 6: Show the Plot
PROGRAM:
Line Chart:
import matplotlib.pyplot as plt
# Sample data
x = [1, 2, 3, 4, 5, 6, 7]
y = [10, 15, 13, 17, 20, 18, 22]
# Create the line chart
plt.figure(figsize=(8, 5))
plt.plot(x, y, color='blue', marker='o', linestyle='-', linewidth=2)
# Add labels and title
plt.title('Line Chart Example')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.grid(True)
# Show the chart
plt.show()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Bar Chart:
import matplotlib.pyplot as plt
# Sample data
categories = ['Apples', 'Bananas', 'Cherries', 'Dates', 'Elderberries']
values = [23, 45, 56, 12, 34]
# Create the bar chart
plt.figure(figsize=(8, 5))
plt.bar(categories, values, color='skyblue', edgecolor='black')
# Add labels and title
plt.title('Fruit Sales Bar Chart')
plt.xlabel('Fruits')
plt.ylabel('Quantity Sold')
plt.grid(axis='y', linestyle='--', alpha=0.7)
# Show the chart
plt.show()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Pie Chart:
import matplotlib.pyplot as plt
# Sample data
labels = ['Python', 'Java', 'C++', 'JavaScript']
sizes = [35, 30, 20, 15]
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99']
explode = (0.1, 0, 0, 0) # "explode" the 1st slice
# Create pie chart
plt.figure(figsize=(6, 6))
plt.pie(sizes, labels=labels, colors=colors, explode=explode,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.title('Programming Language Popularity')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
# Show chart
plt.show()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Scatter Plot:
import matplotlib.pyplot as plt
import numpy as np
# Generate some random data
np.random.seed(42) # For reproducibility
x = np.random.rand(50) * 100
y = np.random.rand(50) * 100
colors = np.random.rand(50)
sizes = 100 * np.random.rand(50)
# Create scatter plot
plt.figure(figsize=(8, 6))
scatter = plt.scatter(x, y, c=colors, s=sizes, alpha=0.7, cmap='viridis')
Bar Chart:
Pie Chart:
Scatter Plot
Histogram:
03 Viva Voce 5
04 Output 5
05 Total 25
RESULT:
AIM:
APPARATUS:
ALGORITHM:
PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
# Sample data
data = {
'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
'Product': ['A', 'A', 'B', 'B', 'A', 'B', 'A', 'B'],
'Sales': [250, 150, 200, 300, 400, 120, 330, 500]
}
# Create DataFrame
df = pd.DataFrame(data)
# Group and pivot
grouped = df.groupby(['Region', 'Product'])['Sales'].sum().reset_index()
pivot_table = grouped.pivot(index='Region', columns='Product', values='Sales')
# Plot the pivot table as an image
fig, ax = plt.subplots(figsize=(6, 4))
ax.axis('tight')
ax.axis('off')
table = ax.table(cellText=pivot_table.fillna(0).values,
colLabels=pivot_table.columns,
rowLabels=pivot_table.index,
cellLoc='center',
loc='center')
table.auto_set_font_size(False)
table.set_fontsize(12)
table.scale(1.2, 1.2)
# Save as image
plt.title("Pivot Table: Total Sales by Region and Product", fontsize=14)
plt.savefig("pivot_table.png", bbox_inches='tight')
plt.show()
03 Viva Voce 5
04 Output 5
05 Total 25
RESULT:
AIM:
APPARATUS:
ALGORITHM:
PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
from math import sqrt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# 1. Generate Synthetic Time Series Data
date_rng = pd.date_range(start='2020-01-01', end='2023-12-01', freq='MS')
np.random.seed(42)
sales_data = np.random.poisson(lam=200, size=len(date_rng)) + np.linspace(0, 50,
len(date_rng))
df = pd.DataFrame({'Date': date_rng, 'Sales': sales_data})
df.set_index('Date', inplace=True)
# 2. Visualize Time Series
plt.figure(figsize=(10, 5))
sns.lineplot(data=df, x=df.index, y='Sales')
plt.title("Monthly Sales Over Time")
plt.ylabel("Sales")
plt.xlabel("Date")
plt.grid(True)
plt.show()
# 3. Time Series Decomposition
decomposition = seasonal_decompose(df['Sales'], model='additive')
decomposition.plot()
plt.show()
# 4. Stationarity Test (Augmented Dickey-Fuller)
result = adfuller(df['Sales'])
print('ADF Statistic:', result[0])
print('p-value:', result[1])
print('Critical Values:', result[4])
if result[1] <= 0.05:
print("Series is stationary")
else:
print("Series is not stationary. Differencing required")
# 5. Differencing (if needed)
df['Sales_diff'] = df['Sales'].diff()
df.dropna(inplace=True) # remove the first NaN value from differencing
# 6. Stationarity Test after Differencing
result_diff = adfuller(df['Sales_diff'])
print('ADF Statistic (Differenced):', result_diff[0])
print('p-value (Differenced):', result_diff[1])
print('Critical Values (Differenced):', result_diff[4])
if result_diff[1] <= 0.05:
print("Differenced series is stationary")
else:
print("Differenced series is not stationary.")
# 7. ACF and PACF Plots
plot_acf(df['Sales_diff'].dropna(), lags=20)
plt.show()
plot_pacf(df['Sales_diff'].dropna(), lags=20)
plt.show()
# 8. Train/Test Split
train = df['Sales_diff'].iloc[:-12]
test = df['Sales_diff'].iloc[-12:]
# 9. ARIMA Model Fitting (with frequency)
try:
model = ARIMA(train, order=(1, 0, 1), freq='MS') # Explicitly set frequency
fitted_model = model.fit()
# 10. Forecast
forecast = fitted_model.forecast(steps=12)
forecast = pd.Series(forecast, index=test.index)
# 11. Reverse Differencing for Forecast
forecast_original_scale = forecast.cumsum() + df['Sales'].iloc[-13]
actual_original_scale = df['Sales'].iloc[-12:]
# 12. Plot Forecast vs. Actual
plt.figure(figsize=(10, 5))
plt.plot(df['Sales'].iloc[:-12], label='Training')
plt.plot(actual_original_scale, label='Actual')
plt.plot(forecast_original_scale, label='Forecast', linestyle='--')
plt.title("ARIMA Forecast vs Actual Sales")
plt.legend()
plt.grid(True)
plt.show()
# 13. Evaluate Model
rmse = sqrt(mean_squared_error(actual_original_scale, forecast_original_scale))
print(f"Root Mean Squared Error: {rmse:.2f}")
except Exception as e:
print(f"An error occurred: {e}")
03 Viva Voce 5
04 Output 5
05 Total 25
RESULT:
AIM:
APPARATUS:
ALGORITHM:
PROGRAM:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import io
import requests
# Load the Titanic dataset from a URL
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
try:
s = requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')))
except requests.exceptions.RequestException as e:
print(f"Error loading dataset from URL: {e}")
exit()
# Select numerical features for correlation analysis
numerical_features = ['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
numerical_df = df[numerical_features].copy()
# Handle missing values (e.g., fill with mean for 'Age')
numerical_df['Age'].fillna(numerical_df['Age'].mean(), inplace=True)
# Calculate the correlation matrix
correlation_matrix = numerical_df.corr()
# Visualize the correlation matrix using a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix of Titanic Features")
plt.show()
# Additional insights (optional)
print("\nCorrelation Insights:")
print(correlation_matrix['Survived'].sort_values(ascending=False))
# Example of plotting scatterplots of highly correlated pairs:
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Pclass', y='Fare', data=numerical_df)
plt.title("Pclass vs. Fare")
plt.show()
# Example of plotting a boxplot of a categorical feature against a numerical one.
plt.figure(figsize=(8, 6))
sns.boxplot(x='Pclass', y='Age', data=numerical_df)
plt.title("Pclass vs Age")
plt.show()
INPUT & OUTPUT:
Correlation Insights:
Survived 1.000000
Fare 0.257307
Parch 0.081629
SibSp -0.035322
Age -0.069809
Pclass -0.338481
Name: Survived, dtype: float64
03 Viva Voce 5
04 Output 5
05 Total 25
RESULT:
AIM:
APPARATUS:
ALGORITHM:
PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import io
# Load the Wine dataset from a URL
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
column_names = [
"Class", "Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium",
"Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins",
"Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline"
]
try:
s = requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')), names=column_names)
except requests.exceptions.RequestException as e:
print(f"Error loading dataset from URL: {e}")
exit()
# Basic Information
print("--- Basic Information ---")
print(df.info())
print("\n--- Descriptive Statistics ---")
print(df.describe())
print("\n--- Class Distribution ---")
print(df['Class'].value_counts())
# Visualizations
# Histograms for numerical features
df.hist(figsize=(15, 10))
plt.suptitle("Histograms of Numerical Features", fontsize=16)
plt.show()
# Boxplots for numerical features
plt.figure(figsize=(15, 10))
sns.boxplot(data=df.drop('Class', axis=1))
plt.title("Boxplots of Numerical Features")
plt.xticks(rotation=45)
plt.show()
# Correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()
# Scatter plots (example: Alcohol vs. Color intensity)
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Alcohol', y='Color intensity', hue='Class', data=df)
plt.title("Alcohol vs. Color Intensity")
plt.show()
# Pair plot (sample)
sns.pairplot(df[['Alcohol', 'Malic acid', 'Color intensity', 'Class']], hue='Class')
plt.suptitle("Pair Plot (Sample)", fontsize=16)
plt.show()
#Violin Plots
plt.figure(figsize = (15,8))
sns.violinplot(x = 'Class', y='Proline', data = df)
plt.title("Proline distribution by wine class")
plt.show()
#Density Plots
for column in df.drop('Class', axis =1).columns:
plt.figure(figsize=(8, 6))
for wine_class in df['Class'].unique():
sns.kdeplot(df[df['Class'] == wine_class][column], label = f"Class {wine_class}")
plt.title(f'Density Plot of {column} by Wine Class')
plt.legend()
plt.show()
03 Viva Voce 5
04 Output 5
05 Total 25
RESULT:
AIM:
APPARATUS:
ALGORITHM:
PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import io
# Load the Wine dataset from a URL
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
column_names = [
"Class", "Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium",
"Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins",
"Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline"
]
try:
s = requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')), names=column_names)
except requests.exceptions.RequestException as e:
print(f"Error loading dataset from URL: {e}")
exit()
# Basic Information
03 Viva Voce 5
04 Output 5
05 Total 25
RESULT:
AIM:
APPARATUS:
ALGORITHM:
PROGRAM:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
import requests
import io
import matplotlib.image as mpimg
from io import BytesIO
# 1. Load the Iris dataset (example)
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
try:
s = requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')), names=column_names)
except requests.exceptions.RequestException as e:
print(f"Error loading dataset from URL: {e}")
exit()
# 2. Preprocessing
X = df.drop('species', axis=1)
y = df['species']
# Impute missing values (if any)
imputer = SimpleImputer(strategy='mean') # using mean imputation
X = imputer.fit_transform(X)
# Scale features
scaler = StandardScaler()
X = scaler.fit_transform(X)
# Encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y) # 0, 1, 2 for Iris-setosa, Iris-versicolor, Iris-virginica
# 3. Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 4. Model Development and Evaluation
# Logistic Regression
logreg = LogisticRegression(random_state=42, max_iter=1000)
logreg.fit(X_train, y_train)
y_pred_logreg = logreg.predict(X_test)
# Random Forest
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
# Gradient Boosting
gb = GradientBoostingClassifier(random_state=42)
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_test)
# 5. Hyperparameter Tuning (Example: Random Forest)
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid,
cv=5)
grid_search.fit(X_train, y_train)
best_rf = grid_search.best_estimator_
y_pred_best_rf = best_rf.predict(X_test)
# 7. Visualization (Example: Feature Importance from Random Forest)
feature_importance = best_rf.feature_importances_
plt.figure(figsize=(8, 6))
sns.barplot(x=feature_importance, y=column_names[:-1])
plt.title("Feature Importance (Random Forest)")
#Save the plot to a BytesIO object, then convert to an image.
buffer = BytesIO()
plt.savefig(buffer, format='png')
buffer.seek(0)
image = mpimg.imread(buffer)
plt.imshow(image)
plt.axis('off') #Turn off axis labels.
plt.show()
03 Viva Voce 5
04 Output 5
05 Total 25
RESULT: