Visvesvaraya Technological University (VTU)
Machine Learning
Laboratory
BCSL606
2025
Faculty Incharge
Prof. Bhagyashri Wakde
RAJIV GANDHI INSTITUTE OF
TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Bengaluru - 560032
2025
Machine Learning Laboratory (BCSL606)
1. Develop a program to create histograms for all numerical features and analyze the
distribution of each feature. Generate box plots for all numerical features and identify any
outliers. Use California Housing dataset.
DATASET
California Housing dataset
from sklearn.datasets import fetch_california_housing
california_housing = fetch_california_housing(as_frame=True)
california_housing.frame.head()
House AveRoo AveBedr Popul AveOcc Latitud Longit MedHo
MedInc Age ms ms ation up e ude useVal
0 8.3252 41 6.984127 1.02381 322 2.555556 37.88 -122.23 4.526
1 8.3014 21 6.238137 0.97188 2401 2.109842 37.86 -122.22 3.585
2 7.2574 52 8.288136 1.073446 496 2.80226 37.85 -122.24 3.521
3 5.6431 52 5.817352 1.073059 558 2.547945 37.85 -122.25 3.413
4 3.8462 52 6.281853 1.081081 565 2.181467 37.85 -122.25 3.422
print(california_housing.DESCR)
.. _california_housing_dataset:
California Housing dataset
--------------------------
**Data Set Characteristics:**
:Number of Instances: 20640
:Number of Attributes: 8 numeric, predictive attributes and the target
:Attribute Information:
- MedInc median income in block group
- HouseAge median house age in block group
- AveRooms average number of rooms per household
- AveBedrms average number of bedrooms per household
- Population block group population
Machine Learning Laboratory (BCSL606)
- AveOccup average number of household members
- Latitude block group latitude
- Longitude block group longitude
:Missing Attribute Values: None
This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html
The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).
This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).
A household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surprisingly large values for block groups with few households
and many empty houses, such as vacation resorts.
It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.
.. rubric:: References
- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297
PROGRAM 1
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
# Load the California Housing dataset
def load_data():
data = fetch_california_housing(as_frame=True)
df = data['data']
df['Target'] = data['target'] # Add the target (house value) to the DataFrame
Machine Learning Laboratory (BCSL606)
return df
# Function to plot histograms for numerical features
def plot_histograms(df):
numerical_features = df.select_dtypes(include='number').columns
df[numerical_features].hist(bins=20, figsize=(15, 10), color='skyblue', edgecolor='black')
plt.suptitle('Histograms of Numerical Features', fontsize=16)
plt.tight_layout()
plt.show()
# Function to generate box plots and identify outliers
def plot_boxplots(df):
numerical_features = df.select_dtypes(include='number').columns
for feature in numerical_features:
plt.figure(figsize=(8, 6))
sns.boxplot(x=df[feature], color='lightblue')
plt.title(f'Box Plot of {feature}', fontsize=14)
plt.xlabel(feature, fontsize=12)
plt.tight_layout()
plt.show()
# Identify outliers using the IQR method
Q1 = df[feature].quantile(0.25)
Q3 = df[feature].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df[feature] < lower_bound) | (df[feature] > upper_bound)]
print(f"{feature}: {len(outliers)} outliers found.")
# Main function to run the analysis
def main():
df = load_data()
print("Dataset loaded successfully!")
print("\nGenerating histograms for numerical features...")
plot_histograms(df)
print("\nGenerating box plots and identifying outliers...")
plot_boxplots(df)
if __name__ == "__main__":
main()
OUTPUT:
Machine Learning Laboratory (BCSL606)
=====================RESTART: C:/Users/USER/first.py ======================
Dataset loaded successfully!
Generating histograms for numerical features...
Machine Learning Laboratory (BCSL606)
2. Develop a program to Compute the correlation matrix to understand the relationships
between pairs of features. Visualize the correlation matrix using a heatmap to know
which variables have strong positive/negative correlations. Create a pair plot to visualize
pairwise relationships between features. Use California Housing dataset.
PROGRAM 2
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
# Load the California Housing dataset
def load_data():
data = fetch_california_housing(as_frame=True)
df = data['data']
df['Target'] = data['target'] # Add the target (house value) as a column
return df
# Function to compute and visualize the correlation matrix
def visualize_correlation_matrix(df):
# Compute the correlation matrix
corr_matrix = df.corr()
# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(
corr_matrix, annot=True, fmt=".2f", cmap="coolwarm", cbar=True, square=True,
linewidths=0.5, annot_kws={"size": 10}
)
plt.title("Correlation Matrix Heatmap", fontsize=16)
plt.show()
# Function to create a pair plot
def create_pairplot(df):
# Select a subset of features for pair plot if dataset is large
selected_features = df.select_dtypes(include='number').columns
sns.pairplot(df[selected_features], diag_kind='kde', corner=True, height=2.0)
plt.suptitle("Pair Plot of Numerical Features", y=1.02, fontsize=16)
plt.show()
# Main function to run the analysis
Machine Learning Laboratory (BCSL606)
def main():
# Load dataset
df = load_data()
print("Dataset loaded successfully!")
# Correlation matrix heatmap
print("\nVisualizing the correlation matrix...")
visualize_correlation_matrix(df)
# Pair plot
print("\nCreating pair plot for numerical features...")
create_pairplot(df)
if __name__ == "__main__":
main()
OUTPUT:
==================== RESTART: C:/Users/USER/Second.py ======================
Dataset loaded successfully!
Visualizing the correlation matrix…
Machine Learning Laboratory (BCSL606)
Machine Learning Laboratory (BCSL606)
3. Develop a program to implement Principal Component Analysis (PCA) for reducing the
dimensionality of the Iris dataset from 4 features to 2.
PROGRAM 3.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Load the Iris dataset
def load_data():
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['target'] = iris.target
data['target_names'] = [iris.target_names[i] for i in iris.target]
return data, iris.target_names
# Perform PCA to reduce dimensions
def perform_pca(data, n_components=2):
# Extract features and standardize them
features = data.iloc[:, :-2] # Exclude target and target_names
scaler = StandardScaler()
standardized_features = scaler.fit_transform(features)
# Apply PCA
pca = PCA(n_components=n_components)
principal_components = pca.fit_transform(standardized_features)
# Create a DataFrame with the PCA results
pca_data = pd.DataFrame(
principal_components, columns=[f'PC{i+1}' for i in range(n_components)]
)
pca_data['target'] = data['target']
pca_data['target_names'] = data['target_names']
return pca_data, pca
# Visualize PCA results
def plot_pca_results(pca_data, explained_variance, target_names):
plt.figure(figsize=(8, 6))
sns.scatterplot(
Machine Learning Laboratory (BCSL606)
x='PC1', y='PC2', hue='target_names', data=pca_data,
palette='Set1', s=100, alpha=0.7, edgecolor='k'
)
plt.title(f'PCA Results (Explained Variance: PC1={explained_variance[0]:.2f},
PC2={explained_variance[1]:.2f})', fontsize=14)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='Classes', loc='upper right')
plt.grid(True, alpha=0.4)
plt.tight_layout()
plt.show()
# Main function to run the program
def main():
# Load the Iris dataset
data, target_names = load_data()
print("Iris dataset loaded successfully!")
# Perform PCA
pca_data, pca = perform_pca(data, n_components=2)
explained_variance = pca.explained_variance_ratio_
# Print explained variance
print(f"Explained Variance Ratio by Principal Components: {explained_variance}")
# Visualize PCA results
print("\nVisualizing PCA results...")
plot_pca_results(pca_data, explained_variance, target_names)
if __name__ == "__main__":
main()
OUTPUT:
====================== RESTART: C:/Users/USER/third.py ======================
Iris dataset loaded successfully!
Explained Variance Ratio by Principal Components: [0.72962445 0.22850762]
Visualizing PCA results…
Machine Learning Laboratory (BCSL606)
Machine Learning Laboratory (BCSL606)
4. For a given set of training data examples stored in a .CSV file, implement and
demonstrate the Find-S algorithm to output a description of the set of all hypotheses
consistent with the training examples.
DATASET:
training_data.csv
Sky Temp Humidity Wind Water Forecast PlayTennis
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Change No
Sunny Warm High Strong Cool Change Yes
PROGRAM 4.
import pandas as pd
# Function to implement the Find-S algorithm
def find_s_algorithm(data, target_col):
# Initialize the most specific hypothesis
attributes = data.columns[:-1] # Exclude the target column
hypothesis = ['ϕ'] * len(attributes)
# Iterate through the training examples
for _, row in data.iterrows():
if row[target_col] == "Yes": # Only consider positive examples
for i, value in enumerate(row[:-1]): # Iterate through attributes
if hypothesis[i] == 'ϕ': # Update when hypothesis is 'ϕ'
hypothesis[i] = value
elif hypothesis[i] != value: # Generalize hypothesis
hypothesis[i] = '?'
return hypothesis
# Main function to load data and run the Find-S algorithm
def main():
# Load the dataset
Machine Learning Laboratory (BCSL606)
try:
data = pd.read_csv(r'C:\Users\USER\Desktop\training_data.csv')
print("Training Data Loaded Successfully!\n")
print(data, "\n")
# Ensure the target column exists
target_col = data.columns[-1]
print(f"Target column identified: {target_col}\n")
# Run the Find-S algorithm
final_hypothesis = find_s_algorithm(data, target_col)
print(f"Final Hypothesis Consistent with Positive Examples: {final_hypothesis}")
except Exception as e:
print(f"Error loading data: {e}")
# Run the program
if __name__ == "__main__":
main()
OUTPUT:
====================== RESTART: C:/Users/USER/fourth.py ======================
Training Data Loaded Successfully!
Sky Temp Humidity Wind Water Forecast PlayTennis
0 Sunny Warm Normal Strong Warm Same Yes
1 Sunny Warm High Strong Warm Same Yes
2 Rainy Cold High Strong Warm Change No
3 Sunny Warm High Strong Cool Change Yes
Target column identified: PlayTennis
Final Hypothesis Consistent with Positive Examples: ['Sunny', 'Warm', '?', 'Strong', '?', '?']
Machine Learning Laboratory (BCSL606)
5. Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly
generated 100 values of x in the range of [0,1]. Perform the following based on dataset
generated. a. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ∊
Class1, else xi ∊ Class1 b. Classify the remaining points, x51,……,x100 using KNN.
Perform this for k=1,2,3,4,5,20,30
PROGRAM 5
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
# Generate 100 random values in the range [0, 1]
np.random.seed(42) # For reproducibility
x = np.random.rand(100)
# Label the first 50 points
y = np.array(["Class1" if xi <= 0.5 else "Class2" for xi in x[:50]])
# Prepare training and test datasets
x_train = x[:50].reshape(-1, 1) # Training features (first 50 points)
y_train = y # Training labels (first 50 points)
x_test = x[50:].reshape(-1, 1) # Testing features (remaining 50 points)
# Function to classify and visualize results for different k values
def classify_knn(k_values):
for k in k_values:
# Initialize and fit the KNN classifier
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(x_train, y_train)
# Predict the classes for test data
y_pred = knn.predict(x_test)
# Print the results
print(f"\nResults for k={k}:")
print(f"Predicted Classes for Test Data: {y_pred}")
# Visualization
plt.figure(figsize=(8, 5))
plt.scatter(x[:50], [0] * 50, c=['red' if label == "Class1" else 'blue' for label in y],
label="Training Data (Class1=Red, Class2=Blue)")
Machine Learning Laboratory (BCSL606)
plt.scatter(x[50:], [0] * 50, c=['red' if label == "Class1" else 'blue' for label in y_pred],
marker='x', label="Test Data Predictions")
plt.axvline(x=0.5, color='gray', linestyle='--', label="Decision Boundary (x=0.5)")
plt.title(f"KNN Classification with k={k}")
plt.xlabel("x values")
plt.yticks([])
plt.legend()
plt.grid(alpha=0.4)
plt.tight_layout()
plt.show()
# Specify k values for the KNN algorithm
k_values = [1, 2, 3, 4, 5, 20, 30]
# Perform classification and visualization for specified k values
classify_knn(k_values)
OUTPUT:
====================== RESTART: C:/Users/USER/fifth.py ======================
Results for k=1:
Predicted Classes for Test Data: ['Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class1' 'Class1' 'Class2'
'Class1' 'Class2' 'Class1' 'Class2' 'Class2' 'Class1' 'Class1' 'Class2'
'Class2' 'Class2' 'Class2' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1' 'Class2' 'Class1'
'Class1' 'Class1']
Results for k=2:
Predicted Classes for Test Data: ['Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class1' 'Class1' 'Class2'
'Class1' 'Class2' 'Class1' 'Class2' 'Class2' 'Class1' 'Class1' 'Class2'
'Class2' 'Class2' 'Class2' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1' 'Class2' 'Class1'
'Class1' 'Class1']
Results for k=3:
Predicted Classes for Test Data: ['Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1'
Machine Learning Laboratory (BCSL606)
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class1' 'Class1' 'Class2'
'Class1' 'Class2' 'Class1' 'Class2' 'Class2' 'Class1' 'Class1' 'Class2'
'Class2' 'Class2' 'Class2' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1' 'Class1']
Results for k=4:
Predicted Classes for Test Data: ['Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class1' 'Class1' 'Class2'
'Class1' 'Class2' 'Class1' 'Class2' 'Class2' 'Class1' 'Class1' 'Class2'
'Class2' 'Class2' 'Class2' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1' 'Class1']
Results for k=5:
Predicted Classes for Test Data: ['Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class1' 'Class1' 'Class2'
'Class1' 'Class2' 'Class1' 'Class2' 'Class2' 'Class1' 'Class1' 'Class2'
'Class2' 'Class2' 'Class2' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1' 'Class1']
Results for k=20:
Predicted Classes for Test Data: ['Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class1' 'Class1' 'Class2'
'Class1' 'Class2' 'Class1' 'Class2' 'Class2' 'Class1' 'Class1' 'Class2'
'Class2' 'Class2' 'Class2' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1' 'Class1']
Results for k=30:
Predicted Classes for Test Data: ['Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1'
'Class1'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class1' 'Class1' 'Class2'
'Class1' 'Class2' 'Class1' 'Class2' 'Class2' 'Class1' 'Class1' 'Class2'
'Class2' 'Class2' 'Class2' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2'
'Class1' 'Class1' 'Class1' 'Class1' 'Class2' 'Class2' 'Class2' 'Class1'
Machine Learning Laboratory (BCSL606)
'Class1' 'Class2' 'Class2' 'Class2' 'Class2' 'Class1' 'Class2' 'Class1'
'Class1' 'Class1']
Machine Learning Laboratory (BCSL606)
6. Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
points. Select appropriate data set for your experiment and draw graphs.
PROGRAM 6
import numpy as np
import matplotlib.pyplot as plt
# Locally Weighted Regression (LWR) function
def locally_weighted_regression(x_train, y_train, x_test, tau):
"""
Perform Locally Weighted Regression (LWR).
Parameters:
x_train: np.array, shape (n,)
Training data features.
y_train: np.array, shape (n,)
Training data labels.
x_test: np.array, shape (m,)
Test data features.
tau: float
Bandwidth parameter (controls the weight decay).
Returns:
y_pred: np.array, shape (m,)
Predicted values for x_test.
"""
m = len(x_test)
y_pred = np.zeros(m)
for i in range(m):
weights = np.exp(-np.square(x_train - x_test[i]) / (2 * tau**2)) # Gaussian weights
W = np.diag(weights) # Diagonal weight matrix
X = np.c_[np.ones(len(x_train)), x_train] # Add intercept term
theta = np.linalg.pinv(X.T @ W @ X) @ X.T @ W @ y_train # Normal equation with
weights
y_pred[i] = [1, x_test[i]] @ theta # Predict y for x_test[i]
return y_pred
# Generate synthetic dataset
np.random.seed(42)
x_train = np.linspace(0, 10, 50)
y_train = 2 * np.sin(x_train) + np.random.normal(0, 0.5, size=len(x_train))
Machine Learning Laboratory (BCSL606)
# Test points (finer granularity for smoother graph)
x_test = np.linspace(0, 10, 200)
# Apply LWR for different tau values
tau_values = [0.1, 0.5, 1, 5]
plt.figure(figsize=(12, 8))
for i, tau in enumerate(tau_values, 1):
y_pred = locally_weighted_regression(x_train, y_train, x_test, tau)
plt.subplot(2, 2, i)
plt.scatter(x_train, y_train, label="Training Data", color="red")
plt.plot(x_test, y_pred, label=f"LWR (tau={tau})", color="blue")
plt.title(f"Locally Weighted Regression (tau={tau})")
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.grid(alpha=0.4)
plt.tight_layout()
plt.show()
OUTPUT:
====================== RESTART: C:/Users/USER/Sixth.py ======================
Machine Learning Laboratory (BCSL606)
Machine Learning Laboratory (BCSL606)
7. Develop a program to demonstrate the working of Linear Regression and Polynomial
Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG Dataset
(for vehicle fuel efficiency prediction) for Polynomial Regression.
PROGRAM 7
OUTPUT:
==================== RESTART: C:/Users/USER/Seventh.py =====================
California Housing Dataset Linear Regression MSE: 0.5558915986952442
8. Develop a program to demonstrate the working of the decision tree algorithm. Use
Breast Cancer Data set for building the decision tree and apply this knowledge to
classify a new sample.
PROGRAM 8
OUTPUT :
9. Develop a program to implement the Naive Bayesian classifier considering Olivetti Face
Data set for training. Compute the accuracy of the classifier, considering a few test data
sets.
PROGRAM 9
Machine Learning Laboratory (BCSL606)
OUTPUT :
10.Develop a program to implement k-means clustering using Wisconsin Breast Cancer
data set and visualize the clustering result.
PROGRAM 10
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score
# Load the Wisconsin Breast Cancer Dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.DataFrame(data.target, columns=["Diagnosis"]) # Target labels
# Normalize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply K-Means Clustering
kmeans = KMeans(n_clusters=2, random_state=42) # 2 clusters for malignant/benign
clusters = kmeans.fit_predict(X_scaled)
# Add clustering labels to the dataset
X["Cluster"] = clusters
y["Cluster"] = clusters
# Evaluate clustering using silhouette score
silhouette_avg = silhouette_score(X_scaled, clusters)
print(f"Silhouette Score: {silhouette_avg:.3f}")
# Visualize Clustering with PCA (2D projection)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Plot the clusters
plt.figure(figsize=(10, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=clusters, cmap='viridis', s=50, alpha=0.6, label='Clustered
Data')
plt.title("K-Means Clustering on Wisconsin Breast Cancer Dataset")
Machine Learning Laboratory (BCSL606)
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.colorbar(label='Cluster')
plt.show()
# Optional: Compare clusters with actual diagnosis
plt.figure(figsize=(10, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=data.target, cmap='coolwarm', s=50, alpha=0.6,
label='Actual Labels')
plt.title("Actual Diagnosis Labels (for comparison)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.colorbar(label='Diagnosis (0=Malignant, 1=Benign)')
plt.show()
OUTPUT:
===================== RESTART: C:/Users/USER/tenth.py ======================
Silhouette Score: 0.345