0% found this document useful (0 votes)
8 views25 pages

ISMLA_Module5

The document outlines the process of conducting an end-to-end machine learning project focused on predicting California housing prices using real-world data. It details the importance of working with open datasets, the steps for data preparation, including handling missing values and encoding categorical attributes, and emphasizes the need for a robust data pipeline. Additionally, it discusses the significance of exploratory data analysis and visualization to uncover patterns and relationships within the dataset.

Uploaded by

prajwal24dsouza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views25 pages

ISMLA_Module5

The document outlines the process of conducting an end-to-end machine learning project focused on predicting California housing prices using real-world data. It details the importance of working with open datasets, the steps for data preparation, including handling missing values and encoding categorical attributes, and emphasizes the need for a robust data pipeline. Additionally, it discusses the significance of exploratory data analysis and visualization to uncover patterns and relationships within the dataset.

Uploaded by

prajwal24dsouza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof.

, ECE, GMIT, Davangere

End-to-end Machine Learning Project


Working with Real Data
Working with real-world data is essential for learning machine learning. There are many open datasets
available through:
1. Popular Repositories:
o UC Irvine ML Repository: A collection of datasets for research and teaching.
o Kaggle Datasets: Offers datasets for various domains, often with code notebooks.
o AWS Datasets: Large datasets for big data and machine learning.
2. Meta Portals:
o dataportals.org: Lists global open data portals.
o opendatamonitor.eu: Focuses on European datasets.
o Quandl: Provides economic, financial, and alternative datasets.
3. Other Resources:
o Wikipedia’s List of ML Datasets: Comprehensive dataset list.
o Quora & Datasets Subreddit: Communities sharing datasets.
The California Housing Prices dataset is based on the 1990 U.S. Census data and is used for predictive
modeling, helping to estimate housing prices based on factors like demographics and geography. For
teaching, the dataset is often modified by adding categorical attributes and removing some features,
which helps illustrate key concepts in data manipulation and machine learning.

Fig. California housing prices

1
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Look at the Big Picture


You're tasked with predicting California housing prices using census data, such as population and
median income, to aid investment decisions. This prediction will be part of a larger pipeline affecting
revenue.

Problem Framing
• Business Goal: Predict district median housing prices for investment decisions.
• Current Solution: Experts manually estimate prices, but it’s costly and often inaccurate (off by
20%).
• Machine Learning Task: This is a supervised learning regression problem, specifically multiple
regression, as you’re predicting a continuous value (median housing price) from multiple features.

Fig. A Machine Learning pipeline for real estate investments

Data Pipeline and Learning Type


• Pipeline: Data flows through multiple components, each processing and passing data
asynchronously, making the system scalable and robust.
• Learning Type: Batch learning is suitable due to manageable data size and stability, but online
learning could be used for larger, dynamic datasets.

Performance Measure
• RMSE: Measures prediction accuracy, penalizing large errors, and is given by:

where:
h(xi) is the predicted value, and yi is the actual value.

2
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

• MAE: An alternative that is less sensitive to outliers than RMSE.

Assumptions Check
• Confirmed that the downstream system needs actual prices, not categories, validating the
regression approach.

Get the Data


The full Jupyter notebook is available at https://github.com/ageron/handson-ml2.

Create the Workspace


To set up your machine learning workspace:
1. Install Python
Check if Python is installed by running:
$ python3 --version
If not, install it from python.org.

2. Create Workspace Directory


Create a directory for your projects:
$ export ML_PATH="$HOME/ml"
$ mkdir -p $ML_PATH

3. Install Required Libraries


Ensure pip is installed and up to date:
$ python3 -m pip --version
$ python3 -m pip install --user -U pip
Install the required libraries:
$ python3 -m pip install -U jupyter matplotlib numpy pandas scipy scikit-learn

4. Check Installation
Verify libraries by running:
$ python3 -c "import jupyter, matplotlib, numpy, pandas, scipy, sklearn"
No output or errors means successful installation.

3
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

5. Launch Jupyter Notebook


Start Jupyter by running:
$ jupyter notebook
This will open Jupyter at http://localhost:8888/.

6. Create and Rename Notebook


In Jupyter, click New, select Python, and rename the notebook (e.g., to "Housing").

7. Test Your Notebook


Type the following code in the first cell:
print("Hello world!")
Press Shift + Enter to execute and see the output below the cell.

Now you're ready to start coding in your machine learning workspace!

Download the Data


The process of downloading, extracting, and loading the housing data can be automated with a Python
script. Below is a complete version of the solution you can use. The script will download the
compressed housing.tgz file, extract its contents, and then load the housing.csv file into a Pandas
DataFrame.

1. Fetching the Data:


The fetch_housing_data function downloads and extracts housing.tgz:
import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):


if not os.path.isdir(housing_path):
os.makedirs(housing_path)

4
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

tgz_path = os.path.join(housing_path, "housing.tgz")


urllib.request.urlretrieve(housing_url, tgz_path)
with tarfile.open(tgz_path) as housing_tgz:
housing_tgz.extractall(path=housing_path)
print("Data downloaded and extracted.")

2. Loading the Data:


The load_housing_data function loads the extracted CSV into a Pandas DataFrame:
import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)

3. Putting It All Together:


Call the functions to download, extract, and load the data:
fetch_housing_data()
housing_data = load_housing_data()
print(housing_data.head())
This script automates the process of fetching and loading the housing dataset into a Pandas
DataFrame.

Take a Quick Look at the Data Structure


The dataset contains 20,640 rows, each representing a district with 10 attributes, including numerical
features (longitude, latitude, median house value) and one categorical feature (ocean_proximity).
Attributes: There are 10 attributes (columns) in the dataset:
• longitude: Longitude of the district.
• latitude: Latitude of the district.
• housing_median_age: The median age of houses in the district.
• total_rooms: The total number of rooms in the district.
• total_bedrooms: The total number of bedrooms in the district.
• population: The population of the district.
• households: The number of households in the district.
• median_income: The median income of the district's residents.
• median_house_value: The median value of homes in the district (target variable).

5
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

• ocean_proximity: The proximity to the ocean, which is a categorical variable.

Key steps in the data exploration include:


1. Missing Data: The total_bedrooms attribute has 207 missing values.
2. Data Types: All attributes are numerical, except ocean_proximity, which is categorical with five
values: <1H OCEAN, INLAND, NEAR OCEAN, NEAR BAY, and ISLAND.
3. Statistical Summary: The describe() method provides key metrics (count, mean, min, max, std, and
percentiles). For example, 25% of districts have a median housing age lower than 18.
4. Histograms: Plotted for numerical features, showing skewed distributions. Notably:
o Median Income: Scaled and capped, representing about $30,000 for values near 3.
o Median House Value: Capped at $500,000, potentially impacting model predictions.
o Distributions: Many features are tail-heavy, which may require transformation for better
model performance.

Create a Test Set


To prevent data snooping bias (where test set patterns influence model selection), it’s important to
set aside a test set early. Typically, 20% of the data is used as the test set.

Basic Test Set Creation


You can randomly split the dataset using a function like split_train_test(). To ensure consistent test
sets, set a random seed or use a unique identifier (e.g., row index) for splitting:
import numpy as np
def split_train_test(data, test_ratio):
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]

Alternatively, use Scikit-Learn’s train_test_split() with a random_state for reproducibility:


from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

6
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Stratified Sampling
For small datasets, stratified sampling ensures the test set is representative. For example, categorize
median income and use StratifiedShuffleSplit:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)


for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
Check the income proportions in the test set:
strat_test_set["income_cat"].value_counts() / len(strat_test_set)

Final Cleanup
Remove the income_cat attribute to restore the dataset:
for set_ in (strat_train_set, strat_test_set):
set_.drop("income_cat", axis=1, inplace=True)

Conclusion
Setting aside a test set early prevents bias, and stratified sampling ensures the test set accurately
represents key features, especially in smaller datasets. This is crucial for unbiased model evaluation.

Discover and Visualize the Data to Gain Insights


After setting aside the test set, explore the training set to uncover patterns. If the dataset is large,
sample a subset; otherwise, work with the full set. Start by creating a copy for experimentation:
housing = strat_train_set.copy()

Visualizing Geographical Data


Use scatterplots to visualize latitude and longitude. Adjusting the transparency (alpha=0.1) helps
highlight high-density areas like the Bay Area, Los Angeles, and Central Valley.
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)

7
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Visualizing Housing Prices


Size the points based on population and color them by housing prices using the jet color map to
visualize price distribution:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
s=housing["population"]/100, label="population", figsize=(10, 7),
c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True)
This reveals that housing prices are influenced by location and population density.

Correlations Between Features


Compute the correlation matrix to examine relationships between features, particularly how they
correlate with median house value:
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
Median income has the highest positive correlation with housing prices.

Scatter Matrix
Use a scatter matrix to visualize relationships between numerical features:
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms",
"housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))

Exploring Median Income vs. Median House Value


A scatter plot between median income and median house value shows a strong correlation and
distinct price caps at certain values:
housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1)

Experimenting with Attribute Combinations


Create new attributes like rooms per household and bedrooms per room for better insights:
housing["rooms_per_household"] = housing["total_rooms"] / housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"] / housing["total_rooms"]
housing["population_per_household"] = housing["population"] / housing["households"]
Recompute the correlation matrix to assess the impact of these new features:

8
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
These new features show stronger correlations with median house value.

Conclusion
Exploration through visualizations, correlation analysis, and attribute combinations provides valuable
insights into feature relationships, helping you better prepare the data for machine learning.

Prepare the Data for Machine Learning Algorithms


Preparing data for machine learning involves several steps that ensure your model receives clean and
well-processed data. Writing reusable functions for these steps is essential for reproducibility,
efficiency, and consistency. These functions help streamline the data preparation process for new
datasets and allow for easy experimentation.
Step 1: Separate Predictors and Labels
Start by splitting the dataset into predictors (features) and labels (target values):
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

Step 2: Handle Missing Values


Machine learning algorithms require clean data, so missing values must be addressed. There are
several ways to handle missing data:
1. Remove rows with missing values:
housing.dropna(subset=["total_bedrooms"])

2. Drop the column:


housing.drop("total_bedrooms", axis=1)

3. Replace missing values with the median:


median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median, inplace=True)
For more automated handling, use Scikit-learn’s SimpleImputer to fill missing values with the median
of the column:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

9
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

housing_num = housing.drop("ocean_proximity", axis=1) # exclude categorical data


imputer.fit(housing_num)
housing_num = pd.DataFrame(imputer.transform(housing_num),
columns=housing_num.columns)

Step 3: Handle Categorical Attributes


Categorical attributes like ocean_proximity need to be converted into numerical values:
1. Ordinal Encoding: Use OrdinalEncoder to assign integer values to categories:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
2. One-Hot Encoding: Convert categorical values into binary columns (one per category) using
OneHotEncoder:
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

Step 4: Feature Scaling


Most machine learning models perform better when the features are scaled. Common scaling methods
are:
1. Min-Max Scaling: Rescale features to a [0, 1] range using MinMaxScaler:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
housing_scaled = scaler.fit_transform(housing_num)
2. Standardization: Scale features to have a zero mean and unit variance using StandardScaler:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
housing_standardized = scaler.fit_transform(housing_num)

Step 5: Combine Transformers Using Pipelines


A pipeline automates the application of multiple transformations to the data:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

10
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
housing_num_tr = num_pipeline.fit_transform(housing_num)

Step 6: Apply All Transformations Using ColumnTransformer


To apply different transformations to different columns, use ColumnTransformer:
from sklearn.compose import ColumnTransformer
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", OneHotEncoder(), cat_attribs),
])
housing_prepared = full_pipeline.fit_transform(housing)
This will apply numerical transformations (imputation, attribute addition, scaling) to numerical
columns and one-hot encoding to categorical columns.
By organizing these transformations into functions and pipelines, you ensure that your data is
preprocessed correctly and consistently for machine learning models.

Select and Train a Model


After completing the data preparation steps, the next phase is to select and train a machine learning
model. Let’s go through the process step by step.
1. Training a Linear Regression Model
The first model to try is a Linear Regression. This model works well for regression tasks where the
relationship between features and labels is linear. Here’s how to train the model:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
Once the model is trained, you can test it on some data from the training set:
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", lin_reg.predict(some_data_prepared))

11
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

print("Labels:", list(some_labels))
After predictions, evaluate the model's performance using the Root Mean Squared Error (RMSE):
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)

2. Underfitting and Choosing a Better Model


A model with a high RMSE (e.g., $68,628 in this case) can be a sign of underfitting. This may occur if
the model is too simple or the features are not informative enough. In this case, trying a more complex
model could help.

3. Training a Decision Tree Model


To capture more complex relationships in the data, a Decision Tree Regressor can be used:
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
The Decision Tree model may achieve perfect results on the training set (RMSE = 0), but this is often a
sign of overfitting. Overfitting happens when the model learns to perform well on training data but
generalizes poorly to new data.

4. Better Evaluation Using Cross-Validation


Instead of evaluating the model on a single validation set, K-fold cross-validation provides a more
robust evaluation by splitting the data into 10 distinct folds and training the model multiple times:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
This cross-validation approach provides not only the performance score but also the standard
deviation of the scores, which helps understand the model’s stability.

5. Evaluating the Linear Regression Model


To compare the Decision Tree model with the Linear Regression model, you can run cross-validation
on the Linear Regression model as well:

12
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,


scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)

6. Random Forest Regressor


The Random Forest Regressor is an ensemble method that builds multiple Decision Trees and averages
their predictions. It often performs better than individual Decision Trees:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)

7. Overfitting in Random Forest


Even though Random Forest typically performs better than individual Decision Trees, overfitting may
still occur if the model is too complex. Therefore, adjusting hyperparameters, such as the number of
trees or the maximum depth, may be necessary.

8. Saving and Reusing Models


Once a promising model is selected, it’s essential to save it for future use. Scikit-learn models can be
saved using joblib:
from sklearn.externals import joblib
joblib.dump(my_model, "my_model.pkl")
Later, the saved model can be loaded and used again:
my_model_loaded = joblib.load("my_model.pkl")

Fine-Tune Your Model


After narrowing down your model options, fine-tuning them is crucial to improve performance. Here
are a few ways to do this:
1. Grid Search
• GridSearchCV automates hyperparameter optimization by exploring all possible combinations of
hyperparameters using cross-validation.
• Example for tuning a RandomForestRegressor:
from sklearn.model_selection import GridSearchCV
param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},

13
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},


]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
• After training, you can retrieve the best parameters and model:
best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_
• This method can take time but helps find the best hyperparameter combination through
exhaustive search.

2. Randomized Search
• For larger hyperparameter spaces, RandomizedSearchCV is faster. It samples random
combinations of hyperparameters, which can explore a broader search space in fewer iterations.
• Benefits:
o Explores more hyperparameter combinations.
o Provides better control over the search budget (e.g., running for 1,000 iterations).

3. Ensemble Methods
• Combine the best models to improve performance. For example, Random Forests combine
multiple decision trees, leading to better accuracy than a single tree.
• Ensembles typically perform better when individual models make different types of errors.

4. Analyzing Best Models and Their Errors


• Check feature importance using models like RandomForestRegressor:
feature_importances = grid_search.best_estimator_.feature_importances_
• Investigate which features contribute most to predictions and consider dropping less useful ones.
• Analyze errors to identify patterns and opportunities for improving the model (e.g., adjusting
features, handling outliers).

5. Evaluating on the Test Set


• Once the model is fine-tuned, evaluate it on the test set:
final_model = grid_search.best_estimator_

14
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

X_test = strat_test_set.drop("median_house_value", axis=1)


y_test = strat_test_set["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
• For a more precise performance estimate, compute the confidence interval of the generalization
error:
from scipy import stats
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1, loc=squared_errors.mean(),
scale=stats.sem(squared_errors)))

6. Final Considerations
• The model's performance on the test set can be slightly worse than the cross-validation scores due
to overfitting to the validation set.
• Avoid tweaking the model solely to improve test set performance, as this may not generalize well
to new data.

Launch, Monitor, and Maintain Your System


Once approved for launch, focus on the following steps:
1. Production Readiness: Integrate production input data and write tests to ensure system
functionality.
2. Monitoring: Track live performance regularly to detect failures or performance degradation, as
models may "rot" over time without retraining.
3. Human Evaluation: Use human analysis (e.g., experts or crowdsourcing platforms) to evaluate
predictions periodically.
4. Input Data Quality: Monitor input data to identify issues like malfunctioning sensors or outdated
sources, especially for online learning systems.
5. Regular Retraining: Automate model retraining with fresh data to prevent performance
fluctuations. For online systems, save snapshots of the model to allow easy rollbacks.
These steps ensure the system remains effective and adaptable over time.

15
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Classification
MNIST
The MNIST dataset contains 70,000 handwritten digit images, each 28x28 pixels, labeled with the digit
they represent. It is widely used for evaluating classification algorithms and is often referred to as the
"Hello World" of Machine Learning.
• Data Structure:
o Features (X): 70,000 images, each with 784 features (one for each pixel).
o Labels (y): 70,000 labels, where each label is the digit (0-9) the image represents.
To load the dataset:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
• Splitting the Dataset: The MNIST dataset is pre-split into training (60,000 images) and test sets
(10,000 images):
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
Shuffling is already done, ensuring that the data is well mixed for training.
To visualize a sample:
import matplotlib.pyplot as plt
some_digit = X[0]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap="binary", interpolation="nearest")
plt.axis("off")
plt.show()
• Label Conversion: Labels are initially strings; convert them to integers for easier handling:
y = y.astype(np.uint8)
The dataset is essential for developing and testing machine learning models on image classification
tasks.

Fig. A few digits from the MNIST dataset

16
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Training a Binary Classifier


In this step, we’ll train a binary classifier to detect the digit 5. This simplifies the problem into a "5 vs.
not-5" classification.
1. Creating the Target Vectors:
o For training and testing, we create binary target vectors:
y_train_5 = (y_train == 5) # True for 5s, False for other digits
y_test_5 = (y_test == 5)
2. Choosing a Classifier: We use Stochastic Gradient Descent (SGD) because it's efficient for large
datasets and can handle online learning. The SGDClassifier in Scikit-Learn is a good choice.
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)
o The random_state ensures reproducible results due to the stochastic nature of the
algorithm.
3. Making Predictions: Once trained, the model can predict whether an image represents a 5 or not.
sgd_clf.predict([some_digit]) # Predict for a single instance
# Output: array([ True]) # Predicts "True" if it is a 5
This setup creates a binary classifier to detect the digit 5, which we can later evaluate for performance.

Performance Measures
Evaluating classifiers is more complex than evaluating regressors. There are several performance
metrics to assess classifier performance.

Measuring Accuracy Using Cross-Validation


Cross-Validation: A useful method for evaluating models. You can implement it manually with
StratifiedKFold for more control.
Manual Cross-Validation:
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, random_state=42)


for train_index, test_index in skfolds.split(X_train, y_train_5):
clone_clf = clone(sgd_clf)

17
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

X_train_folds = X_train[train_index]
y_train_folds = y_train_5[train_index]
X_test_fold = X_train[test_index]
y_test_fold = y_train_5[test_index]
clone_clf.fit(X_train_folds, y_train_folds)
y_pred = clone_clf.predict(X_test_fold)
n_correct = sum(y_pred == y_test_fold)
print(n_correct / len(y_pred))
Using cross_val_score:
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
Accuracy can be misleading with skewed datasets. For instance, a classifier that always predicts "not-
5" can achieve over 90% accuracy due to the imbalance in class distribution, highlighting why accuracy
is not always the best performance measure.

Confusion Matrix
The confusion matrix evaluates classifier performance by showing how often instances of one class are
misclassified as another. It reveals errors like mistaking one digit (e.g., 5) for another (e.g., 3).
Steps:
1. Use cross_val_predict() for K-fold predictions:
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
2. Compute the confusion matrix:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_5, y_train_pred)
Interpretation:
• True Negatives (TN): Correctly predicted non-5 images.
• False Positives (FP): Non-5 images incorrectly predicted as 5.
• False Negatives (FN): 5 images incorrectly predicted as non-5.
• True Positives (TP): Correctly predicted 5 images.
Example output:
[[53057, 1522], [1325, 4096]]
A perfect classifier:
[[54579, 0], [0, 5421]]

18
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Metrics:
• Precision measures predicted positives that are true:
Precision=TP / (TP+FP)
• Recall measures true positives detected:
Recall= TP / (TP+FN)
The confusion matrix provides detailed classifier performance, while precision and recall focus on
specific evaluation metrics.

Fig. An illustrated confusion matrix

Precision and Recall


Precision measures the accuracy of positive predictions:
Precision=TP / (TP+FP)
Recall measures the classifier’s ability to detect actual positives:
Recall= TP / (TP+FN)
Example:
precision_score(y_train_5, y_train_pred) # ~ 0.729
recall_score(y_train_5, y_train_pred) # ~ 0.756
F1 Score combines precision and recall as the harmonic mean:

Example:
f1_score(y_train_5, y_train_pred) # ~ 0.742
Precision/Recall Tradeoff:
• High Precision, Low Recall: Fewer false positives, but miss some true positives (e.g., safe video
detection).
• High Recall, Low Precision: More true positives detected, but includes false positives (e.g.,
shoplifter detection).
Improving one usually reduces the other.

19
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Precision/Recall Tradeoff

Fig. Decision threshold and precision/recall tradeoff


The precision/recall tradeoff occurs because increasing one often decreases the other. The decision
threshold determines the classification: raising it improves precision (fewer false positives) but
reduces recall (more false negatives), and vice versa.
You can control the threshold by using the classifier’s decision_function():
y_scores = sgd_clf.decision_function([some_digit])
y_some_digit_pred = (y_scores > threshold)
To analyze performance across thresholds, use precision_recall_curve():
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

Fig. Precision and recall versus the decision threshold


You can plot Precision vs. Threshold and Precision vs. Recall to visualize the tradeoff. For example, to
achieve 90% precision, find the threshold:
threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)] # ~7816
y_train_pred_90 = (y_scores >= threshold_90_precision)

Fig. Precision versus recall

20
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

This results in 90% precision, but recall decreases, showing the need to balance both metrics
depending on the project's goals.

The ROC Curve


The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (TPR) (recall) against
the False Positive Rate (FPR). It helps evaluate binary classifiers, with a good classifier staying far from
the diagonal (random classifier) and closer to the top-left corner.
To plot the ROC curve, calculate the FPR and TPR for various thresholds using roc_curve():
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

Fig. ROC curve


A classifier with an AUC of 1 is perfect, while a random classifier has an AUC of 0.5. You can compute
AUC with roc_auc_score():
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_5, y_scores)
The ROC curve is useful when the positive class is not rare, while the Precision-Recall curve is better
when the positive class is rare or false positives are more important.
To compare classifiers, such as SGDClassifier and RandomForestClassifier, plot their ROC curves. The
RandomForestClassifier typically has a better AUC. For this classifier, use predict_proba() to get the
probability for the positive class:
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,
method="predict_proba")
y_scores_forest = y_probas_forest[:, 1]
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5, y_scores_forest)
Plot both classifiers' ROC curves to compare their performance:

21
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

plt.plot(fpr, tpr, "b:", label="SGD")


plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right")
plt.show()

Fig. Comparing ROC curves

Multiclass Classification
Multiclass classifiers distinguish between more than two classes. Some algorithms like Random Forest
or Naive Bayes directly handle multiple classes, while others (e.g., SVMs or Linear classifiers) are
binary. For binary classifiers, there are two main strategies:
1. One-vs-All (OvA): Train one classifier per class, selecting the class with the highest score.
2. One-vs-One (OvO): Train a classifier for each pair of classes, requiring N(N-1)/2 classifiers.
Scikit-Learn defaults to OvA for most classifiers, except for SVMs which use OvO. For example, the
SGDClassifier automatically trains 10 binary classifiers for digits 0-9:
sgd_clf.fit(X_train, y_train)
sgd_clf.predict([some_digit])
To see decision scores:
some_digit_scores = sgd_clf.decision_function([some_digit])
You can also force OvA or OvO using OneVsOneClassifier or OneVsRestClassifier.
RandomForestClassifier can directly handle multiclass classification without OvA/OvO:
forest_clf.fit(X_train, y_train)
forest_clf.predict([some_digit])
forest_clf.predict_proba([some_digit])
For evaluation, use cross-validation:
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")
Scaling input features with StandardScaler improves accuracy:
scaler = StandardScaler()

22
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")

Error Analysis
To improve a model, you can analyze the errors it makes by using the confusion matrix.
1. Generate the confusion matrix using cross_val_predict() and confusion_matrix():
y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
2. Visualize the matrix with Matplotlib:
plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()
3. Normalize the confusion matrix to focus on error rates by dividing each value by the total number
of instances per class:
row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums
np.fill_diagonal(norm_conf_mx, 0) # Focus on errors
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()
4. Analyze misclassifications by visualizing examples of misclassified digits:
cl_a, cl_b = 3, 5
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]

5. Improve the model: For digits like 3 and 5, which are often confused, preprocess the images to
ensure they are well-centered and not rotated, reducing such errors.

23
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Multilabel Classification
In multilabel classification, each instance can be assigned multiple labels. For example, a face-
recognition system might output multiple tags for a single image.
Example:
For digit classification, we can assign two labels to each digit:
1. Whether the digit is large (7, 8, or 9).
2. Whether the digit is odd.
The KNN classifier can be trained as follows:
from sklearn.neighbors import KNeighborsClassifier
y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)
When predicting, it returns multiple labels for an image, like [False, True], indicating the digit is not
large but odd.
Evaluation:
To evaluate a multilabel classifier, compute the F1 score for each label and average them:
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)
f1_score(y_multilabel, y_train_knn_pred, average="macro")
You can use weighted averaging by setting average="weighted" to account for class imbalances.

Multioutput Classification
Multioutput classification is an extension of multilabel classification, where each label can have
multiple values (multiclass). It involves predicting multiple outputs, each potentially belonging to
different classes.
Example:
In a denoising system for digit images, the noisy image is the input, and the clean image (same as the
original MNIST image) is the output. Each pixel in the output is a label with a value range from 0 to
255.
Process:
1. Add noise to training and test data:
noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise

24
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

noise = np.random.randint(0, 100, (len(X_test), 784))


X_test_mod = X_test + noise
y_train_mod = X_train # Target is the clean image
y_test_mod = X_test

2. Train a classifier (e.g., KNN) to predict clean images:


knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[some_index]])
plot_digit(clean_digit)

This approach can be used for tasks involving both class labels and value labels, blending classification
and regression.

25

You might also like