0% found this document useful (0 votes)

8 views25 pages

ISMLA_Module5

The document outlines the process of conducting an end-to-end machine learning project focused on predicting California housing prices using real-world data. It details the importance of working with open datasets, the steps for data preparation, including handling missing values and encoding categorical attributes, and emphasizes the need for a robust data pipeline. Additionally, it discusses the significance of exploratory data analysis and visualization to uncover patterns and relationships within the dataset.

Uploaded by

prajwal24dsouza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views25 pages

ISMLA_Module5

Uploaded by

prajwal24dsouza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof.

, ECE, GMIT, Davangere

End-to-end Machine Learning Project

Working with Real Data
Working with real-world data is essential for learning machine learning. There are many open datasets
available through:
1. Popular Repositories:
o UC Irvine ML Repository: A collection of datasets for research and teaching.
o Kaggle Datasets: Offers datasets for various domains, often with code notebooks.
o AWS Datasets: Large datasets for big data and machine learning.
2. Meta Portals:
o dataportals.org: Lists global open data portals.
o opendatamonitor.eu: Focuses on European datasets.
o Quandl: Provides economic, financial, and alternative datasets.
3. Other Resources:
o Wikipedia’s List of ML Datasets: Comprehensive dataset list.
o Quora & Datasets Subreddit: Communities sharing datasets.
The California Housing Prices dataset is based on the 1990 U.S. Census data and is used for predictive
modeling, helping to estimate housing prices based on factors like demographics and geography. For
teaching, the dataset is often modified by adding categorical attributes and removing some features,
which helps illustrate key concepts in data manipulation and machine learning.

Fig. California housing prices

1
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Look at the Big Picture

You're tasked with predicting California housing prices using census data, such as population and
median income, to aid investment decisions. This prediction will be part of a larger pipeline affecting
revenue.

Problem Framing
• Business Goal: Predict district median housing prices for investment decisions.
• Current Solution: Experts manually estimate prices, but it’s costly and often inaccurate (off by
20%).
• Machine Learning Task: This is a supervised learning regression problem, specifically multiple
regression, as you’re predicting a continuous value (median housing price) from multiple features.

Fig. A Machine Learning pipeline for real estate investments

Data Pipeline and Learning Type

• Pipeline: Data flows through multiple components, each processing and passing data
asynchronously, making the system scalable and robust.
• Learning Type: Batch learning is suitable due to manageable data size and stability, but online
learning could be used for larger, dynamic datasets.

Performance Measure
• RMSE: Measures prediction accuracy, penalizing large errors, and is given by:

where:
h(xi) is the predicted value, and yi is the actual value.

2
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

• MAE: An alternative that is less sensitive to outliers than RMSE.

Assumptions Check
• Confirmed that the downstream system needs actual prices, not categories, validating the
regression approach.

Get the Data

The full Jupyter notebook is available at https://github.com/ageron/handson-ml2.

Create the Workspace

To set up your machine learning workspace:
1. Install Python
Check if Python is installed by running:
$ python3 --version
If not, install it from python.org.

2. Create Workspace Directory

Create a directory for your projects:
$ export ML_PATH="$HOME/ml"
$ mkdir -p $ML_PATH

3. Install Required Libraries

Ensure pip is installed and up to date:
$ python3 -m pip --version
$ python3 -m pip install --user -U pip
Install the required libraries:
$ python3 -m pip install -U jupyter matplotlib numpy pandas scipy scikit-learn

4. Check Installation
Verify libraries by running:
$ python3 -c "import jupyter, matplotlib, numpy, pandas, scipy, sklearn"
No output or errors means successful installation.

3
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

5. Launch Jupyter Notebook

Start Jupyter by running:
$ jupyter notebook
This will open Jupyter at http://localhost:8888/.

6. Create and Rename Notebook

In Jupyter, click New, select Python, and rename the notebook (e.g., to "Housing").

7. Test Your Notebook

Type the following code in the first cell:
print("Hello world!")
Press Shift + Enter to execute and see the output below the cell.

Now you're ready to start coding in your machine learning workspace!

Download the Data

The process of downloading, extracting, and loading the housing data can be automated with a Python
script. Below is a complete version of the solution you can use. The script will download the
compressed housing.tgz file, extract its contents, and then load the housing.csv file into a Pandas
DataFrame.

1. Fetching the Data:

The fetch_housing_data function downloads and extracts housing.tgz:
import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):

if not os.path.isdir(housing_path):
os.makedirs(housing_path)

4
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

tgz_path = os.path.join(housing_path, "housing.tgz")

urllib.request.urlretrieve(housing_url, tgz_path)
with tarfile.open(tgz_path) as housing_tgz:
housing_tgz.extractall(path=housing_path)
print("Data downloaded and extracted.")

2. Loading the Data:

The load_housing_data function loads the extracted CSV into a Pandas DataFrame:
import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)

3. Putting It All Together:

Call the functions to download, extract, and load the data:
fetch_housing_data()
housing_data = load_housing_data()
print(housing_data.head())
This script automates the process of fetching and loading the housing dataset into a Pandas
DataFrame.

Take a Quick Look at the Data Structure

The dataset contains 20,640 rows, each representing a district with 10 attributes, including numerical
features (longitude, latitude, median house value) and one categorical feature (ocean_proximity).
Attributes: There are 10 attributes (columns) in the dataset:
• longitude: Longitude of the district.
• latitude: Latitude of the district.
• housing_median_age: The median age of houses in the district.
• total_rooms: The total number of rooms in the district.
• total_bedrooms: The total number of bedrooms in the district.
• population: The population of the district.
• households: The number of households in the district.
• median_income: The median income of the district's residents.
• median_house_value: The median value of homes in the district (target variable).

5
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

• ocean_proximity: The proximity to the ocean, which is a categorical variable.

Key steps in the data exploration include:

1. Missing Data: The total_bedrooms attribute has 207 missing values.
2. Data Types: All attributes are numerical, except ocean_proximity, which is categorical with five
values: <1H OCEAN, INLAND, NEAR OCEAN, NEAR BAY, and ISLAND.
3. Statistical Summary: The describe() method provides key metrics (count, mean, min, max, std, and
percentiles). For example, 25% of districts have a median housing age lower than 18.
4. Histograms: Plotted for numerical features, showing skewed distributions. Notably:
o Median Income: Scaled and capped, representing about $30,000 for values near 3.
o Median House Value: Capped at $500,000, potentially impacting model predictions.
o Distributions: Many features are tail-heavy, which may require transformation for better
model performance.

Create a Test Set

To prevent data snooping bias (where test set patterns influence model selection), it’s important to
set aside a test set early. Typically, 20% of the data is used as the test set.

Basic Test Set Creation

You can randomly split the dataset using a function like split_train_test(). To ensure consistent test
sets, set a random seed or use a unique identifier (e.g., row index) for splitting:
import numpy as np
def split_train_test(data, test_ratio):
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]

Alternatively, use Scikit-Learn’s train_test_split() with a random_state for reproducibility:

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

6
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Stratified Sampling
For small datasets, stratified sampling ensures the test set is representative. For example, categorize
median income and use StratifiedShuffleSplit:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
Check the income proportions in the test set:
strat_test_set["income_cat"].value_counts() / len(strat_test_set)

Final Cleanup
Remove the income_cat attribute to restore the dataset:
for set_ in (strat_train_set, strat_test_set):
set_.drop("income_cat", axis=1, inplace=True)

Conclusion
Setting aside a test set early prevents bias, and stratified sampling ensures the test set accurately
represents key features, especially in smaller datasets. This is crucial for unbiased model evaluation.

Discover and Visualize the Data to Gain Insights

After setting aside the test set, explore the training set to uncover patterns. If the dataset is large,
sample a subset; otherwise, work with the full set. Start by creating a copy for experimentation:
housing = strat_train_set.copy()

Visualizing Geographical Data

Use scatterplots to visualize latitude and longitude. Adjusting the transparency (alpha=0.1) helps
highlight high-density areas like the Bay Area, Los Angeles, and Central Valley.
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)

7
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Visualizing Housing Prices

Size the points based on population and color them by housing prices using the jet color map to
visualize price distribution:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
s=housing["population"]/100, label="population", figsize=(10, 7),
c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True)
This reveals that housing prices are influenced by location and population density.

Correlations Between Features

Compute the correlation matrix to examine relationships between features, particularly how they
correlate with median house value:
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
Median income has the highest positive correlation with housing prices.

Scatter Matrix
Use a scatter matrix to visualize relationships between numerical features:
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms",
"housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))

Exploring Median Income vs. Median House Value

A scatter plot between median income and median house value shows a strong correlation and
distinct price caps at certain values:
housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1)

Experimenting with Attribute Combinations

Create new attributes like rooms per household and bedrooms per room for better insights:
housing["rooms_per_household"] = housing["total_rooms"] / housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"] / housing["total_rooms"]
housing["population_per_household"] = housing["population"] / housing["households"]
Recompute the correlation matrix to assess the impact of these new features:

8
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
These new features show stronger correlations with median house value.

Conclusion
Exploration through visualizations, correlation analysis, and attribute combinations provides valuable
insights into feature relationships, helping you better prepare the data for machine learning.

Prepare the Data for Machine Learning Algorithms

Preparing data for machine learning involves several steps that ensure your model receives clean and
well-processed data. Writing reusable functions for these steps is essential for reproducibility,
efficiency, and consistency. These functions help streamline the data preparation process for new
datasets and allow for easy experimentation.
Step 1: Separate Predictors and Labels
Start by splitting the dataset into predictors (features) and labels (target values):
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

Step 2: Handle Missing Values

Machine learning algorithms require clean data, so missing values must be addressed. There are
several ways to handle missing data:
1. Remove rows with missing values:
housing.dropna(subset=["total_bedrooms"])

2. Drop the column:

housing.drop("total_bedrooms", axis=1)

3. Replace missing values with the median:

median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median, inplace=True)
For more automated handling, use Scikit-learn’s SimpleImputer to fill missing values with the median
of the column:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

9
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

housing_num = housing.drop("ocean_proximity", axis=1) # exclude categorical data

imputer.fit(housing_num)
housing_num = pd.DataFrame(imputer.transform(housing_num),
columns=housing_num.columns)

Step 3: Handle Categorical Attributes

Categorical attributes like ocean_proximity need to be converted into numerical values:
1. Ordinal Encoding: Use OrdinalEncoder to assign integer values to categories:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
2. One-Hot Encoding: Convert categorical values into binary columns (one per category) using
OneHotEncoder:
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

Step 4: Feature Scaling

Most machine learning models perform better when the features are scaled. Common scaling methods
are:
1. Min-Max Scaling: Rescale features to a [0, 1] range using MinMaxScaler:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
housing_scaled = scaler.fit_transform(housing_num)
2. Standardization: Scale features to have a zero mean and unit variance using StandardScaler:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
housing_standardized = scaler.fit_transform(housing_num)

Step 5: Combine Transformers Using Pipelines

A pipeline automates the application of multiple transformations to the data:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

10
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
housing_num_tr = num_pipeline.fit_transform(housing_num)

Step 6: Apply All Transformations Using ColumnTransformer

To apply different transformations to different columns, use ColumnTransformer:
from sklearn.compose import ColumnTransformer
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", OneHotEncoder(), cat_attribs),
])
housing_prepared = full_pipeline.fit_transform(housing)
This will apply numerical transformations (imputation, attribute addition, scaling) to numerical
columns and one-hot encoding to categorical columns.
By organizing these transformations into functions and pipelines, you ensure that your data is
preprocessed correctly and consistently for machine learning models.

Select and Train a Model

After completing the data preparation steps, the next phase is to select and train a machine learning
model. Let’s go through the process step by step.
1. Training a Linear Regression Model
The first model to try is a Linear Regression. This model works well for regression tasks where the
relationship between features and labels is linear. Here’s how to train the model:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
Once the model is trained, you can test it on some data from the training set:
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", lin_reg.predict(some_data_prepared))

11
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

print("Labels:", list(some_labels))
After predictions, evaluate the model's performance using the Root Mean Squared Error (RMSE):
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)

2. Underfitting and Choosing a Better Model

A model with a high RMSE (e.g., $68,628 in this case) can be a sign of underfitting. This may occur if
the model is too simple or the features are not informative enough. In this case, trying a more complex
model could help.

3. Training a Decision Tree Model

To capture more complex relationships in the data, a Decision Tree Regressor can be used:
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
The Decision Tree model may achieve perfect results on the training set (RMSE = 0), but this is often a
sign of overfitting. Overfitting happens when the model learns to perform well on training data but
generalizes poorly to new data.

4. Better Evaluation Using Cross-Validation

Instead of evaluating the model on a single validation set, K-fold cross-validation provides a more
robust evaluation by splitting the data into 10 distinct folds and training the model multiple times:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
This cross-validation approach provides not only the performance score but also the standard
deviation of the scores, which helps understand the model’s stability.

5. Evaluating the Linear Regression Model

To compare the Decision Tree model with the Linear Regression model, you can run cross-validation
on the Linear Regression model as well:

12
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,

scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)

6. Random Forest Regressor

The Random Forest Regressor is an ensemble method that builds multiple Decision Trees and averages
their predictions. It often performs better than individual Decision Trees:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)

7. Overfitting in Random Forest

Even though Random Forest typically performs better than individual Decision Trees, overfitting may
still occur if the model is too complex. Therefore, adjusting hyperparameters, such as the number of
trees or the maximum depth, may be necessary.

8. Saving and Reusing Models

Once a promising model is selected, it’s essential to save it for future use. Scikit-learn models can be
saved using joblib:
from sklearn.externals import joblib
joblib.dump(my_model, "my_model.pkl")
Later, the saved model can be loaded and used again:
my_model_loaded = joblib.load("my_model.pkl")

Fine-Tune Your Model

After narrowing down your model options, fine-tuning them is crucial to improve performance. Here
are a few ways to do this:
1. Grid Search
• GridSearchCV automates hyperparameter optimization by exploring all possible combinations of
hyperparameters using cross-validation.
• Example for tuning a RandomForestRegressor:
from sklearn.model_selection import GridSearchCV
param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},

13
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},

]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
• After training, you can retrieve the best parameters and model:
best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_
• This method can take time but helps find the best hyperparameter combination through
exhaustive search.

2. Randomized Search
• For larger hyperparameter spaces, RandomizedSearchCV is faster. It samples random
combinations of hyperparameters, which can explore a broader search space in fewer iterations.
• Benefits:
o Explores more hyperparameter combinations.
o Provides better control over the search budget (e.g., running for 1,000 iterations).

3. Ensemble Methods
• Combine the best models to improve performance. For example, Random Forests combine
multiple decision trees, leading to better accuracy than a single tree.
• Ensembles typically perform better when individual models make different types of errors.

4. Analyzing Best Models and Their Errors

• Check feature importance using models like RandomForestRegressor:
feature_importances = grid_search.best_estimator_.feature_importances_
• Investigate which features contribute most to predictions and consider dropping less useful ones.
• Analyze errors to identify patterns and opportunities for improving the model (e.g., adjusting
features, handling outliers).

5. Evaluating on the Test Set

• Once the model is fine-tuned, evaluate it on the test set:
final_model = grid_search.best_estimator_

14
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

X_test = strat_test_set.drop("median_house_value", axis=1)

y_test = strat_test_set["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
• For a more precise performance estimate, compute the confidence interval of the generalization
error:
from scipy import stats
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1, loc=squared_errors.mean(),
scale=stats.sem(squared_errors)))

6. Final Considerations
• The model's performance on the test set can be slightly worse than the cross-validation scores due
to overfitting to the validation set.
• Avoid tweaking the model solely to improve test set performance, as this may not generalize well
to new data.

Launch, Monitor, and Maintain Your System

Once approved for launch, focus on the following steps:
1. Production Readiness: Integrate production input data and write tests to ensure system
functionality.
2. Monitoring: Track live performance regularly to detect failures or performance degradation, as
models may "rot" over time without retraining.
3. Human Evaluation: Use human analysis (e.g., experts or crowdsourcing platforms) to evaluate
predictions periodically.
4. Input Data Quality: Monitor input data to identify issues like malfunctioning sensors or outdated
sources, especially for online learning systems.
5. Regular Retraining: Automate model retraining with fresh data to prevent performance
fluctuations. For online systems, save snapshots of the model to allow easy rollbacks.
These steps ensure the system remains effective and adaptable over time.

15
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Classification
MNIST
The MNIST dataset contains 70,000 handwritten digit images, each 28x28 pixels, labeled with the digit
they represent. It is widely used for evaluating classification algorithms and is often referred to as the
"Hello World" of Machine Learning.
• Data Structure:
o Features (X): 70,000 images, each with 784 features (one for each pixel).
o Labels (y): 70,000 labels, where each label is the digit (0-9) the image represents.
To load the dataset:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
• Splitting the Dataset: The MNIST dataset is pre-split into training (60,000 images) and test sets
(10,000 images):
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
Shuffling is already done, ensuring that the data is well mixed for training.
To visualize a sample:
import matplotlib.pyplot as plt
some_digit = X[0]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap="binary", interpolation="nearest")
plt.axis("off")
plt.show()
• Label Conversion: Labels are initially strings; convert them to integers for easier handling:
y = y.astype(np.uint8)
The dataset is essential for developing and testing machine learning models on image classification
tasks.

Fig. A few digits from the MNIST dataset

16
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Training a Binary Classifier

In this step, we’ll train a binary classifier to detect the digit 5. This simplifies the problem into a "5 vs.
not-5" classification.
1. Creating the Target Vectors:
o For training and testing, we create binary target vectors:
y_train_5 = (y_train == 5) # True for 5s, False for other digits
y_test_5 = (y_test == 5)
2. Choosing a Classifier: We use Stochastic Gradient Descent (SGD) because it's efficient for large
datasets and can handle online learning. The SGDClassifier in Scikit-Learn is a good choice.
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)
o The random_state ensures reproducible results due to the stochastic nature of the
algorithm.
3. Making Predictions: Once trained, the model can predict whether an image represents a 5 or not.
sgd_clf.predict([some_digit]) # Predict for a single instance
# Output: array([ True]) # Predicts "True" if it is a 5
This setup creates a binary classifier to detect the digit 5, which we can later evaluate for performance.

Performance Measures
Evaluating classifiers is more complex than evaluating regressors. There are several performance
metrics to assess classifier performance.

Measuring Accuracy Using Cross-Validation

Cross-Validation: A useful method for evaluating models. You can implement it manually with
StratifiedKFold for more control.
Manual Cross-Validation:
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, random_state=42)

for train_index, test_index in skfolds.split(X_train, y_train_5):
clone_clf = clone(sgd_clf)

17
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

X_train_folds = X_train[train_index]
y_train_folds = y_train_5[train_index]
X_test_fold = X_train[test_index]
y_test_fold = y_train_5[test_index]
clone_clf.fit(X_train_folds, y_train_folds)
y_pred = clone_clf.predict(X_test_fold)
n_correct = sum(y_pred == y_test_fold)
print(n_correct / len(y_pred))
Using cross_val_score:
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
Accuracy can be misleading with skewed datasets. For instance, a classifier that always predicts "not-
5" can achieve over 90% accuracy due to the imbalance in class distribution, highlighting why accuracy
is not always the best performance measure.

Confusion Matrix
The confusion matrix evaluates classifier performance by showing how often instances of one class are
misclassified as another. It reveals errors like mistaking one digit (e.g., 5) for another (e.g., 3).
Steps:
1. Use cross_val_predict() for K-fold predictions:
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
2. Compute the confusion matrix:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_5, y_train_pred)
Interpretation:
• True Negatives (TN): Correctly predicted non-5 images.
• False Positives (FP): Non-5 images incorrectly predicted as 5.
• False Negatives (FN): 5 images incorrectly predicted as non-5.
• True Positives (TP): Correctly predicted 5 images.
Example output:
[[53057, 1522], [1325, 4096]]
A perfect classifier:
[[54579, 0], [0, 5421]]

18
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Metrics:
• Precision measures predicted positives that are true:
Precision=TP / (TP+FP)
• Recall measures true positives detected:
Recall= TP / (TP+FN)
The confusion matrix provides detailed classifier performance, while precision and recall focus on
specific evaluation metrics.

Fig. An illustrated confusion matrix

Precision and Recall

Precision measures the accuracy of positive predictions:
Precision=TP / (TP+FP)
Recall measures the classifier’s ability to detect actual positives:
Recall= TP / (TP+FN)
Example:
precision_score(y_train_5, y_train_pred) # ~ 0.729
recall_score(y_train_5, y_train_pred) # ~ 0.756
F1 Score combines precision and recall as the harmonic mean:

Example:
f1_score(y_train_5, y_train_pred) # ~ 0.742
Precision/Recall Tradeoff:
• High Precision, Low Recall: Fewer false positives, but miss some true positives (e.g., safe video
detection).
• High Recall, Low Precision: More true positives detected, but includes false positives (e.g.,
shoplifter detection).
Improving one usually reduces the other.

19
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Precision/Recall Tradeoff

Fig. Decision threshold and precision/recall tradeoff

The precision/recall tradeoff occurs because increasing one often decreases the other. The decision
threshold determines the classification: raising it improves precision (fewer false positives) but
reduces recall (more false negatives), and vice versa.
You can control the threshold by using the classifier’s decision_function():
y_scores = sgd_clf.decision_function([some_digit])
y_some_digit_pred = (y_scores > threshold)
To analyze performance across thresholds, use precision_recall_curve():
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

Fig. Precision and recall versus the decision threshold

You can plot Precision vs. Threshold and Precision vs. Recall to visualize the tradeoff. For example, to
achieve 90% precision, find the threshold:
threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)] # ~7816
y_train_pred_90 = (y_scores >= threshold_90_precision)

Fig. Precision versus recall

20
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

This results in 90% precision, but recall decreases, showing the need to balance both metrics
depending on the project's goals.

The ROC Curve

The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (TPR) (recall) against
the False Positive Rate (FPR). It helps evaluate binary classifiers, with a good classifier staying far from
the diagonal (random classifier) and closer to the top-left corner.
To plot the ROC curve, calculate the FPR and TPR for various thresholds using roc_curve():
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

Fig. ROC curve

A classifier with an AUC of 1 is perfect, while a random classifier has an AUC of 0.5. You can compute
AUC with roc_auc_score():
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_5, y_scores)
The ROC curve is useful when the positive class is not rare, while the Precision-Recall curve is better
when the positive class is rare or false positives are more important.
To compare classifiers, such as SGDClassifier and RandomForestClassifier, plot their ROC curves. The
RandomForestClassifier typically has a better AUC. For this classifier, use predict_proba() to get the
probability for the positive class:
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,
method="predict_proba")
y_scores_forest = y_probas_forest[:, 1]
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5, y_scores_forest)
Plot both classifiers' ROC curves to compare their performance:

21
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

plt.plot(fpr, tpr, "b:", label="SGD")

plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right")
plt.show()

Fig. Comparing ROC curves

Multiclass Classification
Multiclass classifiers distinguish between more than two classes. Some algorithms like Random Forest
or Naive Bayes directly handle multiple classes, while others (e.g., SVMs or Linear classifiers) are
binary. For binary classifiers, there are two main strategies:
1. One-vs-All (OvA): Train one classifier per class, selecting the class with the highest score.
2. One-vs-One (OvO): Train a classifier for each pair of classes, requiring N(N-1)/2 classifiers.
Scikit-Learn defaults to OvA for most classifiers, except for SVMs which use OvO. For example, the
SGDClassifier automatically trains 10 binary classifiers for digits 0-9:
sgd_clf.fit(X_train, y_train)
sgd_clf.predict([some_digit])
To see decision scores:
some_digit_scores = sgd_clf.decision_function([some_digit])
You can also force OvA or OvO using OneVsOneClassifier or OneVsRestClassifier.
RandomForestClassifier can directly handle multiclass classification without OvA/OvO:
forest_clf.fit(X_train, y_train)
forest_clf.predict([some_digit])
forest_clf.predict_proba([some_digit])
For evaluation, use cross-validation:
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")
Scaling input features with StandardScaler improves accuracy:
scaler = StandardScaler()

22
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")

Error Analysis
To improve a model, you can analyze the errors it makes by using the confusion matrix.
1. Generate the confusion matrix using cross_val_predict() and confusion_matrix():
y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
2. Visualize the matrix with Matplotlib:
plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()
3. Normalize the confusion matrix to focus on error rates by dividing each value by the total number
of instances per class:
row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums
np.fill_diagonal(norm_conf_mx, 0) # Focus on errors
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()
4. Analyze misclassifications by visualizing examples of misclassified digits:
cl_a, cl_b = 3, 5
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]

5. Improve the model: For digits like 3 and 5, which are often confused, preprocess the images to
ensure they are well-centered and not rotated, reducing such errors.

23
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

Multilabel Classification
In multilabel classification, each instance can be assigned multiple labels. For example, a face-
recognition system might output multiple tags for a single image.
Example:
For digit classification, we can assign two labels to each digit:
1. Whether the digit is large (7, 8, or 9).
2. Whether the digit is odd.
The KNN classifier can be trained as follows:
from sklearn.neighbors import KNeighborsClassifier
y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)
When predicting, it returns multiple labels for an image, like [False, True], indicating the digit is not
large but odd.
Evaluation:
To evaluate a multilabel classifier, compute the F1 score for each label and average them:
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)
f1_score(y_multilabel, y_train_knn_pred, average="macro")
You can use weighted averaging by setting average="weighted" to account for class imbalances.

Multioutput Classification
Multioutput classification is an extension of multilabel classification, where each label can have
multiple values (multiclass). It involves predicting multiple outputs, each potentially belonging to
different classes.
Example:
In a denoising system for digit images, the noisy image is the input, and the clean image (same as the
original MNIST image) is the output. Each pixel in the output is a label with a value range from 0 to
255.
Process:
1. Add noise to training and test data:
noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise

24
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere

noise = np.random.randint(0, 100, (len(X_test), 784))

X_test_mod = X_test + noise
y_train_mod = X_train # Target is the clean image
y_test_mod = X_test

2. Train a classifier (e.g., KNN) to predict clean images:

knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[some_index]])
plot_digit(clean_digit)

This approach can be used for tasks involving both class labels and value labels, blending classification
and regression.

End-to-End Machine Learning Project (Bootcamp)
No ratings yet
End-to-End Machine Learning Project (Bootcamp)
415 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
UNIT-2 (3)
No ratings yet
UNIT-2 (3)
78 pages
End To End Machine Learning Project-2
No ratings yet
End To End Machine Learning Project-2
10 pages
ML - 03 - Machine Learning Systems
No ratings yet
ML - 03 - Machine Learning Systems
60 pages
Dawit House
No ratings yet
Dawit House
49 pages
Module 5
No ratings yet
Module 5
46 pages
module_2
No ratings yet
module_2
35 pages
AIMLlatestmodule 2Notes Removed
No ratings yet
AIMLlatestmodule 2Notes Removed
33 pages
L03 The Regression Pipeline
No ratings yet
L03 The Regression Pipeline
94 pages
Machine Learning(BCSL606) Lab Manual
No ratings yet
Machine Learning(BCSL606) Lab Manual
117 pages
ML-LAB-Manual
No ratings yet
ML-LAB-Manual
18 pages
Lecture4
No ratings yet
Lecture4
56 pages
Faseeh Chap 2 Report
No ratings yet
Faseeh Chap 2 Report
30 pages
Lecture02. ML Pipeline (Chapter 2)
No ratings yet
Lecture02. ML Pipeline (Chapter 2)
50 pages
ds_ml__house_price_book
No ratings yet
ds_ml__house_price_book
46 pages
Let Review English
No ratings yet
Let Review English
232 pages
Lecture4.pptx
No ratings yet
Lecture4.pptx
56 pages
Project Report
No ratings yet
Project Report
37 pages
Ch2
No ratings yet
Ch2
29 pages
@vtudeveloper.in ISMLA Mod 5
No ratings yet
@vtudeveloper.in ISMLA Mod 5
30 pages
Regression Analysis
No ratings yet
Regression Analysis
17 pages
Module 2
No ratings yet
Module 2
24 pages
2_DataPreProcessing_code
No ratings yet
2_DataPreProcessing_code
46 pages
cp4252-machine learning lab manual 23-24
No ratings yet
cp4252-machine learning lab manual 23-24
28 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
The Data Science Process
100% (1)
The Data Science Process
53 pages
ML LAB - BCSL606
No ratings yet
ML LAB - BCSL606
67 pages
Report
No ratings yet
Report
40 pages
FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material I 09-Aug-2021 Data2 1
No ratings yet
FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material I 09-Aug-2021 Data2 1
9 pages
ML Practical 04
No ratings yet
ML Practical 04
19 pages
House Pricing
No ratings yet
House Pricing
15 pages
0.1 Guilherme Marthe - Boston House Pricing Challenge
100% (1)
0.1 Guilherme Marthe - Boston House Pricing Challenge
15 pages
QB 1
No ratings yet
QB 1
11 pages
BA Project - Team17
No ratings yet
BA Project - Team17
13 pages
machinelearning
No ratings yet
machinelearning
26 pages
MiniProject BI
No ratings yet
MiniProject BI
16 pages
California Housing Project
No ratings yet
California Housing Project
5 pages
Cp4252 Machine Learning Lab Manual (1)
No ratings yet
Cp4252 Machine Learning Lab Manual (1)
27 pages
Presentation 21
No ratings yet
Presentation 21
9 pages
House Report
No ratings yet
House Report
26 pages
Module 2 Own Notes
No ratings yet
Module 2 Own Notes
10 pages
7 Data Science / Machine Learning Cheat Sheets in One
100% (1)
7 Data Science / Machine Learning Cheat Sheets in One
9 pages
Ml Lab Manual
No ratings yet
Ml Lab Manual
43 pages
USA Real Estate Price Prediction Using Decision Tree Regressor, and AdaBoost Regressor
No ratings yet
USA Real Estate Price Prediction Using Decision Tree Regressor, and AdaBoost Regressor
14 pages
Aastha Mahajan Python File
No ratings yet
Aastha Mahajan Python File
17 pages
0 PDF
No ratings yet
0 PDF
9 pages
Setup: Chapter 2 - End-To-End Machine Learning Project
No ratings yet
Setup: Chapter 2 - End-To-End Machine Learning Project
31 pages
P04 The Regression Pipeline - Preprocessing Ans
No ratings yet
P04 The Regression Pipeline - Preprocessing Ans
19 pages
Module 2
No ratings yet
Module 2
20 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Week 1 Get familier with Jupyter Notebook
No ratings yet
Week 1 Get familier with Jupyter Notebook
4 pages
Exercise Explore Your Data
No ratings yet
Exercise Explore Your Data
2 pages
ml file
No ratings yet
ml file
6 pages
Machine Learning Life Cycle Report
No ratings yet
Machine Learning Life Cycle Report
2 pages
Regression Dataset
No ratings yet
Regression Dataset
3 pages
EXAMPLE ML in real life
No ratings yet
EXAMPLE ML in real life
6 pages
AI_ML
No ratings yet
AI_ML
2 pages
Introduction To Machine Learning (ML) With Sklearn
No ratings yet
Introduction To Machine Learning (ML) With Sklearn
10 pages
Machine Learning Labnem (1) (1)
No ratings yet
Machine Learning Labnem (1) (1)
5 pages
Chap 4 - Robot Kinematics
100% (1)
Chap 4 - Robot Kinematics
153 pages
VLT® AutomationDrive FC 301 - 302 - Design Guide 90-710 KW, Enclosure Sizes D and E
No ratings yet
VLT® AutomationDrive FC 301 - 302 - Design Guide 90-710 KW, Enclosure Sizes D and E
212 pages
Maintenance and Repair Module2
100% (1)
Maintenance and Repair Module2
71 pages
511vwb Service Manual
0% (1)
511vwb Service Manual
45 pages
MODULE_5.doc
No ratings yet
MODULE_5.doc
56 pages
Module_1_Embedded_system..
No ratings yet
Module_1_Embedded_system..
64 pages
Chapter - 4
No ratings yet
Chapter - 4
59 pages
Family Therapy - Models and Techniques - Chapter1 - The History of Family Therapy
100% (5)
Family Therapy - Models and Techniques - Chapter1 - The History of Family Therapy
42 pages
Embedded_system_hardware_exp
No ratings yet
Embedded_system_hardware_exp
9 pages
Unit 5 Quantum Mechanics HN
No ratings yet
Unit 5 Quantum Mechanics HN
11 pages
CBAR-Proposal
No ratings yet
CBAR-Proposal
9 pages
Godwin Olorunmaye
No ratings yet
Godwin Olorunmaye
15 pages
Fire 574-848
No ratings yet
Fire 574-848
236 pages
OVO Presentation To IDCON Vsent PDF
No ratings yet
OVO Presentation To IDCON Vsent PDF
9 pages
103038
No ratings yet
103038
7 pages
Call Centers Research 33
No ratings yet
Call Centers Research 33
107 pages
SWING DEVICE: VOLVO: EW130 (#3001 9999) Key# Part No Part Name Q'ty Remark
No ratings yet
SWING DEVICE: VOLVO: EW130 (#3001 9999) Key# Part No Part Name Q'ty Remark
1 page
2011 June UGC NET Previous Years Solved Paper 1
No ratings yet
2011 June UGC NET Previous Years Solved Paper 1
21 pages
Success Coach NewYear Final
No ratings yet
Success Coach NewYear Final
19 pages
Rev Rom ESCI 18 09 23
No ratings yet
Rev Rom ESCI 18 09 23
10 pages
Osborne Mine Paper3
No ratings yet
Osborne Mine Paper3
9 pages
MediSys Case
50% (2)
MediSys Case
2 pages
Aug 2022
No ratings yet
Aug 2022
2 pages
Symbiosis Definition
No ratings yet
Symbiosis Definition
7 pages
Document
No ratings yet
Document
2 pages
5991-8841EN InfinityLab LC Series Brochure
No ratings yet
5991-8841EN InfinityLab LC Series Brochure
16 pages
ABC CU4 - 6&7 May 2024
No ratings yet
ABC CU4 - 6&7 May 2024
7 pages
Cef1 16122020031217 PDF
No ratings yet
Cef1 16122020031217 PDF
4 pages
LLM RQ NDT 007
No ratings yet
LLM RQ NDT 007
1 page
Fundamental Principles of Heat Transfer
No ratings yet
Fundamental Principles of Heat Transfer
10 pages
Annex To Solar KEYMARK Certificate Issued Summary of EN 12975 Test Results, Licence Number
No ratings yet
Annex To Solar KEYMARK Certificate Issued Summary of EN 12975 Test Results, Licence Number
2 pages
G Roug Pu Roa Parame Er Spec Rum Roven Re A Y: Hi HTH H T - B D T T - P Li Bilit
No ratings yet
G Roug Pu Roa Parame Er Spec Rum Roven Re A Y: Hi HTH H T - B D T T - P Li Bilit
2 pages
Jeffrey Archer - First Among Equals
No ratings yet
Jeffrey Archer - First Among Equals
11 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet

ISMLA_Module5

Uploaded by

ISMLA_Module5

Uploaded by

Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof.

, ECE, GMIT, Davangere

End-to-end Machine Learning Project

Fig. California housing prices

Look at the Big Picture

Fig. A Machine Learning pipeline for real estate investments

Data Pipeline and Learning Type

• MAE: An alternative that is less sensitive to outliers than RMSE.

Get the Data

Create the Workspace

2. Create Workspace Directory

3. Install Required Libraries

5. Launch Jupyter Notebook

6. Create and Rename Notebook

7. Test Your Notebook

Now you're ready to start coding in your machine learning workspace!

Download the Data

1. Fetching the Data:

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):

tgz_path = os.path.join(housing_path, "housing.tgz")

2. Loading the Data:

3. Putting It All Together:

Take a Quick Look at the Data Structure

• ocean_proximity: The proximity to the ocean, which is a categorical variable.

Key steps in the data exploration include:

Create a Test Set

Basic Test Set Creation

Alternatively, use Scikit-Learn’s train_test_split() with a random_state for reproducibility:

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

Discover and Visualize the Data to Gain Insights

Visualizing Geographical Data

Visualizing Housing Prices

Correlations Between Features

Exploring Median Income vs. Median House Value

Experimenting with Attribute Combinations

Prepare the Data for Machine Learning Algorithms

Step 2: Handle Missing Values

2. Drop the column:

3. Replace missing values with the median:

housing_num = housing.drop("ocean_proximity", axis=1) # exclude categorical data

Step 3: Handle Categorical Attributes

Step 4: Feature Scaling

Step 5: Combine Transformers Using Pipelines

Step 6: Apply All Transformations Using ColumnTransformer

Select and Train a Model

2. Underfitting and Choosing a Better Model

3. Training a Decision Tree Model

4. Better Evaluation Using Cross-Validation

5. Evaluating the Linear Regression Model

lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,

6. Random Forest Regressor

7. Overfitting in Random Forest

8. Saving and Reusing Models

Fine-Tune Your Model

{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},

4. Analyzing Best Models and Their Errors

5. Evaluating on the Test Set

X_test = strat_test_set.drop("median_house_value", axis=1)

Launch, Monitor, and Maintain Your System

Fig. A few digits from the MNIST dataset

Training a Binary Classifier

Measuring Accuracy Using Cross-Validation

skfolds = StratifiedKFold(n_splits=3, random_state=42)

Fig. An illustrated confusion matrix

Precision and Recall

Fig. Decision threshold and precision/recall tradeoff

Fig. Precision and recall versus the decision threshold

Fig. Precision versus recall

The ROC Curve

Fig. ROC curve

plt.plot(fpr, tpr, "b:", label="SGD")

Fig. Comparing ROC curves

noise = np.random.randint(0, 100, (len(X_test), 784))

2. Train a classifier (e.g., KNN) to predict clean images:

You might also like