ISMLA_Module5
ISMLA_Module5
1
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
Problem Framing
• Business Goal: Predict district median housing prices for investment decisions.
• Current Solution: Experts manually estimate prices, but it’s costly and often inaccurate (off by
20%).
• Machine Learning Task: This is a supervised learning regression problem, specifically multiple
regression, as you’re predicting a continuous value (median housing price) from multiple features.
Performance Measure
• RMSE: Measures prediction accuracy, penalizing large errors, and is given by:
where:
h(xi) is the predicted value, and yi is the actual value.
2
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
Assumptions Check
• Confirmed that the downstream system needs actual prices, not categories, validating the
regression approach.
4. Check Installation
Verify libraries by running:
$ python3 -c "import jupyter, matplotlib, numpy, pandas, scipy, sklearn"
No output or errors means successful installation.
3
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
4
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
5
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
6
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
Stratified Sampling
For small datasets, stratified sampling ensures the test set is representative. For example, categorize
median income and use StratifiedShuffleSplit:
from sklearn.model_selection import StratifiedShuffleSplit
Final Cleanup
Remove the income_cat attribute to restore the dataset:
for set_ in (strat_train_set, strat_test_set):
set_.drop("income_cat", axis=1, inplace=True)
Conclusion
Setting aside a test set early prevents bias, and stratified sampling ensures the test set accurately
represents key features, especially in smaller datasets. This is crucial for unbiased model evaluation.
7
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
Scatter Matrix
Use a scatter matrix to visualize relationships between numerical features:
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms",
"housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
8
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
These new features show stronger correlations with median house value.
Conclusion
Exploration through visualizations, correlation analysis, and attribute combinations provides valuable
insights into feature relationships, helping you better prepare the data for machine learning.
9
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
10
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
housing_num_tr = num_pipeline.fit_transform(housing_num)
11
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
print("Labels:", list(some_labels))
After predictions, evaluate the model's performance using the Root Mean Squared Error (RMSE):
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
12
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
13
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
2. Randomized Search
• For larger hyperparameter spaces, RandomizedSearchCV is faster. It samples random
combinations of hyperparameters, which can explore a broader search space in fewer iterations.
• Benefits:
o Explores more hyperparameter combinations.
o Provides better control over the search budget (e.g., running for 1,000 iterations).
3. Ensemble Methods
• Combine the best models to improve performance. For example, Random Forests combine
multiple decision trees, leading to better accuracy than a single tree.
• Ensembles typically perform better when individual models make different types of errors.
14
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
6. Final Considerations
• The model's performance on the test set can be slightly worse than the cross-validation scores due
to overfitting to the validation set.
• Avoid tweaking the model solely to improve test set performance, as this may not generalize well
to new data.
15
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
Classification
MNIST
The MNIST dataset contains 70,000 handwritten digit images, each 28x28 pixels, labeled with the digit
they represent. It is widely used for evaluating classification algorithms and is often referred to as the
"Hello World" of Machine Learning.
• Data Structure:
o Features (X): 70,000 images, each with 784 features (one for each pixel).
o Labels (y): 70,000 labels, where each label is the digit (0-9) the image represents.
To load the dataset:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
• Splitting the Dataset: The MNIST dataset is pre-split into training (60,000 images) and test sets
(10,000 images):
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
Shuffling is already done, ensuring that the data is well mixed for training.
To visualize a sample:
import matplotlib.pyplot as plt
some_digit = X[0]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap="binary", interpolation="nearest")
plt.axis("off")
plt.show()
• Label Conversion: Labels are initially strings; convert them to integers for easier handling:
y = y.astype(np.uint8)
The dataset is essential for developing and testing machine learning models on image classification
tasks.
16
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
Performance Measures
Evaluating classifiers is more complex than evaluating regressors. There are several performance
metrics to assess classifier performance.
17
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
X_train_folds = X_train[train_index]
y_train_folds = y_train_5[train_index]
X_test_fold = X_train[test_index]
y_test_fold = y_train_5[test_index]
clone_clf.fit(X_train_folds, y_train_folds)
y_pred = clone_clf.predict(X_test_fold)
n_correct = sum(y_pred == y_test_fold)
print(n_correct / len(y_pred))
Using cross_val_score:
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
Accuracy can be misleading with skewed datasets. For instance, a classifier that always predicts "not-
5" can achieve over 90% accuracy due to the imbalance in class distribution, highlighting why accuracy
is not always the best performance measure.
Confusion Matrix
The confusion matrix evaluates classifier performance by showing how often instances of one class are
misclassified as another. It reveals errors like mistaking one digit (e.g., 5) for another (e.g., 3).
Steps:
1. Use cross_val_predict() for K-fold predictions:
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
2. Compute the confusion matrix:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_5, y_train_pred)
Interpretation:
• True Negatives (TN): Correctly predicted non-5 images.
• False Positives (FP): Non-5 images incorrectly predicted as 5.
• False Negatives (FN): 5 images incorrectly predicted as non-5.
• True Positives (TP): Correctly predicted 5 images.
Example output:
[[53057, 1522], [1325, 4096]]
A perfect classifier:
[[54579, 0], [0, 5421]]
18
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
Metrics:
• Precision measures predicted positives that are true:
Precision=TP / (TP+FP)
• Recall measures true positives detected:
Recall= TP / (TP+FN)
The confusion matrix provides detailed classifier performance, while precision and recall focus on
specific evaluation metrics.
Example:
f1_score(y_train_5, y_train_pred) # ~ 0.742
Precision/Recall Tradeoff:
• High Precision, Low Recall: Fewer false positives, but miss some true positives (e.g., safe video
detection).
• High Recall, Low Precision: More true positives detected, but includes false positives (e.g.,
shoplifter detection).
Improving one usually reduces the other.
19
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
Precision/Recall Tradeoff
20
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
This results in 90% precision, but recall decreases, showing the need to balance both metrics
depending on the project's goals.
21
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
Multiclass Classification
Multiclass classifiers distinguish between more than two classes. Some algorithms like Random Forest
or Naive Bayes directly handle multiple classes, while others (e.g., SVMs or Linear classifiers) are
binary. For binary classifiers, there are two main strategies:
1. One-vs-All (OvA): Train one classifier per class, selecting the class with the highest score.
2. One-vs-One (OvO): Train a classifier for each pair of classes, requiring N(N-1)/2 classifiers.
Scikit-Learn defaults to OvA for most classifiers, except for SVMs which use OvO. For example, the
SGDClassifier automatically trains 10 binary classifiers for digits 0-9:
sgd_clf.fit(X_train, y_train)
sgd_clf.predict([some_digit])
To see decision scores:
some_digit_scores = sgd_clf.decision_function([some_digit])
You can also force OvA or OvO using OneVsOneClassifier or OneVsRestClassifier.
RandomForestClassifier can directly handle multiclass classification without OvA/OvO:
forest_clf.fit(X_train, y_train)
forest_clf.predict([some_digit])
forest_clf.predict_proba([some_digit])
For evaluation, use cross-validation:
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")
Scaling input features with StandardScaler improves accuracy:
scaler = StandardScaler()
22
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")
Error Analysis
To improve a model, you can analyze the errors it makes by using the confusion matrix.
1. Generate the confusion matrix using cross_val_predict() and confusion_matrix():
y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
2. Visualize the matrix with Matplotlib:
plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()
3. Normalize the confusion matrix to focus on error rates by dividing each value by the total number
of instances per class:
row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums
np.fill_diagonal(norm_conf_mx, 0) # Focus on errors
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()
4. Analyze misclassifications by visualizing examples of misclassified digits:
cl_a, cl_b = 3, 5
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]
5. Improve the model: For digits like 3 and 5, which are often confused, preprocess the images to
ensure they are well-centered and not rotated, reducing such errors.
23
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
Multilabel Classification
In multilabel classification, each instance can be assigned multiple labels. For example, a face-
recognition system might output multiple tags for a single image.
Example:
For digit classification, we can assign two labels to each digit:
1. Whether the digit is large (7, 8, or 9).
2. Whether the digit is odd.
The KNN classifier can be trained as follows:
from sklearn.neighbors import KNeighborsClassifier
y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)
When predicting, it returns multiple labels for an image, like [False, True], indicating the digit is not
large but odd.
Evaluation:
To evaluate a multilabel classifier, compute the F1 score for each label and average them:
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)
f1_score(y_multilabel, y_train_knn_pred, average="macro")
You can use weighted averaging by setting average="weighted" to account for class imbalances.
Multioutput Classification
Multioutput classification is an extension of multilabel classification, where each label can have
multiple values (multiclass). It involves predicting multiple outputs, each potentially belonging to
different classes.
Example:
In a denoising system for digit images, the noisy image is the input, and the clean image (same as the
original MNIST image) is the output. Each pixel in the output is a label with a value range from 0 to
255.
Process:
1. Add noise to training and test data:
noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
24
Intelligent Systems and Machine Learning Algorithms (BEC515A) Prepared by: Chetan B V, Asst. Prof., ECE, GMIT, Davangere
This approach can be used for tasks involving both class labels and value labels, blending classification
and regression.
25