Machine Learning Basics & Linear Regression
Introduction to Machine Learning
Definition:
○ Machine Learning (ML) is a branch of artificial intelligence that enables
systems to learn from data and make predictions without being explicitly
programmed.
Types of ML:
○ Supervised Learning – Uses labeled data (e.g., classification, regression).
○ Unsupervised Learning – Works with unlabeled data (e.g., clustering,
dimensionality reduction).
○ Reinforcement Learning – Learning through rewards and penalties (e.g.,
self-driving cars).
Key Components of ML Models:
○ Features (X) – Input variables used for prediction.
○ Target (Y) – The output or label the model aims to predict.
○ Training and Testing Data – Splitting data to train and evaluate the
model.
What is Linear Regression?
○ A supervised learning algorithm used for predicting continuous values.
○ Example: Predicting house prices based on features like size, location, etc.
Mathematical Representation:
○ Y=mX+bY = mX + bY=mX+b (Where m is the slope and b is the
intercept).
Cost Function & Gradient Descent:
○ The model minimizes the error by optimizing the cost function (Mean
Squared Error - MSE).
Overfitting vs. Underfitting:
○ Overfitting – Model learns too much from training data (low bias, high
variance).
○ Underfitting – Model is too simple and fails to capture patterns (high bias,
low variance).
Implementing Linear Regression
Implementation Steps:
● Load the dataset (California housing data).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load dataset (e.g., Boston Housing Prices)
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Price'] = data.target
df.head()
● Preprocess the data (select features and target variable).
# Select Features and Target
X = df[['MedInc', 'HouseAge', 'AveRooms']] # Selecting some features
y = df['Price']
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
● Split data into training and testing sets.
● Train a Linear Regression model using Scikit-learn.
model = LinearRegression()
model.fit(X_train, y_train)
● Make predictions on test data.
● Evaluate model performance using RMSE and R² score.
y_pred = model.predict(X_test)
# Performance Metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f'RMSE: {rmse:.2f}')
print(f'R-squared: {r2:.2f}')
● Visualize predictions using a scatter plot.
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual vs Predicted Prices")
plt.show()
Understanding Logistic Regression
● What is Logistic Regression?
○ A supervised learning algorithm used for classification problems (predicts
categorical values).
○ Example: Predicting if an email is spam or not.
● Sigmoid Function & Decision Boundary:
○ Converts continuous outputs into probabilities (0 to 1).
○ If probability > 0.5 → Class 1, otherwise Class 0.
● Difference from Linear Regression:
○ Linear Regression predicts continuous values, while Logistic Regression
predicts probabilities.
Implementing Logistic Regression
Implementation Steps:
● Load the dataset ( Cancer dataset from sklearn).
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Target'] = data.target
df.head()
● Preprocess data (select features and target).
● Split the dataset into training and testing sets.
# Features and Target
X = df.iloc[:, :-1] # All features except target
y = df['Target']
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
● Train a Logistic Regression model using Scikit-learn.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
● Make predictions and evaluate model performance.
● Use classification metrics such as accuracy, confusion matrix, and
classification report.
y_pred = model.predict(X_test)
# Performance Metrics
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))
y_pred = model.predict(X_test)
# Performance Metrics
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))
● Visualize the confusion matrix using a heatmap.
import seaborn as sns
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()
Model Evaluation & Comparison
Regression Model (Linear Regression) Evaluation Metrics:
● Root Mean Squared Error (RMSE): Measures the difference between actual and
predicted values.
● R² Score: Measures how well the model explains variance in the data.
Classification Model (Logistic Regression) Evaluation Metrics:
● Accuracy Score: Measures overall correctness of predictions.
● Confusion Matrix: Shows True Positives (TP), False Positives (FP), True
Negatives (TN), and False Negatives (FN).
● Precision, Recall, and F1 Score: Useful for handling imbalanced datasets.
Hyperparameter Tuning
Optimizing Logistic Regression using GridSearchCV:
● GridSearchCV helps find the best parameters (C value in Logistic Regression).
● Example:
○ Try different values of C = [0.1, 1, 10, 100].
○ Select the best performing model.
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1, 1, 10, 100]}
grid = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5)
grid.fit(X_train, y_train)
print(f"Best Parameters: {grid.best_params_}")
print(f"Best Accuracy: {grid.best_score_:.2f}")