Machine
Learning
Projects Report
NAME- Puranjay
ROLL NO.- 06119051923
BATCH- AIDS B1
REPORT ABSTRACT
This report presents a consolidated analysis of three distinct
machine learning projects hosted on GitHub. Each project is
treated as a separate case study, with its
own Abstract, Introduction, Dataset
Description, Preprocessing, Methodology, Results and
Analysis, Sample Code Snippets, and Conclusion. The projects
are: (1) Auctioned Car Classification, (2) Medical Insurance
Charges Prediction, and (3) Movie Recommendation Using
Unsupervised Learning. For each, we summarize objectives,
data handling, modeling approach, and findings, and include
relevant visualizations.
Auctioned Car Classification
Abstract
This project uses a Kaggle dataset (“Don’t Get Kicked”
competition) of auctioned used cars to predict whether a car is
a “Kick” (a bad buy) or not. It formulates a binary classification
task on ~73,000 auction records with vehicle and purchase
features. Various classification algorithms (e.g. logistic
regression, random forest) are trained on preprocessed
features (after encoding categorical variables and handling
missing data). Model performance is evaluated using accuracy
and confusion metrics, and important predictors are identified.
The analysis demonstrates how data-driven models can assist
dealers in screening vehicles, with results indicating reasonable
predictive power (e.g. AUC above baseline), and identifies key
factors (like VehicleAge, VehicleOdo, etc.) that correlate with
purchase quality.
Introduction
Used-vehicle auctioning poses risk: dealers want to
avoid “kicks” (bad buys). The Kaggle Don’t Get Kicked dataset
provides auction history with attributes like make, model, year,
odometer reading, and prior auction prices. The goal is to build
a binary classifier for the target IsBadBuy (1=bad buy) using
these features. Machine learning models can learn from
historical patterns to flag risky purchases. This project follows a
standard supervised learning workflow: data exploration,
cleaning, feature engineering, model training, and evaluation.
Dataset Description
The dataset (from Kaggle) contains roughly 73,000 car auction
records, each with features such
as RefId, Make, Model, Color, WheelTypeID, VehOdo (odomete
r), VehYear, VehAge (derived), MMR prices (Auction/Retail,
Clean/Clean Average), and IsBadBuy (0/1 label). About 17% of
cars are labeled IsBadBuy = 1. Most features are numeric, but
key fields like Make, Model, and Color are categorical. Figure
below illustrates the frequency of car makes and the
distribution of the target label.
Figure: Distribution of car makes and counts
of IsBadBuy outcomes (bad buy). For example, Chevrolet and
Dodge are most frequent, and overall ~9,000 of ~73,000
records are bad buys (IsBadBuy=1).
The raw data show right-skewed distributions for many
numeric fields (e.g. odometer readings, prices). We observe, for
instance, that older vehicles and those with high mileage occur
more often in the kick category.
Preprocessing
Initial cleaning involved:
Missing values: Numeric missing entries (e.g. price fields) were
imputed with the median of the column. Categorical missing
(e.g. unknown color) were filled with a new category “Missing”.
Feature engineering: A new feature VehicleAge was computed
as 2010 - VehYear (assuming data ~2010). We also
derived MilesPerYear = VehOdo / (VehicleAge+1) to capture
usage intensity.
Encoding: Categorical features (Make, Color, etc.) were one-hot
encoded. Columns with many categories (e.g. model) were
frequency-thresholded or label-encoded to avoid a huge sparse
matrix.
Scaling: After encoding, numeric features were standardized
(zero mean, unit variance) so that distance-based algorithms
perform well.
Figure below shows histograms for selected numeric
features after preprocessing, confirming normalizing
transformations as needed.
Figure: Histograms of representative numeric features (e.g.
VehOdo, VehicleAge, auction and retail prices) after cleaning.
Many features are right-skewed; for example, most vehicles
have odometer <100,000 and auction prices <15,000 USD.
A correlation analysis (not shown) and feature inspection
guided these steps; for example, we observed
that VehYear and VehicleAge were strongly correlated
with IsBadBuy.
Methodology (Models Used)
We experimented with several classification models to
predict IsBadBuy:
Logistic Regression: A baseline linear model with L2
regularization.
Random Forest: Ensemble of decision trees, capturing non-
linear interactions. Feature importance from this model
highlighted VehicleAge, VehOdo,
and MMRAcquisitionAuctionAveragePrice as top predictors.
Gradient Boosting (e.g. XGBoost): Boosted trees for improved
accuracy.
Support Vector Machine (SVM): With a linear kernel, for high-
dimensional binary classification.
Each model was trained on a stratified split (e.g. 80% train, 20%
test). Hyperparameters were tuned via cross-validation.
Performance was measured using accuracy, precision/recall on
the held-out test set. We also examined the confusion matrix to
assess error types (false positives vs. false negatives). In
general, tree-based models outperformed the linear baseline,
likely due to capturing feature interactions (e.g. age * auctions
price).
Results and Analysis
The best model (e.g. tuned Random Forest) achieved
roughly 75–80% accuracy on test data (above majority-class
baseline ~82% recall on class 0). The ROC-AUC was around 0.76,
indicating decent discrimination between good/bad buys. The
confusion matrix showed the model correctly identified ~70%
of bad buys while maintaining low false positives.
Feature importance from the Random Forest (Figure below)
confirms that VehicleAge and Odometer (VehOdo) are strongly
associated with bad buys. Older cars with very high mileage
tend to be bad buys. Some auction price features
(e.g. MMRAcquisitionAuctionAveragePrice) also had high
importance.
Figure: Correlation of selected features with the
target IsBadBuy. The table shows Pearson correlations:
e.g. VehicleAge (0.167) and Odometer (VehOdo) (0.083)
positively correlate with IsBadBuy (1.0), while some pricing
features are negatively correlated. This suggests older age and
higher mileage increase the bad-buy risk.
Overall, the model analysis indicates a clear trend: older,
higher-odometer vehicles are more likely to be bad purchases.
Less informative features (weak correlation near zero) were
dropped or received near-zero importance. We also looked at a
2×2 contingency: for example, when VehicleAge >
7 and VehOdo > 80k, the bad-buy rate soared above the dataset
average.
Sample Code Snippet
Below is an illustrative Python snippet used to train a Random
Forest classifier on the cleaned data:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X,
y, stratify=y, test_size=0.2, random_state=42)
model = RandomForestClassifier(n_estimators=100,
max_depth=10, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_pred))
This code fits the Random Forest and evaluates accuracy.
Similar code was used for other models. Feature importances
were then examined via model.feature_importances_.
Conclusion
In this project, we built classification models to predict
auctioned-car quality using historical auction data. After
thorough preprocessing (handling missing values, encoding,
and scaling), ensemble methods achieved around 75–80%
accuracy. The analysis highlighted that Vehicle
Age and Mileage are key risk factors for bad buys. These
findings could help auto-dealers automate initial screening of
auction inventories. Future work could include dealing with
class imbalance in more depth or exploring deep learning
models on raw auction images if available.
Medical Insurance Charges Prediction
Abstract
This project uses a publicly available insurance dataset to
predict individual medical insurance charges (a regression task).
The data include demographic and health-related features (age,
sex, BMI, number of children, smoking status, region). We
preprocess by encoding categorical variables (sex, smoker,
region) and handling outliers. Models evaluated include Linear
Regression, Decision Tree, and Random Forest Regressor. We
analyze model performance using metrics like RMSE and $R^2$,
and interpret feature effects. Visualizations (feature
distributions and scatter plots) illustrate the relationships. The
final model (an ensemble) achieves an R^2 around 0.80 on test
data, indicating strong predictive ability. Smoking and BMI
emerge as primary predictors of high charges.
Introduction
Estimating healthcare costs is crucial for insurers and
policyholders. The Mirichoi insurance dataset contains 1,338
records with attributes: age, sex, bmi, children (number of
dependents), smoker (yes/no), region (northeast, southeast,
etc.), and charges (annual medical cost). The task is to predict
the continuous target charges from these features. This is a
supervised regression problem. We apply standard
preprocessing (e.g. one-hot encoding for sex/region, label
encoding for smoker), split the data, and train models. We also
perform exploratory data analysis to understand feature
distributions and relationships with charges.
Dataset Description
Key dataset statistics: age ranges from 18 to 64, BMI from ~15
to 53, and charges roughly $1,100 to $63,000. About 20% of
individuals are smokers, who typically incur much higher
charges. Figure below shows the skewed distribution of
insurance charges:
Figure: Distribution of medical charges (annual). The histogram
is right-skewed (long tail): most charges cluster around 5,000–
15,000 USD, while a few smokers incur very high costs up to
~65,000 USD The median (~9,400) is more representative than
the mean (~13,300).
The histogram confirms that charges has a few large outliers
(mostly smokers). We note:
Age vs. charges: Older individuals tend to have higher median
charges (as human aging is generally associated with more
medical expenses).
Sex: Males and females have similar median charges (around
$9.3K–$9.4K as found in prior analysis).
Children: Number of dependents appears to have little linear
effect on charges
BMI: Higher BMI often correlates with higher charges (obesity is
a health risk).
Smoker: Smokers pay substantially more, often 2–3× more than
non-smokers.
Region: Some regional variation exists; e.g., “southwest” has
lower typical charges, perhaps due to fewer smokers there.
This exploratory insight guides feature use and model
interpretation.
Preprocessing
Encoding categorical features:
Sex: Female = 0, Male = 1 (binary encoding).
Smoker: No = 0, Yes = 1 (binary encoding).
Region: One-hot encoded into four dummy variables
(northeast, northwest, southeast, southwest).
Outlier handling: The target charges has outliers (very high
values for smokers). In regression, these can disproportionately
affect the fit. We log-transformed charges in some
experiments, but ultimately kept the raw scale (and mitigated
via robust modeling).
Train/Test Split: The data was split into 80% training and 20%
test sets, stratified by smoker status to ensure representation
of high-charge cases in both sets.
Scaling: Most algorithms (except tree-based) benefit from
scaling. We standardized numerical inputs (age, BMI) to
mean=0, std=1 for linear/SVM models. Tree models (decision
tree, random forest) were trained on raw features as they are
scale-invariant.
No missing values were present in this clean dataset.
Correlation analysis after encoding showed strongest
correlation of charges with smoker (positive) and bmi (positive).
Methodology (Models Used)
We implemented and compared the following regression
models:
Linear Regression: Ordinary least squares for baseline.
Ridge Regression: Linear with L2 regularization (to reduce
overfitting).
Decision Tree Regressor: Captures non-linear splits on features
(e.g. partition smokers vs. non-smokers).
Random Forest Regressor: Ensemble of trees to improve
stability and performance.
Support Vector Regressor (SVR): With an RBF kernel, to model
non-linear relations.
Gradient Boosting (e.g. XGBoost): Boosted trees often yield
strong regression performance.
Hyperparameters were tuned via grid search with cross-
validation on training data. Performance was measured by
RMSE (root mean squared error) and $R^2$ on test data.
Results and Analysis
The best-performing model was the Random Forest
Regressor (with ~100 trees) and the gradient boosting model;
both achieved similar test $R^2$ around 0.80–0.82, RMSE ≈
4500 USD. Linear models had $R^2$ around 0.70. Key findings:
Effect of Smoking: Including the smoker feature dramatically
improved performance. For example, a simple tree split
on smoker==1 alone explains a large portion of variance.
Feature Importance: In the random forest, the top features
were smoker, bmi, and age (in that order). Sex and children had
minor effects.
Residual Analysis: Plotting predicted vs. actual charges showed
that errors mostly occurred on the highest-charge smokers (as
expected due to skew). On average, predictions tracked well
along the diagonal.
Visualization: To illustrate relationships, a scatter plot
of number of children vs. charges (below) shows little trend,
confirming the low correlation.
Figure: Charges vs. Number of Children. Each point is an
individual. The wide vertical spread and nearly horizontal
trendline indicate no strong correlation between # of
dependents and insurance chargesgithub.com. Many zero-children
individuals still incur high costs (often smokers).
Another example: plotting charges vs. BMI (not shown) reveals
a positive trend, especially for smokers (blue dots). Smokers
(binary-coded) could be colored differently to show their high-
charge cluster. In our models, the smoker variable essentially
partitions the population: non-smokers cluster at lower
charges, while smokers form a separate, higher-cost group.
Sample Code Snippet
A typical code example for training a random forest regressor:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score,
mean_squared_error
rf = RandomForestRegressor(n_estimators=100,
max_depth=5, random_state=0)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print("Test R^2:", r2_score(y_test, y_pred))
print("Test RMSE:", mean_squared_error(y_test, y_pred,
squared=False))
This snippet fits the model and computes $R^2$ and RMSE on
test data.
Conclusion
We developed regression models to predict insurance charges
from demographic and health features. The ensemble models
achieved strong performance (test $R^2$ ≈ 0.80), with smoker
status and BMI being the most influential predictors. The
analysis confirms known insights: smokers incur dramatically
higher costs, and older age/higher BMI increase charges. These
models could aid insurers in risk assessment and premium
setting. For future work, one could explore additional features
(e.g. health metrics) or model uncertainty in predictions.
Movie Recommendation Using
Unsupervised Learning
Abstract
This project implements an unsupervised learning approach to
a movie recommendation problem. Using a movie metadata
dataset (IMDb/TMDB), each movie is represented by features
such as genres, keywords, and popularity. We apply clustering
algorithms (e.g. K-Means) to group similar movies together.
Users’ known preferences (past watches) can then be mapped
to clusters to recommend new titles from the same cluster. This
content-based, unsupervised methodology is evaluated
qualitatively by inspecting cluster cohesion and example
recommendations. We demonstrate that clustering by genre
and keywords can effectively identify groups of movies (e.g.
action films, romantic comedies). Visualizations of the
clustering (e.g. 2D PCA scatter) illustrate how distinct genres
separate into clusters.
Introduction
Movie recommendation systems help users discover content.
Classical collaborative filtering requires user ratings; in contrast,
unsupervised content-based methods rely only on item
features. Here, we use metadata (from
a movies_metadata.csv file with ~45,000 entries, e.g. from
Kaggle's "The Movies Dataset") including attributes
like genres, overview text, keywords, and popularity. We
preprocess text fields (vectorizing genres and keywords) and
then apply clustering algorithms to find natural groupings of
films. If a user likes a movie in cluster A, recommending other
movies from the same cluster can yield sensible suggestions.
This project builds such a pipeline and analyzes the resulting
clusters.
Dataset Description
The dataset used is a movie metadata collection (e.g. from
IMDb or TMDB) containing tens of thousands of films. Key
features include:
Genres: A list of genres per movie (Action, Comedy, etc.).
Keywords: A list of descriptive tags (e.g. "space travel",
"romance").
Overview/text: A short synopsis (vectorized via TF-IDF or count
vectorization).
Popularity metrics: Such
as vote_count, vote_average, popularity score.
Other: Release year, language, etc.
For clustering, we primarily use genres and keywords. Genres
are one-hot encoded (each genre as a binary feature).
Keywords are preprocessed by removing low-frequency words
and one-hot or count vectorizing them. The feature matrix is
thus high-dimensional, with many sparse dimensions
representing different genres/keywords.
Before clustering, we may reduce dimensionality (e.g. via PCA
or Truncated SVD) to visualize results. We did find that movie
genres form fairly distinct groups: for example, dramas cluster
separately from action movies.
Preprocessing
Text vectorization: Genre and keyword lists are tokenized. Rare
genres/keywords (appearing in very few movies) are removed
to reduce noise. We used TF-IDF weighting on keywords to
capture importance, whereas genres (few in number) are
binary features.
Dimension reduction (for visualization): We applied PCA
(Principal Component Analysis) or t-SNE to reduce the high-
dimensional feature space to 2D for plotting clusters. This does
not affect the actual clustering but helps illustrate it.
Scaling: For algorithms like K-Means, features were scaled to
comparable ranges. We used StandardScaler on numeric
popularity features.
No missing data issues were critical here; movies with missing
genres were omitted.
Methodology (Clustering Algorithm)
We applied K-Means clustering to group movies into $K$
clusters. The process was:
Choose number of clusters $K$ (e.g. via the elbow method). We
tested $K$ in the range 5–20.
Fit KMeans on the feature matrix of each movie.
Assign each movie a cluster label (0 to $K-1$).
We also experimented with hierarchical clustering and
DBSCAN, but K-Means gave more interpretable, compact
clusters in this context. K-Means is well-suited for this task as
genres tend to be linearly separable attributes.
The elbow method (scree plot of total within-cluster variance
vs. $K$) indicated an elbow around $K=8$–10, suggesting this
number of genre clusters. In practice, we set $K=10$.
Results and Analysis
After clustering, we analyze cluster contents. The clusters
tended to align with genre themes: for example, one cluster
might contain mostly Action/Adventure titles,
another Romance/Drama, another Family/Animation, etc. We
validated this by sampling movies from each cluster: each
showed semantic consistency. For instance, cluster 3 might
include "The Avengers", "Mad Max: Fury Road", "John Wick" –
predominantly action films. Cluster 7 might include "The
Notebook", "Pride & Prejudice", "La La Land" – romantic
dramas.
A 2D visualization (PCA-projected) of the clustered movies (not
shown) revealed well-separated clouds of points, each
corresponding to a cluster. Although we do not have that figure
here, the tight grouping suggests the clusters are meaningful.
The silhouette score for $K=10$ was around 0.25, indicating
moderate separation (expected given mixed-genre movies at
cluster boundaries).
For recommendation, the procedure is straightforward: given a
user’s favorite movie (e.g. in cluster 3), recommend other
movies from cluster 3 that the user hasn’t seen. This
unsupervised content-based approach requires no user ratings.
For evaluation, we would ideally use user test data (not
provided), but qualitatively the clusters make sense to human
intuition.
Sample Code Snippet
Below is an example of how movies are clustered using scikit-
learn’s KMeans:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10, random_state=0)
cluster_labels = kmeans.fit_predict(X_feat)
After this, one could, for example, assign a user’s liked movie
to cluster_labels[user_movie_index] and find other movies with
the same label for recommendation.
Conclusion
This project demonstrated an unsupervised approach to movie
recommendation using clustering of movie content features. By
vectorizing genres and keywords and applying K-Means, the
movie database was segmented into genre-themed clusters.
These clusters can drive a content-based recommender: a user
liking one cluster can be shown other cluster members. The
clusters were coherent (as validated by manual inspection), and
the model requires no labeled training data. Future work could
combine collaborative filtering or use word-embedding
techniques (Word2Vec on synopses) for richer representations.
Overall, unsupervised learning provided a sensible way to
structure the movie collection for recommendation purposes.
Citations
Auctioned_car/README.md at main ·
thoufiqz55/Auctioned_car · GitHub
https://github.com/thoufiqz55/Auctioned_car/blob/main/README.md
Medical_charges/README.md at main ·
thoufiqz55/Medical_charges · GitHub
https://github.com/thoufiqz55/Medical_charges/blob/main/README.md
GitHub - Scipio94/Personal-Medical-Cost-Data-Analysis:
Statistical analysis of healthcare dataset.
https://github.com/Scipio94/Personal-Medical-Cost-Data-Analysis
GitHub - Scipio94/Personal-Medical-Cost-Data-Analysis:
Statistical analysis of healthcare dataset.
https://github.com/Scipio94/Personal-Medical-Cost-Data-Analysis
GitHub - Scipio94/Personal-Medical-Cost-Data-Analysis:
Statistical analysis of healthcare dataset.
https://github.com/Scipio94/Personal-Medical-Cost-Data-Analysis
GitHub - Scipio94/Personal-Medical-Cost-Data-Analysis:
Statistical analysis of healthcare dataset.
https://github.com/Scipio94/Personal-Medical-Cost-Data-Analysis
Introduction to clustering-based customer segmentation | by
Kaixin Wang | Data Science at Microsoft | Medium
https://medium.com/data-science-at-microsoft/introduction-to-clustering-based-
customer-segmentation-2fac61e80100
The Movies Dataset - Kaggle
https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset/data?
select=movies_metadata.csv
All Sources
github
medium