0% found this document useful (0 votes)

168 views27 pages

Machine Learning Projects Report Puranjay

This report analyzes three machine learning projects: Auctioned Car Classification, Medical Insurance Charges Prediction, and Movie Recommendation Using Unsupervised Learning. Each project details its objectives, methodologies, and results, showcasing the use of various algorithms and data preprocessing techniques to achieve predictive insights. Key findings include the identification of important predictors for auctioned car quality and medical insurance charges, as well as effective clustering for movie recommendations.

Uploaded by

prarit.work

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

168 views27 pages

Machine Learning Projects Report Puranjay

Uploaded by

prarit.work

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 27

Machine

Learning
Projects Report

NAME- Puranjay
ROLL NO.- 06119051923
BATCH- AIDS B1
REPORT ABSTRACT
This report presents a consolidated analysis of three distinct
machine learning projects hosted on GitHub. Each project is
treated as a separate case study, with its
own Abstract, Introduction, Dataset
Description, Preprocessing, Methodology, Results and
Analysis, Sample Code Snippets, and Conclusion. The projects
are: (1) Auctioned Car Classification, (2) Medical Insurance
Charges Prediction, and (3) Movie Recommendation Using
Unsupervised Learning. For each, we summarize objectives,
data handling, modeling approach, and findings, and include
relevant visualizations.
Auctioned Car Classification
Abstract
This project uses a Kaggle dataset (“Don’t Get Kicked”
competition) of auctioned used cars to predict whether a car is
a “Kick” (a bad buy) or not. It formulates a binary classification
task on ~73,000 auction records with vehicle and purchase
features. Various classification algorithms (e.g. logistic
regression, random forest) are trained on preprocessed
features (after encoding categorical variables and handling
missing data). Model performance is evaluated using accuracy
and confusion metrics, and important predictors are identified.
The analysis demonstrates how data-driven models can assist
dealers in screening vehicles, with results indicating reasonable
predictive power (e.g. AUC above baseline), and identifies key
factors (like VehicleAge, VehicleOdo, etc.) that correlate with
purchase quality.

Introduction
Used-vehicle auctioning poses risk: dealers want to
avoid “kicks” (bad buys). The Kaggle Don’t Get Kicked dataset
provides auction history with attributes like make, model, year,
odometer reading, and prior auction prices. The goal is to build
a binary classifier for the target IsBadBuy (1=bad buy) using
these features. Machine learning models can learn from
historical patterns to flag risky purchases. This project follows a
standard supervised learning workflow: data exploration,
cleaning, feature engineering, model training, and evaluation.

Dataset Description
The dataset (from Kaggle) contains roughly 73,000 car auction
records, each with features such
as RefId, Make, Model, Color, WheelTypeID, VehOdo (odomete
r), VehYear, VehAge (derived), MMR prices (Auction/Retail,
Clean/Clean Average), and IsBadBuy (0/1 label). About 17% of
cars are labeled IsBadBuy = 1. Most features are numeric, but
key fields like Make, Model, and Color are categorical. Figure
below illustrates the frequency of car makes and the
distribution of the target label.
Figure: Distribution of car makes and counts
of IsBadBuy outcomes (bad buy). For example, Chevrolet and
Dodge are most frequent, and overall ~9,000 of ~73,000
records are bad buys (IsBadBuy=1).
The raw data show right-skewed distributions for many
numeric fields (e.g. odometer readings, prices). We observe, for
instance, that older vehicles and those with high mileage occur
more often in the kick category.
Preprocessing
Initial cleaning involved:
Missing values: Numeric missing entries (e.g. price fields) were
imputed with the median of the column. Categorical missing
(e.g. unknown color) were filled with a new category “Missing”.
Feature engineering: A new feature VehicleAge was computed
as 2010 - VehYear (assuming data ~2010). We also
derived MilesPerYear = VehOdo / (VehicleAge+1) to capture
usage intensity.
Encoding: Categorical features (Make, Color, etc.) were one-hot
encoded. Columns with many categories (e.g. model) were
frequency-thresholded or label-encoded to avoid a huge sparse
matrix.

Scaling: After encoding, numeric features were standardized

(zero mean, unit variance) so that distance-based algorithms
perform well.
Figure below shows histograms for selected numeric
features after preprocessing, confirming normalizing
transformations as needed.

Figure: Histograms of representative numeric features (e.g.

VehOdo, VehicleAge, auction and retail prices) after cleaning.
Many features are right-skewed; for example, most vehicles
have odometer <100,000 and auction prices <15,000 USD.
A correlation analysis (not shown) and feature inspection
guided these steps; for example, we observed
that VehYear and VehicleAge were strongly correlated
with IsBadBuy.
Methodology (Models Used)
We experimented with several classification models to
predict IsBadBuy:
Logistic Regression: A baseline linear model with L2
regularization.
Random Forest: Ensemble of decision trees, capturing non-
linear interactions. Feature importance from this model
highlighted VehicleAge, VehOdo,
and MMRAcquisitionAuctionAveragePrice as top predictors.
Gradient Boosting (e.g. XGBoost): Boosted trees for improved
accuracy.
Support Vector Machine (SVM): With a linear kernel, for high-
dimensional binary classification.
Each model was trained on a stratified split (e.g. 80% train, 20%
test). Hyperparameters were tuned via cross-validation.
Performance was measured using accuracy, precision/recall on
the held-out test set. We also examined the confusion matrix to
assess error types (false positives vs. false negatives). In
general, tree-based models outperformed the linear baseline,
likely due to capturing feature interactions (e.g. age * auctions
price).
Results and Analysis
The best model (e.g. tuned Random Forest) achieved
roughly 75–80% accuracy on test data (above majority-class
baseline ~82% recall on class 0). The ROC-AUC was around 0.76,
indicating decent discrimination between good/bad buys. The
confusion matrix showed the model correctly identified ~70%
of bad buys while maintaining low false positives.

Feature importance from the Random Forest (Figure below)

confirms that VehicleAge and Odometer (VehOdo) are strongly
associated with bad buys. Older cars with very high mileage
tend to be bad buys. Some auction price features
(e.g. MMRAcquisitionAuctionAveragePrice) also had high
importance.
Figure: Correlation of selected features with the
target IsBadBuy. The table shows Pearson correlations:
e.g. VehicleAge (0.167) and Odometer (VehOdo) (0.083)
positively correlate with IsBadBuy (1.0), while some pricing
features are negatively correlated. This suggests older age and
higher mileage increase the bad-buy risk.

Overall, the model analysis indicates a clear trend: older,

higher-odometer vehicles are more likely to be bad purchases.
Less informative features (weak correlation near zero) were
dropped or received near-zero importance. We also looked at a
2×2 contingency: for example, when VehicleAge >
7 and VehOdo > 80k, the bad-buy rate soared above the dataset
average.
Sample Code Snippet
Below is an illustrative Python snippet used to train a Random
Forest classifier on the cleaned data:

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X,
y, stratify=y, test_size=0.2, random_state=42)
model = RandomForestClassifier(n_estimators=100,
max_depth=10, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_pred))

This code fits the Random Forest and evaluates accuracy.

Similar code was used for other models. Feature importances
were then examined via model.feature_importances_.

Conclusion
In this project, we built classification models to predict
auctioned-car quality using historical auction data. After
thorough preprocessing (handling missing values, encoding,
and scaling), ensemble methods achieved around 75–80%
accuracy. The analysis highlighted that Vehicle
Age and Mileage are key risk factors for bad buys. These
findings could help auto-dealers automate initial screening of
auction inventories. Future work could include dealing with
class imbalance in more depth or exploring deep learning
models on raw auction images if available.
Medical Insurance Charges Prediction
Abstract
This project uses a publicly available insurance dataset to
predict individual medical insurance charges (a regression task).
The data include demographic and health-related features (age,
sex, BMI, number of children, smoking status, region). We
preprocess by encoding categorical variables (sex, smoker,
region) and handling outliers. Models evaluated include Linear
Regression, Decision Tree, and Random Forest Regressor. We
analyze model performance using metrics like RMSE and $R^2$,
and interpret feature effects. Visualizations (feature
distributions and scatter plots) illustrate the relationships. The
final model (an ensemble) achieves an R^2 around 0.80 on test
data, indicating strong predictive ability. Smoking and BMI
emerge as primary predictors of high charges.
Introduction
Estimating healthcare costs is crucial for insurers and
policyholders. The Mirichoi insurance dataset contains 1,338
records with attributes: age, sex, bmi, children (number of
dependents), smoker (yes/no), region (northeast, southeast,
etc.), and charges (annual medical cost). The task is to predict
the continuous target charges from these features. This is a
supervised regression problem. We apply standard
preprocessing (e.g. one-hot encoding for sex/region, label
encoding for smoker), split the data, and train models. We also
perform exploratory data analysis to understand feature
distributions and relationships with charges.
Dataset Description
Key dataset statistics: age ranges from 18 to 64, BMI from ~15
to 53, and charges roughly $1,100 to $63,000. About 20% of
individuals are smokers, who typically incur much higher
charges. Figure below shows the skewed distribution of
insurance charges:

Figure: Distribution of medical charges (annual). The histogram

is right-skewed (long tail): most charges cluster around 5,000–
15,000 USD, while a few smokers incur very high costs up to
~65,000 USD The median (~9,400) is more representative than
the mean (~13,300).

The histogram confirms that charges has a few large outliers

(mostly smokers). We note:
Age vs. charges: Older individuals tend to have higher median
charges (as human aging is generally associated with more
medical expenses).
Sex: Males and females have similar median charges (around
$9.3K–$9.4K as found in prior analysis).
Children: Number of dependents appears to have little linear
effect on charges
BMI: Higher BMI often correlates with higher charges (obesity is
a health risk).
Smoker: Smokers pay substantially more, often 2–3× more than
non-smokers.
Region: Some regional variation exists; e.g., “southwest” has
lower typical charges, perhaps due to fewer smokers there.
This exploratory insight guides feature use and model
interpretation.
Preprocessing
Encoding categorical features:
Sex: Female = 0, Male = 1 (binary encoding).
Smoker: No = 0, Yes = 1 (binary encoding).
Region: One-hot encoded into four dummy variables
(northeast, northwest, southeast, southwest).
Outlier handling: The target charges has outliers (very high
values for smokers). In regression, these can disproportionately
affect the fit. We log-transformed charges in some
experiments, but ultimately kept the raw scale (and mitigated
via robust modeling).
Train/Test Split: The data was split into 80% training and 20%
test sets, stratified by smoker status to ensure representation
of high-charge cases in both sets.
Scaling: Most algorithms (except tree-based) benefit from
scaling. We standardized numerical inputs (age, BMI) to
mean=0, std=1 for linear/SVM models. Tree models (decision
tree, random forest) were trained on raw features as they are
scale-invariant.
No missing values were present in this clean dataset.
Correlation analysis after encoding showed strongest
correlation of charges with smoker (positive) and bmi (positive).
Methodology (Models Used)
We implemented and compared the following regression
models:
Linear Regression: Ordinary least squares for baseline.
Ridge Regression: Linear with L2 regularization (to reduce
overfitting).
Decision Tree Regressor: Captures non-linear splits on features
(e.g. partition smokers vs. non-smokers).
Random Forest Regressor: Ensemble of trees to improve
stability and performance.
Support Vector Regressor (SVR): With an RBF kernel, to model
non-linear relations.
Gradient Boosting (e.g. XGBoost): Boosted trees often yield
strong regression performance.
Hyperparameters were tuned via grid search with cross-
validation on training data. Performance was measured by
RMSE (root mean squared error) and $R^2$ on test data.
Results and Analysis
The best-performing model was the Random Forest
Regressor (with ~100 trees) and the gradient boosting model;
both achieved similar test $R^2$ around 0.80–0.82, RMSE ≈
4500 USD. Linear models had $R^2$ around 0.70. Key findings:
Effect of Smoking: Including the smoker feature dramatically
improved performance. For example, a simple tree split
on smoker==1 alone explains a large portion of variance.
Feature Importance: In the random forest, the top features
were smoker, bmi, and age (in that order). Sex and children had
minor effects.
Residual Analysis: Plotting predicted vs. actual charges showed
that errors mostly occurred on the highest-charge smokers (as
expected due to skew). On average, predictions tracked well
along the diagonal.
Visualization: To illustrate relationships, a scatter plot
of number of children vs. charges (below) shows little trend,
confirming the low correlation.

Figure: Charges vs. Number of Children. Each point is an

individual. The wide vertical spread and nearly horizontal
trendline indicate no strong correlation between # of
dependents and insurance chargesgithub.com. Many zero-children
individuals still incur high costs (often smokers).
Another example: plotting charges vs. BMI (not shown) reveals
a positive trend, especially for smokers (blue dots). Smokers
(binary-coded) could be colored differently to show their high-
charge cluster. In our models, the smoker variable essentially
partitions the population: non-smokers cluster at lower
charges, while smokers form a separate, higher-cost group.
Sample Code Snippet
A typical code example for training a random forest regressor:

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import r2_score,
mean_squared_error

rf = RandomForestRegressor(n_estimators=100,
max_depth=5, random_state=0)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print("Test R^2:", r2_score(y_test, y_pred))
print("Test RMSE:", mean_squared_error(y_test, y_pred,
squared=False))

This snippet fits the model and computes $R^2$ and RMSE on
test data.
Conclusion
We developed regression models to predict insurance charges
from demographic and health features. The ensemble models
achieved strong performance (test $R^2$ ≈ 0.80), with smoker
status and BMI being the most influential predictors. The
analysis confirms known insights: smokers incur dramatically
higher costs, and older age/higher BMI increase charges. These
models could aid insurers in risk assessment and premium
setting. For future work, one could explore additional features
(e.g. health metrics) or model uncertainty in predictions.

Movie Recommendation Using

Unsupervised Learning
Abstract
This project implements an unsupervised learning approach to
a movie recommendation problem. Using a movie metadata
dataset (IMDb/TMDB), each movie is represented by features
such as genres, keywords, and popularity. We apply clustering
algorithms (e.g. K-Means) to group similar movies together.
Users’ known preferences (past watches) can then be mapped
to clusters to recommend new titles from the same cluster. This
content-based, unsupervised methodology is evaluated
qualitatively by inspecting cluster cohesion and example
recommendations. We demonstrate that clustering by genre
and keywords can effectively identify groups of movies (e.g.
action films, romantic comedies). Visualizations of the
clustering (e.g. 2D PCA scatter) illustrate how distinct genres
separate into clusters.
Introduction
Movie recommendation systems help users discover content.
Classical collaborative filtering requires user ratings; in contrast,
unsupervised content-based methods rely only on item
features. Here, we use metadata (from
a movies_metadata.csv file with ~45,000 entries, e.g. from
Kaggle's "The Movies Dataset") including attributes
like genres, overview text, keywords, and popularity. We
preprocess text fields (vectorizing genres and keywords) and
then apply clustering algorithms to find natural groupings of
films. If a user likes a movie in cluster A, recommending other
movies from the same cluster can yield sensible suggestions.
This project builds such a pipeline and analyzes the resulting
clusters.
Dataset Description
The dataset used is a movie metadata collection (e.g. from
IMDb or TMDB) containing tens of thousands of films. Key
features include:
Genres: A list of genres per movie (Action, Comedy, etc.).
Keywords: A list of descriptive tags (e.g. "space travel",
"romance").
Overview/text: A short synopsis (vectorized via TF-IDF or count
vectorization).
Popularity metrics: Such
as vote_count, vote_average, popularity score.
Other: Release year, language, etc.
For clustering, we primarily use genres and keywords. Genres
are one-hot encoded (each genre as a binary feature).
Keywords are preprocessed by removing low-frequency words
and one-hot or count vectorizing them. The feature matrix is
thus high-dimensional, with many sparse dimensions
representing different genres/keywords.

Before clustering, we may reduce dimensionality (e.g. via PCA

or Truncated SVD) to visualize results. We did find that movie
genres form fairly distinct groups: for example, dramas cluster
separately from action movies.
Preprocessing
Text vectorization: Genre and keyword lists are tokenized. Rare
genres/keywords (appearing in very few movies) are removed
to reduce noise. We used TF-IDF weighting on keywords to
capture importance, whereas genres (few in number) are
binary features.
Dimension reduction (for visualization): We applied PCA
(Principal Component Analysis) or t-SNE to reduce the high-
dimensional feature space to 2D for plotting clusters. This does
not affect the actual clustering but helps illustrate it.
Scaling: For algorithms like K-Means, features were scaled to
comparable ranges. We used StandardScaler on numeric
popularity features.
No missing data issues were critical here; movies with missing
genres were omitted.
Methodology (Clustering Algorithm)
We applied K-Means clustering to group movies into $K$
clusters. The process was:
Choose number of clusters $K$ (e.g. via the elbow method). We
tested $K$ in the range 5–20.
Fit KMeans on the feature matrix of each movie.
Assign each movie a cluster label (0 to $K-1$).
We also experimented with hierarchical clustering and
DBSCAN, but K-Means gave more interpretable, compact
clusters in this context. K-Means is well-suited for this task as
genres tend to be linearly separable attributes.

The elbow method (scree plot of total within-cluster variance

vs. $K$) indicated an elbow around $K=8$–10, suggesting this
number of genre clusters. In practice, we set $K=10$.
Results and Analysis
After clustering, we analyze cluster contents. The clusters
tended to align with genre themes: for example, one cluster
might contain mostly Action/Adventure titles,
another Romance/Drama, another Family/Animation, etc. We
validated this by sampling movies from each cluster: each
showed semantic consistency. For instance, cluster 3 might
include "The Avengers", "Mad Max: Fury Road", "John Wick" –
predominantly action films. Cluster 7 might include "The
Notebook", "Pride & Prejudice", "La La Land" – romantic
dramas.

A 2D visualization (PCA-projected) of the clustered movies (not

shown) revealed well-separated clouds of points, each
corresponding to a cluster. Although we do not have that figure
here, the tight grouping suggests the clusters are meaningful.
The silhouette score for $K=10$ was around 0.25, indicating
moderate separation (expected given mixed-genre movies at
cluster boundaries).

For recommendation, the procedure is straightforward: given a

user’s favorite movie (e.g. in cluster 3), recommend other
movies from cluster 3 that the user hasn’t seen. This
unsupervised content-based approach requires no user ratings.
For evaluation, we would ideally use user test data (not
provided), but qualitatively the clusters make sense to human
intuition.
Sample Code Snippet
Below is an example of how movies are clustered using scikit-
learn’s KMeans:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10, random_state=0)
cluster_labels = kmeans.fit_predict(X_feat)

After this, one could, for example, assign a user’s liked movie
to cluster_labels[user_movie_index] and find other movies with
the same label for recommendation.
Conclusion
This project demonstrated an unsupervised approach to movie
recommendation using clustering of movie content features. By
vectorizing genres and keywords and applying K-Means, the
movie database was segmented into genre-themed clusters.
These clusters can drive a content-based recommender: a user
liking one cluster can be shown other cluster members. The
clusters were coherent (as validated by manual inspection), and
the model requires no labeled training data. Future work could
combine collaborative filtering or use word-embedding
techniques (Word2Vec on synopses) for richer representations.
Overall, unsupervised learning provided a sensible way to
structure the movie collection for recommendation purposes.
Citations

Auctioned_car/README.md at main ·
thoufiqz55/Auctioned_car · GitHub
https://github.com/thoufiqz55/Auctioned_car/blob/main/README.md

Medical_charges/README.md at main ·
thoufiqz55/Medical_charges · GitHub
https://github.com/thoufiqz55/Medical_charges/blob/main/README.md
GitHub - Scipio94/Personal-Medical-Cost-Data-Analysis:
Statistical analysis of healthcare dataset.
https://github.com/Scipio94/Personal-Medical-Cost-Data-Analysis

GitHub - Scipio94/Personal-Medical-Cost-Data-Analysis:
Statistical analysis of healthcare dataset.
https://github.com/Scipio94/Personal-Medical-Cost-Data-Analysis

Introduction to clustering-based customer segmentation | by

Kaixin Wang | Data Science at Microsoft | Medium
https://medium.com/data-science-at-microsoft/introduction-to-clustering-based-
customer-segmentation-2fac61e80100
The Movies Dataset - Kaggle
https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset/data?
select=movies_metadata.csv

All Sources

github

medium

ML Project Report Puranjay
No ratings yet
ML Project Report Puranjay
2 pages
Machine Learning-Based Models For Accurate Car Pri
No ratings yet
Machine Learning-Based Models For Accurate Car Pri
6 pages
Predicting Used Car Prices in India
No ratings yet
Predicting Used Car Prices in India
15 pages
Ajay and Saurabh
No ratings yet
Ajay and Saurabh
16 pages
Car Price Prediction Project
No ratings yet
Car Price Prediction Project
34 pages
Ai and Machine Learning For Predicting
No ratings yet
Ai and Machine Learning For Predicting
9 pages
Car Price Prediction
No ratings yet
Car Price Prediction
18 pages
33 Submission
No ratings yet
33 Submission
8 pages
Sample Paper 6
No ratings yet
Sample Paper 6
10 pages
Sanke 2024 Ijca 923900
No ratings yet
Sanke 2024 Ijca 923900
6 pages
Bulldozer Price Prediction Model
No ratings yet
Bulldozer Price Prediction Model
19 pages
Car Price Prediction Using AI
No ratings yet
Car Price Prediction Using AI
1 page
1st Review
No ratings yet
1st Review
9 pages
Report Car Price Prediction
No ratings yet
Report Car Price Prediction
8 pages
ML Case Study
No ratings yet
ML Case Study
11 pages
Pre-Owned Car Price Prediction Using Machine Learning Techniques
No ratings yet
Pre-Owned Car Price Prediction Using Machine Learning Techniques
5 pages
Sample
No ratings yet
Sample
15 pages
74 Ijcse2018 19
No ratings yet
74 Ijcse2018 19
7 pages
Car Price Prediction
67% (3)
Car Price Prediction
54 pages
Predicting Pre-Owned Car Prices Using Machine Learning
No ratings yet
Predicting Pre-Owned Car Prices Using Machine Learning
17 pages
Predictive Analytics for Car Pricing
No ratings yet
Predictive Analytics for Car Pricing
8 pages
CS 229 Project Report: Predicting Used Car Prices
100% (1)
CS 229 Project Report: Predicting Used Car Prices
5 pages
78 - Used Car Price Prediction Using Machine Learning
100% (1)
78 - Used Car Price Prediction Using Machine Learning
5 pages
Prediction of The Price of Used Cars Based On Mach
No ratings yet
Prediction of The Price of Used Cars Based On Mach
7 pages
Used Car Price Prediction Model
No ratings yet
Used Car Price Prediction Model
10 pages
Analyzing Selling Price of Used Cars Using Machine Learning
No ratings yet
Analyzing Selling Price of Used Cars Using Machine Learning
41 pages
Used Car Price Prediction Model
No ratings yet
Used Car Price Prediction Model
3 pages
DSPY Lab Project (Formatted) 2
No ratings yet
DSPY Lab Project (Formatted) 2
14 pages
Cars4u Project: Proprietary Content. © Great Learning. All Rights Reserved. Unauthorized Use or Distribution Prohibited
100% (2)
Cars4u Project: Proprietary Content. © Great Learning. All Rights Reserved. Unauthorized Use or Distribution Prohibited
30 pages
Prediction of Car Price Using Linear Regression
No ratings yet
Prediction of Car Price Using Linear Regression
4 pages
Car Resale Price Prediction Analysis
No ratings yet
Car Resale Price Prediction Analysis
32 pages
Research Paper
No ratings yet
Research Paper
3 pages
Used Car Price Prediction Project Report
No ratings yet
Used Car Price Prediction Project Report
20 pages
Report
No ratings yet
Report
47 pages
Paper 10479
No ratings yet
Paper 10479
4 pages
Car Price Prediction Using Ai
No ratings yet
Car Price Prediction Using Ai
6 pages
PPSD 1743674861
No ratings yet
PPSD 1743674861
3 pages
Anuj 1
No ratings yet
Anuj 1
18 pages
A13 Nandan and Ghosh 167-184
No ratings yet
A13 Nandan and Ghosh 167-184
18 pages
Project Soft
No ratings yet
Project Soft
28 pages
Finalised FBA CIA 3
No ratings yet
Finalised FBA CIA 3
16 pages
Car Evaluation
No ratings yet
Car Evaluation
62 pages
Used Car Price Prediction Model Report
No ratings yet
Used Car Price Prediction Model Report
10 pages
Used Car Price Prediction Model
No ratings yet
Used Car Price Prediction Model
10 pages
Smart Used Car Price Prediction: Somesh Alkanthi Vishwakarma Institute of Technology, Pune
No ratings yet
Smart Used Car Price Prediction: Somesh Alkanthi Vishwakarma Institute of Technology, Pune
6 pages
Used Car Price Prediction AI
No ratings yet
Used Car Price Prediction AI
11 pages
Car Price Prediction
No ratings yet
Car Price Prediction
12 pages
A Comprehensive Study of Machine Learning Algorithms For Predicting Car Purchase Based On Customers Demands
No ratings yet
A Comprehensive Study of Machine Learning Algorithms For Predicting Car Purchase Based On Customers Demands
5 pages
JETIR2204201
No ratings yet
JETIR2204201
7 pages
Duplichecker Plagiarism Report
No ratings yet
Duplichecker Plagiarism Report
1 page
A10421291S3
No ratings yet
A10421291S3
8 pages
Used Car Price Prediction Model
No ratings yet
Used Car Price Prediction Model
18 pages
Used Car Price Prediction Model
No ratings yet
Used Car Price Prediction Model
6 pages
Final Report
No ratings yet
Final Report
17 pages
Identifying The Most Influential Attributes For Predicting Vehicle Prices Using Extremely Randomized Trees Regression
No ratings yet
Identifying The Most Influential Attributes For Predicting Vehicle Prices Using Extremely Randomized Trees Regression
7 pages
Used Price Prediction
No ratings yet
Used Price Prediction
4 pages
Data Analytics Research Paper
No ratings yet
Data Analytics Research Paper
3 pages
Used Car Price Prediction Using Machine Learning: Veluru Ranjith (Urk18Cs020)
No ratings yet
Used Car Price Prediction Using Machine Learning: Veluru Ranjith (Urk18Cs020)
26 pages
Green Modern Agriculture Presentation - Compressed
No ratings yet
Green Modern Agriculture Presentation - Compressed
9 pages
Project Name
No ratings yet
Project Name
6 pages
ArogyaAI Compressed
No ratings yet
ArogyaAI Compressed
10 pages
Green Modern Agriculture Presentation
No ratings yet
Green Modern Agriculture Presentation
9 pages
AUITS ProblemStatement
No ratings yet
AUITS ProblemStatement
2 pages
Prarit Arora CV - Compressed
No ratings yet
Prarit Arora CV - Compressed
1 page
Untitled Document
No ratings yet
Untitled Document
14 pages
2 F2026-T&P USAR2 Notice CloudTechner Services
No ratings yet
2 F2026-T&P USAR2 Notice CloudTechner Services
12 pages
QR 11894294
No ratings yet
QR 11894294
1 page
SRS Sentiment Analysis Project
No ratings yet
SRS Sentiment Analysis Project
4 pages
Prarit Aroa ADA Final
No ratings yet
Prarit Aroa ADA Final
110 pages
Tickets - Odoo Hackathon 2025 (Aug 11, 2025, 8-00-00 AM)
No ratings yet
Tickets - Odoo Hackathon 2025 (Aug 11, 2025, 8-00-00 AM)
4 pages
Teena Ai File
No ratings yet
Teena Ai File
24 pages
Level 1
No ratings yet
Level 1
1 page
CN File
No ratings yet
CN File
16 pages
Geek Verse Guidelines Offline
No ratings yet
Geek Verse Guidelines Offline
1 page
Internship Guidelines 2025-26-4-5
No ratings yet
Internship Guidelines 2025-26-4-5
2 pages
Assignment - Ad Hoc On-Demand Distance Vector (AODV) Routing Protocol
No ratings yet
Assignment - Ad Hoc On-Demand Distance Vector (AODV) Routing Protocol
3 pages
ML Digit Classification Report
No ratings yet
ML Digit Classification Report
2 pages
ML Digit Classification Report
No ratings yet
ML Digit Classification Report
7 pages
Manjot CN
No ratings yet
Manjot CN
1 page
Teena 05019051623 B1 CN
No ratings yet
Teena 05019051623 B1 CN
12 pages
ARM 258 Computer Networks Lab File: Submitted To: Submitted by
No ratings yet
ARM 258 Computer Networks Lab File: Submitted To: Submitted by
1 page
Drdo PDF
No ratings yet
Drdo PDF
1 page
COVID-19 RT-PCR Test Results for Nuveera P
No ratings yet
COVID-19 RT-PCR Test Results for Nuveera P
2 pages
Bupa - Member Welcome Booklet
No ratings yet
Bupa - Member Welcome Booklet
19 pages
Reiki
No ratings yet
Reiki
8 pages
BATHMATE Routine
No ratings yet
BATHMATE Routine
1 page
NSC CARDIOLOGY Follow Up FORM 17MAY2024
No ratings yet
NSC CARDIOLOGY Follow Up FORM 17MAY2024
2 pages
Oedema Types: Clinical Insights
No ratings yet
Oedema Types: Clinical Insights
3 pages
Leisure Spending Behavior of The People
No ratings yet
Leisure Spending Behavior of The People
66 pages
Stilwell Etal 2025 Suffering Study
No ratings yet
Stilwell Etal 2025 Suffering Study
13 pages
The Use of ASR-CAI Tools and Their Impact On Interpreters' Performance During Simultaneous Interpretation
No ratings yet
The Use of ASR-CAI Tools and Their Impact On Interpreters' Performance During Simultaneous Interpretation
27 pages
Mapeh 6 Reviewer
No ratings yet
Mapeh 6 Reviewer
7 pages
Strokengine CA en Assessments Barthel Index Bi ...
No ratings yet
Strokengine CA en Assessments Barthel Index Bi ...
21 pages
SITXWHS006 Student Assessment Tasks
No ratings yet
SITXWHS006 Student Assessment Tasks
61 pages
Case Studies For Contemporary Occupational Therapy Practice Guiding Critical Thinking For Students
No ratings yet
Case Studies For Contemporary Occupational Therapy Practice Guiding Critical Thinking For Students
374 pages
Shaishikant - ADIS
No ratings yet
Shaishikant - ADIS
1 page
Parenteral Drug Manufacturing Process
No ratings yet
Parenteral Drug Manufacturing Process
20 pages
Jensen Student Chapter 1
No ratings yet
Jensen Student Chapter 1
40 pages
Ministry Contact Directory
No ratings yet
Ministry Contact Directory
22 pages
The Ultimate Prompt Collection
No ratings yet
The Ultimate Prompt Collection
27 pages
BFD Report
100% (1)
BFD Report
27 pages
GROWTH AND DEVELOPMENT + - Mcqs
100% (1)
GROWTH AND DEVELOPMENT + - Mcqs
63 pages
0800-1690 Rev 15 VTRAC Operations Manual English
No ratings yet
0800-1690 Rev 15 VTRAC Operations Manual English
116 pages
Smoothskin Bare IPL User Guide
No ratings yet
Smoothskin Bare IPL User Guide
20 pages
A Study To Assess The Effectiveness of Hibiscus Sabdariffa On Hypertensive Patients in Rural Area
No ratings yet
A Study To Assess The Effectiveness of Hibiscus Sabdariffa On Hypertensive Patients in Rural Area
6 pages
Metabolism of Human Diseases
No ratings yet
Metabolism of Human Diseases
382 pages
Referral Letter 3712622000114202421100127 1
No ratings yet
Referral Letter 3712622000114202421100127 1
2 pages
Hawaii Human Trafficking Reporter Checklist
No ratings yet
Hawaii Human Trafficking Reporter Checklist
5 pages
Pengaruh Aromaterapi Kenanga (Cananga Odorata) Terhadap Penurunan Tekanan Darah
No ratings yet
Pengaruh Aromaterapi Kenanga (Cananga Odorata) Terhadap Penurunan Tekanan Darah
9 pages
Indonesia's Dental Caries Challenge
No ratings yet
Indonesia's Dental Caries Challenge
21 pages
CPA TCIG GPG Emergency Action Planning & Rescue From Height On Tower Cranes
No ratings yet
CPA TCIG GPG Emergency Action Planning & Rescue From Height On Tower Cranes
37 pages
Confined Spaces
No ratings yet
Confined Spaces
38 pages

Machine Learning Projects Report Puranjay

Uploaded by

Machine Learning Projects Report Puranjay

Uploaded by

Machine

Scaling: After encoding, numeric features were standardized

Figure: Histograms of representative numeric features (e.g.

Feature importance from the Random Forest (Figure below)

Overall, the model analysis indicates a clear trend: older,

from sklearn.ensemble import RandomForestClassifier

This code fits the Random Forest and evaluates accuracy.

Figure: Distribution of medical charges (annual). The histogram

The histogram confirms that charges has a few large outliers

Figure: Charges vs. Number of Children. Each point is an

from sklearn.ensemble import RandomForestRegressor

Movie Recommendation Using

Before clustering, we may reduce dimensionality (e.g. via PCA

The elbow method (scree plot of total within-cluster variance

A 2D visualization (PCA-projected) of the clustered movies (not

For recommendation, the procedure is straightforward: given a

Introduction to clustering-based customer segmentation | by

You might also like