Machine Learning-Based Models for Accurate Car Pri
Machine Learning-Based Models for Accurate Car Pri
Volume 40 (2024)
1. Introduction
If a middle-class family wants to buy a car, they have two choices in front of them: to buy a new
car or to buy a used car. If the family only uses the car as an everyday means of transportation, there
is little difference between buying a new and used car; And buying a new car is relatively expensive,
and the value of a new car depreciates after purchase, so buying a used car at this time is a reliable
option. The surge in new car prices caused by the pandemic has led to an increase in used car sales in
Singapore, which are expected to increase at a high compound annual growth rate of 11.1% from
2020 to 2025 [1].
There is no doubt that machine learning models can handle the complex relationship between
various features and car prices and secondly can also handle large datasets efficiently and with their
advanced algorithms can achieve higher prediction accuracy, thus better bringing better decisions to
people. The effectiveness of machine learning algorithms has been demonstrated in many domains
including medical diagnosis, telecommunication and sentiment analysis etc. [2-5]. Therefore, it is
very necessary to use machine learning model to predict the used car prices, which can provide the
estimated value according to different situations of cars, greatly helping buyers and sellers to have a
clear understanding of the price of used cars [6].
In previous research, some traditional methods, such as single variable linear regression and
Multiple variable regression, were used in the prediction of automobile price. However, these
traditional models have predictive limitations, for one thing, they can only deal with linear
relationships between independent and dependent variables (i.e. car prices). In reality, many
relationships are non-linear. Moreover, if there are outliers or extreme values in the data, it will have
a significant impact on the prediction results, thus affecting the accuracy of the prediction.
With the further development of machine learning models in recent years, a variety of relatively
advanced algorithms have emerged, such as artificial intelligence algorithms including random tree
and neural network etc. For example, in the field of predicting stock market prices, the use of
advanced models such as fuzzy logic and neural networks will be more efficient and accurate. Using
416
Highlights in Business, Economics and Management FTMM 2024
Volume 40 (2024)
traditional methods to evaluate the return on investment of the stock market is a cumbersome,
inefficient and expensive process [7]. And in the field of automobile price prediction, by applying
three advanced machine learning techniques. Such as Support Vector Machine, Artificial Neural
Network and Random Forest, the accuracy is as high as 87.38% [8]. Through the excellent results
achieved in these fields, it can be found the powerful prediction ability of advanced machine learning
methods.
This paper will use three regression machine learning models: Linear Regression, Decision Tree
Regressor, and Random Forest Regressor to predict the used car prices. The used car dataset came
from Kaggle, has 2,065 data from four popular brands including BMW, Tesla, Volkswagen and Acura.
And the dataset includes three important variables that affect the price of a used car: the name of the
car, the distance the car has been driven, and the number of years the car has been in use. Through
the results of the three models, to judge which model is best for predicting the used car prices, and
which factor has the greatest impact on the price of used cars, this is the research goal of this paper.
2. Methods
2.1. Dataset Preparation
The dataset used in this study for predicting car prices was web scraped from carvana.com/cars in
Kaggle [9]. The original dataset has a total of 22,000 pieces of data containing 36 brands and different
models of cars of each brand. In this study, the target variable is car prices. And there are three features
that affect car prices, which are: the name of the car (The name column includes the make and model
of the car), the year and the mileage of the car that has been used.
Since the collected dataset may not be in the most suitable format to be handed over to the machine
learning model for manipulation. So, data preprocessing is necessary to ensure the efficiency and
accuracy of the machine learning model. In the data Preprocessing, this study selected four popular
car brands from the original dataset to form a new dataset: Volkswagen, Acura, BMW, and Tesla. A
total of 2064 data are in the new dataset.
In the subsequent steps, some outliers were detected in the year column by introducing the tqdm
method in tqdm.notebook [10]. After dropping these outliers, this study obtained the following trends
by exploring Mean Car Prices and the name (brand) of the car, the year of use of the car, and the
mileage.
The chart can be analysed and found: 1. Among the four car brands, Tesla has the highest mean
price, followed by BMW, Acura and Volkswagen shown in Fig. 1. When all else is equal, the mean
price of the used car tends to decrease as the mileage of the car increases, as shown in Fig. 2 (a) and
(b). And shown in Fig. 2 (c) and (d), all else being equal, the mean price of a car tends to increase
with the year of the car.
(a) (b)
(c) (d)
Figure 2. (a) Mean price (US$) by mileage group (b) Price (US$) by mileage (c) Mean price (US$)
by year (d) Price (US$) by year of car. (Picture credit: Original)
There are many categorical values in this dataset, e.g., brand, model, etc. To enable machine
learning models to better predict these data, this paper represents these types in numerical form. In
the brand, model, and so on, there may be multiple different values. This paper uses LabelEncoder
[11] in scikit-learn [12] for this task, label encoding compresses the data into integers. And moving
to the part where the machine learning model predicts the price of the car, this study begins by
dividing the dataset into a training set and a test set, with a Random state of 45, a test set size of 25%
of the dataset, and a training set size of 75% of the dataset. Through drawing heat maps using
Pearson's correlation coefficient, this study visualizes the correlation between these features shown
in Fig. 3 [13].
Figure 3. Heat map between every pair of features (Picture credit: Original)
418
Highlights in Business, Economics and Management FTMM 2024
Volume 40 (2024)
419
Highlights in Business, Economics and Management FTMM 2024
Volume 40 (2024)
[19]. This will reduce the problem of overfitting decision trees and the effect of a single extreme value
on the data, and generalize better to unseen data. Therefore, random forests will have the best
performance.
This study also has some limitations and needs to be improved. 1. Only three machine learning
methods are used in this study: decision tree regressor, linear regression and random forest regressor.
Although some of these methods have good prediction performance, if more machine learning
methods are compared, not only will the prediction results be more diversified and objective, but also
better methods will emerge. 2. The automobile price dataset used in this study is derived from one
source, which may not ensure that the predicted data are sufficiently diverse and adequate, and may
not cover all automobile models and makes in the market. 3. Only three more important variable
factors affecting the price of automobiles analyzed in this study were selected, which are: automobile
names, age of automobile use, and mileage of automobile use. The other variable factors that affect
the price of automobile are engine capacity, Oil for automobiles, hybrid and so on. 4.This study might
be more objective if the types of cars were divided into gasoline, diesel, electric, and hybrid vehicles
to study.
So, in future research, there will be four things that need to be done: 1. Use more expressive
machine learning methods to analyze and compare, such as Support Vector Regression, Neural
Networks, ElasticNet Regression, K-Nearest Neighbors (KNN) etc. 2. Search for datasets on
automobile prices from multiple sources, ensuring that the datasets cover as much information as
possible about the models and makes of automobiles available in the market. 3.The next study will
expand the selection of factors affecting automobile prices by considering the effects of variables
such as engine capacity, Oil for automobiles, hybrid etc. on the price of automobiles. 4. In the next
study, the information in the data set will be divided into different vehicle types and analyzed
individually to ensure that the results are more accurate and objective.
4. Conclusion
This study addresses the inaccuracy of car price prediction in used car platforms and proposes
using of machine learning methods to increase the prediction accuracy of the platforms so as to
provide valuable references for sellers and buyers. This paper uses three machine learning methods:
Linear regression, Decision Tree Regressor, and Random Forest Regressor to predict car prices. By
analyzing and comparing multiple methods, it is consistently concluded that Random Forest
Regressor has the greatest performance among the three methods. The experimental results show that
Random Forest Regressor is the most suitable of the three methods to provide car price prediction for
used car platforms. In the future, this research will use more car categorization criteria and better
machine learning methods to predict car prices in multiple ways in an effort to provide meaningful
information.
References
[1] Singapore Used Car Market Outlook Report 2022-2025: Increasing Used Cars Demand Due to the
Pandemic Contributes to Increase in Used Cars Sales During the Economic Crisis -
ResearchAndMarkets.com. Businesswire, 2022, June 6. Available at:
https://www.businesswire.com/news/home/20220606005609/en/.
[2] Qiu Y, Yang Y, Lin Z, Chen P, Luo Y, Huang W. Improved denoising autoencoder for maritime image
denoising and semantic segmentation of USV. China Communications. 2020 Mar;17 (3): 46-57.
[3] Wu Y, Jin Z, Shi C, Liang P, Zhan T. Research on the Application of Deep Learning-based BERT Model
in Sentiment Analysis. arXiv preprint arXiv:2403.08217. 2024 Mar 13.
[4] Wang H, Zhou Y, Perez E, Roemer F. Jointly Learning Selection Matrices for Transmitters, Receivers
and Fourier Coefficients in Multichannel Imaging. arXiv preprint arXiv:2402.19023. 2024 Feb 29.
[5] Li M, He J, Jiang G, Wang H. DDN-SLAM: Real-time Dense Dynamic Neural Implicit SLAM with Joint
Semantic Encoding. arXiv preprint arXiv:2401.01545. 2024 Jan 3.
420
Highlights in Business, Economics and Management FTMM 2024
Volume 40 (2024)
[6] Gajera P, Gondaliya A, Kavathiya J. Old car price prediction with machine learning. Int. Res. J. Mod.
Eng. Technol. Sci, 2021, 3: 284-290.
[7] Miah MBA, Hossain MZ, Hossain MA, Islam MM. Price prediction of stock market using hybrid model
of artificial intelligence. International Journal of Computer Applications, 2015, 111 (3).
[8] Gegic E, Isakovic B, Keco D, Masetic Z, Kevric J. Car price prediction using machine learning techniques.
TEM Journal, 2019, 8 (1): 113.
[9] scikit-learn developers. sklearn.preprocessing.LabelEncoder — scikit-learn 0.22.1 documentation. 2019.
Available at: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html.
[10] Costa-Luis C da. tqdm.notebook - tqdm documentation. Available at:
https://tqdm.github.io/docs/notebook/.
[11] Kaggle Carvana - Predict Car Prices. Available at: https://www.kaggle.com/datasets/ravishah1/carvana-
predict-car-prices.
[12] Scikit-learn. scikit-learn: machine learning in Python. 2019. Available at: https://scikit-learn.org/stable/.
[13] Benesty J, Chen J, Huang Y, Cohen I. Noise reduction in speech processing. Springer Science & Business
Media, 2009, Vol. 2.
[14] Anguita D, Ghelardoni L, Ghio A, Oneto L, Ridella S. The'K'in K-fold Cross Validation. ESANN, 2012,
Vol. 102, pp. 441-446.
[15] Montgomery DC, Peck EA, Vining GG. Introduction to linear regression analysis. John Wiley & Sons,
2021.
[16] Quinlan JR. Induction of decision trees. Machine learning, 1986, 1: 81-106.
[17] Liaw A, Wiener M. Classification and regression by randomForest. R news, 2002, 2 (3): 18-22.
[18] Shanmugasundar G, Vanitha M, Čep R, Kumar V, Kalita K, Ramachandran M. A comparative study of
linear, random forest and adaboost regressions for modeling non-traditional machining. Processes, 2021,
9(11): 2015.
[19] Oshiro TM, Perez PS, Baranauskas JA. How many trees in a random forest? In Machine Learning and
Data Mining in Pattern Recognition: 8th International Conference, MLDM 2012, Berlin, Germany, July
13-20, 2012. Proceedings 8, Springer Berlin Heidelberg, 2012, pp. 154-168.
421