INTERNATIONAL UNIVERSITY
Project
Toyota Carolla
Châu Khắc Đình Phong – ITDSIU20076
Nguyễn Ảnh Đô – ITDSIU20059
1.
CONTENTS Introduct
ion
2. Tool
Used
In this topic, I am going
INT
to use a data set called RO
Toyota Corolla which DU
contains information on CT
IO
the sale of used car in N
the Asia.
This data set has information on a variety of characteristics of cars
such as the price those cars were sold for and these prices are in Euros,
then it has information on the model of the car each of the car
manufacturing monthly, year are fuel type, model, price, age, and
color, automatic and the kilometers who used to use car.
First, we will show in this report how we
clean and detect fraud in our dataset.
Secondly, we statistic each value compared
to every single model car in dataset.
Thirdly, we analize by checking the
correlation of attributes, then plot the
distribution, caculates variance, mode,
mean for each attribute in dataset.
Fourly, Predicting the price base on the
actual price in dataset with many different
building models.
Finally, compare and evalute for each
value in building models.
R is a programming
language and free software
developed by Ross Ihaka
and Robert Gentleman in
1993. R possesses an
TOOL
extensive catalog of
statistical and graphical
methods. It includes
machine learning
USED
algorithms, linear
regression, time series,
statistical inference to
name a few.
Excel is a spreadsheet program from
Microsoft and a component of its
Office product group for business
applications. Microsoft Excel enables
users to format, organize and
calculate data in a spreadsheet.
PROJECT
1.Visualize, Clean and Transform Data / Excel
-There are more than 60 cars in
the price range from 7800$-7995$
and 10900$-11290$
-More than 100 cars in the price
range from 8800$ – 8995$
-More than 80 cars in the price
range from 9900$
- At the remaining prices, about
30 cars dropped back. In it, there
is no car in the range of 32000.
The chart shows the number of cars in the age range from 1-80 years
The chart show sum of Kilometers for each Toyota car type.
According to the chart:
- The car with the most KM is TOYOTA Corolla 1.6 16V HATCHB LINEA TERRA 2/3-Doors:
7704759 (km)
- The car with the least KM (only 1km) is:
+ TOYOTA Corolla 1.6-16v VVT-i Linea Terra Comfort AIRCO NIEUW 5DRS 4/5-Doors.
+ TOYOTA Corolla VERSO 2.0 D4D SOL (7) BNS MPV.
+ TOYOTA Corolla 1.4-16v VVT-i Linea Terra Comfort NIEUW AIRCO 4/5-Doors.
+ TOYOTA Corolla 1.6-16v VVT-i Linea Terra Comfort NIEUW AIRCO 5drs 4/5-Doors.
+ TOYOTA Corolla 2.0 d HB Diesel 2/3-Doors.
+ TOYOTA Corolla 1.4-16v VVT-i Linea Terra Comfort NIEUW AIRCO 4/5-Doors.
According to the chart, more than 1264
cars use Petrol fuel.
Over 155 cars use Diesel fuel.
Only 17 cars use CNG.
According to the chart, Grey is the most popular color
with more than 300 cars use.
The next is Red and Blue with more than 280 cars use.
Beige, Vioet, Yellow is the least popular color.
According to the chart we can see
most of car in Toyota use Automatic: 0
Less than 200 cars use Automatic: 1
Checking for missing values by using R
Transform the value to
ordinal by using Excel
Processed data for analysis
2. Analysis the
data
Use R to plot the
distribution for
each attribute
Predict Price with 3 Model:
Regression Analysis, Random Forest, Gradient Boosting
Before using R to
predict, we split the data.
Result
CASE 1
Regression Analysis
Plot Regression
Analysis Model
Then we use this model which were created to predict price in data set
We have RMSE: 1567.824 after we use the Regression Analysis Model
Plot predicted
price compared
to actual price
Result
CASE 2
Random Forest
Importance feature
Random forest data
We continue to use this model to predict price
We have RMSE: 2101.017 after we use the Random Forest Model
Plot predicted
price compared
to actual price
CASE 3
Gradiant Boosting
Result
Find the
important
variable
We use model to predict price of car
We have RMSE: 1329.32 after we use the Gradiant Boosting Model
Plot predicted
price compared
to actual price
Conclusion
According to 3 RMSE
Gradiant Boosting: RMSE_gbm: 1329.32
Linear Regression: RMSE_lr: 1567.824
Random Forest: RMSE_rf: 2101.017
is better than others to predict the price
because the lower the RMSE, the better a
Gradiant given model can “fit” a dataset. The
variable is useful to predict price is
Boosting Age_08_04, Mfg_year, KM. The accuracy
Model of model is predicting with RMSE which
has value is 1341.965. To improve RMSE,
the model can be further tuned.
Analyze whether the collected attributes have a positive
effect on the target variable or have a negative effect.
According to Linear Regression we calculated
- Based on the calculated data table, we can see the attribute have the positive effects are
Fuel_Type_Diesel, Fuel_Type_Petrol and Automatic. As we would anticipate, a car's price
will increase if it includes an automatic transmission, more specifically, the use of Diesel will
make the price of the car higher than using Petrol.
- Next, we can see that Age and KM have negative values, proving that it negatively affects the
price of a car. That’s mean if the car is older, it has a negative impact on the price, therefore
the older the car, the lower the price.
- We can see the column p-value (Pr(>|t|)), they are 4 column has a p value less than 0.05. We
can confirm that these properties are statistically significant. In this case, we find Age_08_04,
KM, Automatic statistically significant.
Reference
https://www.toyota.com/corolla/
https://en.wikipedia.org/wiki/Toyota_Corolla
THANK YOU FOR FOLLOWING