HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
SCHOOL OF INFORMATION COMMUNICATION TECHNOLOGY
CAPSTONE PROJECT REPORT:
House Prices Prediction
Supervised by:
Than Quang Khoat
Presented by:
Lang Hong Nguyet Anh - 20235885
Tran Tien Dung – 20235921
Ha Duc Duong – 20235923
Le Minh Tuan – 20236007
Tran Sy Nguyen – 20235985
Hanoi - Vietnam
2025
Contents
1 Introduction 2
1.1 Problem’s Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 About the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Preprocessing 5
2.1 Data Validation and Missing Value preprocessing . . . . . . . . . . . . . . . . . . 5
2.2 Skewness and Outlier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Methodology: Machine Learning-based Approaches 7
3.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Experimental results 10
4.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5 References 15
1
1 Introduction
1.1 Problem’s Background
When people imagine their dream home, they often think of aesthetic or surface-level fea-
tures—like a beautiful front yard, a large kitchen, or the number of bedrooms. Rarely do they
mention the height of a basement ceiling or how close the property is to a railroad. However, in
reality, home prices are influenced by a wide range of factors—some obvious, some hidden.
Accurately predicting house prices is a complex and valuable task in the field of data science,
particularly in real estate analytics. Understanding what drives the price of a home can benefit
buyers, sellers, realtors, and urban planners. The goal is not only to determine the most im-
portant features that affect house prices but also to build models that generalize well to unseen
properties.
This problem offers an excellent opportunity to explore both the art and science of price pre-
diction—through creative feature engineering and the application of advanced machine learning
techniques. In this project, I aim to develop a model that predicts house prices using Linear
regression and Random Forests.
1.2 About the dataset
To support this task, we use the Ames Housing dataset, a modern, rich, and comprehensive
collection of housing data compiled by Dean De Cock. It serves as a well-structured alternative
to the classic Boston Housing dataset, specifically designed for education and experimentation
in predictive modeling.
The dataset contains 79 explanatory variables that describe nearly every aspect of residential
homes in Ames, Iowa. These include:
Features Explanation
SalePrice The property’s sale price in dollars (target variable)
MSSubClass The building class
MSZoning The general zoning classification
LotFrontage Linear feet of street connected to property
LotArea Lot size in square feet
Street Type of road access
Alley Type of alley access
2
Features Explanation
LotShape General shape of property
LandContour Flatness of the property
Utilities Type of utilities available
LotConfig Lot configuration
LandSlope Slope of property
Neighborhood Physical locations within Ames city limits
Condition1 Proximity to main road or railroad
Condition2 Proximity to main road or railroad (if a second is present)
BldgType Type of dwelling
HouseStyle Style of dwelling
OverallQual Overall material and finish quality
OverallCond Overall condition rating
YearBuilt Original construction date
YearRemodAdd Remodel date
RoofStyle Type of roof
RoofMatl Roof material
Exterior1st Exterior covering on house
Exterior2nd Exterior covering on house (if more than one material)
MasVnrType Masonry veneer type
MasVnrArea Masonry veneer area in square feet
ExterQual Exterior material quality
ExterCond Present condition of the material on the exterior
Foundation Type of foundation
BsmtQual Height of the basement
BsmtCond General condition of the basement
BsmtExposure Walkout or garden level basement walls
BsmtFinType1 Quality of basement finished area
BsmtFinSF1 Type 1 finished square feet
BsmtFinType2 Quality of second finished area (if present)
BsmtFinSF2 Type 2 finished square feet
BsmtUnfSF Unfinished square feet of basement area
TotalBsmtSF Total square feet of basement area
3
Features Explanation
Heating Type of heating
HeatingQC Heating quality and condition
CentralAir Central air conditioning
Electrical Electrical system
1stFlrSF First Floor square feet
2ndFlrSF Second floor square feet
LowQualFinSF Low quality finished square feet (all floors)
GrLivArea Above grade (ground) living area square feet
BsmtFullBath Basement full bathrooms
BsmtHalfBath Basement half bathrooms
FullBath Full bathrooms above grade
HalfBath Half baths above grade
Bedroom Number of bedrooms above basement level
Kitchen Number of kitchens
KitchenQual Kitchen quality
TotRmsAbvGrd Total rooms above grade (does not include bathrooms)
Functional Home functionality rating
Fireplaces Number of fireplaces
FireplaceQu Fireplace quality
GarageType Garage location
GarageYrBlt Year garage was built
GarageFinish Interior finish of the garage
GarageCars Size of garage in car capacity
GarageArea Size of garage in square feet
GarageQual Garage quality
GarageCond Garage condition
PavedDrive Paved driveway
WoodDeckSF Wood deck area in square feet
OpenPorchSF Open porch area in square feet
EnclosedPorch Enclosed porch area in square feet
3SsnPorch Three season porch area in square feet
ScreenPorch Screen porch area in square feet
4
Features Explanation
PoolArea Pool area in square feet
PoolQC Pool quality
Fence Fence quality
MiscFeature Miscellaneous feature not covered in other categories
MiscVal $Value of miscellaneous feature
MoSold Month Sold
YrSold Year Sold
SaleType Type of sale
SaleCondition Condition of sale
In this section, I will focus on analyzing the general information across all the datasets,
rather than examining each one individually, due to the similarity in patterns and trends ob-
served among them. This approach allows for a more streamlined and cohesive analysis without
repetitive details from each dataset.
Column Types: The dataset contains a mix of numerical (both integers and floats) and cate-
gorical (object) features. Out of 81 columns, 35 are integer-type, 3 are float-type, and 43 are
categorical.
Missing Data: Several features exhibit a significant amount of missing data. For example:
• Alley, PoolQC, and MiscFeature have very high proportions of missing values, sug-
gesting either rare occurrence or lack of entry.
• Features such as GarageYrBlt, FireplaceQu, MasVnrType, and several basement-
related features also contain moderate levels of missing data.
Data Consistency: Most columns are well-populated, and data types are appropriate, although
float fields like LotFrontage and GarageYrBlt with missing values may require imputation.
Target Variable: SalePrice is the target variable for prediction. It is numeric and non-null,
making it suitable for regression analysis.
2 Preprocessing
2.1 Data Validation and Missing Value preprocessing
The dataset was validated to identify illogical entries, such as negative values in numeric
columns, while missing values were handled as follows:
5
• Features with more than 80% missing values were removed from the dataset to avoid
introducing bias from excessive imputation.
• Features with less than or equal to 80% missing values were retained, with missing val-
ues filled using the median (for numerical features) or the mode (for categorical features).
Figure 1: Percentage of missing value
From the fig1, the following features were removed due to their high percentage of missing
values: Alley, PoolQC, Fence, MiscFeature. Besides, MasVnrType, FireplaceQu, LotFrontage,
GarageType, GarageYrBlt, GarageFinish, GarageQual, GarageCond, BsmtExposure, BsmtFinType2,
BsmtQual, BsmtCond, BsmtFinType1, MasVnrArea, Electrical were imputed.
2.2 Skewness and Outlier
For numerical features, we performed a skewness analysis to examine the symmetry of their
distributions. The results are presented in Table 2, where we observe that several features exhibit
a high degree of positive skewness (i.e., right-skewed distributions), indicating a long tail of
large values.
6
Table 2: Skewness of numerical features
Feature Feature Feature
MSSubClass: 1.408 LotFrontage: 2.409 LotArea: 12.208
OverallQual: 0.217 OverallCond: 0.693 YearBuilt: -0.613
YearRemodAdd: -0.504 MasVnrArea: 2.678 BsmtFinSF1: 1.686
BsmtFinSF2: 4.255 BsmtUnfSF: 0.920 TotalBsmtSF: 1.524
1stFlrSF: 1.377 2ndFlrSF: 0.813 LowQualFinSF: 9.011
GrLivArea: 1.367 BsmtFullBath: 0.596 BsmtHalfBath: 4.103
FullBath: 0.037 HalfBath: 0.676 BedroomAbvGr: 0.212
KitchenAbvGr: 4.488 TotRmsAbvGrd: 0.676 Fireplaces: 0.650
GarageYrBlt: -0.678 GarageCars: -0.343 GarageArea: 0.180
WoodDeckSF: 1.541 OpenPorchSF: 2.364 EnclosedPorch: 3.090
3SsnPorch: 10.304 ScreenPorch: 4.122 PoolArea: 14.828
MiscVal: 24.477 MoSold: 0.212 YrSold: 0.096
The substantial right skew in these variables can adversely affect the performance of many
machine learning models that assume normally distributed input data. To address this, we ap-
plied a logarithmic transformation to selected features with high skewness and strictly positive
values. This transformation helps reduce the impact of outliers, compress the range of large
values, and bring the distribution closer to normality.
3 Methodology: Machine Learning-based Approaches
I conducted a range of Machine Learning-based algorithms in this section. In each algorithm,
we set up a range of corresponding models, each with different hyper-parameters. We then perform
a comparison between the models applying the same hyper-parameters to find the best-performing
ones using GridSearchCV with cross_validation enabled from Scikit-learn, before comparing those
"best" implementations to deliver a broader insight on how would each algorithm interact with the
datasets.
3.1 Linear regression
Linear regression is one of the simplest regression algorithms. Assuming a linear relationship
between the features and the target values, linear regression aims to fit a hyperplane that maps
the features to the target values.
The simplest form of the problem is to find the weights w and bias b that minimize the
following loss function:
N
1 X
J(w, b) = (yi − (xi T w + b))2
2N
i=1
7
This optimization can be accomplished using the gradient descent method.
However, linear regression is often combined with regularization of the weights to control
the trade-off between model complexity and generalization. In this project, I use three types of
regularization: Ridge, LASSO and Elastic Net;
• Ridge: adds a squared penalty term to the loss function
N
1 X α
J(w, b) = (yi − (xi T w + b))2 + ||w||22
2N 2
i=1
• LASSO: adds an absolute penalty term to the loss function
N
1 X
J(w, b) = (yi − (xi T w + b))2 + α||w||1
2N
i=1
3.2 Random Forest
3.2.1 How Decision tree works for regression?
Decision Trees follow a tree-like flowchart structure to show the predictions that result from a
series of feature-based splits. It starts with a root node and ends with a decision made by leaves.
When a new data point is received, its attributes are pushed through the decision tree from the
root to the nodes with corresponding attribute statistics. At each node, according to the value of
the chosen attributes, the tree will decide which node will the data be passed on to. Usually, this
path is considered based on the similarity of the data point to the attribute used to stem the tree
- the branch with the highest similarities wins the data point. The data flow stops at a leaf of a
decision tree, in which it receives its “decision” made by the decision tree instead of a pass.
The node split process identifies the optimal feature and threshold value to divide a node.
It computes the CART cost function for various features k and thresholds tk , then selects the
feature and threshold that minimize the cost function for the split.
mlef t mright
J(k, tk ) = M SElef t + M SEright ,
m m
i 2
P
i∈node (ŷnode − y )
M SE
node =
where
= 1 yi
P
ŷ node mnode i∈node
3.2.2 Random Forest’s core
Ensemble Model:
8
Ensemble learning is a machine learning technique that enhances accuracy and resilience in
forecasting by merging predictions from multiple models. It aims to mitigate errors or biases
that may exist in individual models by leveraging the collective intelligence of the ensemble.
The underlying concept behind ensemble learning is to combine the outputs of diverse mod-
els to create a more precise prediction. By considering multiple perspectives and utilizing the
strengths of different models, ensemble learning improves the overall performance of the learn-
ing system. This approach not only enhances accuracy but also provides resilience against uncer-
tainties in the data. By effectively merging predictions from multiple models, ensemble learning
has proven to be a powerful tool in various domains, offering more robust and reliable forecasts.
Bagging:
Bagging, also known as bootstrap aggregation, is an ensemble learning technique that com-
bines the benefits of bootstrapping and aggregation to yield a stable model and improve the
prediction performance of a machine-learning model.
In detail, each model is trained on a random subset of the data sampled with replacement,
meaning that the individual data points can be chosen more than once. This random subset is
known as a bootstrap sample. By training models on different bootstraps, bagging reduces the
variance of the individual models. It also avoids overfitting by exposing the constituent models
to different parts of the dataset.
The predictions from all the sampled models are then combined through simple averaging to
make the overall prediction. This way, the aggregated model incorporates the strengths of the
individual ones and cancels out their errors.
Random Forest’s Core Ideas:
The random forest algorithm is an extension of the bagging method as it uses both bagging
and feature randomness to create an uncorrelated forest of decision trees.
Feature randomness generates a random subset of features, which ensures a low correlation
among decision trees. This is a key difference between decision trees and random forests. Al-
though decision trees consider all possible splits of features, random forests only select a subset
of those features.
3.2.3 Implementation
We optimize the models’ hyperparameter based on the following parameters:
9
• n_estimators: The number of trees in the forest
• max_f eatures: The number of features to consider when looking for the best split
• max_depth: The maximum depth of the tree
• min_samples_split: The minimum number of samples required to split an internal node
• min_samples_leaf : The minimum number of samples required to be at a leaf node. A split
point at any depth will only be considered if it leaves at least min_samples_leaf training
samples in each of the left and right branches.
4 Experimental results
4.1 Linear Regression
To evaluate the performance of the Linear Regression model, we conducted experiments us-
ing a subset of selected features, including OverallQual, GrLivArea, TotalBsmtSF, 1stFlrSF,
YearBuilt, YearRemodAdd, MasVnrArea, BsmtFinSF1, WoodDeckSF, 2ndFlrSF, Open-
PorchSF, and LotArea.
The input features were standardlized using StandardScaler, and the target variables were
reshaped into a column vector. We implemented Linear Regression from scratch using Gradient
Descent with a learning rate 0.001 and trained the model over 500 and 1000 iterations. To
evaluate the model’s accuracy on unseen data, we computed the mean absolute percentage error
(MAPE) on the validation set:
10
Figure 2: Cost vs. 500 Iteration during training
Figure 3: Cost vs. 1000 Iteration during training
Figures 2 and 3 both illustrate a consistent decline in the cost function, suggesting that the
model is effectively learning during training. However, when evaluated in the test data set, the
model trained with 1000 iterations yields a significantly higher error of 6.36, compared to only
0.276 for the model trained with 500 iterations. This sharp increase in test error despite improved
performance in the training and validation sets suggests that the model may have over-fitted to
the training data when trained for too many iterations.
11
Figure 4: Actual vs. Predicted house prices (Validation set)
As shown in Figure 4, the predicted values correlate well with the actual house prices, al-
though there is some deviation, particularly in higher price ranges.
Overall, the Linear Regression model performed reasonably well given its simplicity, and it
serves as a strong baseline for comparison with more complex models.
4.2 Random Forest
To evaluate the performance of the Random Forest model, we conducted experiments using a
subset of selected features, including OverallQual, GrLivArea, TotalBsmtSF, 1stFlrSF,
YearBuilt, YearRemodAdd, MasVnrArea, BsmtFinSF1, WoodDeckSF, 2ndFlrSF, Open-
PorchSF, and LotArea.
After reshaping the training and validation sets (X_train, X_val, y_train, and y_val), we
evaluated the accuracy of the model in unseen data by computing the mean absolute percentage
error (MAPE) in the validation set.
To optimize the performance of the model, we performed hyperparameter tuning using Grid-
SearchCV to identify the best combination of parameters, including n_estimators, max_depth,
and max_features, aiming to minimize the MAPE.
12
Figure 5: Actual vs Predicted Prices
As shown in Figure 5, the predicted values closely follow the actual house prices, indicating
strong model performance. Most predictions lie near the ideal diagonal line with a Mean Abso-
lute Percentage Error (MAPE) of 0.0081 (or 0.81%), confirming the model’s high accuracy. Most
predictions lie near the ideal diagonal line, though there is slight underestimation at higher price
levels. This suggests that the Random Forest model captures the general trend well, with some
deviation in upper price ranges.
Figure 6: Residual Plot
Figure 6 shows that the residuals are mostly centered around zero, with no clear pattern,
13
indicating that the errors are randomly distributed and homoscedastic. Although a few outliers
exist, the residual spread remains consistent across predicted price levels, confirming the model’s
stability.
Overall, the Random Forest model yields accurate predictions with minimal bias and resid-
ual variance, outperforming simpler models like Linear Regression and demonstrating strong
suitability for house price prediction.
14
5 References
• Khoat Than. (2024, Semester 2). IT3190E - Machine Learning. Lecture presented at Hanoi
University of Science and Technology, March 2024.
• Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... Blondel, M.
(2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research,
12(Oct), 2825-2830.
• Tiep, Vu Huu, "Machine Learning cơ bản", published March 27th , 2018.
• Nihal Desai and Vatsal Patel, "Linear Decision Tree Regressor: Decision Tree Regressor
Combined with Linear Regressor", published July, 2016
• Debasish Basak and Srimanta Pal, "Support Vector Regression", published November, 2007
15