0% found this document useful (0 votes)
91 views12 pages

Kaggle House Price Prediction Challenge

The document discusses a Kaggle challenge to predict housing prices. It imports libraries and loads training and test housing datasets, then explores the datasets and describes the variables in the data.

Uploaded by

cedrif1284
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views12 pages

Kaggle House Price Prediction Challenge

The document discusses a Kaggle challenge to predict housing prices. It imports libraries and loads training and test housing datasets, then explores the datasets and describes the variables in the data.

Uploaded by

cedrif1284
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

HOUSES PRICES - Kaggle challenge

This Saturday, August 19th, 2023


Author: Romaric I. C. ASSOGBA
Kaggle-id: Romaric-kg

Librairies importation

[1]:
import pandas as pd
#pd.set_option("display.max_columns", 500)

import numpy as np
import scipy as sp
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler # for preprocessing
from sklearn.metrics import r2_score
# for visualisation
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings("ignore") # in order to ignore warnings

[2]:
sns.set(style="whitegrid", color_codes=True)
sns.set(rc={'figure.figsize':(15, 8)})

Datasets loading

[3]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

[4]:
train.head()

[4]: Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice

0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500

1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500

2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000

4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

[5]:
test.head()

[5]: Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCond

0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal

1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN Gar2 12500 6 2010 WD Normal

2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal

3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 6 2010 WD Normal

4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN NaN NaN 0 1 2010 WD Normal

5 rows × 80 columns

We notice that the dataset for test, hasn't the column "SalePrice" what is the column we have to predict. so, let's go ...

[6]:
print("Train dataset shape:", train.shape)
print("Test dataset shape:", test.shape)

[6]:
Train dataset shape: (1460, 81)
Test dataset shape: (1459, 80)

Let's discover what each column means

[8]:
with open("data_description.txt") as file:
for line in file:
print(line, end='')

[8]:
MSSubClass: Identifies the type of dwelling involved in the sale.

20 1-STORY 1946 & NEWER ALL STYLES


30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGES
45 1-1/2 STORY - UNFINISHED ALL AGES
50 1-1/2 STORY FINISHED ALL AGES
60 2-STORY 1946 & NEWER
70 2-STORY 1945 & OLDER
75 2-1/2 STORY ALL AGES
80 SPLIT OR MULTI-LEVEL
85 SPLIT FOYER
90 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.

A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium Density

LotFrontage: Linear feet of street connected to property

LotArea: Lot size in square feet

Street: Type of road access to property

Grvl Gravel
Pave Paved

Alley: Type of alley access to property

Grvl Gravel
Pave Paved
NA No alley access

LotShape: General shape of property

Reg Regular
IR1 Slightly irregular
IR2 Moderately Irregular
IR3 Irregular

LandContour: Flatness of the property

Lvl Near Flat/Level


Bnk Banked - Quick and significant rise from street grade to building
HLS Hillside - Significant slope from side to side
Low Depression

Utilities: Type of utilities available

AllPub All public Utilities (E,G,W,& S)


NoSewr Electricity, Gas, and Water (Septic Tank)
NoSeWa Electricity and Gas Only
ELO Electricity only

LotConfig: Lot configuration

Inside Inside lot


Corner Corner lot
CulDSac Cul-de-sac
FR2 Frontage on 2 sides of property
FR3 Frontage on 3 sides of property

LandSlope: Slope of property

Gtl Gentle slope


Mod Moderate Slope
Sev Severe Slope

Neighborhood: Physical locations within Ames city limits

Blmngtn Bloomington Heights


Blueste Bluestem
BrDale Briardale
BrkSide Brookside
ClearCr Clear Creek
CollgCr College Creek
Crawfor Crawford
Edwards Edwards
Gilbert Gilbert
IDOTRR Iowa DOT and Rail Road
MeadowV Meadow Village
Mitchel Mitchell
Names North Ames
NoRidge Northridge
NPkVill Northpark Villa
NridgHt Northridge Heights
NWAmes Northwest Ames
OldTown Old Town
SWISU South & West of Iowa State University
Sawyer Sawyer
SawyerW Sawyer West
Somerst Somerset
StoneBr Stone Brook
Timber Timberland
Veenker Veenker

Condition1: Proximity to various conditions

Artery Adjacent to arterial street


Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad

Condition2: Proximity to various conditions (if more than one is present)

Artery Adjacent to arterial street


Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad

BldgType: Type of dwelling

1Fam Single-family Detached


2FmCon Two-family Conversion; originally built as one-family dwelling
Duplx Duplex
TwnhsE Townhouse End Unit
TwnhsI Townhouse Inside Unit

HouseStyle: Style of dwelling

1Story One story


1.5Fin One and one-half story: 2nd level finished
1.5Unf One and one-half story: 2nd level unfinished
2Story Two story
2.5Fin Two and one-half story: 2nd level finished
2.5Unf Two and one-half story: 2nd level unfinished
SFoyer Split Foyer
SLvl Split Level

OverallQual: Rates the overall material and finish of the house

10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor

OverallCond: Rates the overall condition of the house

10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor

YearBuilt: Original construction date

YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)

RoofStyle: Type of roof

Flat Flat
Gable Gable
Gambrel Gabrel (Barn)
Hip Hip
Mansard Mansard
Shed Shed

RoofMatl: Roof material

ClyTile Clay or Tile


CompShg Standard (Composite) Shingle
Membran Membrane
Metal Metal
Roll Roll
Tar&Grv Gravel & Tar
WdShake Wood Shakes
WdShngl Wood Shingles

Exterior1st: Exterior covering on house

AsbShng Asbestos Shingles


AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles

Exterior2nd: Exterior covering on house (if more than one material)

AsbShng Asbestos Shingles


AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles

MasVnrType: Masonry veneer type

BrkCmn Brick Common


BrkFace Brick Face
CBlock Cinder Block
None None
Stone Stone

MasVnrArea: Masonry veneer area in square feet

ExterQual: Evaluates the quality of the material on the exterior

Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor

ExterCond: Evaluates the present condition of the material on the exterior

Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor

Foundation: Type of foundation

BrkTil Brick & Tile


CBlock Cinder Block
PConc Poured Contrete
Slab Slab
Stone Stone
Wood Wood

BsmtQual: Evaluates the height of the basement

Ex Excellent (100+ inches)


Gd Good (90-99 inches)
TA Typical (80-89 inches)
Fa Fair (70-79 inches)
Po Poor (<70 inches
NA No Basement

BsmtCond: Evaluates the general condition of the basement

Ex Excellent
Gd Good
TA Typical - slight dampness allowed
Fa Fair - dampness or some cracking or settling
Po Poor - Severe cracking, settling, or wetness
NA No Basement

BsmtExposure: Refers to walkout or garden level walls

Gd Good Exposure
Av Average Exposure (split levels or foyers typically score average or above)
Mn Mimimum Exposure
No No Exposure
NA No Basement

BsmtFinType1: Rating of basement finished area

GLQ Good Living Quarters


ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement

BsmtFinSF1: Type 1 finished square feet

BsmtFinType2: Rating of basement finished area (if multiple types)

GLQ Good Living Quarters


ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement

BsmtFinSF2: Type 2 finished square feet

BsmtUnfSF: Unfinished square feet of basement area

TotalBsmtSF: Total square feet of basement area

Heating: Type of heating

Floor Floor Furnace


GasA Gas forced warm air furnace
GasW Gas hot water or steam heat
Grav Gravity furnace
OthW Hot water or steam heat other than gas
Wall Wall furnace

HeatingQC: Heating quality and condition

Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor

CentralAir: Central air conditioning

N No
Y Yes

Electrical: Electrical system

SBrkr Standard Circuit Breakers & Romex


FuseA Fuse Box over 60 AMP and all Romex wiring (Average)
FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
Mix Mixed

1stFlrSF: First Floor square feet

2ndFlrSF: Second floor square feet

LowQualFinSF: Low quality finished square feet (all floors)

GrLivArea: Above grade (ground) living area square feet

BsmtFullBath: Basement full bathrooms

BsmtHalfBath: Basement half bathrooms

FullBath: Full bathrooms above grade

HalfBath: Half baths above grade

Bedroom: Bedrooms above grade (does NOT include basement bedrooms)

Kitchen: Kitchens above grade

KitchenQual: Kitchen quality

Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

Functional: Home functionality (Assume typical unless deductions are warranted)

Typ Typical Functionality


Min1 Minor Deductions 1
Min2 Minor Deductions 2
Mod Moderate Deductions
Maj1 Major Deductions 1
Maj2 Major Deductions 2
Sev Severely Damaged
Sal Salvage only

Fireplaces: Number of fireplaces

FireplaceQu: Fireplace quality

Ex Excellent - Exceptional Masonry Fireplace


Gd Good - Masonry Fireplace in main level
TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
Fa Fair - Prefabricated Fireplace in basement
Po Poor - Ben Franklin Stove
NA No Fireplace

GarageType: Garage location

2Types More than one type of garage


Attchd Attached to home
Basment Basement Garage
BuiltIn Built-In (Garage part of house - typically has room above garage)
CarPort Car Port
Detchd Detached from home
NA No Garage

GarageYrBlt: Year garage was built

GarageFinish: Interior finish of the garage

Fin Finished
RFn Rough Finished
Unf Unfinished
NA No Garage

GarageCars: Size of garage in car capacity

GarageArea: Size of garage in square feet

GarageQual: Garage quality

Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No Garage

GarageCond: Garage condition

Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No Garage

PavedDrive: Paved driveway

Y Paved
P Partial Pavement
N Dirt/Gravel

WoodDeckSF: Wood deck area in square feet

OpenPorchSF: Open porch area in square feet

EnclosedPorch: Enclosed porch area in square feet

3SsnPorch: Three season porch area in square feet

ScreenPorch: Screen porch area in square feet

PoolArea: Pool area in square feet

PoolQC: Pool quality

Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
NA No Pool
Fence: Fence quality

GdPrv Good Privacy


MnPrv Minimum Privacy
GdWo Good Wood
MnWw Minimum Wood/Wire
NA No Fence

MiscFeature: Miscellaneous feature not covered in other categories

Elev Elevator
Gar2 2nd Garage (if not described in garage section)
Othr Other
Shed Shed (over 100 SF)
TenC Tennis Court
NA None

MiscVal: $Value of miscellaneous feature

MoSold: Month Sold (MM)

YrSold: Year Sold (YYYY)

SaleType: Type of sale

WD Warranty Deed - Conventional


CWD Warranty Deed - Cash
VWD Warranty Deed - VA Loan
New Home just constructed and sold
COD Court Officer Deed/Estate
Con Contract 15% Down payment regular terms
ConLw Contract Low Down payment and low interest
ConLI Contract Low Interest
ConLD Contract Low Down
Oth Other

SaleCondition: Condition of sale

Normal Normal Sale


Abnorml Abnormal Sale - trade, foreclosure, short sale
AdjLand Adjoining Land Purchase
Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit
Family Sale between family members
Partial Home was not completed when last assessed (associated with New Homes)

Datasets inspection

[9]:
train.info()

[9]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 MSZoning 1460 non-null object
3 LotFrontage 1201 non-null float64
4 LotArea 1460 non-null int64
5 Street 1460 non-null object
6 Alley 91 non-null object
7 LotShape 1460 non-null object
8 LandContour 1460 non-null object
9 Utilities 1460 non-null object
10 LotConfig 1460 non-null object
11 LandSlope 1460 non-null object
12 Neighborhood 1460 non-null object
13 Condition1 1460 non-null object
14 Condition2 1460 non-null object
15 BldgType 1460 non-null object
16 HouseStyle 1460 non-null object
17 OverallQual 1460 non-null int64
18 OverallCond 1460 non-null int64
19 YearBuilt 1460 non-null int64
20 YearRemodAdd 1460 non-null int64
21 RoofStyle 1460 non-null object
22 RoofMatl 1460 non-null object
23 Exterior1st 1460 non-null object
24 Exterior2nd 1460 non-null object
25 MasVnrType 1452 non-null object
26 MasVnrArea 1452 non-null float64
27 ExterQual 1460 non-null object
28 ExterCond 1460 non-null object
29 Foundation 1460 non-null object
30 BsmtQual 1423 non-null object
31 BsmtCond 1423 non-null object
32 BsmtExposure 1422 non-null object
33 BsmtFinType1 1423 non-null object
34 BsmtFinSF1 1460 non-null int64
35 BsmtFinType2 1422 non-null object
36 BsmtFinSF2 1460 non-null int64
37 BsmtUnfSF 1460 non-null int64
38 TotalBsmtSF 1460 non-null int64
39 Heating 1460 non-null object
40 HeatingQC 1460 non-null object
41 CentralAir 1460 non-null object
42 Electrical 1459 non-null object
43 1stFlrSF 1460 non-null int64
44 2ndFlrSF 1460 non-null int64
45 LowQualFinSF 1460 non-null int64
46 GrLivArea 1460 non-null int64
47 BsmtFullBath 1460 non-null int64
48 BsmtHalfBath 1460 non-null int64
49 FullBath 1460 non-null int64
50 HalfBath 1460 non-null int64
51 BedroomAbvGr 1460 non-null int64
52 KitchenAbvGr 1460 non-null int64
53 KitchenQual 1460 non-null object
54 TotRmsAbvGrd 1460 non-null int64
55 Functional 1460 non-null object
56 Fireplaces 1460 non-null int64
57 FireplaceQu 770 non-null object
58 GarageType 1379 non-null object
59 GarageYrBlt 1379 non-null float64
60 GarageFinish 1379 non-null object
61 GarageCars 1460 non-null int64
62 GarageArea 1460 non-null int64
63 GarageQual 1379 non-null object
64 GarageCond 1379 non-null object
65 PavedDrive 1460 non-null object
66 WoodDeckSF 1460 non-null int64
67 OpenPorchSF 1460 non-null int64
68 EnclosedPorch 1460 non-null int64
69 3SsnPorch 1460 non-null int64
70 ScreenPorch 1460 non-null int64
71 PoolArea 1460 non-null int64
72 PoolQC 7 non-null object
73 Fence 281 non-null object
74 MiscFeature 54 non-null object
75 MiscVal 1460 non-null int64
76 MoSold 1460 non-null int64
77 YrSold 1460 non-null int64
78 SaleType 1460 non-null object
79 SaleCondition 1460 non-null object
80 SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

10]:
test.info()

10]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1459 non-null int64
1 MSSubClass 1459 non-null int64
2 MSZoning 1455 non-null object
3 LotFrontage 1232 non-null float64
4 LotArea 1459 non-null int64
5 Street 1459 non-null object
6 Alley 107 non-null object
7 LotShape 1459 non-null object
8 LandContour 1459 non-null object
9 Utilities 1457 non-null object
10 LotConfig 1459 non-null object
11 LandSlope 1459 non-null object
12 Neighborhood 1459 non-null object
13 Condition1 1459 non-null object
14 Condition2 1459 non-null object
15 BldgType 1459 non-null object
16 HouseStyle 1459 non-null object
17 OverallQual 1459 non-null int64
18 OverallCond 1459 non-null int64
19 YearBuilt 1459 non-null int64
20 YearRemodAdd 1459 non-null int64
21 RoofStyle 1459 non-null object
22 RoofMatl 1459 non-null object
23 Exterior1st 1458 non-null object
24 Exterior2nd 1458 non-null object
25 MasVnrType 1443 non-null object
26 MasVnrArea 1444 non-null float64
27 ExterQual 1459 non-null object
28 ExterCond 1459 non-null object
29 Foundation 1459 non-null object
30 BsmtQual 1415 non-null object
31 BsmtCond 1414 non-null object
32 BsmtExposure 1415 non-null object
33 BsmtFinType1 1417 non-null object
34 BsmtFinSF1 1458 non-null float64
35 BsmtFinType2 1417 non-null object
36 BsmtFinSF2 1458 non-null float64
37 BsmtUnfSF 1458 non-null float64
38 TotalBsmtSF 1458 non-null float64
39 Heating 1459 non-null object
40 HeatingQC 1459 non-null object
41 CentralAir 1459 non-null object
42 Electrical 1459 non-null object
43 1stFlrSF 1459 non-null int64
44 2ndFlrSF 1459 non-null int64
45 LowQualFinSF 1459 non-null int64
46 GrLivArea 1459 non-null int64
47 BsmtFullBath 1457 non-null float64
48 BsmtHalfBath 1457 non-null float64
49 FullBath 1459 non-null int64
50 HalfBath 1459 non-null int64
51 BedroomAbvGr 1459 non-null int64
52 KitchenAbvGr 1459 non-null int64
53 KitchenQual 1458 non-null object
54 TotRmsAbvGrd 1459 non-null int64
55 Functional 1457 non-null object
56 Fireplaces 1459 non-null int64
57 FireplaceQu 729 non-null object
58 GarageType 1383 non-null object
59 GarageYrBlt 1381 non-null float64
60 GarageFinish 1381 non-null object
61 GarageCars 1458 non-null float64
62 GarageArea 1458 non-null float64
63 GarageQual 1381 non-null object
64 GarageCond 1381 non-null object
65 PavedDrive 1459 non-null object
66 WoodDeckSF 1459 non-null int64
67 OpenPorchSF 1459 non-null int64
68 EnclosedPorch 1459 non-null int64
69 3SsnPorch 1459 non-null int64
70 ScreenPorch 1459 non-null int64
71 PoolArea 1459 non-null int64
72 PoolQC 3 non-null object
73 Fence 290 non-null object
74 MiscFeature 51 non-null object
75 MiscVal 1459 non-null int64
76 MoSold 1459 non-null int64
77 YrSold 1459 non-null int64
78 SaleType 1458 non-null object
79 SaleCondition 1459 non-null object
dtypes: float64(11), int64(26), object(43)
memory usage: 912.0+ KB

11]:
#df = train.fillna(train.mean())
data = pd.concat([train, test], keys=['train', 'test'])
data.index = data.index.droplevel(level=1)
#df = data.drop(columns=["Alley","PoolQC", "MiscFeature", "Fence", "FireplaceQu"])
#df = df.dropna()
df = data
df.info()

11]:
<class 'pandas.core.frame.DataFrame'>
Index: 2919 entries, train to test
Data columns (total 81 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 2919 non-null int64
1 MSSubClass 2919 non-null int64
2 MSZoning 2915 non-null object
3 LotFrontage 2433 non-null float64
4 LotArea 2919 non-null int64
5 Street 2919 non-null object
6 Alley 198 non-null object
7 LotShape 2919 non-null object
8 LandContour 2919 non-null object
9 Utilities 2917 non-null object
10 LotConfig 2919 non-null object
11 LandSlope 2919 non-null object
12 Neighborhood 2919 non-null object
13 Condition1 2919 non-null object
14 Condition2 2919 non-null object
15 BldgType 2919 non-null object
16 HouseStyle 2919 non-null object
17 OverallQual 2919 non-null int64
18 OverallCond 2919 non-null int64
19 YearBuilt 2919 non-null int64
20 YearRemodAdd 2919 non-null int64
21 RoofStyle 2919 non-null object
22 RoofMatl 2919 non-null object
23 Exterior1st 2918 non-null object
24 Exterior2nd 2918 non-null object
25 MasVnrType 2895 non-null object
26 MasVnrArea 2896 non-null float64
27 ExterQual 2919 non-null object
28 ExterCond 2919 non-null object
29 Foundation 2919 non-null object
30 BsmtQual 2838 non-null object
31 BsmtCond 2837 non-null object
32 BsmtExposure 2837 non-null object
33 BsmtFinType1 2840 non-null object
34 BsmtFinSF1 2918 non-null float64
35 BsmtFinType2 2839 non-null object
36 BsmtFinSF2 2918 non-null float64
37 BsmtUnfSF 2918 non-null float64
38 TotalBsmtSF 2918 non-null float64
39 Heating 2919 non-null object
40 HeatingQC 2919 non-null object
41 CentralAir 2919 non-null object
42 Electrical 2918 non-null object
43 1stFlrSF 2919 non-null int64
44 2ndFlrSF 2919 non-null int64
45 LowQualFinSF 2919 non-null int64
46 GrLivArea 2919 non-null int64
47 BsmtFullBath 2917 non-null float64
48 BsmtHalfBath 2917 non-null float64
49 FullBath 2919 non-null int64
50 HalfBath 2919 non-null int64
51 BedroomAbvGr 2919 non-null int64
52 KitchenAbvGr 2919 non-null int64
53 KitchenQual 2918 non-null object
54 TotRmsAbvGrd 2919 non-null int64
55 Functional 2917 non-null object
56 Fireplaces 2919 non-null int64
57 FireplaceQu 1499 non-null object
58 GarageType 2762 non-null object
59 GarageYrBlt 2760 non-null float64
60 GarageFinish 2760 non-null object
61 GarageCars 2918 non-null float64
62 GarageArea 2918 non-null float64
63 GarageQual 2760 non-null object
64 GarageCond 2760 non-null object
65 PavedDrive 2919 non-null object
66 WoodDeckSF 2919 non-null int64
67 OpenPorchSF 2919 non-null int64
68 EnclosedPorch 2919 non-null int64
69 3SsnPorch 2919 non-null int64
70 ScreenPorch 2919 non-null int64
71 PoolArea 2919 non-null int64
72 PoolQC 10 non-null object
73 Fence 571 non-null object
74 MiscFeature 105 non-null object
75 MiscVal 2919 non-null int64
76 MoSold 2919 non-null int64
77 YrSold 2919 non-null int64
78 SaleType 2918 non-null object
79 SaleCondition 2919 non-null object
80 SalePrice 1460 non-null float64
dtypes: float64(12), int64(26), object(43)
memory usage: 1.8+ MB

12]:
data

12]: Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SaleP

train 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 2085

train 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 1815

train 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 2235

train 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 1400

train 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 2500

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

test 2915 160 RM 21.0 1936 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 6 2006 WD Normal NaN

test 2916 160 RM 21.0 1894 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 4 2006 WD Abnorml NaN

test 2917 20 RL 160.0 20000 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 9 2006 WD Abnorml NaN

test 2918 85 RL 62.0 10441 Pave NaN Reg Lvl AllPub ... 0 NaN MnPrv Shed 700 7 2006 WD Normal NaN

test 2919 60 RL 74.0 9627 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 11 2006 WD Normal NaN

2919 rows × 81 columns

13]:
non_num = ['MSZoning', 'LotArea', 'Street',
'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
'HouseStyle',
'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
'BsmtFinType2', 'Heating',
'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
'Functional', 'FireplaceQu', 'GarageType',
'GarageFinish', 'GarageQual',
'GarageCond', 'PavedDrive', 'PoolQC',
'Fence', 'MiscFeature', 'SaleType',
'SaleCondition']

14]:
for feature in non_num:
train[feature] = LabelEncoder().fit_transform(train[feature])
test[feature] = LabelEncoder().fit_transform(test[feature])
train.drop(columns=["Alley","PoolQC", "MiscFeature", "Fence", "FireplaceQu"], inplace=True)
test.drop(columns=["Alley","PoolQC", "MiscFeature", "Fence", "FireplaceQu"], inplace=True)
20]:
from sklearn.impute import SimpleImputer

normal = ['LotFrontage', 'LotArea', 'TotalBsmtSF', '1stFlrSF', 'GrLivArea', 'SalaPrice']

for data in [train, test]:


for col in data.columns:
if (col in non_num) or (col not in normal):
strategy = 'median'
else:
strategy = 'mean'
imp_mean = SimpleImputer(missing_values=np.nan, strategy=strategy)
data[col] = imp_mean.fit_transform(data[[col]]).squeeze()

22]:
train.head()

22]: Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual Overall

0 1.0 60.0 3.0 65.0 327.0 1.0 2.0 3.0 3.0 0.0 4.0 0.0 5.0 2.0 2.0 0.0 5.0 7.0 5.0

1 2.0 20.0 3.0 80.0 498.0 1.0 2.0 3.0 3.0 0.0 2.0 0.0 24.0 1.0 2.0 0.0 2.0 6.0 8.0
2 3.0 60.0 3.0 68.0 702.0 1.0 2.0 0.0 3.0 0.0 4.0 0.0 5.0 2.0 2.0 0.0 5.0 7.0 5.0

3 4.0 70.0 3.0 60.0 489.0 1.0 2.0 0.0 3.0 0.0 0.0 0.0 6.0 2.0 2.0 0.0 5.0 7.0 5.0

4 5.0 60.0 3.0 84.0 925.0 1.0 2.0 0.0 3.0 0.0 2.0 0.0 15.0 2.0 2.0 0.0 5.0 8.0 5.0

23]:
pd.set_option("display.max_columns", 500)
train.corr().tail(1)

23]: Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType Hous

SalePrice -0.021917 -0.084284 -0.166872 0.334901 0.454564 0.041036 0.139868 -0.25558 0.015453 -0.014314 -0.067396 0.051152 0.210851 0.091155 0.007513 -0.085591 0.180

26]:
from sklearn.model_selection import train_test_split

X = train.drop(columns=['Id', 'SalePrice'])
y = train.SalePrice

x_train, x_eval, y_train, y_eval = train_test_split(X, y, test_size=0.2, shuffle=True,random_state=42)

28]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=1100,
max_depth=5,
min_samples_split=4,
min_samples_leaf=5,
max_features='auto',
oob_score=True,
random_state=42,
n_jobs=-1,
verbose=1)

29]:
model.fit(x_train, y_train)

29]:
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 0.1s
[Parallel(n_jobs=-1)]: Done 192 tasks | elapsed: 0.9s
[Parallel(n_jobs=-1)]: Done 442 tasks | elapsed: 2.2s
[Parallel(n_jobs=-1)]: Done 792 tasks | elapsed: 3.9s
[Parallel(n_jobs=-1)]: Done 1100 out of 1100 | elapsed: 5.4s finished

29]:
RandomForestRegressor(max_depth=5, max_features='auto', min_samples_leaf=5,
min_samples_split=4, n_estimators=1100, n_jobs=-1,
oob_score=True, random_state=42, verbose=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

33]:
from sklearn.metrics import mean_absolute_error

def evaluate_model(model):
print("Model name: ", type(model).__name__)
print("Model parameters: ", model.get_params())

# Printing model accuracy


model_train_mae = mean_absolute_error(y_train,model.predict(x_train))
model_test_mae = mean_absolute_error(y_eval,model.predict(x_eval))

print("Model Mean Absolute error on the train set : %.2f" % model_train_mae)


print("Model Mean Absolute error on the test set : %.2f" % model_test_mae)

34]:
evaluate_model(model)

34]:
Model name: RandomForestRegressor
Model parameters: {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': 5, 'max_features': 'auto', 'max_leaf_nodes': None,
'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 5, 'min_samples_split': 4, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 1100,
'n_jobs': -1, 'oob_score': True, 'random_state': 42, 'verbose': 1, 'warm_start': False}

34]:
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 0.0s
[Parallel(n_jobs=4)]: Done 192 tasks | elapsed: 0.0s
[Parallel(n_jobs=4)]: Done 442 tasks | elapsed: 0.1s
[Parallel(n_jobs=4)]: Done 792 tasks | elapsed: 0.2s
[Parallel(n_jobs=4)]: Done 1100 out of 1100 | elapsed: 0.3s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 0.0s
[Parallel(n_jobs=4)]: Done 192 tasks | elapsed: 0.0s
[Parallel(n_jobs=4)]: Done 442 tasks | elapsed: 0.0s

34]:
Model Mean Absolute error on the train set : 16645.95
Model Mean Absolute error on the test set : 20689.10

34]:
[Parallel(n_jobs=4)]: Done 792 tasks | elapsed: 0.1s
[Parallel(n_jobs=4)]: Done 1100 out of 1100 | elapsed: 0.2s finished

36]:
X_test = test.drop(columns=['Id'])
y_pred = model.predict(X_test)

df_predictions = pd.DataFrame({
'Id': test.Id.astype('Int64'),
'SalePrice': y_pred,
})

df_predictions.to_csv('predictions.csv', index=False, sep=',')


df_predictions.head()

36]:
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 0.0s
[Parallel(n_jobs=4)]: Done 192 tasks | elapsed: 0.0s
[Parallel(n_jobs=4)]: Done 442 tasks | elapsed: 0.0s
[Parallel(n_jobs=4)]: Done 792 tasks | elapsed: 0.2s
[Parallel(n_jobs=4)]: Done 1100 out of 1100 | elapsed: 0.2s finished

36]: Id SalePrice

0 1461 124368.331432

1 1462 147987.206042

2 1463 175034.028728

3 1464 182615.175441

4 1465 222405.623241

[]:

You might also like