Prediction
Prediction
Learning Objectives
At the end of this session you will be able to:
● Understand machine learning based making prediction process
● Apply correlation technique for feature selection purpose
● Build machine learning predictor for Boston Housing
● Learn how to evaluate the predictor
Introduction
Making predictions using Machine Learning isn't just about grabbing the data and feeding it to
algorithms. The algorithm might spit out some prediction but that's not what you are aiming for.
The difference between good data science professionals and naive data science aspirants is
that the former set follows this process religiously. The process is as follows: 1. Understand the
problem: Before getting the data, we need to understand the problem we are trying to solve. If
you know the domain, think of which factors could play an epic role in solving the problem. If
you don't know the domain, read about it. 2. Hypothesis Generation: This is quite important, yet
it is often forgotten. In simple words, hypothesis generation refers to creating a set of features
which could influence the target variable given a confidence interval ( taken as 95% all the
time). We can do this before looking at the data to avoid biased thoughts. This step often helps
in creating new features. 3. Get Data: Now, we download the data and look at it. Determine
which features are available and which aren't, how many features we generated in hypothesis
generation hit the mark, and which ones could be created. Answering these questions will set us
on the right track. 4. Data Exploration: We can't determine everything by just looking at the data.
We need to dig deeper. This step helps us understand the nature of variables ( missing, zero
variance feature) so that they can be treated properly. It involves creating charts, graphs
(univariate and bivariate analysis), and cross-tables to understand the behavior of features. 5.
*Data Preprocessing: *Here, we impute missing values and clean string variables (remove
space, irregular tabs, data time format) and anything that shouldn't be there. This step is usually
followed along with the data exploration stage. 6. Feature Engineering: Now, we create and add
new features to the data set. Most of the ideas for these features come during the hypothesis
generation stage. 7. Model Training: Using a suitable algorithm, we train the model on the given
data set. 8. Model Evaluation: Once the model is trained, we evaluate the model's performance
using a suitable error metric. Here, we also look for variable importance, i.e., which variables
have proved to be significant in determining the target variable. And, accordingly we can
shortlist the best variables and train the model again. 9. Model Testing: Finally, we test the
model on the unseen data (test data) set.
We'll follow this process in the project to arrive at our final predictions. Let's get started.
1.Understand the problem
This lab aims at predicting house prices (residential) in Boston, USA. I believe this problem
statement is quite self-explanatory and doesn't need more explanation. Hence, we move to the
next step.
2. Hypothesis Generation
Well, this is going to be interesting. What factors can you think of right now which can influence
house prices ? As you read this, I want you to write down your factors as well, then we can
match them with the data set. Defining a hypothesis has two parts: Null Hypothesis (Ho) and
Alternate Hypothesis(Ha). They can be understood as:
Ho - There exists no impact of a particular feature on the dependent variable. Ha - There exists
a direct impact of a particular feature on the dependent variable.
Based on a decision criterion (say, 5% significance level), we always 'reject' or 'fail to reject' the
null hypothesis in statistical parlance. Practically, while model building we look for probability (p)
values. If p value < 0.05, we reject the null hypothesis. If p > 0.05, we fail to reject the null
hypothesis. Some factors which I can think of that directly influence house prices are the
following:
Per capita crime rate by town
Proportion of residential land zoned for lots over 25,000 sq. ft
Proportion of non-retail business acres per town
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
Nitric oxide concentration (parts per 10 million)
Average number of rooms per dwelling
Proportion of owner-occupied units built prior to 1940
Weighted distances to five Boston employment centers
Index of accessibility to radial highways
Full-value property tax rate per $10,000
Pupil-teacher ratio by town
1000(Bk — 0.63)², where Bk is the proportion of [people of African American descent] by town
LSTAT: Percentage of lower status of the population
Median value of owner-occupied homes in $1000s
…keep thinking. I am sure you can come up with many more apart from these.
3. Get Data
You can download the data from this https://www.kaggle.com/altavish/boston-housing-dataset
and load it in your python IDE. Also, check the competition page where all the details about the
data and variables are given. The data set consists of 13 explanatory variables. Yes, it's going
to be one heck of a data exploration ride. But, we'll learn how to deal with so many variables.
The target variable is MEDV. As you can see the data set comprises numeric, categorical, and
ordinal variables.
4. Data Exploration
Data Exploration is the key to getting insights from data. Practitioners say a good data exploration
strategy can solve even complicated problems in a few hours. A good data exploration strategy
comprises the following:
1. Univariate Analysis - It is used to visualize one variable in one plot. Examples: histogram,
density plot, etc.
2. Bivariate Analysis - It is used to visualize two variables (x and y axis) in one plot. Examples:
bar chart, line chart, area chart, etc.
3. Multivariate Analysis - As the name suggests, it is used to visualize more than two variables
at once. Examples: stacked bar chart, dodged bar chart, etc.
4. Cross Tables -They are used to compare the behavior of two categorical variables (used in
pivot tables as well).
Let's load the necessary libraries and data and start coding.
Alternatively, you can also check the data set information using the info() command.
5.Data Preprocessing
After loading the data, it’s a good practice to see if there are any missing values in the data. We
count the number of missing values for each feature using isnull()
Out of 14 features, 6 features have missing values. Let's check the percentage of missing values in
these columns.
We can infer that the all variables has 3.9% missing values. Let's look at a pretty picture explaining
these missing values using a bar plot.
Let's proceed and check the distribution of the target variable.
We see that the values of MEDV are distributed normally with few
outliers.
Next, we create a correlation matrix that measures the linear
relationships between the variables. The correlation matrix can be
formed by using the c orr function from the pandas dataframe
library. We will use the h eatmap function from the seaborn library
to plot the correlation matrix.
Observations:
The prices increase as the value of RM increases linearly. There
are few outliers and the data seems to be capped at 50.
The prices tend to decrease with an increase in LSTAT. Though it
doesn’t look to be following exactly a linear line.
7. Model Training
We concatenate the LSTAT and RM columns using np.c_
provided by the numpy library.
Model evaluation
We will evaluate our model using RMSE and R2-score.
Assignment
This question will use the Boston housing dataset once again. Again, create a
test set consisting of 1/2 of the data using the rest for training.
1. Build and evaluate the model by using additional one feature which has high
feature next to RM and LSTV
2. Fit a polynomial regression model to the training data.
3. Predict the labels for the corresponding test data.
4. Evaluate and generate the model parameters.
5. Out of these predictors used in this assignment, which would you choose as a
final model for the boston housing?