Prediction

1. The document discusses the process of making predictions using machine learning, including understanding the problem, generating hypotheses, getting and exploring data, preprocessing, feature engineering, model training, evaluation, and testing. 2. It then applies this process to predict housing prices in Boston using a dataset with 13 variables related to housing in Boston neighborhoods. 3. Key steps include exploring the data, selecting the average number of rooms (RM) and socioeconomic status (LSTAT) as predictive features based on their high correlation with housing prices, and training a linear regression model to predict prices from these features.

Uploaded by

endriasmit_469556062

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

150 views10 pages

Prediction

Uploaded by

endriasmit_469556062

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Getting familiarized with Making predictions process using Machine Learning

Learning Objectives
At the end of this session you will be able to:
● Understand machine learning based making prediction process
● Apply correlation technique for feature selection purpose
● Build machine learning predictor for Boston Housing
● Learn how to evaluate the predictor
Introduction
Making predictions using Machine Learning isn't just about grabbing the data and feeding it to
algorithms. The algorithm might spit out some prediction but that's not what you are aiming for.
The difference between good data science professionals and naive data science aspirants is
that the former set follows this process religiously. The process is as follows: 1. Understand the
problem: Before getting the data, we need to understand the problem we are trying to solve. If
you know the domain, think of which factors could play an epic role in solving the problem. If
you don't know the domain, read about it. 2. Hypothesis Generation: This is quite important, yet
it is often forgotten. In simple words, hypothesis generation refers to creating a set of features
which could influence the target variable given a confidence interval ( taken as 95% all the
time). We can do this before looking at the data to avoid biased thoughts. This step often helps
in creating new features. 3. Get Data: Now, we download the data and look at it. Determine
which features are available and which aren't, how many features we generated in hypothesis
generation hit the mark, and which ones could be created. Answering these questions will set us
on the right track. 4. Data Exploration: We can't determine everything by just looking at the data.
We need to dig deeper. This step helps us understand the nature of variables ( missing, zero
variance feature) so that they can be treated properly. It involves creating charts, graphs
(univariate and bivariate analysis), and cross-tables to understand the behavior of features. 5.
*Data Preprocessing: *Here, we impute missing values and clean string variables (remove
space, irregular tabs, data time format) and anything that shouldn't be there. This step is usually
followed along with the data exploration stage. 6. Feature Engineering: Now, we create and add
new features to the data set. Most of the ideas for these features come during the hypothesis
generation stage. 7. Model Training: Using a suitable algorithm, we train the model on the given
data set. 8. Model Evaluation: Once the model is trained, we evaluate the model's performance
using a suitable error metric. Here, we also look for variable importance, i.e., which variables
have proved to be significant in determining the target variable. And, accordingly we can
shortlist the best variables and train the model again. 9. Model Testing: Finally, we test the
model on the unseen data (test data) set.

We'll follow this process in the project to arrive at our final predictions. Let's get started.
1.Understand the problem
This lab aims at predicting house prices (residential) in Boston, USA. I believe this problem
statement is quite self-explanatory and doesn't need more explanation. Hence, we move to the
next step.
2. Hypothesis Generation
Well, this is going to be interesting. What factors can you think of right now which can influence
house prices ? As you read this, I want you to write down your factors as well, then we can
match them with the data set. Defining a hypothesis has two parts: Null Hypothesis (Ho) and
Alternate Hypothesis(Ha). They can be understood as:

Ho - There exists no impact of a particular feature on the dependent variable. Ha - There exists
a direct impact of a particular feature on the dependent variable.
Based on a decision criterion (say, 5% significance level), we always 'reject' or 'fail to reject' the
null hypothesis in statistical parlance. Practically, while model building we look for probability (p)
values. If p value < 0.05, we reject the null hypothesis. If p > 0.05, we fail to reject the null
hypothesis. Some factors which I can think of that directly influence house prices are the
following:
Per capita crime rate by town
Proportion of residential land zoned for lots over 25,000 sq. ft
Proportion of non-retail business acres per town
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
Nitric oxide concentration (parts per 10 million)
Average number of rooms per dwelling
Proportion of owner-occupied units built prior to 1940
Weighted distances to five Boston employment centers
Index of accessibility to radial highways
Full-value property tax rate per $10,000
Pupil-teacher ratio by town
1000(Bk — 0.63)², where Bk is the proportion of [people of African American descent] by town
LSTAT: Percentage of lower status of the population
Median value of owner-occupied homes in $1000s
…keep thinking. I am sure you can come up with many more apart from these.
3. Get Data
You can download the data from this https://www.kaggle.com/altavish/boston-housing-dataset
and load it in your python IDE. Also, check the competition page where all the details about the
data and variables are given. The data set consists of 13 explanatory variables. Yes, it's going
to be one heck of a data exploration ride. But, we'll learn how to deal with so many variables.
The target variable is MEDV. As you can see the data set comprises numeric, categorical, and
ordinal variables.
4. Data Exploration
Data Exploration is the key to getting insights from data. Practitioners say a good data exploration
strategy can solve even complicated problems in a few hours. A good data exploration strategy
comprises the following:

1. Univariate Analysis - It is used to visualize one variable in one plot. Examples: histogram,
density plot, etc.
2. Bivariate Analysis - It is used to visualize two variables (x and y axis) in one plot. Examples:
bar chart, line chart, area chart, etc.
3. Multivariate Analysis - As the name suggests, it is used to visualize more than two variables
at once. Examples: stacked bar chart, dodged bar chart, etc.
4. Cross Tables -They are used to compare the behavior of two categorical variables (used in
pivot tables as well).

Let's load the necessary libraries and data and start coding.

After we read the data, we can look at the data using:

The description of all the features is given below:

CRIM: Per capita crime rate by town
ZN: Proportion of residential land zoned for lots over 25,000
sq. ft
INDUS: Proportion of non-retail business acres per town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0
otherwise)
NOX: Nitric oxide concentration (parts per 10 million)
RM: Average number of rooms per dwelling
AGE: Proportion of owner-occupied units built prior to 1940
DIS: Weighted distances to five Boston employment centers
RAD: Index of accessibility to radial highways
TAX: Full-value property tax rate per $10,000
PTRATIO: Pupil-teacher ratio by town
B: 1000(Bk — 0.63)², where Bk is the proportion of [people of
African American descent] by town
LSTAT: Percentage of lower status of the population
MEDV: Median value of owner-occupied homes in $1000s
The prices of the house indicated by the variable MEDV is our
target variable and the remaining are the feature variables
based on which we will predict the value of a house.

Alternatively, you can also check the data set information using the info() command.
5.Data Preprocessing
After loading the data, it’s a good practice to see if there are any missing values in the data. We
count the number of missing values for each feature using isnull()

Out of 14 features, 6 features have missing values. Let's check the percentage of missing values in
these columns.

We can infer that the all variables has 3.9% missing values. Let's look at a pretty picture explaining
these missing values using a bar plot.
Let's proceed and check the distribution of the target variable.
We see that the values of MEDV are distributed normally with few
outliers.
Next, we create a correlation matrix that measures the linear
relationships between the variables. The correlation matrix can be
formed by using the c orr function from the pandas dataframe
library. We will use the h eatmap function from the seaborn library
to plot the correlation matrix.

The correlation coefficient ranges from -1 to 1. If the value is close

to 1, it means that there is a strong positive correlation between
the two variables. When it is close to -1, the variables have a
strong negative correlation.
Observations:
To fit a linear regression model, we select those features which
have a high correlation with our target variable MEDV. By looking
at the correlation matrix we can see that RM has a strong positive
correlation with MEDV (0.7) where as LSTAT has a high negative
correlation with MEDV(-0.74).
6. Feature Engineering
An important point in selecting features for a linear regression
model is to check for multi-co-linearity. The features RAD, TAX
have a correlation of 0.91. These feature pairs are strongly
correlated to each other. We should not select both these features
together for training the model. Check this for an explanation.
Same goes for the features DIS and AGE which have a correlation
of -0.75.
Based on the above observations we will select RM and LSTAT as
our features. Using a scatter plot let’s see how these features vary
with MEDV.

Observations:
The prices increase as the value of RM increases linearly. There
are few outliers and the data seems to be capped at 50.
The prices tend to decrease with an increase in LSTAT. Though it
doesn’t look to be following exactly a linear line.
7. Model Training
We concatenate the LSTAT and RM columns using np.c_
provided by the numpy library.

Splitting the data into training and testing sets

Next, we split the data into training and testing sets. We train the
model with 80% of the samples and test with the remaining 20%.
We do this to assess the model’s performance on unseen data. To
split the data we use train_test_split function provided by
scikit-learn library. We finally print the sizes of our training and
test set to verify if the splitting has occurred properly.

Training and testing the model

We use scikit-learn’s LinearRegression to train our model on both
the training and test sets.

Model evaluation
We will evaluate our model using RMSE and R2-score.
Assignment
This question will use the Boston housing dataset once again. Again, create a
test set consisting of 1/2 of the data using the rest for training.
1. Build and evaluate the model by using additional one feature which has high
feature next to RM and LSTV
2. Fit a polynomial regression model to the training data.
3. Predict the labels for the corresponding test data.
4. Evaluate and generate the model parameters.
5. Out of these predictors used in this assignment, which would you choose as a
final model for the boston housing?

Unit 2 Supervised Learning
No ratings yet
Unit 2 Supervised Learning
36 pages
Project Report Gr-12
No ratings yet
Project Report Gr-12
25 pages
M6 Predictive Analytics Presentation
No ratings yet
M6 Predictive Analytics Presentation
49 pages
10. Ai_foundations of Machine Learning III
No ratings yet
10. Ai_foundations of Machine Learning III
98 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
ml project part a 1
No ratings yet
ml project part a 1
6 pages
UNIT-2 (3)
No ratings yet
UNIT-2 (3)
78 pages
Price Prediction
100% (1)
Price Prediction
13 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
Dawit House
No ratings yet
Dawit House
49 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Statistics for Data Science
No ratings yet
Statistics for Data Science
39 pages
Cap8 Predicting Continuous Target Variables with Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science
No ratings yet
Cap8 Predicting Continuous Target Variables with Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science
36 pages
module_2
No ratings yet
module_2
35 pages
L 10 Principal Component Analysis 09052024 072206pm
No ratings yet
L 10 Principal Component Analysis 09052024 072206pm
37 pages
AIMLlatestmodule 2Notes Removed
No ratings yet
AIMLlatestmodule 2Notes Removed
33 pages
Lec3 4 ML Project
No ratings yet
Lec3 4 ML Project
26 pages
Handout 3
No ratings yet
Handout 3
24 pages
Bi El
No ratings yet
Bi El
26 pages
ese lab file
No ratings yet
ese lab file
30 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
DM Assignment
No ratings yet
DM Assignment
17 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
boston_housing
No ratings yet
boston_housing
17 pages
House Ames Project
No ratings yet
House Ames Project
15 pages
Chapter 02 Overview (Python)
No ratings yet
Chapter 02 Overview (Python)
16 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
16 pages
Comprehensive Data Exploration With Python
No ratings yet
Comprehensive Data Exploration With Python
20 pages
Da Laqs Saqs
No ratings yet
Da Laqs Saqs
23 pages
242-44-001-Q-3
No ratings yet
242-44-001-Q-3
6 pages
HOUSE PRICE PREDICTION
No ratings yet
HOUSE PRICE PREDICTION
14 pages
The Boston Housing Dataset
100% (2)
The Boston Housing Dataset
4 pages
House Pricing
No ratings yet
House Pricing
15 pages
Linear Reg
No ratings yet
Linear Reg
25 pages
4. Data and Analysis
No ratings yet
4. Data and Analysis
13 pages
Investigation and Comparison Missing Data Imputation Methods
No ratings yet
Investigation and Comparison Missing Data Imputation Methods
73 pages
Making predictions
No ratings yet
Making predictions
13 pages
Machine Learning Business Report
75% (55)
Machine Learning Business Report
60 pages
Case Study 219302405
No ratings yet
Case Study 219302405
14 pages
House-Price-Prediction-Using-Regression-Techniques Retouch - Removed
No ratings yet
House-Price-Prediction-Using-Regression-Techniques Retouch - Removed
14 pages
R Doc Ii Vee
No ratings yet
R Doc Ii Vee
24 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Module2 Ids 240201 162026
No ratings yet
Module2 Ids 240201 162026
11 pages
Module 2
No ratings yet
Module 2
20 pages
Predictive Modeling Business Report Seetharaman Final Changes PDF
100% (1)
Predictive Modeling Business Report Seetharaman Final Changes PDF
28 pages
Pratapa P Evidence of Learning 4
No ratings yet
Pratapa P Evidence of Learning 4
2 pages
BZAN_6310-project_instructions
No ratings yet
BZAN_6310-project_instructions
4 pages
Oe Cae 3
No ratings yet
Oe Cae 3
7 pages
House Price Prdiction Mini Project Report
100% (2)
House Price Prdiction Mini Project Report
8 pages
Machine Learning Project Checklist
100% (1)
Machine Learning Project Checklist
10 pages
Sberbank Project Report
No ratings yet
Sberbank Project Report
19 pages
Regression Dataset
No ratings yet
Regression Dataset
3 pages
House Prices
No ratings yet
House Prices
5 pages
Tools Manual Maitreya Fields
No ratings yet
Tools Manual Maitreya Fields
6 pages
ISTA as Shop Manual
No ratings yet
ISTA as Shop Manual
12 pages
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
Volume Ii Employers Requirements Technical Specifications Minigrids Kosap KPLC
No ratings yet
Volume Ii Employers Requirements Technical Specifications Minigrids Kosap KPLC
207 pages
Java 3 Mitad de Curso Completito
No ratings yet
Java 3 Mitad de Curso Completito
506 pages
House Prices Prediction in King County
No ratings yet
House Prices Prediction in King County
10 pages
Method Statement For Installation of VRF System
75% (4)
Method Statement For Installation of VRF System
8 pages
A Study on the Microbial Quality of Dahi From Retail Outlets in Madurai
No ratings yet
A Study on the Microbial Quality of Dahi From Retail Outlets in Madurai
6 pages
Engineering Catalogue
No ratings yet
Engineering Catalogue
290 pages
AI Sample Paper For Reference
No ratings yet
AI Sample Paper For Reference
6 pages
Law Commission Report No. 197 - Public Prosecutor S Appointments, 2006
No ratings yet
Law Commission Report No. 197 - Public Prosecutor S Appointments, 2006
39 pages
PLTGU-GRATI-V-MS-O4. Methode of Lifting & Installasi Panel Di CCB Room (Bilingual) PDF
No ratings yet
PLTGU-GRATI-V-MS-O4. Methode of Lifting & Installasi Panel Di CCB Room (Bilingual) PDF
39 pages
TestingNonMetallicGasketMaterial DurlonArticle
No ratings yet
TestingNonMetallicGasketMaterial DurlonArticle
2 pages
12 in 1 Hydra Facial - Eric
100% (1)
12 in 1 Hydra Facial - Eric
20 pages
Industrial Diesel Generator Set - 50 HZ: KOHLER Premium Quality
No ratings yet
Industrial Diesel Generator Set - 50 HZ: KOHLER Premium Quality
7 pages
Marketing Management CG
No ratings yet
Marketing Management CG
20 pages
Chapter 6 - STEADY STATE ERROR
No ratings yet
Chapter 6 - STEADY STATE ERROR
16 pages
Monkey Case Study
No ratings yet
Monkey Case Study
16 pages
Awareness About NBA Accreditation: Presentation On
No ratings yet
Awareness About NBA Accreditation: Presentation On
30 pages
Super Silent Manual
No ratings yet
Super Silent Manual
7 pages
Adm 870 PDF
No ratings yet
Adm 870 PDF
54 pages
Drzik - New Directions in Risk Management
No ratings yet
Drzik - New Directions in Risk Management
12 pages
Employee Retention
No ratings yet
Employee Retention
2 pages
Lk1 Lk1 Preamplifier Service Manual Preamplifier Service Manual
No ratings yet
Lk1 Lk1 Preamplifier Service Manual Preamplifier Service Manual
15 pages
Cline Anthony - Reflections of Leadership and Management
No ratings yet
Cline Anthony - Reflections of Leadership and Management
5 pages
Question Paper For Air Brake
No ratings yet
Question Paper For Air Brake
4 pages
Meow
No ratings yet
Meow
4 pages
Acronyms
No ratings yet
Acronyms
6 pages
Abstract:: - Piruz Khambatta, Chairman and Managing Director, Rasna LTD., in March 2002
No ratings yet
Abstract:: - Piruz Khambatta, Chairman and Managing Director, Rasna LTD., in March 2002
5 pages
Abicor Binzel - MB GRIP 15 AK
No ratings yet
Abicor Binzel - MB GRIP 15 AK
4 pages
Reimbursement Expense Receipt Reimbursement Expense Receipt: G F N 2 G F N 2 R J R J
No ratings yet
Reimbursement Expense Receipt Reimbursement Expense Receipt: G F N 2 G F N 2 R J R J
4 pages
3161 GIS Data Models
No ratings yet
3161 GIS Data Models
13 pages
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
From Everand
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
Fouad Sabry
No ratings yet
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)