machine
learning
Linear regression
Dr. Darkhan Zholtayev
Assistant professor at Department of Computational and Data
Science
[email protected]
Topics to cover
• What is the regression
• Linear regression
• Lest square error
General graph
AI map
Joseph, B. (2020, June 17). Linear Regression Made Easy: How Does It Work and How to Use It in Python. Towards Da
Science. https://towardsdatascience.com/linear-regression-made-easy-how-does-it-work-and-how-to-use-it-in-pytho
Linear Regression
• Technique used for the modeling and analysis of
numerical data
• Exploits the relationship between two or more variables
so that we can gain information about one of them
through knowing values of the other
• Regression can be used for prediction, estimation,
hypothesis testing, and modeling causal relationships
Problem
Data
Xie, Y. (2013). Lecture 11: Simple Linear Regression. H. Milton Stewart School of Industrial
and Systems Engineering, Georgia Institute of Technology. Retrieved from
Data
Xie, Y. (2013). Lecture 11: Simple Linear Regression. H. Milton Stewart School of Industrial
and Systems Engineering, Georgia Institute of Technology. Retrieved from
Data
Linear Regression
Linear regression
Linear
regression
Xie, Y. (2013). Lecture 11: Simple Linear Regression. H. Milton Stewart School of Industrial
and Systems Engineering, Georgia Institute of Technology. Retrieved from
Linear regression: different forms
Linear regression
Linear regression
Linear regression
Estimate regression parameters
Method of least squares
Least square estimates
Xie, Y. (2013). Lecture 11: Simple Linear Regression. H. Milton Stewart School of Industrial
and Systems Engineering, Georgia Institute of Technology. Retrieved from
Alternative notation
Example: oxygen and hydrocarcon level
Calculati
on 2
Calculati
on
Interpretat
ion of
regression
model
Estimation of variance
Sammary
Xie, Y. (2013). Lecture 11: Simple Linear Regression. H. Milton Stewart School of Industrial
and Systems Engineering, Georgia Institute of Technology. Retrieved from
Example
• import pandas as pd # for data manipulation
import numpy as np # for data manipulation
from sklearn.linear_model import LinearRegression # for
creating a model
import plotly.graph_objects as go # for visualizations
import plotly.express as px # for visualizations
• # Read data into a Pandas DataFrame
df = pd.read_csv('Real estate.csv', encoding='utf-8')
# Print DataFrame
df
Joseph, B. (2020, June 17). Linear Regression Made Easy: How Does It Work and How to Use It in Python. Towards Data Science.
https://towardsdatascience.com/linear-regression-made-easy-how-does-it-work-and-how-to-use-it-in-python-be0799d2f159
Data
Joseph, B. (2020, June 17). Linear Regression Made Easy: How Does It Work and How to Use It in Python. Towards Data Science.
https://towardsdatascience.com/linear-regression-made-easy-how-does-it-work-and-how-to-use-it-in-python-be0799d2f159
Code 1
• # Create a scatter plot
fig = px.scatter(df, x=df['X3 distance to the nearest MRT station'], y=df['Y house price of unit area'],
opacity=0.8, color_discrete_sequence=['black'])
# Change chart background color
fig.update_layout(dict(plot_bgcolor = 'white'))
# Update axes lines
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey',
zeroline=True, zerolinewidth=1, zerolinecolor='lightgrey',
showline=True, linewidth=1, linecolor='black')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey',
zeroline=True, zerolinewidth=1, zerolinecolor='lightgrey',
showline=True, linewidth=1, linecolor='black')
# Set figure title
fig.update_layout(title_text="Scatter Plot")
# Update marker size
fig.update_traces(marker=dict(size=3))
Joseph,fig.show()
B. (2020, June 17). Linear Regression Made Easy: How Does It Work and How to Use It in Python. Towards Data Science.
https://towardsdatascience.com/linear-regression-made-easy-how-does-it-work-and-how-to-use-it-in-python-be0799d2f159
Scatter plot
Joseph, B. (2020, June 17). Linear Regression Made Easy: How Does It Work and How to Use It in Python. Towards Data Science.
https://towardsdatascience.com/linear-regression-made-easy-how-does-it-work-and-how-to-use-it-in-python-be0799d2f159
Training
• # Select variables that we want to use in a model
# Note, we need X to be a 2D array, hence reshape
X=df['X3 distance to the nearest MRT station'].values.reshape(-1,1)
y=df['Y house price of unit area'].values
# Fit linear regression model
model = LinearRegression()
reg = model.fit(X, y)
# Print the slope and intercept of the best-fit line
print(reg.coef_)
print(reg.intercept_)
Joseph, B. (2020, June 17). Linear Regression Made Easy: How Does It Work and How to Use It in Python. Towards Data Science.
https://towardsdatascience.com/linear-regression-made-easy-how-does-it-work-and-how-to-use-it-in-python-be0799d2f159
Code 2
• # We will use below to draw a best-fit line on a chart
# Create 20 evenly spaced points from smallest X to largest X
x_range = np.linspace(X.min(), X.max(), 20)
# Predict y values for our set of X values
y_range = model.predict(x_range.reshape(-1, 1))
# Create a scatter plot
fig = px.scatter(df, x=df['X3 distance to the nearest MRT station'], y=df['Y house price of unit area'],
opacity=0.8, color_discrete_sequence=['black'])
# Add a best-fit line
fig.add_traces(go.Scatter(x=x_range, y=y_range, name='Regression Fit'))
# Change chart background color
fig.update_layout(dict(plot_bgcolor = 'white'))
# Update axes lines
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey',
zeroline=True, zerolinewidth=1, zerolinecolor='lightgrey',
showline=True, linewidth=1, linecolor='black')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey',
zeroline=True, zerolinewidth=1, zerolinecolor='lightgrey',
showline=True, linewidth=1, linecolor='black')
# Set figure title
fig.update_layout(title_text="Scatter Plot with Linear Regression Line")
# Update marker size
fig.update_traces(marker=dict(size=3))
fig.show()
Joseph, B. (2020, June 17). Linear Regression Made Easy: How Does It Work and How to Use It in Python. Towards Data Science.
https://towardsdatascience.com/linear-regression-made-easy-how-does-it-work-and-how-to-use-it-in-python-be0799d2f159
Prediction line
• # Select variables that we want to use in a
model
# Note, X in this case is already a 2D
array, hence no reshape
X=df[['X3 distance to the nearest MRT
Multiple station','X2 house age']]
y=df['Y house price of unit area'].values
linear # Fit linear regression model
regression model = LinearRegression()
reg = model.fit(X, y)
# Print slope(s) and intercept
print(reg.coef_)
print(reg.intercept_)
Multiple
linear
regression
— Python
example
Joseph, B. (2020, June 17). Linear Regression Made Easy: How Does It Work and How to Use It in Python. Towards Data Science.
https://towardsdatascience.com/linear-regression-made-easy-how-does-it-work-and-how-to-use-it-in-python-be0799d2f159
Fitted
multiple
linear
regression
Joseph, B. (2020, June 17). Linear Regression Made Easy: How Does It Work and How to Use It in Python. Towards Data Science.
https://towardsdatascience.com/linear-regression-made-easy-how-does-it-work-and-how-to-use-it-in-python-be0799d2f159
Basic statistics
• The sample mean is the sum of all the observations (∑Xi)
divided by the number of observations (n):
ΣXi = X1 + X2 + X3 + X4 + … + Xn
• Example. 1, 2, 2, 4, 5, 10. Calculate the mean. Note: n =
6 (six observations)
∑Xi = 1 + 2+ 2+ 4 + 5 + 10 = 24
= 24 / 6 = 4.0
The median
To get the median, we must first
rearrange the data into an
The median is the middle value of ordered array (in ascending or
the ordered data descending order). Generally, we
order the data from the lowest
value to the highest value.
The mode
• The mode is the value of the data that occurs with the
greatest frequency.
Example. 1, 1, 1, 2, 3, 4, 5
Answer. The mode is 1 since it occurs three times. The other values
each appear only once in the data set.
Example. 5, 5, 5, 6, 8, 10, 10, 10.
Answer. The mode is: 5, 10.
There are two modes. This is a bi-modal dataset.
Standart deviation
• The standard deviation, s, measures a kind of “average” deviation about the
mean. It is not really the “average” deviation, even though we may think of
it that way.
• Why can’t we simply compute the average deviation about the mean, if
that’s what we want?
• If you take a simple mean, and then add up the deviations about the mean,
as above, this sum will be equal to 0. Therefore, a measure of “average
deviation” will not work.
Standard Deviation
• Instead, we use:
• This is the “definitional formula” for standard deviation.
• The standard deviation has lots of nice properties, including:
• By squaring the deviation, we eliminate the problem of the deviations
summing to zero.
• In addition, this sum is a minimum. No other value subtracted from X and
squared will result in a smaller sum of the deviation squared. This is called
the “least squares property.”
• Note we divide by (n-1), not n. This will be referred to as a loss of
one degree of freedom.
Variance
The variance, s2, is the standard deviation (s) squared.
Conversely, .
Definitional formula:
Computational formula:
Thank you
for your
attention