CURVE FITTING
Describes techniques to fit curves (curve fitting) to discrete
data to obtain intermediate estimates.
There are two general approaches for curve fitting:
• Least Squares regression:
Data exhibit a significant degree of scatter. The strategy is
to derive a single curve that represents the general trend
of the data.
• Interpolation:
Data is very precise. The strategy is to pass a curve or a
series of curves through each of the points.
Introduction
In engineering, two types of applications are
encountered:
– Trend analysis. Predicting values of dependent
variable, may include extrapolation beyond data
points or interpolation between data points.
– Hypothesis testing. Comparing existing
mathematical model with measured data.
Mathematical Background
• Arithmetic mean. The sum of the individual data
points (yi) divided by the number of points (n).
y
y i
, i 1, , n
n
• Standard deviation. The most common measure of a
spread for a sample.
St
Sy , S t ( yi y ) 2
n 1
Mathematical Background (cont’d)
• Variance. Representation of spread by the square of
the standard deviation.
i y y
2 2
2
( y y ) 2
/n
S
y or S 2
i i
n 1 y
n 1
• Coefficient of variation. Has the utility to quantify the
spread of data.
Sy
c.v. 100%
y
Least Squares Regression
Linear Regression
Fitting a straight line to a set of paired
observations: (x1, y1), (x2, y2),…,(xn, yn).
y = a 0+ a 1 x + e
a1 - slope
a0 - intercept
e - error, or residual, between the model and
the observations
Linear Regression: Residual
Linear Regression: Question
How to find a0 and a1 so that the error would be
minimum?
Linear Regression: Criteria for a “Best” Fit
n n
min ei ( yi a0 a1 xi )
i 1 i 1
e1 e1= -e2
e2
Linear Regression: Criteria for a “Best” Fit
n n
min | ei | | yi a0 a1 xi |
i 1 i 1
Linear Regression: Criteria for a “Best” Fit
n
min max | ei || yi a0 a1 xi |
i 1
Linear Regression: Least Squares Fit
n n n
S r e ( yi , measured yi , model) ( yi a0 a1 xi ) 2
2
i
2
i 1 i 1 i 1
n n
2 2
min S r ei ( yi a0 a1 xi )
i 1 i 1
Yields a unique line for a given set of data.
Linear Regression: Least Squares Fit
n n
2 2
min S r ei ( yi a0 a1 xi )
i 1 i 1
The coefficients a0 and a1 that minimize Sr must satisfy
the following conditions:
S r
a 0
0
S r 0
a1
Linear Regression:
Determination of ao and a1
S r
2 ( yi ao a1 xi ) 0
ao
S r
2 ( yi ao a1 xi ) xi 0
a1
0 yi a 0 a1 xi
0 yi xi a 0 xi a1 xi2
a 0 na0
na0 xi a1 yi 2 equations with 2
unknowns, can be solved
ii 0i 1i
y x a x a x 2
simultaneously
Linear Regression:
Determination of ao and a1
n xi yi xi yi
a1
n x xi
2 2
i
a0 y a1 x
17
18
Error Quantification of Linear Regression
• Total sum of the squares around the mean
for the dependent variable, y, is St
S t ( yi y ) 2
• Sum of the squares of residuals around the
regression linen is2 Sr n
S r ei ( yi ao a1 xi )
2
i 1 i 1
Error Quantification of Linear Regression
• St-Sr quantifies the improvement or error
reduction due to describing data in terms of a
straight line rather than as an average value.
2 St S r
r
St
r2: coefficient of determination
r : correlation coefficient
Error Quantification of Linear Regression
For a perfect fit:
• Sr= 0 and r = r2 =1, signifying that the line
explains 100 percent of the variability of the
data.
• For r = r2 = 0, Sr = St, the fit represents no
improvement.
Least Squares Fit of a Straight Line:
Example
Fit a straight line to the x and y values in the
following Table:
xi yi xiyi xi 2
xi 28 yi 24.0
1 0.5 0.5 1
2 2.5 5 4
2
i 140
x xi yi 119 .5
3 2 6 9
4 4 16 16
5 3.5 17.5 25
6 6 36 36
7 5.5 38.5 49
28 24 119.5 140
Least Squares Fit of a Straight Line: Example
(cont’d)
n xi yi xi yi
a1 2
n x ( xi )
2
i
7 119 .5 28 24
2
0.8392857
7 140 28
a0 y a1 x
3.428571 0.8392857 4 0.07142857
Y = 0.07142857 + 0.8392857 x
Least Squares Fit of a Straight Line: Example
(Error Analysis)
^
2
xi yi (yi y) e ( yi y ) 2
2
i
1 0.5 8.5765 0.1687 S t
i y y 2
22.7143
2 2.5 0.8622
S r ei 2.9911
2
3 2.0 0.5625
4 4.0 2.0408 0.3473
5 3.5 0.3265 0.3265 2 St S r
6 6.0 0.0051 0.5896 r 0.868
St
7 5.5 6.6122 0.7972
28 24.0 4.2908 2.9911
22.7143 0.1993 2
r r 0.868 0.932
Least Squares Fit of a Straight Line:
Example (Error Analysis)
•The standard deviation (quantifies the spread around the mean):
St 22.7143
sy 1.9457
n 1 7 1
•The standard error of estimate (quantifies the spread around the
regression line)
Sr 2.9911
sy / x 0.7735
n2 72
Because S y / x S y , the linear regression model has good fitness
Algorithm for linear regression
Linearization of Nonlinear Relationships
• The relationship between the dependent and
independent variables is linear.
• However, a few types of nonlinear functions
can be transformed into linear regression
problems.
The exponential equation.
The power equation.
The saturation-growth-rate equation.
Linearization of Nonlinear Relationships
1. The exponential equation.
y a1e b1x
ln y ln a1 b1 x
y* = a o + a1 x
Linearization of Nonlinear Relationships
2. The power equation
y a2 x b2
log y log a2 b2 log x
y* = ao + a1 x*
Linearization of Nonlinear Relationships
3. The saturation-growth-rate equation
x
y a3
b3 x
y* = 1/y
1 1 b3 1 ao = 1/a3
a1 = b3/a3
y a3 a3 x
x* = 1/x
Example
Fit the following Equation:
b2
y a2 x
to the data in the following table: log y log(a2 x b2 )
xi yi X*=log xi Y*=logyi log y log a2 b2 log x
1 0.5 0 -0.301 let Y * log y, X * log x,
2 1.7 0.301 0.226 a0 log a2 , a1 b2
3 3.4 0.477 0.534
Y * a0 a1 X *
4 5.7 0.602 0.753
5 8.4 0.699 0.922
15 19.7 2.079 2.141
Example
Xi Yi X*i=Log(X) Y*i=Log(Y) X*Y* X*^2
1 0.5 0.0000 -0.3010 0.0000 0.0000
2 1.7 0.3010 0.2304 0.0694 0.0906
3 3.4 0.4771 0.5315 0.2536 0.2276
4 5.7 0.6021 0.7559 0.4551 0.3625
5 8.4 0.6990 0.9243 0.6460 0.4886
Sum 15 19.700 2.079 2.141 1.424 1.169
n x i y i x i y i 5 1.424 2.079 2.141
a1 1.75
n x i ( x i )
2 2
2 5 1.169 2.079
a0 y a1x 0.4282 1.75 0.41584 0.334
Linearization of Nonlinear
Functions: Example
log y=-0.334+1.75log x
1.75
y 0.46x
Polynomial Regression
• Some engineering data is poorly represented
by a straight line.
• For these cases a curve is better suited to fit
the data.
• The least squares method can readily be
extended to fit the data to higher order
polynomials.
Polynomial Regression (cont’d)
A parabola is preferable
Polynomial Regression (cont’d)
• A 2nd order polynomial (quadratic) is defined by:
2
y ao a1 x a2 x e
• The residuals between the model and the data:
2
ei yi ao a1 xi a2 xi
• The sum of squares of the residual:
S r ei yi ao a1 xi a2 xi
2
2 2
Polynomial Regression (cont’d)
S r
2 ( yi ao a1 xi a2 xi2 ) 0
ao
S r
2 ( yi ao a1 xi a2 xi2 ) xi 0
a1
S r
2 ( yi ao a1 xi a2 xi2 ) xi2 0
a2
i
y n a o a1 i
x a 2 i
x 2
3 linear equations
with 3 unknowns
i i o i 1 i 2 i
x y a x a x 2
a x 3
(ao,a1,a2), can be
solved
i i o i 1 i 2 i
x 2
y a x 2
a x 3
a x 4
Polynomial Regression (cont’d)
• A system of 3x3 equations needs to be solved to determine the
coefficients of the polynomial.
n
x x
i
2
i
a0 y i
xi x x a1 xi yi
2 3
i i
xi2 x x
3 4 a2 xi2 yi
i i
• The standard error & the coefficient of determination
Sr St S r
sy / x 2
r
n3 St
Polynomial Regression (cont’d)
General:
The mth-order polynomial:
y ao a1 x a2 x 2 ..... am x m e
• A system of (m+1)x(m+1) linear equations must be solved for
determining the coefficients of the mth-order polynomial.
• The standard error:
Sr
sy / x
n m 1
2 St S r
• The coefficient of determination: r
St
Polynomial Regression- Example
Fit a second order polynomial to data:
xi yi xi2 xi 3 xi4 xiyi xi2yi x i 15
0 2.1 0 0 0 0 0
1 7.7 1 1 1 7.7 7.7
y
xi3 225 i 152.6
2 13.6 4 8 16 27.2 54.4 i 55
x 2
3 27.2 9 27 81 81.6 244.8
4 40.9 16 64 256 163.6 654.4
5 61.1 25 125 625 305.5 1527.5
15 152.6 55 225 979 585.6 2489 i 979
x 4
x y 585.6
i i
x y 2488.8
2
i i
Polynomial Regression- Example (cont’d)
• The system of simultaneous linear equations:
6 15 55 a0 152.6
15 55 225 a 585.6
1
55 225 979 a2 2488.8
a0 2.47857, a1 2.35929, a2 1.86071
y 2.47857 2.35929 x 1.86071 x 2
S r ei 3.74657
2
St yi y 2513.39
2
Polynomial Regression- Example (cont’d)
xi yi ymodel ei2 (yi-y`)2
0 2.1 2.4786 0.14332 544.42889
1 7.7 6.6986 1.00286 314.45929
2 13.6 14.64 1.08158 140.01989
3 27.2 26.303 0.80491 3.12229
4 40.9 41.687 0.61951 239.22809
5 61.1 60.793 0.09439 1272.13489
15 152.6 3.74657 2513.39333
•The standard error of estimate:
3.74657
sy / x 1.12
63
•The coefficient of determination:
2513.39 3.74657
r2 0.99851, r r 2 0.99925
2513.39
Using the Regression Equation
• Before using the regression model, we
need to assess how well it fits the data.
• If we are satisfied with how well the
model fits the data, we can use it to
predict the values of y.
• To make a prediction we use
– Point prediction, and
– Interval prediction
44
Point Prediction
• Example
– Predict the selling price of a three-year-old
Taurus with 40,000 miles on the odometer.
A point prediction
ŷ 17067 .0623x 17067 .0623( 40,000) 14,575
– It is predicted that a 40,000 miles car would sell
for $14,575.
– How close is this prediction to the real price?
45
Interval Estimates
• Two intervals can be used to discover how closely
the predicted value will match the true value of y.
– Prediction interval – predicts y for a given value of x,
– Confidence interval – estimates the average y for a given x.
– The prediction – The confidence
interval1 ( x x )2 interval1 ( x x )2
g g
yˆ t 2 s 1 yˆ t 2 s
(x
2
(x
2
n i x) n i x)
46
Interval Estimates,
Example
• Example - continued
– Provide an interval estimate for the bidding price
on a Ford Taurus with 40,000 miles on the
odometer.
– Two types of predictions are required:
• A prediction for a specific car
• An estimate for the average price per car
47
Interval Estimates,
Example
• Solution
– A prediction interval provides the price
estimate for a single car:
1 ( xg x ) 2
yˆ t 2 s 1
t.025,98 n ( xi x ) 2
Approximat
ely
1 (40, 000 36, 009) 2
[17, 067 .0623(40000)] 1.984(303.1) 1 14,575 605
100 4,309,340,310
48
Interval Estimates,
Example
• Solution – continued
– A confidence interval provides the estimate of
the mean price per car for a Ford Taurus with
40,000 miles reading on the odometer.
1 ( x g x)2
• The confidence interval (95%)ŷ = t 2 s
n
( x i x)2
1 (40, 000 36, 009) 2
[17, 067 .0623(40000)] 1.984(303.1) 14,575 70
100 4,309,340, 310
49
The effect of the given xg on the
length of the interval
– As xg moves away from x the interval becomes
longer. That is, the shortest interval is found at x.
ŷ b 0 b1x g
1 ( x g x)
2
ŷ t 2 s
n (n 1)s 2x
50
The effect of the given xg on the
length of the interval
– As xg moves away from x the interval becomes
longer. That is, the shortest interval is found at x.
ŷ b 0 b1x g
1 ( x g x)
2
ŷ t 2 s
n (n 1)s 2x
ŷ( x g x 1)
ŷ( x g x 1) 1 12
ŷ t 2 s
n (n 1)s 2x
x 1 x 1
x
( x 1) x 1 ( x 1) x 1
51
The effect of the given xg on the
length of the interval
– As xg moves away from x the interval becomes longer. That is, the
shortest interval is found at x.
ŷ b 0 b1x g
1 ( x g x)
2
ŷ t 2 s
n (n 1) s 2x
1 12
ŷ t 2 s
n (n 1)s 2x
x2 x2 1 22
x ŷ t 2 s
n (n 1)s 2x
( x 2) x 2 ( x 2) x 2
52
Regression Diagnostics - I
• The three conditions required for the validity
of the regression analysis are:
– the error variable is normally distributed.
– the error variance is constant for all values of x.
– The errors are independent of each other.
• How can we diagnose violations of these
conditions?
53
Residual Analysis
• Examining the residuals (or standardized
residuals), help detect violations of the
required conditions.
• Example – continued:
– Nonnormality.
• Use Excel to obtain the standardized residual
histogram.
• Examine the histogram and look for a bell shaped.
diagram with a mean close to zero.
54
Residual Analysis
ObservationPredicted Price Residuals Standard Residuals
1 14736.91 -100.91 -0.33
2 14277.65 -155.65 -0.52
3 14210.66 -194.66 -0.65
4 15143.59 446.41 1.48
5 15091.05 476.95 1.58
A Partial list of
For each residual we calculate
Standard residuals
the standard deviation as follows:
s ri s 1 hi whereStandardized residual ‘i’ =
1 ( x i x)2 Residual ‘i’
hi Standard deviation
n (n 1)s 2x
55
Residual Analysis
Standardized residuals
40
30
20
10
0
-2 -1 0 1 2 More
It seems the residual are normally distributed with mean zero
56
Heteroscedasticity
• When the requirement of a constant variance is violated
we have a condition of heteroscedasticity.
• Diagnose heteroscedasticity by plotting the residual
against the predicted y. +
^y ++
Residual
+
+ + + ++
+ +
++
+ + + +
++ + +
+ ++ ++ +
+ ++ + + y^
+ +
+ +
+ + + +++
+ ++
+ +
^ y
The spread increases with 57
Homoscedasticity
• When the requirement of a constant variance is not violated we have
a condition of homoscedasticity.
• Example - continued
1000
500
Residuals
0
13500 14000 14500 15000 15500 16000
-500
-1000
Predicted Price
58
Non Independence of Error
– A time series is Variables
constituted if data were collected
over time.
– Examining the residuals over time, no pattern
should be observed if the errors are independent.
– When a pattern is detected, the errors are said to
be autocorrelated.
– Autocorrelation can be detected by graphing the
residuals against time.
59
Non Independence of Error Variables
Patterns in the appearance of the residuals over
time indicates that autocorrelation exists.
Residual Residual
+++ + +
+ + +
+ + +
0 + 0 + +
+ Time Time
+ + + + + +
+ +++ + +
+
Note the runs of positive residuals, Note the oscillating behavior of the
replaced by runs of negative residuals
residuals around zero.
60
Outliers
• An outlier is an observation that is unusually small or
large.
• Several possibilities need to be investigated when an
outlier is observed:
– There was an error in recording the value.
– The point does not belong in the sample.
– The observation is valid.
• Identify outliers from the scatter diagram.
• It is customary to suspect an observation is an outlier
if its |standard residual| > 2
61
An outlier An influential observation
+++++++++++
+ +
+ … but, some outliers
++
may be very influential
+ +
+
+ + + + +
+ +
+
The outlier causes a shift
in the regression line
62
Procedure for Regression
Diagnostics…
• Develop a model that has a theoretical basis.
• Gather data for the two variables in the model.
• Draw the scatter diagram to determine whether a linear
model appears to be appropriate.
• Determine the regression equation.
• Check the required conditions for the errors.
• Check the existence of outliers and influential observations
• Assess the model fit.
• If the model fits the data, use the regression
equation.
63