线性回归 - 波斯顿房价预测

EAI工程笔记

已于 2024-04-23 16:08:41 修改

阅读量2.9k

点赞数 2

CC 4.0 BY-SA版权

分类专栏： # SKLearn 文章标签：线性回归 python 机器学习 sklearn

于 2023-02-25 10:13:17 首次发布

本文链接：https://blog.csdn.net/lovechris00/article/details/129211751

SKLearn 专栏收录该内容

42 篇文章

订阅专栏

文章介绍了波士顿房价数据集的特征，并使用sklearn库进行了数据处理，包括数据下载、查看、切分和标准化。然后，通过两种方法训练模型，分别是线性回归(LinearRegression)和随机梯度下降回归(SGDRegressor)，并计算了它们的预测均方误差。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

项目说明

Boston 数据集

因为涉及种族问题（有一个和黑人人口占比相关的变量B），波士顿房价这个数据集将在sklearn 1.2版本中被移除。
这里使用的是低版本的 sklearn

!pip3 install scikit-learn==0.24.1

load_boston has been removed from scikit-learn since version 1.2.

这个数据集有 506 条数据，相关属性：

CRIM 犯罪率；per capita crime rate by town
ZN proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS 非零售商业用地占比；proportion of non-retail business acres per town
CHAS 是否临Charles河；Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX 氮氧化物浓度；nitric oxides concentration (parts per 10 million)
RM 房屋房间数；average number of rooms per dwelling
AGE 房屋年龄；proportion of owner-occupied units built prior to 1940
DIS 和就业中心的距离；weighted distances to five Boston employment centres
RAD 是否容易上高速路；index of accessibility to radial highways
TAX 税率；full-value property-tax rate per $10,000
PTRATIO 学生人数比老师人数；pupil-teacher ratio by town
B 城镇黑人比例计算的统计值；1000(Bk - 0.63)^2 where Bk is the proportion of black people by town
LSTAT 低收入人群比例；% lower status of the population
MEDV 房价中位数；Median value of owner-occupied homes in $1000’s

代码实现

数据处理

下载、查看数据

from sklearn.datasets import load_boston

data = load_boston()
data

{'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
          4.9800e+00],
         [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
          9.1400e+00],
         [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
          4.0300e+00],
         ..., 
         [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
          7.8800e+00]]),
          
  'target': array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
         18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6, 
         ...
         23.1, 19.7, 18.3, 21.2, 17.5, 16.8, 22.4, 20.6, 23.9, 22. , 11.9]),
  'feature_names': array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
         'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7'),
  'DESCR': ".. _boston_dataset:
 
 Boston house prices dataset
 ---------------------------
 
 **Data Set Characteristics:**  
 
     :Number of Instances: 506 
 
     :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
 
     :Attribute Information (in order):
         - CRIM     per capita crime rate by town
         - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
         - INDUS    proportion of non-retail business acres per town
         - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
         - NOX      nitric oxides concentration (parts per 10 million)
         - RM       average number of rooms per dwelling
         - AGE      proportion of owner-occupied units built prior to 1940
         - DIS      weighted distances to five Boston employment centres
         - RAD      index of accessibility to radial highways
         - TAX      full-value property-tax rate per $10,000
         - PTRATIO  pupil-teacher ratio by town
         - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
         - LSTAT    % lower status of the population
         - MEDV     Median value of owner-occupied homes in $1000's
 
     :Missing Attribute Values: None
 
     :Creator: Harrison, D. and Rubinfeld, D.L.
 
 This is a copy of UCI ML housing dataset.
 https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
 
 This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
 
 The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
 prices and the demand for clean air', J. Environ. Economics & Management,
 vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
 ...', Wiley, 1980.   N.B. Various transformations are used in the table on
 pages 244-261 of the latter.
 
 The Boston house-price data has been used in many machine learning papers that address regression
 problems.   
      
 .. topic:: References
 
    - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
    - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
 ",
  'filename': '/Users/xx/opt/anaconda3/lib/python3.7/site-packages/sklearn/datasets/data/boston_house_prices.csv'}

# 查看数据描述
data.DESCR

data 是 sklearn.utils.Bunch 类，这个类继承自 dict。
它在 sklearn/utils/__init__.py 文件中。

type(data) # sklearn.utils.Bunch;
list(data.keys())   # ['data', 'target', 'feature_names', 'DESCR', 'filename']
len(data.data)  # 506

# data.data
array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00], 
       ..., 
       [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        7.8800e+00]])

切分数据

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)

len(X_train), len(X_test), len(y_train), len(y_test)
# (379, 127, 379, 127)

len(X_train)/ len(X_test) # 2.984251968503937

标准化

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler


# 将数据进行标准化处理
sds = StandardScaler()

x_train = sds.fit_transform(X_train)
x_test = sds.transform(X_test)

# x_train

array([[-0.40503224,  0.02669292, -0.73056566, ...,  0.22105105,
         0.42774266, -0.52050075],
       [ 1.94694906, -0.49157634,  1.01370896, ...,  0.81392993,
        -3.69945075,  3.11194771],
       [-0.41724734, -0.49157634, -0.97415513, ...,  0.17544498,
         0.30149725, -0.27341473],
       ..., 
       [ 0.60505563, -0.49157634,  1.01370896, ...,  0.81392993,
         0.42774266,  1.19206096]])

y_test = y_test.reshape(-1,1)
y_train = y_train.reshape(-1,1)

# y_test
array([[22.3],
        [17.4],
        [27.1],
        [22. ],
        ...
        [10.9]])

sds_y = StandardScaler()
y_train = sds_y.fit_transform(y_train)
y_test = sds_y.transform(y_test)

训练模型

方式一：LinearRegression

from sklearn.linear_model import LinearRegression,SGDRegressor
lr = LinearRegression()
lr.fit(x_train,y_train) # LinearRegression()

# 通过线性回归估计 的权重数组；它的形状是(n_targets，n_features) 
lr.coef_

array([[-0.0612378 ,  0.16416119,  0.00767045,  0.09201928, -0.22140224,
         0.23731323,  0.02417785, -0.34593363,  0.2620663 , -0.18835647,
        -0.2258351 ,  0.08609841, -0.46284107]])

y_predict = lr.predict(x_test)

# y_predict
array([[ 0.52472231],
        [-0.59075883],
        [ 0.43991597],
        [ 0.49699826],
        ...
        [-0.80045313]])

y_predict_lr = sds_y.inverse_transform(y_predict)

# y_predict_lr 
    array([[27.56580687],
           [17.29643043],
           [26.78506015],
           ...
           [15.36593638]])

方式二：SGDRegressor

# SGD
sgd = SGDRegressor()
sgd.fit(x_train,y_train)

# /Users/xx/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:63: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). return f(*args, **kwargs)
#     SGDRegressor()

sgd.coef_

array([-0.04768824,  0.13608779, -0.02603741,  0.09983776, -0.17376075,
        0.26927606,  0.00684786, -0.31192725,  0.16107088, -0.09118234,
       -0.21531016,  0.09122253, -0.44832677])

y_predict = sgd.predict(x_test)

# y_predict

array([ 0.46684306, -0.56863864,  0.41733968,  0.50252795, -0.9806062 ,
        -0.23795072, -1.52966828, -1.8598664 , -1.44979186, -1.6725667 ,
        ...
        -0.45739942, -0.42882668, -1.06541168,  0.11113028,  0.29365225,
        -0.75703282, -0.7820252 ])

y_predict_sgd = sds_y.inverse_transform(y_predict)

# y_predict_sgd

array([27.03295709, 17.50007403, 26.57721759, 27.36148043, 13.70740567,
           20.54446321,  8.65261371,  5.61273368,  9.38797442,  7.33705788,
           ...
           18.52416787, 18.78721513, 12.92666688, 23.75818334, 25.43852267,
           15.76567378, 15.53558811])

from sklearn.metrics import mean_squared_error

print('lr均方误差',mean_squared_error(sds_y.inverse_transform(y_test),y_predict_lr))  # 23.811573271484313
print('sgd均方误差',mean_squared_error(sds_y.inverse_transform(y_test),y_predict_sgd)) # 23.77573271117358
# mean_squared_error()