Datawhale---Task2(EDA-数据探索性分析)

最新推荐文章于 2024-07-17 13:10:48 发布

weixin_43901423

最新推荐文章于 2024-07-17 13:10:48 发布

阅读量345

点赞数

CC 4.0 BY-SA版权

分类专栏：数据挖掘 python

本文链接：https://blog.csdn.net/weixin_43901423/article/details/105044709

python 同时被 2 个专栏收录

4 篇文章

订阅专栏

数据挖掘

3 篇文章

订阅专栏

１、目标

在这里插入图片描述

２、相关概念

（１）EDA(Exploratory Data Analysis)也叫探索性数据分析
（２）EDA的价值：
　　　熟悉数据集,了解数据集,对数据集进行验证来确定所获得数据集可以用于接下来的机器学习或者深度学习使用。
　　　了解数据集中变量间的相互关系以及变量与预测值之间的存在关系。
　　　进行数据处理以及特征工程的步骤,使数据集的结构和特征集让接下来的预测问题更加靠。

3、实现内容

载入各种数据科学以及可视化库:
数据科学库 pandas、numpy、scipy;
可视化库 matplotlib、seabon;
其他;
载入数据:
载入训练集和测试集;
简略观察数据(head()+shape);
数据总览:
通过describe()来熟悉数据的相关统计量
通过info()来熟悉数据类型
判断数据缺失和异常
查看每列的存在nan情况
异常值检测
了解预测值的分布
总体分布概况(无界约翰逊分布等)
查看skewness and kurtosis
查看预测值的具体频数
特征分为类别特征和数字特征,并对类别特征查看unique分布
数字特征分析
相关性分析
查看几个特征得偏度和峰值
每个数字特征得分布可视化
数字特征相互之间的关系可视化多变量互相回归关系可视化
类型特征分析
unique分布
类别特征箱形图可视化
类别特征的小提琴图可视化
类别特征的柱形图可视化类别
特征的每个类别频数可视化(count_plot)
用pandas_profiling生成数据报告

10、sort_values()函数用途
pandas中的sort_values()函数原理类似于SQL中的order by，可以将数据集依照某个字段中的数据进行排序，该函数即可根据指定列数据也可根据指定行的数据排序。
二、sort_values()函数的具体参数
用法：
DataFrame.sort_values(by=‘##’,axis=0,ascending=True, inplace=False, na_position=‘last’)

参数说明
参数说明
by 指定列名(axis=0或’index’)或索引值(axis=1或’columns’)
axis 若axis=0或’index’，则按照指定列中数据大小排序；若axis=1或’columns’，则按照指定索引中数据大小排序，默认axis=0
ascending 是否按指定列的数组升序排列，默认为True，即升序排列
inplace 是否用排序后的数据集替换原来的数据，默认为False，即不替换
na_position {‘first’,‘last’}，设定缺失值的显示位置

4、代码实现

#coding:utf-8
#1、载入各种数据科学以及可视化库

import warnings
##导入warnings包,利用过滤器来实现忽略警告语句。
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt#可视化工具
import seaborn as sns##要注意的是一旦导入了seaborn，matplotlib的默认作图风格就会被覆盖成seaborn的格式

"""
Seaborn是基于matplotlib的图形可视化python包。它提供了一种高度交互式界面，便于用户能够做出各种有吸引力的统计图表。

Seaborn是在matplotlib的基础上进行了更高级的API封装，从而使得作图更加容易，在大多数情况下使用seaborn能做出很具有吸引力的图，而使用matplotlib就能制作具有更多特色的图。应该把Seaborn视为matplotlib的补充，而不是替代物。同时它能高度兼容numpy与pandas数据结构以及scipy与statsmodels等统计模式。


"""
import missingno as msno
#这个库图形化缺失值,数据预处理之缺失值可视化处理

#2、载入数据
Train_data=pd.read_csv('/home/ysn7/PycharmProjects/Datawhale/used_car_train_20200313.csv',sep = ' ')#sep:以什么为分隔符
Test_data=pd.read_csv('/home/ysn7/PycharmProjects/Datawhale/used_car_testA_20200313.csv',sep = ' ')

"""
所有特征集均脱敏处理(方便大家观看)
name - 汽车编码
regDate - 汽车注册时间
model - 车型编码
brand - 品牌
bodyType - 车身类型
fuelType - 燃油类型
gearbox - 变速箱
power - 汽车功率
kilometer - 汽车行驶公里
notRepairedDamage - 汽车有尚未修复的损坏
regionCode - 看车地区编码seller - 销售方
offerType - 报价类型
creatDate - 广告发布时间
price - 汽车价格
v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14'(根据汽车的评
论、标签等大量信息得到的embedding向量)【人工构造 匿名特征】

"""
## 2) 简略观察数据(head()+shape)
#Train_data.head().append(Train_data.tail())
print(Train_data.head().append(Train_data.tail()))#开头的５组数据，追加末尾的５组数据

print(Train_data.head())
print(Train_data.shape)

print(Test_data.head().append(Test_data.tail()))
print(Test_data.shape)


#3总览数据概况
"""
1\describe中有每列的统计量,个数count、平均值mean、方差std、最小值min、中位数25% 50% 75% 、以
及最大值 看这个信息主要是瞬间掌握数据的大概的范围以及每个值的异常值的判断,比如有的时候会发现
999 9999 -1 等值这些其实都是nan的另外一种表达方式,有的时候需要注意下
2. info 通过info来了解数据每列的type,有助于了解是否存在除了nan以外的特殊符号异常
"""
print(Train_data.describe())
print(Train_data.info())

print(Test_data.describe())#比train少一个price
print(Test_data.info())

#4判断数据缺失和异常

##１）查看每列存在nan情况
print("---------")
print(Train_data.isnull().sum())

#nan可视化
missing=Train_data.isnull().sum()
missing=missing[missing>0]#只取大于０的
#是否用排序后的数据集替换原来的数据，默认为False，即不替换
missing.sort_values(inplace=True)
#用pandas中plot.bar()画柱状图

missing.plot.bar()
#print(missing.plot.bar())
plt.show()

print("======")#<class 'pandas.core.frame.DataFrame'>
print(type(Train_data))

#可视化看下缺省值
#Train_data.sample((250) #是pandas中随机抽取２５０行

msno.matrix(Train_data.sample((250)))
# msno.matrix(Train_data.sample((250)))

# 排版方式有不同，pandas是垂直排列，不可指定位置
# plt可以自己指定位置。pandas效果如下：
plt.show()

#条形图，msno.bar 是列的无效的简单可视化：
msno.bar(Train_data.sample(1000))
plt.show()




msno.matrix(Test_data.sample(250))
plt.show()
msno.bar(Test_data.sample(1000))
plt.show()#如果两个定义好，只写一句plt.show(),两张会重叠显示

#２）异常值检查
print("\\\\\\\\")
print(Train_data.info())

#可以发现除了notRepairedDamage 为object类型其他都为数字
# 这里我们把他的几个不同的值都进行显示就知道了

print("------")
print(Train_data['notRepairedDamage'].value_counts())
"""
0.0    111361
-       24324
1.0     14315

"""
#value_counts()是一种查看表格某列中有多少个不同值的快捷方法，
# 并计算每个不同值有在该列中有多少重复值。
#value_counts()是Series拥有的方法，
# 一般在DataFrame中使用时，需要指定对哪一列或行使用

#可以看出来‘ - ’也为空缺值,因为很多模型对nan有直接的处理,
# 这里我们先不做处理,先替换成nan
print(Train_data['notRepairedDamage'].replace('-',np.nan,inplace=True))
print(Train_data['notRepairedDamage'].value_counts())

print(Train_data.isnull().sum())

print(Test_data['notRepairedDamage'].replace('-',np.nan,inplace=True))
print(Test_data['notRepairedDamage'].value_counts())

print(Test_data.isnull().sum())

#以下两个类别特征严重倾斜,一般不会对预测有什么帮助,故这边先删掉,当然你也可以继续挖掘,但是一般意
#义不大

print(Train_data['seller'].value_counts())

print(Train_data["offerType"].value_counts())

del Train_data['seller']
del Train_data['offerType']

del Test_data['seller']
del Test_data['offerType']

#5、了解预测值的分布
print(Train_data['price'])
print(Train_data['price'].value_counts())

#1)总体分布概况（无界约翰逊分布）
import scipy.stats as st#统计分析
y=Train_data['price']
plt.figure(1);plt.title('Johnson SU')
#可以看到与使用matplotlib作的直方图最大的区别在于有一条密度曲线（KDE
#kde=False就不显示密度曲线
sns.distplot(y,kde=False,fit=st.johnsonsu)
plt.figure(2);plt.title('Normal')
sns.distplot(y,kde=False,fit=st.norm)
plt.figure(3);plt.title('Log Normal')
sns.distplot(y,kde=False,fit=st.lognorm)
plt.show()
"""
价格不服从正态分布,所以在进行回归之前,它必须进行转换。
虽然对数变换做得很好,但最佳拟合是无界约翰逊分布
"""

#2) 查看skewness and kurtosis
"""
我们一般会拿偏度和峰度来看数据的分布形态，而且一般会跟正态分布做比较，
我们把正态分布的偏度和峰度都看做零。如果我们在实操中，算到偏度峰度不为0，
即表明变量存在左偏右偏，或者是高顶平顶这么一说。
一.偏度（Skewness）

Definition:是描述数据分布形态的统计量，其描述的是某总体取值分布的对称性，简单来说就是数据的不对称程度。。
偏度是三阶中心距计算出来的。
（1）Skewness = 0 ，分布形态与正态分布偏度相同。
（2）Skewness > 0 ，正偏差数值较大，为正偏或右偏。长尾巴拖在右边，数据右端有较多的极端值。
（3）Skewness < 0 ，负偏差数值较大，为负偏或左偏。长尾巴拖在左边，数据左端有较多的极端值。
（4）数值的绝对值越大，表明数据分布越不对称，偏斜程度大。

二.峰度（Kurtosis）

Definition:偏度是描述某变量所有取值分布形态陡缓程度的统计量，简单来说就是数据分布顶的尖锐程度。
峰度是四阶标准矩计算出来的。
（1）Kurtosis=0 与正态分布的陡缓程度相同。
（2）Kurtosis>0 比正态分布的高峰更加陡峭——尖顶峰
（3）Kurtosis<0 比正态分布的高峰来得平台——平顶峰

"""

sns.distplot(Train_data['price'])
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis: %f" % Train_data['price'].kurt())

Train_data.skew()
Train_data.kurt()

sns.distplot(Train_data.skew(),color='blue',axlabel= 'Skewness')
plt.show()
sns.distplot(Train_data.kurt(),color='orange',axlabel= 'Kurtness')
plt.show()

#3)查看预测值的具体频数
plt.hist(Train_data['price'], orientation= 'vertical',histtype = 'bar', color = 'red')
plt.show()
"""
查看频数, 大于20000得值极少,
其实这里也可以把这些当作特殊得值(异常值)直接用填充或者删掉,
再前面进行

"""

## log变换 z之后的分布较均匀,可以进行log变换进行预测,
# 这也是预测问题常用的trick
plt.hist(np.log(Train_data['price']), orientation = 'vertical',histtype = 'bar', color = 'red')
plt.show()

#6、特征分为类别特征和数字特征,并对类别特征查看unique分布
"""
数据类型
列name - 汽车编码
regDate - 汽车注册时间
model - 车型编码
brand - 品牌
bodyType - 车身类型
fuelType - 燃油类型
gearbox - 变速箱
power - 汽车功率
kilometer - 汽车行驶公里
notRepairedDamage - 汽车有尚未修复的损坏
regionCode - 看车地区编码
seller - 销售方 【以删】
offerType - 报价类型 【以删】
creatDate - 广告发布时间
price - 汽车价格
v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14'(根据汽车的评
论、标签等大量信息得到的embedding向量)【人工构造 匿名特征】

"""
# 分 离 label即预测值
Y_train = Train_data['price']

# 这个区别方式适用于没 有 直接label coding的数据
# 这 里 不适用, 需要 人为根据实际含义 来 区分
# 数字特征
# numeric_features = Train_data.select_dtypes(include=[np.number])
# numeric_features.columns
# # 类型特征
# categorical_features = Train_data.select_dtypes(include=[np.object])
# categorical_features.columns

numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]

categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode',]

# 特征nunique分布
for cat_fea in categorical_features:
    print(cat_fea + "的特征分布如下：")
    print("{}特征有个{}不同的值".format(cat_fea, Train_data[cat_fea].nunique()))
    print(Train_data[cat_fea].value_counts())

#特征nunique分布
for cat_fea in categorical_features:
    print(cat_fea + "的特征分布如下：")
    print("{}特征有个{}不同的值".format(cat_fea, Test_data[cat_fea].nunique()))
    print(Test_data[cat_fea].value_counts())

#7\数字特征分析
print(numeric_features.append('price'))
print(Train_data.head())

## 1) 相关性分析
price_numeric = Train_data[numeric_features]
correlation = price_numeric.corr()#corr()计算相关系数
print(correlation['price'].sort_values(ascending = False),'\n')

f , ax = plt.subplots(figsize = (7, 7))

plt.title('Correlation of Numeric Features with Price',y=1,size=16)

sns.heatmap(correlation,square = True,  vmax=0.8)#热力图就是把这个二维的数组的数字用热力图的颜色值来表示，数字是一模一样的~
del price_numeric['price']

## 2) 查看几个特征得 偏度和峰值
for col in numeric_features:
    print('{:15}'.format(col),
          'Skewness: {:05.2f}'.format(Train_data[col].skew()) ,
          '   ' ,
          'Kurtosis: {:06.2f}'.format(Train_data[col].kurt())
         )

## 3) 每个数字特征得分布可视化
"""
df.melt() 是 df.pivot() 逆转操作函数

将列名转换为列数据(columns name → column values)，重构DataFrame

如果说 df.pivot() 将长数据集转换成宽数据集，df.melt() 则是将宽数据集变成长数据集

"""
f = pd.melt(Train_data, value_vars=numeric_features)
"""
    先sns.FacetGrid画出轮廓
    然后用map填充内容
"""
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")
#可以看出匿名特征相对分布均匀



print("11111111")
## 4) 数字特征相互之间的关系可视化
sns.set()
columns = ['price', 'v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
sns.pairplot(Train_data[columns],size = 2 ,kind ='scatter',diag_kind='kde')
plt.show()

print(Train_data.columns)
print(Y_train)

## 5) 多变量互相回归关系可视化
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = plt.subplots(nrows=5, ncols=2, figsize=(24, 20))
# ['v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
v_12_scatter_plot = pd.concat([Y_train,Train_data['v_12']],axis = 1)
sns.regplot(x='v_12',y = 'price', data = v_12_scatter_plot,scatter= True, fit_reg=True, ax=ax1)

v_8_scatter_plot = pd.concat([Y_train,Train_data['v_8']],axis = 1)
sns.regplot(x='v_8',y = 'price',data = v_8_scatter_plot,scatter= True, fit_reg=True, ax=ax2)

v_0_scatter_plot = pd.concat([Y_train,Train_data['v_0']],axis = 1)
sns.regplot(x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3)

power_scatter_plot = pd.concat([Y_train,Train_data['power']],axis = 1)
sns.regplot(x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4)

v_5_scatter_plot = pd.concat([Y_train,Train_data['v_5']],axis = 1)
sns.regplot(x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5)

v_2_scatter_plot = pd.concat([Y_train,Train_data['v_2']],axis = 1)
sns.regplot(x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6)

v_6_scatter_plot = pd.concat([Y_train,Train_data['v_6']],axis = 1)
sns.regplot(x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7)

v_1_scatter_plot = pd.concat([Y_train,Train_data['v_1']],axis = 1)
sns.regplot(x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8)

v_14_scatter_plot = pd.concat([Y_train,Train_data['v_14']],axis = 1)
sns.regplot(x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9)

v_13_scatter_plot = pd.concat([Y_train,Train_data['v_13']],axis = 1)
sns.regplot(x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10)
plt.show()

5\运行结果

/home/ysn7/anaconda3/bin/python /home/ysn7/PycharmProjects/Datawhale/eda_task2.py
        SaleID    name   regDate    ...         v_12      v_13      v_14
0            0     736  20040402    ...    -2.420821  0.795292  0.914762
1            1    2262  20030301    ...    -1.030483 -1.722674  0.245522
2            2   14874  20040403    ...     1.565330 -0.832687 -0.229963
3            3   71865  19960908    ...    -0.501868 -2.438353 -0.478699
4            4  111080  20120103    ...     0.931110  2.834518  1.923482
149995  149995  163978  20000607    ...     0.589167 -1.304370 -0.302592
149996  149996  184535  20091102    ...     2.553994  0.924196 -0.272160
149997  149997  147587  20101003    ...     2.290197  1.891922  0.414931
149998  149998   45907  20060312    ...     1.414937  0.431981 -1.659014
149999  149999  177672  19990204    ...     0.031724 -1.483350 -0.342674

[10 rows x 31 columns]
   SaleID    name   regDate    ...         v_12      v_13      v_14
0       0     736  20040402    ...    -2.420821  0.795292  0.914762
1       1    2262  20030301    ...    -1.030483 -1.722674  0.245522
2       2   14874  20040403    ...     1.565330 -0.832687 -0.229963
3       3   71865  19960908    ...    -0.501868 -2.438353 -0.478699
4       4  111080  20120103    ...     0.931110  2.834518  1.923482

[5 rows x 31 columns]
(150000, 31)
       SaleID    name   regDate    ...         v_12      v_13      v_14
0      150000   66932  20111212    ...     4.800151  0.620011 -3.664654
1      150001  174960  19990211    ...    -3.796107 -1.541230 -0.757055
2      150002    5356  20090304    ...     0.826562  0.138226  0.754033
3      150003   50688  20100405    ...     1.870379  0.366038  1.312775
4      150004  161428  19970703    ...    -3.197685 -0.025678 -0.101290
49995  199995   20903  19960503    ...    -1.207191 -1.981240 -0.357695
49996  199996     708  19991011    ...    -2.075658 -1.154847  0.169073
49997  199997    6693  20040412    ...     1.137756 -1.390531  0.254420
49998  199998   96900  20020008    ...     2.465630 -0.911682 -2.057353
49999  199999  193384  20041109    ...     0.547628  2.094057 -1.552150

[10 rows x 30 columns]
(50000, 30)
              SaleID           name      ...                 v_13           v_14
count  150000.000000  150000.000000      ...        150000.000000  150000.000000
mean    74999.500000   68349.172873      ...             0.000313      -0.000688
std     43301.414527   61103.875095      ...             1.288988       1.038685
min         0.000000       0.000000      ...            -4.153899      -6.546556
25%     37499.750000   11156.000000      ...            -1.057789      -0.437034
50%     74999.500000   51638.000000      ...            -0.036245       0.141246
75%    112499.250000  118841.250000      ...             0.942813       0.680378
max    149999.000000  196812.000000      ...            11.147669       8.658418

[8 rows x 30 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID               150000 non-null int64
name                 150000 non-null int64
regDate              150000 non-null int64
model                149999 non-null float64
brand                150000 non-null int64
bodyType             145494 non-null float64
fuelType             141320 non-null float64
gearbox              144019 non-null float64
power                150000 non-null int64
kilometer            150000 non-null float64
notRepairedDamage    150000 non-null object
regionCode           150000 non-null int64
seller               150000 non-null int64
offerType            150000 non-null int64
creatDate            150000 non-null int64
price                150000 non-null int64
v_0                  150000 non-null float64
v_1                  150000 non-null float64
v_2                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_5                  150000 non-null float64
v_6                  150000 non-null float64
v_7                  150000 non-null float64
v_8                  150000 non-null float64
v_9                  150000 non-null float64
v_10                 150000 non-null float64
v_11                 150000 non-null float64
v_12                 150000 non-null float64
v_13                 150000 non-null float64
v_14                 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
None
              SaleID           name      ...               v_13          v_14
count   50000.000000   50000.000000      ...       50000.000000  50000.000000
mean   174999.500000   68542.223280      ...          -0.003147      0.001516
std     14433.901067   61052.808133      ...           1.286597      1.027360
min    150000.000000       0.000000      ...          -4.123333     -6.112667
25%    162499.750000   11203.500000      ...          -1.060428     -0.437920
50%    174999.500000   52248.500000      ...          -0.035956      0.138799
75%    187499.250000  118856.500000      ...           0.941469      0.681163
max    199999.000000  196805.000000      ...           5.913273      2.624622

[8 rows x 29 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
SaleID               50000 non-null int64
name                 50000 non-null int64
regDate              50000 non-null int64
model                50000 non-null float64
brand                50000 non-null int64
bodyType             48587 non-null float64
fuelType             47107 non-null float64
gearbox              48090 non-null float64
power                50000 non-null int64
kilometer            50000 non-null float64
notRepairedDamage    50000 non-null object
regionCode           50000 non-null int64
seller               50000 non-null int64
offerType            50000 non-null int64
creatDate            50000 non-null int64
v_0                  50000 non-null float64
v_1                  50000 non-null float64
v_2                  50000 non-null float64
v_3                  50000 non-null float64
v_4                  50000 non-null float64
v_5                  50000 non-null float64
v_6                  50000 non-null float64
v_7                  50000 non-null float64
v_8                  50000 non-null float64
v_9                  50000 non-null float64
v_10                 50000 non-null float64
v_11                 50000 non-null float64
v_12                 50000 non-null float64
v_13                 50000 non-null float64
v_14                 50000 non-null float64
dtypes: float64(20), int64(9), object(1)
memory usage: 11.4+ MB
None
---------
SaleID                  0
name                    0
regDate                 0
model                   1
brand                   0
bodyType             4506
fuelType             8680
gearbox              5981
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
price                   0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64
======
<class 'pandas.core.frame.DataFrame'>
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/figure.py:2267: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.
  warnings.warn("This figure includes Axes that are not compatible "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/figure.py:2267: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.
  warnings.warn("This figure includes Axes that are not compatible "
\\\\
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID               150000 non-null int64
name                 150000 non-null int64
regDate              150000 non-null int64
model                149999 non-null float64
brand                150000 non-null int64
bodyType             145494 non-null float64
fuelType             141320 non-null float64
gearbox              144019 non-null float64
power                150000 non-null int64
kilometer            150000 non-null float64
notRepairedDamage    150000 non-null object
regionCode           150000 non-null int64
seller               150000 non-null int64
offerType            150000 non-null int64
creatDate            150000 non-null int64
price                150000 non-null int64
v_0                  150000 non-null float64
v_1                  150000 non-null float64
v_2                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_5                  150000 non-null float64
v_6                  150000 non-null float64
v_7                  150000 non-null float64
v_8                  150000 non-null float64
v_9                  150000 non-null float64
v_10                 150000 non-null float64
v_11                 150000 non-null float64
v_12                 150000 non-null float64
v_13                 150000 non-null float64
v_14                 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
None
------
0.0    111361
-       24324
1.0     14315
Name: notRepairedDamage, dtype: int64
None
0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64
SaleID                   0
name                     0
regDate                  0
model                    1
brand                    0
bodyType              4506
fuelType              8680
gearbox               5981
power                    0
kilometer                0
notRepairedDamage    24324
regionCode               0
seller                   0
offerType                0
creatDate                0
price                    0
v_0                      0
v_1                      0
v_2                      0
v_3                      0
v_4                      0
v_5                      0
v_6                      0
v_7                      0
v_8                      0
v_9                      0
v_10                     0
v_11                     0
v_12                     0
v_13                     0
v_14                     0
dtype: int64
None
0.0    37249
1.0     4720
Name: notRepairedDamage, dtype: int64
SaleID                  0
name                    0
regDate                 0
model                   0
brand                   0
bodyType             1413
fuelType             2893
gearbox              1910
power                   0
kilometer               0
notRepairedDamage    8031
regionCode              0
seller                  0
offerType               0
creatDate               0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64
0    149999
1         1
Name: seller, dtype: int64
0    150000
Name: offerType, dtype: int64
0          1850
1          3600
2          6222
3          2400
4          5200
5          8000
6          3500
7          1000
8          2850
9           650
10         3100
11         5450
12         1600
13         3100
14         6900
15         3200
16        10500
17         3700
18          790
19         1450
20          990
21         2800
22          350
23          599
24         9250
25         3650
26         2800
27         2399
28         4900
29         2999
          ...  
149970      900
149971     3400
149972      999
149973     3500
149974     4500
149975     3990
149976     1200
149977      330
149978     3350
149979     5000
149980     4350
149981     9000
149982     2000
149983    12000
149984     6700
149985     4200
149986     2800
149987     3000
149988     7500
149989     1150
149990      450
149991    24950
149992      950
149993     4399
149994    14780
149995     5900
149996     9500
149997     7500
149998     4999
149999     4700
Name: price, Length: 150000, dtype: int64
500      2337
1500     2158
1200     1922
1000     1850
2500     1821
600      1535
3500     1533
800      1513
2000     1378
999      1356
750      1279
4500     1271
650      1257
1800     1223
2200     1201
850      1198
700      1174
900      1107
1300     1105
950      1104
3000     1098
1100     1079
5500     1079
1600     1074
300      1071
550      1042
350      1005
1250     1003
6500      973
1999      929
         ... 
21560       1
7859        1
3120        1
2279        1
6066        1
6322        1
4275        1
10420       1
43300       1
305         1
1765        1
15970       1
44400       1
8885        1
2992        1
31850       1
15413       1
13495       1
9525        1
7270        1
13879       1
3760        1
24250       1
11360       1
10295       1
25321       1
8886        1
8801        1
37920       1
8188        1
Name: price, Length: 3763, dtype: int64
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
Skewness: 3.346487
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
Kurtosis: 18.995183
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
name的特征分布如下：
name特征有个99662不同的值
708       282
387       282
55        280
1541      263
203       233
53        221
713       217
290       197
1186      184
911       182
2044      176
1513      160
1180      158
631       157
893       153
2765      147
473       141
1139      137
1108      132
444       129
306       127
2866      123
2402      116
533       114
1479      113
422       113
4635      110
725       110
964       109
1373      104
         ... 
89083       1
95230       1
164864      1
173060      1
179207      1
181256      1
185354      1
25564       1
19417       1
189324      1
162719      1
191373      1
193422      1
136082      1
140180      1
144278      1
146327      1
148376      1
158621      1
1404        1
15319       1
46022       1
64463       1
976         1
3025        1
5074        1
7123        1
11221       1
13270       1
174485      1
Name: name, Length: 99662, dtype: int64
model的特征分布如下：
model特征有个248不同的值
0.0      11762
19.0      9573
4.0       8445
1.0       6038
29.0      5186
48.0      5052
40.0      4502
26.0      4496
8.0       4391
31.0      3827
13.0      3762
17.0      3121
65.0      2730
49.0      2608
46.0      2454
30.0      2342
44.0      2195
5.0       2063
10.0      2004
21.0      1872
73.0      1789
11.0      1775
23.0      1696
22.0      1524
69.0      1522
63.0      1469
7.0       1460
16.0      1349
88.0      1309
66.0      1250
         ...  
141.0       37
133.0       35
216.0       30
202.0       28
151.0       26
226.0       26
231.0       23
234.0       23
233.0       20
198.0       18
224.0       18
227.0       17
237.0       17
220.0       16
230.0       16
239.0       14
223.0       13
236.0       11
241.0       10
232.0       10
229.0       10
235.0        7
246.0        7
243.0        4
244.0        3
245.0        2
209.0        2
240.0        2
242.0        2
247.0        1
Name: model, Length: 248, dtype: int64
brand的特征分布如下：
brand特征有个40不同的值
0     31480
4     16737
14    16089
10    14249
1     13794
6     10217
9      7306
5      4665
13     3817
11     2945
3      2461
7      2361
16     2223
8      2077
25     2064
27     2053
21     1547
15     1458
19     1388
20     1236
12     1109
22     1085
26      966
30      940
17      913
24      772
28      649
32      592
29      406
37      333
2       321
31      318
18      316
36      228
34      227
33      218
23      186
35      180
38       65
39        9
Name: brand, dtype: int64
bodyType的特征分布如下：
bodyType特征有个8不同的值
0.0    41420
1.0    35272
2.0    30324
3.0    13491
4.0     9609
5.0     7607
6.0     6482
7.0     1289
Name: bodyType, dtype: int64
fuelType的特征分布如下：
fuelType特征有个7不同的值
0.0    91656
1.0    46991
2.0     2212
3.0      262
4.0      118
5.0       45
6.0       36
Name: fuelType, dtype: int64
gearbox的特征分布如下：
gearbox特征有个2不同的值
0.0    111623
1.0     32396
Name: gearbox, dtype: int64
notRepairedDamage的特征分布如下：
notRepairedDamage特征有个2不同的值
0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64
regionCode的特征分布如下：
regionCode特征有个7905不同的值
419     369
764     258
125     137
176     136
462     134
428     132
24      130
1184    130
122     129
828     126
70      125
827     120
207     118
1222    117
2418    117
85      116
2615    115
2222    113
759     112
188     111
1757    110
1157    109
2401    107
1069    107
3545    107
424     107
272     107
451     106
450     105
129     105
       ... 
6324      1
7372      1
7500      1
8107      1
2453      1
7942      1
5135      1
6760      1
8070      1
7220      1
8041      1
8012      1
5965      1
823       1
7401      1
8106      1
5224      1
8117      1
7507      1
7989      1
6505      1
6377      1
8042      1
7763      1
7786      1
6414      1
7063      1
4239      1
5931      1
7267      1
Name: regionCode, Length: 7905, dtype: int64
name的特征分布如下：
name特征有个37453不同的值
55        97
708       96
387       95
1541      88
713       74
53        72
1186      67
203       67
631       65
911       64
2044      62
2866      60
1139      57
893       54
1180      52
2765      50
1108      50
290       48
1513      47
691       45
473       44
299       43
444       41
422       39
964       39
1479      38
1273      38
306       36
725       35
4635      35
          ..
46786      1
48835      1
165572     1
68204      1
171719     1
59080      1
186062     1
11985      1
147155     1
134869     1
138967     1
173792     1
114403     1
59098      1
59144      1
40679      1
61161      1
128746     1
55022      1
143089     1
14066      1
147187     1
112892     1
46598      1
159481     1
22270      1
89855      1
42752      1
48899      1
11808      1
Name: name, Length: 37453, dtype: int64
model的特征分布如下：
model特征有个247不同的值
0.0      3896
19.0     3245
4.0      3007
1.0      1981
29.0     1742
48.0     1685
26.0     1525
40.0     1409
8.0      1397
31.0     1292
13.0     1210
17.0     1087
65.0      915
49.0      866
46.0      831
30.0      803
10.0      709
5.0       696
44.0      676
21.0      659
11.0      603
23.0      591
73.0      561
69.0      555
7.0       526
63.0      493
22.0      443
16.0      412
66.0      411
88.0      391
         ... 
124.0       9
193.0       9
151.0       8
198.0       8
181.0       8
239.0       7
233.0       7
216.0       7
231.0       6
133.0       6
236.0       6
227.0       6
220.0       5
230.0       5
234.0       4
224.0       4
241.0       4
223.0       4
229.0       3
189.0       3
232.0       3
237.0       3
235.0       2
245.0       2
209.0       2
242.0       1
240.0       1
244.0       1
243.0       1
246.0       1
Name: model, Length: 247, dtype: int64
brand的特征分布如下：
brand特征有个40不同的值
0     10348
4      5763
14     5314
10     4766
1      4532
6      3502
9      2423
5      1569
13     1245
11      919
7       795
3       773
16      771
8       704
25      695
27      650
21      544
15      511
20      450
19      450
12      389
22      363
30      324
17      317
26      303
24      268
28      225
32      193
29      117
31      115
18      106
2       104
37       92
34       77
33       76
36       67
23       62
35       53
38       23
39        2
Name: brand, dtype: int64
bodyType的特征分布如下：
bodyType特征有个8不同的值
0.0    13985
1.0    11882
2.0     9900
3.0     4433
4.0     3303
5.0     2537
6.0     2116
7.0      431
Name: bodyType, dtype: int64
fuelType的特征分布如下：
fuelType特征有个7不同的值
0.0    30656
1.0    15544
2.0      774
3.0       72
4.0       37
6.0       14
5.0       10
Name: fuelType, dtype: int64
gearbox的特征分布如下：
gearbox特征有个2不同的值
0.0    37301
1.0    10789
Name: gearbox, dtype: int64
notRepairedDamage的特征分布如下：
notRepairedDamage特征有个2不同的值
0.0    37249
1.0     4720
Name: notRepairedDamage, dtype: int64
regionCode的特征分布如下：
regionCode特征有个6971不同的值
419     146
764      78
188      52
125      51
759      51
2615     50
462      49
542      44
85       44
1069     43
451      41
828      40
757      39
1688     39
2154     39
1947     39
24       39
2690     38
238      38
2418     38
827      38
1184     38
272      38
233      38
70       37
703      37
2067     37
509      37
360      37
176      37
       ... 
5512      1
7465      1
1290      1
3717      1
1258      1
7401      1
7920      1
7925      1
5151      1
7527      1
7689      1
8114      1
3237      1
6003      1
7335      1
3984      1
7367      1
6001      1
8021      1
3691      1
4920      1
6035      1
3333      1
5382      1
6969      1
7753      1
7463      1
7230      1
826       1
112       1
Name: regionCode, Length: 6971, dtype: int64
None
   SaleID    name   regDate    ...         v_12      v_13      v_14
0       0     736  20040402    ...    -2.420821  0.795292  0.914762
1       1    2262  20030301    ...    -1.030483 -1.722674  0.245522
2       2   14874  20040403    ...     1.565330 -0.832687 -0.229963
3       3   71865  19960908    ...    -0.501868 -2.438353 -0.478699
4       4  111080  20120103    ...     0.931110  2.834518  1.923482

[5 rows x 29 columns]
price        1.000000
v_12         0.692823
v_8          0.685798
v_0          0.628397
power        0.219834
v_5          0.164317
v_2          0.085322
v_6          0.068970
v_1          0.060914
v_14         0.035911
v_13        -0.013993
v_7         -0.053024
v_4         -0.147085
v_9         -0.206205
v_10        -0.246175
v_11        -0.275320
kilometer   -0.440519
v_3         -0.730946
Name: price, dtype: float64 

power           Skewness: 65.86     Kurtosis: 5733.45
kilometer       Skewness: -1.53     Kurtosis: 001.14
v_0             Skewness: -1.32     Kurtosis: 003.99
v_1             Skewness: 00.36     Kurtosis: -01.75
v_2             Skewness: 04.84     Kurtosis: 023.86
v_3             Skewness: 00.11     Kurtosis: -00.42
v_4             Skewness: 00.37     Kurtosis: -00.20
v_5             Skewness: -4.74     Kurtosis: 022.93
v_6             Skewness: 00.37     Kurtosis: -01.74
v_7             Skewness: 05.13     Kurtosis: 025.85
v_8             Skewness: 00.20     Kurtosis: -00.64
v_9             Skewness: 00.42     Kurtosis: -00.32
v_10            Skewness: 00.03     Kurtosis: -00.58
v_11            Skewness: 03.03     Kurtosis: 012.57
v_12            Skewness: 00.37     Kurtosis: 000.27
v_13            Skewness: 00.27     Kurtosis: -00.44
v_14            Skewness: -1.19     Kurtosis: 002.39
price           Skewness: 03.35     Kurtosis: 019.00
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/ysn7/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
11111111
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
       'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6',
       'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],
      dtype='object')
0          1850
1          3600
2          6222
3          2400
4          5200
5          8000
6          3500
7          1000
8          2850
9           650
10         3100
11         5450
12         1600
13         3100
14         6900
15         3200
16        10500
17         3700
18          790
19         1450
20          990
21         2800
22          350
23          599
24         9250
25         3650
26         2800
27         2399
28         4900
29         2999
          ...  
149970      900
149971     3400
149972      999
149973     3500
149974     4500
149975     3990
149976     1200
149977      330
149978     3350
149979     5000
149980     4350
149981     9000
149982     2000
149983    12000
149984     6700
149985     4200
149986     2800
149987     3000
149988     7500
149989     1150
149990      450
149991    24950
149992      950
149993     4399
149994    14780
149995     5900
149996     9500
149997     7500
149998     4999
149999     4700
Name: price, Length: 150000, dtype: int64

Process finished with exit code 0