“联创黔线”杯大数据应用创新大赛

本文介绍了‘联创黔线’杯大数据应用创新大赛,比赛旨在预测贵阳市居民2018年8月前往黔东南旅游的可能性。参赛者需利用2017年的历史数据建立预测模型,采用AUC作为评分标准。文中提到了特征工程的步骤,包括用户行为、手机品牌、联络圈规模等,并展示了部分训练和测试数据。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

赛题地址:https://www.kesci.com/home/competition/5be92233954d6e001063649a

又打了个酱油,最终成绩是39/205。说出来挺丢人的,因为本次比赛采用AUC来评判模型的效果,不用建模一半预测为去,另一半预测为不去就能得0.5分。
在这里插入图片描述

赛题介绍

赛题描述
参赛选手需要根据2017年贵阳市常住居民的部分用户的历史数据(训练集),以及2018年6月、7月的数据(测试集),对2018年8月贵阳市常住居民前往黔东南州进行省内旅游的可能性进行预测。

本比赛任务为:

训练:使用所提供的训练集,即用户使用2017年6、7月的历史数据与8月是否前往黔东南州进行省内旅游的数据,建立预测模型
输出结果:使用所提供的测试集,即用户使用2018年6月、7月的历史数据,通过所建立的模型,预测用户在2018年8月是否会前往黔东南州进行省内旅游的概率。在科赛网,提交测评,得到AUC分数
数据说明
训练集(training_set)约2.3G,其中包含 201708n,201708q 和 weather_data_2017三个文件夹,分别记录了对应的2017年6、7月用户历史数据和天气历史数据。

在201708n和201708q两个文件夹中,各包含7个txt文件,201708n文件夹中的用户在2017年8月都没有去过黔东南目标区域,201708q文件夹中的用户在2017年8月都去过黔东南目标景区
训练集中,除以下列示字段外,最后还有一个字段“label”:“0”表示其为负样本,即该用户在2017年8月没有去过黔东南目标区域;“1”表示其为正样本,即该用户在2017年8月去过黔东南目标区域
用户身份属性表(201708n1.txt, 201708q1.txt)
用户手机终端信息表(201708n2.txt, 201708q2.txt)
用户漫游行为表(201708n3.txt, 201708q3.txt)
用户漫出省份表(201708n4.txt, 201708q4.txt)
用户地理位置表(201708n6.txt, 201708q6.txt)
用户APP使用情况表(201708n7.txt, 201708q7.txt)
在weather_data_2017文件夹中包含两个txt文件,“weather_reported_2017”记录了2017年6月、7月的实际天气,“weather_forecast_2017”,记录了2017年6月、7月的预报天气,以及一个“天气现象编码表.xlsx”文件。
2017实况天气表(weather_reported_2017.txt)
2017预测天气表(weather_forecast_2017.txt)
测试集(testing_set)共约1G,其中包含201808和weather_data_2018两个文件夹

在201808文件夹中包含7个txt文件,命名依次为2018_1.txt,2018_2.txt, … ,2018_7.txt,字段信息与训练集相对应
在weather_data_2018文件夹中包含两个txt文件,命“weather_reported_2018”记录了2018年6月、7月的实际天气,“weather_forecast_2018”记录了2018年6月、7月的预报天气,字段信息与训练集相对应。
备注:

每个文件夹中的7个表可以通过虚拟ID互相关联;但不是每个虚拟ID都可以被关联,选手自行判断如何处理和使用
不同表中的虚拟ID存在格式不同的情况,需选手自行处理,并保证提交虚拟ID格式为string
由于表的数量较多,信息维度不同,应用方法多种,数据可能存在异常和缺失,选手需自行处理可能遇到的异常状况
欢迎选手用不同的方法进行尝试,如迁移学习等前沿方法
本次竞赛数据经过了脱敏处理,数据和实际信息有一定差距,但是不会影响问题的解决
评审说明
1、初赛评分规则

本次比赛采用AUC来评判模型的效果。AUC即以False Positive Rate为横轴,True Positive Rate为纵轴的ROC (Receiver Operating Characteristic)曲线下方的面积大小。

2、评审说明

测评排行榜采用Private/Public机制,其中,Private榜对应所提交结果文件中一定比例数据的成绩,Public榜对应剩余数据的成绩。

提供给每个队伍每天5次提交与测评排名的机会,实时更新Public排行榜,从高到低排序,若队伍一天内多次提交结果,新结果版本将覆盖原版本。
由于受到使用模型的泛化性能的影响,在 Public 榜获得最高分的提交在 Private 的分数不一定最高,因此需要选手从自己的有效提交里,选择两个觉得兼顾了泛化性能与模型评分的结果文件进入 Private 榜测评
Private 排行榜在比赛结束后会揭晓,比赛的最终有效成绩与有效排名将以 Private 榜为准。

代码

# 显示cell运行时长
%load_ext klab-autotime
import pandas as pd
import numpy as np
time: 311 ms
# 减少内存使用

def reduce_mem_usage(df, verbose=True):

    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

    start_mem = df.memory_usage().sum() / 1024 ** 2

    for col in df.columns:

        col_type = df[col].dtypes

        if col_type in numerics:

            c_min = df[col].min()

            c_max = df[col].max()

            if str(col_type)[:3] == 'int':

                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:

                    df[col] = df[col].astype(np.int8)

                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:

                    df[col] = df[col].astype(np.int16)

                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:

                    df[col] = df[col].astype(np.int32)

                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:

                    df[col] = df[col].astype(np.int64)

            else:

                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:

                    df[col] = df[col].astype(np.float16)

                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:

                    df[col] = df[col].astype(np.float32)

                else:

                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024 ** 2

    if verbose:

        print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))

    return df
time: 3.85 ms

1 特征工程

正样本

q1
将两月金额相加

q1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q1.txt', sep='\t', header=None))
q1.columns = ['year_month', 'id', 'consume', 'label']
Mem. usage decreased to  0.16 Mb (53.1% reduction)
time: 39.2 ms
q1.describe()
year_monthidconsumelabel
count11200.0000001.120000e+041.086500e+0411200.0
mean201706.5000005.416583e+15inf1.0
std0.5000222.642827e+15inf0.0
min201706.0000001.448104e+124.998779e-021.0
25%201706.0000003.117220e+154.068750e+011.0
50%201706.5000005.456254e+159.837500e+011.0
75%201707.0000007.702940e+151.785000e+021.0
max201707.0000009.997949e+151.324000e+031.0
time: 37.3 ms
q1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11200 entries, 0 to 11199
Data columns (total 4 columns):
year_month    11200 non-null int32
id            11200 non-null int64
consume       10865 non-null float16
label         11200 non-null int8
dtypes: float16(1), int32(1), int64(1), int8(1)
memory usage: 164.1 KB
time: 6.91 ms
q1.consume.min()
0.05



time: 2.64 ms
q1 = q1.fillna(98.0)
time: 2.75 ms
q1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11200 entries, 0 to 11199
Data columns (total 4 columns):
year_month    11200 non-null int32
id            11200 non-null int64
consume       11200 non-null float16
label         11200 non-null int8
dtypes: float16(1), int32(1), int64(1), int8(1)
memory usage: 164.1 KB
time: 6.71 ms
q1 = q1[['id', 'consume']]
q1_groupbyid = q1.groupby(['id']).agg({'consume': pd.Series.sum})
time: 709 ms

q2
特征1 使用过的top9+其它手机品牌 共10个
特征2 使用的不同品牌数量

q2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q2.txt', sep='\t', header=None))
q2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time', 'label']
Mem. usage decreased to 11.31 Mb (14.6% reduction)
time: 2.46 s
q2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 289203 entries, 0 to 289202
Data columns (total 6 columns):
id                 289203 non-null int64
brand              197376 non-null object
type               197380 non-null object
first_use_time     289203 non-null int64
recent_use_time    289203 non-null int64
label              289203 non-null int8
dtypes: int64(3), int8(1), object(2)
memory usage: 11.3+ MB
time: 62.6 ms
q2.type = q2.type.fillna('其它')
time: 18.4 ms
brand_series = pd.Series({'苹果' : 'iphone', '华为' : "huawei", '欧珀' : 'oppo', '维沃' : 'vivo', '三星' : 'san', '小米' : 'mi', '金立' : 'jinli', '魅族' : 'mei', '乐视' : 'le', '四季恒美' : 'siji'})

q2.brand = q2.brand.map(brand_series)
time: 42.4 ms
q2.brand = q2.brand.fillna('其它')
time: 17.4 ms
q2.head()
idbrandtypefirst_use_timerecent_use_timelabel
01752398069509000其它其它20161209134530201612091906361
11752398069509000huaweiPLK-AL1020170609223138201706092243451
21752398069509000leLETV X50120160924102711201609241124251
31752398069509000jinli金立 GN80020150331210255201506301312321
41752398069509000jinliGIONEE M520170508191216201706051923471
time: 18.7 ms
q2['brand_type'] = q2['brand'] + q2['type']
time: 109 ms
q2.head()
idbrandtypefirst_use_timerecent_use_timelabelbrand_type
01752398069509000其它其它20161209134530201612091906361其它其它
11752398069509000huaweiPLK-AL1020170609223138201706092243451huaweiPLK-AL10
21752398069509000leLETV X50120160924102711201609241124251leLETV X501
31752398069509000jinli金立 GN80020150331210255201506301312321jinli金立 GN800
41752398069509000jinliGIONEE M520170508191216201706051923471jinliGIONEE M5
time: 9.75 ms
groupbybrand_type = q2['brand_type'].value_counts()
time: 51.8 ms
groupbybrand_type.head(10)
其它其它                     91823
iphoneA1586              14898
iphoneA1524              10330
iphoneA1700               9246
iphoneA1699               8277
iphoneIPHONE6S(A1633)     6271
oppoOPPO R9M              4725
iphoneA1530               4640
oppoOPPO R9TM             2978
vivoVIVO X7               2516
Name: brand_type, dtype: int64



time: 3.44 ms
q2_brand_type = q2[['id', 'brand_type']]
q2_brand_type = q2_brand_type.drop_duplicates()
q2_groupbyid = q2_brand_type['id'].value_counts()
q2_groupbyid = q2_groupbyid.reset_index()
q2_groupbyid.columns = ['id', 'phone_nums']
q2_groupbyid.head()
idphone_nums
08707678197418467422
19196501153454276409
23900535090108175389
34104535378288025352
41106540188374027350
time: 90 ms
q2_groupbyid.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5600 entries, 0 to 5599
Data columns (total 2 columns):
id            5600 non-null int64
phone_nums    5600 non-null int64
dtypes: int64(2)
memory usage: 87.6 KB
time: 5.91 ms
q2_brand = q2[['id', 'brand']]
q2_brand = q2_brand.drop_duplicates()
q2_brand_one_hot = pd.get_dummies(q2_brand)
q2_brand_one_hot.head()
idbrand_huaweibrand_iphonebrand_jinlibrand_lebrand_meibrand_mibrand_oppobrand_sanbrand_sijibrand_vivobrand_其它
0175239806950900000000000001
1175239806950900010000000000
2175239806950900000010000000
3175239806950900000100000000
8175239806950900000000001000
time: 48.9 ms
q2_one_hot = q2_brand_one_hot.groupby(['id']).agg({'brand_huawei': pd.Series.max, 
                                                   'brand_iphone': pd.Series.max,
                                                   'brand_jinli': pd.Series.max, 
                                                   'brand_le': pd.Series.max,
                                                   'brand_mei': pd.Series.max, 
                                                   'brand_mi': pd.Series.max,
                                                   'brand_oppo': pd.Series.max, 
                                                   'brand_san': pd.Series.max,
                                                   'brand_siji': pd.Series.max, 
                                                   'brand_vivo': pd.Series.max,
                                                   'brand_其它': pd.Series.max
})
q2_one_hot.head()
brand_huaweibrand_iphonebrand_jinlibrand_lebrand_meibrand_mibrand_oppobrand_sanbrand_sijibrand_vivobrand_其它
id
144810399800011011011001
1739871881373011111111011
6113262348600010000000001
6815659667552001110000001
7681933457643011101111011
time: 6.57 s
pos_set = q1_groupbyid.merge(q2_groupbyid, on=['id'])
pos_set.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5600 entries, 0 to 5599
Data columns (total 3 columns):
id            5600 non-null int64
consume       5600 non-null float16
phone_nums    5600 non-null int64
dtypes: float16(1), int64(2)
memory usage: 142.2 KB
time: 11.6 ms
pos_set = pos_set.merge(q2_one_hot, on=['id'])
pos_set.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5600 entries, 0 to 5599
Data columns (total 14 columns):
id              5600 non-null int64
consume         5600 non-null float16
phone_nums      5600 non-null int64
brand_huawei    5600 non-null uint8
brand_iphone    5600 non-null uint8
brand_jinli     5600 non-null uint8
brand_le        5600 non-null uint8
brand_mei       5600 non-null uint8
brand_mi        5600 non-null uint8
brand_oppo      5600 non-null uint8
brand_san       5600 non-null uint8
brand_siji      5600 non-null uint8
brand_vivo      5600 non-null uint8
brand_其它        5600 non-null uint8
dtypes: float16(1), int64(2), uint8(11)
memory usage: 202.3 KB
time: 98.6 ms

q3
1.将两月联络圈规模求和
2.将两月出省求和 是:1 否:0
3.将两月出国求和 是:1 否:0

q3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q3.txt', sep='\t', header=None))
q3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational', 'label']
Mem. usage decreased to  0.18 Mb (64.6% reduction)
time: 85.8 ms
q3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11200 entries, 0 to 11199
Data columns (total 6 columns):
year_month             11200 non-null int32
id                     11200 non-null int64
call_nums              11200 non-null int16
is_trans_provincial    11200 non-null int8
is_transnational       11200 non-null int8
label                  11200 non-null int8
dtypes: int16(1), int32(1), int64(1), int8(3)
memory usage: 186.0 KB
time: 7.49 ms
q3_groupbyid_call = q3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})
q3_groupbyid_provincial = q3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})
q3_groupbyid_trans = q3[['id', 'is_transnational']].groupby(['id']).agg({'is_transnational': pd.Series.sum})

pos_set = pos_set.merge(q3_groupbyid_call, on=['id'])
pos_set = pos_set.merge(q3_groupbyid_provincial, on=['id'])
pos_set = pos_set.merge(q3_groupbyid_trans, on=['id'])
pos_set.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5600 entries, 0 to 5599
Data columns (total 17 columns):
id                     5600 non-null int64
consume                5600 non-null float16
phone_nums             5600 non-null int64
brand_huawei           5600 non-null uint8
brand_iphone           5600 non-null uint8
brand_jinli            5600 non-null uint8
brand_le               5600 non-null uint8
brand_mei              5600 non-null uint8
brand_mi               5600 non-null uint8
brand_oppo             5600 non-null uint8
brand_san              5600 non-null uint8
brand_siji             5600 non-null uint8
brand_vivo             5600 non-null uint8
brand_其它               5600 non-null uint8
call_nums              5600 non-null int16
is_trans_provincial    5600 non-null int8
is_transnational       5600 non-null int8
dtypes: float16(1), int16(1), int64(2), int8(2), uint8(11)
memory usage: 224.2 KB
time: 1.95 s

q4
1.两月内漫出省次数
2.所有省份one-hot或top10省份+其它省份
3.两月内漫出不同省个数

q4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q4.txt', sep='\t', header=None))
q4.columns = ['year_month', 'id', 'province', 'label']
q4.info()
Mem. usage decreased to  0.15 Mb (34.4% reduction)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7289 entries, 0 to 7288
Data columns (total 4 columns):
year_month    7289 non-null int32
id            7289 non-null int64
province      7218 non-null object
label         7289 non-null int8
dtypes: int32(1), int64(1), int8(1), object(1)
memory usage: 149.6+ KB
time: 18.4 ms
q4.head()
year_monthidprovincelabel
02017076062475264825100广东1
12017075627768389537500北京1
22017072000900444179600山西1
32017075304502776817600四川1
42017075304502776817600四川1
time: 7.16 ms
q4_groupbyid = q4.groupby(['province']).size()
time: 61.3 ms
q4_groupbyid.sort_values()
province
宁夏      15
吉林      20
内蒙古     22
黑龙江     27
青海      35
天津      39
辽宁      44
西藏      69
山西      70
甘肃      73
新疆      74
安徽      86
海南     100
陕西     114
山东     121
福建     150
河北     168
江苏     182
湖北     208
上海     215
河南     237
北京     247
江西     364
重庆     428
浙江     483
云南     530
广西     536
四川     793
广东     835
湖南     933
dtype: int64



time: 4.04 ms
q4.province = q4.province.fillna('湖南')
q4.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7289 entries, 0 to 7288
Data columns (total 4 columns):
year_month    7289 non-null int32
id            7289 non-null int64
province      7289 non-null object
label         7289 non-null int8
dtypes: int32(1), int64(1), int8(1), object(1)
memory usage: 149.6+ KB
time: 8.09 ms
q4_groupbyid = q4[['id', 'province']].groupby(['id']).size()
q4_groupbyid = q4_groupbyid.reset_index()
q4_groupbyid.columns = ['id', 'province_out_cnt']

pos_set = pos_set.merge(q4_groupbyid, how='left', on=['id'])
pos_set.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5600 entries, 0 to 5599
Data columns (total 18 columns):
id                     5600 non-null int64
consume                5600 non-null float16
phone_nums             5600 non-null int64
brand_huawei           5600 non-null uint8
brand_iphone           5600 non-null uint8
brand_jinli            5600 non-null uint8
brand_le               5600 non-null uint8
brand_mei              5600 non-null uint8
brand_mi               5600 non-null uint8
brand_oppo             5600 non-null uint8
brand_san              5600 non-null uint8
brand_siji             5600 non-null uint8
brand_vivo             5600 non-null uint8
brand_其它               5600 non-null uint8
call_nums              5600 non-null int16
is_trans_provincial    5600 non-null int8
is_transnational       5600 non-null int8
province_out_cnt       1942 non-null float64
dtypes: float16(1), float64(1), int16(1), int64(2), int8(2), uint8(11)
memory usage: 268.0 KB
time: 19.6 ms
pos_set = pos_set.fillna(0)
pos_set['label'] = 1
pos_set.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5600 entries, 0 to 5599
Data columns (total 19 columns):
id                     5600 non-null int64
consume                5600 non-null float16
phone_nums             5600 non-null int64
brand_huawei           5600 non-null uint8
brand_iphone           5600 non-null uint8
brand_jinli            5600 non-null uint8
brand_le               5600 non-null uint8
brand_mei              5600 non-null uint8
brand_mi               5600 non-null uint8
brand_oppo             5600 non-null uint8
brand_san              5600 non-null uint8
brand_siji             5600 non-null uint8
brand_vivo             5600 non-null uint8
brand_其它               5600 non-null uint8
call_nums              5600 non-null int16
is_trans_provincial    5600 non-null int8
is_transnational       5600 non-null int8
province_out_cnt       5600 non-null float64
label                  5600 non-null int64
dtypes: float16(1), float64(1), int16(1), int64(3), int8(2), uint8(11)
memory usage: 311.7 KB
time: 12.7 ms

q6 暂时忽略
q7
1.使用总流量
2.使用不同APP数量
3.某些特定(旅游相关)APP是否使用

1.1 正样本
q1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q1.txt', sep='\t', header=None))

q1.columns = ['year_month', 'id', 'consume', 'label']

q1 = q1.fillna(98.0)

q1 = q1[['id', 'consume']]

q1_groupbyid = q1.groupby(['id']).agg({'consume': pd.Series.sum})



q2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q2.txt', sep='\t', header=None))

q2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time', 'label']

q2.type = q2.type.fillna('其它')

brand_series = pd.Series({'苹果' : 'iphone', '华为' : "huawei", '欧珀' : 'oppo', '维沃' : 'vivo', '三星' : 'san', '小米' : 'mi', '金立' : 'jinli', '魅族' : 'mei', '乐视' : 'le', '四季恒美' : 'siji'})

q2.brand = q2.brand.map(brand_series)

q2.brand = q2.brand.fillna('其它')

q2['brand_type'] = q2['brand'] + q2['type']

q2_brand_type = q2[['id', 'brand_type']]

q2_brand_type = q2_brand_type.drop_duplicates()

q2_groupbyid = q2_brand_type['id'].value_counts()

q2_groupbyid = q2_groupbyid.reset_index()

q2_groupbyid.columns = ['id', 'phone_nums']

q2_brand = q2[['id', 'brand']]

q2_brand = q2_brand.drop_duplicates()

q2_brand_one_hot = pd.get_dummies(q2_brand)

q2_one_hot = q2_brand_one_hot.groupby(['id']).agg({'brand_huawei': pd.Series.max, 

                                                   'brand_iphone': pd.Series.max,

                                                   'brand_jinli': pd.Series.max, 

                                                   'brand_le': pd.Series.max,

                                                   'brand_mei': pd.Series.max, 

                                                   'brand_mi': pd.Series.max,

                                                   'brand_oppo': pd.Series.max, 

                                                   'brand_san': pd.Series.max,

                                                   'brand_siji': pd.Series.max, 

                                                   'brand_vivo': pd.Series.max,

                                                   'brand_其它': pd.Series.max

})

q2_one_hot.head()

pos_set = q1_groupbyid.merge(q2_groupbyid, on=['id'])

pos_set = pos_set.merge(q2_one_hot, on=['id'])



q3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q3.txt', sep='\t', header=None))

q3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational', 'label']

q3_groupbyid_call = q3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})

q3_groupbyid_provincial = q3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})

q3_groupbyid_trans = q3[['id', 'is_transnational']].groupby(['id']).agg({'is_transnational': pd.Series.sum})



pos_set = pos_set.merge(q3_groupbyid_call, on=['id'])

pos_set = pos_set.merge(q3_groupbyid_provincial, on=['id'])

pos_set = pos_set.merge(q3_groupbyid_trans, on=['id'])



q4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q4.txt', sep='\t', header=None))

q4.columns = ['year_month', 'id', 'province', 'label']

q4.province = q4.province.fillna('湖南')

q4_groupbyid = q4[['id', 'province']].groupby(['id']).size()

q4_groupbyid = q4_groupbyid.reset_index()

q4_groupbyid.columns = ['id', 'province_out_cnt']



pos_set = pos_set.merge(q4_groupbyid, how='left', on=['id'])

pos_set = pos_set.fillna(0)

pos_set['label'] = 1

pos_set.info()
Mem. usage decreased to  0.16 Mb (53.1% reduction)
Mem. usage decreased to 11.31 Mb (14.6% reduction)
Mem. usage decreased to  0.18 Mb (64.6% reduction)
Mem. usage decreased to  0.15 Mb (34.4% reduction)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5600 entries, 0 to 5599
Data columns (total 19 columns):
id                     5600 non-null int64
consume                5600 non-null float16
phone_nums             5600 non-null int64
brand_huawei           5600 non-null uint8
brand_iphone           5600 non-null uint8
brand_jinli            5600 non-null uint8
brand_le               5600 non-null uint8
brand_mei              5600 non-null uint8
brand_mi               5600 non-null uint8
brand_oppo             5600 non-null uint8
brand_san              5600 non-null uint8
brand_siji             5600 non-null uint8
brand_vivo             5600 non-null uint8
brand_其它               5600 non-null uint8
call_nums              5600 non-null int16
is_trans_provincial    5600 non-null int8
is_transnational       5600 non-null int8
province_out_cnt       5600 non-null float64
label                  5600 non-null int64
dtypes: float16(1), float64(1), int16(1), int64(3), int8(2), uint8(11)
memory usage: 311.7 KB
time: 10.1 s
1.2 负样本
n1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n1.txt', sep='\t', header=None))

n1.columns = ['year_month', 'id', 'consume', 'label']

n1 = n1.fillna(98.0)

n1_groupbyid = n1[['id', 'consume']].groupby(['id']).agg({'consume': pd.Series.sum})



n2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n2.txt', sep='\t', header=None))

n2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time', 'label']

n2.type = n2.type.fillna('其它')

brand_series = pd.Series({'苹果' : 'iphone', '华为' : "huawei", '欧珀' : 'oppo', '维沃' : 'vivo', '三星' : 'san', '小米' : 'mi', '金立' : 'jinli', '魅族' : 'mei', '乐视' : 'le', '四季恒美' : 'siji'})

n2.brand = n2.brand.map(brand_series)

n2.brand = n2.brand.fillna('其它')

n2['brand_type'] = n2['brand'] + n2['type']

n2_brand_type = n2[['id', 'brand_type']]

n2_brand_type = n2_brand_type.drop_duplicates()

n2_groupbyid = n2_brand_type['id'].value_counts()

n2_groupbyid = n2_groupbyid.reset_index()

n2_groupbyid.columns = ['id', 'phone_nums']

n2_brand = n2[['id', 'brand']]

n2_brand = n2_brand.drop_duplicates()

n2_brand_one_hot = pd.get_dummies(n2_brand)

n2_one_hot = n2_brand_one_hot.groupby(['id']).agg({'brand_huawei': pd.Series.max, 

                                                   'brand_iphone': pd.Series.max,

                                                   'brand_jinli': pd.Series.max, 

                                                   'brand_le': pd.Series.max,

                                                   'brand_mei': pd.Series.max, 

                                                   'brand_mi': pd.Series.max,

                                                   'brand_oppo': pd.Series.max, 

                                                   'brand_san': pd.Series.max,

                                                   'brand_siji': pd.Series.max, 

                                                   'brand_vivo': pd.Series.max,

                                                   'brand_其它': pd.Series.max

})



neg_set = n1_groupbyid.merge(n2_groupbyid, on=['id'])

neg_set = neg_set.merge(n2_one_hot, on=['id'])

n3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n3.txt', sep='\t', header=None))

n3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational', 'label']



n3_groupbyid_call = n3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})

n3_groupbyid_provincial = n3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})

n3_groupbyid_trans = n3[['id', 'is_transnational']].groupby(['id']).agg({'is_transnational': pd.Series.sum})

neg_set = neg_set.merge(n3_groupbyid_call, on=['id'])

neg_set = neg_set.merge(n3_groupbyid_provincial, on=['id'])

neg_set = neg_set.merge(n3_groupbyid_trans, on=['id'])



n4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n4.txt', sep='\t', header=None))

n4.columns = ['year_month', 'id', 'province', 'label']

n4.province = n4.province.fillna('湖南')

n4_groupbyid = n4[['id', 'province']].groupby(['id']).size()

n4_groupbyid = n4_groupbyid.reset_index()

n4_groupbyid.columns = ['id', 'province_out_cnt']

neg_set = neg_set.merge(n4_groupbyid, how='left', on=['id'])

neg_set = neg_set.fillna(0)



neg_set['label'] = 0

neg_set.info()
Mem. usage decreased to  2.67 Mb (53.1% reduction)
Mem. usage decreased to 51.13 Mb (14.6% reduction)
Mem. usage decreased to  3.03 Mb (64.6% reduction)
Mem. usage decreased to  0.73 Mb (34.4% reduction)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 93375 entries, 0 to 93374
Data columns (total 19 columns):
id                     93375 non-null int64
consume                93375 non-null float16
phone_nums             93375 non-null int64
brand_huawei           93375 non-null uint8
brand_iphone           93375 non-null uint8
brand_jinli            93375 non-null uint8
brand_le               93375 non-null uint8
brand_mei              93375 non-null uint8
brand_mi               93375 non-null uint8
brand_oppo             93375 non-null uint8
brand_san              93375 non-null uint8
brand_siji             93375 non-null uint8
brand_vivo             93375 non-null uint8
brand_其它               93375 non-null uint8
call_nums              93375 non-null int16
is_trans_provincial    93375 non-null int8
is_transnational       93375 non-null int8
province_out_cnt       93375 non-null float64
label                  93375 non-null int64
dtypes: float16(1), float64(1), int16(1), int64(3), int8(2), uint8(11)
memory usage: 5.1 MB
time: 2min 48s
train_set = pos_set.append(neg_set)
train_set.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 98975 entries, 0 to 93374
Data columns (total 19 columns):
id                     98975 non-null int64
consume                98975 non-null float16
phone_nums             98975 non-null int64
brand_huawei           98975 non-null uint8
brand_iphone           98975 non-null uint8
brand_jinli            98975 non-null uint8
brand_le               98975 non-null uint8
brand_mei              98975 non-null uint8
brand_mi               98975 non-null uint8
brand_oppo             98975 non-null uint8
brand_san              98975 non-null uint8
brand_siji             98975 non-null uint8
brand_vivo             98975 non-null uint8
brand_其它               98975 non-null uint8
call_nums              98975 non-null int16
is_trans_provincial    98975 non-null int8
is_transnational       98975 non-null int8
province_out_cnt       98975 non-null float64
label                  98975 non-null int64
dtypes: float16(1), float64(1), int16(1), int64(3), int8(2), uint8(11)
memory usage: 5.4 MB
time: 62.5 ms

2 建模

import numpy as np

import pandas as pd

import lightgbm as lgb

from sklearn import metrics

from sklearn.model_selection import train_test_split



X = train_set[['consume', 'phone_nums', 'call_nums', 'is_trans_provincial', 'is_transnational', 'province_out_cnt']].values

y = train_set['label'].values



x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)



lgb_train = lgb.Dataset(x_train, y_train)

lgb_eval = lgb.Dataset(x_test, y_test, reference = lgb_train)

params = {

        'boosting_type':'gbdt',  #提升器的类型

        'objective':'binary',   

        'metric':{'auc'},

        'num_leaves':100,

        'reg_alpha':0,

        'reg_lambda':0.01,

        'max_depth':6,

        'n_estimators':100,

        'subsample':0.9,

        'colsample_bytree':0.85,

        'subsample_freq':1,

        'min_child_samples':25,

        'learning_rate':0.1,

        'random_state':2019

        #'feature_fraction':0.9,  #每棵树训练之前选择90%的特征

        #'bagging_fraction':0.8,  #类似于feature_fraction,加速训练,处理过拟合

        #'bagging_freq':5,

        #'verbose':0

}

gbm = lgb.train(params,

                lgb_train,

                num_boost_round = 2000, # 4000 number of boosting iterations,

                valid_sets = lgb_eval,

                verbose_eval=250,

                early_stopping_rounds=50)

                

y_pred = gbm.predict(X, num_iteration=gbm.best_iteration)

print('AUC: %.4f' % metrics.roc_auc_score(y, y_pred))



y_pred = gbm.predict(x_test, num_iteration=gbm.best_iteration)

print('Test AUC: %.4f' % metrics.roc_auc_score(y_test, y_pred))
Training until validation scores don't improve for 50 rounds.
Early stopping, best iteration is:
[18]	valid_0's auc: 0.786865
AUC: 0.7981
Test AUC: 0.7869
time: 772 ms
from sklearn.model_selection import train_test_split

from xgboost import XGBClassifier

from collections import Counter



X = train_set[['consume', 'phone_nums', 'call_nums', 'is_trans_provincial', 'is_transnational', 'province_out_cnt']].values

y = train_set['label'].values



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)



c = Counter(y_train)

'''

params={'booster':'gbtree',

    'objective': 'binary:logistic',

    'eval_metric': 'auc',

    'max_depth':4,

    'lambda':10,

    'subsample':0.75,

    'colsample_bytree':0.75,

    'min_child_weight':2,

    'eta': 0.025,

    'seed':0,

    'nthread':8,

     'silent':1}

'''

clf = XGBClassifier(max_depth=5, eval_metric='auc', min_child_weight=6, scale_pos_weight=c[0] / 16 / c[1], 

                    nthread=12, num_boost_round=1000, seed=2019

                    )

                    

print('fit start...')

clf.fit(X_train, y_train)

print('fit finish')



'''

train_score = clf.score(X_train, y_train)

test_score = clf.score(X_test, y_test)

print('train score:{}\ntest score:{}'.format(train_score, test_score))

'''



y_pred=clf.predict(X)

from sklearn import metrics

print('AUC: %.4f' % metrics.roc_auc_score(y, y_pred))



y_pred=clf.predict(X_test)

print('Test AUC: %.4f' % metrics.roc_auc_score(y_test, y_pred))
fit start...
fit finish
AUC: 0.5134
Test AUC: 0.5082
time: 3.11 s
import xgboost as xgb

import pandas as pd

from sklearn.model_selection import GridSearchCV

from collections import Counter





X_train = train_set[['consume', 'phone_nums', 'call_nums', 'is_trans_provincial', 'is_transnational', 'province_out_cnt']].values

y_train = train_set['label'].values

c = Counter(y_train)



# n = c[0] / c[1]  # 13.98

# nn = c[0] / 16 / c[1] # 0.8738

# print(n, nn)



parameters = {

    'max_depth': [5, 10, 15],

    'learning_rate': [0.01, 0.02, 0.05],

    'n_estimators': [500, 1000, 2000],

    'min_child_weight': [0, 2, 5],

    'max_delta_step': [0, 0.2, 0.6],

    'subsample': [0.6, 0.7, 0.8],

    'colsample_bytree': [0.5, 0.6, 0.7],

    'reg_alpha': [0, 0.25, 0.5],

    'reg_lambda': [0.2, 0.4, 0.6],

    'scale_pos_weight': [0.8, 8, 14]



}



xlf = xgb.XGBClassifier(max_depth=10,

                        learning_rate=0.01,

                        n_estimators=2000,

                        silent=True,

                        objective='binary:logistic',

                        nthread=12,

                        gamma=0,

                        min_child_weight=1,

                        max_delta_step=0,

                        subsample=0.85,

                        colsample_bytree=0.7,

                        colsample_bylevel=1,

                        reg_alpha=0,

                        reg_lambda=1,

                        scale_pos_weight=1,

                        seed=2019,

                        missing=None)



# 有了gridsearch我们便不需要fit函数

gsearch = GridSearchCV(xlf, param_grid=parameters, scoring='accuracy', cv=3)

gsearch.fit(X_train, y_train)



print("Best score: %0.3f" % gsearch.best_score_)

print("Best parameters set:")

best_parameters = gsearch.best_estimator_.get_params()

for param_name in sorted(parameters.keys()):

    print("\t%s: %r" % (param_name, best_parameters[param_name]))


3 预测

3.1 测试集
t1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_1.txt', sep='\t', header=None))

t1.columns = ['year_month', 'id', 'consume']

t1 = t1.fillna(81.0)

# t1 = t1.dropna(axis=0)

t1_groupbyid = t1[['id', 'consume']].groupby(['id']).agg({'consume': pd.Series.sum})



t2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_2.txt', sep='\t', header=None))

t2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time']

t2 = t2.fillna('其它')

# t2 = t2.dropna(axis=0)

brand_series = pd.Series({'苹果' : 'iphone', '华为' : "huawei", '欧珀' : 'oppo', '维沃' : 'vivo', '三星' : 'san', '小米' : 'mi', '金立' : 'jinli', '魅族' : 'mei', '乐视' : 'le', '四季恒美' : 'siji'})

t2.brand = t2.brand.map(brand_series)

t2.brand = t2.brand.fillna('其它')

t2['brand_type'] = t2['brand'] + t2['type']

t2_brand_type = t2[['id', 'brand_type']]

t2_brand_type = t2_brand_type.drop_duplicates()

t2_groupbyid = t2_brand_type['id'].value_counts()

t2_groupbyid = t2_groupbyid.reset_index()

t2_groupbyid.columns = ['id', 'phone_nums']

t2_brand = t2[['id', 'brand']]

t2_brand = t2_brand.drop_duplicates()

t2_brand_one_hot = pd.get_dummies(t2_brand)

t2_one_hot = t2_brand_one_hot.groupby(['id']).agg({'brand_huawei': pd.Series.max, 

                                                   'brand_iphone': pd.Series.max,

                                                   'brand_jinli': pd.Series.max, 

                                                   'brand_le': pd.Series.max,

                                                   'brand_mei': pd.Series.max, 

                                                   'brand_mi': pd.Series.max,

                                                   'brand_oppo': pd.Series.max, 

                                                   'brand_san': pd.Series.max,

                                                   'brand_siji': pd.Series.max, 

                                                   'brand_vivo': pd.Series.max,

                                                   'brand_其它': pd.Series.max

})



test_set = t1_groupbyid.merge(t2_groupbyid, on=['id'])

test_set = test_set.merge(t2_one_hot, on=['id'])



t3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_3.txt', sep='\t', header=None))

t3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational']



t3_groupbyid_call = t3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})

t3_groupbyid_provincial = t3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})

t3_groupbyid_trans = t3[['id', 'is_transnational']].groupby(['id']).agg({'is_transnational': pd.Series.sum})

test_set = test_set.merge(t3_groupbyid_call, on=['id'])

test_set = test_set.merge(t3_groupbyid_provincial, on=['id'])

test_set = test_set.merge(t3_groupbyid_trans, on=['id'])



t4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_4.txt', sep='\t', header=None))

t4.columns = ['year_month', 'id', 'province']

t4 = t4.fillna('湖南')

# t4 = t4.dropna(axis=0)

t4_groupbyid = t4[['id', 'province']].groupby(['id']).size()

t4_groupbyid = t4_groupbyid.reset_index()

t4_groupbyid.columns = ['id', 'province_out_cnt']

test_set = test_set.merge(t4_groupbyid, how='left', on=['id'])



test_set = test_set.fillna(0)

test_set.info()
Mem. usage decreased to  1.34 Mb (41.7% reduction)
Mem. usage decreased to 60.50 Mb (0.0% reduction)
Mem. usage decreased to  1.53 Mb (60.0% reduction)
Mem. usage decreased to  0.85 Mb (16.7% reduction)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 48668 entries, 0 to 48667
Data columns (total 18 columns):
id                     48668 non-null int64
consume                48668 non-null float16
phone_nums             48668 non-null int64
brand_huawei           48668 non-null uint8
brand_iphone           48668 non-null uint8
brand_jinli            48668 non-null uint8
brand_le               48668 non-null uint8
brand_mei              48668 non-null uint8
brand_mi               48668 non-null uint8
brand_oppo             48668 non-null uint8
brand_san              48668 non-null uint8
brand_siji             48668 non-null uint8
brand_vivo             48668 non-null uint8
brand_其它               48668 non-null uint8
call_nums              48668 non-null int16
is_trans_provincial    48668 non-null int8
is_transnational       48668 non-null int8
province_out_cnt       48668 non-null float64
dtypes: float16(1), float64(1), int16(1), int64(2), int8(2), uint8(11)
memory usage: 2.3 MB
time: 1min 39s
# lightgbm
X_test = test_set[['consume', 'phone_nums', 'call_nums', 'is_trans_provincial', 'is_transnational', 'province_out_cnt']].values
y_predict = gbm.predict(X_test, num_iteration=gbm.best_iteration)
submit = test_set[['id']]
submit['pred'] = y_predict
time: 108 ms


/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
type(y_predict)
numpy.ndarray



time: 2.3 ms
y_predict[:5]
array([0.10280227, 0.08214867, 0.06905468, 0.07655945, 0.11238844])



time: 2.9 ms
# xgboost
X_test = test_set[['consume', 'phone_nums', 'call_nums', 'is_trans_provincial', 'is_transnational', 'province_out_cnt']].values
y_predict = clf.predict_proba(X_test)[:, 1]
submit_xgb = test_set[['id']]
submit_xgb['pred'] = y_predict
time: 208 ms


/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """

4 提交结果

tt1 = pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_1.txt', sep='\t', header=None)
tt1.columns = ['year_month', 'id', 'consume']
time: 41.6 ms
xgb_t1_id = tt1[['id']].drop_duplicates()
time: 13 ms
xgb_t1_id.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 0 to 99852
Data columns (total 1 columns):
id    50200 non-null int64
dtypes: int64(1)
memory usage: 784.4 KB
time: 5.46 ms
t1_id = tt1[['id']].drop_duplicates()
time: 12.5 ms
t1_id.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 0 to 99852
Data columns (total 1 columns):
id    50200 non-null int64
dtypes: int64(1)
memory usage: 784.4 KB
time: 5.67 ms
submit_xgb.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 48668 entries, 0 to 48667
Data columns (total 2 columns):
id      48668 non-null int64
pred    48668 non-null float32
dtypes: float32(1), int64(1)
memory usage: 950.5 KB
time: 7.8 ms
submit.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 48668 entries, 0 to 48667
Data columns (total 2 columns):
id      48668 non-null int64
pred    48668 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
time: 8.33 ms
tt_xgb = t1_id.merge(submit_xgb, on=['id'], how='left')
time: 17.6 ms
tt_xgb.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 0 to 50199
Data columns (total 2 columns):
id      50200 non-null int64
pred    48668 non-null float32
dtypes: float32(1), int64(1)
memory usage: 980.5 KB
time: 8.14 ms
tt = t1_id.merge(submit, on=['id'], how='left')
time: 19.3 ms
tt.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 0 to 50199
Data columns (total 2 columns):
id      50200 non-null int64
pred    48668 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
time: 8.06 ms

xgboost

# fill 0 0.469  dropna-0.50005  addfeat-0.46  addfeat dropna-0.4549
# fill 1 0.436  addfeat dropna-0.419
# fill mean0.088458 0.43048757
submit_xgb = tt_xgb.fillna(0.0)
time: 1.92 ms

lightgbm

# fill 0 addfeat-0.4491 0.4539  addfeat dropna-0.4512
submit_gbm = tt.fillna(0.0)
time: 1.96 ms

1.模型融合 求和 得分0.4558
2.全为1.0/0.0 得分0.5
3.大于0.5改为1.0,小于0.5改为0.0 应有2800人左右去 xgb0.26 得分0.50153 gbm0.17 得分0.50554

submit_xgb.describe()
idpred
count5.020000e+0450200.000000
mean5.449990e+150.092590
std2.628886e+150.088487
min5.959412e+110.000000
25%3.177008e+150.034837
50%5.441108e+150.063993
75%7.726328e+150.125547
max9.999920e+150.754152
time: 22.4 ms
submit_xgb[submit_xgb['pred']>=0.26].describe()
idpred
count2.818000e+032818.000000
mean5.523494e+150.350387
std2.632627e+150.083545
min7.736480e+130.260060
25%3.193231e+150.287803
50%5.528103e+150.324941
75%7.801996e+150.386373
max9.999505e+150.754152
time: 16.7 ms
xgb_yes = submit_xgb[submit_xgb['pred']>=0.26] 
xgb_yes['pred'] = 1.0
xgb_yes.describe()
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
idpred
count2.818000e+032818.0
mean5.523494e+151.0
std2.632627e+150.0
min7.736480e+131.0
25%3.193231e+151.0
50%5.528103e+151.0
75%7.801996e+151.0
max9.999505e+151.0
time: 347 ms
xgb_no = submit_xgb[submit_xgb['pred']<0.26] 
xgb_no['pred'] = 0.0
xgb_no.describe()
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
idpred
count4.738200e+0447382.0
mean5.445619e+150.0
std2.628626e+150.0
min5.959412e+110.0
25%3.175890e+150.0
50%5.435288e+150.0
75%7.722863e+150.0
max9.999920e+150.0
time: 380 ms
submit = xgb_yes.append(xgb_no)
time: 2.29 ms
submit.describe()
idpred
count5.020000e+0450200.000000
mean5.449990e+150.056135
std2.628886e+150.230185
min5.959412e+110.000000
25%3.177008e+150.000000
50%5.441108e+150.000000
75%7.726328e+150.000000
max9.999920e+151.000000
time: 19.6 ms
submit_xgb[submit_xgb['pred']>=0.2].describe()
idpred
count5.547000e+035547.000000
mean5.508672e+150.289829
std2.641133e+150.086438
min5.399382e+120.200014
25%3.195841e+150.225862
50%5.489831e+150.261552
75%7.813588e+150.326278
max9.999505e+150.754152
time: 18.5 ms
5600/98975*50200
2840.3132104066685



time: 2.17 ms
submit_gbm[submit_gbm['pred']>=0.23].describe()
idpred
count2.539000e+032539.000000
mean5.482621e+150.298836
std2.625965e+150.062903
min7.736480e+130.230013
25%3.200866e+150.253366
50%5.471503e+150.279145
75%7.742764e+150.326900
max9.999505e+150.632138
time: 19 ms
submit_gbm[submit_gbm['pred']>=0.22].describe()
idpred
count2.859000e+032859.000000
mean5.493943e+150.290563
std2.630246e+150.063701
min7.736480e+130.220121
25%3.195841e+150.244933
50%5.501943e+150.270700
75%7.743865e+150.321506
max9.999505e+150.632138
time: 19.6 ms
gbm_yes = submit_gbm[submit_gbm['pred']>=0.23] 
gbm_yes['pred'] = 1.0
gbm_yes.describe()
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
idpred
count2.539000e+032539.0
mean5.482621e+151.0
std2.625965e+150.0
min7.736480e+131.0
25%3.200866e+151.0
50%5.471503e+151.0
75%7.742764e+151.0
max9.999505e+151.0
time: 82.2 ms
gbm_no = submit_gbm[submit_gbm['pred']<0.23] 
gbm_no['pred'] = 0.0
gbm_no.describe()
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
idpred
count4.766100e+0447661.0
mean5.448252e+150.0
std2.629058e+150.0
min5.959412e+110.0
25%3.175232e+150.0
50%5.439911e+150.0
75%7.725629e+150.0
max9.999920e+150.0
time: 58.7 ms
submit = gbm_yes.append(gbm_no)
time: 4.19 ms
submit.describe()
idpred
count5.020000e+0450200.000000
mean5.449990e+150.018745
std2.628886e+150.135625
min5.959412e+110.000000
25%3.177008e+150.000000
50%5.441108e+150.000000
75%7.726328e+150.000000
max9.999920e+151.000000
time: 20.4 ms
submit_gbm.describe()
idpred
count5.020000e+0450200.000000
mean5.449990e+150.085097
std2.628886e+150.071304
min5.959412e+110.000000
25%3.177008e+150.036845
50%5.441108e+150.062206
75%7.726328e+150.113462
max9.999920e+150.632138
time: 20.8 ms
submit.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 91 to 50199
Data columns (total 2 columns):
id      50200 non-null int64
pred    50200 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
time: 9.36 ms
submit = submit_xgb.append(submit_gbm)
submit = submit.groupby(by='id').sum().reset_index()
submit.describe()
idpred
count5.020000e+0450200.000000
mean5.449990e+150.169012
std2.628886e+150.139313
min5.959412e+110.000000
25%3.177008e+150.076237
50%5.441108e+150.125893
75%7.726328e+150.222622
max9.999920e+151.124561
time: 41.7 ms
submit.head()
idpred
492971650665915581.0
1481681810970535421.0
1864735155056435551.0
2546412331710055601.0
2967597570360246821.0
time: 6.16 ms
submit_xgb[submit_xgb['id']==595941207920]
idpred
80485959412079200.185561
time: 7.07 ms
submit_gbm[submit_gbm['id']==595941207920]
idpred
80485959412079200.114782
time: 6.33 ms
submit.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 14 to 50199
Data columns (total 2 columns):
id      50200 non-null int64
pred    50200 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
time: 8 ms

全为1

t1_id['pred'] = 1.0

submit = t1_id.copy()
submit.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 0 to 99852
Data columns (total 2 columns):
id      50200 non-null int64
pred    50200 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
time: 8.79 ms
submit.head()
idpred
064018241600107481.0
165061345481354991.0
259969208846199541.0
311872094245437131.0
492971650665915581.0
time: 13.1 ms

submit.columns = ['ID', 'Pred']
submit['ID'] = submit['ID'].astype(str)
time: 36.7 ms
submit.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 14 to 50199
Data columns (total 2 columns):
ID      50200 non-null object
Pred    50200 non-null float64
dtypes: float64(1), object(1)
memory usage: 1.1+ MB
time: 10.1 ms
submit.to_csv('../submit.csv')
time: 126 ms
!wget -O kesci_submit https://www.heywhale.com/kesci_submit&&chmod +x kesci_submit
wget: /opt/conda/lib/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /opt/conda/lib/libssl.so.1.0.0: no version information available (required by wget)
wget: /opt/conda/lib/libssl.so.1.0.0: no version information available (required by wget)
--2019-07-31 08:15:56--  https://www.heywhale.com/kesci_submit
Resolving www.heywhale.com (www.heywhale.com)... 106.15.25.147
Connecting to www.heywhale.com (www.heywhale.com)|106.15.25.147|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6528405 (6.2M) [application/octet-stream]
Saving to: ‘kesci_submit’

kesci_submit        100%[===================>]   6.23M  12.1MB/s    in 0.5s    

2019-07-31 08:15:57 (12.1 MB/s) - ‘kesci_submit’ saved [6528405/6528405]

time: 1.83 s
!https_proxy="http://klab-external-proxy" ./kesci_submit -file ../submit.csv -token 578549794d544bff
Kesci Submit Tool 3.0

> 已验证Token
> 提交文件 ../submit.csv (1312.26 KiB)
> 文件已上传        
> 提交完成
time: 1.7 s

!./kesci_submit -token 578549794d544bff -file ../submit.csv
Kesci Submit Tool
Result File: ../submit.csv (1.28 MiB)
Uploading: 7%====================
Submit Failed.
Serevr Response:
 400 - {"message":"当前提交工具版本过旧,请参考比赛提交页面信息下载新的提交工具"}

time: 1 s
!ls ../
input  pred.csv  work
time: 665 ms
!wget -nv -O kesci_submit https://www.heywhale.com/kesci_submit&&chmod +x kesci_submit
wget: /opt/conda/lib/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /opt/conda/lib/libssl.so.1.0.0: no version information available (required by wget)
wget: /opt/conda/lib/libssl.so.1.0.0: no version information available (required by wget)
2019-07-02 08:08:23 URL:https://www.heywhale.com/kesci_submit [7842088/7842088] -> "kesci_submit" [1]
time: 1.47 s

0 查看数据

0.1 训练数据
0.1.1 正样本
q1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q1.txt', sep='\t', header=None))
Mem. usage decreased to  0.16 Mb (53.1% reduction)
time: 23 ms
q1.columns = ['year_month', 'id', 'consume', 'label']
time: 1.21 ms
q1 = q1.dropna(axis=0)
time: 6.72 ms
q1.head()
year_monthidconsumelabel
2201706816082995131430082.750001
3201707816082995131430037.687501
4201706150807569852140068.000001
5201707150807569852140049.593751
62017061686251204809800200.750001
time: 6.82 ms
q1.describe()
year_monthidconsumelabel
count10865.0000001.086500e+041.086500e+0410865.0
mean201706.4996785.417732e+15inf1.0
std0.5000232.635784e+15inf0.0
min201706.0000001.448104e+124.998779e-021.0
25%201706.0000003.118365e+154.068750e+011.0
50%201706.0000005.456594e+159.837500e+011.0
75%201707.0000007.687339e+151.785000e+021.0
max201707.0000009.997949e+151.324000e+031.0
time: 37.1 ms
q1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 2 to 11199
Data columns (total 4 columns):
year_month    10865 non-null int32
id            10865 non-null int64
consume       10865 non-null float16
label         10865 non-null int8
dtypes: float16(1), int32(1), int64(1), int8(1)
memory usage: 244.0 KB
time: 6.9 ms
%matplotlib inline

# 按index日期排序

q1.consume.plot()
Matplotlib is building the font cache using fc-list. This may take a moment.





<matplotlib.axes._subplots.AxesSubplot at 0x7fd1c0659b70>
time: 11.3 s
q1[q1.consume == 1323.74]
year_monthidconsumelabel
486720170755109776033570001324.01
time: 11.1 ms
q2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q2.txt', sep='\t', header=None))
Mem. usage decreased to 11.31 Mb (14.6% reduction)
time: 291 ms
q2 = q2.dropna(axis=0)
time: 77.7 ms
q2.head()
012345
11752398069509000华为PLK-AL1020170609223138201706092243451
21752398069509000乐视LETV X50120160924102711201609241124251
31752398069509000金立金立 GN80020150331210255201506301312321
41752398069509000金立GIONEE M520170508191216201706051923471
51752398069509000华为PLK-AL1020160618182839201707312359591
time: 8.16 ms
q2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time', 'label']
time: 1.15 ms
q2.head()
idbrandtypefirst_use_timerecent_use_timelabel
11752398069509000华为PLK-AL1020170609223138201706092243451
21752398069509000乐视LETV X50120160924102711201609241124251
31752398069509000金立金立 GN80020150331210255201506301312321
41752398069509000金立GIONEE M520170508191216201706051923471
51752398069509000华为PLK-AL1020160618182839201707312359591
time: 8.58 ms
q2.describe()
idfirst_use_timerecent_use_timelabel
count1.973760e+051.973760e+051.973760e+05197376.0
mean5.436228e+152.015597e+132.015684e+131.0
std2.642924e+152.685010e+112.685124e+110.0
min1.448104e+12-1.000000e+00-1.000000e+001.0
25%3.227267e+152.015122e+132.016013e+131.0
50%5.353833e+152.016052e+132.016060e+131.0
75%7.764521e+152.016102e+132.016112e+131.0
max9.997949e+152.017073e+132.017073e+131.0
time: 64.7 ms
q2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 197376 entries, 1 to 289201
Data columns (total 6 columns):
id                 197376 non-null int64
brand              197376 non-null object
type               197376 non-null object
first_use_time     197376 non-null int64
recent_use_time    197376 non-null int64
label              197376 non-null int8
dtypes: int64(3), int8(1), object(2)
memory usage: 9.2+ MB
time: 41.7 ms
q3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q3.txt', sep='\t', header=None))
Mem. usage decreased to  0.18 Mb (64.6% reduction)
time: 18.4 ms
q3 = q3.dropna(axis=0)
time: 6.41 ms
q3.head()
012345
0201707606247526482510088101
1201707816082995131430027001
2201707150807569852140019001
32017071686251204809800207001
42017075627768389537500133101
time: 7.62 ms
q3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational', 'label']
time: 1.16 ms
q3.head()
year_monthidcall_numsis_trans_provincialis_transnationallabel
0201707606247526482510088101
1201707816082995131430027001
2201707150807569852140019001
32017071686251204809800207001
42017075627768389537500133101
time: 7.37 ms
q3.describe()
year_monthidcall_numsis_trans_provincialis_transnationallabel
count11200.0000001.120000e+0411200.00000011200.00000011200.00000011200.0
mean201706.5000005.416583e+1570.5622320.2354460.0144641.0
std0.5000222.642827e+1561.8201440.4242960.1194000.0
min201706.0000001.448104e+12-1.0000000.0000000.0000001.0
25%201706.0000003.117220e+1525.0000000.0000000.0000001.0
50%201706.5000005.456254e+1554.0000000.0000000.0000001.0
75%201707.0000007.702940e+1599.2500000.0000000.0000001.0
max201707.0000009.997949e+15727.0000001.0000001.0000001.0
time: 79.6 ms
q3.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 11200 entries, 0 to 11199
Data columns (total 6 columns):
year_month             11200 non-null int32
id                     11200 non-null int64
call_nums              11200 non-null int16
is_trans_provincial    11200 non-null int8
is_transnational       11200 non-null int8
label                  11200 non-null int8
dtypes: int16(1), int32(1), int64(1), int8(3)
memory usage: 273.4 KB
time: 7.47 ms
q4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q4.txt', sep='\t', header=None))
q4 = q4.dropna(axis=0)
q4.columns = ['year_month', 'id', 'province', 'label']
time: 935 µs
q4.head()
year_monthidprovincelabel
02017076062475264825100广东1
12017075627768389537500北京1
22017072000900444179600山西1
32017075304502776817600四川1
42017075304502776817600四川1
time: 6.84 ms
q4.describe()
year_monthidlabel
count7218.0000007.218000e+037218.0
mean201706.5385155.341915e+151.0
std0.4985492.631231e+150.0
min201706.0000001.739872e+131.0
25%201706.0000003.037311e+151.0
50%201707.0000005.367106e+151.0
75%201707.0000007.545199e+151.0
max201707.0000009.987407e+151.0
time: 22.2 ms
q4.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7218 entries, 0 to 7288
Data columns (total 4 columns):
year_month    7218 non-null int32
id            7218 non-null int64
province      7218 non-null object
label         7218 non-null int8
dtypes: int32(1), int64(1), int8(1), object(1)
memory usage: 204.4+ KB
time: 6.74 ms
!ls /home/kesci/input/gzlt/train_set/201708q/
201708q1.txt  201708q3.txt  201708q6.txt
201708q2.txt  201708q4.txt  201708q7.txt
time: 667 ms
q6 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q6.txt', sep='\t', header=None))
Mem. usage decreased to 62.58 Mb (52.1% reduction)
time: 3.9 s
q6.columns = ['date', 'hour', 'id', 'user_longitude', 'user_latitude', 'label']
time: 868 µs
q6.head()
datehouriduser_longitudeuser_latitudelabel
02017-07-188.09239265006758100106.46754526.586251
12017-07-100.03859201812337600106.70821326.578541
22017-07-1618.03859201812337600106.54569026.567241
32017-07-178.03859201812337600106.54569026.567241
42017-07-2716.03859201812337600106.54569026.567241
time: 16.7 ms
q6.describe()
houriduser_longitudeuser_latitudelabel
count2.852871e+062.852871e+062.851527e+062.851527e+062852871.0
mean1.141897e+015.415213e+151.068143e+022.659968e+011.0
std6.632995e+002.634349e+155.580043e-012.852525e-010.0
min0.000000e+001.448104e+121.036700e+022.470664e+011.0
25%6.000000e+003.135488e+151.066656e+022.654610e+011.0
50%1.200000e+015.442594e+151.067027e+022.658143e+011.0
75%1.800000e+017.687963e+151.067373e+022.662629e+011.0
max2.200000e+019.997949e+151.095277e+022.909348e+011.0
time: 775 ms
q6.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2852871 entries, 0 to 2852870
Data columns (total 6 columns):
date              object
hour              float64
id                int64
user_longitude    float64
user_latitude     float64
label             int64
dtypes: float64(3), int64(2), object(1)
memory usage: 130.6+ MB
time: 3.24 ms
q7 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q7.txt', sep='\t', header=None))
Mem. usage decreased to  3.80 Mb (42.5% reduction)
time: 137 ms
q7 = q7.dropna(axis=0)
time: 35.4 ms
q7.columns = ['year_month', 'id', 'app', 'flow', 'label']
time: 1.54 ms
q7.head()
year_monthidappflowlabel
02017076610350034824100腾讯手机管家0.0100021
12017076997210664840100喜马拉雅FM27.3906251
22017073198621664927300网易新闻0.0299991
32017079987406611703100喜马拉雅FM0.0000001
42017071785540174324200天气通0.0200041
time: 8.14 ms
q7.describe()
year_monthidflowlabel
count173117.0000001.731170e+05173117.000000173117.0
mean201706.5396995.403100e+15NaN1.0
std0.4984232.667026e+15NaN0.0
min201706.0000001.448104e+120.0000001.0
25%201706.0000003.056260e+150.0100021.0
50%201707.0000005.429056e+150.0800171.0
75%201707.0000007.730223e+151.5996091.0
max201707.0000009.997949e+157828.0000001.0
time: 70.4 ms
q7.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 173117 entries, 0 to 173116
Data columns (total 5 columns):
year_month    173117 non-null int32
id            173117 non-null int64
app           173117 non-null object
flow          173117 non-null float16
label         173117 non-null int8
dtypes: float16(1), int32(1), int64(1), int8(1), object(1)
memory usage: 5.1+ MB
time: 29.8 ms

q1
将两月金额相加

q1.head()
year_monthidconsumelabel
2201706816082995131430082.750001
3201707816082995131430037.687501
4201706150807569852140068.000001
5201707150807569852140049.593751
62017061686251204809800200.750001
time: 7.05 ms
q1 = q1[['id', 'consume']]
time: 2.91 ms
q1_groupbyid = q1.groupby(['id']).agg({'consume': pd.Series.sum})
time: 747 ms
len(q1)
10865



time: 8.1 ms
q1[q1['id']==1448103998000]
idconsume
3532144810399800018.09375
3533144810399800044.28125
time: 8.84 ms
q1_groupbyid[:10]
consume
id
144810399800062.37500
17398718813730460.75000
6113262348600012.28125
68156596675520903.50000
76819334576430282.25000
78745100940550531.00000
110229638660000253.00000
122134826301000138.75000
13292326930400026.81250
138204830829320387.50000
time: 5.8 ms

q2
特征1 使用过的top9+其它手机品牌 共10个
特征2 使用的不同品牌数量

q2 = q2[['id', 'brand']]
time: 4.86 ms
q2.head(10)
idbrand
11752398069509000华为
21752398069509000乐视
31752398069509000金立
41752398069509000金立
51752398069509000华为
61752398069509000华为
71752398069509000金立
81752398069509000三星
94799656026499908三星
104799656026499908华为
time: 6.36 ms
groupbybrand = q2['brand'].value_counts()
time: 18.7 ms
len(groupbybrand)
750



time: 2.09 ms
%matplotlib inline

groupbybrand.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7fd1c00ea7b8>
time: 454 ms
groupbybrand[:10]
苹果      62347
华为      22266
欧珀      20516
维沃      17158
三星      13435
小米      10632
金立       9922
魅族       9708
乐视       5609
四季恒美     2163
Name: brand, dtype: int64



time: 3.52 ms
q2 = q2.drop_duplicates()
groupbyid = q2['id'].value_counts()
time: 19.6 ms
len(groupbyid)
5597



time: 2.23 ms
%matplotlib inline

groupbyid.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7fd1bb56e048>
time: 294 ms
groupbyid[:10]
4104535378288025    115
8707678197418467    108
3900535090108175    104
3986280749497468     93
9196501153454276     88
5510977603357000     84
8569492566715454     78
1106540188374027     71
4091371962011072     71
4874962666674313     71
Name: id, dtype: int64



time: 3.27 ms
q1[q1['id']==4104535378288025]
year_monthidconsumelabel
105762017064104535378288025208.0001
105772017074104535378288025205.1251
time: 7.63 ms
# q2[q2['id']==4104535378288025]
time: 364 µs
type(groupbyid)
pandas.core.series.Series



time: 2.14 ms
type(groupbyid.to_frame())
pandas.core.frame.DataFrame



time: 3.13 ms
q2_groupbyid = groupbyid.reset_index()
time: 2.34 ms
q2_groupbyid.columns = ['id', 'phone_nums']
time: 1.19 ms
q2_groupbyid.head()
idphone_nums
04104535378288025115
18707678197418467108
23900535090108175104
3398628074949746893
4919650115345427688
time: 6.12 ms
type(q1_groupbyid)
pandas.core.frame.DataFrame



time: 2.15 ms
pos_set = q1_groupbyid.merge(q2_groupbyid, on=['id'])
time: 6.42 ms
pos_set.head()
idconsumephone_nums
0144810399800062.375006
117398718813730460.7500023
26113262348600012.281251
368156596675520903.500004
476819334576430282.2500021
time: 7.11 ms
pos_set.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5473 entries, 0 to 5472
Data columns (total 3 columns):
id            5473 non-null int64
consume       5473 non-null float16
phone_nums    5473 non-null int64
dtypes: float16(1), int64(2)
memory usage: 139.0 KB
time: 6.27 ms

q3
1.将两月联络圈规模求和
2.将两月出省求和 是:1 否:0
3.将两月出国求和 是:1 否:0

q3.head()
year_monthidcall_numsis_trans_provincialis_transnationallabel
0201707606247526482510088101
1201707816082995131430027001
2201707150807569852140019001
32017071686251204809800207001
42017075627768389537500133101
time: 7.69 ms
q3_groupbyid_call = q3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})
q3_groupbyid_provincial = q3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})
q3_groupbyid_trans = q3[['id', 'is_transnational']].groupby(['id']).agg({'is_transnational': pd.Series.sum})
time: 1.95 s
pos_set = pos_set.merge(q3_groupbyid_call, on=['id'])
time: 5.14 ms
pos_set.head()
idconsumephone_numscall_nums
0144810399800062.37500621
117398718813730460.7500023217
26113262348600012.28125161
368156596675520903.500004353
476819334576430282.2500021431
time: 7.94 ms
pos_set = pos_set.merge(q3_groupbyid_provincial, on=['id'])
pos_set = pos_set.merge(q3_groupbyid_trans, on=['id'])
time: 9.61 ms
pos_set.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5473 entries, 0 to 5472
Data columns (total 6 columns):
id                     5473 non-null int64
consume                5473 non-null float16
phone_nums             5473 non-null int64
call_nums              5473 non-null int16
is_trans_provincial    5473 non-null int8
is_transnational       5473 non-null int8
dtypes: float16(1), int16(1), int64(2), int8(2)
memory usage: 160.3 KB
time: 7.3 ms

q4
1.两月内漫出省次数
2.所有省份one-hot或top10省份+其它省份
3.两月内漫出不同省个数

q4.head(10)
year_monthidprovincelabel
02017076062475264825100广东1
12017075627768389537500北京1
22017072000900444179600山西1
32017075304502776817600四川1
42017075304502776817600四川1
52017075304502776817600四川1
62017075304502776817600重庆1
72017078594396491246200广西1
82017078594396491246200广西1
92017078594396491246200广西1
time: 8.78 ms
q4_groupbyid = q4[['id', 'province']].groupby(['id']).agg({'province': pd.Series.unique})
q4_groupbyid.head()
province
id
17398718813730重庆
61132623486000[福建, 河南, 江苏, 安徽]
68156596675520[辽宁, 广东]
132923269304000江西
138204830829320浙江
time: 322 ms
q4_groupbyid = q4[['id', 'province']].groupby(['id']).size()
q4_groupbyid.head()
id
17398718813730     1
61132623486000     8
68156596675520     3
132923269304000    1
138204830829320    2
dtype: int64



time: 6.52 ms
q4[q4['id']==61132623486000]
year_monthidprovincelabel
46120170761132623486000福建1
46220170761132623486000福建1
46320170761132623486000福建1
436320170661132623486000河南1
436420170661132623486000江苏1
436520170661132623486000安徽1
436620170661132623486000安徽1
436720170661132623486000江苏1
time: 8.26 ms
type(q4_groupbyid.reset_index())
pandas.core.frame.DataFrame



time: 4.03 ms
q4_groupbyid = q4_groupbyid.reset_index()
q4_groupbyid.columns = ['id', 'province_out_cnt']
time: 2.73 ms
q4_groupbyid.head()
idprovince_out_cnt
0173987188137301
1611326234860008
2681565966755203
31329232693040001
41382048308293202
time: 5.73 ms
pos_set = pos_set.merge(q4_groupbyid, how='left', on=['id'])
pos_set.head()
idconsumephone_numscall_numsis_trans_provincialis_transnationalprovince_out_cnt
0144810399800062.3750062100NaN
117398718813730460.7500023217101.0
26113262348600012.28125161208.0
368156596675520903.500004353203.0
476819334576430282.250002143100NaN
time: 14.6 ms
pos_set.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5473 entries, 0 to 5472
Data columns (total 7 columns):
id                     5473 non-null int64
consume                5473 non-null float16
phone_nums             5473 non-null int64
call_nums              5473 non-null int16
is_trans_provincial    5473 non-null int8
is_transnational       5473 non-null int8
province_out_cnt       1913 non-null float64
dtypes: float16(1), float64(1), int16(1), int64(2), int8(2)
memory usage: 203.1 KB
time: 7.53 ms
pos_set = pos_set.fillna(0)
time: 2.46 ms
pos_set.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5473 entries, 0 to 5472
Data columns (total 7 columns):
id                     5473 non-null int64
consume                5473 non-null float16
phone_nums             5473 non-null int64
call_nums              5473 non-null int16
is_trans_provincial    5473 non-null int8
is_transnational       5473 non-null int8
province_out_cnt       5473 non-null float64
dtypes: float16(1), float64(1), int16(1), int64(2), int8(2)
memory usage: 203.1 KB
time: 8.02 ms
# inner
# pos_set.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1913 entries, 0 to 1912
Data columns (total 7 columns):
id                     1913 non-null int64
consume                1913 non-null float16
phone_nums             1913 non-null int64
call_nums              1913 non-null int16
is_trans_provincial    1913 non-null int8
is_transnational       1913 non-null int8
province_out_cnt       1913 non-null int64
dtypes: float16(1), int16(1), int64(3), int8(2)
memory usage: 71.0 KB
time: 6.67 ms

q6 暂时忽略
q7
1.使用总流量
2.使用不同APP数量
3.某些特定(旅游相关)APP是否使用

q7.head()
year_monthidappflowlabel
02017076610350034824100腾讯手机管家0.0100021
12017076997210664840100喜马拉雅FM27.3906251
22017073198621664927300网易新闻0.0299991
32017079987406611703100喜马拉雅FM0.0000001
42017071785540174324200天气通0.0200041
time: 7.94 ms
q7_groupbyapp = q7.groupby(['app']).agg({'flow': pd.Series.sum})
time: 135 ms
len(q7_groupbyapp)
762



time: 2.04 ms
q7_groupbyapp.sort_values(by='flow', ascending=False)
flow
app
网易云音乐inf
爱奇艺视频inf
微信inf
新浪微博inf
QQ音乐inf
今日头条inf
QQ57856.0
手机百度53408.0
陌陌43488.0
iTunes35392.0
腾讯新闻25952.0
快手24256.0
手机淘宝18400.0
UC浏览器16608.0
酷狗音乐15360.0
高德地图14984.0
酷我音乐13488.0
新浪新闻13432.0
唯品会11504.0
腾讯视频10760.0
优酷视频10736.0
汽车之家9984.0
百度地图9816.0
美团9400.0
网易新闻8648.0
AppStore7776.0
中国联通手机营业厅6736.0
百度贴吧6104.0
凤凰新闻5504.0
虾米音乐5020.0
......
百才招聘网0.0
碰碰0.0
禾文阿思看图购0.0
科学作息时间表0.0
章鱼输入法0.0
米折0.0
约会吧0.0
网易微博0.0
表情大全0.0
欢乐互娱0.0
博客大巴0.0
查快递0.0
邮储银行0.0
号簿助手0.0
司机邦0.0
壁纸多多0.0
天天聊0.0
天翼阅读0.0
安全管家0.0
安卓游戏盒子0.0
安软市场0.0
车网互联0.0
宜搜搜索0.0
工程师爸爸0.0
彩票控0.0
贝瓦儿歌0.0
搜狗壁纸0.0
智远一户通0.0
诚品快拍0.0
07073手游中心0.0

762 rows × 1 columns

time: 12.4 ms
pos_set.describe()
idconsumephone_numscall_numsis_trans_provincialis_transnationalprovince_out_cnt
count5.473000e+035473.0000005473.0000005473.0000005473.0000005473.0000005473.000000
mean5.417038e+15inf8.228942141.2019000.4745110.0296001.300018
std2.637784e+15inf8.551830121.2628260.7061620.1879043.110401
min1.448104e+120.0999761.000000-2.0000000.0000000.0000000.000000
25%3.113785e+1582.0000003.00000052.0000000.0000000.0000000.000000
50%5.457364e+15198.2500006.000000108.0000000.0000000.0000000.000000
75%7.688781e+15355.25000010.000000198.0000001.0000000.0000001.000000
max9.997949e+152392.000000115.0000001035.0000002.0000002.00000042.000000
time: 126 ms
pos_set['label'] = 1
idconsumephone_numscall_numsis_trans_provincialis_transnationalprovince_out_cntlabel
0144810399800062.3750062100NaN1
117398718813730460.7500023217101.01
26113262348600012.28125161208.01
368156596675520903.500004353203.01
476819334576430282.250002143100NaN1
time: 10.5 ms
pos_set.fillna(0)
pos_set.head()
idconsumephone_numscall_numsis_trans_provincialis_transnationalprovince_out_cntlabel
0144810399800062.3750062100NaN1
117398718813730460.7500023217101.01
26113262348600012.28125161208.01
368156596675520903.500004353203.01
476819334576430282.250002143100NaN1
time: 23.5 ms
0.1.2 负样本
n1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n1.txt', sep='\t', header=None))
n1.columns = ['year_month', 'id', 'consume', 'label']
n1 = n1.dropna(axis=0)
n1_groupbyid = n1[['id', 'consume']].groupby(['id']).agg({'consume': pd.Series.sum})

n2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n2.txt', sep='\t', header=None))
n2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time', 'label']
n2 = n2.dropna(axis=0)
n2 = n2[['id', 'brand']]
n2 = n2.drop_duplicates()
n2_groupbyid = n2['id'].value_counts()
n2_groupbyid = n2_groupbyid.reset_index()
n2_groupbyid.columns = ['id', 'phone_nums']

neg_set = n1_groupbyid.merge(n2_groupbyid, on=['id'])
neg_set.head()
Mem. usage decreased to  2.67 Mb (53.1% reduction)
Mem. usage decreased to 51.13 Mb (14.6% reduction)
idconsumephone_nums
01009387204000225.0000004
111673163030001.1992194
21883071709000213.5000008
33393143830010517.5000006
4456897316200018.0781253
time: 10.8 s
neg_set.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 76515 entries, 0 to 76514
Data columns (total 3 columns):
id            76515 non-null int64
consume       76515 non-null float16
phone_nums    76515 non-null int64
dtypes: float16(1), int64(2)
memory usage: 1.9 MB
time: 11.1 ms
n3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n3.txt', sep='\t', header=None))
n3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational', 'label']

n3_groupbyid_call = n3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})
n3_groupbyid_provincial = n3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})
n3_groupbyid_trans = n3[['id', 'is_transnational']].groupby(['id']).agg({'is_transnational': pd.Series.sum})
neg_set = neg_set.merge(n3_groupbyid_call, on=['id'])
neg_set = neg_set.merge(n3_groupbyid_provincial, on=['id'])
neg_set = neg_set.merge(n3_groupbyid_trans, on=['id'])

n4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n4.txt', sep='\t', header=None))
n4.columns = ['year_month', 'id', 'province', 'label']

n4_groupbyid = n4[['id', 'province']].groupby(['id']).size()
n4_groupbyid = n4_groupbyid.reset_index()
n4_groupbyid.columns = ['id', 'province_out_cnt']
neg_set = neg_set.merge(n4_groupbyid, how='left', on=['id'])
neg_set = neg_set.fillna(0)
neg_set.head()
Mem. usage decreased to  3.03 Mb (64.6% reduction)
Mem. usage decreased to  0.73 Mb (34.4% reduction)
idconsumephone_numscall_numsis_trans_provincialis_transnationalprovince_out_cnt
01009387204000225.000000419000.0
111673163030001.19921946000.0
21883071709000213.500000840000.0
33393143830010517.5000006205102.0
4456897316200018.078125317000.0
time: 32.5 s
neg_set['label'] = 0
time: 1.83 ms
neg_set.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 76515 entries, 0 to 76514
Data columns (total 8 columns):
id                     76515 non-null int64
consume                76515 non-null float16
phone_nums             76515 non-null int64
call_nums              76515 non-null int16
is_trans_provincial    76515 non-null int8
is_transnational       76515 non-null int8
province_out_cnt       76515 non-null float64
label                  76515 non-null int64
dtypes: float16(1), float64(1), int16(1), int64(3), int8(2)
memory usage: 3.4 MB
time: 18.9 ms
n1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n1.txt', sep='\t', header=None))
Mem. usage decreased to  2.67 Mb (53.1% reduction)
time: 484 ms
n1.columns = ['year_month', 'id', 'consume', 'label']
time: 1.28 ms
n1.head()
year_monthidconsumelabel
020170785705188329061009.000
1201707218264093871870010.000
22017077836143444290008.380
32017072007036960106400100.000
42017079482847959399300226.050
time: 7.22 ms
n1.describe()
year_monthidconsumelabel
count186800.0000001.868000e+05150750.000000186800.0
mean201706.5000005.464219e+1563.5800280.0
std0.5000012.633848e+1584.0636000.0
min201706.0000001.009387e+12-70.6600000.0
25%201706.0000003.192389e+1512.9300000.0
50%201706.5000005.486486e+1534.0000000.0
75%201707.0000007.744140e+1582.5000000.0
max201707.0000009.999717e+153979.9400000.0
time: 52.5 ms
n1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186800 entries, 0 to 186799
Data columns (total 4 columns):
year_month    186800 non-null int64
id            186800 non-null int64
consume       150750 non-null float64
label         186800 non-null int64
dtypes: float64(1), int64(3)
memory usage: 5.7 MB
time: 21.7 ms
n2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n2.txt', sep='\t', header=None))
Mem. usage decreased to 51.13 Mb (14.6% reduction)
time: 7.76 s
n2.head()
012345
05227696575283900苹果A169920150331210636201507010630170
16279759720262000NaNNaN20160725112240201707312359590
26279759720262000NaNNaN20161205220417201612052204170
36279759720262000三星SM-A900020161128231001201611282310010
46279759720262000NaNNaN20161220102623201703061737130
time: 8.15 ms
n2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time', 'label']
time: 1.2 ms
n2.head()
idbrandtypefirst_use_timerecent_use_timelabel
05227696575283900苹果A169920150331210636201507010630170
16279759720262000NaNNaN20160725112240201707312359590
26279759720262000NaNNaN20161205220417201612052204170
36279759720262000三星SM-A900020161128231001201611282310010
46279759720262000NaNNaN20161220102623201703061737130
time: 8.3 ms
n2.describe()
idfirst_use_timerecent_use_timelabel
count1.307608e+061.307608e+061.307608e+061307608.0
mean5.460966e+151.999810e+131.999992e+130.0
std2.619222e+151.801007e+121.801171e+120.0
min1.009387e+12-1.000000e+00-1.000000e+000.0
25%3.196695e+152.015112e+132.016022e+130.0
50%5.477102e+152.016071e+132.016101e+130.0
75%7.728047e+152.016123e+132.017023e+130.0
max9.999717e+152.017073e+132.017073e+130.0
time: 252 ms
n2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1307608 entries, 0 to 1307607
Data columns (total 6 columns):
id                 1307608 non-null int64
brand              894190 non-null object
type               894205 non-null object
first_use_time     1307608 non-null int64
recent_use_time    1307608 non-null int64
label              1307608 non-null int64
dtypes: int64(4), object(2)
memory usage: 59.9+ MB
time: 251 ms
n3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n3.txt', sep='\t', header=None))
Mem. usage decreased to  3.03 Mb (64.6% reduction)
time: 584 ms
n3.head()
012345
0201707429527767743700036100
1201707912133596906200037000
22017079438277095447300-1000
3201707674985487653250020000
4201707154536180938140026000
time: 7.82 ms
n3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational', 'label']
time: 1.13 ms
n3.head()
year_monthidcall_numsis_trans_provincialis_transnationallabel
0201707429527767743700036100
1201707912133596906200037000
22017079438277095447300-1000
3201707674985487653250020000
4201707154536180938140026000
time: 7.49 ms
n3.describe()
year_monthidcall_numsis_trans_provincialis_transnationallabel
count186800.0000001.868000e+05186800.000000186800.000000186800.000000186800.0
mean201706.5000005.464219e+1532.6747970.0932920.0050540.0
std0.5000012.633848e+1546.0549290.2908420.0709090.0
min201706.0000001.009387e+12-1.0000000.0000000.0000000.0
25%201706.0000003.192389e+154.0000000.0000000.0000000.0
50%201706.5000005.486486e+1519.0000000.0000000.0000000.0
75%201707.0000007.744140e+1543.0000000.0000000.0000000.0
max201707.0000009.999717e+151807.0000001.0000001.0000000.0
time: 75.7 ms
n3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186800 entries, 0 to 186799
Data columns (total 6 columns):
year_month             186800 non-null int64
id                     186800 non-null int64
call_nums              186800 non-null int64
is_trans_provincial    186800 non-null int64
is_transnational       186800 non-null int64
label                  186800 non-null int64
dtypes: int64(6)
memory usage: 8.6 MB
time: 26.6 ms
n4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n4.txt', sep='\t', header=None))
Mem. usage decreased to  0.73 Mb (34.4% reduction)
time: 88.8 ms
n4.columns = ['year_month', 'id', 'province', 'label']
time: 1.15 ms
n4.head()
year_monthidprovincelabel
02017074295277677437000重庆0
12017075560109665240300广西0
22017075560109665240300广东0
32017075560109665240300广东0
42017075705601521649600重庆0
time: 7.14 ms
n4.describe()
year_monthidlabel
count36499.0000003.649900e+0436499.0
mean201706.5391935.471019e+150.0
std0.4984682.639006e+150.0
min201706.0000003.393144e+120.0
25%201706.0000003.203830e+150.0
50%201707.0000005.468480e+150.0
75%201707.0000007.753756e+150.0
max201707.0000009.999305e+150.0
time: 24.4 ms
n4.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36499 entries, 0 to 36498
Data columns (total 4 columns):
year_month    36499 non-null int64
id            36499 non-null int64
province      36099 non-null object
label         36499 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.1+ MB
time: 9.97 ms
!ls /home/kesci/input/gzlt/train_set/201708n/
201708n1.txt  201708n3.txt  201708n6.txt
201708n2.txt  201708n4.txt  201708n7.txt
time: 669 ms
n6 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n6.txt', sep='\t', header=None))
Mem. usage decreased to 798.26 Mb (52.1% reduction)
time: 2min 59s
n6.columns = ['date', 'hour', 'id', 'user_longitude', 'user_latitude', 'label']
time: 1.51 ms
n6.head()
datehouriduser_longitudeuser_latitudelabel
02017-07-0210.07748777616409800106.68081626.5636500
12017-07-100.07748777616409800106.71952026.5763700
22017-07-3114.07748777616409800106.68306026.6546630
32017-07-010.06633710902197900106.69744026.6139300
42017-07-0814.06633710902197900106.71570026.6097100
time: 9.14 ms
q6.describe()
houriduser_longitudeuser_latitudelabel
count2.852871e+062.852871e+062.851527e+062.851527e+062852871.0
mean1.141897e+015.415213e+151.068143e+022.659968e+011.0
std6.632995e+002.634349e+155.580043e-012.852525e-010.0
min0.000000e+001.448104e+121.036700e+022.470664e+011.0
25%6.000000e+003.135488e+151.066656e+022.654610e+011.0
50%1.200000e+015.442594e+151.067027e+022.658143e+011.0
75%1.800000e+017.687963e+151.067373e+022.662629e+011.0
max2.200000e+019.997949e+151.095277e+022.909348e+011.0
time: 979 ms
n6.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36393070 entries, 0 to 36393069
Data columns (total 6 columns):
date              object
hour              float64
id                int64
user_longitude    float64
user_latitude     float64
label             int64
dtypes: float64(3), int64(2), object(1)
memory usage: 1.6+ GB
time: 3.76 ms
n7 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n7.txt', sep='\t', header=None))
Mem. usage decreased to 17.98 Mb (31.2% reduction)
time: 3.14 s
n7.columns = ['year_month', 'id', 'app', 'flow']
time: 1.44 ms
n7.head()
year_monthidappflow
02017074011022166491000米聊0.01
12017078544172893207700百度地图2.07
22017079856572220983403搜狗输入法0.00
32017076441300393946200爱奇艺视频0.00
42017078751918977379700开心消消乐0.03
time: 7.51 ms
# n7['label'] = 0
time: 2.94 ms
# n7.head()
year_monthidappflowlabel
02017074011022166491000米聊0.010
12017078544172893207700百度地图2.070
22017079856572220983403搜狗输入法0.000
32017076441300393946200爱奇艺视频0.000
42017078751918977379700开心消消乐0.030
time: 8.46 ms
n7.describe()
year_monthidflowlabel
count856961.0000008.569610e+05856961.000000856961.0
mean201706.5358815.432556e+159.9425330.0
std0.4987112.643712e+1568.0969440.0
min201706.0000001.009387e+120.0000000.0
25%201706.0000003.134290e+150.0000000.0
50%201707.0000005.440495e+150.0600000.0
75%201707.0000007.727765e+151.1300000.0
max201707.0000009.999717e+1510986.1500000.0
time: 170 ms
n7.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 856961 entries, 0 to 856960
Data columns (total 5 columns):
year_month    856961 non-null int64
id            856961 non-null int64
app           856961 non-null object
flow          856961 non-null float64
label         856961 non-null int64
dtypes: float64(1), int64(3), object(1)
memory usage: 32.7+ MB
time: 116 ms
0.1.3 天气数据
!ls /home/kesci/input/gzlt/train_set/weather_data_2017/
weather_forecast_2017.txt  weather_reported_2017.txt  天气现象编码.xlsx
time: 669 ms
weather_reported = pd.read_csv('/home/kesci/input/gzlt/train_set/weather_data_2017/weather_reported_2017.txt', sep='\t')
time: 6.15 ms
weather_reported.head()
Station_NameVACODEYearMonthDayTEM_AvgTEM_MaxTEM_MinPRE_Time_2020WEP_Record
0麻江52263520176123.0024.520.90.6( 01 60 ) 60 .
1三穗52262420176121.1325.619.49.0( 01 10 80 ) 80 60 .
2镇远52262520176122.6826.521.38.9( 60 ) 60 .
3雷山52263420176123.8026.120.45.1( 10 ) 60 .
4剑河52262920176123.5327.122.06.8( 01 10 80 ) 80 10 .
time: 12.2 ms
# weather_reported.columns = ['Station_Name', 'VACODE', 'Year', 'Month', 'Day', 'TEM_Avg', 'TEM_Max', 'TEM_Min', 'PRE_Time_2020', 'WEP_Record']
time: 1.25 ms
weather_reported.describe()
Station_NameVACODEYearMonthDayTEM_AvgTEM_MaxTEM_MinPRE_Time_2020WEP_Record
count1404140414041404140414041404140414041404
unique24252332448214109330305
top贵阳52000020177422.8330.520.50.0( 01 ) 01 .
freq61360140371346101835625197
time: 49.9 ms
weather_reported.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1404 entries, 0 to 1403
Data columns (total 10 columns):
Station_Name     1404 non-null object
VACODE           1404 non-null object
Year             1404 non-null object
Month            1404 non-null object
Day              1404 non-null object
TEM_Avg          1404 non-null object
TEM_Max          1404 non-null object
TEM_Min          1404 non-null object
PRE_Time_2020    1404 non-null object
WEP_Record       1404 non-null object
dtypes: object(10)
memory usage: 109.8+ KB
time: 6.32 ms
weather_forecast = pd.read_csv('/home/kesci/input/gzlt/train_set/weather_data_2017/weather_forecast_2017.txt', sep='\t')
time: 10.8 ms
weather_forecast.head()
Station_NameVACODEYearMonDayTEM_Max_24hTEM_Min_24hWEP_24hTEM_Max_48hTEM_Min_48h...TEM_Max_120hTEM_Min_120hWEP_120hTEM_Max_144hTEM_Min_144hWEP_144hTEM_Max_168hTEM_Min_168h,WEP_168hUnnamed: 24Unnamed: 25
0白云52011320176125.017.0(2)124.019.0...(4)225.015.0(2)127.015.0(1)026.016.0(1)0
1岑巩52262620176131.319.4(1)131.022.0...(4)132.019.4(1)132.022.8(1)132.021.0(1)1
2从江52263320176133.422.0(1)130.023.0...(4)334.022.0(1)134.023.8(1)134.022.0(1)1
3丹寨52263620176127.518.0(1)124.520.0...(4)128.518.0(1)128.521.0(1)128.520.0(1)1
4贵阳52010320176126.018.0(2)125.020.0...(4)226.016.0(2)128.016.0(1)027.017.0(1)0

5 rows × 26 columns

time: 86.4 ms
weather_forecast.describe()
VACODEYearMonDayTEM_Max_24hTEM_Min_24hTEM_Max_48hTEM_Min_48hTEM_Max_72hTEM_Min_72h,WEP_72hTEM_Min_96hWEP_96hTEM_Min_120hWEP_120hTEM_Min_144hWEP_144hTEM_Min_168h,WEP_168hUnnamed: 24
count1464.0000001464.01464.0000001464.0000001464.0000001464.0000001464.0000001464.0000001464.0000001464.0000001464.0000001464.0000001464.0000001464.0000001464.0000001464.0000001464.0000001464.000000
mean521792.5833332017.06.50819715.75409828.37465820.72158528.37582020.87281428.28381121.11243228.53948121.40812828.70225421.45471329.14262321.48565629.13162621.589003
std1180.8911630.00.5001048.8099664.3003912.2908504.3797712.2327884.3291322.2049804.1541885.2035254.1674415.2382574.1240262.1802224.0332272.391945
min520103.0000002017.06.0000001.00000017.30000013.80000017.30000013.60000017.00000010.00000019.00000014.30000019.00000015.00000018.00000015.00000018.0000002.000000
25%520122.7500002017.06.0000008.00000025.00000019.00000025.00000019.40000025.00000019.60000025.50000019.70000026.00000019.70000026.00000020.00000026.50000020.000000
50%522624.5000002017.07.00000016.00000028.50000021.00000028.50000021.00000028.00000021.00000028.50000021.50000028.50000021.50000029.00000022.00000029.00000022.000000
75%522630.2500002017.07.00000023.00000031.80000022.50000031.60000022.50000031.50000023.00000031.50000023.00000032.00000023.00000032.00000023.00000032.00000023.500000
max522636.0000002017.07.00000031.00000039.00000025.70000039.50000025.50000038.00000025.80000039.000000200.00000039.000000202.00000038.80000025.80000037.50000026.000000
time: 121 ms
weather_forecast.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1464 entries, 0 to 1463
Data columns (total 26 columns):
Station_Name             1464 non-null object
VACODE                   1464 non-null int64
Year                     1464 non-null int64
Mon                      1464 non-null int64
Day                      1464 non-null int64
TEM_Max_24h              1464 non-null float64
TEM_Min_24h              1464 non-null float64
WEP_24h                  1464 non-null object
TEM_Max_48h              1464 non-null float64
TEM_Min_48h              1464 non-null float64
WEP_48h                  1464 non-null object
TEM_Max_72h              1464 non-null float64
TEM_Min_72h,WEP_72h      1464 non-null float64
TEM_Max_96h              1464 non-null object
TEM_Min_96h              1464 non-null float64
WEP_96h                  1464 non-null float64
TEM_Max_120h             1464 non-null object
TEM_Min_120h             1464 non-null float64
WEP_120h                 1464 non-null float64
TEM_Max_144h             1464 non-null object
TEM_Min_144h             1464 non-null float64
WEP_144h                 1464 non-null float64
TEM_Max_168h             1464 non-null object
TEM_Min_168h,WEP_168h    1464 non-null float64
Unnamed: 24              1464 non-null float64
Unnamed: 25              1464 non-null object
dtypes: float64(14), int64(4), object(8)
memory usage: 297.5+ KB
time: 9.2 ms
0.2 测试数据
0.2.1 测试集
t1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_1.txt', sep='\t', header=None))
t1.columns = ['year_month', 'id', 'consume']
t1 = t1.dropna(axis=0)
t1_groupbyid = t1[['id', 'consume']].groupby(['id']).agg({'consume': pd.Series.sum})

t2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_2.txt', sep='\t', header=None))
t2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time']
t2 = t2.dropna(axis=0)
t2 = t2[['id', 'brand']]
t2 = t2.drop_duplicates()
t2_groupbyid = t2['id'].value_counts()
t2_groupbyid = t2_groupbyid.reset_index()
t2_groupbyid.columns = ['id', 'phone_nums']

test_set = t1_groupbyid.merge(t2_groupbyid, on=['id'])
test_set.head()
Mem. usage decreased to  1.34 Mb (41.7% reduction)
Mem. usage decreased to 60.50 Mb (0.0% reduction)
idconsumephone_nums
0595941207920220.00010
1901845022650662.0006
21868765858840143.3754
35058794512580200.0007
45399381591230192.00029
time: 7.86 s
test_set.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 43977 entries, 0 to 43976
Data columns (total 3 columns):
id            43977 non-null int64
consume       43977 non-null float16
phone_nums    43977 non-null int64
dtypes: float16(1), int64(2)
memory usage: 1.1 MB
time: 9.02 ms
t3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_3.txt', sep='\t', header=None))
t3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational']

t3_groupbyid_call = t3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})
t3_groupbyid_provincial = t3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})
t3_groupbyid_trans = t3[['id', 'is_transnational']].groupby(['id']).agg({'is_transnational': pd.Series.sum})
test_set = test_set.merge(t3_groupbyid_call, on=['id'])
test_set = test_set.merge(t3_groupbyid_provincial, on=['id'])
test_set = test_set.merge(t3_groupbyid_trans, on=['id'])

t4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_4.txt', sep='\t', header=None))
t4.columns = ['year_month', 'id', 'province']

t4_groupbyid = t4[['id', 'province']].groupby(['id']).size()
t4_groupbyid = t4_groupbyid.reset_index()
t4_groupbyid.columns = ['id', 'province_out_cnt']
test_set = test_set.merge(t4_groupbyid, how='left', on=['id'])
test_set = test_set.fillna(0)
test_set.head()
Mem. usage decreased to  1.53 Mb (60.0% reduction)
Mem. usage decreased to  0.85 Mb (16.7% reduction)
idconsumephone_numscall_numsis_trans_provincialis_transnationalprovince_out_cnt
0595941207920220.0001068101.0
1901845022650662.0006278000.0
21868765858840143.3754107203.0
35058794512580200.0007128000.0
45399381591230192.0002961000.0
time: 17.4 s
!ls /home/kesci/input/gzlt/test_set/
201808	weather_data_2018
time: 704 ms
!ls /home/kesci/input/gzlt/test_set/201808
2018_1.txt  2018_2.txt	2018_3.txt  2018_4.txt	2018_6.txt  2018_7.txt
time: 702 ms
t1 = pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_1.txt', sep='\t', header=None)
time: 527 ms
t1.columns = ['year_month', 'id', 'consume']
time: 1.27 ms
t1.head()
year_monthidconsume
02018076401824160010748618.40
12018076506134548135499NaN
2201807599692088461995422.05
320180611872094245437137.20
42018079297165066591558124.00
time: 99.9 ms
t1.describe()
year_monthidconsume
count100402.0000001.004020e+0586787.000000
mean201806.5000005.449905e+15103.357399
std0.5000022.628916e+15311.428596
min201806.0000005.959412e+110.010000
25%201806.0000003.176902e+1536.500000
50%201806.5000005.440931e+1581.000000
75%201807.0000007.726318e+15132.125000
max201807.0000009.999920e+1561465.900000
time: 50.6 ms
t1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100402 entries, 0 to 100401
Data columns (total 3 columns):
year_month    100402 non-null int64
id            100402 non-null int64
consume       86787 non-null float64
dtypes: float64(1), int64(2)
memory usage: 2.3 MB
time: 12.6 ms
%matplotlib inline

# 按index日期排序

t1.consume.plot()
Matplotlib is building the font cache using fc-list. This may take a moment.





<matplotlib.axes._subplots.AxesSubplot at 0x7fbd4cd3c978>
time: 17 s
t1[t1.consume == 61465.9]
year_monthidconsume
11962201807482780686030130761465.9
time: 7.15 ms
t2 = pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_2.txt', sep='\t', header=None)
time: 11.8 s
t2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time']
time: 1.18 ms
t2.head()
idbrandtypefirst_use_timerecent_use_time
03179771753483280魅族M5752018060115105220180601151054
14185007692177509NaNNaN2017102118291520171021183000
24972845789896505NaNNaN2018062400364720180624003656
34207293827582218NaNNaN2017122416590220180306175444
42628020151876580NaNNaN2017082011105320171207020159
time: 7.95 ms
t2.describe()
idfirst_use_timerecent_use_time
count1.586024e+061.586024e+061.586024e+06
mean5.410516e+152.017033e+132.017156e+13
std2.618994e+156.902153e+096.865591e+09
min5.959412e+112.016032e+132.016033e+13
25%3.140763e+152.016122e+132.017021e+13
50%5.389338e+152.017063e+132.017080e+13
75%7.660413e+152.017122e+132.018013e+13
max9.999920e+152.018073e+132.018073e+13
time: 353 ms
t2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1586024 entries, 0 to 1586023
Data columns (total 5 columns):
id                 1586024 non-null int64
brand              1098244 non-null object
type               1098250 non-null object
first_use_time     1586024 non-null int64
recent_use_time    1586024 non-null int64
dtypes: int64(3), object(2)
memory usage: 60.5+ MB
time: 291 ms
t3 = pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_3.txt', sep='\t', header=None)
time: 451 ms
t3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational']
time: 1.14 ms
t3.head()
year_monthidcall_numsis_trans_provincialis_transnational
020180636908147030033614900
12018074315823592069831-100
22018065199170013029443-100
320180613876582058952033500
42018073280240784164442-100
time: 7.12 ms
t3.describe()
year_monthidcall_numsis_trans_provincialis_transnational
count100400.0000001.004000e+05100400.000000100400.000000100400.000000
mean201806.5000005.449990e+1551.6421020.2061160.012809
std0.5000022.628873e+1590.7059570.4045160.112449
min201806.0000005.959412e+11-1.0000000.0000000.000000
25%201806.0000003.177008e+156.0000000.0000000.000000
50%201806.5000005.441108e+1531.0000000.0000000.000000
75%201807.0000007.726328e+1571.0000000.0000000.000000
max201807.0000009.999920e+156537.0000001.0000001.000000
time: 46.4 ms
t3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100400 entries, 0 to 100399
Data columns (total 5 columns):
year_month             100400 non-null int64
id                     100400 non-null int64
call_nums              100400 non-null int64
is_trans_provincial    100400 non-null int64
is_transnational       100400 non-null int64
dtypes: int64(5)
memory usage: 3.8 MB
time: 15.1 ms
t4 = pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_4.txt', sep='\t', header=None)
time: 240 ms
t4.columns = ['year_month', 'id', 'province']
time: 1.2 ms
t4.head()
year_monthidprovince
02018078445647072009305广东
12018069414872397547413浙江
22018062272887111818372广东
3201807224368910874770湖北
42018076081677258986878NaN
time: 6.81 ms
t4.describe()
year_monthid
count44543.0000004.454300e+04
mean201806.5303195.448788e+15
std0.4990862.640390e+15
min201806.0000005.959412e+11
25%201806.0000003.118911e+15
50%201807.0000005.430117e+15
75%201807.0000007.751481e+15
max201807.0000009.999505e+15
time: 20.3 ms
t4.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44543 entries, 0 to 44542
Data columns (total 3 columns):
year_month    44543 non-null int64
id            44543 non-null int64
province      44119 non-null object
dtypes: int64(2), object(1)
memory usage: 1.0+ MB
time: 9.73 ms
t6 = pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_6.txt', sep='\t', header=None)
time: 2min 7s
t6.columns = ['date', 'hour', 'id', 'user_longitude', 'user_latitude']
time: 1.22 ms
t6.head()
datehouriduser_longitudeuser_latitude
02018-06-10201929821481825935106.28990226.837687
12018-07-14185450093661688579106.64197526.627846
22018-07-1624617571498633816106.23042027.466980
32018-06-15222826359445811398106.69361026.591110
42018-06-22103526202744290054107.03257027.715830
time: 8.4 ms
t6.describe()
houriduser_longitudeuser_latitude
count1.655899e+071.655899e+071.655081e+071.655081e+07
mean1.144987e+015.461505e+151.066642e+022.662386e+01
std6.742805e+002.629564e+154.626476e-013.195807e-01
min0.000000e+005.959412e+111.036700e+022.469706e+01
25%6.000000e+003.191837e+151.066328e+022.655164e+01
50%1.200000e+015.475087e+151.066902e+022.658444e+01
75%1.800000e+017.732384e+151.067199e+022.663778e+01
max2.200000e+019.999920e+151.095534e+022.916468e+01
time: 6.3 s
t6.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16558993 entries, 0 to 16558992
Data columns (total 5 columns):
date              object
hour              int64
id                int64
user_longitude    float64
user_latitude     float64
dtypes: float64(2), int64(2), object(1)
memory usage: 631.7+ MB
time: 3.04 ms
t7 = pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_7.txt', sep='\t', header=None)
time: 8.75 s
t7.columns = ['year_month', 'id', 'app', 'flow']
time: 1.18 ms
t7.head()
year_monthidappflow
02018069813651010156104OPPO软件商店14545.00
12018062338567014163500腾讯新闻0.19
22018071133512913801798讯飞输入法0.01
32018077739596338372898手机百度1615.00
42018075724269192271018百度贴吧1301953.00
time: 15.6 ms
t7.describe()
year_monthidflow
count1.493733e+061.493733e+061.492434e+06
mean2.018065e+055.468351e+158.991198e+07
std4.999895e-012.628382e+158.503798e+08
min2.018060e+055.959412e+110.000000e+00
25%2.018060e+053.196619e+156.519000e+03
50%2.018070e+055.477012e+152.883350e+05
75%2.018070e+057.737568e+157.842132e+06
max2.018070e+059.999920e+153.341152e+11
time: 226 ms
t7.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1493733 entries, 0 to 1493732
Data columns (total 4 columns):
year_month    1493733 non-null int64
id            1493733 non-null int64
app           1457137 non-null object
flow          1492434 non-null float64
dtypes: float64(1), int64(2), object(1)
memory usage: 45.6+ MB
time: 178 ms
0.2.2 天气数据
!ls /home/kesci/input/gzlt/test_set/weather_data_2018/
weather_forecast_2018.txt  weather_reported_2018.txt
time: 830 ms
weather_reported_2018 = pd.read_csv('/home/kesci/input/gzlt/test_set/weather_data_2018/weather_reported_2018.txt', sep='\t')
time: 8.57 ms
weather_reported_2018.head()
Station_NameVACODEYearMonthDayTEM_AvgTEM_MaxTEM_MinPRE_Time_2020WEP_Record
0镇远52262520186119.021.017.80.1( 60 01 ) 01 60 10 .
1丹寨52263620186117.019.915.34.3( 60 80 ) 80 .
2三穗52262420186117.819.217.00.6( 80 10 ) 60 10 .
3台江52263020186118.821.117.51.4( 60 01 ) 01 60 10 .
4剑河52262920186119.221.617.92.1( 60 ) 60 10 .
time: 12.6 ms
weather_reported_2018.describe()
VACODEYearMonthDayTEM_AvgTEM_MaxTEM_MinPRE_Time_2020
count1403.0000001403.01403.0000001403.0000001403.0000001403.0000001403.0000001403.000000
mean521862.9344262018.06.50819715.754098737.393799742.297577734.0111194.922594
std1155.9721440.00.5001118.81009726696.85026826696.71941526696.94060415.090986
min520103.0000002018.06.0000001.00000015.10000016.20000011.8000000.000000
25%520122.0000002018.06.0000008.00000022.90000027.30000020.0000000.000000
50%522625.0000002018.07.00000016.00000025.10000030.10000021.6000000.000000
75%522631.0000002018.07.00000023.00000026.90000032.55000023.0500002.100000
max522636.0000002018.07.00000031.000000999999.000000999999.000000999999.000000281.700000
time: 118 ms
weather_reported_2018.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1403 entries, 0 to 1402
Data columns (total 10 columns):
Station_Name     1403 non-null object
 VACODE          1403 non-null int64
 Year            1403 non-null int64
 Month           1403 non-null int64
 Day             1403 non-null int64
 TEM_Avg         1403 non-null float64
 TEM_Max         1403 non-null float64
 TEM_Min         1403 non-null float64
PRE_Time_2020    1403 non-null float64
WEP_Record       1403 non-null object
dtypes: float64(4), int64(4), object(2)
memory usage: 109.7+ KB
time: 6.7 ms
weather_forecast_2018 = pd.read_csv('/home/kesci/input/gzlt/test_set/weather_data_2018/weather_forecast_2018.txt', sep='\t')
time: 12 ms
weather_forecast_2018.head()
Station_NameVACODEYearMonDayTEM_Max_24hTEM_Min_24hWEP_24hTEM_Max_48hTEM_Min_48h...TEM_Max_120hTEM_Min_120hWEP_120hTEM_Max_144hTEM_Min_144hWEP_144hTEM_Max_168hTEM_Min_168h,WEP_168hUnnamed: 24Unnamed: 25
0白云52011320186120.214.8(3)223.215.8...(2)127.513.5(1)126.014.0(2)124.016.0(1)1
1岑巩52262620186125.517.5(2)228.520.2...(2)031.017.0(0)031.018.5(0)131.021.5(1)1
2从江52263320186127.319.0(7)229.522.0...(21)033.519.6(0)033.520.2(0)131.523.0(1)1
3丹寨52263620186123.015.5(2)226.019.2...(2)028.016.2(0)028.017.2(0)127.019.5(1)1
4贵阳52010320186120.914.9(3)224.016.4...(2)128.014.0(1)126.014.0(2)124.016.0(1)1

5 rows × 26 columns

time: 54.2 ms
weather_forecast_2018.describe()
VACODEYearMonDayTEM_Max_24hTEM_Min_24hTEM_Max_48hTEM_Min_48hTEM_Max_72hTEM_Min_72h,WEP_72hTEM_Min_96hWEP_96hTEM_Min_120hWEP_120hTEM_Min_144hWEP_144hTEM_Min_168h,WEP_168hUnnamed: 24
count1463.0000001463.01463.0000001463.0000001463.0000001463.0000001463.0000001463.0000001463.0000001463.0000001463.0000001463.0000001463.0000001463.0000001463.0000001463.0000001463.0000001463.000000
mean521793.7382092018.06.50854415.75939829.72460721.24470329.72447021.38523629.69446321.65543429.92494921.88694529.89118322.01093630.02734122.05522930.19296021.985373
std1180.4676380.00.5000988.8106433.4701282.5361033.2327372.3852373.1677892.2705053.1308862.1310203.1917212.0666403.1994602.0921553.1676762.227871
min520103.0000002018.06.0000001.00000017.80000010.80000018.00000012.00000016.50000012.50000016.50000014.00000014.50000013.00000017.00000013.20000016.00000015.000000
25%520123.0000002018.06.0000008.00000027.50000020.00000027.50000020.00000027.50000020.20000028.00000020.50000027.50000021.00000028.00000021.00000028.00000020.850000
50%522625.0000002018.07.00000016.00000030.00000022.00000029.90000022.00000029.50000022.00000030.00000022.00000030.00000022.20000030.00000022.10000030.00000022.200000
75%522630.5000002018.07.00000023.00000032.35000023.00000032.00000023.00000032.30000023.30000032.50000023.50000032.50000023.50000032.50000023.70000032.60000024.000000
max522636.0000002018.07.00000031.00000037.50000027.00000037.00000025.90000036.50000026.00000036.50000026.00000036.50000026.20000037.00000026.00000037.00000030.000000
time: 74 ms
weather_forecast_2018.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1463 entries, 0 to 1462
Data columns (total 26 columns):
Station_Name             1463 non-null object
VACODE                   1463 non-null int64
Year                     1463 non-null int64
Mon                      1463 non-null int64
Day                      1463 non-null int64
TEM_Max_24h              1463 non-null float64
TEM_Min_24h              1463 non-null float64
WEP_24h                  1463 non-null object
TEM_Max_48h              1463 non-null float64
TEM_Min_48h              1463 non-null float64
WEP_48h                  1463 non-null object
TEM_Max_72h              1463 non-null float64
TEM_Min_72h,WEP_72h      1463 non-null float64
TEM_Max_96h              1463 non-null object
TEM_Min_96h              1463 non-null float64
WEP_96h                  1463 non-null float64
TEM_Max_120h             1463 non-null object
TEM_Min_120h             1463 non-null float64
WEP_120h                 1463 non-null float64
TEM_Max_144h             1463 non-null object
TEM_Min_144h             1463 non-null float64
WEP_144h                 1463 non-null float64
TEM_Max_168h             1463 non-null object
TEM_Min_168h,WEP_168h    1463 non-null float64
Unnamed: 24              1463 non-null float64
Unnamed: 25              1463 non-null object
dtypes: float64(14), int64(4), object(8)
memory usage: 297.2+ KB
time: 11 ms
!jupyter nbconvert --to markdown "“联创黔线”杯大数据应用创新大赛.ipynb"

0.000000


25%
520122.000000
2018.0
6.000000
8.000000
22.900000
27.300000
20.000000
0.000000


50%
522625.000000
2018.0
7.000000
16.000000
25.100000
30.100000
21.600000
0.000000


75%
522631.000000
2018.0
7.000000
23.000000
26.900000
32.550000
23.050000
2.100000


max
522636.000000
2018.0
7.000000
31.000000
999999.000000
999999.000000
999999.000000
281.700000

time: 118 ms
weather_reported_2018.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1403 entries, 0 to 1402
Data columns (total 10 columns):
Station_Name     1403 non-null object
 VACODE          1403 non-null int64
 Year            1403 non-null int64
 Month           1403 non-null int64
 Day             1403 non-null int64
 TEM_Avg         1403 non-null float64
 TEM_Max         1403 non-null float64
 TEM_Min         1403 non-null float64
PRE_Time_2020    1403 non-null float64
WEP_Record       1403 non-null object
dtypes: float64(4), int64(4), object(2)
memory usage: 109.7+ KB
time: 6.7 ms
weather_forecast_2018 = pd.read_csv('/home/kesci/input/gzlt/test_set/weather_data_2018/weather_forecast_2018.txt', sep='\t')
time: 12 ms
weather_forecast_2018.head()
Station_NameVACODEYearMonDayTEM_Max_24hTEM_Min_24hWEP_24hTEM_Max_48hTEM_Min_48h...TEM_Max_120hTEM_Min_120hWEP_120hTEM_Max_144hTEM_Min_144hWEP_144hTEM_Max_168hTEM_Min_168h,WEP_168hUnnamed: 24Unnamed: 25
0白云52011320186120.214.8(3)223.215.8...(2)127.513.5(1)126.014.0(2)124.016.0(1)1
1岑巩52262620186125.517.5(2)228.520.2...(2)031.017.0(0)031.018.5(0)131.021.5(1)1
2从江52263320186127.319.0(7)229.522.0...(21)033.519.6(0)033.520.2(0)131.523.0(1)1
3丹寨52263620186123.015.5(2)226.019.2...(2)028.016.2(0)028.017.2(0)127.019.5(1)1
4贵阳52010320186120.914.9(3)224.016.4...(2)128.014.0(1)126.014.0(2)124.016.0(1)1

5 rows × 26 columns

time: 54.2 ms
weather_forecast_2018.describe()
VACODEYearMonDayTEM_Max_24hTEM_Min_24hTEM_Max_48hTEM_Min_48hTEM_Max_72hTEM_Min_72h,WEP_72hTEM_Min_96hWEP_96hTEM_Min_120hWEP_120hTEM_Min_144hWEP_144hTEM_Min_168h,WEP_168hUnnamed: 24
count1463.0000001463.01463.0000001463.0000001463.0000001463.0000001463.0000001463.0000001463.0000001463.0000001463.0000001463.0000001463.0000001463.0000001463.0000001463.0000001463.0000001463.000000
mean521793.7382092018.06.50854415.75939829.72460721.24470329.72447021.38523629.69446321.65543429.92494921.88694529.89118322.01093630.02734122.05522930.19296021.985373
std1180.4676380.00.5000988.8106433.4701282.5361033.2327372.3852373.1677892.2705053.1308862.1310203.1917212.0666403.1994602.0921553.1676762.227871
min520103.0000002018.06.0000001.00000017.80000010.80000018.00000012.00000016.50000012.50000016.50000014.00000014.50000013.00000017.00000013.20000016.00000015.000000
25%520123.0000002018.06.0000008.00000027.50000020.00000027.50000020.00000027.50000020.20000028.00000020.50000027.50000021.00000028.00000021.00000028.00000020.850000
50%522625.0000002018.07.00000016.00000030.00000022.00000029.90000022.00000029.50000022.00000030.00000022.00000030.00000022.20000030.00000022.10000030.00000022.200000
75%522630.5000002018.07.00000023.00000032.35000023.00000032.00000023.00000032.30000023.30000032.50000023.50000032.50000023.50000032.50000023.70000032.60000024.000000
max522636.0000002018.07.00000031.00000037.50000027.00000037.00000025.90000036.50000026.00000036.50000026.00000036.50000026.20000037.00000026.00000037.00000030.000000
time: 74 ms
weather_forecast_2018.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1463 entries, 0 to 1462
Data columns (total 26 columns):
Station_Name             1463 non-null object
VACODE                   1463 non-null int64
Year                     1463 non-null int64
Mon                      1463 non-null int64
Day                      1463 non-null int64
TEM_Max_24h              1463 non-null float64
TEM_Min_24h              1463 non-null float64
WEP_24h                  1463 non-null object
TEM_Max_48h              1463 non-null float64
TEM_Min_48h              1463 non-null float64
WEP_48h                  1463 non-null object
TEM_Max_72h              1463 non-null float64
TEM_Min_72h,WEP_72h      1463 non-null float64
TEM_Max_96h              1463 non-null object
TEM_Min_96h              1463 non-null float64
WEP_96h                  1463 non-null float64
TEM_Max_120h             1463 non-null object
TEM_Min_120h             1463 non-null float64
WEP_120h                 1463 non-null float64
TEM_Max_144h             1463 non-null object
TEM_Min_144h             1463 non-null float64
WEP_144h                 1463 non-null float64
TEM_Max_168h             1463 non-null object
TEM_Min_168h,WEP_168h    1463 non-null float64
Unnamed: 24              1463 non-null float64
Unnamed: 25              1463 non-null object
dtypes: float64(14), int64(4), object(8)
memory usage: 297.2+ KB
time: 11 ms
!jupyter nbconvert --to markdown "“联创黔线”杯大数据应用创新大赛.ipynb"
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

光于前裕于后

您的打赏将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值