促销策略和销量关系的回归分析

最新推荐文章于 2024-11-04 03:26:43 发布

缘源园

最新推荐文章于 2024-11-04 03:26:43 发布

阅读量1.8k

点赞数 3

CC 4.0 BY-SA版权

分类专栏：数据分析文章标签：机器学习数据分析 python 逻辑回归大数据

本文链接：https://blog.csdn.net/weixin_48135624/article/details/113476393

数据分析专栏收录该内容

55 篇文章

订阅专栏

本文探讨了一家传统快消企业如何通过数据分析对商超门店销售额进行预测。通过对电视广告、线上推广、线下活动等促销投入与销售额之间的关系进行回归分析，量化了各种促销策略的效果。数据概况显示，local_tv、person和instore与销售额关联性强。通过填充缺失值、转换类别变量并进行线性回归建模，发现local_tv、person和instore对销售额影响显著。模型评估中，RMSE和MAE作为误差指标，揭示了模型的预测精度。

销售额预测分析

快消企业，分析目的
- 对商超门店的销售额进行预测
- 量化自身所能控制的各种促销因素所能产生的效果
- 对营销资源做出合理规划
传统快消企业，数据特点
- 聚合类的数据
- 渠道众多，无法精准了解用户
本例中，通过回归分析实现对各类因素投入产出比做出评估
- 分析数据
  - 电视广告，线上，线下，门店内，微信渠道等促销投入和销售额之间的关系
- 数据说明（以月为观测窗口）
  - Revenue 门店销售额
  - Reach 微信广告次数
  - Local_tv 本地电视广告投入
  - Online 线上广告投入
  - Instore 门店内海报等投入
  - Person 门店促销人员
  - Event 促销事件
    - cobranding 品牌联合促销
    - holiday 节假日
    - special 门店特别促销
    - non-event 无促销活动
- 分析流程：数据概况分析->单变量分析->相关性分析与可视化->回归模型
  - 数据概况分析
    - 数据行/列数量
    - 缺失值分布
  - 单变量分析
    - 数字型变量的描述指标（平均值，最大最小值，标准差）
    - 类别型变量（多少个分类，各自占比）
  - 相关性分析与可视化
    - 按类别交叉对比
    - 变量之间的相关性分析
    - 散点图/热力图
  - 回归分析
    - 模型建立
    - 模型评估与优化

import pandas as pd

#数据读取#
#index_col=0 ，数据的第一列是索引，指定索引列.后续则不用另外做删除索引Unnamed: 0的操作
store=pd.read_csv('store_rev.csv',index_col=0)

store.head()
	revenue	reach	local_tv	online	instore	person	event
845	45860.28	2	31694.91	2115	3296	8	non_event
483	63588.23	2	35040.17	1826	2501	14	special
513	23272.69	4	30992.82	1851	2524	6	special
599	45911.23	2	29417.78	2437	3049	12	special
120	36644.23	2	35611.11	1122	1142	13	cobranding

store.info()  #数据查看后得出event需要做类型转换，local_tv有缺失值
<class 'pandas.core.frame.DataFrame'>
Int64Index: 985 entries, 845 to 26
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   revenue   985 non-null    float64
 1   reach     985 non-null    int64  
 2   local_tv  929 non-null    float64
 3   online    985 non-null    int64  
 4   instore   985 non-null    int64  
 5   person    985 non-null    int64  
 6   event     985 non-null    object 
dtypes: float64(2), int64(4), object(1)
memory usage: 61.6+ KB

store.describe()  

       revenue	     reach	    local_tv	    online	    instore	    person
count 985.000000	985.000000	929.000000	    985.000000	985.000000	985.000000
mean  38357.355025	3.395939	31324.061109	1596.527919	3350.962437	11.053807
std	  11675.603883	1.011913	3970.934733	    496.131586	976.546381	3.041740
min	  5000.000000	0.000000	20000.000000	0.000000	0.000000	0.000000
25%   30223.600000	3.000000	28733.830000	1253.000000	2690.000000	9.000000
50%	  38159.110000	3.000000	31104.520000	1607.000000	3351.000000	11.000000
75%	  45826.520000	4.000000	33972.410000	1921.000000	4011.000000	13.000000
max	  79342.070000	7.000000	43676.900000	3280.000000	6489.000000	24.000000

#这几个类别对应的local_tv（本地电视广告投入）是怎样的
store.groupby(['event'])['local_tv'].describe()

#将类别变量转化为哑变量
store=pd.get_dummies(store)
#生成event的4个标签，每个标签取值0/1
store.head(10)

	revenue	reach	local_tv	online	instore	person	event_cobranding	event_holiday	event_non_event	event_special
845	45860.28	2	31694.91	2115	3296	8	0	0	1	0
483	63588.23	2	35040.17	1826	2501	14	0	0	0	1
513	23272.69	4	30992.82	1851	2524	6	0	0	0	1
599	45911.23	2	29417.78	2437	3049	12	0	0	0	1
120	36644.23	2	35611.11	1122	1142	13	1	0	0	0
867	36172.81	4	22372.59	2001	1881	17	1	0	0	0
847	43797.03	3	31443.74	1667	1846	15	1	0	0	0
950	41629.80	4	35775.75	1155	2715	12	0	0	0	1
942	21303.48	2	24888.31	1853	3677	4	0	0	1	0
550	20746.15	4	26623.48	1497	3075	9	0	1	0	0


#确认类别变量已经转换成数字变量
store.info()

#确认类别变量已经转换成数字变量
store.info()
#确认类别变量已经转换成数字变量
store.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 985 entries, 845 to 26
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   revenue           985 non-null    float64
 1   reach             985 non-null    int64  
 2   local_tv          929 non-null    float64
 3   online            985 non-null    int64  
 4   instore           985 non-null    int64  
 5   person            985 non-null    int64  
 6   event_cobranding  985 non-null    uint8  
 7   event_holiday     985 non-null    uint8  
 8   event_non_event   985 non-null    uint8  
 9   event_special     985 non-null    uint8  
dtypes: float64(2), int64(4), uint8(4)
memory usage: 57.7 KB

查看目标值和特征之间是否有比较强的关联

#所有变量，任意两个变量相关分析
#local_tv,person,instore是比较好的指标，与revenue相关度高
store.corr()

目标值和特征之间的关联

#其他变量与revenue的相关分析
#sort_values 将revenue排序，ascending默认升序，False为降序排列
#看到前3个相关变量为local_tv,person,instore;
#0.2到0.3会有一个很明显的相关性,如果到0.3以上有明显的相关性；0.5以上有强相关性
store.corr()[['revenue']].sort_values('revenue',ascending=False)

	            revenue
revenue	        1.000000
local_tv        0.602114
person	        0.559208
instore	        0.311739
online	        0.171227
event_special	0.033752
event_cobranding-0.005623
event_holiday	-0.016559
event_non_event	-0.019155
reach	        -0.155314

#可视化分析
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
#%matplotlib inline 可以在Ipython编译器里直接使用，功能是可以内嵌绘图，并且可以省略掉plt.show()这一步。
#线性关系可视化
#斜率与相关系数有关；sns.regplot()：绘图数据和线性回归模型拟合
sns.regplot('local_tv','revenue',store)

#线性关系可视化
sns.regplot('person','revenue',store)

#线性关系可视化
sns.regplot('instore','revenue',store)

#缺失值处理,填充0
store=store.fillna(0)
#缺失值处理,均值填充
store=store.fillna(store.local_tv.mean())
store.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 985 entries, 845 to 26
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   revenue           985 non-null    float64
 1   reach             985 non-null    int64  
 2   local_tv          985 non-null    float64
 3   online            985 non-null    int64  
 4   instore           985 non-null    int64  
 5   person            985 non-null    int64  
 6   event_cobranding  985 non-null    uint8  
 7   event_holiday     985 non-null    uint8  
 8   event_non_event   985 non-null    uint8  
 9   event_special     985 non-null    uint8  
dtypes: float64(2), int64(4), uint8(4)
memory usage: 57.7 KB

线性回归分析

#线性回归分析
from sklearn.linear_model import LinearRegression
model=LinearRegression()  #y=kx+b
#设定自变量和因变量
y=store['revenue']
#第一次三个 
x=store[['local_tv','person','instore']]
model.fit(x,y)
#自变量系数
model.coef_
#模型的截距
model.intercept_
#计算分数
score=model.score(x,y)#x和y打分
#模型的评估,x为'local_tv','person','instore'
predictions=model.predict(x)#计算y预测值;利用特征预测结果
error=predictions-y#计算误差
rmse=(error**2).mean()**.5#计算rmse
mae=abs(error).mean()#计算mae

print(rmse)
print(mae)
#8321.491623472051
#6556.036999600779



#第二次四个 
x=store[['local_tv','person','instore','online']]
model.fit(x,y)
#自变量系数
model.coef_
#模型的截距
model.intercept_
score=model.score(x,y)#x和y打分
score
#0.517440904944027

#模型的评估,x为'local_tv','person','instore'
predictions=model.predict(x)#计算y预测值;利用特征预测结果
error=predictions-y#计算误差
print(rmse)
print(mae)  #计算rmse,mae值越小越好
#8106.512169325369
#6402.202883441895

什么时候使用线性回归？