容易忽视的使用
1. 公式
1.1 期望
E
x
∼
P
[
f
(
x
)
]
=
∑
x
P
(
x
)
f
(
x
)
\mathbb{E}_{x\sim P}[f(x)]=\sum_xP(x)f(x)
Ex∼P[f(x)]=x∑P(x)f(x)
E
x
∼
p
[
f
(
x
)
]
=
∫
p
(
x
)
f
(
x
)
d
x
\mathbb{E}_{x\sim p}[f(x)]=\int p(x)f(x)dx
Ex∼p[f(x)]=∫p(x)f(x)dx
1.2 方差
V a r ( f ( x ) ) = E [ ( f ( x ) − E [ f ( x ) ] ) 2 ] Var(f(x))=\mathbb{E}[(f(x)-\mathbb{E}[f(x)])^2] Var(f(x))=E[(f(x)−E[f(x)])2]
当方差很小时,f(x) 的值形成的簇比较接近它们的期望值.
随机变量X,若 E { [ X − E ( X ) ] 2 } E\{ [X-E(X)]^2\} E{[X−E(X)]2}存在,则称 E { [ X − E ( X ) ] 2 } E\{ [X-E(X)]^2\} E{[X−E(X)]2}为X的方差,记作 D ( x ) = σ 2 ( X ) = E { [ X − E ( X ) ] 2 } D(x)=\sigma^2(X)=E\{ [X-E(X)]^2\} D(x)=σ2(X)=E{[X−E(X)]2}
常数C
D
(
C
X
)
=
C
2
D
(
X
)
(1)
D(CX)=C^2D(X) \tag{1}
D(CX)=C2D(X)(1)
D
(
X
+
C
)
=
D
(
X
)
(2)
D(X+C)=D(X) \tag{2}
D(X+C)=D(X)(2)
两个随机变量
D
(
X
+
Y
)
=
D
(
X
)
+
D
(
Y
)
+
C
o
v
(
X
,
Y
)
(3)
D(X+Y)=D(X)+D(Y)+Cov(X,Y) \tag{3}
D(X+Y)=D(X)+D(Y)+Cov(X,Y)(3)
独立不相关时,
D
(
X
+
Y
)
=
D
(
X
)
+
D
(
Y
)
D
(
X
Y
)
=
E
(
X
)
2
D
(
Y
)
+
E
(
Y
)
2
D
(
X
)
+
D
(
X
)
D
(
Y
)
(4)
D(X+Y)=D(X)+D(Y) \\ D(XY) = E(X)^2D(Y)+E(Y)^2D(X)+D(X)D(Y) \tag{4}
D(X+Y)=D(X)+D(Y)D(XY)=E(X)2D(Y)+E(Y)2D(X)+D(X)D(Y)(4)
1.3 协方差(covariance)
在某种意义上给出了两个变量线性相关性的强度以及这些变量的尺度:
C
o
v
(
f
(
x
)
,
g
(
x
)
)
=
E
[
(
f
(
x
)
−
E
[
f
(
x
)
]
)
(
g
(
x
)
−
E
[
g
(
y
)
]
)
]
Cov(f(x),g(x))=\mathbb{E}[(f(x)-\mathbb{E}[f(x)])(g(x)-\mathbb{E}[g(y)])]
Cov(f(x),g(x))=E[(f(x)−E[f(x)])(g(x)−E[g(y)])]
协方差的绝对值如果很大则意味着变量值变化很大并且它们同时距离各自的均值很远。
1.4 协方差矩阵(covariance matrix)
n维随机向量
x
∈
R
n
x\in \R^n
x∈Rn的协方差矩阵(covariance matrix)是一个nxn 的矩阵,并
且满足:
Σ
i
,
j
=
C
o
v
(
x
)
i
,
j
=
C
o
v
(
x
i
,
x
j
)
\Sigma_{i,j}=Cov(x)_{i,j}=Cov(x_i,x_j)
Σi,j=Cov(x)i,j=Cov(xi,xj)
协方差矩阵的对角元是方差:
C
o
v
(
x
i
,
x
i
)
=
V
a
r
(
x
i
)
Cov(x_i,x_i)=Var(x_i)
Cov(xi,xi)=Var(xi)
推论:
- 如果向量x,各个维度相关性很小的话,协方差矩阵初主对角外,为0
- 协方差矩阵是非负定矩阵
2. Xavier Initialization
Glorot, Xavier., and Bengio,Yoshua. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 249–256.
- 正态分布初始化 N ( 0 , 2 n i n + n o u t ) N (0, \frac{2}{n_{in}+n_{out}}) N(0,nin+nout2)
- 均匀分布初始化 U ( − 6 n i n + n o u t , + 6 n i n + n o u t ) U(-\sqrt{\frac{6}{n_in + n_out}},+\sqrt{\frac{6}{n_in + n_out}}) U(−nin+nout6,+nin+nout6)
对于一个线行模型:
Y
=
W
1
X
1
+
W
2
X
2
+
⋯
+
W
n
X
n
Y=W_1X_1+W_2X_2+⋯+W_nX_n
Y=W1X1+W2X2+⋯+WnXn
假设W与X独立,则有:
V
a
r
(
W
i
X
i
)
=
E
[
X
i
]
2
V
a
r
(
W
i
)
+
E
[
W
i
]
2
V
a
r
(
X
i
)
+
V
a
r
(
W
i
)
V
a
r
(
X
i
)
Var(W_iX_i)=E[X_i]^2Var(W_i)+E[W_i]^2Var(X_i)+Var(W_i)Var(X_i)
Var(WiXi)=E[Xi]2Var(Wi)+E[Wi]2Var(Xi)+Var(Wi)Var(Xi)
假设输入X为均值为0的高斯分布,则:
V
a
r
(
W
i
X
i
)
=
V
a
r
(
W
i
)
V
a
r
(
X
i
)
Var(W_iX_i)=Var(W_i)Var(X_i)
Var(WiXi)=Var(Wi)Var(Xi)
所以,对于线性模型输出Y
V
a
r
(
Y
)
=
V
a
r
(
W
1
X
1
+
W
2
X
2
+
⋯
+
W
−
n
X
n
)
=
n
V
a
r
(
W
i
)
V
a
r
(
X
i
)
Var(Y)=Var(W_1X_1+W_2X_2+⋯+W-nX_n)=nVar(W_i)Var(X_i)
Var(Y)=Var(W1X1+W2X2+⋯+W−nXn)=nVar(Wi)Var(Xi)
那么为了使得输出 Y 和 输入X分布相同,则
V
a
r
(
W
i
)
=
1
n
=
1
n
i
n
Var(W_i)=\frac{1}{n}=\frac{1}{n_{in}}
Var(Wi)=n1=nin1
同样,反向传播时,输出变为输入
V
a
r
(
W
i
)
=
1
n
o
u
t
Var(W_i)=\frac{1}{n_{out}}
Var(Wi)=nout1
取一个折中的妥协,起码使得in和out维度一样能满足:
V
a
r
(
W
i
)
=
2
n
i
n
+
n
o
u
t
Var(W_i)=\frac{2}{n_{in}+n_{out}}
Var(Wi)=nin+nout2
关于初始化多说一句:
Xavier Initialization 假设激活函数是线性的,这并不适用于ReLU,sigmoid等非线性激活函数;另一个是激活值关于0对称,这个不适用于sigmoid函数和ReLU函数它们不是关于0对称的。
Kaiming初始化:
- 正态分布初始化 N ( 0 , 2 n i n ) N (0, \frac{2}{n_{in}}) N(0,nin2)
- 均匀分布初始化 U ( − 6 n i , + 6 n i ) U(-\sqrt{\frac{6}{n_i}},+\sqrt{\frac{6}{n_i}}) U(−ni6,+ni6)
3. Transformer
A t t e n t i o n ( Q , V , K ) = s o f t m a x ( Q K T d k ) V Attention(Q,V,K)=softmax(\frac{QK^T}{\sqrt{d_k}})V Attention(Q,V,K)=softmax(dkQKT)V
其中,softmax输入 QK的积时做了处理: Q K T d k \frac{QK^T}{\sqrt{d_k}} dkQKT
Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.
Assume that the
q
i
q_i
qi &
k
i
k_i
ki are independent random variables with mean 0 and variance 1.
so
V
a
r
(
q
k
)
=
V
a
r
(
∑
N
q
i
k
i
)
=
N
V
a
r
(
q
)
V
a
r
(
k
)
Var(qk)=Var(\sum_Nq_ik_i)=NVar(q)Var(k)
Var(qk)=Var(∑Nqiki)=NVar(q)Var(k) ,and
V
a
r
(
q
k
N
)
=
V
a
r
(
q
)
V
a
r
(
k
)
Var(\frac{qk}{\sqrt{N}})=Var(q)Var(k)
Var(Nqk)=Var(q)Var(k)
4. SVM
SVM中的线性函数写为:
f
(
x
)
=
w
T
x
+
b
=
b
+
∑
N
α
i
y
(
i
)
<
x
,
x
(
i
)
>
f(\boldsymbol x)=\boldsymbol{w}^T\boldsymbol{x}+b=b+\sum_N\alpha_iy^{(i)}<\boldsymbol{x} , \boldsymbol{x}^{(i)}>
f(x)=wTx+b=b+N∑αiy(i)<x,x(i)>
其中,
x
(
i
)
\boldsymbol{x}^{(i)}
x(i) 是训练样本,并且其中对预测真正有效的是支持向量。
引入核函数
k
(
x
,
x
(
i
)
)
=
ϕ
(
x
)
⋅
ϕ
(
x
(
i
)
)
k(\boldsymbol{x},\boldsymbol{x}^{(i)})=\phi(\boldsymbol x)\cdot\phi(\boldsymbol {x}^{(i)})
k(x,x(i))=ϕ(x)⋅ϕ(x(i)) 则,表示为:
f
(
x
)
=
b
+
∑
N
α
i
y
(
i
)
k
(
x
,
x
(
i
)
)
f(\boldsymbol x)=b+\sum_N\alpha_iy^{(i)}k(\boldsymbol x,\boldsymbol{x}^{(i)})
f(x)=b+N∑αiy(i)k(x,x(i))
以上,核函数作用可以理解为
ϕ
(
x
)
\phi(\boldsymbol x)
ϕ(x)对 x 做非线性变换,使得f(x)关于x是非线性的,但是f(x)关于
ϕ
(
x
)
\phi(\boldsymbol x)
ϕ(x)是线性的。
SVM缺点是,训练样本过大,计算量也变大。
SVM 核函数为两个向量的某种内积,可以理解为某种维度下的相似度/协方差
参考
[1] Xavier Initialization
[2] 参数初始化1
[3] 参数初始化2
[4] SVM