目录
1. 前言
本文是李飞飞cs231n-2022的第一次作业的第2个问题(Training a Support Vector Machine)。第1个问题的作业解读参见cs231n-2022-assignment1#Q1:kNN图像分类器实验。
本次作业相关的课程内容参见:CS231n Convolutional Neural Networks for Visual Recognition
Assignment1的内容要求参见:Assignment 1 (cs231n.github.io)
建议有兴趣的伙伴读原文,过于精彩,不敢搬运。本文可以作为补充阅读材料,主要介绍作业完成过程所涉及一些要点以及关键代码解读。作业的原始starter code可以从该课程网站下载。本文仅涉及完成作业所需要修改的代码,修改的文件仅限于以下几个文件:
- (1) root\cs231n/classifiers/linear_svm.py
- (2) root\cs231n/classifiers/linear_classifier.py
- (3) root\svm.ipynb
本作业主要包括以下内容(在starter-code的基础上以完型填空的方式补充关键代码最终构成一个完整的SVM模型):
- implement a fully-vectorized loss function for the SVM
- implement the fully-vectorized expression for its analytic gradient
- check your implementation using numerical gradient
- use a validation set to tune the learning rate and regularization strength
- optimize the loss function with SGD
- visualize the final learned weights
2. 数据加载
和上一篇一样,这里采用tensorflow.keras加载cifar10数据集。
但是,注意到了一个问题。虽然很细节,却可能需要耗费很多时间去纠错。
import tensorflow.keras as keras
(X_train, y_train), (X_test, y_test) = keras.datasets.cifar10.load_data()
y_train = np.squeeze(y_train)
y_test = np.squeeze(y_test)
keras.datasets.cifar10.load_data命令加载输出的y_train和y_test的shape是(N,1)。
但是,starter_code中后续处理都是以shape=(N,)为前提的。由此导致了一系列血案,特别是在向量化处理过程中。举一个例子,没有做以上压缩处理时,后面的predict()函数返回的结果是(N,)。原始的starter code中的比较生成accuracy的语句如下:
np.mean(y_train == y_train_pred)
这条语句不修改的话,由于两者的shape不一样,numpy会进行broadcasting操作并返回错误的结果。我实现完了以后,在loss和gradient如确认无误后,进行训练和预测后所得的结果始终都是10%左右(即随机预测的结果),最后发现是shape不一致惹的祸。需要修改如下:
(np.mean(np.squeeze(y_train) == y_train_pred)
用np.squeeze()将y_train中一个冗余维度消除掉。原始代码如此可能是因为原始代码中的load_CIFAR10函数返回的y_train等的shape是(N,) 。这个问题在cs231n-2022-assignment1#Q1中其实已经碰到过,当时解决掉了但是忘了memo一下,导致在同一堆牛粪踩了两次^-^。
在向量化处理过程中碰到了更多头大的问题,浪费了很多时间,非常惭愧。记录于此,以防再犯。
所以这里先用np.squeeze()将y_train和y_test的冗余维度压缩掉。
顺便说一下,numpy多维数组或者说张量的处理中最容易出的一个错误就是shape matching的问题,特别是在vectorization处理中,所以时时刻刻都要注意张量的shape,这个在后面会反复碰到并提及。
3. gradient实现
首先是在已经实现loss计算的svm_loss_naive()函数中追加gradient的计算。
def svm_loss_naive(W, X, y, reg):
"""
Structured SVM loss function, naive implementation (with loops).
Inputs have dimension D, there are C classes, and we operate on minibatches
of N examples.
Inputs:
- W: A numpy array of shape (D, C) containing weights.
- X: A numpy array of shape (N, D) containing a minibatch of data.
- y: A numpy array of shape (N,) containing training labels; y[i] = c means
that X[i] has label c, where 0 <= c < C.
- reg: (float) regularization strength
Returns a tuple of:
- loss as single float
- gradient with respect to weights W; an array of same shape as W
"""
dW = np.zeros(W.shape) # initialize the gradient as zero
# compute the loss and the gradient
num_classes = W.shape[1]
num_train = X.shape[0]
loss = 0.0
for i in range(num_train):
scores = X[i].dot(W) # Inner product between X[i] and weight vector
correct_class_score = scores[y[i]]
for j in range(num_classes):
if j == y[i]:
continue
margin = scores[j] - correct_class_score + 1 # note delta = 1
if margin > 0:
loss += margin
dW[:,y[i]]-= np.expand_dims(X[i,:],axis=1)
dW[:,j] += X[i,:]
# Right now the loss is a sum over all training examples, but we want it
# to be an average instead so we divide by num_train.
loss /= num_train
dW /= num_train
# Add regularization to the loss.
loss += 0.5 * reg * np.sum(W * W)
dW += reg * W
return loss, dW
单独一个样本的loss计算如下所示(在以上实现中取
,原因是
的效果与正则化系数
是等价的,所以这里取一个相当于归一化的值,hyperparameter tuning留给
去完成)(这里未包括regularization term)
式(1)
注意,以上公式中写作是把
和
视为列向量的写法。但是在机器学习中,通常数据集变量
的第1个维度是分配给sample-dimension,其中样本
的形状为
,因此在代码实现中表达为:
,把样本
的对应C个分类的分数一次性计算出来了。这也是一个前述的shape-matching的问题。注意,x.dot(y)等价于np.dot(x,y),表示计算两个向量的内积。
将损失函数对以及
求梯度分别会得到(这里涉及到矩阵微积分的处理)分别会得到:
式(2)
式(3)
其中,表示Indicator函数,传入的参数为真时返回1,否则返回0。如上所示,很显然,
针对各个
的梯度的综合等于0,即:
以上求梯度的计算对应于以下代码:
if margin > 0:
loss += margin
dW[:,y[i]]-= X[i,:]
dW[:,j] += X[i,:]
考虑正则项后对loss和gradient dW补偿如下:
# Add regularization to the loss.
loss += 0.5 * reg * np.sum(W * W)
dW += reg * W
对loss追加了乘系数0.5是因为平方项求导以后有一个2的因子,这样恰好抵消。但是这个并没有必然性,最终都会由正则系数(以上代码中的reg)所吸收。
4. 向量化实现
def svm_loss_vectorized(W, X, y, reg):
"""
Structured SVM loss function, vectorized implementation.
Inputs and outputs are the same as svm_loss_naive.
"""
loss = 0.0
dW = np.zeros(W.shape) # initialize the gradient as zero
#############################################################################
# TODO: #
# Implement a vectorized version of the structured SVM loss, storing the #
# result in loss. #
#############################################################################
num_train = X.shape[0]
num_classes = W.shape[1]
#scores = X @ W # shape (num_train,num_classes), each row corresponding to one sample
scores = np.dot(X,W) # shape: (N,C) = (num_train,num_classes)
correct_class_scores = scores[np.arange(num_train),y]
margins = scores - np.expand_dims(correct_class_scores,axis=1) + 1 # Broadcasting
margins[np.arange(num_train),y] = 0.0
margins[margins<=0] = 0.0
loss = np.sum(margins)/num_train + 0.5 * reg * np.sum(W * W) # No need of nested nu.sum()
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
#############################################################################
# TODO: #
# Implement a vectorized version of the gradient for the structured SVM #
# loss, storing the result in dW. #
# #
# Hint: Instead of computing the gradient from scratch, it may be easier #
# to reuse some of the intermediate values that you used to compute the #
# loss. #
#############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
margins[margins>0]=1.0
row_sum = np.sum(margins,axis=1)
#margins[np.arange(num_train),y] = -row_sum
for i in range(num_train):
margins[i,y[i]] = -row_sum[i]
dW = np.dot(X.T,margins)/num_train + reg*W
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
return loss, dW
以下逐个介绍其中的实现要点。
4.1 scores计算向量化
在svm_loss_naive()中其实已经实现了部分向量化,即关于每个样本的针对C个分类的score计算已经算是向量化实现。
scores = X[i].dot(W) # Inner product between X[i] and weight vector
这里所说的向量化是指将对样本数据的循环也消除掉,将整个数据集的score计算一次性完成。X.shape = (N,D),W.shape=(D,C),形状是匹配的,所以这个向量化特别直接:
scores = np.dot(X,W) # shape: (N,C) = (num_train,num_classes)
用dot()可以,用matmul()也可以,还可以用"@"。
4.2 correct_class_scores
correct_class_scores = scores[np.arange(num_train),y]
这条语句把所有样本的对应于正确分类的分数全部提取出来了。它等价于:
for i in range(len(num_train)):
correct_class_scores[i] = scores[i,y[i]]
后面margins的对角线清零也采用了同样的处理方式。
4.3 margins
margins = scores - np.expand_dims(correct_class_scores,axis=1) + 1 # Broadcasting
margins[np.arange(num_train),y] = 0.0
margins[margins<=0] = 0.0
margins的初始计算中用np.expand_dims()对correct_class_scores进行了维度扩展,也是为了shape-matching。np.expand_dims()与np.squeeze()的作用相反,后者是消除掉size为1的冗余的维度,前者是扩展出size为1的额外维度,在shape-matching中是经常要用到的。shape-matching是为了broadcasting处理做准备。
以上第2条语句是将margins的对应了个样本的正确标签的元素清零。因为上一条语句统一地加了一个项。
第3条语句是将margins矩阵中所有小于0的元素置零,对应于以下处理:
4.4 loss计算向量化
有了以上铺垫,loss计算的向量化就水到渠成了:
loss = np.sum(margins)/num_train + 0.5 * reg * np.sum(W * W) # No need of nested nu.sum()
注意,虽然这里margins是一个二维数组,但是并不需要两层嵌套的np.sum的调用。在没有只指定axis参数的情况下,numpu的sum输入张量的所有元素一起求和。
此外,一个张量自乘(e;g, W*W)或者两个相同的shape的张量用“*”相乘表示两个张量之间的element-wise multiplication.
4.5 梯度的向量化
margins[margins>0]=1.0
row_sum = np.sum(margins,axis=1)
#margins[np.arange(num_train),y] = -row_sum
for i in range(num_train):
margins[i,y[i]] = -row_sum[i]
dW = np.dot(X.T,margins)/num_train + reg*W
首先将margins进行硬判决处理,对应于的处理。
然后margins中各样本的对应于正确标签位置的值通过减去各行之和得到对应于上面式(2)的结果。
经过这些预处理后,最终梯度就是用X与margins的矩阵乘积,然后再加上正则化处理项。不得不说,向量化实现确实优雅简洁,但是要到达这个优雅简洁需要很多纸面上的推到等准备工作,以及对张量运算的熟练。
5. 训练和预测结果
以上实现准备好后,再以下语句条件下进行训练得到的结果如下所示:
loss_hist = svm.train(X_train, y_train, learning_rate=1e-7, reg=2.5e4, num_iters=1500, verbose=True)
基于以上训练得到的模型对训练集和验证集进行精度估计得到结果如下:
(49000,) (49000,) training accuracy: 0.383122 (1000,) (1000,) validation accuracy: 0.384000
6. Hyperparameters Tuning
对学习率和正则化强度参数进行扫描优化,如下所示:
results = {}
best_val = -1 # The highest validation accuracy that we have seen so far.
best_svm = None # The LinearSVM object that achieved the highest validation rate.
# Provided as a reference. You may or may not want to change these hyperparameters
learning_rates = [1e-7,1e-6,5e-5]
regularization_strengths = [2.5e4,5e4]
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
for lr in learning_rates:
for reg in regularization_strengths:
print('lr = {0}, reg = {1}'.format(lr,reg))
tic = time.time()
svm = LinearSVM()
#loss_hist = svm.train(X_train, y_train, learning_rate=1e-7, reg=2.5e4, num_iters=5000, verbose=True)
loss_hist = svm.train(X_train, y_train, learning_rate=lr, reg=reg, num_iters=1500, verbose=True)
y_train_pred = svm.predict(X_train)
training_acc = np.mean(np.squeeze(y_train) == y_train_pred)
#print('training accuracy: %f' % (np.mean(y_train == y_train_pred), ))
y_val_pred = svm.predict(X_val)
val_acc = np.mean(np.squeeze(y_val) == y_val_pred)
#print('validation accuracy: %f' % (np.mean(y_val == y_val_pred), ))
results[(lr,reg)] = (training_acc,val_acc)
toc = time.time()
print('That took %fs' % (toc - tic))
if val_acc > best_val:
best_val = val_acc
best_svm = svm
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
# Print out results.
for lr, reg in sorted(results):
train_accuracy, val_accuracy = results[(lr, reg)]
print('lr %e reg %e train accuracy: %f val accuracy: %f' % (
lr, reg, train_accuracy, val_accuracy))
print('best validation accuracy achieved during cross-validation: %f' % best_val)
lr 1.000000e-07 reg 2.500000e+04 train accuracy: 0.381816 val accuracy: 0.378000 lr 1.000000e-07 reg 5.000000e+04 train accuracy: 0.368082 val accuracy: 0.370000 lr 1.000000e-06 reg 2.500000e+04 train accuracy: 0.315735 val accuracy: 0.327000 lr 1.000000e-06 reg 5.000000e+04 train accuracy: 0.289592 val accuracy: 0.293000 lr 5.000000e-05 reg 2.500000e+04 train accuracy: 0.122857 val accuracy: 0.122000 lr 5.000000e-05 reg 5.000000e+04 train accuracy: 0.049673 val accuracy: 0.047000
结果表明lr太大的话(5e-5)会导致模型无法收敛,最优参数就是lr,reg = {1e-7, 25000}。再这组参数下得到的模型在测试集上的性能为36.9%,虽然仍然不能令人满意,但是相比Q1的kNN分类器已经有了显著的改善。