【免费】机器学习深度学习中的droupout算法论文资源-CSDN下载

需积分: 0 51 浏览量更新于2019-01-01 1 收藏 1.59MB PDF 举报

在机器学习和深度学习领域中，Dropout算法是一种防止神经网络过拟合的重要技术。该算法通过在训练过程中随机丢弃（即暂时移除）神经网络中的一部分神经元，迫使网络学习更为鲁棒的特征，以减少对特定神经元组合的依赖。由Geoffrey Hinton等人提出的Dropout算法，自2012年被提出以来，已经在包括图像识别、语音识别等多个领域取得了显著的效果。理解深度神经网络的过拟合问题对于理解Dropout算法至关重要。过拟合是指当一个模型对于训练数据的预测能力非常好，但在未见过的新数据（测试数据）上表现不佳时，模型过于复杂，捕捉了训练数据中的噪声和非一般性特征，失去了泛化能力。在小规模的数据集上训练一个大型的前馈神经网络时，过拟合尤为突出。为了解决这个问题，Hinton等研究人员提出了一种随机“丢弃”神经元的方法。在这篇文章中提到，当训练一个训练案例时，每个隐藏单元都有50%的概率被随机忽略掉。这种做法意味着每个神经元不能依赖其他神经元的存在，从而迫使每个神经元都学习到可以独立于其他神经元存在的有用特征。这样，每个神经元学会探测的特征应该是对于给出正确答案普遍有帮助的，无论它们处于什么样的内部上下文中。 Dropout算法对于防止特征检测器之间的复杂共适应有着显著的效果。复杂共适应是指在训练过程中，特征检测器仅仅在其他特定特征检测器的上下文中才是有帮助的。通过使用Dropout，每个神经元在训练时都必须假设它可能随时被移除，因此，每个神经元都必须学会更具一般性的功能，从而在测试数据上更好地泛化。 Dropout算法已经在许多基准测试任务中取得了显著的改进，并且在语音识别和物体识别等方面刷新了记录。Dropout通过随机移除网络中的一部分神经元来实现对过拟合的降低，这可以被看作是一种非常有效的正则化技术。正则化技术是通过向模型的损失函数中添加一些额外的惩罚项来防止过拟合的方法。在Dropout的上下文中，通过随机丢弃神经元的“惩罚”强迫网络学习更加鲁棒的特征表示。在实现Dropout时，有两种常见的方法。一种是在前向传播过程中随机地忽略掉一部分神经元，并相应地调整剩余神经元的激活值。第二种是在反向传播过程中同样地忽略掉一部分神经元，但需要对权重更新进行缩放，以保持梯度的期望值不变。具体来说，如果一个神经元在前向传播时未被忽略，则在反向传播时也要按照同样的概率保持激活，如果被忽略了，则其梯度要乘以该神经元被忽略的概率的倒数（例如，在0.5概率下被忽略，则乘以2），以保证整个网络的梯度保持一致。需要注意的是，尽管Dropout算法在训练时非常有效，但在实际应用模型进行预测时，应该不使用Dropout，或者使用一个较小的Dropout比例，以确保模型能够尽可能地利用全部的神经元进行推断。此外，Dropout通常与其他正则化技术如权重衰减（weight decay）和数据增强（data augmentation）联合使用，以达到更好的泛化效果。总体而言，Dropout是深度学习中一种简单而强大的技术，可以广泛地应用于各种深度神经网络模型中，以提高模型的泛化能力，并避免过拟合现象。随着深度学习在各个领域的应用越来越广泛，Dropout算法及其相关变种仍是研究和实践中的一个热点。

Improving neural networks by preventing

co-adaptation of feature detectors

G. E. Hinton

∗

, N. Srivastava, A. Krizhevsky, I. Sutskever and R. R. Salakhutdinov

Department of Computer Science, University of Toronto,

6 King’s College Rd, Toronto, Ontario M5S 3G4, Canada

∗

To whom correspondence should be addressed; E-mail: [email protected]

When a large feedforward neural network is trained on a small training set,

it typically performs poorly on held-out test data. This “overﬁtting” is greatly

reduced by randomly omitting half of the feature detectors on each training

case. This prevents complex co-adaptations in which a feature detector is only

helpful in the context of several other speciﬁc feature detectors. Instead, each

neuron learns to detect a feature that is generally helpful for producing the

correct answer given the combinatorially large variety of internal contexts in

which it must operate. Random “dropout” gives big improvements on many

benchmark tasks and sets new records for speech and object recognition.

A feedforward, artiﬁcial neural network uses layers of non-linear “hidden” units between

its inputs and its outputs. By adapting the weights on the incoming connections of these hidden

units it learns feature detectors that enable it to predict the correct output when given an input

vector (1). If the relationship between the input and the correct output is complicated and the

network has enough hidden units to model it accurately, there will typically be many different

settings of the weights that can model the training set almost perfectly, especially if there is

only a limited amount of labeled training data. Each of these weight vectors will make different

predictions on held-out test data and almost all of them will do worse on the test data than on

the training data because the feature detectors have been tuned to work well together on the

training data but not on the test data.

Overﬁtting can be reduced by using “dropout” to prevent complex co-adaptations on the

training data. On each presentation of each training case, each hidden unit is randomly omitted

from the network with a probability of 0.5, so a hidden unit cannot rely on other hidden units

being present. Another way to view the dropout procedure is as a very efﬁcient way of perform-

ing model averaging with neural networks. A good way to reduce the error on the test set is to

average the predictions produced by a very large number of different networks. The standard

arXiv:1207.0580v1 [cs.NE] 3 Jul 2012

way to do this is to train many separate networks and then to apply each of these networks to

the test data, but this is computationally expensive during both training and testing. Random

dropout makes it possible to train a huge number of different networks in a reasonable time.

There is almost certainly a different network for each presentation of each training case but all

of these networks share the same weights for the hidden units that are present.

We use the standard, stochastic gradient descent procedure for training the dropout neural

networks on mini-batches of training cases, but we modify the penalty term that is normally

used to prevent the weights from growing too large. Instead of penalizing the squared length

(L2 norm) of the whole weight vector, we set an upper bound on the L2 norm of the incoming

weight vector for each individual hidden unit. If a weight-update violates this constraint, we

renormalize the weights of the hidden unit by division. Using a constraint rather than a penalty

prevents weights from growing very large no matter how large the proposed weight-update is.

This makes it possible to start with a very large learning rate which decays during learning,

thus allowing a far more thorough search of the weight-space than methods that start with small

weights and use a small learning rate.

At test time, we use the “mean network” that contains all of the hidden units but with their

outgoing weights halved to compensate for the fact that twice as many of them are active.

In practice, this gives very similar performance to averaging over a large number of dropout

networks. In networks with a single hidden layer of N units and a “softmax” output layer for

computing the probabilities of the class labels, using the mean network is exactly equivalent

to taking the geometric mean of the probability distributions over labels predicted by all 2

possible networks. Assuming the dropout networks do not all make identical predictions, the

prediction of the mean network is guaranteed to assign a higher log probability to the correct

answer than the mean of the log probabilities assigned by the individual dropout networks (2).

Similarly, for regression with linear output units, the squared error of the mean network is

always better than the average of the squared errors of the dropout networks.

We initially explored the effectiveness of dropout using MNIST, a widely used benchmark

for machine learning algorithms. It contains 60,000 28x28 training images of individual hand

written digits and 10,000 test images. Performance on the test set can be greatly improved by

enhancing the training data with transformed images (3) or by wiring knowledge about spatial

transformations into a convolutional neural network (4) or by using generative pre-training to

extract useful features from the training images without using the labels (5). Without using any

of these tricks, the best published result for a standard feedforward neural network is 160 errors

on the test set. This can be reduced to about 130 errors by using 50% dropout with separate L2

constraints on the incoming weights of each hidden unit and further reduced to about 110 errors

by also dropping out a random 20% of the pixels (see ﬁgure 1).

Dropout can also be combined with generative pre-training, but in this case we use a small

learning rate and no weight constraints to avoid losing the feature detectors discovered by the

pre-training. The publically available, pre-trained deep belief net described in (5) got 118 errors

when it was ﬁne-tuned using standard back-propagation and 92 errors when ﬁne-tuned using

50% dropout of the hidden units. When the publically available code at URL was used to pre-

剩余17页未读，继续阅读

身份认证购VIP最低享 7 折!

30元优惠券

资源推荐

资源评论

dawnstar2008

粉丝: 1

机器学习深度学习中的droup out算法论文

最新资源

机器学习深度学习中的droup out算法论文

keras中加入droupout技术.docx

机器学习经典论文

机器学习经典论文---十大经典算法

Targeted-Dropout：Targeted Dropout纸的补充代码

机器学习论文TOP20

机器学习经典论文（人工智能）

机器学习论文

一种基于智能水滴的新型QoS感知路由算法-研究论文

DeepSeek从入门到精通-清华大学-202502.pdf

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

DEEP SEEK 本地部署（Ollama + ChatBox）+ 私有知识库（cherry studio）教程

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

行人跌倒数据集（VOC格式）

DeepSeek从入门到精通-清华大学

CIFAR10数据集免费下载

大作业05-YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

清华deepseek入门到精通文档 夸克网盘资源下载

Deep Learning Tuning Playbook（中译版）

LabVIEW AI Vision(LabVIEW AI视觉工具包)

zotero翻译插件.xpi

YOLOv5 人脸口罩图片数据集

基于YOLOv8-Pose的姿态识别项目，带数据集可直接跑通的源码

免费Ollama 官方大模型服务器安装程序

人工智能应用：DeepSeek从入门到精通的操作指南与多功能实战详解

Unet眼底血管图像分割数据集+代码+模型+系统界面+教学视频.zip

Web前端之Html弹窗面板的popover新属性

最新资源

清华deepseek入门到精通文档夸克网盘资源下载