没有合适的资源?快使用搜索试试~ 我知道了~
自然语言处理(NLP)帮助智能机器更好地理解人类语言,实现基于语言的人机交流。计算能力的最新发展和大量语言数据的出现,增加了使用数据驱动方法自动进行语义分析的需求。由于深度学习方法在计算机视觉、自动语音识别,特别是NLP等领域的应用取得了显著的进步,数据驱动策略的应用已经非常普遍。本调查对得益于深度学习的NLP的不同方面和应用进行了分类和讨论。它涵盖了核心的NLP任务和应用,并描述了深度学习方法和模型如何推进这些领域。我们进一步分析和比较不同的方法和最先进的模型。

1
Natural Language Processing Advancements By
Deep Learning: A Survey
Amirsina Torfi, Member, IEEE, Rouzbeh A. Shirvani, Yaser Keneshloo, Nader Tavaf,
and Edward A. Fox, Fellow, IEEE
Abstract—Natural Language Processing (NLP) helps empower
intelligent machines by enhancing a better understanding of the
human language for linguistic-based human-computer communi-
cation. Recent developments in computational power and the ad-
vent of large amounts of linguistic data have heightened the need
and demand for automating semantic analysis using data-driven
approaches. The utilization of data-driven strategies is pervasive
now due to the significant improvements demonstrated through
the usage of deep learning methods in areas such as Computer
Vision, Automatic Speech Recognition, and in particular, NLP.
This survey categorizes and addresses the different aspects and
applications of NLP that have benefited from deep learning. It
covers core NLP tasks and applications, and describes how deep
learning methods and models advance these areas. We further
analyze and compare different approaches and state-of-the-art
models.
Index Terms—Natural Language Processing, Deep Learning,
Artificial Intelligence
I. INTRODUCTION
N
ATURAL Language Processing (NLP) is a sub-discipline
of computer science providing a bridge between natural
languages and computers. It helps empower machines to un-
derstand, process, and analyze human language [1]. NLP’s sig-
nificance as a tool aiding comprehension of human-generated
data is a logical consequence of the context-dependency
of data. Data becomes more meaningful through a deeper
understanding of its context, which in turn facilitates text
analysis and mining. NLP enables this with the communication
structures and patterns of humans.
Development of NLP methods is increasingly reliant on
data-driven approaches which help with building more power-
ful and robust models [2], [3]. Recent advances in computa-
tional power, as well as greater availability of big data, enable
deep learning, one of the most appealing approaches in the
NLP domain [2]–[4], especially given that deep learning has
already demonstrated superior performance in adjoining fields
like Computer Vision [5]–[7] and Speech Recognition [8], [9].
These developments led to a paradigm shift from traditional
to novel data-driven approaches aimed at advancing NLP. The
reason behind this shift was simple: new approaches are more
promising regarding results, and are easier to engineer.
Amirsina Torfi, Yaser Keneshloo, and Edward A. Fox were with the
Department of Computer Science, Virginia Polytechnic Institute and State
University, Blacksburg, VA, 24060 USA e-mail: (amirsina.torfi@gmail.com,
University of Minnesota Twin Cities, Miniapolis, MN, 55455 USA e-mail:
(tav[email protected]).
As a sequitur to remarkable progress achieved in adjacent
disciplines utilizing deep learning methods, deep neural net-
works have been applied to various NLP tasks, including part-
of-speech tagging [10]–[12], named entity recognition [13],
[13], [14], and semantic role labeling [15]–[17]. Most of the
research efforts in deep learning associated with NLP appli-
cations involve either supervised learning
1
or unsupervised
learning
2
.
This survey covers the emerging role of deep learning in the
area of NLP, across a broad range of categories. The research
presented in [18] is primarily focused on architectures, with
little discussion of applications. On the other hand, this paper
describes the challenges, opportunities, and evaluations of the
impact of applying deep learning to NLP problems.
This survey has six sections, including this introduction.
Section 2 lays out the theoretical dimensions of NLP and
artificial intelligence, and looks at deep learning as an ap-
proach to solving real-world problems. It motivates this study
by addressing the question: Why use deep learning in NLP?
The third section discusses fundamental concepts necessary
to understand NLP, covering exemplary issues in representa-
tion, frameworks, and machine learning. The fourth section
summarizes benchmark datasets employed in the NLP domain.
Section 5 focuses on some of the NLP applications where deep
learning has demonstrated significant benefit. Finally, Section
6 provides a conclusion, also addressing some open problems
and promising areas for improvement.
II. BACKGROUND
NLP has long been viewed as one aspect of artificial
intelligence (AI), since understanding and generating natural
language are high-level indications of intelligence. Deep learn-
ing is an effective AI tool, so we next situate deep learning in
the AI world. After that we explain motivations for applying
deep learning to NLP.
A. Artificial Intelligence and Deep Learning
There have been “islands of success” where big data are
processed via AI capabilities to produce information to achieve
critical operational goals (e.g., fraud detection). Accordingly,
scientists and consumers anticipate enhancement across a
1
Learning from training data to predict the type of new unseen test examples
by mapping them to known pre-defined labels.
2
Making sense of data without sticking to specific tasks and supervisory
signals.
arXiv:2003.01200v2 [cs.CL] 4 Mar 2020

2
variety of applications. However, achieving this requires un-
derstanding of AI and its mechanisms and means (e.g., algo-
rithms). Ted Greenwald, explaining AI to those who are not
AI experts, comments: ”Generally AI is anything a computer
can do that formerly was considered a job for a human” [19].
An AI goal is to extend the capabilities of information
technology (IT) from those to (1) generate, communicate,
and store data, to also (2) process data into the knowledge
that decision makers and others need [20]. One reason is
that the available data volume is increasing so rapidly that
it is now impossible for people to process all available data.
This leaves two choices: (1) much or even most existing data
must be ignored or (2) AI must be developed to process the
vast volumes of available data into the essential pieces of
information that decision-makers and others can comprehend.
Deep learning is a bridge between the massive amounts of
data and AI.
1) Definitions: Deep learning refers to applying deep neu-
ral networks to massive amounts of data to learn a procedure
aimed at handling a task. The task can range from simple
classification to complex reasoning. In other words, deep
learning is a set of mechanisms ideally capable of deriving an
optimum solution to any problem given a sufficiently extensive
and relevant input dataset. Loosely speaking, deep learning
is detecting and analyzing important structures/features in the
data aimed at formulating a solution to a given problem. Here,
AI and deep learning meet. One version of the goal or ambition
behind AI is enabling a machine to outperform what the human
brain does. Deep learning is a means to this end.
2) Deep Learning Architectures: Numerous deep learning
architectures have been developed in different research areas,
e.g., in NLP applications employing recurrent neural networks
(RNNs) [21], convolutional neural networks (CNNs) [22], and
more recently, recursive neural networks [23]. We focus our
discussion on a review of the essential models, explained in
relevant seminal publications.
Multi Layer Perceptron: A multilayer perceptron (MLP)
has at least three layers (input, hidden, and output layers). A
layer is simply a collection of neurons operating to transform
information from the previous layer to the next layer. In the
MLP architecture, the neurons in a layer do not communicate
with each other. An MLP employs nonlinear activation func-
tions. Every node in a layer connects to all nodes in the next
layer, creating a fully connected network (Fig. 1). MLPs are
the simplest type of Feed-Forward Neural Networks (FNNs).
FNNs represent a general category of neural networks in which
the connections between the nodes do not create any cycle, i.e.,
in a FNN there is no cycle of information flow.
Convolutional Neural Networks: Convolutional neural
networks (CNNs), whose architecture is inspired by the human
visual cortex, are a subclass of feed-forward neural networks.
CNNs are named after the underlying mathematical operation,
convolution, which yields a measure of the interoperability of
its input functions. Convolutional neural networks are usually
employed in situations where data is or needs to be represented
with a 2D or 3D data map. In the data map representation,
the proximity of data points usually corresponds to their
information correlation.
Fig. 1. The general architecture of a MLP.
In convolutional neural networks where the input is an
image, the data map indicates that image pixels are highly cor-
related to their neighboring pixels. Consequently, the convolu-
tional layers have 3 dimensions: width, height, and depth. That
assumption possibly explains why the majority of research
efforts dedicated to CNNs are conducted in the Computer
Vision field [24].
A CNN takes an image represented as an array of numeric
values. After performing specific mathematical operations, it
represents the image in a new output space. This operation is
also called feature extraction, and helps to capture and rep-
resent key image content. The extracted features can be used
for further analysis, for different tasks. One example is image
classification, which aims to categorize images according to
some predefined classes. Other examples include determining
which objects are present in an image and where they are
located. See Fig. 2.
In the case of utilizing CNNs for NLP, the inputs are sen-
tences or documents represented as matrices. Each row of the
matrix is associated with a language element such as a word
or a character. The majority of CNN architectures learn word
or sentence representations in their training phase. A variety
of CNN architectures were used in various classification tasks
such as Sentiment Analysis and Topic Categorization [22],
[25]–[27]. CNNs were employed for Relation Extraction and
Relation Classification as well [28], [29].
Recurrent Neural Network: If we line up a sequence of
FNNs and feed the output of each FNN as an input to the next
one, a recurrent neural network (RNN) will be constructed.
Like FNNs, layers in an RNN can be categorized into input,
hidden, and output layers. In discrete time frames, sequences
of input vectors are fed as the input, one vector at a time,
e.g., after inputting each batch of vectors, conducting some
operations and updating the network weights, the next input
batch will be fed to the network. Thus, as shown in Fig. 3,
at each time step we make predictions and use parameters of
the current hidden layer as input to the next time step.
Hidden layers in recurrent neural networks can carry infor-
mation from the past, in other words, memory. This character-
istic makes them specifically useful for applications that deal
with a sequence of inputs such as language modeling [30], i.e.,

3
Fig. 2. A typical CNN architecture for object detection. The network provides a feature representation with attention to the specific region of an image
(example shown on the left) that contains the object of interest. Out of the multiple regions represented (see an ordering of the image blocks, giving image
pixel intensity, on the right) by the network, the one with the highest score will be selected as the main candidate.
Fig. 3. Recurrent Neural Network (RNN), summarized on the left, expanded
on the right, for N timesteps, with X indicating input, h hidden layer, and
O output
Fig. 4. Schematic of an Autoencoder
representing language in a way that the machine understands.
This concept will be described later in detail.
RNNs can carry rich information from the past. Consider
the sentence: “Michael Jackson was a singer; some people
consider him King of Pop.” It’s easy for a human to identify
him as referring to Michael Jackson. The pronoun him happens
seven words after Michael Jackson; capturing this dependency
is one of the benefits of RNNs, where the hidden layers in an
RNN act as memory units. Long Short Term Memory Network
(LSTM) [31] is one of the most widely used classes of RNNs.
LSTMs try to capture even long time dependencies between
inputs from different time steps. Modern Machine Translation
and Speech Recognition often rely on LSTMs.
Autoencoders: Autoencoders implement unsupervised
methods in deep learning. They are widely used in dimension-
ality reduction
3
or NLP applications which consist of sequence
to sequence modeling (see Section III-B [30]. Fig. 4 illustrates
3
Dimensionality reduction is an unsupervised learning approach which is
the process of reducing the number of variables that were used to represent
the data by identifying the most crucial information.
Fig. 5. Generative Adversarial Networks
the schematic of an Autoencoder. Since autoencoders are
unsupervised, there is no label corresponding to each input.
They aim to learn a code representation for each input. The
encoder is like a feed-forward neural network in which the
input gets encoded into a vector (code). The decoder operates
similarly to the encoder, but in reverse, i.e., constructing
an output based on the encoded input. In data compression
applications, we want the created output to be as close as
possible to the original input. Autoencoders are lossy, meaning
the output is an approximate reconstruction of the input.
Generative Adversarial Networks: Goodfellow [32] intro-
duced Generative Adversarial Networks (GANs). As shown in
Fig. 5, a GAN is a combination of two neural networks, a
discriminator and a generator. The whole network is trained
in an iterative process. First, the generator network generates a
fake sample. Then the discriminator network tries to determine
whether this sample (ex.: an input image) is real or fake, i.e.,
whether it came from the real training data (data used for
building the model) or not. The goal of the generator is to fool
the discriminator in a way that the discriminator believes the
artificial (i.e., generated) samples synthesized by the generator
are real.
This iterative process continues until the generator produces
samples that are indistinguishable by the discriminator. In

4
other words, the probability of classifying a sample as fake
or real becomes like flipping a fair coin for the discriminator.
The goal of the generative model is to capture the distribution
of real data while the discriminator tries to identify the fake
data. One of the interesting features of GANs (regarding being
generative) is: once the training phase is finished, there is no
need for the discrimination network, so we solely can work
with the generation network. In other words, having access to
the trained generative model is sufficient.
Different forms of GANs has been introduced, e.g., Sim
GAN [7], Wasserstein GAN [33], info GAN [34], and DC
GAN [35]. In one of the most elegant GAN implementations
[36], entirely artificial, yet almost perfect, celebrity faces are
generated; the pictures are not real, but fake photos produced
by the network. In the NLP domain, GANs often are used for
text generation [37], [38].
B. Motivation for Deep Learning in NLP
Deep learning applications are predicated on the choices
of (1) feature representation and (2) deep learning algo-
rithm alongside architecture. These are associated with data
representation and learning structure, respectively. For data
representation, surprisingly, there usually is a disjunction
between what information is thought to be important for
the task at hand, versus what representation actually yields
good results. For instance, in sentiment analysis, lexicon
semantics, syntactic structure, and context are assumed by
some linguists to be of primary significance. Nevertheless,
previous studies based on the bag-of-words (BoW) model
demonstrated acceptable performance [39]. The bag-of-words
model [40], often viewed as the vector space model, involves
a representation which accounts only for the words and
their frequency of occurrence. BoW ignores the order and
interaction of words, and treats each word as a unique feature.
BoW disregards syntactic structure, yet provides decent results
for what some would consider syntax-dependent applications.
This observation suggests that simple representations, when
coupled with large amounts of data, may work as well or better
than more complex representations. These findings corroborate
the argument in favor of the importance of deep learning
algorithms and architectures.
Often the progress of NLP is bound to effective language
modeling. A goal of statistical language modeling is the prob-
abilistic representation of word sequences in language, which
is a complicated task due to the curse of dimensionality. The
research presented in [41] was a breakthrough for language
modeling with neural networks aimed at overcoming the curse
of dimensionality by (1) learning a distributed representation
of words and (2) providing a probability function for se-
quences.
A key challenge in NLP research, compared to other do-
mains such as Computer Vision, seems to be the complexity
of achieving an in-depth representation of language using
statistical models. A primary task in NLP applications is to
provide a representation of texts, such as documents. This in-
volves feature learning, i.e., extracting meaningful information
to enable further processing and analysis of the raw data.
Fig. 6. Considering a given sequence, the skip-thought model generates the
surrounding sequences using the trained encoder. The assumption is that the
surrounding sentences are closely related, contextually.
Traditional methods begin with time-consuming hand-
crafting of features, through careful human analysis of a
specific application, and are followed by development of
algorithms to extract and utilize instances of those features.
On the other hand, deep supervised feature learning methods
are highly data-driven and can be used in more general efforts
aimed at providing a robust data representation.
Due to the vast amounts of unlabeled data, unsupervised
feature learning is considered to be a crucial task in NLP. Un-
supervised feature learning is, in essence, learning the features
from unlabeled data to provide a low-dimensional representa-
tion of a high-dimensional data space. Several approaches such
as K-means clustering and principal component analysis have
been proposed and successfully implemented to this end. With
the advent of deep learning and abundance of unlabeled
data, unsupervised feature learning becomes a crucial task for
representation learning, a precursor in NLP applications. Cur-
rently, most of the NLP tasks rely on annotated data, while a
preponderance of unannotated data further motivates research
in leveraging deep data-driven unsupervised methods.
Given the potential superiority of deep learning approaches
in NLP applications, it seems crucial to perform a com-
prehensive analysis of various deep learning methods and
architectures with particular attention to NLP applications.
III. CORE CONCEPTS IN NLP
A. Feature Representation
Distributed representations are a series of compact, low
dimensional representations of data, each representing some
distinct informative property. For NLP systems, due to issues
related to the atomic representation of the symbols, it is
imperative to learn word representations.
At first, let’s concentrate on how the features are rep-
resented, and then we focus on different approaches for
learning word representations. The encoded input features can
be characters, words [23], sentences [42], or other linguistic
elements. Generally, it is more desirable to provide a compact
representation of the words than a sparse one.
How to select the structure and level of text representa-
tion used to be an unresolved question. After proposing the
word2vec approach [43], subsequently, doc2vec was proposed
in [42] as an unsupervised algorithm and was called Paragraph
Vector (PV). The goal behind PV is to learn fixed-length rep-
resentations from variable-length text parts such as sentences
and documents. One of the main objectives of doc2vec is

5
to overcome the drawbacks of models such as BoW and to
provide promising results for applications such as text classi-
fication and sentiment analysis. A more recent approach is the
skip-thought model which applies word2vec at the sentence-
level [44]. By utilizing an encoder-decoder architecture, this
model generates the surrounding sentences using the given
sentence (Fig. 6). Next, let’s investigate different kinds of
feature representation.
1) One-Hot Representation: In one-hot encoding, each
unique element that needs to be represented has its dimen-
sion which results in a very high dimensional, very sparse
representation. Assume the words are represented with the
one-hot encoding method. Regarding representation structure,
there is no meaningful connection between different words in
the feature space. For example, highly correlated words such
as ‘ocean’ and ‘water’ will not be closer to each other (in the
representation space) compared to less correlated pairs such as
‘ocean’ and ‘fire.’ Nevertheless, some research efforts present
promising results using one-hot encoding [2].
2) Continuous Bag of Words: Continuous Bag-of-Words
model (CBOW) has frequently been used in NLP applica-
tions. CBOW tries to predict a word given its surrounding
context, which usually consists of a few nearby words [45].
CBOW is neither dependent on the sequential order of words
nor necessarily on probabilistic characteristics. So it is not
generally used for language modeling. This model is typi-
cally trained to be utilized as a pre-trained model for more
sophisticated tasks. An alternative to CBOW is the weighted
CBOW (WCBOW) [46] in which different vectors get different
weights reflective of relative importance in context. The sim-
plest example can be document categorization where features
are words and weights are TF-IDF scores [47] of the associated
words.
3) Word-Level Embedding: Word embedding is a learned
representation for context elements in which, ideally, words
with related semantics become highly correlated in the rep-
resentation space. One of the main incentives behind word
embedding representations is the high generalization power
as opposed to sparse, higher dimensional representations [48].
Unlike the traditional bag-of-words model in which different
words have entirely different representations regardless of their
usage or collocations, learning a distributed representation
takes advantage of word usage in context to provide similar
representations for semantically correlated words. There are
different approaches to create word embeddings. Several re-
search efforts, including [43], [45], used random initialization
by uniformly sampling random numbers with the objective of
training an efficient representation of the model on a large
dataset. This setup is intuitively acceptable for initialization
of the embedding for common features such as part-of-speech
tags. However, this may not be the optimum method for rep-
resentation of less frequent features such as individual words.
For the latter, pre-trained models, trained in a supervised or
unsupervised manner, are usually leveraged for increasing the
performance.
4) Character-Level Embedding: The methods mentioned
earlier are mostly at higher levels of representation. Lower-
level representations such as character-level representation
require special attention as well, due to their simplicity of
representation and the potential for correction of unusual
character combinations such as misspellings [2]. For generat-
ing character-level embeddings, CNNs have successfully been
utilized [10].
Character-level embeddings have been used in different
NLP applications [49]. One of the main advantages is the
ability to use small model sizes and represent words with
lower-level language elements [10]. Here word embeddings
are models utilizing CNNs over the characters. Another mo-
tivation for employing character-level embeddings is the out-
of-vocabulary word (OOV) issue which is usually encountered
when, for the given word, there is no equivalent vector in
the word embedding. The character-level approach may sig-
nificantly alleviate this problem. Nevertheless, this approach
suffers from a weak correlation between characters and se-
mantic and syntactic parts of the language. So, considering
the aforementioned pros and cons of utilizing character-level
embeddings, several research efforts tried to propose and im-
plement higher-level approaches such as using sub-words [50]
to create word embeddings for OOV instances as well as
creating a semantic bridge between the correlated words [51].
B. Seq2Seq Framework
Most underlying frameworks in NLP applications rely on
sequence-to-sequence (seq2seq) models in which not only the
input but also the output is represented as a sequence. These
models are common in various applications including machine
translation
4
, text summarization
5
, speech-to-text, and text-to-
speech applications
6
.
The most common seq2seq framework is comprised of an
encoder and a decoder. The encoder ingests the sequence of
input data and generates a mid-level output which is subse-
quently consumed by the decoder to produce the series of final
outputs. The encoder and decoder are usually implemented via
a series of Recurrent Neural Networks or LSTM [31] cells.
The encoder takes a sequence of length T , X =
{x
1
, x
2
, · · · , x
T
}, where x
t
∈ V = {1, · · · , |V |} is the
representation of a single input coming from the vocabulary
V , and then generates the output state h
t
. Subsequently, the
decoder takes the last state from the encoder, i.e., h
t
, and
starts generating an output of size L, Y
0
= {y
0
1
, y
0
2
, · · · , y
0
L
},
based on its current state, s
t
, and the ground-truth output y
t
.
In different applications, the decoder could take advantage
of more information such as a context vector [52] or intra-
attention vectors [53] to generate better outputs.
One of the most widely training approaches for seq2seq
models is called Teacher Forcing [54]. Let us define y =
{y
1
, y
2
, · · · , y
L
} as the ground-truth output sequence corre-
spondent to a given input sequence X. The model training
4
The input is a sequence of words from one language (e.g., English) and
the output is the translation to another language (e.g., French).
5
The input is a complete document (sequence of words) and the output is
a summary of it (sequence of words).
6
The input is an audio recording of a speech (sequence of audible elements)
and the output is the speech text (sequence of words).
剩余20页未读,继续阅读
资源推荐
资源评论
2018-12-20 上传
114 浏览量

2019-11-03 上传
145 浏览量
156 浏览量
194 浏览量
2020-09-17 上传
117 浏览量
2020-03-10 上传
2024-02-04 上传
124 浏览量
2021-04-18 上传
284 浏览量
资源评论


syp_net
- 粉丝: 158
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
- 系统集成设计方案样本.doc
- 云计算环境下数字图书馆信息资源安全威胁与对策研究.doc
- 数据库课程设计旅行社管理信息系统.doc
- 2023年HTML语言与网设计题库含答案.doc
- 项目管理工作流程图[最终版].pdf
- 基于JavaMail的电子邮件收发系统毕业设计.docx
- 玫瑰园一号智能家居系统方案.docx
- 整套智能家居系统解决方案.doc
- 基于MATLAB的车牌识别系统设计说明.doc
- 生物:1[1].2《基因工程的基本操作程序》(新人教版选修3)..ppt
- 项目管理成熟度模型在M电子政务公司的应用研究.doc
- 综合布线有关工程概预算问题.pptx
- 无线通信PPT.ppt
- 通信软件设计心得体会.docx
- 基于单片机控制点阵led显示器设计开题报告.doc
- 基于PLC的温度模糊控制设计与实现.doc
安全验证
文档复制为VIP权益,开通VIP直接复制
