【免费】深度学习自然语言处理进展综述论文（NLPadvancementsbyDL:ASurvey）.pdf资源-CSDN下载

需积分: 0 15 浏览量更新于2020-03-06 1 收藏 2.19MB PDF 举报

自然语言处理(NLP)帮助智能机器更好地理解人类语言，实现基于语言的人机交流。计算能力的最新发展和大量语言数据的出现，增加了使用数据驱动方法自动进行语义分析的需求。由于深度学习方法在计算机视觉、自动语音识别，特别是NLP等领域的应用取得了显著的进步，数据驱动策略的应用已经非常普遍。本调查对得益于深度学习的NLP的不同方面和应用进行了分类和讨论。它涵盖了核心的NLP任务和应用，并描述了深度学习方法和模型如何推进这些领域。我们进一步分析和比较不同的方法和最先进的模型。

Natural Language Processing Advancements By

Deep Learning: A Survey

Amirsina Torﬁ, Member, IEEE, Rouzbeh A. Shirvani, Yaser Keneshloo, Nader Tavaf,

and Edward A. Fox, Fellow, IEEE

Abstract—Natural Language Processing (NLP) helps empower

intelligent machines by enhancing a better understanding of the

human language for linguistic-based human-computer communi-

cation. Recent developments in computational power and the ad-

vent of large amounts of linguistic data have heightened the need

and demand for automating semantic analysis using data-driven

approaches. The utilization of data-driven strategies is pervasive

now due to the signiﬁcant improvements demonstrated through

the usage of deep learning methods in areas such as Computer

Vision, Automatic Speech Recognition, and in particular, NLP.

This survey categorizes and addresses the different aspects and

applications of NLP that have beneﬁted from deep learning. It

covers core NLP tasks and applications, and describes how deep

learning methods and models advance these areas. We further

analyze and compare different approaches and state-of-the-art

models.

Index Terms—Natural Language Processing, Deep Learning,

Artiﬁcial Intelligence

I. INTRODUCTION

ATURAL Language Processing (NLP) is a sub-discipline

of computer science providing a bridge between natural

languages and computers. It helps empower machines to un-

derstand, process, and analyze human language [1]. NLP’s sig-

niﬁcance as a tool aiding comprehension of human-generated

data is a logical consequence of the context-dependency

of data. Data becomes more meaningful through a deeper

understanding of its context, which in turn facilitates text

analysis and mining. NLP enables this with the communication

structures and patterns of humans.

Development of NLP methods is increasingly reliant on

data-driven approaches which help with building more power-

ful and robust models [2], [3]. Recent advances in computa-

tional power, as well as greater availability of big data, enable

deep learning, one of the most appealing approaches in the

NLP domain [2]–[4], especially given that deep learning has

already demonstrated superior performance in adjoining ﬁelds

like Computer Vision [5]–[7] and Speech Recognition [8], [9].

These developments led to a paradigm shift from traditional

to novel data-driven approaches aimed at advancing NLP. The

reason behind this shift was simple: new approaches are more

promising regarding results, and are easier to engineer.

Amirsina Torﬁ, Yaser Keneshloo, and Edward A. Fox were with the

Department of Computer Science, Virginia Polytechnic Institute and State

University, Blacksburg, VA, 24060 USA e-mail: (amirsina.torﬁ@gmail.com,

[email protected], [email protected]). Rouzbeh A. Shirvani is an independent re-

searcher, e-mail: ([email protected]). Nader Tavaf was with the

University of Minnesota Twin Cities, Miniapolis, MN, 55455 USA e-mail:

(tav[email protected]).

As a sequitur to remarkable progress achieved in adjacent

disciplines utilizing deep learning methods, deep neural net-

works have been applied to various NLP tasks, including part-

of-speech tagging [10]–[12], named entity recognition [13],

[13], [14], and semantic role labeling [15]–[17]. Most of the

research efforts in deep learning associated with NLP appli-

cations involve either supervised learning

or unsupervised

learning

This survey covers the emerging role of deep learning in the

area of NLP, across a broad range of categories. The research

presented in [18] is primarily focused on architectures, with

little discussion of applications. On the other hand, this paper

describes the challenges, opportunities, and evaluations of the

impact of applying deep learning to NLP problems.

This survey has six sections, including this introduction.

Section 2 lays out the theoretical dimensions of NLP and

artiﬁcial intelligence, and looks at deep learning as an ap-

proach to solving real-world problems. It motivates this study

by addressing the question: Why use deep learning in NLP?

The third section discusses fundamental concepts necessary

to understand NLP, covering exemplary issues in representa-

tion, frameworks, and machine learning. The fourth section

summarizes benchmark datasets employed in the NLP domain.

Section 5 focuses on some of the NLP applications where deep

learning has demonstrated signiﬁcant beneﬁt. Finally, Section

6 provides a conclusion, also addressing some open problems

and promising areas for improvement.

II. BACKGROUND

NLP has long been viewed as one aspect of artiﬁcial

intelligence (AI), since understanding and generating natural

language are high-level indications of intelligence. Deep learn-

ing is an effective AI tool, so we next situate deep learning in

the AI world. After that we explain motivations for applying

deep learning to NLP.

A. Artiﬁcial Intelligence and Deep Learning

There have been “islands of success” where big data are

processed via AI capabilities to produce information to achieve

critical operational goals (e.g., fraud detection). Accordingly,

scientists and consumers anticipate enhancement across a

Learning from training data to predict the type of new unseen test examples

by mapping them to known pre-deﬁned labels.

Making sense of data without sticking to speciﬁc tasks and supervisory

signals.

arXiv:2003.01200v2 [cs.CL] 4 Mar 2020

variety of applications. However, achieving this requires un-

derstanding of AI and its mechanisms and means (e.g., algo-

rithms). Ted Greenwald, explaining AI to those who are not

AI experts, comments: ”Generally AI is anything a computer

can do that formerly was considered a job for a human” [19].

An AI goal is to extend the capabilities of information

technology (IT) from those to (1) generate, communicate,

and store data, to also (2) process data into the knowledge

that decision makers and others need [20]. One reason is

that the available data volume is increasing so rapidly that

it is now impossible for people to process all available data.

This leaves two choices: (1) much or even most existing data

must be ignored or (2) AI must be developed to process the

vast volumes of available data into the essential pieces of

information that decision-makers and others can comprehend.

Deep learning is a bridge between the massive amounts of

data and AI.

1) Deﬁnitions: Deep learning refers to applying deep neu-

ral networks to massive amounts of data to learn a procedure

aimed at handling a task. The task can range from simple

classiﬁcation to complex reasoning. In other words, deep

learning is a set of mechanisms ideally capable of deriving an

optimum solution to any problem given a sufﬁciently extensive

and relevant input dataset. Loosely speaking, deep learning

is detecting and analyzing important structures/features in the

data aimed at formulating a solution to a given problem. Here,

AI and deep learning meet. One version of the goal or ambition

behind AI is enabling a machine to outperform what the human

brain does. Deep learning is a means to this end.

2) Deep Learning Architectures: Numerous deep learning

architectures have been developed in different research areas,

e.g., in NLP applications employing recurrent neural networks

(RNNs) [21], convolutional neural networks (CNNs) [22], and

more recently, recursive neural networks [23]. We focus our

discussion on a review of the essential models, explained in

relevant seminal publications.

Multi Layer Perceptron: A multilayer perceptron (MLP)

has at least three layers (input, hidden, and output layers). A

layer is simply a collection of neurons operating to transform

information from the previous layer to the next layer. In the

MLP architecture, the neurons in a layer do not communicate

with each other. An MLP employs nonlinear activation func-

tions. Every node in a layer connects to all nodes in the next

layer, creating a fully connected network (Fig. 1). MLPs are

the simplest type of Feed-Forward Neural Networks (FNNs).

FNNs represent a general category of neural networks in which

the connections between the nodes do not create any cycle, i.e.,

in a FNN there is no cycle of information ﬂow.

Convolutional Neural Networks: Convolutional neural

networks (CNNs), whose architecture is inspired by the human

visual cortex, are a subclass of feed-forward neural networks.

CNNs are named after the underlying mathematical operation,

convolution, which yields a measure of the interoperability of

its input functions. Convolutional neural networks are usually

employed in situations where data is or needs to be represented

with a 2D or 3D data map. In the data map representation,

the proximity of data points usually corresponds to their

information correlation.

Fig. 1. The general architecture of a MLP.

In convolutional neural networks where the input is an

image, the data map indicates that image pixels are highly cor-

related to their neighboring pixels. Consequently, the convolu-

tional layers have 3 dimensions: width, height, and depth. That

assumption possibly explains why the majority of research

efforts dedicated to CNNs are conducted in the Computer

Vision ﬁeld [24].

A CNN takes an image represented as an array of numeric

values. After performing speciﬁc mathematical operations, it

represents the image in a new output space. This operation is

also called feature extraction, and helps to capture and rep-

resent key image content. The extracted features can be used

for further analysis, for different tasks. One example is image

classiﬁcation, which aims to categorize images according to

some predeﬁned classes. Other examples include determining

which objects are present in an image and where they are

located. See Fig. 2.

In the case of utilizing CNNs for NLP, the inputs are sen-

tences or documents represented as matrices. Each row of the

matrix is associated with a language element such as a word

or a character. The majority of CNN architectures learn word

or sentence representations in their training phase. A variety

of CNN architectures were used in various classiﬁcation tasks

such as Sentiment Analysis and Topic Categorization [22],

[25]–[27]. CNNs were employed for Relation Extraction and

Relation Classiﬁcation as well [28], [29].

Recurrent Neural Network: If we line up a sequence of

FNNs and feed the output of each FNN as an input to the next

one, a recurrent neural network (RNN) will be constructed.

Like FNNs, layers in an RNN can be categorized into input,

hidden, and output layers. In discrete time frames, sequences

of input vectors are fed as the input, one vector at a time,

e.g., after inputting each batch of vectors, conducting some

operations and updating the network weights, the next input

batch will be fed to the network. Thus, as shown in Fig. 3,

at each time step we make predictions and use parameters of

the current hidden layer as input to the next time step.

Hidden layers in recurrent neural networks can carry infor-

mation from the past, in other words, memory. This character-

istic makes them speciﬁcally useful for applications that deal

with a sequence of inputs such as language modeling [30], i.e.,

Fig. 2. A typical CNN architecture for object detection. The network provides a feature representation with attention to the speciﬁc region of an image

(example shown on the left) that contains the object of interest. Out of the multiple regions represented (see an ordering of the image blocks, giving image

pixel intensity, on the right) by the network, the one with the highest score will be selected as the main candidate.

Fig. 3. Recurrent Neural Network (RNN), summarized on the left, expanded

on the right, for N timesteps, with X indicating input, h hidden layer, and

O output

Fig. 4. Schematic of an Autoencoder

representing language in a way that the machine understands.

This concept will be described later in detail.

RNNs can carry rich information from the past. Consider

the sentence: “Michael Jackson was a singer; some people

consider him King of Pop.” It’s easy for a human to identify

him as referring to Michael Jackson. The pronoun him happens

seven words after Michael Jackson; capturing this dependency

is one of the beneﬁts of RNNs, where the hidden layers in an

RNN act as memory units. Long Short Term Memory Network

(LSTM) [31] is one of the most widely used classes of RNNs.

LSTMs try to capture even long time dependencies between

inputs from different time steps. Modern Machine Translation

and Speech Recognition often rely on LSTMs.

Autoencoders: Autoencoders implement unsupervised

methods in deep learning. They are widely used in dimension-

ality reduction

or NLP applications which consist of sequence

to sequence modeling (see Section III-B [30]. Fig. 4 illustrates

Dimensionality reduction is an unsupervised learning approach which is

the process of reducing the number of variables that were used to represent

the data by identifying the most crucial information.

Fig. 5. Generative Adversarial Networks

the schematic of an Autoencoder. Since autoencoders are

unsupervised, there is no label corresponding to each input.

They aim to learn a code representation for each input. The

encoder is like a feed-forward neural network in which the

input gets encoded into a vector (code). The decoder operates

similarly to the encoder, but in reverse, i.e., constructing

an output based on the encoded input. In data compression

applications, we want the created output to be as close as

possible to the original input. Autoencoders are lossy, meaning

the output is an approximate reconstruction of the input.

Generative Adversarial Networks: Goodfellow [32] intro-

duced Generative Adversarial Networks (GANs). As shown in

Fig. 5, a GAN is a combination of two neural networks, a

discriminator and a generator. The whole network is trained

in an iterative process. First, the generator network generates a

fake sample. Then the discriminator network tries to determine

whether this sample (ex.: an input image) is real or fake, i.e.,

whether it came from the real training data (data used for

building the model) or not. The goal of the generator is to fool

the discriminator in a way that the discriminator believes the

artiﬁcial (i.e., generated) samples synthesized by the generator

are real.

This iterative process continues until the generator produces

samples that are indistinguishable by the discriminator. In

other words, the probability of classifying a sample as fake

or real becomes like ﬂipping a fair coin for the discriminator.

The goal of the generative model is to capture the distribution

of real data while the discriminator tries to identify the fake

data. One of the interesting features of GANs (regarding being

generative) is: once the training phase is ﬁnished, there is no

need for the discrimination network, so we solely can work

with the generation network. In other words, having access to

the trained generative model is sufﬁcient.

Different forms of GANs has been introduced, e.g., Sim

GAN [7], Wasserstein GAN [33], info GAN [34], and DC

GAN [35]. In one of the most elegant GAN implementations

[36], entirely artiﬁcial, yet almost perfect, celebrity faces are

generated; the pictures are not real, but fake photos produced

by the network. In the NLP domain, GANs often are used for

text generation [37], [38].

B. Motivation for Deep Learning in NLP

Deep learning applications are predicated on the choices

of (1) feature representation and (2) deep learning algo-

rithm alongside architecture. These are associated with data

representation and learning structure, respectively. For data

representation, surprisingly, there usually is a disjunction

between what information is thought to be important for

the task at hand, versus what representation actually yields

good results. For instance, in sentiment analysis, lexicon

semantics, syntactic structure, and context are assumed by

some linguists to be of primary signiﬁcance. Nevertheless,

previous studies based on the bag-of-words (BoW) model

demonstrated acceptable performance [39]. The bag-of-words

model [40], often viewed as the vector space model, involves

a representation which accounts only for the words and

their frequency of occurrence. BoW ignores the order and

interaction of words, and treats each word as a unique feature.

BoW disregards syntactic structure, yet provides decent results

for what some would consider syntax-dependent applications.

This observation suggests that simple representations, when

coupled with large amounts of data, may work as well or better

than more complex representations. These ﬁndings corroborate

the argument in favor of the importance of deep learning

algorithms and architectures.

Often the progress of NLP is bound to effective language

modeling. A goal of statistical language modeling is the prob-

abilistic representation of word sequences in language, which

is a complicated task due to the curse of dimensionality. The

research presented in [41] was a breakthrough for language

modeling with neural networks aimed at overcoming the curse

of dimensionality by (1) learning a distributed representation

of words and (2) providing a probability function for se-

quences.

A key challenge in NLP research, compared to other do-

mains such as Computer Vision, seems to be the complexity

of achieving an in-depth representation of language using

statistical models. A primary task in NLP applications is to

provide a representation of texts, such as documents. This in-

volves feature learning, i.e., extracting meaningful information

to enable further processing and analysis of the raw data.

Fig. 6. Considering a given sequence, the skip-thought model generates the

surrounding sequences using the trained encoder. The assumption is that the

surrounding sentences are closely related, contextually.

Traditional methods begin with time-consuming hand-

crafting of features, through careful human analysis of a

speciﬁc application, and are followed by development of

algorithms to extract and utilize instances of those features.

On the other hand, deep supervised feature learning methods

are highly data-driven and can be used in more general efforts

aimed at providing a robust data representation.

Due to the vast amounts of unlabeled data, unsupervised

feature learning is considered to be a crucial task in NLP. Un-

supervised feature learning is, in essence, learning the features

from unlabeled data to provide a low-dimensional representa-

tion of a high-dimensional data space. Several approaches such

as K-means clustering and principal component analysis have

been proposed and successfully implemented to this end. With

the advent of deep learning and abundance of unlabeled

data, unsupervised feature learning becomes a crucial task for

representation learning, a precursor in NLP applications. Cur-

rently, most of the NLP tasks rely on annotated data, while a

preponderance of unannotated data further motivates research

in leveraging deep data-driven unsupervised methods.

Given the potential superiority of deep learning approaches

in NLP applications, it seems crucial to perform a com-

prehensive analysis of various deep learning methods and

architectures with particular attention to NLP applications.

III. CORE CONCEPTS IN NLP

A. Feature Representation

Distributed representations are a series of compact, low

dimensional representations of data, each representing some

distinct informative property. For NLP systems, due to issues

related to the atomic representation of the symbols, it is

imperative to learn word representations.

At ﬁrst, let’s concentrate on how the features are rep-

resented, and then we focus on different approaches for

learning word representations. The encoded input features can

be characters, words [23], sentences [42], or other linguistic

elements. Generally, it is more desirable to provide a compact

representation of the words than a sparse one.

How to select the structure and level of text representa-

tion used to be an unresolved question. After proposing the

word2vec approach [43], subsequently, doc2vec was proposed

in [42] as an unsupervised algorithm and was called Paragraph

Vector (PV). The goal behind PV is to learn ﬁxed-length rep-

resentations from variable-length text parts such as sentences

and documents. One of the main objectives of doc2vec is

to overcome the drawbacks of models such as BoW and to

provide promising results for applications such as text classi-

ﬁcation and sentiment analysis. A more recent approach is the

skip-thought model which applies word2vec at the sentence-

level [44]. By utilizing an encoder-decoder architecture, this

model generates the surrounding sentences using the given

sentence (Fig. 6). Next, let’s investigate different kinds of

feature representation.

1) One-Hot Representation: In one-hot encoding, each

unique element that needs to be represented has its dimen-

sion which results in a very high dimensional, very sparse

representation. Assume the words are represented with the

one-hot encoding method. Regarding representation structure,

there is no meaningful connection between different words in

the feature space. For example, highly correlated words such

as ‘ocean’ and ‘water’ will not be closer to each other (in the

representation space) compared to less correlated pairs such as

‘ocean’ and ‘ﬁre.’ Nevertheless, some research efforts present

promising results using one-hot encoding [2].

2) Continuous Bag of Words: Continuous Bag-of-Words

model (CBOW) has frequently been used in NLP applica-

tions. CBOW tries to predict a word given its surrounding

context, which usually consists of a few nearby words [45].

CBOW is neither dependent on the sequential order of words

nor necessarily on probabilistic characteristics. So it is not

generally used for language modeling. This model is typi-

cally trained to be utilized as a pre-trained model for more

sophisticated tasks. An alternative to CBOW is the weighted

CBOW (WCBOW) [46] in which different vectors get different

weights reﬂective of relative importance in context. The sim-

plest example can be document categorization where features

are words and weights are TF-IDF scores [47] of the associated

words.

3) Word-Level Embedding: Word embedding is a learned

representation for context elements in which, ideally, words

with related semantics become highly correlated in the rep-

resentation space. One of the main incentives behind word

embedding representations is the high generalization power

as opposed to sparse, higher dimensional representations [48].

Unlike the traditional bag-of-words model in which different

words have entirely different representations regardless of their

usage or collocations, learning a distributed representation

takes advantage of word usage in context to provide similar

representations for semantically correlated words. There are

different approaches to create word embeddings. Several re-

search efforts, including [43], [45], used random initialization

by uniformly sampling random numbers with the objective of

training an efﬁcient representation of the model on a large

dataset. This setup is intuitively acceptable for initialization

of the embedding for common features such as part-of-speech

tags. However, this may not be the optimum method for rep-

resentation of less frequent features such as individual words.

For the latter, pre-trained models, trained in a supervised or

unsupervised manner, are usually leveraged for increasing the

performance.

4) Character-Level Embedding: The methods mentioned

earlier are mostly at higher levels of representation. Lower-

level representations such as character-level representation

require special attention as well, due to their simplicity of

representation and the potential for correction of unusual

character combinations such as misspellings [2]. For generat-

ing character-level embeddings, CNNs have successfully been

utilized [10].

Character-level embeddings have been used in different

NLP applications [49]. One of the main advantages is the

ability to use small model sizes and represent words with

lower-level language elements [10]. Here word embeddings

are models utilizing CNNs over the characters. Another mo-

tivation for employing character-level embeddings is the out-

of-vocabulary word (OOV) issue which is usually encountered

when, for the given word, there is no equivalent vector in

the word embedding. The character-level approach may sig-

niﬁcantly alleviate this problem. Nevertheless, this approach

suffers from a weak correlation between characters and se-

mantic and syntactic parts of the language. So, considering

the aforementioned pros and cons of utilizing character-level

embeddings, several research efforts tried to propose and im-

plement higher-level approaches such as using sub-words [50]

to create word embeddings for OOV instances as well as

creating a semantic bridge between the correlated words [51].

B. Seq2Seq Framework

Most underlying frameworks in NLP applications rely on

sequence-to-sequence (seq2seq) models in which not only the

input but also the output is represented as a sequence. These

models are common in various applications including machine

translation

, text summarization

, speech-to-text, and text-to-

speech applications

The most common seq2seq framework is comprised of an

encoder and a decoder. The encoder ingests the sequence of

input data and generates a mid-level output which is subse-

quently consumed by the decoder to produce the series of ﬁnal

outputs. The encoder and decoder are usually implemented via

a series of Recurrent Neural Networks or LSTM [31] cells.

The encoder takes a sequence of length T , X =

, x

, · · · , x

}, where x

∈ V = {1, · · · , |V |} is the

representation of a single input coming from the vocabulary

V , and then generates the output state h

. Subsequently, the

decoder takes the last state from the encoder, i.e., h

, and

starts generating an output of size L, Y

= {y

, y

, · · · , y

based on its current state, s

, and the ground-truth output y

In different applications, the decoder could take advantage

of more information such as a context vector [52] or intra-

attention vectors [53] to generate better outputs.

One of the most widely training approaches for seq2seq

models is called Teacher Forcing [54]. Let us deﬁne y =

, y

, · · · , y

} as the ground-truth output sequence corre-

spondent to a given input sequence X. The model training

The input is a sequence of words from one language (e.g., English) and

the output is the translation to another language (e.g., French).

The input is a complete document (sequence of words) and the output is

a summary of it (sequence of words).

The input is an audio recording of a speech (sequence of audible elements)

and the output is the speech text (sequence of words).

剩余20页未读，继续阅读

身份认证购VIP最低享 7 折!

30元优惠券

资源推荐

资源评论

syp_net

粉丝: 158

深度学习自然语言处理进展综述论文（NLP advancements by DL: A Survey）.pdf

最新资源

深度学习自然语言处理进展综述论文（NLP advancements by DL: A Survey）.pdf

面向自然语言处理任务的预训练模型综述.pdf

自然语言处理综述

自然语言处理国际前沿动态综述

走进NLP的世界——NLP综述

自然语言处理综论（中英文）

NLP分析方法

基于深度学习方面自然语言处理技术(NLP)的研究.pdf

面向自然语言处理的深度学习对抗样本综述.pdf

NLP技术分享 深度学习与自然语言处理 舆情分析、智能聊天机器人 共45页.pdf

机器学习 - MachineLearning - ML、深度学习 - DeepLearning - DL、自然语言处理 NLP

NLP技术分享 自然语言处理技术 AI科技大本营公开课《深度学习在NLP中的发展和应用》 共55页.pdf

深度学习研究综述.pdf

深度学习在自然语言处理的应用

自然语言处理综论

深度学习研究与应用综述.pdf

论文研究-深度学习相关研究综述.pdf

Efficient Transformers： A Survey.pdf

深度学习进阶：自然语言处理.docx

深度学习自然语言处理概述（116页ppt）.pdf

《深度学习进阶-自然语言处理》-【日】斋藤康毅著。个人学习整理简单笔记，欢迎一起学习探讨, 小白一枚，希望寻找小伙伴.zip

面向自然语言处理的深度学习.pdf

自然语言处理报告

人工智能精选论文（自然语言处理）

语音情感识别文献综述

问题解决：xshell无法连接linux，windos无法通过ssh连接虚拟机Centos

微信云开发头像馆小程序零基础配置

最新资源

NLP技术分享深度学习与自然语言处理舆情分析、智能聊天机器人共45页.pdf

NLP技术分享自然语言处理技术 AI科技大本营公开课《深度学习在NLP中的发展和应用》共55页.pdf