xLSTM: 扩展长短期记忆网络至数十亿参数

PDF文件

下载需积分: 5 | 1.83MB | 更新于2025-03-20 | 36 浏览量 | 举报收藏

立即下载

知识点一：长短期记忆网络（LSTM）长短期记忆网络（Long Short-Term Memory，LSTM）是一种特殊的循环神经网络（RNN）架构，由Hochreiter和Schmidhuber在1997年提出。LSTM的核心思想在于引入了三种门结构：遗忘门、输入门和输出门，这使得LSTM能够有效学习长期依赖信息。遗忘门负责决定何时丢弃记忆单元中的信息，输入门决定何时更新记忆单元中的信息，而输出门则决定何时将记忆单元中的信息输出。这些门的引入有效缓解了传统循环神经网络中的梯度消失问题，使得LSTM能够在更长的序列上保持较好的性能。知识点二：LSTM的局限性尽管LSTM在许多深度学习成功案例中起到了重要作用，尤其是在大型语言模型（LLMs）的构建上，但它也存在一些局限性。例如，LSTM的计算过程不是完全可并行化的，这限制了其在大规模数据集上训练时的效率。此外，LSTM的门控机制虽然解决了长期依赖问题，但也可能导致模型的参数数量庞大，增加了计算负担。随着Transformer技术的出现，LSTM的这些局限性变得更加明显，因为Transformer可以更高效地处理长距离依赖关系，并且具有高度的并行性。知识点三：Transformer的兴起 Transformer是一种新型的神经网络架构，它完全依赖于注意力机制（Attention Mechanism）来捕捉序列内部的关系，而不是像RNN那样依赖于顺序计算。Transformer的核心是自注意力（Self-Attention）机制，它允许模型在处理输入序列的每个元素时，同时考虑序列中的所有元素。这使得Transformer在处理长距离依赖时更为高效。由于其可并行化的计算特性，Transformer在大规模数据集上的训练效率远超LSTM，成为当前深度学习领域的新宠。知识点四：xLSTM的提出在Transformer技术的冲击下，为了保持LSTM在语言建模中的竞争力，研究者们提出了扩展型长短期记忆网络（Extended Long Short-Term Memory，xLSTM）。xLSTM试图通过引入最新技术来弥补传统LSTM的不足，同时尝试减少其已知的局限性。文档中提到的“exponential gating”可能是xLSTM中引入的新机制，目的是解决传统LSTM中的某些问题。尽管文档中并未详细说明“exponential gating”的具体机制，但可以推测它可能与LSTM中的线性门控机制有所不同，可能是通过某种非线性方式来改进信息的传递过程。知识点五：大规模参数下的LSTM扩展文档还提出了一个关键问题：当扩展LSTM至数十亿参数时，利用现代大型语言模型中的最新技术，同时缓解LSTM已知局限性，我们在语言建模中能走多远。这表明xLSTM的研究不仅仅关注改进门控机制，还包括在大规模数据集上进行训练和扩展。例如，可能涉及到模型容量的提升，或者改善模型的训练和优化方法，以适应更大规模的参数设置。总结来说，xLSTM作为对传统LSTM的扩展，展现了在大型语言模型构建中重新审视LSTM架构的可能性。尽管Transformer在当前的研究和实践中占据主导地位，但xLSTM研究的出现表明，在特定场景和应用中，LSTM及其变种仍有可能提供有价值的优势或解决方案。未来的深度学习研究可能会见证更多类似xLSTM这样的创新尝试，以寻求在不同任务中取得更好的性能和效率。

15B Tokens

Figure 6

: Method comparison

on next token prediction when

trained on 15B tokens from

SlimPajama. Performance mea-

sure in validation perplexity for

the best methods of each model

class (see Table 1) are reported.

The performance degradation of

xLSTM[7:1] at 2.7B is due to

initially slower training conver-

gence that leads to an especially

undertrained model. xLSTM is

the best method at all sizes.

Ablation studies on the new xLSTM components.

Model Modiﬁcation

Exponential

Gating

Matrix

Memory

#Params

SlimPajama

(15B) ppl ↓

LSTM

Vanilla Multi-Layer LSTM ✗ ✗ 607.8 2417.86

Adding Resnet Backbone ✗ ✗ 506.1 35.46

Adding Up-Projection Backbone ✗ ✗ 505.9 26.01

xLSTM[0:1] Adding Exponential Gating ✓ ✗ 427.3 17.70

xLSTM[7:1] Adding Matrix Memory ✓ ✓ 408.4 13.48

Ablation studies on different gating techniques.

Learnable Gates

Forget Gate Input Gate

SlimPajama

(15B) ppl ↓

Input

Dependent

Learnable

Bias

Init

Input

Dependent

Learnable

Bias

Init

No Gates ✗ ✗ +∞ ✗ ✗ 0 NaN

No Gates ✗ ✗ [3, 6] ✗ ✗ 0 13.95

Forget Gate ✓ ✓ [3, 6] ✗ ✗ 0 13.58

Input Gate ✗ ✗ [3, 6] ✓ ✓ N(0, 0.1) 13.69

Forget Gate Bias ✗ ✓ [3, 6] ✗ ✗ 0 13.76

Forget + Input Gate Bias ✗ ✓ [3, 6] ✗ ✓ N(0, 0.1) 13.73

Forget Gate + Input Gate Bias ✓ ✓ [3, 6] ✗ ✓ N(0, 0.1) 13.55

Forget Gate + Input Gate ✓ ✓ [3, 6] ✓ ✓ N(0, 0.1) 13.43

Table 2: Ablation studies. Top: Ablation studies on the new xLSTM components, contributing

the strong performance improvement of xLSTM over vanilla LSTM to both the exponential gating

and the matrix memory. Bottom: Ablation studies on different gating techniques. We consider an

xLSTM[1:0] with sigmoid forget gate and exponential input gate. Bias initialization

∞

means that the

forget gate is set to one,

[3, 6]

indicates that values are taken equidistant in the respective interval, and

N(0, 0.1)

that values are randomly chosen from a Gaussian with mean

and std

0.1

. PPL denotes

validation perplexity. The ﬁrst two lines correspond to models similar to linearized attention, line

four to Retention, line ﬁve to RWKV-5, and line six to RWKV-6. Dependencies of the gates on the

input lead to better performance.

4.3 xLSTM as Large Language Model

We culminate this study in large-scale language modeling experiments, testing the potential of

xLSTM as an LLM. We therefore increase the amount of training data and train on 300B tokens

from SlimPajama. The same number of tokens is used in, e.g., Mamba (Gu & Dao, 2023) and

Grifﬁn (De et al., 2024). We compare xLSTM to RWKV-4, Llama, and Mamba – one method from

each respective method class in Section 4.2. We select RWKV-4 as RNN representative since for

RWKV-5, RWKV-6 and HGRN2 a reasonable training precision setting has been found only after

the training start of the 300B token experiments (see Appendix B.2). We train different model sizes

(125M, 350M, 760M, 1.3B), test all models for length extrapolation capabilities and evaluate their

performance on the validation set. We assess their performance on downstream tasks, test their

performance in language modeling on 571 text domains of the PALOMA benchmark, and, ﬁnally,

investigate their scaling law behavior.

Sequence Length Extrapolation. Firstly, we test the sequence length extrapolation for 1.3B-sized,

large models of xLSTM, RWKV-4, Llama, and Mamba. All models are trained on context length

2048, and then tested for context lengths up to 16384. See Figure 7 for the results. In contrast to

other methods, xLSTM models maintain low perplexities for longer contexts.

Model

SlimPajama

(300B) ppl ↓

at 16k

Llama 337.83

Mamba 14.00

RWKV-4 13.75

xLSTM[7:1] 8.92

xLSTM[1:0] 9.01

Figure 7: Sequence extrapolation in language modeling. This is a comparison of 1.3B-sized, large

models of xLSTM, RWKV-4, Llama, and Mamba at next token prediction on the SlimPajama

validation set after training on 300B tokens from SlimPajama. Models are trained with context length

2048 and then tested for context lengths up to 16384. Left: Token perplexities evaluated at different

context lengths. In contrast to other methods, xLSTM models remain at low perplexities for longer

contexts. Right: Prediction quality when extrapolating to long context sizes in terms of validation

perplexity (PPL). xLSTM yields the best PPL values (best in bold, second best underlined).

Validation Perplexity and Downstream Tasks. Secondly, for all model sizes, we evaluate the

performance of xLSTM, RWKV-4, Llama, and Mamba models on the SlimPajama validation set for

next token prediction and on downstream tasks that measure common sense reasoning. The third

column of Table 3 lists the validation set perplexities of different methods. Both xLSTM[1:0] and

xLSTM[7:1] are the best models for all model sizes with respect to the validation set perplexity. The

other columns of Table 3 provide the performance on downstream tasks. In the vast majority of tasks

and across all model sizes xLSTM is the best method — only on the ARC task Mamba is in some

cases the best method. For details see Appendix B.3.

Performance on PALOMA Language Tasks. Thirdly, for all model sizes, we test the next token

prediction performance of xLSTM, RWKV-4, Llama, and Mamba models on PALOMA language

tasks (Magnusson et al., 2023). We measure the performance by the perplexity for next token

prediction on 571 text domains, which range from nytimes.com to r/depression on Reddit. Table 4

shows token prediction perplexity grouped into language modeling (ﬁrst seven columns) and ﬁne-

grained domain benchmarks (last 5 columns). xLSTM[1:0] performs better than xLSTM[7:1] on

these language tasks. xLSTM[1:0] has in 568 out of 571 (99.5%) text domains a lower perplexity

剩余54页未读，继续阅读

July工作室

粉丝: 3195

xLSTM: 扩展长短期记忆网络至数十亿参数

Alibaba-Dragonwell-Extended-21.0.5.0.5.9-x64-linux.tar.gz

ITU-T H.264_chi(032005).pdf

iso-iec 14496-10(3rd_2006-03-01)_MPEG4_AVC_H264.pdf

Google C++ Style Guide(Google C++编程规范）高清PDF

基于长短期记忆网络的电网同调机群快速辨识.docx

CPO-XLSTM

我要xLSTM-AE代码

AUKF-LSTM算法和EKF-LSTM算法在预测精度方面有何差异？

EKF-LSTM预测SOC的MATALAB资料和思路

EKF-LSTM预测SOC的MATALAB完整代码和思路

最新资源