Deep Delta Learning and Matrix Hidden States
Deep Delta Learning (DDL) represents a paradigm shift in residual network design. It generalizes the standard additive residual connection by modulating the identity shortcut with a learnable, data-dependent geometric transformation known as the Delta Operator.
Authors: Yifan Zhang, Yifeng Liu, Mengdi Wang, Quanquan Gu
Affiliations: Princeton University, UCLA
Date: January 1st, 2026
[Webpage] [Huggingface]
By reinterpreting the residual block as a rank-1 Householder update, DDL unifies identity mapping, orthogonal projection, and geometric reflection into a single, continuously differentiable module. This allows the network to explicitly control the spectrum of its layer-wise transition operator, enabling the modeling of complex, non-monotonic dynamics while preserving the stable training characteristics of gated residual architectures.
The efficacy of deep residual networks is fundamentally predicated on the identity shortcut connection. While this mechanism effectively mitigates the vanishing gradient problem, it imposes a strictly additive inductive bias on feature transformations, thereby limiting the network's capacity to model complex state transitions.
In this paper, we introduce Deep Delta Learning (DDL), a novel architecture that generalizes the standard residual connection by modulating the identity shortcut with a learnable, data-dependent geometric transformation. This transformation, termed the Delta Operator, constitutes a rank-1 perturbation of the identity matrix, parameterized by a reflection direction vector
Standard residual networks approximate the ODE
The Delta-Res block update rule is defined as:
Where:
-
$\mathbf{k}_l \in \mathbb{R}^d$ : The learned Reflection Direction (strictly normalized). -
$\beta_l \in \mathbb{R}$ : The learned Scalar Gate, mapped to$[0, 2]$ . -
$\mathbf{v}_l \in \mathbb{R}^{d_v}$ : The Residual Value Vector carrying new information.
This formulation couples the "erasure" of old information (via projection onto
The expressive power of DDL stems from the spectral properties of the Delta Operator
Theorem 1 in the paper demonstrates that the eigenvalues of
| Regime |
|
Spectrum | Behavior | Interpretation |
|---|---|---|---|---|
| Identity | Skip Connection: Signal preservation for deep propagation. | |||
| Projection |
Forgetting: Orthogonal projection onto the hyperplane |
|||
| Reflection |
Householder Reflection: Inverts the state along |
DDL establishes a theoretical link to efficient sequence models like DeltaNet. While DeltaNet applies the "Delta Rule" ($\text{New} = \text{Old} + \beta(\text{Target} - \text{Old})$) over the time dimension, Deep Delta Learning applies it over the depth dimension.
Expanding the DDL update reveals the classic Delta Rule structure:
This allows the network to selectively "clean" or "rewrite" specific feature subspaces layer-by-layer, preventing the accumulation of interference common in standard additive ResNets.
If you find this work useful in your research, please cite:
@article{zhang2026deep,
title = {Deep Delta Learning},
author = {Zhang, Yifan and Liu, Yifeng and Wang, Mengdi and Gu, Quanquan},
journal = {arXiv preprint arXiv:2601.00417},
year = {2026}
}