Near-Optimal MIMO Detection Using Gradient-Based MCMC in Discrete Spaces

Xingyu Zhou, Le Liang, Jing Zhang, Chao-Kai Wen, and Shi Jin, X. Zhou, L. Liang, J. Zhang, and S. Jin are with the National Mobile Communications Research Laboratory, Southeast University, Nanjing 210096, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). L. Liang is also with the Purple Mountain Laboratories, Nanjing 211111, China.C.-K. Wen is with the Institute of Communications Engineering, National Sun Yat-sen University, Kaohsiung 80424, Taiwan (e-mail: [email protected]).

Abstract

The discrete nature of transmitted symbols poses challenges for achieving optimal detection in multiple-input multiple-output (MIMO) systems associated with a large number of antennas. Recently, the combination of two powerful machine learning methods, Markov chain Monte Carlo (MCMC) sampling and gradient descent, has emerged as a highly efficient solution to address this issue. However, existing gradient-based MCMC detectors are heuristically designed and thus are theoretically untenable. To bridge this gap, we introduce a novel sampling algorithm tailored for discrete spaces. This algorithm leverages gradients from the underlying continuous spaces for acceleration while maintaining the validity of probabilistic sampling. We prove the convergence of this method and also analyze its convergence rate using both MCMC theory and empirical diagnostics. On this basis, we develop a MIMO detector that precisely samples from the target discrete distribution and generates posterior Bayesian estimates using these samples, whose performance is thereby theoretically guaranteed. Furthermore, our proposed detector is highly parallelizable and scalable to large MIMO dimensions, positioning it as a compelling candidate for next-generation wireless networks. Simulation results show that our detector achieves near-optimal performance, significantly outperforms state-of-the-art baselines, and showcases resilience to various system setups.

Index Terms:

MIMO detection, Markov chain Monte Carlo, gradient descent, Langevin algorithms, discrete sample space.

I Introduction

Over the past two decades, the multiple-input multiple-output (MIMO) technology has played a crucial role in improving spectral efficiency and network throughput of wireless communication systems [1]. To enhance data rates and coverage even further, future generations of wireless networks are expected to witness an unprecedented rise in the number of antennas, leading to extra-large-scale MIMO systems [2, 3]. However, this dramatically increased number of antennas results in the curse of dimensionality in symbol detection—a significant roadblock to realizing the full potential of MIMO.

The optimal maximum a posteriori (MAP) detection involves exhaustive enumeration and is tractable only for the most trivial cases [4]. Sphere decoding (SD) has been introduced to reduce the complexity of MAP detection while approaching near-optimal performance [5, 6]. Nonetheless, the substantial tree searches still entail complexity that exponentially grows with the system dimension. Linear detectors, such as the linear minimum mean square error (MMSE) method, are widely acknowledged for their low complexity. However, they are prone to severe performance degradation in high-order modulation or correlated channels.

Posterior inference problems like MIMO detection generally involve multidimensional function integration or maximization and become challenging as the dimension increases. Recently, stochastic sampling techniques [7], also known as Monte Carlo methods, have shown great potential in addressing the curse of dimensionality in posterior inference. These methods sample from certain probability distributions of interest and generate reliable estimates based on the drawn samples. Among them, the Markov chain Monte Carlo (MCMC) method [8] has been popular and has found extensive applications in various communication signal processing tasks that can be regarded as posterior inferences [9, 10]. In MCMC, statistical inferences are developed by simulating a Markov chain to generate a sequence of samples that converge to the target posterior distribution and performing Monte Carlo summation using the converged samples. This approach allows for the reduction of the exponential complexity associated with the dimension to a polynomial level. In particular, for MIMO detection, the MCMC method is highly acclaimed for its hardware-friendly architecture [11] and attainment of remarkable performance with relatively low complexity [12, 13].

Discrete data is ubiquitous in real-world applications, ranging from genomes in biology and texts in linguistics to bits and symbols in wireless communications. Exact inference on this type of data, such as optimal MIMO detection, requires sampling from discrete distributions, which has long been recognized as more challenging compared to sampling from continuous distributions [14]. Gibbs sampling [15], a classical MCMC method, is well known as a generic solution to this task; however, this method is subject to the sequential update of variables and hence suffers from low efficiency when the dimension is large or the distribution is highly correlated among its variables. As data dimensions scale up, developing efficient sampling algorithms for discrete distributions is urgent.

In recent years, MCMC has been combined with the gradient descent method, formulating a promising machine learning framework for nonconvex optimization [16]. This gradient-based MCMC paradigm was initially developed for continuous spaces and has demonstrated enhanced efficiency than conventional sampling and optimization methods in solving inference problems. Owing to its superiority in continuous spaces, new research trends have revolved around generalizing gradient-based MCMC to discrete spaces. The core concept is to leverage gradients of the underlying continuous function to navigate the sampling toward the target discrete distribution. Based on this idea, gradients have been used to accelerate Gibbs sampling by indicating the variables that should be prioritized for updating [14]. To further improve sampling efficiency, the Langevin algorithm [17, 18], which is an advanced gradient-based MCMC method originally developed for continuous spaces, has been extended to discrete spaces in [19, 20]. Unlike the Gibbs sampler, the Langevin method updates all variables in parallel to enable large movements and utilizes gradients to direct the sampling to high probability regions of the target distribution, thus showcasing orders-of-magnitude improvements in efficiency.

Given that MIMO detection is a typical posterior inference problem within the discrete space of the transmitted symbols, it is intuitively appealing to apply advanced gradient-based MCMC methods that have been developed for discrete spaces. In fact, this application has been extensively explored in previous works such as [21, 22, 23, 24], achieving both exceptional performance and high efficiency. Specifically, Newton’s method [25] was leveraged to accelerate MCMC sampling in [21]. In their proposed detector, MCMC’s exploration was conducted along the preconditioned gradient descent direction in the continuous-relaxed space, followed by a Metropolis-Hastings (MH) correction [29]. Nevertheless, the derivation of the correction step was based on heuristics. In [22], additional improvements were investigated by employing Nesterov’s accelerated gradient method [26] to expedite MCMC sampling. This method avoided the computationally intensive matrix inversion required by Newton’s method and hence enhanced the scalability of the detector. Moreover, an annealed Langevin algorithm was developed in [23] by setting multiple noise levels with decreasing variances to mimic the discrete prior of the transmitted symbols. This approach enabled the computation of gradients and the transformation of the intricate discrete approximation into a simplified continuous approximation. Additionally, the Langevin-based MIMO detector was independently investigated in [24]. However, their proposed detector did not consider the discrete nature of symbols and is limited to a specific modulation scheme.

Despite the rapid development, a notable deficiency of existing gradient-based MCMC detectors [21, 22, 23, 24] is their lack of theoretical guarantees. These schemes rely on heuristic designs due to the challenges posed by exact sampling in discrete spaces [14, 19] and give no convergence guarantees. More importantly, as shown in this paper, these heuristics can undermine the reliability of MCMC by generating samples that deviate from the target distribution for MIMO detection. Such divergence can lead to substantial errors in inference and severe performance degradation. Hence, the development of a gradient-based MCMC method that ensures precise sampling from the target discrete distribution holds great significance.

In this paper, we propose a near-optimal MIMO detector based on gradient-based MCMC that exactly samples in discrete spaces, ensuring the validity of stochastic sampling. The proposed detector offers an efficient solution for computing posterior Bayesian estimates via the generation of important samples and the subsequent Monte Carlo summation. The performance is guaranteed by the convergence of the sampling algorithm and the law of large numbers behind Monte Carlo methods. We perform extensive numerical experiments to verify the effectiveness of our proposed detector. Results show that our method achieves near-optimal performance with a limited number of samples and significantly outperforms state-of-the-art detectors.¹¹1The term “near-optimal” in this context refers to the proposed detector’s ability, theoretically guaranteed, to approach optimal detection performance given sufficient computational resources. Importantly, the proposed detector offers significant complexity savings compared to the optimal MAP detector, rendering it a practically viable solution for MIMO systems with a large number of antennas. Moreover, our proposed detector exhibits remarkable robustness to various channel environments and resilience to imperfect channel state information (CSI). Additionally, our detector is scalable to large MIMO dimensions and highly parallelizable, making it a promising solution for extra-large-scale MIMO systems in the next-generation wireless networks. We summarize the contributions of this paper as follows.

•

Gradient-Based MCMC Sampling for Discrete Distributions: We have developed a discrete analog to the Metropolis adjusted Langevin algorithm (MALA) [18] for the target discrete distributions within the context of the MIMO detection problem. This novel algorithm, referred to as DMALA, achieves high-quality sampling from discrete distributions by leveraging gradients from the underlying continuous function for acceleration, while strictly adhering to the systematic steps of MCMC.
•

Convergence Proof and Analysis: We provide a theoretical proof of DMALA’s convergence and analyze its convergence rate. This theoretically guaranteed property distinguishes our proposed algorithm from heuristic sampling algorithms used in existing gradient-based MCMC detectors. Additionally, we offer empirical diagnostics of the convergence (rate) to validate the theoretical findings.
•

Achievement of Near-Optimal Detection: Utilizing DMALA, we propose a MIMO detector that initially samples from the target discrete distribution and then computes Bayesian estimates (soft decisions) using the converged samples. The method employed for computing soft decisions is meticulously designed in alignment with Monte Carlo theory, setting it apart from conventional heuristic approaches. Consequently, the performance of our proposed detector is theoretically underpinned by the convergence of DMALA and the law of large numbers inherent in Monte Carlo methods. This near-optimal performance has been further corroborated through extensive numerical studies.

Notations: Lowercase and uppercase boldface letters denote column vectors and matrices, respectively. $\mathbf{A}^{-1}$ and $\mathbf{A}^{T}$ represent the inverse and transpose of a matrix $\mathbf{A}$ , respectively. $\mathbf{I}_{N}$ represents an $N\times N$ identity matrix. $\mathbb{R}$ is the set of real numbers. $\mathbb{E}[\cdot]$ denotes the expectation operation. $\mathcal{U}(a,b)$ denotes a uniform distribution between $[a,b]$ . $\mathcal{N}(\mu,\sigma^{2})$ indicates a real-valued Gaussian distribution with mean $\mu$ and variance $\sigma^{2}$ . $\|\cdot\|_{F}$ denotes Frobenius norm, and $\|\cdot\|$ denotes $l_{2}$ norm. $|\cdot|$ represents the cardinality of a set.

II System Model and Preliminaries

This section commences with an introduction to the system model of the MIMO detection problem. It then proceeds to provide an overview of the basic concepts underlying MCMC.

II-A System Model

Consider a MIMO system with $N_{\rm t}$ antennas for transmitting data streams and $N_{\rm r}$ antennas for receiving. The message bits are encoded and interleaved to generate the transmitted codeword, which is then partitioned into bit vectors. Each bit vector $\mathbf{b}\in\{\pm 1\}^{N_{\rm b}}$ is mapped into quadrature amplitude modulation (QAM) symbols with unit power in average, constituting the transmitted vector. The equivalent real-valued model for the MIMO transmission is given by

\mathbf{y}=\mathbf{Hx}+\mathbf{n},

(1)

where $\mathbf{x}\in\mathcal{A}^{N\times 1}$ denotes the equivalent real-valued symbol vector, where $N=2N_{\rm t}$ and $\mathcal{A}$ is the finite set of real-valued transmitted symbols with a cardinality of $|\mathcal{A}|=Q$ . Therefore, we have $N_{\rm b}=N\log_{2}Q$ .²²2We assume that all elements of $\mathbf{x}$ are selected from a common finite set $\mathcal{A}$ , implying the use of a single modulation scheme. However, extending the proposed method to handle cases where elements of $\mathbf{x}$ are drawn from different discrete spaces is straightforward. Moreover, in (1), $\mathbf{y}\in\mathbb{R}^{M\times 1}$ is the received real-valued signal with $M=2N_{\rm r}$ , $\mathbf{H}\in\mathbb{R}^{M\times N}$ is the real-valued channel matrix, and $\mathbf{n}\in\mathbb{R}^{M\times 1}$ is the noise vector whose elements independently follow $\mathcal{N}(0,\sigma^{2}/2)$ , where $\sigma^{2}/2$ is the noise variance per real element.

Given the observation $\mathbf{y}$ and assuming that the channel $\mathbf{H}$ is known, the posterior distribution of $\mathbf{x}$ is given by the discrete distribution $\pi(\mathbf{x})=p(\mathbf{x}|\mathbf{y})\propto p(\mathbf{x})p(\mathbf{y}|% \mathbf{x})$ , where $p(\mathbf{x})$ is the prior distribution, and $p(\mathbf{y}|\mathbf{x})$ is the likelihood. When the prior distribution is uniform, the posterior distribution can be further expressed as

\displaystyle\pi(\mathbf{x})=\frac{1}{Z}\exp\big{(}f(\mathbf{x})\big{)}\prod_{% n=1}^{N}\mathbb{I}_{x_{n}\in\mathcal{A}},

(2)

where $f(\mathbf{x})$ is a metric function given by

\displaystyle f(\mathbf{x})

\displaystyle=-\frac{1}{\sigma^{2}}\|\mathbf{y}-\mathbf{Hx}\|^{2},

(3)

$Z$ is a normalization constant whose computation involves multidimensional summation and is generally intractable, $x_{n}$ denotes the $n$ -th entry of $\mathbf{x}$ , and $\mathbb{I}_{x_{n}\in\mathcal{A}}$ is an indicator function that takes the value of one only if $x_{n}\in\mathcal{A}$ and zero otherwise.

The MIMO detector forwards soft outputs in terms of log-likelihood ratios (LLRs) to the subsequent channel decoder. The optimal MAP detector computes the posterior LLR for the $k$ -th element $b_{k}$ of $\mathbf{b}$ as

L_{k}=\log\frac{p(b_{k}=+1|\mathbf{y})}{p(b_{k}=-1|\mathbf{y})}=\log\frac{\sum% _{\mathbf{x}\in\mathcal{A}^{N\times 1}_{k+}}\exp\big{(}f(\mathbf{x})\big{)}}{% \sum_{\mathbf{x}\in\mathcal{A}^{N\times 1}_{k-}}\exp\big{(}f(\mathbf{x})\big{)% }},

(4)

where $\mathcal{A}_{k+}^{N\times 1}$ and $\mathcal{A}_{k-}^{N\times 1}$ denote the subsets of $\mathcal{A}^{N\times 1}$ for the transmitted vector $\mathbf{x}$ mapped from $\mathbf{b}$ , where $b_{k}$ corresponds to $+1$ and $-1$ , respectively. This computation involves evaluating two a posteriori probabilities (APPs) that both require iteration over all possible values of $\mathbf{b}_{-k}=[b_{1},\ldots,b_{k-1},b_{k+1},\ldots,b_{N_{\rm b}}]^{T}$ and the summation of $2^{N_{\rm b}-1}$ terms, causing prohibitive complexity for large $N$ and/or $Q$ . Therefore, a low-complexity alternative should be developed.

II-B Basics of MCMC

We begin with the basics of Monte Carlo methods. Let $X$ denote a discrete random variable, possibly multidimensional, with a distribution $g(X)$ . Consider evaluating the expectation of some function $h(X)$ with respect to $g(X)$ , i.e.,

\mathbb{E}_{g}[h(X)]=\sum_{x\in\mathcal{X}}h({x})g(x),

(5)

where $\mathcal{X}$ is the domain of $X$ . Suppose that a set of $S$ samples $\{x^{[s]}\}_{s=1}^{S}$ with $x^{[s]}\sim g(X)$ is available, then the empirical average

\bar{h}=\frac{1}{S}\sum_{s=1}^{S}h\left({x}^{[s]}\right)

(6)

is an unbiased estimate of the expectation $\mathbb{E}_{g}[h(X)]$ , i.e., $\bar{h}\stackrel{{\scriptstyle\text{ a.s. }}}{{\longrightarrow}}\mathbb{E}_{g}% [h(X)]$ as $S\to\infty$ . A notable feature of this approach is that the required number of samples $S$ to achieve acceptable accuracy is weakly dependent on the dimension of $X$ [12]. Therefore, the exponential complexity that is often encountered when computing the summation in (5) can be mitigated.

Importance sampling (IS) is a classical Monte Carlo method that has the potential to further reduce the variance of $\bar{h}$ [12]. IS introduces an auxiliary distribution $g_{\rm a}(X)$ on $\mathcal{X}$ , which has the same support as $g(X)$ , and approximates the expectation by

\frac{1}{S}\sum_{s=1}^{S}\frac{g(x^{[s]})}{g_{\rm a}(x^{[s]})}h\left(x^{[s]}% \right),\quad x^{[s]}\sim g_{\rm a}(X).

(7)

It turns out that by a wise selection of $g_{\rm a}(X)$ , an accurate approximation can be obtained using fewer samples than the scheme in (6) [7, 27].

The above Monte Carlo methods require samples from a target distribution. MCMC is a popular and generic method for realizing this aim, effective for a wide range of distributions, and scales well with the dimension of the space [7]. Specifically, the MCMC method simulates a Markov chain $x^{(1)},x^{(2)},\ldots,x^{(t)},\ldots$ that is regulated by a transition kernel $P(\cdot|\cdot)$ , where $t$ is the time step index. Each state in the chain corresponds to a sample, and given the current state $x^{(t)}$ , the new state is generated as $x^{(t+1)}\sim P(\cdot|x^{(t)})$ . The transition kernel is designed to ensure that the chain’s stationary distribution coincides with the target distribution, e.g., $g$ or $g_{\rm a}$ . In this manner, the samples $\{x^{(t)}\}$ asymptotically converge to the target distribution for large $t$ . These converged samples can then be used for the Monte Carlo summation in (6) or (7). As the most popular MCMC-type scheme, the MH algorithm first samples a simple proposal distribution $x^{\prime}\sim q(\cdot|x^{(t)})$ . To ensure convergence to the target distribution, such as $g=\pi$ without loss of generality, the proposal $x^{\prime}$ is then accepted as the new state $x^{(t+1)}$ with a probability [29]

\min\left\{1,\frac{\pi({x}^{\prime})q({x}^{(t)}|{x}^{\prime})}{\pi({x}^{(t)})q% ({x}^{\prime}|{x}^{(t)})}\right\}.

(8)

Otherwise, the current state is retained as the new state, i.e., $x^{(t+1)}=x^{(t)}$ . With this criterion, the corresponding Markov chain admits $\pi$ as the stationary distribution [9]. Meanwhile, this algorithm eliminates the need to evaluate the normalization constant of $\pi$ since $\pi$ appears as a ratio in (8).

III DMALA-Based MIMO Detection

In this section, we derive the DMALA-based MIMO detector. Initially, we develop a highly efficient gradient-based MCMC sampling algorithm tailored to discrete spaces. Subsequently, we establish the convergence of the proposed sampling algorithm and substantiate our claims through empirical verification. Then, we discuss the utilization of the samples for soft decisions, specifically in terms of LLR computation, and examine the computational complexity of our proposed detector. Overall, the proposed detector achieves high accuracy in approximating the exact posterior Bayesian estimate in (4), while mitigating the prohibitive complexity.

III-A Gradient-Based MCMC Sampling in Discrete Spaces

For ease of exposition, we target the posterior distribution $\pi(\mathbf{x})$ in (2) for sample generation. It is noteworthy that this probability mass function can be regarded as a restriction of a continuous distribution defined over the domain $\mathbb{R}^{N\times 1}$ to the discrete subset $\mathcal{A}^{N\times 1}$ . Therefore, gradients from the logarithm of the underlying continuous distribution, i.e., $f(\mathbf{x})$ in (3), are informative for the sampling of the discrete distribution.³³3The proposed algorithm and the subsequent theoretical analysis are not limited to the distribution in (2) and can be generalized to different $f(\mathbf{x})$ . Moreover, we consider gradients from the log-probability to simplify calculation.

We start with introducing the Langevin algorithm [17, 18], which is a powerful gradient-based MCMC method in continuous spaces. Initialized with $\mathbf{x}^{(1)}\in\mathbb{R}^{N\times 1}$ , each sampling iteration of this algorithm generates a proposal vector [17, 18, 28]

\mathbf{x}^{\prime}=\mathbf{x}^{(t)}+\frac{\alpha}{2}\nabla f(\mathbf{x}^{(t)}% )+\sqrt{\alpha}\mathbf{w}^{(t)},

(9)

where $t$ is the sampling iteration index (equivalent to the time step index of the underlying Markov chain), $\alpha>0$ is the step size, and $\mathbf{w}^{(t)}$ is a random perturbation that follows $\mathcal{N}(\mathbf{0},\mathbf{I}_{N})$ , where $\mathbf{0}$ is a zero vector. $\nabla f$ is the gradient of $f(\mathbf{x})$ given by

\nabla f(\mathbf{x})=\frac{2}{\sigma^{2}}\mathbf{H}^{T}(\mathbf{y}-\mathbf{Hx}).

(10)

This gradient facilitates efficient exploration of high probability regions of the sample space. It should be noted that the update rule in (9) can be viewed as drawing the proposal vector from the Gaussian distribution

\mathcal{N}\left(\mathbf{x}^{(t)}+\frac{\alpha}{2}\nabla f(\mathbf{x}^{(t)}),% \alpha\mathbf{I}_{N}\right).

(11)

Based on (9) and considering the variable domain $\mathcal{A}^{N\times 1}$ , we derive the discrete proposal function

	$\displaystyle q$	$\displaystyle(\mathbf{x}^{\prime}\|\mathbf{x}^{(t)})$
		$\displaystyle=\frac{\exp\left(-\frac{1}{2\alpha}\\|\mathbf{x}^{\prime}-\mathbf{% x}^{(t)}-\frac{\alpha}{2}\nabla f(\mathbf{x}^{(t)})\\|^{2}\right)}{Z_{\mathcal{% A}}(\mathbf{x}^{(t)})}\prod_{n=1}^{N}\mathbb{I}_{x_{n}^{\prime}\in\mathcal{A}},$		(12)

where $x_{n}^{\prime}$ is the $n$ -th element of $\mathbf{x}^{\prime}$ , and the normalization constant $Z_{\mathcal{A}}(\mathbf{x}^{(t)})$ is given by

Z_{\mathcal{A}}(\mathbf{x}^{(t)})=\sum_{\mathbf{x}^{\prime}\in\mathcal{A}^{N% \times 1}}\exp\Big{(}-\frac{1}{2\alpha}\|\mathbf{x}^{\prime}-\mathbf{x}^{(t)}-% \frac{\alpha}{2}\nabla f(\mathbf{x}^{(t)})\|^{2}\Big{)},

(13)

whose computation is generally intractable since it requires traversal over the full space of size $Q^{N}$ . However, a distinct feature of the proposal in (12) is that it enjoys an elementwise factorization as [19]:

q(\mathbf{x}^{\prime}|\mathbf{x}^{(t)})=\prod_{n=1}^{N}q_{n}(x^{\prime}_{n}|x^% {(t)}_{n}),

(14)

where $q_{n}(x^{\prime}_{n}|x^{(t)}_{n})$ is a categorical distribution given by

\displaystyle q_{n}(x^{\prime}_{n}|x^{(t)}_{n})

\displaystyle=\varsigma\big{(}\mu(x^{\prime}_{n})\big{)}\mathbb{I}_{x_{n}^{% \prime}\in\mathcal{A}},

(15)

where $\mu(\cdot)$ denotes one term in the factorization of the $l_{2}$ norm in (12), and $\varsigma(\cdot)$ is a softmax function. These two functions are given by

\displaystyle\mu(x^{\prime}_{n})

\displaystyle=\frac{1}{2}\left[\nabla f(\mathbf{x}^{(t)})\right]_{n}(x^{\prime% }_{n}-x^{(t)}_{n})-\frac{(x^{\prime}_{n}-x^{(t)}_{n})^{2}}{2\alpha}

(16)

and

\varsigma\big{(}\mu(x^{\prime}_{n})\big{)}=\frac{\exp\big{(}\mu(x^{\prime}_{n}% )\big{)}}{\sum_{x_{n}^{\prime}\in\mathcal{A}}\exp\big{(}\mu(x^{\prime}_{n})% \big{)}},

(17)

where $\left[\nabla f(\mathbf{x}^{(t)})\right]_{n}$ is the $n$ -th element of the gradient vector $\nabla f(\mathbf{x}^{(t)})$ . This factorization enables the parallel update of each element in $\mathbf{x}^{(t)}$ , i.e., ${x^{\prime}_{n}\sim q_{n}(\cdot|x^{(t)}_{n})},n=1,\ldots,N$ , after gradient computation, whose cost is only $\mathcal{O}(N^{2})$ . Therefore, the overall computational cost of constructing the proposal in (15) and parallel updating scales polynomially, rather than exponentially, with $N$ .

The update rule in (9) originates from discretizing the stochastic differential equation of Langevin diffusion [17]. Due to discretization errors, the unadjusted Langevin algorithm, i.e., directly letting $\mathbf{x}^{(t+1)}=\mathbf{x}^{\prime}$ , typically suffers from asymptotic bias towards the target distribution [18, 19, 28]. To deal with this issue, we integrate the MH adjustment with the discrete Langevin proposal (15) to ensure the reversibility of the Markov chain $\{\mathbf{x}^{(t)}\}$ and the convergence to the target distribution [29], resulting in DMALA. Specifically, after generating $\mathbf{x}^{\prime}$ using the proposal distribution in (15), the MH adjustment introduced in Sec. II-B is performed to accept $\mathbf{x}^{\prime}$ with a probability given by

\displaystyle A(\mathbf{x}^{\prime}|\mathbf{x}^{(t)})

\displaystyle=\min\left\{1,\exp\big{(}f(\mathbf{x}^{\prime})-f(\mathbf{x}^{(t)% })\big{)}\frac{q(\mathbf{x}^{(t)}|\mathbf{x}^{\prime})}{q(\mathbf{x}^{\prime}|% \mathbf{x}^{(t)})}\right\},

(18)

where $q(\mathbf{x}^{(t)}|\mathbf{x}^{\prime})$ is the reverse proposal calculated similarly as the forward proposal $q(\mathbf{x}^{\prime}|\mathbf{x}^{(t)})$ . This equation is derived by substituting the target distribution $\pi$ (2) and the discrete proposal $q$ (14) into (8). Note that the indicator $\mathbb{I}$ is omitted for simplicity.

Algorithm 1 DMALA Sampler

{\bf{y}}

{\bf{H}}

\sigma^{2}

\alpha

1: Initialize:

\mathbf{x}^{(1)}

f(\mathbf{x}^{(1)})

\nabla f(\mathbf{x}^{(1)})

2: Construct the proposal

q_{n}(\cdot|x_{n}^{(1)}),n=1,\ldots,N

in (15).

3: for

t=1

T-1

4: Sample

x^{\prime}_{n}\sim q_{n}(\cdot|x^{(t)}_{n})

for

n=1

N

in parallel.

5: Compute

q(\mathbf{x}^{\prime}|\mathbf{x}^{(t)})=\prod_{n=1}^{N}q_{n}(x^{\prime}_{n}|x^% {(t)}_{n})

6: Compute

\nabla f(\mathbf{x}^{\prime})

via (10).

7: Construct the proposal

q_{n}(\cdot|x^{\prime}_{n}),n=1,\ldots,N

in (15) and compute

q(\mathbf{x}^{(t)}|\mathbf{x}^{\prime})=\prod_{n=1}^{N}q_{n}(x^{(t)}_{n}|x^{% \prime}_{n})

8: Compute

f(\mathbf{x}^{\prime})

via (3).

9: Compute

A(\mathbf{x}^{\prime}|\mathbf{x}^{(t)})

via (18) and sample

u\sim\mathcal{U}(0,1)

10: if

A(\mathbf{x}^{\prime}|\mathbf{x}^{(t)})>u

then

11:

\mathbf{x}^{(t+1)}=\mathbf{x}^{\prime}

f(\mathbf{x}^{(t+1)})=f(\mathbf{x}^{\prime})

, and

q_{n}(\cdot|x^{(t+1)}_{n})=q_{n}(\cdot|x^{\prime}_{n}),n=1,\ldots,N

12: else

13:

{\mathbf{x}^{(t+1)}=\mathbf{x}^{(t)}}

{f(\mathbf{x}^{(t+1)})=f(\mathbf{x}^{(t)})}

, and

{q_{n}(\cdot|x^{(t+1)}_{n})}=q_{n}(\cdot|x^{(t)}_{n}),n=1,\ldots,N

14: end if

15: end for

15: Samples

\{\mathbf{x}^{(t)}\}_{t=1}^{T}

The sampler using DMALA is outlined in Algorithm 1. Parallel samplers can be employed for sample generation, facilitating efficient utilization of computational resources while reducing correlation among the generated samples. Utilizing these samples, hard decisions are determined by selecting the sample with the smallest residual norm $\|\mathbf{y}-\mathbf{Hx}\|$ , whereas soft decisions are inferred using the Monte Carlo methods discussed in Sec. II-B. In this paper, we focus on the latter approach, with the developed method detailed in Sec. III-C.

Remark 1

The target distribution may demonstrate strong second-order correlations across dimensions. In such cases, using naive gradients in (10) leads to slow convergence [21, 19, 20]. To accelerate convergence, we explore a preconditioned variant of DMALA. Specifically, we precondition the naive gradients using the inverse of a damped Hessian, given by

\mathbf{M}=(\mathbf{H}^{T}\mathbf{H}+\gamma\mathbf{I}_{N})^{-1},

(19)

where $\gamma>0$ is a Tikhonov damping parameter. This preconditioner $\mathbf{M}$ utilizes the second-order information from the target distribution to expedite the gradient descent process. With this preconditioner, the Langevin algorithm (in the continuous space) proceeds as

\mathbf{x}^{\prime}=\mathbf{x}^{(t)}+\frac{\alpha}{2}\mathbf{M}\nabla f(% \mathbf{x}^{(t)})+\sqrt{\alpha\beta}\mathbf{w}^{(t)}.

(20)

In this variant, we also scale the random perturbation with an additional parameter $\beta>0$ , which allows for adjusting the gradient to perturbation intensity ratio. Correspondingly, in the preconditioned DMALA, the proposal in (15) is adapted by substituting $\mu(x^{\prime}_{n})$ with $\nu(x^{\prime}_{n})$ , represented by

\nu(x^{\prime}_{n})=\frac{1}{2\beta}\left[\mathbf{M}\nabla f(\mathbf{x}^{(t)})% \right]_{n}(x^{\prime}_{n}-x^{(t)}_{n})-\frac{(x^{\prime}_{n}-x^{(t)}_{n})^{2}% }{2\alpha\beta}.

(21)

Remark 2

We present a visual comparison between existing gradient-based MCMC methods for MIMO detection [21, 22] and the proposed DMALA, as illustrated in Fig. 1. Fig. 1(a) depicts how existing methods relax the problem to the continuous space, where they generate a candidate sample through a sequence of steps: gradient descent, random walk (by adding a random Gaussian perturbation), and QAM mapping. Regrettably, the QAM mapping, which discretizes the continuous update to its nearest lattice point, is deterministic and lacks an associated proposal probability. This absence makes the exact MH correction unattainable, leading, as shown later, to inexact sampling and significant inference errors. Conversely, as depicted in Fig. 1(b), DMALA constructs the proposal with guidance from gradients of the underlying continuous function, allowing for an exact MH correction. This key differentiation ensures the convergence of our method, a point elaborated upon in the following subsection.

III-B Convergence of the DMALA Sampler

A distinct feature of the DMALA sampler, compared to existing gradient-based MCMC methods [21, 22], is its guaranteed (asymptotic) convergence to the target distribution. To begin with, we analyze the statistical properties of the Markov chain induced by DMALA. For two states $\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{A}^{N\times 1}$ , the transition probability $P(\mathbf{x}^{\prime}|\mathbf{x})$ of this chain is given by

	$\displaystyle P(\mathbf{x}^{\prime}\|\mathbf{x})$
	$\displaystyle=\left\{\begin{array}[]{ll}q(\mathbf{x}^{\prime}\|\mathbf{x})A(% \mathbf{x}^{\prime}\|\mathbf{x}),&\text{ if }\mathbf{x}^{\prime}\neq\mathbf{x},% \\ q(\mathbf{x}\|\mathbf{x})+\sum_{\mathbf{z}\neq\mathbf{x}}q(\mathbf{z}\|\mathbf{x% })\big{(}1-A(\mathbf{z}\|\mathbf{x})\big{)},&\text{ otherwise},\end{array}\right.$		(24)

where $q(\cdot|\cdot)$ represents the proposal probability, computed as per (14), and $A(\cdot|\cdot)$ denotes the acceptance probability according to (18). Armed with this transition probability (kernel) in position, we present a lemma detailing the statistical properties of the Markov chain, proved in Appendix A.

Lemma 1

Assume that the probability density function of the noise $\mathbf{n}$ satisfies $0<p_{\bf n}(\cdot)<+\infty$ , i.e., $\sigma^{2}>0$ . Then, the transition kernel of the Markov chain induced by DMALA ( $\alpha>0$ ) is irreducible and aperiodic. Moreover, the target posterior distribution $\pi$ is a unique stationary distribution of this chain.

Building upon Lemma 1 and the convergence theorem of MCMC, we assert the theorem concerning the exponential convergence of the Markov chain induced by DMALA, as presented herein without an accompanying proof. A comprehensive proof is available in [30, Theorem 4.9].

Theorem 1

Given that Lemma 1 holds, the Markov chain induced by DMALA exhibits exponential convergence to its unique stationary distribution $\pi$ in total variation distance. Specifically, there exist constants $C>0$ and $0<r<1$ such that

\|P^{(t)}(\mathbf{x},\cdot)-\pi\|_{\rm TV}\leq Cr^{t},\;\forall\mathbf{x}\in% \mathcal{A}^{N\times 1}.

(25)

Herein, $r$ represents the convergence rate [10], with a smaller $r$ indicating faster convergence, $P^{(t)}(\mathbf{x},\cdot)$ denotes the probability distribution induced by the $t$ -step transition function of the Markov chain with the initial state $\mathbf{x}$ , defined as [10]

P^{(t)}(\mathbf{x},\mathbf{x}^{\prime})=P\big{(}\mathbf{x}^{(t+1)}=\mathbf{x}^% {\prime}|\mathbf{x}^{(1)}=\mathbf{x}\big{)},

(26)

and $\|\pi_{1}-\pi_{2}\|_{\rm TV}$ signifies the total variation distance between two distributions ${\pi}_{1}$ and $\pi_{2}$ over $\mathcal{A}^{N\times 1}$ , defined as

\|\pi_{1}-\pi_{2}\|_{\rm TV}=\frac{1}{2}\sum_{\mathbf{x}\in\mathcal{A}^{N% \times 1}}|\pi_{1}(\mathbf{x})-\pi_{2}(\mathbf{x})|.

(27)

Theorem 1 implies that the distribution of the samples generated by DMALA exponentially converges to the target distribution for large $t$ . Furthermore, this theorem sheds light on the rate of convergence, denoted as $r$ . Delving into the convergence rate $r$ , we consider the discrete space $\mathcal{A}^{N\times 1}$ represented by $\left\{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{Q^{N}}\right\}$ . The transition kernel of the Markov chain, induced by DMALA, can be modeled as a $Q^{N}\times Q^{N}$ matrix $\mathbf{P}=[P_{ij}]$ . Here, the entry $P_{ij}$ signifies the transition probability from $\mathbf{x}_{i}$ to $\mathbf{x}_{j}$ , calculated as per (24). It is established that the convergence rate $r$ corresponds to the second largest eigenvalue ${\lambda_{2}}$ of $\mathbf{P}$ [10, 31]. Although deriving analytical expressions for this eigenvalue poses challenges, it is feasible to exactly compute $r$ in small sample spaces, aiding in convergence diagnostics. Such an analysis not only enhances comprehension of the algorithm’s performance but also illuminates paths for further improvement.

We conducted an empirical verification of Theorem 1 using a ${\text{2}\times\text{2}}$ MIMO system with quadrature phase shift keying (QPSK) modulation ( ${Q^{N}=2^{4}=16}$ ) and the signal-to-noise ratio (SNR) per receiving antenna of 8 dB.⁴⁴4The SNR is defined as $\text{SNR}=\mathbb{E}[\|\mathbf{Hx}\|^{2}]/\mathbb{E}[\|\mathbf{n}\|^{2}]$ . Fig. 2 shows the total variation distance between the sampled distribution $P^{(t)}(\mathbf{x},\cdot)$ from DMALA and the true target distribution as a function of the sampling iteration index $t$ . This sampled distribution was derived by collecting the generated samples of $10^{5}$ independent samplers at each time step $t$ and counting the number of occurrences to reflect the probability of each state in the ${\text{2}\times\text{2}}$ MIMO-QPSK space. For this empirical study, we employed the preconditioned version of DMALA. We benchmarked against a state-of-the-art gradient-based MCMC method for MIMO detection, specifically MHGD [21]. An exponential convergence curve of $r^{t}$ serves as a reference. For this specific channel realization, $r=0.657$ was determined through the eigendecomposition of the $16\times 16$ transition matrix $\mathbf{P}$ associated with DMALA.

The analysis yields several key insights. First, the total variation distance for DMALA approaches zero, indicating successful convergence to the target distribution, whereas for MHGD, it converges to an approximate value of 0.2, suggesting a deviation from the target. This outcome underscores the superiority of DMALA’s methodical probabilistic sampling approach. Second, the empirical convergence curve of DMALA’s total variation distance (depicted by a solid line) closely aligns with the theoretical exponential convergence curve (dashed line), thereby corroborating the assertions of Theorem 1. Additionally, when comparing the sampled distributions to the true distribution, as shown in Fig. 3 via histograms, it is evident that MHGD’s sampled distribution exhibits bias, while DMALA’s distribution closely matches the true distribution.

In Fig. 4, we extend our investigation to examine the performance differences between naive and preconditioned DMALA configurations under varying SNR conditions. The system setup remains identical to that described in Fig. 2, with SNR levels adjusted to 4, 6, 8, and 10 dB. For each SNR setting, we simulate 100 independent channel realizations. The transition matrix $\mathbf{P}$ is constructed for each realization, from which we compute the second largest eigenvalue, i.e., the convergence rate $r$ . Fig. 4 shows the boxplots⁵⁵5A boxplot is a method for displaying the distribution of a dataset based on the five-number summary: the minimum, first quartile, medium, third quartile, and maximum. The first and third quartiles are denoted by the two ends of the box, and the medium is represented by the middle bar. The points outside the upper and lower bounds are considered outliners. of these convergence rates, distinguishing between the results for naive DMALA in Fig. 4(a) and the preconditioned variant in Fig. 4(b). From Fig. 4, naive DMALA exhibits increasingly slow convergence ( $r\to 1$ ) at SNR values above 8 dB, a phenomenon recognized in the literature as the high SNR stalling issue [12, 13]. Conversely, the preconditioned DMALA demonstrates steady convergence rates across the explored SNR spectrum, indicating notable enhancements in performance.

We further investigate the convergence rate beyond empirical analysis, drawing inspiration from [32]. We first introduce a lemma on the proposal distribution of DMALA, with the proof provided in Appendix B.

Lemma 2

For the Markov chain of the proposed DMALA, there exists a constant $\xi>0$ such that

\frac{q(\mathbf{x}^{\prime}|\mathbf{x}^{(t)})}{\pi(\mathbf{x}^{\prime})}\geq% \xi\cdot G(\mathbf{x}^{(t)},\mathbf{x}^{\prime}),\;\forall\mathbf{x}^{(t)}\in% \mathcal{A}^{N\times 1},

(28)

where

\xi=\frac{\sum_{\mathbf{s}\in\mathcal{A}^{N\times 1}}\exp\big{(}f(\mathbf{s})% \big{)}}{\prod_{n=1}^{N}\sum_{x_{n}^{\prime}\in\mathbb{Z}}\exp\big{(}-\frac{1}% {2\alpha}|x_{n}^{\prime}|^{2}\big{)}}

(29)

with $\mathbb{Z}$ denoting the set of all integers (encompassing the QAM lattice $\mathcal{A}$ ), and the function $G$ is given by

G(\mathbf{x}^{(t)},\mathbf{x}^{\prime})=\frac{\exp\big{(}-\frac{1}{2\alpha}\|% \mathbf{x}^{\prime}-\mathbf{x}^{(t)}-\frac{\alpha}{2}\nabla f(\mathbf{x}^{(t)}% )\|^{2}\big{)}}{\exp\big{(}f(\mathbf{x}^{\prime})\big{)}}.

(30)

Based on this lemma, we have

q(\mathbf{x}^{\prime}|\mathbf{x}^{(t)})\geq\xi\cdot G(\mathbf{x}^{(t)},\mathbf% {x}^{\prime})\pi(\mathbf{x}^{\prime})\geq\delta\pi(\mathbf{x}^{\prime})

(31)

for all $\mathbf{x}^{(t)},\mathbf{x}^{\prime}\in\mathcal{A}^{N\times 1}$ , where

\delta=\xi\cdot\underset{\mathbf{x}^{(t)},\mathbf{x}^{\prime}\in\mathcal{A}^{N% \times 1}}{\min}[G(\mathbf{x}^{(t)},\mathbf{x}^{\prime})]>0.

(32)

Following this relationship, the coupling technique can be employed to establish the following theorem that delineates the exponential convergence rate of DMALA. The related proof is available in [33, Theorem 1].

Theorem 2

The convergence rate $r$ in Theorem 1 can be specified as $r=1-\delta$ , where $\delta=\xi\cdot\underset{\mathbf{x}^{(t)},\mathbf{x}^{\prime}\in\mathcal{A}^{N% \times 1}}{\min}[G(\mathbf{x}^{(t)},\mathbf{x}^{\prime})]$ also provides a lower bound for the spectral gap (i.e., $1-\lambda_{2}$ ) of DMALA’s transition kernel $\mathbf{P}$ .

This theorem shows that the convergence rate depends on factors such as the step size $\alpha$ and the system setup. Optimizing these parameters, such as selecting a suitable $\alpha$ to make $\xi$ and $G(\mathbf{x}^{(t)},\mathbf{x}^{\prime})$ approach 1 and improve the convergence rate, is beyond the scope of this work and is left for future research.

III-C LLR Computation Methods

After deriving the sample list $\mathcal{L}$ , various methods can be utilized for LLR computation. Conventional list-based soft-output detectors [5, 6, 34] approximate the APPs in (4) by using $\mathcal{L}$ to replace the entire set $\mathcal{A}^{N\times 1}$ of all possible vectors. Since the list size is generally much smaller than $Q^{N}$ , the exponentially increased complexity can be alleviated. Consequently, the LLR as defined in (4) is approximated by

\hat{L}_{k}=\log\frac{\sum_{\mathbf{x}\in{\mathcal{L}}\cap\mathcal{A}^{N\times 1% }_{k+}}\exp\big{(}f(\mathbf{x})\big{)}}{\sum_{\mathbf{x}\in{\mathcal{L}}\cap% \mathcal{A}^{N\times 1}_{k-}}\exp\big{(}f(\mathbf{x})\big{)}}.

(33)

However, this approach is not tailored to the proposed sampling-based detector since the distribution information of the generated samples is not fully harnessed.

To better accommodate the proposed sampling algorithm, we adopt an alternative approach for LLR computation based on the IS-based Monte Carlo method [12, 35]. Initially, to facilitate a Monte Carlo summation for approximation, we reformulate the APPs $p(b_{k}=\pm 1|\mathbf{y})$ in (4) into the form of expectation. Observe that the APPs can be expressed as

	$\displaystyle p(b_{k}=\pm 1\|\mathbf{y})$	$\displaystyle=\sum_{\mathbf{b}_{-k}}p(b_{k}=\pm 1,\mathbf{b}_{-k}\|\mathbf{y})$
		$\displaystyle=\sum_{\mathbf{b}_{-k}}p(b_{k}=\pm 1\|\mathbf{y},\mathbf{b}_{-k})p% (\mathbf{b}_{-k}\|\mathbf{y}),$		(34)

where $p(\mathbf{b}_{-k}|\mathbf{y})$ is considered as $g(x)$ and $p(b_{k}=\pm 1|\mathbf{y},\mathbf{b}_{-k})$ as $h(x)$ in (5). Thus, the computation of APPs translates to evaluating $\mathbb{E}_{p(\mathbf{b}_{-k}|\mathbf{y})}[p(b_{k}=\pm 1|\mathbf{y},\mathbf{b}% _{-k})]$ , where IS-based Monte Carlo summation becomes applicable.

Furthermore, to implement the IS-based method, we introduce an auxiliary distribution as the target according to [35]:

\pi_{\rm a}(\mathbf{x})\propto\exp\big{(}{f(\mathbf{x})/\tau}\big{)}\prod_{n=1% }^{N}\mathbb{I}_{x_{n}\in\mathcal{A}},

(35)

where $\tau>1$ is a temperature parameter for enhancing the mobility of the Markov chain [12, 34, 36]. Given that $\pi_{\rm a}$ represents a tempered posterior and maintains a form analogous to the posterior distribution $\pi$ in (2), our proposed DMALA is well-suited to sample from it, following the derivations in Sec. III-A.

Let $\mathcal{L}=\{\mathbf{x}^{[s]}\}_{s=1}^{S}$ denote the sample list utilized for LLR computation, where $S$ is the size of the sample list (including repetitive samples). To reflect our focus on employing samples nearing convergence for decisions rather than all samples generated during the sampling iterations, we shift the index notation from $(t)$ to $[s]$ . Building upon (7) and (35) and through some algebraic manipulation, the computation of (4) can be approximated by

\hat{L}_{k}=\log\frac{\sum_{s=1}^{S}\frac{1}{1+\exp(-\gamma_{k}^{[s]})}\exp% \left(\frac{\tau-1}{\tau}f(\mathbf{x}_{+1}^{[s]})\right)}{\sum_{s=1}^{S}\frac{% 1}{1+\exp(+\gamma_{k}^{[s]})}\exp\left(\frac{\tau-1}{\tau}f(\mathbf{x}_{-1}^{[% s]})\right)},

(36)

where $\gamma_{k}^{[s]}=\frac{1}{\tau}\left(f(\mathbf{x}_{+1}^{[s]})-f(\mathbf{x}_{-1% }^{[s]})\right)$ , and $\mathbf{x}_{+1}^{[s]}$ and $\mathbf{x}_{-1}^{[s]}$ are mapped from $\mathbf{b}^{[s]}$ with its $k$ -th bit ${b}_{k}^{[s]}$ corresponding to $+1$ or $-1$ , respectively. The bit vector $\mathbf{b}^{[s]}$ is demapped from the sample $\mathbf{x}^{[s]}$ drawn from $\pi_{\rm a}$ by DMALA. A detailed derivation of (36) is provided in Appendix C. The complete process of DMALA-based soft MIMO detection, which integrates the proposed DMALA sampler and IS-based LLR computation, is detailed in Algorithm 2.

Algorithm 2 DMALA-Based Soft MIMO Detection

{\bf{y}}

{\bf{H}}

\sigma^{2}

\alpha

\tau

1: Run the DMALA sampler (Algorithm 1 with

f(\cdot)

replaced by

f(\cdot)/\tau

) to sample from

\pi_{\rm a}(\mathbf{x})

in (35).

2: Select the converged samples

\mathcal{L}=\{\mathbf{x}^{[s]}\}_{s=1}^{S}

for LLR computation.

3: Demap

\{\mathbf{x}^{[s]}\}_{s=1}^{S}

to derive

\{\mathbf{b}^{[s]}\}_{s=1}^{S}

4: Find

\{\mathbf{x}_{+1}^{[s]}\}_{s=1}^{S}

and

\{\mathbf{x}_{-1}^{[s]}\}_{s=1}^{S}

based on

\{\mathbf{b}^{[s]}\}_{s=1}^{S}

5: for

k=1

N\log_{2}Q

6: Compute

\gamma_{k}^{[s]}=\frac{1}{\tau}\left(f(\mathbf{x}_{+1}^{[s]})-f(\mathbf{x}_{-1% }^{[s]})\right),s=1,\ldots,S

7: Compute

\hat{L}_{k}

based on (36).

8: end for

8: LLRs

\{\hat{L}_{k}\}_{k=1}^{N\log_{2}Q}

Remark 3

In alignment with established practices within the MCMC literature, we run the samplers for a sufficiently large $T$ to ensure the Markov chains have (approximately) converged. Following this, we gather a list of converged samples (typically a limited number) for LLR computation. This strategy aids in excluding potentially inaccurate samples from the initial burn-in phase [12, 10] and also reduces LLR computation complexity. Moreover, we observe that utilizing independent samples from parallel samplers significantly enhances the accuracy of LLR computations over using correlated samples from a single sampler. Therefore, we opt to gather converged samples from a set of parallel samplers. Although each sampler necessitates a burn-in period, this strategy can be more efficient than extending the run time of a single sampler well beyond the necessary $T$ for convergence to reduce sample correlation.

Remark 4

We also evaluated the LLR computation using Monte Carlo summation in (6) with samples drawn from $\pi$ . We found that this method requires a larger sample size to match the accuracy afforded by the approximation in (7). This observation aligns with the findings in [35] and can be ascribed to the variance reduction effect inherent in IS. Nonetheless, both methods are theoretically guaranteed to asymptotically converge to the exact LLR calculation in (4), a consequence of the law of large numbers. Furthermore, unlike the approach in (33), these methods not only capitalize on the samples collected in $\mathcal{L}$ but also leverage the occurrence frequencies of these samples, effectively utilizing the probability distribution across the sample space. This aspect of the methodology contributes to performance improvements, as evidenced by our simulation results.

III-D Computational Complexity

We analyze the computational complexity of our proposed detector, focusing on the number of dominant arithmetic operations involved, including multiplications and exponential functions. The analysis is structured into two primary segments: sampling and LLR computation. In this analysis, we denote the number of sampling iterations as $T$ , the number of parallel samplers as $N_{\rm p}$ , and the number of samples selected for LLR computation as $S$ .

III-D1 Sampling Complexity

The computational cost during the sampling stage is twofold: initialization and iteration per sample generation. The initialization cost of the preconditioned DMALA is dominated by the matrix inversion required to calculate the preconditioner $\mathbf{M}$ , leading to $\mathcal{O}(N^{3})$ real number multiplications. Nonetheless, this preconditioner needs to be computed only once, allowing for reuse across all sampling iterations, which mitigates its impact on overall complexity.

Each sampling iteration’s complexity primarily involves three matrix-vector multiplications: two for computing the gradient $\nabla f$ as shown in (10), and one for applying the preconditioner to the gradient as $\mathbf{M}\nabla f$ , resulting in $\mathcal{O}(N^{2}+MN)$ multiplications. Specifically, the matrix-vector multiplication $\mathbf{Hx}^{\prime}$ in the gradient computation can be reduced to a summation of $\mathbf{H}$ ’s columns scaled by distinct QAM magnitudes since the elements of $\mathbf{x}^{\prime}$ are from $\mathcal{A}$ . Additional per-iteration operations, such as proposal construction and probability calculation, impose a modest $\mathcal{O}(NQ)$ on the computational load.

Collectively, the total complexity in the sampling stage is $\mathcal{O}\big{(}N^{3}+(N^{2}+MN+NQ)TN_{\rm p}\big{)}$ , effectively curtailing the exponential rise in complexity with increases in $N$ and $Q$ . Furthermore, the parallelization of sampling across multiple samplers significantly diminishes computation delay.

III-D2 LLR Computation Complexity

When using the IS-based LLR computation described in (36), the computational demand for each bit includes the evaluation of $\gamma_{k}^{[s]}$ and $4S$ exponential operations. Specifically, the calculation of $\gamma_{k}^{[s]}$ involves $f(\mathbf{x}_{+1}^{[s]})$ and $f(\mathbf{x}_{-1}^{[s]})$ , with at least one of these evaluations already performed in the sampling stage. The remaining evaluation can also be executed efficiently, as $\mathbf{x}_{+1}^{[s]}$ and $\mathbf{x}_{-1}^{[s]}$ differ in only one entry [11]. To further reduce the need for exponential function computations, (36) can be reformulated as

	$\displaystyle\hat{L}_{k}=$	$\displaystyle\log\left[\sum_{s=1}^{S}\exp\left(\frac{\tau-1}{\tau}f(\mathbf{x}% _{+1}^{[s]})-F(-\gamma_{k}^{[s]})\right)\right]$
		$\displaystyle-\log\left[\sum_{s=1}^{S}\exp\left(\frac{\tau-1}{\tau}f(\mathbf{x% }_{-1}^{[s]})-F(+\gamma_{k}^{[s]})\right)\right],$		(37)

where ${F(a)=\log(1+e^{a}),\;a\in\mathbb{R}}$ . For $a\leq 0$ , $F(a)$ can be approximated using pre-computed values in a lookup table; while for ${a>0}$ , the identity ${F(a)=a+F(-a)}$ is utilized [35]. On this basis, the computation of $v=\log\sum_{s=1}^{S}e^{a_{s}}$ in (37) is simplified using the following recursive operation for $s=1$ to $S$ , initializing $v$ as $-\infty$ [5]:

v\leftarrow\log(e^{v}+e^{a_{s}})=\max(v,a_{s})+\underbrace{\log(1+e^{-|v-a_{s}% |})}_{F(-|v-a_{s}|)}.

(38)

IV Simulation Results

In this section, we demonstrate the near-optimal bit error rate (BER) performance achieved by our proposed DMALA-based detector within coded MIMO systems. Initially, we outline the common configurations employed across our simulation studies. Subsequently, we delve into the assessment of coded BER performance, exploring a variety of MIMO dimensions, channel models, and scenarios with imperfect CSI. For the sake of conciseness, we refer to our proposed detector simply as “DMALA” throughout this section.

IV-A Simulation Setups

In our simulations, we consistently employ a rate- $3/4$ low-density parity-check code with a block length of 1944 bits for channel coding, alongside belief propagation for channel decoding. The BER is assessed over 70,000 blocks transmitted across independent channel realizations. Our primary focus is on Rayleigh fading channels, characterized by independently Gaussian-distributed channel coefficients with zero mean and unit variance. Furthermore, to ascertain the robustness of our proposed DMALA detector, we extend our analysis to include different channel models. This encompasses Kronecker spatially correlated channels [37] and the more practical 3rd generation partnership project (3GPP) technical reports (TR) 36.873 3D channels [38], both of which are prevalent in the literature for evaluating MIMO detectors. Unless explicitly noted otherwise, we assume perfect receiver knowledge of the channel matrix $\mathbf{H}$ .

In light of its demonstrated superiority in Sec. III, we opt for the preconditioned DMALA for our evaluation. For its parameter setups, the step size is set to $\alpha=\sigma^{2}$ . Following [21], the perturbation scaling parameter $\beta$ and the damping parameter $\gamma$ are selected as $\beta={d_{\min}^{2}}/{{\sigma^{2}}}$ and $\gamma={\sigma^{2}}/({2d_{\min}^{2}})$ , respectively, where $d_{\min}$ represents half of the minimum distance between any two QAM lattice points. Benchmark algorithms for comparative analysis encompass the expectation propagation (EP) detector [39], the $K$ -best SD detector [6], the state-of-the-art MHGD detector [21], and the optimal MAP detector that calculates exact LLRs using (4). The EP detector is configured to run for 10 iterations as per [39] with LLRs calculated from posterior marginals approximated by Gaussian distributions [40, Eq. (3)]. Given the deterministic nature of its Gaussian approximation, further increases in iterations do not enhance performance. For the $K$ -best SD, we set a large list size (denoted by $K$ ) to establish a near-optimal baseline and utilize (33) for LLR computation due to its deterministic list-based nature. For both MHGD and DMALA, $N_{\rm p}$ parallel samplers are employed, each initialized randomly and running for $T$ iterations, with the final sample from each sampler selected for soft decisions. Therefore, the parameter $N_{\rm p}$ also indicates the number of samples used in LLR computation. The rationale for this selection strategy has been discussed in Sec. III-C. IS-based LLR computation is applied to both MHGD and DMALA, with a temperature parameter $\tau=2$ as suggested by [35], unless specified otherwise.

IV-B Small- and Medium-Sized MIMO

Fig. 5 illustrates the BER performance of DMALA with different numbers of sampling iterations $T\in\{25,100\}$ in a ${\text{4}\times\text{4}}$ MIMO system with 16-QAM and Rayleigh fading channels. The number of parallel samplers is set as $N_{\rm p}=128$ . A notable trend observed is the gradual improvement in DMALA’s BER as $T$ increases, a phenomenon that resonates with DMALA’s convergence properties detailed in Theorem 1. This improvement underscores the enhanced accuracy in Monte Carlo approximation for LLRs, as the sampled distribution increasingly aligns with the target distribution with larger $T$ . Nonetheless, the gains become marginal as the performance approaches that of the optimal detector. Contrarily, MHGD’s performance does not exhibit similar improvements with increased $T$ , and its best achievable BER is inferior to the optimal detector. This result is in accordance with the findings in Fig. 2, where MHGD’s sampled distribution consistently exhibits bias compared to the target distribution, hence leading to an insurmountable gap in soft decisions.

For subsequent simulations, we set $T=100$ , unless noted otherwise, to ensure ample convergence of the Markov chain using DMALA, affirming the performance gains achieved via precise posterior sampling. For a fair comparison, we also set $T=100$ for MHGD. This setup offers a balance between achieving high accuracy and maintaining moderate computational complexity, as each iteration’s complexity is merely $\mathcal{O}(N^{2})$ .⁶⁶6Similar to DMALA, the per iteration complexity of MHGD is dominated by gradient computation, which is on the order of $\mathcal{O}(N^{2})$ .

Fig. 6 explores the BER performance in a medium-sized ${\text{8}\times\text{8}}$ MIMO system across different modulation schemes under Rayleigh fading channels. As shown in Fig. 6(a), DMALA significantly outperforms both the EP and MHGD detectors and approaches the optimal detector’s performance with QPSK modulation. Such gains are attributed to DMALA’s efficacious sampling from the target distribution, bolstering the accuracy of statistical inference. Given DMALA’s comparable complexity to MHGD and substantially lower complexity relative to the optimal detector, the proposed method demonstrates substantial promise for efficient and accurate MIMO detection.

Additionally, in the context of 64-QAM modulation, as shown in Fig. 6(b), the optimal MAP detector is computationally prohibitive. Consequently, the $K$ -best SD detector, with a large list size of $K=2048$ , is used as a near-optimal surrogate. We present the BER performance of DMALA with various numbers of sampling iterations $T\in\{25,50,100,150\}$ under this high-order modulation. The analysis reveals that the BER of DMALA converges quickly as $T$ increases. Notably, the DMALA detector outperforms the MHGD detector using only half the iterations ( $T=50$ vs. $T=100$ ) and approaches the near-optimal $K$ -best SD detector with $T=150$ iterations. These results demonstrate that increasing the QAM order does not significantly raise the required number of iterations for convergence, highlighting the robustness and efficiency of the proposed DMALA in handling complex modulation scenarios.

IV-C Large MIMO

This subsection delves into the performance evaluation of the proposed detector within MIMO systems featuring an increased antenna count. Fig. 7 showcases a performance comparison in a large ${\text{16}\times\text{16}}$ MIMO system using 16-QAM under Rayleigh fading channels. This comparison specifically focuses on the performance distinction between different LLR computation methods as outlined in Sec. III-C: the conventional method, as per (33) (“w/ (33)”), and the IS-based method, as per (36). These methods are evaluated for both MHGD and DMALA detectors, each utilizing $N_{\rm p}=128$ parallel samplers. Notably, the IS-based method outperforms the conventional method for both detectors, particularly at high SNRs. This advantage is largely due to the comprehensive exploitation of distribution information from the generated samples in Monte Carlo approximation, as opposed to solely relying on distinct samples within the sample list. Furthermore, DMALA’s performance benefits become even more evident within this large MIMO setting. Specifically, the DMALA with IS significantly outperforms the $K$ -best SD with $K=256$ while employing merely half the number of samples (128 vs. 256), underscoring the efficiency of sampling from critical regions of the sample space.

Fig. 8 shows the BER performance of DMALA in an even larger ${\text{24}\times\text{24}}$ MIMO system, again with 16-QAM. We consider this system setup because enabling support for more than 24 spatial streams is a key objective in advancing MIMO technologies for the 5.5G era and has been actively pursued in numerous practical massive MIMO testbeds [41, 42]. This configuration reflects a balanced increase in spatial streams and complexity that is representative of advanced MIMO configurations, while still being feasible within the constraints of current computational resources and hardware capabilities. The analysis compares MHGD, DMALA, and $K$ -best SD under two different parameter settings. Unsurprisingly, all methods exhibit performance improvements with an increased sample list size ( $N_{\rm p}$ and $K$ ) for soft decision-making. The figure distinctly illustrates DMALA’s superior performance over the benchmark methods across both parameter sets. Notably, DMALA maintains the same number of sampling iterations $T$ as previously demonstrated to achieve these gains, whereas $K$ -best SD necessitates a substantial increase in $K$ to sustain competitive performance. This distinction highlights the remarkable scalability of our proposed approach.

IV-D Correlated MIMO Channels

This subsection delves into the detection performance within Kronecker spatially correlated MIMO channels, following the model introduced in [37]. The channel model is defined as

\mathbf{H}=\mathbf{R}_{\rm r}^{1/2}\mathbf{G}\mathbf{R}_{\rm t}^{1/2},

(39)

where $\mathbf{G}$ represents a Rayleigh fading channel, while $\mathbf{R}_{\rm r}$ and $\mathbf{R}_{\rm t}$ denote the spatial correlation matrices at the receiver and transmitter, respectively. The exponential correlation model [37] is utilized to generate these correlation matrices, with the correlation coefficient denoted by $\rho$ .

For our simulations, we consider a ${\text{16}\times\text{16}}$ MIMO system employing 16-QAM modulation, setting the spatial correlation coefficient as $\rho=0.5$ . The selection of 0.5 as the spatial correlation coefficient represents a typical value indicating a moderate level of spatial correlation. This level is commonly encountered in practical MIMO systems and is widely adopted in the literature to assess the effectiveness of MIMO detectors [43, 44]. We explore different sample list sizes (128 and 256) for MHGD, DMALA, and $K$ -best SD. Fig. 9 shows the comparative performance analysis. In this challenging channel scenario, all detectors undergo performance degradation—approximately a 3 dB loss in SNR compared to their performance in uncorrelated Rayleigh fading channels (as shown in Fig. 7). Despite this, the proposed DMALA consistently achieves performance improvements over the baseline detectors with varying sample sizes, underscoring its robustness to spatial correlation effects.

IV-E 3GPP Massive MIMO Channels

To further attest to the effectiveness of our proposed detector, we delve into performance validation within practical 3GPP 3D MIMO channels [38], employing the QuaDRiGa channel simulator (Version 2.2.0) [45] for this purpose. We simulate the non-line-of-sight massive MIMO uplink transmission ( $N_{\rm r}\gg N_{\rm t}$ ) in an urban macrocell environment, where the base station (BS) is equipped with 32 dual-polarized ( $N_{\rm r}=64$ ) antennas, arranged in a planar array with half-wavelength spacing and positioned at a height of 25 m. The BS services a cell sector extending to a 500 m radius, with 16 single-antenna users uniformly dispersed within this area, resulting in an effective $N_{\rm r}\times N_{\rm t}=\text{64}\times\text{16}$ MIMO system configuration. Channel sampling is performed at a center frequency of 2.53 GHz, with transformation to the frequency domain via 256 effective subcarriers within a 20 MHz bandwidth. A total of 600 channel realizations are generated, each characterized by independent user locations, creating a channel dataset comprising ${\text{600}\times\text{256}}$ channel matrices for performance evaluation.

Fig. 10 presents the performance comparison within these 3GPP MIMO channels. Notably, the MHGD detector underperforms, even lagging behind the deterministic EP method, indicative of MHGD’s sampled distribution deviating significantly from the target distribution in realistic channel conditions, thereby precipitating considerable errors in final inference (LLR computation). In contrast, the DMALA detector, thanks to its precise sampling capabilities, registers significant enhancements in performance relative to both MHGD and EP. Remarkably, compared to the near-optimal $K$ -best SD, DMALA achieves a 0.2 dB performance gain when using 128 samples and maintains comparable performance when using 256 samples. These findings affirm the resilience of our proposed detector in realistic massive MIMO channels.

IV-F Imperfect CSI

This subsection delves into the impact of imperfect CSI on the efficacy of our proposed detector. For the evaluation, we incorporate a linear MMSE channel estimator at the receiver to model the imperfect CSI, resulting in an estimated channel matrix $\hat{\mathbf{H}}=\mathbf{H}+\mathbf{E}$ , where $\mathbf{E}$ is the channel estimation error matrix with entries following $\mathcal{N}(0,\sigma_{\rm e}^{2})$ , with $\sigma_{\rm e}^{2}$ denoting the error variance [46]. Fig. 11 presents the BER performance across different levels of $\sigma_{\rm e}^{2}$ , employing the normalized mean square error (NMSE) as a metric on the $x$ -axis, defined by

\text{NMSE}=\frac{\mathbb{E}[\|\mathbf{E}\|_{F}^{2}]}{\mathbb{E}[\|\mathbf{H}% \|_{F}^{2}]}.

(40)

Herein, we consider a ${\text{16}\times\text{16}}$ MIMO system under Rayleigh fading channels with 16-QAM modulation at ${\text{SNR}=\text{16.5}\;\text{dB}}$ . Consistent with the results observed under perfect CSI, the DMALA-based detector notably outperforms baseline detectors, including the $K$ -best SD configured with $K=256$ , across various levels of CSI accuracy. Specifically, DMALA achieves equivalent BER performance ( $\text{BER}=10^{-3}$ ) as MHGD and $K$ -best SD at 4 dB and 1 dB higher channel estimation NMSE, respectively. This comparative analysis unequivocally showcases the robustness of our proposed detector to imperfect CSI.

IV-G Performance and Complexity Comparisons with Various Baselines

In this subsection, we conduct performance and computational complexity comparisons with additional baselines, including the MMSE detector, and the Excited MCMC (X-MCMC) detector [13]—an advanced Gibbs sampling-based MIMO detector. Furthermore, the performance of the graph neural network-enhanced EP detector (GEPNet) [47] is included to illustrate the comparison against state-of-the-art deep learning-based MIMO detection.⁷⁷7We implement the GEPNet using the code released at https://github.com/GNN-based-MIMO-Detection/GNN-based-MIMO-Detection. We perform the comparison in two typical MIMO setups, including a small ${\text{4}\times\text{4}}$ MIMO system and a large ${\text{24}\times\text{24}}$ MIMO system. Fig. 12 presents the comparison results, showing that the proposed DMALA exhibits the most competitive performance across both system setups.

TABLE I: Computational Complexity Comparison for the MIMO Detectors in Fig. 12

Algorithms	Number of FLOPs
Algorithms	${\text{4}\times\text{4}}$ MIMO	${\text{24}\times\text{24}}$ MIMO
MMSE [4]	$6.60\times 10^{2}$	$2.76\times 10^{4}$
EP [39]	$3.14\times 10^{4}$	$2.04\times 10^{5}$
$K$ -best SD [6]	$1.54\times 10^{5}$ ${\it\langle 1.26\times 10^{7}\rangle}$	${1.66\times 10^{7}}$ ${\it\langle 5.40\times 10^{9}\rangle}$
GEPNet [47]	$4.90\times 10^{6}$	$1.71\times 10^{7}$
X-MCMC [13]	$9.20\times 10^{7}\;(8.14\times 10^{5})$	$6.03\times 10^{9}\;(2.38\times 10^{7})$
MHGD [21]	$2.63\times 10^{7}\;(2.32\times 10^{5})$	$6.67\times 10^{8}\;(2.64\times 10^{6})$
DMALA	${1.45\times 10^{7}\;(1.47\times 10^{5})}$	${1.53\times 10^{8}\;({6.53\times 10^{5}})}$

Note: ( $\sim$ ) denotes the FLOPs count of a single MCMC sampler. $\langle\sim\rangle$ represents the number of comparison operations within the $K$ -best SD detector, which is not included in the FLOPs count.

Subsequently, we illustrate the corresponding computational complexity of the compared detectors in Table I, using the number of floating point operations (FLOPs) as the metric. This metric was profiled using the Python package for the performance application programming interface library [48]. The simulations were performed on a machine equipped with an Intel Xeon E5-4627 CPU at 2.6 GHz and 512 GB of memory. For MCMC-based detectors, the FLOPs count for both the whole detection process and one single sampler are presented. Furthermore, it is important to note that the considerable comparison operations⁸⁸8The number of comparison operations within the $K$ -best SD detector is separately recorded in Table I, marked with $\langle\sim\rangle$ . involved in the list administration of the $K$ -best SD detector are not included in the FLOPs count, resulting in a complexity metric that favors the SD detector [6, 49]. Based on Table I, several observations are derived:

•

Compared to other MCMC-based detectors, including X-MCMC and MHGD, the proposed DMALA exhibits lower complexity in both the ${\text{4}\times\text{4}}$ and ${\text{24}\times\text{24}}$ MIMO systems. The advantage over X-MCMC can be attributed to the parallel update of all variables and the utilization of gradient information for acceleration. Additionally, the proposed method circumvents the high-complexity matrix inversion and line search involved in MHGD [21], thereby exhibiting higher computational efficiency. This superiority is particularly pronounced in the larger ${\text{24}\times\text{24}}$ MIMO setup, where DMALA achieves approximately a 77% reduction in total complexity compared to MHGD.
•

The overall computational complexity of the proposed DMALA is relatively higher than the $K$ -best SD and GEPNet. Nonetheless, the proposed method benefits from parallel implementation, and the complexity of a single sampler is notably reduced. Particularly, in the ${\text{24}\times\text{24}}$ MIMO system, the complexity of a single DMALA sampler is significantly lower compared to both $K$ -best SD and GEPNet. Therefore, by distributing the sampling process over these parallel samplers, the computation latency can be substantially reduced.
•

The complexity increase from the ${\text{4}\times\text{4}}$ to the ${\text{24}\times\text{24}}$ MIMO setup in the proposed method is less pronounced when compared to the $K$ -best SD. This favorable complexity scaling property of DMALA becomes evident especially when considering that a significant complexity component of the $K$ -best SD is disregarded.

Taking a holistic view of the performance illustrated in Fig. 12 and the complexity presented in Table I, our proposed detector showcases significant potential in striking a desirable balance between performance and complexity.

V Conclusion

In this study, we introduced a near-optimal MIMO detection scheme leveraging a novel gradient-based MCMC algorithm, designed for exact sampling within discrete spaces. This advancement enables the computation of highly accurate Bayesian estimates (soft decisions) through Monte Carlo techniques. We have rigorously proved the convergence of our proposed sampling algorithm, DMALA, and conducted a thorough analysis of its convergence rate. These theoretical underpinnings not only guarantee the performance of our proposed detector but also clearly delineate it from existing heuristic approaches. Our extensive numerical investigations empirically validate the near-optimality and resilience of the proposed method, affirming its superior performance compared to contemporary state-of-the-art MIMO detectors. Moreover, the proposed detector exhibits exceptional parallelization capabilities and scalability to the number of antennas, rendering it a viable and effective option for next-generation wireless communication systems.

The current limitations of the proposed MCMC detector include increased computational overhead to achieve near-optimal performance and potential sensitivity to model mismatches. Future research should focus on improving the complexity scaling to accommodate ultra-large MIMO configurations and developing efficient hardware implementations for accelerated processing. Additionally, exploring a holistic approach that combines model-driven and data-driven methods would enhance the detector’s adaptability to diverse channel conditions and system configurations.

Appendix A Proof of Lemma 1

To avoid cluttered expressions, we consider the DMALA using the proposal in (15). Nonetheless, the proof also holds for the preconditioned version. We first show that the transition kernel of the Markov chain using DMALA is irreducible and aperiodic. For any two distinct states $\mathbf{x}$ and $\mathbf{x}^{\prime}$ from $\mathcal{A}^{N\times 1}$ , the transition probability $P(\mathbf{x}^{(t+1)}=\mathbf{x}^{\prime}|\mathbf{x}^{(t)}=\mathbf{x})$ is given by $q(\mathbf{x}^{\prime}|\mathbf{x})A(\mathbf{x}^{\prime}|\mathbf{x})$ , as stated in (24), where the proposal probability $q(\mathbf{x}^{\prime}|\mathbf{x})$ and the acceptance probability $A(\mathbf{x}^{\prime}|\mathbf{x})$ are given in (14) and (18), respectively. The proposal probability $q(\mathbf{x}^{\prime}|\mathbf{x})$ satisfies $0<q(\mathbf{x}^{\prime}|\mathbf{x})<1$ as long as the step size $\alpha>0$ , given that $\left[\nabla f(\mathbf{x})\right]_{n}$ and $(x^{\prime}_{n}-x_{n})$ in (16) are bounded and the proposal probability is normalized. Moreover, based on (3) and (18), the acceptance probability can be written as

$\displaystyle A(\mathbf{x}^{\prime}$	$\displaystyle\|\mathbf{x})$
$\displaystyle=$	$\displaystyle\min\left\{1,\exp\big{(}f(\mathbf{x}^{\prime})-f(\mathbf{x})\big{% )}\frac{q(\mathbf{x}\|\mathbf{x}^{\prime})}{q(\mathbf{x}^{\prime}\|\mathbf{x})}\right\}$
$\displaystyle=$	$\displaystyle\min\left\{1,\exp\left(\frac{\\|\mathbf{y}-\mathbf{Hx}\\|^{2}-\\|% \mathbf{y}-\mathbf{Hx}^{\prime}\\|^{2}}{\sigma^{2}}\right)\frac{q(\mathbf{x}\|% \mathbf{x}^{\prime})}{q(\mathbf{x}^{\prime}\|\mathbf{x})}\right\},$	(41)

which remains nonzero as long as $\sigma^{2}>0$ . This ensures that

P\big{(}\mathbf{x}^{(t+1)}=\mathbf{x}^{\prime}|\mathbf{x}^{(t)}=\mathbf{x}\big% {)}>0,\;\forall\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{A}^{N\times 1},% \mathbf{x}^{\prime}\neq\mathbf{x},

(42)

thereby confirming the chain’s irreducibility by definition [50].

On the other hand, when $\mathbf{x}^{\prime}=\mathbf{x}$ , according to (24) we have

	$\displaystyle P\big{(}\mathbf{x}^{(t+1)}=\mathbf{x}\|\mathbf{x}^{(t)}=\mathbf{x% }\big{)}$
	$\displaystyle=q(\mathbf{x}\|\mathbf{x})+\sum_{\mathbf{z}\neq\mathbf{x}}q(% \mathbf{z}\|\mathbf{x})\big{(}1-A(\mathbf{z}\|\mathbf{x})\big{)}\geq q(\mathbf{x% }\|\mathbf{x})>0,$		(43)

where the first equation is due to the fact that the chain stays in state $\mathbf{x}$ either when the proposal is $\mathbf{x}$ , or when the proposal is $\mathbf{z}\neq\mathbf{x}$ but is rejected. The first inequality holds because $0<q(\mathbf{z}|\mathbf{x})<1$ based on the preceding analysis, and $0<A(\mathbf{z}|\mathbf{x})\leq 1$ according to (41). The second inequality holds for the same reason as $q(\mathbf{x}^{\prime}|\mathbf{x})>0\;(\mathbf{x}^{\prime}\neq\mathbf{x})$ . Leveraging (43), we deduce that

\text{gcd}\left\{m:P\big{(}\mathbf{x}^{(t+m)}=\mathbf{x}|\mathbf{x}^{(t)}=% \mathbf{x}\big{)}>0\right\}=1

(44)

for any positive integer $m>0$ , where “gcd” denotes the greatest common divisor. This implies that the chain is aperiodic by definition [50].

We then show that the chain induced by DMALA is reversible with respect to the target posterior distribution $\pi$ , i.e., given two states $\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{A}^{N\times 1}$ , the following equation (also called detailed balance condition [7]) holds:

\pi(\mathbf{x})P(\mathbf{x}^{\prime}|\mathbf{x})=\pi(\mathbf{x}^{\prime})P(% \mathbf{x}|\mathbf{x}^{\prime}),

(45)

which is a sufficient condition for guaranteeing that $\pi$ is a stationary distribution of the Markov chain. Since (45) naturally holds when $\mathbf{x}^{\prime}=\mathbf{x}$ , we consider the case where $\mathbf{x}^{\prime}\neq\mathbf{x}$ . In this case, we have

	$\displaystyle\pi(\mathbf{x})P(\mathbf{x}^{\prime}\|\mathbf{x})$	$\displaystyle=\pi(\mathbf{x})q(\mathbf{x}^{\prime}\|\mathbf{x})A(\mathbf{x}^{% \prime}\|\mathbf{x})$
		$\displaystyle=\min\left\{\pi(\mathbf{x})q(\mathbf{x}^{\prime}\|\mathbf{x}),{\pi% (\mathbf{x}^{\prime})q(\mathbf{x}\|\mathbf{x}^{\prime})}\right\},$		(46)

and by virtue of the same rationale,

	$\displaystyle\pi(\mathbf{x}^{\prime})P(\mathbf{x}\|\mathbf{x}^{\prime})$	$\displaystyle=\pi(\mathbf{x}^{\prime})q(\mathbf{x}\|\mathbf{x}^{\prime})A(% \mathbf{x}\|\mathbf{x}^{\prime})$
		$\displaystyle=\min\left\{{\pi(\mathbf{x}^{\prime})q(\mathbf{x}\|\mathbf{x}^{% \prime})},\pi(\mathbf{x})q(\mathbf{x}^{\prime}\|\mathbf{x})\right\}.$		(47)

Hence, we justify that (45) holds. With this equation, we have

	$\displaystyle\sum_{\mathbf{x}^{\prime}\in\mathcal{A}^{N\times 1}}\pi(\mathbf{x% }^{\prime})P(\mathbf{x}\|\mathbf{x}^{\prime})$	$\displaystyle=\sum_{\mathbf{x}^{\prime}\in\mathcal{A}^{N\times 1}}\pi(\mathbf{% x})P(\mathbf{x}^{\prime}\|\mathbf{x})$
		$\displaystyle=\pi(\mathbf{x})\sum_{\mathbf{x}^{\prime}\in\mathcal{A}^{N\times 1% }}P(\mathbf{x}^{\prime}\|\mathbf{x})=\pi(\mathbf{x}),$		(48)

completing the proof of $\pi$ being a stationary distribution of the chain using DMALA. Furthermore, since the chain is irreducible, this stationary distribution is unique [30, Corollary 1.17], completing the proof of Lemma 1.

Appendix B Proof of Lemma 2

The proposal distribution of DMALA can be expressed as

	$\displaystyle q$	$\displaystyle(\mathbf{x}^{\prime}\|\mathbf{x}^{(t)})$
		$\displaystyle=\frac{\exp\big{(}-\frac{1}{2\alpha}\\|\mathbf{x}^{\prime}-\mathbf% {x}^{(t)}-\frac{\alpha}{2}\nabla f(\mathbf{x}^{(t)})\\|^{2}\big{)}}{\prod_{n=1}% ^{N}\sum_{x_{n}^{\prime}\in\mathcal{A}}\exp\big{(}-\frac{1}{2\alpha}\|x_{n}^{% \prime}-x_{n}^{(t)}-\frac{\alpha}{2}[\nabla f(\mathbf{x}^{(t)})]_{n}\|^{2}\big{% )}}.$		(49)

Combining this equation with the expression of the target posterior distribution, we have

	$\displaystyle\frac{q(\mathbf{x}^{\prime}\|\mathbf{x}^{(t)})}{\pi(\mathbf{x}^{% \prime})}$
$\displaystyle=\;$	$\displaystyle\frac{\exp\big{(}-\frac{1}{2\alpha}\\|\mathbf{x}^{\prime}-\mathbf{% x}^{(t)}-\frac{\alpha}{2}\nabla f(\mathbf{x}^{(t)})\\|^{2}\big{)}}{\prod_{n=1}^% {N}\sum_{x_{n}^{\prime}\in\mathcal{A}}\exp\big{(}-\frac{1}{2\alpha}\|x_{n}^{% \prime}-x_{n}^{(t)}-\frac{\alpha}{2}[\nabla f(\mathbf{x}^{(t)})]_{n}\|^{2}\big{% )}}$
	$\displaystyle\cdot\frac{\sum_{\mathbf{s}\in\mathcal{A}^{N\times 1}}\exp\big{(}% f(\mathbf{s})\big{)}}{\exp f(\mathbf{x}^{\prime})}$
$\displaystyle\geq\;$	$\displaystyle\frac{\sum_{\mathbf{s}\in\mathcal{A}^{N\times 1}}\exp\big{(}f(% \mathbf{s})\big{)}}{\prod_{n=1}^{N}\sum_{x_{n}^{\prime}\in\mathbb{Z}}\exp\big{% (}-\frac{1}{2\alpha}\|x_{n}^{\prime}-x_{n}^{(t)}-\frac{\alpha}{2}[\nabla f(% \mathbf{x}^{(t)})]_{n}\|^{2}\big{)}}$
	$\displaystyle\cdot G(\mathbf{x}^{(t)},\mathbf{x}^{\prime})$
$\displaystyle\geq\;$	$\displaystyle\frac{\sum_{\mathbf{s}\in\mathcal{A}^{N\times 1}}\exp\big{(}f(% \mathbf{s})\big{)}}{\prod_{n=1}^{N}\sum_{x_{n}^{\prime}\in\mathbb{Z}}\exp\big{% (}-\frac{1}{2\alpha}\|x_{n}^{\prime}\|^{2}\big{)}}\cdot G(\mathbf{x}^{(t)},% \mathbf{x}^{\prime})$
$\displaystyle=\;$	$\displaystyle\xi\cdot G(\mathbf{x}^{(t)},\mathbf{x}^{\prime}),$	(50)

where the two inequalities follow the fact from the lattice theory [51]:

	$\displaystyle\sum_{x_{n}^{\prime}\in\mathcal{A}}\exp\bigg{(}-\frac{1}{2\alpha}% \|x_{n}^{\prime}-x_{n}^{(t)}-\frac{\alpha}{2}[\nabla f(\mathbf{x}^{(t)})]_{n}\|^% {2}\bigg{)}$
$\displaystyle\leq$	$\displaystyle\sum_{x_{n}^{\prime}\in\mathbb{Z}}\exp\bigg{(}-\frac{1}{2\alpha}\|% x_{n}^{\prime}-x_{n}^{(t)}-\frac{\alpha}{2}[\nabla f(\mathbf{x}^{(t)})]_{n}\|^{% 2}\bigg{)}$
$\displaystyle\leq$	$\displaystyle\sum_{x_{n}^{\prime}\in\mathbb{Z}}\exp\bigg{(}-\frac{1}{2\alpha}\|% x_{n}^{\prime}\|^{2}\bigg{)},$	(51)

completing the proof of Lemma 2.

Appendix C Derivation of (36)

According to the IS-based Monte Carlo summation in (7), by treating $p(b_{k}=\pm 1|\mathbf{y})$ as the expectation $\mathbb{E}_{p(\mathbf{b}_{-k}|\mathbf{y})}[p(b_{k}=\pm 1|\mathbf{y},\mathbf{b}% _{-k})]$ , these APPs can be approximated by

\frac{1}{S}\sum_{s=1}^{S}p(b_{k}=\pm 1|\mathbf{y},\mathbf{b}_{-k}^{[s]})\frac{% p(\mathbf{b}_{-k}^{[s]}|\mathbf{y})}{\pi_{\rm a}(\mathbf{b}_{-k}^{[s]})},

(52)

where $\mathbf{b}_{-k}^{[s]}$ is obtained by dropping the $k$ -th bit after demapping the sample $\mathbf{x}^{[s]}$ drawn from $\pi_{\rm a}$ . Note that herein we use the argument $\mathbf{b}$ for $\pi_{\rm a}$ to simplify the derivation of the LLR expression. Consequently, the posterior LLR of the $k$ -th bit can be approximated as

	$\displaystyle\hat{L}_{k}$	$\displaystyle=\log\frac{\sum_{s=1}^{S}p(b_{k}=+1\|\mathbf{y},\mathbf{b}_{-k}^{[% s]})\frac{p(\mathbf{b}_{-k}^{[s]}\|\mathbf{y})}{\pi_{\rm a}(\mathbf{b}_{-k}^{[s% ]})}}{\sum_{s=1}^{S}p(b_{k}=-1\|\mathbf{y},\mathbf{b}_{-k}^{[s]})\frac{p(% \mathbf{b}_{-k}^{[s]}\|\mathbf{y})}{\pi_{\rm a}(\mathbf{b}_{-k}^{[s]})}}$
		$\displaystyle=\log\frac{\sum_{s=1}^{S}\pi_{\rm a}(b_{k}=+1\|\mathbf{b}_{-k}^{[s% ]})\frac{p(b_{k}=+1,\mathbf{b}_{-k}^{[s]}\|\mathbf{y})}{\pi_{\rm a}(b_{k}=+1,% \mathbf{b}_{-k}^{[s]})}}{\sum_{s=1}^{S}\pi_{\rm a}(b_{k}=-1\|\mathbf{b}_{-k}^{[% s]})\frac{p(b_{k}=-1,\mathbf{b}_{-k}^{[s]}\|\mathbf{y})}{\pi_{\rm a}(b_{k}=-1,% \mathbf{b}_{-k}^{[s]})}},$		(53)

where the second equality follows from the Bayes rule $p(a_{1},a_{2})=p(a_{2}|a_{1})p(a_{1})$ . For the conditional probability $\pi_{\rm a}(b_{k}=\pm 1|\mathbf{b}_{-k}^{[s]})$ , to avoid numerical instability, we consider a log-domain implementation by defining

	$\displaystyle\gamma_{k}^{[s]}$	$\displaystyle=\log\frac{\pi_{\rm a}(b_{k}=+1\|\mathbf{b}_{-k}^{[s]})}{\pi_{\rm a% }(b_{k}=-1\|\mathbf{b}_{-k}^{[s]})}=\log\frac{\pi_{\rm a}(b_{k}=+1,\mathbf{b}_{% -k}^{[s]})}{\pi_{\rm a}(b_{k}=-1,\mathbf{b}_{-k}^{[s]})}$
		$\displaystyle=\frac{1}{\tau}\left(f(\mathbf{x}_{+1}^{[s]})-f(\mathbf{x}_{-1}^{% [s]})\right).$		(54)

Therefore, we have

\pi_{\rm a}(b_{k}=\pm 1|\mathbf{b}_{-k}^{[s]})=\frac{1}{1+\exp(\mp\gamma_{k}^{% [s]})}.

(55)

Moreover, based on the definition of $\pi$ and $\pi_{\rm a}$ , we have

\frac{p(b_{k}=\pm 1,\mathbf{b}_{-k}^{[s]}|\mathbf{y})}{\pi_{\rm a}(b_{k}=\pm 1% ,\mathbf{b}_{-k}^{[s]})}\propto\frac{\pi(\mathbf{x}_{\pm 1}^{[s]})}{\pi(% \mathbf{x}_{\pm 1}^{[s]})^{1/\tau}}=\exp\left(\frac{\tau-1}{\tau}f(\mathbf{x}_% {\pm 1}^{[s]})\right).

(56)

Substituting (55) and (56) into (53), we derive (36).

References

[1] E. Björnson, J. Hoydis, and L. Sanguinetti, “Massive MIMO networks: Spectral, energy, and hardware efficiency,” Foundations and Trends® in Signal Processing, vol. 11, no. 3-4, pp. 154–655, 2017.
[2] E. Björnson, L. Sanguinetti, H. Wymeersch, J. Hoydis, and T. L. Marzetta, “Massive MIMO is a reality—What is next?: Five promising research directions for antenna arrays,” Digit. Signal Process., vol. 94, pp. 3–20, Nov. 2019.
[3] J. Zhang, E. Björnson, M. Matthaiou, D. W. K. Ng, H. Yang, and D. J. Love, “Prospective multiple antenna technologies for beyond 5G,” IEEE J. Sel. Areas Commun., vol. 38, no. 8, pp. 1637–1660, Aug. 2020.
[4] S. Yang and L. Hanzo, “Fifty years of MIMO detection: The road to large-scale MIMOs,” IEEE Commun. Surv. Tuts., vol. 17, no. 4, pp. 1941–1988, 2015.
[5] B. Hochwald and S. ten Brink, “Achieving near-capacity on a multiple-antenna channel,” IEEE Trans. Commun., vol. 51, no. 3, pp. 389–399, Mar. 2003.
[6] Z. Guo and P. Nilsson, “Algorithm and implementation of the K-best sphere decoding for MIMO detection,” IEEE J. Sel. Areas Commun., vol. 24, no. 3, pp. 491–503, Mar. 2006.
[7] C. M. Bishop, Pattern Recognition and Machine Learning. New York, NY, USA: Springer, 2006.
[8] S. Brooks, A. Gelman, G. Jones, and X.-L. Meng, Handbook of Markov Chain Monte Carlo. Boca Raton, FL, USA: Chapman & Hall, 2011.
[9] A. Doucet and X. Wang, “Monte Carlo methods for signal processing: A review in the statistical signal processing context,” IEEE Signal Process. Mag., vol. 22, no. 6, pp. 152–170, Nov. 2005.
[10] R. Chen, J. Liu, and X. Wang, “Convergence analyses and comparisons of Markov chain Monte Carlo algorithms in digital communications,” IEEE Trans. Signal Process., vol. 50, no. 2, pp. 255–270, Feb. 2002.
[11] S. A. Laraway and B. Farhang-Boroujeny, “Implementation of a Markov chain Monte Carlo based multiuser/MIMO detector,” IEEE Trans. Circuits and Syst. I: Reg. Papers, vol. 56, no. 1, pp. 246–255, Jan. 2009.
[12] B. Farhang-Boroujeny, H. Zhu, and Z. Shi, “Markov chain Monte Carlo algorithms for CDMA and MIMO communication systems,” IEEE Trans. Signal Process., vol. 54, no. 5, pp. 1896–1909, May 2006.
[13] J. C. Hedstrom, C. H. Yuen, R.-R. Chen, and B. Farhang-Boroujeny, “Achieving near MAP performance with an excited Markov chain Monte Carlo MIMO detector,” IEEE Trans. Wireless Commun., vol. 16, no. 12, pp. 7718–7732, Dec. 2017.
[14] W. Grathwohl, K. Swersky, M. Hashemi, D. Duvenaud, and C. Maddison, “Oops I took a gradient: Scalable sampling for discrete distributions,” in Proc. Int. Conf. Mach. Learn. (ICML), Jul. 2021, pp. 3831–3841.
[15] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 6, no. 6, pp. 721–741, Nov. 1984.
[16] Y.-A. Ma, Y. Chen, C. Jin, N. Flammarion, and M. I. Jordan, “Sampling can be faster than optimization,” Proc. Nat. Acad. Sci., vol. 116, no. 42, pp. 20 881–20 885, Sep. 2019.
[17] G. O. Roberts and R. L. Tweedie, “Exponential convergence of Langevin distributions and their discrete approximations,” Bernoulli, vol. 2, no. 4, pp. 341–363, 1996.
[18] G. O. Roberts and O. Stramer, “Langevin diffusions and Metropolis-Hastings algorithms,” Methodol. Comput. Appl. Probab., vol. 4, no. 4, pp. 337–357, Dec. 2002.
[19] R. Zhang, X. Liu, and Q. Liu, “A Langevin-like sampler for discrete distributions,” in Proc. Int. Conf. Mach. Learn. (ICML), Jun. 2022, pp. 26 375–26 396.
[20] B. Rhodes and M. Gutmann, “Enhanced gradient-based MCMC in discrete spaces,” Trans. Mach. Learn. Res., pp. 1–30, Oct. 2022.
[21] N. M. Gowda, S. Krishnamurthy, and A. Belogolovy, “Metropolis-Hastings random walk along the gradient descent direction for MIMO detection,” in Proc. IEEE Int. Conf. Commun. (ICC), Montreal, QC, Canada, Jun. 2021, pp. 1–7.
[22] X. Zhou, L. Liang, J. Zhang, C.-K. Wen, and S. Jin, “Gradient-based Markov chain Monte Carlo for MIMO detection,” IEEE Trans. Wireless Commun. vol. 23, no. 7, pp. 7566–7581, Jul. 2024.
[23] N. Zilberstein, C. Dick, R. Doost-Mohammady, A. Sabharwal, and S. Segarra, “Annealed Langevin dynamics for massive MIMO detection,” IEEE Trans. Wireless Commun., vol. 22, no. 6, pp. 3762–3776, Jun. 2023.
[24] Z. Wu and H. Li, “Stochastic gradient Langevin dynamics for massive MIMO detection,” IEEE Commun. Lett., vol. 26, no. 5, pp. 1062–1065, May 2022.
[25] D. G. Luenberger, Introduction to Linear and Nonlinear Programming. Reading, MA, USA: Addison-Wesley, 1973.
[26] Y. Nesterov, “A method for solving the convex programming problem with convergence rate o (1/k^ 2),” in Dokl. Akad. Nauk SSSR, vol. 269, pp. 543–547, 1983.
[27] F. M. Kashif, H. Wymeersch, and M. Z. Win, “Monte Carlo equalization for nonlinear dispersive satellite channels,” IEEE J. Sel. Areas Commun., vol. 26, no. 2, pp. 245–255, Feb. 2008.
[28] M. Welling and Y. W. Teh, “Bayesian learning via stochastic gradient Langevin dynamics,” in Proc. Int. Conf. Mach. Learn. (ICML), 2011, pp. 681-688.
[29] W. K. Hastings, “Monte Carlo sampling methods using Markov chains and their applications,” Biometrika, vol. 57, no. 1, pp. 97–109, Apr. 1970.
[30] D. A. Levin, Y. Peres, and E. L. Wilmer, Markov Chains and Mixing Times. Providence, RI, USA: Amer. Math. Soc., 2008.
[31] I. Kontoyiannis and S. P. Meyn, “Geometric ergodicity and the spectral gap of non-reversible Markov chains,” Proc. Probab. Theory Related Fields, vol. 154, no. 1-2, pp. 327–339, 2012.
[32] Z. Wang, Y. Xia, C. Ling, and Y. Huang, “Randomized iterative sampling decoding algorithm for large-scale MIMO detection,” IEEE Trans. Signal Process., vol. 72, pp. 580–593, 2024.
[33] Z. Wang and C. Ling, “On the geometric ergodicity of Metropolis-Hastings algorithms for lattice Gaussian sampling,” IEEE Trans. Inf. Theory, vol. 64, no. 2, pp. 738–751, Feb. 2018.
[34] Z. Wang, S. Lyu, Y. Xia, and Q. Wu, “Expectation propagation-based sampling decoding: Enhancement and optimization,” IEEE Trans. Signal Process., vol. 69, pp. 195–209, 2021.
[35] M. Senst and G. Ascheid, “A Rao-Blackwellized Markov chain Monte Carlo algorithm for efficient MIMO detection,” in Proc. IEEE Int. Conf. Commun. (ICC), Jun. 2011, pp. 1–6.
[36] B. Hassibi, M. Hansen, A. G. Dimakis, H. A. J. Alshamary, and W. Xu, “Optimized Markov chain Monte Carlo for signal detection in MIMO systems: An analysis of the stationary distribution and mixing time,” IEEE Trans. Signal Process., vol. 62, no. 17, pp. 4436–4450, Sep. 2014.
[37] S. Loyka, “Channel capacity of MIMO architecture using the exponential correlation matrix,” IEEE Commun. Lett., vol. 5, no. 9, pp. 369–371, Sep. 2001.
[38] 3GPP TR 36.873, “Study on 3D channel model for LTE (Release 12),” Tech. Rep., Jun. 2017. [Online]. Available: https://www.3gpp.org/ftp/Specs/archive/36_series/36.873/.
[39] J. Céspedes, P. M. Olmos, M. Sánchez-Fernández, and F. Perez-Cruz, “Probabilistic MIMO symbol detection with expectation consistency approximate inference,” IEEE Trans. Veh. Technol., vol. 67, no. 4, pp. 3481–3494, Apr. 2018.
[40] X. Zhou, J. Zhang, C.-K. Wen, S. Jin, and S. Han, “Graph neural network-enhanced expectation propagation algorithm for MIMO turbo receivers,” IEEE Trans. Signal Process., vol. 71, pp. 3458–3473, 2023.
[41] S. Suyama, J. Shen, H. Suzuki, K. Fukawa, and Y. Okumura, “Evaluation of 30 Gbps super high bit rate mobile communications using channel data in 11 GHz band 24 $\times$ 24 MIMO experiment,” in Proc. IEEE Int. Conf. Commun., Jun. 2014, pp. 5203–5208.
[42] P. Gröschel et al., “An ultra-versatile massive MIMO transceiver testbed for multi-Gb/s communication,” in Proc. IEEE 5G World Forum, Sep. 2019, pp. 1–6.
[43] H. He, C.-K. Wen, S. Jin, and G. Y. Li, “Model-driven deep learning for MIMO detection,” IEEE Trans. Signal Process., vol. 68, pp. 1702–1715, Mar. 2020.
[44] Y. Wei, M.-M. Zhao, M. Hong, M.-J. Zhao, and M. Lei, “Learned conjugate gradient descent network for massive MIMO detection,” IEEE Trans. Signal Process., vol. 68, pp. 6336–6349, Nov. 2020.
[45] S. Jaeckel, L. Raschkowski, K. Börner, L. Thiele, F. Burkhardt, and E. Eberlein, “QuaDRiGa—Quasi deterministic radio channel generator user manual and documentation,” Fraunhofer Heinrich Hertz Inst., Berlin, Germany, Tech. Rep. V2.2.0, Aug. 2017.
[46] T. Weber, A. Sklavos, and M. Meurer, “Imperfect channel-state information in MIMO transmission,” IEEE Trans. Commun., vol 54, no. 3, pp. 543–552, Mar. 2006.
[47] A. Kosasih, V. Onasis, V. Miloslavskaya, W. Hardjawana, V. Andrean, and B. Vucetic, “Graph neural network aided MU-MIMO detectors,” IEEE J. Sel. Areas Commun., vol. 40, no. 9, pp. 2540–2555, Sep. 2022.
[48] Flozz/PYPAPI: Python binding for the PAPI (Performance Application Programming Interface) library. Accessed: Aug. 30, 2024. [Online]. Available: https://github.com/flozz/pypapi.
[49] C. Studer and H. Bolcskei, “Soft-input soft-output single tree-search sphere decoding,” IEEE Trans. Inf. Theory, vol. 56, no. 10, pp. 4827–4842, Oct. 2010.
[50] J. R. Norris, Markov Chains. Cambridge, U.K.: Cambridge Univ. Press, 1997.
[51] D. Micciancio and O. Regev, “Worst-case to average-case reductions based on Gaussian measures,” in Proc. Ann. Symp. Found. Computer Science, Rome, Italy, Oct. 2004, pp. 372–381.

	$\displaystyle p(b_{k}=\pm 1\|\mathbf{y})$	$\displaystyle=\sum_{\mathbf{b}_{-k}}p(b_{k}=\pm 1,\mathbf{b}_{-k}\|\mathbf{y})$
		$\displaystyle=\sum_{\mathbf{b}_{-k}}p(b_{k}=\pm 1\|\mathbf{y},\mathbf{b}_{-k})p% (\mathbf{b}_{-k}\|\mathbf{y}),$		(34)

	$\displaystyle\pi(\mathbf{x})P(\mathbf{x}^{\prime}\|\mathbf{x})$	$\displaystyle=\pi(\mathbf{x})q(\mathbf{x}^{\prime}\|\mathbf{x})A(\mathbf{x}^{% \prime}\|\mathbf{x})$
		$\displaystyle=\min\left\{\pi(\mathbf{x})q(\mathbf{x}^{\prime}\|\mathbf{x}),{\pi% (\mathbf{x}^{\prime})q(\mathbf{x}\|\mathbf{x}^{\prime})}\right\},$		(46)

	$\displaystyle\pi(\mathbf{x}^{\prime})P(\mathbf{x}\|\mathbf{x}^{\prime})$	$\displaystyle=\pi(\mathbf{x}^{\prime})q(\mathbf{x}\|\mathbf{x}^{\prime})A(% \mathbf{x}\|\mathbf{x}^{\prime})$
		$\displaystyle=\min\left\{{\pi(\mathbf{x}^{\prime})q(\mathbf{x}\|\mathbf{x}^{% \prime})},\pi(\mathbf{x})q(\mathbf{x}^{\prime}\|\mathbf{x})\right\}.$		(47)

	$\displaystyle\sum_{\mathbf{x}^{\prime}\in\mathcal{A}^{N\times 1}}\pi(\mathbf{x% }^{\prime})P(\mathbf{x}\|\mathbf{x}^{\prime})$	$\displaystyle=\sum_{\mathbf{x}^{\prime}\in\mathcal{A}^{N\times 1}}\pi(\mathbf{% x})P(\mathbf{x}^{\prime}\|\mathbf{x})$
		$\displaystyle=\pi(\mathbf{x})\sum_{\mathbf{x}^{\prime}\in\mathcal{A}^{N\times 1% }}P(\mathbf{x}^{\prime}\|\mathbf{x})=\pi(\mathbf{x}),$		(48)