On the Strong Converse Exponent and Error Exponent of the Classical Soft Covering

Xingyi He and S. Sandeep Pradhan

Andreas Winter
A preliminary version of this work was presented in part at the 2025 IEEE International Symposium on Information Theory (ISIT), 22-27 June 2025, Ann Arbor MI, doi:10.1109/ISIT63088.2025.11195309 [HePradhanWinter:ISIT].XH’s and SSP’s work was supported in part by NSF grant CCF-2132815. AW’s work was supported by the European Commission QuantERA project ExTRaQT (Spanish MICIN grant no. PCI2022-132965); by the Spanish MICIN (project PID2022-141283NB-I00) with the support of FEDER funds; by the Spanish MICIN with funding from European Union NextGenerationEU (PRTR-C17.I1) and the Generalitat de Catalunya; by the Spanish MTDFP through the QUANTUM ENIA project: Quantum Spain, funded by the European Union NextGenerationEU within the framework of the “Digital Spain 2026 Agenda”; by the Alexander von Humboldt Foundation; and by the Institute for Advanced Study of the Technical University Munich.

Abstract

This paper establishes the exact strong converse exponent of the soft covering problem in the classical setting. This exponent characterizes the slowest achievable convergence speed of the total variation to one when a code of rate below mutual information is applied to a discrete memoryless channel for synthesizing a product output distribution. The proposed exponent is expressed through a new two-parameter information quantity, differing from the more commonly studied Rényi divergence or Rényi mutual information. In addition, we demonstrate the non-tightness of random coding for rates both below and above mutual information. Discussions on the latter start with noiseless channels, where we develop a deterministic code construction that outperforms random codes in error exponents. We further observe that the conventional formulation, which assumes a uniform distribution over messages, inherently introduces a discrepancy in error exponents depending on whether the components of the target distribution are rational or irrational numbers. To eliminate this discrepancy, we propose a new formulation in which messages are allowed to be distributed non-uniformly, and the rate is given by the logarithm of the smallest nonzero message probability (corresponding to Rényi entropy $H_{-\infty}$ of order $-\infty$ ). The exact error exponent is characterized in this formulation for noiseless channels. Furthermore, for noisy channels, we provide a high-rate improvement in achievability and derive a converse bound on the error exponent.

I Introduction

The soft covering lemma is a fundamental lemma used in various information theoretic problems, such as channel resolvability, channel simulation, and lossy source coding. It first appeared in [wyner1975common, Thm. 6.3] where Wyner used the normalized (with a factor $1/n$ ) relative entropy to quantify probabilistic closeness and derived the achievability proof of the common information as the optimal rate of shared randomness between two agents. The concept of soft covering has been consequently applied in wiretap channels [wyner1975wire, hayashi2006general, parizi2016exact, yu2018renyi] and in other problems including channel synthesis [cuff2010coordination, cuff2013distributed, hsieh2016channel] and channel resolvability [han1993approximation, bloch2013strong, watanabe2014strong, liu2016e_][koga2013information, Ch. 6] with tighter measures of probabilistic closeness such as total variation and relative entropy [winter2005secret, hou2013informational, hou2014effective, cuff2015stronger, cuff2016soft]. It is worth noting that channel resolvability is closely related to soft covering in the sense that the former deals with simulating all output distributions including non-product ones, while the latter focuses solely on the product ones. More recently, developments in quantum information theory have prompted extensive research into the soft covering [ahlswede2002strong, hayashi2015quantum, cheng2023error, atif2023lossy, atif2024quantum, shen2024optimal, he2024quantum, hayashi2025resolvability][wilde2011classical, Ch. 17][hayashi2016quantum, Sec. 9.4] and channel simulation [massar2000amount, winter2001compression, winter2004extrinsic, luo2009channel, wilde2012information, bennett2014quantum, radhakrishnan2017one, anshu2019convex] for quantum channels.

The general idea of soft covering is to simulate a given output distribution using a specific channel $n$ times. Suppose we are given a discrete memoryless channel (DMC) with transition probability distribution $W_{Y|X}$ and a desired output distribution $P_{Y}$ . Consider any code $\mathcal{C}=\{X^{n}(1),X^{n}(2),\dots,X^{n}(M)\}$ , with $|\mathcal{C}|=M=:2^{nR}$ , and with each encoded message being drawn from the uniform distribution on $\{1,2,\ldots,M\}$ . The output distribution induced by the code $\mathcal{C}$ is

\tilde{P}_{Y^{n}|\mathcal{C}}(\cdot)=\frac{1}{M}\sum_{i=1}^{M}W_{Y|X}^{n}(\cdot|X^{n}(i)).

We aim for $\tilde{P}_{Y^{n}|\mathcal{C}}$ to effectively cover the space of $Y^{n}\sim P_{Y}^{n}$ . In other words, we want the non-product distribution $\tilde{P}_{Y^{n}|\mathcal{C}}$ to asymptotically approximate the product distribution $P_{Y}^{n}$ under the criterion of the total variation $\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1}$ . Let $\mathcal{S}:=\{P_{X}:P_{X}W=P_{Y}\}$ be the set of all 1-shot input distributions whose outputs are $P_{Y}$ under the DMC $W_{Y|X}$ . According to the soft covering lemma, e.g.[moser2019advanced, Ch. 19], when $R>\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ , we can achieve a good covering: there exists a code that ensures $\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1}$ to vanish exponentially fast. The achievable error exponent has been studied extensively in the literature both in the classical and the classical-quantum settings [hayashi2006general, parizi2016exact, yu2018renyi][cheng2023error][yassaee2019almost][yagli2019exact]. All of these studies employ a random coding strategy when investigating the decaying behavior of the total variation and relative entropy at high rates. However, it is not guaranteed that random coding yields a tight exponent; even if it does, it only establishes an achievability result. An important open problem is to develop a converse bound showing that no code can make the total variation decay too fast. In this work, we demonstrate the non-tightness of random coding and provide a converse bound that holds for all codes.

On the other hand, when $R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ , we expect that the covering error $\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1}$ should approach $1$ exponentially fast, $\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1}=1-2^{-n\mathit{\Gamma}(\mathcal{C},n)}$ . We are particularly interested in the minimal achievable exponent:

\mathit{\Gamma}(R):=\liminf_{n\to\infty}\ \min_{\mathcal{C}:|\mathcal{C}|=2^{nR}}\mathit{\Gamma}(\mathcal{C},n),

which characterizes the slowest convergence of $\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1}$ to one. It can be understood as an optimal performance among all the soft covering codes: given a low rate, how well a code can possibly behave to avoid poor covering. Therefore, for an arbitrary code $\mathcal{C}$ , we can deduce that $\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1}\geq 1-2^{-n\mathit{\Gamma}(R)}$ . In this sense, we refer to $\mathit{\Gamma}(R)$ as the strong converse exponent, as it holds for all codes, not just for random codes.

The dual problem of the soft covering lemma is the packing lemma in channel coding, where a similar exponent, called the reliability function [shannon1959probability], has been thoroughly studied at rates both below [gallager1965simple, haroutunian1968estimates, sibson1969information, blahut1974hypothesis, csiszar1972class, arimoto1977information, augustin1978noisy, verdu2021error][gallager1968information, Ch. 5][csiszar2011information, Ch. 10] and above [arimoto1973converse, dueck1979reliability, oohama2015two, mosonyi2017strong] the capacity. The latter case is known as the strong converse of channel coding [wolfowitz1957coding, wolfowitz1960note, winter1999coding], as the corresponding exponent holds for all codes when the probability of error approaches one. However, in the soft covering, the strong converse exponent $\mathit{\Gamma}(R)$ has not been explored in the literature in terms of either lower or upper bounds. It may be noted that [cheng2023error] provides a lower bound on the performance of random code ensemble at rates below mutual information. [watanabe2014strong] and [hayashi2025resolvability] prove the asymptotic strong converse for classical and classical-quantum channels but do not characterize the exponent.

In this work, we characterize the exact strong converse exponent $\mathit{\Gamma}(R)$ of the classical soft covering problem. It is stated in Theorem 17, and a novel two-parameter Rényi-type information quantity $J_{\alpha,\beta}(W_{Y|X}\|P_{Y})$ is introduced in the process. The converse is derived using a hypothesis testing perspective, and the achievability is established through a deterministic code construction technique combined with the type covering lemma. In addition, we provide a random coding achievability bound for the strong converse exponent in Theorem 18 that is expressed using the Rényi mutual information. By comparing the exact $\mathit{\Gamma}(R)$ , the proposed random coding achievability, and the random coding converse in [cheng2023error], we conclude that random coding fails to produce the tight exponent.

Building on the observation of non-tightness of random coding in the strong converse exponents, we provide a new characterization of the error exponents for $R>\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ . In particular, we establish a new lower bound on the error exponents (see Theorem 20) in terms of Rényi entropy for noiseless channels using a new deterministic code construction. We also derive a new upper bound on the error exponents (see Theorem 19) that match the lower bound until they both reach a slope of unity. The lower bound strictly outperforms random codes for noiseless channels. Furthermore, this lower bound can be extended to noisy channels (see Theorem 27), achieving a high-rate improvement over the random coding exponent given in [yagli2019exact, Thm. 1]. We also provide a new sphere-packing style upper bound on the error exponents of the noisy channels, given in Theorem 28.

While investigating the behavior of error exponents for noiseless channels, we uncover an interesting phenomenon: the conventional formulation that assumes a uniform distribution $1/M$ on messages yields different exponents depending on whether the components of the target distribution $P_{Y}$ are rational or irrational numbers. These discrepancies are characterized in Theorems 25 and 26. The reason for this is that the code-induced distribution $\tilde{P}_{Y^{n}|\mathcal{C}}$ takes values in multiples of $1/M$ , so soft covering essentially approximates $P_{Y}^{n}$ within a quantization grid of step size $1/M$ . If $P_{Y}^{n}$ is rational, a perfect covering with zero error is achievable at sufficiently high rates. In contrast, if $P_{Y}^{n}$ is irrational, a nonzero covering error may persist due to limitations imposed by Diophantine approximation [queffelec2013diophantine, Ch. 3]. To address this discrepancy, we propose a new formulation of the soft covering problem, in which the distribution of the messages is allowed to be non-uniform, and the covering rate is defined not by the number of messages but by the inverse of the smallest nonzero message probability. We call this formulation $H_{-\infty}$ -constrained soft covering (see Definition 11). We provide an exact characterization of the error exponents for this formulation without any discrepancy for all rates for noiseless channels (see Theorem 22). It aligns with that of the uniform distribution at low rates. In summary, the error exponents of the soft covering problems exhibit a very complex behavior.

This paper is organized as follows. Section II introduces useful definitions. All the main results are presented in Section III. Section IV provides a detailed proof of the characterization of strong converse exponent. In Section V, we provide a detailed discussion on error exponents for noiseless channels under different formulations and generalize the corresponding results to noisy channels in Section VI.

II Preliminaries

II-A Notation

This paper applies the method of types, as well as basic notions from discrete probability theory and a range of information quantities. Basic properties of types can be found in many papers and textbooks, e.g. [csiszar2011information, Ch. 2] and [csiszar1998method]. Basic notations we will use are collected below.

Let $\mathcal{M}$ denote a finite message set, $\mathcal{X}$ and $\mathcal{Y}$ denote the finite input and output alphabets of a channel, respectively, and $\mathcal{P}(\mathcal{M})$ , $\mathcal{P}(\mathcal{X})$ , and $\mathcal{P}(\mathcal{X}\mathcal{Y})$ denote the sets of all probability distributions on $\mathcal{M}$ , $\mathcal{X}$ , and $\mathcal{X}\times\mathcal{Y}$ , respectively. Likewise, let $\mathcal{P}(\mathcal{Y}|\mathcal{X})$ denote the set of all conditional distributions on $\mathcal{Y}$ given $\mathcal{X}$ . For any block length $n$ , let $\mathcal{P}_{n}(\mathcal{X})$ and $\mathcal{P}_{n}(\mathcal{X}\mathcal{Y})$ denote the sets of all $n$ -types and joint $n$ -types on $\mathcal{X}^{n}$ and $\mathcal{X}^{n}\times\mathcal{Y}^{n}$ , respectively, and let $\mathcal{P}_{n}(\mathcal{Y}|Q_{X})$ denote the set of all conditional $n$ -types on $\mathcal{Y}^{n}$ given an input type $Q_{X}\in\mathcal{P}_{n}(\mathcal{X})$ . Denote by $\mathcal{T}_{Q}$ the type class corresponding to an $n$ -type $Q$ , and by $\mathcal{T}_{V}(x^{n})$ the conditional type class (or $V$ -shell) given $x^{n}\in\mathcal{T}_{Q_{X}}$ and a conditional type $V\in\mathcal{P}_{n}(\mathcal{Y}|Q_{X})$ . Let $H(V|Q_{X})$ and $I(Q_{X};V)$ denote the conditional entropy $H(Y|X)$ and the mutual information $I(X;Y)$ under the joint distribution $Q_{XY}=Q_{X}V_{Y|X}$ , respectively, and define $D(V\|W|Q_{X}):=D(Q_{X}V_{Y|X}\|Q_{X}W_{Y|X})$ . Let $\mathcal{S}:=\{P_{X}\in\mathcal{P}(\mathcal{X}):P_{X}W=P_{Y}\}$ denote the set of all input distributions whose induced output distribution under $W_{Y|X}$ is $P_{Y}$ . For a distribution $V$ , let $\mathrm{supp}(V)$ denote its support, and write $V\ll W$ if $\mathrm{supp}(V)\subseteq\mathrm{supp}(W)$ . Finally, define $|x|^{+}:=\max\{x,0\}$ .

Consider a joint distribution $Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})$ with marginal distributions $Q_{X}\in\mathcal{P}(\mathcal{X})$ and $Q_{Y}\in\mathcal{P}(\mathcal{Y})$ . In this paper, we follow the notation that $Q_{XY}=Q_{X}V_{Y|X}=Q_{Y}\widebar{V}_{X|Y}$ , where $V_{Y|X}:=Q_{XY}/Q_{X}\in\mathcal{P}(\mathcal{Y}|\mathcal{X})$ and $\widebar{V}_{X|Y}=Q_{XY}/Q_{Y}\in\mathcal{P}(\mathcal{X}|\mathcal{Y})$ are the forward and backward transition probabilities, respectively. In other words, the letters $Q$ and $V$ are consistently associated throughout this paper’s notations: whenever a joint distribution $Q_{XY}$ appears, the corresponding marginals $Q_{X},Q_{Y}$ and conditionals $V_{Y|X},\widebar{V}_{X|Y}$ are all implicitly defined and need not be restated. Moreover, we write $Q_{Y}=Q_{X}V$ .

In this work, the desired output distribution is denoted by $P_{Y}\in\mathcal{P}(\mathcal{Y})$ and the DMC used for covering is denoted by $W_{Y|X}\in\mathcal{P}(\mathcal{Y}|\mathcal{X})$ . Definitions of some useful information-theoretic quantities are listed below.

Definition 1 (Information density).

The information density is defined as

\iota_{X;Y}(x,y):=\log\frac{W_{Y|X}(y|x)}{P_{Y}(y)}.

Definition 2 (Expectation of the information density).

Given $Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})$ , the expectation of the information density under $Q_{XY}$ is defined as

\iota(Q_{XY}):=\mathbb{E}_{Q_{XY}}[\iota_{X;Y}]=\begin{cases}\displaystyle{\sum_{x,y}Q_{XY}(x,y)\log\frac{W_{Y|X}(y|x)}{P_{Y}(y)}}&\text{if }V\ll W\\ -\infty&\text{otherwise}\end{cases}.

Lemma 3.

One can easily verify the following identity, and hence we claim it without proof.

D(V\|W|Q_{X})+\iota(Q_{XY})=D(Q_{X}V\|P_{Y})+I(Q_{X};V).

Definition 4 ( $\alpha$ -Rényi entropy).

Given $Q_{X}\in\mathcal{P}(\mathcal{X})$ , the $\alpha$ -Rényi entropy is defined as

H_{\alpha}(Q_{X}):=\frac{1}{1-\alpha}\log\left(\sum_{x}Q_{X}^{\alpha}(x)\right).

Remark 5.

$H_{-\infty}(Q_{X})=-\log\displaystyle{\min_{x\in\mathrm{supp}(Q_{X})}}Q_{X}(x)$ .

Definition 6 ( $\alpha$ -Rényi divergence).

Given $Q_{X},P_{X}\in\mathcal{P}(\mathcal{X})$ , their $\alpha$ -Rényi divergence is defined as

D_{\alpha}(Q_{X}\|P_{X}):=\frac{1}{\alpha-1}\log\left(\sum_{x}Q_{X}^{\alpha}(x)P_{X}^{1-\alpha}(x)\right).

Definition 7 ( $\alpha$ -Rényi mutual information).

Given $P_{X}\in\mathcal{S}$ , the $\alpha$ -Rényi mutual information is defined as

	$\displaystyle I_{\alpha}(P_{X};W_{Y\|X}):=\$	$\displaystyle\frac{\alpha}{\alpha-1}\log\sum_{y}\left(\sum_{x}P_{X}(x)W^{\alpha}_{Y\|X}(y\|x)\right)^{\frac{1}{\alpha}}$
	$\displaystyle=\$	$\displaystyle\frac{\alpha}{\alpha-1}\log\left(\mathbb{E}_{P_{Y}}\left[\mathbb{E}_{\widebar{W}_{X\|Y}}^{1/\alpha}\left[2^{(\alpha-1)\iota_{X;Y}}\|Y\right]\right]\right).$

The second equality follows from [yagli2019exact, Rmk. 24][verdu2021error, Item 20], and $\widebar{W}_{X|Y}:=P_{X}W_{Y|X}/{P_{Y}}$ .

Remark 8.

If $W_{Y|X}$ is a noiseless, i.e., for each $x\in\mathcal{X}$ , there exists a symbol $y\in\mathcal{Y}$ such that $W_{Y|X}(y|x)=1$ . Then $I_{\alpha}(P_{X};W_{Y|X})=H_{\frac{1}{\alpha}}(P_{X}W)$ [verdu2021error, Eq. (85)].

II-B Notions of soft covering

The precise formulations of the soft covering problem are given in this subsection. We begin with two cases: uniform and non-uniform, depending on whether the message set $\mathcal{M}$ has a uniform distribution or not.

Definition 9 (Uniform soft covering).

An $(R,n,P_{Y},W_{Y|X})$ (uniform) soft-covering scheme consists of a message set $\mathcal{M}=\{1,\dots,M\}$ with $M=:2^{nR}$ , where each message is drawn from uniform distribution, and a code $\mathcal{C}=\{X^{n}(1),X^{n}(2),\dots,X^{n}(M)\}$ . This soft-covering scheme induces the following distribution at the output of the DMC $W_{Y|X}$ :

\tilde{P}_{Y^{n}|\mathcal{C}}(y^{n}):=\frac{1}{M}\sum_{i=1}^{M}W_{Y|X}^{n}(y^{n}|X^{n}(i)),\quad\forall y^{n}\in\mathcal{Y}^{n}.

(1)

The covering error is defined as the total variation between the achieved non-product output distribution $\tilde{P}_{Y^{n}|\mathcal{C}}$ and the product desired one $P_{Y}^{n}$ :

\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\right\|_{1}:=\frac{1}{2}\sum_{y^{n}}\left|\tilde{P}_{Y^{n}|\mathcal{C}}(y^{n})-P_{Y}^{n}(y^{n})\right|.

Definition 10 (Non-uniform soft covering).

An $(R,n,P_{Y},W_{Y|X},q)$ (non-uniform) soft-covering scheme consists of a message set $\mathcal{M}=\{1,\dots,M\}$ with $M=:2^{nR}$ , where messages are drawn from distribution $q\in\mathcal{P}(\mathcal{M})$ , and a code $\mathcal{C}=\{X^{n}(1),X^{n}(2),\dots,X^{n}(M)\}$ . This soft-covering scheme induces the following distribution at the output of the DMC $W_{Y|X}$ :

\tilde{P}_{Y^{n}|\mathcal{C}\sim q}(y^{n}):=\sum_{i=1}^{M}q(i)\ W_{Y|X}^{n}(y^{n}|X^{n}(i)),\quad\forall y^{n}\in\mathcal{Y}^{n}.

(2)

The covering error is defined as the total variation between the achieved non-product output distribution $\tilde{P}_{Y^{n}|\mathcal{C}\sim q}$ and the product desired one $P_{Y}^{n}$ :

\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\right\|_{1}:=\frac{1}{2}\sum_{y^{n}}\left|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}(y^{n})-P_{Y}^{n}(y^{n})\right|.

(3)

Besides these two formulations, we introduce a variation based on the non-uniform case: in addition to assuming that the messages have a non-uniform distribution $q$ , we further impose an extra condition that the smallest nonzero probability among all messages is at least $1/M=:2^{-nR}$ . This condition can be equivalently written as

H_{-\infty}(q)=-\log\min_{i\in\mathcal{M}:q(i)>0}q(i)\leq nR,

(4)

according to Remark 5. Under this condition, the size of the message set need not be $M$ ; instead, $|\mathcal{M}|=|\mathrm{supp}(q)|\leq M$ . Moreover, given a rate $R$ , define

\mathcal{Q}(R,n):=\left\{q\in\mathcal{P}(\{1,\dots,M\}):H_{-\infty}(q)\leq nR\right\}

(5)

to be the set of all probabilities satisfying condition (4). Soft covering under this condition is referred to as $H_{-\infty}$ -constrained formulation, and is given by the following definition.

Definition 11 ( $H_{-\infty}$ -constrained soft covering).

An $(R,n,P_{Y},W_{Y|X},q)_{-\infty}$ ( $H_{-\infty}$ -constrained) soft-covering scheme consists of a message set $\mathcal{M}$ , where messages are drawn from distribution $q\in\mathcal{Q}(R,n)$ , and a code $\mathcal{C}=\{X^{n}(1),X^{n}(2),\dots,X^{n}(|\mathcal{M}|)\}$ . This soft-covering scheme induces distribution $\tilde{P}_{Y^{n}|\mathcal{C}\sim q}$ at the output of the DMC $W_{Y|X}$ , which has the same expression in (2), and the covering error is defined the same as (3).

The reason for this new formulation of the soft covering problem is that, when $R\geq\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ , the conventional uniform formulation would lead to an inevitable discrepancy depending on whether $P_{Y}(y)$ ’s are rational or irrational. This discrepancy arises even in the simplest case, when the DMC $W_{Y|X}$ is noiseless, because in this case, the uniform formulation attempts to approximate $P_{Y}$ using only rational probabilities which are multiples of $1/M$ . Switching to the non-uniform formulation in Definition 10 can certainly eliminate this rational-irrational discrepancy, but alters the problem substantially and thereby shifts the error exponent relative to the uniform case. The $H_{-\infty}$ -constrained formulation in Definition 11, on the other hand, resolves the discrepancy while still sharing an overlapping error-exponent region with the uniform formulation. A detailed discussion of this point is presented in Section III-B and Section V.

In information theory, proofs of achievability are commonly facilitated by the technique of random coding; accordingly, we also define the random-coding soft covering below. In this definition, the message set $\mathcal{M}$ is uniformly distributed.

Definition 12 (Random-coding soft covering).

An $(R,n,W_{Y|X})_{P_{X}}$ (random-coding) soft-covering scheme consists of a message set $\mathcal{M}=\{1,\dots,M\}$ with $M=:2^{nR}$ , where each message is drawn from uniform distribution, and a random code $\mathcal{C}=\{X^{n}(1),X^{n}(2),\dots,X^{n}(M)\}$ , where each codeword $X^{n}(i)$ is generated randomly from i.i.d. $P_{X}$ , i.e., $X^{n}(i)\sim P_{X}^{n}$ for all $i\in\mathcal{M}$ . The induced output distribution $\tilde{P}_{Y^{n}|\mathcal{C}}$ is given by (1), and the covering error is defined as the expectation of the total variation between $\tilde{P}_{Y^{n}|\mathcal{C}}$ and $\mathbb{E}_{\mathcal{C}}\tilde{P}_{Y^{n}|\mathcal{C}}$ :

\frac{1}{2}\mathbb{E}_{\mathcal{C}}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-\mathbb{E}_{\mathcal{C}}\tilde{P}_{Y^{n}|\mathcal{C}}\right\|_{1}:=\frac{1}{2}\mathbb{E}_{\mathcal{C}}\sum_{y^{n}}\left|\tilde{P}_{Y^{n}|\mathcal{C}}(y^{n})-\mathbb{E}_{\mathcal{C}}\tilde{P}_{Y^{n}|\mathcal{C}}(y^{n})\right|,

where the expectation $\mathbb{E}_{\mathcal{C}}$ is taken with respect to i.i.d. $P_{X}$ .

Remark 13.

The notation $\mathcal{C}\sim q$ indicates that $\mathcal{C}$ is the code corresponding to a message set with distribution $q$ . If $q(i)=1/M$ is uniform, we omit the notation $\sim q$ , as in Definitions 9 and 12 above. This convention will be used throughout the paper: specifically, every occurrence of $\mathcal{C}\sim q$ or $\tilde{P}_{Y^{n}|\mathcal{C}\sim q}$ refers to the non-uniform formulation (including Definitions 10 and 11), whereas $\tilde{P}_{Y^{n}|\mathcal{C}}$ refers to the uniform formulation (including Definitions 9 and 12), unless otherwise specified.

Next, we define the error exponent (denoted by $E$ ) and the strong converse exponent (denoted by $\mathit{\Gamma}$ ) for the above four formulations of the soft covering problem.

Definition 14 (Error exponent and strong converse exponent for soft covering).

(a)

For uniform soft-covering:

	$\displaystyle E_{\mathrm{uni}}(R)$	$\displaystyle:=\limsup_{n\to\infty}\ \max_{\mathcal{C}:\|\mathcal{C}\|=2^{nR}}\left[-\frac{1}{n}\log\left(\frac{1}{2}\left\\|\tilde{P}_{Y^{n}\|\mathcal{C}}-P_{Y}^{n}\right\\|_{1}\right)\right],$
	$\displaystyle\mathit{\Gamma}_{\mathrm{uni}}(R)$	$\displaystyle:=\liminf_{n\to\infty}\ \min_{\mathcal{C}:\|\mathcal{C}\|=2^{nR}}\left[-\frac{1}{n}\log\left(1-\frac{1}{2}\left\\|\tilde{P}_{Y^{n}\|\mathcal{C}}-P_{Y}^{n}\right\\|_{1}\right)\right].$

(b)

For non-uniform soft-covering:

	$\displaystyle E_{\mathrm{non}}(R)$	$\displaystyle:=\limsup_{n\to\infty}\ \max_{q\in\mathcal{P}(\{1,\dots,2^{nR}\})}\ \max_{\mathcal{C}:\mathcal{C}\sim q}\left[-\frac{1}{n}\log\left(\frac{1}{2}\left\\|\tilde{P}_{Y^{n}\|\mathcal{C}\sim q}-P_{Y}^{n}\right\\|_{1}\right)\right],$
	$\displaystyle\mathit{\Gamma}_{\mathrm{non}}(R)$	$\displaystyle:=\liminf_{n\to\infty}\ \min_{q\in\mathcal{P}(\{1,\dots,2^{nR}\})}\ \min_{\mathcal{C}:\mathcal{C}\sim q}\left[-\frac{1}{n}\log\left(1-\frac{1}{2}\left\\|\tilde{P}_{Y^{n}\|\mathcal{C}\sim q}-P_{Y}^{n}\right\\|_{1}\right)\right],$

where the notation $\mathcal{C}\sim q$ is defined in Remark 13.

(c)

For $H_{-\infty}$ -constrained soft covering:

	$\displaystyle E_{\mathrm{ren}}(R)$	$\displaystyle:=\limsup_{n\to\infty}\ \max_{q\in\mathcal{Q}(R,n)}\ \max_{\mathcal{C}:\mathcal{C}\sim q}\left[-\frac{1}{n}\log\left(\frac{1}{2}\left\\|\tilde{P}_{Y^{n}\|\mathcal{C}\sim q}-P_{Y}^{n}\right\\|_{1}\right)\right],$
	$\displaystyle\mathit{\Gamma}_{\mathrm{ren}}(R)$	$\displaystyle:=\liminf_{n\to\infty}\ \min_{q\in\mathcal{Q}(R,n)}\ \min_{\mathcal{C}:\mathcal{C}\sim q}\left[-\frac{1}{n}\log\left(1-\frac{1}{2}\left\\|\tilde{P}_{Y^{n}\|\mathcal{C}\sim q}-P_{Y}^{n}\right\\|_{1}\right)\right],$

where $\mathcal{Q}(R,n)$ is defined in (5) and the notation $\mathcal{C}\sim q$ is defined in Remark 13.

(d)

For random-coding soft covering with codewords randomly drawn from i.i.d. $P_{X}$ :

	$\displaystyle E_{\mathrm{rc}}(R,P_{X})$	$\displaystyle:=\limsup_{n\to\infty}\left[-\frac{1}{n}\log\left(\frac{1}{2}\mathbb{E}_{\mathcal{C}}\left\\|\tilde{P}_{Y^{n}\|\mathcal{C}}-\mathbb{E}_{\mathcal{C}}\tilde{P}_{Y^{n}\|\mathcal{C}}\right\\|_{1}\right)\right],$
	$\displaystyle\mathit{\Gamma}_{\mathrm{rc}}(R,P_{X})$	$\displaystyle:=\liminf_{n\to\infty}\left[-\frac{1}{n}\log\left(1-\frac{1}{2}\mathbb{E}_{\mathcal{C}}\left\\|\tilde{P}_{Y^{n}\|\mathcal{C}}-\mathbb{E}_{\mathcal{C}}\tilde{P}_{Y^{n}\|\mathcal{C}}\right\\|_{1}\right)\right].$

Remark 15.

It is clear that $E_{\mathrm{uni}}(R)\leq E_{\mathrm{ren}}(R)\leq E_{\mathrm{non}}(R)$ and $\mathit{\Gamma}_{\mathrm{uni}}(R)\geq\mathit{\Gamma}_{\mathrm{ren}}(R)\geq\mathit{\Gamma}_{\mathrm{non}}(R)$ , since the $H_{-\infty}$ -constrained formulation is a special case of the non-uniform formulation, and the uniform formulation is a special case of the $H_{-\infty}$ -constrained formulation.

Remark 16.

It is shown in [yagli2019exact, Thm. 1] that

	$\displaystyle E_{\mathrm{rc}}(R,P_{X})$	$\displaystyle=\min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})}\left[D(Q_{XY}\\|P_{X}W_{Y\|X})+\frac{1}{2}\big\|R-D(Q_{XY}\\|P_{X}Q_{Y})\big\|^{+}\right]$
		$\displaystyle=\max_{\alpha\in[1,2]}\left\{\frac{\alpha-1}{\alpha}\left[R-I_{\alpha}(P_{X};W_{Y\|X})\right]\right\},$

where $I_{\alpha}(P_{X};W_{Y|X})$ is the $\alpha$ -Rényi mutual information, defined in Definition 7. Consequently, the optimal random coding error exponent is $E_{\mathrm{rc}}(R):=\max_{P_{X}\in\mathcal{S}}E_{\mathrm{rc}}(R,P_{X})$ . In particular, by Remark 8, when the DMC $W_{Y|X}$ is noiseless, $E_{\mathrm{rc}}(R)$ reduces to

	$\displaystyle E_{\mathrm{rc}}^{\mathrm{nl}}(R)$	$\displaystyle=\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\left\{D(Q_{Y}\\|P_{Y})+\frac{1}{2}\big\|R-D(Q_{Y}\\|P_{Y})-H(Q_{Y})\big\|^{+}\right\}$
		$\displaystyle=\ \max_{\alpha\in[1,2]}\left\{\frac{\alpha-1}{\alpha}\left[R-H_{\frac{1}{\alpha}}(P_{Y})\right]\right\}.$

III Main Results

This section summarizes the main results of this work, including the strong converse exponent in Section III-A and bounds for the error exponent in Sections III-B and III-C.

III-A Results on the strong converse exponent

To begin with, the following Theorem 17 characterizes the exact strong converse exponents of the soft covering problem in the uniform, non-uniform, and $H_{-\infty}$ -constrained formulations, and shows that they are all identical.

Theorem 17 (The exact strong converse exponent).

The exact strong converse exponents for the uniform, non-uniform, and the $H_{-\infty}$ -constrained soft-covering formulations coincide, and are given by

\mathit{\Gamma}_{\mathrm{uni}}(R)=\mathit{\Gamma}_{\mathrm{non}}(R)=\mathit{\Gamma}_{\mathrm{ren}}(R)=\mathit{\Gamma}(R):=\max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\left[J_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right)+(\beta-\alpha)R\right],

where

J_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right):=-\log\left[\max_{Q_{X}\in\mathcal{P}(\mathcal{X})}\sum_{y}P_{Y}^{1-\beta}(y)\left(\sum_{x}Q_{X}(x)\ W_{Y|X}^{\frac{\beta}{\alpha}}(y|x)\right)^{\alpha}\right].

(6)

Theorem 17 is proved in Section IV-C. The quantity $J_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right)$ , defined in (6), involves two parameters, which is uncommon in the literature of error exponents. Similar exponents taking two parameters have appeared in the context of lossy source coding, e.g., [blahut1974hypothesis, jitsumatsu2025computation]; however, lossy source coding and soft covering address fundamentally different problems. The former employs a distortion function as a criterion for loss and directly evaluates the probability of error, whereas the latter centers on a forward channel and quantifies the error through a probabilistic divergence. Although the notion of covering is intrinsically revealed in source coding, the two settings are structurally distinct. Typically, when a channel is involved in the problem formulation, the error exponent takes the form of $\max_{\alpha}\frac{1-\alpha}{\alpha}\left(I_{\alpha}-R\right)$ , where $I_{\alpha}$ is the Rényi mutual information with respect to the given channel, with different domains of maximization over $\alpha$ . This form is observed in both packing [gallager1965simple][sibson1969information][arimoto1973converse][arimoto1977information][augustin1978noisy][csiszar1995generalized][verdu2015alpha][verdu2021error] and covering (when $R\geq\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ ) [hayashi2006general][parizi2016exact][yassaee2019almost][yagli2019exact] problems. In this work, we give an achievability bound for random codes that exactly takes this form, as formally stated in Theorem 18.

Theorem 18 (A random coding achievability bound).

The strong converse exponent for the random coding formulation has the following upper bound.

	$\displaystyle\mathit{\Gamma}_{\mathrm{rc}}(R,P_{X})\leq\overline{\mathit{\Gamma}}_{\mathrm{rc}}(R,P_{X}):=$	$\displaystyle\min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})}\left[D(Q_{XY}\\|P_{X}W_{Y\|X})+\big\|D(Q_{XY}\\|P_{X}Q_{Y})-R\big\|^{+}\right]$		(7)
	$\displaystyle=$	$\displaystyle\max_{\alpha\in[\frac{1}{2},1]}\left\{\frac{1-\alpha}{\alpha}\left[I_{\alpha}(P_{X};W_{Y\|X})-R\right]\right\},$		(8)

where $I_{\alpha}(P_{X};W_{Y|X})$ is the $\alpha$ -Rényi mutual information, defined in Definition 7.

See Appendix A for a proof of Theorem 18. On the other hand, a lower bound for $\mathit{\Gamma}_{\mathrm{rc}}(R,P_{X})$ is given in [cheng2023error, Thm. 3]:

\mathit{\Gamma}_{\mathrm{rc}}(R,P_{X})\geq\underline{\mathit{\Gamma}}_{\mathrm{rc}}(R,P_{X}):=\max_{\alpha\in[\frac{1}{2},1]}\left\{\frac{1-\alpha}{\alpha}\left[D_{2-\frac{1}{\alpha}}\left(P_{X}W_{Y|X}\|P_{X}P_{Y}\right)-R\right]\right\},

where $D_{2-\frac{1}{\alpha}}\left(P_{X}W_{Y|X}\|P_{X}P_{Y}\right)$ is the Rényi divergence in Definition 6. Hence, defining

\mathit{\Gamma}_{\mathrm{rc}}(R):=\min_{P_{X}\in\mathcal{S}}\mathit{\Gamma}_{\mathrm{rc}}(R,P_{X}),

it can be further bounded via

\underline{\mathit{\Gamma}}_{\mathrm{rc}}(R)\leq\mathit{\Gamma}_{\mathrm{rc}}(R)\leq\overline{\mathit{\Gamma}}_{\mathrm{rc}}(R)

(9)

with

\underline{\mathit{\Gamma}}_{\mathrm{rc}}(R):=\min_{P_{X}\in\mathcal{S}}\underline{\mathit{\Gamma}}_{\mathrm{rc}}(R,P_{X}),\quad\overline{\mathit{\Gamma}}_{\mathrm{rc}}(R):=\min_{P_{X}\in\mathcal{S}}\overline{\mathit{\Gamma}}_{\mathrm{rc}}(R,P_{X}).

It is also noteworthy that $\mathit{\Gamma}_{\mathrm{rc}}(R)$ provides an achievability bound for $\mathit{\Gamma}_{\mathrm{uni}}(R)$ ; specifically, $\mathit{\Gamma}_{\mathrm{uni}}(R)\leq\mathit{\Gamma}_{\mathrm{rc}}(R)$ . Examples of the exact exponent $\mathit{\Gamma}(R)$ (in Theorem 17) and aforementioned random coding bounds $\underline{\mathit{\Gamma}}_{\mathrm{rc}}(R)$ and $\overline{\mathit{\Gamma}}_{\mathrm{rc}}(R)$ are illustrated in Figure 1, where the matrix entry $W(x,y)$ represents $W_{Y|X}(y|x)$ . These examples include channels with fully noisy, fully noiseless, and hybrid input symbols (i.e., a mix of noisy and noiseless inputs). Figure 1 clearly shows that the random coding is not tight in general in the converse regime of the soft covering problem: due to (9), the random-coding exponent $\mathit{\Gamma}_{\mathrm{rc}}(R)$ must lie between $\underline{\mathit{\Gamma}}_{\mathrm{rc}}(R)$ and $\overline{\mathit{\Gamma}}_{\mathrm{rc}}(R)$ , while the exact exponent $\mathit{\Gamma}(R)$ is lower and hence tighter.

Refer to caption — (a) A fully noisy binary channel

III-B Results on error exponents for noiseless channels

The previous subsection reveals an intriguing phenomenon: when $R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ , random coding fails to achieve a tight strong converse exponent. A natural question is whether such non-tightness also arises in the error exponent regime for $R\geq\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ . In this subsection, we show that the answer is yes when $W_{Y|X}$ is noiseless. The following Theorems 19, 20, 21, and 22 summarize our results on error exponents for noiseless channels, where exponents for different formulations are defined in Definition 14, with the superscript ‘nl’ indicating ‘noiseless’.

Theorem 19 (Converse for noiseless channels).

For noiseless channels under the uniform formulation, we have the following upper bound on $E_{\mathrm{uni}}^{\mathrm{nl}}(R)$ .

	$\displaystyle E_{\mathrm{uni}}^{\mathrm{nl}}(R)\leq\overline{E}^{\mathrm{nl}}(R):$	$\displaystyle=\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):D(Q_{Y}\\|P_{Y})+H(Q_{Y})\geq R}D(Q_{Y}\\|P_{Y})$
		$\displaystyle=\max_{\alpha\in(-\infty,0)\cup[1,\infty)}\left\{\frac{\alpha-1}{\alpha}\left[R-H_{\frac{1}{\alpha}}(P_{Y})\right]\right\}.$

Theorem 20 (Achievability for noiseless channels).

For noiseless channels under the uniform formulation, we have the following lower bound on $E_{\mathrm{uni}}^{\mathrm{nl}}(R)$ .

	$\displaystyle E_{\mathrm{uni}}^{\mathrm{nl}}(R)\geq\underline{E}^{\mathrm{nl}}(R):=\$	$\displaystyle\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\left\{D(Q_{Y}\\|P_{Y})+\big\|R-D(Q_{Y}\\|P_{Y})-H(Q_{Y})\big\|^{+}\right\}$
	$\displaystyle=\$	$\displaystyle\max_{\alpha\geq 1}\left\{\frac{\alpha-1}{\alpha}\left[R-H_{\frac{1}{\alpha}}(P_{Y})\right]\right\},$

Theorem 21 (Error exponent for non-uniform case).

For noiseless channels under the non-uniform formulation, the exact error exponent is given by

\displaystyle E_{\mathrm{non}}^{\mathrm{nl}}(R)

\displaystyle=\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):H(Q_{Y})\geq R}D(Q_{Y}\|P_{Y})=\max_{\alpha\geq 1}\left\{(\alpha-1)\left[R-H_{\frac{1}{\alpha}}(P_{Y})\right]\right\}.

Theorem 22 (Error exponent for $H_{-\infty}$ -constrained case).

For noiseless channels under the $H_{-\infty}$ -constrained formulation, the exact error exponent is given by $E_{\mathrm{ren}}^{\mathrm{nl}}(R)=\overline{E}^{\mathrm{nl}}(R),$ where $\overline{E}^{\mathrm{nl}}(R)$ is defined in Theorem 19.

Theorem 19 and Theorem 20 are proved in Section V-A; Theorem 21 is proved in Section V-C; and Theorem 22 is proved in Section V-D. Figure 2 presents plots of these proposed bounds or exponents as well as the random coding exponent $E_{\mathrm{rc}}^{\mathrm{nl}}(R)$ (see Remark 16) using a binary and a ternary target output distribution, respectively. Under the non-uniform and the $H_{-\infty}$ -constrained formulation, an exact error exponent is established. Under the uniform formulation, the error exponent $E_{\mathrm{uni}}^{\mathrm{nl}}(R)$ lies between the two bounds $\overline{E}^{\mathrm{nl}}(R)$ and $\underline{E}^{\mathrm{nl}}(R)$ . However, these two bounds overlap in the vicinity of $R=H(P_{Y})$ , which is given by the following proposition.

Proposition 23.

Let $R_{\alpha}^{s}$ be the value where the function $\overline{E}^{\mathrm{nl}}(\cdot)$ has a tangent line of slope $\alpha$ . Then

	$\displaystyle\underline{E}^{\mathrm{nl}}(R)$	$\displaystyle=\begin{cases}\overline{E}^{\mathrm{nl}}(R)&\text{if }0\leq R\leq R_{1}^{s},\\ \overline{E}^{\mathrm{nl}}(R_{1}^{s})+R-R_{1}^{s}&\text{if }R>R_{1}^{s},\end{cases}$
	$\displaystyle E_{\mathrm{rc}}^{\mathrm{nl}}(R)$	$\displaystyle=\begin{cases}\overline{E}^{\mathrm{nl}}(R)&\text{if }0\leq R\leq R_{1/2}^{s},\\ \overline{E}^{\mathrm{nl}}(R_{1/2}^{s})+\dfrac{1}{2}\left(R-R_{1/2}^{s}\right)&\text{if }R>R_{1/2}^{s}.\end{cases}$

Here, $E_{\mathrm{rc}}^{\mathrm{nl}}(R)$ denotes the random-coding error exponent for noiseless channels and is given in Remark 16.

Proposition 23 is proved in Appendix D-b. In summary, for $R\in[H(P_{Y}),R_{1/2}^{s}]$ , all three curves – $\overline{E}^{\mathrm{nl}}(R)$ , $\underline{E}^{\mathrm{nl}}(R)$ , and $E_{\mathrm{rc}}^{\mathrm{nl}}(R)$ – overlap, implying that random coding is tight. For $R\in[R_{1/2}^{s},R_{1}^{s}]$ , random coding is not tight; nevertheless, $\overline{E}^{\mathrm{nl}}(R)$ and $\underline{E}^{\mathrm{nl}}(R)$ still coincide, and an exact exponent exists via the construction of a deterministic code that is provided in the proof of Theorem 20. For $R>R_{1}^{s}$ , $\overline{E}^{\mathrm{nl}}(R)$ and $\underline{E}^{\mathrm{nl}}(R)$ no longer coincide. It is also noteworthy that $\overline{E}^{\mathrm{nl}}(R)$ diverges when $R>H_{-\infty}(P_{Y})$ ; see Lemma 38.

Assuming uniform messages is conventional in many problems in information theory. Nonetheless, in the soft covering problem, adhering to this uniform formulation highlights an interesting discrepancy between rational and irrational output distributions for noiseless channels. This is because $\tilde{P}_{Y^{n}|\mathcal{C}}$ in (1) can only take rational values as it is a multiple of $1/M$ . If $P_{Y}$ is irrational in some symbols, then approximating $P_{Y}^{n}$ by $\tilde{P}_{Y^{n}|\mathcal{C}}$ amounts to a Diophantine approximation, which necessarily incurs nonzero errors for arbitrarily large $M$ according to studies in number theory. We summarize the rational-irrational discrepancy in Theorem 25 and Theorem 26.

Definition 24.

Given a distribution $P_{Y}\in\mathcal{P}(\mathcal{Y})$ , we say that $P_{Y}$ is rational, denoted $P_{Y}\in\mathbb{Q}^{|\mathcal{Y}|}$ , if $P_{Y}(y)\in\mathbb{Q}$ for all $y\in\mathcal{Y}$ . Otherwise, we say that $P_{Y}$ is irrational, denoted $P_{Y}\notin\mathbb{Q}^{|\mathcal{Y}|}$ .

Theorem 25 (Linear converse for irrational $P_{Y}$ ).

Suppose $W_{Y|X}$ is noiseless. Then for every $\epsilon>0$ and almost every $P_{Y}\notin\mathbb{Q}^{|\mathcal{Y}|}$ (in the sense of Lebesgue measure on $[0,1]^{|\mathcal{Y}|}$ ), we have $E_{\mathrm{uni}}^{\mathrm{nl}}(R)\leq(2+\epsilon)R$ .

Theorem 26 (Infinite achievability for rational $P_{Y}$ ).

Suppose $W_{Y|X}$ is noiseless and $P_{Y}\in\mathbb{Q}^{|\mathcal{Y}|}$ , say, $P_{Y}(y)=\frac{A_{y}}{B_{y}}$ with coprime $A_{y},B_{y}\in\mathbb{Z}$ for $y\in\mathcal{Y}$ . Then $E_{\mathrm{uni}}^{\mathrm{nl}}(R)=\infty$ when $R\geq\log(\mathrm{lcm}(\{B_{y}\}_{y\in\mathcal{Y}}))$ , where $\mathrm{lcm}(\{B_{y}\}_{y\in\mathcal{Y}})$ refers to the least common multiple of $B_{y}$ ’s among all $y\in\mathcal{Y}$ .

Theorem 25 and Theorem 26 are proved in Section V-B. Evidently, the discrepancy concerns whether an infinite exponent, corresponding to perfect covering with no error, is achievable. For rational $P_{Y}$ ’s, this is achievable at high rates, whereas for most irrational $P_{Y}$ ’s, it is not. It is noteworthy that the linear converse in Theorem 25 applies to most irrational probabilities, which are dense in $[0,1]$ . However, it may be not possible to claim a universal converse for an arbitrary irrational $P_{Y}$ . Let $\widebar{y}\in\mathcal{Y}$ be an irrational symbol, i.e., $P_{Y}(\widebar{y})\notin\mathbb{Q}$ . If $P_{Y}(\widebar{y})$ is an irrational algebraic number, then the linear converse of $(2+\epsilon)R$ holds, by virtue of the Thue-Siegel-Roth theorem [queffelec2013diophantine, Thm. 3.1.4]. On the other hand, $P_{Y}(\widebar{y})$ can also be a ‘good’ irrational number, known as a Liouville number [queffelec2013diophantine, Def. 3.1.8], which admits infinitely many integer pairs $K,M\in\mathbb{Z}$ such that $\left|K/M-P_{Y}(\widebar{y})\right|\leq M^{-(2+\epsilon)}$ ; consequently, it is not clear how to establish a linear converse.

To eliminate such a discrepancy, $\tilde{P}_{Y^{n}|\mathcal{C}}$ must be allowed to also take irrational values; hence, messages cannot be uniformly distributed. Switching from the uniform to the non-uniform formulation indeed removes this discrepancy and yields an exact exponent. However, the error exponent $E_{\mathrm{non}}^{\mathrm{nl}}(R)$ under the non-uniform formulation deviates from both $\overline{E}^{\mathrm{nl}}(R)$ and $\underline{E}^{\mathrm{nl}}(R)$ for all $R$ , including in the low-rate regime near $H(P_{Y})$ . In fact, for noiseless channels, soft covering under the non-uniform formulation is equivalent to lossless source coding (see Lemma 43 in Section V-C). In order to remove the rational-irrational discrepancy without substantially altering the feature of soft covering, we proposed the $H_{-\infty}$ -constrained formulation in Definition 11. Its exponent $E_{\mathrm{ren}}^{\mathrm{nl}}(R)$ is exact, and is identical to $\overline{E}^{\mathrm{nl}}(R)$ by Theorem 22. Thus, in the neighborhood of $H(P_{Y})$ , this new formulation aligns with the uniform formulation.

III-C Results on error exponents for noisy channels

More generally, in this subsection, we consider noisy channels. Note that $E_{\mathrm{uni}}^{\mathrm{nl}}(R)$ in Theorem 20 becomes a straight line of slope 1 for large $R$ . Consequently, at high rates, this noiseless achievability can exceed the random-coding exponent $E_{\mathrm{rc}}^{\mathrm{nl}}(R)$ whose slope is $1/2$ , thus demonstrating a high-rate improvement over random coding under the uniform formulation for noisy channels. This noisy achievability is summarized in Theorem 27. Furthermore, we provide a general converse bound for noisy channels under the non-uniform formulation in Theorem 28.

Theorem 27 (High-rate achievability).

For noisy channels under the uniform formulation, we have the following lower bound on $E_{\mathrm{uni}}(R)$ .

	$\displaystyle E_{\mathrm{uni}}(R)\geq\underline{E}(R):$	$\displaystyle=\max_{P_{X}\in\mathcal{S}}\ \min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\left\{D(Q_{X}\\|P_{X})+\big\|R-D(Q_{X}\\|P_{X})-H(Q_{X})\big\|^{+}\right\}$
		$\displaystyle=\max_{P_{X}\in\mathcal{S}}\ \max_{\alpha\geq 1}\left\{\frac{\alpha-1}{\alpha}\left[R-H_{\frac{1}{\alpha}}(P_{X})\right]\right\}.$

Theorem 28 (Converse).

For noisy channels under the non-uniform formulation, we have the following upper bound on $E_{\mathrm{non}}(R)$ .

	$\displaystyle E_{\mathrm{non}}(R)\leq\overline{E}(R):$	$\displaystyle=\max_{P_{X}\in\mathcal{S}}\ \min_{V\in\mathcal{P}(\mathcal{Y}\|\mathcal{X}):\iota(P_{X}V_{Y\|X})\geq R}D(V\\|W\|P_{X})$
		$\displaystyle=\max_{P_{X}\in\mathcal{S}}\ \max_{\alpha\geq 0}\left[\alpha\left(R-\mathbb{E}_{P_{X}}D_{1+\alpha}(W_{Y\|X}\\|P_{Y})\right)\right].$

Theorem 27 is proved in Section VI-A and Theorem 28 is proved in Section VI-B. Figure 3 exhibits examples of these two proposed bounds as well as the random coding error exponent $E_{\mathrm{rc}}(R)$ in Remark 16. Immediately following Theorem 26 and Theorem 27, we obtain Corollary 29 below, which demonstrates an achievable infinite error exponent for noisy channels, provided that there exists some rational input distributions $P_{X}\in\mathcal{S}$ and the rate is sufficiently large.

Corollary 29 (Infinite achievability for noisy channels).

Suppose some $P_{X}\in\mathcal{S}$ is rational, say, $P_{X}(x)=\frac{A_{x}}{B_{x}}$ with coprime $A_{x},B_{x}\in\mathbb{Z}$ for $x\in\mathcal{X}$ . Then $E_{\mathrm{uni}}(R)=\infty$ when $R\geq\log(\mathrm{lcm}(\{B_{x}\}_{x\in\mathcal{X}}))$ .

IV Strong Converse Exponent

In this section, we prove Theorem 17. The sketch of the proof is as follows. In Section IV-A, we establish a lower bound for $\mathit{\Gamma}_{\mathrm{non}}(R)$ , serving as a converse result; while Section IV-B presents an achievability result that provides an upper bound for $\mathit{\Gamma}_{\mathrm{uni}}(R)$ . Both bounds are expressed in variational forms in terms of the optimization over joint distributions. We convey them into dual representations in Section IV-C and show that they coincide when $R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ . Since $\mathit{\Gamma}_{\mathrm{non}}(R)\leq\mathit{\Gamma}_{\mathrm{uni}}(R)$ , this coincidence characterizes an exact exponent for both uniform and non-uniform formulations. Additionally, in Appendix C, we provide an alternative proof of the converse part of Theorem 17 only for the uniform formulation following Arimoto’s techniques in [arimoto1973converse].

IV-A A lower bound for the strong converse exponent

In the following, we show a lower bound for $\mathit{\Gamma}_{\mathrm{non}}(R)$ for the non-uniform formulation, denoted by $\underline{\mathit{\Gamma}}(R)$ .

Proposition 30 (Converse).

The strong converse exponent $\mathit{\Gamma}_{\mathrm{non}}(R)$ in the non-uniform formulation can be bounded from below as follows:

\mathit{\Gamma}_{\mathrm{non}}(R)\geq\underline{\mathit{\Gamma}}(R):=\min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\ \inf_{s\in[0,\infty)}\ \max\big\{s,\mathit{\Gamma}(s,Q_{X},R)\big\},

where

\mathit{\Gamma}(s,Q_{X},R):=\min_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X}):D(V\|W|Q_{X})\leq s}\left[D(Q_{X}V\|P_{Y})+\big|I(Q_{X};V)-R\big|^{+}\right].

(10)

Proof.

We use a hypothesis testing perspective, and apply the following well-known property of the total variation (e.g.,[moser2019advanced, Def. 2.24]) in terms of decision regions:

\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\right\|_{1}\geq\left|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}(\mathcal{A})-P_{Y}^{n}(\mathcal{A})\right|\quad\forall\mathcal{A}\subseteq\mathcal{Y}^{n}.

(11)

We take a collection of conditional types around each codeword to create a set $\mathcal{A}$ that could be used to discriminate $\tilde{P}_{Y^{n}|\mathcal{C}}$ from $P_{Y}^{n}$ , as follows:

\mathcal{A}=\bigcup_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \bigcup_{i:X^{n}(i)\in\mathcal{T}_{Q_{X}}}\ \bigsqcup_{V\in\mathcal{V}_{n}(Q_{X})}\mathcal{T}_{V}(X^{n}(i)).

That is, restrict to the output type classes whose types are in the form of $Q_{Y}=Q_{X}V$ , where $Q_{X}$ is the type of some codeword and $V\in\mathcal{V}_{n}(Q_{X})$ . Here $\mathcal{V}_{n}(Q_{X})$ can be any subset of $\mathcal{P}_{n}(\mathcal{Y}|Q_{X})$ . Later, we will choose an appropriate subset to yield the optimal bound. When a codeword $X^{n}(i)$ is fixed, so is its type. Then the $V$ -shells $\mathcal{T}_{V}(X^{n}(i))$ for different $V$ must be disjoint. We thereby write disjoint union $\bigsqcup_{V\in\mathcal{V}_{n}(Q_{X})}$ in $\mathcal{A}$ .

Consider a codeword $X^{n}(i)$ whose type is $Q_{X}$ , i.e., $X^{n}(i)\in\mathcal{T}_{Q_{X}}$ for some $Q_{X}\in\mathcal{P}_{n}(\mathcal{X})$ . We have

	$\displaystyle W_{Y\|X}^{n}(\mathcal{A}\|X^{n}(i))$	$\displaystyle=W_{Y\|X}^{n}\left(\bigcup_{Q_{X}^{\prime}\in\mathcal{P}_{n}(\mathcal{X})}\ \bigcup_{j:X^{n}(j)\in\mathcal{T}_{Q_{X}^{\prime}}}\ \bigsqcup_{V\in\mathcal{V}_{n}(Q_{X}^{\prime})}\mathcal{T}_{V}(X^{n}(j))\bigg\|X^{n}(i)\right)$
		$\displaystyle\overset{a}{\geq}W_{Y\|X}^{n}\left(\bigsqcup_{V\in\mathcal{V}_{n}(Q_{X})}\mathcal{T}_{V}(X^{n}(i))\bigg\|X^{n}(i)\right)$
		$\displaystyle\overset{b}{=}1-W_{Y\|X}^{n}\left(\bigsqcup_{V\notin\mathcal{V}_{n}(Q_{X})}\mathcal{T}_{V}(X^{n}(i))\bigg\|X^{n}(i)\right)$
		$\displaystyle\overset{c}{\geq}1-\sum_{V\notin\mathcal{V}_{n}(Q_{X})}2^{-nD(V\\|W\|Q_{X})}$
		$\displaystyle\overset{d}{\geq}1-(n+1)^{\|\mathcal{X}\|\|\mathcal{Y}\|}\max_{V\notin\mathcal{V}_{n}(Q_{X})}2^{-nD(V\\|W\|Q_{X})}$
		$\displaystyle=1-(n+1)^{\|\mathcal{X}\|\|\mathcal{Y}\|}\exp_{2}\left\{-n\min_{V\notin\mathcal{V}_{n}(Q_{X})}D(V\\|W\|Q_{X})\right\},$

where in $(a)$ we restrict to the subset in which $j=i$ ; $(b)$ follows from the fact that $\bigsqcup_{V\in\mathcal{P}_{n}(\mathcal{Y}|Q_{X})}\mathcal{T}_{V}(X^{n}(i))=\mathcal{Y}^{n}$ ; $(c)$ and $(d)$ follow from properties of types. By averaging over all codewords, we obtain that

	$\displaystyle\tilde{P}_{Y^{n}\|\mathcal{C}\sim q}(\mathcal{A})$	$\displaystyle=\sum_{i=1}^{M}q(i)\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}\ W_{Y\|X}^{n}(\mathcal{A}\|X^{n}(i))$
		$\displaystyle\geq 1-(n+1)^{\|\mathcal{X}\|\|\mathcal{Y}\|}\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \sum_{i=1}^{M}q(i)\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}\exp_{2}\left\{-n\min_{V\notin\mathcal{V}_{n}(Q_{X})}D(V\\|W\|Q_{X})\right\}$
		$\displaystyle\overset{a}{\geq}1-(n+1)^{\|\mathcal{X}\|(\|\mathcal{Y}\|+1)}\max_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\sum_{i=1}^{M}q(i)\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}\exp_{2}\left\{-n\min_{V\notin\mathcal{V}_{n}(Q_{X})}D(V\\|W\|Q_{X})\right\}$
		$\displaystyle\overset{b}{\geq}1-(n+1)^{\|\mathcal{X}\|(\|\mathcal{Y}\|+1)}\max_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\exp_{2}\left\{-n\min_{V\notin\mathcal{V}_{n}(Q_{X})}D(V\\|W\|Q_{X})\right\}$
		$\displaystyle=1-(n+1)^{\|\mathcal{X}\|(\|\mathcal{Y}\|+1)}\exp_{2}\left\{-n\min_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \min_{V\notin\mathcal{V}_{n}(Q_{X})}D(V\\|W\|Q_{X})\right\},$

where $(a)$ follows from property of types and $(b)$ from $\sum_{i=1}^{M}q(i)\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}\leq 1$ . On the other hand,

	$\displaystyle P_{Y}^{n}(\mathcal{A})$	$\displaystyle\overset{a}{=}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\left\|\mathcal{A}\cap\mathcal{T}_{Q_{Y}}\right\|\ 2^{-n[D(Q_{Y}\\|P_{Y})+H(Q_{Y})]}$
		$\displaystyle=\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\left\|\left(\bigcup_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \bigcup_{i:X^{n}(i)\in\mathcal{T}_{Q_{X}}}\ \bigsqcup_{V\in\mathcal{V}_{n}(Q_{X})}\mathcal{T}_{V}(X^{n}(i))\right)\bigcap\mathcal{T}_{Q_{Y}}\right\|2^{-n[D(Q_{Y}\\|P_{Y})+H(Q_{Y})]}$
		$\displaystyle=\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\left\|\bigcup_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \bigcup_{i:X^{n}(i)\in\mathcal{T}_{Q_{X}}}\ \bigsqcup_{V\in\mathcal{V}_{n}(Q_{X})}\mathcal{T}_{V}(X^{n}(i))\cap\mathcal{T}_{Q_{Y}}\right\|2^{-n[D(Q_{Y}\\|P_{Y})+H(Q_{Y})]}$
		$\displaystyle\overset{b}{\leq}\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \sum_{V\in\mathcal{V}_{n}(Q_{X})}\left\|\bigcup_{i:X^{n}(i)\in\mathcal{T}_{Q_{X}}}\mathcal{T}_{V}(X^{n}(i))\cap\mathcal{T}_{Q_{X}V}\right\|2^{-n[D(Q_{X}V\\|P_{Y})+H(Q_{X}V)]}$
		$\displaystyle\leq\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \sum_{V\in\mathcal{V}_{n}(Q_{X})}\min\left\{\sum_{i=1}^{M}\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}\big\|\mathcal{T}_{V}(X^{n}(i))\big\|,\big\|\mathcal{T}_{Q_{X}V}\big\|\right\}2^{-n[D(Q_{X}V\\|P_{Y})+H(Q_{X}V)]}$
		$\displaystyle\overset{c}{\leq}\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \sum_{V\in\mathcal{V}_{n}(Q_{X})}\min\left\{M\cdot 2^{nH(V\|Q_{X})},2^{nH(Q_{X}V)}\right\}2^{-n[D(Q_{X}V\\|P_{Y})+H(Q_{X}V)]}$
		$\displaystyle=\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \sum_{V\in\mathcal{V}_{n}(Q_{X})}2^{-n\left[D(Q_{X}V\\|P_{Y})+\|I(Q_{X};V)-R\|^{+}\right]}$
		$\displaystyle\overset{d}{\leq}(n+1)^{\|\mathcal{X}\|(\|\mathcal{Y}\|+1)}\max_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \max_{V\in\mathcal{V}_{n}(Q_{X})}2^{-n\left[D(Q_{X}V\\|P_{Y})+\|I(Q_{X};V)-R\|^{+}\right]}$
		$\displaystyle=(n+1)^{\|\mathcal{X}\|(\|\mathcal{Y}\|+1)}\exp_{2}\left\{-n\min_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \min_{V\in\mathcal{V}_{n}(Q_{X})}\left[D(Q_{X}V\\|P_{Y})+\big\|I(Q_{X};V)-R\big\|^{+}\right]\right\},$

where $(a)$ follows from property of types; in $(b)$ , the summation $\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}$ is missing because the non-emptiness of $\mathcal{T}_{V}(X^{n}(i))\cap\mathcal{T}_{Q_{Y}}$ implies that $Q_{Y}$ can only take the unique value $Q_{Y}=Q_{X}V$ ; $(c)$ and $(d)$ follow from properties of types. Inserting the above expressions into (11), we obtain that

\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\right\|_{1}\geq 1-2(n+1)^{|\mathcal{X}|(|\mathcal{Y}|+1)}2^{-n\underline{\mathit{\Gamma}}(R,n)},

where the exponent is

	$\displaystyle\underline{\mathit{\Gamma}}(R,n)$	$\displaystyle=\min\left\{\min_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\min_{V\notin\mathcal{V}_{n}(Q_{X})}D(V\\|W\|Q_{X}),\ \min_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\min_{V\in\mathcal{V}_{n}(Q_{X})}\left[D(Q_{X}V\\|P_{Y})+\big\|I(Q_{X};V)-R\big\|^{+}\right]\right\}$
		$\displaystyle=\min_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\min\left\{\min_{V\notin\mathcal{V}_{n}(Q_{X})}D(V\\|W\|Q_{X}),\ \min_{V\in\mathcal{V}_{n}(Q_{X})}\left[D(Q_{X}V\\|P_{Y})+\big\|I(Q_{X};V)-R\big\|^{+}\right]\right\}.$

It remains to choose a suitable $\mathcal{V}_{n}(Q_{X})$ . We choose it to be the relative-entropy-ball centered at $W$ with radius $s$ :

\mathcal{V}_{n}(Q_{X})=\mathcal{V}_{n}(s,Q_{X}):=\left\{V\in\mathcal{P}_{n}(\mathcal{Y}|\mathcal{X}):D(V\|W|Q_{X})\leq s\right\}

parameterized by $s\geq 0$ . As $n\to\infty$ , $\mathcal{P}_{n}(\mathcal{X})$ becomes dense in $\mathcal{P}(\mathcal{X})$ and $\mathcal{V}_{n}(Q_{X})$ boils down to

\mathcal{V}(Q_{X})=\mathcal{V}(s,Q_{X}):=\left\{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X}):D(V\|W|Q_{X})\leq s\right\}.

Consequently, the exponent $\underline{\mathit{\Gamma}}(R,n)$ converges as follows:

	$\displaystyle\liminf_{n\to\infty}\underline{\mathit{\Gamma}}(R,n)$	$\displaystyle=\min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\min\left\{\inf_{V\notin\mathcal{V}(s,Q_{X})}D(V\\|W\|Q_{X}),\ \min_{V\in\mathcal{V}(s,Q_{X})}\left[D(Q_{X}V\\|P_{Y})+\big\|I(Q_{X};V)-R\big\|^{+}\right]\right\}$
		$\displaystyle=\min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\min\left\{\inf_{V\notin\mathcal{V}(s,Q_{X})}D(V\\|W\|Q_{X}),\ \mathit{\Gamma}(s,Q_{X},R)\right\},$		(12)

where $\mathit{\Gamma}(s,Q_{X},R)$ is defined in (10). Note that (IV-A) is also parameterized by $s\geq 0$ . We can further write

\mathit{\Gamma}_{\mathrm{non}}(R)\geq\underline{\mathit{\Gamma}}(R):=\min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\ \sup_{s\in[0,\infty)}\min\left\{\inf_{V\notin\mathcal{V}(s,Q_{X})}D(V\|W|Q_{X}),\ \mathit{\Gamma}(s,Q_{X},R)\right\}.

(13)

The reason why we put $\sup_{s\in[0,\infty)}$ in (13) is that since the choice of $\mathcal{V}(Q_{X})$ is free, in principle we can select any non-negative $s$ and construct the corresponding $\mathcal{V}(s,Q_{X})$ as our $\mathcal{V}(Q_{X})$ . Hence, there exists some $s$ value for each $Q_{X}$ that generates the maximal (thus the tightest) lower bound $\underline{\mathit{\Gamma}}(R)$ .

Furthermore, in (13), it might seem natural to write $\inf_{V\notin\mathcal{V}(s,Q_{X})}D(V\|W|Q_{X})=s$ since $\mathcal{V}(s,Q_{X})$ describes a relative entropy ball. However, this is true only if $V$ can be those conditional distributions such that $V\ll W$ and $V\neq W$ under $Q_{X}$ . To elaborate, our discussions and the derived bounds should be applicable to all channels. Without loss of generality, it is possible that the channel $W_{Y|X}$ contains hybrid input symbols, meaning that $W_{Y|X}(y|x)=1$ for some $x,y$ and $W_{Y|X}(y|x)<1$ for others. If $Q_{X}$ is supported only on those noiseless input symbols, then $W$ is deterministic and hence for any finite $s\geq 0$ ,

D(V\|W|Q_{X})=\begin{cases}\infty&\text{if }D(V\|W|Q_{X})>s,\\ 0&\text{if }D(V\|W|Q_{X})\leq s.\end{cases}

The case $D(V\|W|Q_{X})=0$ implies that $V|_{\mathrm{supp}(Q_{X})}=W$ , i.e., $V$ is also noiseless when restricted to $x\in\mathrm{supp}(Q_{X})$ . As a result, $\mathit{\Gamma}(s,Q_{X},R)=D(Q_{X}W\|P_{Y})+\big|H(Q_{X}W)-R\big|^{+}$ , which is a constant independent of $s$ , as illustrated in Figure 4LABEL:sub@fig-Gs-a. Then in (13) we have

	$\displaystyle\sup_{s\in[0,\infty)}\min\left\{\inf_{V\notin\mathcal{V}(s,Q_{X})}D(V\\|W\|Q_{X}),\ \mathit{\Gamma}(s,Q_{X},R)\right\}=\$	$\displaystyle\sup_{s\in[0,\infty)}\min\left\{\infty,D(Q_{X}W\\|P_{Y})+\big\|H(Q_{X}W)-R\big\|^{+}\right\}$
	$\displaystyle=\$	$\displaystyle D(Q_{X}W\\|P_{Y})+\big\|H(Q_{X}W)-R\big\|^{+}$
	$\displaystyle\overset{a}{=}\$	$\displaystyle\sup_{s\in[0,\infty)}\min\left\{s,D(Q_{X}W\\|P_{Y})+\big\|H(Q_{X}W)-R\big\|^{+}\right\}$
	$\displaystyle=\$	$\displaystyle\sup_{s\in[0,\infty)}\min\left\{s,\mathit{\Gamma}(s,Q_{X},R)\right\},$

where $(a)$ can be observed from Figure 4LABEL:sub@fig-Gs-a. Interestingly, even though $\inf_{V\notin\mathcal{V}(s,Q_{X})}D(V\|W|Q_{X})=\infty\neq s$ , in (13) it is still valid to write

\sup_{s\in[0,\infty)}\min\left\{\inf_{V\notin\mathcal{V}(s,Q_{X})}D(V\|W|Q_{X}),\ \mathit{\Gamma}(s,Q_{X},R)\right\}=\sup_{s\in[0,\infty)}\min\left\{s,\mathit{\Gamma}(s,Q_{X},R)\right\}.

(14)

If $Q_{X}$ has support on at least one noisy input symbol, $D(\cdot\|W|Q_{X})$ can thereby take continuous values. In this case, writing (14) is explicit and correct. According to Lemma 48 in Appendix D-a, $\mathit{\Gamma}(s,Q_{X},R)$ is non-increasing in $s$ : it decreases over a certain interval and then remains constant. Therefore, $\sup_{s\in[0,\infty)}\min\left\{s,\mathit{\Gamma}(s,Q_{X},R)\right\}$ is attained at a unique fixed point where $s^{*}=\mathit{\Gamma}(s^{*},Q_{X},R)$ , i.e., the intersection point of the curve $\mathit{\Gamma}(s,Q_{X},R)$ and the straight line $s$ . Such an intersection can occur either in the decreasing regime of $\mathit{\Gamma}(s,Q_{X},R)$ (Figure 4LABEL:sub@fig-Gs-b) or in the constant regime (Figure 4LABEL:sub@fig-Gs-c).

To summarize, for any $Q_{X}$ supported on $\mathcal{X}$ , (14) is always true. Furthermore, since (14) is achieved at the intersection point $s^{*}$ , we can equivalently write

\sup_{s\in[0,\infty)}\min\left\{s,\mathit{\Gamma}(s,Q_{X},R)\right\}=\inf_{s\in[0,\infty)}\max\left\{s,\mathit{\Gamma}(s,Q_{X},R)\right\}

(15)

from Figure 4. The reason for doing so is that the right-hand side of (15) facilitates a convenient derivation from variational forms to dual forms in Section IV-C. Substituting (14) and (15) into (13) completes the proof. ∎

The idea of proving Proposition 30 is closely related to hypothesis testing. We are in fact looking for a decision region $\mathcal{A}$ such that $\tilde{P}_{Y^{n}|\mathcal{C}}(\mathcal{A}^{c})\leq 2^{-ns}$ and $P_{Y}^{n}(\mathcal{A})\leq 2^{-n\mathit{\Gamma}(s)}$ with some exponent $\mathit{\Gamma}(s)$ . The average probability of testing error is $\frac{1}{2}\big[\tilde{P}_{Y^{n}|\mathcal{C}}(\mathcal{A}^{c})+P_{Y}^{n}(\mathcal{A})\big]=\frac{1}{2}-\frac{1}{4}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1}$ . Then we optimize over $s$ to find the optimal decision region. Observing the expression in Proposition 30, the following properties of $\underline{\mathit{\Gamma}}(R)$ can be established, which are proved in Appendix D-c.

Lemma 31.

We have the following properties of $\underline{\mathit{\Gamma}}(R)$ :

(i) $\underline{\mathit{\Gamma}}(R)=0$ when $R\geq\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ . (ii) $\underline{\mathit{\Gamma}}(R)>0$ when $R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ .

IV-B An achievability bound for the strong converse exponent

In this subsection, we show an upper bound for $\mathit{\Gamma}_{\mathrm{uni}}(R)$ (the uniform formulation), denoted by $\overline{\mathit{\Gamma}}(R)$ . We develop a novel deterministic code construction technique based on the type covering lemma. The key observation is that since we are operating below $\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ , we cannot afford codeword repetitions, while random coding produces repeated codewords, albeit rarely. In contrast, we only cover joint types with mutual information less than $R$ , and cover only once.

Proposition 32 (Achievability).

The strong converse exponent $\mathit{\Gamma}_{\mathrm{uni}}(R)$ in the uniform formulation can be bounded from above as follows:

\mathit{\Gamma}_{\mathrm{uni}}(R)\leq\overline{\mathit{\Gamma}}(R):=\min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y}):I(Q_{X};V)\leq R}\left[D(Q_{Y}\|P_{Y})+\big|R-\iota(Q_{XY})\big|^{+}\right],

where $\iota(Q_{XY})$ is defined in Definition 2.

Proof.

Consider any joint type $Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y})$ . Following our notation convention, we introduce the backward conditional type $\widebar{V}_{X|Y}=Q_{XY}/Q_{Y}\in\mathcal{P}_{n}(\mathcal{X}|Q_{Y})$ . Take any $y^{n}\in\mathcal{T}_{Q_{Y}}$ . (1) yields that

	$\displaystyle\tilde{P}_{Y^{n}\|\mathcal{C}}(y^{n})$	$\displaystyle=\frac{1}{M}\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}\|Q_{Y})}\ \sum_{i=1}^{M}\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{\widebar{V}}(y^{n})\}\ W_{Y\|X}^{n}(y^{n}\|X^{n}(i))$
		$\displaystyle\overset{a}{=}\frac{1}{M}\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}\|Q_{Y})}k_{\widebar{V}}(y^{n})\ 2^{-n\left[D(V\\|W\|Q_{X})+H(V\|Q_{X})\right]},$

where $(a)$ follows from property of types. Furthermore, in $(a)$ we have defined

k_{\widebar{V}}(y^{n}):=\sum_{i=1}^{M}\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{\widebar{V}}(y^{n})\},

(16)

which is the number of codewords that lie in the $\widebar{V}$ -shell of $y^{n}$ . In other words, $k_{\widebar{V}}(y^{n})$ counts the number of codewords that have joint type $Q_{XY}$ with $y^{n}$ . Since $P_{Y}^{n}(y^{n})=2^{-n[D(Q_{Y}\|P_{Y})+H(Q_{Y})]}$ according to property of types, we can evaluate the following ratio:

$\displaystyle\frac{\tilde{P}_{Y^{n}\|\mathcal{C}}(y^{n})}{P_{Y}^{n}(y^{n})}$	$\displaystyle=\frac{1}{M}\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}\|Q_{Y})}k_{\widebar{V}}(y^{n})\ 2^{n\left[D(Q_{Y}\\|P_{Y})+H(Q_{Y})-D(V\\|W\|Q_{X})-H(V\|Q_{X})\right]}$
	$\displaystyle=\frac{1}{M}\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}\|Q_{Y})}k_{\widebar{V}}(y^{n})\ 2^{n\left[D(Q_{Y}\\|P_{Y})+I(Q_{X};V)-D(V\\|W\|Q_{X})\right]}$
	$\displaystyle=\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}\|Q_{Y})}k_{\widebar{V}}(y^{n})\ 2^{n\left[\iota(Q_{XY})-R\right]},$	(17)

where the last equality follows from Lemma 3.

Using the identity $|a-b|=a+b-2\min\{a,b\}$ , the total variation satisfies the following property:

1-\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\right\|_{1}=\sum_{y^{n}}\min\left\{\tilde{P}_{Y^{n}|\mathcal{C}}(y^{n}),P_{Y}^{n}(y^{n})\right\}.

(18)

Combining this and (17), we obtain that

		$\displaystyle 1-\frac{1}{2}\left\\|\tilde{P}_{Y^{n}\|\mathcal{C}}-P_{Y}^{n}\right\\|_{1}=\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\ \sum_{y^{n}\in\mathcal{T}_{Q_{Y}}}P_{Y}^{n}(y^{n})\min\left\{\frac{\tilde{P}_{Y^{n}\|\mathcal{C}}(y^{n})}{P_{Y}^{n}(y^{n})},1\right\}$
		$\displaystyle\overset{a}{=}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}2^{-n\left[D(Q_{Y}\\|P_{Y})+H(Q_{Y})\right]}\sum_{y^{n}\in\mathcal{T}_{Q_{Y}}}\!\!\!\min\left\{\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}\|Q_{Y})}k_{\widebar{V}}(y^{n})2^{n\left[\iota(Q_{XY})-R\right]},1\right\}$
		$\displaystyle=\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}2^{-n\left[D(Q_{Y}\\|P_{Y})+H(Q_{Y})\right]}\sum_{y^{n}\in\mathcal{T}_{Q_{Y}}}\!\!\!\min\left\{\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}\|Q_{Y})}k_{\widebar{V}}(y^{n})2^{n\left[\iota(Q_{XY})-R\right]},\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}\|Q_{Y})}\frac{1}{\|\mathcal{P}_{n}(\mathcal{X}\|Q_{Y})\|}\right\}$
		$\displaystyle\overset{b}{\geq}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}2^{-n\left[D(Q_{Y}\\|P_{Y})+H(Q_{Y})\right]}\sum_{y^{n}\in\mathcal{T}_{Q_{Y}}}\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}\|Q_{Y})}\!\!\!\min\left\{k_{\widebar{V}}(y^{n})2^{n\left[\iota(Q_{XY})-R\right]},\frac{1}{\|\mathcal{P}_{n}(\mathcal{X}\|Q_{Y})\|}\right\}$
		$\displaystyle\overset{c}{\geq}\sum_{Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y})}2^{-n\left[D(Q_{Y}\\|P_{Y})+H(Q_{Y})\right]}\sum_{y^{n}\in\mathcal{T}_{Q_{Y}}}\min\left\{k_{\widebar{V}}(y^{n})\ 2^{n\left[\iota(Q_{XY})-R\right]},(n+1)^{-\|\mathcal{X}\|\|\mathcal{Y}\|}\right\}$
		$\displaystyle\overset{d}{\geq}(n+1)^{-\|\mathcal{X}\|\|\mathcal{Y}\|}\sum_{Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y})}2^{-n\left[D(Q_{Y}\\|P_{Y})+H(Q_{Y})\right]}\sum_{y^{n}\in\mathcal{T}_{Q_{Y}}}\min\left\{k_{\widebar{V}}(y^{n})\ 2^{n\left[\iota(Q_{XY})-R\right]},1\right\},$		(19)

where $(a)$ follows from property of types and (17); $(b)$ follows from the fact that the minimum of a sum is greater than or equal to the sum of the minima (equivalent to the triangle inequality); and $(c)$ follows from property of types. In $(d)$ , we artifically add an extra term $(n+1)^{-|\mathcal{X}||\mathcal{Y}|}$ to $k_{\widebar{V}}(y^{n})\ 2^{n\left[\iota(Q_{XY})-R\right]}$ and then extract it outside the minimum. The reason for doing so is that $k_{\widebar{V}}(y^{n})\ 2^{n\left[\iota(Q_{XY})-R\right]}$ is exponential in $n$ , so comparing it with 1 or any polynomial factor does not affect the exponential order.

For the purpose of designing a good code, we need to maximize (IV-B), and it suffices to make $k_{\widebar{V}}(y^{n})$ large (or at least nonzero) for all joint types. However, due to the low rate and hence the limited number of codewords, this can only be achieved for certain joint types, and even for those, a large $k_{\widebar{V}}(y^{n})$ would seem too greedy. A more reasonable approach is to set $k_{\widebar{V}}(y^{n})=1$ , meaning that there exists a single codeword that covers $y^{n}$ . The type covering lemma guarantees the existence of such a covering for all $y^{n}$ ’s, provided that the mutual information is less than the code rate.

Based on the above analysis, we construct our code $\mathcal{C}$ in the following way. Take any $\epsilon>0$ and examine all joint types $Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y})$ : if its mutual information satisfies $I(Q_{X};V)\leq R-2\epsilon$ , then according to the type covering lemma (e.g., [moser2019advanced, Lem. 3.34]), there exists a covering code with $2^{n[I(Q_{X};V)+\epsilon]}$ codewords in $\mathcal{T}_{Q_{X}}$ that ensures $k_{\widebar{V}}(y^{n})\geq 1$ for all the corresponding output sequences $y^{n}\in\mathcal{T}_{Q_{Y}}$ . We collect all codewords in this covering code into our code $\mathcal{C}$ . It is noteworthy that such a collection may contain repeated codewords. Under this strategy, the number of codewords we have assigned so far is

	$\displaystyle\|\mathcal{C}_{\text{eff}}\|$	$\displaystyle=\sum_{Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y}):I(Q_{X};V)\leq R-2\epsilon}2^{n[I(Q_{X};V)+\epsilon]}$
		$\displaystyle\overset{a}{\leq}(n+1)^{\|\mathcal{X}\|\|\mathcal{Y}\|}\max_{Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y}):I(Q_{X};V)\leq R-2\epsilon}2^{n[I(Q_{X};V)+\epsilon]}$
		$\displaystyle\leq(n+1)^{\|\mathcal{X}\|\|\mathcal{Y}\|}2^{n(R-\epsilon)}\leq 2^{nR}$

for sufficienly large $n$ , where $(a)$ follows from property of types. The inequality $|\mathcal{C}_{\text{eff}}|\leq 2^{nR}$ confirms that $\mathcal{C}$ is a valid code. Specifically, the codewords are drawn from a uniform distribution of $1/M=2^{-nR}$ . $|\mathcal{C}_{\text{eff}}|$ among them are used for the sake of covering all joint types $Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y})$ satisfying $I(Q_{X};V)\leq R-2\epsilon$ ; as a result, $k_{\widebar{V}}(y^{n})\geq 1$ for all $y^{n}$ ’s in those joint type classes. There are $M-|\mathcal{C}_{\text{eff}}|$ codewords left, which can be used to cover the other joint types. However, this fraction is small, and we apply a trivial lower bound zero on it. To sum up, in (IV-B) we have

k_{\widebar{V}}(y^{n})\ 2^{n\left[\iota(Q_{XY})-R\right]}\geq\begin{cases}2^{n\left[\iota(Q_{XY})-R\right]}&\text{if }I(Q_{X};V)\leq R-2\epsilon\\ 0&\text{if }I(Q_{X};V)>R-2\epsilon\end{cases}

for all $y^{n}\in\mathcal{T}_{Q_{Y}}$ . Consequently, (IV-B) simplifies to

	$\displaystyle 1-\frac{1}{2}$	$\displaystyle\left\\|\tilde{P}_{Y^{n}\|\mathcal{C}}-P_{Y}^{n}\right\\|_{1}$
		$\displaystyle\geq(n+1)^{-\|\mathcal{X}\|\|\mathcal{Y}\|}\sum_{Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y}):I(Q_{Y};\widebar{V})\leq R-2\epsilon}2^{-n\left[D(Q_{Y}\\|P_{Y})+H(Q_{Y})\right]}\ \|\mathcal{T}_{Q_{Y}}\|\ \min\left\{2^{n\left[\iota(Q_{XY})-R\right]},1\right\}$
		$\displaystyle\overset{a}{\geq}(n+1)^{-(\|\mathcal{X}\|+1)\|\mathcal{Y}\|}\sum_{Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y}):I(Q_{Y};\widebar{V})\leq R-2\epsilon}2^{-n\left[D(Q_{Y}\\|P_{Y})+\left\|R-\iota(Q_{XY})\right\|^{+}\right]}$
		$\displaystyle\geq(n+1)^{-(\|\mathcal{X}\|+1)\|\mathcal{Y}\|}\max_{Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y}):I(Q_{Y};\widebar{V})\leq R-2\epsilon}2^{-n\left[D(Q_{Y}\\|P_{Y})+\left\|R-\iota(Q_{XY})\right\|^{+}\right]}$
		$\displaystyle=(n+1)^{-(\|\mathcal{X}\|+1)\|\mathcal{Y}\|}\exp_{2}\left\{-n\min_{Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y}):I(Q_{X},V)\leq R-2\epsilon}\left[D(Q_{Y}\\|P_{Y})+\big\|R-\iota(Q_{XY})\big\|^{+}\right]\right\},$

where $(a)$ follows from property of types. Taking $\epsilon\to 0$ and $n\to\infty$ completes the proof. ∎

Based on Proposition 32, the following property of $\overline{\mathit{\Gamma}}(R)$ holds, which is proved in Appendix D-d.

Lemma 33.

If $\min_{P_{X}\in\mathcal{S}}I(P_{X};W)\leq R\leq\max_{P_{X}\in\mathcal{S}}I(P_{X};W)$ , then $\overline{\mathit{\Gamma}}(R)=0$ .

IV-C Dual forms of $\underline{\mathit{\Gamma}}(R)$ and $\overline{\mathit{\Gamma}}(R)$ and their equality

So far, we have obtained a lower bound $\underline{\mathit{\Gamma}}(R)$ for $\mathit{\Gamma}_{\mathrm{non}}(R)$ and an upper bound $\overline{\mathit{\Gamma}}(R)$ for $\mathit{\Gamma}_{\mathrm{uni}}(R)$ , stated in Proposition 30 and 32, respectively. Figure 5 presents examples of these two bounds, using the same channel models as in Figure 1. Observing Figure 5, a natural question arises: do the proposed two bounds coincide when $R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ ? If they do, then an exact exponent can be tightly squeezed out since $\mathit{\Gamma}_{\mathrm{non}}(R)\leq\mathit{\Gamma}_{\mathrm{uni}}(R)$ . In this subsection, we show that the answer is yes by reformulating $\underline{\mathit{\Gamma}}(R)$ and $\overline{\mathit{\Gamma}}(R)$ into their dual forms in terms of $J_{\alpha,\beta}$ defined by (6). These dual forms are presented in the following Proposition 34 and 35, respectively.

Proposition 34.

$\underline{\mathit{\Gamma}}(R)$ has the following dual form:

	$\displaystyle\underline{\mathit{\Gamma}}(R)\geq\underline{\underline{\mathit{\Gamma}}}(R):=\$	$\displaystyle\min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})}\ \max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\big\{D(Q_{Y}\\|P_{Y})+\alpha\left[I(Q_{X};V)-R\right]+\beta\left[R-\iota(Q_{XY})\right]\big\}$		(20)
	$\displaystyle=\$	$\displaystyle\max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\left[J_{\alpha,\beta}\left(W_{Y\|X}\\|P_{Y}\right)+(\beta-\alpha)R\right].$		(21)

Proposition 35.

$\overline{\mathit{\Gamma}}(R)$ has the following dual form:

\overline{\mathit{\Gamma}}(R)=\max_{\alpha,\beta\in[0,1]}\left[J_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right)+(\beta-\alpha)R\right].

Proposition 34 is proved in Appendix B-a and Proposition 35 is proved in Appendix B-b. Comparing the dual forms of $\overline{\mathit{\Gamma}}(R)$ and $\underline{\underline{\mathit{\Gamma}}}(R)$ , we establish their equality in the following Proposition 36.

Proposition 36.

$\overline{\mathit{\Gamma}}(R)=\underline{\underline{\mathit{\Gamma}}}(R)$ when $R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ .

Proposition 36 is proved in Appendix B-c. Equipped with propositions in this section, the proof of the main result, Theorem 17, follows from straightforward logical reasoning.

Proof of Theorem 17.

First, $\underline{\underline{\mathit{\Gamma}}}(R)\leq\underline{\mathit{\Gamma}}(R)\leq\mathit{\Gamma}_{\mathrm{non}}(R)\leq\mathit{\Gamma}_{\mathrm{ren}}(R)\leq\mathit{\Gamma}_{\mathrm{uni}}(R)\leq\overline{\mathit{\Gamma}}(R)$ holds for all $R$ by combining Remark 15, Proposition 30, Proposition 32, and Proposition 34.

When $R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ , by Proposition 36, $\underline{\underline{\mathit{\Gamma}}}(R)=\underline{\mathit{\Gamma}}(R)=\mathit{\Gamma}_{\mathrm{non}}(R)=\mathit{\Gamma}_{\mathrm{ren}}(R)=\mathit{\Gamma}_{\mathrm{uni}}(R)=\overline{\mathit{\Gamma}}(R)$ .

When $R\geq\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ , by Proposition 30, Proposition 34, and Lemma 46, $\mathit{\Gamma}_{\mathrm{non}}(R)\geq\underline{\mathit{\Gamma}}(R)=\underline{\underline{\mathit{\Gamma}}}(R)=0$ . On the other hand, according to the soft covering lemma, e.g., [moser2019advanced, Ch. 19], under the uniform formulation there exists a code that can make $\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1}\to 0$ for sufficiently large $n$ , yielding that $\left(1-\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1}\right)\to 1=2^{-n\cdot 0}$ : a zero strong converse exponent is achievable and thus $\mathit{\Gamma}_{\mathrm{uni}}(R)\leq 0$ . Ergo, $\mathit{\Gamma}_{\mathrm{non}}(R)=\mathit{\Gamma}_{\mathrm{ren}}(R)=\mathit{\Gamma}_{\mathrm{uni}}(R)=0=\underline{\underline{\mathit{\Gamma}}}(R)$ .

To sum up, we can conclude that $\mathit{\Gamma}_{\mathrm{non}}(R)=\mathit{\Gamma}_{\mathrm{ren}}(R)=\mathit{\Gamma}_{\mathrm{uni}}(R)=\underline{\underline{\mathit{\Gamma}}}(R)$ for all $R$ . Combining it with Proposition 34 completes the proof. Since $\underline{\underline{\mathit{\Gamma}}}(R)$ is now exact, rather than merely a lower bound, we drop the underline and denote it by $\mathit{\Gamma}(R)$ to simplify notation. ∎

V Error Exponent for Noiseless Channels

In this section, we discuss the soft-covering error exponent when $W_{Y|X}$ is noiseless: for each $x\in\mathcal{X}$ , there exists a symbol $y\in\mathcal{Y}$ such that $W_{Y|X}(y|x)=1$ . Then we have $W_{Y|X}(y|x)=\mathbbm{1}\{y=w(x)\}$ for all $x\in\mathcal{X}$ , $y\in\mathcal{Y}$ , and for some function $w$ .

Remark 37 (Choice of alphabet).

If $w(\cdot)$ is a bijection, then, without loss of generality, we assume that $\mathcal{X}=\mathcal{Y}$ . If $w(\cdot)$ is a surjection, for each $y\in\mathcal{Y}$ , in the achievability proof we may select a single representative $x\in w^{-1}(y)$ and construct an alphabet $\mathcal{X}_{\mathcal{C}}\subset\mathcal{X}$ consisting of these representatives and hence satisfying $|\mathcal{X}_{\mathcal{C}}|=|\mathcal{Y}|$ . In the converse proof, given any code $\mathcal{C}$ , we first proceed with a restriction of symbols: if $y_{1}=w(x_{1})=w(x_{2})$ , we replace every occurrence of $x_{2}$ in the codewords with $x_{1}$ . Since both $x_{1}$ and $x_{2}$ map to the same output symbol $y_{1}$ , this modification leaves the output sequences unchanged and, consequently, does not affect the induced distribution $\tilde{P}_{Y^{n}|\mathcal{C}\sim q}$ or $\tilde{P}_{Y^{n}|\mathcal{C}}$ . Under this restriction, the code alphabet set reduces to some $\mathcal{X}_{\mathcal{C}}\subset\mathcal{X}$ such that $|\mathcal{X}_{\mathcal{C}}|=|\mathcal{Y}|$ and $w(\cdot)$ can be regarded bijective. Thus, in both cases, one can always identify an input alphabet set $\mathcal{X}_{\mathcal{C}}$ such that $|\mathcal{X}_{\mathcal{C}}|=|\mathcal{Y}|$ for encoding.

V-A Uniform formulation

Under a noiseless channel, the output distribution (1) reduces to

\tilde{P}_{Y^{n}|\mathcal{C}}(y^{n})=\frac{1}{M}\sum_{x^{n}}\mathbbm{1}\{y^{n}=w(x^{n})\}\sum_{i=1}^{M}\mathbbm{1}\{X^{n}(i)=x^{n}\}=:\frac{k(y^{n})}{M},

(22)

where $k(y^{n})\in\mathbb{Z}^{+}$ counts the number of codewords mapping to $y^{n}$ . We begin by showing the converse result stated by Theorem 19, which is valid for both rational and irrational $P_{Y}$ .

Proof of Theorem 19.

When $R\geq\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ , it is relatively easier to approximate large probabilities, while harder to approximate small probabilities, in particular those smaller than the step size $1/M$ in (22). This leads us to the consideration of the following ‘bad’ set:

\mathcal{B}:=\left\{y^{n}\in\mathcal{Y}^{n}:\frac{1}{M}\geq 2P_{Y}^{n}(y^{n})\right\}=\bigsqcup_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})\geq R+\frac{1}{n}}\mathcal{T}_{Q_{Y}}.

(23)

For any $y^{n}\in\mathcal{B}$ , we have

\left|\frac{k(y^{n})}{M}-P_{Y}^{n}(y^{n})\right|\begin{cases}\geq\dfrac{1}{M}-P_{Y}^{n}(y^{n})\geq P_{Y}^{n}(y^{n}),&\text{if }k(y^{n})\geq 1\\ =P_{Y}^{n}(y^{n}),&\text{if }k(y^{n})=0\end{cases}.

An error of value $P_{Y}^{n}(y^{n})$ is unavoidable for all $y^{n}\in\mathcal{B}$ . In this sense, $\mathcal{B}$ is regarded as ‘bad’, producing errors no matter what code is applied. As a result,

	$\displaystyle\frac{1}{2}\left\\|\tilde{P}_{Y^{n}\|\mathcal{C}}-P_{Y}^{n}\right\\|_{1}$	$\displaystyle\geq\frac{1}{2}\sum_{y^{n}\in\mathcal{B}}P_{Y}^{n}(y^{n})$
		$\displaystyle\overset{a}{\geq}\frac{1}{2}(n+1)^{-\|\mathcal{Y}\|}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\\|P_{Y})+H(Q_{Y})>R+\frac{1}{n}}2^{-nD(Q_{Y}\\|P_{Y})}$
		$\displaystyle\geq\frac{1}{2}(n+1)^{-\|\mathcal{Y}\|}\exp_{2}\left\{-n\min_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\\|P_{Y})+H(Q_{Y})>R+\frac{1}{n}}D(Q_{Y}\\|P_{Y})\right\},$

where $(a)$ follows from property of types. Taking $n\to\infty$ completes the proof of the variational form.

To obtain the dual form, consider the following.

	$\displaystyle\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):D(Q_{Y}\\|P_{Y})+H(Q_{Y})\geq R}D(Q_{Y}\\|P_{Y})\overset{a}{=}\$	$\displaystyle\max_{\delta\geq 0}\ \min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\big\{D(Q_{Y}\\|P_{Y})+\delta\left[R-D(Q_{Y}\\|P_{Y})-H(Q_{Y})\right]\big\}$
	$\displaystyle=\$	$\displaystyle\max_{\delta\geq 0}\left\{\delta R+\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\left[D(Q_{Y}\\|P_{Y})+\delta\sum_{y}Q_{Y}(y)\log P_{Y}(y)\right]\right\}$
	$\displaystyle\overset{b}{=}\$	$\displaystyle\max_{\delta\geq 0}\left\{\delta R-\sum_{y}P_{Y}^{1-\delta}(y)\right\}$
	$\displaystyle\overset{c}{=}\$	$\displaystyle\max_{\alpha\in(-\infty,0)\cup[1,\infty)}\left\{\frac{\alpha-1}{\alpha}\left[R-H_{\frac{1}{\alpha}}(P_{Y})\right]\right\},$

where $(a)$ follows from the convexity of the optimization problem; $(b)$ follows directly from (46); and $(c)$ follows by setting $\alpha:=\frac{1}{1-\delta}$ . ∎

It is noteworthy that $\overline{E}^{\mathrm{nl}}(R)$ is finite only in a certain region, as specified precisely in the following Lemma 38, which is proved in Appendix D-e.

Lemma 38.

$\overline{E}^{\mathrm{nl}}(R)<\infty$ when $R\leq H_{-\infty}(P_{Y})$ , and $\overline{E}^{\mathrm{nl}}(R)=\infty$ when $R>H_{-\infty}(P_{Y})$ .

Next, we prove the achievability result stated by Theorem 20, which is also valid for both rational and irrational $P_{Y}$ . Towards proving it, we need the following lemma.

Lemma 39.

When $R\geq H(P_{Y})$ , we have $N_{1}(R)\geq N_{2}(R)$ , where

	$\displaystyle N_{1}(R)$	$\displaystyle=\max_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):D(Q_{Y}\\|P_{Y})+H(Q_{Y})\leq R}H(Q_{Y}),$
	$\displaystyle N_{2}(R)$	$\displaystyle=\max_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):D(Q_{Y}\\|P_{Y})+H(Q_{Y})\geq R}\left[R-D(Q_{Y}\\|P_{Y})\right].$

Lemma 39 is proved in Appendix D-f. Recall (22) and observe that a strategy for constructing a good code is to make $k(y^{n})\approx MP_{Y}(y^{n})$ so that $k(y^{n})/M$ can approxiate $P_{Y}(y^{n})$ as closely as possible. More precisely, to establish the achievability result in Theorem 20, we show that there exists a code in which $k(y^{n})$ differs from either $\lfloor MP_{Y}(y^{n})\rfloor$ or $\lceil MP_{Y}(y^{n})\rceil$ by at most a polynomial quantity in $n$ .

Proof of Theorem 20.

Taking inspiration from the proof of Proposition 32 regarding the achievability of the strong converse exponent, we provide a deterministic code construction for the problem at hand. Let us first assume that $R\geq H(P_{Y})$ . Define

\mathcal{G}:=\left\{y^{n}\in\mathcal{Y}^{n}:MP_{Y}^{n}(y^{n})\geq 1\right\}=\bigsqcup_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})\leq R}\mathcal{T}_{Q_{Y}}.

(24)

$\mathcal{G}$ is actually a ‘good’ set in the sense that all sequences in $\mathcal{G}$ satisfy that $\lfloor MP_{Y}^{n}(y^{n})\rfloor\geq 1$ . Thus, it is always preferable to cover these sequences, since doing so will yield nonzero $k(y^{n})$ values that can align with $MP_{Y}^{n}(y^{n})$ . In fact, ignoring the factor of two, $\mathcal{G}$ is approximately the complement of the ‘bad’ set $\mathcal{B}$ defined in (23). Take any $\epsilon>0$ . The size of $\mathcal{G}$ is bounded by

	$\displaystyle\|\mathcal{G}\|$	$\displaystyle=\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\\|P_{Y})+H(Q_{Y})\leq R}\mathcal{T}_{Q_{Y}}\geq\max_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\\|P_{Y})+H(Q_{Y})\leq R}\mathcal{T}_{Q_{Y}}$
		$\displaystyle\overset{a}{\geq}(n+1)^{-\|\mathcal{Y}\|}\max_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\\|P_{Y})+H(Q_{Y})\leq R}2^{nH(Q_{Y})}$
		$\displaystyle\overset{b}{\geq}(n+1)^{-\|\mathcal{Y}\|}\ 2^{n\left[N_{1}(R)-\epsilon\right]},$

where $(a)$ follows from property of types. In $(b)$ , the maximization is over all types, so it differs from the maximization over distributions (namely $N_{1}(R)$ as defined in Lemma 39) by at most $\epsilon$ for sufficiently large $n$ .

To design a good code, we adopt the alphabet choice described in Remark 37. With this choice, $k(y^{n})$ represents the number of codewords mapped one-to-one to $y^{n}$ , and hence can be directly constructed via a repetition of codewords. Our goal is to design $k(y^{n})$ so that it satisfies $\lfloor MP_{Y}^{n}(y^{n})\rfloor\leq k(y^{n})\leq\lceil MP_{Y}^{n}(y^{n})\rceil+\mathrm{poly}(n)$ under the constraint that $\sum_{y^{n}}k(y^{n})=M$ . Namely, consider the following two codes:

	$\displaystyle\mathcal{C}_{1}:\quad$	$\displaystyle k(y^{n})=\lfloor MP_{Y}^{n}(y^{n})\rfloor,\quad\forall y^{n}\in\mathcal{Y}^{n},\quad M_{1}:=\|\mathcal{C}_{1}\|=\sum_{y^{n}}\ \lfloor MP_{Y}^{n}(y^{n})\rfloor.$
	$\displaystyle\mathcal{C}_{2}:\quad$	$\displaystyle k(y^{n})=\begin{cases}\lceil MP_{Y}^{n}(y^{n})\rceil&\text{if }y^{n}\in\mathcal{G}\\ 0&\text{if }y^{n}\notin\mathcal{G}\end{cases},\quad M_{2}:=\|\mathcal{C}_{2}\|=\sum_{y^{n}\in\mathcal{G}}\lceil MP_{Y}^{n}(y^{n})\rceil.$

Clearly $M_{1}\leq M$ , meaning that we can use $\mathcal{C}_{1}$ to cover all sequences $y^{n}\in\mathcal{Y}^{n}$ and there will be some codewords left. Note that $\mathcal{G}^{c}:=\mathcal{Y}^{n}\backslash\mathcal{G}$ is not covered since $\lfloor MP_{Y}^{n}(y^{n})\rfloor=0$ for all $y^{n}\in\mathcal{G}^{c}$ . If $M_{2}\geq M$ , then after applying $\mathcal{C}_{1}$ , each sequence in $\mathcal{G}$ can be covered at most once more. Hence, there exists a code $\mathcal{C}_{3}$ as follows.

\displaystyle\mathcal{C}_{3}:\quad

\displaystyle k(y^{n})=\begin{cases}\lfloor MP_{Y}^{n}(y^{n})\rfloor\text{ or }\lceil MP_{Y}^{n}(y^{n})\rceil&\text{if }y^{n}\in\mathcal{G}\\ 0&\text{if }y^{n}\notin\mathcal{G}\end{cases},\quad|\mathcal{C}_{3}|=M.

If $M_{2}<M$ , then after applying $\mathcal{C}_{1}$ , each sequence in $\mathcal{G}$ can still be covered more than once. Equivalently, we use $\mathcal{C}_{2}$ to cover $\mathcal{Y}^{n}$ , and there will be some codewords left. The number of those codewords that are left is

	$\displaystyle\Delta M$	$\displaystyle:=M-M_{2}=M-\sum_{y^{n}\in\mathcal{G}}\lceil MP_{Y}^{n}(y^{n})\rceil\leq M-\sum_{y^{n}\in\mathcal{G}}MP_{Y}^{n}(y^{n})=MP_{Y}^{n}(\mathcal{G}^{c})$
		$\displaystyle=\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{X}):D(Q_{Y}\\|P_{Y})+H(Q_{Y})>R}MP_{Y}^{n}(\mathcal{T}_{Q_{Y}})$
		$\displaystyle\overset{a}{\leq}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\\|P_{Y})+H(Q_{Y})>R}2^{n\left[R-D(Q_{Y}\\|P_{Y})\right]}$
		$\displaystyle\overset{b}{\leq}(n+1)^{\|\mathcal{Y}\|}\max_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\\|P_{Y})+H(Q_{Y})>R}2^{n\left[R-D(Q_{Y}\\|P_{Y})\right]}$
		$\displaystyle=(n+1)^{\|\mathcal{Y}\|}\exp_{2}\left\{n\max_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\\|P_{Y})+H(Q_{Y})>R}\left[R-D(Q_{Y}\\|P_{Y})\right]\right\}$
		$\displaystyle\leq(n+1)^{\|\mathcal{Y}\|}\ 2^{nN_{2}(R)},$

where $(a)$ and $(b)$ follow from properties of types, and $N_{2}(R)$ is defined in Lemma 39. Now let us consider using these $\Delta M$ codewords to continue covering $\mathcal{G}$ after $\mathcal{C}_{2}$ is applied. We can cover each sequence in $\mathcal{G}$ by $\frac{\Delta M}{|\mathcal{G}|}$ more times, which is at most

\frac{\Delta M}{|\mathcal{G}|}\leq(n+1)^{2|\mathcal{Y}|}2^{n\left[N_{2}(R)-N_{1}(R)+\epsilon\right]}\leq(n+1)^{2|\mathcal{Y}|}2^{n\epsilon}\leq 2^{2n\epsilon}

for all sufficiently large $n$ . Here we have used $N_{1}(R)\geq N_{2}(R)$ when $R\geq H(P_{Y})$ from Lemma 39. Hence, there exists a code $\mathcal{C}_{4}$ as follows, with $|\mathcal{C}_{4}|=M$ :

\displaystyle\mathcal{C}_{4}:\quad

\displaystyle\begin{cases}\lceil MP_{Y}^{n}(y^{n})\rceil\leq k(y^{n})\leq\lceil MP_{Y}^{n}(y^{n})\rceil+2^{2n\epsilon}&y^{n}\in\mathcal{G},\\ k(y^{n})=0&y^{n}\notin\mathcal{G}.\end{cases}

According to the above discussions, either of $\mathcal{C}_{3}$ or $\mathcal{C}_{4}$ is possible and has exactly $M$ codewords. Therefore, we can summarize that for all $\epsilon>0$ and sufficiently large $n$ , there exists a code $\mathcal{C}$ as follows, with $|\mathcal{C}|=\sum_{x^{n}}k(y^{n})=M$ :

\displaystyle\mathcal{C}:\quad

\displaystyle\begin{cases}\lfloor MP_{Y}^{n}(y^{n})\rfloor\leq k(y^{n})\leq\lceil MP_{Y}^{n}(y^{n})\rceil+2^{2n\epsilon}&y^{n}\in\mathcal{G},\\ k(y^{n})=0&y^{n}\notin\mathcal{G}.\end{cases}

Choose $\mathcal{C}$ to be our covering code, which yields that

\left|\frac{k(y^{n})}{M}-P_{Y}^{n}(y^{n})\right|\leq\left\{\begin{array}[]{@{}l l@{}}\dfrac{1+2^{2n\epsilon}}{M}&\text{if }y^{n}\in\mathcal{G}\\ P_{Y}^{n}(y^{n})&\text{if }y^{n}\notin\mathcal{G}\end{array}\right\}\leq 2^{3n\epsilon}\min\left\{\frac{1}{M},P_{Y}^{n}(y^{n})\right\}.

Consequently, we have

	$\displaystyle\frac{1}{2}\left\\|\tilde{P}_{Y^{n}\|\mathcal{C}}-P_{Y}^{n}\right\\|_{1}$	$\displaystyle\leq 2^{3n\epsilon-1}\sum_{y^{n}}\min\left\{\frac{1}{M},P_{Y}^{n}(y^{n})\right\}$
		$\displaystyle=2^{3n\epsilon-1}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\|\mathcal{T}_{Q_{Y}}\|\min\left\{2^{-nR},2^{-n\left[D(Q_{Y}\\|P_{Y})+H(Q_{Y})\right]}\right\}$
		$\displaystyle\overset{a}{\leq}2^{3n\epsilon-1}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}2^{nH(Q_{Y})}\min\left\{2^{-nR},2^{-n\left[D(Q_{Y}\\|P_{Y})+H(Q_{Y})\right]}\right\}$
		$\displaystyle\overset{b}{\leq}\frac{1}{2}(n+1)^{\|\mathcal{Y}\|}\ 2^{3n\epsilon}\max_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\min\left\{2^{-n[R-H(Q_{Y})]},2^{-nD(Q_{Y}\\|P_{Y})}\right\}$
		$\displaystyle=\frac{1}{2}(n+1)^{\|\mathcal{Y}\|}\exp_{2}\left\{-n\min_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\left[D(Q_{Y}\\|P_{Y})+\big\|R-D(Q_{Y}\\|P_{Y})-H(Q_{Y})\big\|^{+}-3\epsilon\right]\right\},$

where $(a)$ and $(b)$ follow from properties of types. The exponent is exactly $\underline{E}^{\mathrm{nl}}(R)$ as $\epsilon\to 0$ and $n\to\infty$ .

Note that all the above discussions are based on the assumption that $R\geq H(P_{Y})$ . If $R<H(P_{Y})$ , we can just apply a trivial bound $\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1}\leq 1$ and the error exponent is $0$ . Since $\underline{E}^{\mathrm{nl}}(R)$ also vanishes for $R<H(P_{Y})$ , it is thus an achievable error exponent for $R$ values both below and above $H(P_{Y})$ . Hence, the variational form in the theorem is proven.

To obtain the dual form, consider the following.

	$\displaystyle\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}$	$\displaystyle\left\{D(Q_{Y}\\|P_{Y})+\big\|R-D(Q_{Y}\\|P_{Y})-H(Q_{Y})\big\|^{+}\right\}$
	$\displaystyle=\$	$\displaystyle\max_{\lambda\in[0,1]}\ \min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\big\{D(Q_{Y}\\|P_{Y})+\lambda\left[R-D(Q_{Y}\\|P_{Y})-H(Q_{Y})\right]\big\}$
	$\displaystyle=\$	$\displaystyle\max_{\lambda\in[0,1]}\left\{\lambda R+\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\left[D(Q_{Y}\\|P_{Y})+\lambda\sum_{y}Q_{Y}(y)\log P_{Y}(y)\right]\right\}$
	$\displaystyle\overset{a}{=}\$	$\displaystyle\max_{\lambda\in[0,1]}\left\{\lambda R-\sum_{y}P_{Y}^{1-\lambda}(y)\right\}$
	$\displaystyle\overset{b}{=}\$	$\displaystyle\max_{\alpha\geq 1}\left\{\frac{\alpha-1}{\alpha}\left[R-H_{\frac{1}{\alpha}}(P_{Y})\right]\right\},$

where $(a)$ follows from (46) and $(b)$ follows by setting $\alpha:=\frac{1}{1-\lambda}$ . ∎

The idea behind this proof is actually a quantization of $P_{Y}^{n}$ with a minimal gap of $1/M$ . For all sequences in the good set $\mathcal{G}$ , the resulting error is $1/M$ . For sequences outside $\mathcal{G}$ , each probability is less than $1/M$ so we choose not to cover them, and the error equals their probability. The key step is to show that such a construction can indeed be realized under the constraint that the number of codewords is exactly $M$ .

V-B Rational-irrational discrepancy under the uniform formulation

Recalling (22), under the uniform formulation, $\tilde{P}_{Y^{n}|\mathcal{C}}$ always takes the form $\tilde{P}_{Y^{n}|\mathcal{C}}=\mathbb{Z}^{+}/M\in\mathbb{Q}$ . This gives rise to the rational-irrational discrepancy. In the following, we state Khintchine’s theorem and use it to prove the linear converse in Theorem 25.

Lemma 40 (Khintchine’s theorem [bugeaud2004approximation, Thm. 1.10], [queffelec2013diophantine, Thm. 3.3.2]).

Let $\Psi:\mathbb{R}_{\geq 1}\to\mathbb{R}_{>0}$ be a continuous function such that $M\mapsto M^{2}\Psi(M)$ is non-increasing and that $\sum_{M=1}^{\infty}M\Psi(M)<\infty$ . Then for almost every $p\notin\mathbb{Q}$ (in the sense of full Lebesgue measure in $\mathbb{R}$ ), we have $\left|K/M-p\right|<\Psi(M)$ for only finitely many $K,M\in\mathbb{Z}$ .

Proof of Theorem 25.

Let $\widebar{y}\in\mathcal{Y}$ be an irrational symbol, i.e., $P_{Y}(\widebar{y})\in\mathbb{Q}$ . Define $\mathcal{A}:=\left\{y^{n}\in\mathcal{Y}^{n}:y_{1}=\widebar{y}\right\}$ to be the set of output sequences in which $\widebar{y}$ appears in the first position. According to (11),

\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\right\|_{1}\geq\left|\tilde{P}_{Y^{n}|\mathcal{C}}(\mathcal{A})-P_{Y}^{n}(\mathcal{A})\right|,

(25)

where

	$\displaystyle\tilde{P}_{Y^{n}\|\mathcal{C}}(\mathcal{A})$	$\displaystyle=\sum_{y^{n}\in\mathcal{A}}\tilde{P}_{Y^{n}\|\mathcal{C}}(y^{n})=\frac{1}{M}\sum_{y^{n}\in\mathcal{A}}k(y^{n})=:\frac{K}{M},$
	$\displaystyle P_{Y}^{n}(\mathcal{A})$	$\displaystyle=\sum_{y^{n}\in\mathcal{A}}P_{Y}^{n}(y^{n})=P_{Y}(\widebar{y})\sum_{y^{n}\in\mathcal{Y}^{n-1}}P_{Y}^{n-1}(y^{n})=P_{Y}(\widebar{y}).$

The quantity $k(y^{n})$ is defined in (22) and $K:=\sum_{y^{n}\in\mathcal{A}}k(y^{n})$ is an integer. Hence, $\tilde{P}_{Y^{n}|\mathcal{C}}(\mathcal{A})\in\mathbb{Q}$ and $P_{Y}^{n}(\mathcal{A})\notin\mathbb{Q}$ , and (25) describes a problem of Diophantine approximation. Taking $\Psi(M)=M^{-(2+\epsilon)}$ , it follows immediately from Lemma 40 that

\left|\tilde{P}_{Y^{n}|\mathcal{C}}(\mathcal{A})-P_{Y}^{n}(\mathcal{A})\right|=\left|\frac{K}{M}-P_{Y}(\widebar{y})\right|\geq\frac{1}{M^{2+\epsilon}}=-2^{-n(2+\epsilon)R}

for all sufficiently large $n$ , since $n$ representing the number of channel uses can take infinitely many integers. ∎

Alternatively, if $P_{Y}$ is rational, for high rates a perfect covering can be achieved, as stated in Theorem 26. This is because $MP_{Y}(y^{n})$ can always be an integer for sufficiently large $n$ , making $k(y^{n})=MP_{Y}(y^{n})$ a valid code construction. The detailed proof is as follows.

Proof of Theorem 26.

Given $P_{Y}(y)=\frac{A_{y}}{B_{y}}$ , we have

MP_{Y}^{n}(y^{n})=M\prod_{i=1}^{n}\frac{A_{y_{i}}}{B_{y_{i}}}=\prod_{i=1}^{n}\left(2^{\frac{\log M}{n}}\frac{A_{y_{i}}}{B_{y_{i}}}\right).

When $R\geq\log(\mathrm{lcm}(\{B_{y}\}_{y\in\mathcal{Y}}))$ , there always exist sufficiently large $M$ and $n$ such that $2^{\frac{\log M}{n}}$ is a multiple of $B_{y}$ for each $y\in\mathcal{Y}$ and $\frac{\log M}{n}\to R$ as $n\to\infty$ . Therefore, we can employ the alphabet choice in Remark 37 and set a repetition number $k(y^{n})=MP_{Y}^{n}(y^{n})$ . ∎

Example 41.

Consider a ternary output $P_{Y}=[\frac{1}{2},\frac{1}{3},\frac{1}{6}]$ . Then $E_{\mathrm{uni}}^{\mathrm{nl}}(R)=\infty$ when $R\geq\log 6$ .

V-C Non-uniform formulation

If the messages are non-uniformly distributed, then in the achievability proof, one must address not only how to construct a good covering code, but also how to select the optimal message distribution. It turns out that an effective choice is to treat $P_{Y}^{n}$ as a source and apply lossless source coding, with the decoding index serving as the message index in the covering setting. More generally, after formulating the lossless source coding problem in Definition 42, we show the equivalence between noiseless soft covering and lossless source coding in Lemma 43.

Definition 42 (Lossless source coding).

Let $\mathcal{M}=\{1,\dots,M\}$ with $M=2^{nR}$ . Consider a discrete memoryless source $\mathcal{Y}^{n}$ subject to i.i.d. $P_{Y}$ . An $(R,n,P_{Y},\mathrm{Pr_{e}})$ lossless source coding scheme consists of an encoder $\mathcal{E}:\mathcal{Y}^{n}\to\mathcal{M}$ and a decoder $\mathcal{D}:\mathcal{M}\to\hat{\mathcal{Y}}^{n}$ , with $\mathrm{Pr_{e}}:=\Pr\{Y^{n}\neq\hat{Y}^{n}\}$ denoting the probability of decoding error.

Lemma 43 (Equivalence between noiseless soft covering and lossless source coding).

Let $w:\mathcal{X}\to\mathcal{Y}$ be a bijective function. Consider a noiseless channel $W_{Y|X}(y|x)=\mathbbm{1}\{y=w(x)\}$ .

(a)

For any $(R,n,P_{Y},\mathrm{Pr_{e}})$ lossless source coding scheme, there exists an $(R,n,P_{Y},W_{Y|X},q)$ non-uniform soft-covering scheme such that $\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\big\|_{1}\leq\mathrm{Pr_{e}}$ .
(b)

For any $(R,n,P_{Y},W_{Y|X},q)$ non-uniform soft-covering scheme, there exists an $(R,n,P_{Y},\mathrm{Pr_{e}})$ lossless source coding scheme such that that $\mathrm{Pr_{e}}\leq\big\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\big\|_{1}$ .

Lemma 43 is proved in Appendix D-g. Now we employ this lemma to show Theorem 21.

Proof of Theorem 21.

We first show achievability. Follow the alphabet choice in Remark 37 so that $w(\cdot)$ is a bijection. To apply Lemma 43, we need to specify a lossless source coding scheme for the i.i.d. source $P_{Y}^{n}$ . Take any $\epsilon>0$ and define

\mathcal{A}:=\bigsqcup_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):H(Q_{Y})\leq R-\epsilon}\mathcal{T}_{Q_{Y}}.

Clearly $|\mathcal{A}|\leq M=2^{nR}$ , meaning that we can establish a one-to-one label for every sequence in $\mathcal{A}$ using at most $M$ codewords. Therefore, there exists a $(R,n,P_{Y},\mathrm{Pr_{e}})$ lossless source coding scheme, where all sequences in $\mathcal{A}$ can be correctly decoded and hence $\mathrm{Pr_{e}}\leq P_{Y}^{n}(\mathcal{A}^{c})$ with $\mathcal{A}^{c}=\mathcal{Y}^{n}\backslash\mathcal{A}$ . According to Lemma 43 (a), there exists an $(R,n,P_{Y},W_{Y|X},q)$ non-uniform soft-covering scheme such that

	$\displaystyle\frac{1}{2}\left\\|\tilde{P}_{Y^{n}\|\mathcal{C}\sim q}-P_{Y}^{n}\right\\|_{1}$	$\displaystyle\leq\mathrm{Pr_{e}}\leq P_{Y}^{n}(\mathcal{A}^{c})=\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):H(Q_{Y})>R-\epsilon}P_{Y}^{n}(\mathcal{T}_{Q_{Y}})$
		$\displaystyle\overset{a}{\leq}(n+1)^{\|\mathcal{Y}\|}\exp_{2}\left\{-n\min_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):H(Q_{Y})>R-\epsilon}D(Q_{Y}\\|P_{Y})\right\},$

where $(a)$ follows from property of types. Taking $\epsilon\to 0$ and $n\to\infty$ completes the proof of achievability.

For the converse, given any covering code, by alphabet restriction in Remark 37, $w(\cdot)$ can be regarded bijective. Hence, according to Lemma 43 (b), for any $(R,n,P_{Y},W_{Y|X},q)$ non-uniform soft-covering scheme, there exists a corresponding $(R,n,P_{Y},\mathrm{Pr_{e}})$ lossless source coding scheme such that

\displaystyle\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\right\|_{1}\geq\frac{1}{2}\mathrm{Pr_{e}}\overset{a}{\geq}\frac{1}{2}(n+1)^{-|\mathcal{Y}|}\exp_{2}\left\{-n\min_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):H(Q_{Y})>R+\epsilon}D(Q_{Y}\|P_{Y})\right\},

where $(a)$ follows from the converse exponent of the lossless source coding, e.g., [csiszar2011information, Thm. 2.15]. Taking $\epsilon\to 0$ and $n\to\infty$ completes the proof of the converse.

So far, we have proven the variational form, which is exactly the error exponent of the lossless source coding problem. The dual form thus follows immediately from [csiszar2011information, Problem 2.15]. ∎

V-D $H_{-\infty}$ -constrained formulation

Proof of Theorem 22.

We begin with the converse. In fact, the proof of the converse is identical to that of Theorem 19 in Section V-A. The ‘bad’ set $\mathcal{B}$ defined in (23) applies here as well, because in the $H_{-\infty}$ -constrained formulation, the minimal probability in the message set is also at least $1/M$ . All output sequences with probability $P_{Y}^{n}(y^{n})\leq 1/2M$ contribute to the covering error. Ergo, $\overline{E}^{\mathrm{nl}}(R)$ in Theorem 19, which serves as a converse for the uniform formulation, also constitutes a converse for the $H_{-\infty}$ -constrained formulation, even though the latter is a more general formulation that encompasses the former.

For the achievability, we follow the alphabet choice in Remark 37 so that $w(\cdot)$ is a bijection. Similar to the non-uniform case, we consider a lossless source coding scheme for $P_{Y}^{n}$ , where only sequences with probability greater than $1/M$ are encoded. This can be interpreted as a lossless source coding scheme where the rate is measured by $H_{-\infty}$ of the encoded messages, instead of $H_{0}$ , which is equal to the logarithm of the size of the message set. Those sequences are in fact from the good set $\mathcal{G}$ defined in (24). The corresponding source coding scheme is given by

\mathcal{E}(y^{n})=\begin{cases}1,2,\dots,|\mathcal{G}|&\text{if }y^{n}\in\mathcal{G},\\ 0&\text{if }y^{n}\notin\mathcal{G},\end{cases}\qquad\mathcal{D}(i)=\begin{cases}\mathcal{E}^{-1}(i)&\text{if }i=1,2,\dots,|\mathcal{G}|,\\ \text{declare error}&\text{if }i=0.\end{cases}

(26)

Clearly, all sequences in $\mathcal{G}^{c}=\mathcal{Y}^{n}\backslash\mathcal{G}$ contribute to the decoding error. According to Lemma 43 (a), there exists an $(R,n,P_{Y},W_{Y|X},q)$ non-uniform soft-covering scheme such that

	$\displaystyle\frac{1}{2}\left\\|\tilde{P}_{Y^{n}\|\mathcal{C}\sim q}-P_{Y}^{n}\right\\|_{1}$	$\displaystyle\leq\mathrm{Pr_{e}}=P_{Y}^{n}(\mathcal{G}^{c})=\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\\|P_{Y})+H(Q_{Y})>R}P_{Y}^{n}(\mathcal{T}_{Q_{Y}})$
		$\displaystyle\overset{a}{\leq}(n+1)^{\|\mathcal{Y}\|}\exp_{2}\left\{-n\min_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\\|P_{Y})+H(Q_{Y})>R}D(Q_{Y}\\|P_{Y})\right\},$

where $(a)$ follows from property of types. The exponent here is exactly $\overline{E}^{\mathrm{nl}}(R)$ as $n\to\infty$ . However, directly from Lemma 43 (a), the soft-covering coding scheme here is under the non-uniform formulation. We need to show that construction in (26) can indeed result in a soft-covering coding scheme under the $H_{-\infty}$ -constrained formulation. Specifically, the corresponding message distribution in (66) (see Appendix D-g) needs to satisfy $q(i)\geq 1/M$ for all $i=0,1,\dots,|\mathcal{G}|$ . First, the definition of $\mathcal{G}$ in (24) ensures that $q(i)\geq 1/M$ for $i=1,\dots,|\mathcal{G}|$ . It remains to verify that $q(0)\geq 1/M$ . Since $q(0)=\mathrm{Pr_{e}}$ , it is equivalent to check $\overline{E}^{\mathrm{nl}}(R)\leq R$ . By (65) in Appendix D-b, when $\overline{E}^{\mathrm{nl}}(R)$ is finite, we have $\overline{E}^{\mathrm{nl}}(R)=D(Q_{Y}^{*}\|P_{Y})$ for some optimizer $Q_{Y}^{*}\in\mathcal{P}(\mathcal{Y})$ such that $D(Q_{Y}^{*}\|P_{Y})+H(Q_{Y}^{*})=R$ . Therefore, $\overline{E}^{\mathrm{nl}}(R)\leq R$ . This completes the proof. ∎

VI Error Exponent for Noisy Channels

This section addresses soft covering error exponents for noisy channels. We prove Theorem 27 and Theorem 28.

VI-A Achievability

Proof of Theorem 27.

We can use the soft covering with noiseless channels to the case of noisy channels by covering the input distribution instead of the output. This leads to a high rate improvement in the error exponent achievability as compared to random coding. The output distribution in (1) can be written as

\tilde{P}_{Y^{n}|\mathcal{C}}(y^{n})=\sum_{x^{n}}\tilde{P}_{X^{n}|\mathcal{C}}(x^{n})\ W(y^{n}|x^{n}),

where $\tilde{P}_{X^{n}|\mathcal{C}}$ is the code-induced input distribution. Pick any $P_{X}\in\mathcal{S}$ . Then $P_{Y}=P_{X}W$ . We have

\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\right\|_{1}\overset{a}{\leq}\frac{1}{2}\left\|\tilde{P}_{X^{n}|\mathcal{C}}-P_{X}^{n}\right\|_{1}\overset{b}{\leq}2^{-n\underline{E}^{\mathrm{nl}}(R)},

where $(a)$ follows from the data processing inequality of the total variation, and $(b)$ follows from Theorem 20. Here $\underline{E}^{\mathrm{nl}}(R)$ is the noiseless bound of the $(R,n,P_{X})$ covering problem. Hence, we can follow the code construction in the proof of Theorem 20 in Section V-A and then cover the input distribution $P_{X}^{n}$ . An optimal bound is further generated by maximizing over $P_{X}\in\mathcal{S}$ . ∎

VI-B Converse

Proof of Theorem 28.

Consider any code $\mathcal{C}$ . We claim that exists a type $\widebar{Q}_{X}\in\mathcal{P}_{n}(\mathcal{X})$ such that

\sum_{i=1}^{M}q(i)\ \mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{\widebar{Q}_{X}}\}\geq 2^{-n\epsilon}.

(27)

To see this, suppose $\sum_{i=1}^{M}q(i)\ \mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}<2^{-n\epsilon}$ for all types $Q_{X}\in\mathcal{P}_{n}(\mathcal{X})$ . Then we have

	$\displaystyle\sum_{i=1}^{M}q(i)$	$\displaystyle=\sum_{i=1}^{M}q(i)\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}=\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \sum_{i=1}^{M}q(i)\ \mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}$
		$\displaystyle\leq\|\mathcal{P}_{n}(\mathcal{X})\|2^{-n\epsilon}\leq(n+1)^{\|\mathcal{X}\|}2^{-n\epsilon}<1$

for sufficiently large $n$ , thereby leading to a contradiction. Hence, (27) is proven, which implies that even though $\mathcal{C}$ is not necessarily a constant composition code, it has a constant composition subset that includes most probable codewords.

We follow (11) and set

\mathcal{A}=\bigcup_{i:X^{n}(i)\in\mathcal{T}_{\widebar{Q}_{X}}}\bigsqcup_{V\in\mathcal{V}_{n}(\widebar{Q}_{X})}\mathcal{T}_{V}(X^{n}(i)).

That is, only restrict to codewords with type $\widebar{Q}_{X}$ and $V$ -shell in $\mathcal{V}_{n}(\widebar{Q}_{X})$ , where $\mathcal{V}_{n}(\widebar{Q}_{X})$ is some subset of $\mathcal{P}_{n}(\mathcal{Y}|\widebar{Q}_{X})$ that will be properly chosen later. Following the proof of Proposition 30, we take a codeword $X^{n}(i)\in\mathcal{T}_{\widebar{Q}_{X}}$ and can obtain that

	$\displaystyle W_{Y\|X}^{n}(\mathcal{A}\|X^{n}(i))$	$\displaystyle\geq W_{Y\|X}^{n}\left(\bigsqcup_{V\in\mathcal{V}_{n}(\widebar{Q}_{X})}\mathcal{T}_{V}(X^{n}(i))\bigg\|X^{n}(i)\right)\geq\max_{V\in\mathcal{V}_{n}(\widebar{Q}_{X})}W_{Y\|X}^{n}\left(\mathcal{T}_{V}(X^{n}(i))\big\|X^{n}(i)\right)$
		$\displaystyle\overset{a}{\geq}(n+1)^{-\|\mathcal{X}\|\|\mathcal{Y}\|}\exp_{2}\left\{-n\min_{V\in\mathcal{V}_{n}(\widebar{Q}_{X})}D(V\\|W\|\widebar{Q}_{X})\right\},$

where $(a)$ follows from property of types. Averaging over all codewords further yields that

	$\displaystyle\tilde{P}_{Y^{n}\|\mathcal{C}\sim q}(\mathcal{A})$	$\displaystyle=\sum_{i=1}^{M}q(i)\ W_{Y\|X}^{n}(\mathcal{A}\|X^{n}(i))$
		$\displaystyle\geq\sum_{i=1}^{M}q(i)\ \mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{\widebar{Q}_{X}}\}\ W_{Y\|X}^{n}(\mathcal{A}\|X^{n}(i))$
		$\displaystyle\geq(n+1)^{-\|\mathcal{X}\|\|\mathcal{Y}\|}\exp_{2}\left\{-n\min_{V\in\mathcal{V}_{n}(\widebar{Q}_{X})}D(V\\|W\|\widebar{Q}_{X})\right\}\sum_{i=1}^{M}q(i)\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}$
		$\displaystyle\overset{a}{\geq}(n+1)^{-\|\mathcal{X}\|\|\mathcal{Y}\|}\ 2^{-n\epsilon}\exp_{2}\left\{-n\min_{V\in\mathcal{V}_{n}(\widebar{Q}_{X})}D(V\\|W\|\widebar{Q}_{X})\right\},$

where $(a)$ follows from (27). On the other hand, similarly to the calculation of $P_{Y}^{n}(\mathcal{A})$ in the proof of Proposition 30, here we have

	$\displaystyle P_{Y}^{n}(\mathcal{A})$	$\displaystyle\leq(n+1)^{\|\mathcal{X}\|\|\mathcal{Y}\|}\exp_{2}\left\{-n\min_{V\in\mathcal{V}_{n}(\widebar{Q}_{X})}\left[D(\widebar{Q}_{X}V\\|P_{Y})+\big\|I(\widebar{Q}_{X};V)-R\big\|^{+}\right]\right\}$
		$\displaystyle\leq(n+1)^{-\|\mathcal{X}\|\|\mathcal{Y}\|}\ 2^{n\epsilon}\exp_{2}\left\{-n\min_{V\in\mathcal{V}_{n}(\widebar{Q}_{X})}\left[D(\widebar{Q}_{X}V\\|P_{Y})+\big\|I(\widebar{Q}_{X};V)-R\big\|^{+}\right]\right\}.$

Inserting the above expressions into (11), it is then reasonable to choose

\mathcal{V}_{n}(Q_{X})=\left\{V\in\mathcal{P}_{n}(\mathcal{Y}|Q_{X}):D(V\|W|Q_{X})+3\epsilon<D(Q_{X}V\|P_{Y})+\big|I(Q_{X};V)-R\big|^{+}\right\}

and hence

	$\displaystyle\frac{1}{2}\left\\|\tilde{P}_{Y^{n}\|\mathcal{C}\sim q}-P_{Y}^{n}\right\\|_{1}$	$\displaystyle\geq(n+1)^{-\|\mathcal{X}\|\|\mathcal{Y}\|}\ 2^{-n\epsilon}(1-2^{-n\epsilon})\exp_{2}\left\{-n\min_{V\in\mathcal{V}_{n}(\widebar{Q}_{X})}D(V\\|W\|\widebar{Q}_{X})\right\}$
		$\displaystyle\geq(n+1)^{-\|\mathcal{X}\|\|\mathcal{Y}\|}\ 2^{-n\epsilon}(1-2^{-n\epsilon})\exp_{2}\left\{-n\max_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \min_{V\in\mathcal{V}_{n}(Q_{X})}D(V\\|W\|Q_{X})\right\}.$

Taking $n\to\infty$ and $\epsilon\to 0$ , $\mathcal{V}_{n}(\widebar{Q}_{X})$ then boils down to

	$\displaystyle\mathcal{V}(Q_{X})$	$\displaystyle=\left\{V\in\mathcal{P}(\mathcal{Y}\|\mathcal{X}):D(V\\|W\|Q_{X})<D(Q_{X}V\\|P_{Y})+\big\|I(Q_{X};V)-R\big\|^{+}\right\}$		(28)
		$\displaystyle=\left\{V\in\mathcal{P}(\mathcal{Y}\|\mathcal{X}):\iota(Q_{XY})>\min[R,I(Q_{X};V)]\right\}.$		(29)

where Lemma 3 is used and thus we have acquired a bound

\overline{E}(R):=\max_{Q_{X}\in\mathcal{P}(\mathcal{X})}\ \inf_{V\in\mathcal{V}(Q_{X})}D(V\|W|Q_{X}),

(30)

which contains an unconstrained maximization over all input distributions. Clearly $\underline{E}(R)\geq 0$ . Furthermore, take any $Q_{X}\notin\mathcal{S}$ ; then $D(Q_{X}W\|P_{Y})>0$ and hence (28) implies that $W\in\mathcal{V}(Q_{X})$ . Ergo, for any $Q_{X}\notin\mathcal{S}$ , we have $\min_{V\in\mathcal{V}(Q_{X})}D(V\|W|Q_{X})=0$ . As a result, the maximization $\max_{Q_{X}\in\mathcal{P}(\mathcal{X})}$ in (30) only occurs at those $Q_{X}\in\mathcal{S}$ . To be consistent with our notations ( $Q$ associated with $V$ and $P$ associated with $W$ ), we rewrite (30) as

\overline{E}(R):=\max_{P_{X}\in\mathcal{S}}\ \inf_{V\in\mathcal{V}(P_{X})}D(V\|W|P_{X}).

Moreover, Lemma 3 implies that

I(P_{X};V)-\iota(P_{X}V_{Y|X})=D(V\|W|P_{X})-D(P_{X}V\|P_{Y})\geq 0,

where the inequality is simply the data processing inequality of the relative entropy. Consequently, when $P_{X}\in\mathcal{S}$ is taken, (29) reduces to

\mathcal{V}(P_{X})=\left\{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X}):\iota(P_{X}V_{Y|X})>R\right\},

which completes the proof of the variational form in the theorem.

To reach the dual form, fix $P_{X}\in\mathcal{S}$ and consider the following arguments.

	$\displaystyle\min_{V\in\mathcal{P}(\mathcal{Y}\|\mathcal{X}):\iota(P_{X}V_{Y\|X})\geq R}D(V\\|W\|P_{X})$	$\displaystyle=\max_{\alpha\geq 0}\ \min_{V\in\mathcal{P}(\mathcal{Y}\|\mathcal{X})}\left\{D(V\\|W\|P_{X})+\alpha\left[R-\iota(P_{X}V_{Y\|X})\right]\right\}$
		$\displaystyle\overset{a}{=}\max_{\alpha\geq 0}\left\{\alpha R-\sum_{x}P_{X}(x)\log\left(\sum_{y}\frac{W_{Y\|X}^{1+\alpha}(y\|x)}{P_{Y}^{\alpha}(y)}\right)\right\}$
		$\displaystyle\overset{b}{=}\max_{\alpha\geq 0}\left[\alpha\left(R-\mathbb{E}_{P_{X}}D_{1+\alpha}(W_{Y\|X}\\|P_{Y})\right)\right]$

where $(a)$ follows from (38) in Appendix A and $(b)$ follwos from Definition 6. ∎

From Theorem 28, we can summarize the following properties of the proposed converse bound $\overline{E}(R)$ , which are proved in Appendix D-h.

Lemma 44.

We have the following properties of $\overline{E}(R)$ :

(i) $\overline{E}(R)=0$ when $R\leq\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ . (ii) $\overline{E}(R)>0$ when $R>\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ .

(iii) $\overline{E}(R)=\infty$ when $R>\min_{P_{X}\in\mathcal{S}}\max_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X})}\iota(P_{X}V_{Y|X})$ .

VII Conclusion

In the present work, we have characterized the exact strong converse exponent of the classical soft covering for rates below the mutual information. This exponent is expressed using a two-parameter information quantity that, to the best of our knowledge, has not been studied in the literature on error exponents with respect to a given channel. A promising direction for future work is a deeper investigation into the implications, properties, and potential applications of this new quantity.

Moreover, this work reveals that the conventional random coding bound is generally not tight in the achievability regime of rates above the mutual information, and that the traditional formulation assuming uniformly distributed words in the code can inherently lead to a rational–irrational discrepancy: even in the noiseless channel case, the reliability exponent diverges at sufficiently high rate for target distributions with all rational values, whereas for typical irrational probability values the exponent remains finite for all rates. Future work includes designing a well-behaved deterministic code for noisy channels that outperforms random coding in both high-rate and low-rate regimes, or even an optimal code that achieves the exact error exponent. Our observations also raise an important open question: can a similar rational–irrational discrepancy arise in other information-theoretic settings?

Appendix A A Random Coding Achievability for the Strong Converse Exponent

In this appendix, we prove Theorem 18. We start with the following lemma.

Lemma 45.

Let $K\sim\text{Binomial}(M,p)$ be a binomial random variable and $M\geq 2$ . We have

\frac{1}{2}\mathbb{E}\left|\frac{K}{M}-p\right|\leq p-\frac{1}{2}p\min\{Mp,1\}.

Proof.

Write

\frac{1}{2}\mathbb{E}\left|\frac{K}{M}-p\right|=\frac{1}{2}\sum_{k=0}^{\lfloor Mp\rfloor}\Pr\{K=k\}\left(p-\frac{k}{M}\right)+\frac{1}{2}\sum_{k=\lceil Mp\rceil}^{M}\Pr\{K=k\}\left(\frac{k}{M}-p\right).

(31)

and consider the following two cases.

$Mp<1$ : In this case $\lfloor Mp\rfloor=0$ and $\lceil Mp\rceil=1$ . (31) reduces to

	$\displaystyle\frac{1}{2}\mathbb{E}\left\|\frac{K}{M}-p\right\|$	$\displaystyle=\frac{1}{2}\Pr\{K=0\}p+\frac{1}{2}\sum_{k=1}^{M}\Pr\{K=k\}\left(\frac{k}{M}-p\right)$
		$\displaystyle=\frac{1}{2}(1-p)^{M}p+\frac{\mathbb{E}K}{2M}-\frac{1}{2}[1-(1-p)^{M}]p=(1-p)^{M}p$
		$\displaystyle\overset{a}{\leq}p-\frac{1}{2}Mp^{2},$

where $(a)$ follows from, when $M\geq 2$ ,

(1-p)^{M}\leq 1-Mp+\frac{1}{2}M(M-1)p^{2}\leq 1-Mp+\frac{1}{2}M^{2}p^{2}\leq 1-Mp+\frac{1}{2}Mp=1-\frac{1}{2}Mp.

$Mp>1$ : In this case $\lfloor Mp\rfloor\geq 1$ , analyzing (31) becomes challenging. Instead, we can apply Jensen’s inequality for the square root, which is a commonly used technique in the soft covering problem:

\frac{1}{2}\mathbb{E}\left|\frac{K}{M}-p\right|=\frac{\mathbb{E}\left|K-\mathbb{E}K\right|}{2M}\leq\frac{\sqrt{\text{Var}(K)}}{2M}=\frac{1}{2}\sqrt{\frac{p(1-p)}{M}}\leq\frac{1}{2}\sqrt{\frac{p}{M}}\leq\frac{1}{2}p.

Combining these two cases gives the result. ∎

Now we prove Theorem 18 using the random coding strategy.

Proof of Theorem 18.

We first prove the variational form (7). Take any $P_{X}\in\mathcal{S}$ and generate a random code $\mathcal{C}=\{X^{n}(1),\dots,X^{n}(M)\}$ with $M=2^{nR}$ , where each codeword $X^{n}(i)$ for any $i=1,\dots,M$ is drawn from i.i.d. $P_{X}$ . Consider any joint type $Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y})$ . Similar to the proof of Proposition 32, introduce the backward conditional type $\widebar{V}_{X|Y}=Q_{XY}/Q_{Y}\in\mathcal{P}_{n}(\mathcal{X}|Q_{Y})$ . We define

	$\displaystyle\omega(Q_{XY})$	$\displaystyle:=W_{Y\|X}^{n}(y^{n}\|x^{n}),\quad\qquad\forall(x^{n},y^{n})\in\mathcal{T}_{Q_{XY}}.$		(32)
	$\displaystyle p_{\widebar{V}}(Q_{Y})$	$\displaystyle:=\Pr\{X^{n}\in\mathcal{T}_{\widebar{V}}(y^{n})\},\ \quad\forall y^{n}\in\mathcal{T}_{Q_{Y}}.$		(33)

The values of $\omega(Q_{XY})$ and $p_{\widebar{V}}(Q_{Y})$ are uniquely determined by the joint type $Q_{XY}$ , independently of the particular sequence $x^{n},y^{n}$ . Take any $y^{n}\in\mathcal{T}_{Q_{Y}}$ . Then we can write

	$\displaystyle\tilde{P}_{Y^{n}\|\mathcal{C}}(y^{n})$	$\displaystyle=\frac{1}{M}\sum_{i=1}^{M}W_{Y\|X}^{n}(y^{n}\|X^{n}(i))=\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}\|Q_{Y})}\frac{k_{\widebar{V}}(y^{n})}{M}\omega(Q_{XY}),$
	$\displaystyle P_{Y}^{n}(y^{n})$	$\displaystyle=\sum_{x^{n}}P_{X}^{n}(x^{n})W_{Y\|X}^{n}(y^{n}\|x^{n})=\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}\|Q_{Y})}p_{\widebar{V}}(Q_{Y})\omega(Q_{XY}),$		(34)

where $k_{\widebar{V}}(y^{n})$ is defined in (16). Under random coding,

$\displaystyle\frac{1}{2}\mathbb{E}_{\mathcal{C}}\left\\|\tilde{P}_{Y^{n}\|\mathcal{C}}-P_{Y}^{n}\right\\|_{1}$	$\displaystyle=\frac{1}{2}\sum_{y^{n}}\mathbb{E}_{\mathcal{C}}\left\|\tilde{P}_{Y^{n}\|\mathcal{C}}(y^{n})-P_{Y}^{n}(y^{n})\right\|$
	$\displaystyle=\frac{1}{2}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\ \sum_{y^{n}\in\mathcal{T}_{Q_{Y}}}\mathbb{E}_{\mathcal{C}}\left\|\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}\|Q_{Y})}\omega(Q_{XY})\left(\frac{k_{\widebar{V}}(y^{n})}{M}-p_{\widebar{V}}(Q_{Y})\right)\right\|$
	$\displaystyle\leq\frac{1}{2}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\ \sum_{y^{n}\in\mathcal{T}_{Q_{Y}}}\ \sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}\|Q_{Y})}\omega(Q_{XY})\ \mathbb{E}_{\mathcal{C}}\left\|\frac{k_{\widebar{V}}(y^{n})}{M}-p_{\widebar{V}}(Q_{Y})\right\|$
	$\displaystyle\overset{a}{\leq}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\ \sum_{y^{n}\in\mathcal{T}_{Q_{Y}}}\ \sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}\|Q_{Y})}\omega(Q_{XY})\left(p_{\widebar{V}}(Q_{Y})-\frac{1}{2}p_{\widebar{V}}(Q_{Y})\min\{Mp_{\widebar{V}}(Q_{Y}),1\}\right)$
	$\displaystyle\overset{b}{=}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\ \sum_{y^{n}\in\mathcal{T}_{Q_{Y}}}\left(P_{Y}^{n}(y^{n})-\frac{1}{2}\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}\|Q_{Y})}\omega(Q_{XY})\ p_{\widebar{V}}(Q_{Y})\min\{Mp_{\widebar{V}}(Q_{Y}),1\}\right)$
	$\displaystyle=1-\frac{1}{2}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\ \sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}\|Q_{Y})}\|\mathcal{T}_{Q_{Y}}\|\ \omega(Q_{XY})\ p_{\widebar{V}}(Q_{Y})\min\{Mp_{\widebar{V}}(Q_{Y}),1\}.$	(35)

where $(a)$ follows from Lemma 45 because $k_{\widebar{V}}(y^{n})\sim\text{Binomial}(M,p_{\widebar{V}}(Q_{Y}))$ according to its definition (16); and we identify $P_{Y}(y^{n})$ in $(b)$ from (34). Furthermore, from (32) and (33) we have

	$\displaystyle\|\mathcal{T}_{Q_{Y}}\|\ \omega(Q_{XY})$	$\displaystyle\overset{a}{\geq}(n+1)^{-\|\mathcal{Y}\|}\ 2^{-n[D(V\\|W\|Q_{X})+H(V\|Q_{X})-H(Q_{Y})]}$
		$\displaystyle=(n+1)^{-\|\mathcal{Y}\|}\ 2^{-n[D(Q_{XY}\\|P_{XY})-D(Q_{XY}\\|P_{X}Q_{Y})]},$
	$\displaystyle p_{\widebar{V}}(Q_{Y})$	$\displaystyle\overset{b}{\geq}(n+1)^{-\|\mathcal{X}\|\|\mathcal{Y}\|}\ 2^{-n[D(Q_{Y}\widebar{V}\\|P_{X})+H(Q_{Y}\widebar{V})-H(\widebar{V}\|Q_{Y})]}$
		$\displaystyle=(n+1)^{-\|\mathcal{X}\|\|\mathcal{Y}\|}\ 2^{-nD(Q_{XY}\\|P_{X}Q_{Y})},$

where $(a)$ and $(b)$ follow from general properties of types. Then (A) simplifies to

	$\displaystyle\frac{1}{2}\mathbb{E}_{\mathcal{C}}\left\\|\tilde{P}_{Y^{n}\|\mathcal{C}}-P_{Y}^{n}\right\\|_{1}$	$\displaystyle\leq 1-\frac{1}{2}(n+1)^{-\left(\|\mathcal{X}\|+1\right)\|\mathcal{Y}\|}\sum_{Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y})}2^{-n\left[D(Q_{XY}\\|P_{XY})+\|D(Q_{XY}\\|P_{X}Q_{Y})-R\|^{+}\right]}$
		$\displaystyle\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\leq 1-\frac{1}{2}(n+1)^{-\left(\|\mathcal{X}\|+1\right)\|\mathcal{Y}\|}\exp_{2}\left\{-n\min_{Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y})}\left[D(Q_{XY}\\|P_{XY})+\big\|D(Q_{XY}\\|P_{X}Q_{Y})-R\big\|^{+}\right]\right\}.$

Taking $n\to\infty$ completes the proof of (7).

Next, we prove the dual form (8). Pick any $P_{X}\in\mathcal{S}$ and introduce the corresponding joint distribution $P_{XY}=P_{X}W_{Y|X}$ . In line with our notation convention, we also define the associated backward channel $\widebar{W}_{X|Y}:=P_{XY}/P_{X}\in\mathcal{P}(\mathcal{X}|\mathcal{Y})$ . For any other distribution $Q_{XY}$ , it is straightforward to verify the following identities.

	$\displaystyle D(Q_{XY}\\|P_{XY})$	$\displaystyle=D(\widebar{V}\\|\widebar{W}\|Q_{Y})+D(Q_{Y}\\|P_{Y}),$		(36)
	$\displaystyle D(Q_{XY}\\|P_{X}Q_{Y})$	$\displaystyle=D(\widebar{V}\\|\widebar{W}\|Q_{Y})+\iota(Q_{XY}),$		(37)

where $\iota(Q_{XY})$ is defined in Definition 2. It is noteworthy that $\iota(Q_{XY})$ satisfies the following property [yagli2019exact, Cor. 1], which can be viewed as a corollary of (46):

\min_{\widebar{V}\in\mathcal{P}(\mathcal{X}|\mathcal{Y})}\left\{D(\widebar{V}\|\widebar{W}|Q_{Y})+\lambda\iota(Q_{XY})\right\}=-\mathbb{E}_{Q_{Y}}\left[\log\left(\mathbb{E}_{\widebar{W}_{X|Y}}[2^{-\lambda\iota_{X;Y}}|Y]\right)\right]

(38)

for any $\lambda\in\mathbb{R}$ . Now consider the following chain of transformations:

	$\displaystyle\min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})}$	$\displaystyle\left\{D(Q_{XY}\\|P_{XY})+\big\|D(Q_{XY}\\|P_{X}Q_{Y})-R\big\|^{+}\right\}$
		$\displaystyle=\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\ \min_{\widebar{V}\in\mathcal{P}(\mathcal{X}\|\mathcal{Y})}\ \max_{\lambda\in[0,1]}\big\{D(Q_{XY}\\|P_{XY})+\lambda\left[D(Q_{XY}\\|P_{X}Q_{Y})-R\right]\big\}$
		$\displaystyle\overset{a}{=}\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\ \min_{\widebar{V}\in\mathcal{P}(\mathcal{X}\|\mathcal{Y})}\ \max_{\lambda\in[0,1]}\left\{D(Q_{Y}\\|P_{Y})+(1+\lambda)D(\widebar{V}\\|\widebar{W}\|Q_{Y})+\lambda\left[\iota(Q_{XY})-R\right]\right\}$
		$\displaystyle\overset{b}{=}\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\ \max_{\lambda\in[0,1]}\ \min_{\widebar{V}\in\mathcal{P}(\mathcal{X}\|\mathcal{Y})}\left\{D(Q_{Y}\\|P_{Y})+(1+\lambda)D(\widebar{V}\\|\widebar{W}\|Q_{Y})+\lambda\left[\iota(Q_{XY})-R\right]\right\}$
		$\displaystyle\overset{c}{=}\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\ \max_{\lambda\in[0,1]}\left\{D(Q_{Y}\\|P_{Y})-(1+\lambda)\mathbb{E}_{Q_{Y}}\left[\log\left(\mathbb{E}_{\widebar{W}_{X\|Y}}\big[2^{-\frac{\lambda}{1+\lambda}\iota_{X;Y}}\|Y\big]\right)\right]-\lambda R\right\}$
		$\displaystyle\overset{d}{=}\max_{\lambda\in[0,1]}\ \min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\left\{D(Q_{Y}\\|P_{Y})-(1+\lambda)\mathbb{E}_{Q_{Y}}\left[\log\left(\mathbb{E}_{\widebar{W}_{X\|Y}}\big[2^{-\frac{\lambda}{1+\lambda}\iota_{X;Y}}\|Y\big]\right)\right]-\lambda R\right\}$
		$\displaystyle\overset{e}{=}\max_{\lambda\in[0,1]}\left\{-\log\left(\mathbb{E}_{P_{Y}}\left[\mathbb{E}_{\widebar{W}_{X\|Y}}^{1+\lambda}\big[2^{-\frac{\lambda}{1+\lambda}\iota_{X;Y}}\|Y\big]\right]\right)-\lambda R\right\}$
		$\displaystyle\overset{f}{=}\max_{\lambda\in[0,1]}\left\{\lambda\left[I_{\frac{1}{1+\lambda}}(P_{X};W_{Y\|X})-R\right]\right\}$
		$\displaystyle\overset{g}{=}\max_{\alpha\in[\frac{1}{2},1]}\left\{\frac{1-\alpha}{\alpha}\left[I_{\alpha}(P_{X};W_{Y\|X})-R\right]\right\},$

where $(a)$ follows from (36) and (37); $(b)$ follows from the minimax theorem [rockafellar1970convex, Thm 36.3]: the expression is convex in $\widebar{V}$ and linear in $\lambda$ so we can swap $\min_{\widebar{V}}$ and $\max_{\lambda}$ ; and $(c)$ follows from (38). In $(d)$ , the minimax theorem is invoked again: it is evident to observe that the expression is convex in $Q_{Y}$ ; however, its concavity in $\lambda$ is not obvious. We need to return to the right-hand side of $(b)$ : fix $\widebar{V}$ , it is linear in $\lambda$ (a trivial case of being concave), so the minimization over $\widebar{V}$ yields a concave function of $\lambda$ . $(e)$ is a direct result of (46); $(f)$ follows from Definition 7; and $(g)$ is obtained by setting $\alpha:=\frac{1}{1+\lambda}$ . This completes the proof of (8). ∎

Appendix B Dual forms of the Strong Converse Bounds $\underline{\mathit{\Gamma}}(R)$ and $\overline{\mathit{\Gamma}}(R)$

In this appendix, we establish the dual forms of the strong converse bounds $\underline{\mathit{\Gamma}}(R)$ (in Proposition 30) and $\overline{\mathit{\Gamma}}(R)$ (in Proposition 32). That is, we prove Proposition 34 and Proposition 35 that are stated in Section IV-C.

B-a Proof of Proposition 34

Proof.

We first show (20) in Proposition 34. Fix $Q_{X}\in\mathcal{P}(\mathcal{X})$ and $R\geq 0$ . The function $\mathit{\Gamma}(s,Q_{X},R)$ defined in (10) describes a convex optimization of $V$ : the objective function is $D(Q_{X}V\|P_{Y})+\big|I(Q_{X};V)-R\big|^{+}=\max\{D(Q_{X}V\|P_{Y}),D(V\|P_{Y}|Q_{X})\}$ , which is convex in $V$ ; the inequality constraint $D(V\|W|Q_{X})\leq s$ is also convex in $V$ ; and the equality constraint is $\sum_{y}V_{Y|X}(y|x)=1$ , which is linear in $V$ . Therefore, $\mathit{\Gamma}(s,Q_{X},R)$ , as the optimal value over $V$ , equals its Lagrangian dual due to the strong duality [boyd2004convex, Ch. 5]. Explicitly,

	$\displaystyle\mathit{\Gamma}(s,Q_{X},R)$	$\displaystyle=\max_{\delta\geq 0}\ \min_{V\in\mathcal{P}(\mathcal{Y}\|\mathcal{X})}\left\{D(Q_{X}V\\|P_{Y})+\big\|I(Q_{X};V)-R\big\|^{+}+\delta\left[D(V\\|W\|Q_{X})-s\right]\right\}$
		$\displaystyle=\max_{\delta\geq 0}\ \min_{V\in\mathcal{P}(\mathcal{Y}\|\mathcal{X})}\ \max_{\mu\in[0,1]}\left\{D(Q_{X}V\\|P_{Y})+\mu\big[I(Q_{X};V)-R\big]+\delta\big[D(V\\|W\|Q_{X})-s\big]\right\}$
		$\displaystyle\overset{a}{=}\max_{\mu\in[0,1],\delta\geq 0}\ \min_{V\in\mathcal{P}(\mathcal{Y}\|\mathcal{X})}\left\{D(Q_{X}V\\|P_{Y})+\mu\big[I(Q_{X};V)-R\big]+\delta\big[D(V\\|W\|Q_{X})-s\big]\right\}$
		$\displaystyle\overset{b}{=}\max_{\mu\in[0,1],\delta\geq 0}\ \min_{V\in\mathcal{P}(\mathcal{Y}\|\mathcal{X})}\big\{(1+\delta)D(Q_{X}V\\|P_{Y})+(\mu+\delta)\left[I(Q_{X};V)-R\right]+\delta\left[R-\iota(Q_{XY})\right]-\delta s\big\},$

where $(a)$ follows from the minimax theorem: the objective function is convex in $V$ and linear in $\mu$ , so we can swap $\min_{V}$ and $\max_{\mu}$ . $(b)$ follows from Lemma 3. Plugging this expression of $\mathit{\Gamma}(s,Q_{X},R)$ into Proposition 30 yields that

	$\displaystyle\underline{\mathit{\Gamma}}(R)=\$	$\displaystyle\min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\ \inf_{s\in[0,\infty)}\ \max\big\{s,\mathit{\Gamma}(s,Q_{X},R)\big\}=\ \min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\ \inf_{s\in[0,\infty)}\left\{s+\big\|\mathit{\Gamma}(s,Q_{X},R)\big\}-s\big\|^{+}\right\}$
	$\displaystyle=\$	$\displaystyle\min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\ \inf_{s\in[0,\infty)}\ \max_{\lambda\in[0,1]}\left\{s+\lambda\big[\mathit{\Gamma}(s,Q_{X},R)-s\big]\right\}$
	$\displaystyle=\$	$\displaystyle\min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\ \inf_{s\in[0,\infty)}\ \max_{\lambda,\mu\in[0,1],\delta\geq 0}\ \min_{V\in\mathcal{P}(\mathcal{Y}\|\mathcal{X})}$
		$\displaystyle\big\{\lambda(1+\delta)D(Q_{X}V\\|P_{Y})+\lambda(\mu+\delta)\left[I(Q_{X};V)-R\right]+\lambda\delta\left[R-\iota(Q_{XY})\right]+(1-\lambda-\lambda\delta)s\big\}$
	$\displaystyle\overset{a}{\geq}\$	$\displaystyle\min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\ \max_{\mu\in[0,1],\delta\geq 0}\ \min_{V\in\mathcal{P}(\mathcal{Y}\|\mathcal{X})}\left\{D(Q_{Y}\\|P_{Y})+\frac{\mu+\delta}{1+\delta}\left[I(Q_{X};V)-R\right]+\frac{\delta}{1+\delta}\left[R-\iota(Q_{XY})\right]\right\}$
	$\displaystyle\overset{b}{=}\$	$\displaystyle\min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\ \max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\ \min_{V\in\mathcal{P}(\mathcal{Y}\|\mathcal{X})}\big\{D(Q_{Y}\\|P_{Y})+\alpha\left[I(Q_{X};V)-R\right]+\beta\left[R-\iota(Q_{XY})\right]\big\}$
	$\displaystyle\overset{c}{=}\$	$\displaystyle\min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})}\ \max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\big\{D(Q_{Y}\\|P_{Y})+\alpha\left[I(Q_{X};V)-R\right]+\beta\left[R-\iota(Q_{XY})\right]\big\}.$

Here $(a)$ follows from choosing $\lambda=\frac{1}{1+\delta}$ ; this is a valid choice because $\delta\geq 0$ gives $\frac{1}{1+\delta}\in[0,1]$ . $(b)$ follows from defining $\alpha:=\frac{\mu+\delta}{1+\delta}$ and $\beta:=\frac{\delta}{1+\delta}$ (and clearly $\alpha\geq\beta$ ). $(c)$ follows again from the minimax theorem: the objective function is convex in $V$ and linear in $(\alpha,\beta)$ , so we can swap $\min_{V}$ and $\max_{\alpha,\beta}$ . Finally, we merge $\min_{Q_{X}}$ and $\min_{V}$ together as $\min_{Q_{XY}}$ . This completes the proof of (20).

Next, we prove (21) in Proposition 34. Observe that $D(Q_{Y}\|P_{Y})+I(Q_{X};V)=D(Q_{XY}\|Q_{X}P_{Y})$ is convex in $Q_{XY}$ due to the joint convexity of $D(\cdot\|\cdot)$ . Then for $\alpha\leq 1$ , the expression $D(Q_{Y}\|P_{Y})+\alpha I(Q_{X};V)=(1-\alpha)D(Q_{Y}\|P_{Y})+\alpha D(Q_{XY}\|Q_{X}P_{Y})$ is also convex in $Q_{XY}$ . Moreover, $\iota(Q_{XY})$ is linear in $Q_{XY}$ . Hence, the objective function of $\underline{\underline{\mathit{\Gamma}}}(R)$ is convex in $Q_{XY}$ and linear in $(\alpha,\beta)$ , so by the minimax theorem, we can swap $\min_{Q_{XY}}$ and $\max_{\alpha,\beta}$ and obtain that

	$\displaystyle\underline{\underline{\mathit{\Gamma}}}(R)$	$\displaystyle=\max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\ \min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})}\big\{D(Q_{Y}\\|P_{Y})+\alpha\left[I(Q_{Y};\widebar{V})-R\right]+\beta\left[R-\iota(Q_{XY})\right]\big\}$		(39)
		$\displaystyle=\max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\ \min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\left\{D(Q_{Y}\\|P_{Y})+(\beta-\alpha)R+K_{\alpha,\beta}(Q_{Y})\right\},$		(40)

where we have defined

K_{\alpha,\beta}(Q_{Y})=\min_{\widebar{V}\in\mathcal{P}(\mathcal{X}|\mathcal{Y})}\big[\alpha I(Q_{Y};\widebar{V})-\beta\iota(Q_{XY})\big],

(41)

which is a convex optimization over $\widebar{V}$ and can be solved using the method of the Lagrange multiplier. However, directly plugging in the derivative $\frac{\partial I(Q_{Y};\widebar{V})}{\partial\widebar{V}_{X|Y}(x|y)}$ leads to an equation too complicated to solve. Instead, we employ the following variational form of the mutual information $I(Q_{Y};\widebar{V})$ :

I(Q_{Y};\widebar{V})=\min_{S_{X}\in\mathcal{P}(\mathcal{X})}D(\widebar{V}_{X|Y}Q_{Y}\|S_{X}Q_{Y}).

(42)

Therefore, (41) can be rewritten as

K_{\alpha,\beta}(Q_{Y})=\min_{S_{X}\in\mathcal{P}(\mathcal{X})}\ \min_{\widebar{V}\in\mathcal{P}(\mathcal{X}|\mathcal{Y})}\big[\alpha D(\widebar{V}_{X|Y}Q_{Y}\|S_{X}Q_{Y})-\beta\iota(Q_{XY})\big].

(43)

Now, to optimize over $\widebar{V}$ , we can introduce Lagrange multipliers $\gamma_{y}$ for each $y\in\mathcal{Y}$ and write the Lagrangian as

\mathcal{L}=\alpha D(\widebar{V}_{X|Y}Q_{Y}\|S_{X}Q_{Y})-\beta\iota(Q_{XY})+\sum_{y}\gamma_{y}\left(\sum_{x}\widebar{V}_{X|Y}(x|y)-1\right).

Substituting

\frac{\partial D(\widebar{V}_{X|Y}Q_{Y}\|S_{X}Q_{Y})}{\partial\widebar{V}_{X|Y}(x|y)}=Q_{Y}(y)\left(1+\log\frac{\widebar{V}_{X|Y}(x|y)}{S_{X}(x)}\right),\quad\frac{\partial\iota(Q_{XY})}{\partial\widebar{V}_{X|Y}(x|y)}=Q_{Y}(y)\log\frac{W_{Y|X}(y|x)}{P_{Y}(y)}

into $\frac{\partial\mathcal{L}}{\partial\widebar{V}_{X|Y}(x|y)}=0$ gives that

Q_{Y}(y)\left(\alpha+\alpha\log\frac{\widebar{V}_{X|Y}(x|y)}{S_{X}(x)}-\beta\log\frac{W_{Y|X}(y|x)}{P_{Y}(y)}\right)+\gamma_{y}=0,

further generating the following solution of the optimizer $\widebar{V}^{*}$ :

\widebar{V}^{*}_{X|Y}(x|y)=C(y)S_{X}(x)\left(\frac{W_{Y|X}(y|x)}{P_{Y}(y)}\right)^{\frac{\beta}{\alpha}}

(44)

with some $y$ -dependent normalization factor

C(y)=\left[\sum_{x}S_{X}(x)\left(\frac{W_{Y|X}(y|x)}{P_{Y}(y)}\right)^{\frac{\beta}{\alpha}}\right]^{-1}.

Hence, (43) reduces to

	$\displaystyle K_{\alpha,\beta}(Q_{Y})$	$\displaystyle=\min_{S_{X}\in\mathcal{P}(\mathcal{X})}\left[\alpha\sum_{x,y}Q_{Y}(y)\widebar{V}^{}_{X\|Y}(x\|y)\log\frac{\widebar{V}^{}_{X\|Y}(x\|y)}{S_{X}(x)}-\beta\iota(\widebar{V}^{*}_{X\|Y}Q_{Y})\right]$
		$\displaystyle=\min_{S_{X}\in\mathcal{P}(\mathcal{X})}\left[\alpha\sum_{x,y}Q_{Y}(y)\widebar{V}^{}_{X\|Y}(x\|y)\left(\log C(y)+\frac{\beta}{\alpha}\log\frac{W_{Y\|X}(y\|x)}{P_{Y}(y)}\right)-\beta\iota(\widebar{V}^{}_{X\|Y}Q_{Y})\right]$
		$\displaystyle=\min_{S_{X}\in\mathcal{P}(\mathcal{X})}\left[\alpha\sum_{x,y}Q_{Y}(y)\widebar{V}^{*}_{X\|Y}(x\|y)\log C(y)\right]=\min_{S_{X}\in\mathcal{P}(\mathcal{X})}\left[\alpha\sum_{y}Q_{Y}(y)\log C(y)\right]$
		$\displaystyle=\min_{S_{X}\in\mathcal{P}(\mathcal{X})}\left[-\alpha\sum_{y}Q_{Y}(y)\log\left(\sum_{x}S_{X}(x)\left(\frac{W_{Y\|X}(y\|x)}{P_{Y}(y)}\right)^{\frac{\beta}{\alpha}}\right)\right].$

Inserting this result into (40) yields that

$\displaystyle\underline{\underline{\mathit{\Gamma}}}(R)=\$	$\displaystyle\max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\ \min_{S_{X}\in\mathcal{P}(\mathcal{X})}\ \min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}$
	$\displaystyle\left\{D(Q_{Y}\\|P_{Y})+(\beta-\alpha)R-\alpha\sum_{y}Q_{Y}(y)\log\left(\sum_{x}S_{X}(x)\left(\frac{W_{Y\|X}(y\|x)}{P_{Y}(y)}\right)^{\frac{\beta}{\alpha}}\right)\right\}$
$\displaystyle\overset{a}{=}\$	$\displaystyle\max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\ \min_{S_{X}\in\mathcal{P}(\mathcal{X})}\left\{-\log\left(\sum_{y}P_{Y}^{1-\beta}(y)\left[\sum_{x}S_{X}(x)\ W_{Y\|X}^{\frac{\beta}{\alpha}}(y\|x)\right]^{\alpha}\right)+(\beta-\alpha)R\right\}$	(45)
$\displaystyle=\$	$\displaystyle\max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\left[J_{\alpha,\beta}\left(W_{Y\|X}\\|P_{Y}\right)+(\beta-\alpha)R\right],$

where $(a)$ is due to the following identity [yagli2019exact, Lem. 20] [verdu2021error, Thm. 1]:

\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\left[D(Q_{Y}\|P_{Y})-\sum_{y}Q_{Y}(y)F(y)\right]=-\log\left(\sum_{y}P_{Y}(y)\ 2^{F(y)}\right)

(46)

for any function $F(\cdot)$ with a unique minimizer $Q_{Y}^{*}(y)\ \propto\ 2^{F(y)}P_{Y}(y)$ . This completes the proof of (21). ∎

For later convenience, we identify the optimizer here. The minimizer $Q_{Y}^{*}$ that yields (B-a) is

Q_{Y}^{*}(y)=\frac{P_{Y}^{1-\beta}(y)\left(\displaystyle{\sum_{x}S_{X}(x)\ }W_{Y|X}^{\frac{\beta}{\alpha}}(y|x)\right)^{\alpha}}{\displaystyle{\sum_{y^{\prime}}P_{Y}^{1-\beta}(y^{\prime})}\left(\displaystyle{\sum_{x^{\prime}}\ }S_{X}(x^{\prime})\ W_{Y|X}^{\frac{\beta}{\alpha}}(y^{\prime}|x^{\prime})\right)^{\alpha}}.

(47)

Recall that we introduced $S_{X}$ merely to apply the variational form of mutual information (42), so the minimizer $S_{X}^{*}$ in (B-a) is exactly the marginal $Q_{X}^{*}$ of the minimizer $Q_{XY}^{*}$ of (39). Hence, fixing $\alpha$ and $\beta$ , if $J_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right)$ in (6) yields an optimizer $Q_{X}^{*}$ (which must satisfy (63) in Appendix C), then we can plug $S_{X}=Q_{X}^{*}$ in (44) and (47) and obtain the corresponding optimizer $Q_{XY}^{*}$ of (39) as

Q_{XY}^{*}(x,y)=\frac{Q_{X}^{*}(x)\ W_{Y|X}^{\frac{\beta}{\alpha}}(y|x)\ P_{Y}^{1-\beta}(y)\left(\displaystyle{\sum_{x^{\prime}}Q_{X}^{*}(x^{\prime})\ }W_{Y|X}^{\frac{\beta}{\alpha}}(y|x^{\prime})\right)^{\alpha-1}}{\displaystyle{\sum_{y^{\prime}}P_{Y}^{1-\beta}(y^{\prime})}\left(\displaystyle{\sum_{x^{\prime}}Q_{X}^{*}(x^{\prime})\ }W_{Y|X}^{\frac{\beta}{\alpha}}(y^{\prime}|x^{\prime})\right)^{\alpha}}.

(48)

Analogous to Lemma 31, one can observe the following properties of $\underline{\underline{\mathit{\Gamma}}}(R)$ given its definition in (20).

Lemma 46.

We have the following properties of $\underline{\underline{\mathit{\Gamma}}}(R)$ :

(i) $\underline{\underline{\mathit{\Gamma}}}(R)=0$ when $R\geq\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ . (ii) $\underline{\underline{\mathit{\Gamma}}}(R)>0$ when $R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ .

Proof.

For simplicity, write $\underline{\underline{\mathit{\Gamma}}}(R)=\min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})}\gamma(R,Q_{XY})$ with

\gamma(R,Q_{XY}):=\max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\big\{D(Q_{Y}\|P_{Y})+\alpha\left[I(Q_{X};V)-R\right]+\beta\left[R-\iota(Q_{XY})\right]\big\}.

Given $R$ , let $Q_{XY}^{*}\in\mathcal{P}(\mathcal{Y}|\mathcal{X})$ optimize $\gamma(R,Q_{XY})$ , i.e., $\underline{\underline{\mathit{\Gamma}}}(R)=\gamma(R,Q_{XY}^{*})$ . Setting $\alpha=\beta=0$ gives $\underline{\underline{\mathit{\Gamma}}}(R)=\gamma(R,Q_{XY}^{*})\geq D(Q_{Y}^{*}\|P_{Y})\geq 0$ . Hence, $\underline{\underline{\mathit{\Gamma}}}(R)$ is non-negative for all $R$ . (i) follows immediately from Lemma 31 and that $\underline{\underline{\mathit{\Gamma}}}(R)\leq\underline{\mathit{\Gamma}}(R)$ .

For (ii), suppose not. Then $\gamma(R,Q_{XY}^{*})=0$ and hence plugging any combination of $(\alpha,\beta)\in[0,1]^{2}$ such that $\alpha\geq\beta$ into the objective function of $\gamma(R,Q_{XY}^{*})$ will yield a non-positive value. First, plugging $\alpha=\beta=0$ gives that $D(Q_{Y}^{*}\|P_{Y})=0$ , so $Q_{Y}^{*}=P_{Y}$ . Second, plugging $\alpha=\beta=1$ and combining it with Lemma 3 gives that $D(V^{*}\|W|Q_{X}^{*})=0$ , so $V^{*}=W$ and hence $Q_{X}^{*}\in\mathcal{S}$ . Finally, plugging $\alpha=1,\beta=0$ gives that $I(Q_{X}^{*};V^{*})\leq R$ . To sum up, we obtain that $R=I(Q_{X}^{*};W)$ with $Q_{X}^{*}\in\mathcal{S}$ , contradicting the condition that $R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ . ∎

B-b Proof of Proposition 35

Proof.

Given any $R>0$ , define

	$\displaystyle\mathit{\Gamma}_{1}(R)$	$\displaystyle:=\min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})}\left[D(Q_{Y}\\|P_{Y})+\big\|I(Q_{X};V)-R\big\|^{+}+\big\|R-\iota(Q_{XY})\big\|^{+}\right]$		(49)
		$\displaystyle=:\min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})}\Delta(Q_{XY})=\Delta(Q_{XY}^{\star}).$		(50)

That is, let $\Delta(\cdot)$ denote the objective function in $\mathit{\Gamma}_{1}(R)$ and let $Q_{XY}^{\star}$ be one of the optimizers. We will claim that $\overline{\mathit{\Gamma}}(R)=\mathit{\Gamma}_{1}(R)$ .

We first show this is true when $\mathit{\Gamma}_{1}(R)$ vanishes. Suppose $\mathit{\Gamma}_{1}(R)=0$ . Then $Q_{Y}^{\star}=P_{Y}$ and $I(Q_{X}^{\star};V^{\star})\leq R\leq\iota(Q_{XY}^{\star})$ . However, Lemma 3 gives that $I(Q_{X}^{\star};V^{\star})-\iota(Q_{XY}^{\star})=D(V^{\star}\|W|Q_{X}^{\star})\geq 0$ , so $V^{\star}=W$ , and hence $Q_{X}^{\star}\in\mathcal{S}$ and $R=I(Q_{X}^{\star};W)$ . Due to the continuity of $I(\cdot\ ;W)$ , we have $\min_{P_{X}\in\mathcal{S}}I(P_{X};W)\leq R\leq\max_{P_{X}\in\mathcal{S}}I(P_{X};W)$ . By Lemma 33, we further have $\overline{\mathit{\Gamma}}(R)=\mathit{\Gamma}_{1}(R)=0$ .

Next, we show that $\overline{\mathit{\Gamma}}(R)=\mathit{\Gamma}_{1}(R)$ also holds in their positive regimes. In the rest of our discussions, we may assume that $\mathit{\Gamma}_{1}(R)>0$ . We further write $\mathit{\Gamma}_{1}(R)$ as

	$\displaystyle\mathit{\Gamma}_{1}(R)$	$\displaystyle=\ \min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})}\ \max_{\alpha,\beta\in[0,1]}\big\{D(Q_{Y}\\|P_{Y})+\alpha\left[I(Q_{X};V)-R\right]+\beta\left[R-\iota(Q_{XY})\right]\big\}$
		$\displaystyle=:\min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})}\ \max_{\alpha,\beta\in[0,1]}F(Q_{XY},\alpha,\beta)$		(51)

where we have used $F(Q_{XY},\alpha,\beta)$ to denote the objective function. It is shown in the proof of Proposition 34 that $Q_{XY}\mapsto F(Q_{XY},\alpha,\beta)$ is convex and $(\alpha,\beta)\mapsto F(Q_{XY},\alpha,\beta)$ is linear (a trivial case of being concave). Hence, according to the minimax theorem, the optimization in (B-b) can be achieved at its saddle points (there might exist some other non-saddle points that also produce the optimal value, but at least one saddle point exists). Now, suppose we have a saddle point, denoted by $(Q_{XY}^{\prime},\alpha^{\prime},\beta^{\prime})$ . It must satisfy [rockafellar1970convex, Lem. 36.2]

F(Q_{XY}^{\prime},\alpha,\beta)\leq F(Q_{XY}^{\prime},\alpha^{\prime},\beta^{\prime})\leq F(Q_{XY},\alpha^{\prime},\beta^{\prime}),\quad\forall\ Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y}),\ \alpha,\beta\in[0,1].

(52)

It should be noted that the saddle points belong to both minimax ( $\min_{Q_{XY}}\max_{\alpha,\beta}F(Q_{XY},\alpha,\beta)$ ) and maximin ( $\max_{\alpha,\beta}\min_{Q_{XY}}F(Q_{XY},\alpha,\beta)$ ) solutions. Now we make the following claim:

Claim 1. If $\alpha^{\prime}<1$ , then $\overline{\mathit{\Gamma}}(R)=\mathit{\Gamma}_{1}(R)$ .

Proof of claim. If $\alpha^{\prime}<1$ , then $Q_{XY}^{\prime}$ must give $I(Q_{X}^{\prime};V^{\prime})\leq R$ , otherwise the first inequality in (52) is violated as $F(Q_{XY}^{\prime},\alpha=1,\beta=\beta^{\prime})>F(Q_{XY}^{\prime},\alpha^{\prime},\beta^{\prime})$ . Since this saddle point $(Q_{XY}^{\prime},\alpha^{\prime},\beta^{\prime})$ is also the minimax solution, then in (50) we can take $Q_{XY}^{\star}=Q_{XY}^{\prime}$ and hence $\mathit{\Gamma}_{1}(R)=\Delta(Q_{XY}^{\prime})$ : the minimization $\min_{Q_{XY}}$ in (49) can always occur at a point where $I(Q_{X}^{\prime};V^{\prime})\leq R$ , so $\min_{Q_{XY}}$ can reduce to a constrained optimization $\min_{Q_{XY}:I(Q_{X};V)\leq R}$ and thus $\overline{\mathit{\Gamma}}(R)=\mathit{\Gamma}_{1}(R)$ . $\square$

It remains to discuss the case in which $\alpha^{\prime}=1$ . Comparing (B-b) and (20), we can immediately write from (21) that (note that the condition $\alpha\geq\beta$ is not used in the proof of (21)).

\mathit{\Gamma}_{1}(R)=\max_{\alpha,\beta\in[0,1]}\left[J_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right)+(\beta-\alpha)R\right].

(53)

Since the saddle point $(Q_{XY}^{\prime},\alpha^{\prime}=1,\beta^{\prime})$ belongs to the maximin solutions, $(\alpha^{\prime}=1,\beta^{\prime})$ must be the maximizers of (53), and correspondingly the marginal $Q_{X}^{\prime}$ of $Q_{XY}^{\prime}$ belongs to the set of optimizers $Q_{X}^{*}$ of $J_{1,\beta^{\prime}}$ in (6). Specifically, $Q_{X}^{\prime}$ is in the set

\mathcal{P}(\mathcal{X}_{\beta^{\prime}})=\operatorname*{argmax}_{Q_{X}\in\mathcal{P}(\mathcal{X})}\ \left[\sum_{x,y}Q_{X}(x)\ W_{Y|X}^{\beta^{\prime}}(y|x)\ P_{Y}^{1-\beta^{\prime}}(y)\right],

(54)

which is a linear optimization over $Q_{X}$ . Here $\mathcal{X}_{\beta^{\prime}}:=\operatorname*{argmax}_{x}\left[\sum_{y}W_{Y|X}^{\beta^{\prime}}(y|x)\ P_{Y}^{1-\beta^{\prime}}(y)\right]$ is set of optimal symbols and $\mathcal{P}(\mathcal{X}_{\beta^{\prime}})$ is the set of input distributions supported on $\mathcal{X}_{\beta^{\prime}}$ .

Since $\alpha^{\prime}=1$ , we can first exclude two trivial cases: $\beta^{\prime}=0$ or 1. If $\beta^{\prime}=0$ , then (53) gives $\mathit{\Gamma}_{1}(R)=-R$ , while (49) implies that $\mathit{\Gamma}_{1}(R)\geq 0$ . Thus, this case can never happen. Similarly, if $\beta^{\prime}=1$ , then (53) gives $\mathit{\Gamma}_{1}(R)=0$ , while we have assumed the strict positivity of $\mathit{\Gamma}_{1}(R)$ . In conclusion, as long as $\mathit{\Gamma}_{1}(R)>0$ , we have $\beta^{\prime}\in(0,1)$ and can make the following claim.

Claim 2. A saddle point $(Q_{XY}^{\prime},\alpha^{\prime}=1,\beta^{\prime}\in(0,1))$ must satisfy that $I(Q_{X}^{\prime};V^{\prime})\geq R=\iota(Q_{XY}^{\prime})$ .

Proof of claim. This follows directly from the first inequality in (52). $\square$

In the rest of our discussions, we may assume that $\beta^{\prime}\in(0,1)$ . In most circumstances, $\mathcal{X}_{\beta^{\prime}}$ may contain only one symbol, and the corresponding maximizer of (54) is a deterministic distribution. More generally, however, $\mathcal{X}_{\beta^{\prime}}$ may contain multiple symbols, and all distributions supported on these symbols are maximizers of (54); the marginal $Q_{X}^{\prime}$ of the saddle point $Q_{XY}^{\prime}$ is only one of those distributions. Let us consider this general situation: if $\mathcal{X}_{\beta^{\prime}}$ contains multiple symbols, we take any two symbols, say, $x_{1},x_{2}\in\mathcal{X}_{\beta^{\prime}}$ . Denote

G_{x}(\beta):=2^{(\beta-1)D_{\beta}(W_{Y|x}\|P_{Y})}=\sum_{y}W_{Y|X}^{\beta}(y|x)\ P_{Y}^{1-\beta}(y),

where $D_{\beta}$ is the $\beta$ -Rényi divergence in Definition 6. Then we have $G_{x_{1}}(\beta^{\prime})=G_{x_{2}}(\beta^{\prime})$ , meaning that these two functions of $\beta$ intersect at $\beta^{\prime}$ . Since they are both analytic, in the neighborhood of $\beta^{\prime}$ , they either overlap or intersect only at $\beta^{\prime}$ . The former indicates a certain symmetry in $W_{Y|X}$ and $P_{Y}$ (for example, $W_{Y|X}$ is a binary symmetric channel and $P_{Y}$ is uniform). We define that $x_{1}$ and $x_{2}$ belong to the same class, if $G_{x_{1}}(\beta)=G_{x_{2}}(\beta)$ for all $\beta$ in the neighborhood of $\beta^{\prime}$ . We now conduct our discussions based on how many classes $\mathcal{X}_{\beta^{\prime}}$ contains, which can be classified into the following three cases.

Case 1. $\mathcal{X}_{\beta^{\prime}}$ contains only one class, and there is only one symbol in that class.

In this case, $\mathcal{X}_{\beta^{\prime}}$ basically contains a single symbol, say, $\mathcal{X}_{\beta^{\prime}}=\{x_{0}\}$ for some $x_{0}\in\mathcal{X}$ . Then $\mathcal{P}(\mathcal{X}_{\beta^{\prime}})=\{\delta_{x,x_{0}}\}$ , so (54) has a unique solution which is a deterministic distribution. Since the saddle marginal $Q_{X}^{\prime}\in\mathcal{P}(\mathcal{X}_{\beta^{\prime}})$ is in this solution set, we must have $Q_{X}^{\prime}=\delta_{x,x_{0}}$ ; however, this gives $I(Q_{X}^{\prime};V^{\prime})=0<R$ , contradicting $I(Q_{X}^{\prime};V^{\prime})\geq R$ in Claim 2. Ergo, there exists no saddle point in the form of $(Q_{XY}^{\prime},\alpha^{\prime}=1,\beta^{\prime})$ . The saddle point must have $\alpha^{\prime}<1$ . We are back to Claim 1 and hence $\overline{\mathit{\Gamma}}(R)=\mathit{\Gamma}_{1}(R)$ .

Case 2. $\mathcal{X}_{\beta^{\prime}}$ contains only one class, and there is more than one symbol in that class.

In this case, take any two symbols, say, $x_{1},x_{2}\in\mathcal{X}_{\beta^{\prime}}$ . Then $G_{x_{1}}(\beta)$ and $G_{x_{2}}(\beta)$ are locally the same function. We can further take their derivatives at $\beta^{\prime}$ and write $G_{x_{1}}^{\prime}(\beta^{\prime})=G_{x_{2}}^{\prime}(\beta^{\prime})$ . As a result, the expression

\sum_{y}W_{Y|X}^{\beta^{\prime}}(y|x)\ P_{Y}^{1-\beta^{\prime}}(y)\log\frac{W_{Y|X}(y|x)}{P_{Y}(y)}

(55)

will be independent of symbol choice in that class, i.e., (55) is identical for all $x\in\mathcal{X}_{\beta^{\prime}}$ . Then, take any $Q_{X}^{*}\in\mathcal{P}(\mathcal{X}_{\beta^{\prime}})$ (so $Q_{X}^{*}$ is an optimizer in $J_{1,\beta^{\prime}}$ ) and its corresponding joint optimizer $Q_{XY}^{*}$ for $\min_{Q_{XY}}F(Q_{XY},1,\beta^{\prime})$ is given by (48) (under $\alpha=1$ and $\beta=\beta^{\prime}$ ). The expression

\iota(Q_{XY}^{*})=\frac{\displaystyle{\sum_{x}Q_{X}^{*}(x)\sum_{y}W_{Y|X}^{\beta^{\prime}}(y|x)\ P_{Y}^{1-\beta^{\prime}}(y)\log\frac{W_{Y|X}(y|x)}{P_{Y}(y)}}}{\displaystyle{\sum_{x}Q_{X}^{*}(x)\sum_{y}W_{Y|X}^{\beta^{\prime}}(y|x)\ P_{Y}^{1-\beta^{\prime}}(y)}}

will be independent of $Q_{X}^{*}$ . Since the saddle point $Q_{XY}^{\prime}$ also has marginal $Q_{X}^{\prime}\in\mathcal{P}(\mathcal{X}_{\beta^{\prime}})$ and satisfies $\iota(Q_{XY}^{\prime})=R$ according to Claim 2, we must have $\iota(Q_{XY}^{*})=R$ for all $Q_{X}^{*}\in\mathcal{P}(\mathcal{X}_{\beta^{\prime}})$ . Note that $\mathcal{P}(\mathcal{X}_{\beta^{\prime}})$ contains the deterministic distribution (which yields a zero mutual information) and the saddle point (which yields $I(Q_{X}^{\prime};V^{\prime})\geq R$ by Claim 2). There must exist some $Q_{X}^{\star}\in\mathcal{P}(\mathcal{X}_{\beta^{\prime}})$ such that $I(Q_{X}^{\star};V^{\star})=R=\iota(Q_{XY}^{\star})$ . Plugging this $Q_{XY}^{\star}$ in (50) gives

	$\displaystyle\Delta(Q_{XY}^{\star})$	$\displaystyle=D(Q_{Y}^{\star}\\|P_{Y})+\big\|I(Q_{X}^{\star};V^{\star})-R\big\|^{+}+\big\|R-\iota(Q_{XY}^{\star})\big\|^{+}$
		$\displaystyle=D(Q_{Y}^{\star}\\|P_{Y})+\left[I(Q_{X}^{\star};V^{\star})-R\right]+\beta^{\prime}\left[R-\iota(Q_{XY}^{\star})\right]$
		$\displaystyle=J_{1,\beta^{\prime}}\left(W_{Y\|X}\\|P_{Y}\right)+(\beta^{\prime}-1)R=\mathit{\Gamma}_{1}(R).$

This means the minimization $\min_{Q_{XY}}$ in (49) can occur at a point where $I(Q_{X}^{\star};V^{\star})=R$ , so $\overline{\mathit{\Gamma}}(R)=\mathit{\Gamma}_{1}(R)$ .

Case 3. $\mathcal{X}_{\beta^{\prime}}$ contains more than one class.

In this case, we cannot deduce much about $\beta^{\prime}$ , and hence about the corresponding $R$ . However, let us perturb $\beta^{\prime}$ in its neighborhood, say, to some $\beta^{\prime\prime}\in(0,1)$ . Due to the existence of only finitely many symbols, in the neighborhood of $\beta^{\prime}$ , there are only finite intersections of $G_{x}(\beta)$ ’s for $x$ ’s from different classes, so $\mathcal{X}_{\beta^{\prime\prime}}$ must contain only one class. Take any symbol, say, $x_{1}$ from this class. Then we can write $\beta^{\prime\prime}$ as

	$\displaystyle\beta^{\prime\prime}$	$\displaystyle=\operatorname*{argmax}_{\beta\in[0,1]}\left\{J_{1,\beta}\left(W_{Y\|X}\\|P_{Y}\right)+(\beta-1)(R+\Delta R)\right\}$
		$\displaystyle=\operatorname*{argmax}_{\beta\in[0,1]}\left\{(1-\beta)\left[D_{\beta}\left(W_{Y\|x_{1}}\\|P_{Y}\right)-(R+\Delta R)\right]\right\}$

for some $R+\Delta R$ in the neighborhood of $R$ . Note that this is a strictly concave maximization due to the strict concavity of $\beta\mapsto(1-\beta)D_{\beta}$ (if $(1-\beta)D_{\beta}$ is linear in $\beta$ , then the maximizer $\beta^{\prime\prime}$ must be 0 or 1). Thus, this new $R+\Delta R$ is uniquely determined by $\beta^{\prime\prime}$ . If the corresponding saddle point at $R+\Delta R$ has $\alpha^{\prime\prime}<1$ , then we are back to Claim 1. If it has $\alpha^{\prime\prime}=1$ , then that saddle point must be $(Q_{XY}^{\prime\prime},\alpha^{\prime\prime}=1,\beta^{\prime\prime})$ , and we are back to Case 1 or Case 2 and can obtain that $\overline{\mathit{\Gamma}}(R+\Delta R)=\mathit{\Gamma}_{1}(R+\Delta R)$ . In summary, for all $R+\Delta R$ (where $\Delta R\neq 0$ ) in the neighborhood of $R$ , we have $\overline{\mathit{\Gamma}}(R+\Delta R)=\mathit{\Gamma}_{1}(R+\Delta R)$ . Ergo, $\overline{\mathit{\Gamma}}(\cdot)=\mathit{\Gamma}_{1}(\cdot)$ holds almost everywhere in the neighborhood of $R$ . On the other hand, $\overline{\mathit{\Gamma}}(\cdot)$ and $\mathit{\Gamma}_{1}(\cdot)$ are both continuous due to the Berge maximum theorem [aliprantis2006infinite, Item 17.31] (note that in $\overline{\mathit{\Gamma}}(\cdot)$ , $R\twoheadrightarrow\{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y}):I(Q_{X};V)\leq R\}$ is a continuous and compact correspondence), so $\overline{\mathit{\Gamma}}(R)=\mathit{\Gamma}_{1}(R)$ also holds at $R$ .

To sum up, $\overline{\mathit{\Gamma}}(R)=\mathit{\Gamma}_{1}(R)$ holds everywhere, in both vanishing and positive regimes. The dual form of $\overline{\mathit{\Gamma}}(R)$ follows immediately from (53). This completes the proof. ∎

B-c Proof of Proposition 36

Proof.

We first show that both $\overline{\mathit{\Gamma}}(R)$ and $\underline{\underline{\mathit{\Gamma}}}(R)$ are convex in $R$ . Collecting Proposition 34 and 35, both $\overline{\mathit{\Gamma}}(R)$ and $\underline{\underline{\mathit{\Gamma}}}(R)$ take the dual form

\max_{\alpha,\beta}\left[J_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right)+(\beta-\alpha)R\right].

(56)

Their only difference is the optimization range in $(\alpha,\beta)$ . For a fixed combination of $(\alpha,\beta)$ , the objective function in (56) is linear in $R$ (a trivial case of being convex). Then after maximizing over $(\alpha,\beta)$ , (56) is convex in $R$ .

Next, we show that both $\overline{\mathit{\Gamma}}(R)$ and $\underline{\underline{\mathit{\Gamma}}}(R)$ are monotone decreasing when $R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ . Combining convexity and Lemma 33, $\overline{\mathit{\Gamma}}(R)$ is convex in $R$ and positive when $R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ . As $R$ grows, $\overline{\mathit{\Gamma}}(R)$ remains convex and must vanish when $R$ reaches $\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ . Ergo, $\overline{\mathit{\Gamma}}(R)$ must be monotone decreasing in $R$ when $R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ . Similar arguments can be established for $\underline{\underline{\mathit{\Gamma}}}(R)$ by combining its convexity and Lemma 46.

Finally, to prove our conclusion, it suffices to show that when $R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ , the optimization over $\alpha,\beta\in[0,1]$ in $\overline{\mathit{\Gamma}}(R)$ occurs at $\alpha\geq\beta$ . Let $R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ and the optimizers in $\overline{\mathit{\Gamma}}(R)$ be $\alpha^{\prime},\beta^{\prime}$ , i.e., $\overline{\mathit{\Gamma}}(R)=J_{\alpha^{\prime},\beta^{\prime}}\left(W_{Y|X}\|P_{Y}\right)+(\beta^{\prime}-\alpha^{\prime})R$ . Suppose $\alpha^{\prime}<\beta^{\prime}$ . Take $\Delta R\to 0^{+}$ and we have

	$\displaystyle\overline{\mathit{\Gamma}}(R+\Delta R)$	$\displaystyle=\max_{\alpha,\beta\in[0,1]}\left[J_{\alpha,\beta}\left(W_{Y\|X}\\|P_{Y}\right)+(\beta-\alpha)(R+\Delta R)\right]$
		$\displaystyle\geq J_{\alpha^{\prime},\beta^{\prime}}\left(W_{Y\|X}\\|P_{Y}\right)+(\beta^{\prime}-\alpha^{\prime})(R+\Delta R)$
		$\displaystyle=\overline{\mathit{\Gamma}}(R)+(\beta^{\prime}-\alpha^{\prime})\Delta R>\overline{\mathit{\Gamma}}(R),$

meaning that $\overline{\mathit{\Gamma}}(R)$ is locally increasing in $R$ , which contradicts the decreasing property that we just showed. Hence, the optimizers $\alpha^{\prime},\beta^{\prime}$ for $\overline{\mathit{\Gamma}}(R)$ must satisfy that $\alpha^{\prime}\geq\beta^{\prime}$ . ∎

Appendix C A Different Approach to Proving the Converse Part of Theorem 17

Based on Arimoto’s techniques in [arimoto1973converse], we provide a different proof of the converse part of Theorem 17 for the uniform formulation. We begin with the following lemma about the additivity of $J_{\alpha,\beta}$ .

Lemma 47 (Additivity of $J_{\alpha,\beta}$ ).

$J_{\alpha,\beta}\big(W_{Y|X}^{n}\|P_{Y}^{n}\big)=nJ_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right).$

Proof.

We follow the arguments in [gallager1965simple, Thm. 4] and [arimoto1973converse, Lem. 1]. According to (6), $J_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right)$ is defined via solving the following optimization problem:

\displaystyle\max_{Q_{X}}\sum_{y}P_{Y}^{1-\beta}(y)\left(\sum_{x}Q_{X}(x)W_{Y|X}^{\frac{\beta}{\alpha}}(y|x)\right)^{\alpha}\ \text{ s.t. }\ Q_{X}(x)\geq 0\ \forall x\in\mathcal{X},\text{ and }\ \sum_{x}Q_{X}(x)=1.

(57)

The Lagrangian of this problem is

\mathcal{L}=-\sum_{y}P_{Y}^{1-\beta}(y)\left(\sum_{x}Q_{X}(x)W_{Y|X}^{\frac{\beta}{\alpha}}(y|x)\right)^{\alpha}-\sum_{x}\lambda_{x}Q_{X}(x)+\mu\left(\sum_{x}Q_{X}(x)-1\right).

Since $\alpha\in[0,1]$ , problem (57) is convex. The KKT conditions are necessary and sufficient for the optimizer $Q_{X}$ and multipliers $\lambda_{x},\mu$ [boyd2004convex, Ch. 5.5.3]. Specifically, letting $Q_{X}^{*},\lambda_{x}^{*},\mu^{*}$ be the optimizers, we have

$\displaystyle\sum_{x}Q_{X}^{*}(x)$	$\displaystyle=1;$		(58)
$\displaystyle\lambda_{x}^{*}$	$\displaystyle\geq 0,\ \ \lambda_{x}^{}Q_{X}^{}(x)=0$	$\displaystyle\forall x\in\mathcal{X};$	(59)
$\displaystyle 0$	$\displaystyle=\frac{\partial\mathcal{L}}{\partial Q_{X}(x)}\bigg\|_{Q_{X}^{}(x)}=-\alpha\sum_{y}P_{Y}^{1-\beta}(y)\ W_{Y\|X}^{\frac{\beta}{\alpha}}(y\|x)\ \eta_{y}^{\alpha-1}-\lambda_{x}^{}+\mu^{*}$	$\displaystyle\forall x\in\mathcal{X},$	(60)

where

\eta_{y}=\sum_{x}Q_{X}^{*}(x)W_{Y|X}^{\frac{\beta}{\alpha}}(y|x).

(61)

Multiplying (60) by $Q_{X}^{*}(x)$ and sum over $x$ gives that

\mu^{*}=\alpha\sum_{y}P_{Y}^{1-\beta}(y)\ \eta_{y}^{\alpha}

(62)

where we have plugged in (58) and (59). Comparing (59), (60), and (62), $Q_{X}^{*}$ must satisfy

\sum_{y}P_{Y}^{1-\beta}(y)\ W_{Y|X}^{\frac{\beta}{\alpha}}(y|x)\ \eta_{y}^{\alpha-1}\geq\sum_{y}P_{Y}^{1-\beta}(y)\ \eta_{y}^{\alpha}\quad\forall x\in\mathcal{X},

(63)

with equality for every $x\in\mathrm{supp}(Q_{X}^{*})$ due to complementary slackness (59).

If $Q_{X}^{*}$ satisfies (63), then $Q_{X}^{*n}$ also satisfies the $n$ -shot version of (63) (replacing $W_{Y|X}$ by $W_{Y|X}^{n}$ and $P_{Y}$ by $P_{Y}^{n}$ ). To see this, inserting $Q_{X}^{*n}$ in (61) gives

\eta_{y^{n}}=\sum_{x_{1},\dots,x_{n}}\prod_{i=1}^{n}Q_{X}^{*}(x_{i})W_{Y|X}^{\frac{\beta}{\alpha}}(y_{i}|x_{i})=\prod_{i=1}^{n}\left[\sum_{x_{i}}Q_{X}^{*}(x_{i})W_{Y|X}^{\frac{\beta}{\alpha}}(y_{i}|x_{i})\right]=\prod_{i=1}^{n}\eta_{y_{i}}

and hence (63) becomes

\prod_{i=1}^{n}\left[\sum_{y_{i}}P_{Y}^{1-\beta}(y_{i})\ W_{Y|X}^{\frac{\beta}{\alpha}}(y_{i}|x_{i})\ \eta_{y_{i}}^{\alpha-1}\right]\geq\prod_{i=1}^{n}\left[\sum_{y_{i}}P_{Y}^{1-\beta}(y_{i})\ \eta_{y_{i}}^{\alpha}\right]\quad\forall x^{n}\in\mathcal{X}^{n},

which is true as long as the inequality holds for each $y_{i}$ in the product. Considering that (63) is a necessary and sufficient condition for the optimizer, our conclusion is thus proven. ∎

Now we prove the converse part of Theorem 17 for the uniform formulation, i.e., we show $\mathit{\Gamma}_{\mathrm{uni}}(R)\geq\mathit{\Gamma}(R)$ .

Proof.

For simplicity, we first narrow down our discussions to the 1-shot case ( $n=1$ ). Let $\mathcal{C}^{*}=\{X^{*}(1),\ldots,X^{*}(M)\}$ be the optimal code that achieves the minimal total variation using $M$ codewords, i.e.,

\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}^{*}}-P_{Y}^{n}\right\|_{1}=\min_{\mathcal{C}:|\mathcal{C}|=M}\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\right\|_{1}.

Then there exists a probabilistic distribution of code, say, $\tilde{Q}_{\mathcal{C}}:=\tilde{Q}_{\mathcal{C}}(X(1),\ldots,X(M))$ that satisfies the following two properties.

(i)

The expectation of total variation under $\tilde{Q}_{\mathcal{C}}$ equals to the minimal total variation:

\mathbb{E}_{\tilde{Q}_{\mathcal{C}}}\left[\frac{1}{2}\left\|\tilde{P}_{Y|\mathcal{C}}-P_{Y}\right\|_{1}\right]=\frac{1}{2}\left\|\tilde{P}_{Y|\mathcal{C}^{*}}-P_{Y}^{n}\right\|_{1}.

(ii)

The marginal distribution of each codeword is identical, i.e.,

\tilde{Q}_{X}(X(i)):=\sum_{X(1)}\cdots\sum_{X(i-1)}\sum_{X(i+1)}\cdots\sum_{X(M)}\tilde{Q}_{\mathcal{C}}(X(1),\ldots,X(M))

is the same for each $i=1,\ldots,M$ .

One example of such a $\tilde{Q}_{\mathcal{C}}$ is [arimoto1973converse]:

\tilde{Q}_{\mathcal{C}}=\begin{cases}\dfrac{1}{M!}&\text{if }\mathcal{C}=\mathcal{C}^{*}\text{ up to permutations of codewords},\\ 0&\text{otherwise}.\end{cases}

Take any $\alpha,\beta\in[0,1]$ with $\alpha\geq\beta$ , then we have the following:

	$\displaystyle 1-\frac{1}{2}\left\\|\tilde{P}_{Y\|\mathcal{C}}-P_{Y}\right\\|_{1}$	$\displaystyle\leq 1-\frac{1}{2}\left\\|\tilde{P}_{Y\|\mathcal{C}^{}}-P_{Y}\right\\|_{1}=\mathbb{E}_{\tilde{Q}_{\mathcal{C}}}\left[1-\frac{1}{2}\left\\|\tilde{P}_{Y\|\mathcal{C}^{}}-P_{Y}\right\\|_{1}\right]$
		$\displaystyle\overset{a}{=}\mathbb{E}_{\tilde{Q}_{\mathcal{C}}}\sum_{y}P_{Y}(y)\min\left\{\frac{1}{M}\sum_{i=1}^{M}\frac{W_{Y\|X}(y\|X(i))}{P_{Y}(y)},1\right\}$
		$\displaystyle\overset{b}{\leq}\mathbb{E}_{\tilde{Q}_{\mathcal{C}}}\sum_{y}P_{Y}(y)\left(\frac{1}{M}\sum_{i=1}^{M}\frac{W_{Y\|X}(y\|X(i))}{P_{Y}(y)}\right)^{\beta}$
		$\displaystyle=M^{-\beta}\sum_{y}P_{Y}^{1-\beta}(y)\ \mathbb{E}_{\tilde{Q}_{\mathcal{C}}}\left(\sum_{i=1}^{M}W_{Y\|X}(y\|X(i))\right)^{\frac{\beta}{\alpha}\cdot\alpha}$
		$\displaystyle\overset{c}{\leq}M^{-\beta}\sum_{y}P_{Y}^{1-\beta}(y)\left(\mathbb{E}_{\tilde{Q}_{\mathcal{C}}}\left[\sum_{i=1}^{M}W_{Y\|X}(y\|X(i))\right]^{\frac{\beta}{\alpha}}\right)^{\alpha}$
		$\displaystyle\overset{d}{\leq}M^{-\beta}\sum_{y}P_{Y}^{1-\beta}(y)\left(\mathbb{E}_{\tilde{Q}_{\mathcal{C}}}\left[\sum_{i=1}^{M}W_{Y\|X}^{\frac{\beta}{\alpha}}(y\|X(i))\right]\right)^{\alpha}$
		$\displaystyle\overset{e}{=}M^{-\beta}\sum_{y}P_{Y}^{1-\beta}(y)\left(M\ \mathbb{E}_{\tilde{Q}_{X}(X(i))}\left[W_{Y\|X}^{\frac{\beta}{\alpha}}(y\|X(i))\right]\right)^{\alpha}$
		$\displaystyle=M^{-(\beta-\alpha)}\sum_{y}P_{Y}^{1-\beta}(y)\left(\sum_{x}\tilde{Q}_{X}(x)W_{Y\|X}^{\frac{\beta}{\alpha}}(y\|x)\right)^{\alpha}$
		$\displaystyle\leq M^{-(\beta-\alpha)}\max_{Q_{X}\in\mathcal{P}(\mathcal{X})}\sum_{y}P_{Y}^{1-\beta}(y)\left(\sum_{x}Q_{X}(x)W_{Y\|X}^{\frac{\beta}{\alpha}}(y\|x)\right)^{\alpha}$
		$\displaystyle=2^{-\left[J_{\alpha,\beta}\left(W_{Y\|X}\\|P_{Y}\right)+(\beta-\alpha)R\right]},$

where $(a)$ follows from (18). $(b)$ is due to that $\min\{x,1\}\leq x^{\beta}$ when $x\geq 0$ . $(c)$ applies Jensen’s inequality to the concave function $x^{\alpha}$ . $(d)$ follows from the relation $(x+y)^{s}\leq x^{s}+y^{s}$ for $x,y\geq 0$ and $0\leq s\leq 1$ . $(e)$ follows from the second property of $\tilde{Q}_{\mathcal{C}}$ . Extending this expression to the $n$ -shot case, our conclusion immediately follows from Lemma 47. ∎

Appendix D Proofs of Some Lemmas and Propositions

D-a Properties of $\mathit{\Gamma}(s,Q_{X},R)$

Lemma 48.

Fix any $Q_{X}\in\mathcal{P}(\mathcal{X})$ and $R\geq 0$ . For simplicity, write $f(s):=\mathit{\Gamma}(s,Q_{X},R)$ , which is defined in (10). Then $f(s)$ is a continuous, non-increasing function of $s$ . It is strictly decreasing on the interval $[0,s_{0}]$ for some $s_{0}\geq 0$ and constant on $[s_{0},\infty)$ .

Proof.

For convenience, define the objective function in $f(s)$ to be $g(V):=(Q_{X}V\|P_{Y})+\big|I(Q_{X};V)-R\big|^{+}=\max\big\{D(Q_{X}V\|P_{Y}),D(V\|P_{Y}|Q_{X})-R\big\}.$ Then $f(s)=\min_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X}):D(V\|W|Q_{X})\leq s}g(V)$ .

We first show continuity. Clearly, $g(\cdot)$ being the maximum of two continuous functions, is continuous. Also, $s\twoheadrightarrow\{D(V\|W|Q_{X})\leq s\}$ a continuous, compact, and non-empty correspondence (i.e., given any $s$ , the feasible set $\{D(V\|W|Q_{X})\leq s\}$ is a continuous, compact, and non-empty set of $V$ ). Thanks to the Berge maximum theorem [aliprantis2006infinite, Item 17.31], $f(s)$ is continuous.

Next, the non-increasing property of $f(s)$ is straightforward: the feasible set $D(V\|W|Q_{X})\leq s$ expands as $s$ grows, and thus the minimization value can never increase. When $s=0$ , the only feasible $V$ is $V=W$ and then $f(0)=D(Q_{X}W\|P_{Y})+\big|I(Q_{X};W)-R\big|^{+}\geq 0$ . On the other hand, note that $D(V\|W|Q_{X})$ is bounded when $V\ll W$ , so when $s\geq s_{0}:=\max_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X}):V\ll W}D(V\|W|Q_{X})$ , we have

f(s)=\min_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X}):V\ll W}\left[D(Q_{X}V\|P_{Y})+\big|I(Q_{X};V)-R\big|^{+}\right],

which is a constant. Therefore, $f(s)\equiv f(s_{0})$ on $[s_{0},\infty)$ .

Now, before moving to the decreasing property on $[0,s_{0}]$ , we prove the convexity of $f(s)$ . Take $s_{1},s_{2}\geq 0$ . Suppose that the optimizers for $f(s_{1})$ and $f(s_{2})$ are $V_{1}$ and $V_{2}$ , respectively; that is,

f(s_{i})=g(V_{i}),\quad D(V_{i}\|W|Q_{X})\leq s_{i},\quad i=1,2.

For any $t\in[0,1]$ , take $s_{t}=ts_{1}+(1-t)s_{2}$ and $V_{t}=tV_{1}+(1-t)V_{2}$ . By the convexity of $D(\cdot\|W|Q_{X})$ , we have $D(V_{t}\|W|Q_{X})=D\big([tV_{1}+(1-t)V_{2}]\big\|W\big|Q_{X}\big)\leq tD(V_{1}\|W|Q_{X})+(1-t)D(V_{2}\|W|Q_{X})\leq ts_{1}+(1-t)s_{2}=s_{t}$ . Now since $D(V_{t}\|W|Q_{X})\leq s_{t}$ , we further have

	$\displaystyle f(s_{t})$	$\displaystyle=\min_{V\in\mathcal{P}(\mathcal{Y}\|\mathcal{X}):D(V\\|W\|Q_{X})\leq s_{t}}g(V)\leq g(V_{t})=g\big([tV_{1}+(1-t)V_{2}]\big)$
		$\displaystyle\overset{a}{\leq}tg(V_{1})+(1-t)g(V_{2})=tf(s_{1})+(1-t)f(s_{2}),$		(64)

where $(a)$ follows from the convexity of $g(\cdot)$ . To see this, observe that both $D(Q_{X}V\|P_{Y})$ and $D(V\|P_{Y}|Q_{X})$ are convex in $V$ . Thus, $g(\cdot)$ being the maximum of two convex functions, is also convex. (64) thereby indicates the convexity of $f(\cdot)$ .

Equipped with convexity and the non-increasing monotonicity, $f(s)$ must be strictly decreasing on the interval $[0,s_{0}]$ . It is also noteworthy that on this interval, the optimization of $V$ is achieved at the boundary where $D(V\|W|Q_{X})=s$ . Similar arguments can be found in the proof of [csiszar2011information, Cor. 10.4]. ∎

D-b Proof of Proposition 23

Proof.

We only show the expression for $\underline{E}^{\mathrm{nl}}(R)$ , while that for $E_{\mathrm{rc}}^{\mathrm{nl}}(R)$ follows from analogous reasoning. We apply the method in [csiszar2011information, Cor. 10.4] and first show the convexity of $\overline{E}^{\mathrm{nl}}(\cdot)$ . Let $Q_{Y1},Q_{Y2}$ be the optimizer for $\overline{E}^{\mathrm{nl}}(R_{1})$ and $\overline{E}^{\mathrm{nl}}(R_{2})$ , respectively, i.e.,

\overline{E}^{\mathrm{nl}}(R_{i})=D(Q_{Yi}\|P_{Y}),\quad D(Q_{Yi}\|P_{Y})+H(Q_{Yi})\geq R_{i},\quad i=1,2.

Take any $t\in[0,1]$ . Note that $D(Q_{Y}\|P_{Y})+H(Q_{Y})$ is linear in $Q_{Y}$ . Then $D(Q_{Y}\|P_{Y})+H(Q_{Y})\geq R$ implies that $D(tQ_{Y}\|P_{Y})+H(tQ_{Y})\geq tR$ for the unnormalized distribution $tQ_{Y}$ . Thus,

	$\displaystyle\underline{E}^{\mathrm{nl}}\left(tR_{1}+(1-t)R_{2}\right)$	$\displaystyle=\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):D(Q_{Y}\\|P_{Y})+H(Q_{Y})>tR_{1}+(1-t)R_{2}}D(Q_{Y}\\|P_{Y})$
		$\displaystyle\leq D\left(\left[tQ_{Y1}+(1-t)Q_{Y2}\right]\\|P_{Y}\right)$
		$\displaystyle\overset{a}{\leq}tD(Q_{Y1}\\|P_{Y})+(1-t)D(Q_{Y2}\\|P_{Y})$
		$\displaystyle=t\overline{E}^{\mathrm{nl}}(R_{1})+(1-t)\overline{E}^{\mathrm{nl}}(R_{2}),$

where $(a)$ follows from the convexity of $D(\cdot\|P_{Y})$ . So far, we have shown that $\overline{E}^{\mathrm{nl}}(R)$ is convex in $R$ .

Moreover, observe that $\overline{E}^{\mathrm{nl}}(R)$ is non-decreasing in $R$ and non-negative. Due to its convexity, $\overline{E}^{\mathrm{nl}}(R)$ is strictly increasing in $R$ in the interval where it is finite and positive. Then the optimization of $Q_{Y}$ must be achieved at the boundary:

\overline{E}^{\mathrm{nl}}(R)=\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})=R}D(Q_{Y}\|P_{Y}),\quad\text{when }R\leq H_{-\infty}(P_{Y}).

(65)

Hence,

	$\displaystyle\underline{E}^{\mathrm{nl}}(R)$	$\displaystyle=\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\left\{D(Q_{Y}\\|P_{Y})+\big\|R-D(Q_{Y}\\|P_{Y})-H(Q_{Y})\big\|^{+}\right\}$
		$\displaystyle=\min_{R^{\prime}\leq H_{-\infty}(P_{Y})}\ \min_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):D(Q_{Y}\\|P_{Y})+H(Q_{Y})=R^{\prime}}\left\{D(Q_{Y}\\|P_{Y})+\big\|R-D(Q_{Y}\\|P_{Y})-H(Q_{Y})\big\|^{+}\right\}$
		$\displaystyle=\min_{R^{\prime}\leq H_{-\infty}(P_{Y})}\left\{\overline{E}^{\mathrm{nl}}(R^{\prime})+\big\|R-R^{\prime}\big\|^{+}\right\}\overset{a}{=}\min_{R^{\prime}\leq R}\left\{\overline{E}^{\mathrm{nl}}(R^{\prime})+\big\|R-R^{\prime}\big\|^{+}\right\},$

where $(a)$ follows from the fact $\overline{E}^{\mathrm{nl}}(\cdot)$ is monotone increasing. ∎

D-c Proof of Lemma 31

Proof.

Let $P_{X}^{*}$ be the minimizer of $\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ , i.e., $\min_{P_{X}\in\mathcal{S}}I(P_{X};W)=I(P_{X}^{*};W)$ .

For (i), clearly $\underline{\mathit{\Gamma}}(R)$ is non-negative. Recall the expression of $\mathit{\Gamma}(s,Q_{X},R)$ in (10). Take $Q_{X}=P_{X}^{*}$ . Then $\mathit{\Gamma}(s,P_{X}^{*},R)\equiv 0$ for all $s\geq 0$ , because $V=W$ is the optimizer. Then $\underline{\mathit{\Gamma}}(R)\leq\inf_{s\in[0,\infty)}\max\big\{s,\mathit{\Gamma}(s,P_{X}^{*},R)\big\}=0$ , and hence $\underline{\mathit{\Gamma}}(R)=0$ .

For (ii), suppose not. Owing to (15), there exists some $Q_{X}^{*}\in\mathcal{P}(\mathcal{X})$ such that $\mathit{\Gamma}(s,Q_{X}^{*},R)\equiv 0$ for all $s\in[0,\infty)$ . Take $s=0$ , the only optimizer is $V^{*}=W$ and thus $0=\mathit{\Gamma}(0,Q_{X}^{*},R)=D(Q_{X}^{*}W\|P_{Y})+\big|I(Q_{X}^{*};W)-R\big|^{+}$ , indicating that $Q_{X}^{*}\in\mathcal{S}$ and $I(Q_{X}^{*};W)\leq R$ . This contradicts the condition $R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W)$ . ∎

D-d Proof of Lemma 33

Proof.

Since $I(\cdot\ ;W)$ is continuous. When $\min_{P_{X}\in\mathcal{S}}I(P_{X};W)\leq R\leq\max_{P_{X}\in\mathcal{S}}I(P_{X};W)$ , we can write $R=I(P_{X};W)$ for some $P_{X}\in\mathcal{S}$ . Taking $Q_{XY}=P_{X}W_{Y|X}$ , we have $\underline{\mathit{\Gamma}}(R)\leq D(P_{Y}\|P_{Y})+\big|R-\iota(P_{X}W_{Y|X})\big|^{+}=0$ because $\iota(P_{X}W_{Y|X})=I(P_{X};W)=R$ . On the other hand, $\overline{\mathit{\Gamma}}(R)$ is non-negative, so $\overline{\mathit{\Gamma}}(R)=0$ . ∎

D-e Proof of Lemma 38

Proof.

Observe that

\max_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\big[D(Q_{Y}\|P_{Y})+H(Q_{Y})\big]=\max_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\left[-\sum_{y}Q_{Y}(y)\log P_{Y}(y)\right]=-\min_{y}\log P_{Y}(y)=H_{-\infty}(P_{Y}),

where the last equality follows from Remark 5. Therefore, when $R>H_{-\infty}(P_{Y})$ , the exponent $\overline{E}^{\mathrm{nl}}(R)=\infty$ is not well-defined. In fact, the bad set $\mathcal{B}$ in (23) becomes empty and hence we can only apply a trivial bound $\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1}\geq 0$ , yielding an infinite exponent. ∎

D-f Proof of Lemma 39

Proof.

Let $Q_{Y}^{*}:=\text{argmax}_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})=R}H(Q_{Y})$ . Note that the optimization here has an equality constraint $D(Q_{Y}\|P_{Y})+H(Q_{Y})=R$ .

Observe that $R-D(Q_{Y}\|P_{Y})$ is concave in $Q_{Y}$ , with its maximum attained at $Q_{Y}=P_{Y}$ . However, the feasible set of $N_{2}(R)$ does not contain $P_{Y}$ . Then the maximization of $R-D(Q_{Y}\|P_{Y})$ occurs at the boundary of the linear equation $D(Q_{Y}\|P_{Y})+H(Q_{Y})=R$ , yielding that

	$\displaystyle N_{2}(R)$	$\displaystyle=\max_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):D(Q_{Y}\\|P_{Y})+H(Q_{Y})=R}\left[R-D(Q_{Y}\\|P_{Y})\right]$
		$\displaystyle=\max_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):D(Q_{Y}\\|P_{Y})+H(Q_{Y})=R}H(Q_{Y})=H(Q_{Y}^{*}).$

Similarly, $H(Q_{Y})$ is also concave in $Q_{Y}$ , but the feasible set for $N_{1}(R)$ expands with $R$ . This implies that the maximizer for $N_{1}(R)$ lies either on the boundary $D(Q_{Y}\|P_{Y})+H(Q_{Y})=R$ or at the uniform distribution $Q_{Y}=1/|\mathcal{Y}|$ . In summary, $N_{1}(R)=H(Q_{Y}^{*})$ or $N_{1}(R)=\log|\mathcal{Y}|$ , and therefore $N_{1}(R)\geq N_{2}(R)$ . ∎

D-g Proof of Lemma 43

Proof.

For $(a)$ , let $P_{Y^{n},\hat{Y}^{n},I}$ denote the joint distribution of the source sequence $Y^{n}$ , the reconstruction sequence $\hat{Y}^{n}$ , and the random index $I$ taking values in the set $\{1,\dots,M\}$ ; namely,

P_{Y^{n},\hat{Y}^{n},I}(y^{n},\hat{y}^{n},i):=P_{Y}^{n}(y^{n})\mathbbm{1}\{i=\mathcal{E}(y^{n})\}\ \mathbbm{1}\{\hat{y}^{n}=\mathcal{D}(i)\}.

Its marginal on $I$ determines a message distribution

q(i):=P_{I}(i)=\sum_{y^{n}}P_{Y}^{n}(y^{n})\mathbbm{1}\{i=\mathcal{E}(y^{n})\}.

(66)

We choose $q$ as the message distribution, and $w^{-1}\circ\mathcal{D}$ as the covering code $\mathcal{C}$ . Since $w$ is a bijection, $w^{-1}$ is well-defined. The induced distribution in (2) then becomes

	$\displaystyle\tilde{P}_{Y^{n}\|\mathcal{C}\sim q}(y^{n})$	$\displaystyle=\sum_{i=1}^{M}q(i)\sum_{x^{n}}\mathbbm{1}\{y^{n}=w(x^{n})\}\mathbbm{1}\{x^{n}=w^{-1}(\mathcal{D}(i))\}$
		$\displaystyle=\sum_{i=1}^{M}\sum_{y^{n}}P_{Y}^{n}(y^{n})\mathbbm{1}\{i=\mathcal{E}(y^{n})\}\sum_{x^{n}}\mathbbm{1}\{y^{n}=w(x^{n})\}\mathbbm{1}\{w(x^{n})=\mathcal{D}(i)\}$
		$\displaystyle=\sum_{\hat{y}^{n}}P_{\hat{Y}^{n}}(\hat{y}^{n})\mathbbm{1}\{y^{n}=\hat{y}^{n}\}.$

Here $P_{\hat{Y}^{n}}$ is the marginal of $P_{Y^{n},\hat{Y}^{n},I}$ . Also, note that $P_{Y^{n}}$ is naturally the marginal of $P_{Y^{n},\hat{Y}^{n},I}$ . We obtain that

	$\displaystyle\left\\|\tilde{P}_{Y^{n}\|\mathcal{C}\sim q}-P_{Y}^{n}\right\\|_{1}$	$\displaystyle=\left\\|\sum_{\hat{y}^{n}}P_{\hat{Y}^{n}}(\hat{y}^{n})\mathbbm{1}\{Y^{n}=\hat{y}^{n}\}-P_{Y}^{n}\right\\|_{1}$
		$\displaystyle\leq\left\\|P_{\hat{Y}^{n}}\mathbbm{1}\{Y^{n}=\hat{Y}^{n}\}-P_{Y^{n},\hat{Y}^{n}}\right\\|_{1}$
		$\displaystyle=\sum_{y^{n},\hat{y}^{n}}\left\|P_{\hat{Y}^{n}}(y^{n})\mathbbm{1}\{y^{n}=\hat{y}^{n}\}-P_{Y^{n},\hat{Y}^{n}}(y^{n},\hat{y}^{n})\right\|$
		$\displaystyle=\sum_{y^{n}=\hat{y}^{n}}\left[P_{\hat{Y}^{n}}(y^{n})-P_{Y^{n},\hat{Y}^{n}}(y^{n},\hat{y}^{n})\right]+\sum_{y^{n}\neq\hat{y}^{n}}P_{Y^{n},\hat{Y}^{n}}(y^{n},\hat{y}^{n})$
		$\displaystyle=1-\Pr\{Y^{n}=\hat{Y}^{n}\}+\Pr\{Y^{n}\neq\hat{Y}^{n}\}=2\mathrm{Pr_{e}},$

where the inequality follows from the monotonicity of the total variation.

For $(b)$ , let $\mathcal{C}:\mathcal{M}\to\mathcal{X}^{n}$ be the covering code. Define

\displaystyle\mathcal{J}

\displaystyle=\{y^{n}\in\mathcal{Y}^{n}:\text{there exists }i\in\mathcal{M}\text{ such that }y^{n}=w(\mathcal{C}(i))\}=\{y^{n}\in\mathcal{Y}^{n}:\tilde{P}_{Y^{n}|\mathcal{C}\sim q}(y^{n})>0\}

We choose our encoder and decoder in the lossless source coding scheme to be

	$\displaystyle\mathcal{E}(y^{n})$	$\displaystyle=\begin{cases}i\text{ for some }i\in\mathcal{M}\text{ such that }y^{n}=w(\mathcal{C}(i))&\text{if }y^{n}\in\mathcal{J}\\ 0&\text{if }y^{n}\notin\mathcal{J}\end{cases},$
	$\displaystyle\mathcal{D}(i)$	$\displaystyle=\begin{cases}w(\mathcal{C}(i))&\text{if }i\in\mathcal{M}\cap\mathcal{E}(\mathcal{Y}^{n})\\ \text{declare error}&\text{if }i=0\end{cases}.$

It is noteworthy that the given soft covering code $\mathcal{C}$ may contain repeated codewords. As a result, for some $y^{n}\in\mathcal{J}$ , there exists more than one index $i$ such that $y^{n}=w(\mathcal{C}(i))$ , and we just select one representative index among them as our source coding encoder $\mathcal{E}(y^{n})$ . Consequently, only $y^{n}\in\mathcal{J}$ be reconstructed correctly. We obtain that

\displaystyle\left\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\right\|_{1}\geq\sum_{y^{n}\notin\mathcal{J}}P_{Y}^{n}(y^{n})=\mathrm{Pr_{e}}.

Observe that the lossless source coding scheme contains $M+1$ indices (some of them may be unused if $\mathcal{C}$ contains repeated codewords), and the rate is $\frac{1}{n}\log(M+1)\to R$ for sufficiently large $n$ . ∎

D-h Proof of Lemma 44

Proof.

Fix $P_{X}\in\mathcal{S}$ and define

\overline{E}(R,P_{X}):=\min_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X}):\iota(P_{X}V_{Y|X})\geq R}D(V\|W|P_{X}).

Let $V^{*}$ be its optimizer, i.e., $\overline{E}(R,P_{X})=D(V^{*}\|W|P_{X})$ with $\iota(P_{X}V^{*}_{Y|X})\geq R$ . When $R\leq I(P_{X};W)$ , we have $V^{*}=W$ and $\overline{E}(R,P_{X})=0$ . When $R>I(P_{X};W)$ , note that $\iota(P_{X}W_{Y|X})=I(P_{X};W)$ , so $V^{*}\neq W$ and hence $\overline{E}(R,P_{X})>0$ . When $R>\max_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X})}\iota(P_{X}V_{Y|X})$ , $V^{*}$ does not exist; we can only apply a trivial bound $\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\big\|_{1}\geq 0$ and the exponent is $\overline{E}(R,P_{X})=\infty$ .

In view that $\overline{E}(R)=\max_{P_{X}\in\mathcal{S}}\overline{E}(R,P_{X})$ , the above turning points, including the one from zero to nonzero value, and the one from finite to infinite value, of $\overline{E}(R)$ are obtained via minimizing over $P_{X}\in\mathcal{S}$ . ∎

	$\displaystyle I_{\alpha}(P_{X};W_{Y\|X}):=\$	$\displaystyle\frac{\alpha}{\alpha-1}\log\sum_{y}\left(\sum_{x}P_{X}(x)W^{\alpha}_{Y\|X}(y\|x)\right)^{\frac{1}{\alpha}}$
	$\displaystyle=\$	$\displaystyle\frac{\alpha}{\alpha-1}\log\left(\mathbb{E}_{P_{Y}}\left[\mathbb{E}_{\widebar{W}_{X\|Y}}^{1/\alpha}\left[2^{(\alpha-1)\iota_{X;Y}}\|Y\right]\right]\right).$

	$\displaystyle W_{Y\|X}^{n}(\mathcal{A}\|X^{n}(i))$	$\displaystyle=W_{Y\|X}^{n}\left(\bigcup_{Q_{X}^{\prime}\in\mathcal{P}_{n}(\mathcal{X})}\ \bigcup_{j:X^{n}(j)\in\mathcal{T}_{Q_{X}^{\prime}}}\ \bigsqcup_{V\in\mathcal{V}_{n}(Q_{X}^{\prime})}\mathcal{T}_{V}(X^{n}(j))\bigg\|X^{n}(i)\right)$
		$\displaystyle\overset{a}{\geq}W_{Y\|X}^{n}\left(\bigsqcup_{V\in\mathcal{V}_{n}(Q_{X})}\mathcal{T}_{V}(X^{n}(i))\bigg\|X^{n}(i)\right)$
		$\displaystyle\overset{b}{=}1-W_{Y\|X}^{n}\left(\bigsqcup_{V\notin\mathcal{V}_{n}(Q_{X})}\mathcal{T}_{V}(X^{n}(i))\bigg\|X^{n}(i)\right)$
		$\displaystyle\overset{c}{\geq}1-\sum_{V\notin\mathcal{V}_{n}(Q_{X})}2^{-nD(V\\|W\|Q_{X})}$
		$\displaystyle\overset{d}{\geq}1-(n+1)^{\|\mathcal{X}\|\|\mathcal{Y}\|}\max_{V\notin\mathcal{V}_{n}(Q_{X})}2^{-nD(V\\|W\|Q_{X})}$
		$\displaystyle=1-(n+1)^{\|\mathcal{X}\|\|\mathcal{Y}\|}\exp_{2}\left\{-n\min_{V\notin\mathcal{V}_{n}(Q_{X})}D(V\\|W\|Q_{X})\right\},$

	$\displaystyle\tilde{P}_{Y^{n}\|\mathcal{C}\sim q}(\mathcal{A})$	$\displaystyle=\sum_{i=1}^{M}q(i)\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}\ W_{Y\|X}^{n}(\mathcal{A}\|X^{n}(i))$
		$\displaystyle\geq 1-(n+1)^{\|\mathcal{X}\|\|\mathcal{Y}\|}\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \sum_{i=1}^{M}q(i)\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}\exp_{2}\left\{-n\min_{V\notin\mathcal{V}_{n}(Q_{X})}D(V\\|W\|Q_{X})\right\}$
		$\displaystyle\overset{a}{\geq}1-(n+1)^{\|\mathcal{X}\|(\|\mathcal{Y}\|+1)}\max_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\sum_{i=1}^{M}q(i)\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}\exp_{2}\left\{-n\min_{V\notin\mathcal{V}_{n}(Q_{X})}D(V\\|W\|Q_{X})\right\}$
		$\displaystyle\overset{b}{\geq}1-(n+1)^{\|\mathcal{X}\|(\|\mathcal{Y}\|+1)}\max_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\exp_{2}\left\{-n\min_{V\notin\mathcal{V}_{n}(Q_{X})}D(V\\|W\|Q_{X})\right\}$
		$\displaystyle=1-(n+1)^{\|\mathcal{X}\|(\|\mathcal{Y}\|+1)}\exp_{2}\left\{-n\min_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \min_{V\notin\mathcal{V}_{n}(Q_{X})}D(V\\|W\|Q_{X})\right\},$

	$\displaystyle P_{Y}^{n}(\mathcal{A})$	$\displaystyle\overset{a}{=}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\left\|\mathcal{A}\cap\mathcal{T}_{Q_{Y}}\right\|\ 2^{-n[D(Q_{Y}\\|P_{Y})+H(Q_{Y})]}$
		$\displaystyle=\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\left\|\left(\bigcup_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \bigcup_{i:X^{n}(i)\in\mathcal{T}_{Q_{X}}}\ \bigsqcup_{V\in\mathcal{V}_{n}(Q_{X})}\mathcal{T}_{V}(X^{n}(i))\right)\bigcap\mathcal{T}_{Q_{Y}}\right\|2^{-n[D(Q_{Y}\\|P_{Y})+H(Q_{Y})]}$
		$\displaystyle=\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\left\|\bigcup_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \bigcup_{i:X^{n}(i)\in\mathcal{T}_{Q_{X}}}\ \bigsqcup_{V\in\mathcal{V}_{n}(Q_{X})}\mathcal{T}_{V}(X^{n}(i))\cap\mathcal{T}_{Q_{Y}}\right\|2^{-n[D(Q_{Y}\\|P_{Y})+H(Q_{Y})]}$
		$\displaystyle\overset{b}{\leq}\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \sum_{V\in\mathcal{V}_{n}(Q_{X})}\left\|\bigcup_{i:X^{n}(i)\in\mathcal{T}_{Q_{X}}}\mathcal{T}_{V}(X^{n}(i))\cap\mathcal{T}_{Q_{X}V}\right\|2^{-n[D(Q_{X}V\\|P_{Y})+H(Q_{X}V)]}$
		$\displaystyle\leq\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \sum_{V\in\mathcal{V}_{n}(Q_{X})}\min\left\{\sum_{i=1}^{M}\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}\big\|\mathcal{T}_{V}(X^{n}(i))\big\|,\big\|\mathcal{T}_{Q_{X}V}\big\|\right\}2^{-n[D(Q_{X}V\\|P_{Y})+H(Q_{X}V)]}$
		$\displaystyle\overset{c}{\leq}\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \sum_{V\in\mathcal{V}_{n}(Q_{X})}\min\left\{M\cdot 2^{nH(V\|Q_{X})},2^{nH(Q_{X}V)}\right\}2^{-n[D(Q_{X}V\\|P_{Y})+H(Q_{X}V)]}$
		$\displaystyle=\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \sum_{V\in\mathcal{V}_{n}(Q_{X})}2^{-n\left[D(Q_{X}V\\|P_{Y})+\|I(Q_{X};V)-R\|^{+}\right]}$
		$\displaystyle\overset{d}{\leq}(n+1)^{\|\mathcal{X}\|(\|\mathcal{Y}\|+1)}\max_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \max_{V\in\mathcal{V}_{n}(Q_{X})}2^{-n\left[D(Q_{X}V\\|P_{Y})+\|I(Q_{X};V)-R\|^{+}\right]}$
		$\displaystyle=(n+1)^{\|\mathcal{X}\|(\|\mathcal{Y}\|+1)}\exp_{2}\left\{-n\min_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \min_{V\in\mathcal{V}_{n}(Q_{X})}\left[D(Q_{X}V\\|P_{Y})+\big\|I(Q_{X};V)-R\big\|^{+}\right]\right\},$

$\displaystyle\frac{\tilde{P}_{Y^{n}\|\mathcal{C}}(y^{n})}{P_{Y}^{n}(y^{n})}$	$\displaystyle=\frac{1}{M}\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}\|Q_{Y})}k_{\widebar{V}}(y^{n})\ 2^{n\left[D(Q_{Y}\\|P_{Y})+H(Q_{Y})-D(V\\|W\|Q_{X})-H(V\|Q_{X})\right]}$
	$\displaystyle=\frac{1}{M}\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}\|Q_{Y})}k_{\widebar{V}}(y^{n})\ 2^{n\left[D(Q_{Y}\\|P_{Y})+I(Q_{X};V)-D(V\\|W\|Q_{X})\right]}$
	$\displaystyle=\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}\|Q_{Y})}k_{\widebar{V}}(y^{n})\ 2^{n\left[\iota(Q_{XY})-R\right]},$	(17)

On the Strong Converse Exponent and Error Exponent of the Classical Soft Covering

Abstract

I Introduction

II Preliminaries

II-A Notation

Definition 1 (Information density).

Definition 2 (Expectation of the information density).

Lemma 3.

Definition 4 (α\alpha-Rényi entropy).

Remark 5.

Definition 6 (α\alpha-Rényi divergence).

Definition 7 (α\alpha-Rényi mutual information).

Remark 8.

II-B Notions of soft covering

Definition 9 (Uniform soft covering).

Definition 10 (Non-uniform soft covering).

Definition 11 (H−∞H_{-\infty}-constrained soft covering).

Definition 12 (Random-coding soft covering).

Remark 13.

Definition 14 (Error exponent and strong converse exponent for soft covering).

Remark 15.

Remark 16.

III Main Results

III-A Results on the strong converse exponent

Theorem 17 (The exact strong converse exponent).

Theorem 18 (A random coding achievability bound).

III-B Results on error exponents for noiseless channels

Theorem 19 (Converse for noiseless channels).

Theorem 20 (Achievability for noiseless channels).

Theorem 21 (Error exponent for non-uniform case).

Theorem 22 (Error exponent for H−∞H_{-\infty}-constrained case).

Proposition 23.

Definition 24.

Theorem 25 (Linear converse for irrational PYP_{Y}).

Theorem 26 (Infinite achievability for rational PYP_{Y}).

III-C Results on error exponents for noisy channels

Theorem 27 (High-rate achievability).

Theorem 28 (Converse).

Corollary 29 (Infinite achievability for noisy channels).

IV Strong Converse Exponent

IV-A A lower bound for the strong converse exponent

Proposition 30 (Converse).

Proof.

Lemma 31.

IV-B An achievability bound for the strong converse exponent

Proposition 32 (Achievability).

Proof.

Lemma 33.

IV-C Dual forms of Γ¯​(R)\underline{\mathit{\Gamma}}(R) and Γ¯​(R)\overline{\mathit{\Gamma}}(R) and their equality

Proposition 34.

Proposition 35.

Proposition 36.

Proof of Theorem 17.

V Error Exponent for Noiseless Channels

Remark 37 (Choice of alphabet).

V-A Uniform formulation

Proof of Theorem 19.

Lemma 38.

Lemma 39.

Proof of Theorem 20.

V-B Rational-irrational discrepancy under the uniform formulation

Lemma 40 (Khintchine’s theorem [bugeaud2004approximation, Thm. 1.10], [queffelec2013diophantine, Thm. 3.3.2]).

Proof of Theorem 25.

Proof of Theorem 26.

Example 41.

V-C Non-uniform formulation

Definition 42 (Lossless source coding).

Lemma 43 (Equivalence between noiseless soft covering and lossless source coding).

Proof of Theorem 21.

V-D H−∞H_{-\infty}-constrained formulation

Proof of Theorem 22.

VI Error Exponent for Noisy Channels

VI-A Achievability

Proof of Theorem 27.

VI-B Converse

Proof of Theorem 28.

Lemma 44.

VII Conclusion

Appendix A A Random Coding Achievability for the Strong Converse Exponent

Lemma 45.

Definition 4 ( $\alpha$ -Rényi entropy).

Definition 6 ( $\alpha$ -Rényi divergence).

Definition 7 ( $\alpha$ -Rényi mutual information).

Definition 11 ( $H_{-\infty}$ -constrained soft covering).

Theorem 22 (Error exponent for $H_{-\infty}$ -constrained case).

Theorem 25 (Linear converse for irrational $P_{Y}$ ).

Theorem 26 (Infinite achievability for rational $P_{Y}$ ).

IV-C Dual forms of $\underline{\mathit{\Gamma}}(R)$ and $\overline{\mathit{\Gamma}}(R)$ and their equality

V-D $H_{-\infty}$ -constrained formulation

Appendix B Dual forms of the Strong Converse Bounds $\underline{\mathit{\Gamma}}(R)$ and $\overline{\mathit{\Gamma}}(R)$

Lemma 47 (Additivity of $J_{\alpha,\beta}$ ).

D-a Properties of $\mathit{\Gamma}(s,Q_{X},R)$