On the Strong Converse Exponent and Error Exponent of the Classical Soft Covering

Xingyi He and S. Sandeep Pradhan

Andreas Winter
A preliminary version of this work was presented in part at the 2025 IEEE International Symposium on Information Theory (ISIT), 22-27 June 2025, Ann Arbor MI, doi:10.1109/ISIT63088.2025.11195309 [HePradhanWinter:ISIT].XH’s and SSP’s work was supported in part by NSF grant CCF-2132815. AW’s work was supported by the European Commission QuantERA project ExTRaQT (Spanish MICIN grant no. PCI2022-132965); by the Spanish MICIN (project PID2022-141283NB-I00) with the support of FEDER funds; by the Spanish MICIN with funding from European Union NextGenerationEU (PRTR-C17.I1) and the Generalitat de Catalunya; by the Spanish MTDFP through the QUANTUM ENIA project: Quantum Spain, funded by the European Union NextGenerationEU within the framework of the “Digital Spain 2026 Agenda”; by the Alexander von Humboldt Foundation; and by the Institute for Advanced Study of the Technical University Munich.
Abstract

This paper establishes the exact strong converse exponent of the soft covering problem in the classical setting. This exponent characterizes the slowest achievable convergence speed of the total variation to one when a code of rate below mutual information is applied to a discrete memoryless channel for synthesizing a product output distribution. The proposed exponent is expressed through a new two-parameter information quantity, differing from the more commonly studied Rényi divergence or Rényi mutual information. In addition, we demonstrate the non-tightness of random coding for rates both below and above mutual information. Discussions on the latter start with noiseless channels, where we develop a deterministic code construction that outperforms random codes in error exponents. We further observe that the conventional formulation, which assumes a uniform distribution over messages, inherently introduces a discrepancy in error exponents depending on whether the components of the target distribution are rational or irrational numbers. To eliminate this discrepancy, we propose a new formulation in which messages are allowed to be distributed non-uniformly, and the rate is given by the logarithm of the smallest nonzero message probability (corresponding to Rényi entropy HH_{-\infty} of order -\infty). The exact error exponent is characterized in this formulation for noiseless channels. Furthermore, for noisy channels, we provide a high-rate improvement in achievability and derive a converse bound on the error exponent.

I Introduction

The soft covering lemma is a fundamental lemma used in various information theoretic problems, such as channel resolvability, channel simulation, and lossy source coding. It first appeared in [wyner1975common, Thm. 6.3] where Wyner used the normalized (with a factor 1/n1/n) relative entropy to quantify probabilistic closeness and derived the achievability proof of the common information as the optimal rate of shared randomness between two agents. The concept of soft covering has been consequently applied in wiretap channels [wyner1975wire, hayashi2006general, parizi2016exact, yu2018renyi] and in other problems including channel synthesis [cuff2010coordination, cuff2013distributed, hsieh2016channel] and channel resolvability [han1993approximation, bloch2013strong, watanabe2014strong, liu2016e_][koga2013information, Ch. 6] with tighter measures of probabilistic closeness such as total variation and relative entropy [winter2005secret, hou2013informational, hou2014effective, cuff2015stronger, cuff2016soft]. It is worth noting that channel resolvability is closely related to soft covering in the sense that the former deals with simulating all output distributions including non-product ones, while the latter focuses solely on the product ones. More recently, developments in quantum information theory have prompted extensive research into the soft covering [ahlswede2002strong, hayashi2015quantum, cheng2023error, atif2023lossy, atif2024quantum, shen2024optimal, he2024quantum, hayashi2025resolvability][wilde2011classical, Ch. 17][hayashi2016quantum, Sec. 9.4] and channel simulation [massar2000amount, winter2001compression, winter2004extrinsic, luo2009channel, wilde2012information, bennett2014quantum, radhakrishnan2017one, anshu2019convex] for quantum channels.

The general idea of soft covering is to simulate a given output distribution using a specific channel nn times. Suppose we are given a discrete memoryless channel (DMC) with transition probability distribution WY|XW_{Y|X} and a desired output distribution PYP_{Y}. Consider any code 𝒞={Xn(1),Xn(2),,Xn(M)}\mathcal{C}=\{X^{n}(1),X^{n}(2),\dots,X^{n}(M)\}, with |𝒞|=M=:2nR|\mathcal{C}|=M=:2^{nR}, and with each encoded message being drawn from the uniform distribution on {1,2,,M}\{1,2,\ldots,M\}. The output distribution induced by the code 𝒞\mathcal{C} is

P~Yn|𝒞()=1Mi=1MWY|Xn(|Xn(i)).\tilde{P}_{Y^{n}|\mathcal{C}}(\cdot)=\frac{1}{M}\sum_{i=1}^{M}W_{Y|X}^{n}(\cdot|X^{n}(i)).

We aim for P~Yn|𝒞\tilde{P}_{Y^{n}|\mathcal{C}} to effectively cover the space of YnPYnY^{n}\sim P_{Y}^{n}. In other words, we want the non-product distribution P~Yn|𝒞\tilde{P}_{Y^{n}|\mathcal{C}} to asymptotically approximate the product distribution PYnP_{Y}^{n} under the criterion of the total variation 12P~Yn|𝒞PYn1\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1}. Let 𝒮:={PX:PXW=PY}\mathcal{S}:=\{P_{X}:P_{X}W=P_{Y}\} be the set of all 1-shot input distributions whose outputs are PYP_{Y} under the DMC WY|XW_{Y|X}. According to the soft covering lemma, e.g.[moser2019advanced, Ch. 19], when R>minPX𝒮I(PX;W)R>\min_{P_{X}\in\mathcal{S}}I(P_{X};W), we can achieve a good covering: there exists a code that ensures 12P~Yn|𝒞PYn1\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1} to vanish exponentially fast. The achievable error exponent has been studied extensively in the literature both in the classical and the classical-quantum settings [hayashi2006general, parizi2016exact, yu2018renyi][cheng2023error][yassaee2019almost][yagli2019exact]. All of these studies employ a random coding strategy when investigating the decaying behavior of the total variation and relative entropy at high rates. However, it is not guaranteed that random coding yields a tight exponent; even if it does, it only establishes an achievability result. An important open problem is to develop a converse bound showing that no code can make the total variation decay too fast. In this work, we demonstrate the non-tightness of random coding and provide a converse bound that holds for all codes.

On the other hand, when R<minPX𝒮I(PX;W)R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W), we expect that the covering error 12P~Yn|𝒞PYn1\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1} should approach 11 exponentially fast, 12P~Yn|𝒞PYn1=12nΓ(𝒞,n)\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1}=1-2^{-n\mathit{\Gamma}(\mathcal{C},n)}. We are particularly interested in the minimal achievable exponent:

Γ(R):=lim infnmin𝒞:|𝒞|=2nRΓ(𝒞,n),\mathit{\Gamma}(R):=\liminf_{n\to\infty}\ \min_{\mathcal{C}:|\mathcal{C}|=2^{nR}}\mathit{\Gamma}(\mathcal{C},n),

which characterizes the slowest convergence of 12P~Yn|𝒞PYn1\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1} to one. It can be understood as an optimal performance among all the soft covering codes: given a low rate, how well a code can possibly behave to avoid poor covering. Therefore, for an arbitrary code 𝒞\mathcal{C}, we can deduce that 12P~Yn|𝒞PYn112nΓ(R)\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1}\geq 1-2^{-n\mathit{\Gamma}(R)}. In this sense, we refer to Γ(R)\mathit{\Gamma}(R) as the strong converse exponent, as it holds for all codes, not just for random codes.

The dual problem of the soft covering lemma is the packing lemma in channel coding, where a similar exponent, called the reliability function [shannon1959probability], has been thoroughly studied at rates both below [gallager1965simple, haroutunian1968estimates, sibson1969information, blahut1974hypothesis, csiszar1972class, arimoto1977information, augustin1978noisy, verdu2021error][gallager1968information, Ch. 5][csiszar2011information, Ch. 10] and above [arimoto1973converse, dueck1979reliability, oohama2015two, mosonyi2017strong] the capacity. The latter case is known as the strong converse of channel coding [wolfowitz1957coding, wolfowitz1960note, winter1999coding], as the corresponding exponent holds for all codes when the probability of error approaches one. However, in the soft covering, the strong converse exponent Γ(R)\mathit{\Gamma}(R) has not been explored in the literature in terms of either lower or upper bounds. It may be noted that [cheng2023error] provides a lower bound on the performance of random code ensemble at rates below mutual information. [watanabe2014strong] and [hayashi2025resolvability] prove the asymptotic strong converse for classical and classical-quantum channels but do not characterize the exponent.

In this work, we characterize the exact strong converse exponent Γ(R)\mathit{\Gamma}(R) of the classical soft covering problem. It is stated in Theorem 17, and a novel two-parameter Rényi-type information quantity Jα,β(WY|XPY)J_{\alpha,\beta}(W_{Y|X}\|P_{Y}) is introduced in the process. The converse is derived using a hypothesis testing perspective, and the achievability is established through a deterministic code construction technique combined with the type covering lemma. In addition, we provide a random coding achievability bound for the strong converse exponent in Theorem 18 that is expressed using the Rényi mutual information. By comparing the exact Γ(R)\mathit{\Gamma}(R), the proposed random coding achievability, and the random coding converse in [cheng2023error], we conclude that random coding fails to produce the tight exponent.

Building on the observation of non-tightness of random coding in the strong converse exponents, we provide a new characterization of the error exponents for R>minPX𝒮I(PX;W)R>\min_{P_{X}\in\mathcal{S}}I(P_{X};W). In particular, we establish a new lower bound on the error exponents (see Theorem 20) in terms of Rényi entropy for noiseless channels using a new deterministic code construction. We also derive a new upper bound on the error exponents (see Theorem 19) that match the lower bound until they both reach a slope of unity. The lower bound strictly outperforms random codes for noiseless channels. Furthermore, this lower bound can be extended to noisy channels (see Theorem 27), achieving a high-rate improvement over the random coding exponent given in [yagli2019exact, Thm. 1]. We also provide a new sphere-packing style upper bound on the error exponents of the noisy channels, given in Theorem 28.

While investigating the behavior of error exponents for noiseless channels, we uncover an interesting phenomenon: the conventional formulation that assumes a uniform distribution 1/M1/M on messages yields different exponents depending on whether the components of the target distribution PYP_{Y} are rational or irrational numbers. These discrepancies are characterized in Theorems 25 and 26. The reason for this is that the code-induced distribution P~Yn|𝒞\tilde{P}_{Y^{n}|\mathcal{C}} takes values in multiples of 1/M1/M, so soft covering essentially approximates PYnP_{Y}^{n} within a quantization grid of step size 1/M1/M. If PYnP_{Y}^{n} is rational, a perfect covering with zero error is achievable at sufficiently high rates. In contrast, if PYnP_{Y}^{n} is irrational, a nonzero covering error may persist due to limitations imposed by Diophantine approximation [queffelec2013diophantine, Ch. 3]. To address this discrepancy, we propose a new formulation of the soft covering problem, in which the distribution of the messages is allowed to be non-uniform, and the covering rate is defined not by the number of messages but by the inverse of the smallest nonzero message probability. We call this formulation HH_{-\infty}-constrained soft covering (see Definition 11). We provide an exact characterization of the error exponents for this formulation without any discrepancy for all rates for noiseless channels (see Theorem 22). It aligns with that of the uniform distribution at low rates. In summary, the error exponents of the soft covering problems exhibit a very complex behavior.

This paper is organized as follows. Section II introduces useful definitions. All the main results are presented in Section III. Section IV provides a detailed proof of the characterization of strong converse exponent. In Section V, we provide a detailed discussion on error exponents for noiseless channels under different formulations and generalize the corresponding results to noisy channels in Section VI.

II Preliminaries

II-A Notation

This paper applies the method of types, as well as basic notions from discrete probability theory and a range of information quantities. Basic properties of types can be found in many papers and textbooks, e.g. [csiszar2011information, Ch. 2] and [csiszar1998method]. Basic notations we will use are collected below.

Let \mathcal{M} denote a finite message set, 𝒳\mathcal{X} and 𝒴\mathcal{Y} denote the finite input and output alphabets of a channel, respectively, and 𝒫()\mathcal{P}(\mathcal{M}), 𝒫(𝒳)\mathcal{P}(\mathcal{X}), and 𝒫(𝒳𝒴)\mathcal{P}(\mathcal{X}\mathcal{Y}) denote the sets of all probability distributions on \mathcal{M}, 𝒳\mathcal{X}, and 𝒳×𝒴\mathcal{X}\times\mathcal{Y}, respectively. Likewise, let 𝒫(𝒴|𝒳)\mathcal{P}(\mathcal{Y}|\mathcal{X}) denote the set of all conditional distributions on 𝒴\mathcal{Y} given 𝒳\mathcal{X}. For any block length nn, let 𝒫n(𝒳)\mathcal{P}_{n}(\mathcal{X}) and 𝒫n(𝒳𝒴)\mathcal{P}_{n}(\mathcal{X}\mathcal{Y}) denote the sets of all nn-types and joint nn-types on 𝒳n\mathcal{X}^{n} and 𝒳n×𝒴n\mathcal{X}^{n}\times\mathcal{Y}^{n}, respectively, and let 𝒫n(𝒴|QX)\mathcal{P}_{n}(\mathcal{Y}|Q_{X}) denote the set of all conditional nn-types on 𝒴n\mathcal{Y}^{n} given an input type QX𝒫n(𝒳)Q_{X}\in\mathcal{P}_{n}(\mathcal{X}). Denote by 𝒯Q\mathcal{T}_{Q} the type class corresponding to an nn-type QQ, and by 𝒯V(xn)\mathcal{T}_{V}(x^{n}) the conditional type class (or VV-shell) given xn𝒯QXx^{n}\in\mathcal{T}_{Q_{X}} and a conditional type V𝒫n(𝒴|QX)V\in\mathcal{P}_{n}(\mathcal{Y}|Q_{X}). Let H(V|QX)H(V|Q_{X}) and I(QX;V)I(Q_{X};V) denote the conditional entropy H(Y|X)H(Y|X) and the mutual information I(X;Y)I(X;Y) under the joint distribution QXY=QXVY|XQ_{XY}=Q_{X}V_{Y|X}, respectively, and define D(VW|QX):=D(QXVY|XQXWY|X)D(V\|W|Q_{X}):=D(Q_{X}V_{Y|X}\|Q_{X}W_{Y|X}). Let 𝒮:={PX𝒫(𝒳):PXW=PY}\mathcal{S}:=\{P_{X}\in\mathcal{P}(\mathcal{X}):P_{X}W=P_{Y}\} denote the set of all input distributions whose induced output distribution under WY|XW_{Y|X} is PYP_{Y}. For a distribution VV, let supp(V)\mathrm{supp}(V) denote its support, and write VWV\ll W if supp(V)supp(W)\mathrm{supp}(V)\subseteq\mathrm{supp}(W). Finally, define |x|+:=max{x,0}|x|^{+}:=\max\{x,0\}.

Consider a joint distribution QXY𝒫(𝒳𝒴)Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y}) with marginal distributions QX𝒫(𝒳)Q_{X}\in\mathcal{P}(\mathcal{X}) and QY𝒫(𝒴)Q_{Y}\in\mathcal{P}(\mathcal{Y}). In this paper, we follow the notation that QXY=QXVY|X=QYV¯X|YQ_{XY}=Q_{X}V_{Y|X}=Q_{Y}\widebar{V}_{X|Y}, where VY|X:=QXY/QX𝒫(𝒴|𝒳)V_{Y|X}:=Q_{XY}/Q_{X}\in\mathcal{P}(\mathcal{Y}|\mathcal{X}) and V¯X|Y=QXY/QY𝒫(𝒳|𝒴)\widebar{V}_{X|Y}=Q_{XY}/Q_{Y}\in\mathcal{P}(\mathcal{X}|\mathcal{Y}) are the forward and backward transition probabilities, respectively. In other words, the letters QQ and VV are consistently associated throughout this paper’s notations: whenever a joint distribution QXYQ_{XY} appears, the corresponding marginals QX,QYQ_{X},Q_{Y} and conditionals VY|X,V¯X|YV_{Y|X},\widebar{V}_{X|Y} are all implicitly defined and need not be restated. Moreover, we write QY=QXVQ_{Y}=Q_{X}V.

In this work, the desired output distribution is denoted by PY𝒫(𝒴)P_{Y}\in\mathcal{P}(\mathcal{Y}) and the DMC used for covering is denoted by WY|X𝒫(𝒴|𝒳)W_{Y|X}\in\mathcal{P}(\mathcal{Y}|\mathcal{X}). Definitions of some useful information-theoretic quantities are listed below.

Definition 1 (Information density).

The information density is defined as

ιX;Y(x,y):=logWY|X(y|x)PY(y).\iota_{X;Y}(x,y):=\log\frac{W_{Y|X}(y|x)}{P_{Y}(y)}.
Definition 2 (Expectation of the information density).

Given QXY𝒫(𝒳𝒴)Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y}), the expectation of the information density under QXYQ_{XY} is defined as

ι(QXY):=𝔼QXY[ιX;Y]={x,yQXY(x,y)logWY|X(y|x)PY(y)if VWotherwise.\iota(Q_{XY}):=\mathbb{E}_{Q_{XY}}[\iota_{X;Y}]=\begin{cases}\displaystyle{\sum_{x,y}Q_{XY}(x,y)\log\frac{W_{Y|X}(y|x)}{P_{Y}(y)}}&\text{if }V\ll W\\ -\infty&\text{otherwise}\end{cases}.
Lemma 3.

One can easily verify the following identity, and hence we claim it without proof.

D(VW|QX)+ι(QXY)=D(QXVPY)+I(QX;V).D(V\|W|Q_{X})+\iota(Q_{XY})=D(Q_{X}V\|P_{Y})+I(Q_{X};V).
Definition 4 (α\alpha-Rényi entropy).

Given QX𝒫(𝒳)Q_{X}\in\mathcal{P}(\mathcal{X}), the α\alpha-Rényi entropy is defined as

Hα(QX):=11αlog(xQXα(x)).H_{\alpha}(Q_{X}):=\frac{1}{1-\alpha}\log\left(\sum_{x}Q_{X}^{\alpha}(x)\right).
Remark 5.

H(QX)=logminxsupp(QX)QX(x)H_{-\infty}(Q_{X})=-\log\displaystyle{\min_{x\in\mathrm{supp}(Q_{X})}}Q_{X}(x).

Definition 6 (α\alpha-Rényi divergence).

Given QX,PX𝒫(𝒳)Q_{X},P_{X}\in\mathcal{P}(\mathcal{X}), their α\alpha-Rényi divergence is defined as

Dα(QXPX):=1α1log(xQXα(x)PX1α(x)).D_{\alpha}(Q_{X}\|P_{X}):=\frac{1}{\alpha-1}\log\left(\sum_{x}Q_{X}^{\alpha}(x)P_{X}^{1-\alpha}(x)\right).
Definition 7 (α\alpha-Rényi mutual information).

Given PX𝒮P_{X}\in\mathcal{S}, the α\alpha-Rényi mutual information is defined as

Iα(PX;WY|X):=\displaystyle I_{\alpha}(P_{X};W_{Y|X}):=\ αα1logy(xPX(x)WY|Xα(y|x))1α\displaystyle\frac{\alpha}{\alpha-1}\log\sum_{y}\left(\sum_{x}P_{X}(x)W^{\alpha}_{Y|X}(y|x)\right)^{\frac{1}{\alpha}}
=\displaystyle=\ αα1log(𝔼PY[𝔼W¯X|Y1/α[2(α1)ιX;Y|Y]]).\displaystyle\frac{\alpha}{\alpha-1}\log\left(\mathbb{E}_{P_{Y}}\left[\mathbb{E}_{\widebar{W}_{X|Y}}^{1/\alpha}\left[2^{(\alpha-1)\iota_{X;Y}}|Y\right]\right]\right).

The second equality follows from [yagli2019exact, Rmk. 24][verdu2021error, Item 20], and W¯X|Y:=PXWY|X/PY\widebar{W}_{X|Y}:=P_{X}W_{Y|X}/{P_{Y}}.

Remark 8.

If WY|XW_{Y|X} is a noiseless, i.e., for each x𝒳x\in\mathcal{X}, there exists a symbol y𝒴y\in\mathcal{Y} such that WY|X(y|x)=1W_{Y|X}(y|x)=1. Then Iα(PX;WY|X)=H1α(PXW)I_{\alpha}(P_{X};W_{Y|X})=H_{\frac{1}{\alpha}}(P_{X}W) [verdu2021error, Eq. (85)].

II-B Notions of soft covering

The precise formulations of the soft covering problem are given in this subsection. We begin with two cases: uniform and non-uniform, depending on whether the message set \mathcal{M} has a uniform distribution or not.

Definition 9 (Uniform soft covering).

An (R,n,PY,WY|X)(R,n,P_{Y},W_{Y|X}) (uniform) soft-covering scheme consists of a message set ={1,,M}\mathcal{M}=\{1,\dots,M\} with M=:2nRM=:2^{nR}, where each message is drawn from uniform distribution, and a code 𝒞={Xn(1),Xn(2),,Xn(M)}\mathcal{C}=\{X^{n}(1),X^{n}(2),\dots,X^{n}(M)\}. This soft-covering scheme induces the following distribution at the output of the DMC WY|XW_{Y|X}:

P~Yn|𝒞(yn):=1Mi=1MWY|Xn(yn|Xn(i)),yn𝒴n.\tilde{P}_{Y^{n}|\mathcal{C}}(y^{n}):=\frac{1}{M}\sum_{i=1}^{M}W_{Y|X}^{n}(y^{n}|X^{n}(i)),\quad\forall y^{n}\in\mathcal{Y}^{n}. (1)

The covering error is defined as the total variation between the achieved non-product output distribution P~Yn|𝒞\tilde{P}_{Y^{n}|\mathcal{C}} and the product desired one PYnP_{Y}^{n}:

12P~Yn|𝒞PYn1:=12yn|P~Yn|𝒞(yn)PYn(yn)|.\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\right\|_{1}:=\frac{1}{2}\sum_{y^{n}}\left|\tilde{P}_{Y^{n}|\mathcal{C}}(y^{n})-P_{Y}^{n}(y^{n})\right|.
Definition 10 (Non-uniform soft covering).

An (R,n,PY,WY|X,q)(R,n,P_{Y},W_{Y|X},q) (non-uniform) soft-covering scheme consists of a message set ={1,,M}\mathcal{M}=\{1,\dots,M\} with M=:2nRM=:2^{nR}, where messages are drawn from distribution q𝒫()q\in\mathcal{P}(\mathcal{M}), and a code 𝒞={Xn(1),Xn(2),,Xn(M)}\mathcal{C}=\{X^{n}(1),X^{n}(2),\dots,X^{n}(M)\}. This soft-covering scheme induces the following distribution at the output of the DMC WY|XW_{Y|X}:

P~Yn|𝒞q(yn):=i=1Mq(i)WY|Xn(yn|Xn(i)),yn𝒴n.\tilde{P}_{Y^{n}|\mathcal{C}\sim q}(y^{n}):=\sum_{i=1}^{M}q(i)\ W_{Y|X}^{n}(y^{n}|X^{n}(i)),\quad\forall y^{n}\in\mathcal{Y}^{n}. (2)

The covering error is defined as the total variation between the achieved non-product output distribution P~Yn|𝒞q\tilde{P}_{Y^{n}|\mathcal{C}\sim q} and the product desired one PYnP_{Y}^{n}:

12P~Yn|𝒞qPYn1:=12yn|P~Yn|𝒞q(yn)PYn(yn)|.\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\right\|_{1}:=\frac{1}{2}\sum_{y^{n}}\left|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}(y^{n})-P_{Y}^{n}(y^{n})\right|. (3)

Besides these two formulations, we introduce a variation based on the non-uniform case: in addition to assuming that the messages have a non-uniform distribution qq, we further impose an extra condition that the smallest nonzero probability among all messages is at least 1/M=:2nR1/M=:2^{-nR}. This condition can be equivalently written as

H(q)=logmini:q(i)>0q(i)nR,H_{-\infty}(q)=-\log\min_{i\in\mathcal{M}:q(i)>0}q(i)\leq nR, (4)

according to Remark 5. Under this condition, the size of the message set need not be MM; instead, ||=|supp(q)|M|\mathcal{M}|=|\mathrm{supp}(q)|\leq M. Moreover, given a rate RR, define

𝒬(R,n):={q𝒫({1,,M}):H(q)nR}\mathcal{Q}(R,n):=\left\{q\in\mathcal{P}(\{1,\dots,M\}):H_{-\infty}(q)\leq nR\right\} (5)

to be the set of all probabilities satisfying condition (4). Soft covering under this condition is referred to as HH_{-\infty}-constrained formulation, and is given by the following definition.

Definition 11 (HH_{-\infty}-constrained soft covering).

An (R,n,PY,WY|X,q)(R,n,P_{Y},W_{Y|X},q)_{-\infty} (HH_{-\infty}-constrained) soft-covering scheme consists of a message set \mathcal{M}, where messages are drawn from distribution q𝒬(R,n)q\in\mathcal{Q}(R,n), and a code 𝒞={Xn(1),Xn(2),,Xn(||)}\mathcal{C}=\{X^{n}(1),X^{n}(2),\dots,X^{n}(|\mathcal{M}|)\}. This soft-covering scheme induces distribution P~Yn|𝒞q\tilde{P}_{Y^{n}|\mathcal{C}\sim q} at the output of the DMC WY|XW_{Y|X}, which has the same expression in (2), and the covering error is defined the same as (3).

The reason for this new formulation of the soft covering problem is that, when RminPX𝒮I(PX;W)R\geq\min_{P_{X}\in\mathcal{S}}I(P_{X};W), the conventional uniform formulation would lead to an inevitable discrepancy depending on whether PY(y)P_{Y}(y)’s are rational or irrational. This discrepancy arises even in the simplest case, when the DMC WY|XW_{Y|X} is noiseless, because in this case, the uniform formulation attempts to approximate PYP_{Y} using only rational probabilities which are multiples of 1/M1/M. Switching to the non-uniform formulation in Definition 10 can certainly eliminate this rational-irrational discrepancy, but alters the problem substantially and thereby shifts the error exponent relative to the uniform case. The HH_{-\infty}-constrained formulation in Definition 11, on the other hand, resolves the discrepancy while still sharing an overlapping error-exponent region with the uniform formulation. A detailed discussion of this point is presented in Section III-B and Section V.

In information theory, proofs of achievability are commonly facilitated by the technique of random coding; accordingly, we also define the random-coding soft covering below. In this definition, the message set \mathcal{M} is uniformly distributed.

Definition 12 (Random-coding soft covering).

An (R,n,WY|X)PX(R,n,W_{Y|X})_{P_{X}} (random-coding) soft-covering scheme consists of a message set ={1,,M}\mathcal{M}=\{1,\dots,M\} with M=:2nRM=:2^{nR}, where each message is drawn from uniform distribution, and a random code 𝒞={Xn(1),Xn(2),,Xn(M)}\mathcal{C}=\{X^{n}(1),X^{n}(2),\dots,X^{n}(M)\}, where each codeword Xn(i)X^{n}(i) is generated randomly from i.i.d. PXP_{X}, i.e., Xn(i)PXnX^{n}(i)\sim P_{X}^{n} for all ii\in\mathcal{M}. The induced output distribution P~Yn|𝒞\tilde{P}_{Y^{n}|\mathcal{C}} is given by (1), and the covering error is defined as the expectation of the total variation between P~Yn|𝒞\tilde{P}_{Y^{n}|\mathcal{C}} and 𝔼𝒞P~Yn|𝒞\mathbb{E}_{\mathcal{C}}\tilde{P}_{Y^{n}|\mathcal{C}}:

12𝔼𝒞P~Yn|𝒞𝔼𝒞P~Yn|𝒞1:=12𝔼𝒞yn|P~Yn|𝒞(yn)𝔼𝒞P~Yn|𝒞(yn)|,\frac{1}{2}\mathbb{E}_{\mathcal{C}}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-\mathbb{E}_{\mathcal{C}}\tilde{P}_{Y^{n}|\mathcal{C}}\right\|_{1}:=\frac{1}{2}\mathbb{E}_{\mathcal{C}}\sum_{y^{n}}\left|\tilde{P}_{Y^{n}|\mathcal{C}}(y^{n})-\mathbb{E}_{\mathcal{C}}\tilde{P}_{Y^{n}|\mathcal{C}}(y^{n})\right|,

where the expectation 𝔼𝒞\mathbb{E}_{\mathcal{C}} is taken with respect to i.i.d. PXP_{X}.

Remark 13.

The notation 𝒞q\mathcal{C}\sim q indicates that 𝒞\mathcal{C} is the code corresponding to a message set with distribution qq. If q(i)=1/Mq(i)=1/M is uniform, we omit the notation q\sim q, as in Definitions 9 and 12 above. This convention will be used throughout the paper: specifically, every occurrence of 𝒞q\mathcal{C}\sim q or P~Yn|𝒞q\tilde{P}_{Y^{n}|\mathcal{C}\sim q} refers to the non-uniform formulation (including Definitions 10 and 11), whereas P~Yn|𝒞\tilde{P}_{Y^{n}|\mathcal{C}} refers to the uniform formulation (including Definitions 9 and 12), unless otherwise specified.

Next, we define the error exponent (denoted by EE) and the strong converse exponent (denoted by Γ\mathit{\Gamma}) for the above four formulations of the soft covering problem.

Definition 14 (Error exponent and strong converse exponent for soft covering).
  • (a)

    For uniform soft-covering:

    Euni(R)\displaystyle E_{\mathrm{uni}}(R) :=lim supnmax𝒞:|𝒞|=2nR[1nlog(12P~Yn|𝒞PYn1)],\displaystyle:=\limsup_{n\to\infty}\ \max_{\mathcal{C}:|\mathcal{C}|=2^{nR}}\left[-\frac{1}{n}\log\left(\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\right\|_{1}\right)\right],
    Γuni(R)\displaystyle\mathit{\Gamma}_{\mathrm{uni}}(R) :=lim infnmin𝒞:|𝒞|=2nR[1nlog(112P~Yn|𝒞PYn1)].\displaystyle:=\liminf_{n\to\infty}\ \min_{\mathcal{C}:|\mathcal{C}|=2^{nR}}\left[-\frac{1}{n}\log\left(1-\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\right\|_{1}\right)\right].
  • (b)

    For non-uniform soft-covering:

    Enon(R)\displaystyle E_{\mathrm{non}}(R) :=lim supnmaxq𝒫({1,,2nR})max𝒞:𝒞q[1nlog(12P~Yn|𝒞qPYn1)],\displaystyle:=\limsup_{n\to\infty}\ \max_{q\in\mathcal{P}(\{1,\dots,2^{nR}\})}\ \max_{\mathcal{C}:\mathcal{C}\sim q}\left[-\frac{1}{n}\log\left(\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\right\|_{1}\right)\right],
    Γnon(R)\displaystyle\mathit{\Gamma}_{\mathrm{non}}(R) :=lim infnminq𝒫({1,,2nR})min𝒞:𝒞q[1nlog(112P~Yn|𝒞qPYn1)],\displaystyle:=\liminf_{n\to\infty}\ \min_{q\in\mathcal{P}(\{1,\dots,2^{nR}\})}\ \min_{\mathcal{C}:\mathcal{C}\sim q}\left[-\frac{1}{n}\log\left(1-\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\right\|_{1}\right)\right],

    where the notation 𝒞q\mathcal{C}\sim q is defined in Remark 13.

  • (c)

    For HH_{-\infty}-constrained soft covering:

    Eren(R)\displaystyle E_{\mathrm{ren}}(R) :=lim supnmaxq𝒬(R,n)max𝒞:𝒞q[1nlog(12P~Yn|𝒞qPYn1)],\displaystyle:=\limsup_{n\to\infty}\ \max_{q\in\mathcal{Q}(R,n)}\ \max_{\mathcal{C}:\mathcal{C}\sim q}\left[-\frac{1}{n}\log\left(\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\right\|_{1}\right)\right],
    Γren(R)\displaystyle\mathit{\Gamma}_{\mathrm{ren}}(R) :=lim infnminq𝒬(R,n)min𝒞:𝒞q[1nlog(112P~Yn|𝒞qPYn1)],\displaystyle:=\liminf_{n\to\infty}\ \min_{q\in\mathcal{Q}(R,n)}\ \min_{\mathcal{C}:\mathcal{C}\sim q}\left[-\frac{1}{n}\log\left(1-\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\right\|_{1}\right)\right],

    where 𝒬(R,n)\mathcal{Q}(R,n) is defined in (5) and the notation 𝒞q\mathcal{C}\sim q is defined in Remark 13.

  • (d)

    For random-coding soft covering with codewords randomly drawn from i.i.d. PXP_{X}:

    Erc(R,PX)\displaystyle E_{\mathrm{rc}}(R,P_{X}) :=lim supn[1nlog(12𝔼𝒞P~Yn|𝒞𝔼𝒞P~Yn|𝒞1)],\displaystyle:=\limsup_{n\to\infty}\left[-\frac{1}{n}\log\left(\frac{1}{2}\mathbb{E}_{\mathcal{C}}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-\mathbb{E}_{\mathcal{C}}\tilde{P}_{Y^{n}|\mathcal{C}}\right\|_{1}\right)\right],
    Γrc(R,PX)\displaystyle\mathit{\Gamma}_{\mathrm{rc}}(R,P_{X}) :=lim infn[1nlog(112𝔼𝒞P~Yn|𝒞𝔼𝒞P~Yn|𝒞1)].\displaystyle:=\liminf_{n\to\infty}\left[-\frac{1}{n}\log\left(1-\frac{1}{2}\mathbb{E}_{\mathcal{C}}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-\mathbb{E}_{\mathcal{C}}\tilde{P}_{Y^{n}|\mathcal{C}}\right\|_{1}\right)\right].
Remark 15.

It is clear that Euni(R)Eren(R)Enon(R)E_{\mathrm{uni}}(R)\leq E_{\mathrm{ren}}(R)\leq E_{\mathrm{non}}(R) and Γuni(R)Γren(R)Γnon(R)\mathit{\Gamma}_{\mathrm{uni}}(R)\geq\mathit{\Gamma}_{\mathrm{ren}}(R)\geq\mathit{\Gamma}_{\mathrm{non}}(R), since the HH_{-\infty}-constrained formulation is a special case of the non-uniform formulation, and the uniform formulation is a special case of the HH_{-\infty}-constrained formulation.

Remark 16.

It is shown in [yagli2019exact, Thm. 1] that

Erc(R,PX)\displaystyle E_{\mathrm{rc}}(R,P_{X}) =minQXY𝒫(𝒳𝒴)[D(QXYPXWY|X)+12|RD(QXYPXQY)|+]\displaystyle=\min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})}\left[D(Q_{XY}\|P_{X}W_{Y|X})+\frac{1}{2}\big|R-D(Q_{XY}\|P_{X}Q_{Y})\big|^{+}\right]
=maxα[1,2]{α1α[RIα(PX;WY|X)]},\displaystyle=\max_{\alpha\in[1,2]}\left\{\frac{\alpha-1}{\alpha}\left[R-I_{\alpha}(P_{X};W_{Y|X})\right]\right\},

where Iα(PX;WY|X)I_{\alpha}(P_{X};W_{Y|X}) is the α\alpha-Rényi mutual information, defined in Definition 7. Consequently, the optimal random coding error exponent is Erc(R):=maxPX𝒮Erc(R,PX)E_{\mathrm{rc}}(R):=\max_{P_{X}\in\mathcal{S}}E_{\mathrm{rc}}(R,P_{X}). In particular, by Remark 8, when the DMC WY|XW_{Y|X} is noiseless, Erc(R)E_{\mathrm{rc}}(R) reduces to

Ercnl(R)\displaystyle E_{\mathrm{rc}}^{\mathrm{nl}}(R) =minQY𝒫(𝒴){D(QYPY)+12|RD(QYPY)H(QY)|+}\displaystyle=\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\left\{D(Q_{Y}\|P_{Y})+\frac{1}{2}\big|R-D(Q_{Y}\|P_{Y})-H(Q_{Y})\big|^{+}\right\}
=maxα[1,2]{α1α[RH1α(PY)]}.\displaystyle=\ \max_{\alpha\in[1,2]}\left\{\frac{\alpha-1}{\alpha}\left[R-H_{\frac{1}{\alpha}}(P_{Y})\right]\right\}.

III Main Results

This section summarizes the main results of this work, including the strong converse exponent in Section III-A and bounds for the error exponent in Sections III-B and III-C.

III-A Results on the strong converse exponent

To begin with, the following Theorem 17 characterizes the exact strong converse exponents of the soft covering problem in the uniform, non-uniform, and HH_{-\infty}-constrained formulations, and shows that they are all identical.

Theorem 17 (The exact strong converse exponent).

The exact strong converse exponents for the uniform, non-uniform, and the HH_{-\infty}-constrained soft-covering formulations coincide, and are given by

Γuni(R)=Γnon(R)=Γren(R)=Γ(R):=maxα,β[0,1],αβ[Jα,β(WY|XPY)+(βα)R],\mathit{\Gamma}_{\mathrm{uni}}(R)=\mathit{\Gamma}_{\mathrm{non}}(R)=\mathit{\Gamma}_{\mathrm{ren}}(R)=\mathit{\Gamma}(R):=\max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\left[J_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right)+(\beta-\alpha)R\right],

where

Jα,β(WY|XPY):=log[maxQX𝒫(𝒳)yPY1β(y)(xQX(x)WY|Xβα(y|x))α].J_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right):=-\log\left[\max_{Q_{X}\in\mathcal{P}(\mathcal{X})}\sum_{y}P_{Y}^{1-\beta}(y)\left(\sum_{x}Q_{X}(x)\ W_{Y|X}^{\frac{\beta}{\alpha}}(y|x)\right)^{\alpha}\right]. (6)

Theorem 17 is proved in Section IV-C. The quantity Jα,β(WY|XPY)J_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right), defined in (6), involves two parameters, which is uncommon in the literature of error exponents. Similar exponents taking two parameters have appeared in the context of lossy source coding, e.g., [blahut1974hypothesis, jitsumatsu2025computation]; however, lossy source coding and soft covering address fundamentally different problems. The former employs a distortion function as a criterion for loss and directly evaluates the probability of error, whereas the latter centers on a forward channel and quantifies the error through a probabilistic divergence. Although the notion of covering is intrinsically revealed in source coding, the two settings are structurally distinct. Typically, when a channel is involved in the problem formulation, the error exponent takes the form of maxα1αα(IαR)\max_{\alpha}\frac{1-\alpha}{\alpha}\left(I_{\alpha}-R\right), where IαI_{\alpha} is the Rényi mutual information with respect to the given channel, with different domains of maximization over α\alpha. This form is observed in both packing [gallager1965simple][sibson1969information][arimoto1973converse][arimoto1977information][augustin1978noisy][csiszar1995generalized][verdu2015alpha][verdu2021error] and covering (when RminPX𝒮I(PX;W)R\geq\min_{P_{X}\in\mathcal{S}}I(P_{X};W)) [hayashi2006general][parizi2016exact][yassaee2019almost][yagli2019exact] problems. In this work, we give an achievability bound for random codes that exactly takes this form, as formally stated in Theorem 18.

Theorem 18 (A random coding achievability bound).

The strong converse exponent for the random coding formulation has the following upper bound.

Γrc(R,PX)Γ¯rc(R,PX):=\displaystyle\mathit{\Gamma}_{\mathrm{rc}}(R,P_{X})\leq\overline{\mathit{\Gamma}}_{\mathrm{rc}}(R,P_{X}):= minQXY𝒫(𝒳𝒴)[D(QXYPXWY|X)+|D(QXYPXQY)R|+]\displaystyle\min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})}\left[D(Q_{XY}\|P_{X}W_{Y|X})+\big|D(Q_{XY}\|P_{X}Q_{Y})-R\big|^{+}\right] (7)
=\displaystyle= maxα[12,1]{1αα[Iα(PX;WY|X)R]},\displaystyle\max_{\alpha\in[\frac{1}{2},1]}\left\{\frac{1-\alpha}{\alpha}\left[I_{\alpha}(P_{X};W_{Y|X})-R\right]\right\}, (8)

where Iα(PX;WY|X)I_{\alpha}(P_{X};W_{Y|X}) is the α\alpha-Rényi mutual information, defined in Definition 7.

See Appendix A for a proof of Theorem 18. On the other hand, a lower bound for Γrc(R,PX)\mathit{\Gamma}_{\mathrm{rc}}(R,P_{X}) is given in [cheng2023error, Thm. 3]:

Γrc(R,PX)Γ¯rc(R,PX):=maxα[12,1]{1αα[D21α(PXWY|XPXPY)R]},\mathit{\Gamma}_{\mathrm{rc}}(R,P_{X})\geq\underline{\mathit{\Gamma}}_{\mathrm{rc}}(R,P_{X}):=\max_{\alpha\in[\frac{1}{2},1]}\left\{\frac{1-\alpha}{\alpha}\left[D_{2-\frac{1}{\alpha}}\left(P_{X}W_{Y|X}\|P_{X}P_{Y}\right)-R\right]\right\},

where D21α(PXWY|XPXPY)D_{2-\frac{1}{\alpha}}\left(P_{X}W_{Y|X}\|P_{X}P_{Y}\right) is the Rényi divergence in Definition 6. Hence, defining

Γrc(R):=minPX𝒮Γrc(R,PX),\mathit{\Gamma}_{\mathrm{rc}}(R):=\min_{P_{X}\in\mathcal{S}}\mathit{\Gamma}_{\mathrm{rc}}(R,P_{X}),

it can be further bounded via

Γ¯rc(R)Γrc(R)Γ¯rc(R)\underline{\mathit{\Gamma}}_{\mathrm{rc}}(R)\leq\mathit{\Gamma}_{\mathrm{rc}}(R)\leq\overline{\mathit{\Gamma}}_{\mathrm{rc}}(R) (9)

with

Γ¯rc(R):=minPX𝒮Γ¯rc(R,PX),Γ¯rc(R):=minPX𝒮Γ¯rc(R,PX).\underline{\mathit{\Gamma}}_{\mathrm{rc}}(R):=\min_{P_{X}\in\mathcal{S}}\underline{\mathit{\Gamma}}_{\mathrm{rc}}(R,P_{X}),\quad\overline{\mathit{\Gamma}}_{\mathrm{rc}}(R):=\min_{P_{X}\in\mathcal{S}}\overline{\mathit{\Gamma}}_{\mathrm{rc}}(R,P_{X}).

It is also noteworthy that Γrc(R)\mathit{\Gamma}_{\mathrm{rc}}(R) provides an achievability bound for Γuni(R)\mathit{\Gamma}_{\mathrm{uni}}(R); specifically, Γuni(R)Γrc(R)\mathit{\Gamma}_{\mathrm{uni}}(R)\leq\mathit{\Gamma}_{\mathrm{rc}}(R). Examples of the exact exponent Γ(R)\mathit{\Gamma}(R) (in Theorem 17) and aforementioned random coding bounds Γ¯rc(R)\underline{\mathit{\Gamma}}_{\mathrm{rc}}(R) and Γ¯rc(R)\overline{\mathit{\Gamma}}_{\mathrm{rc}}(R) are illustrated in Figure 1, where the matrix entry W(x,y)W(x,y) represents WY|X(y|x)W_{Y|X}(y|x). These examples include channels with fully noisy, fully noiseless, and hybrid input symbols (i.e., a mix of noisy and noiseless inputs). Figure 1 clearly shows that the random coding is not tight in general in the converse regime of the soft covering problem: due to (9), the random-coding exponent Γrc(R)\mathit{\Gamma}_{\mathrm{rc}}(R) must lie between Γ¯rc(R)\underline{\mathit{\Gamma}}_{\mathrm{rc}}(R) and Γ¯rc(R)\overline{\mathit{\Gamma}}_{\mathrm{rc}}(R), while the exact exponent Γ(R)\mathit{\Gamma}(R) is lower and hence tighter.

Refer to caption
(a) A fully noisy binary channel
Refer to caption
(b) A fully noisy ternary channel
Refer to caption
(c) A fully noiseless ternary channel
Refer to caption
(d) A hybrid ternary channel
Figure 1: Examples of the exact strong converse exponent Γ(R)\mathit{\Gamma}(R), the random coding achievability Γ¯rc(R)\overline{\mathit{\Gamma}}_{\mathrm{rc}}(R), and the random coding converse Γ¯rc(R)\underline{\mathit{\Gamma}}_{\mathrm{rc}}(R).

III-B Results on error exponents for noiseless channels

The previous subsection reveals an intriguing phenomenon: when R<minPX𝒮I(PX;W)R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W), random coding fails to achieve a tight strong converse exponent. A natural question is whether such non-tightness also arises in the error exponent regime for RminPX𝒮I(PX;W)R\geq\min_{P_{X}\in\mathcal{S}}I(P_{X};W). In this subsection, we show that the answer is yes when WY|XW_{Y|X} is noiseless. The following Theorems 19, 20, 21, and 22 summarize our results on error exponents for noiseless channels, where exponents for different formulations are defined in Definition 14, with the superscript ‘nl’ indicating ‘noiseless’.

Theorem 19 (Converse for noiseless channels).

For noiseless channels under the uniform formulation, we have the following upper bound on Euninl(R)E_{\mathrm{uni}}^{\mathrm{nl}}(R).

Euninl(R)E¯nl(R):\displaystyle E_{\mathrm{uni}}^{\mathrm{nl}}(R)\leq\overline{E}^{\mathrm{nl}}(R): =minQY𝒫(𝒴):D(QYPY)+H(QY)RD(QYPY)\displaystyle=\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})\geq R}D(Q_{Y}\|P_{Y})
=maxα(,0)[1,){α1α[RH1α(PY)]}.\displaystyle=\max_{\alpha\in(-\infty,0)\cup[1,\infty)}\left\{\frac{\alpha-1}{\alpha}\left[R-H_{\frac{1}{\alpha}}(P_{Y})\right]\right\}.
Theorem 20 (Achievability for noiseless channels).

For noiseless channels under the uniform formulation, we have the following lower bound on Euninl(R)E_{\mathrm{uni}}^{\mathrm{nl}}(R).

Euninl(R)E¯nl(R):=\displaystyle E_{\mathrm{uni}}^{\mathrm{nl}}(R)\geq\underline{E}^{\mathrm{nl}}(R):=\ minQY𝒫(𝒴){D(QYPY)+|RD(QYPY)H(QY)|+}\displaystyle\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\left\{D(Q_{Y}\|P_{Y})+\big|R-D(Q_{Y}\|P_{Y})-H(Q_{Y})\big|^{+}\right\}
=\displaystyle=\ maxα1{α1α[RH1α(PY)]},\displaystyle\max_{\alpha\geq 1}\left\{\frac{\alpha-1}{\alpha}\left[R-H_{\frac{1}{\alpha}}(P_{Y})\right]\right\},
Theorem 21 (Error exponent for non-uniform case).

For noiseless channels under the non-uniform formulation, the exact error exponent is given by

Enonnl(R)\displaystyle E_{\mathrm{non}}^{\mathrm{nl}}(R) =minQY𝒫(𝒴):H(QY)RD(QYPY)=maxα1{(α1)[RH1α(PY)]}.\displaystyle=\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):H(Q_{Y})\geq R}D(Q_{Y}\|P_{Y})=\max_{\alpha\geq 1}\left\{(\alpha-1)\left[R-H_{\frac{1}{\alpha}}(P_{Y})\right]\right\}.
Theorem 22 (Error exponent for HH_{-\infty}-constrained case).

For noiseless channels under the HH_{-\infty}-constrained formulation, the exact error exponent is given by Erennl(R)=E¯nl(R),E_{\mathrm{ren}}^{\mathrm{nl}}(R)=\overline{E}^{\mathrm{nl}}(R), where E¯nl(R)\overline{E}^{\mathrm{nl}}(R) is defined in Theorem 19.

Refer to caption
(a) A binary output
Refer to caption
(b) A ternary output
Figure 2: Examples of error exponents for noiseless channels under the uniform, non-uniform, and HH_{-\infty}-constrained formulations.

Theorem 19 and Theorem 20 are proved in Section V-A; Theorem 21 is proved in Section V-C; and Theorem 22 is proved in Section V-D. Figure 2 presents plots of these proposed bounds or exponents as well as the random coding exponent Ercnl(R)E_{\mathrm{rc}}^{\mathrm{nl}}(R) (see Remark 16) using a binary and a ternary target output distribution, respectively. Under the non-uniform and the HH_{-\infty}-constrained formulation, an exact error exponent is established. Under the uniform formulation, the error exponent Euninl(R)E_{\mathrm{uni}}^{\mathrm{nl}}(R) lies between the two bounds E¯nl(R)\overline{E}^{\mathrm{nl}}(R) and E¯nl(R)\underline{E}^{\mathrm{nl}}(R). However, these two bounds overlap in the vicinity of R=H(PY)R=H(P_{Y}), which is given by the following proposition.

Proposition 23.

Let RαsR_{\alpha}^{s} be the value where the function E¯nl()\overline{E}^{\mathrm{nl}}(\cdot) has a tangent line of slope α\alpha. Then

E¯nl(R)\displaystyle\underline{E}^{\mathrm{nl}}(R) ={E¯nl(R)if 0RR1s,E¯nl(R1s)+RR1sif R>R1s,\displaystyle=\begin{cases}\overline{E}^{\mathrm{nl}}(R)&\text{if }0\leq R\leq R_{1}^{s},\\ \overline{E}^{\mathrm{nl}}(R_{1}^{s})+R-R_{1}^{s}&\text{if }R>R_{1}^{s},\end{cases}
Ercnl(R)\displaystyle E_{\mathrm{rc}}^{\mathrm{nl}}(R) ={E¯nl(R)if 0RR1/2s,E¯nl(R1/2s)+12(RR1/2s)if R>R1/2s.\displaystyle=\begin{cases}\overline{E}^{\mathrm{nl}}(R)&\text{if }0\leq R\leq R_{1/2}^{s},\\ \overline{E}^{\mathrm{nl}}(R_{1/2}^{s})+\dfrac{1}{2}\left(R-R_{1/2}^{s}\right)&\text{if }R>R_{1/2}^{s}.\end{cases}

Here, Ercnl(R)E_{\mathrm{rc}}^{\mathrm{nl}}(R) denotes the random-coding error exponent for noiseless channels and is given in Remark 16.

Proposition 23 is proved in Appendix D-b. In summary, for R[H(PY),R1/2s]R\in[H(P_{Y}),R_{1/2}^{s}], all three curves – E¯nl(R)\overline{E}^{\mathrm{nl}}(R), E¯nl(R)\underline{E}^{\mathrm{nl}}(R), and Ercnl(R)E_{\mathrm{rc}}^{\mathrm{nl}}(R) – overlap, implying that random coding is tight. For R[R1/2s,R1s]R\in[R_{1/2}^{s},R_{1}^{s}], random coding is not tight; nevertheless, E¯nl(R)\overline{E}^{\mathrm{nl}}(R) and E¯nl(R)\underline{E}^{\mathrm{nl}}(R) still coincide, and an exact exponent exists via the construction of a deterministic code that is provided in the proof of Theorem 20. For R>R1sR>R_{1}^{s}, E¯nl(R)\overline{E}^{\mathrm{nl}}(R) and E¯nl(R)\underline{E}^{\mathrm{nl}}(R) no longer coincide. It is also noteworthy that E¯nl(R)\overline{E}^{\mathrm{nl}}(R) diverges when R>H(PY)R>H_{-\infty}(P_{Y}); see Lemma 38.

Assuming uniform messages is conventional in many problems in information theory. Nonetheless, in the soft covering problem, adhering to this uniform formulation highlights an interesting discrepancy between rational and irrational output distributions for noiseless channels. This is because P~Yn|𝒞\tilde{P}_{Y^{n}|\mathcal{C}} in (1) can only take rational values as it is a multiple of 1/M1/M. If PYP_{Y} is irrational in some symbols, then approximating PYnP_{Y}^{n} by P~Yn|𝒞\tilde{P}_{Y^{n}|\mathcal{C}} amounts to a Diophantine approximation, which necessarily incurs nonzero errors for arbitrarily large MM according to studies in number theory. We summarize the rational-irrational discrepancy in Theorem 25 and Theorem 26.

Definition 24.

Given a distribution PY𝒫(𝒴)P_{Y}\in\mathcal{P}(\mathcal{Y}), we say that PYP_{Y} is rational, denoted PY|𝒴|P_{Y}\in\mathbb{Q}^{|\mathcal{Y}|}, if PY(y)P_{Y}(y)\in\mathbb{Q} for all y𝒴y\in\mathcal{Y}. Otherwise, we say that PYP_{Y} is irrational, denoted PY|𝒴|P_{Y}\notin\mathbb{Q}^{|\mathcal{Y}|}.

Theorem 25 (Linear converse for irrational PYP_{Y}).

Suppose WY|XW_{Y|X} is noiseless. Then for every ϵ>0\epsilon>0 and almost every PY|𝒴|P_{Y}\notin\mathbb{Q}^{|\mathcal{Y}|} (in the sense of Lebesgue measure on [0,1]|𝒴|[0,1]^{|\mathcal{Y}|}), we have Euninl(R)(2+ϵ)RE_{\mathrm{uni}}^{\mathrm{nl}}(R)\leq(2+\epsilon)R.

Theorem 26 (Infinite achievability for rational PYP_{Y}).

Suppose WY|XW_{Y|X} is noiseless and PY|𝒴|P_{Y}\in\mathbb{Q}^{|\mathcal{Y}|}, say, PY(y)=AyByP_{Y}(y)=\frac{A_{y}}{B_{y}} with coprime Ay,ByA_{y},B_{y}\in\mathbb{Z} for y𝒴y\in\mathcal{Y}. Then Euninl(R)=E_{\mathrm{uni}}^{\mathrm{nl}}(R)=\infty when Rlog(lcm({By}y𝒴))R\geq\log(\mathrm{lcm}(\{B_{y}\}_{y\in\mathcal{Y}})), where lcm({By}y𝒴)\mathrm{lcm}(\{B_{y}\}_{y\in\mathcal{Y}}) refers to the least common multiple of ByB_{y}’s among all y𝒴y\in\mathcal{Y}.

Theorem 25 and Theorem 26 are proved in Section V-B. Evidently, the discrepancy concerns whether an infinite exponent, corresponding to perfect covering with no error, is achievable. For rational PYP_{Y}’s, this is achievable at high rates, whereas for most irrational PYP_{Y}’s, it is not. It is noteworthy that the linear converse in Theorem 25 applies to most irrational probabilities, which are dense in [0,1][0,1]. However, it may be not possible to claim a universal converse for an arbitrary irrational PYP_{Y}. Let y¯𝒴\widebar{y}\in\mathcal{Y} be an irrational symbol, i.e., PY(y¯)P_{Y}(\widebar{y})\notin\mathbb{Q}. If PY(y¯)P_{Y}(\widebar{y}) is an irrational algebraic number, then the linear converse of (2+ϵ)R(2+\epsilon)R holds, by virtue of the Thue-Siegel-Roth theorem [queffelec2013diophantine, Thm. 3.1.4]. On the other hand, PY(y¯)P_{Y}(\widebar{y}) can also be a ‘good’ irrational number, known as a Liouville number [queffelec2013diophantine, Def. 3.1.8], which admits infinitely many integer pairs K,MK,M\in\mathbb{Z} such that |K/MPY(y¯)|M(2+ϵ)\left|K/M-P_{Y}(\widebar{y})\right|\leq M^{-(2+\epsilon)}; consequently, it is not clear how to establish a linear converse.

To eliminate such a discrepancy, P~Yn|𝒞\tilde{P}_{Y^{n}|\mathcal{C}} must be allowed to also take irrational values; hence, messages cannot be uniformly distributed. Switching from the uniform to the non-uniform formulation indeed removes this discrepancy and yields an exact exponent. However, the error exponent Enonnl(R)E_{\mathrm{non}}^{\mathrm{nl}}(R) under the non-uniform formulation deviates from both E¯nl(R)\overline{E}^{\mathrm{nl}}(R) and E¯nl(R)\underline{E}^{\mathrm{nl}}(R) for all RR, including in the low-rate regime near H(PY)H(P_{Y}). In fact, for noiseless channels, soft covering under the non-uniform formulation is equivalent to lossless source coding (see Lemma 43 in Section V-C). In order to remove the rational-irrational discrepancy without substantially altering the feature of soft covering, we proposed the HH_{-\infty}-constrained formulation in Definition 11. Its exponent Erennl(R)E_{\mathrm{ren}}^{\mathrm{nl}}(R) is exact, and is identical to E¯nl(R)\overline{E}^{\mathrm{nl}}(R) by Theorem 22. Thus, in the neighborhood of H(PY)H(P_{Y}), this new formulation aligns with the uniform formulation.

III-C Results on error exponents for noisy channels

More generally, in this subsection, we consider noisy channels. Note that Euninl(R)E_{\mathrm{uni}}^{\mathrm{nl}}(R) in Theorem 20 becomes a straight line of slope 1 for large RR. Consequently, at high rates, this noiseless achievability can exceed the random-coding exponent Ercnl(R)E_{\mathrm{rc}}^{\mathrm{nl}}(R) whose slope is 1/21/2, thus demonstrating a high-rate improvement over random coding under the uniform formulation for noisy channels. This noisy achievability is summarized in Theorem 27. Furthermore, we provide a general converse bound for noisy channels under the non-uniform formulation in Theorem 28.

Theorem 27 (High-rate achievability).

For noisy channels under the uniform formulation, we have the following lower bound on Euni(R)E_{\mathrm{uni}}(R).

Euni(R)E¯(R):\displaystyle E_{\mathrm{uni}}(R)\geq\underline{E}(R): =maxPX𝒮minQX𝒫(𝒳){D(QXPX)+|RD(QXPX)H(QX)|+}\displaystyle=\max_{P_{X}\in\mathcal{S}}\ \min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\left\{D(Q_{X}\|P_{X})+\big|R-D(Q_{X}\|P_{X})-H(Q_{X})\big|^{+}\right\}
=maxPX𝒮maxα1{α1α[RH1α(PX)]}.\displaystyle=\max_{P_{X}\in\mathcal{S}}\ \max_{\alpha\geq 1}\left\{\frac{\alpha-1}{\alpha}\left[R-H_{\frac{1}{\alpha}}(P_{X})\right]\right\}.
Theorem 28 (Converse).

For noisy channels under the non-uniform formulation, we have the following upper bound on Enon(R)E_{\mathrm{non}}(R).

Enon(R)E¯(R):\displaystyle E_{\mathrm{non}}(R)\leq\overline{E}(R): =maxPX𝒮minV𝒫(𝒴|𝒳):ι(PXVY|X)RD(VW|PX)\displaystyle=\max_{P_{X}\in\mathcal{S}}\ \min_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X}):\iota(P_{X}V_{Y|X})\geq R}D(V\|W|P_{X})
=maxPX𝒮maxα0[α(R𝔼PXD1+α(WY|XPY))].\displaystyle=\max_{P_{X}\in\mathcal{S}}\ \max_{\alpha\geq 0}\left[\alpha\left(R-\mathbb{E}_{P_{X}}D_{1+\alpha}(W_{Y|X}\|P_{Y})\right)\right].

Theorem 27 is proved in Section VI-A and Theorem 28 is proved in Section VI-B. Figure 3 exhibits examples of these two proposed bounds as well as the random coding error exponent Erc(R)E_{\mathrm{rc}}(R) in Remark 16. Immediately following Theorem 26 and Theorem 27, we obtain Corollary 29 below, which demonstrates an achievable infinite error exponent for noisy channels, provided that there exists some rational input distributions PX𝒮P_{X}\in\mathcal{S} and the rate is sufficiently large.

Corollary 29 (Infinite achievability for noisy channels).

Suppose some PX𝒮P_{X}\in\mathcal{S} is rational, say, PX(x)=AxBxP_{X}(x)=\frac{A_{x}}{B_{x}} with coprime Ax,BxA_{x},B_{x}\in\mathbb{Z} for x𝒳x\in\mathcal{X}. Then Euni(R)=E_{\mathrm{uni}}(R)=\infty when Rlog(lcm({Bx}x𝒳))R\geq\log(\mathrm{lcm}(\{B_{x}\}_{x\in\mathcal{X}})).

Refer to caption
(a) A fully noisy binary channel
Refer to caption
(b) A fully noisy ternary channel
Figure 3: Examples of the converse bound E¯(R)\overline{E}(R), the achievability bound E¯(R)\underline{E}(R), and the random-coding error exponent Erc(R)E_{\mathrm{rc}}(R) for noisy channels.

IV Strong Converse Exponent

In this section, we prove Theorem 17. The sketch of the proof is as follows. In Section IV-A, we establish a lower bound for Γnon(R)\mathit{\Gamma}_{\mathrm{non}}(R), serving as a converse result; while Section IV-B presents an achievability result that provides an upper bound for Γuni(R)\mathit{\Gamma}_{\mathrm{uni}}(R). Both bounds are expressed in variational forms in terms of the optimization over joint distributions. We convey them into dual representations in Section IV-C and show that they coincide when R<minPX𝒮I(PX;W)R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W). Since Γnon(R)Γuni(R)\mathit{\Gamma}_{\mathrm{non}}(R)\leq\mathit{\Gamma}_{\mathrm{uni}}(R), this coincidence characterizes an exact exponent for both uniform and non-uniform formulations. Additionally, in Appendix C, we provide an alternative proof of the converse part of Theorem 17 only for the uniform formulation following Arimoto’s techniques in [arimoto1973converse].

IV-A A lower bound for the strong converse exponent

In the following, we show a lower bound for Γnon(R)\mathit{\Gamma}_{\mathrm{non}}(R) for the non-uniform formulation, denoted by Γ¯(R)\underline{\mathit{\Gamma}}(R).

Proposition 30 (Converse).

The strong converse exponent Γnon(R)\mathit{\Gamma}_{\mathrm{non}}(R) in the non-uniform formulation can be bounded from below as follows:

Γnon(R)Γ¯(R):=minQX𝒫(𝒳)infs[0,)max{s,Γ(s,QX,R)},\mathit{\Gamma}_{\mathrm{non}}(R)\geq\underline{\mathit{\Gamma}}(R):=\min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\ \inf_{s\in[0,\infty)}\ \max\big\{s,\mathit{\Gamma}(s,Q_{X},R)\big\},

where

Γ(s,QX,R):=minV𝒫(𝒴|𝒳):D(VW|QX)s[D(QXVPY)+|I(QX;V)R|+].\mathit{\Gamma}(s,Q_{X},R):=\min_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X}):D(V\|W|Q_{X})\leq s}\left[D(Q_{X}V\|P_{Y})+\big|I(Q_{X};V)-R\big|^{+}\right]. (10)
Proof.

We use a hypothesis testing perspective, and apply the following well-known property of the total variation (e.g.,[moser2019advanced, Def. 2.24]) in terms of decision regions:

12P~Yn|𝒞qPYn1|P~Yn|𝒞q(𝒜)PYn(𝒜)|𝒜𝒴n.\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\right\|_{1}\geq\left|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}(\mathcal{A})-P_{Y}^{n}(\mathcal{A})\right|\quad\forall\mathcal{A}\subseteq\mathcal{Y}^{n}. (11)

We take a collection of conditional types around each codeword to create a set 𝒜\mathcal{A} that could be used to discriminate P~Yn|𝒞\tilde{P}_{Y^{n}|\mathcal{C}} from PYnP_{Y}^{n}, as follows:

𝒜=QX𝒫n(𝒳)i:Xn(i)𝒯QXV𝒱n(QX)𝒯V(Xn(i)).\mathcal{A}=\bigcup_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \bigcup_{i:X^{n}(i)\in\mathcal{T}_{Q_{X}}}\ \bigsqcup_{V\in\mathcal{V}_{n}(Q_{X})}\mathcal{T}_{V}(X^{n}(i)).

That is, restrict to the output type classes whose types are in the form of QY=QXVQ_{Y}=Q_{X}V, where QXQ_{X} is the type of some codeword and V𝒱n(QX)V\in\mathcal{V}_{n}(Q_{X}). Here 𝒱n(QX)\mathcal{V}_{n}(Q_{X}) can be any subset of 𝒫n(𝒴|QX)\mathcal{P}_{n}(\mathcal{Y}|Q_{X}). Later, we will choose an appropriate subset to yield the optimal bound. When a codeword Xn(i)X^{n}(i) is fixed, so is its type. Then the VV-shells 𝒯V(Xn(i))\mathcal{T}_{V}(X^{n}(i)) for different VV must be disjoint. We thereby write disjoint union V𝒱n(QX)\bigsqcup_{V\in\mathcal{V}_{n}(Q_{X})} in 𝒜\mathcal{A}.

Consider a codeword Xn(i)X^{n}(i) whose type is QXQ_{X}, i.e., Xn(i)𝒯QXX^{n}(i)\in\mathcal{T}_{Q_{X}} for some QX𝒫n(𝒳)Q_{X}\in\mathcal{P}_{n}(\mathcal{X}). We have

WY|Xn(𝒜|Xn(i))\displaystyle W_{Y|X}^{n}(\mathcal{A}|X^{n}(i)) =WY|Xn(QX𝒫n(𝒳)j:Xn(j)𝒯QXV𝒱n(QX)𝒯V(Xn(j))|Xn(i))\displaystyle=W_{Y|X}^{n}\left(\bigcup_{Q_{X}^{\prime}\in\mathcal{P}_{n}(\mathcal{X})}\ \bigcup_{j:X^{n}(j)\in\mathcal{T}_{Q_{X}^{\prime}}}\ \bigsqcup_{V\in\mathcal{V}_{n}(Q_{X}^{\prime})}\mathcal{T}_{V}(X^{n}(j))\bigg|X^{n}(i)\right)
𝑎WY|Xn(V𝒱n(QX)𝒯V(Xn(i))|Xn(i))\displaystyle\overset{a}{\geq}W_{Y|X}^{n}\left(\bigsqcup_{V\in\mathcal{V}_{n}(Q_{X})}\mathcal{T}_{V}(X^{n}(i))\bigg|X^{n}(i)\right)
=𝑏1WY|Xn(V𝒱n(QX)𝒯V(Xn(i))|Xn(i))\displaystyle\overset{b}{=}1-W_{Y|X}^{n}\left(\bigsqcup_{V\notin\mathcal{V}_{n}(Q_{X})}\mathcal{T}_{V}(X^{n}(i))\bigg|X^{n}(i)\right)
𝑐1V𝒱n(QX)2nD(VW|QX)\displaystyle\overset{c}{\geq}1-\sum_{V\notin\mathcal{V}_{n}(Q_{X})}2^{-nD(V\|W|Q_{X})}
𝑑1(n+1)|𝒳||𝒴|maxV𝒱n(QX)2nD(VW|QX)\displaystyle\overset{d}{\geq}1-(n+1)^{|\mathcal{X}||\mathcal{Y}|}\max_{V\notin\mathcal{V}_{n}(Q_{X})}2^{-nD(V\|W|Q_{X})}
=1(n+1)|𝒳||𝒴|exp2{nminV𝒱n(QX)D(VW|QX)},\displaystyle=1-(n+1)^{|\mathcal{X}||\mathcal{Y}|}\exp_{2}\left\{-n\min_{V\notin\mathcal{V}_{n}(Q_{X})}D(V\|W|Q_{X})\right\},

where in (a)(a) we restrict to the subset in which j=ij=i; (b)(b) follows from the fact that V𝒫n(𝒴|QX)𝒯V(Xn(i))=𝒴n\bigsqcup_{V\in\mathcal{P}_{n}(\mathcal{Y}|Q_{X})}\mathcal{T}_{V}(X^{n}(i))=\mathcal{Y}^{n}; (c)(c) and (d)(d) follow from properties of types. By averaging over all codewords, we obtain that

P~Yn|𝒞q(𝒜)\displaystyle\tilde{P}_{Y^{n}|\mathcal{C}\sim q}(\mathcal{A}) =i=1Mq(i)QX𝒫n(𝒳)𝟙{Xn(i)𝒯QX}WY|Xn(𝒜|Xn(i))\displaystyle=\sum_{i=1}^{M}q(i)\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}\ W_{Y|X}^{n}(\mathcal{A}|X^{n}(i))
1(n+1)|𝒳||𝒴|QX𝒫n(𝒳)i=1Mq(i)𝟙{Xn(i)𝒯QX}exp2{nminV𝒱n(QX)D(VW|QX)}\displaystyle\geq 1-(n+1)^{|\mathcal{X}||\mathcal{Y}|}\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \sum_{i=1}^{M}q(i)\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}\exp_{2}\left\{-n\min_{V\notin\mathcal{V}_{n}(Q_{X})}D(V\|W|Q_{X})\right\}
𝑎1(n+1)|𝒳|(|𝒴|+1)maxQX𝒫n(𝒳)i=1Mq(i)𝟙{Xn(i)𝒯QX}exp2{nminV𝒱n(QX)D(VW|QX)}\displaystyle\overset{a}{\geq}1-(n+1)^{|\mathcal{X}|(|\mathcal{Y}|+1)}\max_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\sum_{i=1}^{M}q(i)\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}\exp_{2}\left\{-n\min_{V\notin\mathcal{V}_{n}(Q_{X})}D(V\|W|Q_{X})\right\}
𝑏1(n+1)|𝒳|(|𝒴|+1)maxQX𝒫n(𝒳)exp2{nminV𝒱n(QX)D(VW|QX)}\displaystyle\overset{b}{\geq}1-(n+1)^{|\mathcal{X}|(|\mathcal{Y}|+1)}\max_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\exp_{2}\left\{-n\min_{V\notin\mathcal{V}_{n}(Q_{X})}D(V\|W|Q_{X})\right\}
=1(n+1)|𝒳|(|𝒴|+1)exp2{nminQX𝒫n(𝒳)minV𝒱n(QX)D(VW|QX)},\displaystyle=1-(n+1)^{|\mathcal{X}|(|\mathcal{Y}|+1)}\exp_{2}\left\{-n\min_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \min_{V\notin\mathcal{V}_{n}(Q_{X})}D(V\|W|Q_{X})\right\},

where (a)(a) follows from property of types and (b)(b) from i=1Mq(i)𝟙{Xn(i)𝒯QX}1\sum_{i=1}^{M}q(i)\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}\leq 1. On the other hand,

PYn(𝒜)\displaystyle P_{Y}^{n}(\mathcal{A}) =𝑎QY𝒫n(𝒴)|𝒜𝒯QY| 2n[D(QYPY)+H(QY)]\displaystyle\overset{a}{=}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\left|\mathcal{A}\cap\mathcal{T}_{Q_{Y}}\right|\ 2^{-n[D(Q_{Y}\|P_{Y})+H(Q_{Y})]}
=QY𝒫n(𝒴)|(QX𝒫n(𝒳)i:Xn(i)𝒯QXV𝒱n(QX)𝒯V(Xn(i)))𝒯QY|2n[D(QYPY)+H(QY)]\displaystyle=\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\left|\left(\bigcup_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \bigcup_{i:X^{n}(i)\in\mathcal{T}_{Q_{X}}}\ \bigsqcup_{V\in\mathcal{V}_{n}(Q_{X})}\mathcal{T}_{V}(X^{n}(i))\right)\bigcap\mathcal{T}_{Q_{Y}}\right|2^{-n[D(Q_{Y}\|P_{Y})+H(Q_{Y})]}
=QY𝒫n(𝒴)|QX𝒫n(𝒳)i:Xn(i)𝒯QXV𝒱n(QX)𝒯V(Xn(i))𝒯QY|2n[D(QYPY)+H(QY)]\displaystyle=\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\left|\bigcup_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \bigcup_{i:X^{n}(i)\in\mathcal{T}_{Q_{X}}}\ \bigsqcup_{V\in\mathcal{V}_{n}(Q_{X})}\mathcal{T}_{V}(X^{n}(i))\cap\mathcal{T}_{Q_{Y}}\right|2^{-n[D(Q_{Y}\|P_{Y})+H(Q_{Y})]}
𝑏QX𝒫n(𝒳)V𝒱n(QX)|i:Xn(i)𝒯QX𝒯V(Xn(i))𝒯QXV|2n[D(QXVPY)+H(QXV)]\displaystyle\overset{b}{\leq}\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \sum_{V\in\mathcal{V}_{n}(Q_{X})}\left|\bigcup_{i:X^{n}(i)\in\mathcal{T}_{Q_{X}}}\mathcal{T}_{V}(X^{n}(i))\cap\mathcal{T}_{Q_{X}V}\right|2^{-n[D(Q_{X}V\|P_{Y})+H(Q_{X}V)]}
QX𝒫n(𝒳)V𝒱n(QX)min{i=1M𝟙{Xn(i)𝒯QX}|𝒯V(Xn(i))|,|𝒯QXV|}2n[D(QXVPY)+H(QXV)]\displaystyle\leq\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \sum_{V\in\mathcal{V}_{n}(Q_{X})}\min\left\{\sum_{i=1}^{M}\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}\big|\mathcal{T}_{V}(X^{n}(i))\big|,\big|\mathcal{T}_{Q_{X}V}\big|\right\}2^{-n[D(Q_{X}V\|P_{Y})+H(Q_{X}V)]}
𝑐QX𝒫n(𝒳)V𝒱n(QX)min{M2nH(V|QX),2nH(QXV)}2n[D(QXVPY)+H(QXV)]\displaystyle\overset{c}{\leq}\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \sum_{V\in\mathcal{V}_{n}(Q_{X})}\min\left\{M\cdot 2^{nH(V|Q_{X})},2^{nH(Q_{X}V)}\right\}2^{-n[D(Q_{X}V\|P_{Y})+H(Q_{X}V)]}
=QX𝒫n(𝒳)V𝒱n(QX)2n[D(QXVPY)+|I(QX;V)R|+]\displaystyle=\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \sum_{V\in\mathcal{V}_{n}(Q_{X})}2^{-n\left[D(Q_{X}V\|P_{Y})+|I(Q_{X};V)-R|^{+}\right]}
𝑑(n+1)|𝒳|(|𝒴|+1)maxQX𝒫n(𝒳)maxV𝒱n(QX)2n[D(QXVPY)+|I(QX;V)R|+]\displaystyle\overset{d}{\leq}(n+1)^{|\mathcal{X}|(|\mathcal{Y}|+1)}\max_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \max_{V\in\mathcal{V}_{n}(Q_{X})}2^{-n\left[D(Q_{X}V\|P_{Y})+|I(Q_{X};V)-R|^{+}\right]}
=(n+1)|𝒳|(|𝒴|+1)exp2{nminQX𝒫n(𝒳)minV𝒱n(QX)[D(QXVPY)+|I(QX;V)R|+]},\displaystyle=(n+1)^{|\mathcal{X}|(|\mathcal{Y}|+1)}\exp_{2}\left\{-n\min_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \min_{V\in\mathcal{V}_{n}(Q_{X})}\left[D(Q_{X}V\|P_{Y})+\big|I(Q_{X};V)-R\big|^{+}\right]\right\},

where (a)(a) follows from property of types; in (b)(b), the summation QY𝒫n(𝒴)\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})} is missing because the non-emptiness of 𝒯V(Xn(i))𝒯QY\mathcal{T}_{V}(X^{n}(i))\cap\mathcal{T}_{Q_{Y}} implies that QYQ_{Y} can only take the unique value QY=QXVQ_{Y}=Q_{X}V; (c)(c) and (d)(d) follow from properties of types. Inserting the above expressions into (11), we obtain that

12P~Yn|𝒞PYn112(n+1)|𝒳|(|𝒴|+1)2nΓ¯(R,n),\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\right\|_{1}\geq 1-2(n+1)^{|\mathcal{X}|(|\mathcal{Y}|+1)}2^{-n\underline{\mathit{\Gamma}}(R,n)},

where the exponent is

Γ¯(R,n)\displaystyle\underline{\mathit{\Gamma}}(R,n) =min{minQX𝒫n(𝒳)minV𝒱n(QX)D(VW|QX),minQX𝒫n(𝒳)minV𝒱n(QX)[D(QXVPY)+|I(QX;V)R|+]}\displaystyle=\min\left\{\min_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\min_{V\notin\mathcal{V}_{n}(Q_{X})}D(V\|W|Q_{X}),\ \min_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\min_{V\in\mathcal{V}_{n}(Q_{X})}\left[D(Q_{X}V\|P_{Y})+\big|I(Q_{X};V)-R\big|^{+}\right]\right\}
=minQX𝒫n(𝒳)min{minV𝒱n(QX)D(VW|QX),minV𝒱n(QX)[D(QXVPY)+|I(QX;V)R|+]}.\displaystyle=\min_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\min\left\{\min_{V\notin\mathcal{V}_{n}(Q_{X})}D(V\|W|Q_{X}),\ \min_{V\in\mathcal{V}_{n}(Q_{X})}\left[D(Q_{X}V\|P_{Y})+\big|I(Q_{X};V)-R\big|^{+}\right]\right\}.

It remains to choose a suitable 𝒱n(QX)\mathcal{V}_{n}(Q_{X}). We choose it to be the relative-entropy-ball centered at WW with radius ss:

𝒱n(QX)=𝒱n(s,QX):={V𝒫n(𝒴|𝒳):D(VW|QX)s}\mathcal{V}_{n}(Q_{X})=\mathcal{V}_{n}(s,Q_{X}):=\left\{V\in\mathcal{P}_{n}(\mathcal{Y}|\mathcal{X}):D(V\|W|Q_{X})\leq s\right\}

parameterized by s0s\geq 0. As nn\to\infty, 𝒫n(𝒳)\mathcal{P}_{n}(\mathcal{X}) becomes dense in 𝒫(𝒳)\mathcal{P}(\mathcal{X}) and 𝒱n(QX)\mathcal{V}_{n}(Q_{X}) boils down to

𝒱(QX)=𝒱(s,QX):={V𝒫(𝒴|𝒳):D(VW|QX)s}.\mathcal{V}(Q_{X})=\mathcal{V}(s,Q_{X}):=\left\{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X}):D(V\|W|Q_{X})\leq s\right\}.

Consequently, the exponent Γ¯(R,n)\underline{\mathit{\Gamma}}(R,n) converges as follows:

lim infnΓ¯(R,n)\displaystyle\liminf_{n\to\infty}\underline{\mathit{\Gamma}}(R,n) =minQX𝒫(𝒳)min{infV𝒱(s,QX)D(VW|QX),minV𝒱(s,QX)[D(QXVPY)+|I(QX;V)R|+]}\displaystyle=\min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\min\left\{\inf_{V\notin\mathcal{V}(s,Q_{X})}D(V\|W|Q_{X}),\ \min_{V\in\mathcal{V}(s,Q_{X})}\left[D(Q_{X}V\|P_{Y})+\big|I(Q_{X};V)-R\big|^{+}\right]\right\}
=minQX𝒫(𝒳)min{infV𝒱(s,QX)D(VW|QX),Γ(s,QX,R)},\displaystyle=\min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\min\left\{\inf_{V\notin\mathcal{V}(s,Q_{X})}D(V\|W|Q_{X}),\ \mathit{\Gamma}(s,Q_{X},R)\right\}, (12)

where Γ(s,QX,R)\mathit{\Gamma}(s,Q_{X},R) is defined in (10). Note that (IV-A) is also parameterized by s0s\geq 0. We can further write

Γnon(R)Γ¯(R):=minQX𝒫(𝒳)sups[0,)min{infV𝒱(s,QX)D(VW|QX),Γ(s,QX,R)}.\mathit{\Gamma}_{\mathrm{non}}(R)\geq\underline{\mathit{\Gamma}}(R):=\min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\ \sup_{s\in[0,\infty)}\min\left\{\inf_{V\notin\mathcal{V}(s,Q_{X})}D(V\|W|Q_{X}),\ \mathit{\Gamma}(s,Q_{X},R)\right\}. (13)

The reason why we put sups[0,)\sup_{s\in[0,\infty)} in (13) is that since the choice of 𝒱(QX)\mathcal{V}(Q_{X}) is free, in principle we can select any non-negative ss and construct the corresponding 𝒱(s,QX)\mathcal{V}(s,Q_{X}) as our 𝒱(QX)\mathcal{V}(Q_{X}). Hence, there exists some ss value for each QXQ_{X} that generates the maximal (thus the tightest) lower bound Γ¯(R)\underline{\mathit{\Gamma}}(R).

Furthermore, in (13), it might seem natural to write infV𝒱(s,QX)D(VW|QX)=s\inf_{V\notin\mathcal{V}(s,Q_{X})}D(V\|W|Q_{X})=s since 𝒱(s,QX)\mathcal{V}(s,Q_{X}) describes a relative entropy ball. However, this is true only if VV can be those conditional distributions such that VWV\ll W and VWV\neq W under QXQ_{X}. To elaborate, our discussions and the derived bounds should be applicable to all channels. Without loss of generality, it is possible that the channel WY|XW_{Y|X} contains hybrid input symbols, meaning that WY|X(y|x)=1W_{Y|X}(y|x)=1 for some x,yx,y and WY|X(y|x)<1W_{Y|X}(y|x)<1 for others. If QXQ_{X} is supported only on those noiseless input symbols, then WW is deterministic and hence for any finite s0s\geq 0,

D(VW|QX)={if D(VW|QX)>s,0if D(VW|QX)s.D(V\|W|Q_{X})=\begin{cases}\infty&\text{if }D(V\|W|Q_{X})>s,\\ 0&\text{if }D(V\|W|Q_{X})\leq s.\end{cases}

The case D(VW|QX)=0D(V\|W|Q_{X})=0 implies that V|supp(QX)=WV|_{\mathrm{supp}(Q_{X})}=W, i.e., VV is also noiseless when restricted to xsupp(QX)x\in\mathrm{supp}(Q_{X}). As a result, Γ(s,QX,R)=D(QXWPY)+|H(QXW)R|+\mathit{\Gamma}(s,Q_{X},R)=D(Q_{X}W\|P_{Y})+\big|H(Q_{X}W)-R\big|^{+}, which is a constant independent of ss, as illustrated in Figure 4LABEL:sub@fig-Gs-a. Then in (13) we have

sups[0,)min{infV𝒱(s,QX)D(VW|QX),Γ(s,QX,R)}=\displaystyle\sup_{s\in[0,\infty)}\min\left\{\inf_{V\notin\mathcal{V}(s,Q_{X})}D(V\|W|Q_{X}),\ \mathit{\Gamma}(s,Q_{X},R)\right\}=\ sups[0,)min{,D(QXWPY)+|H(QXW)R|+}\displaystyle\sup_{s\in[0,\infty)}\min\left\{\infty,D(Q_{X}W\|P_{Y})+\big|H(Q_{X}W)-R\big|^{+}\right\}
=\displaystyle=\ D(QXWPY)+|H(QXW)R|+\displaystyle D(Q_{X}W\|P_{Y})+\big|H(Q_{X}W)-R\big|^{+}
=𝑎\displaystyle\overset{a}{=}\ sups[0,)min{s,D(QXWPY)+|H(QXW)R|+}\displaystyle\sup_{s\in[0,\infty)}\min\left\{s,D(Q_{X}W\|P_{Y})+\big|H(Q_{X}W)-R\big|^{+}\right\}
=\displaystyle=\ sups[0,)min{s,Γ(s,QX,R)},\displaystyle\sup_{s\in[0,\infty)}\min\left\{s,\mathit{\Gamma}(s,Q_{X},R)\right\},

where (a)(a) can be observed from Figure 4LABEL:sub@fig-Gs-a. Interestingly, even though infV𝒱(s,QX)D(VW|QX)=s\inf_{V\notin\mathcal{V}(s,Q_{X})}D(V\|W|Q_{X})=\infty\neq s, in (13) it is still valid to write

sups[0,)min{infV𝒱(s,QX)D(VW|QX),Γ(s,QX,R)}=sups[0,)min{s,Γ(s,QX,R)}.\sup_{s\in[0,\infty)}\min\left\{\inf_{V\notin\mathcal{V}(s,Q_{X})}D(V\|W|Q_{X}),\ \mathit{\Gamma}(s,Q_{X},R)\right\}=\sup_{s\in[0,\infty)}\min\left\{s,\mathit{\Gamma}(s,Q_{X},R)\right\}. (14)

If QXQ_{X} has support on at least one noisy input symbol, D(W|QX)D(\cdot\|W|Q_{X}) can thereby take continuous values. In this case, writing (14) is explicit and correct. According to Lemma 48 in Appendix D-a, Γ(s,QX,R)\mathit{\Gamma}(s,Q_{X},R) is non-increasing in ss: it decreases over a certain interval and then remains constant. Therefore, sups[0,)min{s,Γ(s,QX,R)}\sup_{s\in[0,\infty)}\min\left\{s,\mathit{\Gamma}(s,Q_{X},R)\right\} is attained at a unique fixed point where s=Γ(s,QX,R)s^{*}=\mathit{\Gamma}(s^{*},Q_{X},R), i.e., the intersection point of the curve Γ(s,QX,R)\mathit{\Gamma}(s,Q_{X},R) and the straight line ss. Such an intersection can occur either in the decreasing regime of Γ(s,QX,R)\mathit{\Gamma}(s,Q_{X},R) (Figure 4LABEL:sub@fig-Gs-b) or in the constant regime (Figure 4LABEL:sub@fig-Gs-c).

To summarize, for any QXQ_{X} supported on 𝒳\mathcal{X}, (14) is always true. Furthermore, since (14) is achieved at the intersection point ss^{*}, we can equivalently write

sups[0,)min{s,Γ(s,QX,R)}=infs[0,)max{s,Γ(s,QX,R)}\sup_{s\in[0,\infty)}\min\left\{s,\mathit{\Gamma}(s,Q_{X},R)\right\}=\inf_{s\in[0,\infty)}\max\left\{s,\mathit{\Gamma}(s,Q_{X},R)\right\} (15)

from Figure 4. The reason for doing so is that the right-hand side of (15) facilitates a convenient derivation from variational forms to dual forms in Section IV-C. Substituting (14) and (15) into (13) completes the proof. ∎

Refer to caption
(a) supp(QX)𝒳l\mathrm{supp}(Q_{X})\subseteq\mathcal{X}_{l}
Refer to caption
(b) supp(QX)𝒳o\mathrm{supp}(Q_{X})\cap\mathcal{X}_{o}\neq\emptyset
Refer to caption
(c) supp(QX)𝒳o\mathrm{supp}(Q_{X})\cap\mathcal{X}_{o}\neq\emptyset
Figure 4: Three possible ways in which Γ(s,QX,R)\mathit{\Gamma}(s,Q_{X},R) intersects ss. Here, 𝒳l\mathcal{X}_{l} denotes the set of noiseless input symbols, and 𝒳o\mathcal{X}_{o} denotes the set of noisy input symbols.

The idea of proving Proposition 30 is closely related to hypothesis testing. We are in fact looking for a decision region 𝒜\mathcal{A} such that P~Yn|𝒞(𝒜c)2ns\tilde{P}_{Y^{n}|\mathcal{C}}(\mathcal{A}^{c})\leq 2^{-ns} and PYn(𝒜)2nΓ(s)P_{Y}^{n}(\mathcal{A})\leq 2^{-n\mathit{\Gamma}(s)} with some exponent Γ(s)\mathit{\Gamma}(s). The average probability of testing error is 12[P~Yn|𝒞(𝒜c)+PYn(𝒜)]=1214P~Yn|𝒞PYn1\frac{1}{2}\big[\tilde{P}_{Y^{n}|\mathcal{C}}(\mathcal{A}^{c})+P_{Y}^{n}(\mathcal{A})\big]=\frac{1}{2}-\frac{1}{4}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1}. Then we optimize over ss to find the optimal decision region. Observing the expression in Proposition 30, the following properties of Γ¯(R)\underline{\mathit{\Gamma}}(R) can be established, which are proved in Appendix D-c.

Lemma 31.

We have the following properties of Γ¯(R)\underline{\mathit{\Gamma}}(R):

(i) Γ¯(R)=0\underline{\mathit{\Gamma}}(R)=0 when RminPX𝒮I(PX;W)R\geq\min_{P_{X}\in\mathcal{S}}I(P_{X};W).  (ii) Γ¯(R)>0\underline{\mathit{\Gamma}}(R)>0 when R<minPX𝒮I(PX;W)R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W).

IV-B An achievability bound for the strong converse exponent

In this subsection, we show an upper bound for Γuni(R)\mathit{\Gamma}_{\mathrm{uni}}(R) (the uniform formulation), denoted by Γ¯(R)\overline{\mathit{\Gamma}}(R). We develop a novel deterministic code construction technique based on the type covering lemma. The key observation is that since we are operating below minPX𝒮I(PX;W)\min_{P_{X}\in\mathcal{S}}I(P_{X};W), we cannot afford codeword repetitions, while random coding produces repeated codewords, albeit rarely. In contrast, we only cover joint types with mutual information less than RR, and cover only once.

Proposition 32 (Achievability).

The strong converse exponent Γuni(R)\mathit{\Gamma}_{\mathrm{uni}}(R) in the uniform formulation can be bounded from above as follows:

Γuni(R)Γ¯(R):=minQXY𝒫(𝒳𝒴):I(QX;V)R[D(QYPY)+|Rι(QXY)|+],\mathit{\Gamma}_{\mathrm{uni}}(R)\leq\overline{\mathit{\Gamma}}(R):=\min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y}):I(Q_{X};V)\leq R}\left[D(Q_{Y}\|P_{Y})+\big|R-\iota(Q_{XY})\big|^{+}\right],

where ι(QXY)\iota(Q_{XY}) is defined in Definition 2.

Proof.

Consider any joint type QXY𝒫n(𝒳𝒴)Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y}). Following our notation convention, we introduce the backward conditional type V¯X|Y=QXY/QY𝒫n(𝒳|QY)\widebar{V}_{X|Y}=Q_{XY}/Q_{Y}\in\mathcal{P}_{n}(\mathcal{X}|Q_{Y}). Take any yn𝒯QYy^{n}\in\mathcal{T}_{Q_{Y}}. (1) yields that

P~Yn|𝒞(yn)\displaystyle\tilde{P}_{Y^{n}|\mathcal{C}}(y^{n}) =1MV¯𝒫n(𝒳|QY)i=1M𝟙{Xn(i)𝒯V¯(yn)}WY|Xn(yn|Xn(i))\displaystyle=\frac{1}{M}\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}|Q_{Y})}\ \sum_{i=1}^{M}\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{\widebar{V}}(y^{n})\}\ W_{Y|X}^{n}(y^{n}|X^{n}(i))
=𝑎1MV¯𝒫n(𝒳|QY)kV¯(yn) 2n[D(VW|QX)+H(V|QX)],\displaystyle\overset{a}{=}\frac{1}{M}\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}|Q_{Y})}k_{\widebar{V}}(y^{n})\ 2^{-n\left[D(V\|W|Q_{X})+H(V|Q_{X})\right]},

where (a)(a) follows from property of types. Furthermore, in (a)(a) we have defined

kV¯(yn):=i=1M𝟙{Xn(i)𝒯V¯(yn)},k_{\widebar{V}}(y^{n}):=\sum_{i=1}^{M}\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{\widebar{V}}(y^{n})\}, (16)

which is the number of codewords that lie in the V¯\widebar{V}-shell of yny^{n}. In other words, kV¯(yn)k_{\widebar{V}}(y^{n}) counts the number of codewords that have joint type QXYQ_{XY} with yny^{n}. Since PYn(yn)=2n[D(QYPY)+H(QY)]P_{Y}^{n}(y^{n})=2^{-n[D(Q_{Y}\|P_{Y})+H(Q_{Y})]} according to property of types, we can evaluate the following ratio:

P~Yn|𝒞(yn)PYn(yn)\displaystyle\frac{\tilde{P}_{Y^{n}|\mathcal{C}}(y^{n})}{P_{Y}^{n}(y^{n})} =1MV¯𝒫n(𝒳|QY)kV¯(yn) 2n[D(QYPY)+H(QY)D(VW|QX)H(V|QX)]\displaystyle=\frac{1}{M}\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}|Q_{Y})}k_{\widebar{V}}(y^{n})\ 2^{n\left[D(Q_{Y}\|P_{Y})+H(Q_{Y})-D(V\|W|Q_{X})-H(V|Q_{X})\right]}
=1MV¯𝒫n(𝒳|QY)kV¯(yn) 2n[D(QYPY)+I(QX;V)D(VW|QX)]\displaystyle=\frac{1}{M}\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}|Q_{Y})}k_{\widebar{V}}(y^{n})\ 2^{n\left[D(Q_{Y}\|P_{Y})+I(Q_{X};V)-D(V\|W|Q_{X})\right]}
=V¯𝒫n(𝒳|QY)kV¯(yn) 2n[ι(QXY)R],\displaystyle=\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}|Q_{Y})}k_{\widebar{V}}(y^{n})\ 2^{n\left[\iota(Q_{XY})-R\right]}, (17)

where the last equality follows from Lemma 3.

Using the identity |ab|=a+b2min{a,b}|a-b|=a+b-2\min\{a,b\}, the total variation satisfies the following property:

112P~Yn|𝒞PYn1=ynmin{P~Yn|𝒞(yn),PYn(yn)}.1-\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\right\|_{1}=\sum_{y^{n}}\min\left\{\tilde{P}_{Y^{n}|\mathcal{C}}(y^{n}),P_{Y}^{n}(y^{n})\right\}. (18)

Combining this and (17), we obtain that

112P~Yn|𝒞PYn1=QY𝒫n(𝒴)yn𝒯QYPYn(yn)min{P~Yn|𝒞(yn)PYn(yn),1}\displaystyle 1-\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\right\|_{1}=\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\ \sum_{y^{n}\in\mathcal{T}_{Q_{Y}}}P_{Y}^{n}(y^{n})\min\left\{\frac{\tilde{P}_{Y^{n}|\mathcal{C}}(y^{n})}{P_{Y}^{n}(y^{n})},1\right\}
=𝑎QY𝒫n(𝒴)2n[D(QYPY)+H(QY)]yn𝒯QYmin{V¯𝒫n(𝒳|QY)kV¯(yn)2n[ι(QXY)R],1}\displaystyle\overset{a}{=}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}2^{-n\left[D(Q_{Y}\|P_{Y})+H(Q_{Y})\right]}\sum_{y^{n}\in\mathcal{T}_{Q_{Y}}}\!\!\!\min\left\{\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}|Q_{Y})}k_{\widebar{V}}(y^{n})2^{n\left[\iota(Q_{XY})-R\right]},1\right\}
=QY𝒫n(𝒴)2n[D(QYPY)+H(QY)]yn𝒯QYmin{V¯𝒫n(𝒳|QY)kV¯(yn)2n[ι(QXY)R],V¯𝒫n(𝒳|QY)1|𝒫n(𝒳|QY)|}\displaystyle=\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}2^{-n\left[D(Q_{Y}\|P_{Y})+H(Q_{Y})\right]}\sum_{y^{n}\in\mathcal{T}_{Q_{Y}}}\!\!\!\min\left\{\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}|Q_{Y})}k_{\widebar{V}}(y^{n})2^{n\left[\iota(Q_{XY})-R\right]},\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}|Q_{Y})}\frac{1}{|\mathcal{P}_{n}(\mathcal{X}|Q_{Y})|}\right\}
𝑏QY𝒫n(𝒴)2n[D(QYPY)+H(QY)]yn𝒯QYV¯𝒫n(𝒳|QY)min{kV¯(yn)2n[ι(QXY)R],1|𝒫n(𝒳|QY)|}\displaystyle\overset{b}{\geq}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}2^{-n\left[D(Q_{Y}\|P_{Y})+H(Q_{Y})\right]}\sum_{y^{n}\in\mathcal{T}_{Q_{Y}}}\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}|Q_{Y})}\!\!\!\min\left\{k_{\widebar{V}}(y^{n})2^{n\left[\iota(Q_{XY})-R\right]},\frac{1}{|\mathcal{P}_{n}(\mathcal{X}|Q_{Y})|}\right\}
𝑐QXY𝒫n(𝒳𝒴)2n[D(QYPY)+H(QY)]yn𝒯QYmin{kV¯(yn) 2n[ι(QXY)R],(n+1)|𝒳||𝒴|}\displaystyle\overset{c}{\geq}\sum_{Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y})}2^{-n\left[D(Q_{Y}\|P_{Y})+H(Q_{Y})\right]}\sum_{y^{n}\in\mathcal{T}_{Q_{Y}}}\min\left\{k_{\widebar{V}}(y^{n})\ 2^{n\left[\iota(Q_{XY})-R\right]},(n+1)^{-|\mathcal{X}||\mathcal{Y}|}\right\}
𝑑(n+1)|𝒳||𝒴|QXY𝒫n(𝒳𝒴)2n[D(QYPY)+H(QY)]yn𝒯QYmin{kV¯(yn) 2n[ι(QXY)R],1},\displaystyle\overset{d}{\geq}(n+1)^{-|\mathcal{X}||\mathcal{Y}|}\sum_{Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y})}2^{-n\left[D(Q_{Y}\|P_{Y})+H(Q_{Y})\right]}\sum_{y^{n}\in\mathcal{T}_{Q_{Y}}}\min\left\{k_{\widebar{V}}(y^{n})\ 2^{n\left[\iota(Q_{XY})-R\right]},1\right\}, (19)

where (a)(a) follows from property of types and (17); (b)(b) follows from the fact that the minimum of a sum is greater than or equal to the sum of the minima (equivalent to the triangle inequality); and (c)(c) follows from property of types. In (d)(d), we artifically add an extra term (n+1)|𝒳||𝒴|(n+1)^{-|\mathcal{X}||\mathcal{Y}|} to kV¯(yn) 2n[ι(QXY)R]k_{\widebar{V}}(y^{n})\ 2^{n\left[\iota(Q_{XY})-R\right]} and then extract it outside the minimum. The reason for doing so is that kV¯(yn) 2n[ι(QXY)R]k_{\widebar{V}}(y^{n})\ 2^{n\left[\iota(Q_{XY})-R\right]} is exponential in nn, so comparing it with 1 or any polynomial factor does not affect the exponential order.

For the purpose of designing a good code, we need to maximize (IV-B), and it suffices to make kV¯(yn)k_{\widebar{V}}(y^{n}) large (or at least nonzero) for all joint types. However, due to the low rate and hence the limited number of codewords, this can only be achieved for certain joint types, and even for those, a large kV¯(yn)k_{\widebar{V}}(y^{n}) would seem too greedy. A more reasonable approach is to set kV¯(yn)=1k_{\widebar{V}}(y^{n})=1, meaning that there exists a single codeword that covers yny^{n}. The type covering lemma guarantees the existence of such a covering for all yny^{n}’s, provided that the mutual information is less than the code rate.

Based on the above analysis, we construct our code 𝒞\mathcal{C} in the following way. Take any ϵ>0\epsilon>0 and examine all joint types QXY𝒫n(𝒳𝒴)Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y}): if its mutual information satisfies I(QX;V)R2ϵI(Q_{X};V)\leq R-2\epsilon, then according to the type covering lemma (e.g., [moser2019advanced, Lem. 3.34]), there exists a covering code with 2n[I(QX;V)+ϵ]2^{n[I(Q_{X};V)+\epsilon]} codewords in 𝒯QX\mathcal{T}_{Q_{X}} that ensures kV¯(yn)1k_{\widebar{V}}(y^{n})\geq 1 for all the corresponding output sequences yn𝒯QYy^{n}\in\mathcal{T}_{Q_{Y}}. We collect all codewords in this covering code into our code 𝒞\mathcal{C}. It is noteworthy that such a collection may contain repeated codewords. Under this strategy, the number of codewords we have assigned so far is

|𝒞eff|\displaystyle|\mathcal{C}_{\text{eff}}| =QXY𝒫n(𝒳𝒴):I(QX;V)R2ϵ2n[I(QX;V)+ϵ]\displaystyle=\sum_{Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y}):I(Q_{X};V)\leq R-2\epsilon}2^{n[I(Q_{X};V)+\epsilon]}
𝑎(n+1)|𝒳||𝒴|maxQXY𝒫n(𝒳𝒴):I(QX;V)R2ϵ2n[I(QX;V)+ϵ]\displaystyle\overset{a}{\leq}(n+1)^{|\mathcal{X}||\mathcal{Y}|}\max_{Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y}):I(Q_{X};V)\leq R-2\epsilon}2^{n[I(Q_{X};V)+\epsilon]}
(n+1)|𝒳||𝒴|2n(Rϵ)2nR\displaystyle\leq(n+1)^{|\mathcal{X}||\mathcal{Y}|}2^{n(R-\epsilon)}\leq 2^{nR}

for sufficienly large nn, where (a)(a) follows from property of types. The inequality |𝒞eff|2nR|\mathcal{C}_{\text{eff}}|\leq 2^{nR} confirms that 𝒞\mathcal{C} is a valid code. Specifically, the codewords are drawn from a uniform distribution of 1/M=2nR1/M=2^{-nR}. |𝒞eff||\mathcal{C}_{\text{eff}}| among them are used for the sake of covering all joint types QXY𝒫n(𝒳𝒴)Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y}) satisfying I(QX;V)R2ϵI(Q_{X};V)\leq R-2\epsilon; as a result, kV¯(yn)1k_{\widebar{V}}(y^{n})\geq 1 for all yny^{n}’s in those joint type classes. There are M|𝒞eff|M-|\mathcal{C}_{\text{eff}}| codewords left, which can be used to cover the other joint types. However, this fraction is small, and we apply a trivial lower bound zero on it. To sum up, in (IV-B) we have

kV¯(yn) 2n[ι(QXY)R]{2n[ι(QXY)R]if I(QX;V)R2ϵ0if I(QX;V)>R2ϵk_{\widebar{V}}(y^{n})\ 2^{n\left[\iota(Q_{XY})-R\right]}\geq\begin{cases}2^{n\left[\iota(Q_{XY})-R\right]}&\text{if }I(Q_{X};V)\leq R-2\epsilon\\ 0&\text{if }I(Q_{X};V)>R-2\epsilon\end{cases}

for all yn𝒯QYy^{n}\in\mathcal{T}_{Q_{Y}}. Consequently, (IV-B) simplifies to

112\displaystyle 1-\frac{1}{2} P~Yn|𝒞PYn1\displaystyle\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\right\|_{1}
(n+1)|𝒳||𝒴|QXY𝒫n(𝒳𝒴):I(QY;V¯)R2ϵ2n[D(QYPY)+H(QY)]|𝒯QY|min{2n[ι(QXY)R],1}\displaystyle\geq(n+1)^{-|\mathcal{X}||\mathcal{Y}|}\sum_{Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y}):I(Q_{Y};\widebar{V})\leq R-2\epsilon}2^{-n\left[D(Q_{Y}\|P_{Y})+H(Q_{Y})\right]}\ |\mathcal{T}_{Q_{Y}}|\ \min\left\{2^{n\left[\iota(Q_{XY})-R\right]},1\right\}
𝑎(n+1)(|𝒳|+1)|𝒴|QXY𝒫n(𝒳𝒴):I(QY;V¯)R2ϵ2n[D(QYPY)+|Rι(QXY)|+]\displaystyle\overset{a}{\geq}(n+1)^{-(|\mathcal{X}|+1)|\mathcal{Y}|}\sum_{Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y}):I(Q_{Y};\widebar{V})\leq R-2\epsilon}2^{-n\left[D(Q_{Y}\|P_{Y})+\left|R-\iota(Q_{XY})\right|^{+}\right]}
(n+1)(|𝒳|+1)|𝒴|maxQXY𝒫n(𝒳𝒴):I(QY;V¯)R2ϵ2n[D(QYPY)+|Rι(QXY)|+]\displaystyle\geq(n+1)^{-(|\mathcal{X}|+1)|\mathcal{Y}|}\max_{Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y}):I(Q_{Y};\widebar{V})\leq R-2\epsilon}2^{-n\left[D(Q_{Y}\|P_{Y})+\left|R-\iota(Q_{XY})\right|^{+}\right]}
=(n+1)(|𝒳|+1)|𝒴|exp2{nminQXY𝒫n(𝒳𝒴):I(QX,V)R2ϵ[D(QYPY)+|Rι(QXY)|+]},\displaystyle=(n+1)^{-(|\mathcal{X}|+1)|\mathcal{Y}|}\exp_{2}\left\{-n\min_{Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y}):I(Q_{X},V)\leq R-2\epsilon}\left[D(Q_{Y}\|P_{Y})+\big|R-\iota(Q_{XY})\big|^{+}\right]\right\},

where (a)(a) follows from property of types. Taking ϵ0\epsilon\to 0 and nn\to\infty completes the proof. ∎

Based on Proposition 32, the following property of Γ¯(R)\overline{\mathit{\Gamma}}(R) holds, which is proved in Appendix D-d.

Lemma 33.

If minPX𝒮I(PX;W)RmaxPX𝒮I(PX;W)\min_{P_{X}\in\mathcal{S}}I(P_{X};W)\leq R\leq\max_{P_{X}\in\mathcal{S}}I(P_{X};W), then Γ¯(R)=0\overline{\mathit{\Gamma}}(R)=0.

IV-C Dual forms of Γ¯(R)\underline{\mathit{\Gamma}}(R) and Γ¯(R)\overline{\mathit{\Gamma}}(R) and their equality

So far, we have obtained a lower bound Γ¯(R)\underline{\mathit{\Gamma}}(R) for Γnon(R)\mathit{\Gamma}_{\mathrm{non}}(R) and an upper bound Γ¯(R)\overline{\mathit{\Gamma}}(R) for Γuni(R)\mathit{\Gamma}_{\mathrm{uni}}(R), stated in Proposition 30 and 32, respectively. Figure 5 presents examples of these two bounds, using the same channel models as in Figure 1. Observing Figure 5, a natural question arises: do the proposed two bounds coincide when R<minPX𝒮I(PX;W)R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W)? If they do, then an exact exponent can be tightly squeezed out since Γnon(R)Γuni(R)\mathit{\Gamma}_{\mathrm{non}}(R)\leq\mathit{\Gamma}_{\mathrm{uni}}(R). In this subsection, we show that the answer is yes by reformulating Γ¯(R)\underline{\mathit{\Gamma}}(R) and Γ¯(R)\overline{\mathit{\Gamma}}(R) into their dual forms in terms of Jα,βJ_{\alpha,\beta} defined by (6). These dual forms are presented in the following Proposition 34 and 35, respectively.

Refer to caption
(a) A fully noisy binary channel
Refer to caption
(b) A fully noisy ternary channel
Refer to caption
(c) A fully noiseless ternary channel
Refer to caption
(d) A hybrid ternary channel
Figure 5: Examples of the converse bound Γ¯(R)\underline{\mathit{\Gamma}}(R) and the achievability bound Γ¯(R)\overline{\mathit{\Gamma}}(R) for the strong converse exponent.
Proposition 34.

Γ¯(R)\underline{\mathit{\Gamma}}(R) has the following dual form:

Γ¯(R)Γ¯¯(R):=\displaystyle\underline{\mathit{\Gamma}}(R)\geq\underline{\underline{\mathit{\Gamma}}}(R):=\ minQXY𝒫(𝒳𝒴)maxα,β[0,1],αβ{D(QYPY)+α[I(QX;V)R]+β[Rι(QXY)]}\displaystyle\min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})}\ \max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\big\{D(Q_{Y}\|P_{Y})+\alpha\left[I(Q_{X};V)-R\right]+\beta\left[R-\iota(Q_{XY})\right]\big\} (20)
=\displaystyle=\ maxα,β[0,1],αβ[Jα,β(WY|XPY)+(βα)R].\displaystyle\max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\left[J_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right)+(\beta-\alpha)R\right]. (21)
Proposition 35.

Γ¯(R)\overline{\mathit{\Gamma}}(R) has the following dual form:

Γ¯(R)=maxα,β[0,1][Jα,β(WY|XPY)+(βα)R].\overline{\mathit{\Gamma}}(R)=\max_{\alpha,\beta\in[0,1]}\left[J_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right)+(\beta-\alpha)R\right].

Proposition 34 is proved in Appendix B-a and Proposition 35 is proved in Appendix B-b. Comparing the dual forms of Γ¯(R)\overline{\mathit{\Gamma}}(R) and Γ¯¯(R)\underline{\underline{\mathit{\Gamma}}}(R), we establish their equality in the following Proposition 36.

Proposition 36.

Γ¯(R)=Γ¯¯(R)\overline{\mathit{\Gamma}}(R)=\underline{\underline{\mathit{\Gamma}}}(R) when R<minPX𝒮I(PX;W)R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W).

Proposition 36 is proved in Appendix B-c. Equipped with propositions in this section, the proof of the main result, Theorem 17, follows from straightforward logical reasoning.

Proof of Theorem 17.

First, Γ¯¯(R)Γ¯(R)Γnon(R)Γren(R)Γuni(R)Γ¯(R)\underline{\underline{\mathit{\Gamma}}}(R)\leq\underline{\mathit{\Gamma}}(R)\leq\mathit{\Gamma}_{\mathrm{non}}(R)\leq\mathit{\Gamma}_{\mathrm{ren}}(R)\leq\mathit{\Gamma}_{\mathrm{uni}}(R)\leq\overline{\mathit{\Gamma}}(R) holds for all RR by combining Remark 15, Proposition 30, Proposition 32, and Proposition 34.

When R<minPX𝒮I(PX;W)R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W), by Proposition 36, Γ¯¯(R)=Γ¯(R)=Γnon(R)=Γren(R)=Γuni(R)=Γ¯(R)\underline{\underline{\mathit{\Gamma}}}(R)=\underline{\mathit{\Gamma}}(R)=\mathit{\Gamma}_{\mathrm{non}}(R)=\mathit{\Gamma}_{\mathrm{ren}}(R)=\mathit{\Gamma}_{\mathrm{uni}}(R)=\overline{\mathit{\Gamma}}(R).

When RminPX𝒮I(PX;W)R\geq\min_{P_{X}\in\mathcal{S}}I(P_{X};W), by Proposition 30, Proposition 34, and Lemma 46, Γnon(R)Γ¯(R)=Γ¯¯(R)=0\mathit{\Gamma}_{\mathrm{non}}(R)\geq\underline{\mathit{\Gamma}}(R)=\underline{\underline{\mathit{\Gamma}}}(R)=0. On the other hand, according to the soft covering lemma, e.g., [moser2019advanced, Ch. 19], under the uniform formulation there exists a code that can make 12P~Yn|𝒞PYn10\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1}\to 0 for sufficiently large nn, yielding that (112P~Yn|𝒞PYn1)1=2n0\left(1-\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1}\right)\to 1=2^{-n\cdot 0}: a zero strong converse exponent is achievable and thus Γuni(R)0\mathit{\Gamma}_{\mathrm{uni}}(R)\leq 0. Ergo, Γnon(R)=Γren(R)=Γuni(R)=0=Γ¯¯(R)\mathit{\Gamma}_{\mathrm{non}}(R)=\mathit{\Gamma}_{\mathrm{ren}}(R)=\mathit{\Gamma}_{\mathrm{uni}}(R)=0=\underline{\underline{\mathit{\Gamma}}}(R).

To sum up, we can conclude that Γnon(R)=Γren(R)=Γuni(R)=Γ¯¯(R)\mathit{\Gamma}_{\mathrm{non}}(R)=\mathit{\Gamma}_{\mathrm{ren}}(R)=\mathit{\Gamma}_{\mathrm{uni}}(R)=\underline{\underline{\mathit{\Gamma}}}(R) for all RR. Combining it with Proposition 34 completes the proof. Since Γ¯¯(R)\underline{\underline{\mathit{\Gamma}}}(R) is now exact, rather than merely a lower bound, we drop the underline and denote it by Γ(R)\mathit{\Gamma}(R) to simplify notation. ∎

V Error Exponent for Noiseless Channels

In this section, we discuss the soft-covering error exponent when WY|XW_{Y|X} is noiseless: for each x𝒳x\in\mathcal{X}, there exists a symbol y𝒴y\in\mathcal{Y} such that WY|X(y|x)=1W_{Y|X}(y|x)=1. Then we have WY|X(y|x)=𝟙{y=w(x)}W_{Y|X}(y|x)=\mathbbm{1}\{y=w(x)\} for all x𝒳x\in\mathcal{X}, y𝒴y\in\mathcal{Y}, and for some function ww.

Remark 37 (Choice of alphabet).

If w()w(\cdot) is a bijection, then, without loss of generality, we assume that 𝒳=𝒴\mathcal{X}=\mathcal{Y}. If w()w(\cdot) is a surjection, for each y𝒴y\in\mathcal{Y}, in the achievability proof we may select a single representative xw1(y)x\in w^{-1}(y) and construct an alphabet 𝒳𝒞𝒳\mathcal{X}_{\mathcal{C}}\subset\mathcal{X} consisting of these representatives and hence satisfying |𝒳𝒞|=|𝒴||\mathcal{X}_{\mathcal{C}}|=|\mathcal{Y}|. In the converse proof, given any code 𝒞\mathcal{C}, we first proceed with a restriction of symbols: if y1=w(x1)=w(x2)y_{1}=w(x_{1})=w(x_{2}), we replace every occurrence of x2x_{2} in the codewords with x1x_{1}. Since both x1x_{1} and x2x_{2} map to the same output symbol y1y_{1}, this modification leaves the output sequences unchanged and, consequently, does not affect the induced distribution P~Yn|𝒞q\tilde{P}_{Y^{n}|\mathcal{C}\sim q} or P~Yn|𝒞\tilde{P}_{Y^{n}|\mathcal{C}}. Under this restriction, the code alphabet set reduces to some 𝒳𝒞𝒳\mathcal{X}_{\mathcal{C}}\subset\mathcal{X} such that |𝒳𝒞|=|𝒴||\mathcal{X}_{\mathcal{C}}|=|\mathcal{Y}| and w()w(\cdot) can be regarded bijective. Thus, in both cases, one can always identify an input alphabet set 𝒳𝒞\mathcal{X}_{\mathcal{C}} such that |𝒳𝒞|=|𝒴||\mathcal{X}_{\mathcal{C}}|=|\mathcal{Y}| for encoding.

V-A Uniform formulation

Under a noiseless channel, the output distribution (1) reduces to

P~Yn|𝒞(yn)=1Mxn𝟙{yn=w(xn)}i=1M𝟙{Xn(i)=xn}=:k(yn)M,\tilde{P}_{Y^{n}|\mathcal{C}}(y^{n})=\frac{1}{M}\sum_{x^{n}}\mathbbm{1}\{y^{n}=w(x^{n})\}\sum_{i=1}^{M}\mathbbm{1}\{X^{n}(i)=x^{n}\}=:\frac{k(y^{n})}{M}, (22)

where k(yn)+k(y^{n})\in\mathbb{Z}^{+} counts the number of codewords mapping to yny^{n}. We begin by showing the converse result stated by Theorem 19, which is valid for both rational and irrational PYP_{Y}.

Proof of Theorem 19.

When RminPX𝒮I(PX;W)R\geq\min_{P_{X}\in\mathcal{S}}I(P_{X};W), it is relatively easier to approximate large probabilities, while harder to approximate small probabilities, in particular those smaller than the step size 1/M1/M in (22). This leads us to the consideration of the following ‘bad’ set:

:={yn𝒴n:1M2PYn(yn)}=QY𝒫n(𝒴):D(QYPY)+H(QY)R+1n𝒯QY.\mathcal{B}:=\left\{y^{n}\in\mathcal{Y}^{n}:\frac{1}{M}\geq 2P_{Y}^{n}(y^{n})\right\}=\bigsqcup_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})\geq R+\frac{1}{n}}\mathcal{T}_{Q_{Y}}. (23)

For any yny^{n}\in\mathcal{B}, we have

|k(yn)MPYn(yn)|{1MPYn(yn)PYn(yn),if k(yn)1=PYn(yn),if k(yn)=0.\left|\frac{k(y^{n})}{M}-P_{Y}^{n}(y^{n})\right|\begin{cases}\geq\dfrac{1}{M}-P_{Y}^{n}(y^{n})\geq P_{Y}^{n}(y^{n}),&\text{if }k(y^{n})\geq 1\\ =P_{Y}^{n}(y^{n}),&\text{if }k(y^{n})=0\end{cases}.

An error of value PYn(yn)P_{Y}^{n}(y^{n}) is unavoidable for all yny^{n}\in\mathcal{B}. In this sense, \mathcal{B} is regarded as ‘bad’, producing errors no matter what code is applied. As a result,

12P~Yn|𝒞PYn1\displaystyle\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\right\|_{1} 12ynPYn(yn)\displaystyle\geq\frac{1}{2}\sum_{y^{n}\in\mathcal{B}}P_{Y}^{n}(y^{n})
𝑎12(n+1)|𝒴|QY𝒫n(𝒴):D(QYPY)+H(QY)>R+1n2nD(QYPY)\displaystyle\overset{a}{\geq}\frac{1}{2}(n+1)^{-|\mathcal{Y}|}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})>R+\frac{1}{n}}2^{-nD(Q_{Y}\|P_{Y})}
12(n+1)|𝒴|exp2{nminQY𝒫n(𝒴):D(QYPY)+H(QY)>R+1nD(QYPY)},\displaystyle\geq\frac{1}{2}(n+1)^{-|\mathcal{Y}|}\exp_{2}\left\{-n\min_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})>R+\frac{1}{n}}D(Q_{Y}\|P_{Y})\right\},

where (a)(a) follows from property of types. Taking nn\to\infty completes the proof of the variational form.

To obtain the dual form, consider the following.

minQY𝒫(𝒴):D(QYPY)+H(QY)RD(QYPY)=𝑎\displaystyle\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})\geq R}D(Q_{Y}\|P_{Y})\overset{a}{=}\ maxδ0minQY𝒫(𝒴){D(QYPY)+δ[RD(QYPY)H(QY)]}\displaystyle\max_{\delta\geq 0}\ \min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\big\{D(Q_{Y}\|P_{Y})+\delta\left[R-D(Q_{Y}\|P_{Y})-H(Q_{Y})\right]\big\}
=\displaystyle=\ maxδ0{δR+minQY𝒫(𝒴)[D(QYPY)+δyQY(y)logPY(y)]}\displaystyle\max_{\delta\geq 0}\left\{\delta R+\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\left[D(Q_{Y}\|P_{Y})+\delta\sum_{y}Q_{Y}(y)\log P_{Y}(y)\right]\right\}
=𝑏\displaystyle\overset{b}{=}\ maxδ0{δRyPY1δ(y)}\displaystyle\max_{\delta\geq 0}\left\{\delta R-\sum_{y}P_{Y}^{1-\delta}(y)\right\}
=𝑐\displaystyle\overset{c}{=}\ maxα(,0)[1,){α1α[RH1α(PY)]},\displaystyle\max_{\alpha\in(-\infty,0)\cup[1,\infty)}\left\{\frac{\alpha-1}{\alpha}\left[R-H_{\frac{1}{\alpha}}(P_{Y})\right]\right\},

where (a)(a) follows from the convexity of the optimization problem; (b)(b) follows directly from (46); and (c)(c) follows by setting α:=11δ\alpha:=\frac{1}{1-\delta}. ∎

It is noteworthy that E¯nl(R)\overline{E}^{\mathrm{nl}}(R) is finite only in a certain region, as specified precisely in the following Lemma 38, which is proved in Appendix D-e.

Lemma 38.

E¯nl(R)<\overline{E}^{\mathrm{nl}}(R)<\infty when RH(PY)R\leq H_{-\infty}(P_{Y}), and E¯nl(R)=\overline{E}^{\mathrm{nl}}(R)=\infty when R>H(PY)R>H_{-\infty}(P_{Y}).

Next, we prove the achievability result stated by Theorem 20, which is also valid for both rational and irrational PYP_{Y}. Towards proving it, we need the following lemma.

Lemma 39.

When RH(PY)R\geq H(P_{Y}), we have N1(R)N2(R)N_{1}(R)\geq N_{2}(R), where

N1(R)\displaystyle N_{1}(R) =maxQY𝒫(𝒴):D(QYPY)+H(QY)RH(QY),\displaystyle=\max_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})\leq R}H(Q_{Y}),
N2(R)\displaystyle N_{2}(R) =maxQY𝒫(𝒴):D(QYPY)+H(QY)R[RD(QYPY)].\displaystyle=\max_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})\geq R}\left[R-D(Q_{Y}\|P_{Y})\right].

Lemma 39 is proved in Appendix D-f. Recall (22) and observe that a strategy for constructing a good code is to make k(yn)MPY(yn)k(y^{n})\approx MP_{Y}(y^{n}) so that k(yn)/Mk(y^{n})/M can approxiate PY(yn)P_{Y}(y^{n}) as closely as possible. More precisely, to establish the achievability result in Theorem 20, we show that there exists a code in which k(yn)k(y^{n}) differs from either MPY(yn)\lfloor MP_{Y}(y^{n})\rfloor or MPY(yn)\lceil MP_{Y}(y^{n})\rceil by at most a polynomial quantity in nn.

Proof of Theorem 20.

Taking inspiration from the proof of Proposition 32 regarding the achievability of the strong converse exponent, we provide a deterministic code construction for the problem at hand. Let us first assume that RH(PY)R\geq H(P_{Y}). Define

𝒢:={yn𝒴n:MPYn(yn)1}=QY𝒫n(𝒴):D(QYPY)+H(QY)R𝒯QY.\mathcal{G}:=\left\{y^{n}\in\mathcal{Y}^{n}:MP_{Y}^{n}(y^{n})\geq 1\right\}=\bigsqcup_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})\leq R}\mathcal{T}_{Q_{Y}}. (24)

𝒢\mathcal{G} is actually a ‘good’ set in the sense that all sequences in 𝒢\mathcal{G} satisfy that MPYn(yn)1\lfloor MP_{Y}^{n}(y^{n})\rfloor\geq 1. Thus, it is always preferable to cover these sequences, since doing so will yield nonzero k(yn)k(y^{n}) values that can align with MPYn(yn)MP_{Y}^{n}(y^{n}). In fact, ignoring the factor of two, 𝒢\mathcal{G} is approximately the complement of the ‘bad’ set \mathcal{B} defined in (23). Take any ϵ>0\epsilon>0. The size of 𝒢\mathcal{G} is bounded by

|𝒢|\displaystyle|\mathcal{G}| =QY𝒫n(𝒴):D(QYPY)+H(QY)R𝒯QYmaxQY𝒫n(𝒴):D(QYPY)+H(QY)R𝒯QY\displaystyle=\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})\leq R}\mathcal{T}_{Q_{Y}}\geq\max_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})\leq R}\mathcal{T}_{Q_{Y}}
𝑎(n+1)|𝒴|maxQY𝒫n(𝒴):D(QYPY)+H(QY)R2nH(QY)\displaystyle\overset{a}{\geq}(n+1)^{-|\mathcal{Y}|}\max_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})\leq R}2^{nH(Q_{Y})}
𝑏(n+1)|𝒴| 2n[N1(R)ϵ],\displaystyle\overset{b}{\geq}(n+1)^{-|\mathcal{Y}|}\ 2^{n\left[N_{1}(R)-\epsilon\right]},

where (a)(a) follows from property of types. In (b)(b), the maximization is over all types, so it differs from the maximization over distributions (namely N1(R)N_{1}(R) as defined in Lemma 39) by at most ϵ\epsilon for sufficiently large nn.

To design a good code, we adopt the alphabet choice described in Remark 37. With this choice, k(yn)k(y^{n}) represents the number of codewords mapped one-to-one to yny^{n}, and hence can be directly constructed via a repetition of codewords. Our goal is to design k(yn)k(y^{n}) so that it satisfies MPYn(yn)k(yn)MPYn(yn)+poly(n)\lfloor MP_{Y}^{n}(y^{n})\rfloor\leq k(y^{n})\leq\lceil MP_{Y}^{n}(y^{n})\rceil+\mathrm{poly}(n) under the constraint that ynk(yn)=M\sum_{y^{n}}k(y^{n})=M. Namely, consider the following two codes:

𝒞1:\displaystyle\mathcal{C}_{1}:\quad k(yn)=MPYn(yn),yn𝒴n,M1:=|𝒞1|=ynMPYn(yn).\displaystyle k(y^{n})=\lfloor MP_{Y}^{n}(y^{n})\rfloor,\quad\forall y^{n}\in\mathcal{Y}^{n},\quad M_{1}:=|\mathcal{C}_{1}|=\sum_{y^{n}}\ \lfloor MP_{Y}^{n}(y^{n})\rfloor.
𝒞2:\displaystyle\mathcal{C}_{2}:\quad k(yn)={MPYn(yn)if yn𝒢0if yn𝒢,M2:=|𝒞2|=yn𝒢MPYn(yn).\displaystyle k(y^{n})=\begin{cases}\lceil MP_{Y}^{n}(y^{n})\rceil&\text{if }y^{n}\in\mathcal{G}\\ 0&\text{if }y^{n}\notin\mathcal{G}\end{cases},\quad M_{2}:=|\mathcal{C}_{2}|=\sum_{y^{n}\in\mathcal{G}}\lceil MP_{Y}^{n}(y^{n})\rceil.

Clearly M1MM_{1}\leq M, meaning that we can use 𝒞1\mathcal{C}_{1} to cover all sequences yn𝒴ny^{n}\in\mathcal{Y}^{n} and there will be some codewords left. Note that 𝒢c:=𝒴n\𝒢\mathcal{G}^{c}:=\mathcal{Y}^{n}\backslash\mathcal{G} is not covered since MPYn(yn)=0\lfloor MP_{Y}^{n}(y^{n})\rfloor=0 for all yn𝒢cy^{n}\in\mathcal{G}^{c}. If M2MM_{2}\geq M, then after applying 𝒞1\mathcal{C}_{1}, each sequence in 𝒢\mathcal{G} can be covered at most once more. Hence, there exists a code 𝒞3\mathcal{C}_{3} as follows.

𝒞3:\displaystyle\mathcal{C}_{3}:\quad k(yn)={MPYn(yn) or MPYn(yn)if yn𝒢0if yn𝒢,|𝒞3|=M.\displaystyle k(y^{n})=\begin{cases}\lfloor MP_{Y}^{n}(y^{n})\rfloor\text{ or }\lceil MP_{Y}^{n}(y^{n})\rceil&\text{if }y^{n}\in\mathcal{G}\\ 0&\text{if }y^{n}\notin\mathcal{G}\end{cases},\quad|\mathcal{C}_{3}|=M.

If M2<MM_{2}<M, then after applying 𝒞1\mathcal{C}_{1}, each sequence in 𝒢\mathcal{G} can still be covered more than once. Equivalently, we use 𝒞2\mathcal{C}_{2} to cover 𝒴n\mathcal{Y}^{n}, and there will be some codewords left. The number of those codewords that are left is

ΔM\displaystyle\Delta M :=MM2=Myn𝒢MPYn(yn)Myn𝒢MPYn(yn)=MPYn(𝒢c)\displaystyle:=M-M_{2}=M-\sum_{y^{n}\in\mathcal{G}}\lceil MP_{Y}^{n}(y^{n})\rceil\leq M-\sum_{y^{n}\in\mathcal{G}}MP_{Y}^{n}(y^{n})=MP_{Y}^{n}(\mathcal{G}^{c})
=QY𝒫n(𝒳):D(QYPY)+H(QY)>RMPYn(𝒯QY)\displaystyle=\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{X}):D(Q_{Y}\|P_{Y})+H(Q_{Y})>R}MP_{Y}^{n}(\mathcal{T}_{Q_{Y}})
𝑎QY𝒫n(𝒴):D(QYPY)+H(QY)>R2n[RD(QYPY)]\displaystyle\overset{a}{\leq}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})>R}2^{n\left[R-D(Q_{Y}\|P_{Y})\right]}
𝑏(n+1)|𝒴|maxQY𝒫n(𝒴):D(QYPY)+H(QY)>R2n[RD(QYPY)]\displaystyle\overset{b}{\leq}(n+1)^{|\mathcal{Y}|}\max_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})>R}2^{n\left[R-D(Q_{Y}\|P_{Y})\right]}
=(n+1)|𝒴|exp2{nmaxQY𝒫n(𝒴):D(QYPY)+H(QY)>R[RD(QYPY)]}\displaystyle=(n+1)^{|\mathcal{Y}|}\exp_{2}\left\{n\max_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})>R}\left[R-D(Q_{Y}\|P_{Y})\right]\right\}
(n+1)|𝒴| 2nN2(R),\displaystyle\leq(n+1)^{|\mathcal{Y}|}\ 2^{nN_{2}(R)},

where (a)(a) and (b)(b) follow from properties of types, and N2(R)N_{2}(R) is defined in Lemma 39. Now let us consider using these ΔM\Delta M codewords to continue covering 𝒢\mathcal{G} after 𝒞2\mathcal{C}_{2} is applied. We can cover each sequence in 𝒢\mathcal{G} by ΔM|𝒢|\frac{\Delta M}{|\mathcal{G}|} more times, which is at most

ΔM|𝒢|(n+1)2|𝒴|2n[N2(R)N1(R)+ϵ](n+1)2|𝒴|2nϵ22nϵ\frac{\Delta M}{|\mathcal{G}|}\leq(n+1)^{2|\mathcal{Y}|}2^{n\left[N_{2}(R)-N_{1}(R)+\epsilon\right]}\leq(n+1)^{2|\mathcal{Y}|}2^{n\epsilon}\leq 2^{2n\epsilon}

for all sufficiently large nn. Here we have used N1(R)N2(R)N_{1}(R)\geq N_{2}(R) when RH(PY)R\geq H(P_{Y}) from Lemma 39. Hence, there exists a code 𝒞4\mathcal{C}_{4} as follows, with |𝒞4|=M|\mathcal{C}_{4}|=M:

𝒞4:\displaystyle\mathcal{C}_{4}:\quad {MPYn(yn)k(yn)MPYn(yn)+22nϵyn𝒢,k(yn)=0yn𝒢.\displaystyle\begin{cases}\lceil MP_{Y}^{n}(y^{n})\rceil\leq k(y^{n})\leq\lceil MP_{Y}^{n}(y^{n})\rceil+2^{2n\epsilon}&y^{n}\in\mathcal{G},\\ k(y^{n})=0&y^{n}\notin\mathcal{G}.\end{cases}

According to the above discussions, either of 𝒞3\mathcal{C}_{3} or 𝒞4\mathcal{C}_{4} is possible and has exactly MM codewords. Therefore, we can summarize that for all ϵ>0\epsilon>0 and sufficiently large nn, there exists a code 𝒞\mathcal{C} as follows, with |𝒞|=xnk(yn)=M|\mathcal{C}|=\sum_{x^{n}}k(y^{n})=M:

𝒞:\displaystyle\mathcal{C}:\quad {MPYn(yn)k(yn)MPYn(yn)+22nϵyn𝒢,k(yn)=0yn𝒢.\displaystyle\begin{cases}\lfloor MP_{Y}^{n}(y^{n})\rfloor\leq k(y^{n})\leq\lceil MP_{Y}^{n}(y^{n})\rceil+2^{2n\epsilon}&y^{n}\in\mathcal{G},\\ k(y^{n})=0&y^{n}\notin\mathcal{G}.\end{cases}

Choose 𝒞\mathcal{C} to be our covering code, which yields that

|k(yn)MPYn(yn)|{1+22nϵMif yn𝒢PYn(yn)if yn𝒢}23nϵmin{1M,PYn(yn)}.\left|\frac{k(y^{n})}{M}-P_{Y}^{n}(y^{n})\right|\leq\left\{\begin{array}[]{@{}l l@{}}\dfrac{1+2^{2n\epsilon}}{M}&\text{if }y^{n}\in\mathcal{G}\\ P_{Y}^{n}(y^{n})&\text{if }y^{n}\notin\mathcal{G}\end{array}\right\}\leq 2^{3n\epsilon}\min\left\{\frac{1}{M},P_{Y}^{n}(y^{n})\right\}.

Consequently, we have

12P~Yn|𝒞PYn1\displaystyle\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\right\|_{1} 23nϵ1ynmin{1M,PYn(yn)}\displaystyle\leq 2^{3n\epsilon-1}\sum_{y^{n}}\min\left\{\frac{1}{M},P_{Y}^{n}(y^{n})\right\}
=23nϵ1QY𝒫n(𝒴)|𝒯QY|min{2nR,2n[D(QYPY)+H(QY)]}\displaystyle=2^{3n\epsilon-1}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}|\mathcal{T}_{Q_{Y}}|\min\left\{2^{-nR},2^{-n\left[D(Q_{Y}\|P_{Y})+H(Q_{Y})\right]}\right\}
𝑎23nϵ1QY𝒫n(𝒴)2nH(QY)min{2nR,2n[D(QYPY)+H(QY)]}\displaystyle\overset{a}{\leq}2^{3n\epsilon-1}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}2^{nH(Q_{Y})}\min\left\{2^{-nR},2^{-n\left[D(Q_{Y}\|P_{Y})+H(Q_{Y})\right]}\right\}
𝑏12(n+1)|𝒴| 23nϵmaxQY𝒫n(𝒴)min{2n[RH(QY)],2nD(QYPY)}\displaystyle\overset{b}{\leq}\frac{1}{2}(n+1)^{|\mathcal{Y}|}\ 2^{3n\epsilon}\max_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\min\left\{2^{-n[R-H(Q_{Y})]},2^{-nD(Q_{Y}\|P_{Y})}\right\}
=12(n+1)|𝒴|exp2{nminQY𝒫n(𝒴)[D(QYPY)+|RD(QYPY)H(QY)|+3ϵ]},\displaystyle=\frac{1}{2}(n+1)^{|\mathcal{Y}|}\exp_{2}\left\{-n\min_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\left[D(Q_{Y}\|P_{Y})+\big|R-D(Q_{Y}\|P_{Y})-H(Q_{Y})\big|^{+}-3\epsilon\right]\right\},

where (a)(a) and (b)(b) follow from properties of types. The exponent is exactly E¯nl(R)\underline{E}^{\mathrm{nl}}(R) as ϵ0\epsilon\to 0 and nn\to\infty.

Note that all the above discussions are based on the assumption that RH(PY)R\geq H(P_{Y}). If R<H(PY)R<H(P_{Y}), we can just apply a trivial bound 12P~Yn|𝒞PYn11\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1}\leq 1 and the error exponent is 0. Since E¯nl(R)\underline{E}^{\mathrm{nl}}(R) also vanishes for R<H(PY)R<H(P_{Y}), it is thus an achievable error exponent for RR values both below and above H(PY)H(P_{Y}). Hence, the variational form in the theorem is proven.

To obtain the dual form, consider the following.

minQY𝒫(𝒴)\displaystyle\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})} {D(QYPY)+|RD(QYPY)H(QY)|+}\displaystyle\left\{D(Q_{Y}\|P_{Y})+\big|R-D(Q_{Y}\|P_{Y})-H(Q_{Y})\big|^{+}\right\}
=\displaystyle=\ maxλ[0,1]minQY𝒫(𝒴){D(QYPY)+λ[RD(QYPY)H(QY)]}\displaystyle\max_{\lambda\in[0,1]}\ \min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\big\{D(Q_{Y}\|P_{Y})+\lambda\left[R-D(Q_{Y}\|P_{Y})-H(Q_{Y})\right]\big\}
=\displaystyle=\ maxλ[0,1]{λR+minQY𝒫(𝒴)[D(QYPY)+λyQY(y)logPY(y)]}\displaystyle\max_{\lambda\in[0,1]}\left\{\lambda R+\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\left[D(Q_{Y}\|P_{Y})+\lambda\sum_{y}Q_{Y}(y)\log P_{Y}(y)\right]\right\}
=𝑎\displaystyle\overset{a}{=}\ maxλ[0,1]{λRyPY1λ(y)}\displaystyle\max_{\lambda\in[0,1]}\left\{\lambda R-\sum_{y}P_{Y}^{1-\lambda}(y)\right\}
=𝑏\displaystyle\overset{b}{=}\ maxα1{α1α[RH1α(PY)]},\displaystyle\max_{\alpha\geq 1}\left\{\frac{\alpha-1}{\alpha}\left[R-H_{\frac{1}{\alpha}}(P_{Y})\right]\right\},

where (a)(a) follows from (46) and (b)(b) follows by setting α:=11λ\alpha:=\frac{1}{1-\lambda}. ∎

The idea behind this proof is actually a quantization of PYnP_{Y}^{n} with a minimal gap of 1/M1/M. For all sequences in the good set 𝒢\mathcal{G}, the resulting error is 1/M1/M. For sequences outside 𝒢\mathcal{G}, each probability is less than 1/M1/M so we choose not to cover them, and the error equals their probability. The key step is to show that such a construction can indeed be realized under the constraint that the number of codewords is exactly MM.

V-B Rational-irrational discrepancy under the uniform formulation

Recalling (22), under the uniform formulation, P~Yn|𝒞\tilde{P}_{Y^{n}|\mathcal{C}} always takes the form P~Yn|𝒞=+/M\tilde{P}_{Y^{n}|\mathcal{C}}=\mathbb{Z}^{+}/M\in\mathbb{Q}. This gives rise to the rational-irrational discrepancy. In the following, we state Khintchine’s theorem and use it to prove the linear converse in Theorem 25.

Lemma 40 (Khintchine’s theorem [bugeaud2004approximation, Thm. 1.10], [queffelec2013diophantine, Thm. 3.3.2]).

Let Ψ:1>0\Psi:\mathbb{R}_{\geq 1}\to\mathbb{R}_{>0} be a continuous function such that MM2Ψ(M)M\mapsto M^{2}\Psi(M) is non-increasing and that M=1MΨ(M)<\sum_{M=1}^{\infty}M\Psi(M)<\infty. Then for almost every pp\notin\mathbb{Q} (in the sense of full Lebesgue measure in \mathbb{R}), we have |K/Mp|<Ψ(M)\left|K/M-p\right|<\Psi(M) for only finitely many K,MK,M\in\mathbb{Z}.

Proof of Theorem 25.

Let y¯𝒴\widebar{y}\in\mathcal{Y} be an irrational symbol, i.e., PY(y¯)P_{Y}(\widebar{y})\in\mathbb{Q}. Define 𝒜:={yn𝒴n:y1=y¯}\mathcal{A}:=\left\{y^{n}\in\mathcal{Y}^{n}:y_{1}=\widebar{y}\right\} to be the set of output sequences in which y¯\widebar{y} appears in the first position. According to (11),

12P~Yn|𝒞PYn1|P~Yn|𝒞(𝒜)PYn(𝒜)|,\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\right\|_{1}\geq\left|\tilde{P}_{Y^{n}|\mathcal{C}}(\mathcal{A})-P_{Y}^{n}(\mathcal{A})\right|, (25)

where

P~Yn|𝒞(𝒜)\displaystyle\tilde{P}_{Y^{n}|\mathcal{C}}(\mathcal{A}) =yn𝒜P~Yn|𝒞(yn)=1Myn𝒜k(yn)=:KM,\displaystyle=\sum_{y^{n}\in\mathcal{A}}\tilde{P}_{Y^{n}|\mathcal{C}}(y^{n})=\frac{1}{M}\sum_{y^{n}\in\mathcal{A}}k(y^{n})=:\frac{K}{M},
PYn(𝒜)\displaystyle P_{Y}^{n}(\mathcal{A}) =yn𝒜PYn(yn)=PY(y¯)yn𝒴n1PYn1(yn)=PY(y¯).\displaystyle=\sum_{y^{n}\in\mathcal{A}}P_{Y}^{n}(y^{n})=P_{Y}(\widebar{y})\sum_{y^{n}\in\mathcal{Y}^{n-1}}P_{Y}^{n-1}(y^{n})=P_{Y}(\widebar{y}).

The quantity k(yn)k(y^{n}) is defined in (22) and K:=yn𝒜k(yn)K:=\sum_{y^{n}\in\mathcal{A}}k(y^{n}) is an integer. Hence, P~Yn|𝒞(𝒜)\tilde{P}_{Y^{n}|\mathcal{C}}(\mathcal{A})\in\mathbb{Q} and PYn(𝒜)P_{Y}^{n}(\mathcal{A})\notin\mathbb{Q}, and (25) describes a problem of Diophantine approximation. Taking Ψ(M)=M(2+ϵ)\Psi(M)=M^{-(2+\epsilon)}, it follows immediately from Lemma 40 that

|P~Yn|𝒞(𝒜)PYn(𝒜)|=|KMPY(y¯)|1M2+ϵ=2n(2+ϵ)R\left|\tilde{P}_{Y^{n}|\mathcal{C}}(\mathcal{A})-P_{Y}^{n}(\mathcal{A})\right|=\left|\frac{K}{M}-P_{Y}(\widebar{y})\right|\geq\frac{1}{M^{2+\epsilon}}=-2^{-n(2+\epsilon)R}

for all sufficiently large nn, since nn representing the number of channel uses can take infinitely many integers. ∎

Alternatively, if PYP_{Y} is rational, for high rates a perfect covering can be achieved, as stated in Theorem 26. This is because MPY(yn)MP_{Y}(y^{n}) can always be an integer for sufficiently large nn, making k(yn)=MPY(yn)k(y^{n})=MP_{Y}(y^{n}) a valid code construction. The detailed proof is as follows.

Proof of Theorem 26.

Given PY(y)=AyByP_{Y}(y)=\frac{A_{y}}{B_{y}}, we have

MPYn(yn)=Mi=1nAyiByi=i=1n(2logMnAyiByi).MP_{Y}^{n}(y^{n})=M\prod_{i=1}^{n}\frac{A_{y_{i}}}{B_{y_{i}}}=\prod_{i=1}^{n}\left(2^{\frac{\log M}{n}}\frac{A_{y_{i}}}{B_{y_{i}}}\right).

When Rlog(lcm({By}y𝒴))R\geq\log(\mathrm{lcm}(\{B_{y}\}_{y\in\mathcal{Y}})), there always exist sufficiently large MM and nn such that 2logMn2^{\frac{\log M}{n}} is a multiple of ByB_{y} for each y𝒴y\in\mathcal{Y} and logMnR\frac{\log M}{n}\to R as nn\to\infty. Therefore, we can employ the alphabet choice in Remark 37 and set a repetition number k(yn)=MPYn(yn)k(y^{n})=MP_{Y}^{n}(y^{n}). ∎

Example 41.

Consider a ternary output PY=[12,13,16]P_{Y}=[\frac{1}{2},\frac{1}{3},\frac{1}{6}]. Then Euninl(R)=E_{\mathrm{uni}}^{\mathrm{nl}}(R)=\infty when Rlog6R\geq\log 6.

V-C Non-uniform formulation

If the messages are non-uniformly distributed, then in the achievability proof, one must address not only how to construct a good covering code, but also how to select the optimal message distribution. It turns out that an effective choice is to treat PYnP_{Y}^{n} as a source and apply lossless source coding, with the decoding index serving as the message index in the covering setting. More generally, after formulating the lossless source coding problem in Definition 42, we show the equivalence between noiseless soft covering and lossless source coding in Lemma 43.

Definition 42 (Lossless source coding).

Let ={1,,M}\mathcal{M}=\{1,\dots,M\} with M=2nRM=2^{nR}. Consider a discrete memoryless source 𝒴n\mathcal{Y}^{n} subject to i.i.d. PYP_{Y}. An (R,n,PY,Pre)(R,n,P_{Y},\mathrm{Pr_{e}}) lossless source coding scheme consists of an encoder :𝒴n\mathcal{E}:\mathcal{Y}^{n}\to\mathcal{M} and a decoder 𝒟:𝒴^n\mathcal{D}:\mathcal{M}\to\hat{\mathcal{Y}}^{n}, with Pre:=Pr{YnY^n}\mathrm{Pr_{e}}:=\Pr\{Y^{n}\neq\hat{Y}^{n}\} denoting the probability of decoding error.

Lemma 43 (Equivalence between noiseless soft covering and lossless source coding).

Let w:𝒳𝒴w:\mathcal{X}\to\mathcal{Y} be a bijective function. Consider a noiseless channel WY|X(y|x)=𝟙{y=w(x)}W_{Y|X}(y|x)=\mathbbm{1}\{y=w(x)\}.

  • (a)

    For any (R,n,PY,Pre)(R,n,P_{Y},\mathrm{Pr_{e}}) lossless source coding scheme, there exists an (R,n,PY,WY|X,q)(R,n,P_{Y},W_{Y|X},q) non-uniform soft-covering scheme such that 12P~Yn|𝒞qPYn1Pre\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\big\|_{1}\leq\mathrm{Pr_{e}}.

  • (b)

    For any (R,n,PY,WY|X,q)(R,n,P_{Y},W_{Y|X},q) non-uniform soft-covering scheme, there exists an (R,n,PY,Pre)(R,n,P_{Y},\mathrm{Pr_{e}}) lossless source coding scheme such that that PreP~Yn|𝒞qPYn1\mathrm{Pr_{e}}\leq\big\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\big\|_{1}.

Lemma 43 is proved in Appendix D-g. Now we employ this lemma to show Theorem 21.

Proof of Theorem 21.

We first show achievability. Follow the alphabet choice in Remark 37 so that w()w(\cdot) is a bijection. To apply Lemma 43, we need to specify a lossless source coding scheme for the i.i.d. source PYnP_{Y}^{n}. Take any ϵ>0\epsilon>0 and define

𝒜:=QY𝒫n(𝒴):H(QY)Rϵ𝒯QY.\mathcal{A}:=\bigsqcup_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):H(Q_{Y})\leq R-\epsilon}\mathcal{T}_{Q_{Y}}.

Clearly |𝒜|M=2nR|\mathcal{A}|\leq M=2^{nR}, meaning that we can establish a one-to-one label for every sequence in 𝒜\mathcal{A} using at most MM codewords. Therefore, there exists a (R,n,PY,Pre)(R,n,P_{Y},\mathrm{Pr_{e}}) lossless source coding scheme, where all sequences in 𝒜\mathcal{A} can be correctly decoded and hence PrePYn(𝒜c)\mathrm{Pr_{e}}\leq P_{Y}^{n}(\mathcal{A}^{c}) with 𝒜c=𝒴n\𝒜\mathcal{A}^{c}=\mathcal{Y}^{n}\backslash\mathcal{A}. According to Lemma 43(a), there exists an (R,n,PY,WY|X,q)(R,n,P_{Y},W_{Y|X},q) non-uniform soft-covering scheme such that

12P~Yn|𝒞qPYn1\displaystyle\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\right\|_{1} PrePYn(𝒜c)=QY𝒫n(𝒴):H(QY)>RϵPYn(𝒯QY)\displaystyle\leq\mathrm{Pr_{e}}\leq P_{Y}^{n}(\mathcal{A}^{c})=\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):H(Q_{Y})>R-\epsilon}P_{Y}^{n}(\mathcal{T}_{Q_{Y}})
𝑎(n+1)|𝒴|exp2{nminQY𝒫n(𝒴):H(QY)>RϵD(QYPY)},\displaystyle\overset{a}{\leq}(n+1)^{|\mathcal{Y}|}\exp_{2}\left\{-n\min_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):H(Q_{Y})>R-\epsilon}D(Q_{Y}\|P_{Y})\right\},

where (a)(a) follows from property of types. Taking ϵ0\epsilon\to 0 and nn\to\infty completes the proof of achievability.

For the converse, given any covering code, by alphabet restriction in Remark 37, w()w(\cdot) can be regarded bijective. Hence, according to Lemma 43(b), for any (R,n,PY,WY|X,q)(R,n,P_{Y},W_{Y|X},q) non-uniform soft-covering scheme, there exists a corresponding (R,n,PY,Pre)(R,n,P_{Y},\mathrm{Pr_{e}}) lossless source coding scheme such that

12P~Yn|𝒞qPYn112Pre𝑎12(n+1)|𝒴|exp2{nminQY𝒫n(𝒴):H(QY)>R+ϵD(QYPY)},\displaystyle\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\right\|_{1}\geq\frac{1}{2}\mathrm{Pr_{e}}\overset{a}{\geq}\frac{1}{2}(n+1)^{-|\mathcal{Y}|}\exp_{2}\left\{-n\min_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):H(Q_{Y})>R+\epsilon}D(Q_{Y}\|P_{Y})\right\},

where (a)(a) follows from the converse exponent of the lossless source coding, e.g., [csiszar2011information, Thm. 2.15]. Taking ϵ0\epsilon\to 0 and nn\to\infty completes the proof of the converse.

So far, we have proven the variational form, which is exactly the error exponent of the lossless source coding problem. The dual form thus follows immediately from [csiszar2011information, Problem 2.15]. ∎

V-D HH_{-\infty}-constrained formulation

Proof of Theorem 22.

We begin with the converse. In fact, the proof of the converse is identical to that of Theorem 19 in Section V-A. The ‘bad’ set \mathcal{B} defined in (23) applies here as well, because in the HH_{-\infty}-constrained formulation, the minimal probability in the message set is also at least 1/M1/M. All output sequences with probability PYn(yn)1/2MP_{Y}^{n}(y^{n})\leq 1/2M contribute to the covering error. Ergo, E¯nl(R)\overline{E}^{\mathrm{nl}}(R) in Theorem 19, which serves as a converse for the uniform formulation, also constitutes a converse for the HH_{-\infty}-constrained formulation, even though the latter is a more general formulation that encompasses the former.

For the achievability, we follow the alphabet choice in Remark 37 so that w()w(\cdot) is a bijection. Similar to the non-uniform case, we consider a lossless source coding scheme for PYnP_{Y}^{n}, where only sequences with probability greater than 1/M1/M are encoded. This can be interpreted as a lossless source coding scheme where the rate is measured by HH_{-\infty} of the encoded messages, instead of H0H_{0}, which is equal to the logarithm of the size of the message set. Those sequences are in fact from the good set 𝒢\mathcal{G} defined in (24). The corresponding source coding scheme is given by

(yn)={1,2,,|𝒢|if yn𝒢,0if yn𝒢,𝒟(i)={1(i)if i=1,2,,|𝒢|,declare errorif i=0.\mathcal{E}(y^{n})=\begin{cases}1,2,\dots,|\mathcal{G}|&\text{if }y^{n}\in\mathcal{G},\\ 0&\text{if }y^{n}\notin\mathcal{G},\end{cases}\qquad\mathcal{D}(i)=\begin{cases}\mathcal{E}^{-1}(i)&\text{if }i=1,2,\dots,|\mathcal{G}|,\\ \text{declare error}&\text{if }i=0.\end{cases} (26)

Clearly, all sequences in 𝒢c=𝒴n\𝒢\mathcal{G}^{c}=\mathcal{Y}^{n}\backslash\mathcal{G} contribute to the decoding error. According to Lemma 43(a), there exists an (R,n,PY,WY|X,q)(R,n,P_{Y},W_{Y|X},q) non-uniform soft-covering scheme such that

12P~Yn|𝒞qPYn1\displaystyle\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\right\|_{1} Pre=PYn(𝒢c)=QY𝒫n(𝒴):D(QYPY)+H(QY)>RPYn(𝒯QY)\displaystyle\leq\mathrm{Pr_{e}}=P_{Y}^{n}(\mathcal{G}^{c})=\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})>R}P_{Y}^{n}(\mathcal{T}_{Q_{Y}})
𝑎(n+1)|𝒴|exp2{nminQY𝒫n(𝒴):D(QYPY)+H(QY)>RD(QYPY)},\displaystyle\overset{a}{\leq}(n+1)^{|\mathcal{Y}|}\exp_{2}\left\{-n\min_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})>R}D(Q_{Y}\|P_{Y})\right\},

where (a)(a) follows from property of types. The exponent here is exactly E¯nl(R)\overline{E}^{\mathrm{nl}}(R) as nn\to\infty. However, directly from Lemma 43(a), the soft-covering coding scheme here is under the non-uniform formulation. We need to show that construction in (26) can indeed result in a soft-covering coding scheme under the HH_{-\infty}-constrained formulation. Specifically, the corresponding message distribution in (66) (see Appendix D-g) needs to satisfy q(i)1/Mq(i)\geq 1/M for all i=0,1,,|𝒢|i=0,1,\dots,|\mathcal{G}|. First, the definition of 𝒢\mathcal{G} in (24) ensures that q(i)1/Mq(i)\geq 1/M for i=1,,|𝒢|i=1,\dots,|\mathcal{G}|. It remains to verify that q(0)1/Mq(0)\geq 1/M. Since q(0)=Preq(0)=\mathrm{Pr_{e}}, it is equivalent to check E¯nl(R)R\overline{E}^{\mathrm{nl}}(R)\leq R. By (65) in Appendix D-b, when E¯nl(R)\overline{E}^{\mathrm{nl}}(R) is finite, we have E¯nl(R)=D(QYPY)\overline{E}^{\mathrm{nl}}(R)=D(Q_{Y}^{*}\|P_{Y}) for some optimizer QY𝒫(𝒴)Q_{Y}^{*}\in\mathcal{P}(\mathcal{Y}) such that D(QYPY)+H(QY)=RD(Q_{Y}^{*}\|P_{Y})+H(Q_{Y}^{*})=R. Therefore, E¯nl(R)R\overline{E}^{\mathrm{nl}}(R)\leq R. This completes the proof. ∎

VI Error Exponent for Noisy Channels

This section addresses soft covering error exponents for noisy channels. We prove Theorem 27 and Theorem 28.

VI-A Achievability

Proof of Theorem 27.

We can use the soft covering with noiseless channels to the case of noisy channels by covering the input distribution instead of the output. This leads to a high rate improvement in the error exponent achievability as compared to random coding. The output distribution in (1) can be written as

P~Yn|𝒞(yn)=xnP~Xn|𝒞(xn)W(yn|xn),\tilde{P}_{Y^{n}|\mathcal{C}}(y^{n})=\sum_{x^{n}}\tilde{P}_{X^{n}|\mathcal{C}}(x^{n})\ W(y^{n}|x^{n}),

where P~Xn|𝒞\tilde{P}_{X^{n}|\mathcal{C}} is the code-induced input distribution. Pick any PX𝒮P_{X}\in\mathcal{S}. Then PY=PXWP_{Y}=P_{X}W. We have

12P~Yn|𝒞PYn1𝑎12P~Xn|𝒞PXn1𝑏2nE¯nl(R),\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\right\|_{1}\overset{a}{\leq}\frac{1}{2}\left\|\tilde{P}_{X^{n}|\mathcal{C}}-P_{X}^{n}\right\|_{1}\overset{b}{\leq}2^{-n\underline{E}^{\mathrm{nl}}(R)},

where (a)(a) follows from the data processing inequality of the total variation, and (b)(b) follows from Theorem 20. Here E¯nl(R)\underline{E}^{\mathrm{nl}}(R) is the noiseless bound of the (R,n,PX)(R,n,P_{X}) covering problem. Hence, we can follow the code construction in the proof of Theorem 20 in Section V-A and then cover the input distribution PXnP_{X}^{n}. An optimal bound is further generated by maximizing over PX𝒮P_{X}\in\mathcal{S}. ∎

VI-B Converse

Proof of Theorem 28.

Consider any code 𝒞\mathcal{C}. We claim that exists a type Q¯X𝒫n(𝒳)\widebar{Q}_{X}\in\mathcal{P}_{n}(\mathcal{X}) such that

i=1Mq(i) 1{Xn(i)𝒯Q¯X}2nϵ.\sum_{i=1}^{M}q(i)\ \mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{\widebar{Q}_{X}}\}\geq 2^{-n\epsilon}. (27)

To see this, suppose i=1Mq(i) 1{Xn(i)𝒯QX}<2nϵ\sum_{i=1}^{M}q(i)\ \mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}<2^{-n\epsilon} for all types QX𝒫n(𝒳)Q_{X}\in\mathcal{P}_{n}(\mathcal{X}). Then we have

i=1Mq(i)\displaystyle\sum_{i=1}^{M}q(i) =i=1Mq(i)QX𝒫n(𝒳)𝟙{Xn(i)𝒯QX}=QX𝒫n(𝒳)i=1Mq(i) 1{Xn(i)𝒯QX}\displaystyle=\sum_{i=1}^{M}q(i)\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}=\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \sum_{i=1}^{M}q(i)\ \mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}
|𝒫n(𝒳)|2nϵ(n+1)|𝒳|2nϵ<1\displaystyle\leq|\mathcal{P}_{n}(\mathcal{X})|2^{-n\epsilon}\leq(n+1)^{|\mathcal{X}|}2^{-n\epsilon}<1

for sufficiently large nn, thereby leading to a contradiction. Hence, (27) is proven, which implies that even though 𝒞\mathcal{C} is not necessarily a constant composition code, it has a constant composition subset that includes most probable codewords.

We follow (11) and set

𝒜=i:Xn(i)𝒯Q¯XV𝒱n(Q¯X)𝒯V(Xn(i)).\mathcal{A}=\bigcup_{i:X^{n}(i)\in\mathcal{T}_{\widebar{Q}_{X}}}\bigsqcup_{V\in\mathcal{V}_{n}(\widebar{Q}_{X})}\mathcal{T}_{V}(X^{n}(i)).

That is, only restrict to codewords with type Q¯X\widebar{Q}_{X} and VV-shell in 𝒱n(Q¯X)\mathcal{V}_{n}(\widebar{Q}_{X}), where 𝒱n(Q¯X)\mathcal{V}_{n}(\widebar{Q}_{X}) is some subset of 𝒫n(𝒴|Q¯X)\mathcal{P}_{n}(\mathcal{Y}|\widebar{Q}_{X}) that will be properly chosen later. Following the proof of Proposition 30, we take a codeword Xn(i)𝒯Q¯XX^{n}(i)\in\mathcal{T}_{\widebar{Q}_{X}} and can obtain that

WY|Xn(𝒜|Xn(i))\displaystyle W_{Y|X}^{n}(\mathcal{A}|X^{n}(i)) WY|Xn(V𝒱n(Q¯X)𝒯V(Xn(i))|Xn(i))maxV𝒱n(Q¯X)WY|Xn(𝒯V(Xn(i))|Xn(i))\displaystyle\geq W_{Y|X}^{n}\left(\bigsqcup_{V\in\mathcal{V}_{n}(\widebar{Q}_{X})}\mathcal{T}_{V}(X^{n}(i))\bigg|X^{n}(i)\right)\geq\max_{V\in\mathcal{V}_{n}(\widebar{Q}_{X})}W_{Y|X}^{n}\left(\mathcal{T}_{V}(X^{n}(i))\big|X^{n}(i)\right)
𝑎(n+1)|𝒳||𝒴|exp2{nminV𝒱n(Q¯X)D(VW|Q¯X)},\displaystyle\overset{a}{\geq}(n+1)^{-|\mathcal{X}||\mathcal{Y}|}\exp_{2}\left\{-n\min_{V\in\mathcal{V}_{n}(\widebar{Q}_{X})}D(V\|W|\widebar{Q}_{X})\right\},

where (a)(a) follows from property of types. Averaging over all codewords further yields that

P~Yn|𝒞q(𝒜)\displaystyle\tilde{P}_{Y^{n}|\mathcal{C}\sim q}(\mathcal{A}) =i=1Mq(i)WY|Xn(𝒜|Xn(i))\displaystyle=\sum_{i=1}^{M}q(i)\ W_{Y|X}^{n}(\mathcal{A}|X^{n}(i))
i=1Mq(i) 1{Xn(i)𝒯Q¯X}WY|Xn(𝒜|Xn(i))\displaystyle\geq\sum_{i=1}^{M}q(i)\ \mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{\widebar{Q}_{X}}\}\ W_{Y|X}^{n}(\mathcal{A}|X^{n}(i))
(n+1)|𝒳||𝒴|exp2{nminV𝒱n(Q¯X)D(VW|Q¯X)}i=1Mq(i)QX𝒫n(𝒳)𝟙{Xn(i)𝒯QX}\displaystyle\geq(n+1)^{-|\mathcal{X}||\mathcal{Y}|}\exp_{2}\left\{-n\min_{V\in\mathcal{V}_{n}(\widebar{Q}_{X})}D(V\|W|\widebar{Q}_{X})\right\}\sum_{i=1}^{M}q(i)\sum_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\mathbbm{1}\{X^{n}(i)\in\mathcal{T}_{Q_{X}}\}
𝑎(n+1)|𝒳||𝒴| 2nϵexp2{nminV𝒱n(Q¯X)D(VW|Q¯X)},\displaystyle\overset{a}{\geq}(n+1)^{-|\mathcal{X}||\mathcal{Y}|}\ 2^{-n\epsilon}\exp_{2}\left\{-n\min_{V\in\mathcal{V}_{n}(\widebar{Q}_{X})}D(V\|W|\widebar{Q}_{X})\right\},

where (a)(a) follows from (27). On the other hand, similarly to the calculation of PYn(𝒜)P_{Y}^{n}(\mathcal{A}) in the proof of Proposition 30, here we have

PYn(𝒜)\displaystyle P_{Y}^{n}(\mathcal{A}) (n+1)|𝒳||𝒴|exp2{nminV𝒱n(Q¯X)[D(Q¯XVPY)+|I(Q¯X;V)R|+]}\displaystyle\leq(n+1)^{|\mathcal{X}||\mathcal{Y}|}\exp_{2}\left\{-n\min_{V\in\mathcal{V}_{n}(\widebar{Q}_{X})}\left[D(\widebar{Q}_{X}V\|P_{Y})+\big|I(\widebar{Q}_{X};V)-R\big|^{+}\right]\right\}
(n+1)|𝒳||𝒴| 2nϵexp2{nminV𝒱n(Q¯X)[D(Q¯XVPY)+|I(Q¯X;V)R|+]}.\displaystyle\leq(n+1)^{-|\mathcal{X}||\mathcal{Y}|}\ 2^{n\epsilon}\exp_{2}\left\{-n\min_{V\in\mathcal{V}_{n}(\widebar{Q}_{X})}\left[D(\widebar{Q}_{X}V\|P_{Y})+\big|I(\widebar{Q}_{X};V)-R\big|^{+}\right]\right\}.

Inserting the above expressions into (11), it is then reasonable to choose

𝒱n(QX)={V𝒫n(𝒴|QX):D(VW|QX)+3ϵ<D(QXVPY)+|I(QX;V)R|+}\mathcal{V}_{n}(Q_{X})=\left\{V\in\mathcal{P}_{n}(\mathcal{Y}|Q_{X}):D(V\|W|Q_{X})+3\epsilon<D(Q_{X}V\|P_{Y})+\big|I(Q_{X};V)-R\big|^{+}\right\}

and hence

12P~Yn|𝒞qPYn1\displaystyle\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\right\|_{1} (n+1)|𝒳||𝒴| 2nϵ(12nϵ)exp2{nminV𝒱n(Q¯X)D(VW|Q¯X)}\displaystyle\geq(n+1)^{-|\mathcal{X}||\mathcal{Y}|}\ 2^{-n\epsilon}(1-2^{-n\epsilon})\exp_{2}\left\{-n\min_{V\in\mathcal{V}_{n}(\widebar{Q}_{X})}D(V\|W|\widebar{Q}_{X})\right\}
(n+1)|𝒳||𝒴| 2nϵ(12nϵ)exp2{nmaxQX𝒫n(𝒳)minV𝒱n(QX)D(VW|QX)}.\displaystyle\geq(n+1)^{-|\mathcal{X}||\mathcal{Y}|}\ 2^{-n\epsilon}(1-2^{-n\epsilon})\exp_{2}\left\{-n\max_{Q_{X}\in\mathcal{P}_{n}(\mathcal{X})}\ \min_{V\in\mathcal{V}_{n}(Q_{X})}D(V\|W|Q_{X})\right\}.

Taking nn\to\infty and ϵ0\epsilon\to 0, 𝒱n(Q¯X)\mathcal{V}_{n}(\widebar{Q}_{X}) then boils down to

𝒱(QX)\displaystyle\mathcal{V}(Q_{X}) ={V𝒫(𝒴|𝒳):D(VW|QX)<D(QXVPY)+|I(QX;V)R|+}\displaystyle=\left\{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X}):D(V\|W|Q_{X})<D(Q_{X}V\|P_{Y})+\big|I(Q_{X};V)-R\big|^{+}\right\} (28)
={V𝒫(𝒴|𝒳):ι(QXY)>min[R,I(QX;V)]}.\displaystyle=\left\{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X}):\iota(Q_{XY})>\min[R,I(Q_{X};V)]\right\}. (29)

where Lemma 3 is used and thus we have acquired a bound

E¯(R):=maxQX𝒫(𝒳)infV𝒱(QX)D(VW|QX),\overline{E}(R):=\max_{Q_{X}\in\mathcal{P}(\mathcal{X})}\ \inf_{V\in\mathcal{V}(Q_{X})}D(V\|W|Q_{X}), (30)

which contains an unconstrained maximization over all input distributions. Clearly E¯(R)0\underline{E}(R)\geq 0. Furthermore, take any QX𝒮Q_{X}\notin\mathcal{S}; then D(QXWPY)>0D(Q_{X}W\|P_{Y})>0 and hence (28) implies that W𝒱(QX)W\in\mathcal{V}(Q_{X}). Ergo, for any QX𝒮Q_{X}\notin\mathcal{S}, we have minV𝒱(QX)D(VW|QX)=0\min_{V\in\mathcal{V}(Q_{X})}D(V\|W|Q_{X})=0. As a result, the maximization maxQX𝒫(𝒳)\max_{Q_{X}\in\mathcal{P}(\mathcal{X})} in (30) only occurs at those QX𝒮Q_{X}\in\mathcal{S}. To be consistent with our notations (QQ associated with VV and PP associated with WW), we rewrite (30) as

E¯(R):=maxPX𝒮infV𝒱(PX)D(VW|PX).\overline{E}(R):=\max_{P_{X}\in\mathcal{S}}\ \inf_{V\in\mathcal{V}(P_{X})}D(V\|W|P_{X}).

Moreover, Lemma 3 implies that

I(PX;V)ι(PXVY|X)=D(VW|PX)D(PXVPY)0,I(P_{X};V)-\iota(P_{X}V_{Y|X})=D(V\|W|P_{X})-D(P_{X}V\|P_{Y})\geq 0,

where the inequality is simply the data processing inequality of the relative entropy. Consequently, when PX𝒮P_{X}\in\mathcal{S} is taken, (29) reduces to

𝒱(PX)={V𝒫(𝒴|𝒳):ι(PXVY|X)>R},\mathcal{V}(P_{X})=\left\{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X}):\iota(P_{X}V_{Y|X})>R\right\},

which completes the proof of the variational form in the theorem.

To reach the dual form, fix PX𝒮P_{X}\in\mathcal{S} and consider the following arguments.

minV𝒫(𝒴|𝒳):ι(PXVY|X)RD(VW|PX)\displaystyle\min_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X}):\iota(P_{X}V_{Y|X})\geq R}D(V\|W|P_{X}) =maxα0minV𝒫(𝒴|𝒳){D(VW|PX)+α[Rι(PXVY|X)]}\displaystyle=\max_{\alpha\geq 0}\ \min_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X})}\left\{D(V\|W|P_{X})+\alpha\left[R-\iota(P_{X}V_{Y|X})\right]\right\}
=𝑎maxα0{αRxPX(x)log(yWY|X1+α(y|x)PYα(y))}\displaystyle\overset{a}{=}\max_{\alpha\geq 0}\left\{\alpha R-\sum_{x}P_{X}(x)\log\left(\sum_{y}\frac{W_{Y|X}^{1+\alpha}(y|x)}{P_{Y}^{\alpha}(y)}\right)\right\}
=𝑏maxα0[α(R𝔼PXD1+α(WY|XPY))]\displaystyle\overset{b}{=}\max_{\alpha\geq 0}\left[\alpha\left(R-\mathbb{E}_{P_{X}}D_{1+\alpha}(W_{Y|X}\|P_{Y})\right)\right]

where (a)(a) follows from (38) in Appendix A and (b)(b) follwos from Definition 6. ∎

From Theorem 28, we can summarize the following properties of the proposed converse bound E¯(R)\overline{E}(R), which are proved in Appendix D-h.

Lemma 44.

We have the following properties of E¯(R)\overline{E}(R):

(i) E¯(R)=0\overline{E}(R)=0 when RminPX𝒮I(PX;W)R\leq\min_{P_{X}\in\mathcal{S}}I(P_{X};W).  (ii) E¯(R)>0\overline{E}(R)>0 when R>minPX𝒮I(PX;W)R>\min_{P_{X}\in\mathcal{S}}I(P_{X};W).

(iii) E¯(R)=\overline{E}(R)=\infty when R>minPX𝒮maxV𝒫(𝒴|𝒳)ι(PXVY|X)R>\min_{P_{X}\in\mathcal{S}}\max_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X})}\iota(P_{X}V_{Y|X}).

VII Conclusion

In the present work, we have characterized the exact strong converse exponent of the classical soft covering for rates below the mutual information. This exponent is expressed using a two-parameter information quantity that, to the best of our knowledge, has not been studied in the literature on error exponents with respect to a given channel. A promising direction for future work is a deeper investigation into the implications, properties, and potential applications of this new quantity.

Moreover, this work reveals that the conventional random coding bound is generally not tight in the achievability regime of rates above the mutual information, and that the traditional formulation assuming uniformly distributed words in the code can inherently lead to a rational–irrational discrepancy: even in the noiseless channel case, the reliability exponent diverges at sufficiently high rate for target distributions with all rational values, whereas for typical irrational probability values the exponent remains finite for all rates. Future work includes designing a well-behaved deterministic code for noisy channels that outperforms random coding in both high-rate and low-rate regimes, or even an optimal code that achieves the exact error exponent. Our observations also raise an important open question: can a similar rational–irrational discrepancy arise in other information-theoretic settings?

Appendix A A Random Coding Achievability for the Strong Converse Exponent

In this appendix, we prove Theorem 18. We start with the following lemma.

Lemma 45.

Let KBinomial(M,p)K\sim\text{Binomial}(M,p) be a binomial random variable and M2M\geq 2. We have

12𝔼|KMp|p12pmin{Mp,1}.\frac{1}{2}\mathbb{E}\left|\frac{K}{M}-p\right|\leq p-\frac{1}{2}p\min\{Mp,1\}.
Proof.

Write

12𝔼|KMp|=12k=0MpPr{K=k}(pkM)+12k=MpMPr{K=k}(kMp).\frac{1}{2}\mathbb{E}\left|\frac{K}{M}-p\right|=\frac{1}{2}\sum_{k=0}^{\lfloor Mp\rfloor}\Pr\{K=k\}\left(p-\frac{k}{M}\right)+\frac{1}{2}\sum_{k=\lceil Mp\rceil}^{M}\Pr\{K=k\}\left(\frac{k}{M}-p\right). (31)

and consider the following two cases.

  • 1)

    Mp<1Mp<1: In this case Mp=0\lfloor Mp\rfloor=0 and Mp=1\lceil Mp\rceil=1. (31) reduces to

    12𝔼|KMp|\displaystyle\frac{1}{2}\mathbb{E}\left|\frac{K}{M}-p\right| =12Pr{K=0}p+12k=1MPr{K=k}(kMp)\displaystyle=\frac{1}{2}\Pr\{K=0\}p+\frac{1}{2}\sum_{k=1}^{M}\Pr\{K=k\}\left(\frac{k}{M}-p\right)
    =12(1p)Mp+𝔼K2M12[1(1p)M]p=(1p)Mp\displaystyle=\frac{1}{2}(1-p)^{M}p+\frac{\mathbb{E}K}{2M}-\frac{1}{2}[1-(1-p)^{M}]p=(1-p)^{M}p
    𝑎p12Mp2,\displaystyle\overset{a}{\leq}p-\frac{1}{2}Mp^{2},

    where (a)(a) follows from, when M2M\geq 2,

    (1p)M1Mp+12M(M1)p21Mp+12M2p21Mp+12Mp=112Mp.(1-p)^{M}\leq 1-Mp+\frac{1}{2}M(M-1)p^{2}\leq 1-Mp+\frac{1}{2}M^{2}p^{2}\leq 1-Mp+\frac{1}{2}Mp=1-\frac{1}{2}Mp.
  • 2)

    Mp>1Mp>1: In this case Mp1\lfloor Mp\rfloor\geq 1, analyzing (31) becomes challenging. Instead, we can apply Jensen’s inequality for the square root, which is a commonly used technique in the soft covering problem:

    12𝔼|KMp|=𝔼|K𝔼K|2MVar(K)2M=12p(1p)M12pM12p.\frac{1}{2}\mathbb{E}\left|\frac{K}{M}-p\right|=\frac{\mathbb{E}\left|K-\mathbb{E}K\right|}{2M}\leq\frac{\sqrt{\text{Var}(K)}}{2M}=\frac{1}{2}\sqrt{\frac{p(1-p)}{M}}\leq\frac{1}{2}\sqrt{\frac{p}{M}}\leq\frac{1}{2}p.

Combining these two cases gives the result. ∎

Now we prove Theorem 18 using the random coding strategy.

Proof of Theorem 18.

We first prove the variational form (7). Take any PX𝒮P_{X}\in\mathcal{S} and generate a random code 𝒞={Xn(1),,Xn(M)}\mathcal{C}=\{X^{n}(1),\dots,X^{n}(M)\} with M=2nRM=2^{nR}, where each codeword Xn(i)X^{n}(i) for any i=1,,Mi=1,\dots,M is drawn from i.i.d. PXP_{X}. Consider any joint type QXY𝒫n(𝒳𝒴)Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y}). Similar to the proof of Proposition 32, introduce the backward conditional type V¯X|Y=QXY/QY𝒫n(𝒳|QY)\widebar{V}_{X|Y}=Q_{XY}/Q_{Y}\in\mathcal{P}_{n}(\mathcal{X}|Q_{Y}). We define

ω(QXY)\displaystyle\omega(Q_{XY}) :=WY|Xn(yn|xn),(xn,yn)𝒯QXY.\displaystyle:=W_{Y|X}^{n}(y^{n}|x^{n}),\quad\qquad\forall(x^{n},y^{n})\in\mathcal{T}_{Q_{XY}}. (32)
pV¯(QY)\displaystyle p_{\widebar{V}}(Q_{Y}) :=Pr{Xn𝒯V¯(yn)},yn𝒯QY.\displaystyle:=\Pr\{X^{n}\in\mathcal{T}_{\widebar{V}}(y^{n})\},\ \quad\forall y^{n}\in\mathcal{T}_{Q_{Y}}. (33)

The values of ω(QXY)\omega(Q_{XY}) and pV¯(QY)p_{\widebar{V}}(Q_{Y}) are uniquely determined by the joint type QXYQ_{XY}, independently of the particular sequence xn,ynx^{n},y^{n}. Take any yn𝒯QYy^{n}\in\mathcal{T}_{Q_{Y}}. Then we can write

P~Yn|𝒞(yn)\displaystyle\tilde{P}_{Y^{n}|\mathcal{C}}(y^{n}) =1Mi=1MWY|Xn(yn|Xn(i))=V¯𝒫n(𝒳|QY)kV¯(yn)Mω(QXY),\displaystyle=\frac{1}{M}\sum_{i=1}^{M}W_{Y|X}^{n}(y^{n}|X^{n}(i))=\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}|Q_{Y})}\frac{k_{\widebar{V}}(y^{n})}{M}\omega(Q_{XY}),
PYn(yn)\displaystyle P_{Y}^{n}(y^{n}) =xnPXn(xn)WY|Xn(yn|xn)=V¯𝒫n(𝒳|QY)pV¯(QY)ω(QXY),\displaystyle=\sum_{x^{n}}P_{X}^{n}(x^{n})W_{Y|X}^{n}(y^{n}|x^{n})=\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}|Q_{Y})}p_{\widebar{V}}(Q_{Y})\omega(Q_{XY}), (34)

where kV¯(yn)k_{\widebar{V}}(y^{n}) is defined in (16). Under random coding,

12𝔼𝒞P~Yn|𝒞PYn1\displaystyle\frac{1}{2}\mathbb{E}_{\mathcal{C}}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\right\|_{1} =12yn𝔼𝒞|P~Yn|𝒞(yn)PYn(yn)|\displaystyle=\frac{1}{2}\sum_{y^{n}}\mathbb{E}_{\mathcal{C}}\left|\tilde{P}_{Y^{n}|\mathcal{C}}(y^{n})-P_{Y}^{n}(y^{n})\right|
=12QY𝒫n(𝒴)yn𝒯QY𝔼𝒞|V¯𝒫n(𝒳|QY)ω(QXY)(kV¯(yn)MpV¯(QY))|\displaystyle=\frac{1}{2}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\ \sum_{y^{n}\in\mathcal{T}_{Q_{Y}}}\mathbb{E}_{\mathcal{C}}\left|\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}|Q_{Y})}\omega(Q_{XY})\left(\frac{k_{\widebar{V}}(y^{n})}{M}-p_{\widebar{V}}(Q_{Y})\right)\right|
12QY𝒫n(𝒴)yn𝒯QYV¯𝒫n(𝒳|QY)ω(QXY)𝔼𝒞|kV¯(yn)MpV¯(QY)|\displaystyle\leq\frac{1}{2}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\ \sum_{y^{n}\in\mathcal{T}_{Q_{Y}}}\ \sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}|Q_{Y})}\omega(Q_{XY})\ \mathbb{E}_{\mathcal{C}}\left|\frac{k_{\widebar{V}}(y^{n})}{M}-p_{\widebar{V}}(Q_{Y})\right|
𝑎QY𝒫n(𝒴)yn𝒯QYV¯𝒫n(𝒳|QY)ω(QXY)(pV¯(QY)12pV¯(QY)min{MpV¯(QY),1})\displaystyle\overset{a}{\leq}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\ \sum_{y^{n}\in\mathcal{T}_{Q_{Y}}}\ \sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}|Q_{Y})}\omega(Q_{XY})\left(p_{\widebar{V}}(Q_{Y})-\frac{1}{2}p_{\widebar{V}}(Q_{Y})\min\{Mp_{\widebar{V}}(Q_{Y}),1\}\right)
=𝑏QY𝒫n(𝒴)yn𝒯QY(PYn(yn)12V¯𝒫n(𝒳|QY)ω(QXY)pV¯(QY)min{MpV¯(QY),1})\displaystyle\overset{b}{=}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\ \sum_{y^{n}\in\mathcal{T}_{Q_{Y}}}\left(P_{Y}^{n}(y^{n})-\frac{1}{2}\sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}|Q_{Y})}\omega(Q_{XY})\ p_{\widebar{V}}(Q_{Y})\min\{Mp_{\widebar{V}}(Q_{Y}),1\}\right)
=112QY𝒫n(𝒴)V¯𝒫n(𝒳|QY)|𝒯QY|ω(QXY)pV¯(QY)min{MpV¯(QY),1}.\displaystyle=1-\frac{1}{2}\sum_{Q_{Y}\in\mathcal{P}_{n}(\mathcal{Y})}\ \sum_{\widebar{V}\in\mathcal{P}_{n}(\mathcal{X}|Q_{Y})}|\mathcal{T}_{Q_{Y}}|\ \omega(Q_{XY})\ p_{\widebar{V}}(Q_{Y})\min\{Mp_{\widebar{V}}(Q_{Y}),1\}. (35)

where (a)(a) follows from Lemma 45 because kV¯(yn)Binomial(M,pV¯(QY))k_{\widebar{V}}(y^{n})\sim\text{Binomial}(M,p_{\widebar{V}}(Q_{Y})) according to its definition (16); and we identify PY(yn)P_{Y}(y^{n}) in (b)(b) from (34). Furthermore, from (32) and (33) we have

|𝒯QY|ω(QXY)\displaystyle|\mathcal{T}_{Q_{Y}}|\ \omega(Q_{XY}) 𝑎(n+1)|𝒴| 2n[D(VW|QX)+H(V|QX)H(QY)]\displaystyle\overset{a}{\geq}(n+1)^{-|\mathcal{Y}|}\ 2^{-n[D(V\|W|Q_{X})+H(V|Q_{X})-H(Q_{Y})]}
=(n+1)|𝒴| 2n[D(QXYPXY)D(QXYPXQY)],\displaystyle=(n+1)^{-|\mathcal{Y}|}\ 2^{-n[D(Q_{XY}\|P_{XY})-D(Q_{XY}\|P_{X}Q_{Y})]},
pV¯(QY)\displaystyle p_{\widebar{V}}(Q_{Y}) 𝑏(n+1)|𝒳||𝒴| 2n[D(QYV¯PX)+H(QYV¯)H(V¯|QY)]\displaystyle\overset{b}{\geq}(n+1)^{-|\mathcal{X}||\mathcal{Y}|}\ 2^{-n[D(Q_{Y}\widebar{V}\|P_{X})+H(Q_{Y}\widebar{V})-H(\widebar{V}|Q_{Y})]}
=(n+1)|𝒳||𝒴| 2nD(QXYPXQY),\displaystyle=(n+1)^{-|\mathcal{X}||\mathcal{Y}|}\ 2^{-nD(Q_{XY}\|P_{X}Q_{Y})},

where (a)(a) and (b)(b) follow from general properties of types. Then (A) simplifies to

12𝔼𝒞P~Yn|𝒞PYn1\displaystyle\frac{1}{2}\mathbb{E}_{\mathcal{C}}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\right\|_{1} 112(n+1)(|𝒳|+1)|𝒴|QXY𝒫n(𝒳𝒴)2n[D(QXYPXY)+|D(QXYPXQY)R|+]\displaystyle\leq 1-\frac{1}{2}(n+1)^{-\left(|\mathcal{X}|+1\right)|\mathcal{Y}|}\sum_{Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y})}2^{-n\left[D(Q_{XY}\|P_{XY})+|D(Q_{XY}\|P_{X}Q_{Y})-R|^{+}\right]}
112(n+1)(|𝒳|+1)|𝒴|exp2{nminQXY𝒫n(𝒳𝒴)[D(QXYPXY)+|D(QXYPXQY)R|+]}.\displaystyle\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\leq 1-\frac{1}{2}(n+1)^{-\left(|\mathcal{X}|+1\right)|\mathcal{Y}|}\exp_{2}\left\{-n\min_{Q_{XY}\in\mathcal{P}_{n}(\mathcal{X}\mathcal{Y})}\left[D(Q_{XY}\|P_{XY})+\big|D(Q_{XY}\|P_{X}Q_{Y})-R\big|^{+}\right]\right\}.

Taking nn\to\infty completes the proof of (7).

Next, we prove the dual form (8). Pick any PX𝒮P_{X}\in\mathcal{S} and introduce the corresponding joint distribution PXY=PXWY|XP_{XY}=P_{X}W_{Y|X}. In line with our notation convention, we also define the associated backward channel W¯X|Y:=PXY/PX𝒫(𝒳|𝒴)\widebar{W}_{X|Y}:=P_{XY}/P_{X}\in\mathcal{P}(\mathcal{X}|\mathcal{Y}). For any other distribution QXYQ_{XY}, it is straightforward to verify the following identities.

D(QXYPXY)\displaystyle D(Q_{XY}\|P_{XY}) =D(V¯W¯|QY)+D(QYPY),\displaystyle=D(\widebar{V}\|\widebar{W}|Q_{Y})+D(Q_{Y}\|P_{Y}), (36)
D(QXYPXQY)\displaystyle D(Q_{XY}\|P_{X}Q_{Y}) =D(V¯W¯|QY)+ι(QXY),\displaystyle=D(\widebar{V}\|\widebar{W}|Q_{Y})+\iota(Q_{XY}), (37)

where ι(QXY)\iota(Q_{XY}) is defined in Definition 2. It is noteworthy that ι(QXY)\iota(Q_{XY}) satisfies the following property [yagli2019exact, Cor. 1], which can be viewed as a corollary of (46):

minV¯𝒫(𝒳|𝒴){D(V¯W¯|QY)+λι(QXY)}=𝔼QY[log(𝔼W¯X|Y[2λιX;Y|Y])]\min_{\widebar{V}\in\mathcal{P}(\mathcal{X}|\mathcal{Y})}\left\{D(\widebar{V}\|\widebar{W}|Q_{Y})+\lambda\iota(Q_{XY})\right\}=-\mathbb{E}_{Q_{Y}}\left[\log\left(\mathbb{E}_{\widebar{W}_{X|Y}}[2^{-\lambda\iota_{X;Y}}|Y]\right)\right] (38)

for any λ\lambda\in\mathbb{R}. Now consider the following chain of transformations:

minQXY𝒫(𝒳𝒴)\displaystyle\min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})} {D(QXYPXY)+|D(QXYPXQY)R|+}\displaystyle\left\{D(Q_{XY}\|P_{XY})+\big|D(Q_{XY}\|P_{X}Q_{Y})-R\big|^{+}\right\}
=minQY𝒫(𝒴)minV¯𝒫(𝒳|𝒴)maxλ[0,1]{D(QXYPXY)+λ[D(QXYPXQY)R]}\displaystyle=\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\ \min_{\widebar{V}\in\mathcal{P}(\mathcal{X}|\mathcal{Y})}\ \max_{\lambda\in[0,1]}\big\{D(Q_{XY}\|P_{XY})+\lambda\left[D(Q_{XY}\|P_{X}Q_{Y})-R\right]\big\}
=𝑎minQY𝒫(𝒴)minV¯𝒫(𝒳|𝒴)maxλ[0,1]{D(QYPY)+(1+λ)D(V¯W¯|QY)+λ[ι(QXY)R]}\displaystyle\overset{a}{=}\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\ \min_{\widebar{V}\in\mathcal{P}(\mathcal{X}|\mathcal{Y})}\ \max_{\lambda\in[0,1]}\left\{D(Q_{Y}\|P_{Y})+(1+\lambda)D(\widebar{V}\|\widebar{W}|Q_{Y})+\lambda\left[\iota(Q_{XY})-R\right]\right\}
=𝑏minQY𝒫(𝒴)maxλ[0,1]minV¯𝒫(𝒳|𝒴){D(QYPY)+(1+λ)D(V¯W¯|QY)+λ[ι(QXY)R]}\displaystyle\overset{b}{=}\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\ \max_{\lambda\in[0,1]}\ \min_{\widebar{V}\in\mathcal{P}(\mathcal{X}|\mathcal{Y})}\left\{D(Q_{Y}\|P_{Y})+(1+\lambda)D(\widebar{V}\|\widebar{W}|Q_{Y})+\lambda\left[\iota(Q_{XY})-R\right]\right\}
=𝑐minQY𝒫(𝒴)maxλ[0,1]{D(QYPY)(1+λ)𝔼QY[log(𝔼W¯X|Y[2λ1+λιX;Y|Y])]λR}\displaystyle\overset{c}{=}\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\ \max_{\lambda\in[0,1]}\left\{D(Q_{Y}\|P_{Y})-(1+\lambda)\mathbb{E}_{Q_{Y}}\left[\log\left(\mathbb{E}_{\widebar{W}_{X|Y}}\big[2^{-\frac{\lambda}{1+\lambda}\iota_{X;Y}}|Y\big]\right)\right]-\lambda R\right\}
=𝑑maxλ[0,1]minQY𝒫(𝒴){D(QYPY)(1+λ)𝔼QY[log(𝔼W¯X|Y[2λ1+λιX;Y|Y])]λR}\displaystyle\overset{d}{=}\max_{\lambda\in[0,1]}\ \min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\left\{D(Q_{Y}\|P_{Y})-(1+\lambda)\mathbb{E}_{Q_{Y}}\left[\log\left(\mathbb{E}_{\widebar{W}_{X|Y}}\big[2^{-\frac{\lambda}{1+\lambda}\iota_{X;Y}}|Y\big]\right)\right]-\lambda R\right\}
=𝑒maxλ[0,1]{log(𝔼PY[𝔼W¯X|Y1+λ[2λ1+λιX;Y|Y]])λR}\displaystyle\overset{e}{=}\max_{\lambda\in[0,1]}\left\{-\log\left(\mathbb{E}_{P_{Y}}\left[\mathbb{E}_{\widebar{W}_{X|Y}}^{1+\lambda}\big[2^{-\frac{\lambda}{1+\lambda}\iota_{X;Y}}|Y\big]\right]\right)-\lambda R\right\}
=𝑓maxλ[0,1]{λ[I11+λ(PX;WY|X)R]}\displaystyle\overset{f}{=}\max_{\lambda\in[0,1]}\left\{\lambda\left[I_{\frac{1}{1+\lambda}}(P_{X};W_{Y|X})-R\right]\right\}
=𝑔maxα[12,1]{1αα[Iα(PX;WY|X)R]},\displaystyle\overset{g}{=}\max_{\alpha\in[\frac{1}{2},1]}\left\{\frac{1-\alpha}{\alpha}\left[I_{\alpha}(P_{X};W_{Y|X})-R\right]\right\},

where (a)(a) follows from (36) and (37); (b)(b) follows from the minimax theorem [rockafellar1970convex, Thm 36.3]: the expression is convex in V¯\widebar{V} and linear in λ\lambda so we can swap minV¯\min_{\widebar{V}} and maxλ\max_{\lambda}; and (c)(c) follows from (38). In (d)(d), the minimax theorem is invoked again: it is evident to observe that the expression is convex in QYQ_{Y}; however, its concavity in λ\lambda is not obvious. We need to return to the right-hand side of (b)(b): fix V¯\widebar{V}, it is linear in λ\lambda (a trivial case of being concave), so the minimization over V¯\widebar{V} yields a concave function of λ\lambda. (e)(e) is a direct result of (46); (f)(f) follows from Definition 7; and (g)(g) is obtained by setting α:=11+λ\alpha:=\frac{1}{1+\lambda}. This completes the proof of (8). ∎

Appendix B Dual forms of the Strong Converse Bounds Γ¯(R)\underline{\mathit{\Gamma}}(R) and Γ¯(R)\overline{\mathit{\Gamma}}(R)

In this appendix, we establish the dual forms of the strong converse bounds Γ¯(R)\underline{\mathit{\Gamma}}(R) (in Proposition 30) and Γ¯(R)\overline{\mathit{\Gamma}}(R) (in Proposition 32). That is, we prove Proposition 34 and Proposition 35 that are stated in Section IV-C.

B-a Proof of Proposition 34

Proof.

We first show (20) in Proposition 34. Fix QX𝒫(𝒳)Q_{X}\in\mathcal{P}(\mathcal{X}) and R0R\geq 0. The function Γ(s,QX,R)\mathit{\Gamma}(s,Q_{X},R) defined in (10) describes a convex optimization of VV: the objective function is D(QXVPY)+|I(QX;V)R|+=max{D(QXVPY),D(VPY|QX)}D(Q_{X}V\|P_{Y})+\big|I(Q_{X};V)-R\big|^{+}=\max\{D(Q_{X}V\|P_{Y}),D(V\|P_{Y}|Q_{X})\}, which is convex in VV; the inequality constraint D(VW|QX)sD(V\|W|Q_{X})\leq s is also convex in VV; and the equality constraint is yVY|X(y|x)=1\sum_{y}V_{Y|X}(y|x)=1, which is linear in VV. Therefore, Γ(s,QX,R)\mathit{\Gamma}(s,Q_{X},R), as the optimal value over VV, equals its Lagrangian dual due to the strong duality [boyd2004convex, Ch. 5]. Explicitly,

Γ(s,QX,R)\displaystyle\mathit{\Gamma}(s,Q_{X},R) =maxδ0minV𝒫(𝒴|𝒳){D(QXVPY)+|I(QX;V)R|++δ[D(VW|QX)s]}\displaystyle=\max_{\delta\geq 0}\ \min_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X})}\left\{D(Q_{X}V\|P_{Y})+\big|I(Q_{X};V)-R\big|^{+}+\delta\left[D(V\|W|Q_{X})-s\right]\right\}
=maxδ0minV𝒫(𝒴|𝒳)maxμ[0,1]{D(QXVPY)+μ[I(QX;V)R]+δ[D(VW|QX)s]}\displaystyle=\max_{\delta\geq 0}\ \min_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X})}\ \max_{\mu\in[0,1]}\left\{D(Q_{X}V\|P_{Y})+\mu\big[I(Q_{X};V)-R\big]+\delta\big[D(V\|W|Q_{X})-s\big]\right\}
=𝑎maxμ[0,1],δ0minV𝒫(𝒴|𝒳){D(QXVPY)+μ[I(QX;V)R]+δ[D(VW|QX)s]}\displaystyle\overset{a}{=}\max_{\mu\in[0,1],\delta\geq 0}\ \min_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X})}\left\{D(Q_{X}V\|P_{Y})+\mu\big[I(Q_{X};V)-R\big]+\delta\big[D(V\|W|Q_{X})-s\big]\right\}
=𝑏maxμ[0,1],δ0minV𝒫(𝒴|𝒳){(1+δ)D(QXVPY)+(μ+δ)[I(QX;V)R]+δ[Rι(QXY)]δs},\displaystyle\overset{b}{=}\max_{\mu\in[0,1],\delta\geq 0}\ \min_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X})}\big\{(1+\delta)D(Q_{X}V\|P_{Y})+(\mu+\delta)\left[I(Q_{X};V)-R\right]+\delta\left[R-\iota(Q_{XY})\right]-\delta s\big\},

where (a)(a) follows from the minimax theorem: the objective function is convex in VV and linear in μ\mu, so we can swap minV\min_{V} and maxμ\max_{\mu}. (b)(b) follows from Lemma 3. Plugging this expression of Γ(s,QX,R)\mathit{\Gamma}(s,Q_{X},R) into Proposition 30 yields that

Γ¯(R)=\displaystyle\underline{\mathit{\Gamma}}(R)=\ minQX𝒫(𝒳)infs[0,)max{s,Γ(s,QX,R)}=minQX𝒫(𝒳)infs[0,){s+|Γ(s,QX,R)}s|+}\displaystyle\min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\ \inf_{s\in[0,\infty)}\ \max\big\{s,\mathit{\Gamma}(s,Q_{X},R)\big\}=\ \min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\ \inf_{s\in[0,\infty)}\left\{s+\big|\mathit{\Gamma}(s,Q_{X},R)\big\}-s\big|^{+}\right\}
=\displaystyle=\ minQX𝒫(𝒳)infs[0,)maxλ[0,1]{s+λ[Γ(s,QX,R)s]}\displaystyle\min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\ \inf_{s\in[0,\infty)}\ \max_{\lambda\in[0,1]}\left\{s+\lambda\big[\mathit{\Gamma}(s,Q_{X},R)-s\big]\right\}
=\displaystyle=\ minQX𝒫(𝒳)infs[0,)maxλ,μ[0,1],δ0minV𝒫(𝒴|𝒳)\displaystyle\min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\ \inf_{s\in[0,\infty)}\ \max_{\lambda,\mu\in[0,1],\delta\geq 0}\ \min_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X})}
{λ(1+δ)D(QXVPY)+λ(μ+δ)[I(QX;V)R]+λδ[Rι(QXY)]+(1λλδ)s}\displaystyle\big\{\lambda(1+\delta)D(Q_{X}V\|P_{Y})+\lambda(\mu+\delta)\left[I(Q_{X};V)-R\right]+\lambda\delta\left[R-\iota(Q_{XY})\right]+(1-\lambda-\lambda\delta)s\big\}
𝑎\displaystyle\overset{a}{\geq}\ minQX𝒫(𝒳)maxμ[0,1],δ0minV𝒫(𝒴|𝒳){D(QYPY)+μ+δ1+δ[I(QX;V)R]+δ1+δ[Rι(QXY)]}\displaystyle\min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\ \max_{\mu\in[0,1],\delta\geq 0}\ \min_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X})}\left\{D(Q_{Y}\|P_{Y})+\frac{\mu+\delta}{1+\delta}\left[I(Q_{X};V)-R\right]+\frac{\delta}{1+\delta}\left[R-\iota(Q_{XY})\right]\right\}
=𝑏\displaystyle\overset{b}{=}\ minQX𝒫(𝒳)maxα,β[0,1],αβminV𝒫(𝒴|𝒳){D(QYPY)+α[I(QX;V)R]+β[Rι(QXY)]}\displaystyle\min_{Q_{X}\in\mathcal{P}(\mathcal{X})}\ \max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\ \min_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X})}\big\{D(Q_{Y}\|P_{Y})+\alpha\left[I(Q_{X};V)-R\right]+\beta\left[R-\iota(Q_{XY})\right]\big\}
=𝑐\displaystyle\overset{c}{=}\ minQXY𝒫(𝒳𝒴)maxα,β[0,1],αβ{D(QYPY)+α[I(QX;V)R]+β[Rι(QXY)]}.\displaystyle\min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})}\ \max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\big\{D(Q_{Y}\|P_{Y})+\alpha\left[I(Q_{X};V)-R\right]+\beta\left[R-\iota(Q_{XY})\right]\big\}.

Here (a)(a) follows from choosing λ=11+δ\lambda=\frac{1}{1+\delta}; this is a valid choice because δ0\delta\geq 0 gives 11+δ[0,1]\frac{1}{1+\delta}\in[0,1]. (b)(b) follows from defining α:=μ+δ1+δ\alpha:=\frac{\mu+\delta}{1+\delta} and β:=δ1+δ\beta:=\frac{\delta}{1+\delta} (and clearly αβ\alpha\geq\beta). (c)(c) follows again from the minimax theorem: the objective function is convex in VV and linear in (α,β)(\alpha,\beta), so we can swap minV\min_{V} and maxα,β\max_{\alpha,\beta}. Finally, we merge minQX\min_{Q_{X}} and minV\min_{V} together as minQXY\min_{Q_{XY}}. This completes the proof of (20).

Next, we prove (21) in Proposition 34. Observe that D(QYPY)+I(QX;V)=D(QXYQXPY)D(Q_{Y}\|P_{Y})+I(Q_{X};V)=D(Q_{XY}\|Q_{X}P_{Y}) is convex in QXYQ_{XY} due to the joint convexity of D()D(\cdot\|\cdot). Then for α1\alpha\leq 1, the expression D(QYPY)+αI(QX;V)=(1α)D(QYPY)+αD(QXYQXPY)D(Q_{Y}\|P_{Y})+\alpha I(Q_{X};V)=(1-\alpha)D(Q_{Y}\|P_{Y})+\alpha D(Q_{XY}\|Q_{X}P_{Y}) is also convex in QXYQ_{XY}. Moreover, ι(QXY)\iota(Q_{XY}) is linear in QXYQ_{XY}. Hence, the objective function of Γ¯¯(R)\underline{\underline{\mathit{\Gamma}}}(R) is convex in QXYQ_{XY} and linear in (α,β)(\alpha,\beta), so by the minimax theorem, we can swap minQXY\min_{Q_{XY}} and maxα,β\max_{\alpha,\beta} and obtain that

Γ¯¯(R)\displaystyle\underline{\underline{\mathit{\Gamma}}}(R) =maxα,β[0,1],αβminQXY𝒫(𝒳𝒴){D(QYPY)+α[I(QY;V¯)R]+β[Rι(QXY)]}\displaystyle=\max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\ \min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})}\big\{D(Q_{Y}\|P_{Y})+\alpha\left[I(Q_{Y};\widebar{V})-R\right]+\beta\left[R-\iota(Q_{XY})\right]\big\} (39)
=maxα,β[0,1],αβminQY𝒫(𝒴){D(QYPY)+(βα)R+Kα,β(QY)},\displaystyle=\max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\ \min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\left\{D(Q_{Y}\|P_{Y})+(\beta-\alpha)R+K_{\alpha,\beta}(Q_{Y})\right\}, (40)

where we have defined

Kα,β(QY)=minV¯𝒫(𝒳|𝒴)[αI(QY;V¯)βι(QXY)],K_{\alpha,\beta}(Q_{Y})=\min_{\widebar{V}\in\mathcal{P}(\mathcal{X}|\mathcal{Y})}\big[\alpha I(Q_{Y};\widebar{V})-\beta\iota(Q_{XY})\big], (41)

which is a convex optimization over V¯\widebar{V} and can be solved using the method of the Lagrange multiplier. However, directly plugging in the derivative I(QY;V¯)V¯X|Y(x|y)\frac{\partial I(Q_{Y};\widebar{V})}{\partial\widebar{V}_{X|Y}(x|y)} leads to an equation too complicated to solve. Instead, we employ the following variational form of the mutual information I(QY;V¯)I(Q_{Y};\widebar{V}):

I(QY;V¯)=minSX𝒫(𝒳)D(V¯X|YQYSXQY).I(Q_{Y};\widebar{V})=\min_{S_{X}\in\mathcal{P}(\mathcal{X})}D(\widebar{V}_{X|Y}Q_{Y}\|S_{X}Q_{Y}). (42)

Therefore, (41) can be rewritten as

Kα,β(QY)=minSX𝒫(𝒳)minV¯𝒫(𝒳|𝒴)[αD(V¯X|YQYSXQY)βι(QXY)].K_{\alpha,\beta}(Q_{Y})=\min_{S_{X}\in\mathcal{P}(\mathcal{X})}\ \min_{\widebar{V}\in\mathcal{P}(\mathcal{X}|\mathcal{Y})}\big[\alpha D(\widebar{V}_{X|Y}Q_{Y}\|S_{X}Q_{Y})-\beta\iota(Q_{XY})\big]. (43)

Now, to optimize over V¯\widebar{V}, we can introduce Lagrange multipliers γy\gamma_{y} for each y𝒴y\in\mathcal{Y} and write the Lagrangian as

=αD(V¯X|YQYSXQY)βι(QXY)+yγy(xV¯X|Y(x|y)1).\mathcal{L}=\alpha D(\widebar{V}_{X|Y}Q_{Y}\|S_{X}Q_{Y})-\beta\iota(Q_{XY})+\sum_{y}\gamma_{y}\left(\sum_{x}\widebar{V}_{X|Y}(x|y)-1\right).

Substituting

D(V¯X|YQYSXQY)V¯X|Y(x|y)=QY(y)(1+logV¯X|Y(x|y)SX(x)),ι(QXY)V¯X|Y(x|y)=QY(y)logWY|X(y|x)PY(y)\frac{\partial D(\widebar{V}_{X|Y}Q_{Y}\|S_{X}Q_{Y})}{\partial\widebar{V}_{X|Y}(x|y)}=Q_{Y}(y)\left(1+\log\frac{\widebar{V}_{X|Y}(x|y)}{S_{X}(x)}\right),\quad\frac{\partial\iota(Q_{XY})}{\partial\widebar{V}_{X|Y}(x|y)}=Q_{Y}(y)\log\frac{W_{Y|X}(y|x)}{P_{Y}(y)}

into V¯X|Y(x|y)=0\frac{\partial\mathcal{L}}{\partial\widebar{V}_{X|Y}(x|y)}=0 gives that

QY(y)(α+αlogV¯X|Y(x|y)SX(x)βlogWY|X(y|x)PY(y))+γy=0,Q_{Y}(y)\left(\alpha+\alpha\log\frac{\widebar{V}_{X|Y}(x|y)}{S_{X}(x)}-\beta\log\frac{W_{Y|X}(y|x)}{P_{Y}(y)}\right)+\gamma_{y}=0,

further generating the following solution of the optimizer V¯\widebar{V}^{*}:

V¯X|Y(x|y)=C(y)SX(x)(WY|X(y|x)PY(y))βα\widebar{V}^{*}_{X|Y}(x|y)=C(y)S_{X}(x)\left(\frac{W_{Y|X}(y|x)}{P_{Y}(y)}\right)^{\frac{\beta}{\alpha}} (44)

with some yy-dependent normalization factor

C(y)=[xSX(x)(WY|X(y|x)PY(y))βα]1.C(y)=\left[\sum_{x}S_{X}(x)\left(\frac{W_{Y|X}(y|x)}{P_{Y}(y)}\right)^{\frac{\beta}{\alpha}}\right]^{-1}.

Hence, (43) reduces to

Kα,β(QY)\displaystyle K_{\alpha,\beta}(Q_{Y}) =minSX𝒫(𝒳)[αx,yQY(y)V¯X|Y(x|y)logV¯X|Y(x|y)SX(x)βι(V¯X|YQY)]\displaystyle=\min_{S_{X}\in\mathcal{P}(\mathcal{X})}\left[\alpha\sum_{x,y}Q_{Y}(y)\widebar{V}^{*}_{X|Y}(x|y)\log\frac{\widebar{V}^{*}_{X|Y}(x|y)}{S_{X}(x)}-\beta\iota(\widebar{V}^{*}_{X|Y}Q_{Y})\right]
=minSX𝒫(𝒳)[αx,yQY(y)V¯X|Y(x|y)(logC(y)+βαlogWY|X(y|x)PY(y))βι(V¯X|YQY)]\displaystyle=\min_{S_{X}\in\mathcal{P}(\mathcal{X})}\left[\alpha\sum_{x,y}Q_{Y}(y)\widebar{V}^{*}_{X|Y}(x|y)\left(\log C(y)+\frac{\beta}{\alpha}\log\frac{W_{Y|X}(y|x)}{P_{Y}(y)}\right)-\beta\iota(\widebar{V}^{*}_{X|Y}Q_{Y})\right]
=minSX𝒫(𝒳)[αx,yQY(y)V¯X|Y(x|y)logC(y)]=minSX𝒫(𝒳)[αyQY(y)logC(y)]\displaystyle=\min_{S_{X}\in\mathcal{P}(\mathcal{X})}\left[\alpha\sum_{x,y}Q_{Y}(y)\widebar{V}^{*}_{X|Y}(x|y)\log C(y)\right]=\min_{S_{X}\in\mathcal{P}(\mathcal{X})}\left[\alpha\sum_{y}Q_{Y}(y)\log C(y)\right]
=minSX𝒫(𝒳)[αyQY(y)log(xSX(x)(WY|X(y|x)PY(y))βα)].\displaystyle=\min_{S_{X}\in\mathcal{P}(\mathcal{X})}\left[-\alpha\sum_{y}Q_{Y}(y)\log\left(\sum_{x}S_{X}(x)\left(\frac{W_{Y|X}(y|x)}{P_{Y}(y)}\right)^{\frac{\beta}{\alpha}}\right)\right].

Inserting this result into (40) yields that

Γ¯¯(R)=\displaystyle\underline{\underline{\mathit{\Gamma}}}(R)=\ maxα,β[0,1],αβminSX𝒫(𝒳)minQY𝒫(𝒴)\displaystyle\max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\ \min_{S_{X}\in\mathcal{P}(\mathcal{X})}\ \min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}
{D(QYPY)+(βα)RαyQY(y)log(xSX(x)(WY|X(y|x)PY(y))βα)}\displaystyle\left\{D(Q_{Y}\|P_{Y})+(\beta-\alpha)R-\alpha\sum_{y}Q_{Y}(y)\log\left(\sum_{x}S_{X}(x)\left(\frac{W_{Y|X}(y|x)}{P_{Y}(y)}\right)^{\frac{\beta}{\alpha}}\right)\right\}
=𝑎\displaystyle\overset{a}{=}\ maxα,β[0,1],αβminSX𝒫(𝒳){log(yPY1β(y)[xSX(x)WY|Xβα(y|x)]α)+(βα)R}\displaystyle\max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\ \min_{S_{X}\in\mathcal{P}(\mathcal{X})}\left\{-\log\left(\sum_{y}P_{Y}^{1-\beta}(y)\left[\sum_{x}S_{X}(x)\ W_{Y|X}^{\frac{\beta}{\alpha}}(y|x)\right]^{\alpha}\right)+(\beta-\alpha)R\right\} (45)
=\displaystyle=\ maxα,β[0,1],αβ[Jα,β(WY|XPY)+(βα)R],\displaystyle\max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\left[J_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right)+(\beta-\alpha)R\right],

where (a)(a) is due to the following identity [yagli2019exact, Lem. 20] [verdu2021error, Thm. 1]:

minQY𝒫(𝒴)[D(QYPY)yQY(y)F(y)]=log(yPY(y) 2F(y))\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\left[D(Q_{Y}\|P_{Y})-\sum_{y}Q_{Y}(y)F(y)\right]=-\log\left(\sum_{y}P_{Y}(y)\ 2^{F(y)}\right) (46)

for any function F()F(\cdot) with a unique minimizer QY(y) 2F(y)PY(y)Q_{Y}^{*}(y)\ \propto\ 2^{F(y)}P_{Y}(y). This completes the proof of (21). ∎

For later convenience, we identify the optimizer here. The minimizer QYQ_{Y}^{*} that yields (B-a) is

QY(y)=PY1β(y)(xSX(x)WY|Xβα(y|x))αyPY1β(y)(xSX(x)WY|Xβα(y|x))α.Q_{Y}^{*}(y)=\frac{P_{Y}^{1-\beta}(y)\left(\displaystyle{\sum_{x}S_{X}(x)\ }W_{Y|X}^{\frac{\beta}{\alpha}}(y|x)\right)^{\alpha}}{\displaystyle{\sum_{y^{\prime}}P_{Y}^{1-\beta}(y^{\prime})}\left(\displaystyle{\sum_{x^{\prime}}\ }S_{X}(x^{\prime})\ W_{Y|X}^{\frac{\beta}{\alpha}}(y^{\prime}|x^{\prime})\right)^{\alpha}}. (47)

Recall that we introduced SXS_{X} merely to apply the variational form of mutual information (42), so the minimizer SXS_{X}^{*} in (B-a) is exactly the marginal QXQ_{X}^{*} of the minimizer QXYQ_{XY}^{*} of (39). Hence, fixing α\alpha and β\beta, if Jα,β(WY|XPY)J_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right) in (6) yields an optimizer QXQ_{X}^{*} (which must satisfy (63) in Appendix C), then we can plug SX=QXS_{X}=Q_{X}^{*} in (44) and (47) and obtain the corresponding optimizer QXYQ_{XY}^{*} of (39) as

QXY(x,y)=QX(x)WY|Xβα(y|x)PY1β(y)(xQX(x)WY|Xβα(y|x))α1yPY1β(y)(xQX(x)WY|Xβα(y|x))α.Q_{XY}^{*}(x,y)=\frac{Q_{X}^{*}(x)\ W_{Y|X}^{\frac{\beta}{\alpha}}(y|x)\ P_{Y}^{1-\beta}(y)\left(\displaystyle{\sum_{x^{\prime}}Q_{X}^{*}(x^{\prime})\ }W_{Y|X}^{\frac{\beta}{\alpha}}(y|x^{\prime})\right)^{\alpha-1}}{\displaystyle{\sum_{y^{\prime}}P_{Y}^{1-\beta}(y^{\prime})}\left(\displaystyle{\sum_{x^{\prime}}Q_{X}^{*}(x^{\prime})\ }W_{Y|X}^{\frac{\beta}{\alpha}}(y^{\prime}|x^{\prime})\right)^{\alpha}}. (48)

Analogous to Lemma 31, one can observe the following properties of Γ¯¯(R)\underline{\underline{\mathit{\Gamma}}}(R) given its definition in (20).

Lemma 46.

We have the following properties of Γ¯¯(R)\underline{\underline{\mathit{\Gamma}}}(R):

(i) Γ¯¯(R)=0\underline{\underline{\mathit{\Gamma}}}(R)=0 when RminPX𝒮I(PX;W)R\geq\min_{P_{X}\in\mathcal{S}}I(P_{X};W).  (ii) Γ¯¯(R)>0\underline{\underline{\mathit{\Gamma}}}(R)>0 when R<minPX𝒮I(PX;W)R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W).

Proof.

For simplicity, write Γ¯¯(R)=minQXY𝒫(𝒳𝒴)γ(R,QXY)\underline{\underline{\mathit{\Gamma}}}(R)=\min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})}\gamma(R,Q_{XY}) with

γ(R,QXY):=maxα,β[0,1],αβ{D(QYPY)+α[I(QX;V)R]+β[Rι(QXY)]}.\gamma(R,Q_{XY}):=\max_{\alpha,\beta\in[0,1],\alpha\geq\beta}\big\{D(Q_{Y}\|P_{Y})+\alpha\left[I(Q_{X};V)-R\right]+\beta\left[R-\iota(Q_{XY})\right]\big\}.

Given RR, let QXY𝒫(𝒴|𝒳)Q_{XY}^{*}\in\mathcal{P}(\mathcal{Y}|\mathcal{X}) optimize γ(R,QXY)\gamma(R,Q_{XY}), i.e., Γ¯¯(R)=γ(R,QXY)\underline{\underline{\mathit{\Gamma}}}(R)=\gamma(R,Q_{XY}^{*}). Setting α=β=0\alpha=\beta=0 gives Γ¯¯(R)=γ(R,QXY)D(QYPY)0\underline{\underline{\mathit{\Gamma}}}(R)=\gamma(R,Q_{XY}^{*})\geq D(Q_{Y}^{*}\|P_{Y})\geq 0. Hence, Γ¯¯(R)\underline{\underline{\mathit{\Gamma}}}(R) is non-negative for all RR. (i) follows immediately from Lemma 31 and that Γ¯¯(R)Γ¯(R)\underline{\underline{\mathit{\Gamma}}}(R)\leq\underline{\mathit{\Gamma}}(R).

For (ii), suppose not. Then γ(R,QXY)=0\gamma(R,Q_{XY}^{*})=0 and hence plugging any combination of (α,β)[0,1]2(\alpha,\beta)\in[0,1]^{2} such that αβ\alpha\geq\beta into the objective function of γ(R,QXY)\gamma(R,Q_{XY}^{*}) will yield a non-positive value. First, plugging α=β=0\alpha=\beta=0 gives that D(QYPY)=0D(Q_{Y}^{*}\|P_{Y})=0, so QY=PYQ_{Y}^{*}=P_{Y}. Second, plugging α=β=1\alpha=\beta=1 and combining it with Lemma 3 gives that D(VW|QX)=0D(V^{*}\|W|Q_{X}^{*})=0, so V=WV^{*}=W and hence QX𝒮Q_{X}^{*}\in\mathcal{S}. Finally, plugging α=1,β=0\alpha=1,\beta=0 gives that I(QX;V)RI(Q_{X}^{*};V^{*})\leq R. To sum up, we obtain that R=I(QX;W)R=I(Q_{X}^{*};W) with QX𝒮Q_{X}^{*}\in\mathcal{S}, contradicting the condition that R<minPX𝒮I(PX;W)R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W). ∎

B-b Proof of Proposition 35

Proof.

Given any R>0R>0, define

Γ1(R)\displaystyle\mathit{\Gamma}_{1}(R) :=minQXY𝒫(𝒳𝒴)[D(QYPY)+|I(QX;V)R|++|Rι(QXY)|+]\displaystyle:=\min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})}\left[D(Q_{Y}\|P_{Y})+\big|I(Q_{X};V)-R\big|^{+}+\big|R-\iota(Q_{XY})\big|^{+}\right] (49)
=:minQXY𝒫(𝒳𝒴)Δ(QXY)=Δ(QXY).\displaystyle=:\min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})}\Delta(Q_{XY})=\Delta(Q_{XY}^{\star}). (50)

That is, let Δ()\Delta(\cdot) denote the objective function in Γ1(R)\mathit{\Gamma}_{1}(R) and let QXYQ_{XY}^{\star} be one of the optimizers. We will claim that Γ¯(R)=Γ1(R)\overline{\mathit{\Gamma}}(R)=\mathit{\Gamma}_{1}(R).

We first show this is true when Γ1(R)\mathit{\Gamma}_{1}(R) vanishes. Suppose Γ1(R)=0\mathit{\Gamma}_{1}(R)=0. Then QY=PYQ_{Y}^{\star}=P_{Y} and I(QX;V)Rι(QXY)I(Q_{X}^{\star};V^{\star})\leq R\leq\iota(Q_{XY}^{\star}). However, Lemma 3 gives that I(QX;V)ι(QXY)=D(VW|QX)0I(Q_{X}^{\star};V^{\star})-\iota(Q_{XY}^{\star})=D(V^{\star}\|W|Q_{X}^{\star})\geq 0, so V=WV^{\star}=W, and hence QX𝒮Q_{X}^{\star}\in\mathcal{S} and R=I(QX;W)R=I(Q_{X}^{\star};W). Due to the continuity of I(;W)I(\cdot\ ;W), we have minPX𝒮I(PX;W)RmaxPX𝒮I(PX;W)\min_{P_{X}\in\mathcal{S}}I(P_{X};W)\leq R\leq\max_{P_{X}\in\mathcal{S}}I(P_{X};W). By Lemma 33, we further have Γ¯(R)=Γ1(R)=0\overline{\mathit{\Gamma}}(R)=\mathit{\Gamma}_{1}(R)=0.

Next, we show that Γ¯(R)=Γ1(R)\overline{\mathit{\Gamma}}(R)=\mathit{\Gamma}_{1}(R) also holds in their positive regimes. In the rest of our discussions, we may assume that Γ1(R)>0\mathit{\Gamma}_{1}(R)>0. We further write Γ1(R)\mathit{\Gamma}_{1}(R) as

Γ1(R)\displaystyle\mathit{\Gamma}_{1}(R) =minQXY𝒫(𝒳𝒴)maxα,β[0,1]{D(QYPY)+α[I(QX;V)R]+β[Rι(QXY)]}\displaystyle=\ \min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})}\ \max_{\alpha,\beta\in[0,1]}\big\{D(Q_{Y}\|P_{Y})+\alpha\left[I(Q_{X};V)-R\right]+\beta\left[R-\iota(Q_{XY})\right]\big\}
=:minQXY𝒫(𝒳𝒴)maxα,β[0,1]F(QXY,α,β)\displaystyle=:\min_{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y})}\ \max_{\alpha,\beta\in[0,1]}F(Q_{XY},\alpha,\beta) (51)

where we have used F(QXY,α,β)F(Q_{XY},\alpha,\beta) to denote the objective function. It is shown in the proof of Proposition 34 that QXYF(QXY,α,β)Q_{XY}\mapsto F(Q_{XY},\alpha,\beta) is convex and (α,β)F(QXY,α,β)(\alpha,\beta)\mapsto F(Q_{XY},\alpha,\beta) is linear (a trivial case of being concave). Hence, according to the minimax theorem, the optimization in (B-b) can be achieved at its saddle points (there might exist some other non-saddle points that also produce the optimal value, but at least one saddle point exists). Now, suppose we have a saddle point, denoted by (QXY,α,β)(Q_{XY}^{\prime},\alpha^{\prime},\beta^{\prime}). It must satisfy [rockafellar1970convex, Lem. 36.2]

F(QXY,α,β)F(QXY,α,β)F(QXY,α,β),QXY𝒫(𝒳𝒴),α,β[0,1].F(Q_{XY}^{\prime},\alpha,\beta)\leq F(Q_{XY}^{\prime},\alpha^{\prime},\beta^{\prime})\leq F(Q_{XY},\alpha^{\prime},\beta^{\prime}),\quad\forall\ Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y}),\ \alpha,\beta\in[0,1]. (52)

It should be noted that the saddle points belong to both minimax (minQXYmaxα,βF(QXY,α,β)\min_{Q_{XY}}\max_{\alpha,\beta}F(Q_{XY},\alpha,\beta)) and maximin (maxα,βminQXYF(QXY,α,β)\max_{\alpha,\beta}\min_{Q_{XY}}F(Q_{XY},\alpha,\beta)) solutions. Now we make the following claim:

Claim 1. If α<1\alpha^{\prime}<1, then Γ¯(R)=Γ1(R)\overline{\mathit{\Gamma}}(R)=\mathit{\Gamma}_{1}(R).

Proof of claim. If α<1\alpha^{\prime}<1, then QXYQ_{XY}^{\prime} must give I(QX;V)RI(Q_{X}^{\prime};V^{\prime})\leq R, otherwise the first inequality in (52) is violated as F(QXY,α=1,β=β)>F(QXY,α,β)F(Q_{XY}^{\prime},\alpha=1,\beta=\beta^{\prime})>F(Q_{XY}^{\prime},\alpha^{\prime},\beta^{\prime}). Since this saddle point (QXY,α,β)(Q_{XY}^{\prime},\alpha^{\prime},\beta^{\prime}) is also the minimax solution, then in (50) we can take QXY=QXYQ_{XY}^{\star}=Q_{XY}^{\prime} and hence Γ1(R)=Δ(QXY)\mathit{\Gamma}_{1}(R)=\Delta(Q_{XY}^{\prime}): the minimization minQXY\min_{Q_{XY}} in (49) can always occur at a point where I(QX;V)RI(Q_{X}^{\prime};V^{\prime})\leq R, so minQXY\min_{Q_{XY}} can reduce to a constrained optimization minQXY:I(QX;V)R\min_{Q_{XY}:I(Q_{X};V)\leq R} and thus Γ¯(R)=Γ1(R)\overline{\mathit{\Gamma}}(R)=\mathit{\Gamma}_{1}(R). \square

It remains to discuss the case in which α=1\alpha^{\prime}=1. Comparing (B-b) and (20), we can immediately write from (21) that (note that the condition αβ\alpha\geq\beta is not used in the proof of (21)).

Γ1(R)=maxα,β[0,1][Jα,β(WY|XPY)+(βα)R].\mathit{\Gamma}_{1}(R)=\max_{\alpha,\beta\in[0,1]}\left[J_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right)+(\beta-\alpha)R\right]. (53)

Since the saddle point (QXY,α=1,β)(Q_{XY}^{\prime},\alpha^{\prime}=1,\beta^{\prime}) belongs to the maximin solutions, (α=1,β)(\alpha^{\prime}=1,\beta^{\prime}) must be the maximizers of (53), and correspondingly the marginal QXQ_{X}^{\prime} of QXYQ_{XY}^{\prime} belongs to the set of optimizers QXQ_{X}^{*} of J1,βJ_{1,\beta^{\prime}} in (6). Specifically, QXQ_{X}^{\prime} is in the set

𝒫(𝒳β)=argmaxQX𝒫(𝒳)[x,yQX(x)WY|Xβ(y|x)PY1β(y)],\mathcal{P}(\mathcal{X}_{\beta^{\prime}})=\operatorname*{argmax}_{Q_{X}\in\mathcal{P}(\mathcal{X})}\ \left[\sum_{x,y}Q_{X}(x)\ W_{Y|X}^{\beta^{\prime}}(y|x)\ P_{Y}^{1-\beta^{\prime}}(y)\right], (54)

which is a linear optimization over QXQ_{X}. Here 𝒳β:=argmaxx[yWY|Xβ(y|x)PY1β(y)]\mathcal{X}_{\beta^{\prime}}:=\operatorname*{argmax}_{x}\left[\sum_{y}W_{Y|X}^{\beta^{\prime}}(y|x)\ P_{Y}^{1-\beta^{\prime}}(y)\right] is set of optimal symbols and 𝒫(𝒳β)\mathcal{P}(\mathcal{X}_{\beta^{\prime}}) is the set of input distributions supported on 𝒳β\mathcal{X}_{\beta^{\prime}}.

Since α=1\alpha^{\prime}=1, we can first exclude two trivial cases: β=0\beta^{\prime}=0 or 1. If β=0\beta^{\prime}=0, then (53) gives Γ1(R)=R\mathit{\Gamma}_{1}(R)=-R, while (49) implies that Γ1(R)0\mathit{\Gamma}_{1}(R)\geq 0. Thus, this case can never happen. Similarly, if β=1\beta^{\prime}=1, then (53) gives Γ1(R)=0\mathit{\Gamma}_{1}(R)=0, while we have assumed the strict positivity of Γ1(R)\mathit{\Gamma}_{1}(R). In conclusion, as long as Γ1(R)>0\mathit{\Gamma}_{1}(R)>0, we have β(0,1)\beta^{\prime}\in(0,1) and can make the following claim.

Claim 2. A saddle point (QXY,α=1,β(0,1))(Q_{XY}^{\prime},\alpha^{\prime}=1,\beta^{\prime}\in(0,1)) must satisfy that I(QX;V)R=ι(QXY)I(Q_{X}^{\prime};V^{\prime})\geq R=\iota(Q_{XY}^{\prime}).

Proof of claim. This follows directly from the first inequality in (52). \square

In the rest of our discussions, we may assume that β(0,1)\beta^{\prime}\in(0,1). In most circumstances, 𝒳β\mathcal{X}_{\beta^{\prime}} may contain only one symbol, and the corresponding maximizer of (54) is a deterministic distribution. More generally, however, 𝒳β\mathcal{X}_{\beta^{\prime}} may contain multiple symbols, and all distributions supported on these symbols are maximizers of (54); the marginal QXQ_{X}^{\prime} of the saddle point QXYQ_{XY}^{\prime} is only one of those distributions. Let us consider this general situation: if 𝒳β\mathcal{X}_{\beta^{\prime}} contains multiple symbols, we take any two symbols, say, x1,x2𝒳βx_{1},x_{2}\in\mathcal{X}_{\beta^{\prime}}. Denote

Gx(β):=2(β1)Dβ(WY|xPY)=yWY|Xβ(y|x)PY1β(y),G_{x}(\beta):=2^{(\beta-1)D_{\beta}(W_{Y|x}\|P_{Y})}=\sum_{y}W_{Y|X}^{\beta}(y|x)\ P_{Y}^{1-\beta}(y),

where DβD_{\beta} is the β\beta-Rényi divergence in Definition 6. Then we have Gx1(β)=Gx2(β)G_{x_{1}}(\beta^{\prime})=G_{x_{2}}(\beta^{\prime}), meaning that these two functions of β\beta intersect at β\beta^{\prime}. Since they are both analytic, in the neighborhood of β\beta^{\prime}, they either overlap or intersect only at β\beta^{\prime}. The former indicates a certain symmetry in WY|XW_{Y|X} and PYP_{Y} (for example, WY|XW_{Y|X} is a binary symmetric channel and PYP_{Y} is uniform). We define that x1x_{1} and x2x_{2} belong to the same class, if Gx1(β)=Gx2(β)G_{x_{1}}(\beta)=G_{x_{2}}(\beta) for all β\beta in the neighborhood of β\beta^{\prime}. We now conduct our discussions based on how many classes 𝒳β\mathcal{X}_{\beta^{\prime}} contains, which can be classified into the following three cases.

Case 1. 𝒳β\mathcal{X}_{\beta^{\prime}} contains only one class, and there is only one symbol in that class.

In this case, 𝒳β\mathcal{X}_{\beta^{\prime}} basically contains a single symbol, say, 𝒳β={x0}\mathcal{X}_{\beta^{\prime}}=\{x_{0}\} for some x0𝒳x_{0}\in\mathcal{X}. Then 𝒫(𝒳β)={δx,x0}\mathcal{P}(\mathcal{X}_{\beta^{\prime}})=\{\delta_{x,x_{0}}\}, so (54) has a unique solution which is a deterministic distribution. Since the saddle marginal QX𝒫(𝒳β)Q_{X}^{\prime}\in\mathcal{P}(\mathcal{X}_{\beta^{\prime}}) is in this solution set, we must have QX=δx,x0Q_{X}^{\prime}=\delta_{x,x_{0}}; however, this gives I(QX;V)=0<RI(Q_{X}^{\prime};V^{\prime})=0<R, contradicting I(QX;V)RI(Q_{X}^{\prime};V^{\prime})\geq R in Claim 2. Ergo, there exists no saddle point in the form of (QXY,α=1,β)(Q_{XY}^{\prime},\alpha^{\prime}=1,\beta^{\prime}). The saddle point must have α<1\alpha^{\prime}<1. We are back to Claim 1 and hence Γ¯(R)=Γ1(R)\overline{\mathit{\Gamma}}(R)=\mathit{\Gamma}_{1}(R).

Case 2. 𝒳β\mathcal{X}_{\beta^{\prime}} contains only one class, and there is more than one symbol in that class.

In this case, take any two symbols, say, x1,x2𝒳βx_{1},x_{2}\in\mathcal{X}_{\beta^{\prime}}. Then Gx1(β)G_{x_{1}}(\beta) and Gx2(β)G_{x_{2}}(\beta) are locally the same function. We can further take their derivatives at β\beta^{\prime} and write Gx1(β)=Gx2(β)G_{x_{1}}^{\prime}(\beta^{\prime})=G_{x_{2}}^{\prime}(\beta^{\prime}). As a result, the expression

yWY|Xβ(y|x)PY1β(y)logWY|X(y|x)PY(y)\sum_{y}W_{Y|X}^{\beta^{\prime}}(y|x)\ P_{Y}^{1-\beta^{\prime}}(y)\log\frac{W_{Y|X}(y|x)}{P_{Y}(y)} (55)

will be independent of symbol choice in that class, i.e., (55) is identical for all x𝒳βx\in\mathcal{X}_{\beta^{\prime}}. Then, take any QX𝒫(𝒳β)Q_{X}^{*}\in\mathcal{P}(\mathcal{X}_{\beta^{\prime}}) (so QXQ_{X}^{*} is an optimizer in J1,βJ_{1,\beta^{\prime}}) and its corresponding joint optimizer QXYQ_{XY}^{*} for minQXYF(QXY,1,β)\min_{Q_{XY}}F(Q_{XY},1,\beta^{\prime}) is given by (48) (under α=1\alpha=1 and β=β\beta=\beta^{\prime}). The expression

ι(QXY)=xQX(x)yWY|Xβ(y|x)PY1β(y)logWY|X(y|x)PY(y)xQX(x)yWY|Xβ(y|x)PY1β(y)\iota(Q_{XY}^{*})=\frac{\displaystyle{\sum_{x}Q_{X}^{*}(x)\sum_{y}W_{Y|X}^{\beta^{\prime}}(y|x)\ P_{Y}^{1-\beta^{\prime}}(y)\log\frac{W_{Y|X}(y|x)}{P_{Y}(y)}}}{\displaystyle{\sum_{x}Q_{X}^{*}(x)\sum_{y}W_{Y|X}^{\beta^{\prime}}(y|x)\ P_{Y}^{1-\beta^{\prime}}(y)}}

will be independent of QXQ_{X}^{*}. Since the saddle point QXYQ_{XY}^{\prime} also has marginal QX𝒫(𝒳β)Q_{X}^{\prime}\in\mathcal{P}(\mathcal{X}_{\beta^{\prime}}) and satisfies ι(QXY)=R\iota(Q_{XY}^{\prime})=R according to Claim 2, we must have ι(QXY)=R\iota(Q_{XY}^{*})=R for all QX𝒫(𝒳β)Q_{X}^{*}\in\mathcal{P}(\mathcal{X}_{\beta^{\prime}}). Note that 𝒫(𝒳β)\mathcal{P}(\mathcal{X}_{\beta^{\prime}}) contains the deterministic distribution (which yields a zero mutual information) and the saddle point (which yields I(QX;V)RI(Q_{X}^{\prime};V^{\prime})\geq R by Claim 2). There must exist some QX𝒫(𝒳β)Q_{X}^{\star}\in\mathcal{P}(\mathcal{X}_{\beta^{\prime}}) such that I(QX;V)=R=ι(QXY)I(Q_{X}^{\star};V^{\star})=R=\iota(Q_{XY}^{\star}). Plugging this QXYQ_{XY}^{\star} in (50) gives

Δ(QXY)\displaystyle\Delta(Q_{XY}^{\star}) =D(QYPY)+|I(QX;V)R|++|Rι(QXY)|+\displaystyle=D(Q_{Y}^{\star}\|P_{Y})+\big|I(Q_{X}^{\star};V^{\star})-R\big|^{+}+\big|R-\iota(Q_{XY}^{\star})\big|^{+}
=D(QYPY)+[I(QX;V)R]+β[Rι(QXY)]\displaystyle=D(Q_{Y}^{\star}\|P_{Y})+\left[I(Q_{X}^{\star};V^{\star})-R\right]+\beta^{\prime}\left[R-\iota(Q_{XY}^{\star})\right]
=J1,β(WY|XPY)+(β1)R=Γ1(R).\displaystyle=J_{1,\beta^{\prime}}\left(W_{Y|X}\|P_{Y}\right)+(\beta^{\prime}-1)R=\mathit{\Gamma}_{1}(R).

This means the minimization minQXY\min_{Q_{XY}} in (49) can occur at a point where I(QX;V)=RI(Q_{X}^{\star};V^{\star})=R, so Γ¯(R)=Γ1(R)\overline{\mathit{\Gamma}}(R)=\mathit{\Gamma}_{1}(R).

Case 3. 𝒳β\mathcal{X}_{\beta^{\prime}} contains more than one class.

In this case, we cannot deduce much about β\beta^{\prime}, and hence about the corresponding RR. However, let us perturb β\beta^{\prime} in its neighborhood, say, to some β′′(0,1)\beta^{\prime\prime}\in(0,1). Due to the existence of only finitely many symbols, in the neighborhood of β\beta^{\prime}, there are only finite intersections of Gx(β)G_{x}(\beta)’s for xx’s from different classes, so 𝒳β′′\mathcal{X}_{\beta^{\prime\prime}} must contain only one class. Take any symbol, say, x1x_{1} from this class. Then we can write β′′\beta^{\prime\prime} as

β′′\displaystyle\beta^{\prime\prime} =argmaxβ[0,1]{J1,β(WY|XPY)+(β1)(R+ΔR)}\displaystyle=\operatorname*{argmax}_{\beta\in[0,1]}\left\{J_{1,\beta}\left(W_{Y|X}\|P_{Y}\right)+(\beta-1)(R+\Delta R)\right\}
=argmaxβ[0,1]{(1β)[Dβ(WY|x1PY)(R+ΔR)]}\displaystyle=\operatorname*{argmax}_{\beta\in[0,1]}\left\{(1-\beta)\left[D_{\beta}\left(W_{Y|x_{1}}\|P_{Y}\right)-(R+\Delta R)\right]\right\}

for some R+ΔRR+\Delta R in the neighborhood of RR. Note that this is a strictly concave maximization due to the strict concavity of β(1β)Dβ\beta\mapsto(1-\beta)D_{\beta} (if (1β)Dβ(1-\beta)D_{\beta} is linear in β\beta, then the maximizer β′′\beta^{\prime\prime} must be 0 or 1). Thus, this new R+ΔRR+\Delta R is uniquely determined by β′′\beta^{\prime\prime}. If the corresponding saddle point at R+ΔRR+\Delta R has α′′<1\alpha^{\prime\prime}<1, then we are back to Claim 1. If it has α′′=1\alpha^{\prime\prime}=1, then that saddle point must be (QXY′′,α′′=1,β′′)(Q_{XY}^{\prime\prime},\alpha^{\prime\prime}=1,\beta^{\prime\prime}), and we are back to Case 1 or Case 2 and can obtain that Γ¯(R+ΔR)=Γ1(R+ΔR)\overline{\mathit{\Gamma}}(R+\Delta R)=\mathit{\Gamma}_{1}(R+\Delta R). In summary, for all R+ΔRR+\Delta R (where ΔR0\Delta R\neq 0) in the neighborhood of RR, we have Γ¯(R+ΔR)=Γ1(R+ΔR)\overline{\mathit{\Gamma}}(R+\Delta R)=\mathit{\Gamma}_{1}(R+\Delta R). Ergo, Γ¯()=Γ1()\overline{\mathit{\Gamma}}(\cdot)=\mathit{\Gamma}_{1}(\cdot) holds almost everywhere in the neighborhood of RR. On the other hand, Γ¯()\overline{\mathit{\Gamma}}(\cdot) and Γ1()\mathit{\Gamma}_{1}(\cdot) are both continuous due to the Berge maximum theorem [aliprantis2006infinite, Item 17.31] (note that in Γ¯()\overline{\mathit{\Gamma}}(\cdot), R{QXY𝒫(𝒳𝒴):I(QX;V)R}R\twoheadrightarrow\{Q_{XY}\in\mathcal{P}(\mathcal{X}\mathcal{Y}):I(Q_{X};V)\leq R\} is a continuous and compact correspondence), so Γ¯(R)=Γ1(R)\overline{\mathit{\Gamma}}(R)=\mathit{\Gamma}_{1}(R) also holds at RR.

To sum up, Γ¯(R)=Γ1(R)\overline{\mathit{\Gamma}}(R)=\mathit{\Gamma}_{1}(R) holds everywhere, in both vanishing and positive regimes. The dual form of Γ¯(R)\overline{\mathit{\Gamma}}(R) follows immediately from (53). This completes the proof. ∎

B-c Proof of Proposition 36

Proof.

We first show that both Γ¯(R)\overline{\mathit{\Gamma}}(R) and Γ¯¯(R)\underline{\underline{\mathit{\Gamma}}}(R) are convex in RR. Collecting Proposition 34 and 35, both Γ¯(R)\overline{\mathit{\Gamma}}(R) and Γ¯¯(R)\underline{\underline{\mathit{\Gamma}}}(R) take the dual form

maxα,β[Jα,β(WY|XPY)+(βα)R].\max_{\alpha,\beta}\left[J_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right)+(\beta-\alpha)R\right]. (56)

Their only difference is the optimization range in (α,β)(\alpha,\beta). For a fixed combination of (α,β)(\alpha,\beta), the objective function in (56) is linear in RR (a trivial case of being convex). Then after maximizing over (α,β)(\alpha,\beta), (56) is convex in RR.

Next, we show that both Γ¯(R)\overline{\mathit{\Gamma}}(R) and Γ¯¯(R)\underline{\underline{\mathit{\Gamma}}}(R) are monotone decreasing when R<minPX𝒮I(PX;W)R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W). Combining convexity and Lemma 33, Γ¯(R)\overline{\mathit{\Gamma}}(R) is convex in RR and positive when R<minPX𝒮I(PX;W)R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W). As RR grows, Γ¯(R)\overline{\mathit{\Gamma}}(R) remains convex and must vanish when RR reaches minPX𝒮I(PX;W)\min_{P_{X}\in\mathcal{S}}I(P_{X};W). Ergo, Γ¯(R)\overline{\mathit{\Gamma}}(R) must be monotone decreasing in RR when R<minPX𝒮I(PX;W)R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W). Similar arguments can be established for Γ¯¯(R)\underline{\underline{\mathit{\Gamma}}}(R) by combining its convexity and Lemma 46.

Finally, to prove our conclusion, it suffices to show that when R<minPX𝒮I(PX;W)R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W), the optimization over α,β[0,1]\alpha,\beta\in[0,1] in Γ¯(R)\overline{\mathit{\Gamma}}(R) occurs at αβ\alpha\geq\beta. Let R<minPX𝒮I(PX;W)R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W) and the optimizers in Γ¯(R)\overline{\mathit{\Gamma}}(R) be α,β\alpha^{\prime},\beta^{\prime}, i.e., Γ¯(R)=Jα,β(WY|XPY)+(βα)R\overline{\mathit{\Gamma}}(R)=J_{\alpha^{\prime},\beta^{\prime}}\left(W_{Y|X}\|P_{Y}\right)+(\beta^{\prime}-\alpha^{\prime})R. Suppose α<β\alpha^{\prime}<\beta^{\prime}. Take ΔR0+\Delta R\to 0^{+} and we have

Γ¯(R+ΔR)\displaystyle\overline{\mathit{\Gamma}}(R+\Delta R) =maxα,β[0,1][Jα,β(WY|XPY)+(βα)(R+ΔR)]\displaystyle=\max_{\alpha,\beta\in[0,1]}\left[J_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right)+(\beta-\alpha)(R+\Delta R)\right]
Jα,β(WY|XPY)+(βα)(R+ΔR)\displaystyle\geq J_{\alpha^{\prime},\beta^{\prime}}\left(W_{Y|X}\|P_{Y}\right)+(\beta^{\prime}-\alpha^{\prime})(R+\Delta R)
=Γ¯(R)+(βα)ΔR>Γ¯(R),\displaystyle=\overline{\mathit{\Gamma}}(R)+(\beta^{\prime}-\alpha^{\prime})\Delta R>\overline{\mathit{\Gamma}}(R),

meaning that Γ¯(R)\overline{\mathit{\Gamma}}(R) is locally increasing in RR, which contradicts the decreasing property that we just showed. Hence, the optimizers α,β\alpha^{\prime},\beta^{\prime} for Γ¯(R)\overline{\mathit{\Gamma}}(R) must satisfy that αβ\alpha^{\prime}\geq\beta^{\prime}. ∎

Appendix C A Different Approach to Proving the Converse Part of Theorem 17

Based on Arimoto’s techniques in [arimoto1973converse], we provide a different proof of the converse part of Theorem 17 for the uniform formulation. We begin with the following lemma about the additivity of Jα,βJ_{\alpha,\beta}.

Lemma 47 (Additivity of Jα,βJ_{\alpha,\beta}).

Jα,β(WY|XnPYn)=nJα,β(WY|XPY).J_{\alpha,\beta}\big(W_{Y|X}^{n}\|P_{Y}^{n}\big)=nJ_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right).

Proof.

We follow the arguments in [gallager1965simple, Thm. 4] and [arimoto1973converse, Lem. 1]. According to (6), Jα,β(WY|XPY)J_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right) is defined via solving the following optimization problem:

maxQXyPY1β(y)(xQX(x)WY|Xβα(y|x))α s.t. QX(x)0x𝒳, and xQX(x)=1.\displaystyle\max_{Q_{X}}\sum_{y}P_{Y}^{1-\beta}(y)\left(\sum_{x}Q_{X}(x)W_{Y|X}^{\frac{\beta}{\alpha}}(y|x)\right)^{\alpha}\ \text{ s.t. }\ Q_{X}(x)\geq 0\ \forall x\in\mathcal{X},\text{ and }\ \sum_{x}Q_{X}(x)=1. (57)

The Lagrangian of this problem is

=yPY1β(y)(xQX(x)WY|Xβα(y|x))αxλxQX(x)+μ(xQX(x)1).\mathcal{L}=-\sum_{y}P_{Y}^{1-\beta}(y)\left(\sum_{x}Q_{X}(x)W_{Y|X}^{\frac{\beta}{\alpha}}(y|x)\right)^{\alpha}-\sum_{x}\lambda_{x}Q_{X}(x)+\mu\left(\sum_{x}Q_{X}(x)-1\right).

Since α[0,1]\alpha\in[0,1], problem (57) is convex. The KKT conditions are necessary and sufficient for the optimizer QXQ_{X} and multipliers λx,μ\lambda_{x},\mu [boyd2004convex, Ch. 5.5.3]. Specifically, letting QX,λx,μQ_{X}^{*},\lambda_{x}^{*},\mu^{*} be the optimizers, we have

xQX(x)\displaystyle\sum_{x}Q_{X}^{*}(x) =1;\displaystyle=1; (58)
λx\displaystyle\lambda_{x}^{*} 0,λxQX(x)=0\displaystyle\geq 0,\ \ \lambda_{x}^{*}Q_{X}^{*}(x)=0 x𝒳;\displaystyle\forall x\in\mathcal{X}; (59)
0\displaystyle 0 =QX(x)|QX(x)=αyPY1β(y)WY|Xβα(y|x)ηyα1λx+μ\displaystyle=\frac{\partial\mathcal{L}}{\partial Q_{X}(x)}\bigg|_{Q_{X}^{*}(x)}=-\alpha\sum_{y}P_{Y}^{1-\beta}(y)\ W_{Y|X}^{\frac{\beta}{\alpha}}(y|x)\ \eta_{y}^{\alpha-1}-\lambda_{x}^{*}+\mu^{*} x𝒳,\displaystyle\forall x\in\mathcal{X}, (60)

where

ηy=xQX(x)WY|Xβα(y|x).\eta_{y}=\sum_{x}Q_{X}^{*}(x)W_{Y|X}^{\frac{\beta}{\alpha}}(y|x). (61)

Multiplying (60) by QX(x)Q_{X}^{*}(x) and sum over xx gives that

μ=αyPY1β(y)ηyα\mu^{*}=\alpha\sum_{y}P_{Y}^{1-\beta}(y)\ \eta_{y}^{\alpha} (62)

where we have plugged in (58) and (59). Comparing (59), (60), and (62), QXQ_{X}^{*} must satisfy

yPY1β(y)WY|Xβα(y|x)ηyα1yPY1β(y)ηyαx𝒳,\sum_{y}P_{Y}^{1-\beta}(y)\ W_{Y|X}^{\frac{\beta}{\alpha}}(y|x)\ \eta_{y}^{\alpha-1}\geq\sum_{y}P_{Y}^{1-\beta}(y)\ \eta_{y}^{\alpha}\quad\forall x\in\mathcal{X}, (63)

with equality for every xsupp(QX)x\in\mathrm{supp}(Q_{X}^{*}) due to complementary slackness (59).

If QXQ_{X}^{*} satisfies (63), then QXnQ_{X}^{*n} also satisfies the nn-shot version of (63) (replacing WY|XW_{Y|X} by WY|XnW_{Y|X}^{n} and PYP_{Y} by PYnP_{Y}^{n}). To see this, inserting QXnQ_{X}^{*n} in (61) gives

ηyn=x1,,xni=1nQX(xi)WY|Xβα(yi|xi)=i=1n[xiQX(xi)WY|Xβα(yi|xi)]=i=1nηyi\eta_{y^{n}}=\sum_{x_{1},\dots,x_{n}}\prod_{i=1}^{n}Q_{X}^{*}(x_{i})W_{Y|X}^{\frac{\beta}{\alpha}}(y_{i}|x_{i})=\prod_{i=1}^{n}\left[\sum_{x_{i}}Q_{X}^{*}(x_{i})W_{Y|X}^{\frac{\beta}{\alpha}}(y_{i}|x_{i})\right]=\prod_{i=1}^{n}\eta_{y_{i}}

and hence (63) becomes

i=1n[yiPY1β(yi)WY|Xβα(yi|xi)ηyiα1]i=1n[yiPY1β(yi)ηyiα]xn𝒳n,\prod_{i=1}^{n}\left[\sum_{y_{i}}P_{Y}^{1-\beta}(y_{i})\ W_{Y|X}^{\frac{\beta}{\alpha}}(y_{i}|x_{i})\ \eta_{y_{i}}^{\alpha-1}\right]\geq\prod_{i=1}^{n}\left[\sum_{y_{i}}P_{Y}^{1-\beta}(y_{i})\ \eta_{y_{i}}^{\alpha}\right]\quad\forall x^{n}\in\mathcal{X}^{n},

which is true as long as the inequality holds for each yiy_{i} in the product. Considering that (63) is a necessary and sufficient condition for the optimizer, our conclusion is thus proven. ∎

Now we prove the converse part of Theorem 17 for the uniform formulation, i.e., we show Γuni(R)Γ(R)\mathit{\Gamma}_{\mathrm{uni}}(R)\geq\mathit{\Gamma}(R).

Proof.

For simplicity, we first narrow down our discussions to the 1-shot case (n=1n=1). Let 𝒞={X(1),,X(M)}\mathcal{C}^{*}=\{X^{*}(1),\ldots,X^{*}(M)\} be the optimal code that achieves the minimal total variation using MM codewords, i.e.,

12P~Yn|𝒞PYn1=min𝒞:|𝒞|=M12P~Yn|𝒞PYn1.\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}^{*}}-P_{Y}^{n}\right\|_{1}=\min_{\mathcal{C}:|\mathcal{C}|=M}\frac{1}{2}\left\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\right\|_{1}.

Then there exists a probabilistic distribution of code, say, Q~𝒞:=Q~𝒞(X(1),,X(M))\tilde{Q}_{\mathcal{C}}:=\tilde{Q}_{\mathcal{C}}(X(1),\ldots,X(M)) that satisfies the following two properties.

  • (i)

    The expectation of total variation under Q~𝒞\tilde{Q}_{\mathcal{C}} equals to the minimal total variation:

    𝔼Q~𝒞[12P~Y|𝒞PY1]=12P~Y|𝒞PYn1.\mathbb{E}_{\tilde{Q}_{\mathcal{C}}}\left[\frac{1}{2}\left\|\tilde{P}_{Y|\mathcal{C}}-P_{Y}\right\|_{1}\right]=\frac{1}{2}\left\|\tilde{P}_{Y|\mathcal{C}^{*}}-P_{Y}^{n}\right\|_{1}.
  • (ii)

    The marginal distribution of each codeword is identical, i.e.,

    Q~X(X(i)):=X(1)X(i1)X(i+1)X(M)Q~𝒞(X(1),,X(M))\tilde{Q}_{X}(X(i)):=\sum_{X(1)}\cdots\sum_{X(i-1)}\sum_{X(i+1)}\cdots\sum_{X(M)}\tilde{Q}_{\mathcal{C}}(X(1),\ldots,X(M))

    is the same for each i=1,,Mi=1,\ldots,M.

One example of such a Q~𝒞\tilde{Q}_{\mathcal{C}} is [arimoto1973converse]:

Q~𝒞={1M!if 𝒞=𝒞 up to permutations of codewords,0otherwise.\tilde{Q}_{\mathcal{C}}=\begin{cases}\dfrac{1}{M!}&\text{if }\mathcal{C}=\mathcal{C}^{*}\text{ up to permutations of codewords},\\ 0&\text{otherwise}.\end{cases}

Take any α,β[0,1]\alpha,\beta\in[0,1] with αβ\alpha\geq\beta, then we have the following:

112P~Y|𝒞PY1\displaystyle 1-\frac{1}{2}\left\|\tilde{P}_{Y|\mathcal{C}}-P_{Y}\right\|_{1} 112P~Y|𝒞PY1=𝔼Q~𝒞[112P~Y|𝒞PY1]\displaystyle\leq 1-\frac{1}{2}\left\|\tilde{P}_{Y|\mathcal{C}^{*}}-P_{Y}\right\|_{1}=\mathbb{E}_{\tilde{Q}_{\mathcal{C}}}\left[1-\frac{1}{2}\left\|\tilde{P}_{Y|\mathcal{C}^{*}}-P_{Y}\right\|_{1}\right]
=𝑎𝔼Q~𝒞yPY(y)min{1Mi=1MWY|X(y|X(i))PY(y),1}\displaystyle\overset{a}{=}\mathbb{E}_{\tilde{Q}_{\mathcal{C}}}\sum_{y}P_{Y}(y)\min\left\{\frac{1}{M}\sum_{i=1}^{M}\frac{W_{Y|X}(y|X(i))}{P_{Y}(y)},1\right\}
𝑏𝔼Q~𝒞yPY(y)(1Mi=1MWY|X(y|X(i))PY(y))β\displaystyle\overset{b}{\leq}\mathbb{E}_{\tilde{Q}_{\mathcal{C}}}\sum_{y}P_{Y}(y)\left(\frac{1}{M}\sum_{i=1}^{M}\frac{W_{Y|X}(y|X(i))}{P_{Y}(y)}\right)^{\beta}
=MβyPY1β(y)𝔼Q~𝒞(i=1MWY|X(y|X(i)))βαα\displaystyle=M^{-\beta}\sum_{y}P_{Y}^{1-\beta}(y)\ \mathbb{E}_{\tilde{Q}_{\mathcal{C}}}\left(\sum_{i=1}^{M}W_{Y|X}(y|X(i))\right)^{\frac{\beta}{\alpha}\cdot\alpha}
𝑐MβyPY1β(y)(𝔼Q~𝒞[i=1MWY|X(y|X(i))]βα)α\displaystyle\overset{c}{\leq}M^{-\beta}\sum_{y}P_{Y}^{1-\beta}(y)\left(\mathbb{E}_{\tilde{Q}_{\mathcal{C}}}\left[\sum_{i=1}^{M}W_{Y|X}(y|X(i))\right]^{\frac{\beta}{\alpha}}\right)^{\alpha}
𝑑MβyPY1β(y)(𝔼Q~𝒞[i=1MWY|Xβα(y|X(i))])α\displaystyle\overset{d}{\leq}M^{-\beta}\sum_{y}P_{Y}^{1-\beta}(y)\left(\mathbb{E}_{\tilde{Q}_{\mathcal{C}}}\left[\sum_{i=1}^{M}W_{Y|X}^{\frac{\beta}{\alpha}}(y|X(i))\right]\right)^{\alpha}
=𝑒MβyPY1β(y)(M𝔼Q~X(X(i))[WY|Xβα(y|X(i))])α\displaystyle\overset{e}{=}M^{-\beta}\sum_{y}P_{Y}^{1-\beta}(y)\left(M\ \mathbb{E}_{\tilde{Q}_{X}(X(i))}\left[W_{Y|X}^{\frac{\beta}{\alpha}}(y|X(i))\right]\right)^{\alpha}
=M(βα)yPY1β(y)(xQ~X(x)WY|Xβα(y|x))α\displaystyle=M^{-(\beta-\alpha)}\sum_{y}P_{Y}^{1-\beta}(y)\left(\sum_{x}\tilde{Q}_{X}(x)W_{Y|X}^{\frac{\beta}{\alpha}}(y|x)\right)^{\alpha}
M(βα)maxQX𝒫(𝒳)yPY1β(y)(xQX(x)WY|Xβα(y|x))α\displaystyle\leq M^{-(\beta-\alpha)}\max_{Q_{X}\in\mathcal{P}(\mathcal{X})}\sum_{y}P_{Y}^{1-\beta}(y)\left(\sum_{x}Q_{X}(x)W_{Y|X}^{\frac{\beta}{\alpha}}(y|x)\right)^{\alpha}
=2[Jα,β(WY|XPY)+(βα)R],\displaystyle=2^{-\left[J_{\alpha,\beta}\left(W_{Y|X}\|P_{Y}\right)+(\beta-\alpha)R\right]},

where (a)(a) follows from (18). (b)(b) is due to that min{x,1}xβ\min\{x,1\}\leq x^{\beta} when x0x\geq 0. (c)(c) applies Jensen’s inequality to the concave function xαx^{\alpha}. (d)(d) follows from the relation (x+y)sxs+ys(x+y)^{s}\leq x^{s}+y^{s} for x,y0x,y\geq 0 and 0s10\leq s\leq 1. (e)(e) follows from the second property of Q~𝒞\tilde{Q}_{\mathcal{C}}. Extending this expression to the nn-shot case, our conclusion immediately follows from Lemma 47. ∎

Appendix D Proofs of Some Lemmas and Propositions

D-a Properties of Γ(s,QX,R)\mathit{\Gamma}(s,Q_{X},R)

Lemma 48.

Fix any QX𝒫(𝒳)Q_{X}\in\mathcal{P}(\mathcal{X}) and R0R\geq 0. For simplicity, write f(s):=Γ(s,QX,R)f(s):=\mathit{\Gamma}(s,Q_{X},R), which is defined in (10). Then f(s)f(s) is a continuous, non-increasing function of ss. It is strictly decreasing on the interval [0,s0][0,s_{0}] for some s00s_{0}\geq 0 and constant on [s0,)[s_{0},\infty).

Proof.

For convenience, define the objective function in f(s)f(s) to be g(V):=(QXVPY)+|I(QX;V)R|+=max{D(QXVPY),D(VPY|QX)R}.g(V):=(Q_{X}V\|P_{Y})+\big|I(Q_{X};V)-R\big|^{+}=\max\big\{D(Q_{X}V\|P_{Y}),D(V\|P_{Y}|Q_{X})-R\big\}. Then f(s)=minV𝒫(𝒴|𝒳):D(VW|QX)sg(V)f(s)=\min_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X}):D(V\|W|Q_{X})\leq s}g(V).

We first show continuity. Clearly, g()g(\cdot) being the maximum of two continuous functions, is continuous. Also, s{D(VW|QX)s}s\twoheadrightarrow\{D(V\|W|Q_{X})\leq s\} a continuous, compact, and non-empty correspondence (i.e., given any ss, the feasible set {D(VW|QX)s}\{D(V\|W|Q_{X})\leq s\} is a continuous, compact, and non-empty set of VV). Thanks to the Berge maximum theorem [aliprantis2006infinite, Item 17.31], f(s)f(s) is continuous.

Next, the non-increasing property of f(s)f(s) is straightforward: the feasible set D(VW|QX)sD(V\|W|Q_{X})\leq s expands as ss grows, and thus the minimization value can never increase. When s=0s=0, the only feasible VV is V=WV=W and then f(0)=D(QXWPY)+|I(QX;W)R|+0f(0)=D(Q_{X}W\|P_{Y})+\big|I(Q_{X};W)-R\big|^{+}\geq 0. On the other hand, note that D(VW|QX)D(V\|W|Q_{X}) is bounded when VWV\ll W, so when ss0:=maxV𝒫(𝒴|𝒳):VWD(VW|QX)s\geq s_{0}:=\max_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X}):V\ll W}D(V\|W|Q_{X}), we have

f(s)=minV𝒫(𝒴|𝒳):VW[D(QXVPY)+|I(QX;V)R|+],f(s)=\min_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X}):V\ll W}\left[D(Q_{X}V\|P_{Y})+\big|I(Q_{X};V)-R\big|^{+}\right],

which is a constant. Therefore, f(s)f(s0)f(s)\equiv f(s_{0}) on [s0,)[s_{0},\infty).

Now, before moving to the decreasing property on [0,s0][0,s_{0}], we prove the convexity of f(s)f(s). Take s1,s20s_{1},s_{2}\geq 0. Suppose that the optimizers for f(s1)f(s_{1}) and f(s2)f(s_{2}) are V1V_{1} and V2V_{2}, respectively; that is,

f(si)=g(Vi),D(ViW|QX)si,i=1,2.f(s_{i})=g(V_{i}),\quad D(V_{i}\|W|Q_{X})\leq s_{i},\quad i=1,2.

For any t[0,1]t\in[0,1], take st=ts1+(1t)s2s_{t}=ts_{1}+(1-t)s_{2} and Vt=tV1+(1t)V2V_{t}=tV_{1}+(1-t)V_{2}. By the convexity of D(W|QX)D(\cdot\|W|Q_{X}), we have D(VtW|QX)=D([tV1+(1t)V2]W|QX)tD(V1W|QX)+(1t)D(V2W|QX)ts1+(1t)s2=stD(V_{t}\|W|Q_{X})=D\big([tV_{1}+(1-t)V_{2}]\big\|W\big|Q_{X}\big)\leq tD(V_{1}\|W|Q_{X})+(1-t)D(V_{2}\|W|Q_{X})\leq ts_{1}+(1-t)s_{2}=s_{t}. Now since D(VtW|QX)stD(V_{t}\|W|Q_{X})\leq s_{t}, we further have

f(st)\displaystyle f(s_{t}) =minV𝒫(𝒴|𝒳):D(VW|QX)stg(V)g(Vt)=g([tV1+(1t)V2])\displaystyle=\min_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X}):D(V\|W|Q_{X})\leq s_{t}}g(V)\leq g(V_{t})=g\big([tV_{1}+(1-t)V_{2}]\big)
𝑎tg(V1)+(1t)g(V2)=tf(s1)+(1t)f(s2),\displaystyle\overset{a}{\leq}tg(V_{1})+(1-t)g(V_{2})=tf(s_{1})+(1-t)f(s_{2}), (64)

where (a)(a) follows from the convexity of g()g(\cdot). To see this, observe that both D(QXVPY)D(Q_{X}V\|P_{Y}) and D(VPY|QX)D(V\|P_{Y}|Q_{X}) are convex in VV. Thus, g()g(\cdot) being the maximum of two convex functions, is also convex. (64) thereby indicates the convexity of f()f(\cdot).

Equipped with convexity and the non-increasing monotonicity, f(s)f(s) must be strictly decreasing on the interval [0,s0][0,s_{0}]. It is also noteworthy that on this interval, the optimization of VV is achieved at the boundary where D(VW|QX)=sD(V\|W|Q_{X})=s. Similar arguments can be found in the proof of [csiszar2011information, Cor. 10.4]. ∎

D-b Proof of Proposition 23

Proof.

We only show the expression for E¯nl(R)\underline{E}^{\mathrm{nl}}(R), while that for Ercnl(R)E_{\mathrm{rc}}^{\mathrm{nl}}(R) follows from analogous reasoning. We apply the method in [csiszar2011information, Cor. 10.4] and first show the convexity of E¯nl()\overline{E}^{\mathrm{nl}}(\cdot). Let QY1,QY2Q_{Y1},Q_{Y2} be the optimizer for E¯nl(R1)\overline{E}^{\mathrm{nl}}(R_{1}) and E¯nl(R2)\overline{E}^{\mathrm{nl}}(R_{2}), respectively, i.e.,

E¯nl(Ri)=D(QYiPY),D(QYiPY)+H(QYi)Ri,i=1,2.\overline{E}^{\mathrm{nl}}(R_{i})=D(Q_{Yi}\|P_{Y}),\quad D(Q_{Yi}\|P_{Y})+H(Q_{Yi})\geq R_{i},\quad i=1,2.

Take any t[0,1]t\in[0,1]. Note that D(QYPY)+H(QY)D(Q_{Y}\|P_{Y})+H(Q_{Y}) is linear in QYQ_{Y}. Then D(QYPY)+H(QY)RD(Q_{Y}\|P_{Y})+H(Q_{Y})\geq R implies that D(tQYPY)+H(tQY)tRD(tQ_{Y}\|P_{Y})+H(tQ_{Y})\geq tR for the unnormalized distribution tQYtQ_{Y}. Thus,

E¯nl(tR1+(1t)R2)\displaystyle\underline{E}^{\mathrm{nl}}\left(tR_{1}+(1-t)R_{2}\right) =minQY𝒫(𝒴):D(QYPY)+H(QY)>tR1+(1t)R2D(QYPY)\displaystyle=\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})>tR_{1}+(1-t)R_{2}}D(Q_{Y}\|P_{Y})
D([tQY1+(1t)QY2]PY)\displaystyle\leq D\left(\left[tQ_{Y1}+(1-t)Q_{Y2}\right]\|P_{Y}\right)
𝑎tD(QY1PY)+(1t)D(QY2PY)\displaystyle\overset{a}{\leq}tD(Q_{Y1}\|P_{Y})+(1-t)D(Q_{Y2}\|P_{Y})
=tE¯nl(R1)+(1t)E¯nl(R2),\displaystyle=t\overline{E}^{\mathrm{nl}}(R_{1})+(1-t)\overline{E}^{\mathrm{nl}}(R_{2}),

where (a)(a) follows from the convexity of D(PY)D(\cdot\|P_{Y}). So far, we have shown that E¯nl(R)\overline{E}^{\mathrm{nl}}(R) is convex in RR.

Moreover, observe that E¯nl(R)\overline{E}^{\mathrm{nl}}(R) is non-decreasing in RR and non-negative. Due to its convexity, E¯nl(R)\overline{E}^{\mathrm{nl}}(R) is strictly increasing in RR in the interval where it is finite and positive. Then the optimization of QYQ_{Y} must be achieved at the boundary:

E¯nl(R)=minQY𝒫(𝒴):D(QYPY)+H(QY)=RD(QYPY),when RH(PY).\overline{E}^{\mathrm{nl}}(R)=\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})=R}D(Q_{Y}\|P_{Y}),\quad\text{when }R\leq H_{-\infty}(P_{Y}). (65)

Hence,

E¯nl(R)\displaystyle\underline{E}^{\mathrm{nl}}(R) =minQY𝒫(𝒴){D(QYPY)+|RD(QYPY)H(QY)|+}\displaystyle=\min_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\left\{D(Q_{Y}\|P_{Y})+\big|R-D(Q_{Y}\|P_{Y})-H(Q_{Y})\big|^{+}\right\}
=minRH(PY)minQY𝒫(𝒴):D(QYPY)+H(QY)=R{D(QYPY)+|RD(QYPY)H(QY)|+}\displaystyle=\min_{R^{\prime}\leq H_{-\infty}(P_{Y})}\ \min_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})=R^{\prime}}\left\{D(Q_{Y}\|P_{Y})+\big|R-D(Q_{Y}\|P_{Y})-H(Q_{Y})\big|^{+}\right\}
=minRH(PY){E¯nl(R)+|RR|+}=𝑎minRR{E¯nl(R)+|RR|+},\displaystyle=\min_{R^{\prime}\leq H_{-\infty}(P_{Y})}\left\{\overline{E}^{\mathrm{nl}}(R^{\prime})+\big|R-R^{\prime}\big|^{+}\right\}\overset{a}{=}\min_{R^{\prime}\leq R}\left\{\overline{E}^{\mathrm{nl}}(R^{\prime})+\big|R-R^{\prime}\big|^{+}\right\},

where (a)(a) follows from the fact E¯nl()\overline{E}^{\mathrm{nl}}(\cdot) is monotone increasing. ∎

D-c Proof of Lemma 31

Proof.

Let PXP_{X}^{*} be the minimizer of minPX𝒮I(PX;W)\min_{P_{X}\in\mathcal{S}}I(P_{X};W), i.e., minPX𝒮I(PX;W)=I(PX;W)\min_{P_{X}\in\mathcal{S}}I(P_{X};W)=I(P_{X}^{*};W).

For (i), clearly Γ¯(R)\underline{\mathit{\Gamma}}(R) is non-negative. Recall the expression of Γ(s,QX,R)\mathit{\Gamma}(s,Q_{X},R) in (10). Take QX=PXQ_{X}=P_{X}^{*}. Then Γ(s,PX,R)0\mathit{\Gamma}(s,P_{X}^{*},R)\equiv 0 for all s0s\geq 0, because V=WV=W is the optimizer. Then Γ¯(R)infs[0,)max{s,Γ(s,PX,R)}=0\underline{\mathit{\Gamma}}(R)\leq\inf_{s\in[0,\infty)}\max\big\{s,\mathit{\Gamma}(s,P_{X}^{*},R)\big\}=0, and hence Γ¯(R)=0\underline{\mathit{\Gamma}}(R)=0.

For (ii), suppose not. Owing to (15), there exists some QX𝒫(𝒳)Q_{X}^{*}\in\mathcal{P}(\mathcal{X}) such that Γ(s,QX,R)0\mathit{\Gamma}(s,Q_{X}^{*},R)\equiv 0 for all s[0,)s\in[0,\infty). Take s=0s=0, the only optimizer is V=WV^{*}=W and thus 0=Γ(0,QX,R)=D(QXWPY)+|I(QX;W)R|+0=\mathit{\Gamma}(0,Q_{X}^{*},R)=D(Q_{X}^{*}W\|P_{Y})+\big|I(Q_{X}^{*};W)-R\big|^{+}, indicating that QX𝒮Q_{X}^{*}\in\mathcal{S} and I(QX;W)RI(Q_{X}^{*};W)\leq R. This contradicts the condition R<minPX𝒮I(PX;W)R<\min_{P_{X}\in\mathcal{S}}I(P_{X};W). ∎

D-d Proof of Lemma 33

Proof.

Since I(;W)I(\cdot\ ;W) is continuous. When minPX𝒮I(PX;W)RmaxPX𝒮I(PX;W)\min_{P_{X}\in\mathcal{S}}I(P_{X};W)\leq R\leq\max_{P_{X}\in\mathcal{S}}I(P_{X};W), we can write R=I(PX;W)R=I(P_{X};W) for some PX𝒮P_{X}\in\mathcal{S}. Taking QXY=PXWY|XQ_{XY}=P_{X}W_{Y|X}, we have Γ¯(R)D(PYPY)+|Rι(PXWY|X)|+=0\underline{\mathit{\Gamma}}(R)\leq D(P_{Y}\|P_{Y})+\big|R-\iota(P_{X}W_{Y|X})\big|^{+}=0 because ι(PXWY|X)=I(PX;W)=R\iota(P_{X}W_{Y|X})=I(P_{X};W)=R. On the other hand, Γ¯(R)\overline{\mathit{\Gamma}}(R) is non-negative, so Γ¯(R)=0\overline{\mathit{\Gamma}}(R)=0. ∎

D-e Proof of Lemma 38

Proof.

Observe that

maxQY𝒫(𝒴)[D(QYPY)+H(QY)]=maxQY𝒫(𝒴)[yQY(y)logPY(y)]=minylogPY(y)=H(PY),\max_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\big[D(Q_{Y}\|P_{Y})+H(Q_{Y})\big]=\max_{Q_{Y}\in\mathcal{P}(\mathcal{Y})}\left[-\sum_{y}Q_{Y}(y)\log P_{Y}(y)\right]=-\min_{y}\log P_{Y}(y)=H_{-\infty}(P_{Y}),

where the last equality follows from Remark 5. Therefore, when R>H(PY)R>H_{-\infty}(P_{Y}), the exponent E¯nl(R)=\overline{E}^{\mathrm{nl}}(R)=\infty is not well-defined. In fact, the bad set \mathcal{B} in (23) becomes empty and hence we can only apply a trivial bound 12P~Yn|𝒞PYn10\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}}-P_{Y}^{n}\big\|_{1}\geq 0, yielding an infinite exponent. ∎

D-f Proof of Lemma 39

Proof.

Let QY:=argmaxQY𝒫(𝒴):D(QYPY)+H(QY)=RH(QY)Q_{Y}^{*}:=\text{argmax}_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})=R}H(Q_{Y}). Note that the optimization here has an equality constraint D(QYPY)+H(QY)=RD(Q_{Y}\|P_{Y})+H(Q_{Y})=R.

Observe that RD(QYPY)R-D(Q_{Y}\|P_{Y}) is concave in QYQ_{Y}, with its maximum attained at QY=PYQ_{Y}=P_{Y}. However, the feasible set of N2(R)N_{2}(R) does not contain PYP_{Y}. Then the maximization of RD(QYPY)R-D(Q_{Y}\|P_{Y}) occurs at the boundary of the linear equation D(QYPY)+H(QY)=RD(Q_{Y}\|P_{Y})+H(Q_{Y})=R, yielding that

N2(R)\displaystyle N_{2}(R) =maxQY𝒫(𝒴):D(QYPY)+H(QY)=R[RD(QYPY)]\displaystyle=\max_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})=R}\left[R-D(Q_{Y}\|P_{Y})\right]
=maxQY𝒫(𝒴):D(QYPY)+H(QY)=RH(QY)=H(QY).\displaystyle=\max_{Q_{Y}\in\mathcal{P}(\mathcal{Y}):D(Q_{Y}\|P_{Y})+H(Q_{Y})=R}H(Q_{Y})=H(Q_{Y}^{*}).

Similarly, H(QY)H(Q_{Y}) is also concave in QYQ_{Y}, but the feasible set for N1(R)N_{1}(R) expands with RR. This implies that the maximizer for N1(R)N_{1}(R) lies either on the boundary D(QYPY)+H(QY)=RD(Q_{Y}\|P_{Y})+H(Q_{Y})=R or at the uniform distribution QY=1/|𝒴|Q_{Y}=1/|\mathcal{Y}|. In summary, N1(R)=H(QY)N_{1}(R)=H(Q_{Y}^{*}) or N1(R)=log|𝒴|N_{1}(R)=\log|\mathcal{Y}|, and therefore N1(R)N2(R)N_{1}(R)\geq N_{2}(R). ∎

D-g Proof of Lemma 43

Proof.

For (a)(a), let PYn,Y^n,IP_{Y^{n},\hat{Y}^{n},I} denote the joint distribution of the source sequence YnY^{n}, the reconstruction sequence Y^n\hat{Y}^{n}, and the random index II taking values in the set {1,,M}\{1,\dots,M\}; namely,

PYn,Y^n,I(yn,y^n,i):=PYn(yn)𝟙{i=(yn)} 1{y^n=𝒟(i)}.P_{Y^{n},\hat{Y}^{n},I}(y^{n},\hat{y}^{n},i):=P_{Y}^{n}(y^{n})\mathbbm{1}\{i=\mathcal{E}(y^{n})\}\ \mathbbm{1}\{\hat{y}^{n}=\mathcal{D}(i)\}.

Its marginal on II determines a message distribution

q(i):=PI(i)=ynPYn(yn)𝟙{i=(yn)}.q(i):=P_{I}(i)=\sum_{y^{n}}P_{Y}^{n}(y^{n})\mathbbm{1}\{i=\mathcal{E}(y^{n})\}. (66)

We choose qq as the message distribution, and w1𝒟w^{-1}\circ\mathcal{D} as the covering code 𝒞\mathcal{C}. Since ww is a bijection, w1w^{-1} is well-defined. The induced distribution in (2) then becomes

P~Yn|𝒞q(yn)\displaystyle\tilde{P}_{Y^{n}|\mathcal{C}\sim q}(y^{n}) =i=1Mq(i)xn𝟙{yn=w(xn)}𝟙{xn=w1(𝒟(i))}\displaystyle=\sum_{i=1}^{M}q(i)\sum_{x^{n}}\mathbbm{1}\{y^{n}=w(x^{n})\}\mathbbm{1}\{x^{n}=w^{-1}(\mathcal{D}(i))\}
=i=1MynPYn(yn)𝟙{i=(yn)}xn𝟙{yn=w(xn)}𝟙{w(xn)=𝒟(i)}\displaystyle=\sum_{i=1}^{M}\sum_{y^{n}}P_{Y}^{n}(y^{n})\mathbbm{1}\{i=\mathcal{E}(y^{n})\}\sum_{x^{n}}\mathbbm{1}\{y^{n}=w(x^{n})\}\mathbbm{1}\{w(x^{n})=\mathcal{D}(i)\}
=y^nPY^n(y^n)𝟙{yn=y^n}.\displaystyle=\sum_{\hat{y}^{n}}P_{\hat{Y}^{n}}(\hat{y}^{n})\mathbbm{1}\{y^{n}=\hat{y}^{n}\}.

Here PY^nP_{\hat{Y}^{n}} is the marginal of PYn,Y^n,IP_{Y^{n},\hat{Y}^{n},I}. Also, note that PYnP_{Y^{n}} is naturally the marginal of PYn,Y^n,IP_{Y^{n},\hat{Y}^{n},I}. We obtain that

P~Yn|𝒞qPYn1\displaystyle\left\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\right\|_{1} =y^nPY^n(y^n)𝟙{Yn=y^n}PYn1\displaystyle=\left\|\sum_{\hat{y}^{n}}P_{\hat{Y}^{n}}(\hat{y}^{n})\mathbbm{1}\{Y^{n}=\hat{y}^{n}\}-P_{Y}^{n}\right\|_{1}
PY^n𝟙{Yn=Y^n}PYn,Y^n1\displaystyle\leq\left\|P_{\hat{Y}^{n}}\mathbbm{1}\{Y^{n}=\hat{Y}^{n}\}-P_{Y^{n},\hat{Y}^{n}}\right\|_{1}
=yn,y^n|PY^n(yn)𝟙{yn=y^n}PYn,Y^n(yn,y^n)|\displaystyle=\sum_{y^{n},\hat{y}^{n}}\left|P_{\hat{Y}^{n}}(y^{n})\mathbbm{1}\{y^{n}=\hat{y}^{n}\}-P_{Y^{n},\hat{Y}^{n}}(y^{n},\hat{y}^{n})\right|
=yn=y^n[PY^n(yn)PYn,Y^n(yn,y^n)]+yny^nPYn,Y^n(yn,y^n)\displaystyle=\sum_{y^{n}=\hat{y}^{n}}\left[P_{\hat{Y}^{n}}(y^{n})-P_{Y^{n},\hat{Y}^{n}}(y^{n},\hat{y}^{n})\right]+\sum_{y^{n}\neq\hat{y}^{n}}P_{Y^{n},\hat{Y}^{n}}(y^{n},\hat{y}^{n})
=1Pr{Yn=Y^n}+Pr{YnY^n}=2Pre,\displaystyle=1-\Pr\{Y^{n}=\hat{Y}^{n}\}+\Pr\{Y^{n}\neq\hat{Y}^{n}\}=2\mathrm{Pr_{e}},

where the inequality follows from the monotonicity of the total variation.

For (b)(b), let 𝒞:𝒳n\mathcal{C}:\mathcal{M}\to\mathcal{X}^{n} be the covering code. Define

𝒥\displaystyle\mathcal{J} ={yn𝒴n:there exists i such that yn=w(𝒞(i))}={yn𝒴n:P~Yn|𝒞q(yn)>0}\displaystyle=\{y^{n}\in\mathcal{Y}^{n}:\text{there exists }i\in\mathcal{M}\text{ such that }y^{n}=w(\mathcal{C}(i))\}=\{y^{n}\in\mathcal{Y}^{n}:\tilde{P}_{Y^{n}|\mathcal{C}\sim q}(y^{n})>0\}

We choose our encoder and decoder in the lossless source coding scheme to be

(yn)\displaystyle\mathcal{E}(y^{n}) ={i for some i such that yn=w(𝒞(i))if yn𝒥0if yn𝒥,\displaystyle=\begin{cases}i\text{ for some }i\in\mathcal{M}\text{ such that }y^{n}=w(\mathcal{C}(i))&\text{if }y^{n}\in\mathcal{J}\\ 0&\text{if }y^{n}\notin\mathcal{J}\end{cases},
𝒟(i)\displaystyle\mathcal{D}(i) ={w(𝒞(i))if i(𝒴n)declare errorif i=0.\displaystyle=\begin{cases}w(\mathcal{C}(i))&\text{if }i\in\mathcal{M}\cap\mathcal{E}(\mathcal{Y}^{n})\\ \text{declare error}&\text{if }i=0\end{cases}.

It is noteworthy that the given soft covering code 𝒞\mathcal{C} may contain repeated codewords. As a result, for some yn𝒥y^{n}\in\mathcal{J}, there exists more than one index ii such that yn=w(𝒞(i))y^{n}=w(\mathcal{C}(i)), and we just select one representative index among them as our source coding encoder (yn)\mathcal{E}(y^{n}). Consequently, only yn𝒥y^{n}\in\mathcal{J} be reconstructed correctly. We obtain that

P~Yn|𝒞qPYn1yn𝒥PYn(yn)=Pre.\displaystyle\left\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\right\|_{1}\geq\sum_{y^{n}\notin\mathcal{J}}P_{Y}^{n}(y^{n})=\mathrm{Pr_{e}}.

Observe that the lossless source coding scheme contains M+1M+1 indices (some of them may be unused if 𝒞\mathcal{C} contains repeated codewords), and the rate is 1nlog(M+1)R\frac{1}{n}\log(M+1)\to R for sufficiently large nn. ∎

D-h Proof of Lemma 44

Proof.

Fix PX𝒮P_{X}\in\mathcal{S} and define

E¯(R,PX):=minV𝒫(𝒴|𝒳):ι(PXVY|X)RD(VW|PX).\overline{E}(R,P_{X}):=\min_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X}):\iota(P_{X}V_{Y|X})\geq R}D(V\|W|P_{X}).

Let VV^{*} be its optimizer, i.e., E¯(R,PX)=D(VW|PX)\overline{E}(R,P_{X})=D(V^{*}\|W|P_{X}) with ι(PXVY|X)R\iota(P_{X}V^{*}_{Y|X})\geq R. When RI(PX;W)R\leq I(P_{X};W), we have V=WV^{*}=W and E¯(R,PX)=0\overline{E}(R,P_{X})=0. When R>I(PX;W)R>I(P_{X};W), note that ι(PXWY|X)=I(PX;W)\iota(P_{X}W_{Y|X})=I(P_{X};W), so VWV^{*}\neq W and hence E¯(R,PX)>0\overline{E}(R,P_{X})>0. When R>maxV𝒫(𝒴|𝒳)ι(PXVY|X)R>\max_{V\in\mathcal{P}(\mathcal{Y}|\mathcal{X})}\iota(P_{X}V_{Y|X}), VV^{*} does not exist; we can only apply a trivial bound 12P~Yn|𝒞qPYn10\frac{1}{2}\big\|\tilde{P}_{Y^{n}|\mathcal{C}\sim q}-P_{Y}^{n}\big\|_{1}\geq 0 and the exponent is E¯(R,PX)=\overline{E}(R,P_{X})=\infty.

In view that E¯(R)=maxPX𝒮E¯(R,PX)\overline{E}(R)=\max_{P_{X}\in\mathcal{S}}\overline{E}(R,P_{X}), the above turning points, including the one from zero to nonzero value, and the one from finite to infinite value, of E¯(R)\overline{E}(R) are obtained via minimizing over PX𝒮P_{X}\in\mathcal{S}. ∎