On the Strong Converse Exponent and Error Exponent of the Classical Soft Covering
Abstract
This paper establishes the exact strong converse exponent of the soft covering problem in the classical setting. This exponent characterizes the slowest achievable convergence speed of the total variation to one when a code of rate below mutual information is applied to a discrete memoryless channel for synthesizing a product output distribution. The proposed exponent is expressed through a new two-parameter information quantity, differing from the more commonly studied Rényi divergence or Rényi mutual information. In addition, we demonstrate the non-tightness of random coding for rates both below and above mutual information. Discussions on the latter start with noiseless channels, where we develop a deterministic code construction that outperforms random codes in error exponents. We further observe that the conventional formulation, which assumes a uniform distribution over messages, inherently introduces a discrepancy in error exponents depending on whether the components of the target distribution are rational or irrational numbers. To eliminate this discrepancy, we propose a new formulation in which messages are allowed to be distributed non-uniformly, and the rate is given by the logarithm of the smallest nonzero message probability (corresponding to Rényi entropy of order ). The exact error exponent is characterized in this formulation for noiseless channels. Furthermore, for noisy channels, we provide a high-rate improvement in achievability and derive a converse bound on the error exponent.
I Introduction
The soft covering lemma is a fundamental lemma used in various information theoretic problems, such as channel resolvability, channel simulation, and lossy source coding. It first appeared in [wyner1975common, Thm. 6.3] where Wyner used the normalized (with a factor ) relative entropy to quantify probabilistic closeness and derived the achievability proof of the common information as the optimal rate of shared randomness between two agents. The concept of soft covering has been consequently applied in wiretap channels [wyner1975wire, hayashi2006general, parizi2016exact, yu2018renyi] and in other problems including channel synthesis [cuff2010coordination, cuff2013distributed, hsieh2016channel] and channel resolvability [han1993approximation, bloch2013strong, watanabe2014strong, liu2016e_][koga2013information, Ch. 6] with tighter measures of probabilistic closeness such as total variation and relative entropy [winter2005secret, hou2013informational, hou2014effective, cuff2015stronger, cuff2016soft]. It is worth noting that channel resolvability is closely related to soft covering in the sense that the former deals with simulating all output distributions including non-product ones, while the latter focuses solely on the product ones. More recently, developments in quantum information theory have prompted extensive research into the soft covering [ahlswede2002strong, hayashi2015quantum, cheng2023error, atif2023lossy, atif2024quantum, shen2024optimal, he2024quantum, hayashi2025resolvability][wilde2011classical, Ch. 17][hayashi2016quantum, Sec. 9.4] and channel simulation [massar2000amount, winter2001compression, winter2004extrinsic, luo2009channel, wilde2012information, bennett2014quantum, radhakrishnan2017one, anshu2019convex] for quantum channels.
The general idea of soft covering is to simulate a given output distribution using a specific channel times. Suppose we are given a discrete memoryless channel (DMC) with transition probability distribution and a desired output distribution . Consider any code , with , and with each encoded message being drawn from the uniform distribution on . The output distribution induced by the code is
We aim for to effectively cover the space of . In other words, we want the non-product distribution to asymptotically approximate the product distribution under the criterion of the total variation . Let be the set of all 1-shot input distributions whose outputs are under the DMC . According to the soft covering lemma, e.g.[moser2019advanced, Ch. 19], when , we can achieve a good covering: there exists a code that ensures to vanish exponentially fast. The achievable error exponent has been studied extensively in the literature both in the classical and the classical-quantum settings [hayashi2006general, parizi2016exact, yu2018renyi][cheng2023error][yassaee2019almost][yagli2019exact]. All of these studies employ a random coding strategy when investigating the decaying behavior of the total variation and relative entropy at high rates. However, it is not guaranteed that random coding yields a tight exponent; even if it does, it only establishes an achievability result. An important open problem is to develop a converse bound showing that no code can make the total variation decay too fast. In this work, we demonstrate the non-tightness of random coding and provide a converse bound that holds for all codes.
On the other hand, when , we expect that the covering error should approach exponentially fast, . We are particularly interested in the minimal achievable exponent:
which characterizes the slowest convergence of to one. It can be understood as an optimal performance among all the soft covering codes: given a low rate, how well a code can possibly behave to avoid poor covering. Therefore, for an arbitrary code , we can deduce that . In this sense, we refer to as the strong converse exponent, as it holds for all codes, not just for random codes.
The dual problem of the soft covering lemma is the packing lemma in channel coding, where a similar exponent, called the reliability function [shannon1959probability], has been thoroughly studied at rates both below [gallager1965simple, haroutunian1968estimates, sibson1969information, blahut1974hypothesis, csiszar1972class, arimoto1977information, augustin1978noisy, verdu2021error][gallager1968information, Ch. 5][csiszar2011information, Ch. 10] and above [arimoto1973converse, dueck1979reliability, oohama2015two, mosonyi2017strong] the capacity. The latter case is known as the strong converse of channel coding [wolfowitz1957coding, wolfowitz1960note, winter1999coding], as the corresponding exponent holds for all codes when the probability of error approaches one. However, in the soft covering, the strong converse exponent has not been explored in the literature in terms of either lower or upper bounds. It may be noted that [cheng2023error] provides a lower bound on the performance of random code ensemble at rates below mutual information. [watanabe2014strong] and [hayashi2025resolvability] prove the asymptotic strong converse for classical and classical-quantum channels but do not characterize the exponent.
In this work, we characterize the exact strong converse exponent of the classical soft covering problem. It is stated in Theorem 17, and a novel two-parameter Rényi-type information quantity is introduced in the process. The converse is derived using a hypothesis testing perspective, and the achievability is established through a deterministic code construction technique combined with the type covering lemma. In addition, we provide a random coding achievability bound for the strong converse exponent in Theorem 18 that is expressed using the Rényi mutual information. By comparing the exact , the proposed random coding achievability, and the random coding converse in [cheng2023error], we conclude that random coding fails to produce the tight exponent.
Building on the observation of non-tightness of random coding in the strong converse exponents, we provide a new characterization of the error exponents for . In particular, we establish a new lower bound on the error exponents (see Theorem 20) in terms of Rényi entropy for noiseless channels using a new deterministic code construction. We also derive a new upper bound on the error exponents (see Theorem 19) that match the lower bound until they both reach a slope of unity. The lower bound strictly outperforms random codes for noiseless channels. Furthermore, this lower bound can be extended to noisy channels (see Theorem 27), achieving a high-rate improvement over the random coding exponent given in [yagli2019exact, Thm. 1]. We also provide a new sphere-packing style upper bound on the error exponents of the noisy channels, given in Theorem 28.
While investigating the behavior of error exponents for noiseless channels, we uncover an interesting phenomenon: the conventional formulation that assumes a uniform distribution on messages yields different exponents depending on whether the components of the target distribution are rational or irrational numbers. These discrepancies are characterized in Theorems 25 and 26. The reason for this is that the code-induced distribution takes values in multiples of , so soft covering essentially approximates within a quantization grid of step size . If is rational, a perfect covering with zero error is achievable at sufficiently high rates. In contrast, if is irrational, a nonzero covering error may persist due to limitations imposed by Diophantine approximation [queffelec2013diophantine, Ch. 3]. To address this discrepancy, we propose a new formulation of the soft covering problem, in which the distribution of the messages is allowed to be non-uniform, and the covering rate is defined not by the number of messages but by the inverse of the smallest nonzero message probability. We call this formulation -constrained soft covering (see Definition 11). We provide an exact characterization of the error exponents for this formulation without any discrepancy for all rates for noiseless channels (see Theorem 22). It aligns with that of the uniform distribution at low rates. In summary, the error exponents of the soft covering problems exhibit a very complex behavior.
This paper is organized as follows. Section II introduces useful definitions. All the main results are presented in Section III. Section IV provides a detailed proof of the characterization of strong converse exponent. In Section V, we provide a detailed discussion on error exponents for noiseless channels under different formulations and generalize the corresponding results to noisy channels in Section VI.
II Preliminaries
II-A Notation
This paper applies the method of types, as well as basic notions from discrete probability theory and a range of information quantities. Basic properties of types can be found in many papers and textbooks, e.g. [csiszar2011information, Ch. 2] and [csiszar1998method]. Basic notations we will use are collected below.
Let denote a finite message set, and denote the finite input and output alphabets of a channel, respectively, and , , and denote the sets of all probability distributions on , , and , respectively. Likewise, let denote the set of all conditional distributions on given . For any block length , let and denote the sets of all -types and joint -types on and , respectively, and let denote the set of all conditional -types on given an input type . Denote by the type class corresponding to an -type , and by the conditional type class (or -shell) given and a conditional type . Let and denote the conditional entropy and the mutual information under the joint distribution , respectively, and define . Let denote the set of all input distributions whose induced output distribution under is . For a distribution , let denote its support, and write if . Finally, define .
Consider a joint distribution with marginal distributions and . In this paper, we follow the notation that , where and are the forward and backward transition probabilities, respectively. In other words, the letters and are consistently associated throughout this paper’s notations: whenever a joint distribution appears, the corresponding marginals and conditionals are all implicitly defined and need not be restated. Moreover, we write .
In this work, the desired output distribution is denoted by and the DMC used for covering is denoted by . Definitions of some useful information-theoretic quantities are listed below.
Definition 1 (Information density).
The information density is defined as
Definition 2 (Expectation of the information density).
Given , the expectation of the information density under is defined as
Lemma 3.
One can easily verify the following identity, and hence we claim it without proof.
Definition 4 (-Rényi entropy).
Given , the -Rényi entropy is defined as
Remark 5.
.
Definition 6 (-Rényi divergence).
Given , their -Rényi divergence is defined as
Definition 7 (-Rényi mutual information).
Given , the -Rényi mutual information is defined as
The second equality follows from [yagli2019exact, Rmk. 24][verdu2021error, Item 20], and .
Remark 8.
If is a noiseless, i.e., for each , there exists a symbol such that . Then [verdu2021error, Eq. (85)].
II-B Notions of soft covering
The precise formulations of the soft covering problem are given in this subsection. We begin with two cases: uniform and non-uniform, depending on whether the message set has a uniform distribution or not.
Definition 9 (Uniform soft covering).
An (uniform) soft-covering scheme consists of a message set with , where each message is drawn from uniform distribution, and a code . This soft-covering scheme induces the following distribution at the output of the DMC :
| (1) |
The covering error is defined as the total variation between the achieved non-product output distribution and the product desired one :
Definition 10 (Non-uniform soft covering).
An (non-uniform) soft-covering scheme consists of a message set with , where messages are drawn from distribution , and a code . This soft-covering scheme induces the following distribution at the output of the DMC :
| (2) |
The covering error is defined as the total variation between the achieved non-product output distribution and the product desired one :
| (3) |
Besides these two formulations, we introduce a variation based on the non-uniform case: in addition to assuming that the messages have a non-uniform distribution , we further impose an extra condition that the smallest nonzero probability among all messages is at least . This condition can be equivalently written as
| (4) |
according to Remark 5. Under this condition, the size of the message set need not be ; instead, . Moreover, given a rate , define
| (5) |
to be the set of all probabilities satisfying condition (4). Soft covering under this condition is referred to as -constrained formulation, and is given by the following definition.
Definition 11 (-constrained soft covering).
The reason for this new formulation of the soft covering problem is that, when , the conventional uniform formulation would lead to an inevitable discrepancy depending on whether ’s are rational or irrational. This discrepancy arises even in the simplest case, when the DMC is noiseless, because in this case, the uniform formulation attempts to approximate using only rational probabilities which are multiples of . Switching to the non-uniform formulation in Definition 10 can certainly eliminate this rational-irrational discrepancy, but alters the problem substantially and thereby shifts the error exponent relative to the uniform case. The -constrained formulation in Definition 11, on the other hand, resolves the discrepancy while still sharing an overlapping error-exponent region with the uniform formulation. A detailed discussion of this point is presented in Section III-B and Section V.
In information theory, proofs of achievability are commonly facilitated by the technique of random coding; accordingly, we also define the random-coding soft covering below. In this definition, the message set is uniformly distributed.
Definition 12 (Random-coding soft covering).
An (random-coding) soft-covering scheme consists of a message set with , where each message is drawn from uniform distribution, and a random code , where each codeword is generated randomly from i.i.d. , i.e., for all . The induced output distribution is given by (1), and the covering error is defined as the expectation of the total variation between and :
where the expectation is taken with respect to i.i.d. .
Remark 13.
The notation indicates that is the code corresponding to a message set with distribution . If is uniform, we omit the notation , as in Definitions 9 and 12 above. This convention will be used throughout the paper: specifically, every occurrence of or refers to the non-uniform formulation (including Definitions 10 and 11), whereas refers to the uniform formulation (including Definitions 9 and 12), unless otherwise specified.
Next, we define the error exponent (denoted by ) and the strong converse exponent (denoted by ) for the above four formulations of the soft covering problem.
Definition 14 (Error exponent and strong converse exponent for soft covering).
Remark 15.
It is clear that and , since the -constrained formulation is a special case of the non-uniform formulation, and the uniform formulation is a special case of the -constrained formulation.
III Main Results
This section summarizes the main results of this work, including the strong converse exponent in Section III-A and bounds for the error exponent in Sections III-B and III-C.
III-A Results on the strong converse exponent
To begin with, the following Theorem 17 characterizes the exact strong converse exponents of the soft covering problem in the uniform, non-uniform, and -constrained formulations, and shows that they are all identical.
Theorem 17 (The exact strong converse exponent).
The exact strong converse exponents for the uniform, non-uniform, and the -constrained soft-covering formulations coincide, and are given by
where
| (6) |
Theorem 17 is proved in Section IV-C. The quantity , defined in (6), involves two parameters, which is uncommon in the literature of error exponents. Similar exponents taking two parameters have appeared in the context of lossy source coding, e.g., [blahut1974hypothesis, jitsumatsu2025computation]; however, lossy source coding and soft covering address fundamentally different problems. The former employs a distortion function as a criterion for loss and directly evaluates the probability of error, whereas the latter centers on a forward channel and quantifies the error through a probabilistic divergence. Although the notion of covering is intrinsically revealed in source coding, the two settings are structurally distinct. Typically, when a channel is involved in the problem formulation, the error exponent takes the form of , where is the Rényi mutual information with respect to the given channel, with different domains of maximization over . This form is observed in both packing [gallager1965simple][sibson1969information][arimoto1973converse][arimoto1977information][augustin1978noisy][csiszar1995generalized][verdu2015alpha][verdu2021error] and covering (when ) [hayashi2006general][parizi2016exact][yassaee2019almost][yagli2019exact] problems. In this work, we give an achievability bound for random codes that exactly takes this form, as formally stated in Theorem 18.
Theorem 18 (A random coding achievability bound).
The strong converse exponent for the random coding formulation has the following upper bound.
| (7) | ||||
| (8) |
where is the -Rényi mutual information, defined in Definition 7.
See Appendix A for a proof of Theorem 18. On the other hand, a lower bound for is given in [cheng2023error, Thm. 3]:
where is the Rényi divergence in Definition 6. Hence, defining
it can be further bounded via
| (9) |
with
It is also noteworthy that provides an achievability bound for ; specifically, . Examples of the exact exponent (in Theorem 17) and aforementioned random coding bounds and are illustrated in Figure 1, where the matrix entry represents . These examples include channels with fully noisy, fully noiseless, and hybrid input symbols (i.e., a mix of noisy and noiseless inputs). Figure 1 clearly shows that the random coding is not tight in general in the converse regime of the soft covering problem: due to (9), the random-coding exponent must lie between and , while the exact exponent is lower and hence tighter.
III-B Results on error exponents for noiseless channels
The previous subsection reveals an intriguing phenomenon: when , random coding fails to achieve a tight strong converse exponent. A natural question is whether such non-tightness also arises in the error exponent regime for . In this subsection, we show that the answer is yes when is noiseless. The following Theorems 19, 20, 21, and 22 summarize our results on error exponents for noiseless channels, where exponents for different formulations are defined in Definition 14, with the superscript ‘nl’ indicating ‘noiseless’.
Theorem 19 (Converse for noiseless channels).
For noiseless channels under the uniform formulation, we have the following upper bound on .
Theorem 20 (Achievability for noiseless channels).
For noiseless channels under the uniform formulation, we have the following lower bound on .
Theorem 21 (Error exponent for non-uniform case).
For noiseless channels under the non-uniform formulation, the exact error exponent is given by
Theorem 22 (Error exponent for -constrained case).
For noiseless channels under the -constrained formulation, the exact error exponent is given by where is defined in Theorem 19.
Theorem 19 and Theorem 20 are proved in Section V-A; Theorem 21 is proved in Section V-C; and Theorem 22 is proved in Section V-D. Figure 2 presents plots of these proposed bounds or exponents as well as the random coding exponent (see Remark 16) using a binary and a ternary target output distribution, respectively. Under the non-uniform and the -constrained formulation, an exact error exponent is established. Under the uniform formulation, the error exponent lies between the two bounds and . However, these two bounds overlap in the vicinity of , which is given by the following proposition.
Proposition 23.
Let be the value where the function has a tangent line of slope . Then
Here, denotes the random-coding error exponent for noiseless channels and is given in Remark 16.
Proposition 23 is proved in Appendix D-b. In summary, for , all three curves – , , and – overlap, implying that random coding is tight. For , random coding is not tight; nevertheless, and still coincide, and an exact exponent exists via the construction of a deterministic code that is provided in the proof of Theorem 20. For , and no longer coincide. It is also noteworthy that diverges when ; see Lemma 38.
Assuming uniform messages is conventional in many problems in information theory. Nonetheless, in the soft covering problem, adhering to this uniform formulation highlights an interesting discrepancy between rational and irrational output distributions for noiseless channels. This is because in (1) can only take rational values as it is a multiple of . If is irrational in some symbols, then approximating by amounts to a Diophantine approximation, which necessarily incurs nonzero errors for arbitrarily large according to studies in number theory. We summarize the rational-irrational discrepancy in Theorem 25 and Theorem 26.
Definition 24.
Given a distribution , we say that is rational, denoted , if for all . Otherwise, we say that is irrational, denoted .
Theorem 25 (Linear converse for irrational ).
Suppose is noiseless. Then for every and almost every (in the sense of Lebesgue measure on ), we have .
Theorem 26 (Infinite achievability for rational ).
Suppose is noiseless and , say, with coprime for . Then when , where refers to the least common multiple of ’s among all .
Theorem 25 and Theorem 26 are proved in Section V-B. Evidently, the discrepancy concerns whether an infinite exponent, corresponding to perfect covering with no error, is achievable. For rational ’s, this is achievable at high rates, whereas for most irrational ’s, it is not. It is noteworthy that the linear converse in Theorem 25 applies to most irrational probabilities, which are dense in . However, it may be not possible to claim a universal converse for an arbitrary irrational . Let be an irrational symbol, i.e., . If is an irrational algebraic number, then the linear converse of holds, by virtue of the Thue-Siegel-Roth theorem [queffelec2013diophantine, Thm. 3.1.4]. On the other hand, can also be a ‘good’ irrational number, known as a Liouville number [queffelec2013diophantine, Def. 3.1.8], which admits infinitely many integer pairs such that ; consequently, it is not clear how to establish a linear converse.
To eliminate such a discrepancy, must be allowed to also take irrational values; hence, messages cannot be uniformly distributed. Switching from the uniform to the non-uniform formulation indeed removes this discrepancy and yields an exact exponent. However, the error exponent under the non-uniform formulation deviates from both and for all , including in the low-rate regime near . In fact, for noiseless channels, soft covering under the non-uniform formulation is equivalent to lossless source coding (see Lemma 43 in Section V-C). In order to remove the rational-irrational discrepancy without substantially altering the feature of soft covering, we proposed the -constrained formulation in Definition 11. Its exponent is exact, and is identical to by Theorem 22. Thus, in the neighborhood of , this new formulation aligns with the uniform formulation.
III-C Results on error exponents for noisy channels
More generally, in this subsection, we consider noisy channels. Note that in Theorem 20 becomes a straight line of slope 1 for large . Consequently, at high rates, this noiseless achievability can exceed the random-coding exponent whose slope is , thus demonstrating a high-rate improvement over random coding under the uniform formulation for noisy channels. This noisy achievability is summarized in Theorem 27. Furthermore, we provide a general converse bound for noisy channels under the non-uniform formulation in Theorem 28.
Theorem 27 (High-rate achievability).
For noisy channels under the uniform formulation, we have the following lower bound on .
Theorem 28 (Converse).
For noisy channels under the non-uniform formulation, we have the following upper bound on .
Theorem 27 is proved in Section VI-A and Theorem 28 is proved in Section VI-B. Figure 3 exhibits examples of these two proposed bounds as well as the random coding error exponent in Remark 16. Immediately following Theorem 26 and Theorem 27, we obtain Corollary 29 below, which demonstrates an achievable infinite error exponent for noisy channels, provided that there exists some rational input distributions and the rate is sufficiently large.
Corollary 29 (Infinite achievability for noisy channels).
Suppose some is rational, say, with coprime for . Then when .
IV Strong Converse Exponent
In this section, we prove Theorem 17. The sketch of the proof is as follows. In Section IV-A, we establish a lower bound for , serving as a converse result; while Section IV-B presents an achievability result that provides an upper bound for . Both bounds are expressed in variational forms in terms of the optimization over joint distributions. We convey them into dual representations in Section IV-C and show that they coincide when . Since , this coincidence characterizes an exact exponent for both uniform and non-uniform formulations. Additionally, in Appendix C, we provide an alternative proof of the converse part of Theorem 17 only for the uniform formulation following Arimoto’s techniques in [arimoto1973converse].
IV-A A lower bound for the strong converse exponent
In the following, we show a lower bound for for the non-uniform formulation, denoted by .
Proposition 30 (Converse).
The strong converse exponent in the non-uniform formulation can be bounded from below as follows:
where
| (10) |
Proof.
We use a hypothesis testing perspective, and apply the following well-known property of the total variation (e.g.,[moser2019advanced, Def. 2.24]) in terms of decision regions:
| (11) |
We take a collection of conditional types around each codeword to create a set that could be used to discriminate from , as follows:
That is, restrict to the output type classes whose types are in the form of , where is the type of some codeword and . Here can be any subset of . Later, we will choose an appropriate subset to yield the optimal bound. When a codeword is fixed, so is its type. Then the -shells for different must be disjoint. We thereby write disjoint union in .
Consider a codeword whose type is , i.e., for some . We have
where in we restrict to the subset in which ; follows from the fact that ; and follow from properties of types. By averaging over all codewords, we obtain that
where follows from property of types and from . On the other hand,
where follows from property of types; in , the summation is missing because the non-emptiness of implies that can only take the unique value ; and follow from properties of types. Inserting the above expressions into (11), we obtain that
where the exponent is
It remains to choose a suitable . We choose it to be the relative-entropy-ball centered at with radius :
parameterized by . As , becomes dense in and boils down to
Consequently, the exponent converges as follows:
| (12) |
where is defined in (10). Note that (IV-A) is also parameterized by . We can further write
| (13) |
The reason why we put in (13) is that since the choice of is free, in principle we can select any non-negative and construct the corresponding as our . Hence, there exists some value for each that generates the maximal (thus the tightest) lower bound .
Furthermore, in (13), it might seem natural to write since describes a relative entropy ball. However, this is true only if can be those conditional distributions such that and under . To elaborate, our discussions and the derived bounds should be applicable to all channels. Without loss of generality, it is possible that the channel contains hybrid input symbols, meaning that for some and for others. If is supported only on those noiseless input symbols, then is deterministic and hence for any finite ,
The case implies that , i.e., is also noiseless when restricted to . As a result, , which is a constant independent of , as illustrated in Figure 4LABEL:sub@fig-Gs-a. Then in (13) we have
where can be observed from Figure 4LABEL:sub@fig-Gs-a. Interestingly, even though , in (13) it is still valid to write
| (14) |
If has support on at least one noisy input symbol, can thereby take continuous values. In this case, writing (14) is explicit and correct. According to Lemma 48 in Appendix D-a, is non-increasing in : it decreases over a certain interval and then remains constant. Therefore, is attained at a unique fixed point where , i.e., the intersection point of the curve and the straight line . Such an intersection can occur either in the decreasing regime of (Figure 4LABEL:sub@fig-Gs-b) or in the constant regime (Figure 4LABEL:sub@fig-Gs-c).
To summarize, for any supported on , (14) is always true. Furthermore, since (14) is achieved at the intersection point , we can equivalently write
| (15) |
from Figure 4. The reason for doing so is that the right-hand side of (15) facilitates a convenient derivation from variational forms to dual forms in Section IV-C. Substituting (14) and (15) into (13) completes the proof. ∎
The idea of proving Proposition 30 is closely related to hypothesis testing. We are in fact looking for a decision region such that and with some exponent . The average probability of testing error is . Then we optimize over to find the optimal decision region. Observing the expression in Proposition 30, the following properties of can be established, which are proved in Appendix D-c.
Lemma 31.
We have the following properties of :
(i) when . (ii) when .
IV-B An achievability bound for the strong converse exponent
In this subsection, we show an upper bound for (the uniform formulation), denoted by . We develop a novel deterministic code construction technique based on the type covering lemma. The key observation is that since we are operating below , we cannot afford codeword repetitions, while random coding produces repeated codewords, albeit rarely. In contrast, we only cover joint types with mutual information less than , and cover only once.
Proposition 32 (Achievability).
The strong converse exponent in the uniform formulation can be bounded from above as follows:
where is defined in Definition 2.
Proof.
Consider any joint type . Following our notation convention, we introduce the backward conditional type . Take any . (1) yields that
where follows from property of types. Furthermore, in we have defined
| (16) |
which is the number of codewords that lie in the -shell of . In other words, counts the number of codewords that have joint type with . Since according to property of types, we can evaluate the following ratio:
| (17) |
where the last equality follows from Lemma 3.
Using the identity , the total variation satisfies the following property:
| (18) |
Combining this and (17), we obtain that
| (19) |
where follows from property of types and (17); follows from the fact that the minimum of a sum is greater than or equal to the sum of the minima (equivalent to the triangle inequality); and follows from property of types. In , we artifically add an extra term to and then extract it outside the minimum. The reason for doing so is that is exponential in , so comparing it with 1 or any polynomial factor does not affect the exponential order.
For the purpose of designing a good code, we need to maximize (IV-B), and it suffices to make large (or at least nonzero) for all joint types. However, due to the low rate and hence the limited number of codewords, this can only be achieved for certain joint types, and even for those, a large would seem too greedy. A more reasonable approach is to set , meaning that there exists a single codeword that covers . The type covering lemma guarantees the existence of such a covering for all ’s, provided that the mutual information is less than the code rate.
Based on the above analysis, we construct our code in the following way. Take any and examine all joint types : if its mutual information satisfies , then according to the type covering lemma (e.g., [moser2019advanced, Lem. 3.34]), there exists a covering code with codewords in that ensures for all the corresponding output sequences . We collect all codewords in this covering code into our code . It is noteworthy that such a collection may contain repeated codewords. Under this strategy, the number of codewords we have assigned so far is
for sufficienly large , where follows from property of types. The inequality confirms that is a valid code. Specifically, the codewords are drawn from a uniform distribution of . among them are used for the sake of covering all joint types satisfying ; as a result, for all ’s in those joint type classes. There are codewords left, which can be used to cover the other joint types. However, this fraction is small, and we apply a trivial lower bound zero on it. To sum up, in (IV-B) we have
for all . Consequently, (IV-B) simplifies to
where follows from property of types. Taking and completes the proof. ∎
Lemma 33.
If , then .
IV-C Dual forms of and and their equality
So far, we have obtained a lower bound for and an upper bound for , stated in Proposition 30 and 32, respectively. Figure 5 presents examples of these two bounds, using the same channel models as in Figure 1. Observing Figure 5, a natural question arises: do the proposed two bounds coincide when ? If they do, then an exact exponent can be tightly squeezed out since . In this subsection, we show that the answer is yes by reformulating and into their dual forms in terms of defined by (6). These dual forms are presented in the following Proposition 34 and 35, respectively.
Proposition 34.
has the following dual form:
| (20) | ||||
| (21) |
Proposition 35.
has the following dual form:
Proposition 34 is proved in Appendix B-a and Proposition 35 is proved in Appendix B-b. Comparing the dual forms of and , we establish their equality in the following Proposition 36.
Proposition 36.
when .
Proposition 36 is proved in Appendix B-c. Equipped with propositions in this section, the proof of the main result, Theorem 17, follows from straightforward logical reasoning.
Proof of Theorem 17.
When , by Proposition 36, .
When , by Proposition 30, Proposition 34, and Lemma 46, . On the other hand, according to the soft covering lemma, e.g., [moser2019advanced, Ch. 19], under the uniform formulation there exists a code that can make for sufficiently large , yielding that : a zero strong converse exponent is achievable and thus . Ergo, .
To sum up, we can conclude that for all . Combining it with Proposition 34 completes the proof. Since is now exact, rather than merely a lower bound, we drop the underline and denote it by to simplify notation. ∎
V Error Exponent for Noiseless Channels
In this section, we discuss the soft-covering error exponent when is noiseless: for each , there exists a symbol such that . Then we have for all , , and for some function .
Remark 37 (Choice of alphabet).
If is a bijection, then, without loss of generality, we assume that . If is a surjection, for each , in the achievability proof we may select a single representative and construct an alphabet consisting of these representatives and hence satisfying . In the converse proof, given any code , we first proceed with a restriction of symbols: if , we replace every occurrence of in the codewords with . Since both and map to the same output symbol , this modification leaves the output sequences unchanged and, consequently, does not affect the induced distribution or . Under this restriction, the code alphabet set reduces to some such that and can be regarded bijective. Thus, in both cases, one can always identify an input alphabet set such that for encoding.
V-A Uniform formulation
Under a noiseless channel, the output distribution (1) reduces to
| (22) |
where counts the number of codewords mapping to . We begin by showing the converse result stated by Theorem 19, which is valid for both rational and irrational .
Proof of Theorem 19.
When , it is relatively easier to approximate large probabilities, while harder to approximate small probabilities, in particular those smaller than the step size in (22). This leads us to the consideration of the following ‘bad’ set:
| (23) |
For any , we have
An error of value is unavoidable for all . In this sense, is regarded as ‘bad’, producing errors no matter what code is applied. As a result,
where follows from property of types. Taking completes the proof of the variational form.
To obtain the dual form, consider the following.
where follows from the convexity of the optimization problem; follows directly from (46); and follows by setting . ∎
It is noteworthy that is finite only in a certain region, as specified precisely in the following Lemma 38, which is proved in Appendix D-e.
Lemma 38.
when , and when .
Next, we prove the achievability result stated by Theorem 20, which is also valid for both rational and irrational . Towards proving it, we need the following lemma.
Lemma 39.
When , we have , where
Lemma 39 is proved in Appendix D-f. Recall (22) and observe that a strategy for constructing a good code is to make so that can approxiate as closely as possible. More precisely, to establish the achievability result in Theorem 20, we show that there exists a code in which differs from either or by at most a polynomial quantity in .
Proof of Theorem 20.
Taking inspiration from the proof of Proposition 32 regarding the achievability of the strong converse exponent, we provide a deterministic code construction for the problem at hand. Let us first assume that . Define
| (24) |
is actually a ‘good’ set in the sense that all sequences in satisfy that . Thus, it is always preferable to cover these sequences, since doing so will yield nonzero values that can align with . In fact, ignoring the factor of two, is approximately the complement of the ‘bad’ set defined in (23). Take any . The size of is bounded by
where follows from property of types. In , the maximization is over all types, so it differs from the maximization over distributions (namely as defined in Lemma 39) by at most for sufficiently large .
To design a good code, we adopt the alphabet choice described in Remark 37. With this choice, represents the number of codewords mapped one-to-one to , and hence can be directly constructed via a repetition of codewords. Our goal is to design so that it satisfies under the constraint that . Namely, consider the following two codes:
Clearly , meaning that we can use to cover all sequences and there will be some codewords left. Note that is not covered since for all . If , then after applying , each sequence in can be covered at most once more. Hence, there exists a code as follows.
If , then after applying , each sequence in can still be covered more than once. Equivalently, we use to cover , and there will be some codewords left. The number of those codewords that are left is
where and follow from properties of types, and is defined in Lemma 39. Now let us consider using these codewords to continue covering after is applied. We can cover each sequence in by more times, which is at most
for all sufficiently large . Here we have used when from Lemma 39. Hence, there exists a code as follows, with :
According to the above discussions, either of or is possible and has exactly codewords. Therefore, we can summarize that for all and sufficiently large , there exists a code as follows, with :
Choose to be our covering code, which yields that
Consequently, we have
where and follow from properties of types. The exponent is exactly as and .
Note that all the above discussions are based on the assumption that . If , we can just apply a trivial bound and the error exponent is . Since also vanishes for , it is thus an achievable error exponent for values both below and above . Hence, the variational form in the theorem is proven.
The idea behind this proof is actually a quantization of with a minimal gap of . For all sequences in the good set , the resulting error is . For sequences outside , each probability is less than so we choose not to cover them, and the error equals their probability. The key step is to show that such a construction can indeed be realized under the constraint that the number of codewords is exactly .
V-B Rational-irrational discrepancy under the uniform formulation
Recalling (22), under the uniform formulation, always takes the form . This gives rise to the rational-irrational discrepancy. In the following, we state Khintchine’s theorem and use it to prove the linear converse in Theorem 25.
Lemma 40 (Khintchine’s theorem [bugeaud2004approximation, Thm. 1.10], [queffelec2013diophantine, Thm. 3.3.2]).
Let be a continuous function such that is non-increasing and that . Then for almost every (in the sense of full Lebesgue measure in ), we have for only finitely many .
Proof of Theorem 25.
Let be an irrational symbol, i.e., . Define to be the set of output sequences in which appears in the first position. According to (11),
| (25) |
where
The quantity is defined in (22) and is an integer. Hence, and , and (25) describes a problem of Diophantine approximation. Taking , it follows immediately from Lemma 40 that
for all sufficiently large , since representing the number of channel uses can take infinitely many integers. ∎
Alternatively, if is rational, for high rates a perfect covering can be achieved, as stated in Theorem 26. This is because can always be an integer for sufficiently large , making a valid code construction. The detailed proof is as follows.
Proof of Theorem 26.
Given , we have
When , there always exist sufficiently large and such that is a multiple of for each and as . Therefore, we can employ the alphabet choice in Remark 37 and set a repetition number . ∎
Example 41.
Consider a ternary output . Then when .
V-C Non-uniform formulation
If the messages are non-uniformly distributed, then in the achievability proof, one must address not only how to construct a good covering code, but also how to select the optimal message distribution. It turns out that an effective choice is to treat as a source and apply lossless source coding, with the decoding index serving as the message index in the covering setting. More generally, after formulating the lossless source coding problem in Definition 42, we show the equivalence between noiseless soft covering and lossless source coding in Lemma 43.
Definition 42 (Lossless source coding).
Let with . Consider a discrete memoryless source subject to i.i.d. . An lossless source coding scheme consists of an encoder and a decoder , with denoting the probability of decoding error.
Lemma 43 (Equivalence between noiseless soft covering and lossless source coding).
Let be a bijective function. Consider a noiseless channel .
-
(a)
For any lossless source coding scheme, there exists an non-uniform soft-covering scheme such that .
-
(b)
For any non-uniform soft-covering scheme, there exists an lossless source coding scheme such that that .
Proof of Theorem 21.
We first show achievability. Follow the alphabet choice in Remark 37 so that is a bijection. To apply Lemma 43, we need to specify a lossless source coding scheme for the i.i.d. source . Take any and define
Clearly , meaning that we can establish a one-to-one label for every sequence in using at most codewords. Therefore, there exists a lossless source coding scheme, where all sequences in can be correctly decoded and hence with . According to Lemma 43(a), there exists an non-uniform soft-covering scheme such that
where follows from property of types. Taking and completes the proof of achievability.
For the converse, given any covering code, by alphabet restriction in Remark 37, can be regarded bijective. Hence, according to Lemma 43(b), for any non-uniform soft-covering scheme, there exists a corresponding lossless source coding scheme such that
where follows from the converse exponent of the lossless source coding, e.g., [csiszar2011information, Thm. 2.15]. Taking and completes the proof of the converse.
So far, we have proven the variational form, which is exactly the error exponent of the lossless source coding problem. The dual form thus follows immediately from [csiszar2011information, Problem 2.15]. ∎
V-D -constrained formulation
Proof of Theorem 22.
We begin with the converse. In fact, the proof of the converse is identical to that of Theorem 19 in Section V-A. The ‘bad’ set defined in (23) applies here as well, because in the -constrained formulation, the minimal probability in the message set is also at least . All output sequences with probability contribute to the covering error. Ergo, in Theorem 19, which serves as a converse for the uniform formulation, also constitutes a converse for the -constrained formulation, even though the latter is a more general formulation that encompasses the former.
For the achievability, we follow the alphabet choice in Remark 37 so that is a bijection. Similar to the non-uniform case, we consider a lossless source coding scheme for , where only sequences with probability greater than are encoded. This can be interpreted as a lossless source coding scheme where the rate is measured by of the encoded messages, instead of , which is equal to the logarithm of the size of the message set. Those sequences are in fact from the good set defined in (24). The corresponding source coding scheme is given by
| (26) |
Clearly, all sequences in contribute to the decoding error. According to Lemma 43(a), there exists an non-uniform soft-covering scheme such that
where follows from property of types. The exponent here is exactly as . However, directly from Lemma 43(a), the soft-covering coding scheme here is under the non-uniform formulation. We need to show that construction in (26) can indeed result in a soft-covering coding scheme under the -constrained formulation. Specifically, the corresponding message distribution in (66) (see Appendix D-g) needs to satisfy for all . First, the definition of in (24) ensures that for . It remains to verify that . Since , it is equivalent to check . By (65) in Appendix D-b, when is finite, we have for some optimizer such that . Therefore, . This completes the proof. ∎
VI Error Exponent for Noisy Channels
This section addresses soft covering error exponents for noisy channels. We prove Theorem 27 and Theorem 28.
VI-A Achievability
Proof of Theorem 27.
We can use the soft covering with noiseless channels to the case of noisy channels by covering the input distribution instead of the output. This leads to a high rate improvement in the error exponent achievability as compared to random coding. The output distribution in (1) can be written as
where is the code-induced input distribution. Pick any . Then . We have
where follows from the data processing inequality of the total variation, and follows from Theorem 20. Here is the noiseless bound of the covering problem. Hence, we can follow the code construction in the proof of Theorem 20 in Section V-A and then cover the input distribution . An optimal bound is further generated by maximizing over . ∎
VI-B Converse
Proof of Theorem 28.
Consider any code . We claim that exists a type such that
| (27) |
To see this, suppose for all types . Then we have
for sufficiently large , thereby leading to a contradiction. Hence, (27) is proven, which implies that even though is not necessarily a constant composition code, it has a constant composition subset that includes most probable codewords.
We follow (11) and set
That is, only restrict to codewords with type and -shell in , where is some subset of that will be properly chosen later. Following the proof of Proposition 30, we take a codeword and can obtain that
where follows from property of types. Averaging over all codewords further yields that
where follows from (27). On the other hand, similarly to the calculation of in the proof of Proposition 30, here we have
Inserting the above expressions into (11), it is then reasonable to choose
and hence
Taking and , then boils down to
| (28) | ||||
| (29) |
where Lemma 3 is used and thus we have acquired a bound
| (30) |
which contains an unconstrained maximization over all input distributions. Clearly . Furthermore, take any ; then and hence (28) implies that . Ergo, for any , we have . As a result, the maximization in (30) only occurs at those . To be consistent with our notations ( associated with and associated with ), we rewrite (30) as
Moreover, Lemma 3 implies that
where the inequality is simply the data processing inequality of the relative entropy. Consequently, when is taken, (29) reduces to
which completes the proof of the variational form in the theorem.
From Theorem 28, we can summarize the following properties of the proposed converse bound , which are proved in Appendix D-h.
Lemma 44.
We have the following properties of :
(i) when . (ii) when .
(iii) when .
VII Conclusion
In the present work, we have characterized the exact strong converse exponent of the classical soft covering for rates below the mutual information. This exponent is expressed using a two-parameter information quantity that, to the best of our knowledge, has not been studied in the literature on error exponents with respect to a given channel. A promising direction for future work is a deeper investigation into the implications, properties, and potential applications of this new quantity.
Moreover, this work reveals that the conventional random coding bound is generally not tight in the achievability regime of rates above the mutual information, and that the traditional formulation assuming uniformly distributed words in the code can inherently lead to a rational–irrational discrepancy: even in the noiseless channel case, the reliability exponent diverges at sufficiently high rate for target distributions with all rational values, whereas for typical irrational probability values the exponent remains finite for all rates. Future work includes designing a well-behaved deterministic code for noisy channels that outperforms random coding in both high-rate and low-rate regimes, or even an optimal code that achieves the exact error exponent. Our observations also raise an important open question: can a similar rational–irrational discrepancy arise in other information-theoretic settings?
Appendix A A Random Coding Achievability for the Strong Converse Exponent
In this appendix, we prove Theorem 18. We start with the following lemma.
Lemma 45.
Let be a binomial random variable and . We have
Proof.
Write
| (31) |
and consider the following two cases.
Now we prove Theorem 18 using the random coding strategy.
Proof of Theorem 18.
We first prove the variational form (7). Take any and generate a random code with , where each codeword for any is drawn from i.i.d. . Consider any joint type . Similar to the proof of Proposition 32, introduce the backward conditional type . We define
| (32) | ||||
| (33) |
The values of and are uniquely determined by the joint type , independently of the particular sequence . Take any . Then we can write
| (34) |
where is defined in (16). Under random coding,
| (35) |
where follows from Lemma 45 because according to its definition (16); and we identify in from (34). Furthermore, from (32) and (33) we have
where and follow from general properties of types. Then (A) simplifies to
Taking completes the proof of (7).
Next, we prove the dual form (8). Pick any and introduce the corresponding joint distribution . In line with our notation convention, we also define the associated backward channel . For any other distribution , it is straightforward to verify the following identities.
| (36) | ||||
| (37) |
where is defined in Definition 2. It is noteworthy that satisfies the following property [yagli2019exact, Cor. 1], which can be viewed as a corollary of (46):
| (38) |
for any . Now consider the following chain of transformations:
where follows from (36) and (37); follows from the minimax theorem [rockafellar1970convex, Thm 36.3]: the expression is convex in and linear in so we can swap and ; and follows from (38). In , the minimax theorem is invoked again: it is evident to observe that the expression is convex in ; however, its concavity in is not obvious. We need to return to the right-hand side of : fix , it is linear in (a trivial case of being concave), so the minimization over yields a concave function of . is a direct result of (46); follows from Definition 7; and is obtained by setting . This completes the proof of (8). ∎
Appendix B Dual forms of the Strong Converse Bounds and
In this appendix, we establish the dual forms of the strong converse bounds (in Proposition 30) and (in Proposition 32). That is, we prove Proposition 34 and Proposition 35 that are stated in Section IV-C.
B-a Proof of Proposition 34
Proof.
We first show (20) in Proposition 34. Fix and . The function defined in (10) describes a convex optimization of : the objective function is , which is convex in ; the inequality constraint is also convex in ; and the equality constraint is , which is linear in . Therefore, , as the optimal value over , equals its Lagrangian dual due to the strong duality [boyd2004convex, Ch. 5]. Explicitly,
where follows from the minimax theorem: the objective function is convex in and linear in , so we can swap and . follows from Lemma 3. Plugging this expression of into Proposition 30 yields that
Here follows from choosing ; this is a valid choice because gives . follows from defining and (and clearly ). follows again from the minimax theorem: the objective function is convex in and linear in , so we can swap and . Finally, we merge and together as . This completes the proof of (20).
Next, we prove (21) in Proposition 34. Observe that is convex in due to the joint convexity of . Then for , the expression is also convex in . Moreover, is linear in . Hence, the objective function of is convex in and linear in , so by the minimax theorem, we can swap and and obtain that
| (39) | ||||
| (40) |
where we have defined
| (41) |
which is a convex optimization over and can be solved using the method of the Lagrange multiplier. However, directly plugging in the derivative leads to an equation too complicated to solve. Instead, we employ the following variational form of the mutual information :
| (42) |
Therefore, (41) can be rewritten as
| (43) |
Now, to optimize over , we can introduce Lagrange multipliers for each and write the Lagrangian as
Substituting
into gives that
further generating the following solution of the optimizer :
| (44) |
with some -dependent normalization factor
Hence, (43) reduces to
For later convenience, we identify the optimizer here. The minimizer that yields (B-a) is
| (47) |
Recall that we introduced merely to apply the variational form of mutual information (42), so the minimizer in (B-a) is exactly the marginal of the minimizer of (39). Hence, fixing and , if in (6) yields an optimizer (which must satisfy (63) in Appendix C), then we can plug in (44) and (47) and obtain the corresponding optimizer of (39) as
| (48) |
Lemma 46.
We have the following properties of :
(i) when . (ii) when .
Proof.
For simplicity, write with
Given , let optimize , i.e., . Setting gives . Hence, is non-negative for all . (i) follows immediately from Lemma 31 and that .
For (ii), suppose not. Then and hence plugging any combination of such that into the objective function of will yield a non-positive value. First, plugging gives that , so . Second, plugging and combining it with Lemma 3 gives that , so and hence . Finally, plugging gives that . To sum up, we obtain that with , contradicting the condition that . ∎
B-b Proof of Proposition 35
Proof.
Given any , define
| (49) | ||||
| (50) |
That is, let denote the objective function in and let be one of the optimizers. We will claim that .
We first show this is true when vanishes. Suppose . Then and . However, Lemma 3 gives that , so , and hence and . Due to the continuity of , we have . By Lemma 33, we further have .
Next, we show that also holds in their positive regimes. In the rest of our discussions, we may assume that . We further write as
| (51) |
where we have used to denote the objective function. It is shown in the proof of Proposition 34 that is convex and is linear (a trivial case of being concave). Hence, according to the minimax theorem, the optimization in (B-b) can be achieved at its saddle points (there might exist some other non-saddle points that also produce the optimal value, but at least one saddle point exists). Now, suppose we have a saddle point, denoted by . It must satisfy [rockafellar1970convex, Lem. 36.2]
| (52) |
It should be noted that the saddle points belong to both minimax () and maximin () solutions. Now we make the following claim:
Claim 1. If , then .
Proof of claim. If , then must give , otherwise the first inequality in (52) is violated as . Since this saddle point is also the minimax solution, then in (50) we can take and hence : the minimization in (49) can always occur at a point where , so can reduce to a constrained optimization and thus .
It remains to discuss the case in which . Comparing (B-b) and (20), we can immediately write from (21) that (note that the condition is not used in the proof of (21)).
| (53) |
Since the saddle point belongs to the maximin solutions, must be the maximizers of (53), and correspondingly the marginal of belongs to the set of optimizers of in (6). Specifically, is in the set
| (54) |
which is a linear optimization over . Here is set of optimal symbols and is the set of input distributions supported on .
Since , we can first exclude two trivial cases: or 1. If , then (53) gives , while (49) implies that . Thus, this case can never happen. Similarly, if , then (53) gives , while we have assumed the strict positivity of . In conclusion, as long as , we have and can make the following claim.
Claim 2. A saddle point must satisfy that .
Proof of claim. This follows directly from the first inequality in (52).
In the rest of our discussions, we may assume that . In most circumstances, may contain only one symbol, and the corresponding maximizer of (54) is a deterministic distribution. More generally, however, may contain multiple symbols, and all distributions supported on these symbols are maximizers of (54); the marginal of the saddle point is only one of those distributions. Let us consider this general situation: if contains multiple symbols, we take any two symbols, say, . Denote
where is the -Rényi divergence in Definition 6. Then we have , meaning that these two functions of intersect at . Since they are both analytic, in the neighborhood of , they either overlap or intersect only at . The former indicates a certain symmetry in and (for example, is a binary symmetric channel and is uniform). We define that and belong to the same class, if for all in the neighborhood of . We now conduct our discussions based on how many classes contains, which can be classified into the following three cases.
Case 1. contains only one class, and there is only one symbol in that class.
In this case, basically contains a single symbol, say, for some . Then , so (54) has a unique solution which is a deterministic distribution. Since the saddle marginal is in this solution set, we must have ; however, this gives , contradicting in Claim 2. Ergo, there exists no saddle point in the form of . The saddle point must have . We are back to Claim 1 and hence .
Case 2. contains only one class, and there is more than one symbol in that class.
In this case, take any two symbols, say, . Then and are locally the same function. We can further take their derivatives at and write . As a result, the expression
| (55) |
will be independent of symbol choice in that class, i.e., (55) is identical for all . Then, take any (so is an optimizer in ) and its corresponding joint optimizer for is given by (48) (under and ). The expression
will be independent of . Since the saddle point also has marginal and satisfies according to Claim 2, we must have for all . Note that contains the deterministic distribution (which yields a zero mutual information) and the saddle point (which yields by Claim 2). There must exist some such that . Plugging this in (50) gives
This means the minimization in (49) can occur at a point where , so .
Case 3. contains more than one class.
In this case, we cannot deduce much about , and hence about the corresponding . However, let us perturb in its neighborhood, say, to some . Due to the existence of only finitely many symbols, in the neighborhood of , there are only finite intersections of ’s for ’s from different classes, so must contain only one class. Take any symbol, say, from this class. Then we can write as
for some in the neighborhood of . Note that this is a strictly concave maximization due to the strict concavity of (if is linear in , then the maximizer must be 0 or 1). Thus, this new is uniquely determined by . If the corresponding saddle point at has , then we are back to Claim 1. If it has , then that saddle point must be , and we are back to Case 1 or Case 2 and can obtain that . In summary, for all (where ) in the neighborhood of , we have . Ergo, holds almost everywhere in the neighborhood of . On the other hand, and are both continuous due to the Berge maximum theorem [aliprantis2006infinite, Item 17.31] (note that in , is a continuous and compact correspondence), so also holds at .
To sum up, holds everywhere, in both vanishing and positive regimes. The dual form of follows immediately from (53). This completes the proof. ∎
B-c Proof of Proposition 36
Proof.
We first show that both and are convex in . Collecting Proposition 34 and 35, both and take the dual form
| (56) |
Their only difference is the optimization range in . For a fixed combination of , the objective function in (56) is linear in (a trivial case of being convex). Then after maximizing over , (56) is convex in .
Next, we show that both and are monotone decreasing when . Combining convexity and Lemma 33, is convex in and positive when . As grows, remains convex and must vanish when reaches . Ergo, must be monotone decreasing in when . Similar arguments can be established for by combining its convexity and Lemma 46.
Finally, to prove our conclusion, it suffices to show that when , the optimization over in occurs at . Let and the optimizers in be , i.e., . Suppose . Take and we have
meaning that is locally increasing in , which contradicts the decreasing property that we just showed. Hence, the optimizers for must satisfy that . ∎
Appendix C A Different Approach to Proving the Converse Part of Theorem 17
Based on Arimoto’s techniques in [arimoto1973converse], we provide a different proof of the converse part of Theorem 17 for the uniform formulation. We begin with the following lemma about the additivity of .
Lemma 47 (Additivity of ).
Proof.
We follow the arguments in [gallager1965simple, Thm. 4] and [arimoto1973converse, Lem. 1]. According to (6), is defined via solving the following optimization problem:
| (57) |
The Lagrangian of this problem is
Since , problem (57) is convex. The KKT conditions are necessary and sufficient for the optimizer and multipliers [boyd2004convex, Ch. 5.5.3]. Specifically, letting be the optimizers, we have
| (58) | |||||
| (59) | |||||
| (60) | |||||
where
| (61) |
Multiplying (60) by and sum over gives that
| (62) |
where we have plugged in (58) and (59). Comparing (59), (60), and (62), must satisfy
| (63) |
with equality for every due to complementary slackness (59).
If satisfies (63), then also satisfies the -shot version of (63) (replacing by and by ). To see this, inserting in (61) gives
and hence (63) becomes
which is true as long as the inequality holds for each in the product. Considering that (63) is a necessary and sufficient condition for the optimizer, our conclusion is thus proven. ∎
Now we prove the converse part of Theorem 17 for the uniform formulation, i.e., we show .
Proof.
For simplicity, we first narrow down our discussions to the 1-shot case (). Let be the optimal code that achieves the minimal total variation using codewords, i.e.,
Then there exists a probabilistic distribution of code, say, that satisfies the following two properties.
-
(i)
The expectation of total variation under equals to the minimal total variation:
-
(ii)
The marginal distribution of each codeword is identical, i.e.,
is the same for each .
One example of such a is [arimoto1973converse]:
Take any with , then we have the following:
where follows from (18). is due to that when . applies Jensen’s inequality to the concave function . follows from the relation for and . follows from the second property of . Extending this expression to the -shot case, our conclusion immediately follows from Lemma 47. ∎
Appendix D Proofs of Some Lemmas and Propositions
D-a Properties of
Lemma 48.
Fix any and . For simplicity, write , which is defined in (10). Then is a continuous, non-increasing function of . It is strictly decreasing on the interval for some and constant on .
Proof.
For convenience, define the objective function in to be Then .
We first show continuity. Clearly, being the maximum of two continuous functions, is continuous. Also, a continuous, compact, and non-empty correspondence (i.e., given any , the feasible set is a continuous, compact, and non-empty set of ). Thanks to the Berge maximum theorem [aliprantis2006infinite, Item 17.31], is continuous.
Next, the non-increasing property of is straightforward: the feasible set expands as grows, and thus the minimization value can never increase. When , the only feasible is and then . On the other hand, note that is bounded when , so when , we have
which is a constant. Therefore, on .
Now, before moving to the decreasing property on , we prove the convexity of . Take . Suppose that the optimizers for and are and , respectively; that is,
For any , take and . By the convexity of , we have . Now since , we further have
| (64) |
where follows from the convexity of . To see this, observe that both and are convex in . Thus, being the maximum of two convex functions, is also convex. (64) thereby indicates the convexity of .
Equipped with convexity and the non-increasing monotonicity, must be strictly decreasing on the interval . It is also noteworthy that on this interval, the optimization of is achieved at the boundary where . Similar arguments can be found in the proof of [csiszar2011information, Cor. 10.4]. ∎
D-b Proof of Proposition 23
Proof.
We only show the expression for , while that for follows from analogous reasoning. We apply the method in [csiszar2011information, Cor. 10.4] and first show the convexity of . Let be the optimizer for and , respectively, i.e.,
Take any . Note that is linear in . Then implies that for the unnormalized distribution . Thus,
where follows from the convexity of . So far, we have shown that is convex in .
Moreover, observe that is non-decreasing in and non-negative. Due to its convexity, is strictly increasing in in the interval where it is finite and positive. Then the optimization of must be achieved at the boundary:
| (65) |
Hence,
where follows from the fact is monotone increasing. ∎
D-c Proof of Lemma 31
Proof.
Let be the minimizer of , i.e., .
For (i), clearly is non-negative. Recall the expression of in (10). Take . Then for all , because is the optimizer. Then , and hence .
For (ii), suppose not. Owing to (15), there exists some such that for all . Take , the only optimizer is and thus , indicating that and . This contradicts the condition . ∎
D-d Proof of Lemma 33
Proof.
Since is continuous. When , we can write for some . Taking , we have because . On the other hand, is non-negative, so . ∎
D-e Proof of Lemma 38
D-f Proof of Lemma 39
Proof.
Let . Note that the optimization here has an equality constraint .
Observe that is concave in , with its maximum attained at . However, the feasible set of does not contain . Then the maximization of occurs at the boundary of the linear equation , yielding that
Similarly, is also concave in , but the feasible set for expands with . This implies that the maximizer for lies either on the boundary or at the uniform distribution . In summary, or , and therefore . ∎
D-g Proof of Lemma 43
Proof.
For , let denote the joint distribution of the source sequence , the reconstruction sequence , and the random index taking values in the set ; namely,
Its marginal on determines a message distribution
| (66) |
We choose as the message distribution, and as the covering code . Since is a bijection, is well-defined. The induced distribution in (2) then becomes
Here is the marginal of . Also, note that is naturally the marginal of . We obtain that
where the inequality follows from the monotonicity of the total variation.
For , let be the covering code. Define
We choose our encoder and decoder in the lossless source coding scheme to be
It is noteworthy that the given soft covering code may contain repeated codewords. As a result, for some , there exists more than one index such that , and we just select one representative index among them as our source coding encoder . Consequently, only be reconstructed correctly. We obtain that
Observe that the lossless source coding scheme contains indices (some of them may be unused if contains repeated codewords), and the rate is for sufficiently large . ∎
D-h Proof of Lemma 44
Proof.
Fix and define
Let be its optimizer, i.e., with . When , we have and . When , note that , so and hence . When , does not exist; we can only apply a trivial bound and the exponent is .
In view that , the above turning points, including the one from zero to nonzero value, and the one from finite to infinite value, of are obtained via minimizing over . ∎