0% found this document useful (0 votes)
83 views62 pages

Heterogeneous Causal Effects

This document proposes a general framework for testing the validity of instruments in heterogeneous causal effect models. It generalizes previous testable implications to allow for multivalued or unordered treatments and covariates. A nonparametric test statistic is constructed based on these implications. The test is shown to be asymptotically valid and consistent. Technical issues are addressed through an extended continuous mapping theorem and delta method. Bootstrap resampling is used to approximate the asymptotic distribution and construct critical values, improving power over conservative prior tests. The framework allows testing instrument validity in a variety of settings.

Uploaded by

A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views62 pages

Heterogeneous Causal Effects

This document proposes a general framework for testing the validity of instruments in heterogeneous causal effect models. It generalizes previous testable implications to allow for multivalued or unordered treatments and covariates. A nonparametric test statistic is constructed based on these implications. The test is shown to be asymptotically valid and consistent. Technical issues are addressed through an extended continuous mapping theorem and delta method. Bootstrap resampling is used to approximate the asymptotic distribution and construct critical values, improving power over conservative prior tests. The framework allows testing instrument validity in a variety of settings.

Uploaded by

A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Instrument Validity for Heterogeneous Causal Effects

arXiv:2009.01995v1 [econ.EM] 4 Sep 2020

Zhenting Sun∗
National School of Development, Peking University
[email protected]

September 7, 2020

Abstract

This paper provides a general framework for testing instrument validity in heterogeneous
causal effect models. We first generalize the testable implications of the instrument validity as-
sumption provided by Balke and Pearl (1997), Imbens and Rubin (1997), and Heckman and
Vytlacil (2005). The generalization involves the cases where the treatment can be multivalued
(and ordered) or unordered, and there can be conditioning covariates. Based on these testable
implications, we propose a nonparametric test which is proved to be asymptotically size controlled
and consistent. Because of the nonstandard nature of the problem in question, the test statistic
is constructed based on a nonsmooth map, which causes technical complications. We provide an
extended continuous mapping theorem and an extended delta method, which may be of inde-
pendent interest, to establish the asymptotic distribution of the test statistic under null. We then
extend the bootstrap method proposed by Fang and Santos (2018) to approximate this asymptotic
distribution and construct a critical value for the test. Compared to the test proposed by Kitagawa
(2015), our test can be applied in more general settings and may achieve power improvement.
Evidence that the test performs well on finite samples is provided via simulations. We revisit the
empirical study of Card (1993) and use their data to demonstrate application of the proposed test
in practice. We show that a valid instrument for a multivalued treatment may not remain valid if
the treatment is coarsened.

Keywords: Instrument validity, heterogeneous causal effects, general nonparametric test, power
improvement, extended continuous mapping theorem, extended delta method

∗ This article is a revised version of the first chapter of the author’s doctoral thesis at UC San Diego. I am deeply grateful

to Brendan K. Beare, Zheng Fang, Andres Santos, Yixiao Sun, and Kaspar Wüthrich for their constant support on this paper.
I thank Shengtao Dai, Tongyu Li, and Xingyu Li for their excellent work as research assistants. I also thank Roy Allen, Qihui
Chen, Asad Dossani, Graham Elliott, Ivan Fernandez-Val, Wenzheng Gao, James D. Hamilton, Jungbin Hwang, Toru Kitagawa,
Sungwon Lee, Juwon Seo, and all seminar participants for their insightful suggestions and comments.

1
1 Introduction
The local average treatment effect (LATE) framework, introduced by the seminal works Imbens and
Angrist (1994) and Angrist et al. (1996), is a commonly used approach in studies of instrumental
variable (IV) models with treatment effect heterogeneity. The local quantile treatment effect (LQTE)
is a concept similar to LATE. While LATE shows the treatment effect on the mean of the outcome,
LQTE is more informative in regard to the effect on the outcome distribution.1 These causal effect
models rely on several strong and sometimes controversial assumptions of IV validity: 1) The instru-
ment should not affect the outcome directly; 2) it should be as good as being randomly assigned;
and 3) it affects the treatment in monotone fashion. Violations of these conditions can generally lead
to inconsistent treatment effect estimates. Relevant surveys and discussion of this can be found in
Angrist and Pischke (2008), Angrist and Pischke (2014), Imbens (2014), Imbens and Rubin (2015),
Koenker et al. (2017), Melly and Wüthrich (2017), and Huber and Wüthrich (2018). Since the
plausibility of the analyses of such models depends on IV validity, economics research has developed
methods to examine these conditions based on testable implications.
This paper provides a general framework for testing such IV validity assumptions. We first gen-
eralize the testable implications obtained by Balke and Pearl (1997), Imbens and Rubin (1997), and
Heckman and Vytlacil (2005) for binary treatments. The generalization includes the cases where the
treatment can be multivalued (and ordered) or unordered, and conditioning covariates may exist.2
Then based on these testable implications, we propose a nonparametric test which can easily be
applied in practice.
Kitagawa (2015) was the first paper to propose a test of IV validity in heterogeneous causal
effect models based on the testable implications in the literature. The test, constructed using a
bootstrap method, is for binary treatments. It was shown to be asymptotically uniformly size con-
trolled and consistent. Since the bootstrap critical value converges to a number larger than the
1 − α quantile of the asymptotic distribution of the test statistic over a large region of the null, the
test could be conservative. Mourifié and Wan (2017) reformulated as conditional inequalities the
testable implications used in Kitagawa (2015). Then they showed that these inequalities could be
tested in the intersection bounds framework of Chernozhukov et al. (2013) using the Stata package
provided by Chernozhukov et al. (2015). The test is also for binary treatments and could be conser-
vative as well. It restricts the support of the outcome variables to be compact, ruling out the case
where outcomes can be unbounded. Huber and Mellace (2015) derived a testable implication for
a weaker LATE identifying condition, that is, that the potential outcomes are mean independent of
instruments, conditional on each selection type. However, the condition of potential outcomes being
mean independent of instruments is not sufficient if we are concerned with distributional features
1 See, for example, studies of LQTE in Abadie (2002), Ananat and Michaels (2008), Cawley and Meyerhoefer (2012),
Frölich and Melly (2013), and Eren and Ozbeklik (2014).
2 Studies of LATE with binary treatments can be found in Angrist (1990), Angrist and Krueger (1991), and Vytlacil (2002).

Those with multivalued treatments can be found in Angrist and Imbens (1995), Angrist and Krueger (1995), and Vytlacil
(2006). Identification of causal effects in unordered choice (treatment) models can be found in Heckman et al. (2006),
Heckman and Vytlacil (2007), Heckman et al. (2008), and Heckman and Pinto (2018).

2
of a complier’s potential outcomes, such as the quantile treatment effects for compliers; see Abadie
et al. (2002) for details. The focus of the present paper is on full statistical independence of potential
outcomes and instruments.
The null hypothesis for the testable implications used in Kitagawa (2015) consists of a set of
inequalities. The reason why the test proposed by Kitagawa (2015) could be conservative is that
they used an upper bound on the asymptotic distribution of the test statistic under null to construct
the bootstrap critical value. The upper bound is identical to the asymptotic distribution when all the
inequalities in the null are binding. In the study described in the present paper, we solve a technical
issue and establish the pointwise asymptotic distribution of the test statistic under null. Then we
construct the critical value based on this asymptotic distribution, rather than on an upper bound,
and therefore improve the power of the test.
A modified variance-weighted Kolmogorov–Smirnov (KS) test statistic is employed in our test.
As mentioned by Kitagawa (2015), variance-weighted KS statistics have been widely applied in
the literature on conditional moment inequalities, such as in Andrews and Shi (2013), Armstrong
(2014), Armstrong and Chan (2016), and Chetverikov (2018). More general KS statistics can be
found in the stochastic dominance testing literature, such as in Abadie (2002), Barrett and Donald
(2003), Horváth et al. (2006), Linton et al. (2010), Barrett et al. (2014), and Donald and Hsu
(2016).
There are two major complications in deriving and approximating the asymptotic distribution of
the test statistic under null. First, the test statistic involves a nonsmooth (nondifferentiable) map
of unknown parameters (underlying probability distributions), and the delta method fails to work.
We provide an extended continuous mapping theorem and an extended delta method, which might
be of independent interest, to overcome this difficulty. By showing that the conditions of the ex-
tended delta method are satisfied under several weak assumptions, we establish the null asymptotic
distribution of the test statistic. Second, since the null asymptotic distribution involves a nonlinear
function, the standard bootstrap method may fail to approximate this distribution consistently. Dis-
cussion of this issue can be found in Dümbgen (1993), Andrews (2000), Hirano and Porter (2012),
Hansen (2017), Fang and Santos (2018), and Hong and Li (2018). To achieve a consistent approx-
imation, we extend the bootstrap approach proposed by Fang and Santos (2018)3 and provide a
valid bootstrap critical value. The test is found to be asymptotically sized controlled and consistent.
Evidence that the test performs well on finite samples is provided via simulations.
We now introduce the following notation, which will be used throughout the paper. We let
denote Hoffmann–Jørgensen weak convergence in a metric space. For a set D, denote the space of
bounded functions on D by ℓ∞ (D):

ℓ∞ (D) = {f : D → R : kf k∞ < ∞} , where kf k∞ = sup |f (x)| .


x∈D

3 Other applications of this bootstrap method can be found in Beare and Moon (2015), Beare and Fang (2017), Seo (2018),

Beare and Shi (2019), and Sun and Beare (2019). A similar bootstrap approach can be found in Hong and Li (2018).

3
If D is a topological space, let C (D) denote the set of continuous functions on D:

C (D) = {f : D → R : f is continuous} .

2 Setup and Testable Implications

2.1 Binary Treatment


To formally introduce the topic of interest, we first consider the heterogeneous causal effect model
of Imbens and Angrist (1994). Let Y ∈ R be the observable outcome variable, and let D ∈ {0, 1}
be the observable treatment variable, where D = 1 indicates that an individual receives treatment.
Let Z ∈ {0, 1} be a binary instrumental variable. Let Ydz ∈ R be the potential outcome variable4
for D = d and Z = z, where d, z ∈ {0, 1}. Similarly, let Dz be the potential treatment variable
for Z = z. The instrument validity assumption for binary treatment and binary IV is formalized as
follows.

Assumption 2.1 IV validity for binary D and binary Z:

(i) Instrument Exclusion: With probability 1, Yd0 = Yd1 for each d ∈ {0, 1}.

(ii) Random Assignment: The variable Z is jointly independent of (Y00 , Y01 , Y10 , Y11 , D0 , D1 ).

(iii) Instrument Monotonicity: The potential treatment response indicators satisfy D1 ≥ D0 with
probability 1.

Assumption 2.1 is from Imbens and Rubin (1997), but it does not require strict instrument mono-
tonicity. In this paper, we are not concerned with the strict monotonicity assumption, which is also
known as the instrument relevance assumption.5
Let (Ω, A, P) be a probability space on which all random elements are well defined. Let BRm
denote the Borel σ-algebra on Rm for all m ∈ N. For all Borel sets B and C, we follow Kitagawa
(2015) and define probability measures as follows:6

P1 (B, C) = P (Y ∈ B, D ∈ C|Z = 1) and P0 (B, C) = P (Y ∈ B, D ∈ C|Z = 0) .

Under Assumption 2.1(i), we can define a potential outcome variable Yd such that Yd = Yd0 = Yd1
almost surely. Imbens and Rubin (1997) showed that for every Borel set B,

P1 (B, {1}) − P0 (B, {1}) = P (Y1 ∈ B, D1 > D0 )


and P0 (B, {0}) − P1 (B, {0}) = P (Y0 ∈ B, D1 > D0 ) . (1)
4 See Rubin (1974) and Splawa-Neyman et al. (1990) for further discussion of the potential outcomes.
5 As mentioned by Kitagawa (2015), the instrument relevance assumption can be assessed by inferring the coefficient in
the first-stage regression of D onto Z.
6 For simplicity of notation, we implicitly assume that (Y, D, Z) is (A, B )-measurable.
R3

4
To see why (1) is true, we can write

P1 (B, {1}) − P0 (B, {1}) = P (Y ∈ B, D = 1|Z = 1) − P (Y ∈ B, D = 1|Z = 0)


= P (Y1 ∈ B, D1 = 1) − P (Y1 ∈ B, D0 = 1) = P (Y1 ∈ B, D1 = 1, D0 = 0) ,

where the second equality follows from Assumptions 2.1(i) and 2.1(ii) and the third equality follows
from Assumption 2.1(iii). Similar reasoning gives the second equation in (1). Since the probabilities
in (1) are nonnegative, we obtain the testable implication of Assumption 2.1 in Balke and Pearl
(1997), Imbens and Rubin (1997), and Heckman and Vytlacil (2005): For all B ∈ BR ,

P1 (B, {1}) − P0 (B, {1}) ≥ 0 and P0 (B, {0}) − P1 (B, {0}) ≥ 0. (2)

To understand (2) graphically, suppose that Y is a continuous variable and that pz (y, d) is the
derivative of the function Pz ((−∞, y], {d}) with respect to y for all d, z ∈ {0, 1}. The following
graphs show a case where (2) holds.

Figure 1: A special case satisfying testable implication (2)

p1 (y, 1) p1 (y, 0)

p0 (y, 1) p0 (y, 0)

−2 −1 0 1 2 −2 −1 0 1 2

(a) P1 (B, {1}) > P0 (B, {1}) (b) P0 (B, {0}) > P1 (B, {0})

The first inequality in (2) is shown in Figure 1a, where the derivative p1 (y, 1) is greater than
p0 (y, 1) everywhere. The second inequality in (2) is shown in Figure 1b, where the derivative
p0 (y, 0) is greater than p1 (y, 0) everywhere. Additional graphical examples can be found in Kitagawa
(2015).

2.2 Multivalued (and Ordered) Treatment


Section 2.1 discussed the case where the treatment and the instrument are both binary. In many
applications, D and Z can be multivalued. See, for example, Angrist and Imbens (1995), where the
treatment variable is the number of years of schooling completed by a student and can take more
than two values. Now suppose that D ∈ D = {d1 , d2 , . . .} and Z ∈ Z = {z1 , z2 , . . . , zK }. We let dmax
be the maximum value of D if it exists, and dmin the minimum value of D if it exists. Suppose there
exist potential variables Ydz for d ∈ D and z ∈ Z, and Dz for z ∈ Z. Then the IV validity assumption
for multivalued treatment D and multivalued instrument Z is formalized as follows.

Assumption 2.2 IV validity for multivalued D and multivalued Z:

5
(i) Instrument Exclusion: With probability 1, Ydz1 = Ydz2 = · · · = YdzK for all d ∈ D.

(ii) Random Assignment: The variable Z is jointly independent of (Ỹ , D̃), where

Ỹ = (Yd1 z1 , . . . , Yd1 zK , Yd2 z1 , . . . , Yd2 zK , . . .) and D̃ = (Dz1 , Dz2 , . . . , DzK ) .

(iii) Instrument Monotonicity: The potential treatment response variables satisfy Dzk+1 ≥ Dzk with
probability 1 for all k ∈ {1, 2, . . . , K − 1}.

Assumption 2.2 is similar to Assumptions 1 and 2 of Angrist and Imbens (1995). Since we allow mul-
tivalued Z, the monotonicity assumption needs to hold for each pair (Dzk , Dzk+1 ). The next lemma
establishes a testable implication of Assumption 2.2 when the treatment variable has a maximum
value and/or a minimum value.

Lemma 2.1 A testable implication of Assumption 2.2 is that for all k with 1 ≤ k ≤ K − 1, all Borel
sets B, and all C = (−∞, c] with c ∈ R, the following hold:

P (Y ∈ B, D = dmax |Z = zk ) ≤ P (Y ∈ B, D = dmax |Z = zk+1 ) if dmax exists


and P (Y ∈ B, D = dmin |Z = zk ) ≥ P (Y ∈ B, D = dmin |Z = zk+1 ) if dmin exists; (3)
P (D ∈ C|Z = zk ) ≥ P (D ∈ C|Z = zk+1 ) . (4)

Lemma 2.1 generalized testable implication (2) to the case where the treatment and the instrument
can both be multivalued. The testable implication (first-order stochastic dominance) discussed by
Angrist and Imbens (1995) for Assumption 2.2 is equivalent to (4). Clearly, if D and Z are both
binary as assumed in Section 2.1, with dmax = 1 and dmin = 0, then (3) is equivalent to (2) and (4)
is implied by (3). To the best of our knowledge, (3) is new in the literature.

2.3 Unordered Treatment


Studies of identification of causal effects in unordered choice (treatment) models can be found in
Heckman et al. (2006), Heckman and Vytlacil (2007), and Heckman et al. (2008). Heckman and
Pinto (2018) showed that the assumptions7 in the preceding literature could be relaxed, and they
defined a new monotonicity condition for the identification of causal effects in such models. We
follow Heckman and Pinto (2018) and suppose that the support D of D is an unordered set with
D = {d1 , d2 , . . . , dJ } and that the support Z of Z with Z = {z1 , . . . , zK } can be unordered as well.
The unordered monotonicity condition proposed by Heckman and Pinto (2018) is as follows.

Assumption 2.3 The potential treatment response indicators satisfy the condition that for all d ∈ D
and all z, z ′ ∈ Z, 1 {Dz′ = d} ≥ 1 {Dz = d} almost surely or 1 {Dz′ = d} ≤ 1 {Dz = d} almost surely.
7 See Heckman and Pinto (2018, pp. 2–3) for a discussion of these assumptions.

6
It is worth noting that in Assumption 2.3, D is allowed to be a vector random element. In the case
where D, Z ∈ {0, 1}, Assumption 2.3 is equivalent to the assumption that 1 {D1 = 1} ≥ 1 {D0 = 1}
almost surely or 1 {D1 = 1} ≤ 1 {D0 = 1} almost surely. According to the context of the issue of
interest, we can prespecify a set C ⊂ D × Z × Z and assume that 1 {Dz′ = d} ≤ 1 {Dz = d} almost
surely for all (d, z, z ′ ) ∈ C, which is similar to Assumption 2.1(iii). With this monotonicity condition,
we introduce the IV validity assumption for unordered treatment.

Assumption 2.4 IV validity for unordered D and unordered Z:

(i) Instrument Exclusion: With probability 1, Ydz = Ydz′ for all d ∈ D and all z, z ′ ∈ Z.

(ii) Random Assignment: The random element Z is jointly independent of (Ỹ , D̃), where

Ỹ = (Yd1 z1 , . . . , Yd1 zK , Yd2 z1 , . . . , Yd2 zK , . . . , YdJ z1 , . . . , YdJ zK ) and D̃ = (Dz1 , Dz2 , . . . , DzK ) .

(iii) Instrument Monotonicity: The potential treatment elements satisfy 1 {Dz′ = d} ≤ 1 {Dz = d}
with probability 1 for all (d, z, z ′ ) ∈ C.

Under this assumption, we can define Yd by Yd = Ydz almost surely for all z, and hence

P (Y ∈ B, D = d|Z = z ′ ) = E[1{Yd ∈ B} · 1{Dz′ = d}]


≤ E[1{Yd ∈ B} · 1{Dz = d}] = P (Y ∈ B, D = d|Z = z)

for all Borel sets B and all (d, z, z ′ ) ∈ C.

Lemma 2.2 A testable implication of Assumption 2.4 is given by

P (Y ∈ B, D = d|Z = z ′ ) ≤ P (Y ∈ B, D = d|Z = z) (5)

for all Borel sets B and all (d, z, z ′ ) ∈ C, where C is a prespecified subset of D × Z × Z.

2.4 Conditioning Covariates


In this section, we consider the case where conditioning covariates may exist, that is, the random
assignment assumption holds conditional on some covariates. Suppose X is a conditioning covariate
vector, let X be the set of possible values of X, and let X = {x1 , x2 , . . . , xL }.
First, consider the case introduced in Section 2.2 where the treatment and the instrument are
both multivalued (and ordered). A testable implication with conditioning covariates is as follows.

Lemma 2.3 A testable implication of the conditional version of Assumption 2.2 is that

P (Y ∈ B, D = dmax |Z = zk , X = xl ) ≤ P (Y ∈ B, D = dmax |Z = zk+1 , X = xl ) if dmax exists


and P (Y ∈ B, D = dmin |Z = zk , X = xl ) ≥ P (Y ∈ B, D = dmin |Z = zk+1 , X = xl ) if dmin exists,
and P (D ∈ C|Z = zk , X = xl ) ≥ P (D ∈ C|Z = zk+1 , X = xl ) , (6)

7
for all k with 1 ≤ k ≤ K − 1, all l with 1 ≤ l ≤ L, all B ∈ BR , and all C = (−∞, c] with c ∈ R.

Second, consider the case introduced in Section 2.3 where the treatment and the instrument can
both be unordered. A testable implication with conditioning covariates is as follows.

Lemma 2.4 A testable implication of the conditional version of Assumption 2.4 is given by

P (Y ∈ B, D = d|Z = z ′ , X = xl ) ≤ P (Y ∈ B, D = d|Z = z, X = xl ) (7)

for all Borel sets B, all (d, z, z ′ ) ∈ C, and all l with 1 ≤ l ≤ L, where C is a prespecified subset of
D × Z × Z.

The inequality in (7) is similar to the generalized regression monotonicity (GRM) hypothesis in Hsu
et al. (2019). The major difference is that Z is allowed to be unordered in (7).

3 Test Formulation
To highlight the idea, we first introduce the test for the case where the treatment is multivalued
(and ordered), with support D = {d1 , d2 , . . .}. The other cases will be discussed as extensions in
later sections. Also, we let Z be multivalued with support Z = {z1 , . . . , zK }. The test is constructed
based on the testable implication given in (3) and (4). Without loss of generality, we assume that
both dmin and dmax exist, with dmin = 0 and dmax = 1. In practice, we can always normalize dmin
and dmax to 0 and 1, respectively. Then (3) and (4) are equivalent to

(−1)d · {P (Y ∈ B, D = d|Z = zk+1 ) − P (Y ∈ B, D = d|Z = zk )} ≤ 0


and P (D ∈ C|Z = zk+1 ) − P (D ∈ C|Z = zk ) ≤ 0 (8)

for all k with 1 ≤ k ≤ K − 1, all closed intervals B in R, each d ∈ {0, 1}, and all C = (−∞, c] with
c ∈ R. Here, (3) and (4) originally require (8) to hold for all Borel sets B. Similarly to Lemma
B.7 of Kitagawa (2015), we can show (by applying Lemma C1 of Andrews and Shi (2013)) that (8)
holding for all closed intervals B is equivalent to (8) holding for all Borel sets B.
By definition, for all B, C ∈ BR and all k with 1 ≤ k ≤ K, P (Y ∈ B, D ∈ C|Z = zk ) =
P (Y ∈ B, D ∈ C, Z = zk )/P (Z = zk ). We now define function spaces

  
GK = 1R×R×{zk } : k = 1, 2, . . . , K , G = 1R×R×{zk } , 1R×R×{zk+1 } : k = 1, 2, . . . , K − 1 ,
n o
d
H1 = (−1) · 1B×{d}×R : B is a closed interval in R, d ∈ {0, 1} ,
n o
d
H̄1 = (−1) · 1B×{d}×R : B is a closed, open, or half-closed interval in R, d ∈ {0, 1} ,

H2 = {1R×C×R : C = (−∞, c], c ∈ R} , H̄2 = {1R×C×R : C = (−∞, c] or C = (−∞, c), c ∈ R} ,


H = H1 ∪ H2 , and H̄ = H̄1 ∪ H̄2 . (9)

8
Let P denote the set of probability measures on (R3 , BR3 ). We use an i.i.d. sample {(Yi , Di , Zi )}ni=1
which is distributed according to some probability distribution Q in P, that is, that the measure
Q(G) = P((Yi , Di , Zi ) ∈ G) for all G ∈ BR3 , to construct a test for the testable implication given in
(3) and (4) (or in (8)). For every Q ∈ P and every measurable function v, by an abuse of notation
we define
Z
Q (v) = v dQ. (10)

Define, by convention (see, for example, Folland (1999, p. 45)), that

0 · ∞ = 0. (11)

For each Q ∈ P, the closure of H in L2 (Q) is equal to H̄ (Lemma A.1). For every Q ∈ P and every
(h, g) ∈ H̄ × G with g = (g1 , g2 ), define

Q (h · g2 ) Q (h · g1 )
φQ (h, g) = − . (12)
Q (g2 ) Q (g1 )

With (11), φQ is always well defined. Then the null hypothesis equivalent to (8) is

H0 : sup φQ (h, g) ≤ 0 (13)


(h,g)∈H×G

if the underlying distribution of the data is Q. Since Q(v) is continuous on L2 (Q), (13) is equivalent
to sup(h,g)∈H̄×G φQ (h, g) ≤ 0. The alternative hypothesis is naturally set to

H1 : sup φQ (h, g) > 0.


(h,g)∈H×G

Define the sample analogue of φQ by

Q̂(h · g2 ) Q̂(h · g1 )
φ̂Q (h, g) = − ,
Q̂(g2 ) Q̂(g1 )

where Q̂ denotes the empirical probability measure of Q such that for every measurable function v,

n
1X
Q̂ (v) = v (Yi , Di , Zi ) , (14)
n i=1

and {(Yi , Di , Zi )}ni=1 is the i.i.d. sample distributed according to Q.


The goal of this section is to construct a test for the H0 in (13). To evaluate the ability of the
test to provide size control, we consider a “local” sequence of probability distributions {Pn }∞
n=1 ⊂ P
under which the testable implication is true and Pn converges to some probability measure P ∈ P.
We introduce the next two assumptions to formalize the above settings.

9
n
Assumption 3.1 {(Yi , Di , Zi )}i=1 is an i.i.d. data set distributed according to probability distribution
Pn for each n, where Di and Zi are discrete variables with support D and Z, respectively.

Assumption 3.2 There is a probability measure P ∈ P such that


Z  2
√ n 1/2 o 1
lim n dPn − dP 1/2 − v0 dP 1/2 =0 (15)
n→∞ 2

1/2
for some measurable function v0 , where dPn and dP 1/2 denote the square roots of the densities of Pn
and P , respectively.

Assumptions 3.1 and 3.2 assume an i.i.d. sample whose distribution Pn is allowed to change as n
increases, and to converge to some probability measure P as defined in (3.10.10) of van der Vaart
and Wellner (1996). In the local analysis of Fang and Santos (2018), they considered the case where
the value of the underlying parameter may be close to a point at which the map involved in the test
statistic is only directionally differentiable (not fully differentiable). A similarly convergent probabil-
ity sequence was introduced to show the local size control of their test.8 As will be shown later, the
map involved in our test statistic is nondifferentiable (neither fully nor directionally differentiable).
We follow Fang and Santos (2018) and assume such a convergent probability sequence to show the
local size control of our test.
Clearly, H × G ⊂ L2 (P ) × (L2 (P ) × L2 (P )). Under Assumption 3.2, define a metric ρP on
L2 (P ) × (L2 (P ) × L2 (P )) by

ρP ((h, g) , (h′ , g ′ )) = kh − h′ kL2 (P ) + kg1 − g1′ kL2 (P ) + kg2 − g2′ kL2 (P ) (16)

for all (h, g) , (h′ , g ′ ) ∈ L2 (P ) × (L2 (P ) × L2 (P )) with g = (g1 , g2 ) and g ′ = (g1′ , g2′ ). By Lemma A.8,
the closure of H × G in L2 (P ) × (L2 (P ) × L2 (P )) under ρP is equal to H̄ × G, where H̄ is defined in
(9). Define

K
Y K
Y
 
Λ(Q) = Q 1R×R×{zk } for all Q ∈ P, and Tn = n · P̂n 1R×R×{zk } ,
k=1 k=1

where P̂n is the empirical probability measure of Pn defined as in (14). Under Assumption 3.2, we
mainly consider the nontrivial case where Λ(P ) > 0. Also, for every Q ∈ P, define
(   )
2 Q h2 · g 2 Q2 (h · g2 ) Q h2 · g1 Q2 (h · g1 )
σQ (h, g) = Λ(Q) · − + − (17)
Q2 (g2 ) Q3 (g2 ) Q2 (g1 ) Q3 (g1 )

for all (h, g) ∈ H̄ × G with g = (g1 , g2 ), where Qm (gj ) = [Q(gj )]m for m ∈ N and j ∈ {1, 2}.

Lemma 3.1 Under Assumptions 3.1 and 3.2, Tn (φ̂Pn − φP ) G for some tight9 random element G
which almost surely has a uniformly ρP -continuous path, and for all (h, g) ∈ H̄ × G with g = (g1 , g2 ),
8 See Examples 2.1 and 2.2 of Fang and Santos (2018).
9 In a metric space, tightness implies separability.

10
the variance V ar (G (h, g)) is equal to the σP2 (h, g) given in (17), where

σP2 (h, g) ≤ 1/4 · max {Λ (P ) /P (g2′ ) + Λ (P ) /P (g1′ )} ≤ 1/2 · (K − 1)−(K−1) , (18)


(g1′ ,g2′ )∈G

and K is the number of elements in Z. In particular, σP2 (h, g) ≤ 1/4 for all (h, g) ∈ H̄ × G when K = 2.

Lemma 3.1 provides the asymptotic distribution of Tn (φ̂Pn − φP ) and its asymptotic variance,
√ √
which is uniformly bounded by 1 for all K > 1. We used the quantity Tn instead of n to establish
the asymptotic distribution in order to achieve a known bound for the asymptotic variance. The
bound in (18) will be useful when we construct the test statistic. By (17), for every (h, g) ∈ H̄×G
with g = (g1 , g2 ), define the sample analogue of σP2 (h, g) by
(   )
Tn P̂n h2 · g2 P̂n2 (h · g2 ) P̂n h2 · g1 P̂n2 (h · g1 )
σ̂P2 n (h, g) = · − + − . (19)
n P̂n2 (g2 ) P̂n3 (g2 ) P̂n2 (g1 ) P̂n3 (g1 )

Note that for each h ∈ H̄ and each gl ∈ GK , if P̂n (gl ) = 0 then P̂n (h · gl ) = 0. By (11), σ̂P2 n is well
defined.
We may extend the idea of Kitagawa (2015) and construct the test statistic to be

Tn φ̂Pn (h, g)
sup (20)
(h,g)∈H×G max{ξ, σ̂Pn (h, g)}

for some positive number (trimming parameter) ξ. Here, ξ plays two roles: (1) Since σ̂Pn can
be zero, ξ bounds the denominator away from zero; (2) as shown in the Monte Carlo studies of
Kitagawa (2015) and the present paper, different values of ξ, from small (close to 0) to large (close
to 1), may lead to different powers of the test for the same data generating process (DGP), which
could be close to 0. Kitagawa (2015) suggests that if there is no prior knowledge available about
a likely alternative, the default choice of ξ could be set to 0.07 according to the simulation studies
for the binary treatment and binary instrument case. They also suggest that users report test results
using different values of ξ. However, the underlying distribution of the data can never be fully
explored or represented by limited simulation designs, so an “optimal” value of ξ which is plausible
for all possible DGPs may not exist. If we repeat the test using the same data set but different
values of ξ and make a decision based on all these results, we might encounter an issue of multiple
comparisons. As a consequence, the size of the test, or more precisely the “family-wise error rate,”
may not be controlled by the nominal significance level. With all these considerations, this paper
constructs the test statistic in a way that, loosely speaking, computes the weighted average of the
test statistics in (20) over ξ. If we put all the weight on one particular value of ξ, the test statistic
degenerates to the test statistic in (20).
Let Ξ be a predetermined closed subset of [0, 1] such that 0 6∈ Ξ. The set Ξ contains all the
values of ξ used for constructing the test statistic. Only one of the values greater than (or equal to)
the bound in Lemma 3.1, say 1, needs to be included in Ξ. The test statistic in (20) reduces to the

11

unweighted KS statistic when ξ = 1. Also, for every A ⊂ H̄×G, define a map SA : ℓ∞ Ξ × H̄ × G →
ℓ∞ (Ξ) by
SA (ψ) (ξ) = sup ψ (ξ, h, g)
(h,g)∈A

for all ψ ∈ ℓ∞ Ξ × H̄ × G . For simplicity of notation, we will write S for SH̄×G . Define M :
ℓ∞ (H̄ × G) → ℓ∞ (Ξ × H̄ × G) by

M(ϕ)(ξ, h, g) = max{ξ, ϕ(h, g)} (21)

for all ϕ ∈ ℓ∞ (H̄ × G) and (ξ, h, g) ∈ Ξ × H̄ × G. Let ν be a positive measure on Ξ.

Assumption 3.3 The measure ν satisfies that 0 < ν(Ξ) < ∞ and SH×G (φ̂Pn /M(σ̂Pn )) ∈ L1 (ν) for all
ω ∈ Ω and all n.

Note that for every finite sample set,

SH×G (φ̂Pn /M(σ̂Pn )) = S(φ̂Pn /M(σ̂Pn )). (22)

See the discussion in Section 4 about the computational simplification of SH×G (φ̂Pn /M(σ̂Pn )). De-
R
fine a function I : L1 (ν) → R by I(f ) = Ξ f dν for all f ∈ L1 (ν). Now we set the test statistic
to !
p φ̂Pn
Tn I ◦ SH×G . (23)
M(σ̂Pn )

The measure ν could be a Dirac measure centered at some fixed ξ ∈ Ξ. This is equivalent to using a
particular value for the trimming parameter to construct the test statistic as in (20). Or ν could be
a discrete or continuous probability measure that assigns probabilities to the elements of Ξ. This is
equivalent to using a weighted average of the test statistics in (20) over ξ. By using (23), we take
into account the fact that the values of ξ may influence the power of the test, and we can also avoid
the multiple testing issue. Define


ΨH×G = {(h, g) ∈ H × G : φP (h, g) = 0} and ΨH̄×G = (h, g) ∈ H̄ × G : φP (h, g) = 0 . (24)

Since 1{a}×{0}×R , −1{a}×{1}×R ∈ H for all a ∈ R, ΨH×G and ΨH̄×G are not empty.

Theorem 3.1 Suppose Assumptions 3.1, 3.2, and 3.3 hold. If the H0 in (13) is true with Q = Pn for
all n, then
!  
p φ̂Pn G
Tn I ◦ SH×G I ◦ SΨH̄×G , (25)
M(σ̂Pn ) M(σP )

where G is as in Lemma 3.1.

Theorem 3.1 provides the pointwise asymptotic distribution of the test statistic if the H0 in (13)

12
is true with Q = Pn for all n.10 To find this asymptotic distribution, we employed the extended
delta method provided in Appendix A. Because the map M is nondifferentiable, the existing delta
methods fail to work in establishing the weak convergence in (25). In Appendix A, we provide
an extended continuous mapping theorem and an extended delta method elaborated by Theorems
A.1 and A.2, respectively, to deal with this technical issue. See further discussion in Remark B.3.
Theorem A.1 can be viewed as an extension of Theorem 1.11.1 of van der Vaart and Wellner (1996),
and Theorem A.2 can be viewed as an extension of Theorem 3.9.5 of van der Vaart and Wellner
(1996) and of Theorem 2.1 of Fang and Santos (2018).
In Theorem 3.1, we consider the general case, where D = {d1 , d2 , . . .}. If D is a finite set with
D = {d1 , d2 , . . . , dJ }, then I ◦ SΨH̄×G (G/M(σP )) = I ◦ SΨH×G (G/M(σP )) under null, because it
can be shown that in this special case ΨH̄×G is equal to the closure of ΨH×G in H̄ × G under ρP and
G/M(σP ) is continuous under ρP for every fixed ξ. We summarize this in the following corollary.

Corollary 3.1 Under the assumptions of Theorem 3.1 with D = {d1 , d2 , . . . , dJ },


!  
p φ̂Pn G
Tn I ◦ SH×G I ◦ SΨH×G , (26)
M(σ̂Pn ) M(σP )

where G is as in Lemma 3.1.

3.1 Bootstrap-Based Inference


It was shown that the asymptotic distribution in (26) involves a map SΨH×G where ΨH×G depends
\
on the underlying probability measure P . Therefore, we need to find a “valid” estimator Ψ H×G for
\
ΨH×G in order to consistently approximate the asymptotic distribution. If Ψ H×G can be constructed

appropriately, a natural approximation of SΨH×G can be constructed by SΨ\


H×G
. By the definition of
\
ΨH×G in (24), we construct ΨH×G by
( )
p φ̂Pn (h, g)

\
Ψ H×G = (h, g) ∈ H × G : Tn ≤ τn (27)
M(σ̂Pn ) (ξ0 , h, g)


with τn → ∞ and τn / n → 0 as n → ∞, where ξ0 is a small positive number. We suggest using
ξ0 = 0.001 in practice. It can be shown that SΨ\
H×G
can also be used to approximate the asymptotic
distribution in (25) when D = {d1 , d2 , . . .}.11 This is a method similar to that which is used in Beare
and Shi (2019) and Sun and Beare (2019) to estimate contact sets in independent contexts. See
Linton et al. (2010) and Lee et al. (2013) for further discussion of estimation of contact sets. Each

\
(h, g) is included in Ψ H×G if Tn |φ̂Pn (h, g)| is no more than τn estimated standard deviations from
zero. As mentioned by Sun and Beare (2019), we can effectively use pointwise confidence intervals
to select points in this way.
10 More precisely, the weak convergence in (25) is under Pn .
11 See the equation in (B.50).

13
3.1.1 Test Procedure

We implement the test in the following sequence of steps:

(1) Obtain the bootstrap sample {(Ŷi , D̂i , Ẑi )}ni=1 drawn independently with replacement from the
n
sample {(Yi , Di , Zi )}i=1 .

(2) Calculate the bootstrap version of φ̂Pn by

P̂nB (h · g2 ) P̂nB (h · g1 )
φ̂B
Pn (h, g) = − , (28)
P̂nB (g2 ) P̂nB (g1 )
QK 
let TnB = n · k=1 P̂nB 1R×R×{zk } , and calculate the bootstrap version of σ̂Pn by

r s
2 2
TnB P̂nB (h2 · g2 ) P̂nB (h · g2 ) P̂nB (h2 · g1 ) P̂nB (h · g1 )
σ̂PBn (h, g) = · − + − (29)
n P̂nB (g2 )2 P̂nB (g2 )3 P̂nB (g1 )2 P̂nB (g1 )3
Pn
for all (h, g) ∈ H̄ × G, where P̂nB (v) = n−1 i=1 v(Ŷi , D̂i , Ẑi ) for all measurable v.

(3) Calculate the bootstrap version of the test statistic by


q 
B B B
I ◦ SΨ\
H×G
Tn (φ̂Pn − φ̂Pn )/M(σ̂Pn ) . (30)

Since the I ◦ SΨH̄×G in the asymptotic distribution in (25) is a nonlinear map, the bootstrap
test statistic in (30) was constructed following the idea of Fang and Santos (2018). The non-
linearity of the map I ◦ SΨH̄×G may cause inconsistencies in the bootstrap approximation. See
Dümbgen (1993), Andrews (2000), and Fang and Santos (2018) for details. Because of the
denominator M(σ̂PBn ), our approach is an extension of that of Fang and Santos (2018). Sim-
ilarly to (23), we can simplify the calculation of (30) in practice. See Section 4 for details
regarding Monte Carlo simulations.

(4) Repeat steps (1), (2), and (3) nB times independently, for (say) nB = 1000. Given the nominal
significance level α, calculate the bootstrap critical value ĉ1−α by
  q   

ĉ1−α = inf c : P I ◦ SΨ\ T B (φ̂B − φ̂ )/M(σ̂ B
) ≤ c {(Y , D , Z )} n
≥ 1 − α .
H×G n Pn Pn Pn i i i i=1
(31)

In practice, we approximate ĉ1−α by computing the 1 − α quantile of the nB independently


generated bootstrap statistics, with nB chosen as large as is computationally convenient.

(5) The decision rule for the test is: Reject H0 if Tn I ◦ SH×G (φ̂Pn /M(σ̂Pn )) > ĉ1−α .

Theorem 3.2 Suppose Assumptions 3.1, 3.2, and 3.3 hold.

14
(i) If the H0 in (13) is true with Q = Pn for all n and the CDF of I ◦SΨH̄×G (G0 /M(σP )) is increasing
and continuous at its 1 − α quantile c1−α , where G0 is the asymptotic limit given by Lemma B.8,

then limn→∞ P( Tn I ◦ SH×G (φ̂Pn /M(σ̂Pn )) > ĉ1−α ) ≤ α. If, in addition, Pn = P for all large

n, then limn→∞ P( Tn I ◦ SH×G (φ̂Pn /M(σ̂Pn )) > ĉ1−α ) = α.

(ii) If the H0 in (13) is false with Q = P and Pn = P for all large n, then

limn→∞ P( Tn I ◦ SH×G (φ̂Pn /M(σ̂Pn )) > ĉ1−α ) = 1.

It is implied by Theorem 11.1 of Davydov et al. (1998) that in (i) of Theorem 3.2, the CDF of
I ◦SΨH̄×G (G0 /M(σP )) is differentiable and has a positive derivative everywhere except at countably
many points in its support, provided that I ◦ SΨH̄×G (G0 /M(σP )) 6= 0. If I ◦ SΨH̄×G (G0 /M(σP )) = 0
at null configurations, our test statistic converges to zero in probability and so does the critical value.
Theorem 3.2 does not show clearly how the rejection rate of the test will behave asymptotically in
this case. As discussed in Sun and Beare (2019), this is a common theoretical limitation for irregular
testing problems. Tests based on the machinery of Fang and Santos (2018), and also those based on
generalized moment selection (Andrews and Soares, 2010; Andrews and Shi, 2013), may encounter
this issue. One practical resolution is to replace the bootstrap critical value ĉ1−α with max{ĉ1−α , η}
or ĉ1−α + η, where η is some small positive constant. See, for instance, Donald and Hsu (2016,
p. 13). Simulation results showed that the empirical rejection rates of our test with η = 0 are lower
than the nominal significance level when I ◦ SΨH̄×G (G0 /M(σP )) = 0 under null configurations.

3.2 Binary Treatment: Power Improvement


In this section, we consider the special case where the treatment D and the instrument Z are both
binary. Kitagawa (2015) constructed a test for the instrument validity assumption based on testable
implication (2) when D and Z are both binary. We now compare the results from Section 3.1 with
those of Kitagawa (2015). Let z1 = 0, z2 = 1, d1 = 0, and d2 = 1. All the results in Section 3.1 hold
in this case, and the test statistic in (23) is now numerically equal to the one constructed by Kitagawa
(2015) if we let ν be a Dirac measure. Recall that the instrument is allowed to be multivalued under
the constructions in Section 3.12
The testing strategy in this paper is different from that of Kitagawa (2015). To make this point
clear, we consider a simple case where Pn = P for all n and the H0 in (13) is true with Q = P .13
We establish the asymptotic distribution in (26) and use it to construct the critical value, while
Kitagawa (2015) used an upper bound of the asymptotic distribution to construct the critical value.
12 For the case where the treatment is binary and the instrument is multivalued, Kitagawa (2015) constructed the test

statistic by first computing the normalized differences of two empirical probability measures between neighboring pairs of
values of instruments (ordered according to the propensity score), and then taking the maximum value of all these differences.
Since these differences can be mutually correlated, it would not be straightforward to obtain the asymptotic distribution of
their test statistic and approximate its null distribution by bootstrap.
13 Our test achieves size control under Assumption 3.2 (the convergence of a “local” sequence of probability distributions),

while the test of Kitagawa (2015) achieves uniform size control under different conditions. Assuming a fixed P makes the
comparison more explicit.

15
As introduced in Section 2, we follow Kitagawa (2015) and define probability measures

P1 (B, C) = P (Y ∈ B, D ∈ C|Z = 1) and P0 (B, C) = P (Y ∈ B, D ∈ C|Z = 0)

for all B, C ∈ BR . Now we define


Fb = (−1)d · 1B×{d} : B is a closed interval, d ∈ {0, 1} ,

R
and write Pd (f ) = f dPd for all measurable f and each d ∈ {0, 1}. Kitagawa (2015) showed that
their critical value converged to the 1 − α quantile of the distribution supf ∈Fb GH (f )/(ξ ∨ σH (f )),
where H = λP1 + (1 − λ)P0 , λ = P(Z = 1), GH is an H-Brownian bridge, and σH (f ) is the standard
2
deviation of GH (f ), that is, σH (f ) = H(f 2 ) − H 2 (f ). Let Fb∗ = {f ∈ Fb : P0 (f ) = P1 (f )}. Then it is
easy to show that H(f ) = P0 (f ) = P1 (f ) for all f ∈ Fb∗ . Let ν be a Dirac measure centered at some
ξ. It can be shown that
 
GH (f ) GH (f ) L G
sup ≥ sup = I ◦ SΨH×G , (32)
f ∈Fb ξ ∨ σH (f ) f ∈Fb ξ ∨ σH (f )
∗ M(σP )

L
where I ◦SΨH×G (G/M(σP )) is the asymptotic distribution of the test statistic in (26) and “=” means
equivalence in distribution. The bootstrap critical value proposed in the present paper is based on I ◦
SΨH×G (G/M(σP )) (equivalently, supf ∈Fb∗ GH (f )/(ξ ∨ σH (f ))), while the one of Kitagawa (2015) is
based on the upper bound supf ∈Fb GH (f )/(ξ ∨ σH (f )). Specifically, Kitagawa (2015) constructed a
bootstrap approximation for the Gaussian process GH /(ξ ∨ σH ), denoted by GB B
H /(ξ ∨ σH ), and then
computed the bootstrap test statistic by supf ∈Fb GB B ∗
H (f )/(ξ ∨ σH (f )). We estimate Fb by a subset of
c∗ , and compute the bootstrap test statistic by sup c∗ GB (f )/(ξ ∨ σ B (f )). Clearly,
Fb , denoted by F b f ∈F b H H
our bootstrap test statistic is numerically smaller than that of Kitagawa (2015), and hence the critical
value is smaller. It can also be shown that our critical value converges to the 1 − α quantile of
supf ∈Fb∗ GH (f )/(ξ ∨ σH (f )). Since the test statistic in (23) is numerically equivalent to that of
Kitagawa (2015), this shows that the power of the test can be improved by use of our approach. See
the simulation evidence in Appendix C.

3.3 Unordered Treatments


With testable implication (5), we define the function space

 
H×G = 1B×{d}×R , 1R×R×{z} , 1R×R×{z′ } : B is a closed interval, (d, z, z ′ ) ∈ C . (33)

For every probability measure Q with (10), we define φQ (h, g) = Q (h · g2 )/Q (g2 )−Q (h · g1 )/Q (g1 )
for every (h, g) ∈ H × G with g = (g1 , g2 ). Testable implication (5) is equivalent to the H0 in

H0 : sup φQ (h, g) ≤ 0 and H1 : sup φQ (h, g) > 0


(h,g)∈H×G (h,g)∈H×G

16
if Q is the underlying probability distribution of the data. Then we can follow the test procedure in
Section 3.1.1 to conduct the test with the function space H × G defined in (33).

3.4 Conditioning Covariates


We follow the setup in Section 2.4 and suppose X is a dX -dimensional vector random variable. First,
consider the testable implication in Lemma 2.3 with dmin = 0 and dmax = 1. Define function spaces

 
G= 1R×R×{zk }×{xl } , 1R×R×{zk+1 }×{xl } : k = 1, 2, . . . , K − 1, l = 1, 2, . . . , L ,
n o
d
H1 = (−1) · 1B×{d}×R×RdX : B is a closed interval, d ∈ {0, 1} ,

H2 = 1R×C×R×RdX : C = (−∞, c], c ∈ R , and H = H1 ∪ H2 . (34)

For every probability measure Q with (10), we define φQ (h, g) = Q (h · g2 )/Q (g2 )−Q (h · g1 )/Q (g1 )
for every (h, g) ∈ H × G with g = (g1 , g2 ). Testable implication (6) is equivalent to the H0 in

H0 : sup φQ (h, g) ≤ 0 and H1 : sup φQ (h, g) > 0


(h,g)∈H×G (h,g)∈H×G

if Q is the underlying probability distribution of the data. Then we can follow the test procedure in
Section 3.1.1 to conduct the test with the function space H × G defined by the H and the G in (34).
Next, consider the testable implication in Lemma 2.4. Define the function space
   
 1 , 1 , 1 ′ : B is a closed interval, 
d
B×{d}×R×R X R×R×{z}×{xl } R×R×{z }×{xl }
H×G = . (35)
 (d, z, z ′ ) ∈ C, l = 1, 2, . . . , L 

For every probability measure Q with (10), we define φQ (h, g) = Q (h · g2 )/Q (g2 )−Q (h · g1 )/Q (g1 )
for every (h, g) ∈ H × G with g = (g1 , g2 ). Testable implication (7) is equivalent to the H0 in

H0 : sup φQ (h, g) ≤ 0 and H1 : sup φQ (h, g) > 0


(h,g)∈H×G (h,g)∈H×G

if Q is the underlying probability distribution of the data. Then we can follow the test procedure in
Section 3.1.1 to conduct the test with the function space H × G defined in (35).

4 Simulation Evidence
We first designed Monte Carlo simulations for the case where D and Z are both multivalued random
variables such that D ∈ {0, 1, 2} and Z ∈ {0, 1, 2}. Simulation comparisons with Kitagawa (2015)
for the case where D and Z are both binary are given in Appendix C. Each simulation consisted of
1000 Monte Carlo iterations and 1000 bootstrap iterations. To expedite the simulation, we employed
the warp-speed method of Giacomini et al. (2013). As shown in (18), σP2 is bounded by (1/2) · (K −

17
1)−(K−1) , where K = 3 in our setting. In each simulation, the measure ν was set to be a Dirac
measure δξ centered at one of the following values of ξ: 0.07, 0.1, 0.13, 0.16, 0.19, 0.22, 0.25, 0.28,
0.3, and 1, or to be a probability measure ν̄ξ that assigns equal probabilities (weights) to the values
of ξ listed above. The nominal significance level α was set to 0.05.

When calculating the supremum of the test statistic Tn I ◦ SH×G (φ̂Pn /M(σ̂Pn )) in (23), we
followed the numerical computation approach used by Kitagawa (2015). Specifically, we calculated
the supremum using only the closed intervals B with the values of {Yi }ni=1 observed in the data as
the endpoints, that is, B = [a, b] with a, b ∈ {Y1 , Y2 , . . . , Yn } and a ≤ b. It is not hard to show that
the test statistic calculated in this way is equal to that in (23). We also used such closed intervals to
p
calculate the bootstrap test statistic I ◦ SΨ\ H×G
( TnB (φ̂B B
Pn − φ̂Pn )/M(σ̂Pn )) in (30). From all such
intervals, we found those that satisfy the inequality in (27) and used them to calculate the supremum
p
of TnB (φ̂B B
Pn − φ̂Pn )/M(σ̂Pn ) for each ξ listed above.

4.1 Size Control and Tuning Parameter Selection


The first set of simulations was designed to investigate the size of the test and the selection of
the tuning parameter. As shown in (27), the estimate Ψ \H×G involves a tuning parameter τn with

τn → ∞ and τn / n → 0 as n → ∞. In practice, we need to use a particular value of τn for
each sample size n. For this set of simulations, we set n to 3000 and τn to 1, 2, 3, 4, and ∞. For
τn = ∞, Ψ\H×G = H × G and the test is conservative. We compared the rejection rates obtained

using each of these values of τn and decided which value would be a good option for sample sizes
close to 3000. We let U ∼ Unif(0, 1), V ∼ Unif(0, 1), N0 ∼ N(0, 1), N1 ∼ N(1, 1), N2 ∼ N(2, 1),
Z = 2×1{U ≤ 0.5}+1{0.5 < U ≤ 0.7} (P(Z = 2) = 0.5), Dz = 2×1{V ≤ 0.33}+1{0.33 < V ≤ 0.66}
P P
for z = 0, 1, 2, D = 2z=0 1{Z = z} × Dz , and Y = 2d=0 1{D = d} × Nd . All the variables U , V ,
N0 , N1 , and N2 are mutually independent. Clearly, Assumption 2.2 holds in this case with z1 = 0,
z2 = 1, and z3 = 2.
Table 1 shows the results of the simulations. The rejection rates were influenced by the values of
τn and ξ. For each measure ν, a smaller τn yields greater rejection rates, because a smaller τn leads
to a smaller critical value according to (27). For τn = 2, all the rejection rates were close to those
for τn = ∞ (the conservative case). Similar to the pattern of the results shown in Kitagawa (2015),
some rejection rates for τn = 2 with δξ centered at particular values of ξ were slightly upwardly
biased compared to the nominal size. Overall, however, the results showed good performance of
the test in terms of size control. When sample sizes are less than or close to 3000, we suggest using
τn = 2 in practice to achieve good size control without a significant power loss. When the sample
size increases, τn should be increased accordingly. It is also worth noting that when we used the
measure ν̄ξ , the rejection rates could be well controlled by the nominal significance level. Thus if
we have no additional information about the choice of ξ, ν̄ξ can be a default choice for us.

18
Table 1: Rejection Rates under H0 for Multivalued D and Multivalued Z

ξ for δξ
τn ν̄ξ
0.07 0.1 0.13 0.16 0.19 0.22 0.25 0.28 0.3 1
1 0.079 0.060 0.047 0.068 0.056 0.058 0.061 0.061 0.061 0.061 0.054
2 0.073 0.050 0.037 0.050 0.050 0.055 0.048 0.048 0.048 0.048 0.047
3 0.073 0.048 0.037 0.050 0.050 0.049 0.048 0.048 0.048 0.048 0.047
4 0.073 0.048 0.037 0.050 0.050 0.049 0.048 0.048 0.048 0.048 0.047
∞ 0.073 0.048 0.037 0.050 0.050 0.049 0.048 0.048 0.048 0.048 0.047

4.2 Rejection Rates against Fixed Alternatives


The second set of simulations was designed to investigate the power of the test. Six data generating
processes (DGPs) in total were considered, and Assumption 2.2 did not hold with z1 = 0, z2 = 1, and
z3 = 2. Sample sizes were set to n = 200, 600, 1000, 1100, and 2000. The probability P(Z = 2) = rn ,
with rn = 1/2, 1/6, 1/2, 1/11, and 1/2 for the corresponding sample sizes. We set τn to 2, as
suggested in the preceding set of simulations. DGPs (1)–(4) are the cases where (3) was violated and
(4) was not violated, and DGPs (5) and (6) are the cases where both (3) and (4) were violated. We
let U ∼ Unif(0, 1), V ∼ Unif(0, 1), W ∼ Unif(0, 1), and Z = 2 × 1{U ≤ rn } + 1{rn < U ≤ rn + 0.2}.
For DGPs (1)–(4), we let Dz = 2 × 1{V ≤ 0.45} + 1{0.45 < V ≤ 0.55} for z = 0, 1, 2, D =
P2
z=0 1{Z = z} × Dz , N00 ∼ N(0, 1), N10 ∼ N(0, 1), and Ndz ∼ N(0, 1) for d = 0, 1, 2 and z = 1, 2.
P2 P2
(1): N20 ∼ N(−0.7, 1) and Y = z=0 1{Z = z} × ( d=0 1{D = d} × Ndz ).
P2 P2
(2): N20 ∼ N(0, 1.6752 ) and Y = z=0 1{Z = z} × ( d=0 1{D = d} × Ndz ).
P2 P2
(3): N20 ∼ N(0, 0.5152 ) and Y = z=0 1{Z = z} × ( d=0 1{D = d} × Ndz ).

(4): N20a ∼ N(−1, 0.1252), N20b ∼ N(−0.5, 0.1252), N20c ∼ N(0, 0.1252), N20d ∼ N(0.5, 0.1252),
N20e ∼ N(1, 0.1252), N20 = 1{W ≤ 0.15} × N20a + 1{0.15 < W ≤ 0.35} × N20b + 1{0.35 <
P2
W ≤ 0.65} × N20c + 1{0.65 < W ≤ 0.85} × N20d + 1{W > 0.85} × N20e , and Y = z=0 1{Z =
P2
z} × ( d=0 1{D = d} × Ndz ).

For DGPs (5) and (6), we let N0 ∼ N(0, 1), N1 ∼ N(1, 1), and N2 ∼ N(2, 1).

(5): D0 = 2 × 1{V ≤ 0.6} + 1{0.6 < V ≤ 0.8}, D1 = 2 × 1{V ≤ 0.33} + 1{0.33 < V ≤ 0.66},
P P
D2 = D1 , D = 2z=0 1{Z = z} × Dz , and Y = 2d=0 1{D = d} × Nd .

(6): D0 = 2 × 1{V ≤ 0.33} + 1{0.33 < V ≤ 0.66}, D1 = 2 × 1{V ≤ 0.6} + 1{0.6 < V ≤ 0.8},
P2 P2
D2 = D0 , D = z=0 1{Z = z} × Dz , and Y = d=0 1{D = d} × Nd .

All the variables U , V , N00 , N10 , N20 , N01 , N11 , N21 , N02 , N12 , N22 , N0 , N1 , and N2 were set
to be mutually independent for each DGP. We briefly explain how DGPs (1)–(4) violate (3), which
is shown graphically in Figure 2. We let pz (y, d) be the derivative of P(Y ∈ (−∞, y], D = d|Z = z)
with respect to y for all d, z ∈ {0, 1, 2}. Similarly to Figure 1, if (3) were true, then we would have
p0 (y, 2) ≤ p1 (y, 2) ≤ p2 (y, 2) everywhere. For DGPs (1)–(4), p1 (y, 2) = p2 (y, 2) held for all y, but

19
p0 (y, 2) ≤ p1 (y, 2) did not hold on some range of R. DGPs (5) and (6) are the cases where the
monotonicity assumption did not hold and both (3) and (4) were violated.

Figure 2: Curves of p0 (y, 2) (dashed) and p1 (y, 2) (solid) for DGPs (1)–(4)

p1 (y, 2) p1 (y, 2) p1 (y, 2) p1 (y, 2)


p0 (y, 2) p0 (y, 2) p0 (y, 2) p0 (y, 2)

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

(a) DGP (1) (b) DGP (2) (c) DGP (3) (d) DGP (4)

Table 2 shows the rejection rates under DGPs (1)–(6), that is, the power of the test. For each DGP
and each measure ν, the rejection rate increased as the sample size n was increased. The results for
ν = ν̄ξ showed that if we have no information about the choice of ξ, using the weighted average of
the statistics over ξ is a desirable option. When n > 200, the rejection rates for using ν = ν̄ξ were at
a relatively high level compared to the results for using a Dirac measure.

5 Empirical Application
We revisit one empirical example discussed by Kitagawa (2015) to show the performance of the
proposed test in practice. The example is from Card (1993), who used college proximity as an
instrument of years of schooling to study the causal link between education and earnings. The data
are from the Young Men Cohort of the National Longitudinal Survey. In the original study of Card
(1993), the educational level D is a multivalued treatment variable, while Kitagawa (2015) treated
it as a binary treatment variable T with T = 1{D ≥ 16}. The results of the test of Kitagawa (2015)
showed that the instrument was not valid when no covariates were controlled.
We use the originally defined treatment variable D to reconduct the test. Specifically, the treat-
ment D is education attainment observed in 1976 (the variable “ed76”), the instrument Z is whether
an individual grew up near a 4-year college (the variable “nearc4”), and the outcome is log wage
observed in 1976 (the variable “lwage76”) in the data set. The available sample size is 3010. We
follow the setup in Section 3 with D = {1, 2, . . . , 18} and Z = {0, 1}. The instrument Z = 1 implies
that an individual grew up near a 4-year college. Table 3 shows the p-values obtained from our
test using each measure ν. From these results we conclude that we do not reject the validity of
instrument Z.
The testable implication used by Kitagawa (2015) for binary T is that

P (Y ∈ B, T = 0|Z = 1) − P (Y ∈ B, T = 0|Z = 0) ≤ 0
and P (Y ∈ B, T = 1|Z = 1) − P (Y ∈ B, T = 1|Z = 0) ≥ 0 (36)

20
Table 2: Rejection Rates under H1 for Multivalued D and Multivalued Z

ξ for δξ
DGP n ν̄ξ
0.07 0.1 0.13 0.16 0.19 0.22 0.25 0.28 0.3 1
200 0.060 0.140 0.175 0.200 0.185 0.155 0.153 0.153 0.153 0.153 0.159
600 0.672 0.683 0.616 0.482 0.323 0.230 0.214 0.214 0.214 0.214 0.516
(1) 1000 0.606 0.729 0.790 0.792 0.775 0.738 0.715 0.715 0.715 0.715 0.777
1100 0.889 0.859 0.720 0.504 0.314 0.216 0.217 0.217 0.217 0.217 0.658
2000 0.969 0.988 0.993 0.987 0.989 0.979 0.975 0.975 0.975 0.975 0.991
200 0.030 0.060 0.074 0.076 0.076 0.069 0.072 0.072 0.072 0.072 0.064
600 0.347 0.168 0.069 0.054 0.059 0.059 0.056 0.056 0.056 0.056 0.083
(2) 1000 0.404 0.379 0.294 0.146 0.088 0.059 0.062 0.062 0.062 0.062 0.153
1100 0.434 0.123 0.054 0.059 0.059 0.059 0.060 0.060 0.060 0.060 0.084
2000 0.896 0.897 0.775 0.521 0.269 0.177 0.154 0.154 0.154 0.154 0.635
200 0.087 0.177 0.240 0.307 0.325 0.297 0.290 0.290 0.290 0.290 0.262
600 0.695 0.719 0.728 0.693 0.577 0.466 0.434 0.434 0.434 0.434 0.673
(3) 1000 0.660 0.743 0.826 0.856 0.880 0.887 0.875 0.875 0.875 0.875 0.878
1100 0.884 0.924 0.899 0.773 0.622 0.516 0.517 0.517 0.517 0.517 0.840
2000 0.968 0.985 0.991 0.995 0.995 0.998 0.999 0.999 0.999 0.999 0.999
200 0.038 0.099 0.147 0.155 0.148 0.138 0.135 0.135 0.135 0.135 0.146
600 0.402 0.376 0.366 0.290 0.207 0.209 0.189 0.189 0.189 0.189 0.304
(4) 1000 0.331 0.433 0.407 0.406 0.444 0.475 0.477 0.477 0.477 0.477 0.483
1100 0.498 0.526 0.492 0.355 0.203 0.137 0.137 0.137 0.137 0.137 0.403
2000 0.597 0.704 0.710 0.725 0.741 0.769 0.791 0.791 0.791 0.791 0.796
200 0.365 0.487 0.589 0.626 0.685 0.752 0.780 0.780 0.780 0.780 0.699
600 0.980 0.990 0.995 0.997 0.998 0.998 0.998 0.998 0.998 0.998 0.998
(5) 1000 0.994 0.998 0.999 0.999 1.000 1.000 1.000 1.000 1.000 1.000 1.000
1100 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
2000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
200 0.372 0.482 0.545 0.616 0.659 0.701 0.711 0.711 0.711 0.711 0.664
600 0.704 0.823 0.904 0.929 0.962 0.981 0.988 0.988 0.988 0.988 0.965
(6) 1000 0.992 0.999 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
1100 0.912 0.957 0.979 0.984 0.990 0.995 0.995 0.995 0.995 0.995 0.990
2000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

for all closed intervals B. The inequalities in (36) are equivalent to the following for all closed
intervals B:

P (Y ∈ B, D < 16|Z = 1) − P (Y ∈ B, D < 16|Z = 0) ≤ 0


and P (Y ∈ B, D ≥ 16|Z = 1) − P (Y ∈ B, D ≥ 16|Z = 0) ≥ 0, (37)

which are different from those in the testable implication given in (3) and (4) and are not implied by
Assumption 2.2. Thus a valid instrument Z for multivalued D which satisfies the testable implication
given in (3) and (4) may not satisfy the inequalities in (36), that is, Z may not remain valid for binary
(or coarsened) T . This provides a possible explanation for why we accept Z but Kitagawa (2015)
rejected it.
To be more explicit, we consider a simpler example. Let U ∼ Unif (0, 1), V ∼ Unif (0, 1), Yd ∼
Unif (d, d + 1) for d ∈ {0, 1, 2}, Z = 1 {U ≤ 0.5}, D0 = 2 × 1 {V ≤ 0.1} + 1 {0.1 < V ≤ 0.5}, D1 =
P1 P2
2 × 1 {V ≤ 0.5} + 1 {0.5 < V ≤ 0.6}, D = z=0 1 {Z = z} × Dz , and Y = d=0 1 {D = d} × Yd ,

21
Table 3: p-values Obtained from the Proposed Test for Each Measure ν

ξ for δξ
ν̄ξ
0.07 0.1 0.13 0.16 0.19 0.22 0.25 0.28 0.3 1
0.958 0.975 0.975 0.975 0.975 0.975 0.975 0.975 0.975 0.975 0.973

where U , V , Y0 , Y1 , and Y2 are mutually independent. We can verify that Assumption 2.2 holds
for Z and D in this example. It can be shown that for every Borel set B and each z ∈ {0, 1},
P (Y ∈ B, D ≥ 1|Z = z) = P (Y1 ∈ B, Dz = 1) + P (Y2 ∈ B, Dz = 2). Let B = [1, 2]. Then we have

P (Y ∈ B, D ≥ 1|Z = 1) − P (Y ∈ B, D ≥ 1|Z = 0) = P (D0 = 0, D1 = 1) − P (D0 = 1, D1 = 2) < 0.


(38)

The inequality in (38) shows that the valid instrument Z for multivalued D does not satisfy the
inequalities as those in (37). Equivalently, the instrument Z is not valid for the coarsened treatment
T = 1{D ≥ 1}. The reason why Z does not remain valid is as follows. Assumption 2.1 for Z and T
specified in this example requires Yt0′ = Yt1′ almost surely for t ∈ {0, 1}, where Ytz′ is the potential
outcome variable for T = t and Z = z with t ∈ {0, 1} and z ∈ {0, 1}. With the potential outcome
variables, we can write

2 1 1
!
X X X
Y = 1 {D = d} · Yd = 1 {Z = z} · 1 {T = t} · Ytz′ .
d=0 z=0 t=0

P2 ′
For every ω ∈ Ω with Z(ω) = z and T (ω) = 1, Y (ω) = d=1 1 {Dz (ω) = d} · Yd (ω) = Y1z (ω). If
′ ′
Y10 = Y11 almost surely, it follows that

2
X 1
X
′ ′
Y10 = Y11 = 1 {D = d} · Yd + 1 {D = 0} · W almost surely with D = 1{Z = z} · Dz , (39)
d=1 z=0

′ ′
where W is a random variable such that W (ω) = Y10 (ω) = Y11 (ω) for almost all ω with T (ω) = 0.
′ ′ ′ ′
However, (39) shows that Z affects Y10 and Y11 through D, and therefore Y10 and Y11 are not
necessarily independent of Z. Thus Assumption 2.1(ii) may fail for Z and (coarsened) T .
For empirical or theoretical reasons, we may want to coarsen a multivalued treatment to be a
binary variable in some circumstances. However, Angrist and Imbens (1995, p. 436) and Marshall
(2016) showed that such coarsening may lead to inconsistent estimates for the average per-unit
treatment effect and the effect of obtaining a particular treatment intensity level beyond obtaining
only the preceding level. They provided several special cases in which the estimates could be con-
sistent, such as the case where the instrument only affects reaching a particular treatment intensity
and the case where the effect at all intensities other than a particular one is zero. But further dis-
cussion of Marshall (2016) showed that these cases are often implausible in practice. For the data
set of Card (1993), the treatment variable defined by Kitagawa (2015), T = 1{D ≥ 16}, can be
considered as a four-year college degree. The simple numerical example designed above shows that

22
coarsening may undermine the validity of the instrument for T , so the IV estimate for the effect of
obtaining a college degree may be inconsistent. This provides another perspective for understand-
ing the inconsistency of the coarsened estimator. In general, therefore, coarsening is not a desirable
option for us. This also shows the significance of the generalization of the test in the present paper.

6 Conclusion
In this paper, we provided a general framework for testing instrument validity in heterogeneous
causal effect models. We generalized the testable implications of the instrument validity assump-
tions in the literature, and based on them we proposed a nonparametric bootstrap test. An extended
continuous mapping theorem and an extended delta method were provided to establish the asymp-
totic distribution of the test statistic, which may be of independent interest. The proposed test can
be applied in more general settings and may achieve power improvement.

References
Abadie, A. (2002). Bootstrap tests for distributional treatment effects in instrumental variable mod-
els. Journal of the American Statistical Association, 97(457):284–292.

Abadie, A., Angrist, J., and Imbens, G. (2002). Instrumental variables estimates of the effect of
subsidized training on the quantiles of trainee earnings. Econometrica, 70(1):91–117.

Ananat, E. O. and Michaels, G. (2008). The effect of marital breakup on the income distribution of
women with children. Journal of Human Resources, 43(3):611–629.

Andrews, D. W. (2000). Inconsistency of the bootstrap when a parameter is on the boundary of the
parameter space. Econometrica, 68(2):399–405.

Andrews, D. W. and Shi, X. (2013). Inference based on conditional moment inequalities. Economet-
rica, 81(2):609–666.

Andrews, D. W. and Soares, G. (2010). Inference for parameters defined by moment inequalities
using generalized moment selection. Econometrica, 78(1):119–157.

Angrist, J. D. (1990). Lifetime earnings and the Vietnam era draft lottery: Evidence from social
security administrative records. The American Economic Review, 80(3):313–336.

Angrist, J. D. and Imbens, G. W. (1995). Two-stage least squares estimation of average causal
effects in models with variable treatment intensity. Journal of the American Statistical Association,
90(430):431–442.

Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using instru-
mental variables. Journal of the American Statistical Association, 91(434):444–455.

23
Angrist, J. D. and Krueger, A. B. (1991). Does compulsory school attendance affect schooling and
earnings? The Quarterly Journal of Economics, 106(4):979–1014.

Angrist, J. D. and Krueger, A. B. (1995). Split-sample instrumental variables estimates of the return
to schooling. Journal of Business & Economic Statistics, 13(2):225–235.

Angrist, J. D. and Pischke, J.-S. (2008). Mostly Harmless Econometrics: An Empiricist’s Companion.
Princeton University Press.

Angrist, J. D. and Pischke, J.-S. (2014). Mastering Metrics: The Path from Cause to Effect. Princeton
University Press.

Armstrong, T. B. (2014). Weighted KS statistics for inference on conditional moment inequalities.


Journal of Econometrics, 181(2):92–116.

Armstrong, T. B. and Chan, H. P. (2016). Multiscale adaptive inference on conditional moment


inequalities. Journal of Econometrics, 194(1):24–43.

Balke, A. and Pearl, J. (1997). Bounds on treatment effects from studies with imperfect compliance.
Journal of the American Statistical Association, 92(439):1171–1176.

Barrett, G. F. and Donald, S. G. (2003). Consistent tests for stochastic dominance. Econometrica,
71(1):71–104.

Barrett, G. F., Donald, S. G., and Bhattacharya, D. (2014). Consistent nonparametric tests for Lorenz
dominance. Journal of Business & Economic Statistics, 32(1):1–13.

Beare, B. K. and Fang, Z. (2017). Weak convergence of the least concave majorant of estimators for
a concave distribution function. Electronic Journal of Statistics, 11(2):3841–3870.

Beare, B. K. and Moon, J.-M. (2015). Nonparametric tests of density ratio ordering. Econometric
Theory, 31(3):471–492.

Beare, B. K. and Shi, X. (2019). An improved bootstrap test of density ratio ordering. Econometrics
and Statistics, 10:9–26.

Card, D. (1993). Using geographic variation in college proximity to estimate the return to schooling.
Working Paper 4483, National Bureau of Economic Research.

Cawley, J. and Meyerhoefer, C. (2012). The medical care costs of obesity: An instrumental variables
approach. Journal of Health Economics, 31(1):219–230.

Chernozhukov, V., Kim, W., Lee, S., and Rosen, A. M. (2015). Implementing intersection bounds in
stata. The Stata Journal, 15(1):21–44.

Chernozhukov, V., Lee, S., and Rosen, A. M. (2013). Intersection bounds: Estimation and inference.
Econometrica, 81(2):667–737.

24
Chetverikov, D. (2018). Adaptive tests of conditional moment inequalities. Econometric Theory,
34(1):186–227.

Davydov, Y. A., Lifshits, M. A., and Smorodina, N. V. (1998). Local Properties of Distributions of
Stochastic Functionals, volume 173. American Mathematical Society.

Donald, S. G. and Hsu, Y.-C. (2016). Improving the power of tests of stochastic dominance. Econo-
metric Reviews, 35(4):553–585.

Dümbgen, L. (1993). On nondifferentiable functions and the bootstrap. Probability Theory and
Related Fields, 95(1):125–140.

Eren, O. and Ozbeklik, S. (2014). Who benefits from Job Corps? A distributional analysis of an
active labor market program. Journal of Applied Econometrics, 29(4):586–611.

Fang, Z. and Santos, A. (2018). Inference on directionally differentiable functions. The Review of
Economic Studies, 86(1):377–412.

Folland, G. B. (1999). Real Analysis: Modern Techniques and Their Applications. John Wiley & Sons.

Frölich, M. and Melly, B. (2013). Unconditional quantile treatment effects under endogeneity. Jour-
nal of Business & Economic Statistics, 31(3):346–357.

Giacomini, R., Politis, D. N., and White, H. (2013). A warp-speed method for conducting Monte
Carlo experiments involving bootstrap estimators. Econometric Theory, 29(3):567–589.

Hansen, B. E. (2017). Regression kink with an unknown threshold. Journal of Business & Economic
Statistics, 35(2):228–240.

Heckman, J. J. and Pinto, R. (2018). Unordered monotonicity. Econometrica, 86(1):1–35.

Heckman, J. J., Urzua, S., and Vytlacil, E. (2006). Understanding instrumental variables in models
with essential heterogeneity. The Review of Economics and Statistics, 88(3):389–432.

Heckman, J. J., Urzua, S., and Vytlacil, E. (2008). Instrumental variables in models with multiple
outcomes: The general unordered case. Annales d’Economie et de Statistique, pages 151–174.

Heckman, J. J. and Vytlacil, E. (2005). Structural equations, treatment effects, and econometric
policy evaluation. Econometrica, 73(3):669–738.

Heckman, J. J. and Vytlacil, E. J. (2007). Econometric evaluation of social programs, part ii: Using
the marginal treatment effect to organize alternative econometric estimators to evaluate social
programs, and to forecast their effects in new environments. In Handbook of Econometrics, pages
4875–5143. Amsterdam: Elsevier.

Hirano, K. and Porter, J. R. (2012). Impossibility results for nondifferentiable functionals. Econo-
metrica, 80(4):1769–1790.

25
Hong, H. and Li, J. (2018). The numerical delta method. Journal of Econometrics, 206(2):379–394.

Horváth, L., Kokoszka, P., and Zitikis, R. (2006). Testing for stochastic dominance using the weighted
McFadden-type statistic. Journal of Econometrics, 133(1):191–205.

Hsu, Y.-C., Liu, C.-A., and Shi, X. (2019). Testing generalized regression monotonicity. Econometric
Theory, 35(6):1146–1200.

Huber, M. and Mellace, G. (2015). Testing instrument validity for LATE identification based on
inequality moment constraints. Review of Economics and Statistics, 97(2):398–411.

Huber, M. and Wüthrich, K. (2018). Local average and quantile treatment effects under endogeneity:
A review. Journal of Econometric Methods, 8(1).

Imbens, G. (2014). Instrumental variables: An econometrician’s perspective. Technical report,


National Bureau of Economic Research.

Imbens, G. W. and Angrist, J. D. (1994). Identification and estimation of local average treatment
effects. Econometrica, 62(2):467–475.

Imbens, G. W. and Rubin, D. B. (1997). Estimating outcome distributions for compliers in instru-
mental variables models. The Review of Economic Studies, 64(4):555–574.

Imbens, G. W. and Rubin, D. B. (2015). Causal Inference in Statistics, Social, and Biomedical Sciences.
Cambridge University Press.

Kitagawa, T. (2015). A test for instrument validity. Econometrica, 83(5):2043–2063.

Koenker, R., Chernozhukov, V., He, X., and Peng, L. (2017). Handbook of Quantile Regression. CRC
Press.

Lee, S., Song, K., and Whang, Y.-J. (2013). Testing functional inequalities. Journal of Econometrics,
172(1):14–32.

Linton, O., Song, K., and Whang, Y.-J. (2010). An improved bootstrap test of stochastic dominance.
Journal of Econometrics, 154(2):186–202.

Marshall, J. (2016). Coarsening bias: How coarse treatment measurement upwardly biases instru-
mental variable estimates. Political Analysis, 24(2):157–171.

Melly, B. and Wüthrich, K. (2017). Local quantile treatment effects. In Handbook of Quantile Regres-
sion, pages 145–164. Chapman and Hall/CRC.

Mourifié, I. and Wan, Y. (2017). Testing local average treatment effect assumptions. Review of
Economics and Statistics, 99(2):305–313.

Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized


studies. Journal of Educational Psychology, 66(5):688.

26
Seo, J. (2018). Tests of stochastic monotonicity with improved power. Journal of Econometrics,
207(1):53–70.

Splawa-Neyman, J., Dabrowska, D. M., and Speed, T. (1990). On the application of probability
theory to agricultural experiments. Essay on principles. Section 9. Statistical Science, pages 465–
472.

Sun, Z. and Beare, B. K. (2019). Improved nonparametric bootstrap tests of Lorenz dominance.
Journal of Business & Economic Statistics. Forthcoming.

van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer.

Vytlacil, E. (2002). Independence, monotonicity, and latent index models: An equivalence result.
Econometrica, 70(1):331–341.

Vytlacil, E. (2006). Ordered discrete-choice selection models and local average treatment effect
assumptions: Equivalence, nonequivalence, and representation results. The Review of Economics
and Statistics, 88(3):578–581.

27
Instrument Validity for Heterogeneous Causal Effects
Supplementary Appendix

Zhenting Sun
National School of Development, Peking University
[email protected]

September 7, 2020

The appendix consists of three sections. Section A provides auxiliary theorems and lemmas, some
of which may be of independent interest, such as the extended continuous mapping theorem and
the extended delta method. Section B contains the proofs of the main results in the paper. Section C
shows the power comparisons between the proposed test and the test of Kitagawa (2015) via Monte
Carlo simulations.

A Auxiliary Results
We follow van der Vaart and Wellner (1996) to introduce some notation we use multiple times in
the appendix. Let (Ω, A, P) be an arbitrary probability space. For an arbitrary map T : Ω → R̄, we
define the outer integral or outer expectation of T with respect to P by


E ∗ [T ] = inf E [U ] : U ≥ T, U : Ω → R̄ measurable and E [U ] exists .

The outer probability of an arbitrary subset B of Ω is

P∗ (B) = inf {P (A) : A ⊃ B, A ∈ A} .

The inner integral (or inner expectation) and the inner probability can be defined as

E∗ [T ] = −E ∗ [−T ] and P∗ (B) = 1 − P∗ (Ω \ B) ,

respectively. We denote a minimal measurable majorant of T (resp. a maximal measurable minorant)


by T ∗ (resp. T∗ ), which always exists by Lemma 1.2.1 of van der Vaart and Wellner (1996). Suppose
T is a real-valued map defined on an arbitrary product probability space (Ω1 × Ω2 , A1 × A2 , P1 × P2 ).
We write E ∗ [T ] for the outer expectation as before, and for every ω1 , we define
Z
E2∗ [T ](ω1 ) = inf U (ω2 ) dP2 (ω2 ), (A.1)

1
where the infimum is taken over all measurable functions U : Ω2 → R̄ with U (ω2 ) ≥ T (ω1 , ω2 ) for
R
all ω2 such that U dP2 exists. Then E1∗ [E2∗ [T ]] is the outer integral of the function E2∗ [T ] : Ω1 → R̄,
and we call E1∗ [E2∗ [T ]] the repeated outer expectation. We define the repeated inner expectation
E1∗ [E2∗ [T ]] analogously.1

Theorem A.1 (Extended continuous mapping) Let D and E be metric spaces with metrics d and e,
respectively. Let D0 ⊂ D. Let X be Borel measurable and take values in D0 . Suppose, in addition, that
either of the following conditions holds:

(a) Let Dn ⊂ D. Let Xn : Ω → D with Xn (ω) ∈ Dn for all ω ∈ Ω and all n. Let gn be a random
map with gn (ω) : Dn → E (for every ω ∈ Ω, gn (ω) is a map on Dn ). The random map gn
satisfies the condition that for every ε > 0 there is a measurable set A ⊂ Ω with P(A) ≥ 1 − ε
such that if xn → x with xn ∈ Dn and x ∈ D0 , then gn (xn ) converges to g (x) uniformly on A
(supω∈A e(gn (ω)(xn ), g(x)) → 0),2 where g : D0 → E is a fixed (deterministic) map. Also, X is
separable.

(b) Let Dn (ω) ⊂ D for all ω ∈ Ω and all n. Let Xn : Ω → D with Xn (ω) ∈ Dn (ω) for all ω ∈ Ω and all
n. Let gn be a random map with gn (ω) : Dn (ω) → E (for every ω ∈ Ω, gn (ω) is a map on Dn (ω)).
The random map gn satisfies the condition that for every ε > 0 there is a measurable set A ⊂ Ω
with P(A) ≥ 1 − ε such that for every subsequence {xnm }, if xnm → x with xnm ∈ Dnm (ωnm ),
ωnm ∈ A, and x ∈ D0 , then gnm (ωnm ) (xnm ) converges to g (x), where g : D0 → E is a fixed
continuous map.

Then we have that

(i) Xn X implies that gn (Xn ) g (X);

(ii) If Xn converges to X in outer probability,3 then gn (Xn ) converges to g (X) in outer probability;

(iii) If Xn converges to X outer almost surely,4 then gn (Xn ) converges to g (X) outer almost surely.

Remark A.1 Theorem A.1 is an extension of Theorem 1.11.1 (extended continuous mapping) of van der
Vaart and Wellner (1996). Theorem 1.11.1 of van der Vaart and Wellner (1996) assumes that every
gn is a fixed map. Theorem A.1 allows every gn to be random. Theorem A.1(i) will be used to establish
Theorem A.2 (extended delta method).

Proof of Theorem A.1. Suppose Condition (a) holds. Assume the weakest of the three assumptions:
the one in (i) that Xn X. First, let D∞ be the set of all x for which there exists a sequence {xn }
with xn ∈ Dn and xn → x. By the representation theorem (see, for example, Theorem 9.4 of Pollard
(1990) or Theorem 1.10.4 of van der Vaart and Wellner (1996)), along the lines of the second
1 Additional technical details about the repeated expectations can be found in van der Vaart and Wellner (1996, pp. 10–12).
2 Thisis a condition similar to almost uniform convergence. See Definition 1.9.1(ii) of van der Vaart and Wellner (1996).
By Lemma 1.9.2(iii) of van der Vaart and Wellner (1996), almost uniform convergence is equivalent to outer almost sure
convergence if the limit is Borel measurable.
3 See Definition 1.9.1(i) of convergence in outer probability in van der Vaart and Wellner (1996).
4 See Definition 1.9.1(iii) of outer almost sure convergence in van der Vaart and Wellner (1996).

2
paragraph in the proof of Theorem 1.11.1 of van der Vaart and Wellner (1996), we can show that
P∗ (X ∈ D∞ ) = 1. Second, fix ε and a measurable set A with P(A) ≥ 1 − ε that satisfies the
assumptions, and suppose there is some subsequence such that xn′ → x with xn′ ∈ Dn′ for all n′
and x ∈ D0 ∩ D∞ . Since x ∈ D∞ , there is a sequence yn → x with yn ∈ Dn for all n. Fill out the
/ {n′ }. Then by assumption,
subsequence xn′ to an entire sequence by putting xn = yn for all n ∈
gn (xn ) → g(x) uniformly on A on this entire sequence, hence also on the subsequence, that is,
gn′ (xn′ ) → g(x) uniformly on A. Third, let xm → x in D0 ∩ D∞ . For every m, there is a sequence
ym,n ∈ Dn with ym,n → xm as n → ∞. Fix a small ε > 0 and a measurable set A with P(A) ≥ 1 − ε
that satisfies the assumptions. Now we have that gn (ym,n ) → g(xm ) uniformly on A. For every
m, take nm such that |ym,nm − xm | < 1/m and |gnm (ym,nm ) − g(xm )| < 1/m uniformly on A and
such that nm is increasing in m. Then ym,nm → x, and hence gnm (ym,nm ) → g(x) uniformly on
A. Since |g(xm ) − g(x)| ≤ |gnm (ym,nm ) − g(xm )| + |gnm (ym,nm ) − g(x)| uniformly on A, we have
|g(xm ) − g(x)| → 0. Thus g is continuous on D0 ∩ D∞ .
For simplicity of notation, we will write D0 for D0 ∩ D∞ . Without loss of generality, we assume
that X takes its values in D0 . Since g is continuous on D0 now, g(X) is Borel measurable.
(i). Let F be an arbitrary closed set in E. By the assumptions, for every ε > 0 there is a
measurable set A ⊂ Ω with P (A) ≥ 1 − ε such that if xn → x with xn ∈ Dn and x ∈ D0 , then gn (xn )
converges to g (x) uniformly on A, that is, supω∈A |gn (ω) (xn ) − g (x)| → 0. Fix ε and A. Then

−1
∩∞ ∞
k=1 ∪m=k ∪ω∈A (gm (ω)) (F ) ⊂ g −1 (F ) ∪ (D − D0 ). (A.2)

Suppose x is an element of the set on the left-hand side of (A.2). For every n, there exist n′ > n,
ωn′ ∈ A, and xn′ ∈ gn′ (ωn′ )−1 (F ) ⊂ Dn′ such that d(xn′ , x) ≤ 1/n. Therefore, there is a subsequence
xnm ∈ gnm (ωnm )−1 (F ) ⊂ Dnm with ωnm ∈ A such that nm ↑ ∞ and xnm → x as m → ∞. By the
/ D0 . Since F is closed, this implies that g(x) ∈ F
definition of A, either gnm (ωnm )(xnm ) → g(x) or x ∈
/ D0 . Then for every k,
or x ∈
nn o o 
−1
lim sup P∗ (gn (Xn ) ∈ F ) ≤ lim sup P∗ Xn ∈ ∪∞
m=k gm (F ) ∩ A ∪ A
c
n→∞ n→∞
h nn o o ∗ i
−1 c
= lim sup E 1 Xn ∈ ∪∞ m=k g m (F ) ∩ A ∨ 1 {A } , (A.3)
n→∞

where the equality is from Lemmas 1.2.3(i) and 1.2.1 of van der Vaart and Wellner (1996). Then by
Lemmas 1.2.2(viii), 1.2.1, and 1.2.3(i) of van der Vaart and Wellner (1996),
h nn o o ∗ i
−1 c
E 1 Xn ∈ ∪∞
m=k gm (F ) ∩ A ∨ 1 {A }
h nn o o∗ i n o 
−1 c ∗ ∞ g −1 (F ) ∩ A + P (Ac ) .
= E 1 Xn ∈ ∪∞m=k g m (F ) ∩ A ∨ (1 {A }) ≤ P X n ∈ ∪m=k m

(A.4)

By (A.3) and (A.4), together with Theorem 1.3.4(iii) (portmanteau) of van der Vaart and Wellner

3
(1996), we have
n o 
−1
lim sup P∗ (gn (Xn ) ∈ F ) ≤ lim sup P∗ Xn ∈ ∪∞ c
m=k gm (F ) ∩ A + P (A )
n→∞ n→∞
   
−1 −1
≤ lim sup P∗ Xn ∈ ∪∞ m=k ∪ω∈A (gm (ω)) (F ) + ε ≤ P X ∈ ∪∞m=k ∪ω∈A (gm (ω)) (F ) + ε.
n→∞

Letting k → ∞ together with (A.2) gives


 
−1
lim sup P∗ (gn (Xn ) ∈ F ) ≤ P X ∈ ∩∞ ∞
k=1 ∪m=k ∪ω∈A (gm (ω)) (F ) + ε ≤ P (g (X) ∈ F ) + ε.
n→∞

Since ε can be arbitrarily small, we can conclude that lim supn→∞ P∗ (gn (Xn ) ∈ F ) ≤ P (g (X) ∈ F ).
By Theorem 1.3.4(iii) of van der Vaart and Wellner (1996) again, gn (Xn ) g (X).

(ii). Choose δn ↓ 0 with P (d (Xn , X) ≥ δn ) → 0. Fix ε > 0. Let A ⊂ Ω be a measurable set
with P (A) ≥ 1 − ε that satisfies the assumptions. Let Bn (ω) be the set of all x such that there is
a y ∈ Dn with d (y, x) < δn and e (gn (ω) (y) , g (x)) > ε. Let Bn = ∪ω∈A Bn (ω). Suppose x ∈ Bn
for infinitely many n. Then there are sequences ωnm ∈ A and xnm ∈ Dnm with xnm → x such
that e (gnm (ωnm ) (xnm ) , g (x)) > ε for each m. This implies that xnm → x with xnm ∈ Dnm but
/ D0 . Note
that gnm (xnm ) does not converge to g(x) uniformly on A. Thus by assumption, x ∈
that x ∈ lim sup Bn is equivalent to x ∈ Bn for infinitely many n. Thus we can conclude that
lim sup Bn ∩D0 = ∅. Since g is continuous on D0 , Bn ∩D0 is relatively open in D0 and hence relatively
Borel. This is because if z ∈ D0 is close enough to x ∈ Bn ∩ D0 , then d(y, z) ≤ d(y, x) + d(x, z) < δn
and e (gn (ω) (y) , g (z)) ≥ e (gn (ω) (y) , g (x)) − e (g (z) , g (x)) > ε. Since X takes values in D0 by
assumption, by Lemma 1.2.3(i) of van der Vaart and Wellner (1996),

P∗ (X ∈ Bn ) = E ∗ [1 {X ∈ Bn }] = E [1 {X ∈ Bn ∩ D0 }] .

Also, by the dominated convergence theorem,

E [1 {X ∈ Bn ∩ D0 }] ≤ E [1 {X ∈ ∪∞
m=n (Bm ∩ D0 )}]

→ E [1 {X ∈ ∩∞ ∞
n=1 ∪m=n (Bm ∩ D0 )}] = P (X ∈ lim sup Bn ∩ D0 ) = 0.

This implies that P∗ (X ∈ Bn ) → 0 as n → ∞. Now we have that

P∗ (e(gn (Xn ), g(X)) > ε) ≤ P∗ ({e(gn (Xn ), g(X)) > ε} ∩ A) + P (Ac )


≤ P∗ (X ∈ Bn or d(Xn , X) ≥ δn ) + ε → ε.

Since ε is arbitrary, the claim holds.


(iii). By Lemmas 1.9.3(i) and 1.9.2(iii) of van der Vaart and Wellner (1996), it suffices to
prove that supm≥n e (gm (Xm ) , g (X)) converges to 0 in outer probability. Choose δn ↓ 0 with

P∗ supm≥n d (Xm , X) ≥ δn → 0. Fix ε > 0. Let A ⊂ Ω be a measurable set with P (A) ≥ 1 − ε
such that if xn → x with xn ∈ Dn and x ∈ D0 , then gn (xn ) converges to g (x) uniformly on

4
A. Let Bn (ω) be the set of all x such that there are m ≥ n and y ∈ Dm with d (y, x) < δn and
e (gm (ω) (y) , g (x)) > ε. Let Bn = ∪ω∈A Bn (ω). Then we can finish the proof along the lines of the
proof of (ii).
Suppose Condition (b) holds. Repeat the proofs of (i), (ii), and (iii) under Condition (a) with
the properties of gn and g under Condition (b). For (ii), let Bn (ω) be the set of all x such that there
is a y ∈ Dn (ω) with d (y, x) < δn and e (gn (ω) (y) , g (x)) > ε. For (iii), let Bn (ω) be the set of all
x such that there are m ≥ n and y ∈ Dm (ω) with d (y, x) < δn and e (gm (ω) (y) , g (x)) > ε. The
key difference is that Condition (a) requires that Xn (ω) ∈ Dn for all ω holds for some fixed Dn .
Condition (b) only requires that Xn (ω) ∈ Dn (ω) for all ω holds for some random Dn which can take
different values Dn (ω) for different ω. On the other hand, Condition (b) strengthens the properties
of gn and g so that the claims hold as well.

Theorem A.2 (Extended delta method) Let D and E be metric spaces, and let rn be constants with
rn → ∞. Let φ̂n : Ω → DF ⊂ D be a random element for every n. Let D0 ⊂ D.

(i) Let F : DF → E satisfy the condition that for every ε > 0, there is a measurable set A ⊂ Ω with
P(A) ≥ 1 − ε such that for some map Fφ′ on D0 ,

rn (F (φ̂n + rn−1 hn ) − F (φ̂n )) → Fφ′ (h)

uniformly on A for every convergent sequence {hn } ⊂ D with φ̂n (ω) + rn−1 hn ∈ DF for all n
and all ω and hn → h ∈ D0 . If Xn : Ω → DF are maps with Xn (ω) − φ̂n (ω) + φ̂n (ω ′ ) ∈ DF
for all ω, ω ′ ∈ Ω and rn (Xn − φ̂n ) X, where X is separable and takes its values in D0 , then
rn (F (Xn ) − F (φ̂n )) Fφ′ (X). Moreover, if Fφ′ is continuous on all of D, then rn (F (Xn ) −
F (φ̂n )) − Fφ′ (rn (Xn − φ̂n )) converges to zero in outer probability.

(ii) Let F : DF → E satisfy the condition that for every ε > 0, there is a measurable set A ⊂ Ω with
P(A) ≥ 1 − ε such that for some continuous map Fφ′ on D0 ,

rnm {F (φ̂nm (ωnm ) + rn−1 h ) − F (φ̂nm (ωnm ))} → Fφ′ (h)


m nm

for every convergent subsequence {hnm } ⊂ D with φ̂nm (ωnm ) + rn−1 h


m nm
∈ DF , ωnm ∈ A, and
hnm → h ∈ D0 . If Xn : Ω → DF are maps with rn (Xn − φ̂n ) X, where X takes its values
in D0 , then rn (F (Xn ) − F (φ̂n )) Fφ′ (X). Moreover, if Fφ′ is continuous on all of D, then
rn (F (Xn ) − F (φ̂n )) − Fφ′ (rn (Xn − φ̂n )) converges to zero in outer probability.

Remark A.2 Theorem A.2 is an extension of Theorem 3.9.5 (delta method) of van der Vaart and Wellner
(1996). Here, φ̂n is allowed to be random, which is the key difference between the two theorems.
Theorem A.2 is used to establish the asymptotic distribution of the test statistic under null.

Proof of Theorem A.2. (i). The proof mainly relies on the results of Theorem A.1. Define
Dn (ω) = {h ∈ D : φ̂n (ω) + rn−1 h ∈ DF } for every n and every ω ∈ Ω. Let Dn = ∩ω∈Ω Dn (ω). Define

5
gn (ω) (h) = rn (F (φ̂n (ω) + rn−1 h) − F (φ̂n (ω))) for every n, every ω ∈ Ω, and every h ∈ Dn . Here, gn
is a random map because of φ̂n . For every n and every ω ∈ Ω, gn (ω) : Dn → E. By the assumptions,
for every ε > 0 there is a measurable set A ⊂ Ω with P(A) ≥ 1 − ε such that if hn ∈ Dn with
hn → h ∈ D0 , then gn (hn ) → Fφ′ (h) uniformly on A. Also, rn (Xn (ω) − φ̂n (ω)) ∈ Dn for all ω by
assumption. Now by Theorem A.1(i) (under Condition (a)),

rn (F (Xn ) − F (φ̂n )) = gn (rn (Xn − φ̂n )) Fφ′ (X) .

Moreover, suppose Fφ′ is continuous on all of D, and let fn (h) = (gn (h) , Fφ′ (h)) for every h ∈ Dn .
By Theorem A.1(i) again,
  
rn (F (Xn ) − F (φ̂n )), Fφ′ (rn (Xn − φ̂n )) = fn (rn (Xn − φ̂n )) Fφ′ , Fφ′ (X) .

Thus by Theorem 1.3.6 (continuous mapping) of van der Vaart and Wellner (1996), rn (F (Xn ) −
F (φ̂n )) − Fφ′ (rn (Xn − φ̂n )) 0. The claim follows from Lemma 1.10.2(iii) of van der Vaart and
Wellner (1996).
(ii). Together with the continuity of Fφ′ , by arguments similar to the proof of (i), we can show
that the claim holds by Theorem A.1(i) (under Condition (b)).

Lemma A.1 Let P be the set of probability measures defined in Section 3. Let H1 , H̄1 , H2 , H̄2 , H, and
H̄ be as in (9). Then for every Q ∈ P, the closures of H1 and H2 in L2 (Q) are equal to H̄1 and H̄2 ,
respectively. Also, the closure of H in L2 (Q) is equal to H̄ for every Q ∈ P.

d
Proof of Lemma A.1. Let H1d = {(−1) · 1B×{d}×R : B is a closed interval in R} for d ∈ {0, 1}. We
first show that the closure of H1d in L2 (Q) is equal to
n o
H̄1d = (−1)d · 1B×{d}×R : B is a closed, open, or half-closed interval in R .

If this is true, the first claim of the Lemma follows from H̄1 = H̄10 ∪ H̄11 .
Suppose there is a sequence {hn } ⊂ H1d such that khn − hkL2 (Q) → 0 for some h ∈ L2 (Q).
Then hn is a Cauchy sequence, that is, khn − hm kL2 (Q) → 0 as n, m → ∞. By the definition of H1d ,
R
hn = (−1)d · 1Bn ×{d}×R , where Bn is a closed interval in R. It is possible that 1Bn ×{d}×R dQ → 0,
and in this case there is a B = {a} for some a ∈ R such that Q (B × R × R) = 0 and hn →
R
(−1)d · 1B×{d}×R ∈ H1d . If 1Bn ×{d}×R dQ 6→ 0, then there is an ε > 0 such that for all nε > 0, there
2 2
is an n > nε such that khn kL2 (Q) > ε. For a δ1 ≪ ε, there is an N1 such that khn − hm kL2 (Q) < δ1 for
2 2
all m, n > N1 . Thus there is an n1 > N1 such that khn1 kL2 (Q) > ε and khn − hn1 kL2 (Q) < δ1 for all
2
n > N1 . Now let δ2 be such that 0 < δ2 ≪ δ1 . Then there is an N2 > n1 such that khn − hm kL2 (Q) <
2 2
δ2 for all m, n > N2 . Thus there is an n2 > N2 such that khn2 kL2 (Q) > ε and khn − hn2 kL2 (Q) <
d
δ2 for all n > N2 . In this way, we can find a sequence {hnk }k with hnk = (−1) · 1Bnk ×{d}×R ,
2 2
khnk kL2 (Q) > ε, khn − hnk kL2 (Q) < δk for all n > nk , and δk ↓ 0. Let B ∞ = ∪∞ ∞
j=1 ∩k=j Bnk . For

6
2
every K, khnk − hnK kL2 (Q) < δK for all k > K. Notice that for every K ′ > K,
Z
d
khnK − (−1) · 1 (∩∞ = |1BnK ×{d}×R − 1(∩∞ ′ Bnk )×{d}×R |2 dQ
B )×{d}×R k2L2 (Q)
k=K ′ nk k=K
Z Z
= 1BnK \(∩∞ B )×{d}×R dQ +
k=K ′ nk
1(∩∞ B )\BnK ×{d}×R dQ.
k=K ′ nk

Because Bnk is a closed interval for all k, we have that for every K ′′ ≥ K ′ , there exist L1 and L2
′′
with K ′ ≤ L1 ≤ L2 ≤ K ′′ such that ∪K
k=K ′ (BnK \ Bnk ) = (BnK \ BnL1 ) ∪ (BnK \ BnL2 ). Then since

2
khnk − hnK kL2 (Q) = Q(BnK \ Bnk × {d} × R) + Q(Bnk \ BnK × {d} × R) < δK

for all k > K, we have


Z

1BnK \(∩∞ B )×{d}×R dQ = Q(BnK \ (∩k=K ′ Bnk ) × {d} × R)
k=K ′ nk

= Q(∪∞
k=K ′ (BnK \ Bnk ) × {d} × R) ≤ 2δK .

R
Similarly, it is easy to show that 1(∩∞ B )\BnK ×{d}×R dQ ≤ 2δK . Thus it follows that
k=K ′ nk

d 2
khnK − (−1) · 1(∩∞ Bnk )×{d}×R kL2 (Q) ≤ 4δK ,
k=K ′

which is true for all K ′ > K. Letting K ′ → ∞, by the dominated convergence theorem (B ∞ =
∪∞ ∞
j=1 ∩k=j Bnk ) we have
d
khnK − (−1) · 1B ∞ ×{d}×R k2L2 (Q) ≤ 4δK .
d
This implies that khnK − (−1) · 1B ∞ ×{d}×R kL2 (Q) → 0 as K → ∞, because δK ↓ 0. Finally, we have

khn − (−1)d · 1B ∞ ×{d}×R kL2 (Q) ≤ khn − hnK kL2 (Q) + khnK − (−1)d · 1B ∞ ×{d}×R kL2 (Q) → 0.

Clearly, B ∞ can be a closed, open, or half-closed interval in R. Also, every element of H̄1d is equal to
the limit of a sequence of elements of H1d under the L2 (Q) norm. Thus the closure of H1d in L2 (Q)
is equal to H̄1d for every Q ∈ P. Similarly, we can show that the closure of H2 in L2 (Q) is equal to
H̄2 for every Q ∈ P. As a result, the closure of H = H1 ∪ H2 in L2 (Q) is equal to H̄ = H̄1 ∪ H̄2 for
every Q ∈ P.

Lemma A.2 Let H1 and H2 be defined as in (9). Then H1 is a VC class5 with VC index V (H1 ) = 3,
and H2 is a VC class with VC index V (H2 ) = 2.

Proof of Lemma A.2. All the functions h ∈ H1 take the form h = −1B×{1}×R or h = 1B×{0}×R ,
where B is a closed interval. If h = −1B×{1}×R , the subgraph of h is


C1B = (y, w, z, t) ⊂ R4 : t < −1B×{1}×R (y, w, z) .
5 See the definition of VC class of functions in van der Vaart and Wellner (1996, p. 141).

7
If h = 1B×{0}×R , the subgraph of h is


C0B = (y, w, z, t) ⊂ R4 : t < 1B×{0}×R (y, w, z) .

Let C = {CdB : B is a closed interval in R, d ∈ {0, 1}} .


Suppose there are two different points a1 = (y1 , w1 , z1 , t1 ) , a2 = (y2 , w2 , z2 , t2 ) ∈ R4 with y1 <
y2 , w1 = w2 = 0, and 0 ≤ t1 , t2 < 1. Then there is a point ȳ ∈ (y1 , y2 ). Let B0 = {ȳ}, B1 = [y1 , ȳ],
B2 = [ȳ, y2 ], and B3 = [y1 , y2 ]. Now we have ∅ = C0B0 ∩ {a1 , a2 }, {a1 } = C0B1 ∩ {a1 , a2 },
{a2 } = C0B2 ∩ {a1 , a2 }, and {a1 , a2 } = C0B3 ∩ {a1 , a2 }. Thus C shatters {a1 , a2 }.
Suppose now that there are three different points a1 = (y1 , w1 , z1 , t1 ), a2 = (y2 , w2 , z2 , t2 ), a3 =
(y3 , w3 , z3 , t3 ) in R4 . Without loss of generality, suppose t1 ≤ t2 ≤ t3 < 1, so that it is possible for C
to pick out {aj } for each j ∈ {1, 2, 3}.

(1) Suppose t1 ≥ 0. In this case, we need w1 = w2 = w3 = 0 in order to pick out {aj } for each j.
Without loss of generality, suppose y1 ≤ y2 ≤ y3 . If we want C to pick out {a1 , a3 }, we need to
find a closed interval B such that y1 , y3 ∈ B, in which case a1 , a3 ∈ C0B . However, a2 ∈ C0B
for all such B.

(2) Suppose t1 < 0, t2 ≥ 0. Then we need w2 = w3 = 0 in order to pick out {aj } for each j ∈ {2, 3}
by using C0B for some closed interval B. But in this case, C can never pick out {a2 }, {a3 }, or
{a2 , a3 }, since for every closed interval B, a1 ∈ C0B .

(3) Suppose t1 , t2 < 0, t3 ≥ 0. Then we need w3 = 0 in order to pick out {a3 } by using C0B for
some closed interval B. In this case, C can never pick out {a3 }, since for every closed interval
B, a1 , a2 ∈ C0B .

(4) Suppose t1 , t2 , t3 < 0. Then for every closed interval B, a1 , a2 , a3 ∈ C0B . If we want C to pick
out {aj , aj ′ } for all j 6= j ′ , we need to use C1B . If wj 6= 1, then for every B, aj ∈ C1B . Thus
we consider w1 = w2 = w3 = 1.

(a) Suppose −1 ≤ t1 , t2 , t3 < 0. Without loss of generality, we assume that y1 ≤ y2 ≤ y3 .


But now if we want C to pick out {a2 }, we need to find a closed interval B such that
y1 , y3 ∈ B but y2 6∈ B, which is not possible.
(b) Suppose tj < −1 for some j ∈ {1, 2, 3}. In this case, aj ∈ C1B for every closed interval B.

Therefore, we conclude that H1 is a VC class with VC index V (H1 ) = 3. Similarly, we can show
that H2 is a VC class with VC index V (H2 ) = 2.

Lemma A.3 Let H be defined as in (9). Then H is totally bounded under k·kLr (Q) for every probability
measure Q ∈ P and every r ≥ 1.

Proof of Lemma A.3. Let N (ε, Hj , Lr (Q)) denote the covering number under the Lr (Q) norm for
Hj for j ∈ {1, 2} and all ε > 0, where Hj is defined as in (9). Since H1 and H2 are VC classes by

8
Lemma A.2 with V (H1 ) = 3 and V (H2 ) = 2, by Theorem 2.6.7 of van der Vaart and Wellner (1996)
with envelope function F = 1 and r ≥ 1 we have that for every probability measure Q,

3 2r 2 r
N (ε, H1 , Lr (Q)) ≤ K1 3 (16e) (1/ε) and N (ε, H2 , Lr (Q)) ≤ K2 2 (16e) (1/ε)

for universal constants K1 , K2 ≥ 1 and every ε ∈ (0, 1). Since H = H1 ∪ H2 , we have

3 2r 2 r
N (ε, H, Lr (Q)) ≤K1 3 (16e) (1/ε) + K2 2 (16e) (1/ε) , (A.5)

which implies that H is totally bounded.

Lemma A.4 Let H̄ be as in (9). Then H̄ is compact under k·kL2 (Q) for every Q ∈ P.

Proof of Lemma A.4. By Lemma A.3, H is totally bounded under k·kL2 (Q) for all Q ∈ P. Suppose
S
that H ⊂ j∈J Bε/2 (hj ), where J is a finite index set and Bε/2 (hj ) is an open ball with center hj
and radius ε/2 under k · kL2 (Q) . By Lemma A.1, H̄ is equal to the closure of H in L2 (Q). Clearly,
S S
H̄ ⊂ j∈J Bε/2 (hj ) ⊂ j∈J Bε (hj ), and therefore

N (ε, H̄, L2 (Q)) ≤ N (ε/2, H, L2 (Q)), (A.6)

which, together with (A.5), implies that H̄ is totally bounded. Since L2 (Q) is complete, H̄ is compact
in L2 (Q).

Let H̄ and GK be defined as in (9). Let V = h · f : h ∈ H̄, f ∈ GK . Then define

Ṽ = V ∪ GK . (A.7)

Lemma A.5 The function space Ṽ is Donsker and pre-Gaussian uniformly in Q ∈ P.

Proof of Lemma A.5. For every δ > 0 and every Q ∈ P, define


n o n o
2
Ṽδ,Q = v − v ′ : v, v ′ ∈ Ṽ, kv − v ′ kL2 (Q) < δ and Ṽ∞
2
= (v − v ′ ) : v, v ′ ∈ Ṽ .

First, we show that Ṽδ,Q is Q-measurable6 for all Q ∈ P. Similarly to the construction of H, we
construct function spaces by
n o
Hq1 = (−1)d · 1B×{d}×R : B = [a, b] , a, b ∈ Q, a ≤ b, d ∈ {0, 1} ,

Hq2 = {1R×C×R : C = (−∞, c] , c ∈ Q} , and Hq = Hq1 ∪ Hq2 ,

where Q denotes the set of all rational numbers. Now define


n o
Ṽq = {h · f : h ∈ Hq , f ∈ GK } ∪ GK and Ṽqδ,Q = v − v ′ : v, v ′ ∈ Ṽq , kv − v ′ kL2 (Q) < δ .
6 See Definition 2.3.3 of Q-measurable class in van der Vaart and Wellner (1996).

9
By construction, GK is a finite set. Since Q is countable (and therefore the set of ordered pairs of
elements of Q is countable), Hq1 and Hq2 are countable (and therefore Hq and Ṽq are countable).
Clearly, Ṽqδ,Q is a countable subset of Ṽδ,Q . For every v ∈ Ṽ, there is a sequence {vm } ⊂ Ṽq
such that vm → v pointwise, because Q is dense in R. For example, if v = (−1)d · 1(√2,√3]×{d}×R ·
√ √
1R×R×{zk } , we can find vm = (−1)d · 1[am ,bm ]×{d}×R · 1R×R×{zk } with am ↓ 2, bm ↓ 3, and
am , bm ∈ Q. Suppose v − v ′ ∈ Ṽδ,Q and vm , vm
′ ′
∈ Ṽq such that vm → v and vm → v ′ pointwise.
′ ′
It is easy to show that kvm − vm kL2 (Q) < δ for large m, that is, vm − vm ∈ Ṽqδ,Q for large m. By
Example 2.3.4 of van der Vaart and Wellner (1996), Ṽδ,Q is Q-measurable, and this is true for all
2
δ > 0. Similarly, Ṽ∞ is Q-measurable.
R
By the construction of Ṽ, F = 1 is a measurable envelope function with F 2 dQ < ∞. Also,
R
limM→∞ supQ∈P F 2 · 1 {F > M } dQ = 0. For all H ∈ P and all ε ≥ 2,
   
N ε kF kL2 (H) , Ṽ, L2 (H) = N ε, Ṽ, L2 (H) = 1. (A.8)

For all H ∈ P and all ε > 0,

 ε  ε  ε 
N ε, V, L2 (H) ≤ N , H̄, L2 (H) · N , GK , L2 (H) ≤ K · N , H̄, L2 (H) , (A.9)
2 2 2

where K is the number of elements in GK . Thus by the definition of Ṽ in (A.7),


  ε 
N ε, Ṽ, L2 (H) ≤ K · N , H̄, L2 (H) + K (A.10)
2

for all H ∈ P and all ε > 0. Let Q denote the set of finitely discrete probability measures. The
results in (A.5), (A.6), (A.8), and (A.10) imply that
Z ∞
r   Z 2
r  
sup log N ε kF kL2 (H) , Ṽ, L2 (H) dε = sup log N ε, Ṽ, L2 (H) dε
0 H∈Q 0 H∈Q
Z 2 r n o
≤ log K · (K1 + K2 ) · 3 · (16e)3 (4/ε)4 + K dε < ∞.
0

The claim of the Lemma follows from Theorem 2.8.3 of van der Vaart and Wellner (1996).

Lemma A.6 The function space Ṽ defined in (A.7) is Glivenko–Cantelli uniformly in Q ∈ P.

Proof of Lemma A.6. Similarly to the proof of Lemma A.5, we can show that Ṽ is Q-measurable
R
for every Q ∈ P. With F = 1 being an envelope function of Ṽ, we have limM→∞ supQ∈P F ·
1 {F > M } dQ = 0. Similarly to the proofs of Lemmas A.1, A.4, and A.5, we can show that for every
Q ∈ P and every ε > 0, the closure of H in L1 (Q) is equal to H̄, N (ε, H̄, L1 (Q)) ≤ N (ε/2, H, L1(Q)),

and N (ε, Ṽ, L1 (Q)) ≤ K · N ε/2, H̄, L1 (Q) + K. Then by (A.5), with the envelope function F = 1
we can show that supH∈Qn log N (ε kF kL1 (H) , Ṽ, L1 (H)) = o (n), where Qn is the collection of all
possible realizations of empirical measures of n observations. Then by Theorem 2.8.1 in van der
Vaart and Wellner (1996), Ṽ is Glivenko–Cantelli uniformly in Q ∈ P.

10
Lemma A.7 Let H and G be defined as in (9), let ρP be as in (16), and define H × G as the closure of
 
H × G in L2 (P ) × (L2 (P ) × L2 (P )) under ρP . Then N ε, H × G, ρP = O 1/ε4 as ε → 0.

Proof of Lemma A.7. By the constructions of H × G and the metric ρP ,


ε  h ε i2
N (ε, H × G, ρP ) ≤ N , H, L2 (P ) · N , GK , L2 (P ) ,
3 3

where GK is defined as in (9). By the construction of GK , N ε/3, GK , L2 (P ) ≤ K, where K
is the number of elements in GK . This, together with (A.5), implies that N (ε, H × G, ρP ) =

O 1/ε4 as ε → 0. Similarly to (A.6),

ε   
 1
N ε, H × G, ρP ≤ N , H × G, ρP = O as ε → 0.
2 ε4

Lemma A.8 Let H and G be defined as in (9), and let ρP be as in (16). Then H × G, the closure of
H × G under ρP in Lemma A.7, is compact and H × G = H̄ × G, where H̄ is defined as in (9).

Proof of Lemma A.8. The first claim follows from Lemma A.7 and the fact that L2 (P ) × (L2 (P ) ×
L2 (P )) is complete under ρP . The second claim holds by the constructions of ρP and G.

B Main Results
Proof of Lemma 2.1. Suppose Assumption 2.2 holds. Then we can define Yd by Yd = Ydz1 =
Ydz2 = · · · = YdzK almost surely for all d ∈ D. First, suppose dmax exists. Under Assumption 2.2, for
all k with 1 ≤ k ≤ K − 1 and all Borel sets B,

P (Y ∈ B, D = dmax |Z = zk ) = P (Ydmax ∈ B, Dzk = dmax )


X  
= P Ydmax ∈ B, Dzk = dmax , Dzk+1 = dj = P Ydmax ∈ B, Dzk = dmax , Dzk+1 = dmax
j

and


P (Y ∈ B, D = dmax |Z = zk+1 ) = P Ydmax ∈ B, Dzk+1 = dmax
X 
= P Ydmax ∈ B, Dzk = dj , Dzk+1 = dmax .
j

Thus P (Y ∈ B, D = dmax |Z = zk+1 ) ≥ P (Y ∈ B, D = dmax |Z = zk ). Second, suppose dmin exists.


Then similarly, P (Y ∈ B, D = dmin |Z = zk ) ≥ P (Y ∈ B, D = dmin |Z = zk+1 ).

Remark B.1 Lemmas 2.2, 2.3, and 2.4 can be proved analogously.

11
Lemma B.1 Let DL = {R ∈ ℓ∞ (Ṽ) : R(h · gl )/R(gl ) exists for all h ∈ H̄ and all gl ∈ GK }. Define

L : DL ⊂ ℓ∞ (Ṽ) → ℓ∞ H̄ × G by

R (h · g2 ) R (h · g1 )
L (R) (h, g) = −
R (g2 ) R (g1 )

for all R ∈ DL and all (h, g) ∈ H̄ × G with g = (g1 , g2 ). Then L is uniformly Hadamard differentiable7
along every sequence Pn → P in DL , tangentially to ℓ∞ (Ṽ), with the derivative L′P defined by

H (h · g2 ) P (g2 ) − P (h · g2 ) H (g2 ) H (h · g1 ) P (g1 ) − P (h · g1 ) H (g1 )


L′P (H) (h, g) = − (B.11)
P 2 (g2 ) P 2 (g1 )

for all H ∈ ℓ∞ (Ṽ).8

Remark B.2 By the definition of L, L(Q) = φQ for all Q ∈ P. We will apply Lemma B.1 along with

the suitable delta method to deduce the asymptotic distributions of n(φ̂Pn − φP ) and the bootstrap
version of this random element.

Proof of Lemma B.1. For all tn → 0, Pn → P , and Hn → H in ℓ∞ (Ṽ) such that Pn ∈ DL and
Pn + tn Hn ∈ DL , we have that for each (h, g) ∈ H̄ × G with g = (g1 , g2 ),

L (Pn + tn Hn ) (h, g) − L (Pn ) (h, g)


   
(Pn + tn Hn ) (h · g2 ) (Pn + tn Hn ) (h · g1 ) Pn (h · g2 ) Pn (h · g1 )
= − − −
(Pn + tn Hn ) (g2 ) (Pn + tn Hn ) (g1 ) Pn (g2 ) Pn (g1 )
tn Hn (h · g2 ) Pn (g2 ) − tn Pn (h · g2 ) Hn (g2 ) tn Hn (h · g1 ) Pn (g1 ) − tn Pn (h · g1 ) Hn (g1 )
= − .
(Pn + tn Hn ) (g2 ) Pn (g2 ) (Pn + tn Hn ) (g1 ) Pn (g1 )

Thus it is easy to show that



L (Pn + tn Hn ) (h, g) − L (Pn ) (h, g)
lim sup − LP (H) (h, g) = 0,

n→∞ (h,g)∈H̄×G tn

where L′P is defined as in (B.11). This implies that L is uniformly differentiable and verifies the
derivative in (B.11).

Lemma B.2 Under Assumptions 3.1 and 3.2 with Pn , P ∈ ℓ∞ (Ṽ), we have that supv∈Ṽ | n(Pn −
P )(v) − Q0 (v)| → 0, where Q0 (v) = P (vv0 ) for all v ∈ Ṽ and v0 is as in Assumption 3.2, and that

n(P̂n − P ) converges under Pn in distribution to the process GP + Q0 for a tight P -Brownian bridge
GP with E[GP (v1 )GP (v2 )] = P (v1 v2 ) − P (v1 )P (v2 ) for all v1 , v2 ∈ Ṽ.

Proof of Lemma B.2. The Lemma holds by Assumptions 3.1 and 3.2, the facts that supv∈Ṽ |P (v)| ≤
1 and supv∈Ṽ |Pn (v 2 )| ≤ 1 for all n, Lemma A.5 in this paper, and Theorem 3.10.12 of van der Vaart
and Wellner (1996).
7 See the definitions of Hadamard differentiability and uniform Hadamard differentiability in van der Vaart and Wellner
(1996, pp. 372–375).
8 By (11), L′ is well defined.
P

12
Lemma B.3 Under Assumptions 3.1 and 3.2 with Pn , P ∈ ℓ∞ (Ṽ), we have that Pn → P and that
P̂n → P , φ̂Pn → φP , Tn /n → Λ(P ), and σ̂Pn → σP almost uniformly.

Proof of Lemma B.3. By Lemma B.2 in this paper, Hölder’s inequality, and Lemma 3.10.11 of
van der Vaart and Wellner (1996), we have that

kPn − P k∞ ≤kPn − P − n−1/2 Q0 k∞ + kn−1/2 Q0 k∞


≤kPn − P − n−1/2 Q0 k∞ + n−1/2 sup |P (v 2 )P (v02 )|1/2 → 0,
v∈Ṽ

where Q0 is the function defined in Lemma B.2. By Lemma A.6 in this paper and Lemma 1.9.3
of van der Vaart and Wellner (1996), kP̂n − Pn k∞ → 0 almost uniformly. Then we have that
kP̂n − P k∞ → 0 almost uniformly. The rest of the results follow from the constructions of φ̂Pn ,
2
Tn /n, and σ̂Pn . By the construction of H̄, the σQ (h, g) in (17) can also be written as
 
2 |Q (h · g2 ) | Q2 (h · g2 ) |Q (h · g1 ) | Q2 (h · g1 )
σQ (h, g) = Λ(Q) · − + − . (B.12)
Q2 (g2 ) Q3 (g2 ) Q2 (g1 ) Q3 (g1 )

Similarly to (B.12), we can write the σ̂P2 n (h, g) in (19) as


( )
Tn |P̂n (h · g2 ) | P̂n2 (h · g2 ) |P̂n (h · g1 ) | P̂n2 (h · g1 )
σ̂P2 n (h, g) = · − + − . (B.13)
n P̂n2 (g2 ) P̂n3 (g2 ) P̂n2 (g1 ) P̂n3 (g1 )

Then the almost uniform convergence of P̂n to P in ℓ∞ (Ṽ) implies the almost uniform convergence
of the σ̂P2 n in (B.13) to the σP2 as in (B.12).
Proof of Lemma 3.1. By the Hadamard derivative of L in (B.11), together with Lemma B.2 in this
paper and Theorem 3.9.4 (delta method) of van der Vaart and Wellner (1996), we have that under
Pn ,

√ √
n(φ̂Pn − φP ) = n{L(P̂n ) − L (P )} L′P (GP + Q0 ) . (B.14)

By Lemma B.3, Tn /n → Λ(P ) almost uniformly. Thus by Lemmas 1.9.3(ii) and 1.10.2(iii), Example
1.4.7 (Slutsky’s lemma), and Theorem 1.3.6 (continuous mapping) of van der Vaart and Wellner
(1996),
p p √
Tn (φ̂Pn − φP ) = Tn /n · n(φ̂Pn − φP ) Λ(P )1/2 L′P (GP + Q0 ) . (B.15)

Let G = Λ(P )1/2 L′P (GP + Q0 ). Then G is tight, because GP is tight and L′P is a continuous map.
Thus (B.15) verifies the first claim of Lemma 3.1. Now we show the continuity of G under ρP .
Define a semimetric on Ṽ by

 1/2
ρ2 (v, v ′ ) = E |GP (v) − GP (v ′ )|2

13
for all v, v ′ ∈ Ṽ. This semimetric is the one defined in van der Vaart and Wellner (1996, p. 39)
with p = 2. Since GP is tight, it follows from the discussion in Example 1.5.10 of van der Vaart and
Wellner (1996) that GP almost surely has a uniformly ρ2 -continuous path. Since GP is a P -Brownian
bridge,

ρ22 (v, v ′ ) = P ((v − v ′ )2 ) − P 2 (v − v ′ ) ≤ kv − v ′ k2L2 (P ) (B.16)

for all v, v ′ ∈ Ṽ. Therefore, GP almost surely has a uniformly continuous path under k · kL2 (P ) . By
Lemma 3.10.11 of van der Vaart and Wellner (1996), P (v0 ) = 0 and P (v02 ) < ∞, where v0 is as in
Assumption 3.2. Hölder’s inequality implies that for every v ∈ L2 (P ), kv · 1kL1 (P ) ≤ 1 · kvkL2 (P ) . By
Hölder’s inequality, P and Q0 are both continuous on Ṽ under k · kL2 (P ) , where Q0 is as in Lemma
B.2. Suppose that there are (h, g), (h′ , g ′ ) ∈ H̄ × G with g = (g1 , g2 ) and g ′ = (g1′ , g2′ ). Then for
j ∈ {1, 2} we have


gj − gj′ 2 ≤ ρP ((h, g) , (h′ , g ′ )) and
L (P )

h · gj − h′ · gj′ 2 ≤ kh − h′ kL2 (P ) + gj − gj′ L2 (P ) ≤ ρP ((h, g) , (h′ , g ′ )) . (B.17)
L (P )

By (B.11) and (B.17), together with the continuity of GP , P , and Q0 under k · kL2 (P ) , we conclude
that G almost surely has a continuous path under ρP .
Next, we show the variance of G(h, g) for each (h, g) ∈ H̄ × G with g = (g1 , g2 ). Since L′P (H) is
linear in H, V ar(G(h, g)) = Λ(P ) · V ar(L′P (GP )(h, g)). First, we have that

V ar(L′P (GP ) (h, g))


" 2 #
GP (h · g2 ) P (g2 ) − P (h · g2 ) GP (g2 ) GP (h · g1 ) P (g1 ) − P (h · g1 ) GP (g1 )
=E − . (B.18)
P 2 (g2 ) P 2 (g1 )

Since GP is a Brownian bridge with E[GP (v1 )GP (v2 )] = P (v1 v2 ) − P (v1 )P (v2 ) for all v1 , v2 ∈ Ṽ, we
have
" 2 #
GP (h · g2 ) P (g2 ) − P (h · g2 ) GP (g2 )
E
P 2 (g2 )

P h2 · g2 − P 2 (h · g2 ) P 2 (h · g2 ) P 2 (h · g2 ) 2P 2 (h · g2 ) 2P 2 (h · g2 )
= + − − +
P 2 (g2 ) P 3 (g2 ) P 2 (g2 ) P 3 (g2 ) P 2 (g2 )

P h2 · g 2 P 2 (h · g2 )
= − . (B.19)
P 2 (g2 ) P 3 (g2 )

Similarly,
" 2 # 
GP (h · g1 ) P (g1 ) − P (h · g1 ) GP (g1 ) P h2 · g 1 P 2 (h · g1 )
E = − . (B.20)
P 2 (g1 ) P 2 (g1 ) P 3 (g1 )

14
Also, we have that

E [(GP (h · g2 ) P (g2 ) − P (h · g2 ) GP (g2 )) (GP (h · g1 ) P (g1 ) − P (h · g1 ) GP (g1 ))]



= P (g2 ) P (g1 ) P h2 g2 g1 − P (g2 ) P (hg1 ) P (hg2 g1 ) − P (hg2 ) P (g1 ) P (hg2 g1 )
+ P (hg2 ) P (hg1 ) P (g2 g1 ) = 0, (B.21)

where we use the fact that g1 g2 = 0 by the construction of G. By (B.21), the expectation on the
right-hand side of (B.18) is equal to the sum of the expectations in (B.19) and (B.20). Thus we now
have that
 
P h2 · g 2 P 2 (h · g2 ) P h2 · g1 P 2 (h · g1 )
V ar(L′P (GP ) (h, g)) = 2
− 3
+ 2
− ,
P (g2 ) P (g2 ) P (g1 ) P 3 (g1 )

which, together with V ar(G(h, g)) = Λ(P ) · V ar(L′P (GP )(h, g)), verifies that V ar (G (g, h)) =
σP2 (h, g) for the σP2 in (17). For every (h, g) ∈ H̄ × G with g = (g1 , g2 ),
(   )
P h2 · g 2 P 2 (h · g2 ) P h2 · g1 P 2 (h · g1 )
σP2 (h, g) = Λ (P ) − + −
P 2 (g2 ) P 3 (g2 ) P 2 (g1 ) P 3 (g1 )
   
Λ (P ) |P (h · g2 )| |P (h · g2 )| Λ (P ) |P (h · g1 )| |P (h · g1 )|
= 1− + 1− .
P (g2 ) P (g2 ) P (g2 ) P (g1 ) P (g1 ) P (g1 )

Since 0 ≤ |P (hgj )| /P (gj ) ≤ 1 for j ∈ {1, 2}, σP2 (h, g) ≤ 1/4 · {Λ (P ) /P (g2 ) + Λ (P ) /P (g1 )}.
Recall that K is the number of elements in Z. We have that for each j ∈ {1, 2},
QK   K−1
Λ (P ) k=1P 1R×R×{zk } 1
≤ max  ≤ ,
P (gj ) 1≤k′ ≤K P 1R×R×{zk′ } K −1

which implies that

σP2 (h, g) ≤ 1/4 · max {Λ (P ) /P (g2′ ) + Λ (P ) /P (g1′ )} ≤ 1/2 · (K − 1)−(K−1) .


(g1′ ,g2′ )∈G

When K = 2, σP2 (h, g) ≤ 1/4 by the construction of Λ(P ).

Lemma B.4 Under ρP , φP and σP are continuous on H̄ × G.

Proof of Lemma B.4. Suppose there are (h, g) , (hk , g k ) ∈ H̄×G with g = (g1 , g2 ), g k = (g1k , g2k ), and
(hk , g k ) → (h, g) under ρP . Since GK is finite, (hk , g k ) → (h, g) under ρP implies that P (gjk ) = P (gj )
for each j ∈ {1, 2} when k is sufficiently large. If P (gj ) = 0,9 then by (11) P (h · gj )/P (gj ) = 0,
P (hk · gjk )/P (gjk ) = 0 when k is large, and

P (h · g ) P (hk · g k )
j j
− = 0.
P (gj ) P (gjk )
9 If P (g ) = 0 for some g ∈ G , then Λ(P ) = 0, which is a trivial case. We consider this case only for the sake of
j j K
completeness.

15
If P (gj ) 6= 0, then for each j ∈ {1, 2} and large k, P (gjk ) = P (gj ) 6= 0 and

P (h · g ) P (hk · g k ) h · gj − hk · gjk 2 ρP ((h, g) , (hk , g k ))
j j L (P )
− ≤ ≤
P (gj ) P (gjk ) P (gj ) P (gj )

by Hölder’s inequality and (B.17). Thus we can conclude that


   
k k k k
φP (h, g) − φP (hk , g k ) = P (h · g2 ) − P (h · g1 ) − P (h · g2 ) − P (h · g1 ) → 0
P (g2 ) P (g1 ) k
P (g2 ) k
P (g1 )

if (hk , g k ) → (h, g) under ρP . Similarly, we can show that σP is continuous on H̄ × G under ρP .


We define some new notation which will be used in the following results. Define a random

element ϕ̂P : Ω → ℓ∞ Ξ × H̄ × G such that for each ω ∈ Ω and each (ξ, h, g) ∈ Ξ × H̄ × G,

φP (h, g)
ϕ̂P (ω)(ξ, h, g) = , (B.22)
M (σ̂Pn (ω)) (ξ, h, g)

and let ϕP ∈ ℓ∞ Ξ × H̄ × G be such that for each (ξ, h, g) ∈ Ξ × H̄ × G,

φP (h, g)
ϕP (ξ, h, g) = .
M (σP ) (ξ, h, g)

Here, σ̂Pn is estimated from data, hence it depends on ω, and so does ϕ̂P . When there is no danger
of confusion, we omit the ω from σ̂Pn and ϕ̂P for brevity. Given each sequence rn → ∞ and each ν
which satisfies Assumption 3.3, define

  
Dn (ω) = ψ ∈ ℓ∞ Ξ × H̄ × G : S ϕ̂P (ω) + rn−1 ψ ∈ L1 (ν) (B.23)

for all ω ∈ Ω, and


gn (ω) (ψ) = rn I ◦ S ϕ̂P (ω) + rn−1 ψ (B.24)

for all ω ∈ Ω and all ψ ∈ Dn (ω). Here, gn also depends on ω; for brevity, we omit ω from gn as
well. If the H0 in (13) is true with Q = Pn for all n, then S(ϕ̂P ) = 0 by Lemma B.5, and so gn (ψ) =
  
rn I ◦ S ϕ̂P + rn−1 ψ − I ◦ S (ϕ̂P ) . Define a correspondence Ψ : Ξ × ℓ∞ Ξ × H̄ × G ։ H̄ × G
by


Ψ (ξ, ψ) = (h, g) ∈ H̄ × G : ψ (ξ, h, g) = S (ψ) (ξ) (B.25)

 
for all ξ ∈ Ξ and all ψ ∈ ℓ∞ Ξ × H̄ × G , and define a metric ρξψ on Ξ × ℓ∞ Ξ × H̄ × G by

ρξψ ((ξ1 , ψ1 ), (ξ2 , ψ2 )) = |ξ1 − ξ2 | + kψ1 − ψ2 k∞ (B.26)

16

for all (ξ1 , ψ1 ), (ξ2 , ψ2 ) ∈ Ξ × ℓ∞ Ξ × H̄ × G . Also, define a metric on Ξ × H̄ × G by

ρξhg ((ξ1 , h1 , g1 ), (ξ2 , h2 , g2 )) = |ξ1 − ξ2 | + ρP ((h1 , g1 ), (h2 , g2 )) (B.27)

for all (ξ1 , h1 , g1 ), (ξ2 , h2 , g2 ) ∈ Ξ × H̄ × G. For every set A ⊂ H̄ × G and every δ > 0, define
 
Aδ = (h, g) ∈ H̄ × G : inf
′ ′
ρP ((h, g) , (h′ , g ′ )) ≤ δ . (B.28)
(h ,g )∈A

Lemma B.5 Suppose Assumption 3.2 holds and the H0 in (13) is true with Q = Pn for all n. Then
the H0 in (13) is true with Q = P . This implies that sup(h,g)∈H̄×G φP (h, g) = 0, and hence that
S (ϕP ) = 0 and S (ϕ̂P ) = 0 for all ω ∈ Ω.

Proof of Lemma B.5. By Lemma B.3, we have kPn − P k∞ → 0. Thus φPn → φP in ℓ∞ (H̄ × G), and
by the assumption that sup(h,g)∈H×G φPn (h, g) ≤ 0 for all n, we have that sup(h,g)∈H×G φP (h, g) ≤ 0.
This implies that sup(h,g)∈H̄×G φP (h, g) ≤ 0 by the constructions of φP and H̄. By the construction of
H̄ × G, there is some (h, g) ∈ H̄ × G, such as h = 1{a}×{0}×R for some a ∈ R, for which φP (h, g) = 0.
Therefore, sup(h,g)∈H̄×G φP (h, g) = 0 under the assumptions. Because ξ ∈ Ξ is always positive by
the construction of Ξ, we have that S (ϕP ) (ξ) = 0 for all ξ ∈ Ξ. For the same reason, S (ϕ̂P ) (ξ) = 0
for all ξ ∈ Ξ and all ω ∈ Ω.

Lemma B.6 The correspondence Ψ defined in (B.25) is upper hemicontinuous10 at (ξ, ϕP ) for all ξ ∈ Ξ.
In addition, suppose the H0 in (13) is true with Q = P . Then for every δ > 0 there is an ε > 0
δ
such that Ψ (ξ ′ , ψ) ⊂ Ψ (ξ, ϕP ) (where the latter is defined as in (B.28)) for all ξ, ξ ′ ∈ Ξ and all

ψ ∈ ℓ∞ Ξ × H̄ × G with kψ − ϕP k∞ < ε.

Proof of Lemma B.6. We first show that Ψ is upper hemicontinuous at (ξ, ϕP ) for all ξ ∈ Ξ. We
do this in three steps. First, we show that Ψ (ξ, ϕP ) is compact for each ξ ∈ Ξ under ρP . Clearly,
given an arbitrary ξ ∈ Ξ, ϕP (ξ, ·, ·) is continuous on H̄ × G under ρP by Lemma B.4. Because H̄ × G
is compact by Lemma A.8, Ψ (ξ, ϕP ) is not empty. Since Ψ (ξ, ϕP ) ⊂ H̄ × G, it suffices to show
that Ψ (ξ, ϕP ) is closed in H̄ × G. Fix ξ ∈ Ξ. Suppose there is a sequence {(hn , gn )}n ⊂ Ψ (ξ, ϕP )
such that (hn , gn ) → (h, g) ∈ H̄ × G under ρP . Then for all n, ϕP (ξ, hn , gn ) = S (ϕP ) (ξ). Since
ϕP (ξ, ·, ·) is continuous by Lemma B.4, ϕP (ξ, hn , gn ) → ϕP (ξ, h, g) as (hn , gn ) → (h, g). Thus
ϕP (ξ, h, g) = S (ϕP ) (ξ), which implies that Ψ (ξ, ϕP ) is closed in H̄ × G and therefore compact.
Second, we show that if there is a sequence {(ξn , ψn ) , (hn , gn )} such that (hn , gn ) ∈ Ψ (ξn , ψn )
and ρξψ ((ξn , ψn ), (ξ, ϕP )) → 0, where ρξψ is defined in (B.26), then (hn , gn ) has a limit point11 in
Ψ (ξ, ϕP ). Notice that by the constructions of Ξ and H̄ × G, Ξ × H̄ × G is compact under the metric
ρξhg defined in (B.27). It is easy to show, by Lemma B.4, that ϕP is continuous on Ξ × H̄ × G under
10 See Definition 17.2 of upper hemicontinuity in Aliprantis and Border (2006).
11 See the definition of limit point in Aliprantis and Border (2006, p. 31).

17
ρξhg , and hence that it is uniformly continuous. Thus ρξψ ((ξn , ψn ), (ξ, ϕP )) → 0 implies that

|S (ψn ) (ξn ) − S (ϕP ) (ξ)| ≤ sup |ψn (ξn , h, g) − ϕP (ξn , h, g)|


(h,g)∈H̄×G

+ sup |ϕP (ξn , h, g) − ϕP (ξ, h, g)| → 0,


(h,g)∈H̄×G

where sup(h,g)∈H̄×G |ϕP (ξn , h, g) − ϕP (ξ, h, g)| converges to 0 because ϕP is uniformly continuous
on Ξ × H̄ × G under ρξhg . This implies that ψn (ξn , hn , gn ) → S (ϕP ) (ξ). Suppose, by way of con-
tradiction, that (hn , gn ) has no limit point in Ψ (ξ, ϕP ). This implies that for each (h, g) ∈ Ψ (ξ, ϕP )
there exist an open neighborhood Vh,g of (h, g) and an nh,g such that (hn , gn ) 6∈ Vh,g when n ≥ nh,g .
Because we have shown that Ψ (ξ, ϕP ) is compact in H̄ × G, there is a finite open cover V such that
Ψ (ξ, ϕP ) ⊂ V = Vh1 ,g1 ∪ · · · ∪ VhM ,gM . Let n0 = maxm≤M nhm ,gm . Thus if n > n0 , then (hn , gn ) 6∈ V ,
and hence (hn , gn ) 6∈ Ψ (ξ, ϕP ). Since H̄ × G is compact and V c is closed in H̄ × G, V c is compact.
Notice that V c ∩ Ψ(ξ, ϕP ) = ∅. Thus

sup ϕP (ξ, h, g) < sup ϕP (ξ, h, g) = sup ϕP (ξ, h, g) .


(h,g)∈V c (h,g)∈H̄×G (h,g)∈Ψ(ξ,ϕP )

Let δ = sup(h,g)∈H̄×G ϕP (ξ, h, g) − sup(h,g)∈V c ϕP (ξ, h, g). Recall that (hn , gn ) ∈ V c for all n > n0 .
Thus ψn (ξn , hn , gn ) = sup(h,g)∈H̄×G ψn (ξn , h, g) = sup(h,g)∈V c ψn (ξn , h, g), so



ψn (ξn , hn , gn ) − sup ϕP (ξ, h, g) ≤ sup |ψn (ξn , h, g) − ϕP (ξn , h, g)|
(h,g)∈V c (h,g)∈H̄×G

+ sup |ϕP (ξn , h, g) − ϕP (ξ, h, g)| → 0.


(h,g)∈H̄×G

This implies that for sufficiently large n,

δ δ
ψn (ξn , hn , gn ) ≤ sup ϕP (ξ, h, g) + = sup ϕP (ξ, h, g) − .
(h,g)∈V c 2 (h,g)∈H̄×G 2

This contradicts ψn (ξn , hn , gn ) → S (ϕP ) (ξ). Thus (hn , gn ) has a limit point in Ψ (ξ, ϕP ). Third, by

Theorem 17.20(ii) of Aliprantis and Border (2006), together with the fact that Ξ × ℓ∞ Ξ × H̄ × G
is first countable under the metric ρξψ defined in (B.26) (every metric space is first countable), Ψ is
upper hemicontinuous at (ξ, ϕP ).
Now we prove the second claim in the Lemma. Fix δ > 0. Since Ψ is upper hemicontinuous at
(ξ, ϕP ) for all ξ ∈ Ξ, we have that for each ξ there is an open ball Bεξ (ξ, ϕP ) under ρξψ with center
δ δ
(ξ, ϕP ) and radius εξ such that Ψ (ξ ′ , ϕ′ ) ⊂ Ψ (ξ, ϕP ) for all (ξ ′ , ϕ′ ) ∈ Bεξ (ξ, ϕP ), where Ψ (ξ, ϕP )
is defined as in (B.28). Notice that {Bεξ /2 (ξ)}ξ∈Ξ is an open cover of Ξ, where each Bεξ /2 (ξ) is
an open ball in R with center ξ and radius εξ /2. Since Ξ is compact by construction, there is a
finite open cover {Bεi (ξi )}M ′
i=1 of Ξ with εi = εξi /2. Let ε = mini≤M εi . Then for every ξ ∈ Ξ

and every ψ ∈ ℓ∞ Ξ × H̄ × G with kψ − ϕP k∞ < ε, there is an open ball Bεξi (ξi , ϕP ) such that

18
δ
(ξ ′ , ψ) ⊂ Bεξi (ξi , ϕP ). This implies that Ψ (ξ ′ , ψ) ⊂ Ψ (ξi , ϕP ) . Suppose the H0 in (13) is true with
Q = P . By Lemma B.5, we have that S (ϕP ) = 0 and


˜ ϕP ) = (h, g) ∈ H̄ × G : φP (h, g) = 0
Ψ (ξ, ϕP ) = Ψ(ξ,

for all ξ, ξ˜ ∈ Ξ. Thus Ψ (ξ ′ , ψ) ⊂ Ψ (ξ, ϕP )δ for all ξ ∈ Ξ, that is, the second claim holds.

Lemma B.7 Suppose Assumptions 3.1, 3.2, and 3.3 hold and the H0 in (13) is true with Q = Pn for all
n. For every ε > 0, there is a measurable set Ω0 ⊂ Ω with P(Ω0 ) ≥ 1 − ε such that for every subsequence
{ψnm } with ψnm ∈ Dnm (ωnm ), ωnm ∈ Ω0 , where Dnm (ωnm ) is defined in (B.23), and ψnm → ψ for

some ψ ∈ C Ξ × H̄ × G under the ρξhg defined in (B.27), we have that

gnm (ωnm ) (ψnm ) → I ◦ SΨ(ξ,ϕP ) (ψ),

where gnm is defined in (B.24).

Proof of Lemma B.7. For simplicity of notation, we replace nm with n. Note that all the following
results hold for every subsequence indexed by nm . By Lemma A.8, H̄ × G is compact under ρP .
By Lemma B.3, we have σ̂Pn → σP almost uniformly. Then by construction, ϕ̂P → ϕP almost
uniformly, where ϕ̂P is defined in (B.22). By Lemma B.5, S(ϕP ) = 0 and S(ϕ̂P ) = 0 for all ω ∈ Ω.

For every ψ ∈ C Ξ × H̄ × G , since ϕ̂P (ξ, ·, ·) + rn−1 ψ (ξ, ·, ·) may not be continuous on H̄ × G,

Ψ ξ, ϕ̂P + rn−1 ψ may be empty. Here, we construct a modified version of ϕ̂P , denoted by ϕ̃P , such
that

(i) ϕ̃P (ξ, ·, ·) is upper semicontinuous for every ω ∈ Ω, every n, and every ξ ∈ Ξ;

(ii) sup(h,g)∈H̄×G ϕ̂P (ξ, h, g) = sup(h,g)∈H̄×G ϕ̃P (ξ, h, g) for every ω ∈ Ω, every n, and every ξ ∈ Ξ;
 
(iii) sup(h,g)∈H̄×G ϕ̂P + rn−1 ψ (ξ, h, g) = sup(h,g)∈H̄×G ϕ̃P + rn−1 ψ (ξ, h, g) for every function ψ ∈

C Ξ × H̄ × G , every ω ∈ Ω, every n, and every ξ ∈ Ξ;

(iv) for every ε > 0 there is a measurable set A ⊂ Ω with P(A) ≥ 1 − ε such that for all ϕ ∈

ℓ∞ Ξ × H̄ × G , ϕ̃P + rn−1 ϕ → ϕP uniformly on A.

Specifically, for all ω ∈ Ω, all (ξ, h, g) ∈ Ξ × H̄ × G, and all n, we define ϕ̃P (ξ, h, g) by

ϕ̃P (ξ, h, g) = lim sup ϕ̂P (ξ, h′ , g ′ ), (B.29)


δ↓0 (h′ ,g′ )∈Bδ (h,g)

where Bδ (h, g) is an open ball in H̄ × G under ρP with center (h, g) and radius δ.
Fix ω ∈ Ω, n, and ξ ∈ Ξ. First, we prove (i), that is, ϕ̃P (ξ, ·, ·) is upper semicontinuous at every
(h, g) ∈ H̄ × G. Fix (h, g) ∈ H̄ × G. By (B.29), for each ε > 0, there is a δε > 0 such that

ε
ϕ̂P (ξ, h′ , g ′ ) ≤ ϕ̃P (ξ, h, g) + (B.30)
2

19
for all (h′ , g ′ ) ∈ Bδε (h, g), where Bδε (h, g) denotes the open ball in H̄ × G under ρP with center
(h, g) and radius δε . Fix (h1 , g1 ) ∈ Bδε /2 (h, g). By definition, there is a δ2 > 0 such that for all δ ′
with 0 < δ ′ ≤ δ2 ,
ε
ϕ̃P (ξ, h1 , g1 ) ≤ sup ϕ̂P (ξ, h2 , g2 ) + .
(h2 ,g2 )∈Bδ′ (h1 ,g1 ) 2

Let δ = min {δε /2, δ2 }. Then for this (h1 , g1 ), we have that

ε
ϕ̃P (ξ, h1 , g1 ) ≤ sup ϕ̂P (ξ, h2 , g2 ) + .
(h2 ,g2 )∈Bδ (h1 ,g1 ) 2

Notice that if (h2 , g2 ) ∈ Bδ (h1 , g1 ), then (h2 , g2 ) ∈ Bδε (h, g), and hence ϕ̂P (ξ, h2 , g2 ) ≤ ϕ̃P (ξ, h, g)+
ε/2. This implies that sup(h2 ,g2 )∈Bδ (h1 ,g1 ) ϕ̂P (ξ, h2 , g2 ) ≤ ϕ̃P (ξ, h, g)+ε/2, and hence ϕ̃P (ξ, h1 , g1 ) ≤
ϕ̃P (ξ, h, g)+ε. This shows that for each ε > 0, there is a δε > 0 such that for all (h1 , g1 ) ∈ Bδε /2 (h, g),
ϕ̃P (ξ, h1 , g1 ) ≤ ϕ̃P (ξ, h, g) + ε. Second, we prove (ii), that is,

sup ϕ̂P (ξ, h, g) = sup ϕ̃P (ξ, h, g) . (B.31)


(h,g)∈H̄×G (h,g)∈H̄×G

By the definition of ϕ̃P , we have ϕ̂P (ξ, h, g) ≤ ϕ̃P (ξ, h, g) for all (h, g) ∈ H̄ × G, and hence
sup(h,g)∈H̄×G ϕ̂P (ξ, h, g) ≤ sup(h,g)∈H̄×G ϕ̃P (ξ, h, g). Also, by the definition of ϕ̃P , ϕ̃P (ξ, h, g) ≤
sup(h′ ,g′ )∈H̄×G ϕ̂P (ξ, h′ , g ′ ) for all (h, g). Thus sup(h,g)∈H̄×G ϕ̃P (ξ, h, g) ≤ sup(h,g)∈H̄×G ϕ̂P (ξ, h, g),
and (B.31) holds. Similarly, by the definition of ϕ̃P , we have that ϕ̂P (ξ, h, g) + rn−1 ψ (ξ, h, g) ≤
ϕ̃P (ξ, h, g) + rn−1 ψ (ξ, h, g) for all (h, g) ∈ H̄ × G, and hence

sup {ϕ̂P (ξ, h, g) + rn−1 ψ (ξ, h, g)} ≤ sup {ϕ̃P (ξ, h, g) + rn−1 ψ (ξ, h, g)}.
(h,g)∈H̄×G (h,g)∈H̄×G

Fix (h, g) ∈ H̄ × G. Since ψ(ξ, ·, ·) is continuous under ρP , for every ε > 0 there is a δ̄ > 0 such that

sup {ϕ̂P (ξ, h′ , g ′ ) + rn−1 ψ(ξ, h, g) − ε} ≤ sup {ϕ̂P (ξ, h′ , g ′ ) + rn−1 ψ(ξ, h′ , g ′ )}
(h′ ,g′ )∈Bδ (h,g) (h′ ,g′ )∈Bδ (h,g)

for all δ ≤ δ̄. By the definition of ϕ̃P , this implies that

ϕ̃P (ξ, h, g) + rn−1 ψ(ξ, h, g) − ε ≤ lim sup {ϕ̂P (ξ, h′ , g ′ ) + rn−1 ψ(ξ, h′ , g ′ )}
δ↓0 (h′ ,g′ )∈Bδ (h,g)

≤ sup {ϕ̂P (ξ, h, g) + rn−1 ψ (ξ, h, g)}.


(h,g)∈H̄×G

Since ε is arbitrary, we have ϕ̃P (ξ, h, g) + rn−1 ψ(ξ, h, g) ≤ sup(h,g)∈H̄×G {ϕ̂P (ξ, h, g) + rn−1 ψ (ξ, h, g)}.
This holds for all (h, g) ∈ H̄ × G, which implies that

sup {ϕ̂P (ξ, h, g) + rn−1 ψ (ξ, h, g)} ≥ sup {ϕ̃P (ξ, h, g) + rn−1 ψ (ξ, h, g)}.
(h,g)∈H̄×G (h,g)∈H̄×G

Thus (iii) is proved.

20
Last, we prove (iv). Since ϕP (ξ, ·, ·) is continuous, we have that


sup ϕ̃P (ξ, h, g) + rn−1 ϕ(ξ, h, g) − ϕP (ξ, h, g)
(ξ,h,g)∈Ξ×H̄×G

≤ sup |ϕ̂P (ξ, h, g) − ϕP (ξ, h, g)| + rn−1 kϕk∞ .


(ξ,h,g)∈Ξ×H̄×G

(iv) follows from the facts that ϕ̂P → ϕP almost uniformly, as mentioned at the beginning of the
proof, and kϕk∞ < ∞.
Fix ε > 0. By property (iv), let Ω0 ⊂ Ω be a measurable set such that P (Ω0 ) ≥ 1 − ε and

ϕ̃P + t−1
n ϕ → ϕP uniformly on Ω0 for all ϕ ∈ ℓ

Ξ × H̄ × G . Let ψn ∈ Dn (ωn ), ωn ∈ Ω0 , and

ψ ∈ C Ξ × H̄ × G be arbitrary maps such that ψn → ψ. By property (i) that we proved above, we

have that Ψ ξ, ϕ̃P + rn−1 ψ 6= ∅ for all ω ∈ Ω0 , all n, and all ξ ∈ Ξ. It is easy to show that because

ψn → ψ in ℓ∞ Ξ × H̄ × G ,

 

sup sup ϕ̂P (ωn ) (ξ, h, g) + rn−1 ψn (ξ, h, g) − sup ϕ̂P (ωn ) (ξ, h, g) + rn−1 ψ (ξ, h, g)

ξ∈Ξ (h,g)∈H̄×G (h,g)∈H̄×G

≤ rn−1 sup |ψn (ξ, h, g) − ψ(ξ, h, g)| = o rn−1 .
(ξ,h,g)∈Ξ×H̄×G

Since ϕ̃P + rn−1 ψ converges to ϕP uniformly on Ω0 , by Lemma B.6 there is a sequence δn ↓ 0 such

that Ψ ξ, ϕ̃P (ω) + rn−1 ψ ⊂ Ψ (ξ, ϕP )δn for all ξ ∈ Ξ and all ω ∈ Ω0 . (By Lemma B.6, δn does not
depend on ξ ∈ Ξ or on ω ∈ Ω0 .) Since S(ϕP ) = 0 by Lemma B.5, we have that for all ξ ∈ Ξ,

Ψ (ξ, ϕP ) = {(h, g) ∈ H̄ × G : φP (h, g) = 0}. (B.32)

By Lemma B.5 and the constructions of ϕ̂P and ϕ̃P , we also have that for all ω, ϕ̂P ≤ 0 and ϕ̃P ≤ 0
on Ξ × H̄ × G, and ϕ̂P (ξ, ·, ·) = 0 on Ψ (ξ, ϕP ). Thus for every ξ ∈ Ξ,

 
sup ϕ̂P (ωn ) (ξ, h, g) + rn−1 ψ (ξ, h, g) ≥ sup ϕ̂P (ωn ) (ξ, h, g) + rn−1 ψ (ξ, h, g)
(h,g)∈H̄×G (h,g)∈Ψ(ξ,ϕP )

= sup rn−1 ψ (ξ, h, g) .


(h,g)∈Ψ(ξ,ϕP )

By property (iii) of ϕ̃P , together with the results shown above, we have that


−1 −1
sup sup ϕ̂P (ωn ) (ξ, h, g) + rn ψ (ξ, h, g) − sup rn ψ (ξ, h, g)

ξ∈Ξ (h,g)∈H̄×G (h,g)∈Ψ(ξ,ϕP )
 
  
= sup sup ϕ̃P (ωn ) (ξ, h, g) + rn−1 ψ (ξ, h, g) − sup rn−1 ψ (ξ, h, g)
ξ∈Ξ (h,g)∈Ψ(ξ,ϕ̃P (ωn )+rn
−1
ψ) (h,g)∈Ψ(ξ,ϕP ) 
( )
 −1
−1
≤ sup sup ϕ̃P (ωn ) (ξ, h, g) + rn ψ (ξ, h, g) − sup rn ψ (ξ, h, g) .
ξ∈Ξ (h,g)∈Ψ(ξ,ϕP )δn (h,g)∈Ψ(ξ,ϕP )

21
Then by the definition of Ψ(ξ, ϕP )δn ,
( )

sup sup ϕ̃P (ωn ) (ξ, h, g) + rn−1 ψ (ξ, h, g) − sup rn−1 ψ (ξ, h, g)
ξ∈Ξ (h,g)∈Ψ(ξ,ϕP )δn (h,g)∈Ψ(ξ,ϕP )
( )
≤ sup sup rn−1 |ψ (ξ, h1 , g1 ) − ψ (ξ, h2 , g2 )| = o(rn−1 ).
ξ∈Ξ ρP ((h1 ,g1 ),(h2 ,g2 ))≤δn

Finally, combining all the results above, we can conclude that



 
−1 −1
sup S ϕ̂P (ωn ) + rn ψn (ξ) − rn sup ψ (ξ, h, g) = o rn−1 .
ξ∈Ξ (h,g)∈Ψ(ξ,ϕP )

This implies that


Z


gn (ωn ) (ψn ) − sup ψ (ξ, h, g) dν (ξ)
Ξ (h,g)∈Ψ(ξ,ϕP )
Z


≤ rn S ϕ̂P (ωn ) + rn−1 ψn (ξ) − sup ψ (ξ, h, g) dν (ξ) = o (1) .
Ξ (h,g)∈Ψ(ξ,ϕP )


Proof of Theorem 3.1. By (B.14), n(φ̂Pn − φP ) L′P (GP + Q0 ), where L′P (GP + Q0 ) is tight as
shown in the proof of Lemma 3.1. By Lemma B.3, M(σ̂Pn ) → M(σP ) almost uniformly, and hence
this convergence is also in outer probability by Lemma 1.9.3(ii) of van der Vaart and Wellner (1996).
By Lemma 1.10.2(iii) of van der Vaart and Wellner (1996), M(σ̂Pn ) M(σP ). By Example 1.4.7

(Slutsky’s lemma) of van der Vaart and Wellner (1996), we have that ( n(φ̂Pn − φP ), M(σ̂Pn ))
(L′P (GP + Q0 ), M(σP )). Let ℓ∞ (Ξ × H̄ × G)+ = {ψ ∈ ℓ∞ (Ξ × H̄ × G) : k1/ψk∞ < ∞}. Define a map
f : ℓ∞ (H̄ × G) × ℓ∞ (Ξ × H̄ × G)+ → ℓ∞ (Ξ × H̄ × G) by f (ϕ, ψ) = ϕ/ψ for all (ϕ, ψ) ∈ ℓ∞ (H̄ × G) ×
ℓ∞ (Ξ × H̄ × G)+ . Clearly, (L′P (GP + Q0 ), M(σP )) takes its values in ℓ∞ (H̄ × G) × ℓ∞ (Ξ × H̄ × G)+ . It
is easy to show that f is continuous under the metric k(ϕ, ψ) − (ϕ′ , ψ ′ )k = kϕ − ϕ′ k∞ + kψ − ψ ′ k∞ .
By Theorem 1.3.6 (continuous mapping) of van der Vaart and Wellner (1996),

√ n(φ̂Pn − φP ) L′P (GP + Q0 )
f ( n(φ̂Pn − φP ), M(σ̂Pn )) = .
M(σ̂Pn ) M(σP )

By Lemma B.5, we have that I ◦ S (φP /M (σ̂Pn )) = 0. Then by Theorem A.2(ii) and Lemma B.7,
together with the continuity of I ◦ SΨ(ξ,ϕP ) under k · k∞ , we have
( !  )  
√ φ̂Pn φP L′P (GP + Q0 )
n I ◦S −I ◦S I ◦ SΨ(ξ,ϕP ) . (B.33)
M (σ̂Pn ) M (σ̂Pn ) M (σP )

By Lemma B.3, Tn /n → Λ(P ) almost uniformly. Then by Lemmas 1.9.3(ii) and 1.10.2(iii), Example
1.4.7 (Slutsky’s lemma), and Theorem 1.3.6 (continuous mapping) of van der Vaart and Wellner

22
(1996), together with (B.33), we have that
r ( !)  
Tn √ φ̂Pn G
· n I ◦S I ◦ SΨ(ξ,ϕP ) ,
n M (σ̂Pn ) M(σP )

p
where G = Λ(P )L′P (GP + Q0 ) as in Lemma 3.1. By Lemma B.5, we have that Ψ(ξ, ϕP ) = ΨH̄×G
defined by (24) for all ξ ∈ Ξ under the assumptions.

Remark B.3 If the H0 in (13) is true with Q = Pn for all n, it is easy to show that S (φP /M(σP )) = 0
(see Lemma B.5). Thus it suffices to find the asymptotic distribution of
! ( !  )
√ φ̂Pn √ φ̂Pn φP
nI ◦ SH×G = n I ◦S −I ◦S . (B.34)
M(σ̂Pn ) M(σ̂Pn ) M(σP )


If we can find the asymptotic distribution of n(φ̂Pn /M(σ̂Pn )−φP /M(σP )) and the “derivative” of I◦S
(see, for example, the definition of Hadamard directional derivative in Shapiro (1990) and Fang and
Santos (2018)), then by the delta method of Fang and Santos (2018), it is straightforward to obtain the

asymptotic distribution of (B.34). However, establishing the limiting distribution of n(φ̂Pn /M(σ̂Pn )−
φP /M(σP )) is technically tricky. By the constructions of φP and σP , we can view φP /M(σP ) as a
map of P . Specifically, let V0 = {v : v = h · gl or v = h2 · gl for some h ∈ H̄ and gl ∈ GK } and
DQ = {Q ∈ ℓ∞ (V0 ∪ GK ) : Q(h · gl )/Q(gl ) and Q(h2 · gl )/Q(gl ) exist for all h ∈ H̄ and gl ∈ GK }. Then
we extend the definitions of φQ and σQ for all Q ∈ P, that is, the φQ defined in (12) and the σQ defined
in (17), to all Q ∈ DQ . Clearly, P ⊂ DQ by (11). Define a map T : DQ → ℓ∞ (Ξ × H̄ × G) by

φQ (h, g)
T (Q)(ξ, h, g) =
M(σQ )(ξ, h, g)

for all Q ∈ DQ and (ξ, h, g) ∈ Ξ × H̄ × G. Now we have that T (P ) = φP /M(σP ) and T (P̂n ) =

φ̂Pn /M(σ̂Pn ). Suppose we have weak convergence of n(P̂n − P ) in some suitable space. Then if T is
Hadamard (directionally) differentiable, by delta method we can establish weak convergence of
!
√ φ̂Pn φP √  
n − = n T (P̂n ) − T (P ) . (B.35)
M(σ̂Pn ) M(σP )

Unfortunately, however, T is nondifferentiable, because of the nondifferentiability of the M defined in


(21) (M is not differentiable even when Ξ is a singleton), and hence it is not straightforward to show

the convergence of n(T (P̂n ) − T (P )). Inspired by Kitagawa (2015), with the asymptotic distribution

of n(φ̂Pn /M(σ̂Pn )−φP /M(σ̂Pn )) (which can be obtained by using Slutsky’s theorem), we can instead
establish the asymptotic distribution of
( !  )
√ φ̂Pn φP
n I◦S −I ◦S , (B.36)
M(σ̂Pn ) M(σ̂Pn )

23
where S (φP /M(σ̂Pn )) = 0 by Lemma B.5 if the H0 in (13) is true with Q = Pn for all n. However,
existing delta methods cannot be used to establish the asymptotic distribution of (B.36) either. Since
φP /M(σ̂Pn ) is a random element, delta methods such as Theorem 3.9.4 or Theorem 3.9.5 of van der
Vaart and Wellner (1996), or Theorem 2.1 of Fang and Santos (2018), do not work in this case. To
overcome the technical complications due to the random element φP /M(σ̂Pn ), we provide the extended
continuous mapping theorem and the extended delta method elaborated by Theorems A.1 and A.2,
respectively.

Proof of Corollary 3.1. By Lemma B.5, φP (h, g) ≤ 0 for all (h, g) ∈ H̄ × G, and there exists
d
(h , g ) ∈ H̄ × G with g = (g10 , g20 ) such that φP (h0 , g 0 ) = 0. First, we show that if h0 = (−1) ·
0 0 0

1A×{d}×R , where d ∈ {0, 1} and A is a half-closed interval or an open interval, then for every closed
d
interval B such that B ⊂ A, we have that φP (h̃, g 0 ) = 0 with h̃ = (−1) · 1B×{d}×R . Suppose, by
way of contradiction, that A = (a1 , a2 ) and B = [b1 , b2 ] with a1 < b1 , a2 > b2 , and φP (h̃, g 0 ) < 0
d d d
with h̃ = (−1) · 1B×{d}×R . Let hL = (−1) · 1(a1 ,b1 )×{d}×R and hR = (−1) · 1(b2 ,a2 )×{d}×R . Then by
the definition of φP ,

P (h0 · g20 ) P (h0 · g10 ) P ((hL + h̃ + hR ) · g20 ) P ((hL + h̃ + hR ) · g10 )


φP (h0 , g 0 ) = 0 − 0 = −
P (g2 ) P (g1 ) P (g20 ) P (g10 )
= φP (h̃, g 0 ) + φP (hL , g 0 ) + φP (hR , g 0 ).

Since φP (h0 , g 0 ) = 0 but φP (h̃, g 0 ) < 0, we have φP (hL , g 0 ) + φP (hR , g 0 ) > 0. This implies that
either φP (hL , g 0 ) > 0 or φP (hR , g 0 ) > 0. However, since (hL , g 0 ), (hR , g 0 ) ∈ H̄ × G, Lemma B.5
shows that both φP (hL , g 0 ) and φP (hR , g 0 ) are nonpositive. This is a contradiction. When A is
a half-closed interval, we can show analogously that the claim is true. Second, we show that if
h0 = 1R×C×R with C = (−∞, c) for some c ∈ R, then there is a sequence of sets Ck = (−∞, ck ]
with ck ↑ c such that φP (hk , g 0 ) = 0 with hk = 1R×Ck ×R . By assumption, D is a finite set. Under
Assumption 3.1, D is a discrete random variable with D ∈ D under Pn . Then D ∈ D under P by
Lemma B.3, and the claim holds.
The above results imply that ΨH̄×G ⊂ ΨH×G , where ΨH×G is the closure of ΨH×G in H̄ × G under
ρP . By (24) and Lemma B.4, ΨH̄×G = ΨH×G . By Lemma 3.1, G almost surely has a continuous path
under ρP . By Lemma B.4, σP is continuous under ρP . Thus the Corollary follows from Theorem 3.1
and the continuity of G/M(σP ) under ρP for every fixed ξ ∈ Ξ.
We now introduce the notation for the bootstrap elements. Let (Wn1 , . . . , Wnn ) be a vector of
n
random multinomial weights independent of {(Yi , Di , Zi )}i=1 for all n. As defined in (14), P̂n is
n
the empirical measure of an i.i.d. sample {(Yi , Di , Zi )}i=1 from probability distribution Pn . Given
the sample values, the {(Ŷi , D̂i , Ẑi )}ni=1 introduced in Section 3.1.1 is an i.i.d. sample from P̂n .
We can write the empirical measure of {(Ŷi , D̂i , Ẑi )}ni=1 , given sample {(Yi , Di , Zi )}ni=1 , as P̂nB =
Pn
n−1 i=1 Wni δ(Yi ,Di ,Zi ) , where δ(Yi ,Di ,Zi ) is a Dirac measure centered at (Yi , Di , Zi ). Given the φ̂B
Pn ,
n
TnB , and σ̂PBn defined in Section 3.1.1, φ̂B B
Pn /M(σ̂Pn ) is a map of {(Yi , Di , Zi , Wni )}i=1 to the space

ℓ∞ Ξ × H̄ × G . We follow Section 3.6 of van der Vaart and Wellner (1996) and (A.1) to define the

24
conditional outer expectations. When we compute the outer expectations as in (A.1), independence
is understood in terms of a product space. Under Assumptions 3.1 and 3.2, each term (Yi , Di , Zi )
∞ ∞
of the sequence {(Yi , Di , Zi )}i=1 has probability distribution P . Let {(Yi , Di , Zi )}i=1 be the coor-
dinate projections on the first ∞ coordinates of the product space ((R3 )∞ , BR∞3 , P ∞ ) × (W, C, PW ),
and let the multinomial vectors W depend on the last factor only. For each real-valued map T on
((R3 )∞ , BR∞3 , P ∞ ) × (W, C, PW ), we can take (Ω1 , A1 , P1 ) = ((R3 )∞ , BR∞3 , P ∞ ) and (Ω2 , A2 , P2 ) =

(W, C, PW ) and define a real-valued map EW [T ] on ((R3 )∞ , BR∞3 , P ∞ ) by

∗ ∞ ∞
EW [T ]({(Yi , Di , Zi )}i=1 ) = E2∗ [T ]({(Yi , Di , Zi )}i=1 ) (B.37)


for each sequence {(Yi , Di , Zi )}i=1 ∈ (R3 )∞ , where E2∗ [T ] is defined as in (A.1). We call the left-

hand side of (B.37) the conditional outer expectation of T given the sequence {(Yi , Di , Zi )}i=1 .

Since EW [T ] is a real-valued map on ((R3 )∞ , BR∞3 , P ∞ ), we can compute its outer and inner inte-
grals (expectations) with respect to ((R3 )∞ , BR∞3 , P ∞ ). For simplicity of notation, we write them

as E ∗ [EW
∗ ∗
[T ]] and E∗ [EW [T ]], respectively. If T ({(Yi , Di , Zi )}i=1 , ·) is a measurable integrable
∞ ∗
map on (W, C, PW ) for every given sequence {(Yi , Di , Zi )}i=1 , we then write EW [T ] for EW [T ]

and call the quantity EW [T ]({(Yi , Di , Zi )}i=1 ) the conditional expectation of T given the sequence

{(Yi , Di , Zi )}i=1 . The conditional inner expectation is defined analogously. If D is a metric space
with metric d, we define

BL1 (D) = {f : D → R : kf k∞ ≤ 1, |f (x1 ) − f (x2 )| ≤ d(x1 , x2 ) for all x1 , x2 ∈ D} .

Lemma B.8 Suppose Assumptions 3.1 and 3.2 hold.


p
(i) TnB (φ̂B B
Pn − φ̂Pn )/M(σ̂Pn ) satisfies

" p !#   
TnB (φ̂B
Pn − φ̂Pn ) G0
sup EW f −E f →0 (B.38)
f ∈BL1 (ℓ∞ (Ξ×H̄×G)) M(σ̂PBn ) M(σP )

p
in outer probability, where G0 = Λ(P ) · L′P (GP ) is tight and GP is as in Lemma B.2;
p
(ii) TnB (φ̂B B
Pn − φ̂Pn )/M(σ̂Pn ) G0 /M(σP );12
p
(iii) For each continuous, bounded f : ℓ∞ (Ξ × H̄ × G) → R, f ( TnB (φ̂B B
Pn − φ̂Pn )/M(σ̂Pn )) is a
measurable function of {Wni }ni=1 for every given sequence {(Yi , Di , Zi )}∞
i=1 .

p
Proof of Lemma B.8. (i). To explore the conditional property of TnB (φ̂B B
Pn − φ̂Pn )/M(σ̂Pn ),
∞ ∞
we consider the entire sequence {(Yi , Di , Zi )}i=1 .13 Each term (Yi , Di , Zi ) in {(Yi , Di , Zi )}i=1 has
probability distribution P under Assumptions 3.1 and 3.2. Now the P̂n defined in (14) can be viewed
p
12 This TnB (φ̂B B ∞
implies that Pn − φ̂Pn )/M(σ̂Pn ) is asymptotically measurable jointly in {(Yi , Di , Zi )}i=1 and W by
Lemma 1.3.8 of van der Vaart and Wellner (1996).
13
p We follow Section 3.6 of van der Vaart and Wellner (1996) to obtain the conditional property of the bootstrap element
TnB (φ̂B B ∞
Pn − φ̂Pn )/M(σ̂Pn ) given the entire sequence {(Yi , Di , Zi )}i=1 .

25

as being computed with the first n elements of {(Yi , Di , Zi )}i=1 that are distributed according to P .

By Lemma A.5, n(P̂n − P ) GP under P , where GP is the limit shown in Lemma B.2. By the
construction of Ṽ in (A.7), F = 1 is an envelope function of Ṽ and P ∗ (supv∈Ṽ |v − P (v)|2 ) < ∞,
where P ∗ is the outer probability measure of P . By Lemma A.5, Ṽ is Donsker. By Theorem 3.6.2 of
van der Vaart and Wellner (1996), we have that


sup |EW [f { n(P̂nB − P̂n )}] − E[f (GP )]| → 0 (B.39)
f ∈BL1 (ℓ∞ (Ṽ))

outer almost surely14 and

√ √
EW [f { n(P̂nB − P̂n )}∗ ] − EW [f { n(P̂nB − P̂n )}∗ ] → 0 (B.40)

almost surely for every f ∈ BL1 (ℓ∞ (Ṽ)). Here, the asterisks denote the measurable cover functions
with respect to {(Yi , Di , Zi )}∞
i=1 and W jointly. Then by Lemmas B.1, A.5, and A.6 in this paper, and
Theorem 3.9.13 of van der Vaart and Wellner (1996), we have


sup |EW [f { n(L(P̂nB ) − L(P̂n ))}] − E[f (L′P (GP ))]| → 0 (B.41)
f ∈BL1 (ℓ∞ (H̄×G))

outer almost surely and

√ √
EW [f { n(L(P̂nB ) − L(P̂n ))}∗ ] − EW [f { n(L(P̂nB ) − L(P̂n ))}∗ ] → 0 (B.42)

almost surely for every f ∈ BL1 (ℓ∞ (H̄ × G)). The outer almost sure convergence in (B.41) implies

that n(L(P̂nB ) − L(P̂n )) L′P (GP ) for almost every given sequence {(Yi , Di , Zi )}∞
i=1 . By Lemma
A.6 in this paper, and Lemmas 1.9.2 and 1.9.3 of van der Vaart and Wellner (1996), we have that
kP̂nB − P̂n k∞ → 0 outer almost surely for almost every given sequence {(Yi , Di , Zi )}∞
i=1 . By Lemma
A.6 again, kP̂n − P k∞ → 0 for almost every sequence {(Yi , Di , Zi )}∞
i=1 . Thus now we have that
kP̂nB − P k∞ ≤ kP̂nB − P̂n k∞ + kP̂n − P k∞ → 0 outer almost surely for almost every given sequence
{(Yi , Di , Zi )}∞ B B
i=1 . This implies that kσ̂Pn − σP k∞ → 0 and Tn /n → Λ(P ) outer almost surely for
almost every given sequence {(Yi , Di , Zi )}∞
i=1 . This, together with (B.41), and Lemmas 1.9.2(i) and
1.10.2(iii), Example 1.4.7 (Slutsky’s lemma), and Theorem 1.3.6 (continuous mapping) of van der
p
Vaart and Wellner (1996), implies that TnB (L(P̂nB ) − L(P̂n ))/M(σ̂PBn ) G0 /M(σP ) for almost
every given sequence {(Yi , Di , Zi )}∞
i=1 . Since GP is tight, G0 is tight by (B.11).
(ii). By (B.42) and Theorem 2.37 of Folland (1999) (Fubini), together with the dominated
convergence theorem and Lemma 1.2.1 of van der Vaart and Wellner (1996),

√ √
E ∗ [f { n(L(P̂nB ) − L(P̂n ))}] − E∗ [f { n(L(P̂nB ) − L(P̂n ))}] → 0 (B.43)
14 As

discussed in van der Vaart and Wellner (1996, p. 183), f { n(P̂nB − P̂n )} is measurable as a function of the random

weights given the values of the sample. Thus we use the conditional expectation EW [f { n(P̂nB − P̂n )}] in (B.39). Similarly,
√ B
we use the conditional expectation EW [f { n(L(P̂n ) − L(P̂n ))}] in (B.41).

26
for every f ∈ BL1 (ℓ∞ (H̄ × G)). By (B.41), together with the definition of outer almost sure conver-
gence (Definition 1.9.1(iii) of van der Vaart and Wellner (1996)), we have that for every function
f ∈ BL1 (ℓ∞ (H̄ × G)),


|EW [f { n(L(P̂nB ) − L(P̂n ))}] − E[f (L′P (GP ))]|∗ → 0 (B.44)

almost surely. Thus by (B.44), together with Lemma 1.2.2(iii) of van der Vaart and Wellner (1996),
we have that


|(EW [f { n(L(P̂nB ) − L(P̂n ))}])∗ − E[f (L′P (GP ))]| → 0 (B.45)

almost surely for every f ∈ BL1 (ℓ∞ (H̄ × G)). By Lemma 1.2.6 (Fubini’s theorem) of van der Vaart
and Wellner (1996),

√ √ √
E ∗ [f { n(L(P̂nB ) − L(P̂n ))}] ≥ E ∗ [EW [f { n(L(P̂nB ) − L(P̂n ))}]] ≥ E∗ [f { n(L(P̂nB ) − L(P̂n ))}].
(B.46)

Then by Lemma 1.2.1 of van der Vaart and Wellner (1996) and (B.43), we have that

√ √
E ∗ [f { n(L(P̂nB ) − L(P̂n ))}] = E[(EW [f { n(L(P̂nB ) − L(P̂n ))}])∗ ] + o(1). (B.47)

Now with (B.45) we can conclude that


|E ∗ [f { n(L(P̂nB ) − L(P̂n ))}] − E[f (L′P (GP ))]|

= |E[(EW [f { n(L(P̂nB ) − L(P̂n ))}])∗ ] + o(1) − E[f (L′P (GP ))]|

≤ E[|(EW [f { n(L(P̂nB ) − L(P̂n ))}])∗ − E[f (L′P (GP ))]|] + o(1) → 0

for every f ∈ BL1 (ℓ∞ (H̄ × G)), where the equality is from (B.47) and the convergence is by the
dominated convergence theorem together with the almost sure convergence in (B.45). This implies

that n(L(P̂nB ) − L(P̂n )) L′P (GP ) unconditionally. Similarly, by (B.39) and (B.40) we can easily

show that n(P̂nB − P̂n ) GP unconditionally. Thus we can conclude that P̂nB − P̂n → 0 in outer
probability by Lemma 1.10.2(iii) of van der Vaart and Wellner (1996). By Lemma A.6 in this paper
and Lemmas 1.9.3 and 1.2.2(i) of van der Vaart and Wellner (1996), we have that P̂nB → P in outer
probability, and hence TnB /n → Λ(P ) and M(σ̂PBn ) → M(σP ) in outer probability by Theorem 1.9.5
(continuous mapping) of van der Vaart and Wellner (1996). By Lemma 1.10.2(iii), Example 1.4.7
(Slutsky’s lemma), and Theorem 1.3.6 (continuous mapping) of van der Vaart and Wellner (1996),
p
TnB (L(P̂nB ) − L(P̂n ))/M(σ̂PBn ) G0 /M(σP ) unconditionally. This verifies (ii) of the Lemma.
(iii). This claim holds naturally under our constructions.
To explore the property of the bootstrap test statistic, we introduce the following notation. For

27


all sets A1 , A2 ⊂ H̄ × G, define d H (A1 , A2 ) = supa∈A1 inf b∈A2 ρP (a, b) and
n−
→ →
− o
dH (A1 , A2 ) = max d H (A1 , A2 ) , d H (A2 , A1 ) .

Also, define
( )
p φ̂Pn (h, g)

\
Ψ H̄×G = (h, g) ∈ H̄ × G : Tn ≤ τn , (B.48)
M(σ̂Pn ) (ξ0 , h, g)

\
where ξ0 and τn are as in (27). Notice the difference between Ψ \
H×G in (27) and ΨH̄×G in (B.48).
\
Clearly, Ψ \
H×G ⊂ ΨH̄×G .

Lemma B.9 Under Assumptions 3.1 and 3.2, if the H0 in (13) is true with Q = Pn for all n, then
\
dH (Ψ H̄×G , ΨH̄×G ) → 0 in outer probability, where ΨH̄×G is defined as in (24).

Proof of Lemma B.9. First, under the assumptions, we have that for all ε > 0,
−→     
\
lim P∗ dH ΨH̄×G , Ψ H̄×G > ε ≤ lim P ∗
Ψ H̄×G \ \
Ψ H̄×G =
6 ∅
n→∞ n→∞
!
p φ̂Pn (h, g) − φP (h, g)


≤ lim P sup Tn > τn .
n→∞ (h,g)∈H̄×G ξ0 ∨ σ̂Pn (h, g)


By Theorem 3.1, Tn (φ̂Pn − φP ) G. By Lemma B.3, σ̂Pn → σP almost uniformly, which implies
that σ̂Pn σP by Lemmas 1.9.3(ii) and 1.10.2(iii) of van der Vaart and Wellner (1996). Thus by
Example 1.4.7 (Slutsky’s lemma) and Theorem 1.3.6 (continuous mapping) of van der Vaart and
Wellner (1996),

p φ̂Pn (h, g) − φP (h, g) G (h, g)
sup Tn sup
ξ0 ∨ σ̂Pn (h, g)

ξ0 ∨ σP (h, g) .
(h,g)∈H̄×G (h,g)∈H̄×G

−→ \
Since τn → ∞, we have that limn→∞ P∗ (dH (ΨH̄×G , Ψ H̄×G ) > ε) = 0.
−→ \
Next, consider dH (ΨH̄×G , ΨH̄×G ). Define

d ((h, g) , A) = inf ρP ((h, g) , (h′ , g ′ ))


(h′ ,g′ )∈A

for all (h, g) ∈ H̄ × G and all subsets A ⊂ H̄ × G. For each ε > 0, define

 
D̃ε = (h, g) ∈ H̄ × G : d (h, g) , ΨH̄×G ≥ ε .

The product space H̄ × G is compact under ρP by Lemma A.8. Suppose there exist {(hn , gn )}n ⊂ D̃ε

28
such that (hn , gn ) → (h, g) for some (h, g) ∈ H̄ × G. Then


d (h, g) , ΨH̄×G = inf ρP ((h, g) , (h′ , g ′ ))
(h′ ,g′ )∈ΨH̄×G

≥ inf ρP ((hn , gn ) , (h′ , g ′ )) − ρP ((h, g) , (hn , gn )) ≥ ε − ρP ((h, g) , (hn , gn )) ,


(h′ ,g′ )∈ΨH̄×G


which is true for all n. Letting n → ∞ gives d (h, g) , ΨH̄×G ≥ ε. This implies that D̃ε is closed in
H̄ × G, which is compact, and thus D̃ε is compact. If D̃ε = ∅, then clearly
 
−→   
lim P∗ dH Ψ \
H̄×G , ΨH̄×G > ε = lim P
∗
sup inf ρP ((h, g) , (h′ , g ′ )) > ε = 0.
n→∞ n→∞ \ (h′ ,g′ )∈ΨH̄×G
(h,g)∈Ψ H̄×G

If D̃ε 6= ∅, then there is a δε > 0 such that inf (h,g)∈D̃ε |φP (h, g)| > δε , since φP is continuous by
Lemma B.4. Also, σ̂Pn is uniformly bounded in (h, g) and ω, so there is a δε′ > 0 such that for all
ω ∈ Ω, inf (h,g)∈D̃ε |φP (h, g) / (ξ0 ∨ σ̂Pn (h, g))| > δε′ . Thus if D̃ε 6= ∅, we have
 
−→   
lim P∗ dH Ψ \
H̄×G , ΨH̄×G > ε = lim P
∗
sup inf ρP ((h, g) , (h′ , g ′ )) > ε
n→∞ n→∞ \ (h′ ,g′ )∈ΨH̄×G
(h,g)∈Ψ H̄×G
 

φP (h, g) p φ̂Pn (h, g)
≤ lim P∗  sup ′
≤ τn  .
n→∞ ξ0 ∨ σ̂P (h, g) > δε , sup Tn
ξ0 ∨ σ̂Pn (h, g)
\
(h,g)∈Ψ H̄×G \ΨH̄×G
n \
(h,g)∈Ψ H̄×G \ΨH̄×G

By Lemma B.3, we have that φ̂Pn → φP almost uniformly. Thus there is a measurable set A with
P(A) ≥ 1 − ε such that for sufficiently large n,

φ̂ (h, g) φP (h, g) δε′
Pn
sup ≥ sup ξ0 ∨ σ̂P (h, g) − 2
\
(h,g)∈Ψ
ξ0 ∨ σ̂Pn (h, g) (h,g)∈Ψ\ \Ψ n
H̄×G \ΨH̄×G H̄×G H̄×G

uniformly on A. Thus we now have that


−→   
lim P∗ dH Ψ \ H̄×G , ΨH̄×G > ε
n→∞
 n o 
φP (h,g)
sup(h,g)∈Ψ\ \Ψ ξ ∨σ̂ (h,g) > δε′
≤ lim P∗  n √ φ̂Pn (h,g) o  + P(Ac )
H̄×G H̄×G 0 P n
n→∞ ∩ sup(h,g)∈Ψ\ \Ψ Tn ξ0 ∨σ̂P (h,g) ≤ τn ∩ A
H̄×G H̄×G n
r r 
T δ ′
T φ̂ (h, g) τ
n n P n
≤ lim P∗  ε
< sup n
≤ √  + ε = ε,
n→∞ n 2 (h,g)∈Ψ \ \Ψ n ξ0 ∨ σ̂Pn (h, g) n
H̄×G H̄×G


because τn / n → 0 as n → ∞. Here, ε can be arbitrarily small.

Proof of Theorem 3.2. (i). Fix ψ ∈ C Ξ × H̄ × G under the ρξhg defined in (B.27). It is easy
to show that Ξ × H̄ × G is compact under ρξhg , and thus ψ is uniformly continuous on Ξ × H̄ × G.
This implies that for every ε > 0, there is a δ > 0 such that |ψ(ξ ′ , h′ , g ′ ) − ψ (ξ, h, g)| ≤ ε/ν (Ξ) for

29
all (ξ, h, g) , (ξ ′ , h′ , g ′ ) ∈ Ξ × H̄ × G with ρξhg ((ξ ′ , h′ , g ′ ), (ξ, h, g)) ≤ δ. Also, by the constructions of
ΨH̄×G in (24) and Ψ \ H̄×G in (B.48), we have that



I ◦ SΨ\ (ψ) − I ◦ SΨH̄×G (ψ)
H̄×G

≤ ν (Ξ) sup  
|ψ (ξ ′ , h′ , g ′ ) − ψ (ξ, h, g)| .
ρξhg ((ξ ′ ,h′ ,g′ ),(ξ,h,g))≤d \
H ΨH̄×G ,ΨH̄×G

By Lemma B.9, this implies that


     
\
P∗ I ◦ SΨ\ (ψ) − I ◦ SΨH̄×G (ψ) > ε ≤ P∗ dH Ψ H̄×G , ΨH̄×G > δ → 0.
H̄×G


Notice that |I ◦ SΨ\ (ψ1 ) − I ◦ SΨ\ (ψ2 ) | ≤ ν (Ξ) kψ1 − ψ2 k∞ for all ψ1 , ψ2 ∈ ℓ∞ Ξ × H̄ × G .
H̄×G H̄×G
By Lemma S.3.6 of Fang and Santos (2018), I ◦ SΨ\ satisfies Assumption 4 of Fang and Santos
H̄×G
(2018). Together with Lemma B.8, by repeating the proof of Theorem 3.2 of Fang and Santos (2018)
p
with GB
n = TnB (φ̂B B B ∗
Pn − φ̂Pn )/M(σ̂Pn ), where Gn replaces Gn in their notation, we can show that

  p   
 TnB φ̂B     
Pn − φ̂Pn G0
sup EW f I ◦ SΨ\  B
   − E f I ◦ SΨ →0

f ∈BL1 (R)  H̄×G M σ̂Pn  H̄×G
M(σP )
(B.49)

in outer probability, where G0 is the limit obtained in Lemma B.8 and G0 /M(σP ) is tight by Lemma
B.8(i). Since the sample is finite, that is, we have only finitely many observations {(Yi , Di , Zi )}ni=1
in the data set, by the constructions of Ψ\ \
H×G in (27) and ΨH̄×G in (B.48) we have that

p   p  
TnB φ̂B
Pn − φ̂Pn TnB φ̂B
Pn − φ̂Pn
I ◦ SΨ\   = I ◦S\   . (B.50)
H×G
M σ̂PBn ΨH̄×G M σ̂PBn

Then (B.49) and (B.50) imply that


  p   
 T B φ̂B − φ̂     
n Pn Pn G 0
sup EW f I ◦ SΨ\ 
B
   − E f I ◦ SΨ →0

f ∈BL1 (R)  H×G
M σ̂Pn  H̄×G
M(σ P )
(B.51)

in outer probability. Let F denote the CDF of I ◦ SΨH̄×G (G0 /M (σP )), and define F̂n by
 p   
TnB φ̂B − φ̂
Pn Pn ∞
F̂n (c) = P I ◦ SΨ\    ≤ c {(Yi , Di , Zi )}  .15
H×G
M σ̂PBn i=1

Since by assumption F is continuous and increasing at c1−α , by a proof similar to that of Theorem
15 This n
conditional probability given {(Yi , Di , Zi )}∞
i=1 is numerically equal to that given {(Yi , Di , Zi )}i=1 in (31).

30
S.1.1 of Fang and Santos (2018) together with (B.51) in this paper, we can conclude that for each
ε > 0,

P∗ (|ĉ1−α − c1−α | > ε) → 0. (B.52)

By the definitions of G (in the proof of Lemma 3.1) and G0 (in Lemma B.8), together with the

linearity of L′P , we have that G = G0 + Λ(P )1/2 L′P (Q0 ). Let Hn = n(Pn − P ). By Lemma B.2,
kHn − Q0 k∞ → 0 as n → ∞. Notice that Pn = P + n−1/2 Hn . By Lemma B.1, we have that

L (Pn ) (h, g) − L (P ) (h, g)
lim sup − LP (Q0 ) (h, g)

n→∞ (h,g)∈Ψ n −1/2
H̄×G

L P + n−1/2 H  (h, g) − L (P ) (h, g)
n ′
≤ lim sup − L P (Q 0 ) (h, g) = 0. (B.53)
n→∞ (h,g)∈H̄×G n−1/2

By construction, L(P ) = 0 on ΨH̄×G because L(P ) = φP . By assumption, we have that L(Pn ) =


φPn ≤ 0 on ΨH̄×G and (B.53) implies that L′P (Q0 ) ≤ 0 on ΨH̄×G . Thus we have that G ≤ G0

and I ◦ SΨH̄×G (G/M(σP )) ≤ I ◦ SΨH̄×G (G0 /M(σP )). Since G/M (σP ) ∈ ℓ∞ Ξ × H̄ × G , where

ℓ∞ Ξ × H̄ × G is a Banach space under k·k∞ and G is tight by Lemma 3.1, we have that G/M (σP )
is tight (hence separable16 ) and is Radon by Theorem 7.1.7 of Bogachev (2007). Since I ◦ SΨH̄×G
is continuous and convex, Theorem 11.1(i) of Davydov et al. (1998) implies that the CDF of I ◦
SΨH̄×G (G/M (σP )) is everywhere continuous except possibly at the point

 
r0 = inf r : P I ◦ SΨH̄×G (G/M (σP )) ≤ r > 0 .

Because I ◦ SΨH̄×G (G/M (σP )) ≤ I ◦ SΨH̄×G (G0 /M (σP )), we have that

 
r0 ≤ inf r : P I ◦ SΨH̄×G (G0 /M (σP )) ≤ r > 0 < c1−α ,

where the last inequality follows from that the CDF of I ◦ SΨH̄×G (G0 /M (σP )) is continuous and
increasing at c1−α . This implies that I ◦ SΨH̄×G (G/M (σP )) is continuous at c1−α . Now by (25) and
(B.52) in this paper, together with Example 1.4.7 (Slutsky’s lemma), Theorem 1.3.6 (continuous
mapping), and Theorem 1.3.4(vi) of van der Vaart and Wellner (1996), we conclude that
! !    
p φ̂Pn G

lim P Tn I ◦ S > ĉ1−α = P I ◦ SΨH̄×G > c1−α ≤ α, (B.54)
n→∞ M(σ̂Pn ) M(σP )

where the inequality follows from that c1−α is the 1 − α quantile for I ◦ SΨH̄×G (G0 /M(σP )). If, in
addition, Pn = P for all n, then by Assumption 3.2 we have that v0 = 0 and hence Q0 = 0. This
implies that G = G0 and that the inequality in (B.54) holds with equality.
(ii). Let ĉ′1−α be the bootstrap critical value obtained using the bootstrap test statistic I ◦
16 See the definition of separability in van der Vaart and Wellner (1996, p. 17). The closure of a separable subset of a metric

space is separable.

31
p p
S( TnB (φ̂B B
Pn − φ̂Pn )/M(σ̂Pn )) in place of I ◦ SΨ
\ H×G
( TnB (φ̂B B
Pn − φ̂Pn )/M(σ̂Pn )) in the test pro-
cedure in Section 3.1.1. By arguments similar to those in the proof of part (i), we can show that
ĉ′1−α → c′1−α in outer probability, where c′1−α is the 1 − α quantile for I ◦ S (G0 /M (σP )).17 Clearly,

ĉ′1−α ≥ ĉ1−α by construction. By Lemma B.3, φ̂Pn /M (σ̂Pn ) → φP /M (σP ) in ℓ∞ Ξ × H̄ × G
almost uniformly, and hence almost uniformly
!  
φ̂Pn φP
I ◦ SH×G → I ◦ SH×G > 0,
M (σ̂Pn ) M (σP )

where the inequality follows from the assumption that the H0 in (13) is false with Q = P . Thus we

have that [I◦SH×G ( Tn φ̂Pn /M (σ̂Pn ))]−1 → 0 almost uniformly (Tn /n → Λ(P ) almost uniformly by
Lemma B.3). By Lemma 1.9.3(ii) and 1.10.2(iii), Example 1.4.7 (Slutsky’s lemma), and Theorems
1.3.6 (continuous mapping) and 1.3.4(vi) of van der Vaart and Wellner (1996), we now conclude
that √ ! ! √ ! !
∗ Tn φ̂Pn ∗ Tn φ̂Pn ′
P I ◦ SH×G > ĉ1−α ≥P I ◦ SH×G > ĉ1−α → 1.
M (σ̂Pn ) M (σ̂Pn )

C Additional Monte Carlo Studies


The Monte Carlo experiments discussed in this section followed the design of Kitagawa (2015),
where the treatment and the instrument were both binary, with D ∈ {0, 1} and Z ∈ {0, 1}, and
we compared our results with theirs. We simulated the limiting rejection rates using the approach
proposed in the present paper and that proposed by Kitagawa (2015) with the same randomly
generated data. In this special case, if the measure ν is set to be a Dirac measure, the asymptotic
distribution of the test statistic under null can be written as supf ∈Fb∗ GH (f )/(ξ ∨ σH (f )) in (32).
Since the test proposed by Kitagawa (2015) constructed the critical value based on the upper bound
supf ∈Fb GH (f )/(ξ ∨ σH (f )) in (32), to show the power improvement of the proposed test on a finite
sample more clearly, we constructed the critical value using supf ∈Fb∗ GH (f )/(ξ ∨ σH (f )) instead of
I ◦ SΨH×G (G/M(σP )), which is equivalent to it in distribution. That is, we approximated GH and
σH by GB B
H and σH following the bootstrap method of Kitagawa (2015). Then we estimated Fb by

c∗ in a way similar to (27), which is the key difference between our approach and that of Kitagawa
F b
(2015). Last, we constructed the bootstrap test statistic from supf ∈Fc∗ GB B
H (f )/(ξ ∨ σH (f )) and used
b
c∗ , our bootstrap test statistic can approximate the null
it to create the critical value. Because of F b
distribution consistently and the power of the test can be improved. This new bootstrap test statistic
is asymptotically equivalent to that in (30), and the new critical value is asymptotically equivalent
to ĉ1−α in Section 3.1.1.
17 Here, we implicitly assume that the CDF of I ◦ S (G /M (σ )) is continuous and strictly increasing at c′
0 P 1−α . Theorem
11.1 of Davydov et al. (1998) implies that the CDF of I ◦ S (G0 /M(σP )) is differentiable and has a positive derivative every-
where except at countably many points in its support, provided that I ◦ S (G0 /M(σP )) is not a constant. By construction,
I ◦ S (G0 /M (σP )) is not a constant in general cases.

32
Each simulation consisted of 1000 Monte Carlo iterations and 1000 bootstrap iterations. For each
DGP, the measure ν was set to a Dirac measure centered at ξ = 0.07, 0.22, 0.3, and 1. The nominal
significance level α was set to 0.05.

C.1 Size Control and Tuning Parameter Selection


We first ran simulations to investigate the size of the test and the selection of the tuning parame-
ter. As suggested in Section 4, for sample sizes less than 3000, we can use τn = 2 for the tuning
parameter. In this set of simulations, we set n = 2000 and τn = 1, 2, 3, 4, ∞. For the DGP, we used
U ∼ Unif(0, 1), V ∼ Unif(0, 1), N0 ∼ N(0, 1), N1 ∼ N(1, 1), Z = 1{U ≤ 0.5}, D0 = 1{V ≤ 0.5},
P1 P1
D1 = 1{V ≤ 0.5}, D = z=0 1{Z = z} × Dz , and Y = d=0 1{D = d} × Nd , where U , V , N0 , and
N1 were mutually independent. This DGP is equivalent to that used by Kitagawa (2015) to show the
size control of their test. The results in Table 4 confirmed the conclusion from Table 1: For τn = 2,
the rejection rates were close to those for τn = ∞ and close to the nominal size. Recall that a smaller
tuning parameter τn yields greater power for the test. Thus we kept using τn = 2 in this case.

Table 4: Rejection Rates under H0 for Binary D and Binary Z

ξ
τn
0.07 0.22 0.3 1
1 0.077 0.052 0.048 0.069
2 0.058 0.048 0.040 0.067
3 0.056 0.046 0.040 0.067
4 0.056 0.046 0.040 0.067
∞ 0.056 0.046 0.040 0.067

C.2 Power Comparison


Four DGPs were considered for the power comparisons. The sample sizes were set to n = 200, 600,
1000, 1100, and 2000, and the tuning parameter was set to τn = 2. The probability P(Z = 1) = rn
with rn = 1/2, 1/6, 1/2, 1/11, and 1/2 for the corresponding sample sizes. We let U ∼ Unif(0, 1),
V ∼ Unif(0, 1), W ∼ Unif(0, 1), Z = 1{U ≤ rn }, D0 = 1{V ≤ 0.45}, D1 = 1{V ≤ 0.55},
P1
D = z=0 1{Z = z} × Dz , N00 ∼ N(0, 1), N01 ∼ N(0, 1), and N11 ∼ N(0, 1).
P1 P1
(1): N10 ∼ N(−0.7, 1) and Y = z=0 1{Z = z} × ( d=0 1{D = d} × Ndz ).
P1 P
(2): N10 ∼ N(0, 1.6752 ) and Y = z=0 1{Z = z} × ( 1d=0 1{D = d} × Ndz ).
P1 P1
(3): N10 ∼ N(0, 0.5152 ) and Y = z=0 1{Z = z} × ( d=0 1{D = d} × Ndz ).

(4): N10a ∼ N(−1, 0.1252), N10b ∼ N(−0.5, 0.1252), N10c ∼ N(0, 0.1252), N10d ∼ N(0.5, 0.1252),
N10e ∼ N(1, 0.1252), N10 = 1{W ≤ 0.15} × N10a + 1{0.15 < W ≤ 0.35} × N10b + 1{0.35 <
P
W ≤ 0.65} × N10c + 1{0.65 < W ≤ 0.85} × N10d + 1{W > 0.85} × N10e , and Y = 1z=0 1{Z =
P1
z} × ( d=0 1{D = d} × Ndz ).

33
All the variables U , V , N00 , N10 , N01 , and N11 were set to be mutually independent for each DGP.
Table 5 shows a comparison of the powers of the two tests. The results suggest that the proposed
test achieves a manifest power improvement over that of Kitagawa (2015).

Table 5: Rejection Rates under H1 for Binary D and Binary Z

The Proposed Test Test of Kitagawa (2015)


DGP n ξ ξ
0.07 0.22 0.3 1 0.07 0.22 0.3 1
200 0.202 0.198 0.186 0.110 0.198 0.193 0.182 0.106
600 0.300 0.434 0.418 0.180 0.240 0.406 0.375 0.144
(1) 1000 0.874 0.915 0.919 0.804 0.855 0.883 0.894 0.714
1100 0.309 0.493 0.452 0.163 0.263 0.451 0.423 0.153
2000 0.997 0.999 1.000 0.997 0.996 0.999 0.999 0.993
200 0.105 0.095 0.059 0.004 0.090 0.084 0.046 0.003
600 0.261 0.141 0.045 0.000 0.242 0.100 0.026 0.000
(2) 1000 0.907 0.814 0.500 0.105 0.887 0.781 0.421 0.030
1100 0.255 0.129 0.037 0.001 0.224 0.082 0.022 0.001
2000 1.000 0.996 0.949 0.674 1.000 0.994 0.909 0.252
200 0.211 0.209 0.202 0.211 0.185 0.188 0.195 0.205
600 0.203 0.427 0.473 0.351 0.191 0.377 0.458 0.331
(3) 1000 0.664 0.769 0.816 0.831 0.654 0.739 0.785 0.796
1100 0.229 0.442 0.487 0.341 0.203 0.399 0.443 0.321
2000 0.950 0.982 0.992 0.995 0.949 0.971 0.987 0.992
200 0.080 0.082 0.073 0.036 0.079 0.082 0.073 0.036
600 0.134 0.117 0.103 0.060 0.123 0.111 0.102 0.058
(4) 1000 0.307 0.306 0.224 0.127 0.307 0.281 0.212 0.116
1100 0.146 0.115 0.112 0.031 0.136 0.115 0.093 0.027
2000 0.660 0.703 0.556 0.325 0.649 0.673 0.505 0.271

References
Aliprantis, C. D. and Border, K. (2006). Infinite Dimensional Analysis: A Hitchhiker’s Guide. Springer
Science & Business Media.

Bogachev, V. I. (2007). Measure Theory, volume 2. Springer Science & Business Media.

Davydov, Y. A., Lifshits, M. A., and Smorodina, N. V. (1998). Local Properties of Distributions of
Stochastic Functionals, volume 173. American Mathematical Society.

Fang, Z. and Santos, A. (2018). Inference on directionally differentiable functions. The Review of
Economic Studies, 86(1):377–412.

Folland, G. B. (1999). Real Analysis: Modern Techniques and Their Applications. John Wiley & Sons.

Kitagawa, T. (2015). A test for instrument validity. Econometrica, 83(5):2043–2063.

34
Pollard, D. (1990). Empirical processes: Theory and applications. In NSF-CBMS Regional Conference
Series in Probability and Statistics, pages i–86. JSTOR.

Shapiro, A. (1990). On concepts of directional differentiability. Journal of Optimization Theory and


Applications, 66(3):477–487.

van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer.

35

You might also like