Anchoring-Based Causal Design (ABCD) : Estimating The Effects of Beliefs
Anchoring-Based Causal Design (ABCD) : Estimating The Effects of Beliefs
Abstract
This article develops a covariate balancing approach for the estimation of treatment
effects on the treated (ATT) in a difference-in-differences (DID) research design when
panel data are available. We show that the proposed covariate balancing propensity score
(CBPS) DID estimator possesses several desirable properties: (i) local efficiency, (ii) double
robustness in terms of consistency, (iii) double robustness in terms of inference, and (iv)
faster convergence to the ATT compared to the augmented inverse probability weighting
(AIPW) DID estimators when both working models are locally misspecified. These latter
two characteristics set the CBPS DID estimator apart from the AIPW DID estimator
theoretically. Simulation studies and an empirical study demonstrate the desirable finite
sample performance of the proposed estimator.
Keywords: double robustness, local misspecification, panel data, treatment effects on the treated
(ATT),
1
1 Introduction
effects of policy interventions using observational data. In its canonical form, the DID approach
necessitates two groups and two periods, stipulating that no entity is exposed to the treatment
in the initial period, while a subset remains untreated in the subsequent period. A crucial
underpinning of the DID design is the so-called (unconditional) parallel trends assumption
(PTA), which posits that, in the absence of the treatment, the average outcomes for both the
treatment and comparison groups would have evolved along parallel paths over time. While the
PTA is inherently untestable, its validity has been questioned, particularly in scenarios where pre-
treatment characteristics, differing between the treatment and comparison groups, are correlated
with the outcome’s evolution. In such instances, the canonical DID setup becomes implausible,
prompting researchers to incorporate pre-treatment covariates into the DID framework. This
modification ensures the satisfaction of the PTA conditionally on these covariates (conditional
PTA).
Under this conditional PTA, three predominant estimation procedures have emerged: the
outcome regression (OR), the inverse probability weighting (IPW), and the augmented IPW
(AIPW), the latter offering double robustness in terms of consistency, see Sant’Anna and Zhao
(2020) for a comprehensive review. However, these methods confront the challenge of potential
misspecification of the outcome regression and/or propensity score models, leading to incorrect
inferences. While doubly robust estimators offer improved consistency by requiring only one
of the two working models to be correctly specified, an unavoidable situation where both
models are misspecified still remains. For estimation of the average treatment effect (ATE),
Kang and Schafer (2007) highlight this limitation, demonstrating that the advantages of AIPW
estimators can substantially diminish when both the outcome regression and propensity score
models are slightly misspecified. To address this issue, Imai and Ratkovic (2014) proposed
the Covariate Balancing Propensity Score (CBPS) methodology, illustrating that the CBPS
estimator can significantly ameliorate the finite sample performance of doubly robust estimators.
Further theoretical expositions of the CBPS ATE estimator were provided by Fan et al. (2022).
2
Nevertheless, their investigation focuses on ATE estimation.
In this paper, we apply the CBPS methodology to the ATT estimator within the framework
of DID design. In particular, we rigorously investigate the robustness and efficiency properties
of the CBPS DID estimator. Our contributions to the DID literature are manifold: Firstly,
this article briefly reviews a range of existing ATT estimators within the DID framework
and then propose a CBPS-based DID estimator when panel data are available. Secondly, we
demonstrate that our proposed estimator possesses the qualities of double robustness and
local efficiency. A notable distinction of our estimator is its double robustness not only in
terms of consistency but also in terms of inference. This characteristic guarantees that the
asymptotic linear representation of the CBPS DID estimator remains invariant even when one
variance based on the influence function. This feature sets our estimator apart from existing
doubly robust AIPW DID estimators because the asymptotic linear representation of the AIPW
DID estimator is not invariant when one of the working models is misspecified. The third
contribution of our work lies in establishing that our estimator can achieve a faster convergence
rate relative to the AIPW DID estimator, under the scenario where both the propensity score and
outcome regression models are locally misspecified. This situation has been seldom addressed in
existing DID literature. Lastly, the fourth contribution is the practicality of our estimator. It is
research.
Organization of this paper: The subsequent section of this paper delineates the settings
and assumptions used throughout the paper and briefly overviews some existing DID estimators.
In Section 3, we introduce the CBPS method and propose a CBPS-based DID estimator, and
derive its theoretical properties. We evaluate the finite sample performance of our proposed
CBPS DID estimator with Monte Carlo simulation in Section 4, and provide an empirical
supporting our arguments and findings are comprehensively compiled in the Appendix.
3
2 Difference-in-differences
2.1 Setup
We introduce the notation and framework that will be employed throughout this article. Our
analysis is based on a two-period, two-group structure (treatment and comparison groups). Let
Yit represent the outcome of interest for unit i at time t. We suppose that researchers have access
Define Dit as an indicator variable, where Dit = 1 if unit i receives treatment on time t, and
Dit = 0 otherwise. We note that Di0 = 0 for all units i, which simplifies the treatment indicator
to Di = Di1 . The observed outcome Yit can also be rewritten as Yit = Di Yit (1) + (1 − Di )Yit (0),
where Yit (0) denotes the potential outcome of unit i at time t if one does not receive treatment
and Yit (1) represents the potential outcome if the same one receives treatment, but Yit (0) and
Yit (1) cannot be observed simultaneously for any unit. In the remainder of this paper, we assume
the availability of panel data on {Yi0 , Yi1 , Di , Xi }ni=1 , where Xi ∈ Rd is a vector of pre-treatment
The parameter of interest, the average treatment effect on the treated (ATT), is defined as:
According to the representation (2.2) above, we can show that the first term (E [Yi1 |Di = 1] =
E[Di Yi1 ]
E[Di ]
) can be estimated directly from the observed data. The main challenge in identifying the
ATT lies in computing the second term (E [Yi1 (0)|Di = 1]) from the observed data since Yi1 (0)
is missing for treated subjects with Di = 1. In order to identify the ATT (or E [Yi1 (0)|Di = 1]),
4
Assumption 1. Assume that the data {Yi0 , Yi1 , Di , Xi }ni=1 are independent and identically
distributed (iid).
Assumption 2. E[Yi1 (0) − Yi0 (0)|Xi , Di = 1] = E[Yi1 (0) − Yi0 (0)|Xi , Di = 0] almost surely.
Assumption 3. For some ε > 0, Pr(Di = 1) > ε and Pr(Di = 1|Xi ) ≤ 1 − ε almost surely.
Assumption 2, which we term the conditional PTA, posits that the average conditional
outcomes for the treatment and comparison groups would have followed parallel paths in the
absence of the treatment. Assumption 3 is an overlap condition, which states that at least a
small fraction of the population is exposed to the treatment, and for every specific value of
covariates Xi , at least a small portion is not treated. These three assumptions are commonly
utilized in semiparametric DID methods, see, e.g. (Heckman, Ichimura, and Todd 1997; Abadie
2005; Sant’Anna and Zhao 2020). Next, we briefly provide an overview of the existing approaches
There are two approaches to identify the ATT: the OR approach (Heckman, Ichimura, and
(i) The OR approach: under Assumptions 1-3, the ATT is identified as:
τ = E [∆Yi |Di = 1]−E [E [∆Yi |Xi , Di = 0] |Di = 1] = E [∆Yi |Di = 1]−E [m∆ (Xi )|Di = 1] := τ OR ,
(2.3)
where ∆Yi = Yi1 − Yi0 and m∆ (Xi ) = E [∆Yi |Xi , Di = 0]. Based on the identification result
(2.3), the OR approach requires researchers to model the conditional expectation of outcome
evolution E [∆Yi |Xi , Di = 0] at the first step. Researchers typically adopt a linear parametric
5
model Xi′ γ (outcome regression model) to specify the true conditional expectation of outcome
evolution E [∆Yi |Xi , Di = 0]. Consequently, the OR DID estimator is represented as follows:
1 1
τ̂ OR = Xi′ γ̂ OLS ,
X X
∆Yi − (2.4)
ntreat i∈treat ntreat i∈treat
where γ̂ OLS is the OLS estimator of the regression of Yi on Xi by the comparison groups (Di = 0)
(ii) The IPW approach: under Assumptions 1-3, the ATT is alternatively identified as:
" #
Di − π(Xi )
τ =E ∆Yi := τ IP W , (2.5)
E[Di ](1 − π(Xi ))
where π(Xi ) = Pr(Di = 1|Xi ) is the propensity score. Based on the identification result (2.5), the
IPW approach requires researchers to estimate the propensity score at the first step. Researchers
exp(Xi′ β)
typically use a parametric model (e.g. π(Xi′ β) = 1+exp(Xi′ β)
) to specify the propensity score
π(Xi ) and estimate parameters by the maximum likelihood method. Hence, the IPW DID
n
IP W 1X Di − π(Xi′ β̂ M L )
τ̂ = ∆Yi , (2.6)
n i=1 D̄(1 − π(Xi′ β̂ M L ))
1 Pn
where β̂M L is the maximum likelihood estimator and D̄ = n i=1 Di .
(iii) AIPW approach: Consistency of the OR and IPW estimators depends on the correct
specification of the outcome regression model and the propensity score model, respectively. To
achieve consistency in scenarios where one of two working models are misspecified, Sant’Anna
n
1X Di − π(Xi′ β̂ M L )
τ̂ AIP W = (∆Yi − Xi′ γ̂ AIP W ), (2.7)
n i=1 D̄(1 − π(Xi′ β̂ M L ))
6
with γ̂ AIP W = γ̂ OLS , which follows from the alternative identification of ATT:
" #
Di − π(Xi )
τ =E (∆Yi − m∆ (Xi )) := τ AIP W . (2.8)
E[Di ](1 − π(Xi ))
Observing the form of (2.7), it is apparent that AIPW procedure combines both OR and
IPW methodologies. This synthesis allows the AIPW estimator to mitigate some of the inherent
weaknesses associated with the OR and IPW approaches individually. Indeed, Sant’Anna and
Zhao (2020) show that the AIPW DID estimator is both locally efficient and doubly robust
in terms of consistency. In the cross-sectional setting, however, Kang and Schafer (2007)
demonstrated that the performance of the AIPW ATE estimator can be poor in scenarios where
both of the working models are slightly misspecified. To address this issue, Imai and Ratkovic
(2014) introduced the CBPS method and demonstrated that the CBPS ATE estimator can
significantly enhance performance over other existing ATE estimators, including the AIPW
ATE estimator, particularly when both working models are misspecified. In the subsequent
subsection, we extend the application of the CBPS method from estimating ATE to ATT
within the framework of DID research. Our objective is to propose a CBPS DID estimator
and to rigorously investigate its theoretical properties, focusing on their robustness and efficiency.
The CBPS method introduced by Imai and Ratkovic (2014) offers a distinct approach to
propensity score estimation. In contrast to the IPW method, the CBPS method imposes exact
finite sample balance of pre-treatment covariates across the treatment and comparison groups
rather than focusing on the predictive accuracy of treatment assignment. The CBPS DID
estimator is defined as
n
1X Di − π(Xi′ β̂ CBP S )
τ̂ CBP S = ∆Yi , (3.1)
n i=1 D̄(1 − π(Xi′ β̂ CBP S ))
7
where β̂ satisfies
n n
1 X
Di −
(1 − Di )π(Xi′ β̂ CBP S ) 1X Di − π(Xi′ β̂ CBP S )
Xi = Xi = 0. (3.2)
n i=1 1 − π(Xi′ β̂ CBP S ) n i=1 1 − π(Xi′ β̂ CBP S )
Recall that the IPW DID estimator (2.6) employs the maximum likelihood method to
1 Pn Di −π(Xi′ β̂ M L ) π̇(Xi′ β̂ M L )
estimate the propensity score, where β̂ M L satisfies n i=1 1−π(X ′ β̂ M L ) π(X ′ β̂ M L ) Xi = 0 where
i i
∂π(v)
π̇(v) = ∂v
. Hence the only difference of the IPW DID estimator (2.6) and the CBPS DID
estimator (3.1) is the method of propensity score estimation. On the other hand, by (3.2), the
n
1X Di − π(Xi′ β̂ CBP S )
τ̂ CBP S = (∆Yi − Xi′ γ CBP S ) (3.3)
n i=1 D̄(1 − π(Xi′ β̂ CBP S ))
where γ CBP S is any value. Hence the only difference of the AIPW DID estimator (2.7) and the
CBPS DID estimator (3.1) is the value of γ in (2.7) and (3.3). It is noteworthy that γ CBP S in
(3.3) can take any value and it is not estimated in the actual CBPS estimation defined as (3.1).
In the following subsections, we will conduct a comprehensive theoretical analysis of the CBPS
DID estimator to elucidate its advantages. It is this arbitrariness of γ CBP S in (3.3) that plays a
In this subsection, we start with the scenario when both of the working models are correctly
specified. We show that, in such a case, the CBPS DID estimator attains the semiparametric
efficiency bound for the ATT under DID framework, when both propensity score model and
outcome regression model are correctly specified. This property is the so-called local efficiency.
Sant’Anna and Zhao (2020) derived the semiparametric efficiency bound for ATT under a DID
framework.
Theorem 1. Under Assumptions 1-3 and Assumptions A (stated in Appendix A), if π(Xi′ β) =
8
π(Xi ) and Xi′ γ = m∆ (Xi ) holds,
√ n
1 X d
n(τ̂ CBP S
− τ) = √ − N 0, E[ηie 2 ] ,
ηie + op (1) → (3.4)
n i=1
where
Di − π(Xi ) Di
ηie = (∆Yi − m∆ (Xi )) − τ (3.5)
E[Di ]{1 − π(Xi )} E[Di ]
is the efficient influence function for the ATT and E[ηie 2 ] is the semiparametric efficiency bound.
Theorem 1 shows asymptotic normality of the CBPS DID estimator when both of the
working models are correctly specified. The asymptotic variance of the CBPS DID estimator is
equal to the semiparametric efficiency bound derived by Sant’Anna and Zhao (2020). It should
be noted that the AIPW DID estimator also achieves the semiparametric efficiency bound when
In the previous subsection, we showed that the CBPS DID estimator is locally efficient. In
this subsection, we shift our focus from efficiency to robustness, under the scenario that one of
the two working models is misspecified. Firstly, we show that the CBPS DID estimator remains
consistent with the ATT even if either the propensity score model or the outcome regression
model (but not both) is misspecified. This property is referred to as double robustness in terms
of consistency.
Theorem 2. Under Assumptions 1-3 and Assumptions A, the CBPS DID estimator is doubly
p
robust in terms of consistency, that is τ̂ CBP S →
− τ if at least one of the following two conditions
holds:
1. The outcome regression model is correctly specified, that is, there exists some value γ0
9
2. The propensity score model is correctly specified, that is, there exists some value β0 such
Consequently, the CBPS DID estimator offers more flexibility and is less demanding in terms
and the IPW approach. It should be noted that the AIPW DID estimator is also doubly robust
in terms of consistency.
While double robustness in terms of consistency is a valuable property, it may not suffice
for inference. The next theorem shows that the asymptotic linear representation of the CBPS
DID estimator remains invariant even when one of the working models is misspecified so that
the asymptotic variance can be estimated based on the influence function. This is referred as
the AIPW DID estimator is not invariant when one of the working models is misspecified. A
Theorem 3. Let β0AIP W , γ0AIP W , β0CBP S and γ0CBP S denote probability limits of β̂ M L , γ̂ AIP W ,
−1
CBP S CBP S Pn (1−Di )π̇(Xi′ β̂ CBP S ) ′ Pn (1−Di )π̇(Xi′ β̂ CBP S )∆Yi
β̂ and γ̂ = i=1 (1−π(X ′ β̂ CBP S ))2 Xi Xi i=1 (1−π(Xi′ β̂ CBP S ))2
Xi , respectively.
i
Under Assumptions 1-3 and Assumptions A, if either π(Xi′ β0a ) = π(Xi ) a.s. or Xi′ γ0a = m∆ (Xi )
a.s. for a = CBP S, AIP W , the CBPS DID and AIPW DID estimators satisfy:
√ n
CBP S 1 X
n(τ̂ − τ) = √ ηiCBP S + op (1), (3.6)
n i=1
√ n
1 X
n(τ̂ AIP W − τ ) = √ η AIP W + Op (1), (3.7)
n i=1 i
Di −π(Xi ′ β0a ) Di
where ηia = E[Di ]{1−π(Xi ′ β0a )}
(∆Yi − Xi′ γ0a ) − E[D i]
τ. Note that ηia is equal to the efficient influence
function (3.5) under the assumption that both working models are correctly specified.
Theorem 3 reveals that inference based on τ̂ AIP W and its influence function may be mislead-
ing when one of the working models is misspecified. In contrast, inference based on τ̂ CBP S and
its influence function will remain accurate even when one of the working models is misspecified.
10
This double robustness in terms of inference significantly enhances the appeal of the CBPS DID
estimator. We note that although γ̂ CBP S does not appear in estimating τ̂ CBP S (see (3.1)), it does
appear in estimating the asymptotic variance. Specifically, the asymptotic variance of the CBPS
2
Pn Di −π(Xi ′ β̂ CBP S ) Di CBP S
DID estimator should be estimated by 1
n i=1 D̄{1−π(Xi ′ β̂ CBP S )}
(∆Yi − Xi′ γ̂ CBP S ) − D̄
τ̂ .
Although the CBPS DID estimator is shown to have desirable properties in scenarios where
at least one of the two working models is correctly specified, situations might arise where both
of the working models are misspecified. To address this issue, Fan et al. (2022) conduct a
theoretical investigation of the AIPW ATE and CBPS ATE estimators under the scenario that
both propensity score and outcome models are locally misspecified and find that the CBPS ATE
estimator may converge in probability to the ATE at a faster rate than the AIPW estimator.
In this subsection, we examine whether the CBPS DID estimator inherits such a desirable
property in the DID design. We consider the same locally misspecified models as Fan et al. (2022):
Assumption 4.
where u(Xi ; β ∗ ) and r(Xi ) are functions determining the directions of misspecification with
|u(Xi ; β ∗ )| ≤ C, |r(Xi )| ≤ C a.s. for some positive constant C, ξ ∈ R and δ ∈ R represent the
Theorem
4. Under Assumptions
h π̇(X ′ β ∗ )
1-4 and Assumptions A, suppose
i−1 h π(X ′ β ∗ ) i at least one entry of
π̇(Xi′ β ∗ )Xi′ E i X X′
1−π(X ′ β ∗ ) i i
E i
1−π(X ′ β ∗ )
u(Xi ;β ∗ )Xi
u(Xi ;β ∗ )
E 1−π(X ′ β ∗ ) − i
1−π(Xi′ β ∗ )
i
is nonzero, as n → ∞, the
Xi
i
11
CBPS DID and AIPW DID estimators satisfy:
n
1X
τ̂ CBP S
−τ = ηie + Op (ξ 2 δ + δn−1/2 + ξn−1/2 ), (3.10)
n i=1
and
n
1X
τ̂ AIP W − τ = η e + Op (ξδ + δn−1/2 + ξn−1/2 ), (3.11)
n i=1 i
√
If nξδ → ∞, then the CBPS DID estimator converges in probability to the ATT at
a faster rate than the AIPW DID estimator since τ̂ CBP S − τ = Op (n−1/2 + ξ 2 δ), whereas
√
τ̂ AIP W − τ = Op (ξδ). On the other hand, if nξδ → 0, the two estimators have the same
√
limiting distribution N (0, E[ηie2 ]), but n(τ̂ CBP S − τ ) converges to the limit distribution at a
√
faster rate than n(τ̂ AIP W − τ ). Theorem 4 implies that the CBPS DID estimator demonstrates
greater robustness to slight model misspecification compared to the AIPW DID estimator.
These faster rates of convergence are attributed to the arbitrariness of γ CBP S in (3.3), which
effectively eliminates the product ξδ in the asymptotic linear representation of the CBPS DID
4 Simulation
In this section, we conduct a series of simulation studies to examine the finite sample properties
of the CBPS DID estimator. The simulation designs here closely follow those in (Sant’Anna
and Zhao 2020; Fan et al. 2022). Throughout these simulations, we utilize a logistic working
model for the propensity score and a linear regression model for outcome evolution. For the
OR, IPW, and AIPW approaches, we estimate the propensity scores using maximum likelihood
We set the sample size n equal to 1000, and conduct 1000 Monte Carlo simulations for
12
each design. The performance of various DID estimators is compared in terms of average bias,
median bias, root mean square error (RMSE), empirical 95% coverage probability, the average
length of a 95% confidence interval, and the average of their plug-in estimator for the asymptotic
variance. The confidence intervals are constructed using the normal approximation, and the
asymptotic variances are estimated by their sample analogues. Additionally, we present the
semiparametric efficiency bound for each design calculated by Sant’Anna and Zhao (2020).
This helps to assess the potential loss of efficiency or accuracy of a semiparametric DID estimator.
1. Both propensity score and outcome regression models are correctly specified.
4. Both the propensity score and the outcome regression models are misspecified.
5. Both the propensity score and the outcome regression models are locally misspecified.
Let Xi = (X1i , X2i , X3i , X4i )′ be generated from N (0, I4 ), and I4 be the 4 × 4 identity ma-
q
trix. For j = 1, 2, 3, 4, let Zji = (Z̃ji − E[Z̃ji ])/ V ar(Z̃ji ), where Z̃1i = exp(0.5X1i ), Z̃2i =
10 + X2i /(1 + exp(X1i )), Z̃3i = (0.6 + X1i X3i /25)3 and Z̃4i = (20 + X2i + X4i )2 . We consider the
13
DGP1.(PS and OR are correctly specified)
Yi1 (d) = 2for (Zi ) + v(Zi , Di ) + εi1 (d), Yi0 (0) = for (Zi ) + v(Zi , Di ) + εi0 ,
Yi1 (d) = 2for (Zi ) + v(Zi , Di ) + εi1 (d), Yi0 (0) = for (Zi ) + v(Zi , Di ) + εi0 ,
Yi1 (d) = 2for (Xi ) + v(Xi , Di ) + εi1 (d), Yi0 (0) = for (Xi ) + v(Xi , Di ) + εi0 ,
Yi1 (d) = 2for (Xi ) + v(Xi , Di ) + εi1 (d), Yi0 (0) = for (Xi ) + v(Xi , Di ) + εi0 ,
14
DGP5.(PS and OR are locally misspecified)
Yi1 (d) = 2for (Zi ) + v(Zi , Di ) + εi1 (d) + 2δr(Zi ), Yi0 (0) = for (Zi ) + v(Zi , Di ) + εi0 + δr(Zi ),
π(Zi ) = exp(fps (Zi ))/(1 + exp(fps (Zi ))) · exp(ξu(Zi )), Di = 1{π(Zi ) ≥ Ui },
treatment or not, εi0 and εi1 (d) are independent standard normal random variable, Ui is an
independent normal random variable with a mean of Di · for (Wi ) and a variance of one. This
determine the directions of misspecification. It is important to note that in all the DGPs
mentioned above„ the true ATT is zero, and the available data are {Yi0 , Yi1 , Di , Zi }ni=1 , where
Zi = (1, Z1i , Z2i , Z3i , Z4i )′ includes a constant among the covariates, the realized outcomeYi0
and Yi1 are generated according to Yi0 = Yi0 (0) and Yi1 = Di Yi1 (1) + (1 − Di )Yi1 (0) respectively.
4.2 Results
The results are summarized in the tables below, τ̂ IP W represents the IPW DID estimator
(2.6), τ̂ OR denotes the OR DID estimator (2.4), τ̂ AIP W is the AIPW DID estimator (2.7), and
τ̂ CBP S refers to the CBPS DID estimator (3.1). The abbreviations used are as follows: “Av.Bias”,
“Med.Bias”, “RMSE”, “Asy.V”, “Cover” and “CIL” stand for the average simulated bias, median
simulated bias, simulated root mean-squared error, average of the plug-in estimators for the
asymptotic variance, 95% coverage probability, and the average length of the 95% confidence
interval.
Table 1 suggests that when both working models are correctly specified, all semiparametric
DID estimators show minimal Monte Carlo bias. However, τ̂ OR , τ̂ AIP W , and τ̂ CBP S outperform
15
Table 1: DGP1, both working models are correctly specified
τ̂ IP W in terms of bias, root mean square error, asymptotic variance, and the length of the
confidence interval. This implies that the IPW DID estimator is substantially less efficient
compared to the latter three. Although τ̂ OR tends to be slightly more efficient than τ̂ AIP W and
Table 2: DGP2, correct outcome regression model with a misspecified propensity score model
Table 2 illustrates that when the propensity score model is misspecified, the CBPS DID
scenario. Conversely, Table 3 indicates that when the outcome regression model is misspecified,
τ̂ CBP S outperforms the other three estimators in terms of root mean square error, asymptotic
variance, and coverage probability. In this scenario, τ̂ OR displays a non-negligible bias, which
Table 3: DGP3, misspecified outcome regression model with a correct propensity score model
16
Table 4: DGP4, both models are misspecified
In Table 4, when both working models are misspecified, it is unsurprising that all semipara-
metric DID estimators exhibit bias, and generally, the inference procedures are misleading. In
this scenario, the CBPS DID estimator demonstrates smaller biases, lower root mean square
error (RMSE), reduced asymptotic variance, and shorter confidence interval lengths compared
to the OR and AIPW DID estimators. However, in DGP4, the IPW DID estimator appears to
Table 5 indicates that in scenarios when both working models are locally misspecified, τ̂ CBP S
shows the smallest bias and root mean square error (RMSE), along with the best coverage
three DID estimators. This finding corroborates the finding that IPW-based estimators are
sensitive to even slight misspecifications of the propensity score model, see, e.g. Kang and
Schafer (2007).
In summary, the findings presented in Table 1 indicate that the estimated variance of τ̂ CBP S
is very close to the semiparametric efficiency bound when both working models are correctly
specified. This supports our Theorem 1 regarding local semiparametric efficiency. In Tables
2 and 3, when one of the working models is misspecified, our proposed CBPS DID estimator
17
shows little bias, justifies the double robustness in terms of consistency as written in Theorem 2.
Furthermore, in DGP2 and DGP3, the CBPS DID estimator achieves a coverage probability
closer to 95% compared to the AIPW DID estimator, validating the superiority of double
robustness for inference demonstrated in Theorem 3. Lastly, Table 5 reveals that the CBPS
DID estimator is more robust to mild model misspecification than the AIPW DID estimator,
5 An empirical application
In this section, we apply our CBPS DID estimator to a real data sample. LaLonde (1986)
conducted a highly influential study evaluating the performance of different treatment effect
estimators based on observational data. The study focused on whether a new statistical method-
ology could replicate an experimental benchmark: the treatment effect of a National Supported
Work (NSW) labor training program on post-treatment earnings. Unfortunately, the results
were not satisfactory due to the potential presence of selection bias in the observational data.
Later, Dehejia and Wahba (1999)cdemonstrated that propensity score matching (PSM) based
estimators could closely replicate the experimental results. However, Smith and Todd (2005)
found that cross-sectional PSM estimators are highly sensitive to both model misspecification
and sample selection. They suggested that DID matching estimators were more appropriate.
Following the findings of Smith and Todd (2005), we use different samples and specifications
to evaluate the existing DID estimators. Specifically, we utilize data from the Current Population
Survey (CPS) to create a comparison group and use the control group from LaLonde’s original
experimental sample and the Dehejia and Wahba (DW) sample as our pseudo treatment group.
We consider two datasets: (1) LaLonde’s control group (425) + CPS (15,992), and (2) DW
control group (260) + CPS (15,992). Since no one received training under this setup, the true
ATT, if consistent, should be zero. Therefore, we use deviations from zero to evaluate the
18
The pre-treatment covariates in the data include age, real earnings in 1974, years of education,
and dummy variables for high school dropout status, marital status, race (black), and ethnicity
(Hispanic). The outcome of interest is real earnings in 1978, and pre-treatment real earnings in
1975 are also available. As part of our analysis, similar to Monte Carlo simulations, we compare
the performance of the CBPS DID estimator, τ̂ CBP S with the IPW DID estimator, τ̂ IP W , the
OR DID estimator, τ̂ OR , and the AIPW DID estimator, τ̂ AIP W . We assume that the outcome
models are linear in parameters and the propensity score model follows a logistic specification.
To assess the sensitivity to model misspecification, we also consider three different specifications:
(i) linear covariates (Lin); (ii) addition of some quadratic covariates such as age squared and
education squared (Qua); (iii) addition of some interaction terms selected by SantAnna and
Zhao (S&Z). The results are summarized in Table 6, with standard error reported in parentheses.
Table 6: Deviation of different DID estimators for the effect of training on real earnings in 1978,
with CPS comparison group
The results in Table 6 reveal several interesting findings. First, τ̂ OR displays the largest
bias across different datasets and covariates specification. For the Lalonde sample, every DID
estimator shows more severe bias compared to their performance under the DW sample. Second,
Abadie’s IPW DID estimator τ̂ IP W tends to have the largest standard error in all situations
although its bias is relatively small especially under the Qua and S&Z specifications. Third,
τ̂ AIP W and τ̂ CBP S perform better than the other two in terms of bias. The CBPS DID estimator
τ̂ CBP S is very close to the true ATT when adopting the linear specification under the DW
19
sample. Finally, when we compare τ̂ CBP S with τ̂ AIP W we find that the CBPS DID estimator
tends to have smaller standard errors in all situations. Taken together, the results suggest that
the proposed DID estimator is a compelling alternative to existing DID estimators. Additionally,
we use the Panel Study of Income Dynamics (PSID) to create a comparison group, with the
6 Conclusion
In this paper, we introduced an ATT estimator based on the CBPS method within a DID
framework. This framework is applicable when the parallel trends assumption holds after
conditioning on a set of pre-treatment covariates and when panel data are available. We
we found that while the CBPS DID estimator’s expression is similar to that of the IPW
estimator, its theoretical properties align more closely with those of the AIPW estimator. We
demonstrated that the CBPS DID estimator is locally semiparametrically efficient and exhibits
double robustness, similar to the AIPW DID estimator. Moreover, the asymptotic linear
representation of the CBPS DID estimator remains invariant even when one of the working
our estimator has a faster convergence rate than the AIPW DID estimator when both working
models are locally misspecified. These superior properties set the CBPS DID estimator apart
from the AIPW DID estimator. Our simulation results and empirical studies confirm these
theoretical properties, showcasing the advantages of the proposed CBPS DID estimator.
In this work, we have primarily concentrated on the theoretical development of the CBPS
DID estimator, especially in comparison to the AIPW DID estimator. An intriguing extension
involves adapting the CBPS DID estimator to high-dimensional settings. This is a nontrivial task,
as traditional regression methods tend to break down in high-dimensional settings, and CBPS
machine learning techniques into the first-step estimation of the propensity score and the
20
outcome evolution. This approach, known as double machine learning methodology, has been
explored in studies including (Chernozhukov et al. 2017; Chernozhukov et al. 2018; Chang 2020).
Our ongoing research aims to develop and investigate a high-dimensional CBPS DID estimator.
References
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, and
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen,
Whitney Newey, and James Robins (2018). Double/debiased machine learning for treatment
Dehejia, Rajeev H and Sadek Wahba (1999). “Causal effects in nonexperimental studies: Reeval-
uating the evaluation of training programs”. Journal of the American statistical Association
Fan, Jianqing, Kosuke Imai, Inbeom Lee, Han Liu, Yang Ning, and Xiaolin Yang (2022).
Heckman, James J, Hidehiko Ichimura, and Petra E Todd (1997). “Matching as an econometric
evaluation estimator: Evidence from evaluating a job training programme”. The review of
Imai, Kosuke and Marc Ratkovic (2014). “Covariate balancing propensity score”. Journal of the
21
Kang, Joseph DY and Joseph L Schafer (2007). “Demystifying double robustness: A comparison
LaLonde, Robert J (1986). “Evaluating the econometric evaluations of training programs with
Ning, Yang, Peng Sida, and Kosuke Imai (2020). “Robust estimation of causal effects via a
Sant’Anna, Pedro HC and Jun Zhao (2020). “Doubly robust difference-in-differences estimators”.
Smith, Jeffrey A and Petra E Todd (2005). “Does matching overcome LaLonde’s critique of
22
A Mathematical appendix
For simplify the notation, let g(x) be a generic notation for π(x) and m∆ (x), denote a parametric
function g(x′ θ) that serves as a generic representation for π(x′ β) and x′ γ. Additionally, for a generic
p
W , let ∥W ∥ = trace (W ′ W ) denote the Euclidean norm of W .
Let
Di − π(Xi′ β) ∂λi (ζ)
λi (ζ) = (∆Yi − Xi′ γ), λ̇i (ζ) = ,
E[Di ](1 − π(Xi )) ∂ζ
where ζ = (β ′ , γ ′ )′ .
Assumption A.
(i) Assume there is a known parametric function g(x′ θ) = g(x) and θ ∈ Θ ⊂ RK , where Θ is a compact
parameter set.
(ii) Assume g(Xi′ θ) is a.s. continuous in θ ∈ Θ and is a.s. continuously differentiable, in addition,
(iv) Assume that θ̂ strongly converges to θ0 and can be asymptotically expressed as:
n
√ 1 X
n(θ̂ − θ0 ) = √ ψg (Di , ∆Yi , Xi , θ0 ) + op (1),
n i=1
where E ψg (Di , ∆Yi , Xi , θ0 ) = 0, E ψg (Di , ∆Yi , Xi , θ0 )ψg (Di , ∆Yi , Xi , θ0 )′ is finite and is positive
definite.
h i
(v) Assume that E ∥λi (ζ 0 )∥2 < ∞ and E supζ∈Γ0 |λ̇i (ζ)| < ∞, where Γ0 is a small neighborhood of ζ 0 .
23
√
Assumption A (i)-(iv) represent standard conditions necessary for consistency and n-asymptotically
linear representations of the first-step estimators, and (v) imposes some weak integrability conditions.
= E [Yi1 (1) − Yi0 (0)|Di = 1] − E [E [Yi1 (0) − Yi0 (0)|Xi , Di = 0] |Di = 1] (Assumption 2)
Di − π(Xi )
τ IP W = E ∆Yi
E[Di ](1 − π(Xi ))
h i
(1−Di )π(Xi )
E [Di ∆Yi ] E 1−π(Xi ) ∆Yi
= −
E[Di ] E[Di ]
h i
π(Xi )
E [Di ∆Yi ] E 1−π(Xi ) E [(1 − Di )∆Yi |Xi ]
= −
Pr(Di = 1) Pr(Di = 1)
h i
π(Xi )
E 1−π(Xi ) E [1 − Di |Xi ] E [Yi1 (0) − Yi0 (0)|Xi ]
= E [∆Yi |Di = 1] − (Assumption 2)
Pr(Di = 1)
E [π(Xi )E [∆Yi |Xi , Di = 0]]
= E [∆Yi |Di = 1] −
Pr(Di = 1)
24
A4. ATT in AIPW approach
Di − π(Xi )
τ AIP W = E (∆Yi − m∆ (Xi ))
E[Di ](1 − π(Xi ))
Di − π(Xi ) E[Di |Xi ] − π(Xi )
=E ∆Yi − E m∆ (Xi )
E[Di ](1 − π(Xi )) E[Di ](1 − π(Xi ))
= τ IP W − 0 = τ
Suppose that both working models are correctly specified, that is π(Xi′ β) = π(Xi ), Xi′ γ = m∆ (Xi ).
n
( )
√ 1 X Di − π(Xi′ β) Di
n(τ̂ CBP S − τ) = √ ′ (∆Yi − Xi′ γ CBP S ) − τ
n i=1 E[Di ](1 − π(Xi β)) E[Di ]
" #
√ (Di − π(Xi′ β))(∆Yi − Xi′ γ CBP S ) Di
− n(D̄ − E[Di ])E − 2 τ
E 2 [Di ](1 − π(Xi′ β)) E [Di ]
" #
√ (1 − Di )(∆Yi − Xi′ γ CBP S )π̇(Xi′ β)
− n(β̂ CBP S − β)′ E Xi + op (1)
E[Di ](1 − π(Xi′ β))2
n
( )
1 X Di − π(Xi′ β) Di
=√ (∆Yi − Xi′ γ CBP S ) − τ + op (1)
n i=1 E[Di ](1 − π(Xi′ β)) E[Di ]
n
1 X
=√ η e + op (1),
n i=1 i
" #
(Di − π(Xi′ β))(∆Yi − Xi′ γ CBP S ) Di (Di − π(Xi ))(∆Yi − m∆ (Xi )) Di
E 2 ′ − 2 τ =E 2
− 2 τ
E [Di ](1 − π(Xi β)) E [Di ] E [Di ](1 − π(Xi )) E [Di ]
τ AIP W − τ
= = 0,
E[Di ]
" #
(1 − Di )(∆Yi − Xi′ γ CBP S )π̇(Xi′ β) E [(1 − Di )(∆Yi − m∆ (Xi ))|Xi ] π̇(Xi′ β)
E Xi = E Xi = 0
E[Di ](1 − π(Xi′ β))2 E[Di ](1 − π(Xi ))2
by setting γ CBP S = γ. (Note that γ CBP S can take any value.) Thus the conclusion follows by CLT.
25
A6. Proof of Theorem 2
Case 1: If the outcome model is correctly specified, that is Xi′ γ0 = m∆ (Xi ) but π(Xi′ β0 ) ̸= π(Xi ), by
the weak law of large numbers and the continuous mapping theorem, as n → ∞,
" #
CBP S p Di − π(Xi′ β0 )
τ̂ →
− E (∆Yi − m∆ (Xi ))
E[Di ](1 − π(Xi′ β0 ))
" #
Di ∆Yi Di m∆ (Xi ) (1 − Di )(∆Yi − m∆ (Xi ))π(Xi′ β0 )
=E − −E
E[Di ] E[Di ] E[Di ](1 − π(Xi′ β0 ))
" #
Di ∆Yi Di m∆ (Xi ) E[(1 − Di )(∆Yi − m∆ (Xi ))|Xi ]π(Xi′ β0 )
=E − −E
E[Di ] E[Di ] E[Di ](1 − π(Xi′ β0 ))
= τ OR − 0 = τ,
Case 2: If only the propensity score model is correctly specified, that is, π(Xi′ β0 ) = π(Xi ) but
Xi′ γ0 ̸= m∆ (Xi ), by the weak law of large numbers and the continuous mapping theorem, as n → ∞,
Di − π(Xi )
p
τ̂ CBP S →
− E (∆Yi − Xi′ γ CBP S )
E[Di ](1 − π(Xi ))
Di − π(Xi ) E [Di |Xi ] − π(Xi ) ′ CBP S
=E ∆Yi − E Xγ
E[Di ](1 − π(Xi )) E[Di ](1 − π(Xi )) i
= τ IP W − 0 = τ,
26
A7. Proof of Theorem 3
√
n(τ̂ CBP S − τ )
n
" #
1 X √ (Di − π(Xi′ β0CBP S ))(∆Yi − Xi′ γ CBP S ) Di
=√ ηiCBP S − n(D̄ − E[Di ])E 2 ′ CBP S
− 2 τ
n i=1 E [Di ](1 − π(Xi β0 )) E [Di ]
" #
√ (1 − Di )(∆Yi − Xi′ γ CBP S )π̇(Xi′ β0CBP S )
− n(β̂ CBP S
− β0CBP S )′ E Xi + op (1), (A.1)
E[Di ](1 − π(Xi′ β0CBP S ))2
whereas
√
n(τ̂ AIP W − τ )
n
" #
1 X √ (Di − π(Xi′ β0AIP W ))(∆Yi − Xi′ γ0AIP W ) Di
=√ ηiAIP W − n(D̄ − E[Di ])E 2 ′ AIP W
− 2 τ
n i=1 E [Di ](1 − π(Xi β0 )) E [Di ]
" #
√ (1 − Di )(∆Yi − Xi′ γ0AIP W )π̇(Xi′ β0AIP W )
− n(β̂ AIP W − β0AIP W )′ E Xi
E[Di ](1 − π(Xi′ β0AIP W ))2
" #
√ Di − π(Xi′ β0AIP W )
− n(γ̂ AIP W − γ0AIP W )′ E Xi + op (1). (A.2)
E[Di ](1 − π(Xi′ β0AIP W ))
Case 1: If the outcome regression model is correct but the propensity score model is incorrect, the
third terms of both AIPW and CBPS expansions are zero but the fourth term of the AIPW expansion
Case 2: On the other hand, if the propensity score model is correct but the outcome regression model is
incorrect, the fourth term of the AIPW expansion is zero but the third terms of the AIPW expansion is
nonzero and of order Op (1). However, the arbitrariness of γ CBP S in the CBPS expansion offers a signifi-
cant advantage. Specifically, the third term of CBPS expansion is zero even when the outcome regression
−1
(1−Di )π̇(Xi′ β0CBP S ) (1−Di )∆Yi π̇(Xi′ β0CBP S )
model is incorrect by setting γ CBP S = E (1−π(Xi′ β0CBP S ))2
Xi Xi′ E (1−π(Xi′ β0CBP S ))2
Xi :=
γ0CBP S .
27
A8. Proof of Theorem 4
We provide a sketch of the proof because the detail is very similar to that of (3.10) and (3.11) of Fan
et al. (2021). Letting β0CBP S denote the probability limit of β̂ CBP S , we decompose
τ̂ CBP S − τ
n
" #
1 1X Di − π(Xi′ β0CBP S ) ′ CBP S
= ∆Yi − Xi γ − Di τ
D̄ n i=1 1 − π(Xi′ β0CBP S )
n
" #
1 1X Di − π(Xi′ β̂ CBP S ) Di − π(Xi′ β0CBP S ) ′ CBP S
+ − ∆Yi − Xi γ
D̄ n i=1 1 − π(Xi′ β̂ CBP S ) 1 − π(Xi′ β0CBP S )
:= A1 + A2 .
First, we write γ CBP S = γ ∗ + δA for any value A since γ CBP S is arbitrary. Then we have
n o
∆Yi − Xi′ γ CBP S = ∆Yi − m∆ (Xi ) + m∆ (Xi ) − Xi′ γ CBP S
Also, by the same argument as the proof of (C.1) in Fan et al. (2021), we have
where u∗i = ui (Xi ; β ∗ ), πi∗ = π(Xi′ β ∗ ) and π̇i∗ = π̇(Xi′ β ∗ ). Hence by using a similar argument as the
A2 = Op (δn−1/2 ).
and
n
" #
1X 1 Di − π(Xi′ β0CBP S ) ′ CBP S
A1 = ηie + E m X
∆ i − Xi γ + Op (ξn−1/2 + δn−1/2 )
n i=1 E[Di ] 1 − π(Xi′ β0CBP S )
n
" #
1X 1 Di − π(Xi′ β0CBP S )
= ηie + E δ r(Xi ) − Xi′ A + Op (ξn−1/2 + δn−1/2 )
n i=1 E[Di ] 1 − π(Xi′ β0CBP S )
n
1X
= η e + Op (ξ 2 δ + ξn−1/2 + δn−1/2 ).
n i=1 i
28
To see the last equality, the second term is calculated as
" #
1 Di − π(Xi′ β0CBP S )
E δ r(Xi ) − Xi′ A
E[Di ] 1 − π(Xi′ β0CBP S )
"( ) #
1 Di − π(Xi′ β ∗ ) (1 − Di )π̇i∗ Xi′ (β0CBP S − β ∗ )
Xi′ A + O(ξ 2 δ)
= E − δ r(Xi ) −
E[Di ] 1 − π(Xi′ β ∗ ) {1 − π(Xi′ β ∗ )}2
"( ) #
1 π(Xi′ β ∗ ) (1 + ξu∗i ) − π(Xi′ β ∗ ) {1 − π(Xi′ β ∗ ) (1 + ξu∗i )} π̇i∗ Xi′ (β0CBP S − β ∗ )
δ r(Xi ) − Xi′ A
= E −
E[Di ] 1 − π(Xi′ β ∗ ) {1 − π(Xi′ β ∗ )}2
+ O(ξ 2 δ)
h ∗ i−1 h ∗ i
∗X ′E π̇i ′ πi ∗X
1
ξui∗ ξ π̇i i 1−πi ∗ X X
i i E u
1−πi i i
∗
δ r(Xi ) − Xi′ A + O(ξ 2 δ)
= E −
′ ∗ ′ ∗
E[Di ] 1 − π(Xi β )
1 − π(Xi β )
h ∗ i−1 h ∗ i
∗X ′E π̇i ′ πi ∗X
ξδ
∗
ui π̇ i i 1−πi ∗ Xi Xi E ∗
1−πi iu i
r(Xi ) − Xi′ A + O(ξ 2 δ).
= E ′β∗) −
E[Di ] 1 − π(X 1 − π(X ′β∗)
i i
h π̇∗ i−1 h π∗ i
∗ ′ i
π̇i Xi E 1−π∗ Xi Xi ′ i ∗
E 1−π∗ ui Xi
u∗i
Hence assuming that at least one entry of E 1−π(X ′ β ∗ ) − i i
Xi is
i 1−π(Xi′ β ∗ )
nonzero, there exists A such that
h ∗ i−1 h ∗ i
∗X ′E π̇i ′ πi ∗X
∗
ui π̇i i 1−πi ∗ Xi Xi E ∗
1−πi iu i
r(Xi ) − Xi′ A = 0.
E ′β∗) −
1 − π(X 1 − π(X ′β∗)
i i
This completes the proof of (3.10). The proof of (3.11) follows from the same argument except that
29
A9. Additional Application Results
Table 7: Deviation of different DID estimators for the effect of training on real earnings in 1978,
with PSID comparison group
30