De Chaisemartin D Haultfœuille 2020 Two Way Fixed Effects Estimators With Heterogeneous Treatment Effects
De Chaisemartin D Haultfœuille 2020 Two Way Fixed Effects Estimators With Heterogeneous Treatment Effects
https://doi.org/10.1257/aer.20181169
Linear regressions with period and group fixed effects are widely
used to estimate treatment effects. We show that they estimate
weighted sums of the average treatment effects (ATE ) in each group
and period, with weights that may be negative. Due to the negative
weights, the linear regression coefficient may for instance be nega-
tive while all the ATEs are positive. We propose another estimator
that solves this issue. In the two applications we revisit, it is signifi-
cantly different from the linear regression estimator. (JEL C21, C23,
D72, J31, J51, L82)
2964
VOL. 110 NO. 9 DE CHAISEMARTIN AND D’HAULTFŒUILLE: TWO-WAY FIXED EFFECTS 2965
= E ∑ W
((g,t):Dg,t )
(1)
βfe g,t Δg,t ,
=1
where Δg ,tis the average treatment effect (ATE) in group g and period tand the
weights Wg,tsum to 1 but may be negative. Negative weights arise because βˆ fe is
a weighted sum of several d ifference-in-differences (DID), which compare the
evolution of the outcome between consecutive time periods across pairs of groups.
However, the “control group” in some of those comparisons may be treated at both
periods. Then, its treatment effect at the second period gets differenced out by the
DID, hence the negative weights.
The negative weights are an issue when the ATEs are heterogeneous across
groups or periods. Then, one could have that β feis negative while all the ATEs are
positive. For instance, 1 .5 × 1 − 0.5 × 4, a weighted sum of 1and 4 , is strictly
negative. Using the dataset of Gentzkow, Shapiro, and Sinkinson (2011), we find
that 40 percent of the weights attached to βfeare negative, so β feis not robust to
heterogeneous effects.1
Researchers may want to know how serious that issue is in the application they
consider. We show that conditional on all treatments, the absolute value of the expec-
tation of βˆ fedivided by the standard deviation of the weights is equal to the minimal
value of the standard deviation of the ATEs across the treated ( g, t)cells under which
the average treatment on the treated (ATT) may actually have the opposite sign than
that coefficient. One can estimate that ratio to assess the robustness of the two-way
FE coefficient. If that ratio is close to 0, that coefficient and the ATT can be of oppo-
site signs even under a small and plausible amount of treatment effect heterogeneity.
In that case, treatment effect heterogeneity would be a serious concern for the valid-
ity of that coefficient. On the contrary, if that ratio is very large, that coefficient and
the ATT can only be of opposite signs under a very large and implausible amount of
treatment effect heterogeneity.
Finally, we propose a new estimator, DIDM, that is valid even if the treatment
effect is heterogeneous over time or across groups. It estimates the average treat-
ment effect across all the (g, t)cells whose treatment changes from t − 1to t. It
relies on common trends assumptions on both potential outcomes. Those conditions
are partly testable, and we propose a test that amounts to looking at pretrends. This
test differs from the standard event study pretrends test (see Autor 2003), which has
been shown to be invalid when treatment effects are heterogeneous (see Abraham
and Sun 2018). We show that our estimator is asymptotically normal. We compute it
in the datasets of Gentzkow, Shapiro, and Sinkinson (2011) and Vella and Verbeek
(1998), and in both cases we find that it is significantly different from β ˆ fe.2 Our esti-
mator can be used in applications where, for each pair of consecutive dates, there are
1
Gentzkow, Shapiro, and Sinkinson (2011) does not estimate βfe, but β fd, the treatment coefficient in the
first-difference regression defined below. Forty-six percent of the weights attached to βfdare strictly negative.
2
In both cases, our estimator is also significantly different from βˆ fd
.
2966 THE AMERICAN ECONOMIC REVIEW SEPTEMBER 2020
groups whose treatment does not change. We estimate that this condition is satisfied
for around 80 percent of the papers using two-way fixed effects regressions found
in our survey of the AER.
Overall, our paper has implications for applied researchers estimating two-way
fixed effects regressions. First, we recommend that they compute the weights
attached to their regression and the ratio of |βˆ fe|divided by the standard deviation
of the weights. To do so, they can use the twowayfeweights Stata package that is
available from the SSC repository. If many weights are negative, and if the ratio is
not very large, we recommend that they compute our new estimator, using the fuzzy-
did and did_multiplegt Stata packages, also available from the SSC repository (see
de Chaisemartin, D’Haultfœuille, and Guyonvarch 2019, for explanations on how to
use the former package).
We extend our results in several important directions. First, another commonly
used regression is the fi g,t − Yg,t−1, the change in
rst-difference regression of Y
the mean outcome in group g , on period fixed effects and on D g,t − Dg,t−1, the
change in the treatment. We let βfddenote the expectation of the coefficient
of Dg,t − Dg,t−1. We show that under common trends, βfdalso identifies a weighted
sum of treatment effects, with potentially some negative weights. Second, in our
online Appendix we show that our results extend to fuzzy designs, where the treat-
ment varies within (g, t)cells, and to two-way fixed effects regressions with a
nonbinary treatment and with covariates.
Our paper is related to the DID literature. Our main result generalizes Theorem 1
in de Chaisemartin and D’Haultfœuille (2018). When the data have two groups
and two periods, the Wald-DID estimand considered therein is equal to βfeand βfd.
Our results on βfeand βfdare thus extensions of that theorem to the case with mul-
tiple periods and groups.3 Moreover, our DIDMestimator is related to the Wald-TC
estimator with many groups and periods proposed in de Chaisemartin and
D’Haultfœuille (2018), and to the multiperiod DID estimator proposed by Imai
and Kim (2018). In Section III, we explain the differences between those three
estimators.
More recently, Borusyak and Jaravel (2017), Abraham and Sun (2018), Athey
and Imbens (2018), Callaway and Sant’Anna (2018), and Goodman-Bacon (2018)
study the special case of staggered adoption designs, where the treatment of a group
is weakly increasing over time. Those papers derive some important results specific
to that design that we do not consider here. Still, some of the results in those papers
are related to ours, and we describe precisely those connections later in the paper.
The most important dimension on which our paper differs from those is that our
results apply to any two-way fixed effects regressions, not only to those with stag-
gered adoption. In our survey of the AER papers estimating two-way fixed effects
regressions, less than 10 percent have a staggered adoption design. This suggests
that while staggered adoptions are an important research design, they may account
for a relatively small minority of the applications where two-way fixed effects
regressions have been used.
3
In fact, a preliminary version of our main result appeared in a working paper version of de Chaisemartin and
D’Haultfœuille (2018): see Theorems S1 and S2 in de Chaisemartin and D’Haultfœuille (2015).
VOL. 110 NO. 9 DE CHAISEMARTIN AND D’HAULTFŒUILLE: TWO-WAY FIXED EFFECTS 2967
I. Setup
One considers observations that can be divided into Ggroups and Tperiods. For
every (g, t) ∈ {1, …, G} × {1, … , T }, let Ng,tdenote the number of observations
in group g at period t, and let N = ∑g,t N
g,tbe the total number of observations.
The data may be an individual-level panel or repeated cross-section dataset where
groups are, say, individuals’ county of birth. The data could also be a c ross sec-
tion where cohort of birth plays the role of time. For instance, Duflo (2001) com-
pares the schooling of different cohorts in Indonesia, some of which were exposed
to a school construction program. It is also possible that for all (g, t), Ng,t = 1,
e.g., a group is one individual or firm. All of the above are special cases of the data
structure we consider.
One is interested in measuring the effect of a treatment on some out-
come. Throughout the paper we assume that treatment is binary, but our results
apply to any ordered treatment, as we show in online Appendix Section 3.2.
Then, for every (i, g, t) ∈ {1, …, Ng,t} × {1, …, G} × {1, …, T } , let Di,g,t
and (Yi,g,t(0), Yi,g,t(1))respectively denote the treatment status and the potential out-
comes without and with treatment of observation iin group gat period t.
The outcome of observation iin group gand period tis Yi,g,t = Yi,g,t(Di,g,t). For
all ( g, t), let
Ng,t
Ng,t
1 ∑ D ,
Dg,t = _
1 ∑ Y ( 0),
Yg,t(0) = _
Ng,t i=1
i,g,t
Ng,t i=1 i,g,t
Ng,t
Ng,t
Yg,t 1 ∑ Y ( 1),
(1) = _ and 1 ∑ Y .
Yg,t = _
Ng,t i=1
i,g,t
Ng,t i=1 i,g,t
Here, Dg,tdenotes the average treatment in group gat period t, while Y (0),
g,t
g,t(1), and Yg,t
Y respectively denote the average potential outcomes without and with
treatment and the average observed outcome in group gat period t.
Throughout the paper, we maintain the following assumptions.
Assumption 2 requires that units’ treatments do not vary within each (g, t) cell,
a situation we refer to as a sharp design. This is for instance satisfied when the
treatment is a g roup-level variable, for instance a county or a s tate law. This is also
mechanically satisfied when Ng,t = 1. In our survey in Section IIA, we find that
almost 80 percent of the papers using two-way fixed effects regressions and pub-
lished in the AER between 2010 and 2012 consider sharp designs. We focus on sharp
designs because of their prevalence, but in online Appendix Section 2, we show that
all the results in Sections II and III can be extended to fuzzy designs.
We consider Dg,t, Yg,t(0), Yg,t(1)as random variables. For instance, aggregate ran-
dom shocks may affect the average potential outcomes of group gat period t. The
treatment status of group gat period tmay also be random. The expectations below
are taken with respect to the distribution of those random variables. Assumption 3
allows for the possibility that the treatments and potential outcomes of a group may
be correlated over time, but it requires that the potential outcomes and treatments of
different groups be independent.
Assumption 4 requires that the shocks affecting a group’s Yg,t(0)be mean inde-
pendent of that group’s treatment sequence. This rules out the possibility that a group
gets treated because it experiences negative shocks, the s o-called Ashenfelter’s dip
(see Ashenfelter 1978). Assumption 4 is related to the strong exogeneity condition
in panel data models, which, as is well known, is necessary to obtain the consistency
of the fixed effects estimator (see, e.g., Wooldridge 2002).
We now define the FE regression described in the introduction.4
4
Throughout the paper, we assume that Dg,tin Regression 1 and D g,t− Dg,t−1in Regression 2 are
not collinear with the other independent variables in those regressions, so β ˆ fe and βˆ fdare well defined.
5
As the independent variables in Regression 1 are constant within each (g, t)cell, Regression 1 is equivalent to
a (g, t)-level regression of Yg,t
on group and period fixed effects and Dg,t
, weighted by N .
g,t
VOL. 110 NO. 9 DE CHAISEMARTIN AND D’HAULTFŒUILLE: TWO-WAY FIXED EFFECTS 2969
A. A Decomposition Result
denote the average treatment effect across all treated units, and let δ TR = E[Δ TR ]
denote the expectation of that parameter, hereafter referred to as the ATT. For any
(g, t) ∈ {1, …, G} × {1, …, T }, let
Ng,t
1 ∑ Y (1) − Y ( 0)
Δg,t = _
[ i,g,t ]
Ng,t i=1
i,g,t
denote the ATE in cell ( g, t). Note that δ TRis equal to the expectation of a weighted
average of the treated cells’ Δ g ,t,
Ng,t
[g,t:Dg,t=1 N1 ]
δ TR = E ∑ _ Δg,t .
(2)
Under the common trends assumption, we show that β feis also equal to the expec-
tation of a weighted sum of the Δ g,t terms, with potentially some negative weights.
Let ε g,tdenote the residual of observations in cell (g, t)in the regression of Dg,t on
group and period fixed effects,6
Dg,t
= α + γg + λt + εg,t .
εg,t
6
arises from a unit-level regression, where the dependent and independent variables only vary at the (g, t)
level. Therefore, all the units in the same (g, t)cell have the same value of εg,t .
2970 THE AMERICAN ECONOMIC REVIEW SEPTEMBER 2020
One can show that if the regressors in Regression 1 are not collinear, the average
value of εg ,tacross all treated (g, t)cells differs from 0: ∑ ( Dg,t=1 Ng,t/N1) εg,t
(g,t):
≠ 0. Then we let wg,tdenote ε g,t
divided by that average:
εg,t
wg,t = _____________
Ng,t
.
∑( _ ε
g,t):Dg,t
=1 N g,t
1
This result implies that in general, βfe ≠ δ TR, so βˆ feis a biased estimator of the ATT.
To illustrate this, we consider a simple example of a staggered adoption design with
two groups and three periods, and where the treatments are nonstochastic: group 1
is untreated at periods 1 and 2 and treated at period 3, while group 2 is untreated at
period 1 and treated both at periods 2 and 3.8 We also assume that N g,t/Ng,t−1 does
not vary across g : all groups experience the same growth of their number of obser-
vations from t − 1to t, a requirement that is for instance satisfied when the data is
a balanced panel. Then, one can show that
ε1,3 = 1 − 1 / 3 − 1 + 1 / 2 = 1 / 6,
ε2,2 = 1 − 2 / 3 − 1 / 2 + 1 / 2 = 1 / 3,
ε2,3 = 1 − 2 / 3 − 1 + 1 / 2 = − 1 / 6.
The residual is negative in group 2 and period 3, because the regression predicts
a treatment probability larger than one in that cell, a classic extrapolation problem
with linear regressions. Then, under the common trends assumption, it follows from
Theorem 1 and the fact that the treatments are nonstochastic that
βfe = 1 / 2E[Δ1,3
] + E[Δ2,2
] − 1 / 2E[Δ2,3
].
7
In the proof, we show the following, stronger result:
N
E[βˆ fe|D] = ∑ _
Ng,t wg,t
E[Δg,t
| D].
(g,t):Dg,t
=1 1
8
A similar example appears in Borusyak and Jaravel (2017).
VOL. 110 NO. 9 DE CHAISEMARTIN AND D’HAULTFŒUILLE: TWO-WAY FIXED EFFECTS 2971
in the population of treated observations. Therefore, β feis not equal to δ TR. Perhaps
more worryingly, not all the weights are positive: the weight assigned to the ATE in
group 2, period 3 is strictly negative. Consequently, βfemay be a very misleading
measure of the treatment effect. Assume for instance that E[Δ1,3] = E[Δ2,2 ] = 1
and E [Δ2,3] = 4. At the period when they start receiving the treatment, both groups
experience a modest positive ATE. But this effect builds over time and in period 3,
one period after it has started receiving the treatment, group 2 now experiences a
large ATE. Then,
βfe = 1 / 2 × 1 + 1 − 1 / 2 × 4 = − 1 / 2.
Therefore, β feis strictly negative, while E[Δ1,3], E[Δ2,2], and E[Δ2 ,3]are all posi-
tive. More generally, the negative weights are an issue if the E [Δg ,t]terms are het-
erogeneous, across groups or over time.9 If E [Δ1,3] = E[Δ2,2] = E[Δ2 ,3] = 1,
then βfe = 1 = δ TR.
Here is some intuition as to why one weight is negative in this example. It
follows from equation (A4) in the proof of Theorem 1 (see also Theorem 1 in
Goodman-Bacon 2018) that in this simple example, βfe = (DID1 + DID2)/2, with
The first DID compares the evolution of the mean outcome from period 1 to 2 in
group 2 and in group 1. The second one compares the evolution of the mean out-
come from period 2 to 3 in group 1 and in group 2. The control group in the second
DID, group 2, is treated both in the pre- and in the post-period. Therefore, under
the common trends assumption, it follows from Lemma 1 in Appendix A (a sim-
ilar result appears in Lemma 1 of de Chaisemartin 2011 and in equation (13) of
Goodman-Bacon 2018) that DID1 = E[Δ2 ,2], but
Note that, D ID2is equal to the ATE in group 1, period 3, minus the change in
group 2’s ATE between periods 2 and 3. Intuitively, the mean outcome of groups 1
and 2 may follow different trends from period 2 to 3 either because group 1 becomes
treated, or because group 2’s ATE changes. The intuition that negative weights arise
because βˆ feuses treated observations as controls also appears in Borusyak and
Jaravel (2017).
We now generalize the previous illustration by characterizing the (g, t) cells
whose ATEs are weighted negatively by βfe .
9
On the other hand, βfedoes not rule out heterogeneous treatment effects within (g, t)cells, as it is identified by
variations across ( g, t)cells, and does not leverage any within-cell variation.
2972 THE AMERICAN ECONOMIC REVIEW SEPTEMBER 2020
= 1, D.,t > D.,t′implies w g,t < wg,t′ . Similarly, for all ( g, g′, t) such
that Dg,t = Dg′,t = 1, Dg,.
> Dg′,.implies wg,t < wg′,t .
Proposition 1 shows that βfeis more likely to assign a negative weight to periods
where a large fraction of groups are treated, and to groups treated for many periods.
Then, negative weights are a concern when treatment effects differ between periods
with many versus few treated groups, or between groups treated for many versus
few periods.
Proposition 1 has interesting implications in staggered adoption designs, a spe-
cial case of sharp designs defined as follows.
ASSUMPTION 6 (Staggered Adoption Designs): For all g, Dg,t ≥ Dg,t−1 for all
t ≥ 2.
Theorem 1 shows that in sharp designs with many groups and periods, β ˆ fe may
be a misleading measure of the treatment effect under the standard common trends
assumption, if the treatment effect is heterogeneous across groups and time periods.
In the corollary below, we propose two robustness measures that can be used to
assess how serious that concern is.
Those robustness measures are defined conditional on D, the vec-
tor stacking together the treatments of all the (g, t)cells. Specifically, for
all (g, t) ∈ {1, …, G } × {1, …, T }, let Δ̃ g,t = E(Δg,t |D)denote the ATE in
cell ( g, t)conditional on D,11 let Δ̃ TR = E(Δ TR|D)denote the ATT conditional on
D, and let β ̃ fe = E( βˆ fe
|D). The first measure we consider is the minimal value of
the standard deviation of the Δ̃ g,t terms under which one could have that β̃ fe is of a
different sign than Δ̃ TR. Therefore, this summary measure applies to β and Δ̃ TR,
̃ fe
10
Borusyak and Jaravel (2017) assumes that the treatment effect of cell ( g, t)only depends on the number of
periods since group ghas started receiving the treatment, whereas Proposition 1 does not rely on that assumption.
11 ̃
Δ g,tmay differ from E(Δg,t)
. To see this, let us consider a simple example where
T = 2 . Then, under Assumption 3, one has Δ̃ g,t
= E(Δg,t
|Dg,1
, Dg,2
)
. One may for instance have
E(Δg,1
|Dg,1 = 0, Dg,2
= 0) < E(Δg,1
|Dg,1
= 1, Dg,2
= 1), if a group is more likely to be treated if her treat-
ment effect is initially high.
VOL. 110 NO. 9 DE CHAISEMARTIN AND D’HAULTFŒUILLE: TWO-WAY FIXED EFFECTS 2973
rather than βfe and δ TR, the unconditional expectations of βˆ feand Δ TRon which we
have focused so far. However, one can show that when G , the number of groups, goes
̃ ̃
to infinity, β fe − βfe and Δ − δ both converge to 0. So if the number of groups is
TR TR
large, β̃ fe and Δ̃ TRshould not differ much from β and δ TR, and our robustness mea-
fe
sure “almost” applies to βfeand δ . TR
Let
Ng,t
((g,t):Dg,t =1 1 )
1/2
σ(Δ̃ ) = ∑ _ (Δ̃ g,t − Δ̃ TR) ,
2
N
Ng,t
((g,t):Dg,t=1 N1 )
1/2
σ(w) = ∑ _ (wg,t − 1) 2
,
where σ(Δ̃ )is the standard deviation of the conditional ATEs, and σ(w)is the stan-
dard deviation of the w-weights,12 across the treated ( g, t)cells. Let n = #{(g, t) : Dg,t
= 1}denote the number of treated cells. For every i ∈ {1, …, n}, let w(i) denote the
ith largest of the weights of the treated cells: w(1) ≥ w(2) ≥ ⋯ ≥ w(n) , and let N(i)
and Δ̃ (i)be the number of observations and the conditional ATE of the corresponding
cell. Then, for any k ∈ {1, …, n}, let Pk = ∑i≥k N
(i)/N1, Sk = ∑i≥k
(N(i)/N1) w(i)
,
and Tk = ∑i≥k (N(i)/N1) w (i) .
2
(i) If σ(w) > 0, the minimal value of σ (Δ̃ ) compatible with β̃ fe and Δ̃ TR = 0 is
|β̃ fe
|
σ _ fe = ____ .
σ(w)
(ii ) If β̃ fe ≠ 0 and at least one of the wg,t weights is strictly negative, the minimal
value of σ (Δ̃ ) compatible with β̃ fe and with Δ̃ g,t of a different sign than β ̃ fe
for all (g, t)is
|β̃ fe|
σ ‗ fe = ________________
,
[Ts + S 2s /( 1 − Ps) ]
1/2
12
One can show that ∑(g,t):
=1(Ng,t/N1) wg,t = 1.
Dg,t
2974 THE AMERICAN ECONOMIC REVIEW SEPTEMBER 2020
Similarly, if σ‗ feis close to 0, one may have, say, β̃ fe > 0, while Δ̃ g,t ≤ 0 for
all ( g, t), even if the dispersion of the Δ ̃ g,t terms is relatively small. Notice that ‗ σ
fe
is only defined if at least one of the weights is strictly negative: if all the weights are
positive, then one cannot have that β̃ feis of a different sign than all the Δ ̃ g,t terms.
When some of the weights wg,t are negative, β fe ˆ may still be robust to heteroge-
neous treatment effects across groups and periods, provided the assumption below
is satisfied.
Assumption 7 requires that the weights attached to the fixed effects estima-
tor be uncorrelated with the conditional ATEs in the treated (g, t)cells. This is
often implausible. For instance, groups treated the most are also those with the
lowest value of wg,t , as shown in Proposition 1. But those groups could also be
those with the largest treatment effect. This would then induce a negative cor-
relation between w and Δ̃ . The plausibility of Assumption 7 can be assessed,
by looking at whether w is correlated with a predictor of the treatment effect
in each (g, t)cell. In the two applications we revisit in Section V, this test is
rejected.
When T = 2and Ng,2/Ng,1does not vary across g , meaning that all groups expe-
rience the same growth of their number of units from period 1 to 2, one can show
that βˆ fe = βˆ fd
. But, βˆ fe
differs from βˆ fd
if T > 2or Ng,2
/Ng,1
varies across g .
We start by showing that a result similar to Theorem 1 also applies to β ˆ fd.
For any (g, t) ∈ {1, …, G } × {2, …, T }, let εfd,g,t denote the residual of obser-
vations in group g and at period tin the regression of D g,t − Dg,t−1on period
fixed effects, among observations for which t ≥ 2. For any g ∈ {1, …, G },
let εfd,g,1 = εfd,g,T+1 = 0. One can show that if the regressors in Regression 2 are
not perfectly collinear,
N N
N ( fd,g,t )
∑ _ g,t
ε − _ εfd,g,t+1 ≠ 0.
g,t+1
N
(g,t):Dg,t
=1 1 g,t
VOL. 110 NO. 9 DE CHAISEMARTIN AND D’HAULTFŒUILLE: TWO-WAY FIXED EFFECTS 2975
Then we define
Ng,t+1
ε fd,g,t − _ Ng,t
ε
fd,g,t+1
__________________________
wfd,g,t =
.
=1 N ( fd,g,t N fd,g,t+1)
Ng,t Ng,t+1
∑( _ ε
− _ ε
g,t):Dg,t 1 g,t
Ng,t
[(g,t):D =1 N1 ]
= E ∑ _ wfd,g,t Δg,t
βfd .
g,t
PROPOSITION 2: Suppose that Assumptions 1–2 and 6 hold and for all g ,
Ng,tdoes not depend on t. Then, for all (g, t)such that Dg,t = 1, wfd,g,t < 0 if
and only if Dg,t−1 = 1and
D.,t − D.,t−1 > D.,t+1 − D.,t (with the convention
that D.,T+1 = D.,T).
Proposition 2 shows that for all t ∈ {2, …, T − 1}such that the increase in the
proportion of treated units is larger from t − 1to tthan from tto t + 1, the period-t
ATE of groups already treated in t − 1receives a negative weight. Moreover, if,
at period T , at least one group becomes treated, the ATE of groups already treated
in T − 1also receives a negative weight. Therefore, the treatment effect arising
at the date when a group starts receiving the treatment does not receive a negative
weight, only long-run treatment effects do. Then, negative weights are a concern
when instantaneous and long-run treatment effects may differ. Proposition 2 also
shows that the prevalence of negative weights depends on how the number of groups
that start receiving the treatment at date tevolves with t. Assume for instance that
this number decreases with t: many groups start receiving the treatment at date 1, a
bit less start at date 2, etc., a case hereafter referred to as the “more early adopters”
case. Then, if N g,tis constant across (g, t), D.,t − D.,t−1is decreasing in t, and all the
long-run treatment effects receive negative weights, except maybe those of period T
if D.,T = D.,T−1. Conversely, assume that the number of groups that start receiving
the treatment at date tincreases with t: few groups start receiving the treatment at
date 1, a bit more start at date 2, etc., a case hereafter referred to as the “more late
adopters” case. Then, if N g,tis constant across (g, t), D.,t − D.,t−1is increasing in t,
and only the period-Tlong-run treatment effects receive negative weights. Overall,
negative weights are much more prevalent in the “more early adopters” than in the
“more late adopters” case.
2976 THE AMERICAN ECONOMIC REVIEW SEPTEMBER 2020
We now come back to general sharp designs where the treatment may not follow
a staggered adoption. Let β ̃ fd = E( βˆ fd
|D)denote the expectation of β ˆ fd conditional
on the vector of treatment assignments D ̃ fe, one can show that the min-
. Just as for β
̃ ̃ ̃
imal value of σ(Δ ) compatible with β fd and Δ = 0 is σ
TR
_ fd = |β̃ fd
|/σ(wfd ), where
((g,t):D =1 N1 )
1/2
N
σ( w ) = ∑ _ (w fd,g,t − 1)
g,t 2
fd
g,t
is the standard deviation of the w fd -weights. One can also show that ‗ σ
fd, the min-
imal value of σ (Δ̃ ) compatible with β̃ fd and Δ̃ g,tof a different sign than β ̃ fd for
all (g, t), has the same expression as σ fe, except that one needs to replace the
‗
weights wg,tby the weights w fd,g,tin its definition. Estimators of σ fe and _
_ σ fd (or ‗ σ fe
‗ fd) can then be used to determine which of βˆ fe or βˆ fdis more robust to het-
and σ
erogeneous treatment effects.
Finally, and similarly to the result shown in Corollary 2 for β fe, βfdis equal to δ TR
under common trends and the following assumption.
̃ ): E[ ∑(g,t):
ASSUMPTION 8 (wfd Uncorrelated with Δ =1(Ng,t/N1)(wfd,g,t − 1)
Dg,t
× (Δg,t − Δ )] = 0.
TR
Note that under the common trends assumption, one can jointly test Assumption 8
and Assumption 7, the assumption that the weights attached to βfeare uncorrelated
with the Δg,t terms: if βˆ fe and βˆ fdare significantly different, at least one of these two
assumptions must fail. In the two applications we revisit in Section V, β ˆ fe and βˆ fd are
significantly different.
[ NS (i,g,t):t≥2,Dg,t≠Dg,t−1 ]
δ S = E _
1 ∑ [Yi,g,t(1) − Yi,g,t(0)] ,
S = ∑(g,t):t≥2,
with N
N Dg,t≠Dg,t−1 g,t. The term δ is the ATE of all switching cells. In
S
staggered adoption designs, δ Sis the average of the treatment effect at the time when
a group starts receiving the treatment, across all groups that become treated at some
point.
We now show that δ Scan be unbiasedly estimated by a weighted average of DID
estimators. This result holds under the following supplementary assumptions.
(ii) If there is at least one g ∈ {1, …, G } such that Dg,t−1 = 1, Dg,t = 0, then
there exists at least one g′ ≠ g, g′ ∈ {1, …, G } such that Dg′,t−1 = Dg′,t = 1.
The first point of the stable groups assumption requires that between each pair of
consecutive time periods, if there is a “joiner” (i.e., a group switching from being
untreated to treated), then there should be another group that is untreated at both
dates. The second point requires that between each pair of consecutive time periods,
if there is a “leaver” (i.e., a group switching from being treated to untreated), then
there should be another group that is treated at both dates.
Notice that under Assumption 11, groups’ treatments are not indepen-
dent, so Assumption 3 cannot hold. Accordingly, we replace Assumption 3 by
Assumption 12. Assumption 12 requires that conditional on its own treatments, a
group’s outcomes be mean independent of the other groups’ treatments. It is weaker
than Assumption 3. Assumption 11 is necessary to show that our estimator is unbi-
ased, but it is not necessary to show that it is consistent. Accordingly, in Section 5 of
the online Appendix, we show that our estimator is consistent under Assumption 3.
For every g ∈ {1, …, G }, let Dg = (D1,g
, …, DT,g
).
13
It should be possible to weaken Assumptions 9–10, in particular to account for dynamic effects where Δg,t
may depend on ( Dg,1, …, Dg,t−1). This introduces complications that are beyond the scope of this paper, but that we
address in de Chaisemartin and D'Haultfœuille (2020a).
14
Imposing Assumptions 9 and 10 does not change the decompositions obtained in Theorems 1 and 2; Y g,t(1) is
observed for all the treated (g, t)cells entering these decompositions, so those assumptions do not bring identifying
information for those cells.
2978 THE AMERICAN ECONOMIC REVIEW SEPTEMBER 2020
We can now define our estimator. For all t ∈ {2, …, T }and for all (d, d′ )
∈ {0, 1} 2, let
denote the number of observations with treatment d ′at period t − 1and d at period t .
Let
N N
N ( g,t
Y − Yg,t−1) −
N ( g,t
DID+,t = ∑ _ ∑ _ Y − Yg,t−1),
g,t g,t
=1,Dg,t−1
g:Dg,t =0 1,0,t =Dg,t−1
g:Dg,t =0 0,0,t
N N
N ( g,t
Y − Yg,t−1) −
N ( g,t
DID−,t = ∑ _ ∑ _ Y − Yg,t−1).
g,t g,t
=Dg,t−1
g:Dg,t =1 1,1,t =0,Dg,t−1
g:Dg,t =1 0,1,t
Note that D ID+,tis not defined when there is no group such that D g,t = 1,
Dg,t−1 = 0 , or no group such that Dg,t = 0, Dg,t−1 = 0
. In such instances,
we let D ID+,t = 0 . Similarly, let D
ID−,t = 0when there is no group such
g,t = 1, Dg,t−1 = 1or no group such that Dg,t = 0, Dg,t−1 = 1. Finally, let
that D
N1,0,t
N0,1,t
DIDM = ∑ (_ DID+,t + _ DID−,t).
T
t=2
N
S
N S
In online Appendix Section 5, we also show that when G goes to infinity, D IDM
is a consistent and asymptotically normal estimator of δ S. The DIDMestimator is
computed by the fuzzydid and did_multiplegt Stata packages.
Here is the intuition underlying Theorem 3. The estimator DID+,tcompares the
evolution of the mean outcome between t − 1and tin two sets of groups: the join-
ers, and those remaining untreated. Under Assumptions 4 and 5, DID+,t estimates
the joiners’ treatment effect. Similarly, DID−,tcompares the evolution of the out-
come between t − 1and tin two sets of groups: those remaining treated, and the
leavers. Under Assumptions 9 and 10, it estimates the leavers’ treatment effect.
Finally, D IDMis a weighted average of those DID estimators. Note that in stag-
gered designs, there are no groups whose treatment decreases over time, so DIDM is
only a weighted average of the D ID+,testimators. Note also that one can separately
estimate the joiners’ and the leavers’ treatment effect, by computing separately
weighted averages of the D ID+,tand DID−,testimators. The former estimator only
relies on Assumptions 4 and 5, while the latter only relies on Assumptions 9 and 10.
IDMis related to two other estimators. First, it is related to the Wald-TC
Note that, D
estimator in point 2 of Theorem S1 in the online Appendix of de Chaisemartin and
D’Haultfœuille (2018), but the weighting of DID+,tand D ID−,ttherein differs. As
IDMestimates Δ
a result, D under weaker assumptions. Second, D
S
IDMis related to
VOL. 110 NO. 9 DE CHAISEMARTIN AND D’HAULTFŒUILLE: TWO-WAY FIXED EFFECTS 2979
the multiperiod DID estimator in Imai and Kim (2018). However, the m ultiperiod
DID estimator is a weighted average of the DID+,t, so it does not estimate the leav-
ers’ treatment effect, and applies to a smaller population. Besides, Imai and Kim
(2018) do not establish the properties of their estimator. Finally, they do not gen-
eralize it to nonbinary treatments, something we do in online Appendix Section 4.
There may be a b ias-variance trade-off between D IDMand the two-way fixed
effects regression estimators. For instance, assume that Regression 1 is correctly
specified:
Then, if the errors ε g,tare homoskedastic and uncorrelated, it follows from the
Gauss-Markov theorem that βˆ feis the linear estimator of δ, the constant treatment
effect parameter, with the lowest variance. As DIDMis also an unbiased linear esti-
mator of δ, the variance of βˆ femust be lower than that of D IDM. With heteroske-
dastic or correlated errors, one can construct examples where the variance of βˆ fe is
higher than that of D IDM, but this still suggests that DIDMmay often have a larger
variance than that of βˆ fe, as we find in our applications in Section V.
IDMuses groups whose treatment is stable to infer the trends that
Note that, D
would have affected switchers if their treatment had not changed. This strategy
could fail, if switchers experience different trends than groups whose treatment is
stable. To assess if this is a serious concern, we propose to use the following placebo
estimator, that essentially compares the outcome’s evolution from t − 2to t − 1, in
groups that switch and do not switch treatment between t − 1and t. This placebo
estimator is defined under a modified version of Assumption 11.
ASSUMPTION 13 (Existence of “Stable” Groups for the Placebo Test): For all
t ∈ {3, …, T }:
For all t ∈ {2, …, T }and for all ( d, d′, d″ ) ∈ {0, 1} 3, let
denote the number of observations with treatment status d″at period t − 2, d′ at
period t − 1, and dat period t. Let
(g,t):t≥3,Dg,t
≠Dg,t−1
=Dg,t−2
N
N ( g,t−1
DID pl
+,t = ∑ _ Y − Yg,t−2)
g,t
=1,Dg,t−1
g:Dg,t =Dg,t−2
=0 1,0,0,t
N
N ( g,t−1
− ∑ _ Y − Yg,t−2),
g,t
=Dg,t−1
g:Dg,t =Dg,t−2
=0 0,0,0,t
N
N ( g,t−1
DID pl
−,t = ∑ _ Y − Yg,t−2)
g,t
=Dg,t−1
g:Dg,t =Dg,t−2
=1 1,1,1,t
N
N ( g,t−1
− ∑ _ Y − Yg,t−2).
g,t
=0,Dg,t−1
g:Dg,t =Dg,t−2
=1 0,1,1,t
When there is no group such that Dg,t = 1, Dg,t−1 = Dg,t−2 = 0or no group such
that Dg,t = Dg,t−1 = Dg,t−2 = 0, we let DID pl
+,t = 0, and we adopt the same con-
vention for DID pl −,t
= 0 . Let
N1,0,0,t
N0,1,1,t
t=3 ( N S )
T
DID M = ∑ _
pl DID pl
+,t +
_ DID pl
−,t .
pl
N S
pl
THEOREM 4: If Assumptions 1, 2, 4, 5, 9, 10, 12, and 13 hold, then E[ DID M] = 0.
pl
finding DID M significantly different from 0 would imply that those assumptions are
pl
violated: groups that switch treatment experience different trends before that switch
than the groups used to reconstruct their counterfactual trends when they switch.15
Note that DID M compares the trends of switching and stable groups one period
pl
before the switch. One can define other placebo estimators comparing those trends,
say, two or three periods before the switch. The DID M estimator and all those other
pl
15
See also Callaway and Sant’Anna (2018), which proposes another placebo test in staggered adoption designs.
VOL. 110 NO. 9 DE CHAISEMARTIN AND D’HAULTFŒUILLE: TWO-WAY FIXED EFFECTS 2981
IV. Extensions
In this section, we briefly review some of the extensions in our online Appendix.
First, we show that the decomposition of β fein Theorem 1 can be extended to fuzzy
designs where the treatment varies within (g, t)cells and to applications with a non-
binary treatment.16 In fuzzy designs or with a n onbinary treatment, the weights in
Theorem 1 remain essentially unchanged.
We also consider two-way fixed effects regressions with covariates. Specifically,
we study the coefficient of Dg,tin a regression of Yi,g,t on group and period fixed
effects, Dg,t, and a vector of covariates Xg,t. We show that a result very similar to
Theorem 1 applies to that coefficient, up to two differences. First, including covari-
ates allows for different trends across groups, provided those differential trends are
fully accounted for by a linear model in Xg,t − Xg,t−1, the change in a group’s covari-
ates. Specifically, instead of Assumptions 4 and 5, one needs to assume that
for some vector γand constant λt , and where Xg = (Xg,1, …, Xg,T). Importantly,
when the covariates are g roup-specific linear trends, the equation above is equiva-
lent to
meaning that from t − 1to t, the evolution of Y(0)in group gshould deviate from
its group-specific linear trend γ g by an amount λ
t common to all groups. Second, the
residual εg,tin the weights in Theorem 1 has to be replaced by ε Xg,t, the residual of
observations in cell (g, t)in the regression of Dg,ton group and period fixed effects
and X g,t. Some of the corresponding weights may still be negative, as in Theorem 1.
Overall, two-way fixed effects regressions with covariates may rely on a more
plausible common trends assumptions than those without covariates, but they still
require that the treatment effect be homogeneous, across time and between groups.
Third, we show that under the common trends assumption and the assumption
that the ATE of a (g, t)cell does not change over time, βfeand β fdidentify weighted
sums of the ATEs of the (g, t)cells whose treatment changes between t − 1and t.
In sharp designs, the weights attached to βfdare all positive, while for βfe, the same
only holds in staggered adoption designs.
Fourth, we show that our DIDMestimator can easily be extended to nonbinary,
discrete treatments. Then, we define it as a weighted average of DID terms com-
paring the evolution of the outcome in groups whose treatment went from dto d′
between t − 1and tand in groups with a treatment of d at both dates, across all
possible values of d, d′, and t.
Finally, our twowayfeweights, fuzzydid, and did_multiplegt Stata packages can
handle all of those extensions.
16
The decomposition of βfd
in Theorem 2 can also be extended to all of those cases.
2982 THE AMERICAN ECONOMIC REVIEW SEPTEMBER 2020
Table 1—Papers Using Two-Way Fixed Effects Regressions Published in the AER
Note: This table reports the number of papers using two-way fixed effects regressions pub-
lished in the AER from 2010 to 2012.
Number
of papers
Panel A. Estimation method
Fixed effects OLS regression 13
First-difference OLS regression 6
Fixed effects or first-difference OLS regression, with several treatment variables 6
Fixed effects or first-difference 2LS regression 3
Other regression 5
Note: This table reports the estimation method and the research design used in the 33 papers
using two-way fixed effects regressions published in the AER from 2010 to 2012, and whether
those papers have stable groups.
A. Applicability
17
For instance, two papers use regressions with three-way fixed effects instead of two-way fixed effects.
VOL. 110 NO. 9 DE CHAISEMARTIN AND D’HAULTFŒUILLE: TWO-WAY FIXED EFFECTS 2983
papers consider sharp designs, while less than one-fourth consider fuzzy designs.
Finally, panel C assesses whether, in those applications, there are groups whose
exposure to the treatment remains stable between each pair of consecutive time peri-
ods, the condition that has to be met to be able to compute the DID
M estimator. For
about one-half of the papers, reading the paper was not enough to assess this with
certainty. We then assessed whether they presumably have stable groups. Overall,
12 papers have stable groups, 14 presumably have stable groups, 5 presumably do
not have stable groups, and 2 do not have stable groups.
In online Appendix Section 6, we review each of the 33 papers. We explain where
two-way fixed effects regressions are used in the paper, and we detail our assess-
ment of whether the design is a sharp or a fuzzy design, and of whether the stable
groups assumption holds.
Gentzkow, Shapiro, and Sinkinson (2011) studies the effect of newspapers on vot-
ers’ turnout in US presidential elections between 1868 and 1928. They regress the
first-difference of the turnout rate in county gbetween election years t − 1and t on
state-year fixed effects and on the first difference of the number of newspapers avail-
able in that county. This corresponds to Regression 2, with state-year fixed effects
as controls. As reproduced in Table 3, Gentzkow, Shapiro, and Sinkinson (2011)
finds that βˆ fd = 0.0026 (standard error = 9 × 10 −4 ). According to this regres-
sion, one more newspaper increased voters’ turnout by 0.26 percentage points. On
the other hand, βˆ fe = − 0.0011 (standard error = 0.0011). Here, βˆ fe and βˆ fd are
significantly different (t-statistic = 2.86).
We use the twowayfeweights Stata package, downloadable with its help file from
the SSC repository, to estimate the weights attached to βˆ fe. We find that 60 percent
are strictly positive, 40 percent are strictly negative. The negative weights sum to
−0.53. We find σ _ ˆ fe = 3 × 10 −4, meaning that βfeand the ATT may be of opposite
signs if the standard deviation of the ATEs across all the treated ( g, t)cells is equal
to 0.0003.18 Further, σ ˆ fe = 7 × 10 −4, meaning that βfemay be of a different sign
‗
than the ATEs of all the treated (g, t)cells if the standard deviation of those ATEs is
equal to 0 .0007. We also estimate the weights attached to βˆ fd. Here, 54 percent are
strictly positive, and 46 percent are strictly negative. The negative weights sum to
−1.43. We find _ σˆ fd = 4 × 10 −4, and ‗ σˆ fd = 6 × 10 −4.
Therefore, βfeand βfdcan only receive a causal interpretation if the weights
attached to them are uncorrelated with the intensity of the treatment effect in each
county × election-year cell (Assumptions 7 and 8, respectively). This is not war-
ranted. First, as β ˆ fe and βˆ fd significantly differ, Assumptions 7 and 8 cannot jointly
hold. Moreover, the weights attached to β ˆ fe and βˆ fd
are correlated with variables
that are likely to be themselves associated with the intensity of the treatment effect
in each cell. For instance, the correlation between the weights attached to βˆ fdand t,
the year variable, is equal to − 0.06 (t-statistic = −3.28). The effect of newspapers
may be different in the last than in the first years of the panel. For instance, new
18
The number of newspapers is not binary, so strictly speaking, in this application the parameter of interest is
the average causal response parameter introduced in online Appendix Section 3.2, rather than the ATT.
2984 THE AMERICAN ECONOMIC REVIEW SEPTEMBER 2020
Notes: This table reports estimates of the effect of one additional newspaper on turnout, as
well as a placebo estimate of the common trends assumption underlying DIDM. Estimators are
computed using the data of Gentzkow, Shapiro, and Sinkinson (2011), with state-year fixed
effects as controls. Standard errors are clustered by county. To compute the D
IDM estimators,
the number of newspapers is grouped into 4 categories: 0, 1, 2, and more than 3.
means of communication, like the radio, appear in the end of the period under con-
sideration, and may diminish the effect of newspapers.19 This would lead to a vio-
lation of Assumption 8.
The stable groups assumption holds: between each pair of consecutive elections,
there are counties where the number of newspapers does not change. We use the
fuzzydid Stata package, downloadable with its help file from the SSC repository, to
estimate a modified version of our DIDMestimator, that accounts for the fact that
the number of newspapers is not binary (see online Appendix Section 3.2, where
we define this modified estimator). We include s tate-year fixed effects as controls
in our estimation. We find that D IDM = 0.0043, with a standard error of 0 .0014.
Therefore, DIDMis 66 percent larger than βˆ fd, and the two estimators are signifi-
cantly different at the 10 percent level (t-statistic = 1.77); D IDMis also of a differ-
ent sign than βˆ fe.
Our DIDMestimator only relies on a common trends assumption. To assess its
plausibility, we compute D ID M , the placebo estimator introduced in Section III.20
pl
As shown in Table 3, our placebo estimator is small and not significantly differ-
ent from 0, meaning that counties where the number of newspapers increased or
decreased between t − 1and tdid not experience significantly different trends
in turnout from t − 2to t − 1than counties where that number was stable. Our
placebo estimator is estimated on a subset of the data: for each pair of consecu-
tive time periods t − 1and t, we only keep counties where the number of news-
papers did not change between t − 2and t − 1. Still, almost 80 percent of the
county × election-year observations are used in the computation of the placebo
estimator. Moreover, when reestimated on this subsample, the D IDMestimator is
very close to the DIDMestimator in the full sample.
19
In fact, Gentzkow, Shapiro, and Sinkinson (2011) analyzes the 1868 to 1928 period separately from later
periods, because the growth of the radio may have changed newspapers’ effects.
Again, we need to slightly modify D
IDMto account for the fact that the number of newspapers is not binary.
20 pl
VOL. 110 NO. 9 DE CHAISEMARTIN AND D’HAULTFŒUILLE: TWO-WAY FIXED EFFECTS 2985
Notes: This table reports estimates of the effect of the union premium, as well as placebo esti-
mators of the common trends assumption. Estimators are computed using the data of Vella and
Verbeek (1998). Standard errors are clustered at the worker level.
(1991) has found a 8.3 percent union membership premium using that strategy, in
a sample of American males from the PSID followed from 1976 to 1980. Vella and
Verbeek (1998) estimates a similar regression and find similar results, in a sample of
young American males from the NLSY followed from 1980 to 1987.21
We use the data in Vella and Verbeek (1998) to compute various estima-
tors of the union wage premium. As union status is often measured with
error (see, e.g., Freeman 1984; Card 1996), we discard changes in union sta-
tus happening twice in three consecutive years. Specifically, for individuals
i,t−1 = 0, Di,t = 1, and Di,t+1 = 0, we replace Di,tby 0. Similarly, for indi-
with D
viduals with D i,t−1 = 1, Di,t = 0, and Di,t+1 = 1, we replace D i,tby 1. Doing so,
we discard half of the union status changes in the initial data.22
We start by estimating a two-way fixed effects regression of wages on union
membership with worker and year fixed effects. Table 4 shows that β ˆ fe = 0.107
(standard error = 0.030), a result close to that of the worker fixed effects regres-
sions in Jakubson (1991) and Vella and Verbeek (1998).
Then, we estimate the weights attached to βˆ fe. Here, 820 are strictly positive, 196
are strictly negative, but the negative weights only sum to −0.01. Still, σ ˆ fe = 0.097,
_
meaning that β feand the ATT may be of opposite signs if the standard deviation
of the treatment effect across the unionized worker × year observations is equal
to 0.097, a substantial but still possible amount of heterogeneity. The weights are
negatively correlated with workers’ years of schooling (correlation = − 0.12,
t-statistic = − 1.88). The union premium may be lower for more educated work-
ers (see Freeman and Medoff 1984), as they may be less substitutable than less
educated ones. Then, βˆ femay overestimate δ TR, the average union premium across
all unionized worker × year observations. We also find that βˆ fd = 0.060 (standard
error = 0.032) and that βˆ fe and βˆ fd significantly differ (t-statistic = 1.91),23 thus
casting further doubt on Assumptions 7 and 8.
21
The fixed effects regression is not the main specification in Vella and Verbeek (1998). The authors favor
instead a dynamic selection model.
22
Keeping the original data does not change much the results presented below, except that the placebo estima-
tor DIDM becomes significant.
pl,2
23
The standard error of βˆ fe
− βˆ fd
is computed with a worker-level clustered bootstrap.
2986 THE AMERICAN ECONOMIC REVIEW SEPTEMBER 2020
The stable groups assumption holds: between each pair of consecutive years, there
are workers whose union membership status does not change. We therefore compute
our DIDMestimator. Table 4 shows that it is equal to 0.041(standard error = 0.034).
In this case D IDMis significantly different from β ˆ fe
(t-statistic = 2.60) and βˆ fd
(t-statistic = 2.36). As discussed in Section III, we can also estimate separately
24
the union premium for workers joining and leaving a union, something that was pre-
viously done by Freeman (1984). The joiners’ effect estimate is equal to 0.059(stan-
dard error = 0.053), the leavers’ effect is equal to 0 .021(standard error = 0.044),
and the two estimates do not significantly differ (t-statistic = 0.55).
The estimator DIDMrelies on a common trends assumption. To assess its plau-
sibility, we compute D ID M , the placebo estimator introduced in Section III; D ID M
pl pl
compares the wage growth of workers changing and not changing their union
status one period before that change. We also compute DID M and DID M , two
pl,2 pl,3
other placebo estimators performing the same comparison two and three periods
before the change. As shown in Table 4, D ID M is large, positive, and significant
pl
(t-statistic = 2.49). On the other hand DID M and D ID M are smaller and insig-
pl,2 pl,3
VI. Conclusion
24
The standard errors of βˆ fe
− DIDM and βˆ fd
− DIDMare computed with a w
orker-level clustered bootstrap.
VOL. 110 NO. 9 DE CHAISEMARTIN AND D’HAULTFŒUILLE: TWO-WAY FIXED EFFECTS 2987
Appendix A. Proofs
Dg,t E(Δg,t | D) − Dg,t′ E(Δg,t′ | D) − (Dg′,t E(Δg′,t | D) − Dg′,t′ E(Δg′,t′ | D)).
=
PROOF OF LEMMA 1:
For all ( g, t) ∈ {1, …, G} × {1, …, T },
( Ng,t i=1 )
Ng,t
E( Yg,t | D) = E _
1 ∑ Yi,g,t | D
( Ng,t i=1 )
Ng,t
1 ∑ (Yi,g,t(0) + Di,g,t(Yi,g,t(1) − Yi,g,t(0))) | D
= E _
where the third equality follows from Assumption 2, and the fourth from
Assumption 3. Therefore,
(0) | Dg) − E(Yg′,t(0) − Yg′,t′
= E(Yg,t(0) − Yg,t′
(0) | Dg′ )
| D) − Dg,t′
+ Dg,t E(Δg,t
| D) − (Dg′,t
E(Δg,t′ | D) − Dg′,t′
E(Δg′,t E(Δg′,t′ | D))
= E(Yg,t(0) − Yg,t′
(0)) − E(Yg′,t(0) − Yg′,t′
(0))
| D) − Dg,t′
+ Dg,t E(Δg,t
| D) − (Dg′,t
E(Δg,t′ | D) − Dg′,t′
E(Δg′,t E(Δg′,t′ | D))
=
| D) − Dg,t′
Dg,t E(Δg,t | D) − (Dg′,t
E(Δg,t′ | D) − Dg′,t′
E(Δg′,t | D)),
E(Δg′,t′
where the second equality follows from Assumption 4, and the third from
Assumption 5. ∎
2988 THE AMERICAN ECONOMIC REVIEW SEPTEMBER 2020
PROOF OF THEOREM 1:
risch-Waugh theorem and the definition of ε g,t that
It follows from the F
∑ N
g,t g,t εg,t | D)
E(Yg,t
E(
(A1) | D) =
βˆ fe _______________
.
∑g,t
N
g,t εg,t
Dg,t
Now, by definition of εg ,t again,
T
∑ Ng,t
(A2) εg,t
= 0 {1, …, G},
for all g ∈
t=1
G
∑ Ng,t εg,t = 0
(A3) {1, …, T}.
for all t ∈
g=1
Then,
(A4)
= g,t εg,t(E(Yg,t | D) − E(Yg,1 | D) − E(Y1,t | D) + E(Y1,1 | D))
∑ N
g,t
=
g,t εg,t(Dg,t E(Δg,t | D) − Dg,1 E(Δg,1 | D)
∑ N
g,t
=
g,t εg,t Dg,t E(Δg,t | D)
∑ N
g,t
The first and third equalities follow from equations (A2) and (A3). The second
equality follows from Lemma 1. The fourth equality follows from Assumption 2.
Finally, Assumption 2 implies that
(A6) ∑ N
g,t εg,t Dg,t = ∑ N
g,t εg,t .
g,t (g,t):Dg,t=1
PROOF OF PROPOSITION 1:
If for all t ≥ 2, Ng,t/Ng,t−1 does not depend on t, then it follows from the
first order conditions attached to Regression 1 and a few lines of alge-
bra that εg,t = Dg,t − Dg,. − D.,t + D.,.. Therefore, wg,tis proportional
to Dg,t − Dg,. − D.,t + D.,.
. Then, for all (g, t, t′ )such that Dg,t = Dg,t′
VOL. 110 NO. 9 DE CHAISEMARTIN AND D’HAULTFŒUILLE: TWO-WAY FIXED EFFECTS 2989
PROOF OF COROLLARY 1:
Proof of the First Point.—If the assumptions of the corollary hold and
̃
Δ = 0, then
TR
⎧ ̃ Ng,t
⎪β fe = ∑ (
___
g,t):Dg,t=1 N1 wg,t Δ̃ g,t ,
⎨ Ng,t
⎪
= ∑ ___ Δ ̃ g,t ,
⎩ 0 (g,t):Dg,t
=1 N 1
where the first equality follows from (A7). These two conditions and the
Cauchy-Schwarz inequality imply
| Ng,t
|β ̃ fe| = ∑ _ (wg,t − 1)(Δ̃ g,t − Δ̃ TR) ≤ σ(W)σ(Δ̃ ).
N
(g,t):D =1 1 g,t
|
(Δ̃ ) ≥ σ
Hence, σ _ fe.
Now, we prove that we can rationalize this lower bound. Let us define
β̃ fe( wg,t − 1)
Δ̃ TR _________
g,t =
.
σ 2( W)
Then,
g,t that ∑
as it follows from the definition of w
(g,t): =1(Ng,t/N1)wg,t = 1.
Dg,t
Similarly,
N β̃ fe(wg,t
− 1) β̃ N
∑ _ wg,t _________
= _____ ∑ _ wg,t(wg,t − 1)
g,t fe g,t
N
(g,t):Dg,t=1 1
σ (W)
2
(W)(g,t):Dg,t=1 1
σ 2
N
β̃ fe Ng,t
= _____
∑ _ (wg,t
− 1) 2
σ ( W)(g,t):D =1 N1
2
g,t
= β̃ fe ,
Ng,t
where the second equality follows again from the fact that ∑ ___ = 1.
Dg,t=1N wg,t
(g,t):
1
Proof of the Second Point.—We first suppose that β̃ fe > 0. We seek to solve
n N
(i)
∑ _ (Δ̃ (i) − Δ̃ TR) ,
2
min
Δ̃ (1),…,Δ̃ (n) i=1 N1
2990 THE AMERICAN ECONOMIC REVIEW SEPTEMBER 2020
subject to
n N
(i)
β̃ fe = ∑ _ w(i) Δ̃ (i), Δ̃ (i) ≤ 0
{1, …, n}.
for all i ∈
i=1 N1
i=1 N1 ( ) (i=1 N1 )
n n 2 n n 2
(i) N (i)N N
(i) 2 N(i)
∑ _ Δ̃ (i) − ∑ _ Δ̃ (i) = ∑ _ Δ̃ (i) − ∑ _ Δ̃ (i) .
i=1 N1 i=1 N1
The Karush-Kuhn-Tucker necessary conditions for optimality are that for all i:
γ(i) ≥ 0,
γ(i) Δ̃ (i) = 0,
where Δ̃ TR = ∑ni=1 (N(i)/ N1) Δ̃ (i), 2λis the Lagrange multiplier of the con-
straint ∑ni=1 (N(i)/N1) w(i) Δ ̃ (i) = β̃ fe and 2 (N(i)/N1) γ( i)is the Lagrange multiplier
of the constraint Δ ̃ (i) ≤ 0.
These constraints imply that Δ̃ (i) = 0if and only if Δ̃ TR+ λw(i) ≥ 0. Therefore,
if Δ̃ TR + λ w(i) < 0, Δ̃ (i) ≠ 0 so γ(i) = 0, and Δ̃ (i) = Δ̃ TR + λ w(i). Therefore,
Therefore,
____________
β̃ fe
λ =
.
Ts + S 2s / (1 − Ps)
Then, using what precedes,
N(i) N(i)
‗ 2fe = ∑ _ (λ w(i)) 2 + ∑ _ (Δ̃ TR)
2
σ
i≥s 1
N i<s 1
N
λ Ss
= λ 2 Ts + (1 − Ps) (_
1 − Ps )
2
β̃ 2
= _____________
.
fe
Ts + S 2s / (1 − Ps)
The result follows, once noted that equations (A8) and (A9) imply that
s = min{i ∈ {1, …, n} : w(i) < − S(i)/(1 − P(i))}.
Finally, consider the case β ̃ fe < 0. By letting Δ̃ ′( Δ̃ (i) and β̃ ′f
i) = − β̃ fe ,
e = −
we have
(i=1 N1 )
n n 2
(i) N (i) N
‗ fe = ∑ _ Δ̃ ′(
i) − ∑
_ Δ̃ ′(
i)
2
σ min
Δ̃ ′( ̃ ≤0
1)≤0,…,Δ ′(
n) i=1 N1
subject to
n N
(i)
∑ _ w(i) Δ̃ ′( β̃ ′fe .
i) =
i=1
N 1
This is the same program as before, with β̃ ′fe instead of β̃ fe. Therefore, by the same
reasoning as before, we obtain
‗ 2fe = _____________
= _____________
. ∎
fe
σ
Ts + S s / (1 − Ps)
2 2
Ts + S s / (1 − Ps)
2992 THE AMERICAN ECONOMIC REVIEW SEPTEMBER 2020
PROOF OF COROLLARY 2:
We have
Ng,t
((g,t):D =1 N1 )
βfe = E ∑ _ wg,t Δ̃ g,t
g,t
= E(Δ̃ TR)
= δ TR.
The first equality follows from the law of iterated expectations and (A7).
The second equality follows from Assumption 7. By the definition of wg,t,
∑(g,t):
=1(Ng,t/N1) wg,t = 1, hence the third equality. The fourth equality follows
Dg,t
from the law of iterated expectations. ∎
PROOF OF THEOREM 2:
It follows from the Frisch-Waugh theorem and the definition of ε fd,g,t that
Then,
= ∑ N
g,t εfd,g,t Δ̃ g,t − Dg,t−1 Δ̃ g,t−1 − D1,t Δ̃ 1,t + D1,t−1 Δ̃ 1,t−1)
(Dg,t
(g,t):t≥2
= ∑ N
g,t εfd,g,t Δ̃ g,t − Dg,t−1 Δ̃ g,t−1)
(Dg,t
(g,t):t≥2
∑ (Ng,t
= εfd,g,t
− Ng,t+1 εfd,g,t+1
Δ̃ g,t
) Dg,t
g,t
Ng,t+1
( )
= ∑ N
− _ εfd,g,t+1 Δ̃ g,t .
g,t εfd,g,t
(g,t):D =1
g,t
N
g,t
VOL. 110 NO. 9 DE CHAISEMARTIN AND D’HAULTFŒUILLE: TWO-WAY FIXED EFFECTS 2993
The first and third equalities follow from (A11). The second equality follows from
Lemma 1. The fourth equality follows from a summation by part, and from the
fact εf d,g,1 = εfd,g,T+1 = 0. The fifth equality follows from Assumption 2.
A similar reasoning yields
N
( )
(A13) ∑ N (Dg,t
g,t εfd,g,t − Dg,t−1) = ∑ N − _ εfd,g,t+1 .
g,t εfd,g,t
g,t+1
(g,t):t≥2 (g,t):Dg,t
=1
N g,t
Combining (A10), (A12), (A13), and the law of iterated expectations yields the
result. ∎
PROOF OF PROPOSITION 2:
It follows from the first order conditions attached to Regression 2 and a few lines
of algebra that εfd,g,t = Dg,t − Dg,t−1 − D.,t + D.,t−1. Therefore, under Assumption 6
and if Ng,tdoes not vary across t, one has that for all (g, t)such that Dg,t = 1,
1 ≤ t ≤ T − 1, wfd,g,tis proportional to 1 − Dg,t−1 − (2 D.,t − D.,t−1 − D.,t+1).
Now, D.,t − D.,t−1 ≤ 1 , and under Assumption 6 D .,t − D.,t+1 ≤ 0, so
1 − Dg,t−1 − (2 D.,t − D.,t−1 − D.,t+1)can only be strictly negative if Dg,t−1 = 1.
Then, for all ( g, t)such that D g,t = 1, 1 ≤ t ≤ T − 1, wfd,g,tis strictly negative if
and only if Dg,t−1 = 1 and 2 D.,t − D.,t−1 − D.,t+1 > 0.
Similarly, when t = T, under the same assumptions as above, one has that for
all g such that Dg,T = 1, wfd,g,Tis proportional to 1 − Dg,T−1 − (D.,T − D.,T−1).
Now, D.,T − D.,T−1 ≤ 1, so 1 − Dg,T−1 − (D.,T − D.,T−1)can only be strictly neg-
ative if Dg,T−1 = 1. Then, wfd,g,Tis strictly negative if and only if D g,T−1 = 1
and D.,T − D.,T−1 > 0.
Finally, when t = 1, one has that for all g such that Dg,1 = 1, Dg,2 = 1 under
Assumption 6, so w fd,g,1is proportional to D .,2 − D.,1,which is greater than 0 under
Assumption 6. ∎
PROOF OF THEOREM 3:
IDM,
First, by definition of D
N1,0,t
N0,1,t
( )
E(DIDM) = ∑ E (_ E(DID+,t | D) + _ E(DID−,t | D)) .
T
(A14)
t=2 NS S
N
Let tbe greater than 2, and let us focus for now on the case where there is at least
one g1such that Dg1,t−1 = 0and Dg1,t = 1. Then Assumption 11 ensures that
there is a least another group g 2such that Dg2,t−1 = Dg2,t = 0. For every g such
g,t−1 = 0and Dg,t
that D = 1, we have
Under Assumptions 12, 4, and 5, for all t ≥ 2, there exists a real number ψ0,t
such that for all g,
= E(Yg,t(0) − Yg,t−1
(0)) = ψ0,t
.
2994 THE AMERICAN ECONOMIC REVIEW SEPTEMBER 2020
Then,
=
∑ N | D)
g,t E(Δg,t
=1,Dg,t−1
g:Dg,t =0
N1,0,t
− _
∑ N g,t E(Yg,t(0) − Yg,t−1(0)| D)
N0,0,t g:Dg,t
=Dg,t−1
=0
=
g,t E(Δg,t | D)
∑ N
=1,Dg,t−1
g:Dg,t =0
N
(g:Dg,t =1,Dg,t−1 )
+ ψ0,t ∑ N
g,t − _
1,0,t
N
∑ N
g,t
=0 =Dg,t−1
0,0,t g:Dg,t =0
=
g,t E(Δg,t | D).
∑ N
=1,Dg,t−1
g:Dg,t =0
The first equality follows by (A15), the second by (A16), and the third after some
algebra. If there is no gsuch that Dg,t−1 = 0and D
g,t = 1, (A17) still holds,
ID+,t = 0in this case.
as D
A similar reasoning yields
= δ S. ∎
PROOF OF THEOREM 4:
First, as with DIDM, we have
( ))
N1,0,0,t
N0,1,1,t
(
+,t| D) +
T
( DID M ) = ∑ E _ −,t| D) .
E(DID pl
(A19) E
pl
pl E(DID pl _
N S S
pl
t=3 N
VOL. 110 NO. 9 DE CHAISEMARTIN AND D’HAULTFŒUILLE: TWO-WAY FIXED EFFECTS 2995
Let tbe greater than 3, and let us for now focus on the case where there exists at
least one g 1such that Dg1,t−2 = Dg1,t−1 = 0and D
g1,t = 1. Then Assumption 13
ensures that there is a least another group g 2such that Dg2,t−2 = Dg2,t−1 = Dg2,t
= 0. Then,
+,t| D)
N1,0,0,t E(DID pl
(A20)
=
∑ N (0)| D)
g,t E(Yg,t−1(0) − Yg,t−2
=1,Dg,t−1
g:Dg,t =Dg,t−2
=0
N1,0,0,t
− _ ∑ N g,t E(Yg,t−1(0) − Yg,t−2(0)| D)
N0,0,0,t g:Dg,t
=Dg,t−1
=Dg,t−2
=0
N
(g:Dg,t =1,Dg,t−1 )
= ψ0,t−1
∑ N
g,t − _
1,0,0,t
N
∑ N
g,t
=Dg,t−2
=0 =Dg,t−1
0,0,0,t g:Dg,t =Dg,t−2
=0
= 0.
The second equality follows by (A16), and the third follows after some algebra. If
there exists no gsuch that Dg,t−2 = Dg,t−1 = 0and Dg,t = 1, (A20) still holds,
as D +,t = 0in this case.
ID pl
A similar reasoning yields
The result follows after plugging (A20) and (A21) into (A19). ∎
REFERENCES
Abraham, Sarah, and Liyang Sun. 2018. “Estimating Dynamic Treatment Effects in Event Studies
with Heterogeneous Treatment Effects.” Unpublished.
Ashenfelter, Orley. 1978. “Estimating the Effect of Training Programs on Earnings.” Review of Eco-
nomics and Statistics 60 (1): 47–57.
Athey, Susan, and Guido W. Imbens. 2018. “Design-Based Analysis in Difference-in-Differences Set-
tings with Staggered Adoption.” NBER Working Paper 24963.
Athey, Susan, and Scott Stern. 2002. “The Impact of Information Technology on Emergency Health
Care Outcomes.” RAND Journal of Economics 33 (3): 399–432.
Autor, David H. 2003. “Outsourcing at Will: The Contribution of Unjust Dismissal Doctrine to the
Growth of Employment Outsourcing.” Journal of Labor Economics 21 (1): 1–42.
Borusyak, Kirill, and Xavier Jaravel. 2017. “Revisiting Event Study Designs.” Unpublished.
Callaway, Brantly, and Pedro H. C. Sant’Anna. 2018. “Difference-in-Differences with Multiple Time
Periods and an Application on the Minimum Wage and Employment.” arXiv e-print 1803.09015.
Card, David. 1996. “The Effect of Unions on the Structure of Wages: A Longitudinal Analysis.”
Econometrica 64 (4): 957–79.
de Chaisemartin, Clément. 2011. “Fuzzy Differences in Differences.” Center for Research in Econom-
ics and Statistics Working Paper 2011-10.
de Chaisemartin, Clément, and Xavier D’Haultfœuille. 2015. “Fuzzy Differences-in-Differences.”
arXiv e-print 1510.01757v2.
de Chaisemartin, Clément, and Xavier D’Haultfœuille. 2018. “Fuzzy Differences-in-Differences.”
Review of Economic Studies 85 (2): 999–1028.
2996 THE AMERICAN ECONOMIC REVIEW SEPTEMBER 2020
1. Patrick Premand, Dominic Rohner. 2024. Cash and Conflict: Large-Scale Experimental Evidence
from Niger. American Economic Review: Insights 6:1, 137-153. [Abstract] [View PDF article] [PDF
with links]
2. Daniel Avdic, Petter Lundborg, Johan Vikström. 2024. Does Health-Care Consolidation Harm
Patients? Evidence from Maternity Ward Closures. American Economic Journal: Economic Policy 16:1,
160-189. [Abstract] [View PDF article] [PDF with links]
3. Traviss Cassidy, Mark Dincecco, Ugo Antonio Troiano. 2024. The Introduction of the Income Tax,
Fiscal Capacity, and Migration: Evidence from US States. American Economic Journal: Economic Policy
16:1, 359-393. [Abstract] [View PDF article] [PDF with links]
4. Christophe Bellégo, Joeffrey Drouard. 2024. Fighting Crime in Lawless Areas: Evidence from Slums
in Rio de Janeiro. American Economic Journal: Economic Policy 16:1, 124-159. [Abstract] [View PDF
article] [PDF with links]
5. Joshua Rauh, Ryan Shyu. 2024. Behavioral Responses to State Income Taxation of High Earners:
Evidence from California. American Economic Journal: Economic Policy 16:1, 34-86. [Abstract] [View
PDF article] [PDF with links]
6. Andreas Bjerre-Nielsen, Mikkel Høst Gandil. 2024. Attendance Boundary Policies and the Limits
to Combating School Segregation. American Economic Journal: Economic Policy 16:1, 190-227.
[Abstract] [View PDF article] [PDF with links]
7. Elliott Ash, W. Bentley MacLeod. 2024. Mandatory Retirement for Judges Improved the Performance
of US State Supreme Courts. American Economic Journal: Economic Policy 16:1, 518-548. [Abstract]
[View PDF article] [PDF with links]
8. Oren Reshef. 2023. Smaller Slices of a Growing Pie: The Effects of Entry in Platform Markets.
American Economic Journal: Microeconomics 15:4, 183-207. [Abstract] [View PDF article] [PDF with
links]
9. Robert C. Allen, Mattia C. Bertazzini, Leander Heldring. 2023. The Economic Origins of
Government. American Economic Review 113:10, 2507-2545. [Abstract] [View PDF article] [PDF
with links]
10. Emily C. Lawler, Katherine G. Yewell. 2023. The Effect of Hospital Postpartum Care Regulations
on Breastfeeding and Maternal Time Allocation. American Economic Journal: Applied Economics 15:4,
477-513. [Abstract] [View PDF article] [PDF with links]
11. Giorgio Gulino, Federico Masera. 2023. Contagious Dishonesty: Corruption Scandals and
Supermarket Theft. American Economic Journal: Applied Economics 15:4, 218-251. [Abstract] [View
PDF article] [PDF with links]
12. Fangwen Lu, Weizeng Sun, Jianfeng Wu. 2023. Special Economic Zones and Human Capital
Investment: 30 Years of Evidence from China. American Economic Journal: Economic Policy 15:3, 35-64.
[Abstract] [View PDF article] [PDF with links]
13. Marcus Dillender. 2023. Evidence and Lessons on the Health Impacts of Public Health Funding from
the Fight against HIV/AIDS. American Economic Review 113:7, 1825-1887. [Abstract] [View PDF
article] [PDF with links]
14. Casper Worm Hansen, Asger Mose Wingender. 2023. National and Global Impacts of Genetically
Modified Crops. American Economic Review: Insights 5:2, 224-240. [Abstract] [View PDF article]
[PDF with links]
15. Elena Esposito, Tiziano Rotesi, Alessandro Saia, Mathias Thoenig. 2023. Reconciliation Narratives:
The Birth of a Nation after the US Civil War. American Economic Review 113:6, 1461-1504. [Abstract]
[View PDF article] [PDF with links]
16. D. Mark Anderson, Daniel I. Rees. 2023. The Public Health Effects of Legalizing Marijuana. Journal
of Economic Literature 61:1, 86-143. [Abstract] [View PDF article] [PDF with links]
17. Aljoscha Janssen, Xuan Zhang. 2023. Retail Pharmacies and Drug Diversion during the Opioid
Epidemic. American Economic Review 113:1, 1-33. [Abstract] [View PDF article] [PDF with links]
18. Luca Braghieri, Ro’ee Levy, Alexey Makarin. 2022. Social Media and Mental Health. American
Economic Review 112:11, 3660-3693. [Abstract] [View PDF article] [PDF with links]
19. Jonathan Roth. 2022. Pretest with Caution: Event-Study Estimates after Testing for Parallel Trends.
American Economic Review: Insights 4:3, 305-322. [Abstract] [View PDF article] [PDF with links]
20. Litterio Mirenda, Sauro Mocetti, Lucia Rizzica. 2022. The Economic Effects of Mafia: Firm Level
Evidence. American Economic Review 112:8, 2748-2773. [Abstract] [View PDF article] [PDF with
links]
21. Jevan Cherniwchan, Nouri Najjar. 2022. Do Environmental Regulations Affect the Decision to
Export?. American Economic Journal: Economic Policy 14:2, 125-160. [Abstract] [View PDF article]
[PDF with links]
22. Enrico Cantoni, Vincent Pons. 2022. Does Context Outweigh Individual Characteristics in Driving
Voting Behavior? Evidence from Relocations within the United States. American Economic Review
112:4, 1226-1272. [Abstract] [View PDF article] [PDF with links]
23. Michael Greenstone, Guojun He, Ruixue Jia, Tong Liu. 2022. Can Technology Solve the Principal-
Agent Problem? Evidence from China’s War on Air Pollution. American Economic Review: Insights 4:1,
54-70. [Abstract] [View PDF article] [PDF with links]
24. Martha J. Bailey, Shuqiao Sun, Brenden Timpe. 2021. Prep School for Poor Kids: The Long-Run
Impacts of Head Start on Human Capital and Economic Self-Sufficiency. American Economic Review
111:12, 3963-4001. [Abstract] [View PDF article] [PDF with links]
25. Dmitry Arkhangelsky, Susan Athey, David A. Hirshberg, Guido W. Imbens, Stefan Wager. 2021.
Synthetic Difference-in-Differences. American Economic Review 111:12, 4088-4118. [Abstract] [View
PDF article] [PDF with links]