Shen, Chi
Feb 23 2020, New Haven, CT
网络上关于双重差分模型平行趋势检验的方法有很多,但是略有差异,因此总结一份,留作学习。
平行趋势检验的方法
需要说明的是两期数据是无法进行平行趋势检验的,因为时间长度不够,如果对于数据不自信,可以采用匹配与双重差分相结合的方法。以下方法主要适用于多年的面板数据。
1)示例数据说明
示例数据来源于普林斯顿大学网站上的数据集Panel101。
数据详情如下:
import pandas as pd
import numpy as np
panel = pd.read_stata("http://dss.princeton.edu/training/Panel101.dta")
panel.head()
country | year | y | y_bin | x1 | x2 | x3 | opinion | op | |
---|---|---|---|---|---|---|---|---|---|
0 | A | 1990 | 1.342788e+09 | 1.0 | 0.277904 | -1.107956 | 0.282554 | Str agree | 1.0 |
1 | A | 1991 | -1.899661e+09 | 0.0 | 0.320685 | -0.948720 | 0.492538 | Disag | 0.0 |
2 | A | 1992 | -1.123436e+07 | 0.0 | 0.363466 | -0.789484 | 0.702523 | Disag | 0.0 |
3 | A | 1993 | 2.645775e+09 | 1.0 | 0.246144 | -0.885533 | -0.094391 | Disag | 0.0 |
4 | A | 1994 | 3.008335e+09 | 1.0 | 0.424623 | -0.729768 | 0.946131 | Disag | 0.0 |
数据预处理:
panel["time"] = np.where(panel["year"] >= 1994, 1, 0)
panel["treated"] = np.where(panel["country"].isin(["E", "F", "G"]), 1, 0)
panel["did"] = panel["time"] * panel["treated"]
2)作图法
即画出被解释变量每一年均值的时序图,通过观察在干预前是否趋势一致来判断是否满足平行假设,当然这是一种比较粗略的方式,通过画图发现干预前不一致,也不一定说明不满足平行假设。
如下:
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=[10, 6])
sns.lineplot(x="year", y="y", hue="treated", estimator="mean", marker="s", ci=None, data=panel, ax=ax)
ax.set_xticks(np.arange(1990, 2000, 1))
ax.axvline(x=1994, linestyle='--', color='black', linewidth=1)
sns.despine()
3)回归法
基本的模型设定如下1:
\[y_{it} = \lambda_i + \delta_i + \beta_1 \cdot Treat_{it}*Pre_2 + \beta_2 \cdot Treat_{it}*Pre_1 + \beta_3 \cdot Treat_{it}*Current + \beta_4 \cdot Treat_{it}*Post_1 + \beta_5 \cdot Treat_{it}*Post_2 + \epsilon_{it}\]式中:
$\lambda_i$ 表示个体固定效应,$\delta_i$ 表示时间固定效应,Pre、Current和Post表示指示相对于干预年份的dummy变量,$Treat_{it}$ 表示干预的指示变量。这里需要考虑是否要放入控制变量2
原理很简单,就是将每一年与干预年的关系做成虚拟变量,然后与干预变量做交互,放入回归中,看干预前年份的回归系数是否有统计学意义,如果无统计学意义,则说明满足平行趋势假设。
当然最理想的情况是干预前的年份全都无统计学意义,如果有个别年份有差异,也可认为满足平行趋势假设。
首先生成年份的虚拟变量与treated交互项
panel["period"] = panel["year"] - 1994
vars_pre = ["pre_" + str(i) for i in np.arange(4, 0, -1)]
vars_post = ["post_" + str(i) for i in np.arange(1, 6, 1)]
for i, vars in enumerate(vars_pre):
panel[vars] = np.where(panel["period"] == (i-4), 1, 0) * panel["treated"]
for i, vars in enumerate(vars_post):
panel[vars] = np.where(panel["period"] == (i+1), 1, 0) * panel["treated"]
panel["current"] = np.where(panel["period"] == 0, 1, 0) * panel["treated"]
然后进行回归
主要是双向固定效应模型,可以采用ols,通过lsdv的估计方式加入个体和时间固定效应,也可以采用如stata
中的xtreg函数和R
中的plm函数直接进行面板回归,但是需要注意的是,最好计算稳健聚类的std.err,否则容易低估std.err,导致p值显著,出现假阳性。
import statsmodels.formula.api as smf
formula = "y ~ " + " + ".join(vars_pre) + " + current + " + " + ".join(vars_post) + "+ C(year) + C(country)"
did = smf.ols(formula, data=panel).fit(cov_type="cluster", cov_kwds={"groups" : panel["country"]})
did.summary()
需要特别说明的是:最好将pre_1
作为对照组(基期),即在模型中不放入pre_1
,以便更好的观察干预前的组间差异。若像本例中,不特别指定基期,那么模型会自动drop一期,不同软件中规则不一样,比如本例中是将最后一期post_6
给drop掉了,也就是post_6
的系数为0.
Dep. Variable: | y | R-squared: | 0.538 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.292 |
Method: | Least Squares | F-statistic: | 2.117 |
Date: | Sun, 23 Feb 2020 | Prob (F-statistic): | 0.194 |
Time: | 22:17:18 | Log-Likelihood: | -1599.7 |
No. Observations: | 70 | AIC: | 3249. |
Df Residuals: | 45 | BIC: | 3306. |
Df Model: | 24 | ||
Covariance Type: | cluster |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -1.004e+09 | 1.44e+09 | -0.697 | 0.486 | -3.83e+09 | 1.82e+09 |
C(year)[T.1991] | 1.003e+09 | 2.52e+09 | 0.398 | 0.691 | -3.94e+09 | 5.94e+09 |
C(year)[T.1992] | 4.278e+08 | 1.57e+09 | 0.273 | 0.785 | -2.65e+09 | 3.5e+09 |
C(year)[T.1993] | 4.003e+09 | 2.01e+09 | 1.995 | 0.046 | 7.03e+07 | 7.94e+09 |
C(year)[T.1994] | 4.616e+09 | 2.18e+09 | 2.115 | 0.034 | 3.38e+08 | 8.89e+09 |
C(year)[T.1995] | 5.214e+09 | 1.87e+09 | 2.788 | 0.005 | 1.55e+09 | 8.88e+09 |
C(year)[T.1996] | 3.412e+09 | 1.37e+09 | 2.494 | 0.013 | 7.31e+08 | 6.09e+09 |
C(year)[T.1997] | 4.195e+09 | 1.55e+09 | 2.700 | 0.007 | 1.15e+09 | 7.24e+09 |
C(year)[T.1998] | 3.075e+09 | 1.51e+09 | 2.031 | 0.042 | 1.07e+08 | 6.04e+09 |
C(year)[T.1999] | 1.376e+09 | 2.78e+09 | 0.495 | 0.620 | -4.07e+09 | 6.82e+09 |
C(country)[T.B] | -1.514e+09 | 2.635 | -5.75e+08 | 0.000 | -1.51e+09 | -1.51e+09 |
C(country)[T.C] | -3.835e+08 | nan | nan | nan | nan | nan |
C(country)[T.D] | 1.912e+09 | nan | nan | nan | nan | nan |
C(country)[T.E] | -1.067e+09 | nan | nan | nan | nan | nan |
C(country)[T.F] | 1.769e+09 | nan | nan | nan | nan | nan |
C(country)[T.G] | -8.405e+07 | 2.583 | -3.25e+07 | 0.000 | -8.4e+07 | -8.4e+07 |
pre_4 | 2.141e+09 | 1.71e+09 | 1.253 | 0.210 | -1.21e+09 | 5.49e+09 |
pre_3 | 1.241e+09 | 2.43e+09 | 0.510 | 0.610 | -3.53e+09 | 6.01e+09 |
pre_2 | 2.651e+09 | 6.91e+08 | 3.835 | 0.000 | 1.3e+09 | 4.01e+09 |
pre_1 | 2.618e+08 | 2.28e+09 | 0.115 | 0.909 | -4.21e+09 | 4.73e+09 |
current | -9.168e+07 | 1.84e+09 | -0.050 | 0.960 | -3.71e+09 | 3.52e+09 |
post_1 | -6.432e+09 | 2.91e+09 | -2.211 | 0.027 | -1.21e+10 | -7.31e+08 |
post_2 | -1.9e+08 | 1.56e+09 | -0.122 | 0.903 | -3.25e+09 | 2.87e+09 |
post_3 | 1.035e+09 | 1.17e+09 | 0.887 | 0.375 | -1.25e+09 | 3.32e+09 |
post_4 | -2.716e+09 | 3.4e+09 | -0.798 | 0.425 | -9.38e+09 | 3.95e+09 |
post_5 | 2.718e+09 | 2.4e+09 | 1.134 | 0.257 | -1.98e+09 | 7.42e+09 |
Omnibus: | 1.113 | Durbin-Watson: | 1.836 |
---|---|---|---|
Prob(Omnibus): | 0.573 | Jarque-Bera (JB): | 0.541 |
Skew: | -0.144 | Prob(JB): | 0.763 |
Kurtosis: | 3.320 | Cond. No. | 9.40e+15 |
Warnings:
[1] Standard Errors are robust to cluster correlation (cluster)
[2] The smallest eigenvalue is 9.79e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
画coef plot
为了更为直观的观察各年份虚拟变量交互项是否具有统计学意义(即是否包含0),可以将上述回归模型的系数作图,通常称为coef plot,stata
和R
中均有相应的package,都名为coefplot,操作十分简便。
以下为通过python自定义的coef plot函数3。
图中不仅可以看出干预前年份的平行趋势,同样可以看出干预后的效应,即did的效应。
def coef_plot(fit, vars, label, ax):
'''ploting the coef of a fit'''
# extract coef and ci from a model
coef_ci = pd.DataFrame({"vars" : fit.params.index.values[1:],
"coef" : fit.params.values[1:],
"lower" : fit.conf_int().iloc[1:, 0],
"upper" : fit.conf_int().iloc[1:, 1]})
coef_ci["err"] = coef_ci["coef"] - coef_ci["upper"]
df = coef_ci[coef_ci["vars"].isin(vars)]
# plot
df.plot(x="vars", y="coef", kind="scatter", color="black", yerr="err", marker="s", s=80, legend=False, ax=ax)
ax.set_xlabel("")
ax.set_ylabel("Coefficient", fontsize=16)
ax.axhline(y=0, linestyle='--', color='black', linewidth=2)
ax.xaxis.set_ticks_position('none')
ax.set_xticklabels(label, rotation=0, fontsize=16)
vars = vars_pre + vars_post
vars.insert(4, "current")
fig, ax = plt.subplots(figsize=[12, 8])
coef_plot(did, vars, vars, ax)