Summary
The surrogate index method allows policymakers to estimate long-run treatment effects before long-run outcomes are observable. We meta-analyse this approach over nine long-run RCTs in development economics, comparing surrogate estimates to estimates from actual long-run RCT outcomes. We introduce the M-lasso algorithm for constructing the surrogate approach’s first-stage predictive model and compare its performance with other surrogate estimation methods. Across methods, we find a negative bias in surrogate estimates. For the M-lasso method, in particular, we investigate reasons for this bias and quantify significant precision gains. This provides evidence that the surrogate index method incurs a bias-variance trade-off.
Introduction
The long-term effects of treatments and policies are important in many different fields. In medicine, one may want to estimate the effect of a surgery on life expectancy; in economics, the effect of a conditional cash transfer during childhood on adult income. One way to measure these effects would be to run a randomised controlled trial (RCT) and then wait to observe the long-run outcomes. However, the results would be observed too late to inform policy decisions made today.
A prominent solution to this issue is the surrogate index, a method for estimating long-run effects without long-run outcome data, which was originally proposed by Athey, Chetty, Imbens, and Kang (2019). Our paper contributes to the evolving literature on this method by examining its empirical performance in a wide range of RCT contexts. We also extend the discourse initiated by LaLonde (1986) on the bias of non-experimental methods, extending the set of estimators studied to those focused on long-term effects. Our findings and recommendations aim to guide practitioners intending to use the surrogate index method, thereby aiding in the development of effective long-term treatment strategies.
We test the surrogate approach on data from nine RCTs in development economics. These RCTs are selected on the basis of being long-running and having a sufficiently large sample size.
In each RCT, we first produce an unbiased estimate of the standard experimental average treatment effect by regressing long-term outcomes on treatment status. Next, we reanalyse the data using the surrogate index approach. If the surrogate estimate is close to the unbiased estimate from the experimental approach, then the surrogate index method is working well. We run meta-analyses on the difference between these estimates to understand how well the surrogate index method performs under different conditions.
We test many different implementations of the surrogate index estimator, varying (1) the set of surrogates used, (2) the first-stage prediction method used, and (3) the observational dataset used to construct the surrogate index. Notably, we introduce a new estimator called the M-lasso, which is specifically designed for use with the surrogate method.
When meta-analysing our results, we find that the surrogate index method is consistently negatively biased and underestimates positive long-term treatment effects by 0.05 standard deviations on average. This is the case regardless of which estimation method we use. We suggest that this is due to missing surrogates, as well as bias in the first-stage predictive model of the surrogate procedure.
While it is important to understand this negative bias as a potential shortcoming of the surrogate approach, we would not necessarily take it to dissuade researchers from this method altogether. Instead, one could interpret surrogate estimates as a reasonable lower bound on the true long-term treatment effect. Furthermore, there is often no better alternative for estimating the true effect.
We also study potential determinants of the surrogate bias for the M-lasso estimator. In particular, we find suggestive evidence that M-lasso bias is smaller for simpler interventions. However, we do not find that this bias depends on the predictive accuracy of the first-stage model in the observational dataset. Our evidence is also inconclusive about how bias is affected by longer time horizons between the surrogates and the outcomes.
We further show that despite the potential bias from using the surrogate index method, it results in significant precision gains, with standard errors on average 52% the size of those from the long-term RCT estimates. Hence, even if researchers had access to long-term outcomes, they might still choose to use the surrogate index, depending on their willingness to trade off bias and variance.
The rest of this paper proceeds as follows. Section 2 discusses related literature. Section 3 summarises the econometric theory behind the surrogate index approach, and section 4 describes in more detail the data we use. Section 5 explains the methods we use to estimate comparable long-term RCT and surrogate index estimates. Section 6 presents results of the meta-analysis over 9 RCTs for different implementations of the surrogate index. In it, we empirically characterise the bias and standard errors for the surrogate method, as well as examine which surrogates are selected by the M-lasso. Finally, section 7 concludes.
I liked this a lot. For context, I work as a RA on an impact evaluation project. I have light interests / familiarity with meta-analysis + machine learning, but I did not know what surrogate indices were going into the paper. Some comments below, roughly in order of importance:
Hi Geoffrey, thanks for these comments, they are really helpful as we move to submitting this to journals. Some miscellaneous responses:
4a. The negative bias is purely an empirical result, but one that we expect to rise in many applications. We can't say for sure whether it's always negative or attenuation bias, but the hypothesis we suggest to explain it is compatible with attenuation bias of the treatment effects to 0 and treatment effects generally being positive. However, when we talk about attenuation in the paper, we're typically talking about attenuation in the prediction of long-run outcomes, not attenuation in the treatment effects.
4b. The surrogate index is unbiased and consistent if the assumptions behind it are satisfied. This is the case for most econometric estimators. What we do in the paper is show that the key surrogacy assumption is empirically not perfectly satisfied in a variety of contexts. Since this assumption is not satisfied, then the estimator is empirically biased and inconsistent in our applications. However, this is not what people typically mean when they say an estimator is theoretically biased and inconsistent. Personally, I think econometrics focuses too heavily on unbiasedness and am sympathetic to the ML willingness to trade off bias and variance, and cares too much about asymptotic properties of estimators and too little about how well they perform in these empirical LaLonde-style tests.
4c. The normalisation depends on the standard deviation of the control group, not the standard error, so we should be fine to do that regardless of what the actual treatment effect is. We would be in trouble if there was no variation in the control group outcome, but this seems to occur very rarely (or never).