I recently put out a revised version of my paper on how well we can generalize from impact evaluation results in development.
To summarize: imagine we have a set of impact evaluations on the effect of a given type of intervention (e.g. conditional cash transfer programs) on a given outcome (e.g. school enrollment rates) and want to say something about the true effect of a similar program in another setting. I argue for using τ², a measure of true inter-study variation used in the meta-analysis literature, to make inferences about how a similar program might fare in another setting, which I take to be the crux of what we mean when we talk about generalizability.
There is actually a very close link between research credibility and generalizability. τ² and the related I² statistic (which is the share of total variance that is not sampling variance) have a long history (see e.g. Rubin, 1981; Efron and Morris, 1975; Stein, 1955), but have perhaps most notably been used to improve how credible a research result is in its own setting through forming a shrinkage estimator. Gelman and Tuerlinckx and Gelman and Carlin‘s work on Type S and Type M errors is also very relevant. Type S errors represent the probability that a hypothesized difference between two true effects has the wrong sign (in Gelman and Tuerlinckx’s parlance, making a claim that θi > θj when θi < θj, where one might be considered to make such a claim if the estimate of θi were found to be significantly greater than the estimate of θj, for example); Type M errors are similar but represent errors of magnitude.[1]
In my paper, I take it that when we are interested in generalizability, we are interested in generalizing from a set of results to make predictions about the effect of a similar program in another setting. Perhaps, as in Type S or Type M errors, we are interested in the sign or magnitude of that true effect. Just as τ² and I² can help improve a study’s own estimates or provide a measure of how likely they are to be correct, they can also help us form estimates about how likely our estimates are to be correct in other settings. How well we can predict the sign or magnitude of this true effect depends entirely on the parameters of the model, assuming the model is not misspecified, and is something we can estimate.
There may well be other things we care about, in addition to the sign and magnitude of such an effect, but I would argue the sign and magnitude are certainly among the things that policymakers attempting to form evidence-based policy would care about, and the same approach could also be leveraged to answer many other similar questions (like the likelihood that an impact evaluation will find a significant effect of an intervention on a particular outcome). Nor does this approach preclude building more complicated models of the treatment effects — in fact, in the paper I include an example leveraging a mixed model to reduce the “residual” τ², or unexplained variation.
One small new statistic reported in this version: the correlation between the standard deviation of the treatment effect and τ². Each effect has a standard deviation associated with it. If we take their mean within an intervention-outcome combination, that statistic is quite strongly correlated with τ² — 0.54 using standardized values. It would be nice if this relationship held at the level of the individual study, as then one could estimate the generalizability of a result simply using the data from one’s own study, however, the relationship between a single study’s standard deviation and τ² is much noisier. Still, I think one promising area for further research is looking more closely at how well one can predict across-study variation using within-study variation.
Another change in this version: R code for both the random-effects and mixed model is now available as an online appendix. I know there are existing packages to estimate these kinds of models using a variety of methods (e.g. metafor’s “empirical Bayes” approach) but this code follows the paper and hopefully can be useful to someone.
[1] Aidan Coville and I have a related SSMART grant-supported working paper estimating Type S and Type M errors and the false positive report probability and false negative report probability in development economics.