:::: MENU ::::

External validity and research credibility

I recently put out a revised version of my paper on how well we can generalize from impact evaluation results in development.

To summarize: imagine we have a set of impact evaluations on the effect of a given type of intervention (e.g. conditional cash transfer programs) on a given outcome (e.g. school enrollment rates) and want to say something about the true effect of a similar program in another setting. I argue for using τ², a measure of true inter-study variation used in the meta-analysis literature, to make inferences about how a similar program might fare in another setting, which I take to be the crux of what we mean when we talk about generalizability.

There is actually a very close link between research credibility and generalizability. τ² and the related I² statistic (which is the share of total variance that is not sampling variance) have a long history (see e.g. Rubin, 1981; Efron and Morris, 1975; Stein, 1955), but have perhaps most notably been used to improve how credible a research result is in its own setting through forming a shrinkage estimator. Gelman and Tuerlinckx and Gelman and Carlin‘s work on Type S and Type M errors is also very relevant. Type S errors represent the probability that a hypothesized difference between two true effects has the wrong sign (in Gelman and Tuerlinckx’s parlance, making a claim that θi > θj when θi < θj, where one might be considered to make such a claim if the estimate of θi were found to be significantly greater than the estimate of θj, for example); Type M errors are similar but represent errors of magnitude.[1]

In my paper, I take it that when we are interested in generalizability, we are interested in generalizing from a set of results to make predictions about the effect of a similar program in another setting. Perhaps, as in Type S or Type M errors, we are interested in the sign or magnitude of that true effect. Just as τ² and I² can help improve a study’s own estimates or provide a measure of how likely they are to be correct, they can also help us form estimates about how likely our estimates are to be correct in other settings. How well we can predict the sign or magnitude of this true effect depends entirely on the parameters of the model, assuming the model is not misspecified, and is something we can estimate.

There may well be other things we care about, in addition to the sign and magnitude of such an effect, but I would argue the sign and magnitude are certainly among the things that policymakers attempting to form evidence-based policy would care about, and the same approach could also be leveraged to answer many other similar questions (like the likelihood that an impact evaluation will find a significant effect of an intervention on a particular outcome). Nor does this approach preclude building more complicated models of the treatment effects — in fact, in the paper I include an example leveraging a mixed model to reduce the “residual” τ², or unexplained variation.

One small new statistic reported in this version: the correlation between the standard deviation of the treatment effect and τ². Each effect has a standard deviation associated with it. If we take their mean within an intervention-outcome combination, that statistic is quite strongly correlated with τ² — 0.54 using standardized values. It would be nice if this relationship held at the level of the individual study, as then one could estimate the generalizability of a result simply using the data from one’s own study, however, the relationship between a single study’s standard deviation and τ² is much noisier. Still, I think one promising area for further research is looking more closely at how well one can predict across-study variation using within-study variation.

Another change in this version: R code for both the random-effects and mixed model is now available as an online appendix. I know there are existing packages to estimate these kinds of models using a variety of methods (e.g. metafor’s “empirical Bayes” approach) but this code follows the paper and hopefully can be useful to someone.

[1] Aidan Coville and I have a related SSMART grant-supported working paper estimating Type S and Type M errors and the false positive report probability and false negative report probability in development economics.

Is it better to build the evidence base or improve the decision-making process?

A key reason we do impact evaluations is to inform policy decisions. It is absolutely crucial to build up an evidence base. However, in a paper leveraging AidGrade data combined with experimental data, I argue that we also shouldn’t neglect the decision-making process and that improvements in the decision-making process can sometimes dominate the returns from conducting an impact evaluation.

This perhaps sounds crazy, and I’m not at all suggesting we abandon impact evaluations. They can be important tools. What I do in the paper is to build a model of the returns to impact evaluation (in terms of improved policy outcomes) assuming policymakers are Bayesian updaters and have only altruistic motives (caring only about the impact of the project on intended beneficiaries). I then gather the real priors of policymakers, practitioners, researchers, and a comparison group of MTurk workers and use these priors to estimate the returns of impact evaluations. Since most projects have fairly small impacts, the typical return to an impact evaluation is also very small.*

Thanks to the great advice of a referee, I also looked at different ways a decision could be made, comparing a single policymaker making a decision as a dictator and a group using majority voting. There is a large literature on the wisdom of the crowds, and there has also been some work to suggest people’s priors are better than meta-analysis results. There are many other ways in which decisions could be made, but even without considering more complicated decision-making rules, it is already apparent that changing the way in which decisions are made can sometimes be more valuable than conducting an impact evaluation. Of course, this depends on the quality of the decision-makers; for the relatively poorly-informed MTurk subjects, I observed something like a “folly” of the crowds when considering how they would behave if faced with a particularly noisy signal of the effects of a program.

In another paper, joint with Aidan Coville, I focus on how policymakers update and find the situation may actually be worse because policymakers (and practitioners and researchers – no one should feel superior here!) do not Bayesian update but are subject to several behavioural biases.

In summary, we talk a lot about “evidence-based” decisions, but making an evidence-based decision takes a lot more than just evidence. There remains a lot of low-hanging fruit in this research area.

*I argue that an impact evaluation is most useful for the highly uncertain, possibly highly effective projects, a straightforward and well-known result of the normal learning model.

Four reasons your study should collect priors

Several of my research projects have involved collecting priors from policymakers, practitioners and researchers (e.g. this and this). I think that collecting priors is quite important and undervalued in economics.

They have several uses:

1) They can help you prioritize outcomes or tweak other features of your design

If you know that there is more disagreement as to whether an intervention will affect a certain set of outcomes, you can focus your attention on that set of outcomes. This can help maximize learning and hopefully ensure your work is widely cited.

2) They help you avoid the problem that, regardless of what results you find, people say they knew it already

Have you ever done a study and then had people say they knew the results already, when you’re pretty sure they didn’t? It would be really nice to avoid this situation and keep your research from being overly discounted.

3) They enable learning about updating

If you collect priors, you can also collect posteriors and start to say something about how people interpret evidence and what behavioural biases a group of people might have, as in my paper with Aidan Coville on how policymakers, practitioners and researchers update.

4) They can make null results more interesting

Researchers currently aren’t given much credit for null results, a problem that can lead to specification searching. However, if we know a priori that some null results were completely unexpected, they become more interesting and informative.

For all these reasons, I am happy to say that due to a SSMART-funded project, which gathered priors from researchers and policymakers on their priors regarding the size of various interventions’ impacts, the World Bank’s Development Impact Evaluation group (DIME) is now capturing priors across their portfolio of impact evaluations through their monitoring system. This should lead to a large corpus of priors that can be very helpful in the future.

What do you think? Have you heard of any other interesting work eliciting priors?

Clear opinions, weakly held

Recently I encountered the phrase “strong opinions, weakly held” — something advocated in the rationalist community. Some backstory for it is here. I am interested in considering the first part of the phrase and will ignore the “weakly held” portion, as I trust everyone agrees on the importance of being able to change their minds in the face of new evidence.

What could “strong opinions” mean? I see four possibilities:

Definition 1) Narrow priors (or posteriors, if you will — depends on which point of time you are considering)

Definition 2) Strongly stated opinions, in the sense of making a point forcefully

Definition 3) Strongly stated opinions, in the sense of making a point with precise language that accurately conveys one’s beliefs

Definition 4) Having an opinion at all, even if one’s beliefs entertain a wide range of possible outcomes (e.g. a uniform distribution over the entire space)

I can see several possible arguments for or against “strong opinions” in the sense of each of those definitions. Nonetheless, it is wholly unclear to me which arguments are typically made, using which definitions. If at the bare minimum one would like statements to be made clearly, in the sense of Definition 3, presumably there are better ways of putting that. By the sheer number of things it could mean, it is an ironic phrase. Perhaps it is better put as “clear opinions, weakly held”.


After spam issues, re-enabling comments. Comments will be automatically locked 30 days after a post is made. Also testing out a new spam filter – apologies if your comments end up caught up in it (e-mail me to let me know if that is the case).