:::: MENU ::::

What about risk? Another benefit of meta-analysis

Part 1 of a series on risk; see part 2 here

Risk means many different things to many different people.
When we talk about the risk involved in a certain development program, the first important question is: who is really bearing the risk?

In asset pricing theory, the only risks that matter are risks that cannot be diversified away. If two potential investments have uncorrelated returns, an investor can hedge his or her bets by diversifying his or her portfolio. In terms of development programs, risks can be mitigated by diversification when the success of different programs is not correlated. Since the success of different programs is indeed largely uncorrelated, as they are disperse in space and in their targeted outcomes, an individual donor could diversify the risk he/she would not end up achieving his/her desired outcomes.

However, the way we normally think about development, the donors are not part of the goal. It can be okay for an individual donor to have funded an unsuccessful program so long as, combined, the aid programs that are funded are successful (though again this depends on a number of other things such as the extent to which unsuccessful programs hurt intended beneficiaries or others).

In other words, donors can hedge their own risks by diversifying the portfolio of programs they donate to, but this is irrelevant to the beneficiaries, who do not receive a diversified portfolio of programs. When we think about risk, we should instead pay attention to whether individual programs themselves are risky.

Yet whether an individual program run by one particular NGO in one particular place at one particular time is risky is precisely the thing on which we have the least information. Even if we were so lucky as to have an impact evaluation of the program so that we knew its past success, and even if we were so lucky as to have that impact evaluation capture the estimated effects on different parts of the population rather than just the mean effect, this still wouldn’t tell us how risky that program was. You can’t quantify the dispersion of the estimates of the mean, for example, when you only have one mean.

We have to resort to analysis of a more general type of program. When you repeat the program, you are not repeating exactly the same program – if nothing else, the intended beneficiaries have changed, having benefitted or been harmed by the previous program. Still, it seems reasonable to use past data as evidence on which to base future policy (in a Bayesian approach, it should still cause us to update our priors, even if it is not fully predictive). And at least when you observe multiple instances of the same program being rolled out, you can measure the variance across programs, or the standard deviation, a more typical measure.

This is another benefit of meta-analysis. As a side perk, you get to see how much results vary, under which contexts, and calculate the coefficient of variation within interventions. Some types of programs do systematically vary more in their results than others. I have a working paper on this and am happy to discuss further.

P.S. to Reinhardt-Rogoff post

A couple of people asked: why am I calling for a pre-analysis registry for non-RCTs? Can’t people look at the data for non-experimental studies and then “game” the pre-analysis plans, essentially writing them after the analysis?

Not necessarily. There can still be substantial lag time before you get the data for non-experimental studies, as well. I’d say I’m averaging more than a year to get non-experimental data… even including the times when the data exist out there and I just have to repeatedly ask for them! Mostly, though, I’m thinking about the large number of development economics studies which use quasi-experimental methods but otherwise face the exact same lags as RCTs do. There’s got to be something for them.

Reinhart-Rogoff and the problem with economics research

If you haven’t read about the Reinhart and Rogoff scandal, you can read about it here, here or here, among other places. In brief, a major paper was found to have made a number of errors, from Excel errors to questionable exclusion of several data points.

There was a lot of outrage in the public, but the response of economists was much more muted in general. Partly, I think this is because economics is a small world and everyone knows everyone. Partly, I think it’s because nobody’s particularly surprised; errors and even misrepresentations happen all the time.
As a discipline, we should be focusing on better correcting mechanisms. But why is there such a big problem in economics in the first place and what can we do about it?

First, regarding data. There is a big push to get people to share their data and their code, but the devil is in the details. It’s not enough to put data or code out there – you need someone to look at it to see whether or not it’s any good. Nobody wants to closely examine data or code unless it’s a really important topic and they are trying to replicate the results, but there are very low incentives to replicate papers.
Solutions? Journals requiring authors share their data and code are already doing a good job on at least encouraging some sharing. What is needed is more pressure and attention to what exactly it is that is shared, along with feedback mechanisms to correct any mistakes. AidGrade has feedback mechanisms explicitly built into its meta-analysis protocols. More radically, there is a sort of “GitHub for research” on the way that would allow all the usual features of forking along with automated posting of data and code. The nice thing about this is that it could take choice out of the picture, both eliminating the hurdle of manually posting data as well as serving as a commitment device for openness.

Second problem: people are biased and this bias can permeate their methods and affect their results, even unconsciously. It is easy enough to, after running a regression, think to run it on a subgroup or with different controls and, if you obtain a result that supports your priors, think this regression closer to capturing reality. Donald Green describes the problems associated with this well.

One thing that would help solve this problem is a pre-analysis registry where people can share their initial hypotheses and how they plan to test them. Then, if they deviate from them, at least we know and can consider the results in light of this.

There is already an effort by the AEA/J-PAL to have such a pre-analysis registry, and while it is a fantastic endeavour it does not go far enough. It only accepts randomized controlled trials, which make up a very small share of development economics research or economics research in general.

The other day I was looking for a place to post a pre-analysis plan for work I am doing. Since it was not an RCT, there was nowhere to post it. I tried signing up for the Open Science Framework but didn’t see a single public pre-analysis plan posted there. Though some may be hidden, that really is a shame and points to the fact that if people don’t have the incentive to do it, they won’t. So for now I am sharing my plans with friends, with the side benefit of getting feedback, but I would of course prefer for there to be a repository for these plans. Would anyone like to set one up with me? Let me know @evavivalt – this is the kind of work best done jointly, so share the word.

Controversy in development economics

How to know you have hit upon a very controversial subject: two titans of development economics each castigate you for diametrically opposite reasons. Next time, I should let them fight directly!

I am trying to use AidGrade’s data to say something about the generalizability of impact evaluation results. I’m not coming in with an agenda, but basing this on the belief that:

1) People want to know what works. There are a lot of grandiose claims that impact evaluation can tell us this. Economists are usually very careful not to generalize from particular cases, knowing that results are heavily context-dependent and have no external validity. But there is also a sense in which we really do want to use the results to update our priors. We want to get something generalizable out of an impact evaluation, else why do one in the first place if it only tells us how successful something which will never again occur was?

The extent to which results are generalizable is an empirical question. So long as people are extrapolating from past results, whether explicitly or with a wink and a nudge when trying to get policy makers to agree to a new impact evaluation, we’d better at least know how generalizable the results are. You can say we know they aren’t generalizable. Fine. People still talk as though they are, they are likely to be to some non-zero degree, so what is that degree?

2) There are undoubtedly contexts under which results are more or less generalizable in practice. For example, since a lot of people don’t want to randomize, I suspect that RCTs may be done in “weirder” situations than quasi-experimental studies. I wonder if we can see this in the results. Some causal chains between interventions and outcomes may also be more complicated than others. The theory here is quite clear – why not test it out?

Unfortunately, I’m stuck between a rock and a hard place. Some say it goes too far, others not far enough. I’m a fan of impact evaluations for what they do tell us about human behaviour. I also think they are often vastly overpriced and that many but not all of them would be more helpful were they to give more immediate, actionable feedback to the project implementer.

I’m not actually interested in participating in the War of the Randomistas. But when a war goes on, it seems that some on either side have hammers and everything looks like a nail.

If special interests kill it, so be it, but practically speaking it’s an important issue.

Big on BITSS

Briefly: BITSS is highly necessary.

As mentioned in the new transparency series on CEGA’s blog, there are a lot of problems in the discipline today. Focusing on interaction terms or particular subgroups is one way of increasing the odds of obtaining the elusive 5% significance level, so a lot of people do it, but this overstates the results’ true significance. Something can always appear significant by accident, and tests of significance are only legitimate if they are defined beforehand.

Because of this issue, AidGrade has been forced to take a very conservative stance with regards to the values it collects from studies. In the absence of pre-analysis plans, we have avoided collecting interaction terms and have also focused on results containing few controls (another way that people can lie with statistics). This is obviously a shame, because there is a real sense in which these terms could be important. However, in the absence of pre-analysis plans, this is the best we can do in order to avoid fraud.

Here’s hoping the discipline will change with all the new attention brought to bear on this topic!