:::: MENU ::::

Variance neglect: Testing a novel bias with policymakers and others

Aidan Coville and I have a new working paper in which we look at how policymakers, development practitioners (such as international organization staff), and researchers update in response to evidence from impact evaluations. In particular, we test whether they are subject to two biases: updating more on “good news” than “bad news” and not taking the variance of the estimates adequately into consideration (here, we mean confidence intervals, but you could imagine testing for neglect of other kinds of variance, such as inter-study variance).

The first bias is sometimes called overconfidence (not to be confused with the kind of overconfidence in which one simply is too certain in one’s beliefs), and in our experiment we are able to distinguish it from confirmation bias. The second bias is novel to my knowledge, and we call it “variance neglect”. In our context, it refers to ignoring confidence intervals, but you can also imagine other types of variance being ignored, such as inter-study variance. Variance neglect is related to “extension neglect”, in which people may, for example, neglect sample size when considering how much to trust a new finding. However, I think it is distinct, since sample size is not all that matters when considering variance. It is also reminiscent of Kahneman and Tversky’s prospect theory, in which people overweight low probability events and underweight high probability events. However, if someone overweights low probability events and underweights high probability events, that should have the effect of increasing the dispersion of their beliefs. What we find, in supplemental tests, is more consistent with a fundamental misunderstanding of confidence intervals. We give respondents one interval and ask them to provide a different interval (e.g. giving a 95% interval and asking for the interquartile range). For small ranges, they have overly disperse distributions, but for large ranges they report overly narrow distributions. Variance neglect is perhaps more closely related to the hot hand fallacy and the gambler’s fallacy, which also have to do with incorrect treatment or perception of variance. I would also distinguish variance neglect from the hot hand fallacy and the gambler’s fallacy, though, both because variance neglect does not require as restrictive a functional form and because the latter two biases conceptually have to do with seeing repeated streaks, while I am more concerned with seeing noisy data once. (Especially given how I know from past work how rare it is to have multiple studies on the same intervention covering the same outcome – instead, studies often “run away from each other” as authors seek to make them unique.)

We show how these biases can be easily fit into a quasi-Bayesian model. Testing for these biases is also straightforward. First, we elicit respondents’ priors. Then we randomly vary whether we show them high or low point estimates (relative to their priors) and large or small confidence intervals. Finally, we elicit their posteriors. That allows us to cleanly estimate whether they update more on the “good news” than the “bad news”. Testing for variance neglect is a little more complicated: for someone to suffer from variance neglect, they don’t need to update equally on large confidence intervals and small confidence intervals, nor do they need to perversely update more on large confidence intervals. Instead, all we need to show is that they do not update as differently when they see small confidence intervals as opposed to large confidence intervals as they would if they were Bayesian. Since we know their priors and we know what data we showed them, we can determine how a Bayesian would have updated and compare their responses.

The setting is also pretty cool. We can’t bring policymakers to a lab, but we can bring the lab to policymakers. We leverage a series of World Bank and Inter-American Development Bank impact evaluation workshops in various countries. These workshops tend to be one week long, and over the course of the week policymakers working on a particular program learn about impact evaluation and try to design one for their program with the help of a researcher. Several different types of participants attend these meetings: “policymakers”, or officials working in developing country government agencies (both those in charge of the particular programs and monitoring and evaluation specialists); “practitioners”, which are mostly international organization staff (like World Bank operational staff) and NGO partners; and researchers. Apart from the workshops, we also conducted some surveys at the headquarters of the World Bank and the Inter-American Development Bank. Finally, we also ran the experiment on Amazon’s Mechanical Turk (where workers take surveys for cash) to obtain another comparison group.

Quick summary of results: we found significant evidence of overconfidence and variance neglect, and no respondent type (policymakers, practitioners, researchers, and MTurk workers) updated significantly better or worse than any other. We also disaggregated results by gender; a few results indicated that women suffered more from these biases but it is difficult to interpret given that a) other results did not indicate that and b) the distribution of women in the sample varied by respondent type. Finally, we found that respondents updated more on more granular data, even when the data should have theoretically provided the same informational content (e.g. providing more/fewer quantiles of the same distribution). This suggests that when you have bad news, you should come bearing a lot of data.

The working paper has more details. There is still some work to do (e.g. non-parametric tests, robustness checks), but we are pretty happy with the initial results. We also ran several follow-up experiments on MTurk that try to parse the ultimate cause of the variance neglect we observe. We are running these experiments on a pre-specified larger sample and will add them to the next draft. Preliminary results suggest that misunderstanding confidence intervals (as opposed to inattentiveness or lack of trust in the experiment) is the crux of the issue.

This is not the only experiment that we have run in this space. We also ran a set of experiments on a similar sample that tried to determine how people weight different aspects of a study’s context or research design. For example, would a policymaker, practitioner or researcher prefer to see results from an RCT done in a different region or results from a quasi-experimental study done in their setting? This was a simple discrete choice experiment in which respondents were asked to repeatedly pick one of two studies with different attributes. Other attributes tested (in various permutations) include program implementer (government or NGO), sample size, point estimates, confidence intervals, and whether the program was recommended by a local expert. You would need far stronger assumptions than we would be comfortable making in order to say that anyone was making a “correct” or “incorrect” choice, but their preferences were interesting nonetheless. Tune in next time to find out how respondents answered.

Predictions of Social Science Research Results

There has been an explosion of studies collecting predictions about what they will find, especially in development economics. I am fully supportive of this, as you might guess from several past blog posts. I think it is in every applied researcher’s best interest to collect forecasts about what they will find. If nothing else, they help make null results more interesting and allow researchers to note which results were particularly unexpected.

Another issue previously alluded to that I would like to highlight, however, is that forecasts can improve experimental design. I’m writing a book chapter on this now so expect more on this topic soon.

Forecasts are also important in Bayesian analysis, as “priors” one could use. And in the very, very long-run, they have the potential to improve decision-making: there are never going to be enough studies to answer all the questions we want answered, so it would be very nice to be able to say something about when our predictions are likely to be pretty accurate even in the absence of a study. Which study outcomes are less predictable? Who should we listen to? How should we aggregate and interpret forecasts? In the absence of being able to run ten billion studies, anything we can do to even slightly increase the accuracy of our forecasts is important.

Towards this end, I am collaborating with Stefano DellaVigna (of these awesome papers with Devin Pope) to build a website that researchers can use to collect forecasts for their studies. To be clear, many people are collecting forecasts on their own already for their respective research projects, but we are trying to develop a common framework that people can use so as to gather predictions more systematically. Predictions would be elicited from both the general public and from experts specified by researchers using the platform (subject to the constraint that no one person should receive a burdensome number of requests to submit predictions – providing these predictions should be thought of as providing a public good, like referee reports). There has been a lot of great work using prediction markets, but we are currently planning to elicit priors individually, so that each forecast is independent.

The hope is that this proves to be a useful tool for researchers who want to collect forecasts for their own studies; that the coordination will ensure no one person is asked to give a burdensome number of forecasts (unless they want to); and that over time the platform will generate data that could be used for more systematic analyses of forecasts.

As part of this work, we are organizing a one-day workshop on forecasts at Berkeley on December 11, 2018, hosted by the Berkeley Initiative for Transparency in the Social Sciences. We are still finalizing the program but hope to bring together all the leading experts working in this area for good discussion. I’m excited about this agenda and will post a link to the program when it is finalized.

Edit: program can be found here.

Workshop on Causal Inference and Extrapolation

I have been helping the Global Priorities Institute at the University of Oxford organize a workshop on new approaches in causal inference and extrapolation, coming up March 16-17. It is pretty small, with a lot of breaks to facilitate good discussion (designed with SITE in mind as a model). As it is just before the annual conference of the Centre for the Study of African Economies, we are hoping others can make it, especially to the keynote, which is on the evening before the CSAE conference begins and is open to all. I’m really excited about both the institute and the workshop and hope that this becomes an annual event in some way, shape or form.

If you will be in town and would like to attend but have not signed up, please let me know. The programme is here.

Edit: site appears to be down (too much traffic?), so I am uploading a copy of the programme here.


Aidan Coville and I have been running a behavioural experiment with policymakers, practitioners and researchers to see how they update based on new evidence from impact evaluations and what biases they may have in updating. We have done some analysis of the numerical data, but we also have the audio transcripts from the one-on-one enumeration, where people were asked to describe their thought processes and why they gave the answers they gave. We are looking for someone proficient in text analysis to collaborate on a separate paper. This would be a great opportunity for grad students, but others welcome, too. Please pass on to anyone you know who might be interested.

Second, please see this post for a description of a Research Assistant position. Deadline is Jan. 31.

Finally, I am helping to organize a workshop for the Global Priorities Institute at the University of Oxford on causal inference and extrapolation. The call for papers is closed but please get in touch if you are interested in attending, as space is limited.

Clean meat is not a panacea

I am a very strong advocate of “clean” or “cultured” meat – meat that is grown in a lab rather than in an animal. But it’s because I’m a strong advocate that I think it’s better to look at what the evidence says and not let our biases get in the way.

And based on the results of two experiments, I suspect there will be more resistance to clean meat than we’d like to think.

Bobbie Macdonald and I did a set of two experiments leveraging Amazon’s Mechanical Turk, a site which allows people to take surveys for money. We focused on U.S. respondents, as the U.S. is one of the first places where clean meat is expected to be made commercially available.

First, we found that while many people were very excited about clean meat, a large group was concerned that it wouldn’t taste good, that it would be too expensive, and that it was “unnatural” and vaguely unsafe or unhealthy.

In one of our papers, we tried to overcome the “naturalistic heuristic” that people seemed to be using to judge clean meat to be unhealthy even in the absence of any evidence that this was the case (in fact, a subset of the sample was provided with evidence that clean meat may be healthier than conventional meat). We tried several messaging strategies: an approach that tried to directly debunk the idea that “natural is good” (“direct debunking”); an approach that noted that many of the food products respondents currently enjoy are “unnatural” (“embrace unnatural”); and a descriptive norms approach (“descriptive norms”). We also randomly primed half the sample with real negative statements about the “unnaturalness” of the product made by participants in another study. The negative priming turned out to have stronger effects than any of the messages intended to help overcome the naturalistic heuristic, with the “embrace unnatural” message faring the best among the latter.

The strong effect of the negative priming treatment is a bit worrisome, because we can easily imagine that if clean meat begins to threaten conventional meat producers, some of them may engage in campaigns to spread distrust via these kinds of negative social information.

In our other paper, we explored a different topic: whether merely knowing about clean meat products could change ethical beliefs. Standard models of cognitive dissonance (e.g. Rabin, 1994) would suggest that if people believe the costs of avoiding meat from factory farms are high, they will be less receptive to information about the environmental costs or the harm caused to animals by factory farming. If clean meat is regarded as a good substitute, it could lower the costs of avoiding meat from factory farms (e.g. by lowering monetary costs or being a closer substitute in taste and nutrition than vegetarian products). Thus, upon receipt of new information (e.g. a video about factory farming), people may be more receptive to it and likely to shift their ethical beliefs.

Contrary to expectations, we saw no effect on ethical views – until we restricted attention to those who viewed clean meat positively (remember: many thought that clean meat would not be a good substitute). Since it was possible that those who viewed clean meat positively were also more likely to change their views, to identify the causal effect of a positive perception of clean meat on ethical views we leveraged the randomized priming treatment previously described as an instrument. Those who were randomly selected to be shown negative statements about the “unnaturalness” of clean meat were less likely to view clean meat positively and were also less likely to exhibit changes in their ethical beliefs upon viewing the video. From a broader economics perspective, this is very interesting: it implies our ethical beliefs are a function of the products around us.

None of this is to say that companies developing clean meat products won’t be wildly successful. They could also help directly mitigate the effects of factory farming by providing a product that at least some proportion of the population will substitute towards. Further, while we did not observe shifts in ethical values on net in our experiments, it is possible that over time more people will perceive clean meat positively and clean meat will affect ethical values in the long run.

Nonetheless, these experiments suggest that there is still room for animal advocacy organizations to make a difference in changing people’s ethical views, and while some conventional meat companies have shown interest in clean meat, clean meat companies should prepare for negative advertising campaigns, since the effects of negative messages can be tough to overcome.