:::: MENU ::::

6 big data sources for development (and 2 not ready for prime time)

Impact evaluation costs a lot (World Bank average: $500,000 per impact evaluation). Combined with the fact that results may vary from context to context and multiple studies may be needed to form a clear idea of what to do, this leads some to argue that impact evaluation should not be the main tool that is used (see, for instance, Pritchett et al.’s arguments here, or IPA’s “Goldilocks” initiative which seeks to find “appropriately-sized” evaluations for different policy needs).

My guess is that in the future we will see a lot more leveraging of big data sets, especially given the heterogeneity of results.

With that in mind, I thought to gather some large data sets that could help. Here are the top 6 types of data I found, as well as 2 cautionary tales. Am I missing any?

1. CDR data
Call detail records (CDR) are formed when people make phone calls or send SMS messages. These have been used to track people’s movements in disasters, to track migrant workers, or estimate when transportation infrastructure might need improvement.

2. Satellite data
The night lights data have famously been used to estimate growth. Satellites can also measure vegetation or, conversely, urban expansion, as well as having a number of other uses. Back in 2009 I wrote up a system to extract satellite data from NASA’s public-but-difficult-to-access database; Amazon has recently added the data to its own servers, making it much easier to access.

3. Apps
I am surprised that more people haven’t leveraged apps. ResearchKit, for example, allows people to get medical information from people who sign up with an Apple device. You have to rely on people who choose to share their own information, but apparently tens of thousands of people have already voluntarily signed up to fill out surveys and take coordination tests or share their daily activity. I am still looking for good health apps for a developing country context; let me know if you have any suggestions. I’ve toyed with the idea of making my own version of an app just to get the data that people transmit.

4. Online education data
Online education would seem to lead to many opportunities to learn how to best convey new ideas and teach different material.

5. Social media data
My understanding is that it is illegal to scrape Facebook or LinkedIn. However, many companies will have an API from which one can legally obtain some of their data (here is a list of APIs). The rule of thumb more generally is that if it is not disallowed in a site’s robots.txt, it is okay, but there have been challenges to this. You can also legally get some data from Facebook if you start your own group, invite people to like it, and observe their interactions within your group.

6. Financial and other private sector data
Evaluations of microfinance programs have long benefitted from the fact that due to the nature of that type of program, data is collected on a regular basis. Other financial data are also typically produced on a frequent basis, and collaborations with companies can be very fruitful.

What didn’t work
Finally, David McKenzie has highlighted some data collection strategies that sounded promising but didn’t quite work, and it could be just as important to hear about them. One study attempted to measure inventory by photography; the other tried to use RFID technology to track inventory.

Other things to look at

There are many more things that can be done using AidGrade’s data set of impact evaluation results. (And many things I am planning to do using completely different data sets!)

A selection of other questions that can be answered using the same data (collaborators welcome):

– What is the shape of different interventions’ effects over time? I tend to have repeat observations, e.g. 3, 6 or 12 months after the start of each intervention or after the program concluded.

– Regarding the finding that government-implemented programs sadly do not perform as well as NGO-implemented programs, even after controlling for sample size: what could explain this? Do the effects vary by the type of government or NGO?

– When do you observe more specification searching in the social sciences? I largely peeled this part off my job market paper to submit to another journal.

– What are the general equilibrium effects of the interventions?

– How do policymakers respond to the exogenous information shock provided by the results of a new study?

You might have seen that I am looking for an RA. It might be possible to collaborate on one or more of these projects if there were interest.

Incentivizing researchers to add their results

The biggest barrier to maintaining a constantly-updated, comprehensive data set on different studies’ results (necessary for meta-analysis) is getting all those data.

Do you know anyone who could help build an app that encourages researchers to “See how your results compare” — so that if they enter in their data, they get some nice graphics about where their results fall in the distribution of all studies done to that point, perhaps disaggregated by region or other study characteristics?

Let’s leverage researchers’ curiosity about their own data. Navel-gazing for the public good.

Ending the war

Development economics has relied increasingly on randomized controlled trials (RCTs), championed by the likes of the folks at J-PAL, IPA, CEGA, and many others. On the other hand, the strategy has its discontents, who fear that a lot of money is going into evaluations that may not have much practical value, as impact evaluations may not have much external validity.

I was worried that my paper “How Much Can We Generalize from Impact Evaluations?”, which draws upon a unique data set of roughly 600 impact evaluations across 20 different types of development programs, would stoke the flames and end up criticized by both sides. It didn’t, because we’re all economists and care a lot about data. At the end of the day, to what extent results generalize, and when, is an empirical question. I think this “war” is poorly named, because we can all agree that it is critically important to look carefully at the data.

I am very heartened by initiatives like the Berkeley Initiative for Transparency in the Social Sciences (BITSS) which emphasize getting to the right answer, not getting an answer. I suspect that meta-analysis will continue to grow in use in economics, and that the answer to the question “how much do results generalize?” will continue to be tested.

For my part, I intend for AidGrade’s data to be constantly updated and publicly available, and to continue to allow people to conduct their own meta-analyses instantly online, by selecting papers they wish to include (more filters to be added). I will be applying for grants to develop online training modules to help crowdsource the data (moving to a Wiki style), which will enable this to keep going and expanding in perpetuity, becoming more and more useful as more studies are completed.

We are in a new era. If I can borrow Chris Blattman’s conceptualization of “impact evaluations 1.0” (just run it) vs. “impact evaluations 2.0” (run it with an emphasis on mechanisms), I’d suggest a slightly modified “impact evaluations 3.0”: run it, with an emphasis on mechanisms, but then synthesize your results with those from other studies to build something bigger than the sum of your parts.


Apologies for the radio silence. Job market and all.