Hide table of contents

The problem

In the 1990s, the World Health Organization (WHO) had an important function. They had to calculate, estimate, and publish the number of deaths caused by different diseases. These numbers influenced several things, from government spending on treatment programs, to the public perception of progress being made on different issues. However, even though the people doing the calculations were well-meaning and generally competent, there was a big problem. There was no oversight and the process lacked consistency, meaning that each WHO group used different methods, calculations, and assumptions. This resulted in estimates double- and triple-counting a single death.

This mis-estimation was potentially fatal, with funding and intellectual resources being devoted towards certain diseases over other, more important areas. A concerned staff member at the WHO noticed the problem after discovering that by adding up the four biggest killers (malaria, diarrhea, TB, and measles) in a lower income country, this added up to more than 100% of the total number of deaths in that country, and that was not even counting all other causes of death. When the employee brought up the concern with coworkers and management, it was largely dismissed. It would have looked bad, both for the individual groups and the WHO as a whole, to admit or address such a large mistake. Even after the staff member triple-checked his work and strengthened it through deeper research, it went unheard. The unspoken rule was: don’t embarrass the higher-ups.

The end line result was the founding of a completely new project outside of the WHO- the Global Burden of Disease Study- which measured impact correctly, did not double or triple count deaths, and which is, in fact, used to this day by groups like GiveWell, the Gates Foundation, and many others.


This is a true story, paraphrased from Epic Measures, and it highlights one of my biggest concerns about the EA movement. Trying to calculate counterfactual impact is a very hard task and much like with the WHO numbers, not only is each EA organization using different systems, they each have an incentive to publish high impact results. With impact, the calculations are even harder to do correctly than in the case of deaths, as it is often plausible that five different people or organizations were required for an action to happen. Sadly, if each of the five take 100% credit, you will end up with the EA movement as a whole taking 500% credit for a given action.

This can also happen with donations. It would be very easy for an EA to find out about EA from Charity Science, to read blog posts from both GWWC and TLYCS, sign up for both pledges, and then donate directly to GiveWell (who would count this impact again). This person would become quadruple counted in EA, with each organization using their donations as impact to justify their running. The problem is that, at the end of the day, if the person donated $1000, TLYCS, GWWC, GiveWell, and Charity Science may each have spent $500 on programs for getting this person into the movement/donating. Each organization would proudly report they have 2:1 ratios and give themselves a pat on the back, when really the EA movement as a whole just spent $2000 for $1000 worth of donations.

The previous example used donations because it’s easy and clear cut to make the case that this is the wrong move without getting into more difficult issues, but it generalizes to talent as well. For example, recently, Fortify Health was founded. Clearly the founders deserve 100% impact- without them, the project certainly would not have happened. But wait a second: both of them think that without Charity Science’s support, the project would definitely not have happened. So, technically, Charity Science could also take 100% credit. (Since from our perspective, if we did not help Fortify Health it would not have happened, so it is a 100% counterfactually caused by Charity Science project). But wait a second, what about the donors who funded the project early on (because of Charity Science’s recommendation)? Surely they deserve some credit for impact as well! What about the fact that without the EA movement, it would have been much less likely for Charity Science and Fortify Health to connect?

With multiple organizations and individuals, you can very easily attribute a lot more impact than actually happens.  A project’s evaluation could easily create the perception of x4 the impact it really had. This is even more likely, if it's unclear where people are taking their credit for impact from (e.g. I might publish a report on Charity Science's overall impact with “supporting new charities” impact listed, but not specify on the exact help I gave or how many others were involved). This is not even talking about deliberate rounding or naive overestimation of the value of that project.

Sadly, all these issues occur even with everyone trying to be as honest and careful as they can be. To jump back to the financial example, you can imagine Charity Science, GWWC and TLYCS not knowing exactly how much the person who donates $1000 is actually donating, leading to different and often over optimistic estimates across the organization.

The solutions

Sadly, I cannot think of a silver bullet solution. Thankfully, though, I think there are some things that can really help.

Transparent sharing of data regarding impact and the methodology to calculate impact  

Mistakes like this are much more likely to happen the less clear and transparent the causal chain of impact is. Many organizations have internal counterfactual calculations, but it’s hard for donors or other organizations to make sense of the end line data without knowing how it was, in fact, estimated. Obviously, not all data is going to be shareable (e.g. the names of the people donating). However, the process for calculating impact can be shared and compared, which, in turn, can allow for an open discussion of these issues (e.g. how to disaggregate the impact of two organizations taking similar actions.) It also gives the community a chance to sanity check each other’s numbers. If Charity Science was massively over-estimating something relative to external observers, it would be hard for them to point out this flaw without a high level of transparency.

Efforts towards a consistent evaluation process between organizations

The more similar the process that is used between organizations, the easier it would be to take seriously the end line numbers. Something like this could be coordinated on the EA Forum and could clear up a lot of confusion regarding impact evaluation. (For example, if I hire someone to Charity Science, does that count as a career change?) I think that current organizations have very different intuitions and processes, and thus, end line numbers. I also think that to increase consistency, donors should insist upon seeing the data before donating to an organization.

Independent unbiased external impact analysis

The solution to the WHO problem was not just more interdepartmental coordination and transparency. It was, in fact, independent external analysis. Although I think this is “the solution”, it's easily the hardest to execute well. The results from something like this would a) be very sensitive to the evaluators’ values (e.g. if they valued one cause a lot more than another, it would be hard to generalize), b) be very time consuming (I expect it would take many hours to get a strong understanding of all the aspects of an organization; likely months to years of full time work), c) would require a fairly unprecedented level of transparency in the charity world.

Things like this can happen. I think GiveWell’s external reviewing of poverty charities is a good example of something pretty close to the ideal, and I think it would allow for much stronger evaluation and accountability when considering and comparing the impacts of different organizations.   

Comments19
Sorted by Click to highlight new comments since:
[anonymous]8
0
0

Here are my less rushed thoughts on why this line of thought is mistaken. Would have been better to do this as a comment in the first place - sorry about that.

This is a shorter and less rushed version of the argument I made in an earlier post on counterfactual impact, which could have been better in a few ways. Hopefully, people will find this version clearer and more convincing.

Suppose that we are assessing the total lifetime impact of two agents: Darren, a GWWC member who gives $1m to effective charities over the course of his life; and GWWC, which, let’s assume in this example, moves only Darren’s money to effective charities. If Darren had not heard of GWWC, he would have had zero impact, and if GWWC had not had Darren as a member it would have had zero impact.

When we ask how much lifetime counterfactual impact someone had, we are asking how much impact they had compared to the world in which they did not exist. On this approach, when we are assessing Darren’s impact, we compare two worlds:

Actual world: Darren gives $1m to GWWC recommended charities.

Counterfactual worldD: Darren does not exist and GWWC acts as it would have if Darren did not exist.

In the actual world, an additional $1m is given to effective charities compared to the Counterfactual WorldD. Therefore, Darren’s lifetime counterfactual impact is $1m. Similarly, when we are assessing GWWC’s counterfactual impact, we compare two worlds:

Actual world: GWWC recruits Darren ensuring that $1m goes to effective charities

Counterfactual worldG: GWWC does not exist and Darren acts as he would have done if GWWC did not exist.

In the actual world, an additional $1m is given to effective charities compared to the Counterfactual WorldG. Therefore, GWWC’s lifetime counterfactual impact is $1m.

This seems to give rise to the paradoxical conclusion that the lifetime counterfactual impact of both GWWC and Darren is $2m, which is absurd as this exceeds the total benefit produced. We would assess the lifetime counterfactual impact of both Darren and GWWC collectively by comparing two worlds:

Actual world: GWWC recruits Darren ensuring that $1m goes to effective charities

Counterfactual worldG&D: GWWC does not exist and Darren does not exist.

The difference between the Actual world and the counterfactual worldG&D is $1m, not $2m, so, the argument goes, the earlier method of calculating counterfactual impact must be wrong. The hidden premise here is:

Premise. The sum of the counterfactual impact of any two agents, A and B, taken individually, must equal the sum of the counterfactual impact of A and B, taken collectively.

In spite of its apparent plausibility, this premise is false. It implies that the conjunction of the counterfactual worlds we use to assess the counterfactual impact of two agents, taken individually, must be the same as the counterfactual world we use to assess the counterfactual impact of two agents, taken collectively. But this is not so. The conjunction of the counterfactual worlds we use to assess the impact of Darren and GWWC, taken individually, is:

Counterfactual worldD+G: GWWC does not exist and Darren acts as he would have done if GWWC did not exist; and Darren does not exist and GWWC acts as it would have done if Darren did not exist.

This world is not equivalent to Counterfactual worldD&G. Indeed, in this world Darren does not exist and acts as he would have done had GWWC not existed. But if GWWC had not existed, Darren would, ex hypothesi, still have existed. Therefore, this is not a description of the relevant counterfactual world which determines the counterfactual impact of both Darren and GWWC. This shows that you cannot unproblematically aggregate counterfactual worlds, it does not show that we assessed the counterfactual impact of Darren or GWWC in the wrong way.

To reiterate this point, when we assess Darren’s lifetime counterfactual impact, we ask: “what would have happened if Darren only hadn’t existed?” When we assess Darren and GWWC’s lifetime counterfactual impact, we ask “what would have happened if Darren and GWWC hadn’t existed?” These questions inevitably produce different answers about what GWWC would have done: in one case, we ask what GWWC would have done if Darren hadn’t existed, and in another we are assuming GWWC doesn’t even exist. This is why we get surprising answers when we mistakenly try to aggregate the counterfactual impact of multiple agents.

I agree with you that impact is importantly relative to a particular comparison world, and so you can't straightforwardly sum different people's impacts. But my impression is that Joey's argument is actually that it's important for us to try to work collectively rather than individually. Consider a case of three people:

Anna and Bob each have $600 to donate, and want to donate as effectively as possible. Anna is deciding between donating to TLYCS and AMF, Bob between GWWC and AMF. Casey is currently not planning to donate, but if introduced to EA by TLYCS and convinced of the efficacy of donating by GWWC, would donate $1000 to AMF.

It might be the case that Anna knows that Bob plans to donate the GWWC, and therefore she's choosing between a case of causing $600 of impact or $1000. I take Joey's point not to be that you can't think of Anna's impact as being $1000, but to be that it would be better to concentrate on the collective case than the individual case. Rather than considering what her impact would be holding fixed Bob's actions ($1000 if she donates to TLYCS, $600 if she gives to AMF), Anna should try to coordinate with Bob and think about their collective impact ($1200 if they give to AMF, $1000 if they give to TLYCS/GWWC).

Given that, I would add 'increased co-ordination' to the list of things that could help with the problem. Given the highlighted fact that often multiple steps by different organisations are required to achieve particular impact, we should be thinking not just about how to optimise each step individually but also about the process overall.

[anonymous]2
0
0

I think this is a fair comment. I probably misinterpreted the main emphasis of the piece. I thought his main point was that each of the organisations is misstating their impact. I do think this was part of the argument and I think a few others did as well given that a few people started talking about dividing up credit according to the Shapely value. But I think the main part is about coordination and I agree wholeheartedly with his points and yours on that front

I'm interested in what norms we can use to better deal with the practical case.

e.g. Suppose:

1) GiveWell does research for a cost of $6 2) TLYCS does outreach using the research for a cost of $6 3) $10 is raised as a result.

Assume that if GiveWell didn't do the research, TLYCS wouldn't have raised the $10, and vice versa.

If you're a donor working out where to give, how should you approach the situation?

If you consider funding TLYCS with GiveWell held fixed, then you can spend $6 to raise $10, which is worth doing. But if you consider funding GiveWell+TLYCS together, then you can spend $12 to raise $10, which is not worth doing.

It seems like the solution is that the donor needs to think very carefully about which margin they're operating at. Here are a couple of options:

A) If GiveWell will definitely do the research whatever happens, then you ought to give. B) Maybe GiveWell won't do the research if they don't think anyone will promote it, so the two orgs are coupled, and that means you shouldn't fund either. (Funding TLYCS causes GiveWell to raise more, which is bad in this case) C) If you're a large donor who is able to cover both funding gaps, then you should consider the value of funding the sum, rather than each org individually.

It seems true that donors don't often consider situations like (B), which might be a mistake. Though sometimes they do - e.g. GiveWell considers the costs of malaria net distribution incurred by other actors.

Likewise, it seems like donors often don't consider situations like (C). e.g. If there are enough interactions, maybe the EA Funds should calculate the cost-effectiveness of a portfolio of EA orgs, rather than estimate the ratios for each individual org.

On the other hand, I don't think these cases where two orgs are both 100% necessary for 100% of the impact are actually that common. In practice, if GiveWell didn't exist, TLYCS would do something else with the $6, which would mean they raise somewhat less than $10; and vice versa. So, the two impacts are fairly unlikely to add up to much more than $12.

[anonymous]1
0
0

In case B, it looks to me like the donor should give to TLYCS, in certain conditions, in others not.

(a) Suppose: Because you gave to TLYCS, GiveWell does the research at a cost of $6, fundraising from an otherwise ineffective donor, and getting $10 to GW charities. In this case, your $6 has raised $10 for effective charities minus the $6 from an otherwise ineffective donor (~0 value). So, I don't think causing GW to fundraise further would be bad in this case. Coordinating with GW to just get them to fundraise for donations to their effective charities is even better in this case, but donating to TLYCS is better than doing nothing.

(b) Suppose: Same as before, except GW fundraises from an effective donor who would otherwise have given the $6 to GW charities. In this case, giving to TLYCS is worse than doing nothing because you have spent $6 getting $10 to GW charities, minus what the effective donor would have given to had you not acted (-$6 to GW charities), so you've spent $6 getting $4 to effective charities. Doing nothing would be better, as then $6 goes to effective charities.

This shows that the counterfactual impact of funged/leveraged donations needs to be considered carefully. GiveWell is starting to do this - e.g. if govt money is leveraged or funged they try to estimate the cost-effectiveness of govt money. Outside that, this is probably something EA donors should take more account of.

Another case that should be considered is causing irrational prioritisation with a given amount of funds. Imagine case (a) above except that instead of fundraising, GiveWell moves money from another research project with a counterfactual value of $9 to GW charities because they have not considering these coordination effects (they reason that $10>$9). In that case, you're spending $6 to get $10 to GW charities minus the $9 that would have gone to GW charities.

Regarding C, this seems right. It would be a mistake for the EA funds to add up its impact as the sum of the impact of each of the individual grants it has made.

On the practical point, one help is that I think cases like these are fairly uncommon:

The previous example used donations because it’s easy and clear cut to make the case that this is the wrong move without getting into more difficult issues, but it generalizes to talent as well. For example, recently, Fortify Health was founded. Clearly the founders deserve 100% impact- without them, the project certainly would not have happened. But wait a second: both of them think that without Charity Science’s support, the project would definitely not have happened. So, technically, Charity Science could also take 100% credit. (Since from our perspective, if we did not help Fortify Health it would not have happened, so it is a 100% counterfactually caused by Charity Science project). But wait a second, what about the donors who funded the project early on (because of Charity Science’s recommendation)? Surely they deserve some credit for impact as well! What about the fact that without the EA movement, it would have been much less likely for Charity Science and Fortify Health to connect? With multiple organizations and individuals, you can very easily attribute a lot more impact than actually happens.

In our impact evaluations, and in my experiences talking to others in the community, we would never give 100% of the impact to each group. For instance, if Charity Science didn't exist, the founders of Fortify might well have ended up doing a similar idea anyway - it's not as if Charity Science is the only group promoting evidence-based global health charities, and if Charity Science didn't exist, another group like them probably would have sprung up eventually. What's more, even if the founders didn't do Fortify, they would probably have done something else high-impact instead. So, the impact of Charity Science should probably be much less than 100% of Fortify. And the same is true for the other groups involved.

At 80,000 Hours, we rarely claim more than 30% of the impact of an event or plan change, and we most often model our impact as a speed-up (e.g. we assume the career changer would have eventually made the same shift, but we made it come 0.5-4 years earlier). We also sometimes factor in costs incurred by other groups. All this makes it hard for credit to add up to more than 100% in practice.

[anonymous]4
0
0

good points. This can also go the other way though - an org could leverage money from otherwise very ineffective orgs. Especially with policy changes, it can sometimes be the case that a good org comes up with a campaign that steers the entire advocacy ecosystem to a more effective path. A good example of this is campaigns for ordinary air pollution regulations on coal plants, which were started in the 1990s by the Clean Air Task Force among others and now have hundreds of millions in funding from Bloomberg. If these campaigns weren't started, environmental NGOs in the US and Europe would plausibly be working on something much worse.

I don't think the notion of 'credit' is a useful one. At FP, when we were looking at orgs working on policy change, we initially asked them how much credit they should take for a particular policy change. They ended up saying things like "40%". I don't really understand what this means. It turned out to be best to ask them when the campaign and policy change would have happened had they not acted (obviously a very difficult question). It's best to couch things in terms of counterfactual impact throughout and not to convert into 'credit'.

Similarly with voting, if an election is decided by one vote and there are one million voters for the winning party, I think it is inevitably misleading to ask how much of the credit each voter should get. One naturally answers that they get one millionth of the credit, but this is wrong as a proposition about their counterfactual impact, which is what we really care about.

Indeed, focusing on credit can lead you to attribute impact in cases of redundant causation when an org actually has zero counterfactual impact. Imagine 100 orgs are working for a big policy change, and only 50 of them were necessary to the outcome (though this could be any combination of them and they were all equally important). In this case, funding one of the orgs had zero counterfactual impact because the change would have happened without them. But on the 'credit approach', you'd end up attributing one hundredth of the impact to each of the orgs

I agree - I was talking a bit too loosely. When I said "assign credit of 30% of X" I meant "assign counterfactual impact of 30% of X". My point was just that even if you do add up all the counterfactual impacts (ignoring that this is a conceptual mistake like you point out), they rarely sum to more than 100%, so it's still not a big issue.

I'm not sure I follow the first paragraph about leveraging other groups.

[anonymous]0
0
0

You argued that counterfactual impact may be smaller than it appears. But it may also be larger than it first appears due to leveraging other orgs away from ineffective activities. e.g. an NGO successfully advocates for a policy change P1 - the benefits of P1 is their counterfactual impact. But as a result of the proven success of this type of project, 100 other NGOs start working on similar projects where before they worked on ineffective projects. This latter effect should also be counted as the first org's counterfactual impact. This could be understood as leveraging additional money into an effective space.

Makes sense. I don't think Joey would object if orgs were counting this though.

[anonymous]0
0
0

I don't agree. His logic entails that money/effort you leverage shouldn't be counted as your own counterfactual impact. If FHI convinces e.g. the UK government that biorisk is worth spending money on, then on Joey's approach, FHI would be wrong to count this additional money as it's own impact.

This certainly has the potential to be a big problem; in practice it’ll largely depend on the methodologies used by the relevant organizations. FYI, TLYCS’s impact methodology takes steps to avoid double-counting and includes an explicit discussion of these counterfactual concerns. See the appendix of our annual report for details.

[anonymous]0
0
0

I don't think the reasoning here is correct. It is possible and normal for the sum of the counterfactual impact of individual actors to exceed the counterfactual impact of the sum of individual actors. I will write something up on this.

I'd be curious what you're reasoning is. My understanding is that the technical solution here is to calculate Shapely value.

If we have: impact = one life saved = 100%

and several organizations assign this impact to themselves

There is still just one life saved

but several organizations are taking together more than 100% credit for it

100% != 200%

[anonymous]0
0
0

See the post following this one explaining why this is not a puzzle.

I do not understand. For practical purposes it makes sense to me, we should not take more than 100% credit for anything we do.

If multiple organizations cooperate, they create a bigger impact, that is understandable. The impact is always 100% no matter how big it is. We can say organizations A, B and C and multiple other factors D created together impact 100%, saying each organization has 100% impact is misleading and can lead us to the wrong conclusion about how effective we are compared to others who are not using this math magic.

Maybe it would make sense for me if counterfactual was always strict zero and every action was completely irreplaceable and it was all or nothing forever.

In the real world and what I think this article is referring to is that organizations are evaluating their impact using surveys and when they find out person is giving 10 000$ a year and they were strongly influenced by their activities they add the money to their impact ... and 10 other organizations also do it.

But a lot of those organization activities would be very replaceable and even if not, it is rarely all or nothing.

Then someone adds the impact of all those organizations together and says EA created impact 100x bigger than it actually has

When life is saved it doesn't matter whether by one person or by 100 people. When 1 impact = 1 life saved, 10 people cooperating on saving one life cannot have 10 impacts together. And if they want to get some representative numbers, they should divide the impact between themselves.

If I understood the problem well enough, a possible solution could be setting up a database of donations that is shared between many EA charities, in which donors are obviously anonymous. In this way donors can't be counted twice. In reality the database wouldn't even need to keep track of single donors but only of dollars donated, since we want to estimate if the dollars devoted to advocacy and movement building are less than new donations. Do you think this is viable? Does this offer a solution or at least improve the situation?

The EA Survey records donations for some individuals anonymously. This could be a good basis for comparison.

Curated and popular this week
Relevant opportunities