Thanks for providing that background Jessica, very helpful. It'd be great to see metrics for the OSP included in the dashboard at some point.
It might also make sense to have the dashboard provide links to additional information about the different programs (e.g. the blog posts you link to) so that users can contextualize the dashboard data.
Same goal as your analysis really- to find the most cost-effective event models for producing valuable outcomes. With a very fat-tailed scoring rubric, I’m concerned that legitimate differences between the event types might be overshadowed by the particulars of the rubric. As some of the other comments indicate, it’s not obvious how to value different outcomes on a relative basis.
Even if you don’t want to use an equal-weighted scoring system, you could see if the results change materially if you use a much less fat-tailed rubric (e.g. have scores ranging from 1-5 vs. 1-50). You can think of that as a type of sensitivity analysis to see how dependent your findings are on the specifics of the scoring system.
Thanks for sharing the survey response rate, that’s helpful info. I’ve shared some other thoughts on specific metrics via a comment on your other post.
Thank you for writing this sequence; it provides some nice analysis and transparency into CEA’s thinking.
I like your attempt to measure “valuable outcomes” in addition to “connections” based metrics, especially since your other post suggests that “learning” creates about as much value as “connections”. I’d be curious in seeing an equal-weighted “valuable outcomes” measure (i.e. every outcome that passes a bar gets one point vs. different scores for different outcomes) and whether that changes any results. I think it’s reasonable to believe that the value of different outcomes follows a power law distribution; I just think it’s difficult to score those outcomes properly on an ex-ante basis.
I do wish you hadn’t relegated the discussion of EAGxVirtual to an appendix. I think the finding that “EAGxVirtual is unusually cost-effective” belongs in the “bottom line up front” section, and could conceivably be the most important finding of this analysis. Virtual events aren’t great for some things (e.g. building career or social connections) but they also have a lot of advantages. In addition to the 10x(!) difference in cost-effectiveness you mentioned, virtual events:
Thank you for this thoughtful response. It is helpful to see this list of tradeoffs you’re balancing in considering which metrics to present on the dashboard, and the metrics you’ve chosen seem reasonable. Might be worth adding your list to the “additional notes” section at the top of the dashboard (I found your comment more informative than the current “additional notes” FWIW).
While I understand some metrics might not be a good fit for the dashboard if they rely on confidential information or aren’t legible to external audiences, I would love to see CEA provide a description of what your most important metrics are for each of the major program areas even if you can’t share actual data for these metrics. I think that would provide valuable transparency into CEA’s thinking about what is valuable, and might also help other organizations think about which metrics they should use.
Other data points on community growth per CEA’s dashboard:
I want to start off by saying how great it is that CEA is publishing this dashboard. Previously, I’ve been very critical about the fact that the community didn’t have access to such data, so I want to express my appreciation to Angelina and everyone else who made this possible. My interpretation of the data includes some critical observations, but I don’t want that to overshadow the overall point that this dashboard represents a huge improvement in CEA’s transparency.
My TLDR take on the data is that Events seem to be going well, the Forum metrics seem decent but not great, Groups metrics look somewhat worrisome (if you have an expectation that these programs should be growing), and the newsletter and effectivealtruism.org metrics look bad. Thoughts on metrics for specific programs, and some general observations, below.
Events
FWIW, I don’t find the “number of connections made” metric informative. Asking people at the end of a conference how many people they’d hypothetically feel comfortable asking for a favor seems akin to asking kids at the end of summer camp how many friends they made that they plan to stay in touch with; if you asked even a month later you’d probably get a much lower number. The connections metric probably provides a useful comparison across years or events, I just don’t think the unit or metric is particularly meaningful. Whereas if you waited a year and asked people how many favors they’ve asked from people they met at an event, that would provide some useful information.
That said, I like that CEA is not solely relying on the connections metric. The “willingness to recommend” metric seems a lot better, and the scores look pretty good. I found it interesting that the scores for EAG and EAGX look pretty similar.
Online (forum)
It doesn’t seem great that after a couple of years of steady growth, hours of engagement on the forum seems to have spiked from FTX (and to a lesser extent WWOTF), then fallen to roughly levels from April 2022. Views by forum users follows the same pattern, as does the number of posts with >2 upvotes.
Monthly users seem to have spiked a lot around WWOTF (September 2022 users are >50% higher than March 2022 users), and is now dropping, but hasn’t reverted as much as the other metrics. Not totally sure what to make of that. It would be interesting to see how new users acquired in mid-2022 have behaved subsequently.
Online (effectivealtruism.org)
It seems pretty bad that traffic to the homepage and intro pages grew only very modestly from early 2017 to early 2022 (CEA has acknowledged mistakenly failing to prioritize this site over that period). WWOTF, and then FTX, both seem to have led to enormous increases in traffic relative to that baseline, and homepage traffic remains significantly elevated (though it is falling rapidly).
IMO it is very bad that WWOTF doesn’t seem to have driven any traffic to the intro page and that intro page traffic is the lowest level since the data starts in April 2017, and has been falling steadily since FTX. Is CEA doing anything to address this?
Going forward, it would be great if the dashboard included some kind of engagement metric(s) such as average time on site in addition to showing the number of visitors.
Online (newsletter)
Subscriber growth grew dramatically from 2016-2018 (perhaps boosted by some ad campaigns during the period of fastest growth?), then there were essentially no net additions of subscribers in 2019-2020. We then saw very modest growth in 2021 and 2022, followed by a decline in subscribers year to date in 2023. So 2019, 2020, and 2023 all seem problematic, and from the end of 2018 to today subscriber growth has only grown about 15% (total, not annually) despite huge tailwinds (e.g. much more spent on community building and groups, big investments in promoting WWOTF, etc.) And the 2023 YTD decline seems particularly bad. Do we have any insight into what’s going on? There are obviously people unsubscribing (have they given reasons why?); are we also seeing a drop in people signing up?
Going forward, it would be great if the dashboard included some kind of engagement metric(s) in addition to showing the number of subscribers.
Groups (UGAP)
I was somewhat surprised there wasn’t any growth between spring 2022 and spring 2023, as I would have expected a new program to grow pretty rapidly (like we saw between fall 2021 and fall 2022). Does CEA expect this program to grow in the future? Are there any specific goals for number of groups/participants?
Groups (Virtual)
The data’s kind of noisy, but it looks like the number of participants has been flat or declining since the data set starts. Any idea why that’s the case? I would have expected pretty strong growth. The number of participants completing all or most also of the sessions also seems to be dropping, which seems like a bad trend.
The exit scores for the virtual programs have been very consistent for the last ~2 years. But the level of those scores (~80/100) doesn’t seem great. If I’m understanding the scale correctly, participants are grading the program at about a B-/C+ type level. Does CEA feel like it understands the reason for these mediocre scores and have a good sense of how to improve them?
General observations
I’d also like to see “the board be more diverse and representative of the wider EA community.” In addition to adding more members without ties to OpenPhil, I’d favor more diversity in the cause preferences of board members. Many of the members of the EV US and EV UK board have clear preferences for longtermism, while none are clearly neartermist. The same can said of the projects EV runs. This raises the question of whether EV sees its role as “a central hub for the effective altruism community, [balancing] the interests of different stakeholders” or if EV is instead trying to steer the community in specific directions. I hope EV offers more transparency around this going forward.
Hopefully, EV will be expanding its boards, which would be an opportunity to address these issues. Expanding the US board seems particularly important, since two of the four board members (Zach and Nicole) are staff members of EV (a pretty unusual structure) and as such would need to recuse themselves from some votes. This dynamic, combined with Nick (in the US and UK) and Will (in the UK) recusing themselves from FTX related issues, means the effective board sizes will be quite small for some important decisions.
Thanks for running this analysis Ollie! Interesting findings!
Agree that this exercise doesn’t yield an obvious conclusion. Given that you’ve found the results to be sensitive to the scoring system, I suggest trying to figure out how sensitive. You’ve crunched the numbers using max scores of 50 and 5; I imagine it’d be quick to do the same with max scores of 20,10, and 1 (the other scores you used in your original scoring system).
The other methodology I’d suggest looking at would be to keep the same relative rankings you used originally, but just condense the range of scores (to say 1,2,3,4,5 vs. 1,5,10,20,50). That would capture the fact that you think starting an EA project is more valuable than meeting a collaborator (which is lost by capping the scores at 5), but would assess it as 2.5x more valuable vs. 10x. (Btw, I think the technical term for “beheading” the data is “Winsorizing” though that’s usually done using percentiles of the data set, which is another way you could do a sensitivity analysis).
This sort of more comprehensive sensitivity analysis would shed some light on whether your observation about EAGxAustralia is supported by the broader data set:
If that looks to be a robust finding, that has pretty big implications for how events should be run. FWIW I’d consider that a more important finding than EAGx events looking more cost-effective than CEP events, and would suggest editing the bottom line upfront section to note that.
Longer term, I’d look to refine the metrics you use for events and how you collect the data. I love that you’ve started looking beyond “number of connections” to “valuable outcomes”; this definitely seems like a move in the right direction. However, it’s also not feasible for you to score responses from attendees at scale going forward. So I’d suggest giving asking responders to score the event themselves, while providing guidance on how different experiences should be scored (e.g. starting a new project = X) to promote consistency across respondents.
My hunch is that it’d be good to have people score the event along the different dimensions (connections, learning, motivation/positivity, action, other) you listed in the “How do attendees get value from EA community-building events?” post. That might make the survey too onerous, but if you could collect that data you’d have a lot of granularity about which events accrued which type of value and it's probably easier to do relative scoring within categories rather than across them. You'd still be able to create a single score based on a weighted average of the different dimensions (where you’d presumably give connections and learning the most weight, since that’s where people seem to get the most value).