MK

Mikolaj Kniejski

128 karmaJoined

Comments
10

Thanks! I saw that post. It's an excellent approach. I'm planning to do something similar, but less time-consuming and limited. The range of theories of change that are pursued in AIS is limited and can be broken down into:

  • Evals
  • Field-building
  • Governance
  • Research

Evals can be measured by quality and number of evals, relevance to ex-risks. It seems pretty straightforward to differentiate a bad eval org from a good eval org—engaging with major labs, having a lot of evals, and a relation to existential risks.

Field-building—having a lot of participants who do awesome things after the project.

Research—I argue that the number of citations is also a good proxy for the impact of a paper. It's definitely easy to measure and is related to how much engagement a paper received. In the absence of any work done to bring the paper to the attention of key decision makers, it's very related to the engagement.

I'm not sure how to think about governance.

Take this with a grain of salt. 


EDIT: Also I think that engaging broader ML community with AI safety is extremely valuable and citations tells us how if an organization is good at that. Another thing that would be good to reivew is to ask about transparency of organizations, how thier estimate their own impact and so on - this space is really unexplored and this seems crazy to me. The amount of money that goes into AI safety is gigantic and it would be worth exploring what happens with it. 

I’m working on a project to estimate the cost-effectiveness of AIS orgs, something like Animal Charity Evaluators does. This involves gathering data on metrics such as:

  • People impacted (e.g., scholars trained).
  • Research output (papers, citations).
  • Funding received and allocated.

Some organizations (e.g., MATS, AISC) share impact analyses, there’s no broad comparison. AI safety orgs operate on diverse theories of change, making standardized evaluation tricky—but I think rough estimates could help with prioritization.

I’m looking for:

  1. Previous work
  2. Collaborators
  3. Feedback on the idea

If you have ideas for useful metrics or feedback on the approach, let me know!

I've always been impressed with Rethink Priorities' work, but this post is underwhelming.

As I understand it, the post argues that we can't treat LLMs as coherent persons. The author seems to think this idea is vaguely connected to the claim that LLMs are not experiencing pain when they say they do. I guess the reasoning goes something like this: If LLMs are not coherent personas, then we shouldn't interpret statements like "I feel pain" as genuine indicators that they actually feel pain, because such statements are more akin to role-playing than honest representations of their internal states.

I think this makes sense but the way it's argued for is not great.

1. The user is not interacting with a single dedicated system.

The argument here seems to be: If the user is not interacting with a single dedicated system, then the system shouldn't be treated as a coherent person.

This is clearly incorrect. Imagine we had the ability to simulate a brain. You could run the same brain simulation across multiple systems. A more hypothetical scenario: you take a group of frozen, identical humans, connect them to a realistic VR simulation, and ensure their experiences are perfectly synchronized. From the user’s perspective, interacting with this setup would feel indistinguishable from interacting with a single coherent person. Furthermore, if the system is subjected to suffering, the suffering would multiply with each instance the experience is replayed. This shows that coherence doesn't necessarily depend on being a "single" system.

2. An LLM model doesn't clearly distinguish the text it generates from the text the user inputs.

Firstly, this claim isn't accurate. If you provide an LLM with the transcript of a conversation, it can often identify which parts are its responses and which parts are user inputs. This is an empirically testable claim. Moreover, statements about how LLMs process text don't necessarily negate the possibility of them being coherent personas. For instance, it’s conceivable that an LLM could function exactly as described and still be a coherent persona. 

There is interesting connection between those techniques and "Trapped priors" and the whole take on human cognition as bayesian reasoning and biases as a strong prior. Why would those techniques work? (Assuming they work).

I guess some like "Try to speak truth" can make you consider a wide range of connected notions e.g. you say something like "Climate change is fake" and  you start to consider "why would make it true?" Or you just feel (because of your prioir) that this is true and ignore any further considerations (in that case the technique doesn't work). 

 

Do you have any arguments for why this would be more important rather than working on evals of deceptive AI or evals of cybersecurity capabilities? Asking in general, I'm trying to figure out how one should think about prioritizing things like that.

This is an interesting article! I understand the main claim as follows:

  1. There are a number of simple rationality techniques, such as "Don’t make irrelevant personal attacks," that are both simpler and more effective than complex rationality techniques. 
  2. Irrationality regarding moral and political issues is often due to a failure to apply these simple techniques.
  3. If there were a strong social norm towards applying these techniques, people would apply them more consistently. 
  4. Therefore, we should focus on creating a social norm that encourages the use of these simple techniques, rather than emphasizing complex rationality techniques, because (implicitly) we want more people to be rational about moral and political issues.

An additional claim is that we typically focus on the "fun" parts of rationality, like self-improvement, instead of the simple but important aspects because they are less enjoyable. For example, discipline and restraint are harder to practice than self-improvement.

I assume this extra claim refers to the rationality community or the EA community.

So, the main point is essentially that rationality is mundane and simple (though not easy!), and we shouldn't try to make it more complex than it really is. This perspective is quite refreshing, and I’ve had some similar thoughts!

However, I’m concerned that, even though people might know about these techniques, the emotionally charged nature of political and moral topics can make it difficult to apply them. It’s not necessarily the other way around. Also, while I’m not sure if you would label these as complex or not, sometimes it takes time to figure out what you actually want in life, and this requires "complex" techniques.

I just want to flag that I've raised the issue of the inconsistencies in the use of discount rate (if by "the discount rate in the GBD data" you mean the 3% or 4% discount rate in the standard inputs table) in an email sent a few days ago to one of the CE employees. Unfortunately, we failed to have a productive discussion, as the conversation died quickly when CE stopped responding. Here is one of the emails I sent:

 


Hi [name],
 

I might be wrong but you are using 1.4% rate in the CEA but the value of life saved at various ages is copied from GiveWell standard inputs that uses 4% discount rate to calculate the value. Isn't this an inconsistency? 
 

Mikolaj

I might have been too directive when writing this post. I lack the organizational context and knowledge of how CEAs are used to say definitively that this should be changed. I ultimately agree that this is a small change that might not affect the decisions made, and it's up to you to decide whether to account for it. However, some of the points you raised against updating this are incorrect.

I might have focused too much on the 10% reduction, while the real issue, as Elliot mentioned, is that you ignore two variables in the formula for DALYs averted:

Missing out on three 10% reductions in error X results in a difference of 0.1^3 = 27.1% which could be significant. I generally view organizations as growing through small iterative changes and optimization rather than big leaps.  

My critique is only valid if you are trying to measure DALYs averted. If you choose to do something similar to GiveWell, which is more arbitrary, then it might not make sense to adjust for this anymore.

The three changes to the value of life saved come from different frameworks:

  1. GiveWell values don't represent DALYs averted but are mixed with other factors such as survey results.
  2. HLI's work is based on the assumption that death isn't the worst possible state and that there is a baseline quality of life that must be met for a life to be worth living.
  3. The change I'm suggesting is compatible with your current method of estimating the value of life saved. It doesn't introduce any new assumptions; it simply makes some assumptions explicit. Unless you state something like, "We used those values initially but then detached them from their original formulas and now we will update them in another way," my suggestion should fit within your existing framework.

EDIT:

I can't say much about the GiveWell 1.5% rate, but I've heard it comes from the Rethink Priorities review, but it suggests 4.3% discount rate: can you direct me somewhere where I can read more about it?

I agree, this wouldn't change much probably, but this is a change that applies to a lot of CEAs and is in some way a straightforward and safe change? 

Load more