EDIT: Someone on lesswrong linked a great report by Epoch which tries to answer exactly this.
With the release of openAI o1, I want to ask a question I've been wondering about for a few months.
Like the chinchilla paper, which estimated the optimal ratio of data to compute, are there any similar estimates for the optimal ratio of compute to spend on inference vs training?
In the release they show this chart:
The chart somewhat gets at what I want to know, but doesn't answer it completely. How much additional inference compute would I need a 1e25 o1-like model to perform as well as a one shotted 1e26?
Additionally, for some x number of queries, what is the optimal ratio of compute to spend on training versus inference? How does that change for different values of x?
Are there any public attempts at estimating this stuff? If so, where can I read about it?
Good question, not sure how I get it into my email actually, I can't find it on the website either
edit: I think it's through the forecasting newsletter
I can highly recommend following Sentinel's weekly minutes, a weekly update from superforecasters on the likelihood of any events which plausibly could cause worldwide catastrophe.
Perhaps the weekly newsletter I look the most forward to at this point. Read previous issues here:
Hi Ian,
Thanks for the question! I've been meaning to write down my thoughts on this for a while, so here is a longer perspective:
In 2015 USAID teamed up with Givewell to cash-benchmark one of its programs. The evidence came back showing that cash-transfers outperformed the program on every metric. What gets brought up less often is that the programme got its funding renewed shortly after anyways! The cash-benchmark alone was not sufficient, you also need some policy to require programs worse than cash should be wound down.
This is a sentiment I'm fully behind. But what exactly that policy should look like is where it gets tricky.
How should the ministry cash benchmark a music festival in Mali?[1] What is the cash-benchmark for a programme to monitor the Senegalese election to ensure a fair election? If the cash-benchmark should only be for certain types of programming amenable to cash comparisons, such as global health, how will that shift funding?
I worry that instituting a selective high bar will move funding from away broadly cost-effective areas which can be benchmarked against cash, to broadly ineffective areas which can't be easily benchmarked against cash.
But even within areas amenable to cash-benchmarking, it's unclear what the policy should look like. How should the ministry cash-benchmark its funding to a large multilateral which will go to fund a thousand programmes across the world?
The answer to this, which many arrive at is: "Cleary we need to move from demanding literal cash-arms to just making estimates of how impactful programmes and organizations are compared to cash-transfers. That way we still get the nice hurdle-rate that programmes must be compared against, which is what we were really after anyways"
But that development ministries should systematically estimate and compare the impact of projects is what development economists have been shouting for decades!
To an extent, the ministry's lack of systematic measurement and comparison is a feature not a bug. Almost any instantiation of cash-benchmarking removes wriggle room to fund projects which are valuable for reasons you didn't want to state out loud. From a ministers perspective, cash-benchmarking doesn't solve any problems, it creates one!
This is not a facetious example, but a real project funded by the Norwegian government.
Two ideas off the top of my head