Z

zhengdong

16 karmaJoined

Comments
2

Makes sense that this would be a big factor in what to do with our time, and AI timelines. And we're surprised too by how AI can overperform expectations, like in the sources you cited.

We'd still say the best way of characterizing the problem of creating synthetic data is that it's a wide open problem, rather than high confidence that naive approaches using current LMs will just work. How about a general intuition instead of parsing individual sources. We wouldn't expect making the dataset bigger by just repeating the same example over and over to work. We generate data by having 'models' of the original data generators, humans. If we knew what exactly made human data 'good,' we could optimize directly for it and simplify massively (this runs into the well-defined eval problem again---we can craft datasets to beat benchmarks of course).

An analogy (a disputed one, to be fair) is Ted Chiang's lossy compression. So for every case of synthetic data working, there's also cases where it fails, like Shumailov et el. we cited. If we knew exactly what made human data 'good,' we'd argue you wouldn't see labs continue to ramp up hiring contractors specifically to generate high-quality data in expert domains, like programming.

A fun exercise---take a very small open-source dataset, train your own very small LM, and have it augment (double!) its own dataset. Try different prompts, plot n-gram distributions vs the original data. Can you get one behavior out of the next generation that looks like magic compared to the previous, or does improvement plateau? May have nitpicks with this experiment, but I don't think it's that different to what's happening at large scale.

Hey Aaron, thanks for your thorough comment. While we still disagree (explained a bit below), I'm also quite glad to read your comment :)

Re scaling current methods: The hundreds of billions figure we quoted does require more context not in our piece; SemiAnalysis explains in a bit more detail how they get to that number (eg assuming training in 3mo instead of 2 years). We don't want to haggle over the exact scale before it becomes infeasible, though---even if we get another 2 OOM in, we wanted to emphasize with our argument that 'the current method route' 1) requires regular scientific breakthroughs of the pre-TAI sort, and 2) even if we get there doesn't guarantee capabilities that look like magic compared to what we have now, depending on how much you believe in emergence. Both would be bottlenecks. We're pretty sure that current capabilities can be economically useful with more people, more fine-tuning. Just skeptical of the sudden emergence of the exact capabilities we need for transformative growth.

On Epoch's work on algorithmic progress specifically, we think it's important to note that:

1) They do this by measuring progress on computer vision benchmarks, which isn't a good indicator of progress in either algorithms for control (physical world important for TAI) or even language---it might be cheeky to say, little algorithmic progress there; just scale ;) Computer vision is also the exact example Schaeffer et al. gives for the subfield where emergent abilities do not arise---until you induce them by intentionally crafting the evaluations.

2) That there even is a well-defined benchmark is a good sign for beating that benchmark. AI benefits from quantifiable evaluation (beating a world champion, CASP scores) when it measures what we want. But we'd say for really powerful AI we don't know what we want (see our wrong direction / philosophy hurdle), plus at some point the quantifiable metrics we do have stop measuring what we really want. (Is there really a difference between models that get 91.0 and 91.1 top-1 accuracy on ImageNet? Do people really look at MMLU over qualitative experience when they choose which language model to play with?)

3) We don't discount algorithmic progress at all! In fact we cite SemiAnalysis and the Epoch team's suggestions on where to research next. But again, these require human breakthroughs, bottlenecked on human research timescales---we don't have a step by step progress we can just follow to improve a metric to TAI, so hard-won past breakthroughs doesn't guarantee future ones happen at the same clip.

Re Constitutional AI: We agree that researchers will continue searching for ways to use human feedback more efficiently. But under our Baumol framework, the important step is going from one to zero, not n to one. And there we find it hard to believe that in high stakes situations (say, judging AI debates), that safety researchers are willing to hand over the reins. We'd also really contest the 'perform very similarly to human raters' is enough---it'd be surprising if we already have a free lunch, no information lost, way to simulate humans well enough to make better AI.

Re 2025 language models equipped with search: For this to be as useful as a panel of experts, the models need to be searching an index where what the experts know is recorded, in some sense, which 1) doesn't happen (experts are busy being experts) 2) is sometimes impossible (chef, LeBron) 3) maybe less likely in the future when an LLM is going to just hoover up your hard won expertise? I know you mentioned you don't disagree with our point here though.

Re motte and bailey: We agree that our hurdles may have overlap. But the point of our Baumol framework is that any valid hurdle, where we don't know if it's fundamentally the same problem that causes other hurdles, each has the potential to bottleneck transformative growth. And we allude to several cases where for one reason or another a promising invention did not meet expectations precisely because they could not clear them all.

Hope this clarifies our view. Not conclusive, of course, we're happy, like your piece, to also be going for intuition pumps to temper expectations.