Hide table of contents

We’re releasing Squiggle AI, a tool that generates probabilistic models using the Squiggle language. This can provide early cost-effectiveness models and other kinds of probabilistic programs.

No prior Squiggle knowledge is required to use Squiggle AI. Simply ask for whatever you want to estimate, and the results should be fairly understandable. The Squiggle programming language acts as an adjustable backend, but isn’t mandatory to learn.

Beyond being directly useful, we’re interested in Squiggle AI as an experiment in epistemic reasoning with LLMs. We hope it will help highlight potential strengths, weaknesses, and directions for the field.

Screenshots

The “Playground” view after it finishes a successful workflow. Form on the left, code in the middle, code output on the right.
The “Steps” page. Shows all of the steps that the workflow went through, next to the form on the left. For each, shows a simplified view of recent messages to and from the LLM.

Motivation

Organizations in the effective altruism and rationalist communities regularly rely on cost-effectiveness analyses and fermi estimates to guide their decisions. QURI's mission is to make these probabilistic tools more accessible and reliable for altruistic causes.

However, our experience with tools like Squiggle and Guesstimate has revealed a significant challenge: even highly skilled domain experts frequently struggle with the basic programming requirements and often make errors in their models. This suggests a need for alternative approaches.

Language models seem particularly well-suited to address these difficulties. Fermi estimates typically follow straightforward patterns and rely on common assumptions, making them potentially ideal candidates for LLM assistance. Previous direct experiments with Claude and ChatGPT alone proved insufficient, but with substantial iteration, we've developed a framework that significantly improves the output quality and user experience.

We're focusing specifically on cost-effectiveness estimates because they represent a strong balance between specificity and broad applicability. These models often contain recurring patterns and sub-variables, yet can be applied to a wide spectrum of human decisions - from different kinds of charitable giving to policy interventions.

Looking ahead, if we succeed in automating cost-effectiveness analyses, we believe this could serve as one foundation for enhanced LLM reasoning capabilities. This work could pave the way for more sophisticated applications of AI in decision-making and impact assessment.

Description

Squiggle AI” combines LLM capabilities with specialized scaffolding to produce fermi estimates in the Squiggle programming language. The frontend runs on Squiggle Hub. The code is open-source and available on Github.

Since LLMs don't naturally write valid Squiggle code, substantial work has gone into bridging this gap. Our system provides comprehensive documentation to the LLM, implements automatic error correction, and includes steps for model improvement and documentation.

Squiggle AI currently calls Claude 3.5 Sonnet. Each single LLM call costs $0.01-0.04, with full workflows ranging from $0.10-0.35. While we cover these costs currently, users can provide their own API keys. Most workflows complete within 20 seconds to 3 minutes. Resulting models typically are 100-200 lines long.

The tool offers two specific workflows:

  • "Create": Takes a high-level prompt to generate and enhance new Squiggle code
  • "Improve": Fixes broken code and implements requested improvements

Example Outputs

Select recent examples:

Is restoring the Notre Dame Cathedral or donating to the AMF more cost-effective, as a charity?
Comparing health impacts against tourism benefits, the model estimates that a $1B donation to AMF would generate 8M QALYs versus 9K QALYs for Notre Dame. The core calculation uses just 12 lines of code.

What is the cost-effectiveness of a 10-person nonprofit subscribing to Slack Pro?
This makes the core assumption that users will save 1 to 4 hours per month from having a searchable chat history. Results suggest an ROI of approximately 10x.

What is the cost-effectiveness of opening a new Bubble Tea store in Berkeley, CA?
Our most-tested example, popular in workshops for being fun and relevant (at least for people around Berkeley). When accounting for failure risk, the model usually shows negative expected value suggesting a risky investment.

Estimate the cost-effectiveness of different Animal Welfare interventions
Inspired by a discussion of genetic modification of chickens 
Evaluated five intervention types, with legislative change emerging as most cost-effective, outperforming genetic modification (ranked second).

Estimate the probability that a single vote will decide the US presidential election, in various states
Inspired by some recent work for the 2024 election. Provides state-level probability estimates through two key data tables. Better input data would improve accuracy.

What is the probability that there will be an AGI Manhattan Project in the next decade, and how will it happen?
Projects 24% probability of an “AGI Manhattan Project” within a decade, primarily driven by potential crisis scenarios.

What is the expected benefit of a 30-year old male to get a HPV vaccine?
Analyzes benefits for 30-year-old males. Despite high costs ($500-$1,000 uninsured), estimates suggest $1k-$100k in expected benefits.

Estimate the cost-effectiveness of taboo activities for a teenager
Demonstrates how using AI scaffolding can do reasoning that goes against social norms, in favor of first-principles thinking. If you ask Claude directly to estimate the benefits of smoking cigarettes or of recreational sex for teenagers, it will usually refuse. This estimate by Squiggle AI states that smoking cigarettes would provide $500 to $2k per year of positive social capital. This is clearly overconfident, but it demonstrates the system’s willingness to make unconventional assumptions.
 

A more extensive collection of examples is available here. While some older examples there were manually edited, the examples listed above were not. Examples were typically selected out of 2-3 generations each.

How Good Is It?

We don't yet have quantitative measures of output quality, partly due to the challenge of establishing ground-truth for cost-effectiveness estimates. However, we do have a variety of some qualitative results.

Early Use

As the primary user, I (Ozzie) have seen dramatic improvements in efficiency - model creation time has dropped from 2-3 hours to 10-30 minutes. For quick gut-checks, I often find the raw AI outputs informative enough to use without editing.

Our three Squiggle workshops (around 20 total attendees) have shown encouraging results, with participants strongly preferring Squiggle AI over manual code writing. Early adoption has been modest but promising - in recent months, 30 users outside our team have run 168 workflows total.

Accuracy Considerations

As with most LLM systems, Squiggle AI tends toward overconfidence and may miss crucial factors. We recommend treating its outputs as starting points rather than definitive analyses. The tool works best for quick sanity check and initial model drafts.

Current Limitations

Several technical constraints affect usage:

  • Code length soft-caps at 200 lines
  • Frequent workflow stalls from rate limits or API balance issues
  • Auto-generated documentation is decent but has gaps, particularly in outputting plots and diagrams

While slower and more expensive than single LLM queries, Squiggle AI provides more comprehensive and structured output, making it valuable for users who want detailed, adjustable, and documentable reasoning behind their estimates.

Alternatives

Direct LLM Estimates

Standard LLMs can perform basic numeric estimates quickly and cheaply. However, they typically lack support for executing programming functions or probabilistic simulations.

LLMs that Call Python

Recent systems with Python interpreters offer more computational power but aren't optimized for intuitive probabilistic modeling. The results are not as interactive as Squiggle playgrounds are.

Basic Squiggle Integration

It’s possible to ask LLMs to generate Squiggle code for you. While most LLMs don’t know much about Squiggle, you can use our custom prompt or the Squiggle Bot on ChatGPT. These can work in a pinch, but are generally a lot less powerful than Squiggle AI.

Using Squiggle AI

Getting Started

  1. Create a Squiggle Hub account (required for abuse prevention)
  2. Navigate to the "AI" section in the header menu on the right
  3. Choose your workflow type:
    • "Create" to generate new models from prompts
    • "Edit" to refine existing Squiggle code
  4. Fill out the corresponding form. Settings have hover tooltips with extra information.
  5. Click “Start Workflow”
  6. If you’d like to make multiple workflows for the same setting, click “Start Workflow” multiple times.
  7. You should see a new workflow begin. This will likely take 5 to 10 seconds to run one LLM query, and up to 3 minutes to complete.
The Workflow form

Using Workflows

When you start a workflow, you'll see real-time updates as the LLM processes your request. The interface displays:

  • Step-by-step progress through the LLM stages
  • Detailed logs upon completion
  • The most recent working version (or last failing version if errors occur)

Note that workflows occasionally stall - if this happens, simply start a new one.

While all workflow runs are saved in our backend (and will be accessible to you), older workflow logs get cleared out over time.

Saving Your Work

To save and/or share your models:

Keep It Simple

Start with straightforward cost-effectiveness and probabilistic models. Complex requests often fail or require multiple iterations. It’s often fine to bring in a bunch of considerations into a cost-benefit analysis, the challenge comes when some of them require difficult functions.

Generate Multiple Models

Generate 2-4 different model variations for each question to capture a range of perspectives. Since Squiggle AI makes parallel model generation straightforward, you can easily launch multiple workflows with identical inputs. Each individual model is fairly inexpensive.

Be Specific with Details

When modeling specific scenarios:

  • Include key input statistics and metrics
  • Outline your intended outputs
  • Consider potential externalities and unusual but important considerations
  • Bring in research from tools like Perplexity

Optimize Your Workflow

  • Start with minimal settings (0 "Numeric" and "Documentation" steps)
  • Start with the “Create” workflow, then later use the "Improve" workflow to enhance your favorite models
  • Break complex problems into smaller components (100-200 lines each)

API Usage

For extensive use (20+ runs), consider using your own Anthropic API key. This helps ensure consistent service and reduces load on our shared token.

Complementary Tools

Research and Data Gathering

While Squiggle AI excels at model generation, it doesn't perform web searches or data collection. For comprehensive analysis, consider combining it with:

Alternative LLM Integration

OpenAI o1 and other models can complement Squiggle AI's capabilities. While these tools may outperform Claude 3.5 Sonnet on certain tasks, they typically come with other trade-offs. They can be useful for early ideation, coming up with crucial considerations, or even starting off with a Python model or similar.

Probability Distribution Tools

For estimating specific probability distributions, consider emerging tools like FutureSearch. Squiggle input estimates are often not calibrated.

This ecosystem approach - using each tool for its strengths - often produces the best analyses.

Privacy

Squiggle AI outputs on Squiggle Hub are private. If you want to share them or make them public, you can explicitly do that by creating new models.

Note that our team can access Squiggle AI results for diagnostic purposes. If you require more complete data privacy, please contact us to discuss options.

Technical Details

Performance and Costs

Since Squiggle is a young programming language, it typically needs more troubleshooting and adjustments compared to when using LLMs with more established languages.

Squiggle AI currently uses Claude Sonnet 3.5 for all operations. It makes use of prompt caching to cache a lot of information (around 20k tokens) about the Squiggle language. LLM queries typically cost around $0.02 each to run - more in the case of where large models are being edited.

In terms of accuracy, do not place trust in the default distributions or in the key model assumptions. We’ve found that LLM outputs are often highly overconfident (despite some warnings in the prompts), and often leave out important considerations. This is another reason to do runs multiple times (often, different runs reach very different results), and to adjust models heavily with your own views. In the future, there’s clearly more engineering work that can be done in this area, yet it will likely be very difficult to do fully robustly.

State Machine Details

Squiggle AI makes use of a set of discrete steps, each with specific instructions and custom functionality.

  1. Generation: Makes a first attempt at writing Squiggle code, based on a provided prompt. Tries to naively follow our Style Guide, including writing some tests.
  2. Bug Fixing: Attempts to fix parsing or runtime errors. This step is run until there are no further errors.
  3. Update Estimates: Checks the results of the model. If there are broken tests or suspicious outputs, it will make changes.
  4. Document: Makes changes to better match the style guide. This typically means improving variable annotation and model documentation, but it sometimes also means adjusting variables or adding code organization.

Step 1 is run once, step 2 is run until the code successfully executes, and steps 3 and 4 are run as many times as the user requests. In informal testing, it seems like more runs of steps 3 and 4 improve performance up to some point.

A simplified state machine, showing a typical path. Note that the stages “Update Estimates” and “Document” can be optionally run 0 or more times, based on user request. Also, the “Fix” state is used whenever there is broken code, even if it comes from “Update Estimates” or “Document.”

Conclusion

We think Squiggle AI demonstrates the potential for combining LLMs with specialized programming frameworks to make probabilistic modeling more accessible. While the tool has limitations and should be used thoughtfully, early results suggest it can significantly reduce the barrier to entry for cost-effectiveness analysis and fermi estimation. We invite the community to heavily experiment with the tool and provide takes and feedback.


Appendix: Lessons Learned During Development

LLM cost-effective estimates seem tractable

We spent a few months on this scaffolding, and have found the final tool both useful and promising. There’s clearly a lot of useful work to be done in this area, even without fundamental advances in LLMs.

Given the current performance and limitations of Squiggle AI, it’s not clear how much we would expect it to be used now, especially without significant promotion and advertising. But it does seem clear that at the very least, future related tools have significant potential.

Cost-effectiveness estimation can fit into a larger category of “strategic decision-making.” So far, there’s been significantly more work to use LLMs to solve narrow tasks like coding problems, than there’s been to use LLMs to make high-level and general-purpose strategic decisions. While the latter might be difficult, we think it is tractable.

LLMs are willing to estimate more than you might expect, with some prompting

From Ozzie: 

In my use with Squiggle AI, I’ve been surprised at how much the LLMs were willing to go with. Claude is known for being overly restrictive against potentially controversial topics, but in my experience, it was often willing to make controversial estimates, with the right prompting. Many humans complain about cost-benefit analyses about sensitive topics, but LLMs are often much more willing.

There have been a few times where I’d ask Claude to tell me which of a few decisions seemed the best, and it would be very reluctant to answer. Then I’d ask it to estimate the impact of each answer and rank them, and it would do this right away.

Specialized Steps Improve Results

LLMs with very long prompts tend to forget or neglect large prompt portions. This makes it difficult to rely on a large prompt or two. Instead, it often seemed better to have hand-crafted prompts for different situations. We did this for the four specific steps listed above so far, but imagine that there’s a lot of further expansion to do here.

Claude 3.5 Sonnet often only returns a limited amount of text (up to around 200 lines). This means that any one step can only write so much code or suggest so many changes. It therefore can take multiple prompts for more complex things.

It’s difficult to make complex models understandable

As with other kinds of code, there can be a lot going on with estimation using programming. On one hand, all the key numbers and assumptions are organized in code. On the other hand, there’s a lot of information to represent.

Many viewers might not understand programming well, so providing an overview to these people is a challenge. In addition to the code, variables need substantial detailing. Many quantitative estimates can be highly sensitive to specific assumptions and definitions, so these should be carefully specified. Not only should the programs and assumptions be presented - outputs should be expressed as well. These are often multidimensional and might do best with custom visualizations.

Quick understandability isn’t particularly important for model authors, who already have a deep understanding of their work. But in cases where an AI can very quickly write models, it’s more important for users to quickly be able to quickly understand the model inputs, assumptions, and outputs.

We’ve spent a lot of time on the Squiggle Playground and with a specific Style Guide to help address it, but this remains a significant challenge.

Prompt engineering can be a lot of work

A lot of building this tool has been making a big list of common errors and solutions that LLMs make when writing Squiggle code. Most of these were done at the prompt level, though we also added a bunch of regular expressions and formal code checks to detect common errors.

We haven’t set up formal evaluation systems yet, in large part because these can be complex and expensive to run. If we had an additional time and budget, we’d likely do more work here.

73

1
0
1

Reactions

1
0
1

More posts like this

Comments11
Sorted by Click to highlight new comments since:

I have really enjoyed using Squiggle AI for estimation tasks, and particularly cost-benefit analyses, since going to your talk about it at Minifest in mid December. Thanks for this post and for building this!

I like this, and have been trying a similar visual approach using squiggle. I agree that LLM estimation using squiggle seems tractable and that it could help turn many text outputs into quantifiable/comparable numerical outputs. 

I am interested in creating a space to compare/rank these outputs. @Ozzie Gooen do you see squiggle hub as the space for this?

Oh interesting. Can you explain more about what you mean, and how this would work? I think there are a lot of ways this sort of thing could be done. 

This is so cool ! I'm using it to improve the BOTECs and cost-effectiveness estimates in my research into effective zakat, and islamic FAW interventions

Good to hear! Do let us know if there are any frustrations you have or improvements you'd like to see!

Recent systems with Python interpreters offer more computational power but aren't optimized for intuitive probabilistic modeling. The results are not as interactive as Squiggle playgrounds are.

I believe https://marimo.io notebooks with PyMC models could be made every bit as interactive and illustrative as Squiggle. And Marimo already has free cloud hosting, notebook sharing, export to interactive Quarto pages, online editor, AI assistant for cells, tabular data integrations, and so on...

For "intuitiveness", I agree that PyMC syntax in straightforward estimation cases where the user doesn't want to "infer" anything in a Bayesian sense from data, but rather chain a few distributions and see how they act together (specifically: using exclusively pm.Deterministic variables in the model and then pm.sample_prior_predictive() at the end), would be maybe somewhat more cumbersome than Squiggle, but is this difference significant enough to motivate supporting own language, web server, etc.? In addition, PyMC models could be scaled beyond these simplest models, I suspect farther than Squiggle permits.

This space can move somewhat quickly. I just looked into Marimo - seems interesting. It was announced about a year ago and seems to be run as an independent project by two people.

I think it's easy to get burned by jumping on neat new projects. Before I've had people argue that we should have been deep into the Julia ecosystem, or at some point, the OCaml ecosystem (OWL seemed neat for a few years, but then the lead developers left). We previously were excited by ReasonML / Rescript, but then that sort of fizzled out. 

We started Squiggle over 3 years ago an published the first main version, with the editor, 2 years ago. Then we wrote about how we didn't think that Python made sense at that point. 

I'd flag that "Squiggle AI", despite the name, is fairly language-independent. Most of the software and learnings would allow us to change languages without too much difficulty (until/unless we really get into the details of composability). AI is also often good at translating between languages. We think we could have it optionally or only output Python, if that's a feature users would want later on, or if we think that's best.

All that said, I appreciate the suggestion. I don't think we made the wrong move looking back, but we'll keep our eyes on new technologies like this. Right now we work well with Squiggle - the UI / UX is very optimized for this kind of estimation, and it's very easy for us to customize and interact with. But it's definitely the case that it's a lot of work, and it might be the case that one of these options winds up good enough to spend the effort and risk to transfer to. 

Yeah, rather than Roman's argument feeling to me like a reason not to use Squiggle, this feels more like a reason for Squiggle to incorporate some python behind the scenes.

I think the target audience of squiggle is people who aren't comfortable with complex code, but who are comfortable with probabilistic thinking.

Seems like having a set of structured queries for LLMs, plus the custom squiggle code, plus allowing the models to improvise python and JS code... Could be a powerful tool that would be much easier for most people to use.

If you'd prefer, feel free to leave questions for Squiggle AI here, and I'll run the app on them and respond with the results.

Oh, I like this, that's cool! I used squiggle in Futuresearch when trying to get "ground truth" and spreadsheets were not having it, and I used "doc recommended" approach of taking the docs and shoving that into claude's project context. This seems much better.

Executive summary: QURI has released Squiggle AI, a tool that uses language models to automatically generate probabilistic models and cost-effectiveness analyses, making complex estimation more accessible while serving as an experiment in AI-assisted reasoning.

Key points:

  1. Tool combines LLMs with Squiggle programming language to generate cost-effectiveness analyses and Fermi estimates, requiring no prior coding knowledge
  2. Current performance shows promise but has limitations: overconfidence in estimates, 200-line code limit, and occasional workflow stalls
  3. Typical workflow costs $0.10-0.35 and takes 20 seconds to 3 minutes, producing 100-200 line models
  4. Early testing shows significant efficiency gains (reducing model creation from 2-3 hours to 10-30 minutes) but outputs should be treated as starting points rather than definitive analyses
  5. Best practices include generating multiple models per question, being specific with inputs, and combining with complementary research tools
  6. Development revealed that LLMs can handle controversial estimates with proper prompting, but making complex models easily understandable remains challenging

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Curated and popular this week
Relevant opportunities