• Soyweiser@awful.systems
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 day ago

    Latter test fails if they write a specific bit of code to put out the ‘llms fail the river crossing’ fire btw. Still a good test.

    • diz@awful.systems
      link
      fedilink
      English
      arrow-up
      2
      ·
      4 hours ago

      It would have to be more than just river crossings, yeah.

      Although I’m also dubious that their LLM is good enough for universal river crossing puzzle solving using a tool. It’s not that simple, the constraints have to be translated into the format that the tool understands, and the answer translated back. I got told that o3 solves my river crossing variant but the chat log they gave had incorrect code being run and then a correct answer magically appearing, so I think it wasn’t anything quite as general as that.