Apple: ‘Reasoning’ AIs fail hard if they actually have to think

David Gerard@awful.systems · 2 months ago

Apple: ‘Reasoning’ AIs fail hard if they actually have to think

Soyweiser@awful.systems · 2 months ago

Latter test fails if they write a specific bit of code to put out the ‘llms fail the river crossing’ fire btw. Still a good test.

diz@awful.systems · 2 months ago

It would have to be more than just river crossings, yeah.

Although I’m also dubious that their LLM is good enough for universal river crossing puzzle solving using a tool. It’s not that simple, the constraints have to be translated into the format that the tool understands, and the answer translated back. I got told that o3 solves my river crossing variant but the chat log they gave had incorrect code being run and then a correct answer magically appearing, so I think it wasn’t anything quite as general as that.