Apple: ‘Reasoning’ AIs fail hard if they actually have to think

David Gerard@awful.systems · 3 months ago

Apple: ‘Reasoning’ AIs fail hard if they actually have to think

diz@awful.systems · edit-2 3 months ago

Further support for the memorization claim: I posted examples of novel river crossing puzzles where LLMs completely fail (on this forum).

Note that Apple’s actors / agents river crossing is a well known “jealous husbands” variant, which you can ask a chatbot to explain to you. It gladly explains, even as it can’t follow its own explanation (since of course it isn’t its own explanation but a plagiarized one, even if changes words).

edit: https://awful.systems/post/4027490 and earlier https://awful.systems/post/1769506

I think what I need to do is to write up a bunch of puzzles, assign them randomly to 2 sets, and test & post one set, while holding back on the second set (not even testing it on any online chatbots). Then in a year or two see how much the set that’s public improves, vs the one that’s held back.

YourNetworkIsHaunted@awful.systems · 3 months ago

That would be the best way to actively catch the cheating happening here, given that the training datasets remain confidential. But I also don’t know that it would be conclusive or convincing unless you could be certain that the problems in the private set were similar to the public set.

In any case either you’re doubledipping for credit in multiple places or you absolutely should get more credit for the scoop here.

diz@awful.systems · 3 months ago

I’d just write the list then assign randomly. Or perhaps pseudorandomly like sort by hash and then split in two.

One problem is that it is hard to come up with 20 or more completely unrelated puzzles.

Although I don’t think we need a large number for statistical significance here, if it’s like 8/10 solved in the cheating set and 2/10 in the hold back set.

Soyweiser@awful.systems · 3 months ago

Latter test fails if they write a specific bit of code to put out the ‘llms fail the river crossing’ fire btw. Still a good test.

diz@awful.systems · 2 months ago

It would have to be more than just river crossings, yeah.

Although I’m also dubious that their LLM is good enough for universal river crossing puzzle solving using a tool. It’s not that simple, the constraints have to be translated into the format that the tool understands, and the answer translated back. I got told that o3 solves my river crossing variant but the chat log they gave had incorrect code being run and then a correct answer magically appearing, so I think it wasn’t anything quite as general as that.