I love to show that kind of shit to AI boosters. (In case you’re wondering, the numbers were chosen randomly and the answer is incorrect).

They go waaa waaa its not a calculator, and then I can point out that it got the leading 6 digits and the last digit correct, which is a lot better than it did on the “softer” parts of the test.

  • scruiser@awful.systems
    link
    fedilink
    English
    arrow-up
    5
    ·
    7 hours ago

    Have they fixed it as in genuinely uses python completely reliably or “fixed” it, like they tweaked the prompt and now it use python 95% of the time instead of 50/50? I’m betting on the later.

    • diz@awful.systemsOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      3 hours ago

      Yeah, I’d also bet on the latter. They also added a fold-out button that shows you the code it wrote (folded by default), but you got to unfold it or notice that it is absent.

    • aramova@infosec.pub
      link
      fedilink
      English
      arrow-up
      4
      ·
      6 hours ago

      Non-deterministic LLMs will always have randomness in their output. Best they can hope for is layers of sanity checke slowing things down and costing more.

      • scruiser@awful.systems
        link
        fedilink
        English
        arrow-up
        5
        ·
        6 hours ago

        If you wire the LLM directly into a proof-checker (like with AlphaGeometry) or evaluation function (like with AlphaEvolve) and the raw LLM outputs aren’t allowed to do anything on their own, you can get reliability. So you can hope for better, it just requires a narrow domain and a much more thorough approach than slapping some extra firm instructions in an unholy blend of markup languages in the prompt.

        In this case, solving math problems is actually something Google search could previously do (before dumping AI into it) and Wolfram Alpha can do, so it really seems like Google should be able to offer a product that does math problems right. Of course, this solution would probably involve bypassing the LLM altogether through preprocessing and post processing.

        Also, btw, LLM can be (technically speaking) deterministic if the heat is set all the way down, its just that this doesn’t actually improve their performance at math or anything else. And it would still be “random” in the sense that minor variations in the prompt or previous context can induce seemingly arbitrary changes in output.