AI-Training Data Scrapers are constantly attacking the feddit.online instance, in the last 24 hours about 70% of all traffic here was a bot of nefarious intent.

This is a continuation of the problems experienced last weekend which were temporarily snubbed by CloudFlare filtering them out, but they always return eventually.

This is the reason many users have been complaining about poor uptime/stability.

  • Jerry on PieFed@feddit.onlineM
    link
    fedilink
    English
    arrow-up
    9
    ·
    8 天前

    Thanks for understanding. I think the server is functioning normally now. Please let me know if it isn’t for you.

    It all began a few days ago. It was strange, but all the AI bots started a feeding frenzy on the server. The worst, by far, was coming out of Vietnam. It accounted for at least 60% of the AI traffic. Then Bangladesh took another 25%. In 3rd place was Claude. There was Tencent, Amazon, Meta, GPT, Baidu, Yandex, and more. All the bots you know well were scraping, except Google. It was like a shark feeding frenzy. They all ignored the robots.txt file (except Google).

    The server became unresponsive.

    I think I’ve got them all blocked now at the Cloudflare edge. It appears that 99% of the traffic I’m seeing is real. But I’ll keep monitoring. They constantly change how they access servers to break through, so I’m watching.

    In the past 24 hours, the firewall has blocked over 700,000 AI bot requests.

    I read that Lemmy/PieFed/Mbin are now the darling targets for training AI. Not for knowledge but to learn how to write like a human. So they are all out to read the entire databases of Threadeverse servers.

    Feddit has over 2 years of just about every single post and comment from every Threadiverse server in a 130 GB database, so we’re apparently a juicy target.

    • Elvith Ma'for@feddit.org
      link
      fedilink
      arrow-up
      3
      ·
      8 天前

      If they’d really want to train on Fediverse data sets, why don’t they just set-up an ActivityPub compatible endpoint and subscribe to a fuck ton of communities, users, hashtags, … And just get it delivered to them for free?!

      • WhoIzDisIz@lemmy.today
        link
        fedilink
        arrow-up
        2
        ·
        edit-2
        8 天前

        If I had the time, knowledge, & hardware, I’d be setting up honeypots to accomplish exactly that, rotating through new names every so often as they figured out the truth of the current one.

        • FiniteBanjo@feddit.onlineOP
          link
          fedilink
          English
          arrow-up
          1
          ·
          8 天前

          I remember hearing about one type of AI defence where the page has some invisible redirects at the top which human users won’t see but bots will and that takes the crawlers into an infinite maze of constantly randomly generated redirects.

          What you described is that but more nefarious because it would take the AI owners longer to notice they’re just wasting compute time lol.