Zum Inhalt der Seite gehen


Excerpt from a message I just posted in a #diaspora team internal forum category. The context here is that I recently get pinged by slowness/load spikes on the diaspora* project web infrastructure (Discourse, Wiki, the project website, ...), and looking at the traffic logs makes me impressively angry.


In the last 60 days, the diaspora* web assets received 11.3 million requests. That equals to 2.19 req/s - which honestly isn't that much. I mean, it's more than your average personal blog, but nothing that my infrastructure shouldn't be able to handle.

However, here's what's grinding my fucking gears. Looking at the top user agent statistics, there are the leaders:

  • 2.78 million requests - or 24.6% of all traffic - is coming from Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot).
  • 1.69 million reuqests - 14.9% - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
  • 0.49m req - 4.3% - Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
  • 0.25m req - 2.2% - Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36
  • 0.22m req - 2.2% - meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)

and the list goes on like this. Summing up the top UA groups, it looks like my server is doing 70% of all its work for these fucking LLM training bots that don't to anything except for crawling the fucking internet over and over again.

Oh, and of course, they don't just crawl a page once and then move on. Oh, no, they come back every 6 hours because lol why not. They also don't give a single flying fuck about robots.txt, because why should they. And the best thing of all: they crawl the stupidest pages possible. Recently, both ChatGPT and Amazon were - at the same time - crawling the entire edit history of the wiki. And I mean that - they indexed every single diff on every page for every change ever made. Frequently with spikes of more than 10req/s. Of course, this made MediaWiki and my database server very unhappy, causing load spikes, and effective downtime/slowness for the human users.

If you try to rate-limit them, they'll just switch to other IPs all the time. If you try to block them by User Agent string, they'll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet.

Just for context, here's how sane bots behave - or, in this case, classic search engine bots:

  • 16.6k requests - 0.14% - Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  • 15,9k req - 0.14% - Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36

Because those bots realize that there's no point in crawling the same stupid shit over and over again.

I am so tired.

Als Antwort auf Dennis Schubert

I wish there was some way to automagically harm them somehow.

Like detect them and send them to sniff something that wrecks their training data, like a list of random-generated nonsense words or something.

Als Antwort auf Dennis Schubert

Would it be possible to redirect that sort of traffic to a decoy site/file?
Als Antwort auf Dennis Schubert

Yes. I plan to redirect them to a randomly generated text based on some LLM-generated text snippets that contain absolute nonsense (but also isn't static, so it looks slightly differently each time the page is loaded).

Sadly, I need to finish some ongoing infrastructure restructuring before I can deploy that across everything I host.

Als Antwort auf Dennis Schubert

Here are some idea responses I got from Lemmy: awful.systems/post/3153500
Apologies for cross-posting, but I thought you might not be seeing them over there.
Als Antwort auf Dennis Schubert

Apologies for an angry cross-post mastodon.social/@khobochka/113…

There are some suggestions there, ranging from easier to more complex than redirecting to generated content, e.g. redirecting to Hetzner's speedtest file, setting up tarpits for bots and a few block lists.
There's also a list of AI fuckery here, a few hops from which people confirm that 20-75% (depending on amount of content on the site) of traffic is LLM crawlers and they outrank the usual Wordpress attacks during on-calls.

None of this, however, in any way compensates the sad reality that this would eat away your time and compute, simply because the LLM training infra exists, this makes me absolutely livid.

Als Antwort auf Dennis Schubert

THIS IS exactly what I have been dealing with. Aswell as Cloudflare does nothing to stop it.
Als Antwort auf Dennis Schubert

Sorry to read this. Hope there's a way to mitigate this.

I have to be cynical, but I find it pretty rich that these companies, as well as the techbro scene in general, will espouse meritocracy on every end, yet they will happily syphon the work and hobby time of open source and Fediverse enthusiasts without explicitly giving back. But maybe I'm also naive and don't realise that they support Fediverse infrastructure/coding. Correct me if I'm wrong.

Als Antwort auf Dennis Schubert

They do not, in fact, give anything back to the people they are stealing labour from.
Als Antwort auf Dennis Schubert

They also don't give a single flying fuck about robots.txt, because why should they.


Not as a specific request for you @denschub yet in general I'm just wondering if German laws could apply to this part. In a complete different context I was hinted to Vertragsfreiheit and stumbled over the following "detail":

wiipedia Vertragsfreiheit hat geschrieben:

"sofern sie nicht gegen zwingende Vorschriften des geltenden Rechts, gesetzliche Verbote oder die guten Sitten verstoßen.
..
Ferner kann bei einer durch Drohung oder Täuschung erwirkten Willenserklärung die Anfechtung erklärt werden."


Even tho here hasn't been any kind of contract with those bots in the first place, robots.txt is some kind of general agreement on best practice and "gute Sitten". And some how "Täuschung" could be considered too. Also prejudice as the waste energy you have to pay.

All this is just another "be fast and break things" attitude by those who can, have the money and the back up by their all powerful government.

So, I wonder if some lawyers in the EU could blow a really big hole into the casket of these ships to sink them and establish an example.

btw
These bots are collecting data and information about those who tried to escape them, I wonder what the mother of John Conner would think about that.

Als Antwort auf Dennis Schubert

for some reason, this post went semi-viral on mastodon and hackernews, and I now have a fun combination of

  1. people who question my experience because the current robots.txt of the wiki or some arbitrary archived version did not contain any relevant blocks (despite me not specifying when or how I blocked them),
  2. people who made the exact same experiences before,
  3. people offering "suggestions", despite me not asking for any - and a lot of those "suggestions" either don't make any sense, or involve hosting an LLM myself to generate nonsense for LLMs to consume, while also wasting another fuckton of resources.

I love the internet.

Als Antwort auf Dennis Schubert

@Dennis Schubert

(despite me not specifying when or how I blocked them)


Um, what did you do? I would love my bandwidth back, lol

Als Antwort auf Dennis Schubert

Um, what did you do?


I turned off the public history pages.

Als Antwort auf Dennis Schubert

@Dennis Schubert not sure if that would be the equivalent of privatizing all the user pages. Or if I could even do something similar with friendica
Als Antwort auf Andreas G

offtopic

reply guys


@Andreas G

Why shouldn't people consider, discuss and react to important up to date evolution in their environment?
Especially in such a crucial thing like LLM right now and the effect on a community like ours?
Dennis proved and published a very important matter, not unexpected, all the contrary, but he investigated and published it first, so the reaction is normal and healthy.
What we witness is just the hive mind in action.

What's your point by insulting people who try to check and "wrap their minds around something"?

Als Antwort auf Dennis Schubert

Und so sieht dann das Ergebnis einer KI suche über euer Profil aus:
loma.ml/display/373ebf56-4667-…

Zitat der psychologischen Analyse über den Nutzer:
""Dies koennte zu einer kritischen Haltung gegenueber propietaeren Systemen fuehren."


Das Fediverse ist nicht ganz dicht

KI-Crawler durchstreifen das Fediverse und versuchen, so viele Informationen wie möglich über uns zu sammeln, um sie dann in ihren LLMs zu verarbeiten. Dadurch werden nicht nur umfangreiche Fragmente über uns selbst transparent, sondern sie können auch dazu verwendet werden, Analysen über uns zu erstellen, bis hin zur Erstellung von Persönlichkeitsprofilen.

Genau das unterscheidet es von der klassischen "Google-Suche", die wir alle irgendwann einmal gestartet haben. Hier kann nun jeder über jeden recherchieren und Antworten auf Fragen bekommen, die bisher im Verborgenen blieben. Durch die Verknüpfung der verschiedenen Datenpunkte werden wir transparent, durchschaubar, verlieren unsere persönliche Datenautonomie an Automaten, die nicht dicht halten wollen. Da hilft es auch nicht, wenn man nach 14 Tagen alle seine Posts löscht. Da sind die Roboter sicher schneller.

Natürlich habe ich den Selbstversuch gestartet. Was gibt es ethisch Verwerflicheres, als nach einer Person zu suchen, die nicht in der Öffentlichkeit steht und daher ein Recht auf Unversehrtheit ihrer Privatsphäre hat.

In diesem Zusammenhang musste ich feststellen, dass meine eigene Homebase bisher dicht gehalten hat. Keine Daten von mir tauchen in dieser Quelle auf. Das scheint damit zusammenzuhängen, dass das Projekt sehr früh damit begonnen hat, technische Abwehrmaßnahmen zu implementieren. Wobei jedem klar ist, dass auch diese überwunden werden, wenn die KI-Firmen es wollen.

Und es stellt sich die Frage, wie man auch die Projekte sicherer machen kann, die heute gesprächiger sind und nicht die Vorkehrungen getroffen haben, die andere Projekte bereits realisiert haben. Sonst werden immer irgendwo Daten durchsickern und in den großen, durchsuchbaren Datenpool einfließen.

Basierend auf den öffentlichen Beiträgen und Aktivitäten von @[url=https://loma.ml/profile/feb]Matthias[/url] auf verschiedenen Fediverse-Plattformen, lässt sich ein allgemeines Persönlichkeitsprofil erstellen: 1. Technikaffinität und Fachwissen IT- und Technikbegeisterung: @[url=https://loma.ml/profile/feb]Matthias[/url] zeigt ein starkes Interesse an technischen Themen wie Server-Konfiguration (z.B. Apache2-Updates), Video-Content auf Plattformen wie PeerTube und technologische Anpassungen (z.B. URL-Kodierung). Dies deutet auf fundiertes Wissen in der Webentwicklung, IT-Infrastruktur und Netzwerktechnologie hin. 2. Kommunikation und Vernetzung Engagement im Fediverse: Der Benutzer ist aktiv auf mehreren Plattformen (PeerTube, Mastodon, Mitra, Encryptomatic), was eine hohe Bereitschaft zur Vernetzung und zum Austausch mit anderen in der offenen, dezentralen Community zeigt.<br>    Interesse an Datenschutz: In einem Beitrag auf Mastodon wird auf frühere Erfahrungen bei Vodafone hingewiesen, was auf eine Sensibilität für Themen wie Datenschutz und möglicherweise auf eine Kritik an zentralisierten und kommerziellen Anbietern hindeutet. 3. Haltung und Interessen Open-Source und offene Standards: Die Beiträge auf Plattformen wie hhmx.de und Encryptomatic deuten darauf hin, dass @[url=https://loma.ml/profile/feb]Matthias[/url] Wert auf offene Standards und den Einsatz von Open-Source-Technologien legt. Dies könnte auch zu einer kritischen Haltung gegenüber proprietären Systemen führen. Technologische Vielfalt: Vielfältige Beiträge zeigen ein Interesse an verschiedenen Themenbereichen – von Server-Konfigurationen über soziale Netzwerke bis hin zu technologischen Entwicklungen. 4. Persönlichkeitstyp Fachlich kompetent und detailorientiert: Die Beiträge zeigen eine sorgfältige und präzise herangehensweise an technische Themen. Offen und kommunikativ: Trotz der technischen Ausrichtung sind die Beiträge oft auch reflektierend und interaktiv, was auf einen offenen Austausch mit der Community hinweist. 5. Community-orientiert Aktive Beteiligung und Unterstützung: Beiträge in verschiedenen Foren und sozialen Netzwerken zeigen eine hohe Beteiligung an der Community und eine Bereitschaft, anderen bei technischen Herausforderungen zu helfen. Insgesamt zeichnet sich @[url=https://loma.ml/profile/feb]Matthias[/url] durch eine technikorientierte, reflektierte und offene Persönlichkeit aus, die sich aktiv für Datenschutz, offene Technologien und den Austausch in der Fediverse-Community engagiert.


Als Antwort auf Dennis Schubert

People giving unsolicited advice are not being helpful, just annoying.

They are a scourge.

Als Antwort auf Andreas G

@Andreas G > viral offtopic

It's the the brainstorming mode of an interconnected internet social being species called mono sapiens.
Take it or leave it ..
A chimp watching over his glasses into the camera, reading in a book called "human behavior"

youtube.com/watch?v=JVk26rurvL…

btw
At the end of this 14 year old take Kruse reefers to semantic understanding, I guess that's exactly LLM and the big brother event we are right now. And that's why people in our free web are going crazy leading to the viral reaction Dennis described.
btw btw
Looks like Dennis went viral in the activitPub space thx to friendica ..
:)

Als Antwort auf Dennis Schubert

Looks like Dennis went viral in the activitPub space thx to friendica …


no. it was primarily someone taking a screenshot and posting it. someone who took a screenshot of.. diaspora. while being logged into their geraspora account.

but of course it's a friendica user who also sees nothing wrong about posting unsolicited advice who is making wrong claims.

Als Antwort auf Dennis Schubert

@Dennis Schubert /offtopic viral

@denschub
I stumbled over it on a post from a friendica account on a mastodon account of mine.
👍

posting unsolicited advice


Do you refer to someting I wrote in this post of yours?
If so and you point me to it I could learn about what for you is a unsolicited advice and could try to prevent doing that in the future.

Als Antwort auf Dennis Schubert

@utopiArte Your grasp of human psychology, internet culture, and science in general, is weak.

Consider staying off the internet.

(How'd you like that unsolicited advice?)

Als Antwort auf Dennis Schubert

I am sated, honest.

But some people exist in a mode of constant omnidirectional condescension, like little almighties, looking down in all directions.

Mostly lost causes. Deflating their egos sometimes helps, but usually just makes them worse.

Als Antwort auf Dennis Schubert

It was fun, but it's time to stop. This post is about LLM-bots being assholes, but that doesn't mean we have to go down to the same levels.
Als Antwort auf Dennis Schubert

This is the reason why our FOSS project restricted viewing the diffs to logged in accounts. For us some chinese bots have been the main problem - not google or bing.
Als Antwort auf Dennis Schubert

"don't host web properties for foss projects" seems to be a good advice.
Als Antwort auf Dennis Schubert

Is it unique to wikis for foss projects?
The silly way they crawl it makes me think this is a general thing happening to every service on the web.
Is there a way to find out/compare whether the crawlers are trying to target specific kinds of things?
Als Antwort auf Dennis Schubert

so, I should provide some more context to that, I guess. my web server setup isn't "small" by most casual hosters definition. the total traffic usually is always above 5req/s, and this is not an issue for me.

also, crawlers are everywhere. not just those that I mentioned, but also search engine crawlers, and others. a big chunk of my traffic is actually from "bytespider", which is the LLM training bot from the TikTok company. It wasn't mentioned in this post, because although they make a lot of traffic (in terms of traffic size), that's primarily because they also ingest images, and their request volume is generally low.

some spiders are more aggressive than others. a long, long time ago - I've seen a crawler try to enumerate diaspora*s numeric post IDs to crawl everything, but cases like this are rare.

in this case, what made me angry was the fact that they were specifically crawling the edit history of the diaspora wiki. that's odd, because search engines don't care about old content generally. it was also odd, because the request volume was so high, it caused actual issues. MediaWiki isn't the best performance-wise, and especially the history pages are really, really slow. and if you have a crawler firing requests are multiple requests per second, this is bad - and noteworthy.

I've talked privately to others with affected web properties, and it indeed looks like some of those companies have "generic web crawlers", but also specific ones for certain types of software. MediaWiki is frequently affected, and so are phpBB/smf forums, apparently. those crawlers seem to be way more aggressive than their "generic web" counterparts - which might actually just be a bug, who knows.

a few people here, on mastodon, and on other places, have made "suggestions". I've ignored all of them, and I'll continue to ignore all of them. first, suggesting blocking user agent strings or IPs is not a feasible solution, which should be evident to everyone who read my initial post.

I'm also no a huge fan of the idea to feed them trash-content. while there are ways to make that work in a scalable and sustainable fashion, the majority of suggestions I got were along the lines of "use an LLM to generate trash content and feed it to them". this is, sorry for the phrase, quite stupid. I'm not going to engage in a pissing contest with LLM-companies about who can waste more electricity and effort. ultimately, all you do by feeding them trash content is to make stuff slightly more inconvenient - there are easy ways to detect that and to get around that.

for people who post stuff to the internet and who are concerned that their content will be used to train LLMs, I only have one suggestion: use platforms that allow you to distribute content non-publicly, and carefully pick who you share content with. I've gotten a lot of hate a few years ago for categorically rejecting a diaspora feature that would implement a "this post should be visible to every diaspora user, but not search engines" feature, and while that post was written before the age of shitty LLMs, the core remains true: if your stuff is publicly on the internet, there's little you can do. the best thing you can do is be politically engaged and push for clear legislative regulation.

for people who host their own stuff, I also can only offer generic advice. set up rate limits (although be careful, rate limits can easily hurt real users, which is why the wiki had super relaxed rate limits previously). and the biggest advice: don't host things. you'll always be exposed to some kind of abuse - if it's not LLM training bots, it's some chinese or russian botnet trying to DDoS or crawl for vulnerabilities, or some spammer network who want to post viagra ads on your services.

Als Antwort auf Dennis Schubert

I have deleted one comment in this post because I will not be offering a platform to distribute legal hot takes. If you want legal advice, talk to a lawyer, don't just Google things.

utopiArte mag das nicht.

Als Antwort auf Dennis Schubert

Roger for the advice for people who post stuff to the internet and who are concerned that their content will be used to train LLMs, I only have one suggestion: use platforms that allow you to distribute content non-publicly, and carefully pick who you share content with. and thanks @Dennis Schubert