A Twisty Maze of Ill-Behaved Bots

Like many, bot traffic has been causing significant issues for my hosted server recently. I’ve been noticing a dramatic increase in bots that do not respect robots.txt, especially the crawl-delay I have set there. Not only that, but many of them are sending user-agent strings that are quite precisely matching what desktop browsers send. That is, they don’t identify themselves.

They posed a particular problem on two sites: my blog, and the lists.complete.org archives.

The list archives is a completely static site, but it has many pages, so the bots that are ill-behaved absolutely hammer it following links.

My blog runs WordPress. It has fewer pages, but by using PHP, doesn’t need as many hits to start to bog down. Also, there is a Mastodon thundering herd problem, and since I participate on Mastodon, this hits my server.

The solution was one of layers.

I had already added a crawl-delay line to robots.txt. It helped a bit, but many bots these days aren’t well-behaved. Next, I added WP Super Cache to my WordPress installation. I also enabled APCu in PHP and installed APCu Manager. Again, each step helped. Again, not quite enough.

Finally, I added Anubis. Installing it (especially if using the Docker container) was under-documented, but I figured it out. By default, it is designed to block AI bots and try to challenge everything with “Mozilla” in its user-agent (which is most things) with a Javascript challenge.

That’s not quite what I want. If a bot is well-behaved, AI or otherwise, it will respect my robots.txt and I can more precisely control it there. Also, I intentionally support non-Javascript browsers on many of the sites I host, so I wanted to be judicious. Eventually I configured Anubis to only challenge things that present a user-agent that looks fully like a real browser. In other words, real browsers should pass right through, and bad bots pretending to be real browsers will fail.

That was quite effective. It reduced load further to the point where things are ordinarily fairly snappy.

I had previously been using mod_security to block some bots, but it seemed to be getting in the way of the Fediverse at times. When I disabled it, I observed another increase in speed. Anubis was likely going to get rid of those annoying bots itself anyhow.

As a final step, I migrated to a faster hosting option. This post will show me how well it survives the Mastodon thundering herd!

Update: Yes, it handled it quite nicely now.

9 thoughts on “A Twisty Maze of Ill-Behaved Bots

  1. PHP can be very efficient, but it depends on how it is handled by a web server. Classic mod_php is indeed slow, because PHP will process every request, even without any PHP code.

    FPM sockets in Apache, or Nginx will be much faster.

    Sadly, robots.txt is overused right now. It should point irrevelant, or fast changing information on the page, but when it is applied to a *whole* website, no one will care.

    I think the worst crawlers are from Micrososft’s AI now. Crawling takes several days and that amount of traffic is causing temporary downtime on hosting services. Sites hoated on VPS are fine.

Likes

Reposts

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

To respond on your own website, enter the URL of your response which should contain a link to this post's permalink URL. Your response will then appear (possibly after moderation) on this page. Want to update or remove your response? Update or delete your post and re-enter your post's URL again. (Find out more about Webmentions.)