A Twisty Maze of Ill-Behaved Bots

Like many, bot traffic has been causing significant issues for my hosted server recently. I’ve been noticing a dramatic increase in bots that do not respect robots.txt, especially the crawl-delay I have set there. Not only that, but many of them are sending user-agent strings that are quite precisely matching what desktop browsers send. That is, they don’t identify themselves.

They posed a particular problem on two sites: my blog, and the lists.complete.org archives.

The list archives is a completely static site, but it has many pages, so the bots that are ill-behaved absolutely hammer it following links.

My blog runs WordPress. It has fewer pages, but by using PHP, doesn’t need as many hits to start to bog down. Also, there is a Mastodon thundering herd problem, and since I participate on Mastodon, this hits my server.

The solution was one of layers.

I had already added a crawl-delay line to robots.txt. It helped a bit, but many bots these days aren’t well-behaved. Next, I added WP Super Cache to my WordPress installation. Again, this helped. Again, not quite enough.

Finally, I added Anubis. Installing it (especially if using the Docker container) was under-documented, but I figured it out. By default, it is designed to block AI bots and try to challenge everything with “Mozilla” in its user-agent (which is most things) with a Javascript challenge.

That’s not quite what I want. If a bot is well-behaved, AI or otherwise, it will respect my robots.txt and I can more precisely control it there. Also, I intentionally support non-Javascript browsers on many of the sites I host, so I wanted to be judicious. Eventually I configured Anubis to only challenge things that present a user-agent that looks fully like a real browser. In other words, real browsers should pass right through, and bad bots pretending to be real browsers will fail.

That was quite effective. It reduced load further to the point where things are ordinarily fairly snappy.

As a final step, I migrated to a faster hosting option. This post will show me how well it survives the Mastodon thundering herd!

Reposts

Likes

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

To respond on your own website, enter the URL of your response which should contain a link to this post's permalink URL. Your response will then appear (possibly after moderation) on this page. Want to update or remove your response? Update or delete your post and re-enter your post's URL again. (Find out more about Webmentions.)