Bad bots will always ignore your robots.txt
settings and skip crawl-delay
value, but not only the bots created by spammers do that. SEO and Data Collectors Bots owned and produced by famous companies that provide SEO services do the same thing and can cause high CPU usage, slow down the server response, and lose bandwidth.
Using the Cloudflare firewall Super Bot Fight Mode feature will not stop all well-known bots with bad behavior.
Catch Bad Bots YourSelf
Besides the firewall rules or WAF you may apply, the more practical solution will be to catch the bad bots yourself and then take action.
Simply and as a professional, we go to the logs files of our Web Server. I’m using Nginx as a reverse proxy, but it’s the same technique for Apache, Varnish, etc.
# cat access.log | awk -F\" '{print $6}' | sort | uniq -c | sort -n
When running the above cat
command and pipe outputs to awk
to get the User-Agent
column, Then unqiue
and sort
we will get the final result similar to the following output, Bots user-agent, beside the number of hits.
758 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36 X-Middleton/1 762 Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/) 869 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36 Edg/97.0.1072.62 X-Middleton/1 872 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0 X-Middleton/1 924 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20100101 Firefox/8.0 955 Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-TW) AppleWebKit/531.21.8 (KHTML, like Gecko) Version/4.0.4 Safari/531.21.10 X-Middleton/1 958 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36 X-Middleton/1 1010 Mozilla/5.0 (X11; Linux x86_64; rv:29.0) Gecko/20100101 Firefox/29.0 X-Middleton/1 1060 Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.7) Gecko/2009030719 Firefox/3.0.3 X-Middleton/1 1193 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36 X-Middleton/1 1202 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) X-Middleton/1 1226 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14 1310 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36 1326 Mozilla/5.0 (X11; Linux x86_64; rv:29.0) Gecko/20100101 Firefox/29.0 1335 Mediapartners-Google 1336 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36 1336 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36 1357 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0 1404 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36 1446 Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot) X-Middleton/1 1568 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36 1657 facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) 1734 Mozilla/5.0 (compatible; Uptime/1.0; http://uptime.com) 1785 Mozilla/5.0 (compatible; Pinterestbot/1.0; +http://www.pinterest.com/bot.html) 1828 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36 X-Middleton/1 1865 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36 X-Middleton/1 2033 - 2288 PHP/5.5 2474 Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot) 3696 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36 4368 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36 X-Middleton/1 4834 Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) 5007 Mozilla/5.0 (compatible; DataForSeoBot/1.0; +https://dataforseo.com/dataforseo-bot) 6493 Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html) 6774 Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) 6868 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0) X-Middleton/1 7576 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) 8461 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0) 13841 Photon/1.0 16427 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 44424 Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Bad Bots and hight CPU Usage
As shown in the above output, many bots hit your website pages for data-collecting reasons. And when you have a CMS like WordPress, or Drupal, or a Dynamic website application with a significant number of pages, of course, these crawling hits will cause a high load CPU usage, even take your Webserver and Database down.
If you are using Cloudflare paid or free version, we can go to the Firewall Bots section. But set Challenge for well-known Bad bots will not prevent other bots from utilizing your server CPU and causing a high load which has the same effect as a DDoS.
Same Time Static resource protection option can not make it alone. And have harmful side effects. It may block ideal bots.
Take An Action
For sorry Bots like AhrefsBot, which is related to Ahrefs Web Seo and Marketing tools, SEMrushBot, which is related to SEMrush web data collected and marketing tool, and DotBot, which is related to Moz.com. all of them works and behave like bad bots, as we mention, they can cause the same effect as a DDos, and not so smart to measure they hit rates and works ideally with your server.
Anyway, our action on Cloudflare will be creating a firewall rule and manually adding the following expression with Js Challenge, then deploying the rule.
(lower(http.user_agent) contains "petalbot") or (lower(http.user_agent) contains "ahrefs") or (lower(http.user_agent) contains "mj12bot") or (lower(http.user_agent) contains "aspiegelbot") or (lower(http.user_agent) contains "dotbot") or (lower(http.user_agent) contains "80legs")
If you Using Apache
as a webserver, so can .htaccess
to prevent bots from hitting your website, as described in Apache way to prevent bad bots from stealing your bandwidth tutorial.
RewriteEngine on SetEnvIfNoCase User-Agent "^SemrushBot" bad_bot SetEnvIfNoCase User-Agent "^AhrefsBot" bad_bot SetEnvIfNoCase User-Agent "^SEMrushBot" bad_bot SetEnvIfNoCase User-Agent "^DotBot*" bad_bot SetEnvIfNoCase User-Agent "^Baidu*" bad_bot SetEnvIfNoCase User-Agent "^Petalbot*" bad_bot SetEnvIfNoCase User-Agent "^80legs*" bad_bot Order Deny,Allow Deny from env=bad_bot
And for nginx
we can use if condition in the server section as the following
# using ~ the condition case insensitive matching if ($http_user_agent ~* (AhrefsBot|SemrushBot|DotBot|Baidu|Petalbot|80legs)) { return 403; }
And for using IP Addresses and IPTables, you can check the 100 Percent CPU Utilization!? Know-How To Catch the Bad Boy’s IP Address tutorial.
Excellent post! We are linking to this particularly great content on our website.
Keep up the great writing.