Identify and block fake/bad bots in php including fake google/bing/yahoo bots

Edited by Anonymous, Neo Mann, Charmed, Lynn and 9 others

I just launched a new website that nobody should know about except me. To my surprise GoogleBot apparently showed up and starting spider even though I have a robots.txt exclusion of all robots in place! Upon closer inspection I discovered this was not GoogleBot at all, but an imposter! So I am writing this HowTo with accompanying code to efficiently block all bad bots I have surveyed all the PHP based bot blocking code available and there are a lot of what I would call incomplete solutions out there. There are honeypot methods but they don't address bots that obey robots.txt but are really just scrapers. What is worse many disguise themselves as GoogleBot or other popular bots that we want to scrape from our site. (Well, we call it indexing when we like them!) First I will lay out my logic, then develop a simple easy to use PHP script which I will call badbotkiller.php that you can simply include in your PHP programs to deal with bad bots once and for all and block them.

Was this helpful? Yes | No | I need help

Ad

Identify bad bots

The first thing we need to consider is the various cases so we develop a comprehensive bad bot blocker.

  1. 1
    Bot claims to be GoogleBot or other known bot, but is not.
    We will use the official google method to identify googlebot ... BUT we will extend this method to the other popular spiders for bing, Yahoo, etc.
    Advertisement
    Was this step helpful? Yes | No | I need help

  2. 2
    Bot does not claim to be a bot but instead impersonates a visitor to our site.
    To identify this type of bot we have 3 key markers.
    Advertisement
     
    1. Real visitors do NOT visit robots.txt (unless they are the snoopy type which we don't want anyway)
    2. Real visitors do NOT go to hundreds of pages. We can catch them by identifying IP's that do not visit robots.txt but yet visit X number of pages within a given time frame.
    3. Real visitors do NOT visits invisible links. We can set up a honeypot to catch these bots (which is pretty much the most popular technique I have seen around the net). It's good but far from a complete solution.
    Was this step helpful? Yes | No | I need help

    Advertisement

Logic to block bad bots

Does this bot claim to be a bot? How can we tell? A bot claims to be a bot by first visiting robots.txt . Also if the referring has the word "bot" in it then it is declaring itself to be a bot. We used to be able to test if it is a bot by testing if it executes javascript. Today bots do all kinds of things they couldn't before. Really we only need to consider if it accesses the robots.txt and it's user agent string to is declaring it to be a bot.
If yes, then we test it to see if it is the bot it claims to be by using the official Google method mentioned earlier. As I said before we do this test with other bots as well because any self-respecting bot will comply with reverse ip best practices. First test to see if the IP has a reverse in-arpa lookup. Then if it does check that, it maps back to the original IP. Here is an example test done for bing/MSN:

root: host 42.96.164.179 Host 179.164.96.42.in-addr.arpa. not found: 3(NXDOMAIN) root: host msnbot-157-55-33-84.search.msn.com. msnbot-157-55-33-84.search.msn.com has address 157.55.33.84

If no, then we need to test it if it is in fact a bot disguising as a website visitor In this case we will catch bots that pretend to be website visitors by testing:
  • Did this IP try to access robots.txt?
  • Did this IP try to access too many pages?
  • Did this bot fall for our honeypot?

Code to test logs for bad bots faking their identity

To get started let's make a little code to test ip addresses from our logs.

<?php

function testbotip($ip,$agent) {
        $hostname = gethostbyaddr($ip);
        if ($hostname==$ip) return false;
        $rip = gethostbyname($hostname); // we use long version because there could be multiple A records.
        print "Ip:$ip\tHostname: $hostname = $agent\t\t";
        if ($ip==$rip) {
                //host is not faked so now let's see if it is who it says it is via agent
                if(preg_match("/bing|msnbot/I",$agent)&&(preg_match("/msn\.com/I",$hostname))) return true;
                if(preg_match("/Google/I",$agent)&&(preg_match("/Google\.com/I",$hostname))) return true;
                if(preg_match("/yahoo/I",$agent)&&(preg_match("/yahoo\.com/I",$hostname))) return true;
                if(preg_match("/twittervir/I",$agent)&&(preg_match("/twttr\.com/I",$hostname))) return true;
                //ok done standard ones we know ... now we need to try generic test.
                //good bots will give a domain where they can be looked up.  This should match their reverse ip domain.
                preg_match("/([\w]+\.[\w]+)($|\.uk$)/",strtolower($hostname),$matches);
                $dom1 = $matches[0];
                if (!(strpos($agent, $dom1) !== false)) {
                        //echo "$dom1 NOT FOUND !!\n";
                        return false;
                }
                return true;
        }
        return false;
}

if (@$argv[1]=="testlog") {
        $fh = fopen($argv[2], 'r');
        while (($line = fgetcsv($fh,4096," ")) !== false) {
                $ip=$line[0];
                $req=$line[5];
                $agent=$line[9];
                if (preg_match('/bot/I',$agent,$matches)) {
                        if (testbotip($ip,$agent)) {
                                print "Passed\n";
                        } else {
                                print "FAILED\n";
                        }
                }
        }
}

?>

I recommend you run this code on a current log and check to see what bots it will be disallowing. Please feel free to edit this wiki to include more good agents/bots.

Was this helpful? Yes | No | I need help

Code to track all bots accessing robots.txt

One way to get this information is to mine the access logs but not all accounts will have access to log files and the format varies. It is much easier to use an .htaccess mod_rewrite rule which is widely supported to capture this info. Here is the code to capture the IP addresses of anyone who accesses the robots.txt file. We will consider these IP's below to robots. Now we can monitor their behavior and see if they behave or need to be blocked. Of course as outline above this will only be one of many tests.

Was this helpful? Yes | No | I need help

Add the following lines to your .htaccess file.

RewriteEngine on RewriteRule ^robots.txt$ /robots.php

robots.php

<?php
        //if your server is behind a proxy use $_SERVER['HTTP_X_FORWARDED_FOR'] instead
        $ip=$_SERVER['REMOTE_ADDR']; 
        $robotips = @file_get_contents("robotips.txt");
        $robotips = str_replace("|$ip|","",$robotips);
        $robotips = substr("|$ip|$robotips",0,1000); //only keep 1k worth of spider ips
        file_put_contents("robotips.txt",$robotips);
        print file_get_contents("robots.txt");

?>

Both of these files should be in your sites public_html folder. If everything is working properly when you browse to robots.txt you should see your regular robots.txt file and your ip address should be written to a text file called robotips.txt .

Was this helpful? Yes | No | I need help

Tips Tricks & Warnings

  • I propose all bots have a new standard that their reverse IP hostname should contain the exact same URL as the agent the bot reports. This way we can have an iron clad rule that will allow us to identify bad bots without writing override type rules that may go bad with time (i.e., MSNbot reports itself as bing - not good).

Questions and Answers

How to identify a faulty address line

VisiHow QnA. This section is not written yet. Want to join in? Click EDIT to write this answer.

I need to block all bots sending fake traffic to my site any advice?

I need to block all bots sending fake traffic to my site any advice?

The best solution is to use javascript to check mouse movements to find atypical behaviour. Start by logging mouse movements of known bots and see if you can find some type of signature behaviour.

I have built a similar script and was just comparing notes?

I have a site with over 700,000 contact forms and have found a couple more methods that help.

Sure. Contact me on my VisiHow board.

Sir I am testing firewall plugin but no idea testing?

Sir I am testing firewall plugin but no idea testing

VisiHow QnA. This section is not written yet. Want to join in? Click EDIT to write this answer.

How can I identify short URL fake traffic?

I just want to block fake traffic coming from twitter short URL which are not tracked in Google analytic

VisiHow QnA. This section is not written yet. Want to join in? Click EDIT to write this answer.

I can see that there are multiple request from different IPs to my domain?

It contains the following line in access logs

+http://www.bing.com/bingbot.htm

VisiHow QnA. This section is not written yet. Want to join in? Click EDIT to write this answer.

Comments

VisiHow welcomes all comments. If you do not want to be anonymous, register or log in. It is free.




Daniel
Featured Author
69 Articles Started
2,601 Article Edits
24,290 Points
Daniel is a featured author with VisiHow. Daniel has achieved the level of "Lieutenant" with 24,290 points. Daniel has started 69 articles (including this one) and has also made 2,601 article edits. 17,578 people have read Daniel's article contributions.
Daniel's Message Board
Daniel: Hi, my name is Daniel.
Daniel: Can I help you with your problem about "Identify and block fake/bad bots in php including fake google/bing/yahoo bots"?
 

Article Info

Categories : Ultimate Guide To Build & Promote A Website

Recent edits by: bunty, Dougie-1, call5414790082

Share this Article:

Thanks to all authors for creating a page that has been read 10,271 times.

Do you have a question not answered in this article?
Click here to ask one of the writers of this article
x

Thank Our Volunteer Authors.

Would you like to give back to the community by fixing a spelling mistake? Yes | No