Spider traps to stop using the site crawling
Sometimes, your competitors will do almost anything to compete with you, including the theft of your content.
To do this, they sometimes use automated software like search engine crawlers to make this process faster and easier than copying your website manually. This will cause you a lot of trouble.
In this article, we will explore ways to prevent this from happening.
Online theft is rampant. I’m not talking about stealing someone’s username and password. I am referring to theft on the website.
But what really causes the problem is that your competitor steals your content.
As we all know, content is the king of the web. Whoever has the most content wins. Therefore, if your competitors need to grow quickly, one of the easiest ways is to use a website harvester.
Web Harvester is no different from other search engine crawlers. It requests all the URLs it can find, and then continues to download all the content related to those URLs.
So, how do you protect yourself from malicious crawlers?
It’s really simple. You made a spider trap.
As the name implies, you create a section on your website dedicated to lure those unfriendly spiders, and then you start trapping them or prohibiting them from visiting your website.
What is involved in making a spider trap?
Usually some PHP code is combined with a database and URL rewriter.
The first thing you need to do is to create a space on the website dedicated to catching those bad robots. Then use robots.txt to exclude that part from the crawl.
You do this because you want to make sure that the Google bot, Yahoo! Slurp, MSNbot and others will not be trapped either. Because most excellent spiders follow the robots.txt exclusion protocol, you have to politely deny them access to this location.
There are various options from here. One of my favorite methods is to log in to a database or text file, and then dynamically deny access to the bad bot.
How does it work?
Let me give you a practical example.
I once had a client who was harvested by many different bad spiders many times a day. This is so bad that the bad robot doubles his bandwidth usage.
So we designed a plan to create this trap as mentioned above. When we capture user agent and IP information, we immediately ban them from entering the site.
The working principle is as follows: the
robot will find a bad link on the website. The link will point to the trap directory.
Normally, an ordinary spider will first check the robots.txt file to ensure that they can actually index the contents of the directory. Because the file excludes this directory, the “good” spider will not enter.
But the bad spider ignored robots.txt and entered the directory.
From here there is a PHP script that will run and capture the user agent and IP address.
Another script will get the information and immediately rewrite the .htaccess file with the wrong spider information after receiving the wrong spider information.
Then, the server reloaded the .htaccess file, and then the spider was not allowed to access the site.
In another instance, the system will directly access a database or text file, and then reference the database or text file through a small php script, which allows or denies access based on the list.
Remember, this is a very advanced thing. You don’t want to take it lightly. Doing so can (and most likely will) remove your site from any search index.
Not because you are doing something you shouldn’t do, but because there is always such a good spider like Googlebot that will eventually be included in your blacklist.
Therefore, before you get into these advanced things, I will make sure you are very familiar with what they are and how they work.
A good start is to read this page. I don’t recommend using this code now, but see what it can do.
Also, search for “bot trap” and “spider trap” on the engine to see what other options are available. Then, choose the one that suits you best.
In the end, the best robot trap is what it should do-stop harvesters from crawling your website while allowing legitimate search engines to index your website effectively and efficiently. If you really care about this strategy, don’t use it. It is best to use the manual method-search for high activity of unknown user agents in the server logs, and then use .htaccess to manually ban them.