Sun

Block Search Engines Using robots.txt

  1. Home
  2. Knowledge Base
  3. Hosting
  4. Block Search Engines Using robots.txt

Block Search Engines Using robots.txt

Search engine bots (also known as bots) are programs that automatically browse websites. These programs are used to crawl the site’s content that is then used for search or other actions. At first, this may seem like a good thing, but not all of those programs are good. Sometimes, those programs can be harmful. For example, they harvest email, do web scraping, and do other stuff. They can also cause your site’s load when the site is visited by either many different bots at once or one that generates many queries at once. However, you as a site’s owner can create a robots.txt file that provides instructions (Robots Exclusion Protocol) specifically for these visitors.

Where can I find my robots.txt file?

You can find the robots.txt file in the public_html folder of your site. Login to your cPanel and pick File Manager

Enter public_html folder

Check if the robots.txt file exists. If it does, right-click on the file and pick Edit

If the file does not exist, you can create it by clicking + File on the left-right corner of your File Manager

Creating rules in a robots.txt file

The file usually describes two or three instructions that contain the name of the robot and its rules. You will find the rule files described below.

All robots have access to all parts of the site (also works if robots.txt is empty or missing):

User-agent: *
Disallow: /

One robot is prohibited from indexing the site, all others are allowed. Note that when you enter multiple instructions, they are separated by a single space:

User-agent: robot_name
Disallow: /

User-agent: *
Disallow:

Queries for one particular robot are slowed down to 1 query every 10 seconds:

User-agent: robot_name
Crawl-delay: 10

All robots are prohibited from accessing only two directories:

User-agent: *
Disallow: /temp/
Disallow: /include/

All robots are not allowed to access one file:

User-agent: *
Disallow: /folder/file.html

An example of a robots.txt file can look like this:

How do I evaluate which robots are crawling my site?

This can be done by checking the Apache log of your server. You can find the Apache log in your cPanel File Manager (/var/home/user/log)

If you need the latest entries, you can check the Raw Access section

Unzip the downloaded file on your PC and open it with a text editor. Use the search in the text editor (Ctrl + F) and try to find the value of ‘Bot‘. The search will give you results where you can see which robots have crawled your site.

In this example, you can see Google Bot entries.

Additional information

Some search engines, such as Google also have an option to manage the Googlebot crawl rate. Meaning, that you can limit the crawl rate directly via Google. You can find more information on the page below:

https://support.google.com/webmasters/answer/48620?hl=en

Information about Bing crawl control can be found here:

https://www.bing.com/webmaster/help/crawl-control-55a30302

Check this page to see a list of malicious Bots:

http://www.botreports.com/badbots/index.shtml

Conclusion

As mentioned at the beginning, not all queries may be useful for your site. There are scrapers, spambots, email harvesters, and more bots that you don’t want on your site. In other cases, bots themselves might be doing a good thing, however, your plan’s resources are too low to handle multiple requests at once.

So in order to optimize a site’s performance, it is worth evaluating which search engines will be useful for your site and which will not.

Was this article helpful?

Related Articles