Robots.txt and SEO

What is the Robots.txt actually?

Robots.txt is a file that instructs search engine spiders not to crawl certain pages or sections of a website. Most of the major search engines (including Google, Bing, and Yahoo) recognize and observe Robots.txt’s information. This file must be in the main directory of a domain, so it must not be in a subfolder. In addition, the name must be “robots.txt.”

Warning: The robots.txt is not a suitable mechanism to exclude a website from Google. If you want to exclude a Google website, the best thing to do is to use a noindex directive or protect your page with a password!

Why is Robots.txt so important?

Strictly speaking, it isn’t because most websites don’t actually need a robots.txt file. That’s because Google is usually supposed to find and index all of the major pages on a website. Also, search engines will automatically NOT index the pages that are not important or duplicate other pages.

That being said, there are four reasons you might want to use a robots.txt file:

  1. Block Non-Public Pages: Sometimes, you have pages on your website that you don’t want to have crawled. For example, you could have a staging version of a page. Or a login page. These pages have to exist, but you don’t want people or search engine robots to randomly land on them. This is one case where you would use robots.txt to block these pages from search engine crawlers and bots.
  2. Reduce server load: By excluding certain page areas that are not relevant for the search engines anyway, you can reduce the server load by crawlers and save bandwidth as well as costs.
  3. Maximize the Crawl Budget: If you don’t get all the subpages in the index, you probably have a crawl budget problem. By blocking unimportant pages with robots.txt, the Googlebot can spend more of your crawl budget on the pages that are actually important.
  4. Prevent resources from being indexed: Using meta-robots directives actually works better than the instructions in robots.txt to prevent pages from being indexed. However, meta-directives often cannot be used with multimedia resources such as PDFs or images. This is where robots.txt comes into play better.

So: The robots.txt file tells search engine spiders not to crawl certain pages on your website.

Best practice: Create a Robots.txt file

Your first step is to actually create your robots.txt file. Since it is a text file, you can easily create it with a text editor such as Windows Notepad.

And no matter how you ultimately create your robots.txt file, the format is exactly the same:

user-agent: X 

disallow: Y

The user agent is the specific bot that you want to address.

And all that comes after “disallow” is pages or sections that you want to block.

Here is an example:

user-agent: Googlebot 

disallow: / images

This rule would tell Googlebot not to index your website’s images folder.

You can also use an asterisk (*) to speak to any bots that drop by on your website.

Here is an example:

user-agent: *

disallow: / images

The “*” instructs all spiders NOT to search the picture folder.

This is just one of many ways a robots.txt file can be used. In this helpful guide from Google for more information on the different rules, you can use to block or allow bots to crawl different pages on your website

With the robots.txt tester, you can create or edit robots.txt files for your website. It can also be used to check the syntax and the impact on your website.