Robots.txt is a text file in which separate instructions for web crawlers can be stored. This determines whether a bot can crawl and index a certain website or not. It is possible to exclude individual directories or entire domains from crawling.
If webmasters want to control the behavior of crawlers, they use the Robots.txt file. This is located in the root directory of a website. In addition to the instruction to follow links or not to follow links, the crawlers can also tell the URL structure of a page through the Robots.txt file by including an XML sitemap in the file.
Robots.txt – REP
The REP (Robots Exclusion Standard Protocol) specifies that the crawlers should look for and read the Robots.txt file before indexing. The file robots.txt must be stored exactly in this way in the root directory. However, creating this file does not guarantee that crawlers will not have any access to the page, because not all bots follow the commands.
A page that is still crawled and included in the index also appears in the search engine results list, but without any descriptive text. The three largest search engines, Google and Bing, always follow the instructions in the Robots.txt file.
Creation and control
The file can be created using any text editor. It is read out in its written form. Some tools do the creation of Robots.txt. When creating the Robots.txt file, it must first be specified for which user agents the stored instructions are intended. For the exclusion from the indexing, a second part with the term “disallow” is added.
Before uploading the Robots.txt file, it should be checked to correct any errors. If there is the slightest mistake in the spelling, the command cannot be followed and the corresponding website may not be indexed, or may still be indexed, depending on the intent and error. In the Google Search Console, you can check whether the file works properly.
A Robots.txt file that allows crawling looks like this:
To prevent crawling, a forward slash is added:
- These examples are specifically for Google’s crawlers. Depending on the search engine from which you do not want to be indexed, you list its spider at User-Agent.Googlebot for the Google search engine
- Googlebot image for the image search crawlers
- Adsbot-Google for Google Adwords
- Slurp for the Yahoo search engine
- bingbot for the Bing search engine
If you want to exclude your pages from multiple search engines, each bot must be listed on a separate line. To exclude directories or subpages from all search engines, you use a wildcard, i.e. a placeholder. The Robots.txt file looks like this:
Disallow: / sample directory /
Importance for search engine optimization
Robots.txt has a direct connection to search engine optimization because websites that are excluded from following and indexing do not appear in the SERPs and if so, only without a description or with a “placeholder text”. If too many Robots.txt files are set up for pages in a domain, this can also lead to a poor ranking. If no such file is created, duplicate content can be indexed, for example. Care must, therefore, be taken to ensure accuracy when creating a Robots.txt file.
If errors occur due to these files, it may be the case that certain pages are not crawled and do not belong to a search engine index. This means that the pages cannot be found. If the procedure is correct, such a file does not influence the ranking of a page. However, you should be aware that excluded subpages do not appear in the SERPs, but the main page does.