Correct robots txt. How to edit the robots txt file. Disallow - placing “bricks”

💖 Do you like it? Share the link with your friends

Good afternoon dear friends! All you know is that search engine optimization- a responsible and delicate matter. You need to take into account absolutely every little detail to get an acceptable result.

Today we will talk about robots.txt - a file that is familiar to every webmaster. It contains all the most basic instructions for search robots. As a rule, they are happy to follow the prescribed instructions and, if they are compiled incorrectly, refuse to index the web resource. Next, I will tell you how to compose the correct version of robots.txt, as well as how to configure it.

In the preface I already described what it is. Now I’ll tell you why it is needed. Robots.txt is a small text file that is stored in the root of the site. It is used by search engines. It clearly states the rules of indexing, i.e. which sections of the site need to be indexed (added to the search) and which sections should not.

Typically, technical sections of a site are closed from indexing. Occasionally, non-unique pages are blacklisted (copy-paste of the privacy policy is an example of this). Here the robots are “explained” the principles of working with sections that need to be indexed. Very often rules are prescribed for several robots separately. We will talk about this further.

If you configure robots.txt correctly, your site is guaranteed to increase in rankings search engines. Robots will only take into account useful content, ignoring duplicate or technical sections.

Creating robots.txt

To create a file, just use the standard functionality of your operating system, and then upload it to the server via FTP. Where it lies (on the server) is easy to guess - at the root. Typically this folder is called public_html.

You can easily get into it using any FTP client (for example) or built-in file manager. Naturally, we will not upload empty robots to the server. Let's write some basic directives (rules) there.

User-agent: *
Allow: /

Using these lines in your robots file, you will contact all robots (User-agent directive), allowing them to index your entire site (including all technical pages Allow: /)

Of course, this option is not particularly suitable for us. The file will not be particularly useful for search engine optimization. It definitely needs some proper tuning. But before that, we will look at all the main directives and robots.txt values.

Directives

User-agentOne of the most important, because it indicates which robots should follow the rules that follow it. The rules are taken into account until the next User-agent in the file.
AllowAllows indexing of any resource blocks. For example: “/” or “/tag/”.
DisallowOn the contrary, it prohibits indexing of sections.
SitemapPath to the site map (in xml format).
HostMain mirror (with or without www, or if you have several domains). The secure protocol https (if available) is also indicated here. If you have standard http, you don't need to specify it.
Crawl-delayWith its help, you can set the interval for robots to visit and download files on your site. Helps reduce the load on the host.
Clean-paramAllows you to disable indexing of parameters on certain pages (like www.site.com/cat/state?admin_id8883278).
Unlike previous directives, 2 values ​​are specified here (the address and the parameter itself).

These are all rules that are supported by flagship search engines. It is with their help that we will create our robots, operating with various variations for the most different types sites.

Settings

To properly configure the robots file, we need to know exactly which sections of the site should be indexed and which should not. In the case of a simple one-page website using html + css, we just need to write a few basic directives, such as:

User-agent: *
Allow: /
Sitemap: site.ru/sitemap.xml
Host: www.site.ru

Here we have specified the rules and values ​​for all search engines. But it’s better to add separate directives for Google and Yandex. It will look like this:

User-agent: *
Allow: /

User-agent: Yandex
Allow: /
Disallow: /politika

User-agent: GoogleBot
Allow: /
Disallow: /tags/

Sitemap: site.ru/sitemap.xml
Host: site.ru

Now absolutely all files on our html site will be indexed. If we want to exclude some page or picture, then we need to specify relative link to this fragment in Disallow.

You can use robots automatic file generation services. I don’t guarantee that with their help you will create a perfectly correct version, but you can try it as an introduction.

Among such services are:

With their help you can create robots.txt in automatic mode. Personally, I strongly do not recommend this option, because it is much easier to do it manually, customizing it for your platform.

When we talk about platforms, I mean all kinds of CMS, frameworks, SaaS systems and much more. Next we will talk about how to set up the WordPress and Joomla robot file.

But before that, let’s highlight a few universal rules that can guide you when creating and setting up robots for almost any site:

Disallow from indexing:

  • site admin;
  • Personal Area and registration/login pages;
  • cart, data from order forms (for an online store);
  • cgi folder (located on the host);
  • service sections;
  • ajax and json scripts;
  • UTM and Openstat tags;
  • various parameters.

Open (Allow):

  • Pictures;
  • JS and CSS files;
  • other elements that must be taken into account by search engines.

In addition, at the end, do not forget to indicate the sitemap (path to the site map) and host (main mirror) data.

Robots.txt for WordPress

To create a file, we need to drop robots.txt into the root of the site in the same way. In this case, you can change its contents using the same FTP and file managers.

There is a more convenient option - create a file using plugins. In particular, Yoast SEO has such a function. Editing robots directly from the admin panel is much more convenient, so I myself use this method of working with robots.txt.

How you decide to create this file is up to you; it is more important for us to understand exactly what directives should be there. On my sites running WordPress I use this option:

User-agent: * # rules for all robots, except Google and Yandex

Disallow: /cgi-bin # folder with scripts
Disallow: /? # request parameters with home page
Disallow: /wp- # files of the CSM itself (with the wp- prefix)
Disallow: *?s= # \
Disallow: *&s= # everything related to search
Disallow: /search/ # /
Disallow: /author/ # author archives
Disallow: /users/ # and users
Disallow: */trackback # notifications from WP that someone is linking to you
Disallow: */feed # feed in xml
Disallow: */rss # and rss
Disallow: */embed # built-in elements
Disallow: /xmlrpc.php #WordPress API
Disallow: *utm= # UTM tags
Disallow: *openstat= # Openstat tags
Disallow: /tag/ # tags (if available)
Allow: */uploads # open downloads (pictures, etc.)

User-agent: GoogleBot # for Google
Disallow: /cgi-bin
Disallow: /?
Disallow: /wp-
Disallow: *?s=
Disallow: *&s=
Disallow: /search/
Disallow: /author/
Disallow: /users/
Disallow: */trackback
Disallow: */feed
Disallow: */rss
Disallow: */embed
Disallow: /xmlrpc.php
Disallow: *utm=
Disallow: *openstat=
Disallow: /tag/
Allow: */uploads
Allow: /*/*.js # open JS files
Allow: /*/*.css # and CSS
Allow: /wp-*.png # and images in png format
Allow: /wp-*.jpg # \
Allow: /wp-*.jpeg # and other formats
Allow: /wp-*.gif # /
# works with plugins

User-agent: Yandex # for Yandex
Disallow: /cgi-bin
Disallow: /?
Disallow: /wp-
Disallow: *?s=
Disallow: *&s=
Disallow: /search/
Disallow: /author/
Disallow: /users/
Disallow: */trackback
Disallow: */feed
Disallow: */rss
Disallow: */embed
Disallow: /xmlrpc.php
Disallow: /tag/
Allow: */uploads
Allow: /*/*.js
Allow: /*/*.css
Allow: /wp-*.png
Allow: /wp-*.jpg
Allow: /wp-*.jpeg
Allow: /wp-*.gif
Allow: /wp-admin/admin-ajax.php
# clean UTM tags
Clean-Param: openstat # and don’t forget about Openstat

Sitemap: # specify the path to the site map
Host: https://site.ru # main mirror

Attention! When copying lines to a file, do not forget to remove all comments (text after #).

This robots.txt option is most popular among webmasters who use WP. Is he ideal? No. You can try to add something or, on the contrary, remove something. But keep in mind that errors are common when optimizing a robot’s text engine. We will talk about them further.

Robots.txt for Joomla

And although in 2018 few people use Joomla, I believe that this wonderful CMS cannot be ignored. When promoting projects on Joomla, you will certainly have to create a robots file, otherwise how do you want to block unnecessary elements from indexing?

As in the previous case, you can create a file manually by simply uploading it to the host, or use a module for these purposes. In both cases, you will have to configure it correctly. This is what the correct option for Joomla will look like:

User-agent: *
Allow: /*.css?*$
Allow: /*.js?*$
Allow: /*.jpg?*$
Allow: /*.png?*$
Disallow: /cache/
Disallow: /*.pdf
Disallow: /administrator/
Disallow: /installation/
Disallow: /cli/
Disallow: /libraries/
Disallow: /language/
Disallow: /components/
Disallow: /modules/
Disallow: /includes/
Disallow: /bin/
Disallow: /component/
Disallow: /tmp/
Disallow: /index.php
Disallow: /plugins/
Disallow: /*mailto/

Disallow: /logs/
Disallow: /component/tags*
Disallow: /*%
Disallow: /layouts/

User-agent: Yandex
Disallow: /cache/
Disallow: /*.pdf
Disallow: /administrator/
Disallow: /installation/
Disallow: /cli/
Disallow: /libraries/
Disallow: /language/
Disallow: /components/
Disallow: /modules/
Disallow: /includes/
Disallow: /bin/
Disallow: /component/
Disallow: /tmp/
Disallow: /index.php
Disallow: /plugins/
Disallow: /*mailto/

Disallow: /logs/
Disallow: /component/tags*
Disallow: /*%
Disallow: /layouts/

User-agent: GoogleBot
Disallow: /cache/
Disallow: /*.pdf
Disallow: /administrator/
Disallow: /installation/
Disallow: /cli/
Disallow: /libraries/
Disallow: /language/
Disallow: /components/
Disallow: /modules/
Disallow: /includes/
Disallow: /bin/
Disallow: /component/
Disallow: /tmp/
Disallow: /index.php
Disallow: /plugins/
Disallow: /*mailto/

Disallow: /logs/
Disallow: /component/tags*
Disallow: /*%
Disallow: /layouts/

Host: site.ru # don't forget to change the address here to yours
Sitemap: site.ru/sitemap.xml # and here

As a rule, this is enough to prevent unnecessary files from getting into the index.

Errors during setup

Very often people make mistakes when creating and setting up a robots file. Here are the most common of them:

  • The rules are specified only for User-agent.
  • Host and Sitemap are missing.
  • The presence of the http protocol in the Host directive (you only need to specify https).
  • Failure to comply with nesting rules when opening/closing images.
  • UTM and Openstat tags are not closed.
  • Writing host and sitemap directives for each robot.
  • Superficial elaboration of the file.

It is very important to configure this small file correctly. If you make serious mistakes, you can lose a significant part of the traffic, so be extremely careful when setting up.

How to check a file?

For these purposes, it is better to use special services from Yandex and Google, since these search engines are the most popular and in demand (most often the only ones used); there is no point in considering search engines such as Bing, Yahoo or Rambler.

First, let's consider the option with Yandex. Go to Webmaster. Then go to Tools – Analysis of robots.txt.

Here you can check the file for errors, as well as check in real time which pages are open for indexing and which are not. Very convenient.

Google has exactly the same service. Let's go to Search Console. Find the Scanning tab and select Robots.txt File Check Tool.

The functions here are exactly the same as in the domestic service.

Please note that it shows me 2 errors. This is due to the fact that Google does not recognize the directives for clearing the parameters that I specified for Yandex:

Clean-Param: utm_source&utm_medium&utm_campaign
Clean-Param: openstat

You should not pay attention to this, because Google robots only use GoogleBot rules.

Conclusion

The robots.txt file is very important for SEO optimization of your website. Approach its setup with all responsibility, because if implemented incorrectly, everything can go to waste.

Keep in mind all the instructions I've shared in this article, and don't forget that you don't have to copy my robots variations exactly. It is quite possible that you will have to further understand each of the directives, adjusting the file to suit your specific case.

And if you want to understand robots.txt and creating websites on WordPress more deeply, then I invite you to. Here you will learn how you can easily create a website, not forgetting to optimize it for search engines.

The first thing a search bot does when it comes to your site is look for and read the robots.txt file. What is this file? is a set of instructions for a search engine.

He is text file, with a txt extension, which is located in the root directory of the site. This set of instructions tells the search robot which pages and files on the site to index and which not. It also indicates the main mirror of the site and where to look for the site map.

What is it needed for robots file.txt? For proper indexing of your site. So that the search does not contain duplicate pages, various service pages and documents. Once you correctly configure directives in robots, you will save your site from many problems with indexing and site mirroring.

How to create the correct robots.txt

It’s quite easy to create robots.txt, let’s create Text Document in a standard Windows notepad. We write directives for search engines in this file. Next, save this file under the name “robots” and the text extension “txt”. Everything can now be uploaded to the hosting, in the root folder of the site. Please note that you can only create one “robots” document for one site. If this file is not on the site, then the bot automatically “decides” that everything can be indexed.

Since there is only one, it contains instructions for all search engines. Moreover, you can write down both separate instructions for each PS, and a general one for all of them at once. The separation of instructions for different search bots is done through the User-agent directive. Let's talk more about this below.

Robots.txt directives

The file “for robots” can contain the following directives for managing indexing: User-agent, Disallow, Allow, Sitemap, Host, Crawl-delay, Clean-param. Let's look at each instruction in more detail.

User-agent directive

User-agent directive— indicates which search engine the instructions will be for (more precisely, which specific bot). If there is a “*”, then the instructions are intended for all robots. If a specific bot is specified, such as Googlebot, then the instructions are intended only for Google's main indexing robot. Moreover, if there are instructions separately for Googlebot and for all other subsystems, then Google will only read its own instructions and ignore the general one. The Yandex bot will do the same. Let's look at an example of writing a directive.

User-agent: YandexBot - instructions only for the main Yandex indexing bot
User-agent: Yandex - instructions for all Yandex bots
User-agent: * - instructions for all bots

Disallow and Allow directives

Disallow and Allow directives— give instructions on what to index and what not. Disallow gives the command not to index a page or an entire section of the site. On the contrary, Allow indicates what needs to be indexed.

Disallow: / - prohibits indexing the entire site
Disallow: /papka/ - prohibits indexing the entire contents of the folder
Disallow: /files.php - prohibits indexing the files.php file

Allow: /cgi-bin – allows cgi-bin pages to be indexed

It is possible and often simply necessary to use special characters in the Disallow and Allow directives. They are needed to specify regular expressions.

Special character * - replaces any sequence of characters. It is assigned by default to the end of each rule. Even if you haven’t registered it, the PS will assign it themselves. Usage example:

Disallow: /cgi-bin/*.aspx – prohibits indexing all files with the .aspx extension
Disallow: /*foto - prohibits indexing of files and folders containing the word foto

The special character $ cancels the effect of the special character “*” at the end of the rule. For example:

Disallow: /example$ - prohibits indexing '/example', but does not prohibit '/example.html'

And if you write it without the special symbol $, then the instruction will work differently:

Disallow: /example - disallows both '/example' and '/example.html'

Sitemap Directive

Sitemap Directive— is intended to indicate to the search engine robot where the sitemap is located on the hosting. The sitemap format should be sitemaps.xml. A site map is needed for faster and more complete indexing of the site. Moreover, a sitemap is not necessarily one file, there can be several of them. Direct message format:

Sitemap: http://site/sitemaps1.xml
Sitemap: http://site/sitemaps2.xml

Host directive

Host directive- indicates to the robot the main mirror of the site. Whatever is in the index of site mirrors, you must always specify this directive. If you do not specify it, the Yandex robot will index at least two versions of the site with and without www. Until the mirror robot glues them together. Example entry:

Host: www.site
Host: website

In the first case, the robot will index the version with www, in the second case, without. It is allowed to specify only one Host directive in the robots.txt file. If you enter several of them, the bot will process and take into account only the first one.

A valid host directive must have the following data:
— indicate the connection protocol (HTTP or HTTPS);
- correctly written Domain name(you cannot register an IP address);
— port number, if necessary (for example, Host: site.com:8080).

Directives made incorrectly will simply be ignored.

Crawl-delay directive

Crawl-delay directive allows you to reduce the load on the server. It is needed in case your site begins to fall under the onslaught of various bots. The Crawl-delay directive tells the search bot the waiting time between the end of downloading one page and the start of downloading another page on the site. The directive must come immediately after the "Disallow" and/or "Allow" directive entries. The Yandex search robot can read fractional values. For example: 1.5 (one and a half seconds).

Clean-param directive

Clean-param directive needed for sites whose pages contain dynamic parameters. We are talking about those that do not affect the content of the pages. This is various service information: session identifiers, users, referrers, etc. So, so that there are no duplicates of these pages, this directive is used. She will tell the PS not to re-upload the getting information. The load on the server and the time it takes for the robot to crawl the site will also be reduced.

Clean-param: s /forum/showthread.php

This entry tells the PS that the s parameter will be considered insignificant for all urls that start with /forum/showthread.php. The maximum entry length is 500 characters.

We've sorted out the directives, let's move on to setting up our robots file.

Setting up robots.txt

Let's proceed directly to setting up the robots.txt file. It must contain at least two entries:

User-agent:— indicates which search engine the instructions below will be for.
Disallow:— specifies which part of the site should not be indexed. It can block both a single page of a site and entire sections from indexing.

Moreover, you can indicate that these directives are intended for all search engines, or for one specifically. This is indicated in the User-agent directive. If you want all bots to read the instructions, put an asterisk

If you want to write instructions for a specific robot, you must specify its name.

User-agent: YandexBot

A simplified example of a correctly composed robots file would be like this:

User-agent: *
Disallow: /files.php
Disallow: /section/
Host: website

Where, * indicates that the instructions are intended for all PS;
Disallow: /files.php– prohibits indexing of the file file.php;
Disallow: /foto/— prohibits indexing the entire “foto” section with all attached files;
Host: website— tells robots which mirror to index.

If you don’t have pages on your site that need to be closed from indexing, then your robots.txt file should be like this:

User-agent: *
Disallow:
Host: website

Robots.txt for Yandex (Yandex)

To indicate that these instructions are intended for the Yandex search engine, you must specify in the User-agent: Yandex directive. Moreover, if we enter “Yandex”, then all Yandex robots will index the site, and if we specify “YandexBot”, then this will be a command only for the main indexing robot.

It is also necessary to specify the “Host” directive, where to indicate the main mirror of the site. As I wrote above, this is done to prevent duplicate pages. Your correct robots.txt for Yandex will be like this.

Robots.txt is a text file that contains site indexing parameters for the search engine robots.

Recommendations on the content of the file

Yandex supports the following directives:

Directive What does it do
User-agent *
Disallow
Sitemap
Clean-param
Allow
Crawl-delay
Directive What does it do
User-agent * Indicates the robot to which the rules listed in robots.txt apply.
Disallow Prohibits indexing site sections or individual pages.
Sitemap Specifies the path to the Sitemap file that is posted on the site.
Clean-param Indicates to the robot that the page URL contains parameters (like UTM tags) that should be ignored when indexing it.
Allow Allows indexing site sections or individual pages.
Crawl-delay Specifies the minimum interval (in seconds) for the search robot to wait after loading one page, before starting to load another.

* Mandatory directive.

You"ll most often need the Disallow, Sitemap, and Clean-param directives. For example:

User-agent: * # specify the robots that the directives are set for Disallow: /bin/ # disables links from the Shopping Cart. Disallow: /search/ # disables page links of the search embedded on the site Disallow: /admin/ # disables links from the admin panel Sitemap: http://example.com/sitemap # specify for the robot the sitemap file of the site Clean-param: ref /some_dir/get_book.pl

Robots from other search engines and services may interpret the directives in a different way.robots.txt file to be taken into account by the robot, it must be located in the root directory of the site and respond with HTTP 200 code. The indexing robot doesn't support the use of files hosted on other sites.

You can check the server's response and the accessibility of robots.txt to the robot using the tool.

If your robots.txt file redirects to another robots.txt file (for example, when moving a site), add the redirect target site to Yandex.Webmaster and verify the rights to manage this site.

Robots.txt is a text file that contains site indexing parameters for search engine robots.

Yandex supports the following directives:

Directive What is he doing
User-agent *
Disallow
Sitemap
Clean-param
Allow
Crawl-delay
Directive What is he doing
User-agent * Indicates a robot for which the rules listed in robots.txt apply.
Disallow Prohibits indexing of sections or individual pages of the site.
Sitemap Specifies the path to the Sitemap file that is located on the site.
Clean-param Indicates to the robot that the page URL contains parameters (for example, UTM tags) that do not need to be taken into account when indexing.
Allow Allows indexing of sections or individual pages of the site.
Crawl-delay

Sets the minimum time period (in seconds) for the robot between finishing loading one page and starting loading the next.

* Mandatory directive.

The most common directives you may need are Disallow, Sitemap and Clean-param. For example:

User-agent: * #specify for which robots directives are installed\nDisallow: /bin/ # prohibits links from the \"Shopping Cart\".\nDisallow: /search/ # prohibits links to pages built into the search site\nDisallow: /admin / # prohibits links from the admin panel\nSitemap: http://example.com/sitemap # point the robot to the sitemap file for the site\nClean-param: ref /some_dir/get_book.pl

Robots of other search engines and services may interpret directives differently.

Note. The robot takes case into account when writing substrings (name or path to the file, robot name) and does not take case into account in the names of directives.

Using the Cyrillic alphabet

The use of Cyrillic is prohibited in the robots.txt file and server HTTP headers.

The robots.txt file is one of the most important when optimizing any website. Its absence can lead to a high load on the site from search robots and slow indexing and re-indexing, and incorrect setting to the fact that the site will completely disappear from the search or simply will not be indexed. Consequently, it will not be searched in Yandex, Google and other search engines. Let's look at all the nuances correct settings robots.txt.

First, a short video that will give you a general idea of ​​what a robots.txt file is.

How does robots.txt affect site indexing?

Search robots will index your site regardless of the presence of a robots.txt file. If such a file exists, then robots can be guided by the rules that are written in this file. At the same time, some robots may ignore certain rules, or some rules may be specific only to some bots. In particular, GoogleBot does not use the Host and Crawl-Delay directives, YandexNews has recently begun to ignore the Crawl-Delay directive, and YandexDirect and YandexVideoParser ignore more general directives in robots (but are guided by those specified specifically for them).

More about exceptions:
Yandex exceptions
Robot Exception Standard (Wikipedia)

The maximum load on the site is created by robots that download content from your site. Therefore, by indicating what exactly to index and what to ignore, as well as at what time intervals to download, you can, on the one hand, significantly reduce the load on the site from robots, and on the other hand, speed up the download process by prohibiting crawling of unnecessary pages .

Such unnecessary pages include ajax, json scripts responsible for pop-up forms, banners, captcha output, etc., order forms and a shopping cart with all the steps of making a purchase, search functionality, personal account, admin panel.

For most robots, it is also advisable to disable indexing of all JS and CSS. But for GoogleBot and Yandex, such files must be left for indexing, since they are used by search engines to analyze the convenience of the site and its ranking (Google proof, Yandex proof).

Robots.txt directives

Directives are rules for robots. There is a W3C specification from January 30, 1994, and an extended standard from 1996. However, not all search engines and robots support certain directives. In this regard, it will be more useful for us to know not the standard, but how the main robots are guided by certain directives.

Let's look at them in order.

User-agent

This is the most important directive that determines for which robots the rules follow.

For all robots:
User-agent: *

For a specific bot:
User-agent: GoogleBot

Please note that robots.txt is case-insensitive. Those. The user agent for Google can just as easily be written as follows:
user-agent: googlebot

Below is a table of the main user agents of various search engines.

Bot Function
Google
Googlebot Google's main indexing robot
Googlebot-News Google News
Googlebot-Image Google Images
Googlebot-Video video
Mediapartners-Google
Mediapartners Google AdSense, Google Mobile AdSense
AdsBot-Google landing page quality check
AdsBot-Google-Mobile-Apps Googlebot for apps
Yandex
YandexBot Yandex's main indexing robot
YandexImages Yandex.Pictures
YandexVideo Yandex.Video
YandexMedia multimedia data
YandexBlogs blog search robot
YandexAddurl a robot that accesses a page when adding it through the “Add URL” form
YandexFavicons robot that indexes website icons (favicons)
YandexDirect Yandex.Direct
YandexMetrika Yandex.Metrica
YandexCatalog Yandex.Catalog
YandexNews Yandex.News
YandexImageResizer mobile service robot
Bing
Bingbot Bing's main indexing robot
Yahoo!
Slurp main indexing robot Yahoo!
Mail.Ru
Mail.Ru main indexing robot Mail.Ru
Rambler
StackRambler Previously the main indexing robot Rambler. However, as of June 23, 2011, Rambler ceases to support its own search engine and now uses Yandex technology on its services. No longer relevant.

Disallow and Allow

Disallow blocks pages and sections of the site from indexing.
Allow forces pages and sections of the site to be indexed.

But it's not that simple.

First, you need to know the additional operators and understand how they are used - these are *, $ and #.

* is any number of characters, including their absence. In this case, you don’t have to put an asterisk at the end of the line; it is assumed that it is there by default.
$ - indicates that the character before it should be the last one.
# is a comment; everything after this character in the line is not taken into account by the robot.

Examples of using:

Disallow: *?s=
Disallow: /category/$

Secondly, you need to understand how nested rules are executed.
Remember that the order in which the directives are written is not important. Inheritance of the rules of what to open or close from indexing is determined by which directories are specified. Let's look at it with an example.

Allow: *.css
Disallow: /template/

http://site.ru/template/ - closed from indexing
http://site.ru/template/style.css - closed from indexing
http://site.ru/style.css - open for indexing
http://site.ru/theme/style.css - open for indexing

If you need all .css files to be open for indexing, you will have to additionally register this for each of the closed folders. In our case:

Allow: *.css
Allow: /template/*.css
Disallow: /template/

Again, the order of the directives is not important.

Sitemap

Directive for specifying the path to the XML Sitemap file. The URL is written in the same way as in the address bar.

For example,

Sitemap: http://site.ru/sitemap.xml

The Sitemap directive is specified anywhere in the robots.txt file without being tied to a specific user-agent. You can specify multiple Sitemap rules.

Host

Directive for specifying the main mirror of the site (in most cases: with www or without www). Please note that the main mirror is specified WITHOUT http://, but WITH https://. Also, if necessary, the port is indicated.
The directive is supported only by Yandex and Mail.Ru bots. Other robots, in particular GoogleBot, will not take the command into account. Host is registered only once!

Example 1:
Host: site.ru

Example 2:
Host: https://site.ru

Crawl-delay

Directive for setting the time interval between the robot downloading website pages. Supported by Yandex robots, Mail.Ru, Bing, Yahoo. The value can be set in integer or fractional units (separator is a dot), time in seconds.

Example 1:
Crawl-delay: 3

Example 2:
Crawl-delay: 0.5

If the site has a small load, then there is no need to set such a rule. However, if indexing pages by a robot leads to the site exceeding the limits or experiencing significant loads to the point of server outages, then this directive will help reduce the load.

The higher the value, the fewer pages the robot will download in one session. The optimal value is determined individually for each site. It is better to start with not very large values ​​- 0.1, 0.2, 0.5 - and gradually increase them. For search engine robots that are less important for promotion results, such as Mail.Ru, Bing and Yahoo, you can initially set higher values ​​than for Yandex robots.

Clean-param

This rule tells the crawler that URLs with the specified parameters should not be indexed. The rule specifies two arguments: a parameter and the section URL. The directive is supported by Yandex.

Clean-param: author_id http://site.ru/articles/

Clean-param: author_id&sid http://site.ru/articles/

Clean-Param: utm_source&utm_medium&utm_campaign

Other options

In the extended robots.txt specification you can also find the Request-rate and Visit-time parameters. However, they are this moment are not supported by major search engines.

The meaning of the directives:
Request-rate: 1/5 — load no more than one page in five seconds
Visit-time: 0600-0845 - load pages only between 6 a.m. and 8:45 a.m. GMT.

Closing robots.txt

If you need to configure your site to NOT be indexed by search robots, then you need to specify the following directives:

User-agent: *
Disallow: /

Make sure that these directives are written on the test sites of your site.

Correct setting of robots.txt

For Russia and the CIS countries, where Yandex’s share is significant, directives should be prescribed for all robots and separately for Yandex and Google.

To properly configure robots.txt, use the following algorithm:

  1. Close the site admin panel from indexing
  2. Close your personal account, authorization, and registration from indexing
  3. Block your shopping cart, order forms, delivery and order data from indexing
  4. Close ajax and json scripts from indexing
  5. Close the cgi folder from indexing
  6. Block plugins, themes, js, css from indexing for all robots except Yandex and Google
  7. Disable search functionality from indexing
  8. Close from indexing service sections that do not provide any value for the site in search (404 error, list of authors)
  9. Block technical duplicate pages from indexing, as well as pages on which all content in one form or another is duplicated from other pages (calendars, archives, RSS)
  10. Block pages with filter, sorting, comparison parameters from indexing
  11. Block pages with UTM tags and session parameters from indexing
  12. Check what is indexed by Yandex and Google using the “site:” parameter (type “site:site.ru” in the search bar). If the search contains pages that also need to be closed from indexing, add them to robots.txt
  13. Specify Sitemap and Host
  14. If necessary, enter Crawl-Delay and Clean-Param
  15. Check the correctness of robots.txt using Google and Yandex tools (described below)
  16. After 2 weeks, check again whether the search results new pages that should not be indexed. If necessary, repeat the above steps.

Example robots.txt

# An example of a robots.txt file for setting up a hypothetical site https://site.ru User-agent: * Disallow: /admin/ Disallow: /plugins/ Disallow: /search/ Disallow: /cart/ Disallow: */?s= Disallow : *sort= Disallow: *view= Disallow: *utm= Crawl-Delay: 5 User-agent: GoogleBot Disallow: /admin/ Disallow: /plugins/ Disallow: /search/ Disallow: /cart/ Disallow: */?s = Disallow: *sort= Disallow: *view= Disallow: *utm= Allow: /plugins/*.css Allow: /plugins/*.js Allow: /plugins/*.png Allow: /plugins/*.jpg Allow: /plugins/*.gif User-agent: Yandex Disallow: /admin/ Disallow: /plugins/ Disallow: /search/ Disallow: /cart/ Disallow: */?s= Disallow: *sort= Disallow: *view= Allow: /plugins/*.css Allow: /plugins/*.js Allow: /plugins/*.png Allow: /plugins/*.jpg Allow: /plugins/*.gif Clean-Param: utm_source&utm_medium&utm_campaign Crawl-Delay: 0.5 Sitemap: https://site.ru/sitemap.xml Host: https://site.ru

How to add and where is robots.txt located

After you have created the robots.txt file, it must be placed on your website at site.ru/robots.txt - i.e. in the root directory. The search robot always accesses the file at the URL /robots.txt

How to check robots.txt

Robots.txt is checked using the following links:

  • In Yandex.Webmaster - on the Tools>Robots.txt Analysis tab
  • IN Google Search Console- on the Scanning tab>Robots.txt file inspection tool

Typical errors in robots.txt

At the end of the article I will give a few typical mistakes robots.txt file

  • robots.txt is missing
  • in robots.txt the site is closed from indexing (Disallow: /)
  • the file contains only the most basic directives, there is no detailed elaboration of the file
  • in the file, pages with UTM tags and session identifiers are not blocked from indexing
  • the file contains only directives
    Allow: *.css
    Allow: *.js
    Allow: *.png
    Allow: *.jpg
    Allow: *.gif
    while the css, js, png, jpg, gif files are closed by other directives in a number of directories
  • the Host directive is specified several times
  • the HTTP protocol is not specified in Host
  • the path to the Sitemap is incorrect, or the wrong protocol or site mirror is specified

P.S.

P.S.2

Useful video from Yandex (Attention! Some recommendations are only suitable for Yandex).



tell friends