Robots txt content. How to edit the robots txt file. What does setting up a file do?

The robots.txt file is located in the root directory of your site. For example, on the site www.example.com the robots.txt file address will look like www.example.com/robots.txt. The robots.txt file is a regular text file, which complies with the robot exclusion standard, and includes one or more rules, each of which denies or allows a particular search robot to access a specific path on the site.

Here's an example simple file robots.txt with two rules. Below are explanations.

# Group 1 User-agent: Googlebot Disallow: /nogooglebot/ # Group 2 User-agent: * Allow: / Sitemap: http://www.example.com/sitemap.xml

Explanations

A user agent called Googlebot should not crawl the directory http://example.com/nogooglebot/ and its subdirectories.
All other user agents have access to the entire site (can be omitted, the result will be the same, since full access is granted by default).
Sitemap file this site is located at http://www.example.com/sitemap.xml.

Below are some tips for working with robots.txt files. We recommend that you study the full syntax of these files, as the syntax rules used to create them are not obvious and you must understand them.

Format and layout

You can create a robots.txt file in almost any text editor with support for UTF-8 encoding. Avoid using word processors, as they often save files in a proprietary format and add illegal characters, such as curly quotation marks, that are not recognized by search robots.

When creating and testing robots.txt files, use a testing tool. It allows you to analyze the syntax of a file and find out how it will function on your site.

Rules regarding file format and location

The file should be named robots.txt.
There should be only one such file on the site.
The robots.txt file must be placed in root directory site. For example, to control crawling of all pages on the site http://www.example.com/, the robots.txt file should be located at http://www.example.com/robots.txt. It should not be in a subdirectory(for example, at the address http://example.com/pages/robots.txt). If you have difficulty accessing the root directory, contact your hosting provider. If you don't have access to the site's root directory, use an alternative blocking method such as meta tags.
The robots.txt file can be added to addresses with subdomains(for example http:// website.example.com/robots.txt) or non-standard ports (for example, http://example.com: 8181 /robots.txt).
Any text after the # symbol is considered a comment.

Syntax

The robots.txt file must be a text file encoded in UTF-8 (which includes ASCII character codes). Other character sets cannot be used.
The robots.txt file consists of groups.
Each group may contain several rules, one per line. These rules are also called directives.
The group includes the following information:
- To which user agent Group directives apply.
- have access.
- Which directories or files does this agent access? No access.
Group instructions are read from top to bottom. The robot will only follow the rules of one group with the user agent that most closely matches it.
By default it is assumed that if access to a page or directory is not blocked by a Disallow: rule, then the user agent can process it.
Rules case sensitive. Thus, the Disallow: /file.asp rule applies to the URL http://www.example.com/file.asp, but not to http://www.example.com/File.asp.

Directives used in robots.txt files

User-agent: Mandatory directive, there may be several of these in a group. Determines which search engine robot rules must apply. Each group begins with this line. Most user agents related to Googlebots can be found in a special list and in the Internet Robots Database. The * wildcard character is supported to indicate a prefix, suffix of a path, or the entire path. Use the * sign as shown in the example below to block access to all crawlers ( except AdsBot robots, which must be specified separately). We recommend that you familiarize yourself with the list of Google robots. Examples:# Example 1. Blocking access only to Googlebot User-agent: Googlebot Disallow: / # Example 2. Blocking access to Googlebot and AdsBot robots User-agent: Googlebot User-agent: AdsBot-Google Disallow: / # Example 3. Blocking access to all robots , with the exception of AdsBot User-agent: * Disallow: /
Disallow: . Points to a directory or page relative to the root domain that cannot be crawled by the user agent defined above. If this is a page, the full path to it must be specified, as in the browser's address bar. If it is a directory, the path must end with a slash (/). The * wildcard character is supported to indicate a prefix, suffix of a path, or the entire path.
Allow: At least one Disallow: or Allow: directive must be in each group. Points to a directory or page relative to the root domain that can be crawled by the user agent defined above. Used to override the Disallow directive and allow scanning of a subdirectory or page in a directory that is closed for scanning. If this is a page, the full path to it must be specified, as in the browser's address bar. If it is a directory, the path must end with a slash (/). The * wildcard character is supported to indicate a prefix, suffix of a path, or the entire path.
Sitemap: An optional directive; there may be several or none of these in the file. Indicates the location of the Sitemap used on this site. The URL must be complete. Google does not process or validate URL variations with the http and https prefixes, or with or without the www element. Sitemaps tell Google what content need to scan and how to distinguish it from content that Can or it is forbidden scan. Example: Sitemap: https://example.com/sitemap.xml Sitemap: http://www.example.com/sitemap.xml

Other rules are ignored.

One more example

The robots.txt file consists of groups. Each of them starts with a User-agent line, which defines the robot that must follow the rules. Below is an example of a file with two groups and explanatory comments for both.

# Block Googlebot's access to example.com/directory1/... and example.com/directory2/... # but allow access to directory2/subdirectory1/... # Access to all other directories is allowed by default. User-agent: googlebot Disallow: /directory1/ Disallow: /directory2/ Allow: /directory2/subdirectory1/ # Block access to the entire site to another search engine. User-agent: anothercrawler Disallow: /

Full syntax of the robots.txt file

The full syntax is described in this article. We recommend that you familiarize yourself with it, as there are some important nuances in the syntax of the robots.txt file.

Useful rules

Here are some common rules for the robots.txt file:

Rule	Example
Prohibition of crawling the entire site. Please note that in some cases, site URLs may be present in the index even if they have not been crawled. Please note that this rule does not apply to AdsBot robots, which must be specified separately.	User-agent: * Disallow: /
To prevent scanning of a directory and all its contents, place a forward slash after the directory name. Do not use the robots.txt file to protect confidential information! Authentication should be used for these purposes. URLs that are not allowed to be crawled by the robots.txt file can be indexed, and the contents of the robots.txt file can be viewed by any user and thus reveal the location of files with sensitive information.	User-agent: * Disallow: /calendar/ Disallow: /junk/
To allow crawling by only one crawler	User-agent: Googlebot-news Allow: / User-agent: * Disallow: /
To allow crawling for all crawlers except one	User-agent: Unnecessarybot Disallow: / User-agent: * Allow: /
To prevent a specific page from being crawled, specify this page after the slash.	User-agent: * Disallow: /private_file.html
To hide a specific image from the Google Images robot	User-agent: Googlebot-Image Disallow: /images/dogs.jpg
To hide all images from your site from the Google Images robot	User-agent: Googlebot-Image Disallow: /
To prevent all files of a specific type from being scanned(in this case GIF)	User-agent: Googlebot Disallow: /*.gif$
To block certain pages on your site but still show AdSense ads on them, use the Disallow rule for all robots except Mediapartners-Google. As a result, this robot will be able to access pages removed from search results in order to select ads to display to a particular user.	User-agent: * Disallow: / User-agent: Mediapartners-Google Allow: /
To specify a URL that ends at a specific fragment, use the $ symbol. For example, for URLs ending in .xls, use the following code:	User-agent: Googlebot Disallow: /*.xls$

Most robots are well designed and do not cause any problems for website owners. But if the bot was written by an amateur or “something went wrong,” then it can create a significant load on the site it crawls. By the way, spiders do not enter the server at all like viruses - they simply request the pages they need remotely (in fact, these are analogues of browsers, but without the page viewing function).

Robots.txt - user-agent directive and search engine bots

Robots.txt has a very simple syntax, which is described in great detail, for example, in Yandex help And Google help. It usually indicates which search bot the following directives are intended for: bot name (" User-agent"), allowing (" Allow") and prohibiting (" Disallow"), and "Sitemap" is also actively used to indicate to search engines exactly where the map file is located.

The standard was created quite a long time ago and something was added later. There are directives and design rules that will only be understood by robots of certain search engines. In RuNet, only Yandex and Google are of interest, which means that you should familiarize yourself with their help on compiling robots.txt in particular detail (I provided the links in the previous paragraph).

For example, earlier for search engine It was useful for Yandex to indicate that your web project is the main one in the special “Host” directive, which only this search engine understands (well, also Mail.ru, because their search is from Yandex). True, at the beginning of 2018 Yandex still canceled Host and now its functions, like those of other search engines, are performed by a 301 redirect.

Even if your resource does not have mirrors, it will be useful to indicate which spelling option is the main one - .

Now let's talk a little about the syntax of this file. Directives in robots.txt look like this:

<поле>:<пробел><значение><пробел> <поле>:<пробел><значение><пробел>

The correct code should contain at least one “Disallow” directive after each “User-agent” entry. Empty file implies permission to index the entire site.

User-agent

"User-agent" directive must contain the name of the search bot. Using it, you can set up rules of behavior for each specific search engine (for example, create a ban on indexing a separate folder only for Yandex). An example of writing “User-agent” addressed to all bots visiting your resource looks like this:

User-agent: *

If you want to set certain conditions in the “User-agent” only for one bot, for example, Yandex, then you need to write this:

User-agent: Yandex

Name of search engine robots and their role in the robots.txt file

Bot of every search engine has its own name (for example, for a rambler it is StackRambler). Here I will give a list of the most famous of them:

Google http://www.google.com Googlebot Yandex http://www.ya.ru Yandex Bing http://www.bing.com/ bingbot

Major search engines sometimes have except the main bots, there are also separate instances for indexing blogs, news, images, etc. You can get a lot of information on the types of bots (for Yandex) and (for Google).

How to be in this case? If you need to write a rule for prohibiting indexing, which all types of Google robots must follow, then use the name Googlebot and all other spiders of this search engine will also obey. However, you can only ban, for example, the indexing of pictures by specifying the Googlebot-Image bot as the User-agent. Now this is not very clear, but with examples, I think it will be easier.

Examples of using the Disallow and Allow directives in robots.txt

I'll give you a few simple ones. examples of using directives with an explanation of his actions.

The code below allows all bots (indicated by an asterisk in the User-agent) to index all content without any exceptions. This is given empty directive Disallow. User-agent: * Disallow:
The following code, on the contrary, completely prohibits all search engines from adding pages of this resource to the index. Sets this to Disallow with "/" in the value field. User-agent: * Disallow: /
In this case, all bots will be prohibited from viewing the contents of the /image/ directory (http://mysite.ru/image/ is the absolute path to this directory) User-agent: * Disallow: /image/
To block one file, it will be enough to register its absolute path to it (read): User-agent: * Disallow: /katalog1//katalog2/private_file.html
Looking ahead a little, I’ll say that it’s easier to use the asterisk (*) symbol so as not to write the full path:
Disallow: /*private_file.html
In the example below, the directory “image” will be prohibited, as well as all files and directories starting with the characters “image”, i.e. files: “image.htm”, “images.htm”, directories: “image”, “ images1", "image34", etc.): User-agent: * Disallow: /image The fact is that by default at the end of the entry there is an asterisk, which replaces any characters, including their absence. Read about it below.
By using Allow directives we allow access. Complements Disallow well. For example, with this condition we prohibit the Yandex search robot from downloading (indexing) everything except web pages whose address begins with /cgi-bin: User-agent: Yandex Allow: /cgi-bin Disallow: /
Well, or this obvious example of using the Allow and Disallow combination:
User-agent: * Disallow: /catalog Allow: /catalog/auto
When describing paths for Allow-Disallow directives, you can use the symbols "*" and "$", thus defining certain logical expressions.
1. Symbol "*"(star) means any (including empty) sequence of characters. The following example prohibits all search engines from indexing files with the “.php” extension: User-agent: * Disallow: *.php$
2. Why is it needed at the end? $ sign? The fact is that, according to the logic of compiling the robots.txt file, a default asterisk is added at the end of each directive (it’s not there, but it seems to be there). For example, we write: Disallow: /images
  Implying that this is the same as:
  Disallow: /images*
  Those. this rule prohibits the indexing of all files (web pages, pictures and other types of files) whose address begins with /images, and then anything follows (see example above). So, $ symbol it simply cancels the default asterisk at the end. For example:
  Disallow: /images$
  Only prevents indexing of the /images file, but not /images.html or /images/primer.html. Well, in the first example, we prohibited indexing only files ending in .php (having such an extension), so as not to catch anything unnecessary:
  Disallow: *.php$

In many engines, users (human-readable Urls), while system-generated Urls have a question mark "?" in the address. You can take advantage of this and write the following rule in robots.txt: User-agent: * Disallow: /*?

The asterisk after the question mark suggests itself, but, as we found out just above, it is already implied at the end. Thus, we will prohibit the indexing of search pages and other service pages created by the engine, which the search robot can reach. It won’t be superfluous, because the question mark is most often used by CMS as a session identifier, which can lead to duplicate pages being included in the index.

Sitemap and Host directives (for Yandex) in Robots.txt

To avoid unpleasant problems with site mirrors, it was previously recommended to add a Host directive to robots.txt, which pointed the Yandex bot to the main mirror.

Host directive - indicates the main mirror of the site for Yandex

For example, earlier if you have not yet switched to a secure protocol, it was necessary to indicate in Host not the full URL, but Domain name(without http://, i.e..ru). If you have already switched to https, then you will need to indicate the full URL (such as https://myhost.ru).

A wonderful tool for combating duplicate content - the search engine simply will not index the page if a different URL is registered in Canonical. For example, for such a page of my blog (page with pagination), Canonical points to https://site and there should be no problems with duplicating titles.

But I digress...
If your project is created on the basis of any engine, then Duplicate content will occur with a high probability, which means you need to fight it, including with the help of a ban in robots.txt, and especially in the meta tag, because in the first case Google may ignore the ban, but it will no longer be able to give a damn about the meta tag ( brought up that way).
For example, in WordPress pages with very similar content can be included in the search engine index if indexing of the content of categories, the content of the tag archive, and the content of temporary archives is allowed. But if, using the Robots meta tag described above, you create a ban on the tag archive and temporary archive (you can leave the tags and prohibit indexing of the content of the categories), then duplication of content will not occur. How to do this is described in the link given just above (to the OlInSeoPak plugin)
To summarize, I will say that the Robots file is intended for setting global rules for denying access to entire site directories, or to files and folders whose names contain specified characters (by mask). You can see examples of setting such prohibitions just above.
Now let's look at specific examples of robots designed for different engines - Joomla, WordPress and SMF. Naturally, all three options created for different CMS will differ significantly (if not radically) from each other. True, they will all have one thing in common, and this moment is connected with the Yandex search engine.
Because In RuNet, Yandex has quite a lot of weight, then we need to take into account all the nuances of its work, and here we The Host directive will help. It will explicitly indicate to this search engine the main mirror of your site.
For this, it is recommended to use a separate User-agent blog, intended only for Yandex (User-agent: Yandex). This is due to the fact that other search engines may not understand Host and, accordingly, its inclusion in the User-agent record intended for all search engines (User-agent: *) may lead to negative consequences and incorrect indexing.
It’s hard to say what the situation really is, because search algorithms are a thing in themselves, so it’s better to do as advised. But in this case, we will have to duplicate in the User-agent: Yandex directive all the rules that we set User-agent: *. If you leave User-agent: Yandex with an empty Disallow:, then in this way you will allow Yandex to go anywhere and drag everything into the index.
Robots for WordPress
I will not give an example of a file that the developers recommend. You can watch it yourself. Many bloggers do not at all limit Yandex and Google bots in their walks through the content of the WordPress engine. Most often on blogs you can find robots automatically filled with a plugin.
But, in my opinion, we should still help the search in the difficult task of sifting the wheat from the chaff. Firstly, it will take a lot of time for Yandex and Google bots to index this garbage, and there may not be any time left to add web pages with your new articles to the index. Secondly, bots crawling through garbage engine files will create additional load on your host’s server, which is not good.
You can see my version of this file for yourself. It’s old and hasn’t been changed for a long time, but I try to follow the principle “don’t fix what isn’t broken,” and it’s up to you to decide: use it, make your own, or steal from someone else. I also had a ban on indexing pages with pagination until recently (Disallow: */page/), but recently I removed it, relying on Canonical, which I wrote about above.
But in general, the only correct file for WordPress probably doesn't exist. You can, of course, implement any prerequisites in it, but who said that they will be correct. There are many options for ideal robots.txt on the Internet.
I will give two extremes:
you can find a megafile with detailed explanations (the # symbol separates comments that would be better deleted in a real file): User-agent: * # general rules for robots, except Yandex and Google, # because for them the rules are below Disallow: /cgi-bin # folder on hosting Disallow: /? # all request parameters on the main page Disallow: /wp- # all WP files: /wp-json/, /wp-includes, /wp-content/plugins Disallow: /wp/ # if there is a subdirectory /wp/ where the CMS is installed ( if not, # the rule can be deleted) Disallow: *?s= # search Disallow: *&s= # search Disallow: /search/ # search Disallow: /author/ # author archive Disallow: /users/ # author archive Disallow: */ trackback # trackbacks, notifications in comments about the appearance of an open # link to an article Disallow: */feed # all feeds Disallow: */rss # rss feed Disallow: */embed # all embeddings Disallow: */wlwmanifest.xml # manifest xml file Windows Live Writer (if you don't use it, # the rule can be deleted) Disallow: /xmlrpc.php # WordPress API file Disallow: *utm= # links with utm tags Disallow: *openstat= # links with openstat tags Allow: */uploads # open the folder with the files uploads User-agent: GoogleBot # rules for Google (I do not duplicate comments) Disallow: /cgi-bin Disallow: /? Disallow: /wp- Disallow: /wp/ Disallow: *?s= Disallow: *&s= Disallow: /search/ Disallow: /author/ Disallow: /users/ Disallow: */trackback Disallow: */feed Disallow: */ rss Disallow: */embed Disallow: */wlwmanifest.xml Disallow: /xmlrpc.php Disallow: *utm= Disallow: *openstat= Allow: */uploads Allow: /*/*.js # open js scripts inside /wp - (/*/ - for priority) Allow: /*/*.css # open css files inside /wp- (/*/ - for priority) Allow: /wp-*.png # images in plugins, cache folder and etc. Allow: /wp-*.jpg # images in plugins, cache folder, etc. Allow: /wp-*.jpeg # images in plugins, cache folder, etc. Allow: /wp-*.gif # images in plugins, cache folder, etc. Allow: /wp-admin/admin-ajax.php # used by plugins so as not to block JS and CSS User-agent: Yandex # rules for Yandex (I do not duplicate comments) Disallow: /cgi-bin Disallow: /? Disallow: /wp- Disallow: /wp/ Disallow: *?s= Disallow: *&s= Disallow: /search/ Disallow: /author/ Disallow: /users/ Disallow: */trackback Disallow: */feed Disallow: */ rss Disallow: */embed Disallow: */wlwmanifest.xml Disallow: /xmlrpc.php Allow: */uploads Allow: /*/*.js Allow: /*/*.css Allow: /wp-*.png Allow: /wp-*.jpg Allow: /wp-*.jpeg Allow: /wp-*.gif Allow: /wp-admin/admin-ajax.php Clean-Param: utm_source&utm_medium&utm_campaign # Yandex recommends not blocking # from indexing, but deleting tag parameters, # Google does not support such rules Clean-Param: openstat # similar # Specify one or more Sitemap files (no need to duplicate for each User-agent #). Google XML Sitemap creates 2 sitemaps like the example below. Sitemap: http://site.ru/sitemap.xml Sitemap: http://site.ru/sitemap.xml.gz # Specify the main mirror of the site, as in the example below (with WWW / without WWW, if HTTPS # then write protocol, if you need to specify a port, indicate it). The Host command is understood by # Yandex and Mail.RU, Google does not take it into account. Host: www.site.ru
But you can use an example of minimalism: User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php Host: https://site.ru Sitemap: https://site. ru/sitemap.xml

The truth probably lies somewhere in the middle. Also, don’t forget to add the Robots meta tag for “extra” pages, for example, using the wonderful plugin - . It will also help you set up Canonical.
Correct robots.txt for Joomla
User-agent: * Disallow: /administrator/ Disallow: /bin/ Disallow: /cache/ Disallow: /cli/ Disallow: /components/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /layouts/ Disallow: /libraries/ Disallow: /logs/ Disallow: /modules/ Disallow: /plugins/ Disallow: /tmp/
In principle, almost everything is taken into account here and it works well. The only thing is that you should add a separate User-agent: Yandex rule to insert the Host directive, which defines the main mirror for Yandex, and also specify the path to the Sitemap file.
Therefore, in the final form correct robots for Joomla, in my opinion, should look like this:
User-agent: Yandex Disallow: /administrator/ Disallow: /cache/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /libraries/ Disallow: /modules/ Disallow: /plugins/ Disallow: /tmp/ Disallow: /layouts/ Disallow: /cli/ Disallow: /bin/ Disallow: /logs/ Disallow: /components/ Disallow: /component/ Disallow: /component/tags* Disallow: /*mailto/ Disallow: /*.pdf Disallow : /*% Disallow: /index.php Host: vash_sait.ru (or www.vash_sait.ru) User-agent: * Allow: /*.css?*$ Allow: /*.js?*$ Allow: /* .jpg?*$ Allow: /*.png?*$ Disallow: /administrator/ Disallow: /cache/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /libraries/ Disallow: /modules/ Disallow : /plugins/ Disallow: /tmp/ Disallow: /layouts/ Disallow: /cli/ Disallow: /bin/ Disallow: /logs/ Disallow: /components/ Disallow: /component/ Disallow: /*mailto/ Disallow: /*. pdf Disallow: /*% Disallow: /index.php Sitemap: http://path to your XML format map
Yes, also note that in the second option there are directives Allow, allowing indexing of styles, scripts and images. This was written specifically for Google, because its Googlebot sometimes complains that indexing of these files, for example, from the folder with the theme used, is prohibited in the robots. He even threatens to lower his ranking for this.
Therefore, we allow this whole thing to be indexed in advance using Allow. By the way, the same thing happened in the example file for WordPress.

Good luck to you! See you soon on the pages of the blog site
You might be interested
Domains with and without www - the history of their appearance, the use of 301 redirects to glue them together
Mirrors, duplicate pages and Url addresses - an audit of your site or what could be the cause of failure during its SEO promotion SEO for beginners: 10 main points of a technical website audit
Bing webmaster - center for webmasters from the Bing search engine
Google webmaster - tools Search Console(Google Webmaster)
How to avoid common mistakes when promoting a website
How to promote a website yourself by improving on-page optimization keywords and removing duplicate content
Yandex Webmaster - indexing, links, site visibility, region selection, authorship and virus checking in Yandex Webmaster

Creating the file itself

Robots.txt is a file with instructions for search robots. It is created at the root of the site. You can create it right now on your desktop using Notepad, just like you create any text file.

To do this, right-click on the empty space and select Create – Text Document(not Word). It will open using a regular notepad. Call it robots, its extension is already correct - txt. That's it for creating the file itself.

How to compose robots.txt

Now all that remains is to fill the file with the necessary instructions. Actually, commands for robots have the simplest syntax, much simpler than in any programming language. In general, you can fill the file in two ways:

Look at another site, copy and change to suit the structure of your project.

Write it yourself

I already wrote about the first method in. It is suitable if the sites have the same engines and there are no significant differences in functionality. For example, all WordPress sites have the same structure, but there may be various extensions, such as a forum, an online store and many additional directories. If you want to know how to change robots.txt, read this article, you can also read the previous one, but this one will say quite a lot.

For example, you have a /source directory on your website, where the sources for the articles that you write on your blog are stored, but another webmaster does not have such a directory. And you, for example, want to close the source folder from indexing. If you copy robots.txt from another resource, then there will not be such a command there. You will have to add your instructions, delete unnecessary things, etc.

So in any case, it is useful to know the basic syntax of instructions for robots, which we will now analyze.

How to write your instructions to robots?

The first thing the file begins with is an indication of which search engines the instructions are addressed to. This is done like this:

User-agent: Yandex Or User-agent: Googlebot

User-agent: Yandex

User-agent: Googlebot

There is no need to put any semicolons at the end of the line, this is not programming for you). In general, it is clear that in the first case, only the Yandex bot will read the instructions, in the second - only Google. If commands must be executed by all robots, write this: User-agent:

Great. We've sorted out the appeal to robots. It is not difficult. You can imagine it on simple example. You have three younger brothers, Vasya, Dima and Petya, and you are the main one. Your parents left and told you to keep an eye on them.

All three are asking you for something. Imagine that you need to give them an answer as if you were writing instructions to search robots. It will look something like this:

User-agent: Vasya Allow: go to the football User-agent: Dima Disallow: go to the football (Dima broke the glass of his neighbors last time, he was punished) User-agent: Petya Allow: go to the cinema (Petya is already 16 and he’s generally I’m shocked that I should also ask your permission, but oh well, let him go).

Thus, Vasya happily laces up his sneakers, Dima, with his head down, looks out the window at his brother, who is already thinking how many goals he will score today (Dima received the disallow command, that is, a ban). Well, Petya goes to his movie.

From this example it is easy to understand that Allow is a permission, and Disallow is a prohibition. But in robots.txt we give commands not to people, but to robots, so instead of specific tasks, the addresses of pages and directories that need to be allowed or prohibited from indexing are written there.

For example, I have a website site.ru. It's powered by WordPress. I start writing instructions:

User-agent: * Disallow: /wp-admin/ Disallow: /wp-content/ Disallow: /wp-includes/ Allow: /wp-content/uploads/ Disallow: /source/ Well, etc.

User-agent: *

Disallow: /wp-admin/

Disallow: /wp-content/

Disallow: /wp-includes/

Disallow: /source/

Nuit. d.

First, I reached out to all the robots. Secondly, I blocked the indexing of the engine’s folders, but at the same time gave the robot access to the downloads folder. All pictures are usually stored there, and they are usually not blocked from indexing if you plan to receive traffic from image search.

Well, remember, earlier in the article I said that you can have additional directories? You can create them yourself for various purposes. For example, on one of my sites there is a flash folder where I put flash games so that I can launch them on the site. Or source – this folder can store files available for users to download.

In general, it doesn’t matter what the folder is called. If you need to close it, specify the path to it and the Disallow command.

The Allow command is needed precisely in order to open some parts of already closed sections. After all, by default, if you do not have a robots.txt file, the entire site will be available for indexing. This is both good (you certainly won’t close something important by mistake), and at the same time bad (files and folders will be opened that should not be in the search results).

To better understand this point, I suggest you look at this piece again:

Disallow: /wp-content/ Allow: /wp-content/uploads/

Disallow: /wp-content/

Allow: /wp-content/uploads/

As you can see, first we block indexing of the entire wp-content directory. It stores all your templates, plugins, but it also contains pictures. Obviously, they can be opened. This is why we need the Allow command.

Extra options

The listed commands are not the only things that can be specified in the file. There are also these: Host – indicates the main mirror of the site. For those who didn’t know, any website has two default spelling options for its domain name: domain.com and www.domain.com.

To avoid problems, you need to specify one option as the main mirror. This can be done both in webmaster tools and in the Robots.txt file. To do this we write: Host: domain.com

What does this give? If someone tries to get to your site like this: www.domain.com, they will automatically be redirected to the version without www, because it will be recognized as the main mirror.

The second directive is sitemap. I think you already understand that it specifies the path to the sitemap in xml format. Example: http://domain.com/sitemap.xml

Again, you can upload the map in Yandex.Webmaster, you can also specify it in robots.txt so that the robot reads this line and clearly understands where to look for the sitemap. For a robot, a site map is as important as for Vasya - the ball with which he will go to football. It's like him asking you (like an older brother) where the ball is. And you tell him:

Behind the sofa

Now you know how to properly configure and change robots.txt for Yandex and, in general, any other search engine to suit your needs.

What does file customization do?

I also spoke about this earlier, but I will say it again. Thanks to a clearly configured file with commands for robots, you can sleep easier knowing that the robot will not crawl into an unnecessary section and will not take unnecessary pages into the index.

I also said that setting up robots.txt doesn't solve everything. In particular, it does not save you from duplicates that arise due to the fact that the engines are imperfect. Just like people. You allowed Vasya to go to football, but it’s not a fact that he won’t do the same thing there as Dima. It’s the same with duplicates: you can give a command, but you definitely can’t be sure that something extra won’t sneak into the index, ruining the positions.

There is also no need to fear doubles like fire. For example, Yandex treats sites that have serious technical problems more or less normally. Another thing is that if you start a business, then you really can lose a serious percentage of traffic to yourself. However, soon in our section dedicated to SEO there will be an article about duplicates, then we will fight with them.

How can I get a normal robots.txt if I don’t understand anything myself?

After all, creating robots.txt is not creating a website. It’s somehow simpler, so you can simply copy the contents of the file from any more or less successful blogger. Of course, if you have a WordPress site. If it is on a different engine, then you need to search for sites using the same cms. I have already said how to view the contents of a file on someone else’s website: Domain.com/robots.txt

Bottom line

I don't think there's much more to say here, because writing robot instructions shouldn't be your goal for the year. This is a task that even a beginner can complete in 30-60 minutes, and a professional can generally complete in just a couple of minutes. You will succeed and you can have no doubt about it.

And to find out other useful and important tips for promoting and promoting a blog, you can look at our unique one. If you apply 50-100% of the recommendations from there, you will be able to successfully promote any sites in the future.

First, I’ll tell you what robots.txt is.

Robots.txt– a file that is located in the root folder of the site, where special instructions for search robots are written. These instructions are necessary so that when entering the site, the robot does not take into account the page/section; in other words, we close the page from indexing.

Why do we need robots.txt?

The robots.txt file is considered a key requirement for SEO optimization of absolutely any website. The absence of this file may negatively affect the load from robots and slow indexing and, even moreover, the site will not be completely indexed. Accordingly, users will not be able to access pages through Yandex and Google.

Impact of robots.txt on search engines?

Search engines(V Google features) will index the site, but if there is no robots.txt file, then, as I said, not all pages. If there is such a file, then the robots are guided by the rules that are specified in this file. Moreover, there are several types of search robots; some can take into account the rule, while others ignore it. In particular, the GoogleBot robot does not take into account the Host and Crawl-Delay directives, the YandexNews robot has recently stopped taking into account the Crawl-Delay directive, and the YandexDirect and YandexVideoParser robots ignore generally accepted directives in robots.txt (but take into account those that are written specifically for them).

The site is loaded the most by robots that load content from your site. Accordingly, if we tell the robot which pages to index and which to ignore, as well as at what time intervals to load content from the pages (this applies more to large sites that have more than 100,000 pages in the search engine index). This will make it much easier for the robot to index and download content from the site.

Files that are unnecessary for search engines include files that belong to the CMS, for example, in Wordpress – /wp-admin/. In addition, ajax, json scripts responsible for pop-up forms, banners, captcha output, and so on.

For most robots, I also recommend blocking all Javascript and CSS files from indexing. But for GoogleBot and Yandex, it is better to index such files, since they are used by search engines to analyze the convenience of the site and its ranking.

What is a robots.txt directive?

Directives– these are the rules for search robots. The first standards for writing robots.txt and, accordingly, appeared in 1994, and the extended standard in 1996. However, as you already know, not all robots support certain directives. Therefore, below I have described what the main robots are guided by when indexing website pages.

What does User-agent mean?

This is the most important directive that determines which search robots will follow further rules.

For all robots:

For a specific bot:

User-agent: Googlebot

The register in robots.txt is not important, you can write both Googlebot and googlebot

Google search robots

Yandex search robots


	Yandex's main indexing robot
	Used in the Yandex.Images service
	Used in the Yandex.Video service
	Multimedia data
	Blog Search
	A search robot accessing a page when adding it through the “Add URL” form
	robot that indexes website icons (favicons)
	Yandex.Direct
	Yandex.Metrica
	Used in the Yandex.Catalog service
	Used in the Yandex.News service
YandexImageResizer	Mobile services search robot

Search robots Bing, Yahoo, Mail.ru, Rambler

Disallow and Allow directives

Disallow blocks sections and pages of your site from indexing. Accordingly, Allow, on the contrary, opens them.

There are some peculiarities.

First, the additional operators are *, $ and #. What are they used for?

“*” – this is any number of characters and their absence. By default, it is already at the end of the line, so there is no point in putting it again.

“$” – indicates that the character before it should come last.

“#” – comment, the robot does not take into account everything that comes after this symbol.

Examples of using Disallow:

Disallow: *?s=

Disallow: /category/

Accordingly, the search robot will close pages like:

But pages like this will be open for indexing:

Now you need to understand how nesting rules are executed. The order in which directives are written is absolutely important. Inheritance of rules is determined by which directories are specified, that is, if we want to block a page/document from indexing, it is enough to write a directive. Let's look at an example

This is our robots.txt file

Disallow: /template/

This directive can also be specified anywhere, and several sitemap files can be specified.

Host directive in robots.txt

This directive is necessary to indicate the main mirror of the site (often with or without www). Please note that the host directive is specified without the http:// protocol, but with the https:// protocol. The directive is taken into account only by Yandex and Mail.ru search robots, and other robots, including GoogleBot, will not take the rule into account. Host should be specified once in the robots.txt file

Example with http://

Host: website.ru

Example with https://

Crawl-delay directive

Sets the time interval for indexing site pages by a search robot. The value is indicated in seconds and milliseconds.

Example:

It is used mostly on large online stores, information sites, portals, where site traffic is from 5,000 per day. It is necessary for the search robot to make an indexing request within a certain period of time. If this directive is not specified, it can create a serious load on the server.

The optimal crawl-delay value is different for each site. For search engines Mail, Bing, Yahoo, the value can be set to a minimum value of 0.25, 0.3, since these search engine robots can crawl your site once a month, 2 months, and so on (very rarely). For Yandex, it is better to set a higher value.

If the load on your site is minimal, then there is no point in specifying this directive.

Clean-param directive

The rule is interesting because it tells the crawler that pages with certain parameters do not need to be indexed. Two arguments are specified: page URL and parameter. This directive is supported by the Yandex search engine.

Example:

Disallow: /admin/

Disallow: /plugins/

Disallow: /search/

Disallow: /cart/

Disallow: *sort=

Disallow: *view=

User-agent: GoogleBot

Disallow: /admin/

Disallow: /plugins/

Disallow: /search/

Disallow: /cart/

Disallow: *sort=

Disallow: *view=

Allow: /plugins/*.css

Allow: /plugins/*.js

Allow: /plugins/*.png

Allow: /plugins/*.jpg

Allow: /plugins/*.gif

User-agent: Yandex

Disallow: /admin/

Disallow: /plugins/

Disallow: /search/

Disallow: /cart/

Disallow: *sort=

Disallow: *view=

Allow: /plugins/*.css

Allow: /plugins/*.js

Allow: /plugins/*.png

Allow: /plugins/*.jpg

Allow: /plugins/*.gif

Clean-Param: utm_source&utm_medium&utm_campaign

In the example, we wrote down the rules for 3 different bots.

Where to add robots.txt?

Added to the root folder of the site. In addition, so that you can follow the link:

How to check robots.txt?

Yandex Webmaster

On the Tools tab, select Robots.txt Analysis and then click check

Google Search Console

On the tab Scanning choose Robots.txt file inspection tool and then click check.

Conclusion:

The robots.txt file must be present on every website being promoted, and only its correct configuration will allow you to obtain the necessary indexing.

And finally, if you have any questions, ask them in the comments under the article and I’m also wondering, how do you write robots.txt?

Explanation of values:

User-agent: * - you access all search engines at once, Yandex - only Yandex.
Disallow: lists folders and files that are prohibited for indexing
Host – enter the name of your site without www.
Sitemap: link to the XML sitemap.

Place the file in the root directory of the site using Filezilla or through the hosting site. Post it to the main directory so that it is available via the link: your_site.ru/robots.txt

It is suitable only for those who have CNC machines (links are written in words, not in the form p=333). Just go to Settings – Permalinks, select the bottom option and enter /%postname% in the field.

Some people prefer to create this file themselves:

First, create a notepad on your computer and name it robots (don't use uppercase). At the end of the settings, its size should not be more than 500 kb.

User-agent– name of the search engine (Yandex, Googlebot, StackRambler). If you want to appeal to everyone at once, put a star *

And then specify the pages or folders that this robot should not index using Disallow:

First, three directories are listed, and then a specific file.

To allow indexing of everything and everyone, you need to write:

User-agent: *
Disallow:

Setting up robots.txt for Yandex and Google

For Yandex You definitely need to add the host directive to avoid duplicate pages. This word is understood only by the Yandex bot, so write down instructions for it separately.

For Google there are no extras. The only thing you need to know is how to access it. In the User-agent section you need to write:

Googlebot;
Googlebot-Image – if you limit image indexing;
Googlebot-Mobile - for mobile version site.

How to check the functionality of the robots.txt file

This can be done in the "Webmaster Tools" section from Google search engine or on the Yandex.Webmaster website in the Check robots.txt section.

If there are errors, correct them and check again. Achieve a good result, then don’t forget to copy the correct code into robots.txt and upload it to the site.

Now you have an idea how to create robots.txt for all search engines. For beginners, I recommend using a ready-made file, substituting the name of your site.