What is Robots txt? Correct for WordPress

💖 Do you like it? Share the link with your friends
1) What is a search robot?
2) What is robots.txt?
3) How to create robots.txt?
4) What and why can be written to this file?
5) Examples of robot names
6) Example of finished robots.txt
7) How can I check if my file is working?

1. What is a search robot?

Robot (English crawler) keeps a list of URLs that it can index and regularly downloads documents corresponding to them. If the robot finds a new link while analyzing a document, it adds it to its list. Thus, any document or site that has links can be found by a robot, and therefore by Yandex search.

2. What is robots.txt?

Search robots look for the robots.txt file on websites first. If you have directories, content, etc. on your site that you, for example, would like to hide from indexing (the search engine did not provide information on them. For example: admin panel, other page panels), then you should carefully study the instructions for working with this file.

robots.txt- This text file(.txt), which is located in the root (root directory) of your site. It contains instructions for search robots. These instructions may prohibit certain sections or pages on the site from being indexed, indicate correct “mirroring” of the domain, recommend that the search robot observe a certain time interval between downloading documents from the server, etc.

3. How to create robots.txt?

Creating robots.txt is very simple. We go to a regular text editor (or right mouse button - create - text document), for example, Notepad. Next, create a text file and rename it robots.txt.

4. What and why can be written to the robots.txt file?

Before you specify a command to a search engine, you need to decide which Bot it will be addressed to. There is a command for this User-agent
Below are examples:

User-agent: * # the command written after this line will be addressed to all search robots
User-agent: YandexBot # access to the main Yandex indexing robot
User-agent: Googlebot # access to the main Google indexing robot

Allowing and disabling indexing
To enable and disable indexing there are two corresponding commands - Allow(possible) and Disallow(it is forbidden).

User-agent: *
Disallow: /adminka/ # prohibits all robots from indexing the adminka directory, which supposedly contains the admin panel

User-agent: YandexBot # the command below will be addressed to Yandex
Disallow: / # we prohibit indexing of the entire site by the Yandex robot

User-agent: Googlebot # the command below will call Google
Allow: /images # allow all contents of the images directory to be indexed
Disallow: / # and everything else is prohibited

The order doesn't matter

User-agent: *
Allow: /images
Disallow: /

User-agent: *
Disallow: /
Allow: /images
# both are allowed to index files
# starting with "/images"

Sitemap Directive
This command specifies the address of your sitemap:

Sitemap: http://yoursite.ru/structure/my_sitemaps.xml # Indicates the sitemap address

Host directive
This command is inserted AT THE END of your file and denotes the main mirror
1) is written AT THE END of your file
2) is indicated only once. otherwise only the first line is accepted
3) indicated after Allow or Disallow

Host: www.yoursite.ru # mirror of your site

#If www.yoursite.ru is the main mirror of the site, then
#robots.txt for all mirror sites looks like this
User-Agent: *
Disallow: /images
Disallow: /include
Host: www.yoursite.ru

# By Google default ignores Host, you need to do this
User-Agent: * # index all
Disallow: /admin/ # disallow admin index
Host: www.mainsite.ru # indicate the main mirror
User-Agent: Googlebot # now commands for Google
Disallow: /admin/ # ban for Google

5. Examples of robot names

Yandex robots
Yandex has several types of robots that solve a variety of problems: one is responsible for indexing images, others are responsible for indexing rss data to collect data on blogs, and others are responsible for multimedia data. Foremost - YandexBot, it indexes the site in order to compile a general database of the site (headings, links, text, etc.). There is also a robot for fast indexing(news indexing, etc.).

YandexBot-- main indexing robot;
YandexMedia-- a robot that indexes multimedia data;
YandexImages-- Yandex.Images indexer;
YandexCatalog-- "tapping" of Yandex.Catalogue, used for temporary removal from publication of inaccessible sites in the Catalog;
YandexDirect-- Yandex.Direct robot, interprets robots.txt in a special way;
YandexBlogs-- blog search robot that indexes posts and comments;
YandexNews-- Yandex.News robot;
YandexPagechecker-- micro markup validator;
YandexMetrika-- Yandex.Metrica robot;
YandexMarket-- Yandex.Market robot;
YandexCalendar-- Yandex.Calendar robot.

6. Example of finished robots.txt

Actually we came to the example of a finished file. I hope after the above examples everything will be clear to you.

User-agent: *
Disallow: /admin/
Disallow: /cache/
Disallow: /components/

User-agent: Yandex
Disallow: /admin/
Disallow: /cache/
Disallow: /components/
Disallow: /images/
Disallow: /includes/

Sitemap: http://yoursite.ru/structure/my_sitemaps.xml

Hello, friends! The article shows what the correct robots txt is for the site where it is located, how to create a robots file, how to adapt a robots file from another site, how to upload it to your blog.

What is a filerobots txt,why is it needed and what is it responsible for?

A robots txt file is a text file that contains instructions for search robots. Before accessing the pages of your blog, the robot first looks for the robots file, which is why it is so important. The robots txt file is a standard for preventing robots from indexing certain pages. The robots txt file will determine whether your confidential data will be released. The correct robots txt for a site will help in its promotion, since it is an important tool in the interaction between your site and search robots.

It’s not for nothing that the robots txt file is called the most important SEO tool; this small file directly affects the indexing of site pages and the site as a whole. Conversely, incorrect robots txt can exclude some pages, sections, or the site as a whole from search results. In this case, you can have 1000 articles on your blog, but there will simply be no visitors to the site, there will be purely random passers-by.

Yandex webmaster has a training video in which Yandex compares the file robots tht with a box of your personal belongings that you do not want to show to anyone. To prevent strangers from looking into this box, you seal it with tape and write on it “Do not open.”

Robots, as well-mannered individuals, do not open this box and will not be able to tell others what is there. If there is no robots txt file, then the search engine robot believes that all files are available, it will open the box, look at everything and tell others what is in the box. To prevent the robot from climbing into this box, you need to prohibit it from climbing there; this is done using the Disallow directive, which translates from English as prohibit, and Allow as allow.

This is a regular txt file, which is compiled in a regular notepad or NotePad++ program, a file that suggests robots not to index certain pages on the site. What is it for:

  • a properly composed robots txt file does not allow robots to index any garbage and not clog search results unnecessary material, and also not to produce duplicate pages, which is a very harmful phenomenon;
  • does not allow robots to index information that is needed for official use;
  • prevents spy robots from stealing confidential data and using it to send spam.

This does not mean that we want to hide something from search engines, something secret, it’s just that this information is of no value to either search engines or visitors. For example, login page, RSS feeds, etc. In addition, the robots txt file specifies the site mirror as well as the sitemap. By default, a website built on WordPress does not have a robots txt file. Therefore, you need to create a robots txt file and upload it to the root folder of your blog. In this article we will look at robots txt for WordPress, its creation, adjustment and uploading to the site. So first we will know where is the robots txt file?

Where isrobots txthow to see it?

I think many beginners ask themselves the question - where is robots txt located? The file is located in the root folder of the site, in the public_html folder, it can be seen quite simply. You can go to your hosting, open your site’s folder and see if this file is there or not. The video below shows how to do this. You can view the file using Yandex webmaster and Google webmaster, but we’ll talk about that later.

There is an even simpler option, which allows you to view not only your robots txt, but also the robots of any site. You can download robots to your computer, and then adapt it to yourself and use it on your website (blog). This is done like this - you open the site (blog) you need, and add robots.txt using a slash (see screenshot)

and press Enter, the robots txt file opens. In this case, you cannot see where robots txt is located, but you can view and download it.

How to create the right onerobots txt for site

There are various options for creating robots txt for a website:

  • use online generators that will quickly create a robots txt file; there are quite a lot of sites and services that can do this;
  • use plugins for WordPress that will help solve this problem;
  • create a robots txt file with your own hands manually in a regular notepad or NotePad++ program;
  • use ready-made, correct robots txt from someone else’s site (blog), replacing the address of your site in it.

Generators

So, I have not previously used generators for creating robots txt files, but before writing this article I decided to test 4 services for generating robots txt files, I got certain results, I’ll tell you about them later. These services are:

  • SEOlib ;
  • PR-CY service;
  • service Raskruty.ru;
  • seo café you can go here using this link - info.seocafe.info/tools/robotsgenerator.

How to use the robots txt generator in practice is shown in detail in the video below. During the testing process, I came to the conclusion that they are not suitable for beginners, and here’s why? The generator only allows you to create the correct entry without errors in the file itself, but to compose the correct robots txt you still need to have knowledge, you need to know which folders to close and which not. For this reason, I do not recommend using the robots txt generator to create a file for beginners.

PluginsFor WordPress

There are plugins, for example, PC Robots.txt to create the file. This plugin allows you to edit a file directly in the site's control panel. Another plugin is iRobots.txt SEO - this plugin has similar functionality. You can find a bunch of different plugins that allow you to work with the robots txt file. If you wish, you can enter the phrase robots in the “Search for plugins” field. txt and click the “Search” button and you will be offered several plugins. Of course, you need to read about each of them and look at reviews.

The way robots txt plugins work for WordPress is very similar to how generators work. To get the correct robots txt for a site, you need knowledge and experience, but where can beginners get it? In my opinion, more harm than good can come from such services. And if you install a plugin, it will also load the hosting. For this reason, I do not recommend installing the robots txt WordPress plugin.

Createrobots txtmanually

You can create robots txt manually using a regular notepad or NotePad++ program, but this requires knowledge and experience. This option is also not suitable for beginners. But over time, when you gain experience, you will be able to do this, and you can create a robots txt file for the site, register Disallow robots directives, close the necessary folders from indexing, perform a robots check and adjust it in just 10 minutes. The screenshot below shows robots txt in notepad:

We will not consider the procedure for creating a robots txt file here; this is written in detail in many sources, for example, Yandex Webmaster. Before compiling a robots txt file, you need to go to Yandex Webmaster, where each directive is described in detail, what it is responsible for, and compose a file based on this information. (see screenshot).

By the way, new Yandex the webmaster offers detailed and detailed information, an article about it can be found on the blog. More precisely, two articles are presented that will be of great benefit to bloggers and not only beginners, I advise you to read them.

If you are not a beginner and want to make robots txt yourself, then you need to follow a number of rules:

  1. The use of national characters in the robots txt file is not allowed.
  2. The robots file size should not exceed 32 KB.
  3. The name of the robots file cannot be written like Robots or ROBOTS; the file must be signed exactly as shown in the article.
  4. Each directive must start on a new line.
  5. You cannot specify more than one directive on one line.
  6. The “Disallow” directive with an empty line is equivalent to the “Allow” directive - allow, this must be remembered.
  7. You cannot put a space at the beginning of a line.
  8. If you do not make a space between the various “User-agent” directives, then the robots will only accept the top directive - the rest will be ignored.
  9. The directive parameter itself needs to be written in only one line.
  10. You cannot enclose directive parameters in quotes.
  11. You cannot close a line with a semicolon after a directive.
  12. If the robots file is not detected or is empty, then the robots will perceive this as “Everything is allowed.”
  13. You can make comments in the directive line (to make it clear what the line is), but only after the hash sign #.
  14. If you put a space between the lines, this will mean the end of the User-agent directive.
  15. The "Disallow" and "Allow" directives must contain only one parameter.
  16. For directives that are a directory, a slash is added, for example – Disallow/ wp-admin.
  17. In the “Crawl-delay” section, you need to recommend to robots the time interval between downloading documents from the server, usually 4-5 seconds.
  18. Important - there should be no empty lines between directives. A new directive begins with one space. This means the end of the rules for the search robot, as the attached video shows in detail. Asterisks mean a sequence of any characters.
  19. I advise you to repeat all the rules separately for the Yandex robot, that is, repeat all the directives that were prescribed for other robots separately for Yandex. At the end of the information for the Yandex robot, you need to write down the host directive (Host - it is supported only by Yandex) and indicate your blog. The host indicates to Yandex which mirror of your site is the main one, with or without www.
  20. In addition, in a separate directory of the robots txt file, that is, separated by a space, it is recommended to indicate the address of your site map. Creating the file can be done in a few minutes and begins with the phrase “User-agent:”. If you want to block, for example, pictures from indexing, then you need to set Disallow: /images/.

Use the correct onerobots txt from someone else's site

There is no ideal file; periodically you need to try to experiment and take into account changes in the operation of search engines, take into account the errors that may appear on your blog over time. Therefore, to begin with, you can take someone else’s verified robots txt file and install it for yourself.

Be sure to change the entries that reflect the address of your blog in the Host directory (see screenshot, see also video), and also replace it with your site address in the sitemap address (bottom two lines). Over time, this file will need to be adjusted a little. For example, you noticed that duplicate pages began to appear.

In the section “Where is robots txt located, how to see it”, which is located above, we looked at how to view and download robots txt. Therefore, you need to choose a good trust site that has high performance Titz, high traffic, open and download the correct robots txt. You need to compare several sites and choose for yourself required file robots txt and upload it to your website.

How to upload a file to the siterobots txt to the root folder of the site

As already written, after creating a site on WordPress, by default, there is no robots txt file. Therefore, it must be created and uploaded to the root folder of our website (blog) on ​​the hosting. Uploading the file is quite simple. On TimeWeb hosting, on other hostings you can upload either through or through. The video below shows the process of uploading a robots txt file to TimeWeb hosting.

Checking the robots txt file

After downloading the robots txt file, you need to check its presence and operation. To do this, we can look at the file from the browser, as shown above in the section “Where is robots txt located, how to see.” You can check the file’s operation using Yandex webmaster and Google webmaster. We remember that for this there must be , and in .

To check in Yandex, go to our Yandex webmaster account, select a site if you have several of them. Select “Indexing settings”, “Robots.txt analysis”, and then follow the instructions.

In Google webmaster we do the same thing, go to our account, select the desired site (if there are several of them), click the “Crawling” button and select “Robots.txt file verification tool”. The robots txt file will open. You can edit or check it.

On the same page there are excellent instructions for working with the robots txt file, you can read them. In conclusion, I provide a video that shows what a robots txt file is, how to find it, how to view and download it, how to work with the file generator, how to create a robots txt and adapt it for yourself, other information is shown:

Conclusion

So, in this article we looked at the question of what a robots txt file is and found out that this file is very important for the site. We learned how to make the correct robots txt, how to adapt a robots txt file from someone else’s site to yours, how to upload it to your blog, and how to check it.

From the article it became clear that for beginners, at first, it is better to use a ready-made and correct robots txt, but you must remember to replace the domain in it in the Host directory with your own, and also enter the address of your blog in the sitemaps. You can download my robots txt file here. Now, after the correction, you can use the file on your blog.

There is a separate website for the robots txt file. You can go to it and find out more detailed information. I hope everything works out for you and the blog will be well indexed. Good luck to you!

Best regards, Ivan Kunpan.

P.S. To properly promote your blog, you need to write correctly about optimizing articles on your blog, then it will have high traffic and ratings. My information products, which incorporate my three years of experience, will help you with this. You can get the following products:

  • paid book;
  • intelligence map;
  • paid video course " ".

Receive new blog articles directly to your email. Fill out the form, click the "Subscribe" button

Hello, dear readers of the “Webmaster’s World” blog!

File robots.txt– this is a very important file that directly affects the quality of indexing of your site, and therefore its search engine promotion.

That is why you must be able to correctly format robots.txt so as not to accidentally prohibit any important documents of the Internet project from being included in the index.

How to format the robots.txt file, what syntax should be used, how to allow and deny documents to the index will be discussed in this article.

About the robots.txt file

First, let's find out in more detail what kind of file this is.

File robots is a file that shows search engines, which pages and documents of the site can be added to the index, and which cannot. It is necessary because initially search engines try to index the entire site, and this is not always correct. For example, if you are creating a site on an engine (WordPress, Joomla, etc.), then you will have folders that organize the work of the administrative panel. It is clear that the information in these folders cannot be indexed; in this case, the robots.txt file is used, which restricts access to search engines.

The robots.txt file also contains the address of the site map (it improves indexing by search engines), as well as the main domain of the site (the main mirror).

Mirror– this is an absolute copy of the site, i.e. when there is one site, then they say that one of them is the main domain, and the other is its mirror.

Thus, the file has quite a lot of functions, and important ones at that!

Robots.txt file syntax

The robots file contains blocks of rules that tell a particular search engine what can be indexed and what cannot. There can be one block of rules (for all search engines), but there can also be several of them - for some specific search engines separately.

Each such block begins with a “User-Agent” operator, which indicates which search engine these rules apply to.

User-Agent:A
(rules for robot “A”)

User-Agent:B
(rules for robot “B”)

The example above shows that the “User-Agent” operator has a parameter - the name of the search engine robot to which the rules are applied. I will indicate the main ones below:

After “User-Agent” there are other operators. Here is their description:

All operators have the same syntax. Those. operators should be used as follows:

Operator1: parameter1

Operator2: parameter2

Thus, first we write the name of the operator (no matter in capital or small letters), then we put a colon and, separated by a space, indicate the parameter of this operator. Then, starting on a new line, we describe operator two in the same way.

Important!!! Empty line will mean that the block of rules for this search engine is complete, so do not separate the statements with an empty line.

Example robots.txt file

Let's look at a simple example of a robots.txt file to better understand the features of its syntax:

User-agent: Yandex
Allow: /folder1/
Disallow: /file1.html
Host: www.site.ru

User-agent: *
Disallow: /document.php
Disallow: /folderxxx/
Disallow: /folderyyy/folderzzz
Disallow: /feed/

Sitemap: http://www.site.ru/sitemap.xml

Now let's look at the described example.

The file consists of three blocks: the first for Yandex, the second for all search engines, and the third contains the sitemap address (applied automatically for all search engines, so there is no need to specify “User-Agent”). We allowed Yandex to index the folder “folder1” and all its contents, but prohibited it from indexing the document “file1.html” located in the root directory on the hosting. We also indicated the main domain of the site to Yandex. The second block is for all search engines. There we banned the document "document.php", as well as the folders "folderxxx", "folderyyy/folderzzz" and "feed".

Please note that in the second block of commands to the index we did not prohibit the entire “folderyyy” folder, but only the folder inside this folder – “folderzzz”. Those. we have provided the full path for "folderzzz". This should always be done if we prohibit a document located not in the root directory of the site, but somewhere inside other folders.

It will take less than two minutes to create:

The created robots file can be checked for functionality in the Yandex webmaster panel. If errors are suddenly found in the file, Yandex will show it.

Be sure to create a robots.txt file for your site if you don’t already have one. This will help your site develop in search engines. You can also read our other article about the method of meta tags and .htaccess.

We have released a new book “Content Marketing in in social networks: How to get into your subscribers’ heads and make them fall in love with your brand.”

Robots.txt is a text file containing information for search robots that help index portal pages.


More videos on our channel - learn internet marketing with SEMANTICA

Imagine that you went to an island for treasure. You have a map. The route is indicated there: “Approach a large stump. From there, take 10 steps east, then reach the cliff. Turn right, find a cave.”

These are the directions. Following them, you follow the route and find the treasure. A search bot works in much the same way when it starts indexing a site or page. It finds the robots.txt file. It reads which pages need to be indexed and which do not. And following these commands, it crawls the portal and adds its pages to the index.

What is robots.txt for?

They start visiting sites and indexing pages after the site is uploaded to hosting and DNS is registered. They do their job regardless of whether you have any technical files or not. Robots tells search engines that when crawling a website, they need to take into account the parameters it contains.

The absence of a robots.txt file can lead to problems with site crawl speed and the presence of garbage in the index. Incorrect configuration of the file can result in the exclusion of important parts of the resource from the index and the presence of unnecessary pages in the output.

All this, as a result, leads to problems with promotion.

Let's take a closer look at what instructions are contained in this file and how they affect the behavior of the bot on your site.

How to make robots.txt

First, check if you have this file.

Enter the site address in the browser address bar followed by a slash the file name, for example, https://www.xxxxx.ru/robots.txt

If the file is present, a list of its parameters will appear on the screen.

If there is no file:

  1. The file is created in a regular text editor such as Notepad or Notepad++.
  2. You need to set the name robots, extension .txt. Enter data taking into account accepted design standards.
  3. You can check for errors using services such as Yandex Webmaster. There you need to select the “Robots.txt Analysis” item in the “Tools” section and follow the prompts.
  4. When the file is ready, upload it to the root directory of the site.

Setting rules

Search engines have more than one robot. Some bots only index text content, some only graphic content. And even among search engines themselves, the way crawlers work can be different. This must be taken into account when compiling the file.

Some of them may ignore some of the rules, for example, GoogleBot does not respond to information about which site mirror is considered the main one. But in general, they perceive and are guided by the file.

File Syntax

Document parameters: name of the robot (bot) “User-agent”, directives: allowing “Allow” and prohibiting “Disallow”.

Now there are two key search engines: Yandex and Google, respectively, it is important to take into account the requirements of both when creating a website.

The format for creating entries is as follows, please note the required spaces and empty lines.

User-agent directive

The robot looks for records that begin with User-agent; it should contain indications of the name of the search robot. If it is not specified, bot access is considered to be unlimited.

Disallow and Allow directives

If you need to disable indexing in robots.txt, use Disallow. With its help, the bot’s access to the site or certain sections is limited.

If robots.txt does not contain any prohibiting “Disallow” directives, it is considered that indexing of the entire site is allowed. Usually bans are prescribed after each bot separately.

All information that appears after the # sign is a comment and is not machine readable.

Allow is used to allow access.

The asterisk symbol serves as an indication of what applies to everyone: User-agent: *.

This option, on the contrary, means a complete ban on indexing for everyone.

Prevent viewing the entire contents of a specific directory folder

To block one file you need to specify its absolute path


Sitemap, Host directives

For Yandex, it is customary to indicate which mirror you want to designate as the main one. And Google, as we remember, ignores it. If there are no mirrors, simply note whether you think it is correct to write the name of your website with or without www.

Clean-param directive

It can be used if the URLs of website pages contain changeable parameters that do not affect their content (this could be user ids, referrers).

For example, in the page address “ref” determines the source of traffic, i.e. indicates where the visitor came to the site from. The page will be the same for all users.

You can point this out to the robot and it won't download duplicate information. This will reduce server load.

Crawl-delay directive

Using this, you can determine how often the bot will load pages for analysis. This command is used when the server is overloaded and indicates that the crawl process should be speeded up.

Robots.txt errors

  1. The file is not in the root directory. The robot will not look for it deeper and will not take it into account.
  2. The letters in the name must be small Latin.
    There is a mistake in the name, sometimes they miss the letter S at the end and write robot.
  3. You cannot use Cyrillic characters in the robots.txt file. If you need to specify a domain in Russian, use the format in the special Punycode encoding.
  4. This is a method of converting domain names into a sequence of ASCII characters. To do this, you can use special converters.

This encoding looks like this:
site.rf = xn--80aswg.xn--p1ai

Additional information on what to close in robots txt and on settings in accordance with the requirements of Google and Yandex search engines can be found in the help documents. Different cms may also have their own characteristics, this should be taken into account.

Successful indexing of a new site depends on many factors. One of them is the robots.txt file, the correct filling of which should be familiar to any novice webmaster.

What is robots.txt and why is it needed?

This is a text file (document in .txt format) containing clear instructions for indexing a specific site. The file indicates to search engines which pages of a web resource need to be indexed and which should be prohibited from indexing.

It would seem, why prohibit the indexing of some site content? Let the search robot index everything indiscriminately, guided by the principle: the more pages, the better! But that's not true.


Not all the content that makes up a website is needed by search robots. There are system files, there are duplicate pages, there are categories keywords and there is a lot more that does not necessarily need to be indexed. Otherwise, the following situation cannot be ruled out.

When a search robot comes to your site, the first thing it does is try to find the notorious robots.txt. If this file is not detected by it or is detected, but it is compiled incorrectly (without the necessary prohibitions), the search engine “messenger” begins to study the site at its own discretion.

In the process of such studying, he indexes everything and it is far from a fact that he starts with those pages that need to be entered into the search first (new articles, reviews, photo reports, etc.). Naturally, in this case, the indexing of the new site may take some time.

In order to avoid such an unenviable fate, the webmaster needs to take care in time to create the correct robots.txt file.

“User-agent:” is the main directive of robots.txt

In practice, directives (commands) are written in robots.txt using special terms, the main one of which can be considered the directive “ User-agent: " The latter is used to specify the search robot, which will be given certain instructions in the future. For example:

  • User-agent: Googlebot– all commands that follow this basic directive will relate exclusively to the search engine Google systems(her indexing robot);
  • User-agent: Yandex– the addressee in this case is the domestic search engine Yandex.

The robots.txt file can be used to address all other search engines combined. The command in this case will look like this: User-agent: *. The special symbol “*” usually means “any text”. In our case, any search engines other than Yandex. Google, by the way, also takes this directive personally, unless you contact it personally.

“Disallow:” command – prohibiting indexing in robots.txt

The main “User-agent:” directive addressed to search engines can be followed by specific commands. Among them, the most common is the directive “ Disallow: " Using this command, you can prevent the search robot from indexing the entire web resource or some part of it. It all depends on what extension this directive will have. Let's look at examples:

User-agent: Yandex Disallow: /

This kind of entry in the robots.txt file means that the Yandex search robot is not allowed to index this site at all, since the prohibitory sign “/” stands alone and is not accompanied by any clarifications.

User-agent: Yandex Disallow: /wp-admin

As you can see, this time there are clarifications and they concern the system folder wp-admin V . That is, the indexing robot, using this command (the path specified in it), will refuse to index this entire folder.

User-agent: Yandex Disallow: /wp-content/themes

Such an instruction to the Yandex robot presupposes its admission to a large category " wp-content ", in which it can index all contents except " themes ».

Let’s explore the “forbidden” capabilities of the robots.txt text document further:

User-agent: Yandex Disallow: /index$

In this command, as follows from the example, another special sign “$” is used. Its use tells the robot that it cannot index those pages whose links contain the sequence of letters " index " At the same time, index a separate site file with the same name “ index.php » the robot is not prohibited. Thus, the “$” symbol is used when a selective approach to prohibiting indexing is necessary.

Also, in the robots.txt file, you can prohibit indexing of individual resource pages that contain certain characters. It might look like this:

User-agent: Yandex Disallow: *&*

This command tells the Yandex search robot not to index all those pages on a website whose URLs contain the “&” character. Moreover, this sign in the link must appear between any other symbols. However, there may be another situation:

User-agent: Yandex Disallow: *&

Here the indexing ban applies to all those pages whose links end in “&”.

If with indexing prohibited system files site there should be no questions, then such questions may arise regarding the ban on indexing individual pages of the resource. Like, why is this necessary in principle? An experienced webmaster may have many considerations in this regard, but the main one is the need to get rid of duplicate pages in the search. Using the "Disallow:" command and group special characters, discussed above, you can deal with “undesirable” pages quite simply.

“Allow:” command – allowing indexing in robots.txt

The antipode of the previous directive can be considered the command “ Allow: " Using the same clarifying elements, but using this command in the robots.txt file, you can allow the indexing robot to enter the site elements you need into the search database. To confirm this, here is another example:

User-agent: Yandex Allow: /wp-admin

For some reason, the webmaster changed his mind and made the appropriate adjustments to robots.txt. As a consequence, from now on the contents of the folder wp-admin officially approved for indexing by Yandex.

Even though the Allow: command exists, it is not used very often in practice. By and large, there is no need for it, since it is applied automatically. The site owner just needs to use the “Disallow:” directive, prohibiting this or that content from being indexed. After this, all other content of the resource that is not prohibited in the robots.txt file is perceived by the search robot as something that can and should be indexed. Everything is like in jurisprudence: “Everything that is not prohibited by law is permitted.”

"Host:" and "Sitemap:" directives

The overview of important directives in robots.txt is completed by the commands “ Host: " And " Sitemap: " As for the first, it is intended exclusively for Yandex, indicating to it which site mirror (with or without www) is considered the main one. For example, a site might look like this:

User-agent: Yandex Host: website

User-agent: Yandex Host: www.site

Using this command also avoids unnecessary duplication of site content.

In turn, the directive “ Sitemap: » indicates to the indexing robot the correct path to the so-called Site Map - files sitemap.xml And sitemap.xml.gz (in the case of CMS WordPress). A hypothetical example might be:

User-agent: * Sitemap: http://site/sitemap.xml Sitemap: http://site/sitemap.xml.gz

Writing this command in the robots.txt file will help the search robot index the Site Map faster. This, in turn, will also speed up the process of getting web resource pages into search results.

The robots.txt file is ready - what next?

Let's assume that you, as a novice webmaster, have mastered the entire array of information that we have given above. What to do after? Create Text Document robots.txt, taking into account the characteristics of your site. To do this you need:

  • take advantage text editor(for example, Notepad) to compile the robots.txt you need;
  • check the correctness of the created document, for example, using this Yandex service;
  • using an FTP client, upload the finished file to the root folder of your site (in the case of WordPress, we are usually talking about system folder public_html).

Yes, we almost forgot. A novice webmaster, without a doubt, will want to first look at ready-made examples this file performed by others. Nothing could be simpler. To do this, just enter in the address bar of your browser site.ru/robots.txt . Instead of “site.ru” - the name of the resource you are interested in. That's all.

Happy experimenting and thanks for reading!



tell friends