Robots file for Google. How to edit robots txt file

The first thing that makes a search bot that comes to your site is searching and reading the Robots.txt file. What is this file? - This is a set of instructions for the search engine.

It is a text file, with the extension of the TXT, which is located in the root directory of the site. This set of instructions specifies the search robot, which pages and site files to index, and which are not. It also indicates the main mirror of the site and where to look for the site map.

What do you need a robots.txt file? To proper indexing your site. That in the search did not have a duplicate pages, various service pages and documents. Once correctly configuring directives in Robots, you will save your site from many problems with indexing and sitelacing site.

How to make the right robots.txt

Make robots.txt fairly easy, create text Document In the standard notepad of Windows. We write in this file directive for search engines. Next, save this file called "Robots" and the text extension "TXT". All now you can pour it on hosting, in the root folder of the site. Consider, for one site you can create only one "Robots" document. If this file is missing on the site, then the bot automatic "decides" that you can index everything.

Since it is one, then instructions are prescribed in it to all search engines. Moreover, you can write down as separately instructions for each PS and the total immediately under all. The separation of instructions for various search bots is done through the User-Agent directive. We will talk more about it below.

Robots.txt directives

The "For robots" file may contain the following directives for controlling indexing: User-Agent, Disallow, Allow, Sitemap, Host, Crawl-Delay, Clean-Param. Let's consider each instruction in more detail.

User-Agent Directive

User-Agent Directive - Indicates what search engine system will be instructions (more precisely for what particular bot). If you are "*", the instructions are designed for all robots. If a specific bot is specified, such as Googlebot, it means that the instructions are intended only for the main indexing robot of Google. Moreover, if the instructions are and separately for GoogleBot and for all other PS, then Google read only your instructions, and the overall ignores. Bot Yandex will do the same. We see an example of recording directive.

User-Agent: YandexBot - Instructions only for the main indexing bot Yandex
User-Agent: Yandex - Instructions for all Bot Yandex
User-Agent: * - Instructions for all bots

DISALLOW and ALLOW directives

DISALLOW and ALLOW directives - give commands that index, and what is not. Disallow gives the command not to index the page or a whole section of the site. A Allow, on the contrary, indicates that it is necessary to index.

Disallow: / - prohibits indexing the entire site
Disallow: / papka / - prevents index all the contents of the folder
Disallow: /files.php - prohibits indexing files.php file

Allow: / CGI-BIN - Allows index CGI-BIN pages

In the Disallow and Allow directives, it is often necessary to use specialimillers. They are needed to specify regular expressions.

Specialsimal * - replaces any sequence of characters. He is attributed to the end of each rule by default. Even if you did not register it, the PS will be prefeed. Example of use:

Disallow: /cgi-bin/*.aspx - prohibits index all files with extension.aspx
Disallow: / * Foto - prevents the indexing of files and folders containing the word foto

The special-purpose $ - cancels the action of the special symbol "*" at the end of the rule. For example:

Disallow: / Example $ - prevents index '/ example', but does not prohibit '/Example.html'

And if we register without a special system, the instruction will work as another:

Disallow: / Example - prohibits and '/ example' and '/example.html'

Sitemap Directive

Sitemap Directive - It is intended to specify the robot of the search engine, in which location on the hosting site is a site map. The site map format must be sitmaps.xml. The site map is needed for a faster and complete site indexation. And the site map is not necessarily one file, there may be several of them. Recording format Directions:

Sitemap: http: //syt/sitemaps1.xml
Sitemap: http: //syt/sitemaps2.xml

Directive Host.

Directive Host. - Indicates the robot the main mirror of the site. That would not be in the site mirror index, you should always specify this directive. If you do not specify it, the Yandex robot will index at least two versions of the site with WWW and without. While the robot mirror will not glue them. An example of recording:

Host: www.Syt.
HOST: Site

In the first case, the robot will index the version with WWW, in the second case without. It is allowed to prescribe only one HOST directive in the robots.txt file. If you register their several, the bot will treat and take note only the first.

The correct guost directive must have the following data:
- indicate the connection protocol (http or https);
- Correctly written domain name (it is impossible to register an IP address);
- Port number if necessary (for example, Host: Site.com:8080).

Not properly made directives will simply be ignored.

Crawl-Delay Directive

Crawl-Delay Directive Allows you to reduce the load on the server. It is needed in case your site begins to lie down under the onslaught of various bots. Crawl-Delay directive indicates the search bot waiting time between the end of the download of one page and the beginning of the download of the other page of the site. The directive should go directly after recordings of the Disallow and / or "Allow" directives. The search robot Yandex can read fractional values. For example: 1.5 (seconds and a half).

Clean-Param Directive

Clean-Param Directive We need sites whose pages contain dynamic parameters. We are talking about those that do not affect the contents of the pages. This is a different service information: identifiers of sessions, users, referrers, etc. So, that would not be a double of these pages and this directive is used. She will say PS not to download re-achieving information. The load on the server and the time bypassing the site by the robot will decrease.

Clean-Param: s /Forum/showThread.php

This entry says PS that the S parameter will be considered minor for all URLs that start with /Forum/showThread.php. Maximum recording length 500 characters.

The directives figured out, go to the configuration of our robots file.

Setting Robots.txt

We proceed directly to the configuration of the robots.txt file. It must contain at least two entries:

User-Agent: - Indicates which search engine will be lower than the instructions.
Disallow: - Specifies which part of the site is not indexing. It can close from indexing, both a separate site page and entire sections.

Moreover, you can specify that these directives are intended for all search engines, or for some one specifically. This is indicated in the User-Agent directive. If you want instructions to read all the bots - we put the "star"

If you want to register instructions for a specific robot, but you need to specify its name.

User-Agent: yandexbot

Simplified example of the correctly composed Robots file will be:

User-Agent: *
Disallow: /Files.php.
Disallow: / Razdel /
HOST: Site

Where, * suggests that the instructions are intended for all PS;
Disallow: /Files.php. - gives a ban on the indexing file file.php;
Disallow: / Foto / - it prohibits the entire "Foto" to index the entire section with all nested files;
HOST: Site - Indicates robots, what mirror index.

If you have no pages on your site, which must be closed from indexing, your robots.txt file should be like this:

User-Agent: *
Disallow:
HOST: Site

Robots.txt for Yandex (Yandex)

To specify that these instructions are designed for the Yandex search engine, you must register in the USER-Agent directive: Yandex. And if we presidize "Yandex" then the site will index all Yandex robots, and if you specify "yandexbot" - it will be a team only for the main indexing robot.

It is also necessary to prescribe the "Host" directive, where to specify the main mirror of the site. As I wrote above, it is done to prevent the duplicate pages. Your correct robots.txt for Yandex will be like that.

Robots.txt - This is a text file that is located at the root of the site - http://site.ru/robots.txt.. Its main purpose is to ask certain directives to search engines - what and when to do on the site.

The easiest robots.txt

The easiest robots.txt, which all search engines permits everything to index, looks like this:

User-Agent: *
Disallow:

If the Disallow directive does not have an inclined slash at the end, then all pages for indexing are allowed.

Such a directive completely prohibits the site to indexing:

User-Agent: *
Disallow: /

User-Agent - Indicates for whom the directives are intended, the stars indicates that for all PS, the User-Agent is indicated for Yandex: Yandex.

The Yandex Help is written that its search robots handle User-Agent: *, but if you are present User-Agent: Yandex, User-Agent: * ignored.

DISALLOW and ALLOW directives

There are two main directives:

Disallow - to ban

Allow - Allow

Example: On the blog we are forbidden to index the folder / WP-CONTENT / where plug-ins are located, template I.T.P. But there are also images that the PS are indexed to participate in the search for pictures. To do this, use such a scheme:

User-Agent: *
ALLOW: / WP-CONTENT / UPLOADS / # Allow the indexing of pictures in the Uploads folder
Disallow: / WP-CONTENT /

The procedure for using directives is important for Yandex if they apply to one pages or folders. If you specify like this:

User-Agent: *
Disallow: / WP-CONTENT /
ALLOW: / WP-CONTENT / UPLOADS /

Images will not load the Yandex robot from the / Uploads /, because the first directive is executed, which prohibits all access to the WP-Content folder.

Google belongs easier and performs all directives of the Robots.txt file, regardless of their location.

Also, do not forget that directives with a slash and without, carry out different roles:

DISALOW: / ABOUT Forbidden access to the entire site.ru/about/ directory, will also not be indexed pages that contain about - site.ru/about.html, site.ru/aboutlive.html I.T.P.

Disallow: / About / Disable the indexation of pages robots in the site.ru/about/ directory, and page by type.ru/about.html i.t.p. will be available to indexing.

Regular expressions in robots.txt

Two characters are supported, this is:

* - implies any order of characters.

Example:

Disallow: / About * Disable access to all pages that contain about, in principle, this directive will also work without a stars. But in some cases this expression is not replaceable. For example, in one category there are pages of C.HTML at the end and without to close from indexing all pages that contain HTML, we prescribe this directive:

Disallow: /about/*.html.

Now the site.ru/about/live.html pages are closed from indexing, and the site.ru/about/Live page is open.

Another example by analogy:

User-Agent: Yandex
Allow: /about/*.html. # Let me index
Disallow: / About /

All pages will be closed, except pages that end with .html

$ - cuts the remaining part and denotes the end of the string.

Example:

DISALOW: / ABOUT - This Robots.txt directive prohibits indexing all pages that start with ABOUT, also goes a ban on pages in the / about / directory.

Adding a dollar symbol at the end - Disallow: / About $ we will inform robots that you can not index only the page / About page, and the / about / page, pages / aboutlive i.t.p. can be indexed.

Sitemap Directive

This directive indicates the path to the site map, in this form:

Sitemap: http: //site.ru/sitemap.xml

Directive Host.

Indicated in this form:

Host: Site.ru.

Without http: //, inclined scenes and the like things. If you have the main site mirror with www, then write:

Example robots.txt for Bitrix

User-Agent: *
Disallow: /*index.php$
Disallow: / Bitrix /
Disallow: / Auth /
Disallow: / Personal /
Disallow: / Upload /
DISALOW: / SEARCH /
DISALLOW: / * / SEARCH /
Disallow: / * / Slide_Show /
Disallow: / * / Gallery / * Order \u003d *
Disallow: / *? *
Disallow: / * & Print \u003d
Disallow: / * Register \u003d
Disallow: / * Forgot_password \u003d
Disallow: / * Change_Password \u003d
Disallow: / * login \u003d
Disallow: / * logout \u003d
Disallow: / * Auth \u003d
Disallow: / * Action \u003d *
Disallow: / * Bitrix _ * \u003d
Disallow: / * BackURL \u003d *
Disallow: / * BackURL \u003d *
Disallow: / * Back_URL \u003d *
Disallow: / * Back_URL \u003d *
DiSallow: / * Back_URL_ADMIN \u003d *
Disallow: / * Print_course \u003d Y
Disallow: / * Course_id \u003d
Disallow: / * Pagen_ *
Disallow: / * Page_ *
Disallow: / * Showall
Disallow: / * show_all \u003d
Host: Sitename.ru.
Sitemap: https://www.sitename.ru/sitemap.xml.

Example Robots.txt for WordPress

After all the necessary directives are added, which are described above. You must get something like this file Robots:

It is so to say the basic version of robots.txt for WordPress. Here are two User-Agent-A - one for all and the second for Yandex, where the Host directive is indicated.

Meta Tags Robots

It is possible to close from the indexing page or the site not only by the Robots.txt file, it can be done with a meta tag.

<meta Name \u003d "robots" content \u003d "noindex, nofollow"\u003e

It is necessary to prescribe it in the tag and this meta tag will ban index the site. In WordPress there are plugins that allow you to set such meta tags, such as Platinum SEO Pack . With it, you can close from the indexation any page, it uses meta tags.

Crawl-Delay Directive

Using this directive, you can set the time to which the search bot should be interrupted, between downloading the pages of the site.

User-Agent: *
Crawl-Delay: 5

Timeout between loading two pages will be equal to 5 seconds. To reduce the load on the server, usually exhibit 15-20 seconds. This directive is needed for large, often updated sites, on which search bots simply "live".

For ordinary sites / blogs, this directive is not needed, but it can thus limit the behavior of other non-actual search robots (Rambler, Yahoo, Bing) I.T.P. After all, they also enter the site and index it, thereby creating a load on the server.

Good afternoon dear friends! All you know that search engine optimization is the responsible and subtle business. It is necessary to take into account absolutely every trifle to get an acceptable result.

Today we will talk about the robots.txt file that is familiar to each webmaster. It is in it that all the basic instructions for search robots are prescribed. As a rule, they are happy to follow the prescribed instructions and in case of improper compilation refuse to index the web resource. Next, I will tell you about how to make the right version of robots.txt, as well as how to configure it.

In the preface I already described what it is. Now I'll tell you why he is needed. Robots.txt is a small text file that is stored at the root of the site. It is used by search engines. It is clearly spelled out of the indexing rules, i.e., which sections of the site need to be indexed (add to the search), and which is not.

Usually, the industrial sections of the site are closed from indexation. Occasionally, the blacklist falls off the undelated pages (copy-paste privacy policy of this example). Here, "robots are explained" the principles of working with the sections that need to be indexed. Very often prescribe rules for several robots separately. We will talk about it further.

For proper setting Robots.txt Your site will be guaranteed to increase in search engine positions. Robots will take into account only useful content, manifesting the duplicate or technical sections.

Creating robots.txt

To create a file, it is enough to use the standard functionality of your operating system, after which you unload it to the server via FTP. Where he lies (on the server) to guess easily - in the root. Typically, this folder is called public_html.

You can easily get into it with any FTP client (for example,) or a built-in file manager. Naturally, we will not upload an empty robot for the server. We carry out several main directives (rules).

User-Agent: *
Allow: /

Using these rows in your Robots file, you contact all robots (User-Agent directive), allowing them to index your site fully and completely (including all. Pages ALLOW: /)

Of course, this option is not particularly suitable for us. The file will not be especially useful to optimize the search engine. It definitely needs competent configuration. But before that, we consider all the main directives and robots.txt values.

Directive

User-AgentOne of the most important, since indicates which robots to follow the rules after it. The rules are taken into account until the next User-Agent in the file.
Allow.Allows the indexation of any resource blocks. For example: "/" or "/ tag /".
Disallow.On the contrary, prohibits sections indexing.
SitemapPath to site map (in XML format).
Host.The main mirror (with WWW or without, or if you have several domains). Here it also indicates the secure HTTPS protocol (if available). If you have a standard HTTP, you do not need to specify it.
Crawl-delayWith it, you can set the interval of visiting and downloading files of your site for robots. Helps reduce the load on the host.
Clean-ParamAllows you to disable the indexing of parameters on certain pages (by type www.site.com/cat/state?Admin_id8883278).
In contrast to previous directives, 2 values \u200b\u200b(address and parameter itself) are specified here.

These are all the rules that are supported by flagship search engines. It is with their help we will create our robots, operating various variations for various types of sites.

Setting

For competent configuration of the robot file, we need to know exactly which of the site sections should be indexed, and which are not. In the case of a simple single-page on HTML + CSS, we sufficiently register several major directives, such as:

User-Agent: *
Allow: /
Sitemap: site.ru/sitemap.xml.
Host: www.site.ru.

Here we indicated rules and values \u200b\u200bfor all search engines. But it is better to add separate directives for Google and Yandex. It will look like this:

User-Agent: *
Allow: /

User-Agent: Yandex
Allow: /
Disallow: / Politika

User-Agent: googlebot
Allow: /
Disallow: / Tags /

Sitemap: site.ru/sitemap.xml.
Host: Site.ru.

Now at our HTML site will be indexed absolutely all files. If we want to exclude some page or picture, then we need to specify a relative link to this fragment in DiSallow.

You can use the Robots automatic file generation services. We do not guarantee that with their help you will create the perfectly correct option, but you can try as familiarization.

Among such services can be allocated:

With their help, you can create robots.txt in automatic mode. Personally, I really do not recommend this option, because it is much easier to do it manually, tuing under my platform.

Speaking of platforms, I mean all sorts of CMS, frameworks, SaaS systems and much more. Next, we will talk about how to customize the WordPress and Joomla robots file.

But before this, select several universal rules that can be guided by creating and configuring Robots almost for any site:

Close from indexing (disallow):

  • site adminship;
  • personal account and registration / authorization pages;
  • basket, data with orders forms (for online store);
  • cGI folder (located on the host);
  • service sections;
  • scripts AJAX and JSON;
  • UTM and OpenStat labels;
  • various parameters.

Open (ALLOW):

  • pictures;
  • JS and CSS files;
  • other elements that should be taken into account by search engines.

In addition, at the end do not forget to specify the Sitemap data (path to the site map) and Host (main mirror).

Robots.txt for WordPress

To create a file, we need to throw robots.txt to the root of the site. You can change its contents in this case using all the same FTP and file managers.

There is a more convenient option - create a file with plugins. In particular, this feature is Yoast SEO. Edit Robots straight from the administrator is much more convenient, so I myself use this way of working with robots.txt.

How do you decide to create this file - your business, it is more important for us to understand which directives should be there. On your sites running WordPress using this option:

User-Agent: * # Rules for all robots, with the exception of Google and Yandex

Disallow: / CGI-BIN # folder with scripts
Disallow: /? # Query settings with homepage
Disallow: / WP- # Files of the CSM itself (with the WP-)
Disallow: *? S \u003d # \
Disallow: * & s \u003d # All related to search
DISALOW: / SEARCH / # /
Disallow: / Author / # Archives authors
Disallow: / Users / # and users
Disallow: * / TrackBack # Notifications from WP that someone refers to you
Disallow: * / Feed # FID in XML
Disallow: * / RSS # and RSS
Disallow: * / Embed # Built-in elements
Disallow: /xmlrpc.php. # Wordpress API.
Disallow: * UTM \u003d # UTM labels
Disallow: * OpenStat \u003d # OpenStat labels
Disallow: / Tag / # Tags (if any)
Allow: * / Uploads # open downloads (pictures, etc.)

User-Agent: googlebot # For Google
Disallow: / CGI-BIN
Disallow: /?
Disallow: / WP-
Disallow: *? S \u003d
Disallow: * & s \u003d
DISALOW: / SEARCH /
Disallow: / Author /
Disallow: / Users /
Disallow: * / TrackBack
Disallow: * / Feed
Disallow: * / RSS
Disallow: * / Embed
Disallow: /xmlrpc.php.
Disallow: * UTM \u003d
Disallow: * OpenStat \u003d
Disallow: / Tag /
Allow: * / Uploads
Allow: /* \u003e.js. # open JS files
Allow: /* ,/ABSS. # and CSS
Allow: /wp-*.png. # and pictures in png format
Allow: /wp-*.jpg # \
Allow: /wp-*.jpeg. # and in other formats
Allow: /wp-*.gif. # /
# works together with plugins

User-Agent: Yandex # for Yandex
Disallow: / CGI-BIN
Disallow: /?
Disallow: / WP-
Disallow: *? S \u003d
Disallow: * & s \u003d
DISALOW: / SEARCH /
Disallow: / Author /
Disallow: / Users /
Disallow: * / TrackBack
Disallow: * / Feed
Disallow: * / RSS
Disallow: * / Embed
Disallow: /xmlrpc.php.
Disallow: / Tag /
Allow: * / Uploads
Allow: /* \u003e.js.
Allow: /* ,/ABSS.
Allow: /wp-*.png.
Allow: /wp-*.jpg.
Allow: /wp-*.jpeg
Allow: /wp-*.gif.
Allow: /wp-admin/admin-ajax.php.
# clean utm tags
Clean-Param: OpenStat # and about OpenStat do not forget

Sitemap: # We prescribe the way to the site map
Host: https://site.ru. # Main mirror

Attention! When copying strings to a file - do not forget to delete all comments (text after #).

This option robots.txt is most popular among webmasters that use WP. Is it perfect? Not. You can try to add something or, on the contrary to remove. But note that when optimizing the texts of the robots, the error is not uncommon. We will talk about them further.

Robots.txt for Joomla

And although in 2018 Joomla rarely who uses, I believe that it is impossible to determine this wonderful CMS. When promoting projects on Joomla, you will certainly have to create a robots file, and otherwise how do you want to close unnecessary elements from indexation?

As in the previous case, you can create a file manually, just throwing it onto the host, or use the module for these purposes. In both cases, you will have to competently configure it. This will look like the correct option for Joomla:

User-Agent: *
Allow: /*.css?
Allow: /*.js?
Allow: /*.jpg?c.
Allow: /*.png?c.ova
Disallow: / Cache /
Disallow: /*.pdf.
Disallow: / Administrator /
Disallow: / Installation /
Disallow: / CLI /
Disallow: / Libraries /
Disallow: / Language /
Disallow: / Components /
Disallow: / Modules /
DiSallow: / Includes /
Disallow: / Bin /
Disallow: / Component /
Disallow: / TMP /
Disallow: /index.php.
Disallow: / Plugins /
Disallow: / * Mailto /

Disallow: / Logs /
Disallow: / Component / Tags *
Disallow: / *%
Disallow: / Layouts /

User-Agent: Yandex
Disallow: / Cache /
Disallow: /*.pdf.
Disallow: / Administrator /
Disallow: / Installation /
Disallow: / CLI /
Disallow: / Libraries /
Disallow: / Language /
Disallow: / Components /
Disallow: / Modules /
DiSallow: / Includes /
Disallow: / Bin /
Disallow: / Component /
Disallow: / TMP /
Disallow: /index.php.
Disallow: / Plugins /
Disallow: / * Mailto /

Disallow: / Logs /
Disallow: / Component / Tags *
Disallow: / *%
Disallow: / Layouts /

User-Agent: googlebot
Disallow: / Cache /
Disallow: /*.pdf.
Disallow: / Administrator /
Disallow: / Installation /
Disallow: / CLI /
Disallow: / Libraries /
Disallow: / Language /
Disallow: / Components /
Disallow: / Modules /
DiSallow: / Includes /
Disallow: / Bin /
Disallow: / Component /
Disallow: / TMP /
Disallow: /index.php.
Disallow: / Plugins /
Disallow: / * Mailto /

Disallow: / Logs /
Disallow: / Component / Tags *
Disallow: / *%
Disallow: / Layouts /

Host: Site.ru. # Do not forget to change the address to your
Sitemap: site.ru/sitemap.xml. # and here

As a rule, this is enough so that unnecessary files do not fall into the index.

Errors when setting up

Very often, people allow errors when creating and configuring a robots file. Here are the most common of them:

  • The rules are indicated only for User-Agent.
  • There are no Host and Sitemap.
  • The presence of an HTTP protocol in the HOST directive (you only need to specify HTTPS).
  • Failure to comply with nesting rules when opening / closing pictures.
  • Not closed UTM and OpenStat tags.
  • Pressing Host and Sitemap directives for each robot.
  • Superficial file study.

It is very important to properly configure this little file. When approved by coarse errors, you can lose a significant part of the traffic, so be extremely attentive when setting up.

How to check the file?

For these purposes, it is better to use special services from Yandex and Google, since these search engines are the most popular and in demand (most often unique), such search engines like Bing, Yahoo or Rambler consider no sense.

To begin with, consider the option with Yandex. We go to the webmaster. After that, the Robots.txt analysis tools.

Here you can check the file for errors, as well as check in real time, which pages are open to indexing, and which are not. Very convenient.

Google has exactly the same service. We go B. Search Console. . We find the scan tab, select - the Robots.txt file check tool.

Here are exactly the same functions as in the domestic service.

Please note that it shows me 2 errors. This is due to the fact that Google does not recognize the parameter cleaning directives that I specified for Yandex:

Clean-Param: UTM_SOURCE & UTM_MEDIUM & UTM_CAMPAIGN
Clean-Param: OpenStat

It is not worth paying attention to this, since Google's robots use only rules for GoogleBot.

Conclusion

The robots.txt file is very important for SEO optimization of your site. Come on to its configuration with all the responsibility, because with incorrect implementation everything can go as shock.

Consider all the instructions that I shared in this article, and do not forget that you do not necessarily exactly copy my robots options. It is possible that you will have to additionally understand each of the directives, adjusting the file under your specific case.

And if you want to more deeply figure it out in robots.txt and creating websites on WordPress, then I invite you to. On it you will learn how you can easily create a site, without forgetting to optimize it for search engines.

Robots.txt is a special file located in the root directory of the site. Webmaster specifies in it, which pages and data close from indexation from search engines. The file contains directives that describe access to the sections of the site (the so-called standard of exceptions for robots). For example, with it, you can install various access settings for search robots intended for mobile devices and regular computers. It is very important to configure it correctly.

Do you need robots.txt?

With robots.txt you can:

  • to prohibit indexing of similar and unnecessary pages, so as not to spend the craving limit (the number of URL, which can bypass the search robot for one bypass). Those. The robot will be able to index more important pages.
  • hide images from search results.
  • close from indexing Noble scripts, style files and other non-critical page resources.

If this prevents Google Scanner or Yandex to analyze pages, do not block the files.

Where is the file robots.txt?

If you just want to see what is in the Robots.txt file, simply enter the browser in the address bar: site.ru/robots.txt.

Physically, the Robots.txt file is located in the root folder of the site on the hosting. I have a hosting beget.ru, so I'll show the location of the Robots.txt file on this hosting.


How to create the right robots.txt

The robots.txt file consists of one or more rules. Each rule blocks or allows indexing path on the site.

  1. IN text editor Create a file named Robots.txt and fill it in accordance with the rules below.
  2. The robots.txt file must be a text file in the ASCII or UTF-8 encoding. Symbols in other encodings are unacceptable.
  3. On the site there should be only one such file.
  4. The robots.txt file must be placed in root catalog Site. For example, to control the indexation of all the pages of the site http://www.example.com/, the Robots.txt file should be placed at http://www.example.com/robots.txt. He should not be in the subdirectory (for example, at http://example.com/pages/robots.txt). In case of difficulty with access to the root catalog, contact your hosting provider. If you do not have access to the root directory of the site, use alternative method Locks, such as MetaTelets.
  5. The robots.txt file can be added to addresses with subdomains (for example, http: // website..example.com / robots.txt) or non-standard ports (for example, http://example.com: 8181 /robots.txt).
  6. Check the file in the Yandex.Vebmaster and Google service Search Console..
  7. Load the file to the root directory of your site.

Here is an example of the Robots.txt file with two rules. Below is his explanation.

User-Agent: Googlebot Disallow: / NogoogleBot / User-Agent: * Allow: / sitemap: http://www.example.com/sitemap.xml

Explanation

  1. A username called Googlebot should not index the catalog http://example.com/nogooglebot/ and its subdirectories.
  2. All other user agents have access to all sites (you can omit, the result will be the same, since full access is provided by default).
  3. The Sitemap file of this site is located at http://www.example.com/sitemap.xml.

DISALLOW and ALLOW directives

To prohibit indexing and accessing a robot to the site or some of its sections, use the Disallow directive.

User-Agent: Yandex Disallow: / # Blocks access to the whole site User-Agent: Yandex Disallow: / CGI-BIN # blocks access to pages, # starting with "/ cgi-bin"

In accordance with the standard, before each User-Agent directive, it is recommended to insert an empty line translation.

The # symbol is designed to describe comments. All that is after this symbol is not taken into account before the first translation.

To allow the access of a robot to the site or some of its sections, use the Allow directive

User-Agent: Yandex Allow: / CGI-BIN DISALLOW: / # prohibits download everything except Pages # starting with "/ cgi-bin"

Invalid the presence of empty line translations between User-Agent, DiSallow and Allow directives.

Allow and Disallow directives from the corresponding user-agent block are sorted along the URL prefix length (from smaller to more) and are used sequentially. If several directives are suitable for this page site, then the robot chooses the latter in the order of appearance in the assorted list. Thus, the order of the directives in the Robots.txt file does not affect the use of their robot. Examples:

# Source Robots.txt: User-Agent: Yandex All: / catalog Disallow: / # Assorted Robots.txt: USER-Agent: Yandex Disallow: / Allow: / Catalog # allows you to download only pages, # starting with "/ catalog" Source Robots.txt: User-Agent: Yandex Allow: / Catalog # Assorted Robots.txt: User-Agent: Yandex Allow: / Disallow: / Catalog Allow: / Catalog / Auto # forbids downloading Pages starting with "/ CATALOG", but allows you to download pages starting with "/ Catalog / Auto".

When conflict between two directives with prefixes of the same length, the priority is given to the Allow directive.

Using specials and $ * and $

When specifying routes of Allow and Disallow directives, it is possible to use specialimolts * and $, thus specifying certain regular expressions.

Special mixer * means any (including an empty) sequence of characters.

The special isimer $ means the end of the string, the symbol in front of it is the last.

User-Agent: Yandex Disallow: /cgi-bin/*.aspx # forbids "/cgi-bin/example.aspx" # and "/cgi-bin/private/test.aspx" disallow: / * Private # prohibits not only "/ Private", # BUT AND "/ CGI-BIN / PRIVATE"

Sitemap Directive

If you use a description of the site structure using the Sitemap file, specify the path to the file as the Sitemap directive parameter (if multiple files, specify everything). Example:

User-Agent: Yandex All: / Sitemap: https://example.com/site_structure/my_sitemaps1.xml sitemap: https://example.com/site_structure/my_sitemaps2.xml

The directive is intersection, so it will be used by a robot, regardless of the place in the robots.txt file, where it is specified.

The robot will remember the path to the file, processes the data and will use the results upon subsequent formation of the download sessions.

Crawl-Delay Directive

If the server is heavily loaded and does not have time to work out the robot requests, use the Crawl-Delay directive. It allows you to set the search robot minimum time period (in seconds) between the end of the download of one page and the beginning of the download of the next one.

Before changing the speed of the site bypass, find out to what kind of pages the robot is addressed more often.

  • Analyze server logs. Contact your employee responsible for the site or to the hosting provider.
  • Check out the URL list on the Indexing page → Bypass statistics in Yandex.Vebmaster (set the switch to all pages).

If you find that the robot appeals to the service pages, prohibit their indexing in the Robots.txt file using the DiSallow directive. This will help reduce the number of unnecessary robot appeals.

Clean-Param Directive

The directive only works with the Yandex robot.

If the addresses of the site pages contain dynamic parameters that do not affect their contents (identifiers of sessions, users, referrers, etc.), you can describe them using the Clean-Param directive.

The Yandex robot using this directive will not repeat the duplicate information repeatedly. Thus, the efficiency of bypass your site will increase, the load on the server will be reduced.

For example, there are pages on the site:

Www.example.com/some_dir/get_book.pl?ref\u003dsite_1&book_id\u003d123 www.example.com/some_dir/get_book.pl?ref\u003dsite_2&book_id\u003d123 www.example.com/some_dir/get_book.pl?ref\u003dsite_3&book_id\u003d 123.

The REF parameter is used only in order to track from which resource the request was made and does not change the contents, on all three addresses the same page will be shown with the book_id \u003d 123 book. Then, if you specify the directive as follows:

User-Agent: Yandex Disallow: Clean-Param: Ref /Some_Dir/Get_Book.pl

the Yandex robot will drive all the page addresses to one:

Www.example.com/some_dir/get_book.pl?Book_id\u003d123.

If such a page is available on the site, it will be to participate in the search results.

Playing syntax

Clean-Param: P0 [& P1 & P2 & .. & PN]

In the first field through the symbol & lists the parameters that the robot does not need to be considered. In the second field, the prefix of the page paths is specified for which the rule must be applied.

Note. The Clean-Param directive is intersection, so it can be specified anywhere in the Robots.txt file. If the directives are indicated by several, they will all be taken into account by the robot.

The prefix may contain a regular expression in a format similar to the robots.txt file, but with some limitations: only A-ZA-Z0-9 characters can be used .- / * _. In this case, the symbol * is interpreted in the same way as in the robots.txt file: the symbol * is always implicitly added to the end of the prefix. For example:

Clean-Param: s /Forum/showThread.php

The register is taken into account. There is a limit on the length of the rule - 500 characters. For example:

Clean-Param: ABC /FORUM/ShowThread.php Clean-Param: SID & Sort /Forum/2.php Clean-Param: Sometrash & OtherTrash

Directive Host.

At the moment, Yandex ceased support for this directive.

Correct Robots.txt: Setup

The contents of the Robots.txt file differs depending on the type of site (online store, blog) used by CMS, the characteristics of the structure and a number of other factors. Therefore, engage in creating this file For a commercial site, especially if we are talking about a complex project, a SEO specialist with sufficient experience.

An unprepared person will most likely be able to make the right decision on which part of the content is better to close from indexation, and which allow to appear in search results.

Correct Robots.txt Example for WordPress

User-Agent: * General rules for robots, except Yandex and Google, # because For them, the rules below disallow: / cgi-bin # folder on hosting Disallow: /? # All query options on the main DiSallow: / WP- # All WP files: / WP-JSON /, / WP-includes, / WP-CONTENT / PLUGINS DISALLOW: / WP / # If there is a subscription / WP /, where CMS is installed ( If not, # The rule can be deleted) Disallow: *? S \u003d # Search Disallow: * & s \u003d # Search Disallow: / Search / # Search Disallow: / Author / # Archive by Disallow: * / Trackback # trackbacks, notifications in the comments on the appearance of open # Links to the Disallow article: * / Feed # All Disallow Fids: * / RSS # RSS FID DISALLOW: * / Embed # all embedding Disallow: * / wlwmanifest.xml # xml-file manifest Windows Live. Writer (If you do not use, # Rule can be deleted) disallow: /xmlrpc.php # WordPress API Disallow API file: * UTM * \u003d # Links with UTM labels Disallow: * OpenStat \u003d # Links with OpenStat Allow: * / Uploads # Open Folder with Uploads Sitemap files: http://site.ru/sitemap.xml # User-agent site map address: googleBot # Google rules (do not duplicate) Disallow: / CGI-BIN DISALLOW: /? Disallow: / WP- DISALLOW: / WP / DISALLOW: *? S \u003d Disallow: * & s \u003d Disallow: / Search / Disallow: / Author / Disallow: / Users / Disallow: * / Trackback Disallow: * / Feed Disallow: * / RSS Disallow: * / Embed Disallow: * / wlwmanifest.xml disallow: /xmlrpc.php Disallow: * UTM * \u003d Disallow: * OpenStat \u003d Allow: * / Uploads ALLOW: /*/*.js # open JS scripts inside / WP- (/ * / - for priority) ALLOW: /*/*.CSS # open CSS files inside / WP- (/ * / - for priority) ALLOW: /WP-*.PNG # Pictures in plugins, cache folder etc. Allow: /wp-*.jpg # pictures in plugins, cache folder, etc. Allow: /wp-*.jpeg # pictures in plugins, cache folder, etc. Allow: /wp-*.gif # pictures in plugins, cache folder, etc. Allow: /wp-admin/admin-ajax.php # is used by plugins in order not to block JS and CSS User-Agent: Yandex # Rules for Yandex (no duplication rules) Disallow: / CGI-BIN DISALLOW: /? Disallow: / WP- DISALLOW: / WP / DISALLOW: *? S \u003d Disallow: * & s \u003d Disallow: / Search / Disallow: / Author / Disallow: / Users / Disallow: * / Trackback Disallow: * / Feed Disallow: * / RSS Disallow: * / Embed Disallow: * / wlwmanifest.xml disallow: /xmlrpc.php Allow: * / Uploads Allow ::/7CSS ALLOW: /WP-*.PNG ALLOW: /wp-*.jpeg allow: /wp-*.jpeg alow: /wp-*.gif alow: /wp-admin/admin-ajax.php clean-param: utm_source & utm_medium & utm_campaign # Yandex recommends not to close # from indexing, but delete Label parameters, # Google Such rules do not support Clean-Param: OpenStat # Similarly

Robots.txt example for joomla

User-Agent: *
Disallow: / Administrator /
Disallow: / Cache /
DiSallow: / Includes /
Disallow: / Installation /
Disallow: / Language /
Disallow: / Libraries /
Disallow: / Media /
Disallow: / Modules /
Disallow: / Plugins /
Disallow: / Templates /
Disallow: / TMP /
Disallow: / XMLRPC /

Robots.txt example for Bitrix

User-Agent: *
Disallow: /*index.php$
Disallow: / Bitrix /
Disallow: / Auth /
Disallow: / Personal /
Disallow: / Upload /
DISALOW: / SEARCH /
DISALLOW: / * / SEARCH /
Disallow: / * / Slide_Show /
Disallow: / * / Gallery / * Order \u003d *
Disallow: / *? Print \u003d
Disallow: / * & Print \u003d
Disallow: / * Register \u003d
Disallow: / * Forgot_password \u003d
Disallow: / * Change_Password \u003d
Disallow: / * login \u003d
Disallow: / * logout \u003d
Disallow: / * Auth \u003d
Disallow: / *? Action \u003d
Disallow: / * Action \u003d Add_to_Compare_List
Disallow: / * Action \u003d delete_from_compare_list
Disallow: / * Action \u003d Add2Basket
Disallow: / * Action \u003d BUY
Disallow: / * Bitrix _ * \u003d
Disallow: / * BackURL \u003d *
Disallow: / * BackURL \u003d *
Disallow: / * Back_URL \u003d *
Disallow: / * Back_URL \u003d *
DiSallow: / * Back_URL_ADMIN \u003d *
Disallow: / * Print_course \u003d Y
Disallow: / * Course_id \u003d
Disallow: / *? Course_id \u003d
Disallow: / *? Pagen
Disallow: / * Pagen_1 \u003d
Disallow: / * Pagen_2 \u003d
Disallow: / * Pagen_3 \u003d
Disallow: / * Pagen_4 \u003d
Disallow: / * Pagen_5 \u003d
Disallow: / * Pagen_6 \u003d
Disallow: / * Pagen_7 \u003d

Disallow: / * Page_NAME \u003d SEARCH
Disallow: / * Page_name \u003d User_Post
Disallow: / * Page_name \u003d Detail_Slide_Show
Disallow: / * Showall
Disallow: / * show_all \u003d
Sitemap: http: // Path to your XML format map

Robots.txt example for MODX

User-Agent: *
Disallow: / Assets / Cache /
Disallow: / Assets / Docs /
Disallow: / Assets / Export /
Disallow: / Assets / Import /
Disallow: / Assets / Modules /
Disallow: / Assets / Plugins /
Disallow: / Assets / Snippets /
Disallow: / Install /
Disallow: / Manager /
Sitemap: http://site.ru/sitemap.xml.

Robots.txt example for Drupal

User-Agent: *
Disallow: / Database /
DiSallow: / Includes /
Disallow: / Misc /
Disallow: / Modules /
Disallow: / Sites /
Disallow: / Themes /
Disallow: / scripts /
Disallow: / Updates /
DISALLOW: / PROFILES /
Disallow: / Profile
Disallow: / Profile / *
Disallow: /xmlrpc.php.
Disallow: /cron.php.
Disallow: /Update.php.
Disallow: /install.php.
Disallow: /index.php.
Disallow: / Admin /
Disallow: / Comment / Reply /
Disallow: / Contact /
Disallow: / logout /
DISALOW: / SEARCH /
Disallow: / user / Register /
Disallow: / user / password /
Disallow: * Register *
Disallow: * Login *
Disallow: / Top-Rated-
Disallow: / Messages /
Disallow: / Book / Export /
Disallow: / user2userpoints /
Disallow: / MyuserPoints /
Disallow: / Tagadelic /
Disallow: / Referral /
Disallow: / aggregator /
Disallow: / Files / Pin /
Disallow: / Your-Votes
Disallow: / Comments / Recent
Disallow: / * / Edit /
Disallow: / * / Delete /
Disallow: / * / Export / HTML /
Disallow: / Taxonomy / TERM / * / $
Disallow: / * / Edit $
Disallow: / * / Outline $
Disallow: / * / Revisions $
Disallow: / * / Contact $
Disallow: / * Downloadpipe
Disallow: / Node $
Disallow: / Node / * / Track $
Disallow: / * &
Disallow: / *%
Disallow: / *? Page \u003d 0
Disallow: / * Section
Disallow: / * Order
Disallow: / *? Sort *
Disallow: / * & Sort *
Disallow: / * votesupdown
Disallow: / * Calendar
Disallow: /*index.php.
Allow: / *? Page \u003d
Disallow: / *?
Sitemap: http: // Path to your XML format map

ATTENTION!

CMS is constantly updated. Perhaps you need to close other pages from indexing. Depending on the purpose, the prohibition of indexing can be removed or, on the contrary, is added.

Check robots.txt

Each search engine has its own requirements for the design of the Robots.txt file.

In order to check robots.txt On the correctness of the syntax and file structure, you can use one of the online services. For example, Yandex and Google offer their own site analysis services for webmasters, which include robots.txt analysis:

Check Robotx.txt for search robot Yandex

You can do this with the help of a special tool from Yandex - Yandex.Vebmaster, also two options.

Option 1:

On the right above the drop-down list - select Analysis robots.txtor by reference http://webmaster.yandex.ru/robots.xml

You should not forget that all the changes that you enter into the Robots.txt file will not be available immediately, but after just some time.

Check robotx.txt for Google search robot

  1. In Google Search Console, select your site, go to the check tool and view the contents of the Robots.txt file. Syntax and brain teaser Errors in it will be highlighted, and their number is indicated below the edit window.
  2. Down on the interface page, specify the desired URL in the corresponding window.
  3. In the drop-down menu on the right, select robot.
  4. Press the button CHECK.
  5. The status will be displayed Available or NOT AVAILABLE. In the first case, Google's robots can go to the address you specify, and in the second - no.
  6. If necessary, make changes to the menu and check again. Attention! These fixes will not be automatically entered into the Robots.txt file on your site.
  7. Copy the modified content and add it to the Robots.txt file on your web server.

In addition to checks from Yandex and Google, there are many other online robots.txt validators.

Robots.txt generators

  1. SERVICE from seolib.ru. With this tool, you can quickly get and check the restrictions in the robots.txt file.
  2. The generator from pr-cy.ru. In the result of the operation of the Robots.txt generator, you will receive the text you want to save to the file called robots.txt and download to the root directory of your site.

First I will tell you what robots.txt is.

Robots.txt- The file that is in the root folder of the site, where special instructions for search robots are prescribed. These instructions are needed so that when entering the site, the robot does not take into account the page / section, in other words, we close the page from the indexation.

Why do you need robots.txt

The robots.txt file is considered a key requirement when SEO optimization is absolutely any site. The absence of this file can negatively affect the load from robots and slow indexing and, even more so, the site will not be completely indexed. Accordingly, users will not be able to go to pages through Yandex and Google.

Influence robots.txt on search engines?

Search engines(In particular Google), the site index, but if the Robots.txt file is not, then, as not all pages said. If there is such a file, the robots are guided by the rules that are specified in this file. Moreover, there are several types of search robots if some may consider the rule, then others ignore. In particular, Googlebot's robot does not take into account the Host and Crawl-Delay directive, the YandexNews robot has recently ceased to take into account the Crawl-Delay directive, and the robots yandexdirect and yandexvideoparser ignore the generally accepted directives in robots.txt (but take into account those that are written specifically for them).

Load the site more than robots that load content from your site. Accordingly, if we point out the robot, which pages to index, and which ignore, as well as with some time intervals, load content from pages (it swings more large sites with more than 100,000 pages in the search engine index). This will significantly alleviate the robot indexation and the process of loading content from the site.


Unnecessary for search engines can include files that relate to CMS, for example, in WordPress - / WP-Admin /. In addition, the Ajax, JSON scripts that are responsible for pop-ups, banners, capping and so on.

For most robots, I also recommend close from indexing all JavaScript and CSS files. But for Googlebot and Yandex such files are better indexed, as they are used by search engines for analyzing the convenience of the site and its ranking.

What is robots.txt directive?



Directive - This is the rules for search robots. The first standards for writing robots.txt and, accordingly, appeared in 1994, and the Advanced Standard in 1996. However, as you already know that not all robots support certain directives. Therefore, below I painted than guided by the main robots when indexing the site pages.

What does user-agent mean?

This is the main directive that defines for which search robots will be followed by further rules.

For all robots:

For a specific bot:

User-Agent: googlebot

Register in robots.txt not table is important, you can write as Googlebot and googleBot

Google search robots







Search robots Yandex

main indexing robot Yandex

Used in the Yandex. Martinki service

Used in the Yandex.Video service

Multimedia data

Search blogs

Search robot accessing the page when adding it through the "Add URL" form

robot, indexing pictograms of sites (FAVICONS)

Yandex.Direct

Yandex.Metrica

Used in the Yandex.Catalog service

Used in the Yandex.New service

Yandeximageresizer.

Search robot mobile services

Search robots Bing, Yahoo, Mail.Ru, Rambler

DISALLOW and ALLOW directives

Disallow closes from indexing sections and pages of your site. Accordingly, Allow on the contrary opens them.

There are some features.

First, additional operators - *, $ and #. What are they used for?

“*” - This is any number of characters and their absence. By default, it is already at the end of the string, so there is no sense to put it again.

“$” - shows that the symbol before him should go the last.

“#” - Comment, all that comes after this symbol robot does not take into account.

Examples of using Disallow:

Disallow: *? S \u003d

Disallow: / Category /

Accordingly, the search robot will close the type pages:

But the pages of the form will be open to indexation:

Now you need to understand how rules with nestedness are performed. The procedure for recording directives is made important. Inheritance of the rules, it is determined by what directories are indicated, that is, if we want to close the page / document from indexing, it suffices to register directive. Let's look at the example

This is our Robots.txt file.

DISALLOW: / TEMPLATE /

This directive is also specified in any place, and several Sitemap files can be prescribed.

Host Directive in Robots.txt

This directive is required to indicate the main mirror of the site (often with WWW or without). Please note that the Host directive is specified without the http: //, but with the HTTPS protocol: //. The directive takes into account only the search robots of Yandex and Mail.Ru, and other robots, including Googlebot, will not be taken into account. Host prescribed 1 time in the Robots.txt file

Example with http: //

Host: Website.ru.

Example with https: //

Crawl-Delay Directive

Sets the time interval of indexing by the search robot of the site pages. The value is indicated in seconds, and in milliseconds.

Example:

It is used for the most part on large online stores, information sites, portals, where the site attendance from 5,000 per day. We are necessary in order for the search robot to make a request for indexing at a certain period of time. If you do not specify this directive, it can create a serious load on the server.

The optimal value is Crawl-Delay for each site. For search engines Mail, Bing, Yahoo, you can set the minimum value of 0.25, 0.3, since the robots these search engines can translate your site once a month, 2 months and so on (very rarely). For Yandex, it is better to establish more important.


If the load of your site is minimal, then there is no sense to indicate this directive.

Clean-Param Directive

Rule is interested in what tells the crawler that pages with certain parameters there is no need to index. 2 Agrumes are prescribed: Page URL and Parameter. This directive is supported by the Yandex search engine.

Example:

Disallow: / Admin /

Disallow: / Plugins /

DISALOW: / SEARCH /

Disallow: / Cart /

Disallow: * Sort \u003d

Disallow: * View \u003d

User-Agent: googlebot

Disallow: / Admin /

Disallow: / Plugins /

DISALOW: / SEARCH /

Disallow: / Cart /

Disallow: * Sort \u003d

Disallow: * View \u003d

Allow: /Plugins/A.css.

Allow: /Plugins/Ac.js.

Allow: /plugins/*.png.

Allow: /Plugins/*.jpg.

Allow: /Plugins/*.gif.

User-Agent: Yandex

Disallow: / Admin /

Disallow: / Plugins /

DISALOW: / SEARCH /

Disallow: / Cart /

Disallow: * Sort \u003d

Disallow: * View \u003d

Allow: /Plugins/A.css.

Allow: /Plugins/Ac.js.

Allow: /plugins/*.png.

Allow: /Plugins/*.jpg.

Allow: /Plugins/*.gif.

Clean-Param: UTM_SOURCE & UTM_MEDIUM & UTM_CAMPAIGN

In the example, we prescribed the rules for 3 different bots.

Where to add robots.txt?

Added to the root folder of the site. In addition, so that it can be followed by reference:

How to check robots.txt?

Yandex webmaster

On the Tools tab, choose the Robots.txt analysis and then click Check

Google Search Console.

On the tab Scanningchoose Robots.txt File Check Tooland then click Check.

Conclusion:

The robots.txt file must be necessarily on each promotional site and only the correct setting will allow you to get the necessary indexation.

And finally, if you have any questions, ask them in the comments under the article and I also wonder, and how do you prescribe robots.txt?