How to prohibit indexing the desired pages.

the Internet

In this review, consider how to find and delete duplicate pages forever.

How do pages duplicates arise

The main reasons for the appearance of the oak - the imperfection of the CMS site, almost all modern commercial and non-profit CMS generate duplicate pages. Another reason can be a low professional site developer level, which made the appearance of a double.

What are the duplicate pages

1. The main page of the site that opens with WWW and without www

example www.site.ua and site.ua

site.ua/home.html and site.ua/

2. Dynamic content content with identifiers?, Index.php, & View

site.ua/index.php?option\u003dcom_k2&Itemid\u003d141&id\u003d10&lang\u003dru&task\u003dcategory&view\u003ditemlist

site.ua/index.php?option\u003dcom_k2&itemid\u003d141&id\u003d10&lang\u003dru&layout\u003dcategory&task\u003dcategory&view\u003ditemlist

3. Slash at the end of the URL and without

4. Filters in the online store (example)

site.ua//?itemid\u003d&product_book&

5. Prices page

site.ua/cadok/?tmpl\u003dcomponent&print\u003d1&layout\u003ddefault&page\u003d "

What kind of danger carry duplicate pages

Imagine that you are reading a book where on the pages the same text, or very similar. How useful for you such information? In the same position, search engines are also searching for the duplicates of your site, the useful content that needs to be used.

Search engines do not like such sites, therefore your site does not take high positions in the search, and it carries a direct threat to it.

How to discover duplicates on the site

1. Using the team site: Site.ua. You can check exactly exactly the duplicate in the search engine index.

2. Enter the fragments of the phrases from your site in the search, thus detect the pages on which it is present

3. Google Web Master Tools, in Search Search → HTML Optimization, you can see pages on which there is a repeating metaoupe or headlines.

5 ways to remove pages duplicates

1. Using the Robots.txt file

Disallow: / *?
Disallow: /index.php?

Thus, let us know the search engine that the pages that contain the parameters?, Index.php?, Should not be indexed.

There is one "but": the Robots file is only a recommendation for search engines, and not a rule that they absolutely follow. If, for example, the link is put on such a page, it will fall into the index.

2. File.htaccess allows you to solve the problem with the duplicate at the server level.

HTCCESS is a Apache Server Configuration File, is at the root of the site. Allows you to configure the server configuration for a single site.

To glue the page page 301 redirect.

Redirect 301 /home.html http://site.ua/ (for static site pages)

RewriteCond% (query_string) ^ id \u003d 45454
Rewriterule ^ index.php $ http://site.ua/news.html? (redirect for dynamic pages)

Use 410 redirect (Full Double Removal)
He informs that there is no such page on the server.

Redirect 410 /tag/video.html

Set up a domain with www and without www

Example with www.
Options + FollowSymLinks.
RewriteEngine ON.
RewriteCond% (http_host) ^ Site \\ .ua
Rewriterule ^ (. *) $ Http://www.site.ua/$1

Without www
Options + FollowSymLinks.
RewriteEngine ON.
RewriteCond% (http_host) ^ www.site.ua $
Rewriterule ^ (. *) $ Http://site.ua/$1

Add slam at the end of the URL

RewriteCond% (Request_uri) (. * / [^ /.] +) ($ | \\?) Rewriterule. *% 1 /

For sites with a large number of pages, it will be quite difficult to search and glue a duplicate.

3. Tools for webmasters

Function URL Settings allow you to prohibit Google Scanning site page with defined parameters

Or manually delete

Deleting pages is possible only if the page:

It is prohibited for indexing in the Robots.txt file.

Returns server response 404

Forbidden by Noindex tag

4. Meta tag noindex- This is the most effective way to remove duplicates. Removes forever and irrevocably.

According to Google, the availability of the Noindex tag completely eliminates the page from the index.

Important. In order for the robot to remove the page, it should index it, that is, it should not be closed from indexing in the robots.txt file.

You can implement through the PHP regular expressions using the function preg_match ().

5. REL \u003d "CANONICAL" attribute

The REL \u003d "CANONICAL" attribute allows you to specify the recommended (canonical) page for indexing by search engines, so duplicates do not fall into the index.

rEL \u003d "Canonical" indicates the courtyard ways

1. Using the LINK attribute in HTTP header

Link: ; REL \u003d "CANONICAL"

2. In the section Add REL \u003d »Canonical» for non-canonic versions of pages

Some popular CMS attribute REL \u003d "CANONICAL" is automatically implemented - for example, Joomla! 3.0 (read about). Other CMS has special additions.

Summarize. When developing the site, take into account the possibilities of the appearance of a double and in advance determine the ways to combat them. Create the right structure of the site (more.

Check periodically number of pages in the index, and use the capabilities of the toolbar for webmasters.

When writing, materials were used

Zebegesti

Any page on the site can be opened or closed for indexing by search engines. If the page is open, the search engine adds it to its index, if closed, the robot does not enter it and does not take into account the search results.

When creating a site, it is important at the programmate level to close from indexing all pages that users and search engines should not see for any reason.

Such pages include the administrative part of the site (admin), pages with various service information (for example, with personal data of registered users), pages with multi-level forms (for example, complex registration forms), feedback forms, etc.

Example:
User profile on Search SEARCHENGINES search engines.

Mandatory is also the closure of indexing pages, the contents of which are already used on other pages. These pages are called duplicate. Full or partial duplicas are very pessimized by the site, as they increase the amount of unreasonable content on the site.

As you can see, the content on both pages is partly coincided. Therefore, categories page on WordPress sites are closed from indexing, or display only the name of the records.

The same applies to tag pages, such pages are often present in the blog structure on WordPress. Tag cloud makes it easier to navigate the site and allows users to quickly find the information you are interested in. However, they are partial doubles of other pages, which means they are subject to closure from indexation.

Another example is the store on CMS OpenCart.

Product category page http://www.masternet-instrument.ru/Lampy-Energosberegajuschie-c-906_910_947.html.

The product page to which the discount is distributed http://www.masternet-instrument.ru/specials.php.

These pages have similar content, as there are many identical goods on them.

Especially critical of the duplication of content on various pages of the site is Google. For a large number of doubles in Google, you can earn certain sanctions until the temporary exclusion of the site from the search results.

Another case when the contents of the pages should not "show" the search engine - pages with a nonsense content. Typical example - instructions for medical preparations in the Internet pharmacy. Content on a page with a description of the drug http://www.piluli.ru/product271593/product_info.html is unnecessary and published on hundreds of other sites.

It is almost unique to make it, since the rewriting of such specific texts is ungrateful and prohibited. The best solution in this case will be the closure of the indexing page, or writing a letter to search engines asking loyal to the uniqueness of the content that is unique cannot be done for one reason or another.

How to close the pages from indexing

A classic tool for closing pages from indexing is the Robots.txt file. It is located in the root directory of your site and is created specifically in order to show search robots, which pages cannot be visited. This is an ordinary text file that you can edit at any time. If the Robots.txt file you do not have or if it is empty, the default search engines will index all pages that will find.

The structure of the Robots.txt file is quite simple. It may consist of one or more blocks (instructions). Each instruction, in turn, consists of two lines. The first line is called User-Agent and determines which search engine must follow this instruction. If you want to prohibit indexing for all search engines, the first line should look like this:

If you want to prohibit the indexing of the page only for one PS, for example, for Yandex, the first line looks like this:

The second line of instructions is called Disallow (prohibit). To prohibit all pages of the site, write the following in this row:

To solve the indexation of all pages, the second line should be:

In the Disallow string, you can specify specific folders and files that need to be closed from indexing.

For example, to prohibit indexing the Image folder and all its contents are writing:

To "hide" from search engines specific files, list them:

User-Agent: *
Disallow: /myFile1.htm.
Disallow: /myFile2.htm.
Disallow: /myFile3.htm.

These are the basic principles of the structure of the Robots.txt file. They will help you close from indexing individual pages and folders on your site.

Another, less common way to ban indexing is meta-tag Robots. If you want to close the indexing page or prohibit search engines to index links placed on it, it is necessary to register this TEG in its HTML code. It must be placed in the area of \u200b\u200bHead, before the tag .</p> <p>The robots meta tag consists of two parameters. Index is a parameter responsible for indexing the page itself, and Follow is a parameter that allows or prohibiting indexing links located on this page.</p> <p>To prohibit indexing instead of Index and Follow, NOINDEX and NOFOLLOW should be written accordingly.</p> <p>Thus, if you want to close the page from indexing and prohibit search engines to consider links to it, you need to add to the code such a string:</p> <blockquote><p><meta name=“robots” content=“noindex,nofollow”></p> </blockquote> <p>If you do not want to hide the page from indexing, but you need to "hide" links on it, the meta tag robots will look like this:</p> <blockquote><p><metaname=“robots” content=“index,nofollow”></p> </blockquote> <p>If you, on the contrary, you need to hide the page from the PS, but at the same time consider the links, this TEG will have this kind:</p> <blockquote><p><meta name=“robots” content=“noindex,follow”></p> </blockquote> <p>Most modern CMS make it possible to close some pages from indexing directly from the admin. This avoids the need to understand the code and customize these parameters manually. However, the above methods were and remain universal and most reliable tools to prohibit indexation.</p> <p>CMS Joomla has one disadvantage, it is a pages addresses. Duplicas is when one article is available at two addresses.</p><p>For example:</p><p>Read more and how to remove from indexation Dubli pages in Joomla read under the cut.</p><p><img src='https://i1.wp.com/rightblog.ru/wp-content/uploads/2011/10/chain.jpg' width="100%" loading=lazy loading=lazy></p><p>CMS Joomla has one disadvantage, it is a pages addresses. Duplicas is when one article is available at two addresses. For example:</p><blockquote><p>http: //syt/dizayn/ikonki-sotsial-noy-seti-vkonrtakte.html.</p><p>index.php? option \u003d com_content & view \u003d article & id \u003d 99: vkontakteicons & catid \u003d 5: Design & Itemid \u003d 5</p> </blockquote><p><b>How do duplicate pages appear?</b> Very simple, on example above we see two references to one material. The first link is a beautiful and human response (CNC reference), created by the Joomsef component that converts all links to the site into such a beautiful, readable view. The second link is an internal system link Jumla, and if the ARTIO JOOMSEF component was not installed, then all links on the site would be like the second - incomprehensible and ugly. Now, from being as far as it is terribly and how to deal with the dubs.</p><p><b>How duplicated is harmful to the site.</b> I would not call him a very big disadvantage, since in my opinion, search engines should not be very bathing and pessimizing the site for such a duplicate, as the duplicate these are made not specifically, but are part of the CMS system. Moreover, I note the very popular system on which millions of sites are made, which means search engines learned to understand such a "feature." But still, if there is a possibility and desire, then it is better to have such a duplicate from the eyes of a big brother.</p><h3><b>How to deal with doubles in Joomla and other CMS</b></h3><p><b>1) Two doubles of one page, ban in robots.txt</b></p><p>For example, the following two addresses of one page fall into the search engine index:</p><blockquote><p>http://site.ru/stristen.html?replyTocom\u003d371 <br> http://site.ru/stristen.html?iframe\u003dtrue&width\u003d900&height\u003d450</p> </blockquote><p>To close such a double in Robots.txt need to add:</p><blockquote><p>Disallow: / *? * <br> Disallow: / *?</p> </blockquote><p>With this action, we closed from indexing all the links of the site with the sign "?". Such an option is suitable for sites where the CNC work is enabled, and normal links do not have the signs of the question - "?".</p><p><b>2. Use the REL \u003d "Canonical" tag</b></p><p>Suppose one page goes two links with different addresses. Google and Yahoo search engines can be distributed to what address to the page is the main one. To do this in the tag <a> It is necessary to register the REL \u003d "Canonical" tag. Yandex This option does not support.</p><p>For Joomla for setting the REL \u003d "Canonical" tag, I found two extensions, called 1) PLG_CANONICAL_V1.2; and 2) 098_MOD_CANONICAL_1.1.0. You can test them. But I would have received in another way and just forbade all references to indexing with a question mark, as shown in the example above.</p><p><b>3. Prevent indexing in Robots.txt Joomla doubles (pages with the end of index.php) and other not necessary pages.</b></p><p>Since all pupble pages in Joomla begin with index.php, you can prohibit them all to indexing one line in robots.txt - disallow: /index.php. Also, we also forbid the chief of the main page, when it is available at "http://site.ru/" and "http://site.ru/index.php".</p><p><b>4. Domain Skleka with WWW and without 301 redirect (redirect).</b></p><p>To gluing a domain with www and without need to make redirection - 301 redirect. To do this, in the file.htaccess we register:</p><blockquote><p>RewriteEngine ON. <br> </p> </blockquote><p>If you need to do a redirect from http://site.ru on www.site.ru, the record will look like this:</p><blockquote><p>RewriteEngine ON. <br> RewriteCond% (http_host) ^ Site.ru <br> Rewriterule (. *) Http://www.site.ru/$1</p> </blockquote><p><b>5. Host Directive gives the definition of the main domain with WWW or without Yandex.</b></p><p>For those webmasters who have just created their website, do not hurry to perform the actions that I described at this point, you first need to make the right robots.txt to register the Host directive, by this you define the main domain in Yandex's eyes.</p><p>It will look like this:</p><blockquote><p>User-Agent: Yandex <br> Host: Site.ru.</p> </blockquote><p>The HOST directive understands only Yandex. Google does not understand her.</p><p><b>6. Joomla pupble pages glue in the file.htaccess.</b></p><p>Very often the main page of the site on Joomla is available at http://site.ru/index.html or http://site.ru/index.rhp, http: //site.ru.html, that is, it is a duplicate main Pages (http://site.ru), of course, you can get rid of them by closing them in robots.txt, but it is better to do with .htaccess. To do this in this file, add the following:</p><blockquote><p><br> </p> </blockquote><p>Use this code if you need to get rid of the double with index.rhp, do not forget in the code instead of http: // Your site.ru /, put your domain.</p><p>To check it turned out or not, simply enter the address of the double (http://site.ru/index.rhp) in the browser, if it turned out, you will transfer you to the page http://site.ru, will also happen with search bots and they will not see these duplicas.</p><p>And by analogy, glue the Joomla dub with other consoles to the URI of your main page, simply edit the code that I led above.</p><p><b>7. Specify Sitemap in Robots.txt</b></p><p>Although it does not apply to the doubles, but once it went such a drive, then at the same time I recommend in the Robots.txt file to specify the path to the site map in the XML search format:</p><blockquote><p>Sitemap: http: //domome.com/sitemap.xml.gz <br> Sitemap: http: //domome.com/sitemap.xml</p> </blockquote><h3>Outcome</h3><p>Summer outcome to the above, for Joomla I would prescribe here such lines in robots.txt:</p><blockquote><p>Disallow: /index.php.</p><p><b>Specify the main host for Yandex</b></p><p>User-Agent: Yandex <br> Host: Site.ru.</p> </blockquote><p>And these are V.HTaccess lines</p><blockquote><p><b># Domain gluing with www and without</b></p><p>RewriteEngine ON. <br> Rewritecond% (http_host) ^ www.site.ru <br> Rewriterule ^ (. *) $ Http://site.ru/$1</p><p><b># Gluing double pages</b></p><p>RewriteCond% (The_Request) ^ (3.9) /index.php HTTP / <br> Rewriterule ^ index.php $ http: // Your site.ru /</p> </blockquote><p>If you use other ways to eliminate doubles, you know how to improve the above or just you have something to say on this topic - write, waiting in the comments.</p> <p>Most robots are well designed and do not create any problems for site owners. But if the bot was written or "something went wrong," he can create a significant load on the site that he bypasses. By the way, spiders at all enter the server like viruses - they simply request the pages you need remotely (in fact it is analogs of browsers, but without a page viewing function).</p><h2>Robots.txt - Directive User-Agent and search engine bots</h2><p>Robots.Tht has a not a complex syntax, which is described in detail, for example, in <b>helpe Yandex</b> and <b>helpe Gugang</b> . It is usually indicated in it, for which search bot is designed the following directives: the name of the bot (" <b>User-Agent</b>"), resolving (" <b>Allow.</b>") and prohibiting (" <b>Disallow.</b>"), And also actively used" Sitemap "to indicate search engines, where the map file is located.</p><p>The standard was created for a long time ago and something was added later. There are directives and rules for design, which will be understood only by robots of certain search engines. In RuNet, interest is mostly only Yandex and Google, and therefore it is with their Helpami on the compilation of robots.txt, it should be read especially in detail (I led the links in the previous paragraph).</p><p>For example, before the Yandex search engine was useful to indicate, your webpore is the main in the Special Directive "Host", which only this search engine understands (well, also Mile.ru, for they have a search from Yandex). True, at the beginning of 2018 <b>Yandex still canceled Host</b> And now its functions, like other search engines, performs 301-redirect.</p><p>If your resource has no mirrors, it will be useful to indicate which of the writing options is the main one.</p><p>Now let's talk a little about the syntax of this file. Directives in robots.txt have the following form:</p><p> <поле>:<пробел><значение><пробел> <поле>:<пробел><значение><пробел> </p><p>The correct code must contain <b>at least one "DiSallow" directive</b> After each entry "User-Agent". An empty file involves permission to index the site.</p><h3>User-Agent</h3><p><b>Directive "User-Agent"</b> Must contain the name of the search bot. Using it, you can configure the rules of behavior for each specific search engine (for example, creating a separate folder indexing for only Yandex only). An example of writing "User-Agent", addressed to all bots to your resource, looks like this:</p><p>User-Agent: *</p><p>If you want to set certain conditions in the "User-Agent" only for a single bot, for example, Yandex, then you need to write like this:</p><p>User-Agent: Yandex</p><h3>The name of the search engine robots and their role in the Robots.txt file</h3><p><b>Bot of each search engine</b> It has its name (for example, for rambler is Stackrambler). Here I will give a list of the most famous of them:</p><p>Google http://www.google.com Googlebot Yandex http://www.ya.ru Yandex Bing http://www.bing.com/ bingbot</p><p>Large search engines sometimes <b>in addition to the main bots</b>, There are also separate instances for indexing blogs, news, images, etc. Many information on the varieties of bots you can learn (for Yandex) and (for Google).</p><p>How to be in this case? If you need to write a project prohibit rule that all types of Google robots must complete, then use the name GoogleBot and all other spiders of this search engine will also obey. However, it is possible to ban only, for example, on the indexing of the pictures, specifying the googleBot-image bot as a User-Agent. Now it is not very clear, but on the examples, I think it will be easier.</p><h2>Examples of using Disallow and Allow directives in Robots.tcht</h2><p>I will give a little simple <b>examples of using directive</b> With an explanation of his actions.</p><ol><li>The code below allows all bots (as an asterisk points to User-Agent) to index all the content without any exceptions. This is defined <b>empty Disallow directive</b>. User-Agent: * Disallow:</li><li>The following code, on the contrary, completely prohibits all search engines to add to the index of the page of this resource. Sets this Disallow with "/" in the value field. User-Agent: * Disallow: /</li><li>In this case, all bots will be prohibited to view the contents of the directory / image / (http://mysite.ru/image/ - the absolute path to this catalog) User-Agent: * Disallow: / image /</li><li>To block one file, it will be enough to register its absolute path to it (read): User-Agent: * disallow: /katalog1//katalog2/private_file.html <p>Run a little further, I will say that it is easier to use an asterisk symbol (*) so as not to write the full way:</p><p>Disallow: /c.html.</p></li><li>The example below will be prohibited by the "Image" directory, as well as all files and directories starting with "Image" characters, i.e. files: "image.htm", "images.htm", catalogs: "Image", " images1 "," Image34 ", etc.): User-Agent: * Disallow: / image The fact is that by default, at the end of the record, an asterisk is meant that replaces any characters, including their absence. Read about it below.</li><li>Via <b>allow</b>we allow access. Well complements disallow. For example, this is the condition of the search robot Yandex we prohibit to dig (index) everything except the web brings, the address of which begins with / CGI-BIN: User-Agent: Yandex All: / CGI-BIN DISALLOW: / <p>Well, or such an obvious example of using Alla and Disallow's bundles:</p><p>User-Agent: * Disallow: / Catalog Allow: / Catalog / Auto</p></li><li>When describing the paths for Allow-Disallow directives, you can use symbols <b>"*" and "$"</b>, specifying, thus defined logical expressions. <ol><li>Symbol <b>"*"(star)</b> Means any (including an empty) sequence of characters. The following example prohibits all search engines to index files with the ".php" extension: User-Agent: * Disallow: * .php $</li><li>Why is needed at the end <b>$ sign (dollar)</b>? The fact is that by the logic of the formation of the Robots.txt file, at the end of each directive, the default asterisk (it is not, but it seems to be). For example, we write: Disallow: / Images <p>Implying that this is the same as:</p><p>Disallow: / Images *</p><p>Those. This rule prohibits the indexation of all files (webons, pictures and other file types) the address of which begins with / images, and then everything is accomplished (see example above). So here <b>symbol $.</b> It simply cancels this default (unprofitable) asterisk at the end. For example:</p><p>Disallow: / Images $</p><p>It prohibits only the indexing of the file / images, but not /images.html or /images/primer.html. Well, in the first example, we forbidden indexing only files ending on PHP (having such an extension) so that nothing superfluous is hooked:</p><p>Disallow: * .php $</p></li> </ol></li> </ol><li>In many engines, users (human-understandable urlas), while the system generated by the system, have a question mark "?" in the address. This can use and write such a rule in robots.txt: User-Agent: * Disallow: / *? <p>The asterisk after the question mark suggests itself, but she, as we figured out just above, is already implied at the end. Thus, we can prohibit indexation of search pages and other service pages created by the engine, to which the search robot can reach. It will not be superfluous, because the question mark most often CMS is used as a session identifier, which can lead to the index of the duplicate pages.</p></li><h2>Directives Sitemap and Host (for Yandex) in robots.txt</h2><p>In order to avoid unpleasant problems with site mirrors, it was used to be recommended to add a HOST directive to Robots.txt, which pointed the Yandex bot on the main mirror.</p><h3>Host Directive - Indicates the main site mirror for Yandex</h3><p>For example, before, if you <b>not yet switched to secure protocol</b>, Pointing to Host it was necessary not complete Ural, but a domain name (without http: //, i.e.en). If you have already switched to HTTPS, you will need to specify a complete ul (type https://myhost.ru).</p><blockquote><p>A wonderful tool for dealing with duplicates of content - the search engine will simply not index the page if another ul is registered in Canonical. For example, for such a page of my blog (Page Pagazy) Canonical indicates https: // Site and no problems with duplication of Taitles should not occur.</p><p> <link rel="canonical" href="https://сайт/" /> </p><p>But I am distracted by ...</p><p>If your project is created on the basis of any engine, then <b>content duplication will take place</b> With a high probability, which means you need to fight with him, including using the ban in robots.txt, and especially in a meta tag, for in the first case, the google can and ignore the ban, but he will not be able to give a damn one ( so pupil).</p><p>For example, in WordPress pages with very similar contents can get to the search engine index, if the indexing and contents of the headings, and the contents of the archive of tags, and the content of temporary archives are allowed. But if using the Meta Tag Robots described above, create a ban for the tag archive and a temporary archive (you can leave, but to prohibit the indexing of the contents of the headings), then the content duplication will not arise. How to do this is described by reference to the following slightly (on the plugin Olinseopak)</p><p>Summing up I will say that the Robosts file is designed to set the global rules for prohibiting access to entire directory of the site, or in files and folders, in the title of which are given specified characters (mask). Examples of the task of such prohibitions you can see slightly higher.</p><p>Now let's consider specific examples of robots intended for different engines - Joomla, WordPress and SMF. Naturally, all three options created for different CMS will be significantly (if not to say dramatically) differ from each other. True, everyone will have one general time, and this moment is associated with the Yandex search engine.</p><p>Because In RuNet, Yandex has a sufficient weight, then you need to take into account all the nuances of his work, and here we <b>host will help</b>. She explicitly specifies this search engine the main mirror of your site.</p><p>It is advised to use a USER-AGENT separate blog intended for Yandex only (User-Agent: Yandex). This is due to the fact that the remaining search engines may not understand the host and, accordingly, its inclusion in the User-Agent record intended for all search engines (User-Agent: *) can lead to negative consequences and incorrect indexation.</p><p>What is the case in fact - it is difficult to say, for the search algorithms are a thing in itself, so it is better to do as they advise. But in this case you will have to duplicate in the User-Agent directive: Yandex all those rules that we asked User-Agent: *. If you leave User-Agent: Yandex with empty disallow:, so that you allow Yandex to enter anywhere and drag everything in a row to the index.</p><h3>Robots for WordPress</h3><p>I will not give an example of a file that developers recommend. You can see it yourself. Many bloggers do not limit the bots of Yandex and Google in their walks on the contents of the WordPress engine. Most often in blogs you can find Robots, automatically filled with plugin.</p><p>But, in my opinion, still it should be helped by finding in the difficult case of sewing grains from the challenge. First, the indexation of this garbage will leave a lot of time at the bots of Yandex and Google, and it may not be to stay at all for adding the webcase to the index with your new items. Secondly, bots, dusted engine trashal files, will create an additional load on the server of your host, which is not good.</p><p>My version of this file you can see. He did not change old, but I try to follow the principle of "not to quain what I was not breaking," and you already decide: to use it, to make your own or else to look at. I still have a prohibition of indexing pages with pagination was registered until recently (disallow: * / page /), but I recently removed it, hoping on the Canonical, about which he wrote above.</p><p>But in general, <b>the only correct file</b> For WordPress, probably does not exist. It is possible, it is over the same, implement any prerequisites in it, but who said they will be correct. Options for perfect robots.txt on the network a lot.</p><p><b>I will give two extremes</b>:</p><ol><li> You can find a megaphay with detailed explanations (symbol # is separated by comments that in the real file will be better deleted): User-Agent: * General rules for robots, except Yandex and Google, # because For them, the rules below disallow: / cgi-bin # folder on hosting Disallow: /? # All query options on the main DiSallow: / WP- # All WP files: / WP-JSON /, / WP-includes, / WP-CONTENT / PLUGINS DISALLOW: / WP / # If there is a subscription / WP /, where CMS is installed ( If not, # The rule can be deleted) Disallow: *? S \u003d # Search Disallow: * & s \u003d # Search Disallow: / Search / # Search Disallow: / Author / # Archive by Disallow: * / Trackback # trackbacks, notifications in the comments on the appearance of open # Links to the Disallow article: * / Feed # All Disallow Fids: * / RSS # RSS FID DISALLOW: * / Embed # all embedding Disallow: * / wlwmanifest.xml # xml-file manifest Windows Live Writer (If you do not use, # rule can be deleted) disallow: /xmlrpc.php # file WordPress API Disallow: * UTM \u003d # Links with UTM labels Disallow: * OpenStat \u003d # Links with OpenStat Allow Tags: * / Uploads # Open the Uploads User-Agent file folder: Googlebot # Rules for Google (do not duplicate comments) Disallow: / CGI-BIN DISALLOW: /? Disallow: / WP- DISALLOW: / WP / DISALLOW: *? S \u003d Disallow: * & s \u003d Disallow: / Search / Disallow: / Author / Disallow: / Users / Disallow: * / Trackback Disallow: * / Feed Disallow: * / RSS Disallow: * / Embed Disallow: * / wlwmanifest.xml disallow: /xmlrpc.php Disallow: * UTM \u003d Disallow: * OpenStat \u003d ALLOW: * / uploads ALLOW: /*/*.js # open JS scripts inside / WP - (/ * / - for priority) ALLOW: /*/*CSS # open CSS files inside / WP- (/ * / - for priority) ALLOW: /WP-*.PNG # Pictures in plugins, cache folder and etc. Allow: /wp-*.jpg # pictures in plugins, cache folder, etc. Allow: /wp-*.jpeg # pictures in plugins, cache folder, etc. Allow: /wp-*.gif # pictures in plugins, cache folder, etc. Allow: /wp-admin/admin-ajax.php # is used by plugins in order not to block JS and CSS User-Agent: Yandex # Rules for Yandex (no duplication rules) Disallow: / CGI-BIN DISALLOW: /? Disallow: / WP- DISALLOW: / WP / DISALLOW: *? S \u003d Disallow: * & s \u003d Disallow: / Search / Disallow: / Author / Disallow: / Users / Disallow: * / Trackback Disallow: * / Feed Disallow: * / RSS Disallow: * / Embed Disallow: * / wlwmanifest.xml disallow: /xmlrpc.php Allow: * / Uploads Allow ::/7CSS ALLOW: /WP-*.PNG ALLOW: /wp-*.jpeg allow: /wp-*.jpeg alow: /wp-*.gif alow: /wp-admin/admin-ajax.php clean-param: utm_source & utm_medium & utm_campaign # Yandex recommends not to close # from indexing, but delete Label parameters, # Google Such rules do not support Clean-Param: OpenStat # Similarly, specify one or more Sitemap files (duplicate for each user-agent # is not needed). Google XML Sitemap creates 2 site maps as in the example below. Sitemap: http://site.ru/sitemap.xml Sitemap: http://site.ru/sitemap.xml.gz # Specify the main site mirror, as in the example below (with www / without www, if https # then write Protocol if you want to specify the port, indicate). Host team understands # Yandex and Mail.Ru, Google does not take into account. Host: www.site.ru.</li><li>But you can use an example of minimalism: User-Agent: * Disallow: / WP-Admin / Allow: /wp-admin/admin-ajax.php host: https://site.ru Sitemap: https: // site. RU / Sitemap.xml.</li> </ol><p>Truth probably lies somewhere in the middle. Do not forget to register the Robots meta tag for "extra" pages, for example, with the help of a wonderful plug-in. He will help and canonical customized.</p><h3>ROBOTS.TXT for Joomla</h3><p>User-Agent: * Disallow: / Administrator / Disallow: / Bin / Disallow: / Cache / Disallow: / CLI / DISALLOW: / Components / Disallow: / Includes / Disallow: / Installation / Disallow: / Language / Disallow: / Layouts / Disallow: / Libraries / Disallow: / Logs / Disallow: / Modules / Disallow: / Plugins / Disallow: / TMP /</p><p>In principle, here almost everything is taken into account and it works well. The only thing to add a separate User-Agent rule to it to insert the HOST directive that determines the main mirror for Yandex, as well as specify the path to the Sitemap file.</p><p>Therefore, in the final form, the correct Robots for Joomla, in my opinion, should look like this:</p><p>User-Agent: Yandex Disallow: / Administrator / Disallow: / Cache / Disallow: / Includes / Disallow: / Installation / Disallow: / Language / Disallow: / Libraries / Disallow: / Modules / Disallow: / Plugins / Disallow: / TMP / Disallow: / Layouts / Disallow: / CLI / Disallow: / Bin / Disallow: / Disallow: / Component / Disallow: / Component / Tags * Disallow: / * Mailto / Disallow: /*.pdf Disallow : / *% Disallow: /index.php host: vash_sait.ru (or www.vash_sait.ru) User-Agent: * ALLOW: /*.CSS?* $$ Allow: / * .jpg? * $ allow: /c.png?*$ Disallow: / Administrator / Disallow: / Cache / Disallow: / Disallow: / Language / Disallow: / Libraries / Disallow: / Modules / Disallow : / Plugins / Disallow: / TMP / Disallow: / Layouts / Disallow: / CLI / Disallow: / Bin / Disallow: / Logs / Disallow: / Components / Disallow: / Component / Disallow: / * Mailto / Disallow: / *. PDF Disallow: / *% Disallow: /index.php Sitemap: http: // Path to your card XML format</p><p>Yes, even note that in the second version there are directives <b>Allow, allowing indexing of styles, scripts and pictures</b>. It is written specifically for Google, for his googleBot sometimes swears that the indexation of these files is prohibited in Robots, for example, from the folder with the topic used. Even threatens to lower in ranking.</p><p>Therefore, in advance, all this business is allowed to index using Allow. The same, by the way, and in the example of the file for the WordPress it was.</p> <p>Good luck to you! To ambiguous meetings on the blog pages Website</p><blockquote>see more Rollers you can go on</blockquote>");"><br><img src='https://i2.wp.com/ktonanovenkogo.ru/wp-content/uploads/video/image/parkur-kadr.jpg' width="100%" loading=lazy loading=lazy><p>You may be interested</p><p><img src='https://i2.wp.com/ktonanovenkogo.ru/wp-content/uploads/2013/10/redirekt-na-www.jpg' width="100%" loading=lazy loading=lazy><span>Domains with www and without it - the history of appearance, use 301 redirect for their gluing</span> <br><img src='https://i1.wp.com/ktonanovenkogo.ru/wp-content/uploads/2014/05/scleit-zercla-saita.jpg' width="100%" loading=lazy loading=lazy><span>Mirrors, duplicate pages and url addresses - audit of your site or what could be the cause of collapse in its SEO promotion</p></blockquote> <p>Hi friends! According to its statistics, I determined that more than half of the webmasters and optimizers are not entirely correctly closed from the indexation of duplicate pages. The result is a longer found of garbage documents in extradition. As an option - pages in principle remain indexed (unable to remove).</p> <p>Below, I will indicate which basic errors are made when you try to remove the doubles, as well as the scenario of the correct prohibition methods for popular types of documents.</p> <p>I will not dwell on questions: "Why is the ducky - is it bad?" And "how to look for them?". You can read the answers to them in the post "". Today, attention is focused on the correctness of one way or another for certain types of pages.</p> <p>We are all people and can make mistakes. Fortunately, in this topic, they are usually not critical. I highlighted 4 the main reasons for which incorrect.</p> <ol><li>Use several ways to close immediately. Sometimes it is found that the webmaster closed the page in robots.txt, added to Head Meta Name \u003d "Robots", and below, for the reliability REL \u003d "Canonical". When the document is prohibited in Robotse, the search spider will not be able to scan its contents.</li> <li>Using only one method - robots.txt. If 5-7 years ago it was almost the only way to remove the double, now it is not the most efficient and versatile.</li> <li>When the rules for which pages are prohibited are more general and affect normal documents. In my opinion, it is better to write 2 private rules for specific parameters than one general one that can potentially affect quality content.</li> <li>The use of an incorrect method that is not suitable for this type of documents (for example, a redirect for sorting).</li> </ol><p>I can not say that if your optimizer uses only Robots, then it should be quickly fired. Much depends on the resources and features of their indexing. Prohibition methods need to be chosen based on the project's nuances directly.</p> <p>I turn directly to correct ways that will allow you to remove from the search of the duplicate and "garbage". Methods are placed in order of priority (1 - the most priority).</p> <h3>1. Removal</h3> <p>If possible, first of all, you need to remove unnecessary documents. No material, then nothing needs to be prohibited. It can be:</p> <ol><li>categories of online store without goods that will not be renewed;</li> <li>tag system. Exception - Pages of labels, decorated properly: Indeed, interesting to visitors have a high-quality Title, Description, a small description. In other words, not just a list of related materials, but a truly a full page.</li> <li>Infinitely nested URLs. This is when in Url you can add an infinite (or finite) number of investments. For example, the site.ru/post/ document may be available on site.ru/post/post/post/. The prohibition of creating a similar structure must be solved at the server level and / or CMS (they must give 404 error).</li> </ol><p><img src='https://i1.wp.com/sosnovskij.ru/wp-content/uploads/2017/06/perenapravlenie.png' align="center" width="100%" loading=lazy loading=lazy></p> <h3>2. 301-redirect</h3> <p>All "trash", which cannot be deleted to redirect to the main documents. To do this, use 301-redirect. What types of pages this method is suitable?</p> <ol><li>www and without www;</li> <li>with a slash at the end or without;</li> <li>fids from RSS;</li> <li>Urals with parameters that do not change content;</li> <li>attachi (attached files);</li> <li>goods available on different URLs (usually due to the fact that are in different categories);</li> <li>dubli main: site.ru/index.php, Domen.ru/home.html and so on;</li> <li>print version (if there is a reference to it only in the code);</li> <li>the first pagination page. The fact is, some CMS create duplicas on the url site.ru/category/ and site.ru/category/page/1/. That is, the content of the first paginating page usually corresponds to the content of the category, but they have different URLs.</li> </ol><h3><span>3. Meta-tag META NAME \u003d "ROBOTS"</span></h3> <p>When the document cannot be removed or redirected, META NAME \u003d "ROBOTS" must be included in the work, and not so popular robots.txt. These words are confirmed in practice - my experiment, as well as in theory (below the screenshot from the Google help).</p> <p><img src='https://i0.wp.com/sosnovskij.ru/wp-content/uploads/2017/06/meta-robots-google.jpg' align="center" width="100%" loading=lazy loading=lazy></p> <p>This method is suitable for pages:</p> <ol><li>sorting (at a price, popularity and so on);</li> <li>pagination;</li> <li>with parameters (when the content changes);</li> <li>filters (in the case when they are not decorated "qualitative" way, by type of tags: above wrote about it);</li> <li>print version;</li> <li>CMS and additions (plugins, hooks);</li> <li>search;</li> <li>user profiles;</li> <li>mobile version located on the subdomain.</li> </ol><p>In general, Meta Name \u003d "Robots" should be used in all cases where pages are not desirable for search engine index, but desirable for visitors. In this paragraph there are 2 refinements:</p> <p>1) Pages in which this meta tag has been added, should not be closed from indexing in robots.txt.</p> <p><img src='https://i2.wp.com/sosnovskij.ru/wp-content/uploads/2017/06/noindex-i-robotst-txt.jpg' align="center" width="100%" loading=lazy loading=lazy></p> <p>2) On many sites, some materials are available only on one way. For example, the goods cards are only available from pagination pages in categories (Sitemap does not count). If you use a standard prohibiting code:</p> <blockquote><p><meta name="robots" content="noindex, nofollow"/></p> </blockquote> <p>the search robot will be more difficult to get to cards. Here you must specify the following attribute:</p> <blockquote><p><meta name="robots" content="noindex, follow"/></p> </blockquote> <p>In this case, the search spider will not include the document into the index, but will go to internal links and index the content behind them.</p> <h3><span>4. attribute REL \u003d "Canonical" element LINK</span></h3> <p>If for some reason, the use of Meta Name Robots is impossible, then the well-known REL \u003d "Canonical" attribute comes to the rescue. It helps to indicate the indexing robot the main (canonical) page. To do this, on non-canonical documents within the HEAD tag, you must specify the following code with an indication of the URL of the canonical document.</p> <blockquote><p><link rel="canonical" href="http://site.ru/url-osnovnogo-dokumenta/" /></p> </blockquote> <p>This attribute is less preferred, since the search algorithms take it into account as only only the recommendation (therefore, META NAME \u003d "ROBOTS" use priority). That is why when I, they appeared, they disappeared from the Yandex index.</p> <p>The attribute may be suitable for prohibiting indexing the following types of pages:</p> <ol><li>sorting;</li> <li>having parameters in the URL;</li> <li>pagazy (as canonical indicates the first or basic: for example, a category);</li> <li>print version.</li> </ol><p><img src='https://i1.wp.com/sosnovskij.ru/wp-content/uploads/2017/06/robots.txt.png' align="center" width="100%" loading=lazy loading=lazy></p> <h3></h3> <p>Previously, the most popular way to prohibit indexation was in my ranking only at the 5th position. He still works well in Yandex, but does not matter in Google. By virtue of his unsubstantly, he was at this position.</p> <p>Anything to prohibit in robots.txt is then when all previous techniques have been implemented, but remained "garbage" that did not work out. Usually pages remain on the "snack":</p> <ol><li>with parameters;</li> <li>CMS and plugins;</li> <li>AMP (only for Yandex robot until it supports this format);</li> <li>mobile version on a separate subdomain (full ban + Note Host Basic Project).</li> </ol><h3>6. Ajax</h3> <p>Sometimes it is necessary to close from indexing not all page, but only its part. This will help AJAX. For example, I have long been closed in a blog and</p> <script>document.write("<img style='display:none;' src='//counter.yadro.ru/hit;artfast?t44.1;r"+ escape(document.referrer)+((typeof(screen)=="undefined")?"": ";s"+screen.width+"*"+screen.height+"*"+(screen.colorDepth? screen.colorDepth:screen.pixelDepth))+";u"+escape(document.URL)+";h"+escape(document.title.substring(0,150))+ ";"+Math.random()+ "border='0' width='1' height='1' loading=lazy loading=lazy>");</script> </div> </article> <div class="social-buttons"> <a class="social-button social-button-facebook" target="_blank" href="https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Fregularshow.ru%2Finternet%2Fkak-zapretit-indeksaciyu-nuzhnyh-stranic.html" data-url="https://regularshow.ru/internet/kak-zapretit-indeksaciyu-nuzhnyh-stranic.html" data-title=" How to prohibit indexing the desired pages.">Share on Facebook.</a> </div> <div class="sharethrough-container"></div> <div class="recommended-heading-container"> <span class="recommended-heading">Recommended</span> </div> <div class="zergnet-widget"> <div class="zerglayoutcl"> <div class="zergrow"> <div class="zergentity"> <a href="https://regularshow.ru/en/settings/gde-zahoronen-voin-vov-znachimost-poiska-uchastnikov-vov-v-podvig.html" class="zergimg"> <img src="/uploads/f18c3c129dc1616bae9b4b7ae66b91fa.jpg" nopin="true" data-pin-no-hover="true" loading=lazy loading=lazy> </a> <div class="zergheadline"> <a href="https://regularshow.ru/en/settings/gde-zahoronen-voin-vov-znachimost-poiska-uchastnikov-vov-v-podvig.html" >The significance of the search for participants in the Second World War in the "Feat of People"</a> </div> </div> <div class="zergentity"> <a href="https://regularshow.ru/en/rates/professor-kafedry-zhurnalistiki-o-religioznyh-informacionnyh.html" class="zergimg"> <img src="/uploads/d89145c987ecc3137010425084c9e084.jpg" nopin="true" data-pin-no-hover="true" loading=lazy loading=lazy> </a> <div class="zergheadline"> <a href="https://regularshow.ru/en/rates/professor-kafedry-zhurnalistiki-o-religioznyh-informacionnyh.html" >What is included in the concept of information war?</a> </div> </div> <div class="zergentity"> <a href="https://regularshow.ru/en/the-answers-to-the-questions/ob-informacionnoi-voine-i-hristianah-informacionnye-voiny-v-sovremennom-mire-vek-razveshivaniya-lapsh.html" class="zergimg"> <img src="/uploads/da2aaca4d8927fa71e90a446e20260af.jpg" nopin="true" data-pin-no-hover="true" loading=lazy loading=lazy> </a> <div class="zergheadline"> <a href="https://regularshow.ru/en/the-answers-to-the-questions/ob-informacionnoi-voine-i-hristianah-informacionnye-voiny-v-sovremennom-mire-vek-razveshivaniya-lapsh.html" >Information wars in the modern world</a> </div> </div> <div class="zergentity"> <a href="https://regularshow.ru/en/rates/informacionnaya-voina-v-sovremennom-mire.html" class="zergimg"> <img src="/uploads/59005ad2d08e52584673872811827db8.jpg" nopin="true" data-pin-no-hover="true" loading=lazy loading=lazy> </a> <div class="zergheadline"> <a href="https://regularshow.ru/en/rates/informacionnaya-voina-v-sovremennom-mire.html" >Information war in the modern world</a> </div> </div> </div> </div> </div> </div> <div class="tripelift-container"> </div> </div> <aside id="sidebar" style='top:90px!important;;left:70%!important;'> </aside> </div> </main> <script src="/assets/nav.js"></script> <script> jQuery(document).ready(function() { var isTablet = jQuery("#is-tablet").is(":visible"); var isMobile = jQuery("#is-mobile").is(":visible"); if (isMobile || isTablet || jQuery(".nav-opener:visible").length > 0) { initMobileNav(); } }); </script> <script> jQuery("#search-btn").click(function() { jQuery(".search-box").addClass("searching").width( jQuery("#header .nav-holder").width() - jQuery("#social-nav").width() ); setTimeout(function() { jQuery(".search-box input").width( jQuery("#header .nav-holder").width() - jQuery("#social-nav").width() - 150 ).focus(); }, 600); }); jQuery("#search-close").click(function() { jQuery(".search-box").removeClass("searching").css({ width: "" }) }); </script> <footer id="footer"> <div class="mobile-ad-placeholder footer-ad" data-ad-height="any" data-ad-width="any" data-ad-pos="footer"></div> <div class="holder"> <div id="footer-bottom"> <p><a href="">Contacts</a></p> <p>© 2021 / All rights reserved</p> <img src="/images/logo.png" alt="Site about digital services and services" loading=lazy loading=lazy> <script type="text/javascript"> document.write("<a href='https://regularshow.ru/en//www.liveinternet.ru/click' "+ "target=_blank><img src='//counter.yadro.ru/hit?t50.6;r"+ escape(document.referrer)+((typeof(screen)=="undefined")?"": ";s"+screen.width+"*"+screen.height+"*"+(screen.colorDepth? screen.colorDepth:screen.pixelDepth))+";u"+escape(document.URL)+ ";"+Math.random()+ "' alt='' title='LiveInternet' "+ "border='0' width='31' height='31' loading=lazy loading=lazy><\/a>") </script> </div> </div> </footer> </div> </div> </body> </html>