Hanki AI SEO ILMAISEKSI.

Lue lisää!

What is robots.txt?

Robots.txt is a file instructing search engine crawler bots and other bots about the pages they can and cannot crawl on a website. The file is named robots.txt and is located in the site’s root folder. 

The robots.txt file helps protect sensitive information, reduce server load, and influence how search engines crawl a site. This is how it appears on a site.

Sample of a robots.txt file of a website

minät is essential to note that while search engine crawler bots generally obey the rules in the robots.txt file, other crawler bots may decide to crawl the links even if you instructed them not to. 

How to Locate Your robots.txt File

The robots.txt file should be located in your site’s root folder. You can access it at yourdomain.com/robots.txt. Just replace yourdomain.com with your domain name and enter it into your address bar. It will reveal your robots.txt file. 

In the case of a subdomain like blog.yourdomain.com, the robots.txt file will be located at blog.yourdomain.com/robots.txt

Example of a robots.txt File

The rules in a robots.txt file usually contain two components: user agents and directives. For example, the rule below instructs all crawlers not to crawl any URL on the site. The asterisk * is called a wildcard character, indicating that a rule is directed at all crawler bots.

User-agent: *
Disallow: /

You may direct the rule at a specific crawler by setting the user agent to the crawler’s name. For example, the rule below instructs Googlebot not to crawl any URL on the site. This means other bots can crawl the URLs on the site. 

User-agent: Googlebot
Disallow: /

The directive below instructs all web crawlers not to crawl the URL at yourdomain.com/admin-login

User-agent: *
Disallow: /admin-login

The rule below instructs all crawlers not to crawl URLs in the /recipes/ subdirectory. So, all crawlers will not crawl URLs like yourdomain.com/recipes/ ja yourdomain.com/recipes/vanilla-cake.

User-agent: *
Disallow: /recipes/

List of Some Common robots.txt User-Agents

Robots.txt rules typically consist of a user agent and one or more directives. The user agent specifies the web crawlers for which the directive is intended. It is declared using the User-agent string.

Some common user agents include:

  • *: The rule is directed at all crawler bots
  • Googlebot: The rule is directed at Google’s crawler bot
  • Googlebot-Image: The rule is directed at Google Images crawler bot
  • Bingbot: The rule is directed at Bing’s crawler bot
  • Slurp: The rule is directed at Yahoo’s crawler bot
  • YandexBot: The rule is directed at Yandex’s crawler bot
  • GPTBot: The rule is directed at OpenAI’s crawler bot

List of Some Common robots.txt Directives

Directives are the commands and instructions that inform web crawlers about what part of the site they can and cannot crawl and how they are expected to crawl it. Some common robots.txt directives include:

1 Disallow

The disallow directive tells web crawlers which URLs they cannot access or index. The disallow rule is paired with a user agent. For example, the rule below tells Bingbot not to crawl any URL on the site. 

User-agent: Bingbot
Disallow: /

2 Allow

The allow directive informs web crawlers of the URLs within a disallowed directive that they are allowed to crawl. The allow directive is only used with a disallow directive. 

The allow directive is used to make exceptions. This is helpful when you want search engines to crawl specific URLs within a group of URLs you already blocked them from crawling using the disallow directive. 

The allow directive is used with a User-agent string and the Disallow directive. For example, the rule below disallows all crawlers from crawling URLs within the /recipes/ directory. However, it allows them to crawl the URL at /recipes/vanilla-cake.

User-agent: *
Disallow: /recipes/
Allow: /recipes/vanilla-cake

3 Sivukartta

The sitemap specifies the location of the site’s XML sitemap. It does not include a user agent and is declared using the Sivukartta string.

Sitemap: https://yourdomain.com/sitemap_index.xml

4 Crawl-Delay

The crawl-delay directive recommends the rate at which a site owner wants the crawler bot to crawl their site. It is used with a User-agent and is specified using Crawl-delay.

User-agent: *
Crawl-delay: 10

It is crucial to know that not all crawlers obey this directive. Additionally, those that do may interpret it differently. For example, the rule above may instruct crawlers to wait 10 seconds before making another request or access a URL once every 10 seconds.

Robots.txt Best Practices

An incorrectly configured robots.txt file could have serious consequences for your SEO. To reduce the possibility of such happening, it is recommended to follow the best practices listed below.

1 Do Not Create Conflicting Rules

It is easy to create conflicting rules in a robots.txt file. This is particularly common in files that contain multiple complex rules. For example, you may unknowingly allow and disallow crawlers from crawling the same URL. 

Such contradictory rules confuse crawlers and could cause them to perform some other action than the one you intended them to. Some crawlers resolve conflicting directives by obeying the first rule. Others follow the rule with the least number of characters or the one they consider less restrictive.

2 Do Not Use the robots.txt File to Hide Your Pages

The primary purpose of the robots.txt file is to manage the behavior of the crawler bots that access a site. It is not intended to hide pages from search engines. So, avoid using it to hide the URLs you do not want search engines to index.

Google can still find those pages and could display them on search results pages if they receive a link from another content. Instead, use the noindex meta tag to specify URLs you do not want search engines to index.

3 Include Your Sitemap in Your robots.txt File

While search engines can find your content without a sitemap, it is good practice to include one on your site. When you do, add the sitemap’s URL to your robots.txt file. 

Search engines will typically check specific locations on your site to see if it contains a sitemap. However, including it in your robots.txt file speeds up the discovery process and informs them of its location. 

4 Use a Separate robots.txt Rule for Your Subdomain

Robots.txt files only tell search engines how to crawl a specific domain or subdomain. If you have a subdomain, you have to create their robots.txt file separately. 

For example, if you have a domain at verkkotunnus.fi and a subdomain at shop.yourdomain.com, both URLs must have their separate robots.txt files as the robots.txt file of yourdomain.com will not control the crawling behavior for the URLs at shop.yourdomain.com. 

5 Use the Correct Title Case for Your User Agents

User agents are case-sensitive. For example, Google’s crawler bot is called Googlebot. You must use this exact name when specifying a rule directed to it. The capital G and the rest in small letters.

Any rule you declare using user agents like googlebotgoogleBot, and GoogleBotwill not work. It must be Googlebot. So, use the crawler bot’s name as specified by its developer. 

6 Understand Which User-Agents Search Engines Obey

Most search engines have multiple crawlers for crawling different types of content, and they do not obey all rules directed to these crawlers. 

For example, Google has multiple crawlers but only obeys rules directed at Googlebot ja Googlebot-Image. So, before creating rules, check the documentation and confirm whether they obey directives directed to that user-agent. 

7 Understand Which Directives Search Engines Obey

Search engines do not obey all directives. For example, Google does not obey the crawl-delay directive. So, review the search engine guidelines before creating your directives. 

🇫🇮 Suomi