What is robots.txt?
Robots.txt is a file instructing search engine crawler bots and other bots about the pages they can and cannot crawl on a website. The file is named robots.txt
and is located in the site’s root folder.
The robots.txt file helps protect sensitive information, reduce server load, and influence how search engines crawl a site. This is how it appears on a site.
It is essential to note that while search engine crawler bots generally obey the rules in the robots.txt file, other crawler bots may decide to crawl the links even if you instructed them not to.
How to Locate Your robots.txt File
The robots.txt file should be located in your site’s root folder. You can access it at yourdomain.com/robots.txt. Just replace yourdomain.com with your domain name and enter it into your address bar. It will reveal your robots.txt file.
In the case of a subdomain like blog.yourdomain.com, the robots.txt file will be located at blog.yourdomain.com/robots.txt.
Example of a robots.txt File
The rules in a robots.txt file usually contain two components: user agents and directives. For example, the rule below instructs all crawlers not to crawl any URL on the site. The asterisk *
is called a wildcard character, indicating that a rule is directed at all crawler bots.
User-agent: * Disallow: /
You may direct the rule at a specific crawler by setting the user agent to the crawler’s name. For example, the rule below instructs Googlebot not to crawl any URL on the site. This means other bots can crawl the URLs on the site.
User-agent: Googlebot Disallow: /
The directive below instructs all web crawlers not to crawl the URL at yourdomain.com/admin-login
.
User-agent: * Disallow: /admin-login
The rule below instructs all crawlers not to crawl URLs in the /recipes/
subdirectory. So, all crawlers will not crawl URLs like yourdomain.com/recipes/
and yourdomain.com/recipes/vanilla-cake
.
User-agent: * Disallow: /recipes/
List of Some Common robots.txt User-Agents
Robots.txt rules typically consist of a user agent and one or more directives. The user agent specifies the web crawlers for which the directive is intended. It is declared using the User-agent
string.
Some common user agents include:
*
: The rule is directed at all crawler botsGooglebot
: The rule is directed at Google’s crawler botGooglebot-Image
: The rule is directed at Google Images crawler botBingbot
: The rule is directed at Bing’s crawler botSlurp
: The rule is directed at Yahoo’s crawler botYandexBot
: The rule is directed at Yandex’s crawler botGPTBot
: The rule is directed at OpenAI’s crawler bot
List of Some Common robots.txt Directives
Directives are the commands and instructions that inform web crawlers about what part of the site they can and cannot crawl and how they are expected to crawl it. Some common robots.txt directives include:
- Disallow
- Allow
- Sitemap
- Crawl-delay
1 Disallow
The disallow directive tells web crawlers which URLs they cannot access or index. The disallow rule is paired with a user agent. For example, the rule below tells Bingbot not to crawl any URL on the site.
User-agent: Bingbot Disallow: /
2 Allow
The allow directive informs web crawlers of the URLs within a disallowed directive that they are allowed to crawl. The allow directive is only used with a disallow directive.
The allow directive is used to make exceptions. This is helpful when you want search engines to crawl specific URLs within a group of URLs you already blocked them from crawling using the disallow directive.
The allow directive is used with a User-agent
string and the Disallow
directive. For example, the rule below disallows all crawlers from crawling URLs within the /recipes/
directory. However, it allows them to crawl the URL at /recipes/vanilla-cake
.
User-agent: * Disallow: /recipes/ Allow: /recipes/vanilla-cake
3 Sitemap
The sitemap specifies the location of the site’s XML sitemap. It does not include a user agent and is declared using the Sitemap
string.
Sitemap: https://yourdomain.com/sitemap_index.xml
4 Crawl-Delay
The crawl-delay directive recommends the rate at which a site owner wants the crawler bot to crawl their site. It is used with a User-agent
and is specified using Crawl-delay
.
User-agent: * Crawl-delay: 10
It is crucial to know that not all crawlers obey this directive. Additionally, those that do may interpret it differently. For example, the rule above may instruct crawlers to wait 10 seconds before making another request or access a URL once every 10 seconds.
Robots.txt Best Practices
An incorrectly configured robots.txt file could have serious consequences for your SEO. To reduce the possibility of such happening, it is recommended to follow the best practices listed below.
1 Do Not Create Conflicting Rules
It is easy to create conflicting rules in a robots.txt file. This is particularly common in files that contain multiple complex rules. For example, you may unknowingly allow and disallow crawlers from crawling the same URL.
Such contradictory rules confuse crawlers and could cause them to perform some other action than the one you intended them to. Some crawlers resolve conflicting directives by obeying the first rule. Others follow the rule with the least number of characters or the one they consider less restrictive.
2 Do Not Use the robots.txt File to Hide Your Pages
The primary purpose of the robots.txt file is to manage the behavior of the crawler bots that access a site. It is not intended to hide pages from search engines. So, avoid using it to hide the URLs you do not want search engines to index.
Google can still find those pages and could display them on search results pages if they receive a link from another content. Instead, use the noindex meta tag to specify URLs you do not want search engines to index.
3 Include Your Sitemap in Your robots.txt File
While search engines can find your content without a sitemap, it is good practice to include one on your site. When you do, add the sitemap’s URL to your robots.txt file.
Search engines will typically check specific locations on your site to see if it contains a sitemap. However, including it in your robots.txt file speeds up the discovery process and informs them of its location.
4 Use a Separate robots.txt Rule for Your Subdomain
Robots.txt files only tell search engines how to crawl a specific domain or subdomain. If you have a subdomain, you have to create their robots.txt file separately.
For example, if you have a domain at yourdomain.com and a subdomain at shop.yourdomain.com, both URLs must have their separate robots.txt files as the robots.txt file of yourdomain.com will not control the crawling behavior for the URLs at shop.yourdomain.com.
5 Use the Correct Title Case for Your User Agents
User agents are case-sensitive. For example, Google’s crawler bot is called Googlebot
. You must use this exact name when specifying a rule directed to it. The capital G and the rest in small letters.
Any rule you declare using user agents like googlebot
, googleBot
, and GoogleBot
will not work. It must be Googlebot
. So, use the crawler bot’s name as specified by its developer.
6 Understand Which User-Agents Search Engines Obey
Most search engines have multiple crawlers for crawling different types of content, and they do not obey all rules directed to these crawlers.
For example, Google has multiple crawlers but only obeys rules directed at Googlebot
and Googlebot-Image
. So, before creating rules, check the documentation and confirm whether they obey directives directed to that user-agent.
7 Understand Which Directives Search Engines Obey
Search engines do not obey all directives. For example, Google does not obey the crawl-delay
directive. So, review the search engine guidelines before creating your directives.