Robots.txt and Meta robots are two important factors in SEO of any website. It has to use wisely else it may create issues in reaping required SEO benefits. This article shares a complete guide about, both, robots.txt and meta robots tag.
Before getting into details of robots file and tag, you need to understand how search engine bots work?
- Search engine bots find different links online
- Once it finds a link, it crawls the page and index it in its search result database
- These indexed pages are shown in the search results for matching search queries
The above mentioned process or behavior is called ‘Spidering’.
What is the usage of the Robots.txt and Robots Meta Tag?
This file and meta tag tells search engines what pages of the website shouldn’t get crawled and indexed.
How robots file and meta work?
- When search engine bots come to your website for Spidering process, it will look for robots file or tag.
- If it finds robots file or meta, it will follow the instruction about dropping specified webpage(s) from Spidering process which are mentioned not to crawl explicitly.
- If it doesn’t find any as such instruction, it will crawl all webpages it finds during the Spidering process.
Meta Robots tag
<meta name=”robots” content=”noindex” />
* It will instruct all search bots not crawl a specific page in which this tag is put.
For specific search engine bot:
<meta name=”user-agent name” content=”noindex” />
<meta name=”googlebot” content=”noindex” />
* It will instruct Google organic search bot to not crawl a specific page.
User-agent: [user-agent name]
Disallow: [URL pattern not to be crawled]
Values of different parameters of Robots.txt file:
- User agent name: Specific name of search-bot. The most common search-bot names are mentioned below:
- googlebot = Google organic bots
- googlebot-news = Google news bots
To check all available bots name, please visit http://www.robotstxt.org/db.html
- Disallow: It is a command which is used to pass an instruction not to crawl page(s)/sub-folder(s) passed in the value.
- Allow: It is a command to pass an instruction to (only) Googlebot whether it can access (s)/sub-folder(s) or not.
- URL pattern not to be crawled: put a complete URL/pattern of the webpage which need to be dropped during crawling.
- Crawl-delay: You can specify how many milliseconds a search-bot should wait until a page get loaded for crawling. Please be advised Google ignores this parameter, but you can set it using the Google search console.
- Sitemap: Pass an instruction to Google, Bing, Yahoo and Ask about the location of sitemap.xml. This helps in ensuring the bots get access of sitemap and crawl each link.
What you can pass as values in the URL?:
You can pass 2 different types of values in the URL field:
- Exact webpage or folder location
- Pattern: There are 2 different types of patters supported by all search bots which are briefed below:
- * (star) indicates any kind of sequence
- $ (Dollar) matches the end of the URL
Check a complete list of pattern matching values with specific examples from Google here: https://support.google.com/webmasters/answer/6062596?hl=en
Example Robots.txt file:
Above robots.txt Decoded
The above mentioned robots.txt file means following:
- Instructs Google image bot to not crawl or index profile folder and any other sub-folder.
- Instructs MSN bot to not crawl or index any version of the xml file and only wait for 200 milliseconds.
- Instructs all bots to drop crawling plugin folder.
- The last line indicates the location of the sitemap file.
File format of Robots.txt
You must put the instruction in notepad without any rich formatting and save as robots.txt. Make sure to give exact name ‘robots.txt’ because the file name is case sensitive. The wrong name will ruin your efforts as the bots will ignore looking at that file.
Location of Robots.txt
Put it in the root folder. It should be accessible at below location:
Use cases of Robots.txt:
We all are in the race of getting more pages indexed so you may wonder what can be the use case of robots, then, here are a few cases where you don’t want to get crawled or indexed in search engines:
- Categories and tags
- Internal search results
- Profile details private
- Duplicate content: If you have copied content which you don’t want to get indexed
- Preventing a few file types to get indexed such as .xls, .xml, .pdf, etc.
- Staging website
Robots file is very crucial and sensitive tool which has to be used with caution. If you are unsure of its usage, please don’t play with it because it is like playing with fire and can harm you badly, if you made mistakes. Below are a few important tips and guidelines to follow:
- Make sure you don’t block an important content.
- The file name has to be in all small letters as it is case sensitive.
- The file has to be put in the root folder. If you put it in any other folder, it will be overlooked.
- The search-bots look at blocks of the file from one search-bot to another search-bot instruction. So be careful.
- This is not a secure method to instruct search-bots about not crawling a webpage. In some cases, search-bots ignore the instructions. The best way to prevent crawling and indexing is to use other methods such as protecting a page with password access.
This is all you need to know about robots without getting much deep with technical details. You can learn more in-depth detail about from Google here: https://support.google.com/webmasters/answer/6062596?hl=en
Please be noted that it is okay to not have robots file or tag so don’t hurt yourself with wrong details. Use it only when you need it and you are sure about its usage.