As with many things in the world of SEO, your robots.txt file is a small thing that can either do wonders for your site or tank it completely. Optimize it correctly, and you’ll improve your search rankings. Get it wrong, and you’ll waste weeks or months digging yourself out of the SEO hole. Here’s what you need to know.
What is the robots.txt file??
The robots.txt file is a simple file that tells search engine robots how to crawl your site. It’s part of a larger group of protocols that control bot crawling and help manage the way search engine rankings work, as well as how content on the web gets indexed and presented to users.
The file can be configured to address search engines separately or establish rules that apply to them all. It allows or restricts search engines from crawling certain pages or subdirectories on your site using a “follow” or “nofollow” command. It can also be used to put the brakes on a search engine, commanding it to wait a certain number of seconds before crawling a page.
The first line of the file will specify which user agent is being addressed. The next line will specify which pages are disallowed and what type of crawl delay, if any, the website wants a user agent to apply.
Why is this important?
The robots.txt file is an amazingly powerful tool. The delay function, for example, can keep your site from being overwhelmed with constant traffic. You can also use it to make the most of a bot’s crawl budget. This is the number of pages on your site that a bot will look at on any given crawl.
If you identify a malicious or unsavory bot, you can disallow it from crawling your site at all (although the worst offenders may choose to ignore your robots.txt file). You can also keep your private site pages from showing up in public searches by disallowing them. A good robots.txt file is one part of protecting your site from hackers.
How to optimize your robots.txt file for SEO
If you’re just beginning to appreciate how important this file is, then your best first step is to use a robots.txt tester to take a look at your current file. The tester will help you figure out whether your file is configured to do what you want it doing.
To adjust your file, simply open it in any text editor app, such as Notepad, TextMate, or TextEdit. From there, you want to make sure you have disallowed these sorts of pages:
- User login pages
- Sensitive or personal data
- Duplicate content
- Testing pages
- Site-search pages
- Checkout pages
If you use WordPress, some of these pages, like login, may automatically be disallowed, and you’ll be able to see that in your robots.txt file. As an example, here is a command that will keep keep search engines from crawling any page in your site testing folder:
Search engine bots interpret the asterisk to apply to all search engines. If you want to keep a particular search engine from crawling a page, simply specify it:
Once you have your robots.txt file just the way you want it, be sure to double-check it for typos and run it through the tester site again to make sure everything is correct. Don’t forget to re-upload the new robots.txt file to your site so changes will take effect.
If you have pages that Google or another website has already indexed, you may have to get rid of them manually. Additionally, if you have particularly sensitive content that you don’t want malicious bots to find, don’t specify these in your robots.txt file. Instead, use the meta robots tag to protect them.