On the virtues of the infamous Robots.txt file...

by Alexis Jones
05 December, 2019
4 min read

A Robots.txt File?

Yes. A robots.txt file. It’s your chance to take control and tell web bots (think webcrawlers) how to crawl the pages of your site. The problem is that webcrawlers, if left to their own spidery devices, may end up crawling EVERYTHING on your site.

This might sound potentially, at least on the surface, like a good thing. But it’s not…

There are many cases where you won’t want webcrawlers to index certain pages of your site, or the very building blocks you used to create the site itself. As case in point, if I search for your business and one of the top options is a template you’ve used to create your site, it's going to make for an embarrassing user experience.

Which is why it's important for you to understand how the robots.txt file works!

First, let's consider the format of a typical Robots.txt file below.

The Basic Format

User-agent: [user-agent name]

Disallow: [URL string not to be crawled]

Example

User-agent: astrobot

Disallow: /how-to-bake-cakes-with-strychnine

User-agent here refers to the webcrawler you're giving instructions to. These will typically be Google, Bing, and Yahoo. But there are many others. In case you're wondering, Google's user-agent is called Googlebot. Once you've named the user-agent, you can tell them what to crawl and what's off limits by using the commands Disallow and Allow. For example, Disallow: /teenage-pictures-of-me-at-prom. Allow: /instagram-feed-from-bali.

Want a real world example?

All you have to do is add /robots.txt to the end of your favorite website. For instance, if you like pancakes, you might be an International House of Pancakes fan (and who isn’t a fan of pancakes?). Their robots.txt file can be found at https://www.ihop.com/robots.txt., and looks like this:

User-agent: *
Disallow: /LTO
Disallow: /Initiatives
sitemap: https://www.ihop.com/sitemap.xml

First, the * indicates that the following rules apply to ALL webcrawlers. Otherwise, they would specify the particular bots they’re targeting with each subset of rules (and there would be multiple entries for bots like “msnbot”, and “Slurp”, for example).

Second, the Disallow commands are clearly stating which pages iHop doesn’t wish webcrawlers to crawl. In this case, the pages /LTO and /Initiatives are prohibited.

Lastly, the sitemap addition gives webcrawlers a guide to crawling your site and its associated content. If you're learning all of this for the first time, it can seem like Ancient Greek (actually, modern Greek would be equally as confusing to us). We decided that all of the rules might go down easier with a glass of milk if they were turned into a quick Robots.txt guide. Read the following and for all of you discerning web developers out there, let us know if you think we're forgetting anything important!

A Quick Robots.txt Rule Guide

Ultimate control over what webcrawlers do and do not crawl for indexing.

User-Agent

The web crawler you're feeding instructions to. This will usually be a search engine like Google, Bing, or Yahoo. If you're wondering what other options exist, go here: https://www.robotstxt.org/db.html

Allow

Although only applicable to Googlebot, Google is probably the one search engine you should care most about. This command tells Google to crawl a specific page, even when the parent folder has been flagged as not for crawling.

Disallow

The thing to remember about this command is that each URL you want to prevent crawling for needs to have its own Disallow command. So if you have 5 URLs you'd like webcrawlers not to crawl for indexing, you'll need 5 Disallow lines in your robots.txt file.

Crawl Delay

This is where you can set up a delay in the number of seconds before a bot crawls your site. The thing to remember here is that Googlebots don't follow this rule. If you'd like to manually control the crawl rate for Googlebots, you'll need to do so within Google Console. More information on this can be found here: https://support.google.com/webmasters/answer/48620?hl=en

Sitemap

This is used to call out the location of your XML sitemap. Although only Google, Bing, Yahoo, and Ask will follow this rule, do we really care about any of the other bots? Most of the time, probably not.

That's really all there is to it...

Understanding the importance of defining your user-agent, what you're allowing them to see, whether or not there's a delay AND the location of your sitemap will pretty much get you to the finish line.

But wait...we forgot one key ingredient to this tutorial cookie. Wondering how you add a Robots.txt file to your site? That's a logical question. Very concrete, sequential. If you were thinking it, you must be a reasonable person.

We like reasonable people. Which is why we've answered the question below :).

Great…how do I add a robots.txt file to my site?

The creation of a robots.txt file is pretty simple. Open up the text editor of your choice and follow the rules above. Save it as “robots.txt” and you’re golden. The real problem people have is in placing it in the right place on their website. This is key, because if you don’t put it in the right place, webcrawlers will assume you don’t have one.

So where is this magical land? It’s in the main directory, or root domain. If this is confusing, we understand. As a rule of thumb, just remember it should ultimately resolve to this URL position: www.yourcoolbananaswebsite.com/robots.txt. If it’s not placed here, all of your hard learning and work will be for naught.

Perhaps we'll cover this step more comprehensively in a future tutorial, but for the purposes of this one, less is more. Besides which, we don't want to confuse anyone. We confuse ourselves enough already when it comes to file directories.

That said, don't hesitate to let us know if you'd like assistance with this.

As always, if you have any questions, we're here for you! We’re happy to help you navigate this landscape to make sure your site is indexed as efficiently as possible. So don't be a stranger, stranger!