I’m going to talk about setting up a robots.txt especially for your self hosted wordpress blog, to help the search engine crawlers to best index your site and help with with search engine optimization. Due to the recent content duplication rules in the google index, you want to make sure that you’re submitting one version of your posts/pages, and also that the crawler isn’t trying to index pages it really does need to at all. Pages like trackbacks, admin, includes, and your rss feed.

It seems from reading many blogs and postings that not everyone agrees about category pages. I’ve heard some say that they want their category pages indexed – and that helps them. I think it seems to depend on the site, and how you have been tagging things. Sometimes on some of my sites I go overboard on tagging, so I end up with a ton of category pages. And also, many times I tag things in many different categories. Having a post have it’s own page, be listed on the front page, and 5 category pages wouldn’t seem to be a very good plan for good seo and an obvious setup for content duplication (in my eyes). So just to be safe, I filter out my category pages too in my robots.txt.

First, I read over or Lorelle on WordPress (link in sidebar) that now google has sitemap inclusion, and you can add this line to your

robots.txt file:
User-agent: *
Sitemap: http://www.jtpratt.com/sitemap.xml

and you no longer have to submit your sitemap (the crawler will know what to do with it). So this is a new entry for me. I also read that you can tell the google image crawler where to (and not to) go in your site, so I added this:

# The Googlebot-Image is the image bot for google
User-agent: Googlebot-Image
# Allow Everything
Allow: /*

I also saw that can do the same for the adsense crawler, which has nothing to do with indexing, but if you use adsense it would be smart to have this as well:

# This is the ad bot for google
User-agent: Mediapartners-Google*
# Allow Everything
Allow: /*

So these are all new entries for me. Now daily blog tips (link in sidebar) has a quick, down and dirty post on a robots.txt file for wordpress. It’s pretty simple:

User-agent: *
Disallow: /wp-
Disallow: /feed/
Disallow: /trackback/

I kinda like that, but it doesn’t seem to cover everything. Fili’s Tech has an article on wordpress seo for wordpress too, and I like his ideas. So I ended up with something like this:

# Disallow all directories and files within
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/

# Disallow all files ending with these extensions
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$

# Disallow parsing individual post feeds, categories and trackbacks..
Disallow: /trackback/
Disallow: /feed/
Disallow: /category/

For right or wrong, I have one section for:

User-agent: Googlebot

and another section for:

User-agent: ia_archiver
User-agent: Scooter

User-agent: Atomz
User-agent: FAST-WebCrawler
User-agent: ArchitextSpider
User-agent: Googlebot
User-agent: Slurp.so/1.0
User-agent: Slurp/2.0j
User-agent: Slurp/2.0-KiteHourly
User-agent: Slurp/2.0-OwlWeekly
User-agent: Slurp/3.0-AU

User-agent: UltraSeek
User-agent: MantraAgent
User-agent: Lycos_Spider_(T-Rex)
User-agent: MSNBOT/0.1
User-agent: Gulliver
User-agent: Scrubby/
User-agent: ZyBorg

If you have any comments, improvements, or suggestions – please comment now!