Concentrating on robots.txt specifically for WordPress

Iâ€™m going to talk about setting up a robots.txt especially for your self hosted wordpress blog, to help the search engine crawlers to best index your site and help with with search engine optimization. Due to the recent content duplication rules in the google index, you want to make sure that youâ€™re submitting one version of your posts/pages, and also that the crawler isnâ€™t trying to index pages it really does need to at all. Pages like trackbacks, admin, includes, and your rss feed.

It seems from reading many blogs and postings that not everyone agrees about category pages. Iâ€™ve heard some say that they want their category pages indexed – and that helps them. I think it seems to depend on the site, and how you have been tagging things. Sometimes on some of my sites I go overboard on tagging, so I end up with a ton of category pages. And also, many times I tag things in many different categories. Having a post have itâ€™s own page, be listed on the front page, and 5 category pages wouldnâ€™t seem to be a very good plan for good seo and an obvious setup for content duplication (in my eyes). So just to be safe, I filter out my category pages too in my robots.txt.

First, I read over or Lorelle on WordPress (link in sidebar) that now google has sitemap inclusion, and you can add this line to your

robots.txt file:
User-agent: *
Sitemap: http://www.jtpratt.com/sitemap.xml

and you no longer have to submit your sitemap (the crawler will know what to do with it). So this is a new entry for me. I also read that you can tell the google image crawler where to (and not to) go in your site, so I added this:

# The Googlebot-Image is the image bot for google
User-agent: Googlebot-Image
# Allow Everything
Allow: /*

I also saw that can do the same for the adsense crawler, which has nothing to do with indexing, but if you use adsense it would be smart to have this as well:

# This is the ad bot for google
User-agent: Mediapartners-Google*
# Allow Everything
Allow: /*

So these are all new entries for me. Now daily blog tips (link in sidebar) has a quick, down and dirty post on a robots.txt file for wordpress. Itâ€™s pretty simple:

User-agent: *
Disallow: /wp-
Disallow: /feed/
Disallow: /trackback/

I kinda like that, but it doesnâ€™t seem to cover everything. Filiâ€™s Tech has an article on wordpress seo for wordpress too, and I like his ideas. So I ended up with something like this:

# Disallow all directories and files within
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/

# Disallow all files ending with these extensions
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$

# Disallow parsing individual post feeds, categories and trackbacks..
Disallow: /trackback/
Disallow: /feed/
Disallow: /category/

For right or wrong, I have one section for:

User-agent: Googlebot

and another section for:

User-agent: ia_archiver
User-agent: Scooter

User-agent: Atomz
User-agent: FAST-WebCrawler
User-agent: ArchitextSpider
User-agent: Googlebot
User-agent: Slurp.so/1.0
User-agent: Slurp/2.0j
User-agent: Slurp/2.0-KiteHourly
User-agent: Slurp/2.0-OwlWeekly
User-agent: Slurp/3.0-AU

User-agent: UltraSeek
User-agent: MantraAgent
User-agent: Lycos_Spider_(T-Rex)
User-agent: MSNBOT/0.1
User-agent: Gulliver
User-agent: Scrubby/
User-agent: ZyBorg

If you have any comments, improvements, or suggestions – please comment now!

Best Practices for Setting Up a New WordPress blog in 60 Minutes or less

June 8th, 2007 | Posted in wordpress seo, wordpress | No Comments

3 Comments

Dave on July 3, 2008 at 11:32 am

You say you have one section for Googlebot and one for the others “for right or wrong”. Do you do anything different between the two sections?
admin on July 3, 2008 at 12:35 pm

no, I do both sections the same way – I just want to make sure google’s instructions are very clean and don’t get muddied by the other crawlers listings.
increase backlinks on September 8, 2008 at 1:21 pm

Backlinks are key to increasing your search engine popularity LinkPartnerExpress is the best on the web…I have 6500 links, all quality!!