Are you reading your XML Sitemap when it updates and checking your SERP’s (Search Engine Results Pages) to make sure your site is indexing ok? If you aren’t – you could unknowingly be driving your site into the ground and killing your traffic off.

tracking your blog's progress This is the third installment of “Tracking Your Blog’s Progress“, and in this installment we’re going to talk about XML Sitemaps and the google supplemental index. Do you have pages in google’s supplemental index? Have you ever had a google penalty for duplicate content? Is your blog sitemap listing pages for indexing that really shouldn’t be indexed by the search engine crawler at all? I’m going to tell you everything you need to know about sitemaps and why you want to look at yours every now and again to make sure that the right pages are getting crawled in your blog at all times!

What is an XML Sitemap

A sitemap is what tells the search crawler what to index on your site – and it’s importance. If your site has a sitemap, the search crawler searches for a sitemap first before crawling as a roadmap for where to go and what to do. I use WordPress, so my sitemap is created and managed by the WordPress XML Sitemap Plugin. The default name of a sitemap is “sitemap.xml” and it’s default location is right off of the root of your site like mine: http://www.jtpratt.com/sitemap.xml.

You can also specify in your robots.txt the location of your sitemap for the searchbots liks this:


Sitemap: http://www.jtpratt.com/sitemap.xml.gz

My XML Sitemap WordPress plugin does this automatically for. As you see in the code example it also provides a zipped version of my sitemap, and that's the one it points the search crawlers to.

What to look for in your XML Sitemap

So first of course, if you have a blog you need a sitemap desperately. Whether you use blogger, Drupal, Mambo, WordPress, Xoops, Postnuke, Joomla, or even have a static web site - there should be a "sitemap plugin" or "sitemap generator" of some kind you can use. Once you get one and set it up, view your sitemap and see what it looks like. Look at my sitemap: http://www.jtpratt.com/sitemap.xml. If it generated properly it should have all the normal things a sitemap would contain by default.

Default XML Sitemap Contents

  • Home Page
  • Content Pages
  • Post Pages
  • Category Pages
  • Tag Pages
  • Archive Pages

You are looking to make sure that YOUR sitemap contains the things you want it to, and you can probably set your sitemap up to not contain things you don't want in there. Of course you can also limit these things in your robots.txt file as well. The types of things in your sitemap I refer to are based on WordPress, but no matter what (blogging platform) you use - most have category, tag, and archives pages. Let's talk about why you are worried what is and what "is not" in your sitemap by addressing the contents again one by one:

Default XML Sitemap Contents

  • Home Page Your home page should of course be listed in your sitemap - and it's your job to make sure that your homepage has a good descriptive keyword laden title and description. The WordPress wpSEO Plugin I use allows you to set a default description - which is what I use for my homepage.
  • Content Pages: In WordPress this is referred to just as a "page". My pages are setup with good titles and descriptions so I include them in my sitemap. Some of my pages only point to posts or other pages, but I always write descriptive text and original content at the top of each for good indexing.
  • Post Pages: All my post pages are in my sitemap because they all have unique content and good titles and descriptions.
  • Category Pages: A category page in WordPress by default is just an "archive" page listing past posts in the category. There were be nothing unique at all about these pages and they will get sent to google's supplemental index unless you place original content on each cateogry page. IF you are going to keep your category pages in your sitemap (in WordPress) you have to create a "category template" page for each category you have. I already wrote a tutorial on how to do that: "Better SEO and $$$ with WordPress Category Templates. If you don't do this for every single category you have - you will want to remove categories from being listed in your sitemap. Here's an example of what a category page looks like with original content on it.
  • Tag Pages: I am currently not using tags at all in my WordPress sites, but if I was I would remove all tag pages from my Sitemap. If you're using category pages properly, there's no reason to have these in a sitemap unless you can find a way to have original content on them.
  • Archive Pages: Same with archive pages, with no original content they should not be in a sitemap. During this exercise I realized that my sitemap contained archives pages - so I removed them. Those pages only contain excerpts of posts and have no original content at all - I see no reason to ever have them indexed.

Why does what I have indexed matter?

It matters because what you the search crawler indexes on your site is what will be listed in the search engine(s) for your site. Google has 2 indexes, one with a quality results and the "supplemental index" which is technicall speaking "everything else not so quality". It's for this reason bloggers and webmasters call it "supplemental index hell"! If you have web pages from your site listed in the supplemental index you may have one one of these problems:

  • duplicate content: when a page has pretty much the same or similar content of another in your (or another) site
  • duplicate or bad titles and descriptions: you may have original content on each and every page, but the same title and / or description on multipe pages throughout your site
  • no original content: when you have data without complete sentences, random text or information - nothing really "searchable"
  • spammy content: when the crawler determines that you have more than "x" uses of the exact same keyword, or more keywords than original content. It also could be more than the acceptable amount of hyperlinks for hte amount of content - or anything else that the search crawler algorithm determines you are trying to "game" the search engine for better results
  • bad or sneaky redirects: if the crawler thinks you are trying to show it one version of content for rankings and another to visitors you could go to the supplemental index and / or be penalized or removed from google's index as well

You could have web pages go to the supplemental index permanently (like archive pages with no original content) which wouldn't adversely affect the rest of your site or SERP's at all...OR - if google thinks you're getting "spammy" they could assess "google penalty" to your site where ALL your indexed pages are sent to the "supplemental index" (including the ones with original content) for a set period of time, from 90 days to 6 months. This is NOT FUN - and it's happenned to me before for sites I own (which are now out of there with no penalty). Setting up an XML sitemap properly, reading your sitemap, and checking your indexed pages ensures you know what's going on at all times and in control of your own search rankings.

How do I know if I have pages in "supplemental index hell"??

That's pretty easy to figure out. If you do any searches for your site (or any other) and get the phrase under results in a google search:

"In order to show you the most relevant results, we have omitted some entries very similar to the 1 already displayed.
If you like, you can repeat the search with the omitted results included.
"

When you click for the "omitted results" - those are pages in the "supplemental index". If you need to figure out if you have pages in the supplemental index, and how many you have there - just use the Supplemental Index Ratio Calculator. It will tell you not only how many pages you have in the supplemental index, but the total percentage of pages in your site that went supplemental.

How do I get out of google's "supplemental index"?

That's pretty easy:

Steps to Submit a "Reconsideration Request":

  1. Setup an XML Sitemap
  2. Remove everything that doesn't contain original content
  3. Make sure you're not linking to (or being linked by) any "bad neighborhoods" by using the Bad Neighborhood Text Link Checker Tool
  4. Remove anything considered "spammy"
  5. Rebuild your sitemap
  6. Submit a Reconsideration Request

This is yet another example of one of the ways you should be "tracking your blog" from time to time to make sure that your blogging efforts aren't unknowingly being driven into the ground. Get maximum benefits from my blog my making sure you are getting the most out of it vistor, traffic, and search engine wise at all times. Please comment now if you have questions or something to add to this article!