After I blogged about building a sitemap for Rails I contacted Mike Clark and asked him if he thought it would make a good Recipe for his upcoming book, Advanced Rails Recipes. He thought it was a good fit and it is currently in the Beta version of the book.
I wrote up my notes in the Recipe format, then Mike basically rewrote it for Rails 2.0 and added some additional content. Thanks Mike! I almost feel bad being cited as the author since after editing it is drastically different from the original.
The core concepts are still there and some thoughts were dropped since Recipes should be short. So, Here are some elaborations..
The Ping Protocol
There is a warning in the book about excessively pinging to Google to have them read your sitemap. I would recommend letting search engines crawl your sitemaps at their own speed. The ping example in the book was a nice overview of when to use an Observer and also provided complete coverage on how to submit sitemaps. Please use ping sparingly, if at all.
Sitemaps with over 50,000 entries
I work on sites where we use siteindex files because we submit well over 50,000 URLs to the search engines. I didn’t provide an example on how to build these in Rails because I’m not sure they provide any value to the typical site.
My theory is that if you build a sitemap with the 50,000 pages that were most recently updated you will give the search engines all they need. If a page isn’t updated for a while and it falls off the list is that really a problem? If the page was worth anything someone externally would be linking to it before it fell off the list. Now if your site is creating millions of pages a day this may not be the case.
If your pages are islands (no links to them) and you’re afraid they won’t be found unless they are all in the sitemap, I would suggest building the sitemap via a rake task that is kicked off via a cron job. This will also give you an opportunity to gzip the files. I’ll try to writeup some example code for the this when I find some free time.
Do I really need a sitemap?
If your site has navigation to all its pages, then a sitemap will probably not benefit you. I suggest checking what pages the search engines have in their index and if key content is missing then pursue a sitemap. Even if they are finding all your pages a sitemap certainly couldn’t hurt.
Just in case you didn’t know how to find the pages Google knows about on your site you can simply type site:youdomain.com in the Google or Yahoo search box.
Example results for my site are here