rails

Setting up Sunspot/Solr for OR queries, stemming and lower memory usage

by on Jan.06, 2011, under rails, tech

As I keep finding in Rails 3, the Gems I used in Rails 2 no longer work or have fallen out of favor.   In Rails 2 acts_as_ferret met my searching needs but after submitting some fixes for Rails 3 and Ruby 1.9.2, I was still having issues so I moved on to Sunspot.

One of the 1st things I wanted to change with Sunspot was to make the default boolean operator OR.   This means when someone searches for “car window” they will get results that match car or window.

Not being a Solr expert my 1st thought was that all I needed to do was change

<solrQueryParser defaultOperator="AND"/>

to

<solrQueryParser defaultOperator="OR"/>

But it didn’t work.   After some research and digging through the logs I learned that Sunspot is using the dismax request handler.  To make a long story short, dismax ignores the defaultOperator and uses a minimum_match field.   The good news here is that setting this field to 1 in your search query is easy and gives you the same function as  defaultOperator=”OR”.

In your controller your search would look something like this.

@articles = Article.search do
  keywords(actual_search) {minimum_match 1}
end

Next thing I wanted was for car searches to return results for cars and other stems.   This required a 1 line change in schema.xml

In the <analyzer> block just add <filter class=”solr.SnowballPorterFilterFactory” language=”English” />

      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" />
      </analyzer>

Finally, because the model I am searching is small and Java eats quite a bit of memory I wanted to reduce the Solr server’s memory footprint.  This may come back to bite me as my dataset grows but for now this is working fine.  To adjust the memory parameters used when using rake sunspot:solr:start just edit your sunspot.yml file and add min_memory and max_memory lines.

development:
  solr:
    hostname: localhost
    port: 8982
    log_level: DEBUG
    min_memory: 64M
    max_memory: 64M

This will result in -Xms64M -Xmx64M being sent to java on startup.

      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" />
      </analyzer>
6 Comments more...

My sitemap notes are in the Advanced Rails Recipes Book

by on Mar.08, 2008, under rails, sitemap, tech

After I blogged about building a sitemap for Rails I contacted Mike Clark and asked him if he thought it would make a good Recipe for his upcoming book, Advanced Rails Recipes. He thought it was a good fit and it is currently in the Beta version of the book.

I wrote up my notes in the Recipe format, then Mike basically rewrote it for Rails 2.0 and added some additional content. Thanks Mike! I almost feel bad being cited as the author since after editing it is drastically different from the original. :-)

The core concepts are still there and some thoughts were dropped since Recipes should be short. So, Here are some elaborations..

The Ping Protocol

There is a warning in the book about excessively pinging to Google to have them read your sitemap. I would recommend letting search engines crawl your sitemaps at their own speed. The ping example in the book was a nice overview of when to use an Observer and also provided complete coverage on how to submit sitemaps. Please use ping sparingly, if at all. :-)

Sitemaps with over 50,000 entries

I work on sites where we use siteindex files because we submit well over 50,000 URLs to the search engines. I didn’t provide an example on how to build these in Rails because I’m not sure they provide any value to the typical site.

My theory is that if you build a sitemap with the 50,000 pages that were most recently updated you will give the search engines all they need. If a page isn’t updated for a while and it falls off the list is that really a problem? If the page was worth anything someone externally would be linking to it before it fell off the list. Now if your site is creating millions of pages a day this may not be the case.

If your pages are islands (no links to them) and you’re afraid they won’t be found unless they are all in the sitemap, I would suggest building the sitemap via a rake task that is kicked off via a cron job. This will also give you an opportunity to gzip the files. I’ll try to writeup some example code for the this when I find some free time.

Do I really need a sitemap?

If your site has navigation to all its pages, then a sitemap will probably not benefit you. I suggest checking what pages the search engines have in their index and if key content is missing then pursue a sitemap. Even if they are finding all your pages a sitemap certainly couldn’t hurt.

Just in case you didn’t know how to find the pages Google knows about on your site you can simply type site:youdomain.com in the Google or Yahoo search box.

Example results for my site are here

Leave a Comment : more...