chris finne

Best Solr Setup for Indexing Large Rails App

03 Jun 2010

Background on my Solr / Rails effort

I’ve been doing a lot of importing recently as I’m trying to find the fastest way to get my 22 million records into Solr. I’ve used DataImportHandler and it is of course by far the fastest, but I need to massage some of the data in Ruby first, so DIH is out. I use Matt Mitchell’s rsolr libraries. I’ve looked at SunSpot and other Solr libraries for Rails, but this time I want to start from a pretty low level as I’ve got a lot of docs and don’t want any unnecessary overhead that some of these higher-level libraries might add.

rsolr-direct vs. normal rsolr/jetty setup

These are rough numbers, but thought you might be interested

on Snow Leopard jruby 1.5.0rc3 vs. REE 1.8.7:

But this is just a single-process, single-threaded test, so not very real-world in most cases.

Factors and Permutations

There are so many factors that determine where the bottleneck and what the final performance will be:

Here are just a few of the permutations that I’ve tried:

The last one is what I’m starting to lean towards because it makes the best use of my multiple cores on my dev box and will likely do the best on a super big EC2 instance. My Mac has 8GB of DDR3 RAM and a 2.9GHz quad-core processor and it seems that CPU is the limiting factor. I tried 3 JRuby processes, but the CPU was only at about 50%. I’m now doing 7 JRuby processes and the CPU is running at about 90%.