When we were upgrading our Sitecore 8 project to Sitecore 9, we needed to come up with a new solution for indexing the data. Our Sitecore 8 project was using Lucene for indexing which posed two problems: Lucene was not supported in our Sitecore 8 environment configuration, and Lucene was not supported in Sitecore 9. To combat both issues we began the upgrade process to Solr 8.1.1 and further to Solr Cloud.
Inventorying the Indexes
Our first step to migrating the indexes was to take an inventory and cleanup/consolidate what we could. We found that the data from many of our custom indexes could simply be brought into the respective Sitecore indexes (sitecore_master, sitecore_web) and handled naturally in Solr. This allowed us to reduce the number of indexes to migrate and reduced the clutter in our project.
Solr Setup and Conversion
We were quickly able to get an instance of Solr running in our local environments and could begin converting indexes to Solr cores. The conversion process for each index was straight forward: copy the Lucene configuration for that index to a new Solr configuration (with some minor updates), create the new Solr core files, and deploy those files to the Solr server. There were also a few data converters we had to implement so the data we extracted from Solr could be mapped properly into our ORM (Glass.Mapper).
With our project now converted to Solr, we could begin comparing search results to the old Lucene project to find inconsistencies. To dig into the differences, we turned to the Solr query debugger inside the Solr admin. This allowed us to gain a better understanding of how Solr scores its documents.
With query debugging enabled, Solr will output the calculations it used to score each document. At first glance, this can look like some obscure calculation from physics class, but we did some investigation into scoring to demystify it. The scoring is broken into several tiers that build upon each other. At the bottom tier, each document is given a base score of 1. The next tier is where Solr does most of its calculations to give each document a relevancy score. Then a tier for index-time boosting. These are boost values added directly to content items in Sitecore. The final tier applies query-time boosting. These are boost values normally added inside code and passed along to the query string of the Solr search request.
The most complex part of scoring is in the second tier where the relevancy score is calculated. These calculations are the bulk of what is displayed in the query debugger, and are based on several facets:
- Term frequency – The frequency of a search term in a document
- Inverse document frequency – The rarity of a search term across all documents
- Coordination factor – The number of search terms that are present in a document
- Field norm – The length of a field value in which the search term is found
Having a good understanding of these facets and calculations, we were able to structure our content to achieve desirable search results.
When all was said and done, we were extremely happy with the results. The first thing we noticed was the sheer speed of searches; our results were coming back in a fraction of the time. This was likely due in part to indexing operations being handled by a separate server. On top of that, we now had a centralized index solution for both of our CD servers (previously they were separate indexes on each CD server). We also gained an invaluable debugging tool with the Solr admin. Debugging search results and indexes became trivial when compared to spinning up a Luke instance for Lucene.