A fascinating write-up we located containing a number of things to study on. You may wish to check it out to see if you agree.
I was recently with a client doing a Best Practices assesment when I came across a common source of confusion related to sorting, faceting and schema design.
As background, Solr provides a schema that describes the Fields and Field Types (FT) that are used by an application. Field Types describe how Solr should handle the information contained in a Field. For instance, the integer FT tells Solr to treat the contents of any Field of type integer as, you guessed it, an integer. By integer here, I mean, good old fashioned Java ints.
Solr provides other FTs like long, double, float, string, date, as well as Text (which can be associated with Lucenes analysis process). Additionally, Solr provides several sortable FTs such as sint, slong, sdouble and sfloat. Therein lies the confusion. I think what happens is developers hear the word sortable and think they should use the sortable FT for any field they want to sort results by.
However, there is some subtlety here. Namely, sortable FTs manipulate the content so that the lexicographic order is the same as the numeric order for use during search. Sortables are thus really meant to be used when doing things like range queries (i.e. [price:2 TO 100]) and not for sorting as it relates to returning results. Due to these required changes, sortables take up more space in the index (and in memory) then their non-sortable compadres.
Whats this got to do with schema design? Well, this client had three fields, all defined as sortable integer FTs, as in:
1.fieldOriginal – The source of the content. This was the main field used for sorting.
2.fieldSearch Copy field of Original, but rounded to the nearest 100. This was the main field for searching.
3.fieldFacet Copy field of Original, but rounded based on a percentage of the original value so as to provide a sliding scale for faceting. This was the main field used for faceting.
In this case, the client was using the Original for sorting, Search for searching, and Facet for faceting. They were not doing any range queries, so they did not need fieldSearch to be sortable. Furthermore, the Original field had over 1 million unique terms, so sorting on it was taking up a good chunk of memory and disk space. The other two fields were smaller, so the cost of sortables was not that big of a deal. Finally, this field pattern was replicated for several other fields as well, some of which also had a significant number of unique terms.
Thus, simply by changing the Fields to use integers where appropriate, we significantly reduced the memory footprint and the disk space required in this client application.
So, as is always the case, play close attention to your schema design. While the Solr example schema is pretty good out of the box, you shouldnt just take it as gospel, either. Spend some time thinking about your needs during design and it will likely save you much time later when debugging and testing your application.
About Lucid Imagination
Lucid Imagination is the commercial company exclusively dedicated to Apache Solr/Lucene open source enterprise technology. It provides search solution development platforms built on the power of Solr/Lucene open source search via enterprise-grade subscriptions. Learn more about the company at www.lucidimagination.com.
I thought that was interesting. Feel free to leave your comments below.
Related MySQL Schema Articles