Class CmsSearchSimilarity


  • public class CmsSearchSimilarity
    extends org.apache.lucene.search.similarities.Similarity
    Reduces the importance of the computeNorm(FieldInvertState) factor for the CmsSearchField.FIELD_CONTENT field, while keeping the Lucene default for all other fields.

    This implementation was added since apparently the default length norm is heavily biased for small documents. In the default, even if a term is found in 2 documents the same number of times, the smaller document (containing less terms) will have a score easily 3x as high as the longer document. Using this implementation the importance of the term number is reduced.

    Inspired by Chuck Williams WikipediaSimilarity.

    Since:
    6.0.0
    • Nested Class Summary

      • Nested classes/interfaces inherited from class org.apache.lucene.search.similarities.Similarity

        org.apache.lucene.search.similarities.Similarity.SimScorer
    • Constructor Summary

      Constructors 
      Constructor Description
      CmsSearchSimilarity()
      Creates a new instance of the OpenCms search similarity.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      long computeNorm​(org.apache.lucene.index.FieldInvertState state)
      Special implementation for "compute norm" to reduce the significance of this factor for the CmsSearchField.FIELD_CONTENT field, while keeping the Lucene default for all other fields.
      boolean getDiscountOverlaps()
      Returns true iff overlap tokens are discounted from the document's length.
      org.apache.lucene.search.similarities.Similarity.SimScorer scorer​(float boost, org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics... termStats)  
      void setDiscountOverlaps​(boolean v)
      Sets whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • CmsSearchSimilarity

        public CmsSearchSimilarity()
        Creates a new instance of the OpenCms search similarity.

    • Method Detail

      • computeNorm

        public final long computeNorm​(org.apache.lucene.index.FieldInvertState state)
        Special implementation for "compute norm" to reduce the significance of this factor for the CmsSearchField.FIELD_CONTENT field, while keeping the Lucene default for all other fields.

        Specified by:
        computeNorm in class org.apache.lucene.search.similarities.Similarity
      • getDiscountOverlaps

        public boolean getDiscountOverlaps()
        Returns true iff overlap tokens are discounted from the document's length.
        Returns:
        true iff overlap tokens are discounted from the document's length.
        See Also:
        setDiscountOverlaps(boolean)
      • scorer

        public org.apache.lucene.search.similarities.Similarity.SimScorer scorer​(float boost,
                                                                                 org.apache.lucene.search.CollectionStatistics collectionStats,
                                                                                 org.apache.lucene.search.TermStatistics... termStats)
        Specified by:
        scorer in class org.apache.lucene.search.similarities.Similarity
        See Also:
        Similarity.scorer(float, org.apache.lucene.search.CollectionStatistics, org.apache.lucene.search.TermStatistics[])
      • setDiscountOverlaps

        public void setDiscountOverlaps​(boolean v)
        Sets whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.
        Parameters:
        v - if true, tokens with position increment 0 are ignored when computing the norm, otherwise they are not ignored.