Class CmsSearchSimilarity

java.lang.Object
org.apache.lucene.search.similarities.Similarity
org.opencms.search.CmsSearchSimilarity

public class CmsSearchSimilarity extends org.apache.lucene.search.similarities.Similarity
Reduces the importance of the computeNorm(FieldInvertState) factor for the CmsSearchField.FIELD_CONTENT field, while keeping the Lucene default for all other fields.

This implementation was added since apparently the default length norm is heavily biased for small documents. In the default, even if a term is found in 2 documents the same number of times, the smaller document (containing less terms) will have a score easily 3x as high as the longer document. Using this implementation the importance of the term number is reduced.

Inspired by Chuck Williams WikipediaSimilarity.

Since:
6.0.0
  • Nested Class Summary

    Nested classes/interfaces inherited from class org.apache.lucene.search.similarities.Similarity

    org.apache.lucene.search.similarities.Similarity.SimScorer
  • Constructor Summary

    Constructors
    Constructor
    Description
    Creates a new instance of the OpenCms search similarity.
  • Method Summary

    Modifier and Type
    Method
    Description
    final long
    computeNorm(org.apache.lucene.index.FieldInvertState state)
    Special implementation for "compute norm" to reduce the significance of this factor for the CmsSearchField.FIELD_CONTENT field, while keeping the Lucene default for all other fields.
    boolean
    Returns true iff overlap tokens are discounted from the document's length.
    org.apache.lucene.search.similarities.Similarity.SimScorer
    scorer(float boost, org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics... termStats)
     

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • CmsSearchSimilarity

      Creates a new instance of the OpenCms search similarity.

  • Method Details

    • computeNorm

      public final long computeNorm(org.apache.lucene.index.FieldInvertState state)
      Special implementation for "compute norm" to reduce the significance of this factor for the CmsSearchField.FIELD_CONTENT field, while keeping the Lucene default for all other fields.

      Specified by:
      computeNorm in class org.apache.lucene.search.similarities.Similarity
    • getDiscountOverlaps

      public boolean getDiscountOverlaps()
      Returns true iff overlap tokens are discounted from the document's length.
      Returns:
      true iff overlap tokens are discounted from the document's length.
      See Also:
      • #setDiscountOverlaps(boolean)
    • scorer

      public org.apache.lucene.search.similarities.Similarity.SimScorer scorer(float boost, org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics... termStats)
      Specified by:
      scorer in class org.apache.lucene.search.similarities.Similarity
      See Also:
      • Similarity.scorer(float, org.apache.lucene.search.CollectionStatistics, org.apache.lucene.search.TermStatistics[])