Indexed data

To build advanced search functions you need to know which data is indexed how. We show what data is indexed by default and what possibilities are offered by OpenCms to configure / implement additional field configurations / mappings.

The Solr index schema (schema.xml)

Have a look at the Solr schema.xml first. In the file <CATALINA_HOME>/webapps/<OPENCMS>/WEB-INF/solr/configsets/default/conf/schema.xml you will find the field definitions that will be used by OpenCms. During indexing, the fields will be filled with suitable information. Below we only mention some very interesting fields.

You may extend the schema.xml. But be careful when removing fields or changing the field or field type configuration. OpenCms assumes that most of the fields are present as in the default schema.xml.

Basic information on fields and field types

The aim of this section is to make you aware of the concept, such that you can barely read the schema file. It is by far not complete. See the Solr documentation to get a better and deeper understanding of fields and field types.

When you add a document to Solr, you basically provide a map from field names to values. In the schema file, each of the field names in the map must be declared as field of a specific field type. Moreover, for each field, it is declared if the field is:

stored - meaning that Solr keeps the original value of the field and it can return it when on query.
indexed - meaning that Solr processes the original value in a way that you can search in the field.

The field type tells how Solr processes the original value of the field to prepare it for search. By the field type, it is defined how you can query a field. Here are some interesting field types:

Text fields: Here you can store text. For different languages, different field types are typically used, since the original text should be processed language specific to get good search results. Fields are matched typically, if you match one word of the original text. E.g., if your original value is "OpenCms is great!" then a search for "OpenCms" would match, but also a search for "great".
String fields: String fields are different than text fields. For string fields, the original values are not processed and you will match the field only if you exactly match the original string. In the OpenCms schema, we have the field parent-folders of type string. This specific field holds all parent folders of a resource. You typically filter by the fields value with fq=parent-folders:"/sites/default/sub-folder/". Note that for string fields you need really exact values and quotes around them.
Date fields: Here you can store dates and have special query options - for example a range query like fq=lastmodified:[NOW-1DAY TO NOW].
Numeric fields: Here you can store numbers. For different types of numbers, different fields types are declared. On numeric fields you can do range queries as well, or have specific sort behavior.
Spatial fields: Here you can store Geo coordinates to realize distance searches like returning all point of interest contents that are within a radius of 10 kilometers of a reference point.

Except of indexing behavior and field type, you specify how many values a field can hold:

single-valued fields - hold exactly one value.
multi-valued fields - can hold multiple values.

Dynamic fields

Dynamic fields are fields where you declare only a template for the fields name and thereby allow for all fields that follow the template. Dynamic fields are declared like:

*_{suffix} - what allows for all fields named {my special field name}_{suffix}, e.g., if the declaration is *_de, the fields content_de, special_value_de, ... are allowed.
{prefix}_* - what allows for all fields named {prefix}_{my special field name}, e.g., if the declaration is de_*, the fields de_excerpt, de_special_value, ... are allowed.

The great thing is, that you can add new fields without altering the schema. This is heavily used by OpenCms, e.g., when

indexing properties (with *_prop, *_dprop, *_prop_s, ...)
allowing for search settings for XML contents
...

Interesting fields in OpenCms

OpenCms indexes a lot of information for each resource. To get the full overview look at the schema.xml and play with the Solr handler or even look in the code. Here we only list some interesting fields.

Specially handled fields for sorting and filtering

Since OpenCms 11, a set of default fields for sort options is provided that is used by the integrated list. We recommend to use these fields for sorting in other search functions as well, since:

The fields are filled with values for all contents (even PDFs etc.)
The indexed values are the ones you typically need for the sort options
The field types are adjusted for sorting (specifically the one for the title sort option)
The fields are also present without locale, e.g. disptitle_sort instead of disptitle_{locale}_sort, but also will be there localized for each locale the content should be available in.

These fields are used by the list type integrated since OpenCms 11.

For search settings, there's a short-hand notation with field settings to map correctly to the fields described here.

List of specific fields for sorting

disptitle_{locale}_sort: Default field for alphabetical sorting by type. It is treated special on indexing. If the field is not filled via a search setting in the schema of a content type, it's value falls back to the (localized) title property.

The field is defined as dynamic field *_sort in the schema and it is indexed in a suitable way for alpha-numerical sorting. If you want to improve sorting by locale specific sort fields, you could add dynamic fields *_{locale}_sort.

The field is indexed the way, that even for PDFs etc. the fields disptitle_sort and disptitle_{locale}_sort will be indexed for all locales the indexed file is available.
disporder_{locale}_i: Default field for individual sorting of contents (by a manually assigned integer). You can fill it via a search setting in the schema of a content type, or set it via the (inherited) property display-order.

The field is always indexed as disporder_i and disporder_{locale}_i for all locales the indexed resource is available.
instancedate_{locale}_dt: Default field for sorting by date. You can fill it via a search setting in the schema of a content type. If not, it is set to index field with the name of the value of the (inherited) property instancedate.copyfield. If this is not set, the release date of the file would be used, and if set neither, the date last-modified.

The field is always indexed as instancedate_i and instancedate_{locale}_i for all locales the indexed resource is available.
instancedatecurrenttill_{locale}_dt: Default field used for time ranges. You can fill it via a search setting in the schema of a content type. If not, it is set to index field with the name of the value of the (inherited) property instancedate.copyfield. If this is not set, the release date of the file would be used, and if set neither, the date last-modified.

The field is always indexed as instancedatecurrenttill_i and instancedatecurrenttill_{locale}_i for all locales the indexed resource is available.

Usually, instancedate and instancedatecurrenttill may be the same, but you may have events where instancedate holds the start time and instancedatecurrenttill the end time of the event. The reason: You may have a list, showing the next 10 events and you want to show "running" events in it as well.
instancedaterange_{locale}_dr: Alternative field for instancedate_{locale}_dt, which was introduced especially for longer lasting events that span more than one day. If you search for all events that take place at a certain day, it is sometimes interesting to retrieve all events starting at this certain day; in other situations you want to retrieve all events starting or still running at a day. With instancedaterange_{locale}_dr filtering and also facetting by date ranges is possible.
geocoords_loc: Default field used for distance searching. If a content has no Geo coordinates set, the default coordinates 0.000000,0.000000 are used.

Specially handled fields for spell fields / did you mean

A "Did you mean?" feature can be implemented via the Solr spellcheck handler. To get good results, the field for spell checking should have specialized indexing behavior. Typically, special for the requested locale.

By default, Solr is configured to have the fields de_spell and en_spell to support spellchecking in German and English. Each of this fields has special type spell_de vs. spell_en and we have configured a spellchecker de and a spellchecker en, using each one of the fieldThe spellcheckers are of type DirectSolrSpellChecker, i.e., they work on a field in the index directly and do not need an extra index..

The fields are filled with the localized extracted content of a file and the title property's value. And only for contents in sites or in the shared folder the spell fields are filled.

To get further locales supported, configure similarly:

A field type spell_{locale}
A field {locale}_spell of type spell_{locale}
A spellcheck component {locale} (in the solrconfig.xml, in the same folder as the schema.xml).

To use the spellchecker, you can query it as follows (example for the Solr handler, log in before you use it and remove the comments starting with //):

http://localhost:8080/opencms/handleSolrSelect
  ?q=OpenCns roks //otherwise the handler will not search
  &wt=xml //just to make the output more readable
  &rows=0 // we do not want results
  &spellcheck=on // turn on the spellcheck component
  &spellcheck.dictionary=en // use the English dictionary
  &spellcheck.q="OpenCns roks" // the spellcheck query

Here the handler would return (in OpenCms 11 with default demo):

<lst name="spellcheck">
  <lst name="suggestions">
    <lst name="opencns">
      <int name="numFound">1</int>
      <int name="startOffset">1</int>
      <int name="endOffset">8</int>
      <arr name="suggestion">
        <str>opencms</str>
      </arr>
    </lst>
    <lst name="opencns roks">
      <int name="numFound">1</int>
      <int name="startOffset">1</int>
      <int name="endOffset">13</int>
      <arr name="suggestion">
        <str>opencms rocks</str>
      </arr>
    </lst>
  </lst>
</lst>

Futher interesting fields

solr_id: The unique identifier of the document in the Solr index since OpenCms 11. Unless you use serial dates, it is idential to the id field.
id: Structure id used as a unique identifier for a document (The structure id of the resource).
path: Full root path (The root path of the resource, e.g., /sites/default/flower_en/.content/article.html)
parent-folders: Parent folders (multi-valued field containing an entry for each parent path as root path).
type: Type name (the resource type name).
res_locales: Existing locale nodes for XML content and all available locales in the case of binary files.
created: The creation date (The date when the resource itself has being created).
lastmodified: The date last modified (The last modification date of the resource itself).
released: The release date of the resource.
expired: The expiration date of the resource.
content: A general content field that holds all extracted resource data (all languages, type text_general).
content_{locale}: Extracted textual content optimized for the language specific search (Default languages: en, de, el, es, fr, hu, it). This is typically a field to search in and to highlight on.
{locale}_excerpt: Holds only an excerpt of the content, i.e., the first part of the localized content field. This can be useful if you print excerpts of contents and not use highlighting.
category_exact: All categories as exact strings for faceting purposes.
*_prop: All searched propertiesMeans set at the resource or a folder the resource is located in. of a resource as searchable and stored text (field name: <Property_Definition_Name>_prop as text_general).
*_prop_s: Same as *_prop, but stored as String.
*_dprop: All properties directly set at a resource as searchable and stored text (field name: <Property_Definition_Name>_prop as text_general).
*_dprop_s: Same as *_dprop but stored as String.

Special indexing for series contents

Since OpenCms 11, you can define a series of events in one XML content. If the series is stored with the schema type CmsXmlSerialDateValue, the contents are indexed as follows:

For each date in the series, the content is indexed once.
The fields (also without _{locale})
- instancedate_{locale}_dt (start time of the single event),
- instancedatecurrenttill_{locale}_dt (start time or end time of the single event, depending on configuration)
- instancedateend_{locale}_dt (end time of the single event)
  are set accordingly for each single event instance.