Skip to content
OpenCms documentation
OpenCms documentation

Indexing configuration

The resources in OpenCms have to be indexed, i.e., added to a Solr index to make them searchable. By default, OpenCms indexes the resoures in a reasonable way. If you have requirements that do not fit to the default setup you have lots of configuration options. Here we explain them.

Moreover, we hint to some interesting details of the default indexing process.

By default OpenCms comes along with a "Solr Online" and a "Solr Offline" index - one indexes the offline versions of resources, one the online versions. "Solr Offline" is used internally and should not be removed. The index configuration basically, brings together the index implementation (class attribute), the rebuild behavior, the field configuration (schema) and the source for indexed resources. To add a new Solr index you can use the default configuration as a copy template.

<index class="org.opencms.search.solr.CmsSolrIndex">
	<name>Solr Online</name>
	<rebuild>auto</rebuild>
	<project>Online</project>
	<locale>all</locale>
	<configuration>solr_fields</configuration>
	<sources>
		<source>solr_source</source>
	</sources>
        <!-- optional parameters -->
</index>
Solr indexes take several parameters. Have a look at the index configurations in the default opencms-search.xml to find about the various options.

Configurable post processor

OpenCms offers the capability for post search processing Solr documents after the document has been checked for permissions. This capability allows you to add fields to the found document before the search result is returned. In order to make use of the post processor you have to add an optional parameter for the search index as follows:

<index class="org.opencms.search.solr.CmsSolrIndex">
   <name>Solr Offline</name>
   <rebuild>offline</rebuild>
   <project>Offline</project>
   <locale>all</locale>
   <configuration>solr_fields</configuration>
   <sources>
     [...]
   </sources>
   <param name="search.solr.postProcessor">
      my.package.MyPostProcessor
  </param>
</index>

The specified class for the parameter search.solr.postProcessor must be an implementation of org.opencms.search.solr.I_CmsSolrPostSearchProcessor.

The default indexes have a post-processor configured that add a "link" field to the documents. It contains the correct link to the search result. This is in particular required when using the Solr handler, but also for server-side search queries it can be useful: You get rid of a <cms:link> tag.

Index sources for Solr can be configured in the file opencms-search.xml exactly the same way as you did for Lucene indexes. Basically, you say resources from which folders and which type should be added to an index. Each index then links to such an index source definition.

For Solr indexes, in order to use the advanced XSD field mapping for XML contents, you must add the new document type xmlcontent-solr to the list of document types that are indexed:

<indexsource>
	<name>solr_source</name>
	<indexer class="org.opencms.search.CmsVfsIndexer" />
	<resources>
		<resource>/sites/default/</resource>
	</resources>
	<documenttypes-indexed>
		<name>xmlcontent-solr</name>
		<name>containerpage</name>
		<name>xmlpage</name>
		<name>text</name>
		<name>pdf</name>
		<name>image</name>
		<name>msoffice-ole2</name>
		<name>msoffice-ooxml</name>
		<name>openoffice</name>
	</documenttypes-indexed>
</indexsource>

Document types, also defined in the opencms-search.xml, declare how documents of a specific resource or mimetype should be extracted.

With OpenCms version 8.5 there is a new document type called xmlcontent-solr. Its implementation (CmsSolrDocumentXmlContent) performs a localized content extraction that is used later on to fill the Solr input document.

<documenttype>
	<name>xmlcontent-solr</name>
	<class>org.opencms.search.solr.CmsSolrDocumentXmlContent</class>
	<mimetypes>
		<mimetype>text/html</mimetype>
	</mimetypes>
	<resourcetypes>
		<resourcetype>xmlcontent-solr</resourcetype>
	</resourcetypes>
</documenttype>

By default the field configuration for OpenCms Solr indexes is implemented by the class org.opencms.search.solr.CmsSolrFieldConfiguration. The easiest Solr field configuration declared in opencms-search.xml looks as follows. See also the section about extending the CmsSolrFieldConfiguration.

<fieldconfiguration class="org.opencms.search.solr.CmsSolrFieldConfiguration">
	<name>solr_fields</name>
	<description>The Solr search index field configuration.</description>
	<fields />
</fieldconfiguration>

An existing Lucene field configuration can easily be transformed into a Solr index. To do so, create a new Solr field configuration. As template, you can use the snippet shown in section about the Solr default field configuration. Just copy the list of fields from the Lucene index you want to convert into that skeleton.

There exists a specific strategy to map the Lucene field names to Solr field names:

  • Exact name matching: OpenCms tries to determine an explicit Solr field that has the exact name like the value of the name-attribute. E.g., OpenCms tries to find an explicit Solr field definition named meta for <field name="meta"> ... </field>. To make use of this strategy you have to edit the schema.xml of Solr manually and add an explicit field definition named according to the exact Lucene field names.
  • Type specific fields: In the existing Lucene configuration type specific field definitions are not designated, but the Solr schema.xml defines different data types for fields. If you are interested in making use of these type specific advantages (like language specific field analyzing/tokenizing) without manipulating the schema.xml of Solr, you have to define a type attribute for those fields at least. The value of the attribute type can be any name of each <dynamicField> configured in the schema.xml that starts with a *_. The resulting field inside the Solr document is then named <luceneFieldName>_<dynamicFieldSuffix>.
  • Fallback: If you don't have a type attribute defined and there does not exist an explicit field in the schema.xml named according to the Lucene field name, OpenCms uses text_general as a fallback. E.g. a Lucene field <field name="title" index="true"> ... </field> will be stored as a dynamic field named title_txt in the Solr index.

An original field configuration is as follows:

<fieldconfiguration>
 <name>standard</name>
 <description>The standard OpenCms 8.0 search field configuration.</description>
 <fields>
   <field name="content" store="compress" index="true" excerpt="true">
     <mapping type="content"/>
   </field>
   <field name="title-key" display="-" store="true" index="untokenized" boost="0.0">
     <mapping type="property">Title</mapping>
   </field>
   <field name="title" display="%(key.field.title)" store="false" index="true">
     <mapping type="property">Title</mapping>
   </field>
   <field name="keywords" display="%(key.field.keywords)" store="true" index="true">
     <mapping type="property">Keywords</mapping>
   </field>
   <field name="description" store="true" index="true">
     <mapping type="property">Description</mapping>
   </field>
   <field name="meta" display="%(key.field.meta)" store="false" index="true">
     <mapping type="property">Title</mapping>
     <mapping type="property">Keywords</mapping>
     <mapping type="property">Description</mapping>
   </field>
 </fields>
</fieldconfiguration>

could look after conversion like this:

<fieldconfiguration class="org.opencms.search.solr.CmsSolrFieldConfiguration">
 <name>standard</name>
 <description>The standard OpenCms 8.0 Solr search field configuration.</description>
 <fields>
   <field name="content" store="compress" index="true" excerpt="true">
     <mapping type="content"/>
   </field>
   <field name="title-key" store="true" index="untokenized" boost="0.0" type="s">
     <mapping type="property">Title</mapping>
   </field>
   <field name="title" store="false" index="true" type="prop">
     <mapping type="property">Title</mapping>
   </field>
   <field name="keywords" store="true" index="true" type="prop">
     <mapping type="property">Keywords</mapping>
   </field>
   <field name="description" store="true" index="true" type="prop">
     <mapping type="property">Description</mapping>
   </field>
   <field name="meta" store="false" index="true" type="en">
     <mapping type="property">Title</mapping>
     <mapping type="property">Keywords</mapping>
     <mapping type="property">Description</mapping>
   </field>
 </fields>
</fieldconfiguration>

The default indexing mechanism has some interesting features. Typically, everything behaves as expected. But in corner cases the special behavior might cause trouble and it's good to know about.

There is a default strategy implemented for the multi-language support within OpenCms Solr search index. For binary documents the language is determined automatically based on the extracted text. An exception are documents that follow the name scheme {name}_{locale}.{suffix}, e.g., mydoc_en.pdf. In this case the locale from the document name is used. That is, mydoc_en.pdf will be indexed for English, independent of the real content's locale. The default mechanism is implemented with: http://code.google.com/p/language-detection/.

For XML contents we have the concrete language/locale information and the localized fields end with an underscore followed by the locale. E.g.: content_en, content_de or text_en, text_de. By default all the field mappings defined within the XSD of a resource type are extended by the _<locale>.

Based on the file name of a resource in OpenCms there exists a concept to index documents that are distributed over more than one resource in OpenCms. The standard implementation can be found at: org.opencms.search.documents.CmsDocumentDependency

For better index performance the extracted result is cached, see org.opencms.search.extractors.I_CmsExtractionResult.