Search the documentation
 Show GitHub edit links  Hide GitHub edit links
In OpenCms since: 8.5 Documented since: 8.5 Latest revision for: 8.5 Valid for OpenCms: 10.5.3

After searching with Apache's Lucene for years, Apache Solr has grown and grown and can now be called an enterprise search platform that is based on Lucene. It’s a standalone enterprise search server with a REST-like API. You put documents in it (called "indexing") via XML, JSON or binary over HTTP. You query it via HTTP GET and receive XML, JSON, or binary results. To get a more detailed knowledge what Solr exactly is and how it works, please visit the Apache Solr project website. Searching with the powerful and flexible Apache Solr's REST-like interface will reduce the development complexity. Moreover you can rely on existing graphical interfaces that provide comfortable AJAX based search functionality to the end user of your internet/intranet application.

Searching for content in OpenCms

OpenCms (since version 8.5) integrates Apache Solr. And not only for full text search, but as a powerful enterprise search platform as well.

Demo

The documentation itself features a solr-based search facility. Watch out for the magnifier in top navigation.

Quickstart example

Send a REST-like query

Imagine you want to show a list of "all articles, that have changed since yesterday, where property 'X' has the value 'Y'" :

http://localhost:8080/opencms/opencms/handleSolrSelect?
    fq=type:v8article
    &fq=lastmodified:[NOW-1DAY TO NOW]
    &fq=Title_prop:Flower

Parameter explanation:

http://localhost:8080/opencms/opencms/handleSolrSelect    
   // The URI of the OpenCms Solr Select Handler configured in 
   // 'opencms-system.xml'
    ?fq=type:v8article                   // Filter query on the field type
                                         // with the value 'v8article'
    &fq=lastmodified:[NOW-1DAY TO NOW]   // Filter query on the field    lastmodified
                                         // with a range query from 'NOW-1DAY TO NOW'
    &fq=Title_prop:Flower                // Filter query on the field Title_prop
                                         // with the value 'v8article'
A note on the Solr query syntax

If you want to get familiar with the Solr query syntax you will get a general overview at Solr query syntax. For advanced features Searching - Solr Reference Guide - Lucid Imagination will lend a hand.

Please note that many characters in the Solr Query Syntax (most notably the plus sign: "+") are special characters in URLs, so when constructing request URLs manually, you must properly URL-encode these characters.

                                  q=  +popularity:[10   TO   *]     +section:0
http://localhost:8983/solr/select?q=%2Bpopularity:[10%20TO%20*]%20%2Bsection:0

For more information, see Yonik Seeley's blog on Nested Queries in Solr.

You can pass any "Solr valid" input to the new OpenCms Solr request handler (handleSolrSelect). To get familiar with the Solr query syntax the Solr Wiki page lends itself: Search and Indexing.

Retrieving the response

The response produced by Solr can be XML or JSON by default. With an additional parameter 'wt' you can specify the QueryResponseWriter that should be used by Solr. For the above shown query example a result can look like this:

<response>
	<lst name="responseHeader">
		<int name="status">0</int>
		<int name="QTime">7</int>
		<lst name="params">
			<str name="qt">dismax</str>
			<str name="fl">*,score</str>
			<int name="rows">50</int>
			<str name="q">*:*</str>
			<arr name="fq">
				<str>type:v8article</str>
				<str>contentdate:[NOW-1DAY TO NOW]</str>
				<str>Title_prop:Flower</str>
			</arr>
			<long name="start">0</long>
		</lst>
	</lst>
	<result name="response" numFound="2" start="0">
		<doc>
			<str name="id">51041618-77f5-11e0-be13-000c2972a6a4</str>
			<str name="contentblob">[B:[B@6c1cb5</str>
			<str name="path">/sites/default/.content/article/a_00003.html</str>
			<str name="type">v8article</str>
			<str name="suffix">.html</str>
			<date name="created">2011-05-06T15:27:13Z</date>
			<date name="lastmodified">2011-08-17T13:58:29Z</date>
			<date name="contentdate">2012-09-03T10:41:13.56Z</date>
			<date name="relased">1970-01-01T00:00:00Z</date>
			<date name="expired">292278994-08-17T07:12:55.807Z</date>
			<arr name="res_locales">
				<str>en</str>
				<str>de</str>
			</arr>
			<arr name="con_locales">
				<str>en</str>
			</arr>
			<str name="template_prop">
				/system/modules/com.alkacon.opencms.v8.template3/templates/main.jsp</str>
			<str name="style.layout_prop">/.content/style</str>
			<str name="NavText_prop">OpenCms 8 Demo</str>
			<str name="Title_prop">Flower Today</str>
			<arr name="content_en">
				<str>News from the world of flowers Flower Today In this [...]</str>
			</arr>
			<date name="timestamp">2012-09-03T10:45:47.055Z</date>
			<float name="score">1.0</float>
		</doc>
		<doc>
			<str name="id">ac56418f-77fd-11e0-be13-000c2972a6a4</str>
			<str name="contentblob">[B:[B@1d0e4a2</str>
			<str name="path">/sites/default/.content/article/a_00030.html</str>
			<str name="type">v8article</str>
			<str name="suffix">.html</str>
			<date name="created">2011-05-06T16:27:02Z</date>
			<date name="lastmodified">2011-08-17T14:03:27Z</date>
			<date name="contentdate">2012-09-03T10:41:18.155Z</date>
			<date name="relased">1970-01-01T00:00:00Z</date>
			<date name="expired">292278994-08-17T07:12:55.807Z</date>
			<arr name="res_locales">
				<str>en</str>
				<str>de</str>
			</arr>
			<arr name="con_locales">
				<str>en</str>
			</arr>
			<str name="template_prop">
				/system/modules/com.alkacon.opencms.v8.template3/templates/main.jsp
			</str>
			<str name="style.layout_prop">/.content/style</str>
			<str name="NavText_prop">OpenCms 8 Demo</str>
			<str name="Title_prop">Flower Dictionary</str>
			<arr name="content_en">
				<str>The different types of flowers Flower Dictionary There are
					[...]</str>
			</arr>
			<date name="timestamp">2012-09-03T10:45:49.265Z</date>
			<float name="score">1.0</float>
		</doc>
	</result>
</response>

Sending a Java-API query

String query="fq=type:v8article&fq=lastmodified:[NOW-1DAY TO NOW]&fq=Title_prop:Flower";
CmsSolrResultList results = OpenCms.getSearchManager().getIndexSolr("Solr Online 
     Index").search(getCmsObject(), query);
for (CmsSearchResource sResource : results) {
   String path = searchRes.getField(I_CmsSearchField.FIELD_PATH);
   Date date =searchRes.getMultivaluedField(I_CmsSearchField.FIELD_DATE_LASTMODIFIED);
   List<String> cats =  searchRes.getMultivaluedField(I_CmsSearchField.FIELD_CATEGORY);
}

The class org.opencms.search.solr.CmsSolrResultList encapsulates a list of 'OpenCms resource documents' (CmsSearchResource).
The list can be accessed exactly like an ArrayList with entries of the type CmsSearchResource that extend the type CmsResource and holds the Solr implementation of I_CmsSearchDocument as member. This format enables you to deal with the results as with a well known List and work on its entries like you do on CmsResource.

Using the CmsSolrQuery-class for querying Solr

CmsSolrIndex index = OpenCms.getSearchManager().getIndexSolr("Solr Online Index");
Map parameters = new HashMap<String,String>();
parameters.put("path","/sites/default/xmlcontent/article_0001.html");

CmsSolrQuery squery = new CmsSolrQuery(getCmsObject(), parameters);
List<CmsResource> results = index.search(getCmsObject(), squery);

Advanced search features

Solr comes with a whole bunch of features for which documentation is found in the solr wiki:

Querying multiple cores (indexes)

Core is the wording in the Solr world for thinking of several indexes. Preferring the correct term, let's say core instead of index. Multiple cores should only be required if you have completely different applications but want a single Solr Server that manages all the data. See Solr Core Administration for detailed information. So assuming you have configured multiple Solr cores and you would like to query a specific one you have to tell Solr/OpenCms which core/index you want to search on. This is done by a special parameter:

http://localhost:8080/opencms/opencms/handleSolrSelect?   
                             // The URI of the OpenCms Solr Select Handler
                             // configured in 'opencms-system.xml'
   &core=My Solr Index Name  // Searches on the core with the name 'My Solr Index Name'
   &q=content_en:Flower      // for the text 'Flower'

Using the standard OpenCms Solr collector

OpenCms (since version 8.5) delivers a standard Solr collector using byQuery as a name to simply pass a query string and byContext as a name to pass a query string and let OpenCms use the user's request context. The implementing class for this collector can be found at org.opencms.file.collectors.CmsSolrCollector.

<cms:contentload collector="byQuery" preload="true"
  param='fq=parent-folders:"/sites/default/"&fq=type:ddarticle&sort=lastmodified desc'>
  <cms:contentinfo var="info" />
  <c:if test='${info.resultSize != 0}'>
    <cms:contentinfo var="info" />			
    <c:if test='${info.resultSize != 0}'>
      <h3>Solr Collector Demo</h3>
      <cms:contentload editable="false">
        <cms:contentaccess var="content" />
        <%-- Title of the article --%>
        <h6>${content.value.Title}</h6>
        <%-- The text field of the article with image --%>
        <div class="paragraph">
          <%-- Set the requied variables for the image. --%>
          <c:if test="${content.value.Image.isSet}">
            <%-- Output of the image using cms:img tag --%>			
            <c:set var="imgwidth">${(cms.container.width - 20) / 3}</c:set>
            <%-- Output the image. --%>
            <cms:img src="${content.value.Image}" />
          </c:if>									
          ${cms:trimToSize(cms:stripHtml(content.value.Text), 300)}
        </div>
        <div class="clear"></div>
      </cms:contentload>
    </c:if>
  </c:if>
</cms:contentload>

Indexing content with Solr

Search configuration

In general the system wide search configuration for OpenCms is done in the file opencms-search.xml (<CATALINA_HOME>/webapps/<OPENCMS_WEBAPP>/WEB_INF/config/opencms-search.xml).

Configuring the embedded/HTTP Solr Server

Since version 8.5 of OpenCms a new optional node with the XPath: opencms/search/solr is available. To simply enable the OpenCms embedded Solr Server your opencms-search.xml should start like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE opencms SYSTEM "http://www.opencms.org/dtd/6.0/opencms-search.dtd">
<opencms>
 <search>
   <solr enabled="true"/>
     [...]
 </search>
</opencms>

Optionally you can configure the Solr home directory and the main Solr configuration file name (default: solr.xml). OpenCms then concatenates those two paths to <solr_home>/<configfile>. An example for such a configuration would look like:

<solr enabled="true">
   <home>/my/solr/home/folder</home>
   <configfile>rabbit.xml</configfile>
</solr>

In order to disable Solr system wide remove the <solr/>-node or set the enabled attribute to false like:

<solr enabled="false"/>

It is also possible to connect with an external HTTP Solr server. To do so replace the line <solr enabled="true"/> with the following:

<solr enabled="true" serverUrl="http://mySolrServer" />

The OpenCms SolrSelect request handler does not support the external HTTP Solr Server. So if your HTTP Solr Server is directly reachable by http://<your_server> there will be no permission check performed and indexed data that is secret will be accessible. That means that you are responsible for resources that have permission restrictions set on the VFS of OpenCms. But of course you can use the method

org.opencms.search.solr.CmsSolrIndex.search(CmsObject, SolrQuery)

or

org.opencms.search.solr.CmsSolrIndex.search(CmsObject, String)

and be sure permissions are checked also for HTTP Solr Servers. Maybe a future version of OpenCms will feature secure access to HTTP Solr server.

Configuring search index(es)

By default OpenCms comes along with a "Solr Online" index. To add a new Solr index you can use the default configuration as a copy template.

<index class="org.opencms.search.solr.CmsSolrIndex">
	<name>Solr Online</name>
	<rebuild>auto</rebuild>
	<project>Online</project>
	<locale>all</locale>
	<configuration>solr_fields</configuration>
	<sources>
		<source>solr_source</source>
	</sources>
</index>

Configuring index sources

Index sources for Solr can be configured in the file opencms-search.xml exactly the same way as you do for Lucene indexes. In order to use the advanced XSD field mapping for XML contents, you must add the new document type xmlcontent-solr to the list of document types that are indexed:

<indexsource>
	<name>solr_source</name>
	<indexer class="org.opencms.search.CmsVfsIndexer" />
	<resources>
		<resource>/sites/default/</resource>
	</resources>
	<documenttypes-indexed>
		<name>xmlcontent-solr</name>
		<name>containerpage</name>
		<name>xmlpage</name>
		<name>text</name>
		<name>pdf</name>
		<name>image</name>
		<name>msoffice-ole2</name>
		<name>msoffice-ooxml</name>
		<name>openoffice</name>
	</documenttypes-indexed>
</indexsource>

The document type xmlcontent-solr

With OpenCms version 8.5 there is a new document type called xmlcontent-solr. Its implementation (CmsSolrDocumentXmlContent) performs a localized content extraction that is used later on to fill the Solr input document. As explained in the section about custom fields for XML content, it is possible to define a mapping between elements defined in the XSD of an XML resource type and a field of the Solr document. The values for those defined XSD field mappings are also extracted by the document type named xmlcontent-solr.

<documenttype>
	<name>xmlcontent-solr</name>
	<class>org.opencms.search.solr.CmsSolrDocumentXmlContent</class>
	<mimetypes>
		<mimetype>text/html</mimetype>
	</mimetypes>
	<resourcetypes>
		<resourcetype>xmlcontent-solr</resourcetype>
	</resourcetypes>
</documenttype>

The Solr default field configuration

By default the field configuration for OpenCms Solr indexes is implemented by the class org.opencms.search.solr.CmsSolrFieldConfiguration. The easiest Solr field configuration declared in opencms-search.xml looks as follows. See also the section about extending the CmsSolrFieldConfiguration.

<fieldconfiguration class="org.opencms.search.solr.CmsSolrFieldConfiguration">
	<name>solr_fields</name>
	<description>The Solr search index field configuration.</description>
	<fields />
</fieldconfiguration>

Migrating a Lucene index to a Solr index

An existing Lucene field configuration can easily be transformed into a Solr index. To do so, create a new Solr field configuration. As template, you can use the snippet shown in section about the Solr default field configuration. Just copy the list of fields from the Lucene index you want to convert into that skeleton.

There exists a specific strategy to map the Lucene field names to Solr field names:

  • Exact name matching: OpenCms tries to determine an explicit Solr field that has the exact name like the value of the name-attribute. E.g., OpenCms tries to find an explicit Solr field definition named meta for <field name="meta"> ... </field>. To make use of this strategy you have to edit the schema.xml of Solr manually and add an explicit field definition named according to the exact Lucene field names.
  • Type specific fields: In the existing Lucene configuration type specific field definitions are not designated, but the Solr schema.xml defines different data types for fields. If you are interested in making use of these type specific advantages (like language specific field analyzing/tokenizing) without manipulating the schema.xml of Solr, you have to define a type attribute for those fields at least. The value of the attribute type can be any name of each <dynamicField> configured in the schema.xml that starts with a *_. The resulting field inside the Solr document is then named <luceneFieldName>_<dynamicFieldSuffix>.
  • Fallback: If you don't have a type attribute defined and there does not exist an explicit field in the schema.xml named according to the Lucene field name, OpenCms uses text_general as a fallback. E.g. a Lucene field <field name="title" index="true"> ... </field> will be stored as a dynamic field named title_txt in the Solr index.

An original field configuration as follows:

<fieldconfiguration>
 <name>standard</name>
 <description>The standard OpenCms 8.0 search field configuration.</description>
 <fields>
   <field name="content" store="compress" index="true" excerpt="true">
     <mapping type="content"/>
   </field>
   <field name="title-key" display="-" store="true" index="untokenized" boost="0.0">
     <mapping type="property">Title</mapping>
   </field>
   <field name="title" display="%(key.field.title)" store="false" index="true">
     <mapping type="property">Title</mapping>
   </field>
   <field name="keywords" display="%(key.field.keywords)" store="true" index="true">
     <mapping type="property">Keywords</mapping>
   </field>
   <field name="description" store="true" index="true">
     <mapping type="property">Description</mapping>
   </field>
   <field name="meta" display="%(key.field.meta)" store="false" index="true">
     <mapping type="property">Title</mapping>
     <mapping type="property">Keywords</mapping>
     <mapping type="property">Description</mapping>
   </field>
 </fields>
</fieldconfiguration>

could look after conversion like this:

<fieldconfiguration class="org.opencms.search.solr.CmsSolrFieldConfiguration">
 <name>standard</name>
 <description>The standard OpenCms 8.0 Solr search field configuration.</description>
 <fields>
   <field name="content" store="compress" index="true" excerpt="true">
     <mapping type="content"/>
   </field>
   <field name="title-key" store="true" index="untokenized" boost="0.0" type="s">
     <mapping type="property">Title</mapping>
   </field>
   <field name="title" store="false" index="true" type="prop">
     <mapping type="property">Title</mapping>
   </field>
   <field name="keywords" store="true" index="true" type="prop">
     <mapping type="property">Keywords</mapping>
   </field>
   <field name="description" store="true" index="true" type="prop">
     <mapping type="property">Description</mapping>
   </field>
   <field name="meta" store="false" index="true" type="en">
     <mapping type="property">Title</mapping>
     <mapping type="property">Keywords</mapping>
     <mapping type="property">Description</mapping>
   </field>
 </fields>
</fieldconfiguration>

Indexed data

The following sections will show what data is indexed by default and what possibilities are offered by OpenCms to configure / implement additional field configurations / mappings.

The Solr index schema (schema.xml)

Have a look at the Solr schema.xml first. In the file <CATALINA_HOME>/webapps/<OPENCMS>/WEB-INF/solr/conf/schema.xml you will find the field definitions that will be used by OpenCms that were briefly summarized before.

Default index fields

OpenCms indexes for each resource by default the following fields:

List of default index fields
id

Structure id used as a unique identifier for a document (The structure id of the resource).

path

Full root path (The root path of the resource, e.g., /sites/default/flower_en/.content/article.html)

path_hierarchy

The full path as (path tokenized field type: text_path).

parent-folders

Parent folders (multi-valued field containing an entry for each parent path as root path).

type

Type name (the resource type name).

res_locales

Existing locale nodes for XML content and all available locales in the case of binary files.

created

The creation date (The date when the resource itself has being created).

lastmodified

The date last modified (The last modification date of the resource itself).

contentdate

The content date (The date when the resource's content has been modified).

released

The release and expiration date of the resource.

content

A general content field that holds all extracted resource data (all languages, type text_general).

contentblob

The serialized extraction result (content_blob) to improve the extraction performance while indexing.

category

All categories as general text.

category_exact

All categories as exact strings for faceting purposes.

text_

Extracted textual content optimized for the language specific search (Default languages: en, de, el, es, fr, hu, it).

timestamp

The time when the document was last indexed.

*_prop

All properties of a resource as searchable and stored text (field name: <Property_Definition_Name>_prop as text_general).

*_exact

All properties of a resource as exact not stored string (field name: <Property_Definition_Name>_exact as string)

Custom field configuration

Declarative field configurations with field mappings can be defined in the file opencms-search.xml. You can use exactly the same features as already known for OpenCms Lucene field configurations.

Please see the section about migrating a Lucene index to a Solr index.

Extending the CmsSolrFieldConfiguration

If the standard configuration options are still not flexible enough you are able to extend the class: org.opencms.search.solr.CmsSolrFieldConfiguration and define a custom Solr field configuration in the opencms-search.xml:

<fieldconfiguration class="your.package.YourSolrFieldConfiguration">
   <name>solr_fields</name>
   <description>The Solr search index field configuration.</description>
   <fields/>
 </fieldconfiguration>

Behind the walls

The request handler

The class org.opencms.main.OpenCmsSolrHandler offers the same functionality as the default select request handler of an standard Solr server installation. In the OpenCms default system configuration (opencms-system.xml) the Solr request handler is configured:

<requesthandlers>
	<requesthandler class="org.opencms.main.OpenCmsSolrHandler" />
</requesthandlers>

Alternatively the request handler class can be used as a servlet. Therefore add the handler class to the WEB-INF/web.xml of your OpenCms application:

<servlet>
	<description>
		The OpenCms Solr servlet.
	</description>
	<servlet-name>OpenCmsSolrServlet</servlet-name>
	<servlet-class>org.opencms.main.OpenCmsSolrHandler</servlet-class>
	<load-on-startup>1</load-on-startup>
</servlet>
[...]
<servlet-mapping>
	<servlet-name>OpenCmsSolrServlet</servlet-name>
	<url-pattern>/solr/*</url-pattern>
</servlet-mapping>

Permission check

OpenCms performs a permission check for all resulting documents and throws those away that the current user is not allowed to retrieve and expands the result for the next best matching documents on the fly. This security check is very cost intensive and should be replaced/improved with a pure index based permission check.

Configurable post processor

OpenCms offers the capability for post search processing Solr documents after the document has been checked for permissions. This capability allows you to add fields to the found document before the search result is returned. In order to make use of the post processor you have to add an optional parameter for the search index as follows:

<index class="org.opencms.search.solr.CmsSolrIndex">
   <name>Solr Offline</name>
   <rebuild>offline</rebuild>
   <project>Offline</project>
   <locale>all</locale>
   <configuration>solr_fields</configuration>
   <sources>
     [...]
   </sources>
   <param name="search.solr.postProcessor">
      my.package.MyPostProcessor
  </param>
</index>

The specified class for the parameter search.solr.postProcessor must be an implementation of org.opencms.search.solr.I_CmsSolrPostSearchProcessor.

Multilingual support

There is a default strategy implemented for the multi-language support within OpenCms Solr search index. For binary documents the language is determined automatically based on the extracted text. An exception are documents that follow the name scheme {name}_{locale}.{suffix}, e.g., mydoc_en.pdf. In this case the locale from the document name is used. That is, mydoc_en.pdf will be indexed for English, independent of the real content's locale. The default mechanism is implemented with: http://code.google.com/p/language-detection/.

For XML contents we have the concrete language/locale information and the localized fields end with an underscore followed by the locale. E.g.: content_en, content_de or text_en, text_de. By default all the field mappings defined within the XSD of a resource type are extended by the _<locale>.

Multilingual dependency resolving

Based on the file name of a resource in OpenCms there exists a concept to index documents that are distributed over more than one resource in OpenCms. The standard implementation can be found at: org.opencms.search.documents.CmsDocumentDependency

The extraction result cache

For better index performance the extracted result is cached for siblings, see org.opencms.search.extractors.I_CmsExtractionResult.

You can improve this page

Please contribute your suggestions or comments regarding this topic on our wiki. For support questions, please use the OpenCms mailing list or go for professional support.