Solr search integration

After searching with plain Apache's Lucene for years, Apache Solr has grown and grown and can now be called an enterprise search platform that is based on Lucene. It’s a standalone enterprise search server with a REST-like API. You put documents in it (called "indexing"). You query it via HTTP GET and receive XML, JSON, or binary results. Additionally, you can connect via a special Java API (SolrJ).

OpenCms ships with an embedded Solr Server and already indexes resources for you. Moreover, it provides various ways to search with Solr. These ways range from support for building lists of contents without any Solr knowledge to sending queries to Solr directly and using the power of all Solr features either on the server or on the client directly.

About Solr

Apache Solr is an enterprise search engine based on Lucene. It can be installed as stand-alone or embedded server and can be queried via a REST API or a special Java API.

Basically, you have to add all the resources you want to search for to a Solr index. Adding a resource is called "indexing" it. To index a resource, you extract the information from it - e.g., its content, its location, its type - and put each of this information in a so called "field" of a "document". Hence, each resource is represented by a document in a Solr index. Each document has various fields - e.g. one called "content" (holding the content), one called "location" (holding the link to the original resource, .... Once you added a document for your resource to a Solr index, you can ask Solr for all documents fullfilling certain conditions (that you specify in the query). For example, you might index some PDF files and ask Solr for up to 10 documents that contain the word "OpenCms". Solr can even rank the result, hence returning a document containing "OpenCms" very often before a document that contains "OpenCms" just once.

Solr has many more advanced features that you have already used to when visiting Google or Amazon:

Faceted search
Highlighting
Range queries
Sorting
Spellchecking
Auto suggestion/completion/correction
Thesaurus/Synonyms
...

So Solr is great to implement a full text search for your website with many advanced features. But of course, it is as well a great choice for very simple searches.

Learn more about Solr here.

Solr OpenCms integration

Since version 8.5 OpenCms ships with an embedded Solr server that is used by OpenCms itself and can be used for searches on your website as well. You can use it to implement everything from a simple list to a fully-featured full text search on the website.

OpenCms has already preconfigured most things, so you only have to search. In particular:

There are preconfigured Solr indexes "Solr Offline" and "Solr Online", having the offline versions / online versions of the VFS resources indexed.
There's a suitable schema (that is what tells which fields can be part of a document) for the indexes - it fits to the information of resources in the VFS.
Resources in the VFS are indexed automatically - even for PDF or Excel files the content is extracted and indexed.
We've also taken care to support multi-lingual setups properly.
Permission checks are performed before search results are returned.
You can search with nearly no Solr knowledge:
- Build lists with the integrated list type that provides an intuitive configuration interface for basic searches.
You can easily build complex search pages:
- Build powerful full-text searches using OpenCms' <cms:search> tag.
You can query Solr from the client using the Solr handler - this way you get a (close to) "native" Solr experience but still have permissions checked.
You can configure indexing of XML contents very flexible via search settings to allow for advanced search features when searching over your contents.

If the default configuration is not sufficient for your purpose, you can manipulate it with many configuration options:

Add other indexes
Feed an external Solr server with data from OpenCms
Change the document extractors for certain resource types to index the documents differently.

The default configuration will be sufficient for most scenarios. So play with it before thinking about reconfiguration.

Searching in OpenCms

We suggest to use the integrated list type for searches, or - if this is not enough - the <cms:search> tag for full-text searches. Partly for the first suggestion and necessarily for the second one, you should know how a Solr query is constructed in general.

The best way to play with Solr on your OpenCms instance is the Solr handler. Here you can type plain Solr queries and get responses immediately.

We explain a search with a concrete example (that you may vary a bit depending on your OpenCms installation):

Show articles in the default site that have been changed in the last 24 hours and sort them by the (english) title ascending.

Below, we show different solutions to perform that search.

Using the Solr handler

Here's how your query will look like in plain text:

http://localhost:8080/opencms/handleSolrSelect
   ?wt=xml
   &fq=type:m-article
   &fq=parent-folders:"/sites/default/"
   &fq=lastmodified:[NOW-1YEAR TO NOW]
   &sort=disptitle_en_sort asc

and with correct URL encoding (in particular "[" and "]" must be encoded):

http://localhost:8080/opencms/handleSolrSelect?wt=xml&fq=type:m-article
   &fq=parent-folders:"/sites/default/"&fq=lastmodified:%5BNOW-1YEAR%20TO%20NOW%5D&sort=disptitle_en_sort%20asc

Parameter explanation:

http://localhost:8080/opencms/handleSolrSelect  
  // The URI of the OpenCms Solr Select Handler configured in 
  // 'opencms-system.xml'
   ?wt=xml                              // By default the handler returns it's result in JSON,
                                        // which is usuallypreferrable when using it for a
                                        // client-side search feature, but for better readability,
                                        // we prefer XML as result format.

   &fq=type:m-article                   // Filter query on the field type
                                        // with the value 'm-article'
                                        // This is the resource type for articles in the
                                        // mercury template shipped with OpenCms 11.

   &fq=parent-folders:"/sites/default/" // Filter query on field parent-folders.
                                        // We filter by folder, showing only articles
                                        // in the default site.
                                        // Note the quotes around the folder - they are necessary
                                        // because of the special type of that field (string field).

   &fq=lastmodified:[NOW-1YEAR TO NOW]  // Filter query on the field lastmodified with a range
                                        // query for everything changed in the past year.
   &sort=disptitle_en_sort asc          // Sort by the field disptitle_en_sort in ascending order

A note on the Solr query syntax

We showed a very simple Solr query, to get familiar with the query syntax, either look in the Solr reference guide or one of the many tutorials around. Solr is evolving very fast and query options become more and more. The OpenCms Solr handler supports most of the options and you can play with it. E.g., grouping and clustering is currently not supported because of permission check issues.

Notes on the response

The handler directly returns a response with the search results. It might look like this:

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">67</int>
  <lst name="params">
    <str name="q">*:*</str>
    <str name="fl">*,score</str>
    <str name="qt">edismax</str>
    <str name="rows">1</str>
    <str name="fq">con_locales:en</str>
    <str name="sort">disptitle_en_sort asc</str>
    <str name="wt">xml</str>
  </lst>
</lst>
<result name="response" numFound="21" start="0" maxScore="1.0">
  <doc>
    <str name="id">0ead13f7-3a96-11e9-bd84-0242ac11002b</str>
    <str name="contentblob">...</str>
    <arr name="parent-folders">
      <str>/</str>
      <str>/sites/</str>
      <str>/sites/default/</str>
      <str>/sites/default/mercury-demo/</str>
      <str>/sites/default/mercury-demo/newsletters/</str>
      <str>/sites/default/mercury-demo/newsletters/.content/</str>
      <str>/sites/default/mercury-demo/newsletters/.content/article/</str>
    </arr>
    <str name="path">/sites/default/mercury-demo/newsletters/.content/article/a_00001.xml</str>
    <str name="path_hierarchy">/sites/default/mercury-demo/newsletters/.content/article/a_00001.xml</str>
    <str name="type">m-article</str>
    <str name="suffix">xml</str>
    <int name="size">1252</int>
    <date name="created">2018-10-31T13:12:11Z</date>
    <date name="lastmodified">2019-03-14T15:12:14Z</date>
    <date name="contentdate">2019-03-14T15:12:15Z</date>
    <date name="released">1970-01-01T00:00:00Z</date>
    <arr name="res_locales">
      <str>en</str>
    </arr>
    <arr name="con_locales">
      <str>en</str>
    </arr>
    <str name="template_prop">/system/modules/alkacon.mercury.template/templates/mercury.jsp</str>
    <str name="template_prop_s">/system/modules/alkacon.mercury.template/templates/mercury.jsp</str>
    
    <!-- ... many more fields here ... -->

    <date name="instancedate_dt">2019-03-14T15:12:14Z</date>
    <str name="disptitle_sort">An article</str>
    <int name="disporder_i">0</int>
    <str name="solr_id">0ead13f7-3a96-11e9-bd84-0242ac11002b</str>
    <date name="expired">2119-03-14T15:13:44.456Z</date>
    <date name="timestamp">2019-03-14T15:13:44.456Z</date>
    <float name="score">1.0</float>
    <str name="link">/mercury-demo/detail-pages/article/An-article/</str>
  </doc>

  <!-- more docs here -->

</result>
</response>

Looking at the result, you see two sections:

responseHeader - that basically tells what you queried
results - the list of results with some extra information.

Considering the results section:

You see that we found 21 results at all, but you'll see only 10 doc nodes. The reason is, that by default only 10 results are returned. You can request more by adding &rows=30 to your query. Or you can just query the next 10 results by adding &start=10Looks like pagination, doesn't it?.
You see that each returned document has lots of fields returned. On nearly all of these fields (and some more) you can query. But typically, you do not need all the fields in the query response. Add &fl=disptitle_en_sort,en_excerpt,link to your query to reduce the returned fieldsThere are some additional fields returned that are always returned since they are typically useful. This is an OpenCms feature, not a Solr feature..

You might think about all the information anyone can gain by correct queries to the handler? Since OpenCms 11 the handler refuses searches online by default and if you really need it, you can configure what results to return. Read more about this in the topic about the handler.

Using the integrated list type

With the integated list type, you do not need any Solr experience to perform the query. We assume things are prepared like in the OpenCms 11 demo:

We have a display formatter for articles
We have a formatter for the integrated list (not even necessary, but helpful)

Now add a new content of type "List". If configured as in the demo, you can drop a new list on the page. It's in the view "Advanced elements". Edit it as shown below (Only three intuitive steps are necessary!):

The search example with the integrated list

You can have a look at the found results in the list app and you should find, that the results are identically to the ones found by the handler. Hence, you could do your Solr search without any Solr knowledge. Great, isn't it?

Learn more about the integrated list here.

Using the tag <cms:search>

The <cms:search> tag is great for feature-rich full-text searches where you are not satisfied with the integrated list type. For this simple example, it might look a bit complicated, but the tag has great advantages:

It lets you easily configure complex searches and provides methods to keep track of the state (are facet items checked, which page is currently displayed, ...).
It provides easy access to search results.

Here's a small JSP that does the search for us on the server and prints the titles used for sorting:

<%@ taglib prefix="c" uri="http://java.sun.com/jsp/jstl/core" %>
<%@ taglib prefix="cms" uri="http://www.opencms.org/taglib/cms"%>

<c:set var="config">
  {
    "searchforemptyquery" : true,
    "extrasolrparams"     : "&fq=type:m-article&fq=parent-folders:\"/sites/default/\"\
                             &fq=lastmodified:[NOW-1YEAR TO NOW]&sort=disptitle_en_sort asc"
  }
</c:set>

<cms:search var="search" configString="${config}" />

<c:forEach var="result" items="${search.searchResults}">
	<div>${result.fields['disptitle_en_sort']}</div>
</c:forEach>

Basically, we feed our plain Solr query to the tag's configuration. The extra config value searchforemptyquery is necessary, since otherwise no search will be triggered as long as a string for a full-text search is provided.

The big advantage of the tag is the structure of the search result. This becomes obvious in more complex search scenarios.

Learn more about the tag and it's use for a full-text search here.

Using Solr's Java API

As a last approach for seaching, you can use Solr's Java API for the request:

String query="&fq=type:m-article&fq=parent-folders:\"/sites/default/\"\
              &fq=lastmodified:[NOW-1YEAR TO NOW]&sort=disptitle_en_sort asc"
CmsSolrResultList results = OpenCms.getSearchManager().getIndexSolr("Solr Online 
     Index").search(getCmsObject(), query);
for (CmsSearchResource result : results) {
  // Do something with the result
}

Instead of constructing the query as String, you can use the OpenCms-specific CmsSolrQuery as well. But, look up the JavaDoc for more information.

Using <cms:contentload>

You may also use <cms:contentload> to perform your Solr search and use the "byQuery" or "byContext" collectors:

<%@ taglib prefix="cms" uri="http://www.opencms.org/taglib/cms"%>
<cms:contentload collector="byQuery"
                 param='&fq=type:m-article&fq=parent-folders:"/sites/default/"\
                        &fq=lastmodified:[NOW-1YEAR TO NOW]&sort=disptitle_en_sort asc'>
  <cms:contentaccess var="content" />
  <%-- Title of the article --%>
  <div>${content.value.Title}</div>
</cms:contentload>

Note that this approach has several drawbacks. In particular, looping over the results and searching are combined and you get access to the found contents only, not to the documents indexed for the content.

We suggest to prefer the integrated list or <cms:search> instead of querying Solr via <cms:contentload>.

Handling of permissions

OpenCms performs a permission check for all resulting documents and throws those away that the current user is not allowed to retrieve and expands the result for the next best matching documents on the fly. This security check is very cost intensive and therefor by default the number of documents that can be retrieved via Solr is limited by default to 400. Moreover some Solr features are not supported.

OpenCms support of Solr features

OpenCms supports most Solr features and in most cases behaves exactly like a plain Solr server. But, due to the permission checks, you will experience the following restrictions:

The number of results to return is by default restricted to 400. You can change this as described here.
Features that could cheat the current permission check are disabled:
- Grouping
- Expanding of results

For most use cases, this restrictions should not cause a problem. If you really need more flexibility, you could configure your own special index on an external Solr server and request it directly.