Spellchecking for the TinyMCE

Since version 9.5, OpenCms integrates a spellchecker for the WYSIWYG HTML editor (TinyMCE). Clicking the spellcheck icon, a spellcheck is performed. Misspelled words are highlighted and suggestions for correct spellings are provided.

In this topic, we explain how the spellchecking mechanism works and how to add dictionaries to the spellchecker. We are concerned about the configuration of the spellchecker and not it's usage.

The HTML editor widget that allows spellchecking

General concept of the spellchecker

TODOs: Add schematic picture to explain the process

TinyMCE provides a spellcheck plugin. The plugin makes RPC-calls to a specific URL, sending the text from the editor field and it's language, and waits for callbacks with spelling information (misspelled words and suggestions for replacements). Hence, it sends it's information to a server-side spellchecker and waits for the spellchecking results.

The server-side spellchecker in OpenCms is implemented using Solr's SpellCheckComponent. The component is typically used to implement the "Did you mean?" feature in a search application. It takes a list of words and checks if each of them occurs in a special Solr field. If some word A does not occur there, the SpellCheckComponent suggests words that are present in the Solr field and spelled very similar to word A. For spellchecking in OpenCms we added an additional Solr core. It contains an index for all words from dictionaries for different languages, stored in entry_{locale} fields. Based on that index, for each language an IndexBasedSpellchecker is configured.

Implementing the spellcheck using Solr, the format of the Solr-response has to be adjusted to the format TinyMCE expects. Thus, some transformation is required. It is performed by the CmsSolrSpellchecker, that is reachable via the OpenCmsSpellcheckHandler, by default available under /handleSpellcheckDictionary and called when the TinyMCE sends RPC-calls for spellchecking. The CmsSolrSpellchecker takes care of transforming request and response to the necessary format. Moreover, the handler provides some extra maintenance features to update, exchange or add dictionaries.

Adjusting the dictionaries of the spellchecker

The spellchecker builds up its index from text files containing word lists. These files have to be located in the VFS under /system/workplace/editors/spellcheck/. The filenames must match the following schema.

Naming scheme for dictionary files

dict_{locale}.{txt|zip}: Dictionaries with this name scheme are meant to be default dictionaries for a language. They can either be given as raw text files with one word per line. In this case, use the suffix .txt. Or, they can be given as zipped version of the text file. In this case, use the suffix .zip. The locale should be given as two letter string.
custom_dict_{locale}.{txt|zip}: Dictionaries with this name scheme are meant to be custom dictionaries for a language. They typically provide specific extensions to the default dictionaries. Here you may specify words like OpenCms, that are not a default English word - but still spelled correctly. The custom dictionaries can either be given as raw text files with one word per line. In this case, use the suffix .txt. Or, they can be given as zipped version of such a text file. In this case, use the suffix .zip. The locale should be given as two letter string.

The OpenCms default installation ships with dictionaries for German, English, Spanish, French and Russian. Feel free to replace these dictionaries or extend them with custom dictionaries.

Whenever you updated a dictionary, call the URL (possibly with adjusted prefix)

Adjusting the dictionaries of the spellchecker

Then the index for spellchecking is rebuild. This may take some time and you may not get any feedback. You can also use the parameter check instead of rebuild. This will only rebuild the index, if a dictionary has changed since the last build.

The check and rebuild parameters only work, if you have role root administrator. Moreover, you'll always get an "error" or a timout as answer. Nonetheless, the indexes will be rebuilt.

To add dictionaries for an additional language, the language must be configured before you add the dictionary.

Configuration of additional languages

In the default installation, the spellchecker supports the languages German, English, Spanish, French and Russian. To add support for other languages, some configuration steps must be performed.

We take the configuration for the Czech language as an example. It has the locale cs.

Adjusting the managed-schema of the spellcheck core

To add a new language, say Czech (locale cs), to the spellchecker, you have to add a new field definition to the managed-schema of the spellchecker's Solr core. The managed-schema is found under ${WEBAPP_HOME}/WEB-INF/solr/spellcheck/conf/. Just add the new field with the same attributes as the already existing fields, as done in line 17 in the managed-schema shown below.

<?xml version="1.0" encoding="UTF-8"?>
<!-- Solr managed schema - automatically generated - DO NOT EDIT -->
<schema name="opencms_spellcheck" version="1.6">
  <fieldType name="spell" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    </analyzer>
  </fieldType>
  <field name="entry_de" type="spell" indexed="true" stored="true"/>
  <field name="entry_en" type="spell" indexed="true" stored="true"/>
  <field name="entry_es" type="spell" indexed="true" stored="true"/>
  <field name="entry_fr" type="spell" indexed="true" stored="true"/>
  <field name="entry_ru" type="spell" indexed="true" stored="true"/>
  <field name="entry_cs" type="spell" indexed="true" stored="true" />
</schema>

The entry_cs field will be used to store the words of Czech dictionaries. The field will become the source for the Czech IndexBasedSpellChecker.

Adjusting the solrconfig.xml of the spellcheck core

When you added a new field definition for your language, say entry_cs, to the managed-schema, you can go on and add a spellcheck component for the language to the solrconfig.xml. The solrconfig.xml is found under ${WEBAPP_HOME}/WEB-INF/solr/spellcheck/conf/.

Search for the node <searchComponent name="spellcheck" class="solr.SpellCheckComponent">. Here you have to add the language specific spellchecker as subnode. All you need to do: copy an existing <lst name="spellchecker"> subnode and replace its locales, say it, with your locale, say cs. Below you see an excerpt of the adjusted solrconfig.xml. Lines 7 to 13 show the already existing definition of the spellchecker for Italian, lines 14 to 20 its adjusted copy, defining the Czech spellchecker.

<!-- ... -->

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
	<str name="queryAnalyzerFieldType">spell</str>

	<!-- ... some spellcheckers ...-->
	<lst name="spellchecker">
		<str name="classname">solr.IndexBasedSpellChecker</str>
		<str name="spellcheckIndexDir">./spellchecker_it</str>
		<str name="field">entry_it</str>
		<str name="name">it</str>
		<str name="buildOnCommit">true</str>
	</lst>
	<lst name="spellchecker">
		<str name="classname">solr.IndexBasedSpellChecker</str>
		<str name="spellcheckIndexDir">./spellchecker_cs</str>
		<str name="field">entry_cs</str>
		<str name="name">cs</str>
		<str name="buildOnCommit">true</str>
	</lst>
</searchComponent>

<!-- ... -->

Having finished the configuration, you should restart your servlet container and afterward you can start adding dictionaries for the newly configured language.

Creating new dictionaries

Dictionaries are simply text files with one word per line. For example, the German dictionary, dict_de.txt, which is zipped as dict_de.zip and placed in the VFS under /system/workplace/editors/spellcheck/, looks as followsThese are all correct German words, also AachenerInnen etc. with the big "I".:

a
Aachen
Aachens
Aachener
Aachenern
AachenerInnen
AachenerIn
AachenerInnen
AachenerIn
Aachenerinnen
Aachenerin
Aacheners
Aal
Aale
...

To create the lists for default words, you may use the unmunch tool of hunspell. But be careful, unfortunately, it only works correct with myspell dictionaries. Recheck the generated word list and possibly remove comments. Unfortunately, up to now we did not find a suitable tool to produce wordlists from hunspell dictionaries, as they are shipped with LibreOffice.