Class CmsHtmlStripper

java.lang.Object
org.opencms.util.CmsHtmlStripper

public final class CmsHtmlStripper extends Object
Simple html tag stripper that allows configuration of html tag names that are allowed.

All tags that are not explicitly allowed via invocation of one of the addPreserve... methods will be missing in the result of the method stripHtml(String).

Instances are reusable but not shareable (multithreading). If configuration should be changed between subsequent invocations of stripHtml(String) method reset() has to be called.

Since:
6.9.2
  • Constructor Details

    • CmsHtmlStripper

      public CmsHtmlStripper()
      Default constructor that turns echo on and uses the settings for replacing tags.

    • CmsHtmlStripper

      public CmsHtmlStripper(boolean useTidy)
      Creates an instance with control whether tidy is used.

      Parameters:
      useTidy - if true tidy will be used
  • Method Details

    • addPreserveTag

      public boolean addPreserveTag(String tagName)
      Adds a tag that will be preserved by stripHtml(String).

      Parameters:
      tagName - the name of the tag to keep (case insensitive)
      Returns:
      true if the tagName was added correctly to the internal engine
    • addPreserveTagList

      public void addPreserveTagList(List<String> preserveTags)
      Convenience method for adding several tags to preserve.

      Parameters:
      preserveTags - a List<String> with the case-insensitive tag names of the tags to preserve
      See Also:
    • addPreserveTags

      public void addPreserveTags(String tagList, char separator)
      Convenience method for adding several tags to preserve in form of a delimiter-separated String.

      The String will be CmsStringUtil.splitAsList(String, char, boolean) with tagList as the first argument, separator as the second argument and the third argument set to true (trimming - support).

      Parameters:
      tagList - a delimiter-separated String with case-insensitive tag names to preserve by stripHtml(String)
      separator - the delimiter that separates tag names in the tagList argument
      See Also:
    • reset

      public void reset()
      Resets the configuration of the tags to preserve.

      This is called from the constructor and only has to be called if this instance is reused with a differen configuration (of tags to keep).

    • stripHtml

      public String stripHtml(String html) throws org.htmlparser.util.ParserException
      Extracts the text from the given html content, assuming the given html encoding.

      Additionally tags are replaced / removed according to the configuration of this instance.

      Please note:

      There are static process methods in the superclass that will not do the replacements / removals. Don't mix them up with this method.

      Parameters:
      html - the content to extract the plain text from.
      Returns:
      the text extracted from the given html content.
      Throws:
      org.htmlparser.util.ParserException - if something goes wrong.