Class CmsHtmlStripper


  • public final class CmsHtmlStripper
    extends java.lang.Object
    Simple html tag stripper that allows configuration of html tag names that are allowed.

    All tags that are not explicitly allowed via invocation of one of the addPreserve... methods will be missing in the result of the method stripHtml(String).

    Instances are reusable but not shareable (multithreading). If configuration should be changed between subsequent invocations of stripHtml(String) method reset() has to be called.

    Since:
    6.9.2
    • Constructor Summary

      Constructors 
      Constructor Description
      CmsHtmlStripper()
      Default constructor that turns echo on and uses the settings for replacing tags.
      CmsHtmlStripper​(boolean useTidy)
      Creates an instance with control whether tidy is used.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      boolean addPreserveTag​(java.lang.String tagName)
      Adds a tag that will be preserved by stripHtml(String).
      void addPreserveTagList​(java.util.List<java.lang.String> preserveTags)
      Convenience method for adding several tags to preserve.
      void addPreserveTags​(java.lang.String tagList, char separator)
      Convenience method for adding several tags to preserve in form of a delimiter-separated String.
      void reset()
      Resets the configuration of the tags to preserve.
      java.lang.String stripHtml​(java.lang.String html)
      Extracts the text from the given html content, assuming the given html encoding.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • CmsHtmlStripper

        public CmsHtmlStripper()
        Default constructor that turns echo on and uses the settings for replacing tags.

      • CmsHtmlStripper

        public CmsHtmlStripper​(boolean useTidy)
        Creates an instance with control whether tidy is used.

        Parameters:
        useTidy - if true tidy will be used
    • Method Detail

      • addPreserveTag

        public boolean addPreserveTag​(java.lang.String tagName)
        Adds a tag that will be preserved by stripHtml(String).

        Parameters:
        tagName - the name of the tag to keep (case insensitive)
        Returns:
        true if the tagName was added correctly to the internal engine
      • addPreserveTagList

        public void addPreserveTagList​(java.util.List<java.lang.String> preserveTags)
        Convenience method for adding several tags to preserve.

        Parameters:
        preserveTags - a List<String> with the case-insensitive tag names of the tags to preserve
        See Also:
        addPreserveTag(String)
      • addPreserveTags

        public void addPreserveTags​(java.lang.String tagList,
                                    char separator)
        Convenience method for adding several tags to preserve in form of a delimiter-separated String.

        The String will be CmsStringUtil.splitAsList(String, char, boolean) with tagList as the first argument, separator as the second argument and the third argument set to true (trimming - support).

        Parameters:
        tagList - a delimiter-separated String with case-insensitive tag names to preserve by stripHtml(String)
        separator - the delimiter that separates tag names in the tagList argument
        See Also:
        addPreserveTag(String)
      • reset

        public void reset()
        Resets the configuration of the tags to preserve.

        This is called from the constructor and only has to be called if this instance is reused with a differen configuration (of tags to keep).

      • stripHtml

        public java.lang.String stripHtml​(java.lang.String html)
                                   throws org.htmlparser.util.ParserException
        Extracts the text from the given html content, assuming the given html encoding.

        Additionally tags are replaced / removed according to the configuration of this instance.

        Please note:

        There are static process methods in the superclass that will not do the replacements / removals. Don't mix them up with this method.

        Parameters:
        html - the content to extract the plain text from.
        Returns:
        the text extracted from the given html content.
        Throws:
        org.htmlparser.util.ParserException - if something goes wrong.