Class CmsHtmlExtractor

java.lang.Object
org.opencms.util.CmsHtmlExtractor

public final class CmsHtmlExtractor extends Object
Extracts plain text from HTML.

Since:
6.0.0
  • Method Details

    • extractText

      public static String extractText(InputStream in, String encoding) throws org.htmlparser.util.ParserException, UnsupportedEncodingException
      Extract the text from a HTML page.

      Parameters:
      in - the html content input stream
      encoding - the encoding of the content
      Returns:
      the extracted text from the page
      Throws:
      org.htmlparser.util.ParserException - if the parsing of the HTML failed
      UnsupportedEncodingException - if the given encoding is not supported
    • extractText

      public static String extractText(String content, String encoding) throws org.htmlparser.util.ParserException, UnsupportedEncodingException
      Extract the text from a HTML page.

      Parameters:
      content - the html content
      encoding - the encoding of the content
      Returns:
      the extracted text from the page
      Throws:
      org.htmlparser.util.ParserException - if the parsing of the HTML failed
      UnsupportedEncodingException - if the given encoding is not supported