Package org.opencms.search.extractors
Class CmsExtractorHtml
java.lang.Object
org.opencms.search.extractors.A_CmsTextExtractor
org.opencms.search.extractors.CmsExtractorHtml
- All Implemented Interfaces:
I_CmsTextExtractor
Extracts the text from an HTML document.
- Since:
- 6.0.0
-
Method Summary
Modifier and TypeMethodDescriptionextractText
(InputStream in, String encoding) Extracts the text and meta information from the document on the input stream, using the specified content encoding.static I_CmsTextExtractor
Returns an instance of this text extractor.Methods inherited from class org.opencms.search.extractors.A_CmsTextExtractor
combineContentItem, extractText, extractText, extractText, extractText, removeControlChars
-
Method Details
-
getExtractor
Returns an instance of this text extractor.- Returns:
- an instance of this text extractor
-
extractText
Description copied from interface:I_CmsTextExtractor
Extracts the text and meta information from the document on the input stream, using the specified content encoding.The encoding is a hint for the text extractor, if the value given is
null
then the text extractor should try to figure out the encoding itself.- Specified by:
extractText
in interfaceI_CmsTextExtractor
- Overrides:
extractText
in classA_CmsTextExtractor
- Parameters:
in
- the input stream for the document to extract the text fromencoding
- the encoding to use- Returns:
- the extracted text and meta information
- Throws:
Exception
- if the text extration fails- See Also:
-