Package org.opencms.search.extractors
Class CmsExtractorHtml
java.lang.Object
org.opencms.search.extractors.A_CmsTextExtractor
org.opencms.search.extractors.CmsExtractorHtml
- All Implemented Interfaces:
I_CmsTextExtractor
Extracts the text from an HTML document.
- Since:
- 6.0.0
-
Method Summary
Modifier and TypeMethodDescriptionextractText(InputStream in, String encoding) Extracts the text and meta information from the document on the input stream, using the specified content encoding.static I_CmsTextExtractorReturns an instance of this text extractor.Methods inherited from class org.opencms.search.extractors.A_CmsTextExtractor
combineContentItem, extractText, extractText, extractText, extractText, removeControlChars
-
Method Details
-
getExtractor
Returns an instance of this text extractor.- Returns:
- an instance of this text extractor
-
extractText
Description copied from interface:I_CmsTextExtractorExtracts the text and meta information from the document on the input stream, using the specified content encoding.The encoding is a hint for the text extractor, if the value given is
nullthen the text extractor should try to figure out the encoding itself.- Specified by:
extractTextin interfaceI_CmsTextExtractor- Overrides:
extractTextin classA_CmsTextExtractor- Parameters:
in- the input stream for the document to extract the text fromencoding- the encoding to use- Returns:
- the extracted text and meta information
- Throws:
Exception- if the text extration fails- See Also:
-