Package org.apache.pdfbox.util
Class PDFMarkedContentExtractor
- java.lang.Object
-
- org.apache.pdfbox.util.PDFStreamEngine
-
- org.apache.pdfbox.util.PDFMarkedContentExtractor
-
public class PDFMarkedContentExtractor extends PDFStreamEngine
This is an stream engine to extract the marked content of a pdf.- Version:
- $Revision$
- Author:
- koch
-
-
Field Summary
Fields Modifier and Type Field Description protected java.lang.StringoutputEncodingencoding that text will be written in (or null).
-
Constructor Summary
Constructors Constructor Description PDFMarkedContentExtractor()Instantiate a new PDFTextStripper object.PDFMarkedContentExtractor(java.lang.String encoding)Instantiate a new PDFTextStripper object.PDFMarkedContentExtractor(java.util.Properties props)Instantiate a new PDFTextStripper object.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidbeginMarkedContentSequence(COSName tag, COSDictionary properties)voidendMarkedContentSequence()java.util.List<PDMarkedContent>getMarkedContents()protected voidprocessTextPosition(TextPosition text)This will process a TextPosition object and add the text to the list of characters on a page.voidxobject(PDXObject xobject)-
Methods inherited from class org.apache.pdfbox.util.PDFStreamEngine
getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getTotalCharCnt, getValidCharCnt, getXObjects, inspectFontEncoding, isForceParsing, processEncodedText, processOperator, processOperator, processStream, processSubStream, registerOperatorProcessor, resetEngine, setColorSpaces, setFonts, setForceParsing, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix
-
-
-
-
Constructor Detail
-
PDFMarkedContentExtractor
public PDFMarkedContentExtractor() throws java.io.IOExceptionInstantiate a new PDFTextStripper object. This object will load properties from PDFMarkedContentExtractor.properties and will not do anything special to convert the text to a more encoding-specific output.- Throws:
java.io.IOException- If there is an error loading the properties.
-
PDFMarkedContentExtractor
public PDFMarkedContentExtractor(java.util.Properties props) throws java.io.IOExceptionInstantiate a new PDFTextStripper object. Loading all of the operator mappings from the properties object that is passed in. Does not convert the text to more encoding-specific output.- Parameters:
props- The properties containing the mapping of operators to PDFOperator classes.- Throws:
java.io.IOException- If there is an error reading the properties.
-
PDFMarkedContentExtractor
public PDFMarkedContentExtractor(java.lang.String encoding) throws java.io.IOExceptionInstantiate a new PDFTextStripper object. This object will load properties from PDFMarkedContentExtractor.properties and will apply encoding-specific conversions to the output text.- Parameters:
encoding- The encoding that the output will be written in.- Throws:
java.io.IOException- If there is an error reading the properties.
-
-
Method Detail
-
beginMarkedContentSequence
public void beginMarkedContentSequence(COSName tag, COSDictionary properties)
-
endMarkedContentSequence
public void endMarkedContentSequence()
-
xobject
public void xobject(PDXObject xobject)
-
processTextPosition
protected void processTextPosition(TextPosition text)
This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.- Overrides:
processTextPositionin classPDFStreamEngine- Parameters:
text- The text to process.
-
getMarkedContents
public java.util.List<PDMarkedContent> getMarkedContents()
-
-