This class will take a pdf document and strip out all of the text and ignore the
formatting and such.
author: Ben Litchfield version: $Revision: 1.69 $
setStartPage(int startPageValue) This will set the first page to be extracted by this class.
public void
setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue) By default the text stripper will attempt to remove text that overlapps each other.
Word paints the same character several times in order to make it look bold.
The charactersByArticle is used to extract text by article divisions. For example
a PDF that has two columns like a newspaper, we want to extract the first column and
then the second column. In this example the PDF would have 2 beads(or articles), one for
each column. The size of the charactersByArticle would be 5, because not all text on the
screen will fall into one of the articles. The five divisions are shown below
Text before first article
first article text
text between first article and second article
second article text
text after second article
Most PDFs won't have any beads, so charactersByArticle will contain a single entry.
Instantiate a new PDFTextStripper object. This object will load properties from
Resources/PDFTextStripper.properties.
throws: IOException - If there is an error loading the properties.
Instantiate a new PDFTextStripper object. Loading all of the operator mappings
from the properties object that is passed in.
Parameters: props - The properties containing the mapping of operators to PDFOperator classes. throws: IOException - If there is an error reading the properties.
This method is available for subclasses of this class. It will be called after processing
of the document finishes.
Parameters: pdf - The PDF document that is being processed. throws: IOException - If an IO error occurs.
End a page. Default implementation is to do nothing. Subclasses
may provide additional information.
Parameters: page - The page we are about to process. throws: IOException - If there is any error writing to the stream.
End a paragraph. Default implementation is to do nothing. Subclasses
may provide additional information.
throws: IOException - If there is any error writing to the stream.
Character strings are grouped by articles. It is quite common that there
will only be a single article. This returns a List that contains List objects,
the inner lists will contain TextPosition objects.
A double List of TextPositions for all text strings on the page.
This will get the last page that will be extracted. This is inclusive,
for example if a 5 page PDF an endPage value of 5 would extract the
entire document, an end page of 2 would extract pages 1 and 2. This defaults
to Integer.MAX_VALUE such that all pages of the pdf will be extracted.
Value of property endPage.
This is the page that the text extraction will start on. The pages start
at page 1. For example in a 5 page PDF document, if the start page is 1
then all pages will be extracted. If the start page is 4 then pages 4 and 5
will be extracted. The default value is 1.
Value of property startPage.
This will return the text of a document. See writeText.
NOTE: The document must not be encrypted when coming into this method.
Parameters: doc - The document to get the text from. The text of the PDF document. throws: IOException - if the doc state is invalid or it is encrypted.
See Also:PDFTextStripper.getText(PDDocument) Parameters: doc - The document to extract the text from. The document text. throws: IOException - If there is an error extracting the text.
This will process the contents of a page.
Parameters: page - The page to process. Parameters: content - The contents of the page. throws: IOException - If there is an error processing the page.
This will process all of the pages and the text that is in them.
Parameters: pages - The pages object in the document. throws: IOException - If there is an error parsing the text.
This will set the last page to be extracted by this class.
Parameters: endPageValue - New value of property endPage.
setLineSeparator
public void setLineSeparator(String separator)(Code)
Set the desired line separator for output text. The line.separator
system property is used if the line separator preference is not set
explicitly using this method.
Parameters: separator - The desired line separator string.
setPageSeparator
public void setPageSeparator(String separator)(Code)
Set the desired page separator for output text. The line.separator
system property is used if the page separator preference is not set
explicitly using this method.
Parameters: separator - The desired page separator string.
setShouldSeparateByBeads
public void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)(Code)
Set if the text stripper should group the text output by a list of beads. The default value is true!
Parameters: aShouldSeparateByBeads - The new grouping of beads.
setSortByPosition
public void setSortByPosition(boolean newSortByPosition)(Code)
The order of the text tokens in a PDF file may not be in the same
as they appear visually on the screen. For example, a PDF writer may
write out all text by font, so all bold or larger text, then make a second
pass and write out the normal text.
The default is to not sort by position.
A PDF writer could choose to write each character in a different order. By
default PDFBox does not sort the text tokens before processing them due to
performance reasons.
Parameters: newSortByPosition - Tell PDFBox to sort the text positions.
Set the bookmark where text extraction should start, inclusive.
Parameters: aStartBookmark - The starting bookmark.
setStartPage
public void setStartPage(int startPageValue)(Code)
This will set the first page to be extracted by this class.
Parameters: startPageValue - New value of property startPage.
setSuppressDuplicateOverlappingText
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)(Code)
By default the text stripper will attempt to remove text that overlapps each other.
Word paints the same character several times in order to make it look bold. By setting
this to false all text will be extracted, which means that certain sections will be
duplicated, but better performance will be noticed.
Parameters: suppressDuplicateOverlappingTextValue - The suppressDuplicateOverlappingText to set.
setWordSeparator
public void setWordSeparator(String separator)(Code)
Set the desired word separator for output text. The PDFBox text extraction
algorithm will output a space character if there is enough space between
two words. By default a space character is used. If you need and accurate
count of characters that are found in a PDF document then you might want to
set the word separator to the empty string.
Parameters: separator - The desired page separator string.
This will tell if the text stripper should sort the text tokens
before writing to the stream.
true If the text tokens will be sorted before being written.
shouldSuppressDuplicateOverlappingText
public boolean shouldSuppressDuplicateOverlappingText()(Code)
This will show add a character to the list of characters to be printed to
the text file.
Parameters: text - The description of the character to display.
This method is available for subclasses of this class. It will be called before processing
of the document start.
Parameters: pdf - The PDF document that is being processed. throws: IOException - If an IO error occurs.
Start a new page. Default implementation is to do nothing. Subclasses
may provide additional information.
Parameters: page - The page we are about to process. throws: IOException - If there is any error writing to the stream.
Start a new paragraph. Default implementation is to do nothing. Subclasses
may provide additional information.
throws: IOException - If there is any error writing to the stream.
Write the string to the output stream.
Parameters: text - The text to write to the stream. throws: IOException - If there is an error when writing the text.
See Also:PDFTextStripper.writeText(PDDocumentWriter) Parameters: doc - The document to extract the text. Parameters: outputStream - The stream to write the text to. throws: IOException - If there is an error extracting the text.
This will take a PDDocument and write the text of that document to the print writer.
Parameters: doc - The document to get the data from. Parameters: outputStream - The location to put the text. throws: IOException - If the doc is in an invalid state.
Methods inherited from org.pdfbox.util.PDFStreamEngine