Java Doc for PDFTextStripper.java in » PDF » PDFBox-0.7.3 » org » pdfbox » util » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation

1.	6.0 JDK Core
2.	6.0 JDK Modules
3.	6.0 JDK Modules com.sun
4.	6.0 JDK Modules com.sun.java
5.	6.0 JDK Modules sun
6.	6.0 JDK Platform
7.	Ajax
8.	Apache Harmony Java SE
9.	Aspect oriented
10.	Authentication Authorization
11.	Blogger System
12.	Build
13.	Byte Code
14.	Cache
15.	Chart
16.	Chat
17.	Code Analyzer
18.	Collaboration
19.	Content Management System
20.	Database Client
21.	Database DBMS
22.	Database JDBC Connection Pool
23.	Database ORM
24.	Development
25.	EJB Server geronimo
26.	EJB Server GlassFish
27.	EJB Server JBoss 4.2.1
28.	EJB Server resin 3.1.5
29.	ERP CRM Financial
30.	ESB
31.	Forum
32.	GIS
33.	Graphic Library
34.	Groupware
35.	HTML Parser
36.	IDE
37.	IDE Eclipse
38.	IDE Netbeans
39.	Installer
40.	Internationalization Localization
41.	Inversion of Control
42.	Issue Tracking
43.	J2EE
44.	JBoss
45.	JMS
46.	JMX
47.	Library
48.	Mail Clients
49.	Net
50.	Parser
51.	PDF
52.	Portal
53.	Profiler
54.	Project Management
55.	Report
56.	RSS RDF
57.	Rule Engine
58.	Science
59.	Scripting
60.	Search Engine
61.	Security
62.	Sevlet Container
63.	Source Control
64.	Swing Library
65.	Template Engine
66.	Test Coverage
67.	Testing
68.	UML
69.	Web Crawler
70.	Web Framework
71.	Web Mail
72.	Web Server
73.	Web Services
74.	Web Services apache cxf 2.0.1
75.	Web Services AXIS2
76.	Wiki Engine
77.	Workflow Engines
78.	XML
79.	XML UI

Java

Java Tutorial

Illustrator Tutorials

GIMP Tutorials

C# / C Sharp

C# / CSharp Tutorial

C# / CSharp Open Source

SQL Server / T-SQL Tutorial

Oracle PL / SQL

Oracle PL/SQL Tutorial

Flash / Flex / ActionScript

VBA / Excel / Access / Word

XML

XML Tutorial

Microsoft Office PowerPoint 2007 Tutorial

Microsoft Office Excel 2007 Tutorial

Microsoft Office Word 2007 Tutorial

Java Source Code / Java Documentation » PDF » PDFBox 0.7.3 » org.pdfbox.util

Source Cross Reference

Class Diagram

Java Document (Java Doc)

java.lang .Object

org.pdfbox.util .PDFStreamEngine

org.pdfbox.util .PDFTextStripper

All known Subclasses:   org.pdfbox.util .PDFHighlighter,  org.pdfbox.examples.util .PrintTextLocations,  org.pdfbox.util .PDFTextStripperByArea,  org.pdfbox.util .PDFText2HTML,
PDFTextStripper
public class PDFTextStripper extends PDFStreamEngine (Code)
This class will take a pdf document and strip out all of the text and ignore the formatting and such.
author:
   Ben Litchfield
version:
   $Revision: 1.69 $

Field Summary
protected Vector charactersByArticle
     The charactersByArticle is used to extract text by article divisions.
protected Writer output
     The stream to write the output to.

Constructor Summary
public PDFTextStripper()
     Instantiate a new PDFTextStripper object.
public PDFTextStripper(Properties props)
     Instantiate a new PDFTextStripper object.

Method Summary
protected void endDocument(PDDocument pdf)
     This method is available for subclasses of this class.
protected void endPage(PDPage page)
     End a page.
protected void endParagraph()
     End a paragraph.
protected void flushText()
     This will print the text to the output stream.
protected List getCharactersByArticle()
     Character strings are grouped by articles.
protected int getCurrentPageNo()
     Get the current page number that is being processed.
public PDOutlineItem getEndBookmark()
     Get the bookmark where text extraction should end, inclusive.
public int getEndPage()
     This will get the last page that will be extracted.
public String getLineSeparator()
     This will get the line separator.
protected Writer getOutput()
     The output stream that is being written to.
public String getPageSeparator()
     This will get the page separator.
public PDOutlineItem getStartBookmark()
     Get the bookmark where text extraction should start, inclusive.
public int getStartPage()
     This is the page that the text extraction will start on.
public String getText(PDDocument doc)
     This will return the text of a document.
public String getText(COSDocument doc)

See Also:   PDFTextStripper.getText(PDDocument)
Parameters:
  doc - The document to extract the text from.
public String getWordSeparator()
     This will get the word separator.
protected void processPage(PDPage page, COSStream content)
     This will process the contents of a page.
protected void processPages(List pages)
     This will process all of the pages and the text that is in them.
public void setEndBookmark(PDOutlineItem aEndBookmark)
     Set the bookmark where the text extraction should stop.
public void setEndPage(int endPageValue)
     This will set the last page to be extracted by this class.
public void setLineSeparator(String separator)
     Set the desired line separator for output text.
public void setPageSeparator(String separator)
     Set the desired page separator for output text.
public void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
     Set if the text stripper should group the text output by a list of beads.
public void setSortByPosition(boolean newSortByPosition)
     The order of the text tokens in a PDF file may not be in the same as they appear visually on the screen.
public void setStartBookmark(PDOutlineItem aStartBookmark)
     Set the bookmark where text extraction should start, inclusive.
public void setStartPage(int startPageValue)
     This will set the first page to be extracted by this class.
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
     By default the text stripper will attempt to remove text that overlapps each other. Word paints the same character several times in order to make it look bold.
public void setWordSeparator(String separator)
     Set the desired word separator for output text.
public boolean shouldSeparateByBeads()
     This will tell if the text stripper should separate by beads.
public boolean shouldSortByPosition()
     This will tell if the text stripper should sort the text tokens before writing to the stream.
public boolean shouldSuppressDuplicateOverlappingText()

protected void showCharacter(TextPosition text)
     This will show add a character to the list of characters to be printed to the text file.
protected void startDocument(PDDocument pdf)
     This method is available for subclasses of this class.
protected void startPage(PDPage page)
     Start a new page.
protected void startParagraph()
     Start a new paragraph.
protected void writeCharacters(TextPosition text)
     Write the string to the output stream.
public void writeText(COSDocument doc, Writer outputStream)

public void writeText(PDDocument doc, Writer outputStream)
     This will take a PDDocument and write the text of that document to the print writer.

Field Detail
charactersByArticle
protected Vector charactersByArticle(Code)
The charactersByArticle is used to extract text by article divisions. For example a PDF that has two columns like a newspaper, we want to extract the first column and then the second column. In this example the PDF would have 2 beads(or articles), one for each column. The size of the charactersByArticle would be 5, because not all text on the screen will fall into one of the articles. The five divisions are shown below Text before first article first article text text between first article and second article second article text text after second article Most PDFs won't have any beads, so charactersByArticle will contain a single entry.

output
protected Writer output(Code)
The stream to write the output to.

Constructor Detail
PDFTextStripper
public PDFTextStripper() throws IOException(Code)
Instantiate a new PDFTextStripper object. This object will load properties from Resources/PDFTextStripper.properties.
throws:
  IOException - If there is an error loading the properties.

PDFTextStripper
public PDFTextStripper(Properties props) throws IOException(Code)
Instantiate a new PDFTextStripper object. Loading all of the operator mappings from the properties object that is passed in.
Parameters:
  props - The properties containing the mapping of operators to PDFOperator classes.
throws:
  IOException - If there is an error reading the properties.

Method Detail
endDocument
protected void endDocument(PDDocument pdf) throws IOException(Code)
This method is available for subclasses of this class. It will be called after processing of the document finishes.
Parameters:
  pdf - The PDF document that is being processed.
throws:
  IOException - If an IO error occurs.

endPage
protected void endPage(PDPage page) throws IOException(Code)
End a page. Default implementation is to do nothing. Subclasses may provide additional information.
Parameters:
  page - The page we are about to process.
throws:
  IOException - If there is any error writing to the stream.

endParagraph
protected void endParagraph() throws IOException(Code)
End a paragraph. Default implementation is to do nothing. Subclasses may provide additional information.
throws:
  IOException - If there is any error writing to the stream.

flushText
protected void flushText() throws IOException(Code)
This will print the text to the output stream.
throws:
  IOException - If there is an error writing the text.

getCharactersByArticle
protected List getCharactersByArticle()(Code)
Character strings are grouped by articles. It is quite common that there will only be a single article. This returns a List that contains List objects, the inner lists will contain TextPosition objects. A double List of TextPositions for all text strings on the page.

getCurrentPageNo
protected int getCurrentPageNo()(Code)
Get the current page number that is being processed. A 1 based number representing the current page.

getEndBookmark
public PDOutlineItem getEndBookmark()(Code)
Get the bookmark where text extraction should end, inclusive. Default is null. The ending bookmark.

getEndPage
public int getEndPage()(Code)
This will get the last page that will be extracted. This is inclusive, for example if a 5 page PDF an endPage value of 5 would extract the entire document, an end page of 2 would extract pages 1 and 2. This defaults to Integer.MAX_VALUE such that all pages of the pdf will be extracted. Value of property endPage.

getLineSeparator
public String getLineSeparator()(Code)
This will get the line separator. The desired line separator string.

getOutput
protected Writer getOutput()(Code)
The output stream that is being written to. The stream that output is being written to.

getPageSeparator
public String getPageSeparator()(Code)
This will get the page separator. The page separator string.

getStartBookmark
public PDOutlineItem getStartBookmark()(Code)
Get the bookmark where text extraction should start, inclusive. Default is null. The starting bookmark.

getStartPage
public int getStartPage()(Code)
This is the page that the text extraction will start on. The pages start at page 1. For example in a 5 page PDF document, if the start page is 1 then all pages will be extracted. If the start page is 4 then pages 4 and 5 will be extracted. The default value is 1. Value of property startPage.

getText
public String getText(PDDocument doc) throws IOException(Code)
This will return the text of a document. See writeText.
NOTE: The document must not be encrypted when coming into this method.
Parameters:
  doc - The document to get the text from. The text of the PDF document.
throws:
  IOException - if the doc state is invalid or it is encrypted.

getText
public String getText(COSDocument doc) throws IOException(Code)

See Also:   PDFTextStripper.getText(PDDocument)
Parameters:
  doc - The document to extract the text from. The document text.
throws:
  IOException - If there is an error extracting the text.

getWordSeparator
public String getWordSeparator()(Code)
This will get the word separator. The desired word separator string.

processPage
protected void processPage(PDPage page, COSStream content) throws IOException(Code)
This will process the contents of a page.
Parameters:
  page - The page to process.
Parameters:
  content - The contents of the page.
throws:
  IOException - If there is an error processing the page.

processPages
protected void processPages(List pages) throws IOException(Code)
This will process all of the pages and the text that is in them.
Parameters:
  pages - The pages object in the document.
throws:
  IOException - If there is an error parsing the text.

setEndBookmark
public void setEndBookmark(PDOutlineItem aEndBookmark)(Code)
Set the bookmark where the text extraction should stop.
Parameters:
  aEndBookmark - The ending bookmark.

setEndPage
public void setEndPage(int endPageValue)(Code)
This will set the last page to be extracted by this class.
Parameters:
  endPageValue - New value of property endPage.

setLineSeparator
public void setLineSeparator(String separator)(Code)
Set the desired line separator for output text. The line.separator system property is used if the line separator preference is not set explicitly using this method.
Parameters:
  separator - The desired line separator string.

setPageSeparator
public void setPageSeparator(String separator)(Code)
Set the desired page separator for output text. The line.separator system property is used if the page separator preference is not set explicitly using this method.
Parameters:
  separator - The desired page separator string.

setShouldSeparateByBeads
public void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)(Code)
Set if the text stripper should group the text output by a list of beads. The default value is true!
Parameters:
  aShouldSeparateByBeads - The new grouping of beads.

setSortByPosition
public void setSortByPosition(boolean newSortByPosition)(Code)
The order of the text tokens in a PDF file may not be in the same as they appear visually on the screen. For example, a PDF writer may write out all text by font, so all bold or larger text, then make a second pass and write out the normal text.
The default is to not sort by position.

A PDF writer could choose to write each character in a different order. By default PDFBox does not sort the text tokens before processing them due to performance reasons.
Parameters:
  newSortByPosition - Tell PDFBox to sort the text positions.

setStartBookmark
public void setStartBookmark(PDOutlineItem aStartBookmark)(Code)
Set the bookmark where text extraction should start, inclusive.
Parameters:
  aStartBookmark - The starting bookmark.

setStartPage
public void setStartPage(int startPageValue)(Code)
This will set the first page to be extracted by this class.
Parameters:
  startPageValue - New value of property startPage.

setSuppressDuplicateOverlappingText
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)(Code)
By default the text stripper will attempt to remove text that overlapps each other. Word paints the same character several times in order to make it look bold. By setting this to false all text will be extracted, which means that certain sections will be duplicated, but better performance will be noticed.
Parameters:
  suppressDuplicateOverlappingTextValue - The suppressDuplicateOverlappingText to set.

setWordSeparator
public void setWordSeparator(String separator)(Code)
Set the desired word separator for output text. The PDFBox text extraction algorithm will output a space character if there is enough space between two words. By default a space character is used. If you need and accurate count of characters that are found in a PDF document then you might want to set the word separator to the empty string.
Parameters:
  separator - The desired page separator string.

shouldSeparateByBeads
public boolean shouldSeparateByBeads()(Code)
This will tell if the text stripper should separate by beads. If the text will be grouped by beads.

shouldSortByPosition
public boolean shouldSortByPosition()(Code)
This will tell if the text stripper should sort the text tokens before writing to the stream. true If the text tokens will be sorted before being written.

shouldSuppressDuplicateOverlappingText
public boolean shouldSuppressDuplicateOverlappingText()(Code)
Returns the suppressDuplicateOverlappingText.

showCharacter
protected void showCharacter(TextPosition text)(Code)
This will show add a character to the list of characters to be printed to the text file.
Parameters:
  text - The description of the character to display.

startDocument
protected void startDocument(PDDocument pdf) throws IOException(Code)
This method is available for subclasses of this class. It will be called before processing of the document start.
Parameters:
  pdf - The PDF document that is being processed.
throws:
  IOException - If an IO error occurs.

startPage
protected void startPage(PDPage page) throws IOException(Code)
Start a new page. Default implementation is to do nothing. Subclasses may provide additional information.
Parameters:
  page - The page we are about to process.
throws:
  IOException - If there is any error writing to the stream.

startParagraph
protected void startParagraph() throws IOException(Code)
Start a new paragraph. Default implementation is to do nothing. Subclasses may provide additional information.
throws:
  IOException - If there is any error writing to the stream.

writeCharacters
protected void writeCharacters(TextPosition text) throws IOException(Code)
Write the string to the output stream.
Parameters:
  text - The text to write to the stream.
throws:
  IOException - If there is an error when writing the text.

writeText
public void writeText(COSDocument doc, Writer outputStream) throws IOException(Code)

See Also:   PDFTextStripper.writeText(PDDocumentWriter)
Parameters:
  doc - The document to extract the text.
Parameters:
  outputStream - The stream to write the text to.
throws:
  IOException - If there is an error extracting the text.

writeText
public void writeText(PDDocument doc, Writer outputStream) throws IOException(Code)
This will take a PDDocument and write the text of that document to the print writer.
Parameters:
  doc - The document to get the data from.
Parameters:
  outputStream - The location to put the text.
throws:
  IOException - If the doc is in an invalid state.

Methods inherited from org.pdfbox.util.PDFStreamEngine
public Map getColorSpaces()(Code)(Java Doc)
public PDPage getCurrentPage()(Code)(Java Doc)
public Map getFonts()(Code)(Java Doc)
public Stack getGraphicsStack()(Code)(Java Doc)
public PDGraphicsState getGraphicsState()(Code)(Java Doc)
public Map getGraphicsStates()(Code)(Java Doc)
public PDResources getResources()(Code)(Java Doc)
public Matrix getTextLineMatrix()(Code)(Java Doc)
public Matrix getTextMatrix()(Code)(Java Doc)
public Map getXObjects()(Code)(Java Doc)
public void processOperator(String operation, List arguments) throws IOException(Code)(Java Doc)
protected void processOperator(PDFOperator operator, List arguments) throws IOException(Code)(Java Doc)
public void processStream(PDPage aPage, PDResources resources, COSStream cosStream) throws IOException(Code)(Java Doc)
public void processSubStream(PDPage aPage, PDResources resources, COSStream cosStream) throws IOException(Code)(Java Doc)
public void registerOperatorProcessor(String operator, OperatorProcessor op)(Code)(Java Doc)
public void resetEngine()(Code)(Java Doc)
public void setColorSpaces(Map value)(Code)(Java Doc)
public void setFonts(Map value)(Code)(Java Doc)
public void setGraphicsStack(Stack value)(Code)(Java Doc)
public void setGraphicsState(PDGraphicsState value)(Code)(Java Doc)
public void setGraphicsStates(Map value)(Code)(Java Doc)
public void setTextLineMatrix(Matrix value)(Code)(Java Doc)
public void setTextMatrix(Matrix value)(Code)(Java Doc)
protected void showCharacter(TextPosition text)(Code)(Java Doc)
public void showString(byte[] string) throws IOException(Code)(Java Doc)

Methods inherited from java.lang.Object
native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
public boolean equals(Object obj)(Code)(Java Doc)
protected void finalize() throws Throwable(Code)(Java Doc)
final native public Class getClass()(Code)(Java Doc)
native public int hashCode()(Code)(Java Doc)
final native public void notify()(Code)(Java Doc)
final native public void notifyAll()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
final public void wait() throws InterruptedException(Code)(Java Doc)

www.java2java.com | Contact Us

All other trademarks are property of their respective owners.