org.archive.util.ms

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Web Crawler » heritrix » org.archive.util.ms 
org.archive.util.ms
Memory-efficient reading of .doc files. To extract the text from a .doc file, use {@link org.archive.util.ms.Doc#getText(SeekInputStream)}. That's basically the whole API. The other classes are necessary to make that method work, and you can probably ignore them.

Implementation/Format Details

These APIs differ from the POI API provided by Apache in that POI wants to load complete documents into memory. Though POI does provide an "event-driven" API that is memory efficient, that API cannot be used to scan text across block or piece boundaries.

This package provides a stream-based API for extracting the text of a .doc file. At this time, the package does not provide a way to extract style attributes, embedded images, subdocuments, change tracking information, and so on.

There are two layers of abstraction between the contents of a .doc file and reality. The first layer is the Block File System, and the second layer is the piece table.

The Block File System

All .doc files are secretly file systems, like a .iso file, but insane. A good overview of how this file system is arranged inside the file is available at the Jarkarta POIFS system.

Subfiles and directories in a block file system are represented via the {@link org.archive.util.ms.Entry} interface. The root directory can be obtained via the {@link org.archive.util.ms.BlockFileSystem#getRoot()} method. From there, the child entries can be discovered.

The file system divides its subfiles into 512-byte blocks. Those blocks are not necessarily stored in a linear order; blocks from different subfiles may be interspersed with each other. The {@link org.archive.util.ms.Entry#open()} method returns an input stream that provides a continuous view of a subfile's contents. It does so by moving the file pointer of the .doc file behind the scenes.

It's important to keep in mind that any given read on a stream produced by a BlockFileSystem may involve:

  1. Moving the file pointer to the start of the file to look up the main block allocation table.
  2. Navigation the file pointer through various allocation structures located throughout the file.
  3. Finally repositioning the file pointer at the start of the next block to be read.

So, this package lowers memory consumption at the expense of greater IO activity. A future version of this package will use internal caches to minimize IO activity, providing tunable trade-offs between memory and IO.

The Piece Table

The second layer of abstraction between you and the contents of a .doc file is the piece table. Some .doc files are produced using a "fast-save" feature that only writes recent changes to the end of the file. In this case, the text of the document may be fragmented within the document stream itself. Note that this fragmentation is in addition to the block fragmentation described above.

A .doc file contains several subfiles within its filesystem. The two that are important for extracting text are named WordDocument and 0Table. The WordDocument subfile contains the text of the document. The 0Table subfile contains supporting information, including the piece table.

The piece table is a simple map from logical character position to actual subfile stream position. Additionally, each piece table entry describes whether or not the piece stores text using 16-bit Unicode, or using 8-bit ANSI codes. One .doc file can contain both Unicode and ANSI text. A consequence of this is that every .doc file has a piece table, even those that were not "fast-saved".

The reader returned by {@link org.achive.util.ms.Doc#getText(SeekInputStream)} consults the piece table to determine where in the WordDocument subfile the next piece of text is located. It also uses the piece table to determine how bytes should be converted to Unicode characters.

Note, however, that any read from such a reader may involve:

  1. Moving the file pointer to the piece table.
  2. Searching the piece table index for the next piece, which may involve moving the file pointer many times.
  3. Moving the file pointer to that piece's description in the piece table.
  4. Moving the file pointer to the start of the piece indicated by the description.
Since the "file pointer" in this context is the file pointer of the subfile, each move described above may additionally involve:
  1. Moving the file pointer to the piece table.
  2. Searching the piece table index for the next piece, which may involve moving the file pointer many times.
  3. Moving the file pointer to that piece's description in the piece table.
  4. Moving the file pointer to the start of the piece indicated by the description.
A future implementation will provide an intelligent cache of the piece table, which will hopefully reduce the IO activity required.
Java Source File NameTypeComment
BlockFileSystem.javaInterface Describes the internal file system contained in .doc files.
BlockInputStream.javaClass InputStream for a file contained in a BlockFileSystem.
Cp1252.javaClass A fast implementation of code page 1252.
DefaultBlockFileSystem.javaClass Default implementation of the Block File System.

The overall structure of a BlockFileSystem file (such as a .doc file) is as follows.

DefaultEntry.javaClass
Doc.javaClass Reads .doc files.
DocTest.javaClass
Entry.javaInterface
HeaderBlock.javaClass
Piece.javaClass
PieceReader.javaClass
PieceReaderTest.javaClass Unit test for PieceReader.
PieceTable.javaClass The piece table of a .doc file.
www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.