Memory-efficient reading of .doc files. To extract the text from a .doc
file, use {@link org.archive.util.ms.Doc#getText(SeekInputStream)}. That's
basically the whole API. The other classes are necessary to make that
method work, and you can probably ignore them.
Implementation/Format Details
These APIs differ from the POI API provided by Apache in that POI wants to
load complete documents into memory. Though POI does provide an "event-driven"
API that is memory efficient, that API cannot be used to scan text across block
or piece boundaries.
This package provides a stream-based API for extracting the text of
a .doc file. At this time, the package does not provide a way to extract
style attributes, embedded images, subdocuments, change tracking information,
and so on.
There are two layers of abstraction between the contents of a .doc
file and reality. The first layer is the Block File System, and
the second layer is the piece table.
The Block File System
All .doc files are secretly file systems, like a .iso file, but insane.
A good overview of how this file system is arranged inside the file is
available at
the Jarkarta POIFS system.
Subfiles and directories in a block file system are represented via the
{@link org.archive.util.ms.Entry} interface. The root directory can be
obtained via the {@link org.archive.util.ms.BlockFileSystem#getRoot()}
method. From there, the child entries can be discovered.
The file system divides its subfiles into 512-byte blocks. Those blocks
are not necessarily stored in a linear order; blocks from different subfiles
may be interspersed with each other. The
{@link org.archive.util.ms.Entry#open()} method returns an input stream that
provides a continuous view of a subfile's contents. It does so by moving
the file pointer of the .doc file behind the scenes.
It's important to keep in mind that any given read on a stream produced
by a BlockFileSystem may involve:
- Moving the file pointer to the start of the file to look up the main
block allocation table.
- Navigation the file pointer through various allocation structures located
throughout the file.
- Finally repositioning the file pointer at the start of the next block
to be read.
So, this package lowers memory consumption at the expense of greater IO
activity. A future version of this package will use internal caches to
minimize IO activity, providing tunable trade-offs between memory and IO.
The Piece Table
The second layer of abstraction between you and the contents of a .doc file
is the piece table. Some .doc files are produced using a "fast-save" feature
that only writes recent changes to the end of the file. In this case, the
text of the document may be fragmented within the document stream itself.
Note that this fragmentation is in addition to the block fragmentation
described above.
A .doc file contains several subfiles within its filesystem. The two
that are important for extracting text are named WordDocument
and 0Table . The WordDocument subfile contains the
text of the document. The 0Table subfile contains supporting
information, including the piece table.
The piece table is a simple map from logical character position to actual
subfile stream position. Additionally, each piece table entry describes whether
or not the piece stores text using 16-bit Unicode, or using 8-bit ANSI
codes. One .doc file can contain both Unicode and ANSI text. A consequence
of this is that every .doc file has a piece table, even those that
were not "fast-saved".
The reader returned by
{@link org.achive.util.ms.Doc#getText(SeekInputStream)} consults the piece
table to determine where in the WordDocument subfile the next piece of text
is located. It also uses the piece table to determine how bytes should be
converted to Unicode characters.
Note, however, that any read from such a reader may involve:
- Moving the file pointer to the piece table.
- Searching the piece table index for the next piece, which may
involve moving the file pointer many times.
- Moving the file pointer to that piece's description in the piece table.
- Moving the file pointer to the start of the piece indicated by the
description.
Since the "file pointer" in this context is the file pointer of the
subfile, each move described above may additionally involve:
- Moving the file pointer to the piece table.
- Searching the piece table index for the next piece, which may
involve moving the file pointer many times.
- Moving the file pointer to that piece's description in the piece table.
- Moving the file pointer to the start of the piece indicated by the
description.
A future implementation will provide an intelligent cache of the piece table,
which will hopefully reduce the IO activity required.
|