The piece table of a .doc file.
The piece table maps logical character positions of a document's text
stream to actual file stream positions. The piece table is stored as two
parallel arrays. The first array contains 32-bit integers representing
the logical character positions. The second array contains 64-bit data
structures that are mostly mysterious to me, except that they contain a
32-bit subfile offset. The second array is stored immediately after the
first array. I call the first array the charPos array and the
second array the filePos array.
The arrays are preceded by a special tag byte (2), followed by the
combined size of both arrays in bytes. The number of piece table entries
must be deduced from this byte size.
Because of this bizarre structure, caching piece table entries is
something of a challenge. A single piece table entry is actually located
in two different file locations. If there are many piece table entries,
then the charPos and filePos information may be separated by many bytes,
potentially crossing block boundaries. The approach I took was to use
two different buffered streams. Up to n charPos offsets and n filePos
structures can be buffered in the two streams, preventing any file seeking
from occurring when looking up piece information. (File seeking must
still occur to jump from one piece to the next.)
Note that the vast majority of .doc files in the world will have exactly
1 piece table entry, representing the complete text of the document. Only
those documents that were "fast-saved" should have multiple pieces.
Finally, the text contained in a .doc file can either contain 16-bit
unicode characters (charset UTF-16LE) or 8-bit CP1252 characters. One
.doc file can contain both kinds of pieces. Whether or not a piece is
Cp1252 is stored as a flag in the filePos value, bizarrely enough. If
the flag is set, then the actual file position is the filePos with the
flag cleared, then divided by 2.
author: pjack |