| java.lang.Object org.apache.lucene.index.DocumentsWriter
DocumentsWriter | final class DocumentsWriter (Code) | | This class accepts multiple added documents and directly
writes a single segment file. It does this more
efficiently than creating a single segment per document
(with DocumentWriter) and doing standard merges on those
segments.
When a document is added, its stored fields (if any) and
term vectors (if any) are immediately written to the
Directory (ie these do not consume RAM). The freq/prox
postings are accumulated into a Postings hash table keyed
by term. Each entry in this hash table holds a separate
byte stream (allocated as incrementally growing slices
into large shared byte[] arrays) for freq and prox, that
contains the postings data for multiple documents. If
vectors are enabled, each unique term for each document
also allocates a PostingVector instance to similarly
track the offsets & positions byte stream.
Once the Postings hash is full (ie is consuming the
allowed RAM) or the number of added docs is large enough
(in the case we are flushing by doc count instead of RAM
usage), we create a real segment and flush it to disk and
reset the Postings hash.
In adding a document we first organize all of its fields
by field name. We then process field by field, and
record the Posting hash per-field. After each field we
flush its term vectors. When it's time to flush the full
segment we first sort the fields by name, and then go
field by field and sorts its postings.
Threads:
Multiple threads are allowed into addDocument at once.
There is an initial synchronized call to getThreadState
which allocates a ThreadState for this thread. The same
thread will get the same ThreadState over time (thread
affinity) so that if there are consistent patterns (for
example each thread is indexing a different content
source) then we make better use of RAM. Then
processDocument is called on that ThreadState without
synchronization (most of the "heavy lifting" is in this
call). Finally the synchronized "finishDocument" is
called to flush changes to the directory.
Each ThreadState instance has its own Posting hash. Once
we're using too much RAM, we flush all Posting hashes to
a segment by merging the docIDs in the posting lists for
the same term across multiple thread states (see
writeSegment and appendPostings).
When flush is called by IndexWriter, or, we flush
internally when autoCommit=false, we forcefully idle all
threads and flush only once they are all idle. This
means you can call flush with a given thread even while
other threads are actively adding/deleting documents.
Exceptions:
Because this class directly updates in-memory posting
lists, and flushes stored fields and term vectors
directly to files in the directory, there are certain
limited times when an exception can corrupt this state.
For example, a disk full while flushing stored fields
leaves this file in a corrupt state. Or, an OOM
exception while appending to the in-memory posting lists
can corrupt that posting list. We call such exceptions
"aborting exceptions". In these cases we must call
abort() to discard all docs added since the last flush.
All other exceptions ("non-aborting exceptions") can
still partially update the index structures. These
updates are consistent, but, they represent only a part
of the document seen up until the exception was hit.
When this happens, we immediately mark the document as
deleted so that the document is always atomically ("all
or none") added to the index.
|
Inner Class :final static class FieldMergeState | |
Inner Class :static class Num | |
Method Summary | |
synchronized void | abort(AbortException ae) Called if we hit an exception when adding docs,
flushing, etc. | List | abortedFiles() | boolean | addDocument(Document doc, Analyzer analyzer) Returns true if the caller (IndexWriter) should now
flush. | void | appendPostings(ThreadState.FieldData[] fields, TermInfosWriter termsOut, IndexOutput freqOut, IndexOutput proxOut) | synchronized boolean | bufferDeleteTerm(Term term) | synchronized boolean | bufferDeleteTerms(Term[] terms) | synchronized void | clearBufferedDeletes() | synchronized void | clearFlushPending() | synchronized void | close() | String | closeDocStore() Closes the current open doc stores an returns the doc
store segment name. | int | compareText(char[] text1, int pos1, char[] text2, int pos2) | void | copyBytes(IndexInput srcIn, IndexOutput destIn, long numBytes) | void | createCompoundFile(String segment) | synchronized List | files() | static void | fillBytes(IndexOutput out, byte b, int numBytes) | synchronized int | flush(boolean closeDocStore) | synchronized List | getBufferedDeleteDocIDs() | synchronized HashMap | getBufferedDeleteTerms() | synchronized byte[] | getByteBlock() | synchronized char[] | getCharBlock() | int | getDocStoreOffset() Returns the doc offset into the shared doc store for
the current buffered docs. | String | getDocStoreSegment() Returns the current doc store segment we are writing
to. | int | getMaxBufferedDeleteTerms() | int | getMaxBufferedDocs() | synchronized int | getNumBufferedDeleteTerms() | int | getNumDocsInRAM() Returns how many docs are currently buffered in RAM. | synchronized void | getPostings(Posting[] postings) | double | getRAMBufferSizeMB() | long | getRAMUsed() | String | getSegment() Get current segment name we are writing. | synchronized ThreadState | getThreadState(Document doc, Term delTerm) Returns a free (idle) ThreadState that may be used for
indexing this one document. | synchronized boolean | hasDeletes() | synchronized boolean | pauseAllThreads() | synchronized void | recycleByteBlocks(byte[][] blocks, int start, int end) | synchronized void | recycleCharBlocks(char[][] blocks, int numBlocks) | synchronized void | recyclePostings(Posting[] postings, int numPostings) | synchronized void | resumeAllThreads() | synchronized void | setAborting() | synchronized boolean | setFlushPending() Set flushPending if it is not already set and returns
whether it was set. | void | setInfoStream(PrintStream infoStream) If non-null, various details of indexing are printed
here. | void | setMaxBufferedDeleteTerms(int maxBufferedDeleteTerms) | void | setMaxBufferedDocs(int count) Set max buffered docs, which means we will flush by
doc count instead of by RAM usage. | void | setRAMBufferSizeMB(double mb) Set how much RAM we can use before flushing. | String | toMB(long v) | boolean | updateDocument(Term t, Document doc, Analyzer analyzer) | boolean | updateDocument(Document doc, Analyzer analyzer, Term delTerm) | void | writeNorms(String segmentName, int totalNumDoc) Write norms in the "true" segment format. |
BYTE_BLOCK_MASK | final static int BYTE_BLOCK_MASK(Code) | | |
BYTE_BLOCK_NOT_MASK | final static int BYTE_BLOCK_NOT_MASK(Code) | | |
BYTE_BLOCK_SHIFT | final static int BYTE_BLOCK_SHIFT(Code) | | |
BYTE_BLOCK_SIZE | final static int BYTE_BLOCK_SIZE(Code) | | |
CHAR_BLOCK_MASK | final static int CHAR_BLOCK_MASK(Code) | | |
CHAR_BLOCK_SHIFT | final static int CHAR_BLOCK_SHIFT(Code) | | |
CHAR_BLOCK_SIZE | final static int CHAR_BLOCK_SIZE(Code) | | |
MAX_TERM_LENGTH | final static int MAX_TERM_LENGTH(Code) | | |
POSTING_NUM_BYTE | final static int POSTING_NUM_BYTE(Code) | | |
copyByteBuffer | byte[] copyByteBuffer(Code) | | |
levelSizeArray | final static int[] levelSizeArray(Code) | | |
nextLevelArray | final static int[] nextLevelArray(Code) | | |
numBytesAlloc | long numBytesAlloc(Code) | | |
numBytesUsed | long numBytesUsed(Code) | | |
abort | synchronized void abort(AbortException ae) throws IOException(Code) | | Called if we hit an exception when adding docs,
flushing, etc. This resets our state, discarding any
docs added since last flush. If ae is non-null, it
contains the root cause exception (which we re-throw
after we are done aborting).
|
clearBufferedDeletes | synchronized void clearBufferedDeletes() throws IOException(Code) | | |
clearFlushPending | synchronized void clearFlushPending()(Code) | | |
close | synchronized void close()(Code) | | |
closeDocStore | String closeDocStore() throws IOException(Code) | | Closes the current open doc stores an returns the doc
store segment name. This returns null if there are *
no buffered documents.
|
compareText | int compareText(char[] text1, int pos1, char[] text2, int pos2)(Code) | | |
createCompoundFile | void createCompoundFile(String segment) throws IOException(Code) | | Build compound file for the segment we just flushed
|
flush | synchronized int flush(boolean closeDocStore) throws IOException(Code) | | Flush all pending docs to a new segment
|
getBufferedDeleteDocIDs | synchronized List getBufferedDeleteDocIDs()(Code) | | |
getBufferedDeleteTerms | synchronized HashMap getBufferedDeleteTerms()(Code) | | |
getByteBlock | synchronized byte[] getByteBlock()(Code) | | |
getCharBlock | synchronized char[] getCharBlock()(Code) | | |
getDocStoreOffset | int getDocStoreOffset()(Code) | | Returns the doc offset into the shared doc store for
the current buffered docs.
|
getDocStoreSegment | String getDocStoreSegment()(Code) | | Returns the current doc store segment we are writing
to. This will be the same as segment when autoCommit
is true.
|
getMaxBufferedDeleteTerms | int getMaxBufferedDeleteTerms()(Code) | | |
getMaxBufferedDocs | int getMaxBufferedDocs()(Code) | | |
getNumBufferedDeleteTerms | synchronized int getNumBufferedDeleteTerms()(Code) | | |
getNumDocsInRAM | int getNumDocsInRAM()(Code) | | Returns how many docs are currently buffered in RAM.
|
getPostings | synchronized void getPostings(Posting[] postings)(Code) | | |
getRAMBufferSizeMB | double getRAMBufferSizeMB()(Code) | | |
getRAMUsed | long getRAMUsed()(Code) | | |
getSegment | String getSegment()(Code) | | Get current segment name we are writing.
|
getThreadState | synchronized ThreadState getThreadState(Document doc, Term delTerm) throws IOException(Code) | | Returns a free (idle) ThreadState that may be used for
indexing this one document. This call also pauses if a
flush is pending. If delTerm is non-null then we
buffer this deleted term after the thread state has
been acquired.
|
hasDeletes | synchronized boolean hasDeletes()(Code) | | |
pauseAllThreads | synchronized boolean pauseAllThreads()(Code) | | |
recycleByteBlocks | synchronized void recycleByteBlocks(byte[][] blocks, int start, int end)(Code) | | |
recycleCharBlocks | synchronized void recycleCharBlocks(char[][] blocks, int numBlocks)(Code) | | |
recyclePostings | synchronized void recyclePostings(Posting[] postings, int numPostings)(Code) | | |
resumeAllThreads | synchronized void resumeAllThreads()(Code) | | |
setAborting | synchronized void setAborting()(Code) | | |
setFlushPending | synchronized boolean setFlushPending()(Code) | | Set flushPending if it is not already set and returns
whether it was set. This is used by IndexWriter to *
trigger a single flush even when multiple threads are
trying to do so.
|
setInfoStream | void setInfoStream(PrintStream infoStream)(Code) | | If non-null, various details of indexing are printed
here.
|
setMaxBufferedDeleteTerms | void setMaxBufferedDeleteTerms(int maxBufferedDeleteTerms)(Code) | | |
setMaxBufferedDocs | void setMaxBufferedDocs(int count)(Code) | | Set max buffered docs, which means we will flush by
doc count instead of by RAM usage.
|
setRAMBufferSizeMB | void setRAMBufferSizeMB(double mb)(Code) | | Set how much RAM we can use before flushing.
|
writeNorms | void writeNorms(String segmentName, int totalNumDoc) throws IOException(Code) | | Write norms in the "true" segment format. This is
called only during commit, to create the .nrm file.
|
|
|